Data Processing with Relational Database in Spark

neupane parlad
2 min readJun 5, 2023

Problem Statement:

In previous weeks, I have been working on large data processing . I was comparing Relational Database Data vs Data on S3 Bucket. Data size is huge. I am working on AWS environment. Data compare includes counts and Hash calculations for selected column.

Data sets contains blob, CSV and XMLs. Data in S3 bucket ware relatively fast HASH calculation but data in DB2 ware extremely slow and timing out repeatedly.

Reading Entire Data :

If you have two or three million record where data contains blob or XML data, this approach is worst. A simple JDBC connection and select Statement will have hard time processing data over the course of time.

Beast approach is to read the data in chunk or batch. This will improve query performance and better suited for big data. Batching also allows us to perform parallel processing.

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Define the connection properties
url = "jdbc:mysql://localhost:3306/mydatabase"
table = "mytable"
properties = {
"user": "username",
"password": "password",
"driver": "com.mysql.jdbc.Driver"
}

# Create a DataFrame by reading from the database with specified bounds
df = spark.read.jdbc(url, table, properties=properties, column="id")
  1. Specify Partitions:
numPartitions=4

you can specify partition value to parallel processing in distributed computing. It will divide the data into partitions that can be process parallel across cluster of machine.

2. Specify lower_bound and upper_bound:

df = spark.read.jdbc(url, table,lowerBound=lower_bound, upperBound=upper_bound, numPartitions=4)

Reading the entire table can be time-consuming and resource-intensive. By setting upper and lower bounds, you can narrow down the data retrieval to a specific range of rows, effectively reducing the amount of data read from the database.

3. Predicate pushdown:

Filtering should be the first thing we should do while working on any types of data. Removing unnecessary data from precious processing capacity will make your job easier.

df = spark.read.jdbc(url, table, properties=properties, column="id", predicates=["id BETWEEN 100 AND 200"])

Beside these technique, you can you normal database approach like indexing or tuning database schema will always helps.

--

--

neupane parlad

Follow me for Data Science, Data Engineering and Data Analysis Articles. Follow me on twitter at @parladN