Using SQL Queries on DataFrames in Spark

One of the most powerful features of Apache Spark is its ability to let you run SQL queries on distributed datasets using DataFrames. Spark SQL provides a seamless integration of SQL and Python, allowing data analysts and developers to analyze structured data easily.

Why Use SQL with Spark?

If you are familiar with traditional SQL used in databases like MySQL or PostgreSQL, Spark lets you use almost the same syntax to query huge datasets stored in DataFrames. This makes learning Spark much easier for beginners with SQL backgrounds.

Steps to Use SQL on DataFrames

Create a Spark DataFrame
Register the DataFrame as a temporary SQL view
Use the spark.sql() function to run SQL queries

Example: Analyzing Employee Data

Let’s consider a sample employee dataset and learn how to use SQL to extract insights.

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("SQL-on-DataFrames").getOrCreate()

# Sample data
data = [
    ("John", "Sales", 3000),
    ("Alice", "Sales", 4000),
    ("Bob", "HR", 3500),
    ("Eve", "HR", 2800),
    ("Charlie", "IT", 5000)
]

# Define schema
columns = ["name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

+-------+----------+------+
|   name|department|salary|
+-------+----------+------+
|   John|     Sales|  3000|
|  Alice|     Sales|  4000|
|    Bob|        HR|  3500|
|    Eve|        HR|  2800|
|Charlie|        IT|  5000|
+-------+----------+------+

Registering DataFrame as a Temporary View

Before running SQL queries, we must register the DataFrame as a SQL view.

# Register as temporary view
df.createOrReplaceTempView("employees")

Using SQL Queries on the DataFrame

Query 1: Select All Employees in HR Department

result = spark.sql("SELECT * FROM employees WHERE department = 'HR'")
result.show()

+----+----------+------+
|name|department|salary|
+----+----------+------+
| Bob|        HR|  3500|
| Eve|        HR|  2800|
+----+----------+------+

Question:

Why do we need to register a DataFrame as a view?

Answer:

Because Spark SQL requires a table-like structure to run SQL queries. When we register a DataFrame as a view, we give SQL access to its structure and data.

Query 2: Calculate Average Salary by Department

avg_salary = spark.sql("SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department")
avg_salary.show()

+----------+----------+
|department|avg_salary|
+----------+----------+
|        HR|    3150.0|
|        IT|    5000.0|
|     Sales|    3500.0|
+----------+----------+

This query groups employees by department and calculates the average salary per group.

Query 3: Join with Another Table

Let’s create another DataFrame that contains department locations and perform a join using SQL.

# Additional DataFrame
dept_data = [
    ("Sales", "New York"),
    ("HR", "Chicago"),
    ("IT", "San Francisco")
]

dept_columns = ["department", "location"]
dept_df = spark.createDataFrame(dept_data, dept_columns)

# Register as temp view
dept_df.createOrReplaceTempView("departments")

# SQL Join
joined = spark.sql("""
SELECT e.name, e.department, e.salary, d.location
FROM employees e
JOIN departments d ON e.department = d.department
""")
joined.show()

+-------+----------+------+-------------+
|   name|department|salary|     location|
+-------+----------+------+-------------+
|   John|     Sales|  3000|     New York|
|  Alice|     Sales|  4000|     New York|
|    Bob|        HR|  3500|      Chicago|
|    Eve|        HR|  2800|      Chicago|
|Charlie|        IT|  5000|San Francisco|
+-------+----------+------+-------------+

Common SQL Operations in Spark

SELECT: To retrieve specific columns
WHERE: To filter rows
GROUP BY: To aggregate data
ORDER BY: To sort results
JOIN: To combine data from multiple tables

Summary

Spark SQL makes it easy to analyze big data using familiar SQL syntax. By combining the flexibility of Python and the power of SQL, Spark lets you write expressive, high-performance queries on massive datasets.

As a beginner, this feature allows you to work with data intuitively without worrying about distributed computing — Spark takes care of parallelization and optimization in the background.

Using SQL Queries on DataFrames in Spark