Using SQL Queries on DataFrames in Spark
One of the most powerful features of Apache Spark is its ability to let you run SQL queries on distributed datasets using DataFrames. Spark SQL provides a seamless integration of SQL and Python, allowing data analysts and developers to analyze structured data easily.
Why Use SQL with Spark?
If you are familiar with traditional SQL used in databases like MySQL or PostgreSQL, Spark lets you use almost the same syntax to query huge datasets stored in DataFrames. This makes learning Spark much easier for beginners with SQL backgrounds.
Steps to Use SQL on DataFrames
- Create a Spark DataFrame
- Register the DataFrame as a temporary SQL view
- Use the
spark.sql()
function to run SQL queries
Example: Analyzing Employee Data
Let’s consider a sample employee dataset and learn how to use SQL to extract insights.
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("SQL-on-DataFrames").getOrCreate()
# Sample data
data = [
("John", "Sales", 3000),
("Alice", "Sales", 4000),
("Bob", "HR", 3500),
("Eve", "HR", 2800),
("Charlie", "IT", 5000)
]
# Define schema
columns = ["name", "department", "salary"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
+-------+----------+------+ | name|department|salary| +-------+----------+------+ | John| Sales| 3000| | Alice| Sales| 4000| | Bob| HR| 3500| | Eve| HR| 2800| |Charlie| IT| 5000| +-------+----------+------+
Registering DataFrame as a Temporary View
Before running SQL queries, we must register the DataFrame as a SQL view.
# Register as temporary view
df.createOrReplaceTempView("employees")
Using SQL Queries on the DataFrame
Query 1: Select All Employees in HR Department
result = spark.sql("SELECT * FROM employees WHERE department = 'HR'")
result.show()
+----+----------+------+ |name|department|salary| +----+----------+------+ | Bob| HR| 3500| | Eve| HR| 2800| +----+----------+------+
Question:
Why do we need to register a DataFrame as a view?
Answer:
Because Spark SQL requires a table-like structure to run SQL queries. When we register a DataFrame as a view, we give SQL access to its structure and data.
Query 2: Calculate Average Salary by Department
avg_salary = spark.sql("SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department")
avg_salary.show()
+----------+----------+ |department|avg_salary| +----------+----------+ | HR| 3150.0| | IT| 5000.0| | Sales| 3500.0| +----------+----------+
This query groups employees by department and calculates the average salary per group.
Query 3: Join with Another Table
Let’s create another DataFrame that contains department locations and perform a join using SQL.
# Additional DataFrame
dept_data = [
("Sales", "New York"),
("HR", "Chicago"),
("IT", "San Francisco")
]
dept_columns = ["department", "location"]
dept_df = spark.createDataFrame(dept_data, dept_columns)
# Register as temp view
dept_df.createOrReplaceTempView("departments")
# SQL Join
joined = spark.sql("""
SELECT e.name, e.department, e.salary, d.location
FROM employees e
JOIN departments d ON e.department = d.department
""")
joined.show()
+-------+----------+------+-------------+ | name|department|salary| location| +-------+----------+------+-------------+ | John| Sales| 3000| New York| | Alice| Sales| 4000| New York| | Bob| HR| 3500| Chicago| | Eve| HR| 2800| Chicago| |Charlie| IT| 5000|San Francisco| +-------+----------+------+-------------+
Common SQL Operations in Spark
- SELECT: To retrieve specific columns
- WHERE: To filter rows
- GROUP BY: To aggregate data
- ORDER BY: To sort results
- JOIN: To combine data from multiple tables
Summary
Spark SQL makes it easy to analyze big data using familiar SQL syntax. By combining the flexibility of Python and the power of SQL, Spark lets you write expressive, high-performance queries on massive datasets.
As a beginner, this feature allows you to work with data intuitively without worrying about distributed computing — Spark takes care of parallelization and optimization in the background.