⬅ Previous Topic
Creating Temporary Views and Global Views in Spark SQLNext Topic ⮕
Common Aggregations and Joins in Spark SQL⬅ Previous Topic
Creating Temporary Views and Global Views in Spark SQLNext Topic ⮕
Common Aggregations and Joins in Spark SQLOne of the most powerful features of Apache Spark is its ability to let you run SQL queries on distributed datasets using DataFrames. Spark SQL provides a seamless integration of SQL and Python, allowing data analysts and developers to analyze structured data easily.
If you are familiar with traditional SQL used in databases like MySQL or PostgreSQL, Spark lets you use almost the same syntax to query huge datasets stored in DataFrames. This makes learning Spark much easier for beginners with SQL backgrounds.
spark.sql()
function to run SQL queriesLet’s consider a sample employee dataset and learn how to use SQL to extract insights.
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("SQL-on-DataFrames").getOrCreate()
# Sample data
data = [
("John", "Sales", 3000),
("Alice", "Sales", 4000),
("Bob", "HR", 3500),
("Eve", "HR", 2800),
("Charlie", "IT", 5000)
]
# Define schema
columns = ["name", "department", "salary"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
+-------+----------+------+ | name|department|salary| +-------+----------+------+ | John| Sales| 3000| | Alice| Sales| 4000| | Bob| HR| 3500| | Eve| HR| 2800| |Charlie| IT| 5000| +-------+----------+------+
Before running SQL queries, we must register the DataFrame as a SQL view.
# Register as temporary view
df.createOrReplaceTempView("employees")
result = spark.sql("SELECT * FROM employees WHERE department = 'HR'")
result.show()
+----+----------+------+ |name|department|salary| +----+----------+------+ | Bob| HR| 3500| | Eve| HR| 2800| +----+----------+------+
Why do we need to register a DataFrame as a view?
Because Spark SQL requires a table-like structure to run SQL queries. When we register a DataFrame as a view, we give SQL access to its structure and data.
avg_salary = spark.sql("SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department")
avg_salary.show()
+----------+----------+ |department|avg_salary| +----------+----------+ | HR| 3150.0| | IT| 5000.0| | Sales| 3500.0| +----------+----------+
This query groups employees by department and calculates the average salary per group.
Let’s create another DataFrame that contains department locations and perform a join using SQL.
# Additional DataFrame
dept_data = [
("Sales", "New York"),
("HR", "Chicago"),
("IT", "San Francisco")
]
dept_columns = ["department", "location"]
dept_df = spark.createDataFrame(dept_data, dept_columns)
# Register as temp view
dept_df.createOrReplaceTempView("departments")
# SQL Join
joined = spark.sql("""
SELECT e.name, e.department, e.salary, d.location
FROM employees e
JOIN departments d ON e.department = d.department
""")
joined.show()
+-------+----------+------+-------------+ | name|department|salary| location| +-------+----------+------+-------------+ | John| Sales| 3000| New York| | Alice| Sales| 4000| New York| | Bob| HR| 3500| Chicago| | Eve| HR| 2800| Chicago| |Charlie| IT| 5000|San Francisco| +-------+----------+------+-------------+
Spark SQL makes it easy to analyze big data using familiar SQL syntax. By combining the flexibility of Python and the power of SQL, Spark lets you write expressive, high-performance queries on massive datasets.
As a beginner, this feature allows you to work with data intuitively without worrying about distributed computing — Spark takes care of parallelization and optimization in the background.
⬅ Previous Topic
Creating Temporary Views and Global Views in Spark SQLNext Topic ⮕
Common Aggregations and Joins in Spark SQLYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.