⬅ Previous Topic
Reading CSV, JSON, and Parquet Files in PySparkNext Topic ⮕
Setting up PySpark in Jupyter Notebook⬅ Previous Topic
Reading CSV, JSON, and Parquet Files in PySparkNext Topic ⮕
Setting up PySpark in Jupyter NotebookWorking with large datasets often involves extracting relevant data by selecting specific columns, filtering rows based on conditions, and applying transformations. In Apache Spark, the DataFrame
API makes these operations easy and efficient, even for beginners.
When you load a DataFrame, it may contain many columns, but often you only need a few for your analysis. You can use the select()
method to choose specific columns.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SelectExample").getOrCreate()
# Sample data
data = [
("Alice", 25, "New York"),
("Bob", 30, "San Francisco"),
("Charlie", 28, "London")
]
# Create DataFrame
df = spark.createDataFrame(data, ["name", "age", "city"])
# Select specific columns
df.select("name", "city").show()
+-------+-------------+ | name| city| +-------+-------------+ | Alice| New York| | Bob|San Francisco| |Charlie| London| +-------+-------------+
What happens if you try to select a column that doesn't exist?
Spark will throw an error. Always make sure the column name you specify matches exactly (including case sensitivity) with the schema.
To work only with relevant data, we often need to filter rows that meet specific conditions using filter()
or where()
methods.
Let’s say we want to filter people whose age is greater than 26.
# Filter rows where age > 26
df.filter(df.age > 26).show()
+-------+---+-------------+ | name|age| city| +-------+---+-------------+ | Bob| 30|San Francisco| |Charlie| 28| London| +-------+---+-------------+
You can also combine multiple conditions using &
(AND), |
(OR), and ~
(NOT).
# Filter people older than 26 and living in London
df.filter((df.age > 26) & (df.city == "London")).show()
+-------+---+------+ | name|age| city| +-------+---+------+ |Charlie| 28|London| +-------+---+------+
Why do we use parentheses around each condition?
Because logical operators like &
and |
have lower precedence than method calls. Parentheses ensure the correct evaluation order.
Transformation refers to modifying or deriving new data columns using existing ones. Common transformations include:
Let’s add a new column called age_plus_5
by adding 5 to the existing age.
from pyspark.sql.functions import col
df.withColumn("age_plus_5", col("age") + 5).show()
+-------+---+-------------+-----------+ | name|age| city|age_plus_5| +-------+---+-------------+-----------+ | Alice| 25| New York| 30| | Bob| 30|San Francisco| 35| |Charlie| 28| London| 33| +-------+---+-------------+-----------+
df.withColumnRenamed("city", "location").show()
+-------+---+-------------+ | name|age| location| +-------+---+-------------+ | Alice| 25| New York| | Bob| 30|San Francisco| |Charlie| 28| London| +-------+---+-------------+
Is the original DataFrame modified permanently when we use withColumn
or withColumnRenamed
?
No. Spark DataFrames are immutable. Every transformation returns a new DataFrame. You must assign it to a variable if you want to store the changes.
In this section, we learned how to:
select()
filter()
or where()
withColumn()
and withColumnRenamed()
These operations are essential for cleaning, preparing, and analyzing data before applying any machine learning or reporting logic. Mastering them is key to working effectively with Spark DataFrames.
⬅ Previous Topic
Reading CSV, JSON, and Parquet Files in PySparkNext Topic ⮕
Setting up PySpark in Jupyter NotebookYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.