Selecting, Filtering, and Transforming Data in Spark DataFrames

Working with large datasets often involves extracting relevant data by selecting specific columns, filtering rows based on conditions, and applying transformations. In Apache Spark, the DataFrame API makes these operations easy and efficient, even for beginners.

Selecting Specific Columns

When you load a DataFrame, it may contain many columns, but often you only need a few for your analysis. You can use the select() method to choose specific columns.

Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SelectExample").getOrCreate()

# Sample data
data = [
    ("Alice", 25, "New York"),
    ("Bob", 30, "San Francisco"),
    ("Charlie", 28, "London")
]

# Create DataFrame
df = spark.createDataFrame(data, ["name", "age", "city"])

# Select specific columns
df.select("name", "city").show()

+-------+-------------+
|   name|         city|
+-------+-------------+
|  Alice|     New York|
|    Bob|San Francisco|
|Charlie|       London|
+-------+-------------+

Question:

What happens if you try to select a column that doesn't exist?

Answer:

Spark will throw an error. Always make sure the column name you specify matches exactly (including case sensitivity) with the schema.

Filtering Rows Based on Conditions

To work only with relevant data, we often need to filter rows that meet specific conditions using filter() or where() methods.

Example

Let’s say we want to filter people whose age is greater than 26.

# Filter rows where age > 26
df.filter(df.age > 26).show()

+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|    Bob| 30|San Francisco|
|Charlie| 28|       London|
+-------+---+-------------+

You can also combine multiple conditions using & (AND), | (OR), and ~ (NOT).

Example: Filtering with Multiple Conditions

# Filter people older than 26 and living in London
df.filter((df.age > 26) & (df.city == "London")).show()

+-------+---+------+
|   name|age|  city|
+-------+---+------+
|Charlie| 28|London|
+-------+---+------+

Question:

Why do we use parentheses around each condition?

Answer:

Because logical operators like & and | have lower precedence than method calls. Parentheses ensure the correct evaluation order.

Transforming Data

Transformation refers to modifying or deriving new data columns using existing ones. Common transformations include:

Renaming columns
Creating new columns with calculations
Changing column values using expressions

Example: Adding a New Column

Let’s add a new column called age_plus_5 by adding 5 to the existing age.

from pyspark.sql.functions import col

df.withColumn("age_plus_5", col("age") + 5).show()

+-------+---+-------------+-----------+
|   name|age|         city|age_plus_5|
+-------+---+-------------+-----------+
|  Alice| 25|     New York|        30|
|    Bob| 30|San Francisco|        35|
|Charlie| 28|       London|        33|
+-------+---+-------------+-----------+

Example: Renaming a Column

df.withColumnRenamed("city", "location").show()

+-------+---+-------------+
|   name|age|     location|
+-------+---+-------------+
|  Alice| 25|     New York|
|    Bob| 30|San Francisco|
|Charlie| 28|       London|
+-------+---+-------------+

Question:

Is the original DataFrame modified permanently when we use withColumn or withColumnRenamed?

Answer:

No. Spark DataFrames are immutable. Every transformation returns a new DataFrame. You must assign it to a variable if you want to store the changes.

Summary

In this section, we learned how to:

Select one or more columns from a DataFrame using select()
Filter rows based on conditions using filter() or where()
Create new columns or rename existing ones using withColumn() and withColumnRenamed()

These operations are essential for cleaning, preparing, and analyzing data before applying any machine learning or reporting logic. Mastering them is key to working effectively with Spark DataFrames.

Selecting, Filtering, and Transforming Data in Spark DataFrames