Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Selecting, Filtering, and Transforming Data in Spark DataFrames



Selecting, Filtering, and Transforming Data in Spark DataFrames

Working with large datasets often involves extracting relevant data by selecting specific columns, filtering rows based on conditions, and applying transformations. In Apache Spark, the DataFrame API makes these operations easy and efficient, even for beginners.

Selecting Specific Columns

When you load a DataFrame, it may contain many columns, but often you only need a few for your analysis. You can use the select() method to choose specific columns.

Example


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SelectExample").getOrCreate()

# Sample data
data = [
    ("Alice", 25, "New York"),
    ("Bob", 30, "San Francisco"),
    ("Charlie", 28, "London")
]

# Create DataFrame
df = spark.createDataFrame(data, ["name", "age", "city"])

# Select specific columns
df.select("name", "city").show()
    
+-------+-------------+
|   name|         city|
+-------+-------------+
|  Alice|     New York|
|    Bob|San Francisco|
|Charlie|       London|
+-------+-------------+
    

Question:

What happens if you try to select a column that doesn't exist?

Answer:

Spark will throw an error. Always make sure the column name you specify matches exactly (including case sensitivity) with the schema.

Filtering Rows Based on Conditions

To work only with relevant data, we often need to filter rows that meet specific conditions using filter() or where() methods.

Example

Let’s say we want to filter people whose age is greater than 26.


# Filter rows where age > 26
df.filter(df.age > 26).show()
    
+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|    Bob| 30|San Francisco|
|Charlie| 28|       London|
+-------+---+-------------+
    

You can also combine multiple conditions using & (AND), | (OR), and ~ (NOT).

Example: Filtering with Multiple Conditions


# Filter people older than 26 and living in London
df.filter((df.age > 26) & (df.city == "London")).show()
    
+-------+---+------+
|   name|age|  city|
+-------+---+------+
|Charlie| 28|London|
+-------+---+------+
    

Question:

Why do we use parentheses around each condition?

Answer:

Because logical operators like & and | have lower precedence than method calls. Parentheses ensure the correct evaluation order.

Transforming Data

Transformation refers to modifying or deriving new data columns using existing ones. Common transformations include:

Example: Adding a New Column

Let’s add a new column called age_plus_5 by adding 5 to the existing age.


from pyspark.sql.functions import col

df.withColumn("age_plus_5", col("age") + 5).show()
    
+-------+---+-------------+-----------+
|   name|age|         city|age_plus_5|
+-------+---+-------------+-----------+
|  Alice| 25|     New York|        30|
|    Bob| 30|San Francisco|        35|
|Charlie| 28|       London|        33|
+-------+---+-------------+-----------+
    

Example: Renaming a Column


df.withColumnRenamed("city", "location").show()
    
+-------+---+-------------+
|   name|age|     location|
+-------+---+-------------+
|  Alice| 25|     New York|
|    Bob| 30|San Francisco|
|Charlie| 28|       London|
+-------+---+-------------+
    

Question:

Is the original DataFrame modified permanently when we use withColumn or withColumnRenamed?

Answer:

No. Spark DataFrames are immutable. Every transformation returns a new DataFrame. You must assign it to a variable if you want to store the changes.

Summary

In this section, we learned how to:

These operations are essential for cleaning, preparing, and analyzing data before applying any machine learning or reporting logic. Mastering them is key to working effectively with Spark DataFrames.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M