Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Basic DataFrame Operations in PySpark



Understanding Basic DataFrame Operations in PySpark

In PySpark, DataFrames are a core abstraction for working with structured and semi-structured data. They are similar to tables in a relational database or DataFrames in pandas, but designed to operate in a distributed computing environment like Apache Spark.

Creating a DataFrame

You can create a DataFrame from a list of tuples or read it from a CSV, JSON, or Parquet file. Here's how to create a simple DataFrame for demonstration:


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("BasicDataFrameOps").getOrCreate()

data = [
    ("Alice", 25, "New York"),
    ("Bob", 30, "San Francisco"),
    ("Catherine", 29, "Chicago")
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()
    
+---------+---+-------------+
|     name|age|         city|
+---------+---+-------------+
|    Alice| 25|     New York|
|      Bob| 30|San Francisco|
|Catherine| 29|      Chicago|
+---------+---+-------------+
    

Selecting Columns

The select() function allows you to extract specific columns from a DataFrame.


df.select("name", "city").show()
    
+---------+-------------+
|     name|         city|
+---------+-------------+
|    Alice|     New York|
|      Bob|San Francisco|
|Catherine|      Chicago|
+---------+-------------+
    

Question:

Why is it useful to select only specific columns?

Answer:

It improves performance and readability, especially when you're dealing with wide datasets that have dozens or hundreds of columns.

Filtering Rows

You can filter rows using filter() or where(). Both methods behave the same way.


df.filter(df.age > 25).show()
    
+---------+---+-------------+
|     name|age|         city|
+---------+---+-------------+
|      Bob| 30|San Francisco|
|Catherine| 29|      Chicago|
+---------+---+-------------+
    

Adding New Columns

Use the withColumn() function to add a new column to your DataFrame.


from pyspark.sql.functions import col

df_with_bonus = df.withColumn("age_plus_5", col("age") + 5)
df_with_bonus.show()
    
+---------+---+-------------+-----------+
|     name|age|         city|age_plus_5|
+---------+---+-------------+-----------+
|    Alice| 25|     New York|        30|
|      Bob| 30|San Francisco|        35|
|Catherine| 29|      Chicago|        34|
+---------+---+-------------+-----------+
    

Question:

Why not just modify the existing 'age' column instead of adding a new one?

Answer:

Creating a new column helps avoid accidental data loss and keeps transformations traceable. You can always drop the original later if needed.

Renaming Columns

You can rename columns using the withColumnRenamed() function.


df_renamed = df.withColumnRenamed("city", "location")
df_renamed.show()
    
+---------+---+-------------+
|     name|age|     location|
+---------+---+-------------+
|    Alice| 25|     New York|
|      Bob| 30|San Francisco|
|Catherine| 29|      Chicago|
+---------+---+-------------+
    

Grouping and Aggregation

To perform group-wise operations (like average, count, etc.), use groupBy() with aggregation functions.


df.groupBy("city").count().show()
    
+-------------+-----+
|         city|count|
+-------------+-----+
|San Francisco|    1|
|      Chicago|    1|
|     New York|    1|
+-------------+-----+
    

You can also apply multiple aggregations:


from pyspark.sql.functions import avg, max

df.groupBy("city").agg(
    avg("age").alias("average_age"),
    max("age").alias("max_age")
).show()
    
+-------------+-----------+-------+
|         city|average_age|max_age|
+-------------+-----------+-------+
|San Francisco|       30.0|     30|
|      Chicago|       29.0|     29|
|     New York|       25.0|     25|
+-------------+-----------+-------+
    

Question:

Can you group by a column and apply multiple calculations at once?

Answer:

Yes. PySpark’s agg() method allows applying multiple functions like avg(), count(), min(), and max() simultaneously.

Conclusion

In this section, you've learned how to perform the most common DataFrame operations in PySpark, including selecting, filtering, renaming, adding new columns, and performing group aggregations. These operations are essential for transforming and analyzing data efficiently using Spark.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M