Basic DataFrame Operations in PySpark

Understanding Basic DataFrame Operations in PySpark

In PySpark, DataFrames are a core abstraction for working with structured and semi-structured data. They are similar to tables in a relational database or DataFrames in pandas, but designed to operate in a distributed computing environment like Apache Spark.

Creating a DataFrame

You can create a DataFrame from a list of tuples or read it from a CSV, JSON, or Parquet file. Here's how to create a simple DataFrame for demonstration:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("BasicDataFrameOps").getOrCreate()

data = [
    ("Alice", 25, "New York"),
    ("Bob", 30, "San Francisco"),
    ("Catherine", 29, "Chicago")
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()

+---------+---+-------------+
|     name|age|         city|
+---------+---+-------------+
|    Alice| 25|     New York|
|      Bob| 30|San Francisco|
|Catherine| 29|      Chicago|
+---------+---+-------------+

Selecting Columns

The select() function allows you to extract specific columns from a DataFrame.

df.select("name", "city").show()

+---------+-------------+
|     name|         city|
+---------+-------------+
|    Alice|     New York|
|      Bob|San Francisco|
|Catherine|      Chicago|
+---------+-------------+

Question:

Why is it useful to select only specific columns?

Answer:

It improves performance and readability, especially when you're dealing with wide datasets that have dozens or hundreds of columns.

Filtering Rows

You can filter rows using filter() or where(). Both methods behave the same way.

df.filter(df.age > 25).show()

+---------+---+-------------+
|     name|age|         city|
+---------+---+-------------+
|      Bob| 30|San Francisco|
|Catherine| 29|      Chicago|
+---------+---+-------------+

Adding New Columns

Use the withColumn() function to add a new column to your DataFrame.

from pyspark.sql.functions import col

df_with_bonus = df.withColumn("age_plus_5", col("age") + 5)
df_with_bonus.show()

+---------+---+-------------+-----------+
|     name|age|         city|age_plus_5|
+---------+---+-------------+-----------+
|    Alice| 25|     New York|        30|
|      Bob| 30|San Francisco|        35|
|Catherine| 29|      Chicago|        34|
+---------+---+-------------+-----------+

Question:

Why not just modify the existing 'age' column instead of adding a new one?

Answer:

Creating a new column helps avoid accidental data loss and keeps transformations traceable. You can always drop the original later if needed.

Renaming Columns

You can rename columns using the withColumnRenamed() function.

df_renamed = df.withColumnRenamed("city", "location")
df_renamed.show()

+---------+---+-------------+
|     name|age|     location|
+---------+---+-------------+
|    Alice| 25|     New York|
|      Bob| 30|San Francisco|
|Catherine| 29|      Chicago|
+---------+---+-------------+

Grouping and Aggregation

To perform group-wise operations (like average, count, etc.), use groupBy() with aggregation functions.

df.groupBy("city").count().show()

+-------------+-----+
|         city|count|
+-------------+-----+
|San Francisco|    1|
|      Chicago|    1|
|     New York|    1|
+-------------+-----+

You can also apply multiple aggregations:

from pyspark.sql.functions import avg, max

df.groupBy("city").agg(
    avg("age").alias("average_age"),
    max("age").alias("max_age")
).show()

+-------------+-----------+-------+
|         city|average_age|max_age|
+-------------+-----------+-------+
|San Francisco|       30.0|     30|
|      Chicago|       29.0|     29|
|     New York|       25.0|     25|
+-------------+-----------+-------+

Question:

Can you group by a column and apply multiple calculations at once?

Answer:

Yes. PySpark’s agg() method allows applying multiple functions like avg(), count(), min(), and max() simultaneously.

Conclusion

In this section, you've learned how to perform the most common DataFrame operations in PySpark, including selecting, filtering, renaming, adding new columns, and performing group aggregations. These operations are essential for transforming and analyzing data efficiently using Spark.

⬅ Previous TopicSetting up PySpark in Jupyter Notebook

Next Topic ⮕Working with Columns, Expressions, and User-Defined Functions in PySpark

Comments

Loading comments...

Basic DataFrame Operations in PySpark

Understanding Basic DataFrame Operations in PySpark

Creating a DataFrame

Selecting Columns

Question:

Answer:

Filtering Rows

Adding New Columns

Question:

Answer:

Renaming Columns

Grouping and Aggregation

Question:

Answer:

Conclusion

Comments

Module 6: PySpark Essentials❯

Welcome to ProgramGuru

Player Settings