⬅ Previous Topic
Why DataFrames over RDDs?Next Topic ⮕
Reading CSV, JSON, and Parquet Files in PySpark⬅ Previous Topic
Why DataFrames over RDDs?Next Topic ⮕
Reading CSV, JSON, and Parquet Files in PySparkIn Apache Spark, a DataFrame is a distributed collection of data organized into named columns — much like a table in a relational database or an Excel spreadsheet. PySpark provides several simple ways to create and explore these DataFrames.
While RDDs give more control, DataFrames are easier to use and more optimized for performance. DataFrames support SQL-like syntax, automatic optimization, and better memory management.
You can create a DataFrame from various sources such as:
To work with DataFrames in PySpark, you must first start a SparkSession
.
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder .appName("DataFrameExample") .getOrCreate()
What is a SparkSession?
It's the entry point to programming with Spark using the DataFrame and SQL APIs. It lets you create DataFrames, run queries, and interact with Spark.
This is one of the simplest ways to create a DataFrame manually.
# Sample data
data = [("Alice", 24), ("Bob", 30), ("Charlie", 28)]
# Define column names
columns = ["Name", "Age"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Display the DataFrame
df.show()
+-------+---+ | Name|Age| +-------+---+ | Alice| 24| | Bob| 30| |Charlie| 28| +-------+---+
Here, each tuple represents a row, and the list columns
defines the column headers.
What happens if you don’t specify column names?
Spark will assign default column names like _1
, _2
, etc., which may be confusing.
Using Python dictionaries allows you to map keys directly to column names.
# List of dictionaries
data = [
{"Name": "David", "Age": 35},
{"Name": "Eva", "Age": 29}
]
# Create DataFrame
df2 = spark.createDataFrame(data)
# Display
df2.show()
+-----+---+ | Name|Age| +-----+---+ |David| 35| | Eva| 29| +-----+---+
This is particularly useful when your data already exists in a dictionary format, such as JSON objects or API responses.
CSV files are one of the most common data formats. PySpark makes it easy to load and inspect them.
# Load CSV file into DataFrame
df_csv = spark.read.csv("sample.csv", header=True, inferSchema=True)
# Show the first few rows
df_csv.show()
Here, header=True
tells Spark to use the first row as column names, and inferSchema=True
allows Spark to guess data types (like integer, string, etc.).
Why is inferSchema=True
useful?
Without it, Spark treats all data as strings. This may cause issues if you want to run numeric operations later.
In some cases, you might want to explicitly define the column types using StructType
.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define schema
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True)
])
# Data
data = [("John", 32), ("Lisa", 27)]
# Create DataFrame with schema
df_schema = spark.createDataFrame(data, schema)
df_schema.show()
+----+---+ |Name|Age| +----+---+ |John| 32| |Lisa| 27| +----+---+
This approach gives more control and is helpful when reading data from raw sources where you want to enforce specific formats.
df.show(n)
– displays the top n rows in a tabular format.df.printSchema()
– prints the structure of the DataFrame.df.columns
– returns a list of column names.
df.printSchema()
root |-- Name: string (nullable = true) |-- Age: integer (nullable = true)
Creating and displaying DataFrames in PySpark is the first step toward large-scale data analysis. You can start with simple lists, structured files like CSV, or even define schemas manually. Once the DataFrame is ready, you can visualize it with show()
and inspect its structure using printSchema()
.
⬅ Previous Topic
Why DataFrames over RDDs?Next Topic ⮕
Reading CSV, JSON, and Parquet Files in PySparkYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.