Creating and Displaying DataFrames in PySpark
In Apache Spark, a DataFrame is a distributed collection of data organized into named columns — much like a table in a relational database or an Excel spreadsheet. PySpark provides several simple ways to create and explore these DataFrames.
Why Use DataFrames Instead of RDDs?
While RDDs give more control, DataFrames are easier to use and more optimized for performance. DataFrames support SQL-like syntax, automatic optimization, and better memory management.
How to Create a DataFrame?
You can create a DataFrame from various sources such as:
- Python lists and dictionaries
- CSV, JSON, or Parquet files
- External databases like MySQL or PostgreSQL
Step 1: Start a SparkSession
To work with DataFrames in PySpark, you must first start a SparkSession
.
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder .appName("DataFrameExample") .getOrCreate()
Question:
What is a SparkSession?
Answer:
It's the entry point to programming with Spark using the DataFrame and SQL APIs. It lets you create DataFrames, run queries, and interact with Spark.
Example 1: Create DataFrame from a List of Tuples
This is one of the simplest ways to create a DataFrame manually.
# Sample data
data = [("Alice", 24), ("Bob", 30), ("Charlie", 28)]
# Define column names
columns = ["Name", "Age"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Display the DataFrame
df.show()
+-------+---+ | Name|Age| +-------+---+ | Alice| 24| | Bob| 30| |Charlie| 28| +-------+---+
Here, each tuple represents a row, and the list columns
defines the column headers.
Question:
What happens if you don’t specify column names?
Answer:
Spark will assign default column names like _1
, _2
, etc., which may be confusing.
Example 2: Create DataFrame from a List of Dictionaries
Using Python dictionaries allows you to map keys directly to column names.
# List of dictionaries
data = [
{"Name": "David", "Age": 35},
{"Name": "Eva", "Age": 29}
]
# Create DataFrame
df2 = spark.createDataFrame(data)
# Display
df2.show()
+-----+---+ | Name|Age| +-----+---+ |David| 35| | Eva| 29| +-----+---+
This is particularly useful when your data already exists in a dictionary format, such as JSON objects or API responses.
Example 3: Read DataFrame from CSV File
CSV files are one of the most common data formats. PySpark makes it easy to load and inspect them.
# Load CSV file into DataFrame
df_csv = spark.read.csv("sample.csv", header=True, inferSchema=True)
# Show the first few rows
df_csv.show()
Here, header=True
tells Spark to use the first row as column names, and inferSchema=True
allows Spark to guess data types (like integer, string, etc.).
Question:
Why is inferSchema=True
useful?
Answer:
Without it, Spark treats all data as strings. This may cause issues if you want to run numeric operations later.
Example 4: Create DataFrame with a Defined Schema
In some cases, you might want to explicitly define the column types using StructType
.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define schema
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True)
])
# Data
data = [("John", 32), ("Lisa", 27)]
# Create DataFrame with schema
df_schema = spark.createDataFrame(data, schema)
df_schema.show()
+----+---+ |Name|Age| +----+---+ |John| 32| |Lisa| 27| +----+---+
This approach gives more control and is helpful when reading data from raw sources where you want to enforce specific formats.
How to View DataFrames?
df.show(n)
– displays the top n rows in a tabular format.df.printSchema()
– prints the structure of the DataFrame.df.columns
– returns a list of column names.
df.printSchema()
root |-- Name: string (nullable = true) |-- Age: integer (nullable = true)
Summary
Creating and displaying DataFrames in PySpark is the first step toward large-scale data analysis. You can start with simple lists, structured files like CSV, or even define schemas manually. Once the DataFrame is ready, you can visualize it with show()
and inspect its structure using printSchema()
.