Creating and Displaying DataFrames in PySpark

In Apache Spark, a DataFrame is a distributed collection of data organized into named columns — much like a table in a relational database or an Excel spreadsheet. PySpark provides several simple ways to create and explore these DataFrames.

Why Use DataFrames Instead of RDDs?

While RDDs give more control, DataFrames are easier to use and more optimized for performance. DataFrames support SQL-like syntax, automatic optimization, and better memory management.

How to Create a DataFrame?

You can create a DataFrame from various sources such as:

Python lists and dictionaries
CSV, JSON, or Parquet files
External databases like MySQL or PostgreSQL

Step 1: Start a SparkSession

To work with DataFrames in PySpark, you must first start a SparkSession.

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder     .appName("DataFrameExample")     .getOrCreate()

Question:

What is a SparkSession?

Answer:

It's the entry point to programming with Spark using the DataFrame and SQL APIs. It lets you create DataFrames, run queries, and interact with Spark.

Example 1: Create DataFrame from a List of Tuples

This is one of the simplest ways to create a DataFrame manually.

# Sample data
data = [("Alice", 24), ("Bob", 30), ("Charlie", 28)]

# Define column names
columns = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Display the DataFrame
df.show()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 24|
|    Bob| 30|
|Charlie| 28|
+-------+---+

Here, each tuple represents a row, and the list columns defines the column headers.

Question:

What happens if you don’t specify column names?

Answer:

Spark will assign default column names like _1, _2, etc., which may be confusing.

Example 2: Create DataFrame from a List of Dictionaries

Using Python dictionaries allows you to map keys directly to column names.

# List of dictionaries
data = [
    {"Name": "David", "Age": 35},
    {"Name": "Eva", "Age": 29}
]

# Create DataFrame
df2 = spark.createDataFrame(data)

# Display
df2.show()

+-----+---+
| Name|Age|
+-----+---+
|David| 35|
|  Eva| 29|
+-----+---+

This is particularly useful when your data already exists in a dictionary format, such as JSON objects or API responses.

Example 3: Read DataFrame from CSV File

CSV files are one of the most common data formats. PySpark makes it easy to load and inspect them.

# Load CSV file into DataFrame
df_csv = spark.read.csv("sample.csv", header=True, inferSchema=True)

# Show the first few rows
df_csv.show()

Here, header=True tells Spark to use the first row as column names, and inferSchema=True allows Spark to guess data types (like integer, string, etc.).

Question:

Why is inferSchema=True useful?

Answer:

Without it, Spark treats all data as strings. This may cause issues if you want to run numeric operations later.

Example 4: Create DataFrame with a Defined Schema

In some cases, you might want to explicitly define the column types using StructType.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])

# Data
data = [("John", 32), ("Lisa", 27)]

# Create DataFrame with schema
df_schema = spark.createDataFrame(data, schema)

df_schema.show()

+----+---+
|Name|Age|
+----+---+
|John| 32|
|Lisa| 27|
+----+---+

This approach gives more control and is helpful when reading data from raw sources where you want to enforce specific formats.

How to View DataFrames?

df.show(n) – displays the top n rows in a tabular format.
df.printSchema() – prints the structure of the DataFrame.
df.columns – returns a list of column names.

df.printSchema()

root
  |-- Name: string (nullable = true)
  |-- Age: integer (nullable = true)

Summary

Creating and displaying DataFrames in PySpark is the first step toward large-scale data analysis. You can start with simple lists, structured files like CSV, or even define schemas manually. Once the DataFrame is ready, you can visualize it with show() and inspect its structure using printSchema().

⬅ Previous TopicWhy DataFrames over RDDs?

Next Topic ⮕Reading CSV, JSON, and Parquet Files in PySpark

Creating and Displaying DataFrames in PySpark