Reading CSV, JSON, and Parquet Files in PySpark
In Apache Spark, data is usually read from external sources like CSV files, JSON documents, or Parquet files. PySpark makes this task easy through its powerful DataFrame API.
Why is File Reading Important?
Data analysis starts with loading data. If you can't read the data correctly, everything else — like cleaning, transforming, or modeling — falls apart. PySpark provides a simple, consistent interface to read various formats into DataFrames.
Reading CSV Files
CSV (Comma-Separated Values) is one of the most common formats for tabular data. Each line represents a row, and values are separated by commas.
Example: Reading a CSV File
from pyspark.sql import SparkSession
# Start Spark session
spark = SparkSession.builder.appName("ReadCSV").getOrCreate()
# Read CSV with header
df_csv = spark.read.option("header", True).csv("airtravel.csv")
# Show first few rows
df_csv.show()
+-----+-----+-----+-----+ |Month|1958 |1959 |1960 | +-----+-----+-----+-----+ | JAN | 340 | 360 | 417 | | FEB | 318 | 342 | 391 | | ... | ... | ... | ... | +-----+-----+-----+-----+
We used option("header", True)
to treat the first row as column names.
Question:
What if the file doesn't have headers?
Answer:
In that case, Spark will assign generic column names like _c0, _c1
. You can later rename them using withColumnRenamed()
.
Reading JSON Files
JSON (JavaScript Object Notation) is commonly used for nested or hierarchical data, such as API responses or configuration files.
Example: Reading a JSON File
# Read JSON file
df_json = spark.read.json("example.json")
# Show data
df_json.show(truncate=False)
+------------+--------+-----+ |name |country |age | +------------+--------+-----+ |John Doe |USA |29 | |Alice Smith |Canada |32 | +------------+--------+-----+
Spark automatically infers the schema from the structure of the JSON file.
Question:
Can JSON files have nested fields?
Answer:
Yes. In that case, Spark still reads the file, but you'll see fields like address.city
or even nested structs, which you can flatten using explode()
or by accessing inner fields.
Reading Parquet Files
Parquet is a columnar storage format optimized for performance and efficiency. It is widely used in Big Data pipelines.
Example: Reading a Parquet File
# Read Parquet file
df_parquet = spark.read.parquet("sample.parquet")
# Show data
df_parquet.show()
+----------+-----+------+ |first_name|age |gender| +----------+-----+------+ |Alice | 30 |Female| |Bob | 45 |Male | +----------+-----+------+
Parquet is faster and uses less storage than CSV or JSON because it stores data in a compressed and encoded format.
Question:
Why is Parquet preferred in production pipelines?
Answer:
Because it supports efficient compression, faster reads, and allows Spark to scan only the required columns, reducing resource usage.
Reading Multiple Files
You can also read multiple files of the same type by using wildcards or passing a list of paths.
# Read multiple CSV files
df_multi = spark.read.option("header", True).csv(["file1.csv", "file2.csv"])
df_multi.show()
Summary
- Use
spark.read.csv()
for CSV files (addoption("header", True)
if needed). - Use
spark.read.json()
for JSON files; handles nested structures well. - Use
spark.read.parquet()
for high-performance, production-grade data pipelines.
Being comfortable with loading data is the first and most important step before analysis, transformation, or modeling in PySpark.