⬅ Previous Topic
Flattening Hierarchical DataYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
⬅ Previous Topic
Flattening Hierarchical DataWhen working with structured or semi-structured data in Apache Spark, two key concepts come into play: Schema Inference and Schema Evolution. These are especially important when handling complex data sources like JSON, Parquet, or Avro that may change over time.
A schema defines the structure of your data — it specifies the column names and data types for each field in a dataset.
{ "name": "string", "age": "integer", "email": "string" }
If you're reading a JSON file or a Parquet file, Spark needs to know the structure (schema) to process it properly. This is where inference and evolution come in.
Schema Inference is the process by which Spark automatically determines the structure (column names and data types) of the data when reading files like JSON or CSV, without you having to manually define it.
Let’s consider a small JSON file with user records.
{"name": "Alice", "age": 25} {"name": "Bob", "age": 30}
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SchemaInferenceExample").getOrCreate()
# Load JSON data without defining schema
df = spark.read.json("users.json")
df.printSchema()
df.show()
root |-- age: long (nullable = true) |-- name: string (nullable = true) +---+-----+ |age| name| +---+-----+ | 25|Alice| | 30| Bob| +---+-----+
Spark inferred that age
is a long
and name
is a string
just by reading the file.
When should we let Spark infer schema?
Schema inference is useful during development and for small datasets, but for large production pipelines, manually defining schemas is safer and more efficient.
Schema Evolution allows Spark to read and write data even when the structure of the data changes over time. It is especially supported in formats like Parquet and Avro.
Imagine a situation where the original data had 2 fields: name
and age
, but new data being added now includes an extra field: email
.
Spark can handle this if schema merging is enabled. It will evolve the schema to include all fields from both versions.
# Read both versions with schema merge
df = spark.read.option("mergeSchema", "true").parquet("user_data/")
df.printSchema()
df.show()
root |-- age: integer (nullable = true) |-- email: string (nullable = true) |-- name: string (nullable = true) +---+-----------+-----+ |age| email| name| +---+-----------+-----+ | 25| null|Alice| | 30|bob@mail.com| Bob| +---+-----------+-----+
Spark handled the evolution and showed email
as null for old records. This is schema evolution in action.
Is schema evolution automatic in all formats?
No. Formats like Parquet and Avro support schema evolution, but CSV and JSON don’t support it inherently. You must handle schema changes manually in those formats.
mergeSchema=true
when reading evolving data.Schema Inference makes it easy to get started without defining schema, while Schema Evolution ensures Spark can handle changes in data structure over time. These features are critical when working with complex and ever-changing datasets, especially in real-world enterprise pipelines.
⬅ Previous Topic
Flattening Hierarchical DataYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.