Schema Evolution and Inference in Apache Spark

Understanding Schema Evolution and Inference

When working with structured or semi-structured data in Apache Spark, two key concepts come into play: Schema Inference and Schema Evolution. These are especially important when handling complex data sources like JSON, Parquet, or Avro that may change over time.

What is Schema?

A schema defines the structure of your data — it specifies the column names and data types for each field in a dataset.

Example Schema:

    {
      "name": "string",
      "age": "integer",
      "email": "string"
    }

If you're reading a JSON file or a Parquet file, Spark needs to know the structure (schema) to process it properly. This is where inference and evolution come in.

What is Schema Inference?

Schema Inference is the process by which Spark automatically determines the structure (column names and data types) of the data when reading files like JSON or CSV, without you having to manually define it.

Example: Schema Inference from JSON

Let’s consider a small JSON file with user records.

    {"name": "Alice", "age": 25}
    {"name": "Bob", "age": 30}

Python Code (PySpark):

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SchemaInferenceExample").getOrCreate()

# Load JSON data without defining schema
df = spark.read.json("users.json")

df.printSchema()
df.show()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

+---+-----+
|age| name|
+---+-----+
| 25|Alice|
| 30|  Bob|
+---+-----+

Spark inferred that age is a long and name is a string just by reading the file.

Question:

When should we let Spark infer schema?

Answer:

Schema inference is useful during development and for small datasets, but for large production pipelines, manually defining schemas is safer and more efficient.

What is Schema Evolution?

Schema Evolution allows Spark to read and write data even when the structure of the data changes over time. It is especially supported in formats like Parquet and Avro.

Imagine a situation where the original data had 2 fields: name and age, but new data being added now includes an extra field: email.

Example Scenario:

Old file (user_v1.parquet): name, age
New file (user_v2.parquet): name, age, email

What happens when we try to read both together?

Spark can handle this if schema merging is enabled. It will evolve the schema to include all fields from both versions.

Python Code (Schema Evolution with Parquet):

# Read both versions with schema merge
df = spark.read.option("mergeSchema", "true").parquet("user_data/")

df.printSchema()
df.show()

root
 |-- age: integer (nullable = true)
 |-- email: string (nullable = true)
 |-- name: string (nullable = true)

+---+-----------+-----+
|age|      email| name|
+---+-----------+-----+
| 25|       null|Alice|
| 30|bob@mail.com|  Bob|
+---+-----------+-----+

Spark handled the evolution and showed email as null for old records. This is schema evolution in action.

Question:

Is schema evolution automatic in all formats?

Answer:

No. Formats like Parquet and Avro support schema evolution, but CSV and JSON don’t support it inherently. You must handle schema changes manually in those formats.

Best Practices for Schema Evolution

Use Parquet or Avro formats if your schema might change over time.
Always set mergeSchema=true when reading evolving data.
Version your schemas and track changes to avoid confusion in large teams.

Summary

Schema Inference makes it easy to get started without defining schema, while Schema Evolution ensures Spark can handle changes in data structure over time. These features are critical when working with complex and ever-changing datasets, especially in real-world enterprise pipelines.

⬅ Previous TopicFlattening Hierarchical Data

Next Topic ⮕Dropping Nulls and Handling Missing Values in Apache Spark

Schema Evolution and Inference in Apache Spark

Understanding Schema Evolution and Inference

What is Schema?

Example Schema:

What is Schema Inference?

Example: Schema Inference from JSON

Python Code (PySpark):

Question:

Answer:

What is Schema Evolution?

Example Scenario:

What happens when we try to read both together?

Python Code (Schema Evolution with Parquet):

Question:

Answer:

Best Practices for Schema Evolution

Summary

Module 8: Working with Complex Data❯

Welcome to ProgramGuru

Player Settings