Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Schema Evolution and Inference in Apache Spark



Understanding Schema Evolution and Inference

When working with structured or semi-structured data in Apache Spark, two key concepts come into play: Schema Inference and Schema Evolution. These are especially important when handling complex data sources like JSON, Parquet, or Avro that may change over time.

What is Schema?

A schema defines the structure of your data — it specifies the column names and data types for each field in a dataset.

Example Schema:

    {
      "name": "string",
      "age": "integer",
      "email": "string"
    }
    

If you're reading a JSON file or a Parquet file, Spark needs to know the structure (schema) to process it properly. This is where inference and evolution come in.

What is Schema Inference?

Schema Inference is the process by which Spark automatically determines the structure (column names and data types) of the data when reading files like JSON or CSV, without you having to manually define it.

Example: Schema Inference from JSON

Let’s consider a small JSON file with user records.

    {"name": "Alice", "age": 25}
    {"name": "Bob", "age": 30}
    

Python Code (PySpark):


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SchemaInferenceExample").getOrCreate()

# Load JSON data without defining schema
df = spark.read.json("users.json")

df.printSchema()
df.show()
    
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

+---+-----+
|age| name|
+---+-----+
| 25|Alice|
| 30|  Bob|
+---+-----+
    

Spark inferred that age is a long and name is a string just by reading the file.

Question:

When should we let Spark infer schema?

Answer:

Schema inference is useful during development and for small datasets, but for large production pipelines, manually defining schemas is safer and more efficient.

What is Schema Evolution?

Schema Evolution allows Spark to read and write data even when the structure of the data changes over time. It is especially supported in formats like Parquet and Avro.

Imagine a situation where the original data had 2 fields: name and age, but new data being added now includes an extra field: email.

Example Scenario:

What happens when we try to read both together?

Spark can handle this if schema merging is enabled. It will evolve the schema to include all fields from both versions.

Python Code (Schema Evolution with Parquet):


# Read both versions with schema merge
df = spark.read.option("mergeSchema", "true").parquet("user_data/")

df.printSchema()
df.show()
    
root
 |-- age: integer (nullable = true)
 |-- email: string (nullable = true)
 |-- name: string (nullable = true)

+---+-----------+-----+
|age|      email| name|
+---+-----------+-----+
| 25|       null|Alice|
| 30|bob@mail.com|  Bob|
+---+-----------+-----+
    

Spark handled the evolution and showed email as null for old records. This is schema evolution in action.

Question:

Is schema evolution automatic in all formats?

Answer:

No. Formats like Parquet and Avro support schema evolution, but CSV and JSON don’t support it inherently. You must handle schema changes manually in those formats.

Best Practices for Schema Evolution

Summary

Schema Inference makes it easy to get started without defining schema, while Schema Evolution ensures Spark can handle changes in data structure over time. These features are critical when working with complex and ever-changing datasets, especially in real-world enterprise pipelines.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M