












What is Apache Spark?
Next Topic ⮕Use Cases of Apache Spark in Industry
What is Apache Spark?
Apache Spark is an open-source, distributed computing system designed to process big data quickly and efficiently. It enables parallel processing across a cluster of computers, allowing it to handle massive datasets that wouldn’t fit or perform well on a single machine.
Why was Spark created?
Before Spark, Hadoop MapReduce was widely used for big data processing. However, MapReduce had limitations — it was slow due to repeated disk writes between processing stages.
Spark was created at UC Berkeley to overcome these limitations by using in-memory processing, making it up to 100x faster for certain tasks compared to MapReduce.
Key Features of Apache Spark
- Speed: Spark keeps data in memory whenever possible, drastically speeding up processing.
- Ease of Use: Supports APIs in Python (PySpark), Scala, Java, and R.
- Unified Engine: Spark can handle batch processing, real-time streaming, machine learning, and SQL queries — all in one platform.
- Scalability: Easily scales from a laptop to thousands of nodes in a cluster.
Real-Life Example: Ride-Sharing Platform
Consider a company like Uber or Ola. Every second, they collect massive data points:
- Ride requests (timestamp, location, user ID)
- Driver availability and positions
- Trip durations, pricing, and feedback
Spark can be used to process this data in real-time to:
- Assign drivers efficiently
- Predict surge pricing areas
- Detect fraudulent activities
Question:
Why can’t we just use Python with Pandas for all this?
Answer:
Pandas is great for small-to-medium data, but it works on a single machine and doesn’t scale well. Spark, on the other hand, distributes both storage and computation across a cluster, making it suitable for big data.
Apache Spark Ecosystem
- Spark Core: The foundation that handles basic job scheduling and task management.
- Spark SQL: Module for working with structured data using SQL queries.
- MLlib: Built-in machine learning library.
- Spark Streaming: For real-time data stream processing.
- GraphX: For graph computation (mostly in Scala).
Example: Running Your First PySpark Code
Let’s write a small PySpark program to understand how Spark works with data. This program creates a simple DataFrame and performs basic operations.
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder .appName("Intro to Apache Spark") .getOrCreate()
# Create a sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
# Filter people older than 24
df.filter(df.Age > 24).show()
+-------+---+ | Name|Age| +-------+---+ | Alice| 25| | Bob| 30| |Charlie| 22| +-------+---+ +-----+---+ | Name|Age| +-----+---+ |Alice| 25| | Bob| 30| +-----+---+
In this example, Spark is managing the data and operations in parallel, even if it’s just running on your local machine. In production, the same code would run on a distributed cluster without any changes.
Question:
Is Spark only for big companies?
Answer:
No. With tools like Databricks Community Edition and local Spark setups, even individual learners and startups can use Spark to process data efficiently.
Summary
Apache Spark is a powerful, fast, and scalable platform for big data processing. It offers a unified framework to work with batch, streaming, SQL, and machine learning workloads. Understanding what Spark is and why it exists gives you a strong foundation for learning how to use it effectively in real-world data projects.