What is Apache Spark?

Apache Spark is an open-source, distributed computing system designed to process big data quickly and efficiently. It enables parallel processing across a cluster of computers, allowing it to handle massive datasets that wouldn’t fit or perform well on a single machine.

Why was Spark created?

Before Spark, Hadoop MapReduce was widely used for big data processing. However, MapReduce had limitations — it was slow due to repeated disk writes between processing stages.

Spark was created at UC Berkeley to overcome these limitations by using in-memory processing, making it up to 100x faster for certain tasks compared to MapReduce.

Key Features of Apache Spark

Speed: Spark keeps data in memory whenever possible, drastically speeding up processing.
Ease of Use: Supports APIs in Python (PySpark), Scala, Java, and R.
Unified Engine: Spark can handle batch processing, real-time streaming, machine learning, and SQL queries — all in one platform.
Scalability: Easily scales from a laptop to thousands of nodes in a cluster.

Real-Life Example: Ride-Sharing Platform

Consider a company like Uber or Ola. Every second, they collect massive data points:

Ride requests (timestamp, location, user ID)
Driver availability and positions
Trip durations, pricing, and feedback

Spark can be used to process this data in real-time to:

Assign drivers efficiently
Predict surge pricing areas
Detect fraudulent activities

Question:

Why can’t we just use Python with Pandas for all this?

Answer:

Pandas is great for small-to-medium data, but it works on a single machine and doesn’t scale well. Spark, on the other hand, distributes both storage and computation across a cluster, making it suitable for big data.

Apache Spark Ecosystem

Spark Core: The foundation that handles basic job scheduling and task management.
Spark SQL: Module for working with structured data using SQL queries.
MLlib: Built-in machine learning library.
Spark Streaming: For real-time data stream processing.
GraphX: For graph computation (mostly in Scala).

Example: Running Your First PySpark Code

Let’s write a small PySpark program to understand how Spark works with data. This program creates a simple DataFrame and performs basic operations.

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder     .appName("Intro to Apache Spark")     .getOrCreate()

# Create a sample DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

# Filter people older than 24
df.filter(df.Age > 24).show()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 22|
+-------+---+

+-----+---+
| Name|Age|
+-----+---+
|Alice| 25|
|  Bob| 30|
+-----+---+

In this example, Spark is managing the data and operations in parallel, even if it’s just running on your local machine. In production, the same code would run on a distributed cluster without any changes.

Question:

Is Spark only for big companies?

Answer:

No. With tools like Databricks Community Edition and local Spark setups, even individual learners and startups can use Spark to process data efficiently.

Summary

Apache Spark is a powerful, fast, and scalable platform for big data processing. It offers a unified framework to work with batch, streaming, SQL, and machine learning workloads. Understanding what Spark is and why it exists gives you a strong foundation for learning how to use it effectively in real-world data projects.

⬅ Previous TopicOverview of Big Data Tools: Hadoop, Spark, Hive

Next Topic ⮕Use Cases of Apache Spark in Industry

What is Apache Spark?