⬅ Previous Topic
Installing Apache Spark on MacOSYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
⬅ Previous Topic
Installing Apache Spark on MacOSNow that you have an understanding of what Apache Spark is, let’s move one step closer by running your first Spark application. In this lesson, we'll use PySpark — the Python API for Spark — to create a simple Spark job and understand its basic flow.
A typical Spark application consists of a driver program that runs your main function and uses SparkContext to connect with cluster managers and execute tasks in parallel across executors.
Do I need a cluster to run a Spark application?
No, you can run Spark on your local machine in “local mode” to develop and test your code. Later, the same code can scale to run on a real cluster with no or minimal changes.
Before we begin, ensure you have pyspark
installed. You can install it using pip:
pip install pyspark
Once installed, open your terminal or Jupyter Notebook to begin coding.
Let’s write a simple program that:
from pyspark.sql import SparkSession
# Step 1: Create Spark Session
spark = SparkSession.builder
.appName("FirstSparkApp")
.master("local[*]")
.getOrCreate()
# Step 2: Get Spark Context from Session
sc = spark.sparkContext
# Step 3: Create a list and parallelize into RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Step 4: Apply transformation (square each element)
squared_rdd = rdd.map(lambda x: x * x)
# Step 5: Collect the result
result = squared_rdd.collect()
# Step 6: Print the output
print("Squared Numbers:", result)
# Stop Spark Session
spark.stop()
Squared Numbers: [1, 4, 9, 16, 25]
Why do we need collect()
at the end?
Until you call an action like collect()
or count()
, Spark doesn’t actually execute anything. This is because Spark uses lazy evaluation — it builds a plan of transformations first, and only when an action is called does it execute the operations.
You can run this script:
python my_spark_app.py
Congratulations! You've just run your first Apache Spark application. You created an RDD, performed a transformation, and executed an action. As we move forward, we’ll explore working with structured data using DataFrames, SQL, and machine learning pipelines.
⬅ Previous Topic
Installing Apache Spark on MacOSYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.