Installing Apache Spark on Windows
This guide walks you through setting up Apache Spark on a Windows machine from scratch. We’ll install the required components step by step: Java, Hadoop (Winutils), Spark, and configure PySpark for Python usage.
Why Do We Need These Tools?
- Java: Apache Spark runs on the Java Virtual Machine (JVM), so Java must be installed.
- Hadoop Winutils: Required to support Spark’s file system operations on Windows.
- Spark: The main engine for distributed data processing.
- PySpark: Python API to use Spark interactively.
Step 1: Install Java
Apache Spark needs Java to run. We recommend using Java 8 or Java 11.
- Download the JDK from the official website: Download JDK
- Install it and note the installation path. Example:
C:\Program Files\Java\jdk-11.0.x
- Set the environment variables:
JAVA_HOME
= Path to your Java installation- Add
%JAVA_HOME%\bin
to thePath
variable
Question:
Why does Spark need Java?
Answer:
Because Spark is written in Scala, which runs on the JVM. Java ensures the necessary runtime environment is available.
Step 2: Install Hadoop Winutils
Spark uses Hadoop's file system APIs. On Windows, we need a helper binary called winutils.exe
.
- Download
winutils.exe
from a trusted source or GitHub mirror. - Create a folder like
C:\hadoop\bin
and placewinutils.exe
inside it. - Set environment variable:
HADOOP_HOME
= C:\hadoop- Add
%HADOOP_HOME%\bin
to thePath
Question:
Why do we need winutils.exe?
Answer:
Spark expects Hadoop-like commands (like managing file permissions), and winutils.exe
provides this compatibility on Windows.
Step 3: Download and Configure Spark
- Go to the Spark download page: Download Spark
- Select a Spark version (e.g., 3.5.0) and package type as “Pre-built for Apache Hadoop 3.3 and later”.
- Extract the zip to a directory, e.g.,
C:\spark
- Set environment variables:
SPARK_HOME
= C:\spark- Add
%SPARK_HOME%\bin
to thePath
Step 4: Verify Spark Installation
Open Command Prompt and run:
spark-shell
This should launch the interactive Spark shell in Scala.
Question:
Do I need to learn Scala to use Spark?
Answer:
No. As a beginner, you can use PySpark, which allows you to use Python with Spark.
Step 5: Set Up PySpark
- Install Python if not already installed: Download Python
- Install PySpark using pip:
pip install pyspark
Step 6: Run PySpark
To verify everything is working:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder .appName("LocalSparkTest") .getOrCreate()
df = spark.createDataFrame([
("Alice", 25),
("Bob", 30),
("Charlie", 35)
], ["Name", "Age"])
df.show()
+-------+---+ | Name|Age| +-------+---+ | Alice| 25| | Bob| 30| |Charlie| 35| +-------+---+
This confirms Spark is installed correctly and working with Python (PySpark).
Summary
- Install Java (JDK 8 or 11) and configure environment variables
- Download Hadoop winutils and set up
HADOOP_HOME
- Download and extract Spark, set
SPARK_HOME
- Install Python and PySpark
- Run PySpark and verify with a simple DataFrame example
Once installed, you're ready to explore Spark’s powerful data processing features using Python.