Installing Apache Spark on Windows (Step-by-Step Guide)

Installing Apache Spark on Windows

This guide walks you through setting up Apache Spark on a Windows machine from scratch. We’ll install the required components step by step: Java, Hadoop (Winutils), Spark, and configure PySpark for Python usage.

Why Do We Need These Tools?

Java: Apache Spark runs on the Java Virtual Machine (JVM), so Java must be installed.
Hadoop Winutils: Required to support Spark’s file system operations on Windows.
Spark: The main engine for distributed data processing.
PySpark: Python API to use Spark interactively.

Step 1: Install Java

Apache Spark needs Java to run. We recommend using Java 8 or Java 11.

Download the JDK from the official website: Download JDK
Install it and note the installation path. Example: C:\Program Files\Java\jdk-11.0.x
Set the environment variables:
- JAVA_HOME = Path to your Java installation
- Add %JAVA_HOME%\bin to the Path variable

Question:

Why does Spark need Java?

Answer:

Because Spark is written in Scala, which runs on the JVM. Java ensures the necessary runtime environment is available.

Step 2: Install Hadoop Winutils

Spark uses Hadoop's file system APIs. On Windows, we need a helper binary called winutils.exe.

Download winutils.exe from a trusted source or GitHub mirror.
Create a folder like C:\hadoop\bin and place winutils.exe inside it.
Set environment variable:
- HADOOP_HOME = C:\hadoop
- Add %HADOOP_HOME%\bin to the Path

Question:

Why do we need winutils.exe?

Answer:

Spark expects Hadoop-like commands (like managing file permissions), and winutils.exe provides this compatibility on Windows.

Step 3: Download and Configure Spark

Go to the Spark download page: Download Spark
Select a Spark version (e.g., 3.5.0) and package type as “Pre-built for Apache Hadoop 3.3 and later”.
Extract the zip to a directory, e.g., C:\spark
Set environment variables:
- SPARK_HOME = C:\spark
- Add %SPARK_HOME%\bin to the Path

Step 4: Verify Spark Installation

Open Command Prompt and run:

spark-shell

This should launch the interactive Spark shell in Scala.

Question:

Do I need to learn Scala to use Spark?

Answer:

No. As a beginner, you can use PySpark, which allows you to use Python with Spark.

Step 5: Set Up PySpark

Install Python if not already installed: Download Python
Install PySpark using pip:

pip install pyspark

Step 6: Run PySpark

To verify everything is working:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("LocalSparkTest")     .getOrCreate()

df = spark.createDataFrame([
    ("Alice", 25),
    ("Bob", 30),
    ("Charlie", 35)
], ["Name", "Age"])

df.show()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+

This confirms Spark is installed correctly and working with Python (PySpark).

Summary

Install Java (JDK 8 or 11) and configure environment variables
Download Hadoop winutils and set up HADOOP_HOME
Download and extract Spark, set SPARK_HOME
Install Python and PySpark
Run PySpark and verify with a simple DataFrame example

Once installed, you're ready to explore Spark’s powerful data processing features using Python.

Installing Apache Spark on Windows (Step-by-Step Guide)

Installing Apache Spark on Windows

Why Do We Need These Tools?

Step 1: Install Java

Question:

Answer:

Step 2: Install Hadoop Winutils

Question:

Answer:

Step 3: Download and Configure Spark

Step 4: Verify Spark Installation

Question:

Answer:

Step 5: Set Up PySpark

Step 6: Run PySpark

Summary

Comments

Module 2: Getting Started with Apache Spark❯

Installing Apache Spark on Windows (Step-by-Step Guide)

Installing Apache Spark on Windows

Why Do We Need These Tools?

Step 1: Install Java

Question:

Answer:

Step 2: Install Hadoop Winutils

Question:

Answer:

Step 3: Download and Configure Spark

Step 4: Verify Spark Installation

Question:

Answer:

Step 5: Set Up PySpark

Step 6: Run PySpark

Summary

Comments

Module 2: Getting Started with Apache Spark❯

Player Settings