Apache Spark CourseApache Spark Course1
Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Installing Apache Spark on Windows (Step-by-Step Guide)

Installing Apache Spark on Windows

This guide walks you through setting up Apache Spark on a Windows machine from scratch. We’ll install the required components step by step: Java, Hadoop (Winutils), Spark, and configure PySpark for Python usage.

Why Do We Need These Tools?

  • Java: Apache Spark runs on the Java Virtual Machine (JVM), so Java must be installed.
  • Hadoop Winutils: Required to support Spark’s file system operations on Windows.
  • Spark: The main engine for distributed data processing.
  • PySpark: Python API to use Spark interactively.

Step 1: Install Java

Apache Spark needs Java to run. We recommend using Java 8 or Java 11.

  1. Download the JDK from the official website: Download JDK
  2. Install it and note the installation path. Example: C:\Program Files\Java\jdk-11.0.x
  3. Set the environment variables:
    • JAVA_HOME = Path to your Java installation
    • Add %JAVA_HOME%\bin to the Path variable

Question:

Why does Spark need Java?

Answer:

Because Spark is written in Scala, which runs on the JVM. Java ensures the necessary runtime environment is available.

Step 2: Install Hadoop Winutils

Spark uses Hadoop's file system APIs. On Windows, we need a helper binary called winutils.exe.

  1. Download winutils.exe from a trusted source or GitHub mirror.
  2. Create a folder like C:\hadoop\bin and place winutils.exe inside it.
  3. Set environment variable:
    • HADOOP_HOME = C:\hadoop
    • Add %HADOOP_HOME%\bin to the Path

Question:

Why do we need winutils.exe?

Answer:

Spark expects Hadoop-like commands (like managing file permissions), and winutils.exe provides this compatibility on Windows.

Step 3: Download and Configure Spark

  1. Go to the Spark download page: Download Spark
  2. Select a Spark version (e.g., 3.5.0) and package type as “Pre-built for Apache Hadoop 3.3 and later”.
  3. Extract the zip to a directory, e.g., C:\spark
  4. Set environment variables:
    • SPARK_HOME = C:\spark
    • Add %SPARK_HOME%\bin to the Path

Step 4: Verify Spark Installation

Open Command Prompt and run:

spark-shell
  

This should launch the interactive Spark shell in Scala.

Question:

Do I need to learn Scala to use Spark?

Answer:

No. As a beginner, you can use PySpark, which allows you to use Python with Spark.

Step 5: Set Up PySpark

  1. Install Python if not already installed: Download Python
  2. Install PySpark using pip:
pip install pyspark
  

Step 6: Run PySpark

To verify everything is working:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("LocalSparkTest")     .getOrCreate()

df = spark.createDataFrame([
    ("Alice", 25),
    ("Bob", 30),
    ("Charlie", 35)
], ["Name", "Age"])

df.show()
  
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+
  

This confirms Spark is installed correctly and working with Python (PySpark).

Summary

  • Install Java (JDK 8 or 11) and configure environment variables
  • Download Hadoop winutils and set up HADOOP_HOME
  • Download and extract Spark, set SPARK_HOME
  • Install Python and PySpark
  • Run PySpark and verify with a simple DataFrame example

Once installed, you're ready to explore Spark’s powerful data processing features using Python.