Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Installing Apache Spark on Windows (Step-by-Step Guide)



Installing Apache Spark on Windows

This guide walks you through setting up Apache Spark on a Windows machine from scratch. We’ll install the required components step by step: Java, Hadoop (Winutils), Spark, and configure PySpark for Python usage.

Why Do We Need These Tools?

Step 1: Install Java

Apache Spark needs Java to run. We recommend using Java 8 or Java 11.

  1. Download the JDK from the official website: Download JDK
  2. Install it and note the installation path. Example: C:\Program Files\Java\jdk-11.0.x
  3. Set the environment variables:
    • JAVA_HOME = Path to your Java installation
    • Add %JAVA_HOME%\bin to the Path variable

Question:

Why does Spark need Java?

Answer:

Because Spark is written in Scala, which runs on the JVM. Java ensures the necessary runtime environment is available.

Step 2: Install Hadoop Winutils

Spark uses Hadoop's file system APIs. On Windows, we need a helper binary called winutils.exe.

  1. Download winutils.exe from a trusted source or GitHub mirror.
  2. Create a folder like C:\hadoop\bin and place winutils.exe inside it.
  3. Set environment variable:
    • HADOOP_HOME = C:\hadoop
    • Add %HADOOP_HOME%\bin to the Path

Question:

Why do we need winutils.exe?

Answer:

Spark expects Hadoop-like commands (like managing file permissions), and winutils.exe provides this compatibility on Windows.

Step 3: Download and Configure Spark

  1. Go to the Spark download page: Download Spark
  2. Select a Spark version (e.g., 3.5.0) and package type as “Pre-built for Apache Hadoop 3.3 and later”.
  3. Extract the zip to a directory, e.g., C:\spark
  4. Set environment variables:
    • SPARK_HOME = C:\spark
    • Add %SPARK_HOME%\bin to the Path

Step 4: Verify Spark Installation

Open Command Prompt and run:


spark-shell
  

This should launch the interactive Spark shell in Scala.

Question:

Do I need to learn Scala to use Spark?

Answer:

No. As a beginner, you can use PySpark, which allows you to use Python with Spark.

Step 5: Set Up PySpark

  1. Install Python if not already installed: Download Python
  2. Install PySpark using pip:

pip install pyspark
  

Step 6: Run PySpark

To verify everything is working:


import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("LocalSparkTest")     .getOrCreate()

df = spark.createDataFrame([
    ("Alice", 25),
    ("Bob", 30),
    ("Charlie", 35)
], ["Name", "Age"])

df.show()
  
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+
  

This confirms Spark is installed correctly and working with Python (PySpark).

Summary

Once installed, you're ready to explore Spark’s powerful data processing features using Python.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M