Course Setup Guide

This repository will guide you through the steps to set up your development environment for learning Spark, PySpark, and related technologies. Please follow the instructions below.

Step 1: Install Docker Desktop

Navigate to the install-setup-docker folder.
Follow the instructions inside to install Docker Desktop on your system.
- Docker is required for running containers for your PySpark environment.
- Make sure Docker is properly installed and running before proceeding to the next step.

Step 2: Set up PySpark Jupyter Lab Environment for Spark Core, DataFrames, Datasets, SQL

After installing Docker, navigate to the pyspark-jupyter-lab folder.
Use the Dockerfile provided in the folder to create a Docker container for running PySpark in Jupyter Lab.
The technical lectures on Spark Core, Spark DataFrames, SparkSQL, and DataFrames will be conducted in the containerized environment.
- Instructions to build the docker container are in the README file in the pyspark-jupyter-lab folder.
Once the container is running, open your browser and go to your logs in the Docker container use the provided host link with the token to access the Jupyter Lab environment where you'll be working with PySpark.
There is a notes folder with lecture notes on RDD, DataFrames, Datasets and SparkSQL

Step 3: Set up Pyspark Jupyter Environment for Structured Streaming

Refer to the README file README(spark-streaming) in the spark-structured-streaming folder to build the docker image and run the container
There is notes folder with lecture notes
README(lab-one), README(lab-two-streaming) and README(practice-three-streaming) have outlined steps for your non-graded practice on structured streaming. They are not labs to be submitted but they are mandatory practice.

Step 4: Set up Jupyter Lab Environment Using Dataproc on GCP

Refer to the README file in the gcp-spark-jupyter-setup folder

Step 5: Set up Apache Spark on Hadoop Cluster on AWS

Navigate to the aws-spark-setup folder.
Follow the instructions in the README file to set up Apache Spark on a Hadoop cluster running in an EC2 instance on AWS.
This setup will allow you to execute distributed Spark jobs on a live cluster.
Follow the instructions in the README files. First, sparkAWSREADME, then sparkAWSREADME(cont).

Further Assistance

If you encounter any issues or have questions, feel free to reach out in the course discussion forum. Happy learning!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Course Setup Guide

Step 1: Install Docker Desktop

Step 2: Set up PySpark Jupyter Lab Environment for Spark Core, DataFrames, Datasets, SQL

Step 3: Set up Pyspark Jupyter Environment for Structured Streaming

Step 4: Set up Jupyter Lab Environment Using Dataproc on GCP

Step 5: Set up Apache Spark on Hadoop Cluster on AWS

Further Assistance

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
aws-spark-setup		aws-spark-setup
gcp-spark-jupyter-setup		gcp-spark-jupyter-setup
install-setup-docker		install-setup-docker
pyspark-jupyter-lab		pyspark-jupyter-lab
spark-structured-streaming		spark-structured-streaming
.gitignore		.gitignore
README.md		README.md

josephtugah/spark-tutorials

Folders and files

Latest commit

History

Repository files navigation

Course Setup Guide

Step 1: Install Docker Desktop

Step 2: Set up PySpark Jupyter Lab Environment for Spark Core, DataFrames, Datasets, SQL

Step 3: Set up Pyspark Jupyter Environment for Structured Streaming

Step 4: Set up Jupyter Lab Environment Using Dataproc on GCP

Step 5: Set up Apache Spark on Hadoop Cluster on AWS

Further Assistance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages