Big Data Processing and Applications

Introduction

Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.

Educational goals

Discover or review Big Data concepts
Discover all the functionalities of Apache Spark and why it is everywhere.
Understand the internals of Spark.
Learn to use Spark for batch and streaming data analytics.
and more

Evaluation

MCQ - 50%
Project - 50%

Prerequisites

Python programming knowledge, Linux/Unix shell basic knowledge.

Modules

Module 0 - Introduction to Big Data

Information Systems
Distributed systems
Horizontal vs vertical scaling
Data structure
History of data
Distributed systems
The 3 Vs
Who needs Big Data?
Big Data clusters
The Hadoop Ecosystem
Data skils and profiles

Module 1 - Introduction to Spark & RDDs

Presentation
Spark in Hadoop ecosystem
Use cases
Spark ecosystem
Internals
Data structures
Operations
Resilient Distributed Datasets (RDDs)

Module 2 - Spark SQL and DataFrames

RDDs: Pros and Cons
DataFrames
RDDs vs DataFrames
Working with DataFrames
Why SQL?

Module 3 - Spark Structured Streaming

Streaming introduction
Difference between batch and stream processing
Stream processing models
Different processing semantics
Programming model
Event-time vs. processing time
Windows: tumbling, overlapping
Handling late data and how long to wait
Vocabulary

Due Dates

Group 1

Lab RDD: TBD
Lab SQL & Dataframes: TBD
Lab Structured Streaming: TBD
Additional labs: TBD

Group 2

Lab RDD: TBD
Lab SQL & Dataframes: TBD
Lab Structured Streaming: TBD
Additional labs: TBD

Group 3

Lab RDD: TBD
Lab SQL & Dataframes: TBD
Lab Structured Streaming: TBD
Additional Labs: TBD

All Groups

Final Project: TBD

Contact Info

email: [email protected]
github: jsanc525

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
modules		modules
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Big Data Processing and Applications

Introduction

Educational goals

Evaluation

Prerequisites

Modules

Module 0 - Introduction to Big Data

Module 1 - Introduction to Spark & RDDs

Module 2 - Spark SQL and DataFrames

Module 3 - Spark Structured Streaming

Due Dates

Group 1

Group 2

Group 3

All Groups

Contact Info

About

Uh oh!

Releases

Packages

Languages

adaltas/ece-big-data-processing-2025-fall

Folders and files

Latest commit

History

Repository files navigation

Big Data Processing and Applications

Introduction

Educational goals

Evaluation

Prerequisites

Modules

Module 0 - Introduction to Big Data

Module 1 - Introduction to Spark & RDDs

Module 2 - Spark SQL and DataFrames

Module 3 - Spark Structured Streaming

Due Dates

Group 1

Group 2

Group 3

All Groups

Contact Info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages