Skip to content

adaltas/ece-big-data-processing-2025-fall

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Big Data Processing and Applications

Introduction

Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.

Educational goals

  • Discover or review Big Data concepts
  • Discover all the functionalities of Apache Spark and why it is everywhere.
  • Understand the internals of Spark.
  • Learn to use Spark for batch and streaming data analytics.
  • and more

Evaluation

  • MCQ - 50%
  • Project - 50%

Prerequisites

Python programming knowledge, Linux/Unix shell basic knowledge.

Modules

Module 0 - Introduction to Big Data

  • Information Systems
  • Distributed systems
  • Horizontal vs vertical scaling
  • Data structure
  • History of data
  • Distributed systems
  • The 3 Vs
  • Who needs Big Data?
  • Big Data clusters
  • The Hadoop Ecosystem
  • Data skils and profiles

Module 1 - Introduction to Spark & RDDs

  • Presentation
  • Spark in Hadoop ecosystem
  • Use cases
  • Spark ecosystem
  • Internals
  • Data structures
  • Operations
  • Resilient Distributed Datasets (RDDs)

Module 2 - Spark SQL and DataFrames

  • RDDs: Pros and Cons
  • DataFrames
  • RDDs vs DataFrames
  • Working with DataFrames
  • Why SQL?

Module 3 - Spark Structured Streaming

  • Streaming introduction
  • Difference between batch and stream processing
  • Stream processing models
  • Different processing semantics
  • Programming model
  • Event-time vs. processing time
  • Windows: tumbling, overlapping
  • Handling late data and how long to wait
  • Vocabulary

Due Dates

Group 1

  • Lab RDD: TBD
  • Lab SQL & Dataframes: TBD
  • Lab Structured Streaming: TBD
  • Additional labs: TBD

Group 2

  • Lab RDD: TBD
  • Lab SQL & Dataframes: TBD
  • Lab Structured Streaming: TBD
  • Additional labs: TBD

Group 3

  • Lab RDD: TBD
  • Lab SQL & Dataframes: TBD
  • Lab Structured Streaming: TBD
  • Additional Labs: TBD

All Groups

  • Final Project: TBD

Contact Info

About

course content for Big Data Processing & Application fall 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published