Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.
- Discover or review Big Data concepts
- Discover all the functionalities of Apache Spark and why it is everywhere.
- Understand the internals of Spark.
- Learn to use Spark for batch and streaming data analytics.
- and more
- MCQ - 50%
- Project - 50%
Python programming knowledge, Linux/Unix shell basic knowledge.
- Information Systems
- Distributed systems
- Horizontal vs vertical scaling
- Data structure
- History of data
- Distributed systems
- The 3 Vs
- Who needs Big Data?
- Big Data clusters
- The Hadoop Ecosystem
- Data skils and profiles
- Presentation
- Spark in Hadoop ecosystem
- Use cases
- Spark ecosystem
- Internals
- Data structures
- Operations
- Resilient Distributed Datasets (RDDs)
- RDDs: Pros and Cons
- DataFrames
- RDDs vs DataFrames
- Working with DataFrames
- Why SQL?
- Streaming introduction
- Difference between batch and stream processing
- Stream processing models
- Different processing semantics
- Programming model
- Event-time vs. processing time
- Windows: tumbling, overlapping
- Handling late data and how long to wait
- Vocabulary
- Lab RDD: TBD
- Lab SQL & Dataframes: TBD
- Lab Structured Streaming: TBD
- Additional labs: TBD
- Lab RDD: TBD
- Lab SQL & Dataframes: TBD
- Lab Structured Streaming: TBD
- Additional labs: TBD
- Lab RDD: TBD
- Lab SQL & Dataframes: TBD
- Lab Structured Streaming: TBD
- Additional Labs: TBD
- Final Project: TBD
- email: [email protected]
- github: jsanc525