Apache Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Spark Overview

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Preview release of Spark 4.0

https://spark.apache.org/news/spark-4.0.0-preview1.html
ANSI mode enabled by default.
Python data source support added.
Polymorphic Python UDTF (User-Defined Table Function) introduced.
String collation support included.
New VARIANT data type available.
Streaming state store data source introduced.
Structured logging capabilities added.
Java 17 set as the default Java version.
Plus many other enhancements.

Latest Version: 3.5.3

New Features since 3.4.x

Spark Connect: Spark Connect is a new client-server architecture introduced in Apache Spark 3.4 which solves the problem by decoupling Spark client applications and allowing remote connectivity to Spark clusters.
TorchDistributor Module to PySpark: This new module TorchDistributorin PySpark that makes it easier to do distributed training with PyTorch on Spark clusters.
DEFAULT Values: This default value support for columns feature in Spark 3.4 solves the problem by automatically inserting the default value for any column that is not explicitly specified.
TIMESTAMP WITHOUT TIMEZONE: New Data Type is a new feature in Spark 3.4 that allows you to represent timestamp values without a time zone.
Lateral Column: SQL SELECT List is a new feature in Spark 3.4 that allows you to reference columns from a subquery in the SELECT list of a main query. This is useful for a variety of tasks, such as joining data from multiple tables or performing aggregations on related data.
Bloom Filter: Bloom filter join is a new feature in Spark 3.4 that can be used to improve the performance of joins between large datasets.
Convert the Entire Source Dataframe to a Schema: Spark 3.4 has introduced Dataset.to(StructType) feature that solves the problem by making it easy to convert an entire DataFrame to a schema with a single line of code.
Parameterized SQL Queries: A parameterized SQL query is a query that uses named parameters instead of literal values. This means that the values of the parameters are not hard-coded into the query, but are instead passed in at runtime. This makes the query more reusable because the same query can be used with different values for the parameters. It also makes the query more secure, because it prevents attackers from injecting malicious code into the query

With Hadoop 3 & Scala 2.13

Download Spark 3.5.2 with Hadoop3 and Scala 2.13

Create the Docker base file

Dockerfile

Primary Users

Data Scientist
Data Engineers
Software Developers
Data Analysts

Language Support

Scala
Python
Java
R

Spark APIs

RDDs
DataFrames
Datasets

Popular Spark Use cases 4P's

Parallel processing across cluster
Performing adhoc queries for EDA
Pipelines implementations for data
Processig and analyzing graph data and social networks

Run the Spark Submit on Kubernetes Cluster

Spark-Submit

Create a Spark Session

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .master("local[1]")
  .appName("Spark App")
  .config("spark.driver.cores", "2")
  .getOrCreate()
  
val df = spark.read("data.json").json(jsonPath)
df.show()

Spark Shell Read a Text File

scala> val readMeDf  = spark.read.text("../README.md")
scala>readMeDf.show(5, false)

+--------------------------------------------------------------------------------+
|value                                                                           |
+--------------------------------------------------------------------------------+
|# Apache Spark                                                                  |
|                                                                                |
|Spark is a unified analytics engine for large-scale data processing. It provides|
|high-level APIs in Scala, Java, Python, and R, and an optimized engine that     |
|supports general computation graphs for data analysis. It also supports a       |
+--------------------------------------------------------------------------------+
only showing top 5 rows

Spark App Flow

Spark Operations

Transformations e.g. orderBy() | groupBy() | join() | filter() | select()
Actions e.g. show() | count() | take() | collect() | save()

Spark DataFrameReader methods

spark.read.format("parquet").load("file.parquet") or spark.read.load("file.parquet")
spark.read.format("csv").option("inferSchema", "true").option("header", "true").option("mode", "PERMISSIVE").load("file.csv")
spark.read.format("json").load("file.json")

Spark DataFrameWriter df.write.format("json").mode("overwrite").save(location)methods

df.write.format("parquet").mode("overwrite").option("compression", "snappy").save("parquet")
df.write.format("csv").mode("overwrite").save("csvfile")
df.write.format("json").mode("overwrite").save("jsonfile")
df.write.format("avro").mode("overwrite").save("avrofile")
df.write.format("orc").mode("overwrite").save("orcfile")

Spark 3.2.1 Dependencies changes:

[SPARK-37113]: Upgrade Parquet to 1.12.2
[SPARK-37238]: Upgrade ORC to 1.6.12
[SPARK-37534]: Bump dev.ludovic.netlib to 2.2.1
[SPARK-37656]: Upgrade SBT to 1.5.7

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
project		project
01_Sample.py		01_Sample.py
Dockerfile		Dockerfile
README.md		README.md
RunSparkApp.scala		RunSparkApp.scala
SampleSparkJob.scala		SampleSparkJob.scala
SparkApp.png		SparkApp.png
SparkDrools.scala		SparkDrools.scala
SparkFunctions.java		SparkFunctions.java
TreeCounter.scala		TreeCounter.scala
build.sbt		build.sbt
spark-logo-hd.png		spark-logo-hd.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Spark

Spark Overview

Preview release of Spark 4.0

Latest Version: 3.5.3

New Features since 3.4.x

With Hadoop 3 & Scala 2.13

Create the Docker base file

Primary Users

Language Support

Spark APIs

Popular Spark Use cases 4P's

Run the Spark Submit on Kubernetes Cluster

Create a Spark Session

Spark Shell Read a Text File

Spark App Flow

Spark Operations

Spark DataFrameReader methods

Spark DataFrameWriter df.write.format("json").mode("overwrite").save(location)methods

Spark 3.2.1 Dependencies changes:

Reference Links

About

Uh oh!

Releases

Packages

Languages

ninadgawad/spark

Folders and files

Latest commit

History

Repository files navigation

Apache Spark

Spark Overview

Preview release of Spark 4.0

Latest Version: 3.5.3

New Features since 3.4.x

With Hadoop 3 & Scala 2.13

Create the Docker base file

Primary Users

Language Support

Spark APIs

Popular Spark Use cases 4P's

Run the Spark Submit on Kubernetes Cluster

Create a Spark Session

Spark Shell Read a Text File

Spark App Flow

Spark Operations

Spark DataFrameReader methods

Spark DataFrameWriter df.write.format("json").mode("overwrite").save(location)methods

Spark 3.2.1 Dependencies changes:

Reference Links

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages