Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.
- https://spark.apache.org/news/spark-4.0.0-preview1.html
- ANSI mode enabled by default.
- Python data source support added.
- Polymorphic Python UDTF (User-Defined Table Function) introduced.
- String collation support included.
- New VARIANT data type available.
- Streaming state store data source introduced.
- Structured logging capabilities added.
- Java 17 set as the default Java version.
- Plus many other enhancements.
- Download 3.5.3
- Download 3.5.2
- Download 3.5.1
- Download 3.4.3
- Download 3.3.2
- Download 3.3.1
- Download 3.2.3
- Download 3.2.0
- Spark Connect: Spark Connect is a new client-server architecture introduced in Apache Spark 3.4 which solves the problem by decoupling Spark client applications and allowing remote connectivity to Spark clusters.
- TorchDistributor Module to PySpark: This new module TorchDistributorin PySpark that makes it easier to do distributed training with PyTorch on Spark clusters.
- DEFAULT Values: This default value support for columns feature in Spark 3.4 solves the problem by automatically inserting the default value for any column that is not explicitly specified.
- TIMESTAMP WITHOUT TIMEZONE: New Data Type is a new feature in Spark 3.4 that allows you to represent timestamp values without a time zone.
- Lateral Column: SQL SELECT List is a new feature in Spark 3.4 that allows you to reference columns from a subquery in the SELECT list of a main query. This is useful for a variety of tasks, such as joining data from multiple tables or performing aggregations on related data.
- Bloom Filter: Bloom filter join is a new feature in Spark 3.4 that can be used to improve the performance of joins between large datasets.
- Convert the Entire Source Dataframe to a Schema: Spark 3.4 has introduced Dataset.to(StructType) feature that solves the problem by making it easy to convert an entire DataFrame to a schema with a single line of code.
- Parameterized SQL Queries: A parameterized SQL query is a query that uses named parameters instead of literal values. This means that the values of the parameters are not hard-coded into the query, but are instead passed in at runtime. This makes the query more reusable because the same query can be used with different values for the parameters. It also makes the query more secure, because it prevents attackers from injecting malicious code into the query
Dockerfile
- Data Scientist
- Data Engineers
- Software Developers
- Data Analysts
- Scala
- Python
- Java
- R
- RDDs
- DataFrames
- Datasets
- Parallel processing across cluster
- Performing adhoc queries for EDA
- Pipelines implementations for data
- Processig and analyzing graph data and social networks
Spark-Submit
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[1]")
.appName("Spark App")
.config("spark.driver.cores", "2")
.getOrCreate()
val df = spark.read("data.json").json(jsonPath)
df.show()
scala> val readMeDf = spark.read.text("../README.md")
scala>readMeDf.show(5, false)
+--------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------+
|# Apache Spark |
| |
|Spark is a unified analytics engine for large-scale data processing. It provides|
|high-level APIs in Scala, Java, Python, and R, and an optimized engine that |
|supports general computation graphs for data analysis. It also supports a |
+--------------------------------------------------------------------------------+
only showing top 5 rows
- Transformations e.g. orderBy() | groupBy() | join() | filter() | select()
- Actions e.g. show() | count() | take() | collect() | save()
- spark.read.format("parquet").load("file.parquet") or spark.read.load("file.parquet")
- spark.read.format("csv").option("inferSchema", "true").option("header", "true").option("mode", "PERMISSIVE").load("file.csv")
- spark.read.format("json").load("file.json")
- df.write.format("parquet").mode("overwrite").option("compression", "snappy").save("parquet")
- df.write.format("csv").mode("overwrite").save("csvfile")
- df.write.format("json").mode("overwrite").save("jsonfile")
- df.write.format("avro").mode("overwrite").save("avrofile")
- df.write.format("orc").mode("overwrite").save("orcfile")
- [SPARK-37113]: Upgrade Parquet to 1.12.2
- [SPARK-37238]: Upgrade ORC to 1.6.12
- [SPARK-37534]: Bump dev.ludovic.netlib to 2.2.1
- [SPARK-37656]: Upgrade SBT to 1.5.7