⚡ Spark on Dataproc

A comprehensive toolkit for running Apache Spark jobs on Google Cloud Dataproc with support for Lightning Engine and Native Query Engine (NQE).

📁 File Generator (optional)

Use the CSV file generator for large file generation. filegen.py is a bit slow.

🛠️ Make Commands

🏗️ Infrastructure Setup

make histserver - Create a Persistent History Server (PHS)
make jobserver - Create an ephemeral job server with computing resource autoscaling, this is configurable through autoscaling-policy.yml. Can be turned off.

🔨 Build and Run

make build - Build the Scala source code
make run - Run job on ephemeral cluster (highly customizable)
- Supports Lightning Engine and Native Query Engine

☁️ Serverless Execution

make run_serverless - Run batch job in Dataproc Serverless premium mode
- Uses N2 instances + LocalSSD shuffle
make run_serverless_std - Run batch job in Dataproc Serverless standard mode
- Uses E2 instances + pd-standard shuffle
make run_nqe - Run with Native Query Engine enabled
- Uses N2 + LocalSSD shuffle + native execution engine
- Requires compatibility check with make qualify

✅ Job Compatibility

make qualify - Run qualification tool against Spark event logs to check job compatibility for NQE

⚙️ Configuration

🎛️ Dataproc Cluster Tier Settings

# Default tier (standard)
export DATAPROC_TIER=standard

# Premium tier (Lightning Engine)
export DATAPROC_TIER=premium

# Enable Native Query Engine (Premium tier only)
export ENABLE_NQE=true

Note: Native Query Engine is only available on Premium tier clusters.

📝 Job Configuration

Manually adjust job configuration in spark.sh to fit your specific needs. When using NQE, always run the qualification tool first to ensure compatibility.

📚 Additional Resources

BigQuery Integration: Run Spark Serverless in BigQuery as a stored procedure - Guide

🔬 FIO (Flexible I/O Tester) Integration

🔧 Building FIO from Source

🏗️ Basic Build

git clone https://github.com/axboe/fio.git
cd fio
./configure --build-static
make

🔧 Troubleshooting Builds

For virtualized environments (QEMU):

./configure --build-static --disable-optimizations

Minimal lightweight configuration:

./configure --build-static \
    --disable-numa \
    --disable-rdma \
    --disable-gfapi \
    --disable-libhdfs \
    --disable-pmem \
    --disable-gfio \
    --disable-libiscsi \
    --disable-rados \
    --disable-rbd \
    --disable-zlib

After running make, the fio binary will be available in your project directory.

🚀 Deployment to GCS

gcloud storage cp fio gs://dingoproc/fio_linux_x86

See GcpTest.scala#L277-L295 for runtime download to Spark workers.

📊 FIO Performance Testing Cheat Sheet

🔨 Basic Tests

Random Read Test

fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting

Random Write Test (2GB total: 4 jobs × 512MB)

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting

Mixed Read/Write Test (75% read, 25% write)

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

⏩ Sequential Performance Tests

Sequential Reads (8K blocks, Direct I/O)

fio --name=seqread --rw=read --direct=1 --ioengine=libaio --bs=8k --numjobs=8 --size=1G --runtime=600 --group_reporting

Sequential Writes (32K blocks, Direct I/O)

fio --name=seqwrite --rw=write --direct=1 --ioengine=libaio --bs=32k --numjobs=4 --size=2G --runtime=600 --group_reporting

🎲 Random Performance Tests

Random Reads (8K blocks, Direct I/O)

fio --name=randread --rw=randread --direct=1 --ioengine=libaio --bs=8k --numjobs=16 --size=1G --runtime=600 --group_reporting

Random Writes (64K blocks, Direct I/O)

fio --name=randwrite --rw=randwrite --direct=1 --ioengine=libaio --bs=64k --numjobs=8 --size=512m --runtime=600 --group_reporting

Mixed Random Read/Write (90% read, 10% write)

fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=16k --numjobs=8 --rwmixread=90 --size=1G --runtime=600 --group_reporting

🚀 Advanced Testing Scenarios

Time-based Mixed Workload (70% read, 30% write, 5-minute duration)

Creates 8 files (512MB each) with 64K block size
Runs for exactly 5 minutes regardless of completion

fio --name=randrw --ioengine=libaio --iodepth=1 --rw=randrw --bs=64k --direct=1 --size=512m --numjobs=8 --runtime=300 --group_reporting --time_based --rwmixread=70

Database Simulation (3:1 read/write ratio, typical database workload)

4GB file with 4KB operations
75% read, 25% write split
64 concurrent operations

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Pure Random Read Performance

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread

Pure Random Write Performance

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
autoscaling-policy.yml		autoscaling-policy.yml
build.sbt		build.sbt
filegen.py		filegen.py
makefile		makefile
pom.xml		pom.xml
readme.md		readme.md
readme_cn.md		readme_cn.md
run.sh		run.sh
run_qualification_tool.sh		run_qualification_tool.sh
spark.sh		spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡ Spark on Dataproc

📁 File Generator (optional)

🛠️ Make Commands

🏗️ Infrastructure Setup

🔨 Build and Run

☁️ Serverless Execution

✅ Job Compatibility

⚙️ Configuration

🎛️ Dataproc Cluster Tier Settings

📝 Job Configuration

📚 Additional Resources

🔬 FIO (Flexible I/O Tester) Integration

🔧 Building FIO from Source

🏗️ Basic Build

🔧 Troubleshooting Builds

🚀 Deployment to GCS

📊 FIO Performance Testing Cheat Sheet

🔨 Basic Tests

⏩ Sequential Performance Tests

🎲 Random Performance Tests

🚀 Advanced Testing Scenarios

About

Uh oh!

Releases

Packages

Languages

cloudymoma/dataproc-scala

Folders and files

Latest commit

History

Repository files navigation

⚡ Spark on Dataproc

📁 File Generator (optional)

🛠️ Make Commands

🏗️ Infrastructure Setup

🔨 Build and Run

☁️ Serverless Execution

✅ Job Compatibility

⚙️ Configuration

🎛️ Dataproc Cluster Tier Settings

📝 Job Configuration

📚 Additional Resources

🔬 FIO (Flexible I/O Tester) Integration

🔧 Building FIO from Source

🏗️ Basic Build

🔧 Troubleshooting Builds

🚀 Deployment to GCS

📊 FIO Performance Testing Cheat Sheet

🔨 Basic Tests

⏩ Sequential Performance Tests

🎲 Random Performance Tests

🚀 Advanced Testing Scenarios

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages