A comprehensive toolkit for running Apache Spark jobs on Google Cloud Dataproc with support for Lightning Engine and Native Query Engine (NQE).
Use the CSV file generator for large file generation. filegen.py is a bit slow.
make histserver
- Create a Persistent History Server (PHS)make jobserver
- Create an ephemeral job server with computing resource autoscaling, this is configurable through autoscaling-policy.yml. Can be turned off.
make build
- Build the Scala source codemake run
- Run job on ephemeral cluster (highly customizable)- Supports Lightning Engine and Native Query Engine
make run_serverless
- Run batch job in Dataproc Serverless premium mode- Uses N2 instances + LocalSSD shuffle
make run_serverless_std
- Run batch job in Dataproc Serverless standard mode- Uses E2 instances + pd-standard shuffle
make run_nqe
- Run with Native Query Engine enabled- Uses N2 + LocalSSD shuffle + native execution engine
- Requires compatibility check with
make qualify
make qualify
- Run qualification tool against Spark event logs to check job compatibility forNQE
# Default tier (standard)
export DATAPROC_TIER=standard
# Premium tier (Lightning Engine)
export DATAPROC_TIER=premium
# Enable Native Query Engine (Premium tier only)
export ENABLE_NQE=true
Note: Native Query Engine is only available on Premium tier clusters.
Manually adjust job configuration in spark.sh
to fit your specific needs. When using NQE, always run the qualification tool first to ensure compatibility.
- BigQuery Integration: Run Spark Serverless in BigQuery as a stored procedure - Guide
git clone https://github.com/axboe/fio.git
cd fio
./configure --build-static
make
For virtualized environments (QEMU):
./configure --build-static --disable-optimizations
Minimal lightweight configuration:
./configure --build-static \
--disable-numa \
--disable-rdma \
--disable-gfapi \
--disable-libhdfs \
--disable-pmem \
--disable-gfio \
--disable-libiscsi \
--disable-rados \
--disable-rbd \
--disable-zlib
After running make
, the fio
binary will be available in your project directory.
gcloud storage cp fio gs://dingoproc/fio_linux_x86
See GcpTest.scala#L277-L295 for runtime download to Spark workers.
Random Read Test
fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting
Random Write Test (2GB total: 4 jobs × 512MB)
fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting
Mixed Read/Write Test (75% read, 25% write)
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
Sequential Reads (8K blocks, Direct I/O)
fio --name=seqread --rw=read --direct=1 --ioengine=libaio --bs=8k --numjobs=8 --size=1G --runtime=600 --group_reporting
Sequential Writes (32K blocks, Direct I/O)
fio --name=seqwrite --rw=write --direct=1 --ioengine=libaio --bs=32k --numjobs=4 --size=2G --runtime=600 --group_reporting
Random Reads (8K blocks, Direct I/O)
fio --name=randread --rw=randread --direct=1 --ioengine=libaio --bs=8k --numjobs=16 --size=1G --runtime=600 --group_reporting
Random Writes (64K blocks, Direct I/O)
fio --name=randwrite --rw=randwrite --direct=1 --ioengine=libaio --bs=64k --numjobs=8 --size=512m --runtime=600 --group_reporting
Mixed Random Read/Write (90% read, 10% write)
fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=16k --numjobs=8 --rwmixread=90 --size=1G --runtime=600 --group_reporting
Time-based Mixed Workload (70% read, 30% write, 5-minute duration)
- Creates 8 files (512MB each) with 64K block size
- Runs for exactly 5 minutes regardless of completion
fio --name=randrw --ioengine=libaio --iodepth=1 --rw=randrw --bs=64k --direct=1 --size=512m --numjobs=8 --runtime=300 --group_reporting --time_based --rwmixread=70
Database Simulation (3:1 read/write ratio, typical database workload)
- 4GB file with 4KB operations
- 75% read, 25% write split
- 64 concurrent operations
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
Pure Random Read Performance
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread
Pure Random Write Performance
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite