Skip to content

cloudymoma/dataproc-scala

Repository files navigation

⚡ Spark on Dataproc

CI

中文

A comprehensive toolkit for running Apache Spark jobs on Google Cloud Dataproc with support for Lightning Engine and Native Query Engine (NQE).

📁 File Generator (optional)

Use the CSV file generator for large file generation. filegen.py is a bit slow.

🛠️ Make Commands

🏗️ Infrastructure Setup

🔨 Build and Run

  • make build - Build the Scala source code
  • make run - Run job on ephemeral cluster (highly customizable)

☁️ Serverless Execution

  • make run_serverless - Run batch job in Dataproc Serverless premium mode
    • Uses N2 instances + LocalSSD shuffle
  • make run_serverless_std - Run batch job in Dataproc Serverless standard mode
    • Uses E2 instances + pd-standard shuffle
  • make run_nqe - Run with Native Query Engine enabled
    • Uses N2 + LocalSSD shuffle + native execution engine
    • Requires compatibility check with make qualify

✅ Job Compatibility

  • make qualify - Run qualification tool against Spark event logs to check job compatibility for NQE

⚙️ Configuration

🎛️ Dataproc Cluster Tier Settings

# Default tier (standard)
export DATAPROC_TIER=standard

# Premium tier (Lightning Engine)
export DATAPROC_TIER=premium

# Enable Native Query Engine (Premium tier only)
export ENABLE_NQE=true

Note: Native Query Engine is only available on Premium tier clusters.

📝 Job Configuration

Manually adjust job configuration in spark.sh to fit your specific needs. When using NQE, always run the qualification tool first to ensure compatibility.

📚 Additional Resources

🔬 FIO (Flexible I/O Tester) Integration

🔧 Building FIO from Source

🏗️ Basic Build

git clone https://github.com/axboe/fio.git
cd fio
./configure --build-static
make

🔧 Troubleshooting Builds

For virtualized environments (QEMU):

./configure --build-static --disable-optimizations

Minimal lightweight configuration:

./configure --build-static \
    --disable-numa \
    --disable-rdma \
    --disable-gfapi \
    --disable-libhdfs \
    --disable-pmem \
    --disable-gfio \
    --disable-libiscsi \
    --disable-rados \
    --disable-rbd \
    --disable-zlib

After running make, the fio binary will be available in your project directory.

🚀 Deployment to GCS

gcloud storage cp fio gs://dingoproc/fio_linux_x86

See GcpTest.scala#L277-L295 for runtime download to Spark workers.

📊 FIO Performance Testing Cheat Sheet

🔨 Basic Tests

Random Read Test

fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting

Random Write Test (2GB total: 4 jobs × 512MB)

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting

Mixed Read/Write Test (75% read, 25% write)

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

⏩ Sequential Performance Tests

Sequential Reads (8K blocks, Direct I/O)

fio --name=seqread --rw=read --direct=1 --ioengine=libaio --bs=8k --numjobs=8 --size=1G --runtime=600 --group_reporting

Sequential Writes (32K blocks, Direct I/O)

fio --name=seqwrite --rw=write --direct=1 --ioengine=libaio --bs=32k --numjobs=4 --size=2G --runtime=600 --group_reporting

🎲 Random Performance Tests

Random Reads (8K blocks, Direct I/O)

fio --name=randread --rw=randread --direct=1 --ioengine=libaio --bs=8k --numjobs=16 --size=1G --runtime=600 --group_reporting

Random Writes (64K blocks, Direct I/O)

fio --name=randwrite --rw=randwrite --direct=1 --ioengine=libaio --bs=64k --numjobs=8 --size=512m --runtime=600 --group_reporting

Mixed Random Read/Write (90% read, 10% write)

fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=16k --numjobs=8 --rwmixread=90 --size=1G --runtime=600 --group_reporting

🚀 Advanced Testing Scenarios

Time-based Mixed Workload (70% read, 30% write, 5-minute duration)

  • Creates 8 files (512MB each) with 64K block size
  • Runs for exactly 5 minutes regardless of completion
fio --name=randrw --ioengine=libaio --iodepth=1 --rw=randrw --bs=64k --direct=1 --size=512m --numjobs=8 --runtime=300 --group_reporting --time_based --rwmixread=70

Database Simulation (3:1 read/write ratio, typical database workload)

  • 4GB file with 4KB operations
  • 75% read, 25% write split
  • 64 concurrent operations
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Pure Random Read Performance

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread

Pure Random Write Performance

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published