Easy CPU Profiling for Apache Spark applications.
The script spark-submit-flamegraph is a wrapper around standard spark-submit that generates Flame Graph.
- Amazon EMR
- Most Linux distributions
- Mac (with Homebrew installed)
The script is adapted for work in Amazon EMR. Otherwise the following utilities must present on your system:
- perl
- python2.7 (or set PYTHONenvironment variable to the Python executabl)
- pip (or set PIPenvironment variable to the pip utility)
wget -O /usr/local/bin/spark-submit-flamegraph \
  https://raw.githubusercontent.com/spektom/spark-flamegraph/master/spark-submit-flamegraph
chmod +x /usr/local/bin/spark-submit-flamegraphUse spark-submit-flamegraph as a replacement for the spark-submit command.
To configure use the following environment variables:
| Environment Variable | Description | Default value | 
|---|---|---|
| SPARK_CMD | Spark command to run | spark-submit | 
| PYTHON | Path to the Python executable | python2.7 | 
| PIP | Path to the pip utility | pip | 
For example, to profile Spark shell session set SPARK_CMD environment variable:
SPARK_CMD=spark-shell /usr/local/bin/spark-submit-flamegraphThe script does the following operations to make profiling Spark applications as easy as possible:
- Downloads InfluxDB, and starts it on some random port.
- Starts Spark application using original spark-submitcommand, with the StatsD profiler Jar in its classpath and with the configuration that tells it to report statistics back to the InfluxDB instance.
- After running Spark application, queries all the reported metrics from the InfluxDB instance.
- Run a script that generates the .SVG file.
- Stops the InfluxDB instance.
