A starter project to begin coding an Arc job using the Jupyter Notebook interface.
Clone this repository then run the included shells script. The user interface will then be available at http://localhost:8888 and the token will be printed to the console.
./.develop.shThe .develop.sh script contains a hard coded memory allocation for Apache Spark via the Java Virtual Machine which should be configured for your specific environment. e.g. to change from 4 Gigabytes to 8 Gigabytes:
-e JAVA_OPTS="-Xmx4g" \to
-e JAVA_OPTS="-Xmx8g" \By default everything will be executed as an Arc stage.
If needed SQL can be executed directly by using the Jupyter %sql magic which can speed development:
%sql numRows=10 truncate=100 outputView=green_tripdata0
SELECT *
FROM green_tripdata0_raw
WHERE fare_amount < 10numRowsspecifies number of rows to display in the tabletruncatespecifies the maximum character length of any output stringsoutputViewallows registration of a Spark view so it can be referenced in later stages.
These other 'magics' have been defined:
%envwhich allows setting job variables via the notebook (e.g.%env ETL_CONF_KEY0=value0 ETL_CONF_KEY1=value1). These can be used in both%arcand%sqlstages.%metadatawhich will try to create and print the correct Arc metadata file for the supplied view.%printschemawhich will print the Spark schema in a simple text mode.%schemawhich will print the Spark schema of a view.%summarywhich will print summary statistics of a view.%versionwhich will print relevant versions.
To export an Arc job an option has been provided in the File\Download as menu which will export all the Arc stages from the notebook and create a job file. Note that Jupyter Notebooks has been modified so that the .ipynb file will not save any output datasets to prevent data from being accidentally committed to version control.
Important:
If you are running Docker For Mac or Docker for Windows ensure that the Docker memory allocation is large enough to support the memory -Xmx4g requested:



