By: Brian Ray [email protected]
This project's goal is to use docker containers to set up a network of services and workbenches commonly used by Data Scientists working on Machine Learning problems. It's currently marked as experimental and contributions are welcome. The Docker Compose file outlines a couple of the containers. They should be configured to work with eachother over the docpyml network you create on your docker VM.
List of Containers:
- docpyml-namenode: Hadoop NameNode. Keeps the directory tree of all files in the file system.
 - docpyml-datanode1: Data Storage HadoopFileSystem
 - docpyml-datanode2: Data Storage HadoopFileSystem
 - docpyml-spark-master: Apache Spark Master
 - spark-worker (<- may launch many): Spark Workers. Also contain the Python version matching docpyml-conda
 - docpyml-sparknotebook: Preconfigured Spark Notebook
 - docpyml-hdfsfb: HDFS FileBrowser from Cloudera Hue
 - docpyml-conda: Anaconda Python 3.5 with Jupyter Notebook, machine learning packages, pySpark preconfigured
 - docpyml-rocker: RStudio
 
Prerequisites. Docker Toolbox.
optionally adjust your VM settings:
    docker-machine stop
    VBoxManage modifyvm default --cpus 4
    VBoxManage modifyvm default --memory 8192
    docker-machine start
    
To start the enviroment:
    docker network create docpyml
    docker-compose up -d
If says docker not running try first:
   eval "$(docker-machine env default)"
To scale up spark-workers:
    docker-compose scale spark-worker=3