A lightweight tool for launching and managing ensembles
To install ensemble_launcher
, clone the repository and install the required dependencies:
git clone https://github.com/your-repo/ensemble_launcher.git
cd ensemble_launcher
python3 -m pip install -e ./
- Create a Configuration File: Define your ensembles and tasks in a JSON file. Below is an example configuration file (
tests/simple_test/config.json
) and an explanation of its options:
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "local",
"ncores_per_nodes": 1,
"ngpus_per_node": 1
},
"ensembles": {
"example_ensemble": {
"num_nodes": 1,
"num_processes_per_node": 1,
"num_gpus_per_process": 1,
"launcher": "mpi",
"launcher_options": {
"np": 1,
"ppn": 1,
"cpu-bind": "depth",
"depth": 1
},
"relation": "one-to-one",
"cmd_template": "./exe -a {arg1} -b {arg2}",
"arg1": "linspace(0, 10, 5)",
"arg2": "linspace(0, 1, 5)",
"run_dir": "./run_dir",
"env":{
"var":"value"
}
}
}
}
- poll_interval: Time interval (in seconds) to check the status of running tasks.
- update_interval: Time interval (in seconds) to update the ensemble configuration. Set to
null
to disable updates. - sys_info: System-specific information:
- name: Name of the system (e.g.,
local
,aurora
). - ncores_per_nodes: Number of CPU cores available per node.
- ngpus_per_node: Number of GPUs available per node.
- name: Name of the system (e.g.,
- ensembles: A dictionary defining the ensembles to be executed:
- example_ensemble: Name of the ensemble.
- num_nodes: Number of nodes required per task in the ensemble. Can be varied for each task
- num_processes_per_node: Number of processes per node used per task.
- num_gpus_per_process: Number of GPUs allocated per process per task.
- launcher: Task launcher type (
mpi
orbash
). - launcher_options: Additional options for the launcher:
- cpu-bind: CPU binding strategy (e.g.,
depth
,list
). - depth: Depth of CPU binding.
- cpu-bind: CPU binding strategy (e.g.,
- relation: Relationship between task parameters (
one-to-one
ormany-to-many
). - pre_launch_cmd: A linux cmd to be executed before launching running the below. (eg. cp -r * ./run_dir)
- cmd_template: Template for the command to execute, with placeholders for task-specific arguments. Variable arguments should be surrounded by
{}
. - arg1, arg2: Task-specific arguments, which can be defined using functions like
linspace
. - run_dir: Directory where task outputs and logs will be stored.
- env: A dictionary of environment variables to set for the tasks:
- key: Name of the environment variable.
- value: Value of the environment variable. Can be a static value or dynamically generated.
- example_ensemble: Name of the ensemble.
- Write a launcher script: Write a simple launcher script. An example is given below
import time
from ensemble_launcher import ensemble_launcher
"""
Instead of manual generation from step 1, config file can also be generated on the fly. For example,
import json
ntasks=1000
config = {
"poll_interval":1,
"sys_info":{
"name":"aurora",
"ncores_per_node":104,
"ngpus_per_node":12
},
"ensembles":{
"inference":{
"num_nodes":1,
"num_processes_per_node":1,
"num_gpus_per_process":1,
"launcher":"mpi",
"relation":"one-to-one",
"cmd_template":"<args> {opts}",
"opts":list(range(ntasks)),
"run_dir":[f"./run_dir/task_{i}" for i in range(ntasks)]
}
}
}
fname = "./config.json"
with open(fname,"w") as f:
json.dump(config, f, indent=4)
"""
if __name__ == '__main__':
el = ensemble_launcher("config.json")
start_time = time.perf_counter()
total_poll_time = el.run_tasks()
end_time = time.perf_counter()
total_run_time = end_time - start_time
print(f"{total_run_time=}")
- Run the launcher script: To launch the script
python3 launcher_ensemble_launcher.py
- Monitor Progress: Check the
outputs
directory for logs and status updates.
Following are example .json config files for various mpiexec commands used at ALCF
Example 1: 2 nodes, 4 ranks/node, 1 thread/rank
mpiexec -n 8 -ppn 4 --depth 1 --cpu-bind=depth <app> <app_args>
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": 2,
"num_processes_per_node": 4,
"launcher": "mpi",
"launcher_options": {
"cpu-bind": "depth",
"depth": 1
},
"relation": "one-to-one",
"cmd_template": "<app> <constant args> <variable args>",
"<variable args>": [1,2,....],
"run_dir": "./run_dir",
}
}
}
Example 2: 2 nodes, 2 ranks/node, 2 thread/rank
OMP_PLACES=threads OMP_NUM_THREADS=2 mpiexec -n 4 -ppn 2 --depth 2 --cpu-bind=depth <app> <app_args>
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": 2,
"num_processes_per_node": 2,
"launcher": "mpi",
"launcher_options": {
"cpu-bind": "depth",
"depth": 2
},
"relation": "one-to-one",
"cmd_template": "<app> <constant args> <variable args>",
"<variable args>": [1,2,....],
"run_dir": "./run_dir",
"env":{
"OMP_PLACES": "threads",
"OMP_NUM_THREADS": 2,
}
}
}
}
Example 3: 2 nodes, 2 ranks/node, 1 thread/rank, compact fashion
mpiexec -n 4 -ppn 2 --cpu-bind=list:0:104 <app> <app_args>
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": 2,
"num_processes_per_node": 2,
"launcher": "mpi",
"launcher_options": {
"cpu-bind": "list:0:104"
},
"relation": "one-to-one",
"cmd_template": "<app> <constant args> <variable args>",
"<variable args>": [1,2,....],
"run_dir": "./run_dir"
}
}
}
Example 4: 1 node, 12 ranks/node
mpiexec -n 12 -ppn 12 --cpu-bind=list:0-7:8-15:16-23:24-31:32-39:40-47:52-59:60-67:68-75:76-83:84-91:92-99 <app> <app_args>
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": 1,
"num_processes_per_node": 12,
"launcher": "mpi",
"launcher_options": {
"cpu-bind": "list:0-7:8-15:16-23:24-31:32-39:40-47:52-59:60-67:68-75:76-83:84-91:92-99"
},
"relation": "one-to-one",
"cmd_template": "<app> <constant args> <variable args>",
"<variable args>": [1,2,....],
"run_dir": "./run_dir"
}
}
}
Example 5: 1 node, 12 ranks/node, 1 thread/rank, 1 rank/GPU tile
mpiexec -n 12 -ppn 12 --cpu-bind=list:0-7:8-15:16-23:24-31:32-39:40-47:52-59:60-67:68-75:76-83:84-91:92-99 gpu_tile_compact.sh <app> <app_args>
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": 1,
"num_processes_per_node": 12,
"num_gpus_per_process":1,
"launcher": "mpi",
"launcher_options": {
"cpu-bind": "list:0-7:8-15:16-23:24-31:32-39:40-47:52-59:60-67:68-75:76-83:84-91:92-99"
},
"relation": "one-to-one",
"cmd_template": "<app> <constant args> <variable args>",
"<variable args>": [1,2,....],
"run_dir": "./run_dir",
"env":{
"ZE_FLAT_DEVICE_HIERARCHY":"COMPOSITE"
}
}
}
}
Example 6: 1 node, 6 ranks/node, 1 thread/rank, 1 rank/GPU device
mpiexec -n 12 -ppn 12 --cpu-bind=list:0-7:8-15:16-23:24-31:32-39:40-47:52-59:60-67:68-75:76-83:84-91:92-99 gpu_dev_compact.sh <app> <app_args>
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": 1,
"num_processes_per_node": 6,
"num_gpus_per_process":2,
"launcher": "mpi",
"launcher_options": {
"cpu-bind": "list:0-7:8-15:16-23:24-31:32-39:40-47:52-59:60-67:68-75:76-83:84-91:92-99"
},
"relation": "one-to-one",
"cmd_template": "<app> <constant args> <variable args>",
"<variable args>": [1,2,....],
"run_dir": "./run_dir",
"env":{
"ZE_FLAT_DEVICE_HIERARCHY":"COMPOSITE"
}
}
}
}
Example 7: 1 node, 12 ranks/node, 1 thread/rank, and any other MPI options
mpiexec -n 12 -ppn 12 --cpu-bind=list:0-7:8-15:16-23:24-31:32-39:40-47:52-59:60-67:68-75:76-83:84-91:92-99 <other mpi options> <app> <app_args>
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": 1,
"num_processes_per_node": 12,
"launcher": "mpi",
"launcher_options": {
"cpu-bind": "list:0-7:8-15:16-23:24-31:32-39:40-47:52-59:60-67:68-75:76-83:84-91:92-99"
},
"relation": "one-to-one",
"cmd_template": "<any other mpi options> <app> <constant args> <variable args>",
"<variable args>": [1,2,....],
"run_dir": "./run_dir",
}
}
}
Examples of other general ensembles
Example 1: An ensemble with N tasks. Each task usese 1 node, 12 ranks/node, and they have identical args
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": [1,1,1,1,1........N],
"num_processes_per_node": 12,
"launcher": "mpi",
"relation": "one-to-one",
"cmd_template": "<any other mpi options> <app> <constant args>",
"run_dir": "./run_dir",
}
}
}
Example 2: A scaling test using 1-128 nodes and 104 ranks/node
{
"poll_interval": 1,
"update_interval": null,
"sys_info": {
"name": "aurora",
"ncores_per_nodes": 104,
"ngpus_per_node": 12
},
"ensembles": {
"example_ensemble": {
"num_nodes": [1,2,4,8,16,32,64,128],
"num_processes_per_node": 104,
"launcher": "mpi",
"relation": "one-to-one",
"cmd_template": "<app> <constant args> <variable args>",
"<variable_args>":[1,...]
"run_dir": ["./run_dir_1","./run_dir_2","./run_dir_4","./run_dir_8","./run_dir_16","./run_dir_32","./run_dir_64","./run_dir_128"],
}
}
}
Contributions are welcome! Please fork the repository, make your changes, and submit a pull request.
If you encounter any issues, feel free to open an issue on the GitHub repository or contact the maintainers.