Skip to content

Commit 8614909

Browse files
committed
tau is now built by the Chimbuko bundle package
Added to run documentation
1 parent d29de77 commit 8614909

File tree

3 files changed

+49
-23
lines changed

3 files changed

+49
-23
lines changed

spack/repo/chimbuko/packages/chimbuko/package.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,4 +34,5 @@ class Chimbuko(BundlePackage):
3434

3535
depends_on('chimbuko-performance-analysis')
3636
depends_on('chimbuko-visualization2')
37+
depends_on('tau)
3738

sphinx/source/install_usage/install.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,14 +37,16 @@ Once installed, the unit and integration tests can be run as:
3737
cd $(spack location -i chimbuko-performance-analysis)/test
3838
./run_all.sh
3939
40+
.. _a_note_on_libfabric_providers:
41+
4042
A note on libfabric providers
4143
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4244

4345
The Mercury library used for the provenance database requires a libfabric provider that supports the **FI_EP_RDM** endpoint. By default spack installs libfabric with the **sockets**, **tcp** and **udp** providers, of which only **sockets** supports this endpoint. However **sockets** is being deprecated as its performance is not as good as other dedicated providers. We recommend installing the **rxm** utility provider alongside **tcp** for most purposes, by appending the spack spec with :code:`^libfabric fabrics=sockets,tcp,rxm`.
4446

4547
For network hardware supporting the Linux Verbs API (such as Infiniband) the **verbs** provider (with **rxm**) may provide better performance. This can be added to the spec as, for example, :code:`^libfabric fabrics=sockets,tcp,rxm,verbs`.
4648

47-
Details of how to choose the libfabrics provider used by Mercury can be found `here <>`_. For further information consider the `Mercury documentation <https://mercury-hpc.github.io/documentation/#network-abstraction-layer>`_ .
49+
Details of how to choose the libfabrics provider used by Mercury can be found :ref:`here <online_analysis>`. For further information consider the `Mercury documentation <https://mercury-hpc.github.io/documentation/#network-abstraction-layer>`_ .
4850

4951
Integrating with system-installed MPI
5052
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

sphinx/source/install_usage/run_chimbuko.rst

Lines changed: 45 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,77 +2,98 @@
22
Running an application under Chimbuko
33
*************************************
44

5+
.. _online_analysis:
6+
57
Online Analysis
68
~~~~~~~~~~~~~~~
79

810
In this section we detail how to run Chimbuko both offline (Chimbuko analysis performed after application has completed) or online (Chimbuko analysis performed in real-time). In the first sections below we will describe how to run Chimbuko assuming it has been installed directly onto the system and is available on all nodes. The same procedure can be used to run Chimbuko inside a (single) Docker container, which can be convenient for testing or for offline analysis. For online analysis it is often more convenient to run Chimbuko through Singularity containers, and we will describe how to do so in a separate section below.
911

1012
Chimbuko is designed primarily for online analysis. In this mode an instance of the AD module is spawned for each instance of the application (e.g. for each MPI rank); a centralized parameter server aggregates global statistics and distributes information to the visualization module; and a centralized provenance database maintains detailed information regarding each anomaly.
1113

12-
Note that the provenance database component of Chimbuko requires the Mochi/Sonata shared libraries which are typically installed using Spack. The user must ensure that the **mochi-sonata** Spack library is loaded. In addition the visualization module requires **py-mochi-sonata**, the Python interface to Sonata.
14+
It is expected that the user create a run script that performs three basic functions:
15+
16+
- Launch Chimbuko services (pserver, provDB, visualization).
17+
- Launch the Anomaly Detection (AD) component
18+
- Launch the application
19+
20+
Because launching the AD and application requires explicit invocations of :code:`mpirun` or the system equivalent (e.g. :code:`jsrun` on Summit) tailored to the specific setup desired by the user, we are unfortunately unable to automate this entire process. In this section we will walk through the creation of a typical run script.
21+
22+
--------------------------
23+
24+
The first step is to load Chimbuko. If installed via Spack this can be accomplished simply by:
1325

1426
.. code:: bash
1527
16-
spack load mochi-sonata py-mochi-sonata
28+
spack load chimbuko
1729
1830
(It may be necessary to source the **setup_env.sh** script to setup spack first, which by default installs in the **spack/share/spack/** subdirectory of Spack's install location.)
1931

2032
---------------------------
2133

22-
A configuration file (like **chimbuko_config.sh**) is used to set various configuration variables that are used for launching Chimbuko services and used by the Anomaly detection algorithm in Chimbuko.
23-
A full list of variables along with their description is provided in the `Appendix Section <../appendix/appendix_usage.html#chimbuko-config>`_.
24-
Some of the important variable are
34+
A configuration file is used to set various configuration variables that are used for launching Chimbuko services and used by the Anomaly detection algorithm in Chimbuko. A template is provided in the Chimbuko install directory :code:`$(spack location -i chimbuko-performance-analysis)/scripts/launch/chimbuko_config.sh`, which the user should copy to their working directory and modify as necessary.
35+
36+
A number of variables in **chimbuko_config.sh** are marked :code:`<------------ ***SET ME***` and must be set appropriately:
2537

26-
- **service_node_iface** : It is the network interface upon which communication to the service node is performed.
27-
- **provdb_domain** : It is used by verbs provider.
38+
- **viz_root** : The root path of the Chimbuko visualization installation. For spack users this can be set to :code:`$(spack location -i chimbuko-visualization2)` (default)
39+
- **export C_FORCE_ROOT=1** : This variable is necessary when using Chimbuko in a Docker image, otherwise set to 0.
40+
- **service_node_iface** : The network interface upon which communication between the AD and the parameter server is performed. There are `multiple ways <https://www.cyberciti.biz/faq/linux-list-network-interfaces-names-command/>`_ to list the available interfaces, for example using :code:`ip link show`. On Summit this command must be executed on a job node and not the login node. The interface :code:`ib0` should remain applicable for this machine.
41+
- **provdb_engine** : The libfabric provider used for the provenance database. Available fabrics can be found using :code:`fi_info` (on Summit this must be executed on a job node, not the login node). For more details, cf. :ref:`here <a_note_on_libfabric_providers>`.
42+
- **provdb_domain** : The network domain, used by verbs provider. It can be found using :code:`fi_info` and looking for a :code:`verbs;ofi_rxm` provider that has the :code:`FI_EP_RDM` type. On Summit this must be performed on a compute node, but the default, :code:`mlx5_0` should remain applicable for this machine.
2843
- **TAU_EXEC** : This specifies how to execute tau_exec.
2944
- **TAU_PYTHON** : This specifies how to execute tau_python.
45+
- **TAU_MAKEFILE** : The Tau Makefile. For spack users this variable is set by Spack when loading Tau and this line can be commented out.
3046
- **export EXE_NAME=<name>** : This specifies the name of the executable (without full path). Replace **<name>** with an actual name of the application executable.
3147

32-
Next, export the config script as follows:
48+
A full list of variables along with their description is provided in the `Appendix Section <../appendix/appendix_usage.html#chimbuko-config>`_, and more guidance is also provided in the template script.
49+
50+
Next, in the run script, export the config script as follows:
3351

3452
.. code:: bash
3553
3654
export CHIMBUKO_CONFIG=chimbuko_config.sh
3755
3856
---------------------------
3957

40-
The first step is to generate **explicit resource files** (ERF) for the head node, AD, and main programs in Chimbuko. It generates three ERF files (main.erf, ad.erf, services.erf) which are later used as input ERF files to instantiate Chimbuko services using **jsrun** command.
58+
In order to avoid having the Chimbuko services interfere with the running of the application, we typically run the services on a dedicated node. It is also necessary to place the ranks of the AD module on the same node as the corresponding rank of the application to avoid having to pipe the traces over the network. Unfortunately the means of setting this up will vary from system to system depending on the job scheduler (e.g. using a **hostfile** with :code:`mpirun`).
59+
60+
In this section we will concentrate on the Summit supercomputer, where the process is made more difficult by the restrictions on sharing resources between resource sets which forces us to dedicate cores to the AD instances. To achieve the job placement the first step is to generate **explicit resource files** (ERF) for the head node, AD, and main programs. For convenience we provide a script `here <https://github.com/CODARcode/PerformanceAnalysis/blob/ckelly_develop/scripts/summit/gen_erf_summit.sh>`_ to generate the ERF files. It generates three ERF files (main.erf, ad.erf, services.erf) which are later used as input ERF files to instantiate Chimbuko services using **jsrun** command.
4161
This can be achieved by running the following script
4262

4363
.. code:: bash
4464
45-
./gen_erf.pl ${n_nodes_total} ${n_mpi_ranks_per_node} 1 0 ${ncores_per_host_ad}
65+
./gen_erf.pl ${n_nodes_total} ${n_mpi_ranks_per_node} ${n_cores_per_rank_main} ${n_gpus_per_rank_main} ${ncores_per_host_ad}
66+
67+
where
4668

47-
where **${n_nodes_total}** is the Total number of nodes used. This number includes the one node that is dedicated to run the services. **${n_mpi_ranks_per_node}** is the number of MPI ranks that will run on each node. **${ncores_per_host_ad}** is the Number of cores used on each node.
69+
- **${n_nodes_total}** is the total number of nodes used, including the one node that is dedicated to run the services.
70+
- **${n_mpi_ranks_per_node}** is the number of MPI ranks of the application (and AD) that will run on each node (must be a multiple of 2).
71+
- **${n_cores_per_rank_main}** and **${n_gpus_per_rank_main}** specify the number of cores and GPUs, respectively, given to each rank of the application.
72+
- **${ncores_per_host_ad}** is the number of cores dedicated to the Chimbuko AD modules (must be a multiple of 2), with the application running on the remaining cores. Note that the total number of cores allocated per node must not exceed 42.
4873

4974
More details on ERF can be found `here <https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=SSWRJV_10.1.0/jsm/jsrun.html>`_.
5075

51-
For convenience we provide a script for generating the ERF files that should suffice for most normal MPI jobs:
52-
53-
.. code:: bash
54-
55-
${AD_SOURCE_DIR}/scripts/summit/gen_erf_summit.sh
56-
5776
----------------------------
5877

59-
In the next step, provenance database, visualization module and parameter server are launched as Chimbuko Services. This is achieved by running the **run_services.sh** script using **jsrun** command as following:
78+
In the next step the Chimbuko services are launched by running the **run_services.sh** script using **jsrun** command as following:
6079

6180
.. code:: bash
6281
63-
jsrun --erf_input=services.erf ${SERVICES} &
82+
jsrun --erf_input=services.erf ${chimbuko_services} &
6483
while [ ! -f chimbuko/vars/chimbuko_ad_cmdline.var ]; do sleep 1; done
6584
66-
**--erf_input** specifies the ERF file to use. In this case **services.erf** file is used. This files was generated in the previous step. **${SERVICES}** is the path to a script which specifies commands to launch the provenance database, the visualization module, and the parameter server. Description of commands used in script in ${SERVICES} is provided in `Appendix <../appendix/appendix_usage.html#launch-services>`_.
85+
Here **--erf_input=services.erf** launches the services using the the services ERF generated in the previous step. **${chimbuko_services}** is the path to a script which specifies commands to launch the provenance database, the visualization module, and the parameter server. This variable is set by the configuration script. A description of commands used in the services script is provided in `Appendix <../appendix/appendix_usage.html#launch-services>`_.
6786

68-
The while loop after the **jsrun** command is used to wait until it generates a command file (**chimbuko/vars/chimbuko_ad_cmdline.var**) to launch the Chimbuko's anomaly detection driver program as a next step, as following:
87+
The while loop after the **jsrun** command is used to wait until the services have progressed to a stage at which connection is possible. At this point the script generates a command file (**chimbuko/vars/chimbuko_ad_cmdline.var**) which provides the to launch the Chimbuko's anomaly detection driver program, assuming a basic (single component) workflow. This can be invoked as follows:
6988

7089
.. code:: bash
7190
7291
ad_cmd=$(cat chimbuko/vars/chimbuko_ad_cmdline.var)
7392
eval "jsrun --erf_input=ad.erf -e prepended ${ad_cmd} &"
7493
75-
Here the **ad.erf** file is used as **--erf_input** for the **jsrun** command. The generated command in previous step is used here to launch the AD driver.
94+
Here the **ad.erf** file is used as **--erf_input** for the **jsrun** command.
95+
96+
For more complicated workflows the AD will need to be invoked differently. To aid the user we write a second file, **chimbuko/vars/chimbuko_ad_opts.var**, which contains just the initial command line options for the AD. Examples of various setups can be found among the :ref:`benchmark applications <benchmark_suite>`.
7697

7798
-----------------------------
7899

@@ -90,6 +111,8 @@ Chimbuko can be run to perform offline analysis of the application by changing c
90111

91112
------------------------------
92113

114+
.. _benchmark_suite:
115+
93116
Examples
94117
~~~~~~~~
95118

0 commit comments

Comments
 (0)