-
Notifications
You must be signed in to change notification settings - Fork 36
Open
Description
Description:
When running BaseParallelProcessor-based processors on an HPC cluster using SLURM and interactive jobs, the execution does not proceed beyond Dask cluster initialization. Despite the scheduler and workers starting successfully, no processing seems to happen afterward.
This affects all processors inheriting from BaseParallelProcessor and using Dask as the parallel backend.
Issue reproduction:
SLURM interactive job command:
srun -A convai_convaird_nemo-speech \
--job-name yodas2:shell \
--partition=interactive \
--time=4:00:00 \
--container-image=nvcr.io/nvstaging/nemo/nemo_data_processing:yt \
--container-mounts=/home:/home \
--gpus=8 \
--nodes=1 \
--ntasks-per-node=8 \
--exclusive \
--mem=128G --pty /bin/bash -l
YAML configuration:
processors_to_run: "0"
processors:
- _target_: sdp.processors.PreserveByValue
input_manifest_file: /home/manifest_01.json
output_manifest_file: /home/manifest_02.json
input_value_key: src_lang
target_value: en
Launch command:
export HYDRA_FULL_ERROR=1
python ./NeMo-speech-data-processor/main.py \
--config-path=./NeMo-speech-data-processor/dataset_configs/multilingual/yodas2/ \
--config-name=config.yaml
Log output:
[2025-04-16 09:37:14 Rank 0] INFO: Hydra config:
processors_to_run: '0'
processors:
- _target_: sdp.processors.PreserveByValue
input_manifest_file: /home/manifest_01.json
output_manifest_file: /home/manifest_02.json
input_value_key: src_lang
target_value: en
[2025-04-16 09:37:14 Rank 0] INFO: Specified to run the following processors: ['sdp.processors.PreserveByValue']
[SDP I 2025-04-16 09:37:14 run_processors:157] Specified to run the following processors: ['sdp.processors.PreserveByValue']
[2025-04-16 09:37:14 Rank 0] INFO: => Building processor "sdp.processors.PreserveByValue"
[SDP I 2025-04-16 09:37:14 run_processors:179] => Building processor "sdp.processors.PreserveByValue"
[2025-04-16 09:37:14 Rank 0] INFO: => Running processor "<sdp.processors.modify_manifest.data_to_dropbool.PreserveByValue object at 0x155412dee9e0>"
[SDP I 2025-04-16 09:37:14 run_processors:204] => Running processor "<sdp.processors.modify_manifest.data_to_dropbool.PreserveByValue object at 0x155412dee9e0>"
[2025-04-16 09:37:14 Rank 0] INFO: Resources: 255 workers, each with memory limit 8094MB
[SDP I 2025-04-16 09:37:14 base_processor:170] Resources: 255 workers, each with memory limit 8094MB
INFO: To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
/usr/local/lib/python3.10/dist-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 19923 instead
warnings.warn(
INFO: State start
INFO: Scheduler at: tcp://127.0.0.1:9773
INFO: dashboard at: http://127.0.0.1:19923/status
INFO: Registering Worker plugin shuffle
INFO: Start Nanny at: 'tcp://127.0.0.1:9035'
INFO: Start Nanny at: 'tcp://127.0.0.1:19723'
INFO: Start Nanny at: 'tcp://127.0.0.1:13259'
INFO: Start Nanny at: 'tcp://127.0.0.1:27865'
INFO: Start Nanny at: 'tcp://127.0.0.1:10199'
INFO: Start Nanny at: 'tcp://127.0.0.1:21089'
INFO: Start Nanny at: 'tcp://127.0.0.1:27073'
INFO: Start Nanny at: 'tcp://127.0.0.1:27495'
INFO: Start Nanny at: 'tcp://127.0.0.1:29023'
INFO: Start Nanny at: 'tcp://127.0.0.1:25693'
INFO: Start Nanny at: 'tcp://127.0.0.1:19975'
INFO: Start Nanny at: 'tcp://127.0.0.1:28465'
INFO: Start Nanny at: 'tcp://127.0.0.1:21615'
INFO: Start Nanny at: 'tcp://127.0.0.1:27419'
INFO: Start Nanny at: 'tcp://127.0.0.1:34145'
INFO: Start Nanny at: 'tcp://127.0.0.1:24257'
INFO: Start Nanny at: 'tcp://127.0.0.1:18757'
INFO: Start Nanny at: 'tcp://127.0.0.1:29621'
INFO: Start Nanny at: 'tcp://127.0.0.1:18349'
INFO: Start Nanny at: 'tcp://127.0.0.1:25149'
INFO: Start Nanny at: 'tcp://127.0.0.1:11425'
INFO: Start Nanny at: 'tcp://127.0.0.1:27987'
INFO: Start Nanny at: 'tcp://127.0.0.1:36105'
INFO: Start Nanny at: 'tcp://127.0.0.1:27683'
INFO: Start Nanny at: 'tcp://127.0.0.1:17747'
INFO: Start Nanny at: 'tcp://127.0.0.1:26467'
INFO: Start Nanny at: 'tcp://127.0.0.1:32865'
INFO: Start Nanny at: 'tcp://127.0.0.1:25205'
INFO: Start Nanny at: 'tcp://127.0.0.1:15689'
INFO: Start Nanny at: 'tcp://127.0.0.1:31689'
INFO: Start Nanny at: 'tcp://127.0.0.1:9823'
INFO: Start Nanny at: 'tcp://127.0.0.1:22309'
INFO: Start Nanny at: 'tcp://127.0.0.1:11867'
INFO: Start Nanny at: 'tcp://127.0.0.1:26097'
INFO: Start Nanny at: 'tcp://127.0.0.1:10511'
INFO: Start Nanny at: 'tcp://127.0.0.1:31955'
INFO: Start Nanny at: 'tcp://127.0.0.1:10215'
INFO: Start Nanny at: 'tcp://127.0.0.1:32033'
INFO: Start Nanny at: 'tcp://127.0.0.1:16921'
INFO: Start Nanny at: 'tcp://127.0.0.1:32433'
INFO: Start Nanny at: 'tcp://127.0.0.1:13653'
INFO: Start Nanny at: 'tcp://127.0.0.1:29977'
INFO: Start Nanny at: 'tcp://127.0.0.1:33981'
INFO: Start Nanny at: 'tcp://127.0.0.1:33403'
INFO: Start Nanny at: 'tcp://127.0.0.1:25933'
INFO: Start Nanny at: 'tcp://127.0.0.1:32885'
INFO: Start Nanny at: 'tcp://127.0.0.1:31599'
INFO: Start Nanny at: 'tcp://127.0.0.1:27693'
INFO: Start Nanny at: 'tcp://127.0.0.1:36293'
INFO: Start Nanny at: 'tcp://127.0.0.1:22373'
INFO: Start Nanny at: 'tcp://127.0.0.1:34851'
INFO: Start Nanny at: 'tcp://127.0.0.1:32429'
NeMo SDP version: latest main