Skip to content

BaseParallelProcessor-based processors fail to execute correctly in cluster environments using Dask #107

@ssh-meister

Description

@ssh-meister

Description:
When running BaseParallelProcessor-based processors on an HPC cluster using SLURM and interactive jobs, the execution does not proceed beyond Dask cluster initialization. Despite the scheduler and workers starting successfully, no processing seems to happen afterward.

This affects all processors inheriting from BaseParallelProcessor and using Dask as the parallel backend.

Issue reproduction:

SLURM interactive job command:

srun -A convai_convaird_nemo-speech \
     --job-name yodas2:shell \
     --partition=interactive \
     --time=4:00:00 \
     --container-image=nvcr.io/nvstaging/nemo/nemo_data_processing:yt \ 
     --container-mounts=/home:/home \
     --gpus=8 \
     --nodes=1 \
     --ntasks-per-node=8 \
     --exclusive \
     --mem=128G --pty /bin/bash -l

YAML configuration:

processors_to_run: "0"
processors:
  - _target_: sdp.processors.PreserveByValue
    input_manifest_file: /home/manifest_01.json
    output_manifest_file: /home/manifest_02.json
    input_value_key: src_lang
    target_value: en

Launch command:

export HYDRA_FULL_ERROR=1
python ./NeMo-speech-data-processor/main.py \
  --config-path=./NeMo-speech-data-processor/dataset_configs/multilingual/yodas2/ \
  --config-name=config.yaml

Log output:

[2025-04-16 09:37:14 Rank 0] INFO: Hydra config:
processors_to_run: '0'
processors:
  - _target_: sdp.processors.PreserveByValue
    input_manifest_file: /home/manifest_01.json
    output_manifest_file: /home/manifest_02.json
    input_value_key: src_lang
    target_value: en

[2025-04-16 09:37:14 Rank 0] INFO: Specified to run the following processors: ['sdp.processors.PreserveByValue'] 
[SDP I 2025-04-16 09:37:14 run_processors:157] Specified to run the following processors: ['sdp.processors.PreserveByValue'] 
[2025-04-16 09:37:14 Rank 0] INFO: => Building processor "sdp.processors.PreserveByValue"
[SDP I 2025-04-16 09:37:14 run_processors:179] => Building processor "sdp.processors.PreserveByValue"
[2025-04-16 09:37:14 Rank 0] INFO: => Running processor "<sdp.processors.modify_manifest.data_to_dropbool.PreserveByValue object at 0x155412dee9e0>"
[SDP I 2025-04-16 09:37:14 run_processors:204] => Running processor "<sdp.processors.modify_manifest.data_to_dropbool.PreserveByValue object at 0x155412dee9e0>"
[2025-04-16 09:37:14 Rank 0] INFO: Resources: 255 workers, each with memory limit 8094MB
[SDP I 2025-04-16 09:37:14 base_processor:170] Resources: 255 workers, each with memory limit 8094MB
INFO: To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
/usr/local/lib/python3.10/dist-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 19923 instead
  warnings.warn(
INFO: State start
INFO:   Scheduler at:      tcp://127.0.0.1:9773
INFO:   dashboard at:  http://127.0.0.1:19923/status
INFO: Registering Worker plugin shuffle
INFO:         Start Nanny at: 'tcp://127.0.0.1:9035'
INFO:         Start Nanny at: 'tcp://127.0.0.1:19723'
INFO:         Start Nanny at: 'tcp://127.0.0.1:13259'
INFO:         Start Nanny at: 'tcp://127.0.0.1:27865'
INFO:         Start Nanny at: 'tcp://127.0.0.1:10199'
INFO:         Start Nanny at: 'tcp://127.0.0.1:21089'
INFO:         Start Nanny at: 'tcp://127.0.0.1:27073'
INFO:         Start Nanny at: 'tcp://127.0.0.1:27495'
INFO:         Start Nanny at: 'tcp://127.0.0.1:29023'
INFO:         Start Nanny at: 'tcp://127.0.0.1:25693'
INFO:         Start Nanny at: 'tcp://127.0.0.1:19975'
INFO:         Start Nanny at: 'tcp://127.0.0.1:28465'
INFO:         Start Nanny at: 'tcp://127.0.0.1:21615'
INFO:         Start Nanny at: 'tcp://127.0.0.1:27419'
INFO:         Start Nanny at: 'tcp://127.0.0.1:34145'
INFO:         Start Nanny at: 'tcp://127.0.0.1:24257'
INFO:         Start Nanny at: 'tcp://127.0.0.1:18757'
INFO:         Start Nanny at: 'tcp://127.0.0.1:29621'
INFO:         Start Nanny at: 'tcp://127.0.0.1:18349'
INFO:         Start Nanny at: 'tcp://127.0.0.1:25149'
INFO:         Start Nanny at: 'tcp://127.0.0.1:11425'
INFO:         Start Nanny at: 'tcp://127.0.0.1:27987'
INFO:         Start Nanny at: 'tcp://127.0.0.1:36105'
INFO:         Start Nanny at: 'tcp://127.0.0.1:27683'
INFO:         Start Nanny at: 'tcp://127.0.0.1:17747'
INFO:         Start Nanny at: 'tcp://127.0.0.1:26467'
INFO:         Start Nanny at: 'tcp://127.0.0.1:32865'
INFO:         Start Nanny at: 'tcp://127.0.0.1:25205'
INFO:         Start Nanny at: 'tcp://127.0.0.1:15689'
INFO:         Start Nanny at: 'tcp://127.0.0.1:31689'
INFO:         Start Nanny at: 'tcp://127.0.0.1:9823'
INFO:         Start Nanny at: 'tcp://127.0.0.1:22309'
INFO:         Start Nanny at: 'tcp://127.0.0.1:11867'
INFO:         Start Nanny at: 'tcp://127.0.0.1:26097'
INFO:         Start Nanny at: 'tcp://127.0.0.1:10511'
INFO:         Start Nanny at: 'tcp://127.0.0.1:31955'
INFO:         Start Nanny at: 'tcp://127.0.0.1:10215'
INFO:         Start Nanny at: 'tcp://127.0.0.1:32033'
INFO:         Start Nanny at: 'tcp://127.0.0.1:16921'
INFO:         Start Nanny at: 'tcp://127.0.0.1:32433'
INFO:         Start Nanny at: 'tcp://127.0.0.1:13653'
INFO:         Start Nanny at: 'tcp://127.0.0.1:29977'
INFO:         Start Nanny at: 'tcp://127.0.0.1:33981'
INFO:         Start Nanny at: 'tcp://127.0.0.1:33403'
INFO:         Start Nanny at: 'tcp://127.0.0.1:25933'
INFO:         Start Nanny at: 'tcp://127.0.0.1:32885'
INFO:         Start Nanny at: 'tcp://127.0.0.1:31599'
INFO:         Start Nanny at: 'tcp://127.0.0.1:27693'
INFO:         Start Nanny at: 'tcp://127.0.0.1:36293'
INFO:         Start Nanny at: 'tcp://127.0.0.1:22373'
INFO:         Start Nanny at: 'tcp://127.0.0.1:34851'
INFO:         Start Nanny at: 'tcp://127.0.0.1:32429'

NeMo SDP version: latest main

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions