Skip to content

example workflow with high memory use #7078

@oliver-sanders

Description

@oliver-sanders

We have occasionally observed workflows gradually accumulating large quantities of memory.

Reproducible Example

Working with a recent workflow, Dave has managed to extract a reproducible example:

[task parameters]
    origins = ADD, ARN, ATH, ATL, AUH, BAH, BCN, BDA, BER, BEY, BGI, BKK, BLQ, BLR, BOG, BOM, BOS, CAI, CAN, CLT, CMB, CMN, CPH, CPT, CTU, DAC, DEL, DEN, DFW, DOH, DTW, DXB, EWR, EZE, FCO, FNC, GIB, GOT, GRU, GYD, HAN, HEL, HKG, HND, IAD, IAH, ICN, IST, JED, JFK, JNB, KBP, KEF, KUL, KWI, LAX, LIM, LIN, LIS, LOS, MAD, MEX, MIA, MLA, MNL, MUC, NBO, NCE, NQZ, ORD, OSL, OTP, PDL, PEK, PER, PHL, PHX, PRG, PVG, RUH, SCL, SEA, SFO, SIN, SVO, TIA, TLV, TPE, VIE, WAW, YHZ, YUL, YYC, YYZ, ZAG
    mogreps_timesteps = 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102, 108, 114, 120, 126, 132, 138
    ecmwf_timesteps = 144, 150, 156, 162, 168, 174, 180, 186, 192, 198, 204, 210, 216, 222, 228, 234, 240, 246, 252, 258, 264
    ecmwf_resil_timesteps = 270, 276, 282, 288
    [[templates]]
        origins = _%(origins)s
        mogreps_timesteps = _t+%(mogreps_timesteps)03d
        ecmwf_timesteps = _t+%(ecmwf_timesteps)03d
        ecmwf_resil_timesteps = _t+%(ecmwf_resil_timesteps)03d
[scheduling]
    initial cycle point = previous(T00)
    runahead limit = P0
    [[queues]]
        [[[default]]]
           limit = 5
    [[graph]]
        PT24H = '''
                natseg_res_create => nats_start_00 => nats_start => nats_run_astar_mogreps<origins, mogreps_timesteps>
                nats_start => nats_prepare_ecmwf
                nats_prepare_ecmwf => nats_run_astar_ecmwf<origins, ecmwf_timesteps>
                nats_run_astar_ecmwf<origins, ecmwf_timesteps> & nats_run_astar_mogreps<origins, mogreps_timesteps> => nats_genxml
                nats_run_astar_ecmwf<origins, ecmwf_timesteps> => nats_run_astar_ecmwf_resil<origins, ecmwf_resil_timesteps>
                nats_run_astar_mogreps<origins, mogreps_timesteps> & nats_run_astar_ecmwf_resil<origins, ecmwf_resil_timesteps> => natseg_res_delete
                nats_genxml & nats_run_astar_ecmwf_resil<origins, ecmwf_resil_timesteps> => nats_preprocess_for_archive
                nats_preprocess_for_archive => nats_archiving
                nats_archiving & housekeep[-P1D] => housekeep
                housekeep => archive_logs
                '''
[runtime]
    [[root]]
        run mode = skip
    [[nats_archiving]]
    [[nats_genxml]]
    [[housekeep]]
    [[nats_prepare_ecmwf]]
    [[nats_preprocess_for_archive]]
    [[nats_run_astar_set<origins>]]
    [[nats_run_astar_ecmwf<origins, ecmwf_timesteps>]]
        inherit = nats_run_astar_set<origins>
    [[nats_run_astar_ecmwf_resil<origins, ecmwf_resil_timesteps>]]
        inherit = nats_run_astar_set<origins>
    [[nats_run_astar_mogreps<origins, mogreps_timesteps>]]
        inherit = nats_run_astar_set<origins>
    [[nats_start]]
    [[nats_start_00]]
    [[natseg_res_create]]
    [[natseg_res_delete]]
    [[archive_logs]]

Note, this workflow contains many-to-many triggers so the number of edges is extreme. It is likely these are triggering the issue (pun intended).

Example Results:

Example memory trace:

Image

Metadata

Metadata

Assignees

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions