-
Notifications
You must be signed in to change notification settings - Fork 208
Tracking file provenance #3712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Tracking file provenance #3712
Conversation
I don't know why the tests are failing. They pass fine on my own machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astro-friedel I think there general direction here makes sense, but there are some engineering choices here that need a bit of rework. I scanned the docs and they look nice, esp the addition of the figures.
My main concerns are around:
- Passing the DFK to the futures is breaking the abstraction layer, I've left notes on where you could try using callbacks.
- I'm not clear on what we are claiming about when the file was created because we stat a file that could be transferred.
I'm also got some minor nitpicks about a few unrelated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astro-friedel, I've got some minor comments, but this PR looks almost ready to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astro-friedel Please squash the changes since there is quite a bit commits. This looks good to go!
integrated dynamic file into output file handling data flow kernel changes to accommodate dynamic file lists Make manager->interchange tasks_output protocol always JSON objects (#3109) This is to permit imminent changes to this protocol to allow quenching of task delivery as part of work to drain workers near the end of their walltime. Prior to this PR, the protocol used two formats: the first message in the worker->interchange task_output pipe was a JSON object, the "registration message". Subsequent messages were not JSON, but an integer encoded as a byte sequence. This integer was historically used for two purposes: as a heartbeat (with value HEARTBEAT_CODE = (2 ** 32) - 1) and as a count of tasks to request from the interchange. This latter use was removed in PRs #2588 and #3062, leaving only the heartbeat use case. To allow this channel to be used for richer messages, this PR switches the messages on this stream to always be JSON object, with a 'type' entry: currently 'registration' or 'heartbeat'. This is in line with the 'result_package' format used on the manager->interchange results_outgoing channel, and so aligns with issue Code paths: This PR removes the reg_flag variable used in registration code, which was redundant and only used to route control flow from a try block into an immediately following code block: that second block is merged upwards. That try block protected/protects json decoding, and is moved higher up because there is now no need to have a separate path for "expecting JSON" and "expecting encoded int". Type checking: This PR adds some additional checking for JSON structure to report a more useful error when the message is not a dictionary; this drives mypy to specialise the type of `msg` to dict[Any, Any] which can then not be used to update a ManagerRecord. This PR adds a # type: ignore with commentary to address that. Testing: while registration is implicitly tested by htex tests successfully running tasks, there isn't any automated testing of manager->interchange heartbeats, so human review should pay special attention there. Update cctools version and parsl_coprocess.py (#3107) Updating cctools version to 7.8.0 after latest release. Work Queue executor parsl_coprocess.py needs to be updated concurrently. Tidy up htex worker pool start time recording (#3112) The start time is needed as part of upcoming work on walltime-based worker draining (so this PR makes it into an instance attribute)... ... and it can be made more accurate than the previous implementation by measuring it earlier (most especially, prior to invoking the often expensive probe_addresses call) This PR rephrases a nearby log line which uses the word "starting" despite being not in the start() procedure of the manager, to reduce ambiguity when reading the code vs logs. Test htex_auto_scale partial scaling-in (#3097) The existing scaling-in test parsl/tests/test_scaling/test_scale_down.py only tests full scaling-in which is implemented in a separate code path to partial scaling-in (case 4b vs case 1a in parsl/jobs/strategy.py) Repurpose test_scaling/test_scale_down.py for simple strategy (#3114) PR #3097 introduced a more comprehensive test for the htex_auto_scale strategy, based on this test, and prior to this PR, test_scale_down only tested the 'simple' strategy parts of htex_auto_scale. Remove unused codepath from HighThroughputExecutor scale_in (#3115) The codepath would be used when scale_in is called with: force=False max_idletime=None and would pick blocks from the list of blocks which are currently idle. This PR removes that unused codepath, merging the choice of forced/non-forced scale-in and idle time specification into a single parameter that indicates "do not scale in blocks that have not been idle this long". The two remaining cases from force=True/False, max_idletime=None/number are: * max_idletime=None, previously: force=True, max_idletime=None This means that the requested number of blocks should be scaled in, even if the blocks are not idle. The use case for this path is "parsl is shutting down, so we need to tidy up everything. If there are tasks still running, we don't care because terminating in-progress tasks is part of parsl shutdown." * max_idletime=some time, previousy: force=False, max_idletime=some time This means that the scaling code has decided to apply downwards pressure on the number of blocks: there are more blocks than needed. However, this pressure should not disrupt already running tasks, and it is less urgent to cancel blocks, because the same call will happen every 5 seconds to keep applying that pressure. Change htex pull_tasks iteration from exponential backoff to calculated time (#3116) Before this PR, this loop runs repeatedly, under high load quite often, but slowing down exponentially until reaching once per heartbeat period. The other things happening in this loop: * emit a debug log * check heartbeat related events (motivating upcoming drain work in PR #3063 would like to add an additional event there) This PR calculates the earliest of the two heartbeat related events and wait in poll until that time. This will change: i) the two heartbeat checks may happen earlier than previously, closer to the calculated time, because this loop will more accurately aim for the heartbeat time. This might uncover bugs or misassumptions elsewhere in Parsl code or users. Related to this, this PR changes time comparisons from > to >= so that events can fire in the same fractional-second as the corresponding computed poll ends. ii) the debug log message will be output at different times, as the loop will now iterate at different times. The timing of the log message was quite weird before, and is now differently weird. This PR removes one use of the somewhat arbitrary poll timer htex parameter. Deprecate `max_workers` for `max_workers_per_node` (#3117) The latter is less ambiguous than the former. We will support both for the time being but raise a deprecation warning for `max_workers`. Also, `max_workers_per_node` will take precedence if both are defined. Reduce flake8 max-line-length (#3122) Reduce flake8 max-line-length (#3123) Reduced flake-8 maximum length and changed some lines which violate the new configuation Reduce flake8 max-line-length (#3130) Reduce flake8 max-line-length (#3131) Reduce flake8 max-line-length (#3125) Add more typing around scale-in decision state (#3119) Use a typed dataclass with human readable names, instead of a numerically index list for the decision state itself, and add type anontations on the method signature. This is intended to make some of the decision code easier to read and typecheck, but should not change behaviour. Update docs README with addition on viewing docs locally (#3140) This PR adds instructions on launching a local python3 http server to view freshly rebuilt docs. Update provider list in execution guide (#3145) Remove Bluewaters section from user guide and delete example config (#3151) This PR addresses the issue #3139, where the user guide still contains an example configuration for the retired Bluewaters supercomputer at the NCSA (UIUC). This configuration is no longer relevant and should be removed to maintain the accuracy and relevance of the documentation. Add process context switch counts to resource monitoring (#3120) Make htex scale_in pick longest idle blocks (#3135) Prior to this, scale_in picked the shortest-idle (i.e. most recently used but now idle) blocks. In the scaling code right now, there isn't any particular reason to order idle blocks in any particular order... ... but the next PR following on from this one will also consider unstarted blocks with infinite idle time, and in that situation, the policy will be that unstarted idle blocks should be scaled down in preference to currently running blocks. (other policies are possible and reasonable but not addressed by either this PR or the upcoming PR) I think this was intended to work this way to begin with (although I'm unsure of the justification) but there is a sign error that is flipped by this PR. Reduce flake8 max-line-length (#3148) Convert string to f-string (#3153) The second line in the deprecation notice for the issue described in #3136 appears to be lacking the f prefix to define it as a f-string and thus interpolate the {new} variable. Fix MPI docs typo (#3166) Stop DFK cleanup in atexit, because this doesn't work in Python >3.12 (#3165) atexit handlers work differently in the latest version of Python (version 3.12) and it is not safe to automatically call cleanup any more in Python 3.12 Remove out of date banner (#3146) Co-authored-by: Yadu Nand Babuji <[email protected]> Reduce flake8 max-line-length (#3187) Reduce flake8 max-line-length (#3190) Added NSF badge for award 150475 (#3181) Reduce flake8 max-line-length (#3199) Move historical documentation pages into a new documentation book (#3174) Co-authored-by: Yadu Nand Babuji <[email protected]> Reduce max-length of flake8 (#3201) * Reduce max-length from 148 to 147 * Delete hello1.txt * Delete output.txt Change API usage to explict start/end style (#3230) Previously the message format was driven by a small "invoke me repeatedly, I'll change my behaviour based on how many times you've invoked me" state machine. This PR removes that state machine and relies on the DFK knowing whether it is starting up or shutting down - that information is available statically inside the DFK code. TaskVine: add config and links to CCL documentation (#3205) TaskVine: fix tracking of output files/data futures (#3243) Change runinfo_* into runinfo in Makefile clean target (#3236) Drain workers as walltime expiry nears (#3063) This is an implementation of the first part (the "easy part") of issue #3059. It adds a parameter to HighThroughputExecutor specifying a "not quite as long as walltime" parameter, after which time workers will drain themselves: they will continue with existing tasks but ask the interchange to not send any more. When they are drained, the worker pools will exit immediately. Fix address field in TaskVine manager config. (#3066) Previously it looks like this field was ignored, because it was always (in the absence of a project name) overriddel by parsl.addresses.get_any_address. This PR makes that override only happen when the address field is set to None. Previously this field had a class level default value of socket.gethostname() That value was never used, because of the above override, and resulted in the hostname of the documentation build host being listed as the default value for this parameter, incorrectly. This PR removes that default value, allows the address field to be None, and makes the address field be None by default. Cross-reference this Debian documentation reproducibility issue where the presence of socket.gethostname() breaks binary reproducibility: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1063542 Testing: Prior to this PR, setting address='10.10.10.10' does not break anything, even though that is not an IP address of my test environment. After this PR, setting that IP address causes taskvine to hang as I would expect. Change Timer type annotation to allow floats too (#3244) Prior to this PR, Timer works with floats at runtime, but fails typechecks. Make JobStatusPoller parameters mandatory and remove defaults (#3245) Making the parameters mandatory and without defaults helps in debuggability and in code understanding. Defaults come from closer to the user, and so the defaults removed here are never used. To support Parsl users using JobStatusPoller as part of a DFK-free environment (for example, Globus Compute), this PR leaves the dfk parameter to have a meaningful default. Allow strategy polling period to be configured (#3246) This is initially driven by a desire to run strategy polling faster in tests: there's no fundamental reason why the previous hard-coded value of 5 seconds needs to set the timescale for test execution. This was demonstrated previously in parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py in PR #3097 performing modification on the internals of a live DFK-private JobStatusPoller. Work I've done on tests elsewhere benefits from strategy polling period reconfiguration too, so this PR makes that facility a publicly exposed feature. This change allows the interval to be set before the job status poller starts running, which means a racy initial first 5s poll in the above mentioned test_scale_down_htex_auto_scale.py is avoided: median runtime of that test on my laptop goes from 11s before this PR to 6s after this PR (dropping by exactly the 5s initial poll that is now avoided). Its reasonable to expect some users to want to use this facility too: perhaps a user doesn't want to wait 5 seconds before the scaling code notices their workload; or perhaps they are more interested in running the strategy code much less frequently (for example, if running workloads on the scale of hours/days to reduce eg debug log load) Fix for TypeError when --available-accelerators is not specificied in htex process worker pool (#3132) Fixes #3110 Remove unused monitoring configuration parameters (#3256) This could go through a deprecation cycle, but I'm not aware of anyone specifying them rather than leaving them at their unused defaults, so this PR avoids that deprecation workload. Use parsl.utils log helpers for monitoring, rather than own variant (#3254) This PR removes start_file_logger which was similar, but not the same as parsl.utils.set_file_logger. It makes set_file_logger return the chosen root logger, which was the main difference between start_file_logger and set_file_logger. After this PR, monitoring logs will now use the verbose log format used by parsl.log, instead of their own more terse format. Tidyup interchange text on monitoring port descriptions (#3255) This PR is clearer about connecting specifically to the MonitoringHub, and removes a sentence that describes how parameters are passed around which is not monitoring specific. Issue warning if SimpleLauncher is used with multiple nodes per block (#3177) This addresses the issue where users could inadvertently attempt to use the SimpleLauncher with multiple nodes per block, which could lead to confusion and unexpected behavior. This enhancement ensures that a warning is raised when users try to use the SimpleLauncher with more than one node per block, unless explicitly permitted. Test app names are stored in monitoring under various decoration patterns (#3056) This follows some investigations of app names that turned out to be a bug in a different piece of code, resolved in https://github.com/LSSTDESC/gen3_workflow/pull/58 Return None rather than raise an exception for Gantt chart of 0 tasks (#3253) Prior to this PR, visualizing a workflow with no tasks in it would cause an HTTP 500 server error, caused by an exception raised in the gantt generation code. However, a workflow with no tasks is a legitimate workflow. This PR makes task_gantt_plot return None, rather than attempting a plot, which then places no Gantt chart in workflow visualization, rather than breaking. This bug was exposed through PR 3121 which sometimes (as a legitimate part of that test) results in a monitoring DB with a workflow with no tasks (but not always - that test is testing shutdown behaviour, not database contents). Return None rather than constant True from cluster provider _write_submit_script (#3238) Remove permit_multiple_nodes argument in SimpleLauncher (#3242) Reduce flake8 max-line-length to 146 (#3203) Split monitoring router and hub into separate files (#3264) This is ongoing work to split up monitoring.py into more topical pieces - see also PRs #2468, #2439 Use module-scope logger in MonitoringHub (#3265) It already does! except indirected via self.logger. This PR removes that indirection. Rename monitoring port variables to reflect use (#3266) The naming of the interchange port suggests that is for the interchange to send monitoring messages, but it is a general ZMQ listening port used by at least the DFK too. There is another general listening port, listening on UDP, used by the UDPRadio on remote workers. This PR renames most variables referencing the above connections to have zmq_ and udp_ prefixes, instead of a mix of hub_ and ic_ prefixes, to clarify which variables and parameters apply to which of those two ports. This clarifies in most of the codebase, for example, that the former hub_port and hub_port_range variables do not control the same thing in different ways, but that one is for ZMQ and ones is for UDP. This PR avoids changing user-facing parameters, so some ambiguosly named parameters still exist: those should be dealt with by a deprecation process over a longer time period in a different PR. WQExecutor + TVExecutor: Automatic cleanup by adding their own atexit handlers. (#3269) Previously these executors don't exit on their own but wait for the DFK to call their shutdown methods from the DFK's atexit handler. Now the call is removed so these executors add their own atexit handlers to clean themselves up, otherwise a Parsl program using these executors will hang at the end of the program. These handlers never spawn new threads or processes so they should be safe according to Python3.12's atexit standard. Test that monitoring is resistant to helper process termination (#3121) Rephrase and move MonitoringHub init log (#3272) This message previously talking about initializing ZMQ, but other stuff is happening here too. Add return annotation to windowed_error_handler (#3273) Annotate scale_in and scale_out with types in JobStatusPoller (#3274) These types align with the scale_in and scale_out types defined on BlockProviderExecutor. Remove unused code path, driven by new type annotation (#3275) Block IDs are strings, and _fail_job_async is never used without an allocated block ID. This was revealed by mypy after tightening the type annotation in this PR. Remove docstring talk of scaling coroutines (#3276) This dates from 2017, bfd4045b05f3ae9abd554a841e2ab73fb1213a5a and reflects a possible scaling model that is not what ended up being implemented in the subsequent 7 years. fix code of conduct link (#3277) Tighten executor simulated status type: block_ids are str (#3279) Test that init_blocks blocks are recorded in the monitoring database (#3282) This is a test for an uncoming rearrangement of init_blocks handling. Add type signatures related to add_executors (#3280) This PR adds type signatures directly on add_executors, and adds an explicit typed property onto the executor base class that is references from add_executors. Support context manager protocol in DFK (#3260) Recently, PR #3165 removed at-exit cleanup of DFKs, so after that PR, a user has to explicitly invoke the DFK cleanup method. More justification for this is in issue #3104 This PR adds context manager support to the DFK, so that it can be used in a with block around an entire workflow. Pin radical-pilot to version 1.47 (#3290) A recently released version, 1.48, doesn't work with this executor, so this PR aggressively constrains the version to what was passing in Parsl GitHub Actions over the last few weeks. For further contex, see https://github.com/radical-cybertools/radical.saga/issues/885#issuecomment-2017520999 Tighten database access in htex/monitoring test to avoid issue #3287 (#3293) A more principled fix of #3287 looks to be much more deeply invasive, and involves a big rework of how ZMQ is used. Fix TypeError race condition in htex draining code (#3285) Before this PR, sometimes draining would fail with this error message: Traceback (most recent call last): File "/home/runner/work/parsl/parsl/.venv/lib/python3.10/site-packages/parsl/p rocess_loggers.py", line 27, in wrapped r = func(*args, **kwargs) File "/home/runner/work/parsl/parsl/.venv/bin/process_worker_pool.py", line 32 2, in pull_tasks next_interesting_event_time = min(last_beat + self.heartbeat_period, TypeError: '<' not supported between instances of 'NoneType' and 'float' because the code incorrect sets drain time to None to indicate that further draining messages should not be sent. An earlier version of the drain code used None in that situation, but this was changed to +inf in the version that was actually merged. This error is statically detectable by mypy, if you turn on mypy checking for pull_tasks, although it also raises other type errors. Co-authored-by: Kevin Hunter Kesling <[email protected]> Correct not-scaling-in log message (#3284) Previously this message said an executor was not being shut down when it is in a bad state. This `if` statement is actually about scaling in the executors blocks. Shutdown always happens, a few lines later. WQ + TV Executors: Making their shutdown methods idempotent (#3281) Previously, WQ and TV Executors can be shut down twice, one by the registration of shutdown to atexit, one by the user explicitly calling parsl.dfk().cleanup(). This PR makes the effect of these executors shutdown happen only once. Note that previously the shutdown methods don't create bugs as they only call process joins and a python process can be joined many times, but future bugs might come. Add `cd` command to README (#3297) To prevent errors when running the 'make' command immediately after cloning, remind users to navigate to the root directory by using the command 'cd Parsl' first. Return DFK in context manager (#3296) When attempting to use the with statement with parsl.load() to manage a DataFlowKernel object (dfk). Currently, the context manager protocol does not correctly assign the loaded DataFlowKernel to dfk, resulting in dfk being None. The desired solution is to ensure that with parsl.load() as dfk: That assigns the loaded DataFlowKernel object to dfk Handle init_blocks in scaling strategy, rather than special-casing it (#3283) This is part of issue #3278 tidying up job and block management. Now init_blocks scale out happens on the first strategy poll, not at executor start - that will often delay init_blocks scaling by one strategy poll period compared to before this PR. Use .pytest rundir hook for disconnection test (#3299) By leaving the test config's rundir as default, parsl/tests/conftest.py will reassign that attribute to a dynamically generated directory under .pytest/parsltest-current Unify test_data and test_staging directories (#3305) Before this PR, test_data contained file staging tests, but not tests for other non-file based data (such as serialization). This is a move away from that mis-over-abstraction of the word "data". Test scale-in at dfk shutdown (#3304) Don't initialise self.blocks in WorkQueue twice (#3301) It is initialised in BlockProviderExecutor superclass Use a defaultdict for BlockInfo in htex scale-in (#3300) This is to accomodate upcoming expansion of the block_info dictionary to include data from an additional source (the list of blocks known to the scaling code) This change arises from a review of PR #3232, an earlier prototype of that expansion work. Behaviourally, this shouldn't change anything: it changes invalid block accesses from key errors into blocks that are infinitely idle and completely unloaded. However there are no indexes into block_info except in code that is adding block information, so this situation should not arise. Scale in blocks at shutdown using Job Status Poller (#3302) This will now scale in blocks using the job status poller scale in code, which means the (moved) shutdown scale in loop does not need to send its own BLOCK_INFO messages. Behaviour-wise: The scale_in call is changed from the executor version to the status poller version. That latter version is mostly a passthrough to the executor version, with the above-mentioned monitoring code change. So the behaviour of the scale in call + monitoring should be the same. The parameter to the scale_in call is a count of blocks. This is intended to mean "all of them" - by counting however many blocks are known and using that count. Prior to this PR, the number of known blocks was the number of jobs in the provider's `resources` dictionary - a structure that should otherwise be private to the particular provider. Post this PR, the number of known blocks is the number of blocks known to the JobStatusPoller module: this is a possibly different number, although it should be at least the number of blocks which have been submitted (either by getting a new status in PollItem.scale_out or by being present in the most recent provider cache refresh in PollItem.poll). Because it should be at least the number of submitted blocks, this number should be large enough to represent "all of them" (with the above meaning). Remove duplicated parsl.providers.GridEngineProvider (#3310) Remove a hard-coded indirection to monitor_wrapper (#3309) Rename PollItem to reflect its current role as a facade on executor (#3307) This should not change behaviour, but should make the behaviour of this class a bit more understandable for human readers. The history here was PollItem used to be two classes, PollItem and ExecutorStatus. Those two classes were merged, leaving only PollItem; and the combo gained more facade-like behaviour, rather than being only a status cache. Rename self.blocks and self.block_mapping for human clarity (#3308) This PR makes it more obvious what they are maps for, and that they are/should be inverses of each other. Force kwargs for monitor wrapper for reduced maintenance fragility (#3312) This function has a lot of parameters, and long sequences of positional parameters are hard to align. See #2973 for similar work elsewhere. Move scale in at exit code into close method of job status poller (#3311) Changed behaviour: this will now happen slightly before it used to, but the differences are only changed place in the call stack, and log messages will appear in a different order now - so the actual scaling-in behaviour should be unchanged. Inline fail_job_async (#3318) This should not change any behaviour. The call happens only in one place, and it only modifies a block-related data structure, right next to code that more directly modifies block-related data strucutres. This PR makes all of those block-related data structure changes happen at the same level. This call is not asynchronous, and this PR also removes that confusing use of the term async - there is a historical reason, Move 'first' field into strategy state (#3317) This is part of work to move JobStatusPoller facade state into other classes, as part of job handling rearrangements in PR #3293 This should not change behaviour: each executor has a single PolledExecutorFacade and a single strategy.ExecutorState, and this PR moves the 'first' field from one to the other. parsl/tests/test_monitoring/test_htex_init_blocks_vs_monitoring.py tests that init_blocks handling still fires properly - that's what is switched by this 'first' field. Do not store DFK in PolledExecutorFacade (#3320) The DFK is only used to initialise monitoring, in __init__, and is not used after that. The object that lives beyond __init__ is self.hub_channel. This PR should not change behaviour. Do not duplicate executor polling interval in PollingExecutorFacade (#3321) This is part of work to move JobStatusPoller facade state into other classes, as part of job handling rearrangements in PR #3293 This value is constant, in both executors and in PollingExecutorFacade. This PR should not change behaviour Reduce flake8 max-line-length from 146 to 145 (#3228) Reduced the max-line-length in flake8 from 146 to 145. Fixed the subsequent errors due to files exceeding the max-line-length. Add ZipFileStaging to archive outputs into a zip file (#3319) This output file staging mechanism provides a zip: URL scheme, naming a file inside a zip file; output files from apps can be staged out into this location. Input staging is not supported. The use-case for this is as part of a sequence of PRs separating PR #3306, leading to stdout staging into zipfiles. Remove unused 'noci' pytest marker. (#3326) Not-running-tests-in-CI is achieved by more subtle means, such as topic and issue specific test markers. Remove 'not issue363' markers from Work Queue and Task Vine tests (#3327) These tests all pass with Work Queue and Task Vine in CI and on my laptop. Some of these used to fail, but PR #2803 changed the stdout/stderr paths from relative paths to absolute paths (to move files into a per-test working directory). This then causes a different codepath for std* behaviour to be followed in Task Vine and Work Queue: when give absolute paths, those executors assume a shared file system without staging. For example, test_stdout_truncate in parsl/tests/test_bash_apps/test_stdout.py work as currently written, but fails if this change from PR #2803 is reverted: ``` def test_stdout_append(tmpd_cwd): - out = str('t1.out') + out = str(tmpd_cwd / 't1.out') ``` Making a more formal model of how Work Queue and Task Vine should handle relative vs absolute paths as a signal to stage or not stage is beyond the scope of this pull request. Instead this PR seeks to enable tests that are not failing (and no longer expected to fail). Unshare "now" timestamp across executor polls (#3325) This PR removes the notion of a shared current time among all polls. Instead each facade poll looks up the current time individually. Although executor polls are all launched from the same loop, there's no need for them to act as if they are being polled at the same moment in time, which in the case of multiple executors, will be some time in the past. This means there is one less variable shared between polls of different executors, which is part of work pushing executor polling deeper into each executor away from the per-DFK JobStatusPoller. This will change the cadence of polling: `now` may be measured a bit later, especially in the presence of multiple executors, causing meaning a poll that might have previously skipped to now run, and as a knock on, all subsequent skips/updates will be jiggled around based on that change. This should not have any serious effect on observable behaviour. Merge PollingExecutorFacade monitoring_enabled and hub_channel (#3322) self.monitoring_enabled = True is equivalent to self.hub_channel being assigned. This PR uses a more type-driven style of a single Optional for hub_channel, removing monitoring_enabled. This PR should not change behaviour. Remove issue363 test marker (#3329) This marker originally referred to issue #363 for tests which should not be expected to pass if running in an environment where stderr and stdout are not staged back to the submitting system by the executor. Since it was created, it then expanded in use to more generally refer to any test which had a problem in environments which did not properly make either output files or stdout/stderr available on the submitting system, as a separate behaviour from the core of parsl behaving properly. An example of that would be running in a multi-site environment without working file staging. However, since all that happened, the feel in the Parsl community is that running with a shared filesystem is the core supported usage mode of Parsl; and also since then, notions of file staging have become much stronger, both in Parsl core and in individual executors, making it a much more reasonable test requirement that if you make a file, it can be expected to be available for the test to inspect - and that if you are in an environment that does not support that, you should not expect the test suite (or Parsl) to work. Because of all of that, this PR removes the issue363 test marker entirely. It was removed from Work Queue and Task Vine test environments in PR #3327, leaving only the Radical-PILOT test environment setting that. It turns out only two tests don't work in the Radical-PILOT environment when issue363 is removed: One requires support for the slightly awkward "stdout/err opening modes, specified by tuples" feature, and this PR labels and disables that test using a new 'executor_supports_std_stream_tuples' label. (In general, an executor which does staging of stdout/err streams is likely to not support the features offered by std stream tuples, such as appending to stderr files from multiple task - so this label is not Radical-PILOT specific) Another test looks like it is broken exception handling in the Radical-PILOT executor, and so it is labelled with a new issue specific label, relating to the bug report for that issue - issue #3328, issue3328. Delete duplicate misplaced DFK docstring on random executors (#3324) This text is already present on a more appropriate method, `submit`, where executor selection actually takes place. The deleted text was on the launch_task method, which runs long after executor selection has happened. Inline PolledExecutorFacade._should_poll_now (#3331) It was used once, a couple of lines below. This PR should not change behaviour. Update monitoring basic test to recent style (#3333) * Update monitoring basic test to recent style * use a tmpd_cwd working directory * remove logging * use DFK as a context manager This should not change what is tested. * Set runinfo to tmpd_cwd Exit monitoring router exit on multiprocessing event, not exit message (#3330) Prior to this PR, the monitoring router exited due to receiving a WORKFLOW_INFO message with an exit_now field set to True, but only if that message was received through a specific path. This PR removes that exit_now field, and makes the monitoring router exit on a multiprocessing event. This removes the need for the exit message to arrive through that specific path into the router, which makes message handling more consistent, and opens up opportunities to feed messages into monitoring through different paths. Slowly ongoing work has been trying to make all the different monitoring message paths behave the same with a goal of eliminating some of them, and this change also works towards that. Co-authored-by: Kevin Hunter Kesling <[email protected]> Fix scale_in return values for Work Queue and Task Vine (#3336) Before this PR, the scale in methods for those executors returned None, in violation of the type specification in BlockProviderExecutor. This PR adds type annotations to cause those methods to be type checked. This PR adds handling for return value based on the HTEX implementation of scale_in to both Work Queue and Task Vine executors. This PR adds a few debug logs around behaviour there. These return values are not used anywhere (which is why this has not been a problem) but an upcoming PR will both make use of them (in monitoring) and test that something is returned. Replace proc_id with human words in LocalProvider debug log (#3335) Before this PR, proc_id was used in log messages as if its a symbol such as a local variable or function parameter. It is not. This PR replaces proc_id with human words. Fix typing on general strategy (#3340) This PR adds a type annotation to the function signature. That has the knock-on effect of enabling type checking within the method, revealing two mypy errors. One is unreachable code: because the strategy will only be called with a BlockProviderExecutor executor, it does not need to check if it has been called with a BlockProviderExecutor. The other type error comes from mypy not realising that a path is unreachable: the idle time of an executor may be None, but at the point of the type error, an `if` statement has ensured that it is never none. This PR adds an assert statement to assert that assertion to the type checker (and, incidentally, to the logs via wrap_with_logs). Make CCTOOLS environment available to --config local tests (#3341) This is needed for PR #3339 to make Work Queue and Task Vine tests in --config local mode Co-authored-by: Kevin Hunter Kesling <[email protected]> Remove (unused?) PolledExecutorFacade __repr__ (#3337) I can't find anywhere this is used: I think PolledExecutorFacades are never referenced outside of job_status_poller.py and strategy.py; and in both of those files, I cannot see anywhere that a PolledExecutorFacade is repr'd. So, this PR should not change any behaviour. Turn on type checking for all of parsl.jobs.* (#3342) This reveals one type error, that JobStatusPoller does not properly implement the .close() method signature. This PR fixes that and passes the new timeout parameter to the superclass which can make use of it. Remove outdated parsl.dataflow.futures docstring (#3345) This refers to DataFutures which do not live in this module. AppFutures are already described in the AppFuture docstring. Make test executors use default staging storage access list (#3343) Aside from htex_local_alternate which is deliberately intended to use complicated option combinations, other test providers should use a minimal configuration and get the default. This has the immediate effect of making zip: file staging available to future tests on all executors. Co-authored-by: Kevin Hunter Kesling <[email protected]> Replace DFK and JobStatusPoller monitoring zmq channels with Queue plugin (#3338) Before this PR, the DataFlowKernel and each JobStatusPoller PolledExecutorFacade each opened a ZMQ connection to the monitoring router. These connections are not threadsafe, but (especially in the DFK case) no reasoning or checking stops the DFK's connection being used in multiple threads. Before this PR, the MonitoringHub presented a 'send' method and stored the DFK's ZMQ connection in self._dfk_channel, and each PolledExecutorFacade contained a copy of the ZMQ channel code to open its own channel, configured using parameters from the DFK passed in at construction. This PR: * moves the above uses to use the MonitoringRadio interface (which was originally designed for remote workers to send monitoring information, but seems ok here too) * has MonitoringHub construct an appropriate MonitoringRadio instance for use on the submit side, exposed as self.radio; * replaces the implementation of send with a new MultiprocessingQueueRadio which is thread safe but only works in the same multiprocessing environment as the launched monitoring database manager process (which is true on the submit side, but for example means this radio cannot be used on most remote workers) This work aligns with the prototype in #3315 (which pushes on monitoring radio configuration for remote workers) and pushes in the direction (without getting there) of allowing other submit-side hooks. This work removes some monitoring specific code from the JobStatusPoller, replacing it with a dependency injection style. This is part of work to move JobStatusPoller facade state into other classes, as part of job handling rearrangements in PR #3293 This PR will change how monitoring messages are delivered from the submitting process, and the most obvious thing I can think of that will change is how this will behave under load: heavily loaded messaging causing full buffers and other heavy-load symptoms will now behave as multiprocessing Queues do, rather than as ZMQ connections do. I have not attempted to characterise either of those modes. Co-authored-by: Kevin Hunter Kesling <[email protected]> Unabstract executor.create_monitoring_info (#3339) This method makes sense for BlockProviderExecutors, where it is used by the JobStatusPoller. It does not make sense for ParslExecutors in general and is not otherwise used. So remove from the ParslExecutor base class, which removes the dependency of parsl.executors.base on parsl.jobs.*, a cleaner separation of concerns. This method is implemented in HighThroughputExecutor, but the information generated only uses BlockProviderExecutor information and would make sense for other BlockProviderExecutors (Work Queue, Task Vine and the erstwhile IPP and Extreme Scale executors). Using this implementation has been prototyped in the desc branch of parsl since 2021. So move that implementation out of out of HighThroughputExecutor and into BlockProviderExecutor. This should not change any behaviour when using HighThroughputExecutor. When using Work Queue or Task Vine, this change will bring block monitoring support to those executors. As noted above, this has been used successfully with Work Queue since 2021 in the desc branch. This PR modifies the basic monitoring test to test against those executors in addition to the HighThroughputExecutor. This PR modifies the type annotations of create_monitoring_info to type check properly - the previous htex implementation was not type-checked. The return type is modified to a sequence, because List in invariant on its parameter type, which is incompatible with adding Dict[str, Any] to it. Rework monitoring recording of invalid stdout/err streams (#3344) This PR stops treating None as an invalid stdout/err: it is a legitimate value for stdout/err. Prior to this PR, using None was resulting in the stdout/err being recorded in monitoring as a stringified exception, like this: ``` sqlite> select task_stderr from task; type of argument "stdfspec" must be one of (os.PathLike, str, Tuple[str, str], Tuple[os.PathLike, str]); got NoneType instead ``` Before this PR, an invalid stdout/err stream was logged at WARNING level and the filename as recorded in the monitoring database was set to the exception message. This PR elevates that log to an ERROR, because this is an error, including the exception message to give a debugging hint. In this error case, a blank stdout or stderr will be recorded in the monitoring database. Generally the task will fail later due to other parts of the code being unable to interpret the specification. This PR adds a few checks around stdout/err tests that should not have any error raised - this test checks for ERROR rather than WARNING because Parsl will legitimately output WARNING messages in some configurations (such as when checkpointing is enabled but Parsl notices it hasn't checkpointed anything) Co-authored-by: Kevin Hunter Kesling <[email protected]> Add more cases for std stream logging (#3347) Previously, stdout/err logs not output if the streams were not set (i.e. defaulting to None) and logged with the entire stream target if set. This PR adds more cases here with case specific logging: if stdout/err is not redirected, that is now logged. If tuple mode is used, the tuple is unpacked and logged as separate fields. If an unknown format is specified an error is logged (but no exception is raised - it is not the job of this code to validate the stdout/err specifications) This PR is in preparation for a new case to be added in an upcoming PR which requires more serious formatting that was not accomodated by the previous implementation. Move monitoring radio reference from JobStatusPoller to BlockProviderExecutor (#3346) This is a step in consolidating responsibility for BLOCK monitoring into the BlockProviderExecutor. Before this PR, executors were already configured ad-hoc with a lot of monitoring information. This PR adds another attribute to that collection for the local monitoring radio. This PR removes some more state from the PolledExecutorFacade class. This PR should not change behaviour - it changes the code path by which the job poller's monitoring radio is found. Move send_monitoring_info method into BlockProviderExecutor (#3349) This should not change behaviour: it is a code move. Rename PolledExecutorFacade methods that overlap BlockProviderExecutor (#3351) This is in preparation for moving all the PolledExecutorFacade code into BlockProviderExecutor, in an upcoming PR. There should now be no intersection in the attributes or methods of PolledExecutorFacade and BlockProviderExecutor, in the methods that will move. __init__ is an exception because it will merge with the __init__ of BlockProviderException. The renamed methods are slightly awkwardly named because there isn't a short adjective that I have found that adequately describes the difference between BlockProviderExecutor and PolledExecutorFacade behaviours. Future untangling work in this area will necessarily involve quite a lot of rearranging and picking appropriate names for whatever remains is part of that future work - so no bikeshedding over adjectives. This PR should only rename methods. It should not change any behaviour. Move PolledExecutorFacade functionality into BlockProviderExecutor (#3352) This should be code movement only and should not change functionality. This consolidates scale in/out and status functionality in BlockProviderExecutor, which hopefully makes future work rearranging the three sources of block status information easier. Make HTEX scale-down be aware of unstarted blocks (#3353) Prior to this PR, the scale_in code for the HighThroughputExecutor will not scale in any block that has not had at least one manager register with the interchange, because it retrieves the list of blocks from the interchange. This is documented in issue #3232 This PR makes the htex scale in code also pay attention to blocks in the status_facade list - which includes blocks that have been submitted, and blocks which have been reported by the provider mechanism. Fix two CI-rot breakages (#3360) Two simultaneous CI-rot breakages have happened and this PR fixes them simultaneously - they cannot be fixed individually because neither fix is sufficient to pass CI. RADICAL-Pilot installation broke recently in CI, through some change outside of the Parsl repository, and even with a pinned RADICAL-Pilot version. More recent radical-pilot will install OK, and this PR advances the pin to that more recent version. This might be something to do with this fix introduced in radical-cybertools/radical.pilot#3169 but I'm not completely clear. Issue #3037 started happening in CI. That issue is that a test checks a property that it might not actually be reasonable to expect to hold. This PR skips that issue, leaving further work for whoever adopts issue #3037. Remove nested exception from BadStdStreamFile exception (#3362) Prior to this PR, this exception class had a mandatory nested exception attibute. This PR removes that exception, because Python already has some nesting mechanisms for exceptions which would serve well enough here: https://docs.python.org/3/tutorial/errors.html#exception-chaining This change allows exceptions to be optionally chained like any other Python exception; or not chained at all, simplifying one use of BadStdStreamFile where a contrived TypeError was constructed to fill the slot. get_all_addresses to log to debug rather than info (#3364) Currently get_all_addresses presents every single missing address from say address_by_interface to info when realistically, this is not useful at all. Any single method failing is not worth presenting to the user, and is useful only in a debug context. Set Executor attributes in super class constructor (#3361) Setting these attributes in the constructor makes reasoning about later use less complicated. If an Executor is instantiated these exist. Whether or not they've been set to a non-default value is another question. Co-authored-by: Ben Clifford <[email protected]> Fix broken stdstream handling for bytes and arbitrary os.PathLike objects; test harder. (#3363) Prior to this PR, an arbitrary os.PathLike was rendered to monitoring using that PathLike's str method. What should be appearing in the monitoring database should be a string representation of the path. This is an issue in (rare in practice?) cases where __str__ and __fspath__ return different things. This PR changes rendering to use fspath rather than str. os.PathLike objects can also render as bytes, not str. This PR decodes those bytes using .decode(). This PR adds tests for various valid data types going from app stdout/err parameters to the corresponding column in the monitoring database, including test cases that drove the above fixes. Treat stdout and stderr as stageable Files (#3348) Prior to this PR, stdout and stderr were treated as string pathnames interpreted in an executor specific context: for example, htex treats them as paths on the worker filesystem; Task Vine treats these paths as names of files that should be staged to the submitting system. This is/was different to the treatment of files placed in the outputs kwarg of an app using File objects. Those file objects can be staged out arbitrarily using Parsl file staging. This PR allows File objects to be specified for bash_app stdout and stderr kwargs, and stages those files out in the same way as outputs-kwarg files are staged. This PR preserves the previous str-based behaviour without staging. This is inconsistent with the data staging mechanism, which only accepts Files, but preserves backwards compatibility, which is important here because stdout and stderr kwargs are used by users a lot. When this new staging mechanism is used, DataFutures for stdout/err Files are made available to the user via app_future.stdout and app_future.stderr. This properties existed previously, and when using strings for stdout/err, retain their previous behaviour of returning those strings. This mode is incompatible with tuple-based stdout/err specifications, which allow opening stdout/err with specific file modes, and rely on the user to manage those files outwith Parsl: that concept is not compatible with the file staging model of "write a file once, don't ever modify it". Fixes issue #363 Fix stdout/err log errors introduced by PR #3347 (#3379) PR #3347 attempted to do case-based logging of all the different kinds of stdout/err specification. It failed to capture some of the cases involving os.PathLike, and so after PR #3347, those cases would log an ERROR that the specification was unknown. This new behavior is only a new ERROR logging message - PR #3347 did not change other behaviour. This PR also amends a rich test of stdout/err specification types introduced in PR #3363 to check that no ERROR messages are logged during these tests. Decrease strategy poll period in stdout/err vs monitoring test (#3380) This removes around 100s seconds of realtime during test execution. Shutdown multiprocessing queues at monitoring shutdown (#3385) This frees up some threads and file descriptors. Co-authored-by: Kevin Hunter Kesling <[email protected]> Shutdown multiprocessing queues at Work Queue shutdown (#3384) This frees up some threads and file descriptors. Fix typo in test invocation that results in skipped DFK cleanup (#3386) This typo is not a syntax error, although it might superficially look like one: it merely computes the method that if invoked, would clean up the DFK, and then discards the result of that computation rather than invoking it. This typo resulted in an entire DFK being left behind after the test finishes. Shutdown multiprocessing queues at TaskVine shutdown (#3383) This frees up some threads and file descriptors. Fix broken test that cannot be run multiple times in same environment. (#3382) This test used to re-use a temporary intermediate file by appending to it. When run multiple times, it failed with errors like this as the results from previous runs are also observed by the test: ``` E AssertionError: assert '2\n2\n2\n2\n' == '2\n' ``` This PR uses tmpd_cwd to avoid that behaviour. Use strategy_period in test, instead of custom thread (#3381) PR #2816 introduced a workaround in the polling API by adding a new thread to call job status poller poll() method more frequently than the JobStatusPoller invokes that method. PR #3246 introduced general configurability of the JobStatusPoller poll period. This PR replaces the thread from #2816 with the configuration option from This PR removes one thread left running at the end of this test case, in --config local tests. Reduce test sleep() (#3387) The checkpoint is the threshold of interest, so rather than waiting for an inordinate amount of time, dynamically ascertain what the checkpoint wait time is. Add 1 second to that value for good measure. Test time reduced from ~20s locally to ~4s. Fix typo in marker doc string (#3389) Tidy up wrong looking htex test (#3390) The name suggests that some failure is being tested, but that's not the case. This PR renames that. This PR also adds a very simple assert on the return value for a small amount of extra behavioural testing. Shut down usage tracking harder (#3392) This invokes .close() on usage tracking, which releases two fds per sub-process (so 4 for each usage-tracked DFK = 2 x {start, stop}) more immediately. This PR reworks logging around shutdown to emphasise that something is wrong if SIGKILL is needed, and to add a debug line right before a potentially long delay. Add logs around usage tracker shutdown (#3391) Parsl shutdown logs are deliberately rich at shutdown time because this is an incredibly hang prone part of the code-base. See https://github.com/Parsl/parsl/labels/safe-exit This PR adds logs bracketing the shutdown of the usage tracker, in line with later shutdowns of the job status poller, executors and monitoring. Remove expensive typeguard call from file staging closure (#3394) Prior to this PR, the closure stageout_one_file is redecorated by typeguard on every invocation, so once per task that is using files, and instruments that closure each time with internal type checking code. This was introduced incidentally as part of PR #3348. This PR removes that runtime type-checking. In general, parsl-internal calls should be checked statically by mypy if possible, not by typeguard, which should be reserved for the interface with users. Prior to this PR, parsl-perf modified to use a File stdout on my laptop ran at median 2176 tasks per second. After this PR, median 2627 tasks per second. @svandenhaute experienced a much starker difference on a psiflow test case, with one test case that takes 40 seconds reduced to 9 seconds by this PR, the same performance as before PR #3348. Add first unit test for File (#3396) We may have other UTs littered among the other tests, but this file now sets the stage for where to place them. A not-complete metric for UTs: do they they run _fast_? If not, or if the test includes `.sleep()`, it's probably not a UT. Report usage using a richer configuration traversing protocol (#3229) This PR introduces code to traverse the configuration object (in a similar manner to the RepresentationMixin style of logging the supplied configuration object) with the intention of giving each object a chance to report its own usage information. The protocol now reports configured objects either as a JSON string class name, or as a JSON object containing the class name and any additional information that class wishes to report for usage (via the UsageInformation abstract class) This PR modifies the HighThroughputExecutor to use this API to report richer usage information: a specific usage tracking query is to ask about use of the enable_mpi_mode parameter, and so the HighThroughputExecutor will now report the boolean value of that parameter. Beware that this reports on configuration, not use, of components: for example, configurations by default will include three staging providers, even though I believe it is extremely rare that either the FTP or HTTP staging providers are actually used to stage data. The UsageInformation API is intended to support reporting whether these staging providers actually stage anything, but this PR does not implement that in those staging provider components. To support UsageInformation instances which report on usage during a run, the component tree is traversed both for the start message and the end message, and may result in different information in each message - for example, the DFK report occurs only end in the end message. An example start message looks like this: (pretty-formatted) {'correlator': 'f7595d08-7b94-49bc-b3d7-1ea7532b2f51', 'parsl_v': '1.3.0-dev', 'python_v': '3.12.2', 'platform.system': 'Linux', 'start': 1710150467, 'components': ['parsl.config.Config', {'c': 'parsl.executors.high_throughput.executor.HighThroughputExecutor', 'mpi': False}, 'parsl.providers.local.local.LocalProvider', 'parsl.channels.local.local.LocalChannel', 'parsl.launchers.launchers.SingleNodeLauncher', 'parsl.data_provider.ftp.FTPInTaskStaging', 'parsl.data_provider.http.HTTPInTaskStaging', 'parsl.data_provider.file_noop.NoOpFileStaging', 'parsl.monitoring.monitoring.MonitoringHub']} Actually test the file: prefix. (#3401) Remove the mistaken conditional check from an apparent vestigial thought-process during initial implementation. .close() interchange process to release more fds (#3399) This releases two fds in the workflow process used to communicate with the interchange process, and is part of work to release more resources explicitly at shutdown rather than leaving them until process end. One test was making an out-of-API shutdown of HTEX on the basis that the DFK would not shutdown this executor when a bad state was set. That used to be true, but PR 2855, which re-arranged some shutdown behaviour, (accidentally?) changed shutdown to always happen. That is, I think, the right shutdown behaviour: if an executor wants to suppress parts of its own shutdown based on internal state, that's the business of the executor, not the DFK. This PR removes that out-of-API shutdown. Rework description of --config local tests (#3402) The previous description of putting --config local tests into /sites/ and /integration/ does not reflect the reality of current Parsl development. Use local config for htex MPI test, instead of setup/teardown fixutre (#3398) Implement stage-in for zip staging provider (#3395) Make parsl.AUTO_LOGNAME generation pluggable (#3388) A generation function can now be specified at configuration time, with the previous behaviour invoked as default. Use incluster config fallback for KubernetesProvider (#3357) The KubernetesProvider now falls back to loading config in-cluster if a kube-config file is not found. This allows in-cluster submission of parsl jobs. See kubernetes-client/python#1005 for a good description of the issue. Co-authored-by: T. Andrew Manning <[email protected]> Co-authored-by: Matt Fisher <[email protected]> Co-authored-by: Ben Clifford <[email protected]> Use DFK in a `with` block in README (#3405) This functionality was introduced in #3260 and we should be encouraging people to use it in order to clean up nicely, after PR #3165 took away atexit handler cleanup. WorkQueueExecutor: remove atexit handler (#3407) This PR removes the atexit handler of WorkQueueExecutor as discussed in #3334. Co-authored-by: Ben Clifford <[email protected]> TaskVineExecutor: remove atexit handler (#3406) This PR removes the atexit handler of TaskVineExecutor as discussed in #3334. Fix variable name in user guide example (#3408) Close monitoring processes at shutdown, freeing some fds (#3410) Before this PR, pytest parsl/tests/test_monitoring/ leaves 1093 fds open (which is too much for the test suite to complete in some default configurations of max fds 1024) After this PR, that test run leaves 895 fds open. Remove vestigial monitoring parameter (#3411) The use of this parameter was removed in PR #3346 Correct grammar in monitoring docstring (#3413) Remove a TODO for pluggable block error handling (#3412) This TODO was implemented in PR #2858 Unregister atexit hook at cleanup (#3416) This hook is only to deal with reminder the user about cleanup and is otherwise unnecessary. The exit warning about cleanup state is now switched based on the presence of the atexit handler or not, rather than the cleanup called variable. Effects of this PR: * removed log noise * one less garbage collector reference to the DFK - although that won't directly lead to the DFK being collected earlier, it might in combination with other future work. Update HTEX to report manager version info (#3417) htex.connected_managers will now report `python_version` and `parsl_version` for each connected manager. Fixing circular dependency (#3424) Fix `slurm` command reference in userguide overview (#3425) Fixes #3419 Pin radical.utils version, to work with pinned radical.pilot (#3431) On 10th May 2024, a new version of radical.utils was released, which does not work with the pinned radical.pilot==1.52.1 used by Parsl. I also tried upgrading radical.pilot to 1.60, released alongside the new radical.utils, but that does not work for other reasons documented in the below issue. See Parsl issue #3429 Remove double init of some HTEX/monitoring attributes (#3433) These have been initialised in the ParslExecutor base class since #3361 Use concurrency in the README (#3428) Remove return bool from Channel.close() (#3403) Theres was no defined meaning to the return bool either in the code or in the human text: - no code calls close to give an example of interpretation - it's not success/fail because local channel returns False to indicate that it didn't need to do anything, not that there was a failure. Other .close() style methods return None and raise an exception if there is a problem. This PR pushes Channel.close() in this direction. A separate PR will actually invoke this .close() method. Factor radio selection in monitoring worker-side code (#3432) Co-authored-by: Sicheng Zhou <[email protected]> Replace SingleNodeLauncher with SimpleLauncher for MPI Mode (#3418) Corrects a problem in the MPI mode implementation of HTEx. SingleNodeLauncher creates too many workers when running in MPI mode because it launches multiple copies of the worker pool onto the head node. We want SimpleLauncher, which (instead) launches a single copy of the worker pool. Co-authored-by: Ben Clifford <[email protected]> Remove unused logging import in MPI test (#3436) Co-authored-by: Logan Ward <[email protected]> Log commands that the interchange does not recognise, as an error (#3434) The interchange command protocol is tied to the submit side closely, and it is generally a programmer or deployment error rather than intentional cross-version compatibility that would result in this unknown command code path firing. Add an MPIExecutor (#3423) Add MPIExecutor -- a wrapper class over HTEx which fixes or removes options irrelevant when enable_mpi_mode=True. Remove $ when writing job name in LocalProvider (#3440) Fixes #3437 Add CONTRIBUTING note on local test config framework (#3438) Co-authored-by: Logan Ward <[email protected]> Rename hub_port in executors to clarify it is for ZMQ (#3439) This follows a rename in PR #3266 which focused on making the same change inside parsl/monitoring/ A final use of hub_port is left in place in the MonitoringHub constructor because it is user-facing - future work on monitoring radio plugins might change this significantly (or not) so I am leaving it as is for now. Extend dependency error change in #2989 to remaining two dependency types (#3445) PR #2989 made dependency errors look nicer, but only did it for positional parameters. This PR extends that to kwarg and inputs=... dependencies. PR #2989 describes these changes in more depth. This PR adds a type annotation onto DFK._unwrap_futures that drives this change - preventing _unwrap_futures from returning None task ids for failed dependencies and instead requiring the task IDs to be strings. Turn flake8 CONTRIBUTING note into its own section (#3450) Merge conflicting CONTRIBUTING sections on the best place for discussion (#3449) Fix misformatted 'Credits and Contributions' section heading (#3448) Prior to this PR, there were not enough - symbols so GitHub's RST renderer did not recognise this as a heading. Remove incorrect historical reference to Parsl semantic versioning (#3446) Remove issue number recommendation from CONTRIBUTING branch names (#3447) Branch names are largely irrelevant in the squash PR model: PR text provides much richer context; branch names do not find their way into permanent history; and we have been successfully operating with arbitrary branch name choices without trouble for a long time. Update CoC link in README (#3453) Updating the CoC link in the README's Code of Conduct section reflects recent updates. I've linked it to the CoC tab in parsl/parsl so users aren't unnecessarily routed to the .github repo. Delete CODE_OF_CONDUCT.md (#3452) The CoC has been added to the .github repo as a community health file (CHF) – which means it will now appear in all of Parsl's repos without needing to create separate CoC files in each. I've tested it and it is appearing in all of our public repos that do not have their own CoC file. CHF files help automate and standardize various aspects of our project's development and commun…
93ac87e
to
f05886a
Compare
Description
This PR introduces the ability to track file provenance to Parsl. The provenance information includes:
This feature uses the existing monitoring infrastructure to capture the requisite information and store it in a databse. The file provenance information can also be accessed via the parsl-visualizer interface. An additional keyword argument for the MonitoringHub has been added
capture_file_provenance
to enable/disable the provenance tracking. By default it isFalse
, turning off the provenance tracking.While this type of file information can be tracked manually for smaller workflows, larger workflows require an automated solution, like this one.
Changed Behaviour
In general, users will not change Parsl's behavior, as the provenance tracking code is turned off by default. Enabling it will only gather additional information in the background. The only change users should outwardly see is the workflow slowdown typically seen when unsing the monitoring framework.
Fixes
This fixes #3711
Type of change