updated derived metrics interface #898

shanedsnyder · 2023-02-10T22:51:30Z

Mostly for discussion for now, see commits for more details.

One thing I'd like to cleanup is add some sort of helper logic for mapping between module names and their C library module identifiers (integers). It's annoying to maintain that in Python, but otherwise we don't really have anyway to use the accumulator API unless we have an open Darshan log we can use to find the underlying module identifier. For now, the Python accumulator interface is ugly since it requires you to give the module name and ID.

* early draft of Python/CFFI interface to derived metrics/accumulators described in: - gh-642 - gh-677 * for now this definitely doesn't work, and feels like I'm basically reconstituting a C control flow in CFFI/Python instead of using a sensible exposure point between C and Python to pull out populated structs from a single entry point * perhaps folks can just help me sort out the current issues noted in the source changes rather than providing a convenient API, though once thorough regression tests are in place that might be something to consider in the future... (or even just maintaining it in `pandas`/Python someday if the perf is ~similar)

* needed to expose `darshan_file_category_counters` so that its size is available for usage in `darshan_derived_metrics` struct * some improvements to the renamed `log_get_derived_metrics()` function, including first draft of error handling; the renamed `test_derived_metrics_basic` test was also adjusted to expect an error for `LUSTRE` module

- for record buffer in accumulator test - for error path in another test

* `log_get_derived_metrics()` was adjusted to inject all the records for a given module, because this is our initial target to replicate the stats in the old perl summary report * a new `log_get_bytes_bandwidth()` function was drafted in as a convenience wrapper to get MiB (total bytes) and bandwidth (MiB/s) values printed out in the old perl report * renamed the regression test for this PR and adjusted it to compare against the bytes/bandwidth strings present in the perl reports; so far, only a small subset of the STDIO results are working properly (see the xfails in this test..)

* `log_get_bytes_bandwidth()` has been simplified substantially as it no longer needs to generate a new `report` object, and can instead use the `agg_perf_by_slowest` structure member * the `xfail` marks have all been removed from `test_derived_metrics_bytes_and_bandwidth()`--those cases all pass now thanks to the above changes * remove some debug prints and unused code from `log_get_derived_metrics()`

* a number of additional `MPI-IO` and `STDIO` test cases were added from the logs repo to `test_derived_metrics_bytes_and_bandwidth()` * for the `MPI-IO` cases to pass, special casing was added to `log_get_bytes_bandwidth()` such that `total_bytes` is actually extracted from `POSIX`

* removed an invalid `darshan_free()` from `log_get_derived_metrics()`-- the `buf` object didn't even exist at that point in the control flow * add a `LUSTRE` test case, which raises a `RuntimeError` as expected * add a tentatie `POSIX` test case, which reports a bandwidth string at the Python level, but is not included in the Perl summary reports...

* when `log_get_derived_metrics()` receives a module name that doesn't exist in the log file it received, it will now raise a `ValueError` for clarity of feedback * update `test_derived_metrics_bytes_and_bandwidth()` accordingly, and also start regex matching on expected error messages in this test

* add the bandwidth summary string to the Python report proper, and include a test for the presence of this string in logs repo-based summary reports

* add one of the tricky `APMPI` cases I discovered to `test_derived_metrics_bytes_and_bandwidth()`, pending discussion with team re: how I should handle this

* adjust tests to more closely match `darshan-parser` instead of the old Perl report in cases where MPI-IO and POSIX are both involved; this allows me to remove the weird MPI-IO shim in `log_get_bytes_bandwidth()`

* the bandwidth text in the Python summary report is now colored "blue," along with a regression test, based on reviewer feedback * added `skew-app.darshan` log to `test_derived_metrics_bytes_and_bandwidth()`--we get the same results as `darshan-parser` * replaced the `xfail` for `e3sm_io_heatmap_only.darshan` with an expected `KeyError` when handling `APMPI` (this should already be handled gracefully/ignored by the Python summary report)

* the testsuite now always uses `DarshanReport` with a context manager to avoid shenanigans with `__del__` and garbage collection/`pytest`/multiple threads * this appears to fix the problem with testsuite hangs described in gh-839 and gh-851

* `cffi_backend` module changes requested from PR review - remove a spurious `darshan_free` from `_log_get_heatmap_record()` - fix the scoping of the `darshan_free` of `buf` object used with `darshan_accumulator_inject` in `log_get_derived_metrics` - adding a missing `log_close()` to `log_get_derived_metrics` (maybe we can wrap in Python contexts in the future though) - use a separate buffer for `darshan_accumulator_emit()` inside `log_get_derived_metrics` * note that making the above CFFI/free-related changes caused a segfault in the testuite, so in the end I adjusted the location of the memory freeing as I saw fit to avoid segfaults--I'd say at this point please provide concrete evidence with a memory leak plot or failing test for additional adjustments there, or just push the change in * in the end, there is a slightly more concise usage of `darshan_free()` but no meaningful change in the free operations * I also reverted the suggested changed to `darshan_accumulator_emit()` usage--there was no testable evidence of an issue, and it was also causing segfaults.. * address many of the discussion points that came up in gh-868: - `log_get_derived_metrics()` now uses an LRU cache, which effectively means that we use memoization to return derived metrics data rather than doing another pass over the log file if the same log path and module name have already been accumulated from; we still need to pass over a given log twice in most cases--once at initial read-in and once for using `log_get_derived_metrics`; how we decide to add filtering of records prior to accumulation interface in Python is probably a deeper discussion/for later - `log_get_bytes_bandwidth()` and its associated testing have been migrated to modules not named after "CFFI", like the in the above PR, because I think we should only use the "CFFI" named modules for direct CFFI interaction/testing, and for other analyses we should probably use more distinct names. Also, to some extent everything depends on the CFFI layer, so trying to restrict "CFFI" modules to direct rather than direct interaction will help keep them manageably sized, especially given the proclivity for surprising memory issues/segfaults in those parts of the code. - add a proper docstring with examples for `log_get_bytes_bandwidth()`

[skip actions]

* `log_get_derived_metrics()` changes: - update input parameters for what's currently needed for conversion to a raw buffer from a dataframe - remove code for opening log to read records, and replace with code to convert dataframe to a record buffer - remove lru cache as probably no longer appropriate * Darshan job summary changes: - add code to get derived metrics from a module's record dataframe - to avoid various errors related to attempting to accumulate modules that don't support, only accumulate modules that support it * `log_get_bytes_bandwidth()` changes: - take a derived metrics struct as input * updated `lib/accum` tests - update to open a log report and read in records explicitly - some expected error types changed as a result of refactor

tylerjereddy

I think I'm fine with most of it, but then again a lot of this is effectively improved versions of my original rough code so I may be biased.

The blue text bandwidth strings are still showing up in the Python summary reports for me on this branch, which is a good sign.

tylerjereddy · 2023-02-13T17:54:24Z

darshan-util/pydarshan/darshan/tests/test_lib_accum.py

+#        with pytest.raises(ValueError,
+#                           match=f"{mod_name} is not in the available log"):
+#            log_get_bytes_bandwidth(derived_metrics=derived_metrics,
+#                                    mod_name=mod_name)


For this test, if we want to keep the error handling cases, this seems to fixup locally:

index 17b0e0e3..047e4d56 100644 --- a/darshan-util/pydarshan/darshan/report.py +++ b/darshan-util/pydarshan/darshan/report.py @@ -661,6 +661,9 @@ class DarshanReport(object): cn = backend.counter_names(mod) fcn = backend.fcounter_names(mod) + if mod not in self._modules: + raise ValueError(f"mod {mod} is not available in this DarshanReport object.") + # update module metadata self._modules[mod]['num_records'] = 0 if mod not in self.counters: diff --git a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py index 4e46dc7e..6e3ec020 100644 --- a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py +++ b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py @@ -69,28 +69,27 @@ def test_derived_metrics_bytes_and_bandwidth(log_path, mod_name, expected_str): # test the basic scenario of retrieving # the total data transferred and bandwidth # for all records in a given module; the situation - # of accumulating drived metrics with filtering + # of accumulating derived metrics with filtering # (i.e., for a single filename) is not tested here log_path = get_log_path(log_path) with darshan.DarshanReport(log_path, read_all=True) as report: - report.mod_read_all_records(mod_name, dtype="pandas") - rec_dict = report.records[mod_name][0] - mod_idx = report.modules[mod_name]['idx'] - nprocs = report.metadata['job']['nprocs'] + if expected_str == "ValueError": + with pytest.raises(ValueError, + match=f"mod {mod_name} is not available"): + report.mod_read_all_records(mod_name, dtype="pandas") + else: + report.mod_read_all_records(mod_name, dtype="pandas") + rec_dict = report.records[mod_name][0] + mod_idx = report.modules[mod_name]['idx'] + nprocs = report.metadata['job']['nprocs'] - if expected_str == "RuntimeError": - with pytest.raises(RuntimeError, - match=f"{mod_name} module does not support derived"): - log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs) - else: - derived_metrics = log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs) - actual_str = log_get_bytes_bandwidth(derived_metrics=derived_metrics, - mod_name=mod_name) - assert actual_str == expected_str - -# elif expected_str == "ValueError": -# with pytest.raises(ValueError, -# match=f"{mod_name} is not in the available log"): -# log_get_bytes_bandwidth(derived_metrics=derived_metrics, -# mod_name=mod_name) + if expected_str == "RuntimeError": + with pytest.raises(RuntimeError, + match=f"{mod_name} module does not support derived"): + log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs) + else: + derived_metrics = log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs) + actual_str = log_get_bytes_bandwidth(derived_metrics=derived_metrics, + mod_name=mod_name) + assert actual_str == expected_str

Thanks, I added this patch with 5d9e362 and looks to be working fine.

tylerjereddy · 2023-02-13T18:03:06Z

darshan-util/pydarshan/darshan/backend/cffi_backend.py

    return buf
+
+
+def log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs):


I suppose in the future we could allow an optional kwarg for custom filtering operations on the DataFrames, which then gets passed to a similar optional kwarg for _df_to_rec().

For now, I think we're focused on just getting the standard accum elements into the summary report, but I'm just thinking ahead to make sure we don't lock ourselves into a design that is harder to apply custom DataFrame filters to.

I suppose, however, that for supporting arbitrary slicing and dicing of DataFrames we may want to just allow folks to manually operate on the two DataFames in rec_dict and then pass it in here, since a pass-through API that can achieve any kind of filtering op may be more work than it is worth.

Perhaps some pre-canned filtering ops will be built-in for us, like an argument for a regex to filter on filenames or something.

I did think about this briefly and just assumed we'd expect users to do the filtering directly on the dataframes before having PyDarshan operate on them, similar to your approach in the test_reverse_record_array test case. I'd definitely be open to more directly supporting in our APIs if we think it'll avoid some complexity and simplify things for users. I guess users have to filter counters/fcounters separately, so avoiding that and instead passing in some kwargs to handle internally could be useful if it's not a pain to get working for different use cases.

tylerjereddy · 2023-02-13T18:10:54Z

darshan-util/pydarshan/darshan/backend/cffi_backend.py

+    r = libdutil.darshan_accumulator_emit(darshan_accumulator[0],
+                                          derived_metrics,
+                                          total_record)
+    libdutil.darshan_accumulator_destroy(darshan_accumulator[0])


I suppose we could check for a 0 retcode here, though the function appears to have no error handling anyway, not even in the form of non-zero exit code.

Yeah, I just punted on error checking that for a couple of reasons:

Internal knowledge that the destroy() call doesn't actually have an error path despite having a return code -- it's basically just freeing some stuff, so if there's an error it's likely going to be a crash.

We want to call destroy() before potentially raising an exception related to the emit() call returning a failure just to make sure we clean up C memory. It just becomes a little complicated/verbose to unwind the errors from back-to-back calls like that, so I just opted for simplicity with the internal knowledge from above.

tylerjereddy · 2023-02-13T18:14:13Z

darshan-util/pydarshan/darshan/backend/cffi_backend.py

+                           "It may be possible "
+                           "to retrieve additional information from the stderr "
+                           "stream.")
+    derived_metrics = ffi.new("struct darshan_derived_metrics *")


The CFFI stuff is of course a bit awkward, especially given how many different entrypoints to the C layer we need to interact with just to get a single final struct back, but I'd probably be inclined to just live with it if we just operate on DataFrames or Arrow in-memory stuff in a few years anyway.

It occurs to me that for some things, like struct darshan_derived_metrics, we could probably avoid CFFI traversal by using the bytes from a NumPy recarray, but this would be more verbose with minimal gain for now I suspect.

shanedsnyder · 2023-02-14T17:04:50Z

I just pushed one more commit that adds some helper for mapping between module name to index and uses this logic internally in log_get_derived_metrics() so that callers don't have to ensure they hold onto both module names and log format indices. Really, we should probably not require users to know what a module index is and have them interact entirely with module names for simplicity.

I think I'm satisfied with the state of things here and am willing to merge things in assuming there are no other critical concerns to address? I'm not sure if it makes any difference for attribution purposes, but I could cherry-pick these commits back onto your original PR @tylerjereddy if it matters any to you? Otherwise, I just requested a review here and we can merge this in directly and close #839

tylerjereddy · 2023-02-14T17:55:03Z

The last two commits look fine to me.

I suppose we could debate about using a dictionary to retrieve the mod_idx to get O(1) vs. the O(n) search from list.index(), but the sequence is so small, and unlikely to balloon, that it isn't worth the effort either way.

Feel free to merge this PR if that is easiest, and if you want to squash the commits or whatever I don't mind.

tylerjereddy and others added 22 commits December 16, 2022 08:59

fix pydarshan defs for darshan-util accumulators

6bd2c2c

bug fix: missing array in derived metrics struct

6bd237c

silence warning: convert enum arg to int

7747764

added some free wrapper calls

e306f8e

- for record buffer in accumulator test - for error path in another test

MAINT: PR 839 revisions

a9b9765

* add the bandwidth summary string to the Python report proper, and include a test for the presence of this string in logs repo-based summary reports

MAINT: PR 839 revisions

d0704fd

* add one of the tricky `APMPI` cases I discovered to `test_derived_metrics_bytes_and_bandwidth()`, pending discussion with team re: how I should handle this

MAINT: PR 839 revisions

029b7c9

* adjust tests to more closely match `darshan-parser` instead of the old Perl report in cases where MPI-IO and POSIX are both involved; this allows me to remove the weird MPI-IO shim in `log_get_bytes_bandwidth()`

MAINT: PR 839 revisions

f1bac18

* the testsuite now always uses `DarshanReport` with a context manager to avoid shenanigans with `__del__` and garbage collection/`pytest`/multiple threads * this appears to fix the problem with testsuite hangs described in gh-839 and gh-851

make sure to call accumulator_destroy()

8f83274

sync gitignore

02bf33f

[skip actions]

Merge branch 'main' into treddy_derived_metrics_1

6f5f716

don't rely on df id/rank position in _df_to_rec

eb18df8

shanedsnyder requested a review from tylerjereddy February 10, 2023 22:51

github-actions bot added the pydarshan label Feb 10, 2023

tylerjereddy reviewed Feb 13, 2023

View reviewed changes

Shane Snyder added 2 commits February 13, 2023 15:47

apply Tyler's testing patch

5d9e362

add helper for mapping mod name to idx

853b443

shanedsnyder changed the title ~~WIP: updated dervied metrics interface~~ WIP: updated derived metrics interface Feb 14, 2023

shanedsnyder changed the title ~~WIP: updated derived metrics interface~~ updated derived metrics interface Feb 14, 2023

shanedsnyder merged commit 0a4e9f2 into main Feb 14, 2023

shanedsnyder mentioned this pull request Feb 14, 2023

ENH: Python derived/accum interface #839

Merged

shanedsnyder mentioned this pull request Feb 22, 2023

add per-module bandwidth calculations to pydarshan job summary reports #555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

updated derived metrics interface #898

updated derived metrics interface #898

shanedsnyder commented Feb 10, 2023

Uh oh!

tylerjereddy left a comment

Uh oh!

tylerjereddy Feb 13, 2023

Uh oh!

shanedsnyder Feb 13, 2023 •

edited

Loading

Uh oh!

tylerjereddy Feb 13, 2023

Uh oh!

shanedsnyder Feb 13, 2023

Uh oh!

tylerjereddy Feb 13, 2023

Uh oh!

shanedsnyder Feb 13, 2023

Uh oh!

tylerjereddy Feb 13, 2023

Uh oh!

shanedsnyder commented Feb 14, 2023

Uh oh!

tylerjereddy commented Feb 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return buf


		def log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs):

updated derived metrics interface #898

updated derived metrics interface #898

Conversation

shanedsnyder commented Feb 10, 2023

Uh oh!

tylerjereddy left a comment

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

shanedsnyder Feb 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

shanedsnyder Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

shanedsnyder Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

shanedsnyder commented Feb 14, 2023

Uh oh!

tylerjereddy commented Feb 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shanedsnyder Feb 13, 2023 •

edited

Loading