-
Notifications
You must be signed in to change notification settings - Fork 35
updated derived metrics interface #898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* early draft of Python/CFFI interface to derived metrics/accumulators described in: - gh-642 - gh-677 * for now this definitely doesn't work, and feels like I'm basically reconstituting a C control flow in CFFI/Python instead of using a sensible exposure point between C and Python to pull out populated structs from a single entry point * perhaps folks can just help me sort out the current issues noted in the source changes rather than providing a convenient API, though once thorough regression tests are in place that might be something to consider in the future... (or even just maintaining it in `pandas`/Python someday if the perf is ~similar)
* needed to expose `darshan_file_category_counters` so that its size is available for usage in `darshan_derived_metrics` struct * some improvements to the renamed `log_get_derived_metrics()` function, including first draft of error handling; the renamed `test_derived_metrics_basic` test was also adjusted to expect an error for `LUSTRE` module
- for record buffer in accumulator test - for error path in another test
* `log_get_derived_metrics()` was adjusted to inject all the records for a given module, because this is our initial target to replicate the stats in the old perl summary report * a new `log_get_bytes_bandwidth()` function was drafted in as a convenience wrapper to get MiB (total bytes) and bandwidth (MiB/s) values printed out in the old perl report * renamed the regression test for this PR and adjusted it to compare against the bytes/bandwidth strings present in the perl reports; so far, only a small subset of the STDIO results are working properly (see the xfails in this test..)
* `log_get_bytes_bandwidth()` has been simplified substantially as it no longer needs to generate a new `report` object, and can instead use the `agg_perf_by_slowest` structure member * the `xfail` marks have all been removed from `test_derived_metrics_bytes_and_bandwidth()`--those cases all pass now thanks to the above changes * remove some debug prints and unused code from `log_get_derived_metrics()`
* a number of additional `MPI-IO` and `STDIO` test cases were added from the logs repo to `test_derived_metrics_bytes_and_bandwidth()` * for the `MPI-IO` cases to pass, special casing was added to `log_get_bytes_bandwidth()` such that `total_bytes` is actually extracted from `POSIX`
* removed an invalid `darshan_free()` from `log_get_derived_metrics()`-- the `buf` object didn't even exist at that point in the control flow * add a `LUSTRE` test case, which raises a `RuntimeError` as expected * add a tentatie `POSIX` test case, which reports a bandwidth string at the Python level, but is not included in the Perl summary reports...
* when `log_get_derived_metrics()` receives a module name that doesn't exist in the log file it received, it will now raise a `ValueError` for clarity of feedback * update `test_derived_metrics_bytes_and_bandwidth()` accordingly, and also start regex matching on expected error messages in this test
* add the bandwidth summary string to the Python report proper, and include a test for the presence of this string in logs repo-based summary reports
* add one of the tricky `APMPI` cases I discovered to `test_derived_metrics_bytes_and_bandwidth()`, pending discussion with team re: how I should handle this
* adjust tests to more closely match `darshan-parser` instead of the old Perl report in cases where MPI-IO and POSIX are both involved; this allows me to remove the weird MPI-IO shim in `log_get_bytes_bandwidth()`
* the bandwidth text in the Python summary report is now colored "blue," along with a regression test, based on reviewer feedback * added `skew-app.darshan` log to `test_derived_metrics_bytes_and_bandwidth()`--we get the same results as `darshan-parser` * replaced the `xfail` for `e3sm_io_heatmap_only.darshan` with an expected `KeyError` when handling `APMPI` (this should already be handled gracefully/ignored by the Python summary report)
* `cffi_backend` module changes requested from PR review - remove a spurious `darshan_free` from `_log_get_heatmap_record()` - fix the scoping of the `darshan_free` of `buf` object used with `darshan_accumulator_inject` in `log_get_derived_metrics` - adding a missing `log_close()` to `log_get_derived_metrics` (maybe we can wrap in Python contexts in the future though) - use a separate buffer for `darshan_accumulator_emit()` inside `log_get_derived_metrics` * note that making the above CFFI/free-related changes caused a segfault in the testuite, so in the end I adjusted the location of the memory freeing as I saw fit to avoid segfaults--I'd say at this point please provide concrete evidence with a memory leak plot or failing test for additional adjustments there, or just push the change in * in the end, there is a slightly more concise usage of `darshan_free()` but no meaningful change in the free operations * I also reverted the suggested changed to `darshan_accumulator_emit()` usage--there was no testable evidence of an issue, and it was also causing segfaults.. * address many of the discussion points that came up in gh-868: - `log_get_derived_metrics()` now uses an LRU cache, which effectively means that we use memoization to return derived metrics data rather than doing another pass over the log file if the same log path and module name have already been accumulated from; we still need to pass over a given log twice in most cases--once at initial read-in and once for using `log_get_derived_metrics`; how we decide to add filtering of records prior to accumulation interface in Python is probably a deeper discussion/for later - `log_get_bytes_bandwidth()` and its associated testing have been migrated to modules not named after "CFFI", like the in the above PR, because I think we should only use the "CFFI" named modules for direct CFFI interaction/testing, and for other analyses we should probably use more distinct names. Also, to some extent everything depends on the CFFI layer, so trying to restrict "CFFI" modules to direct rather than direct interaction will help keep them manageably sized, especially given the proclivity for surprising memory issues/segfaults in those parts of the code. - add a proper docstring with examples for `log_get_bytes_bandwidth()`
[skip actions]
* `log_get_derived_metrics()` changes: - update input parameters for what's currently needed for conversion to a raw buffer from a dataframe - remove code for opening log to read records, and replace with code to convert dataframe to a record buffer - remove lru cache as probably no longer appropriate * Darshan job summary changes: - add code to get derived metrics from a module's record dataframe - to avoid various errors related to attempting to accumulate modules that don't support, only accumulate modules that support it * `log_get_bytes_bandwidth()` changes: - take a derived metrics struct as input * updated `lib/accum` tests - update to open a log report and read in records explicitly - some expected error types changed as a result of refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm fine with most of it, but then again a lot of this is effectively improved versions of my original rough code so I may be biased.
The blue text bandwidth strings are still showing up in the Python summary reports for me on this branch, which is a good sign.
# with pytest.raises(ValueError, | ||
# match=f"{mod_name} is not in the available log"): | ||
# log_get_bytes_bandwidth(derived_metrics=derived_metrics, | ||
# mod_name=mod_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test, if we want to keep the error handling cases, this seems to fixup locally:
index 17b0e0e3..047e4d56 100644
--- a/darshan-util/pydarshan/darshan/report.py
+++ b/darshan-util/pydarshan/darshan/report.py
@@ -661,6 +661,9 @@ class DarshanReport(object):
cn = backend.counter_names(mod)
fcn = backend.fcounter_names(mod)
+ if mod not in self._modules:
+ raise ValueError(f"mod {mod} is not available in this DarshanReport object.")
+
# update module metadata
self._modules[mod]['num_records'] = 0
if mod not in self.counters:
diff --git a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
index 4e46dc7e..6e3ec020 100644
--- a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
+++ b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
@@ -69,28 +69,27 @@ def test_derived_metrics_bytes_and_bandwidth(log_path, mod_name, expected_str):
# test the basic scenario of retrieving
# the total data transferred and bandwidth
# for all records in a given module; the situation
- # of accumulating drived metrics with filtering
+ # of accumulating derived metrics with filtering
# (i.e., for a single filename) is not tested here
log_path = get_log_path(log_path)
with darshan.DarshanReport(log_path, read_all=True) as report:
- report.mod_read_all_records(mod_name, dtype="pandas")
- rec_dict = report.records[mod_name][0]
- mod_idx = report.modules[mod_name]['idx']
- nprocs = report.metadata['job']['nprocs']
+ if expected_str == "ValueError":
+ with pytest.raises(ValueError,
+ match=f"mod {mod_name} is not available"):
+ report.mod_read_all_records(mod_name, dtype="pandas")
+ else:
+ report.mod_read_all_records(mod_name, dtype="pandas")
+ rec_dict = report.records[mod_name][0]
+ mod_idx = report.modules[mod_name]['idx']
+ nprocs = report.metadata['job']['nprocs']
- if expected_str == "RuntimeError":
- with pytest.raises(RuntimeError,
- match=f"{mod_name} module does not support derived"):
- log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs)
- else:
- derived_metrics = log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs)
- actual_str = log_get_bytes_bandwidth(derived_metrics=derived_metrics,
- mod_name=mod_name)
- assert actual_str == expected_str
-
-# elif expected_str == "ValueError":
-# with pytest.raises(ValueError,
-# match=f"{mod_name} is not in the available log"):
-# log_get_bytes_bandwidth(derived_metrics=derived_metrics,
-# mod_name=mod_name)
+ if expected_str == "RuntimeError":
+ with pytest.raises(RuntimeError,
+ match=f"{mod_name} module does not support derived"):
+ log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs)
+ else:
+ derived_metrics = log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs)
+ actual_str = log_get_bytes_bandwidth(derived_metrics=derived_metrics,
+ mod_name=mod_name)
+ assert actual_str == expected_str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I added this patch with 5d9e362 and looks to be working fine.
return buf | ||
|
||
|
||
def log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose in the future we could allow an optional kwarg for custom filtering operations on the DataFrames, which then gets passed to a similar optional kwarg for _df_to_rec()
.
For now, I think we're focused on just getting the standard accum elements into the summary report, but I'm just thinking ahead to make sure we don't lock ourselves into a design that is harder to apply custom DataFrame filters to.
I suppose, however, that for supporting arbitrary slicing and dicing of DataFrames we may want to just allow folks to manually operate on the two DataFames in rec_dict
and then pass it in here, since a pass-through API that can achieve any kind of filtering op may be more work than it is worth.
Perhaps some pre-canned filtering ops will be built-in for us, like an argument for a regex to filter on filenames or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did think about this briefly and just assumed we'd expect users to do the filtering directly on the dataframes before having PyDarshan operate on them, similar to your approach in the test_reverse_record_array
test case. I'd definitely be open to more directly supporting in our APIs if we think it'll avoid some complexity and simplify things for users. I guess users have to filter counters/fcounters separately, so avoiding that and instead passing in some kwargs to handle internally could be useful if it's not a pain to get working for different use cases.
r = libdutil.darshan_accumulator_emit(darshan_accumulator[0], | ||
derived_metrics, | ||
total_record) | ||
libdutil.darshan_accumulator_destroy(darshan_accumulator[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could check for a 0
retcode here, though the function appears to have no error handling anyway, not even in the form of non-zero exit code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I just punted on error checking that for a couple of reasons:
- Internal knowledge that the
destroy()
call doesn't actually have an error path despite having a return code -- it's basically just freeing some stuff, so if there's an error it's likely going to be a crash. - We want to call
destroy()
before potentially raising an exception related to theemit()
call returning a failure just to make sure we clean up C memory. It just becomes a little complicated/verbose to unwind the errors from back-to-back calls like that, so I just opted for simplicity with the internal knowledge from above.
"It may be possible " | ||
"to retrieve additional information from the stderr " | ||
"stream.") | ||
derived_metrics = ffi.new("struct darshan_derived_metrics *") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CFFI stuff is of course a bit awkward, especially given how many different entrypoints to the C layer we need to interact with just to get a single final struct back, but I'd probably be inclined to just live with it if we just operate on DataFrames or Arrow in-memory stuff in a few years anyway.
It occurs to me that for some things, like struct darshan_derived_metrics
, we could probably avoid CFFI traversal by using the bytes from a NumPy recarray
, but this would be more verbose with minimal gain for now I suspect.
I just pushed one more commit that adds some helper for mapping between module name to index and uses this logic internally in I think I'm satisfied with the state of things here and am willing to merge things in assuming there are no other critical concerns to address? I'm not sure if it makes any difference for attribution purposes, but I could cherry-pick these commits back onto your original PR @tylerjereddy if it matters any to you? Otherwise, I just requested a review here and we can merge this in directly and close #839 |
The last two commits look fine to me. I suppose we could debate about using a dictionary to retrieve the Feel free to merge this PR if that is easiest, and if you want to squash the commits or whatever I don't mind. |
Mostly for discussion for now, see commits for more details.
One thing I'd like to cleanup is add some sort of helper logic for mapping between module names and their C library module identifiers (integers). It's annoying to maintain that in Python, but otherwise we don't really have anyway to use the accumulator API unless we have an open Darshan log we can use to find the underlying module identifier. For now, the Python accumulator interface is ugly since it requires you to give the module name and ID.