Skip to content

Conversation

shanedsnyder
Copy link

Mostly for discussion for now, see commits for more details.

One thing I'd like to cleanup is add some sort of helper logic for mapping between module names and their C library module identifiers (integers). It's annoying to maintain that in Python, but otherwise we don't really have anyway to use the accumulator API unless we have an open Darshan log we can use to find the underlying module identifier. For now, the Python accumulator interface is ugly since it requires you to give the module name and ID.

tylerjereddy and others added 22 commits December 16, 2022 08:59
* early draft of Python/CFFI interface to
derived metrics/accumulators described in:
- gh-642
- gh-677

* for now this definitely doesn't work, and feels like I'm basically
reconstituting a C control flow in CFFI/Python instead of using a sensible
exposure point between C and Python to pull out populated structs from
a single entry point

* perhaps folks can just help me sort out the current issues noted
in the source changes rather than providing a convenient API, though
once thorough regression tests are in place that might be something
to consider in the future... (or even just maintaining it in
`pandas`/Python someday if the perf is ~similar)
* needed to expose `darshan_file_category_counters` so that its
size is available for usage in `darshan_derived_metrics` struct

* some improvements to the renamed `log_get_derived_metrics()` function,
including first draft of error handling; the renamed
`test_derived_metrics_basic` test was also adjusted to expect an
error for `LUSTRE` module
- for record buffer in accumulator test
- for error path in another test
* `log_get_derived_metrics()` was adjusted to inject
all the records for a given module, because this is our
initial target to replicate the stats in the old
perl summary report

* a new `log_get_bytes_bandwidth()` function was drafted
in as a convenience wrapper to get MiB (total bytes) and
bandwidth (MiB/s) values printed out in the old perl report

* renamed the regression test for this PR and adjusted
it to compare against the bytes/bandwidth strings
present in the perl reports; so far, only a small
subset of the STDIO results are working properly
(see the xfails in this test..)
* `log_get_bytes_bandwidth()` has been simplified
substantially as it no longer needs to generate a
new `report` object, and can instead use the
`agg_perf_by_slowest` structure member

* the `xfail` marks have all been removed from
`test_derived_metrics_bytes_and_bandwidth()`--those
cases all pass now thanks to the above changes

* remove some debug prints and unused code from
`log_get_derived_metrics()`
* a number of additional `MPI-IO` and `STDIO` test cases
were added from the logs repo to `test_derived_metrics_bytes_and_bandwidth()`

* for the `MPI-IO` cases to pass, special casing was added
to `log_get_bytes_bandwidth()` such that `total_bytes` is
actually extracted from `POSIX`
* removed an invalid `darshan_free()` from `log_get_derived_metrics()`--
the `buf` object didn't even exist at that point in the control flow

* add a `LUSTRE` test case, which raises a `RuntimeError` as expected

* add a tentatie `POSIX` test case, which reports a bandwidth string
at the Python level, but is not included in the Perl summary reports...
* when `log_get_derived_metrics()` receives a
module name that doesn't exist in the log
file it received, it will now raise a `ValueError`
for clarity of feedback

* update `test_derived_metrics_bytes_and_bandwidth()`
accordingly, and also start regex matching on expected
error messages in this test
* add the bandwidth summary string to the Python
report proper, and include a test for the presence
of this string in logs repo-based summary reports
* add one of the tricky `APMPI` cases I discovered
to `test_derived_metrics_bytes_and_bandwidth()`, pending
discussion with team re: how I should handle this
* adjust tests to more closely match `darshan-parser` instead
of the old Perl report in cases where MPI-IO and POSIX are
both involved; this allows me to remove the weird MPI-IO
shim in `log_get_bytes_bandwidth()`
* the bandwidth text in the Python summary report is now
colored "blue," along with a regression test, based on
reviewer feedback

* added `skew-app.darshan` log to
`test_derived_metrics_bytes_and_bandwidth()`--we get the same
results as `darshan-parser`

* replaced the `xfail` for `e3sm_io_heatmap_only.darshan` with
an expected `KeyError` when handling `APMPI` (this should already
be handled gracefully/ignored by the Python summary report)
* the testsuite now always uses `DarshanReport` with a context
manager to avoid shenanigans with `__del__` and garbage
collection/`pytest`/multiple threads

* this appears to fix the problem with testsuite hangs
described in gh-839 and gh-851
* `cffi_backend` module changes requested from PR review
  - remove a spurious `darshan_free` from `_log_get_heatmap_record()`
  - fix the scoping of the `darshan_free` of `buf` object used with
    `darshan_accumulator_inject` in `log_get_derived_metrics`
  - adding a missing `log_close()` to `log_get_derived_metrics` (maybe
    we can wrap in Python contexts in the future though)
  - use a separate buffer for `darshan_accumulator_emit()` inside
    `log_get_derived_metrics`

* note that making the above CFFI/free-related changes caused
a segfault in the testuite, so in the end I adjusted the location
of the memory freeing as I saw fit to avoid segfaults--I'd say at this
point please provide concrete evidence with a memory leak plot or
failing test for additional adjustments there, or just push the change
in

* in the end, there is a slightly more concise usage of `darshan_free()`
but no meaningful change in the free operations

* I also reverted the suggested changed to `darshan_accumulator_emit()`
usage--there was no testable evidence of an issue, and it was also
causing segfaults..

* address many of the discussion points that came up in gh-868:
  - `log_get_derived_metrics()` now uses an LRU cache, which effectively
    means that we use memoization to return derived metrics data
    rather than doing another pass over the log file if the same
    log path and module name have already been accumulated from; we
    still need to pass over a given log twice in most cases--once at
    initial read-in and once for using `log_get_derived_metrics`; how
    we decide to add filtering of records prior to accumulation
    interface in Python is probably a deeper discussion/for later
  - `log_get_bytes_bandwidth()` and its associated testing have been
     migrated to modules not named after "CFFI", like the in the above
     PR, because I think we should only use the "CFFI" named modules
     for direct CFFI interaction/testing, and for other analyses we
     should probably use more distinct names. Also, to some extent
     everything depends on the CFFI layer, so trying to restrict
     "CFFI" modules to direct rather than direct interaction will
     help keep them manageably sized, especially given the proclivity
     for surprising memory issues/segfaults in those parts of the code.
  - add a proper docstring with examples for `log_get_bytes_bandwidth()`
[skip actions]
* `log_get_derived_metrics()` changes:
    - update input parameters for what's currently needed for
      conversion to a raw buffer from a dataframe
    - remove code for opening log to read records, and replace
      with code to convert dataframe to a record buffer
    - remove lru cache as probably no longer appropriate
* Darshan job summary changes:
    - add code to get derived metrics from a module's record
      dataframe
    - to avoid various errors related to attempting to accumulate
      modules that don't support, only accumulate modules that
      support it
* `log_get_bytes_bandwidth()` changes:
    - take a derived metrics struct as input
* updated `lib/accum` tests
    - update to open a log report and read in records explicitly
    - some expected error types changed as a result of refactor
Copy link
Collaborator

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm fine with most of it, but then again a lot of this is effectively improved versions of my original rough code so I may be biased.

The blue text bandwidth strings are still showing up in the Python summary reports for me on this branch, which is a good sign.

# with pytest.raises(ValueError,
# match=f"{mod_name} is not in the available log"):
# log_get_bytes_bandwidth(derived_metrics=derived_metrics,
# mod_name=mod_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test, if we want to keep the error handling cases, this seems to fixup locally:

index 17b0e0e3..047e4d56 100644
--- a/darshan-util/pydarshan/darshan/report.py
+++ b/darshan-util/pydarshan/darshan/report.py
@@ -661,6 +661,9 @@ class DarshanReport(object):
         cn = backend.counter_names(mod)
         fcn = backend.fcounter_names(mod)
 
+        if mod not in self._modules:
+            raise ValueError(f"mod {mod} is not available in this DarshanReport object.")
+
         # update module metadata
         self._modules[mod]['num_records'] = 0
         if mod not in self.counters:
diff --git a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
index 4e46dc7e..6e3ec020 100644
--- a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
+++ b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
@@ -69,28 +69,27 @@ def test_derived_metrics_bytes_and_bandwidth(log_path, mod_name, expected_str):
     # test the basic scenario of retrieving
     # the total data transferred and bandwidth
     # for all records in a given module; the situation
-    # of accumulating drived metrics with filtering
+    # of accumulating derived metrics with filtering
     # (i.e., for a single filename) is not tested here
 
     log_path = get_log_path(log_path)
     with darshan.DarshanReport(log_path, read_all=True) as report:
-        report.mod_read_all_records(mod_name, dtype="pandas")
-        rec_dict = report.records[mod_name][0]
-        mod_idx = report.modules[mod_name]['idx']
-        nprocs = report.metadata['job']['nprocs']
+        if expected_str == "ValueError":
+            with pytest.raises(ValueError,
+                               match=f"mod {mod_name} is not available"):
+                report.mod_read_all_records(mod_name, dtype="pandas")
+        else:
+            report.mod_read_all_records(mod_name, dtype="pandas")
+            rec_dict = report.records[mod_name][0]
+            mod_idx = report.modules[mod_name]['idx']
+            nprocs = report.metadata['job']['nprocs']
 
-    if expected_str == "RuntimeError":
-        with pytest.raises(RuntimeError,
-                           match=f"{mod_name} module does not support derived"):
-            log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs)
-    else:
-        derived_metrics = log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs)
-        actual_str = log_get_bytes_bandwidth(derived_metrics=derived_metrics,
-                                             mod_name=mod_name)
-        assert actual_str == expected_str
-
-#    elif expected_str == "ValueError":
-#        with pytest.raises(ValueError,
-#                           match=f"{mod_name} is not in the available log"):
-#            log_get_bytes_bandwidth(derived_metrics=derived_metrics,
-#                                    mod_name=mod_name)
+            if expected_str == "RuntimeError":
+                with pytest.raises(RuntimeError,
+                                   match=f"{mod_name} module does not support derived"):
+                    log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs)
+            else:
+                derived_metrics = log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs)
+                actual_str = log_get_bytes_bandwidth(derived_metrics=derived_metrics,
+                                                     mod_name=mod_name)
+                assert actual_str == expected_str

Copy link
Author

@shanedsnyder shanedsnyder Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added this patch with 5d9e362 and looks to be working fine.

return buf


def log_get_derived_metrics(rec_dict, mod_name, mod_idx, nprocs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose in the future we could allow an optional kwarg for custom filtering operations on the DataFrames, which then gets passed to a similar optional kwarg for _df_to_rec().

For now, I think we're focused on just getting the standard accum elements into the summary report, but I'm just thinking ahead to make sure we don't lock ourselves into a design that is harder to apply custom DataFrame filters to.

I suppose, however, that for supporting arbitrary slicing and dicing of DataFrames we may want to just allow folks to manually operate on the two DataFames in rec_dict and then pass it in here, since a pass-through API that can achieve any kind of filtering op may be more work than it is worth.

Perhaps some pre-canned filtering ops will be built-in for us, like an argument for a regex to filter on filenames or something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about this briefly and just assumed we'd expect users to do the filtering directly on the dataframes before having PyDarshan operate on them, similar to your approach in the test_reverse_record_array test case. I'd definitely be open to more directly supporting in our APIs if we think it'll avoid some complexity and simplify things for users. I guess users have to filter counters/fcounters separately, so avoiding that and instead passing in some kwargs to handle internally could be useful if it's not a pain to get working for different use cases.

r = libdutil.darshan_accumulator_emit(darshan_accumulator[0],
derived_metrics,
total_record)
libdutil.darshan_accumulator_destroy(darshan_accumulator[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could check for a 0 retcode here, though the function appears to have no error handling anyway, not even in the form of non-zero exit code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just punted on error checking that for a couple of reasons:

  • Internal knowledge that the destroy() call doesn't actually have an error path despite having a return code -- it's basically just freeing some stuff, so if there's an error it's likely going to be a crash.
  • We want to call destroy() before potentially raising an exception related to the emit() call returning a failure just to make sure we clean up C memory. It just becomes a little complicated/verbose to unwind the errors from back-to-back calls like that, so I just opted for simplicity with the internal knowledge from above.

"It may be possible "
"to retrieve additional information from the stderr "
"stream.")
derived_metrics = ffi.new("struct darshan_derived_metrics *")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CFFI stuff is of course a bit awkward, especially given how many different entrypoints to the C layer we need to interact with just to get a single final struct back, but I'd probably be inclined to just live with it if we just operate on DataFrames or Arrow in-memory stuff in a few years anyway.

It occurs to me that for some things, like struct darshan_derived_metrics, we could probably avoid CFFI traversal by using the bytes from a NumPy recarray, but this would be more verbose with minimal gain for now I suspect.

@shanedsnyder
Copy link
Author

I just pushed one more commit that adds some helper for mapping between module name to index and uses this logic internally in log_get_derived_metrics() so that callers don't have to ensure they hold onto both module names and log format indices. Really, we should probably not require users to know what a module index is and have them interact entirely with module names for simplicity.

I think I'm satisfied with the state of things here and am willing to merge things in assuming there are no other critical concerns to address? I'm not sure if it makes any difference for attribution purposes, but I could cherry-pick these commits back onto your original PR @tylerjereddy if it matters any to you? Otherwise, I just requested a review here and we can merge this in directly and close #839

@tylerjereddy
Copy link
Collaborator

The last two commits look fine to me.

I suppose we could debate about using a dictionary to retrieve the mod_idx to get O(1) vs. the O(n) search from list.index(), but the sequence is so small, and unlikely to balloon, that it isn't worth the effort either way.

Feel free to merge this PR if that is easiest, and if you want to squash the commits or whatever I don't mind.

@shanedsnyder shanedsnyder changed the title WIP: updated dervied metrics interface WIP: updated derived metrics interface Feb 14, 2023
@shanedsnyder shanedsnyder changed the title WIP: updated derived metrics interface updated derived metrics interface Feb 14, 2023
@shanedsnyder shanedsnyder merged commit 0a4e9f2 into main Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants