Description
Rationale
Currently Unified Runtime does not have a unified way of querying or modifying internal loader or adapter state (or statistics). For example, the L0 adapter has more than 20 individual environment variables that controls its behavior. It also collects some statistics, but the only way for see them is by setting an env variable to print it at the end.
There's no centralized location in which all knobs are stored, there's no uniform way to see which flags are set, there's no way to set a different set of defaults depending on device (e.g., PVC vs DG2), there's no way to control these things programmatically (other than using setenv and hoping for the best).
Description
Ideally, we'd limit the number of knobs that users need to be aware of to bare minimum, and never actually require that any of them are set for optimal performance. Sadly, reality is such that setting those options is often needed to get the most of the hardware. Controllable feature flags are also a useful tool for testing out new features or optimizations without affecting everyone.
Likewise, statistics are not strictly required, but are invaluable tool for debugging and performance evaluations. Currently, the adapters do not have any real infrastructure to output statistics in a controlled way. And even if some statistics exist, they are not queryable at runtime, which is often helpful e.g., in tests.
To address these problems, I'm proposing we introduce a control and introspection API in Unified Runtime:
ur_result_t urLoaderCtlGet(const char *name, void *arg);
ur_result_t urLoaderCtlSet(const char *name, void *arg);
Inspired by similar interfaces in jemalloc and libpmemobj.
Adapters would define a tree-like structure of the ctl interface:
- events
- caching (read-write)
- discarded
- enabled (read-write)
- stats (all read only)
- pool
- allocated
- active
- queue
- num
- <index>
- command_list
- num
- <index>
- events_in_flight
- debug (read-write)
A pointer to this structure would be then passed to the loader (through a special API in DDI, not defined here), which would include it in the global control namespaces, prepending appropriate namespace at the front (or not, depending if some nodes are meant to be shared).
This would then allow users to access the various functionality like so:
bool caching = false;
urLoaderCtlSet("adapters.level_zero.events.caching", &caching);
UR_CTL="adapters.level_zero.events.caching=false;[more ctls]"
$ cat /etc/ur.conf
adapters:
level_zero:
events:
- caching: false
(doesn't have to be YAML, we could implement many parsers, the simplest one could use the same syntax as env variables)
Similarly, to access a statistic a user would be able to simply call:
size_t nevents;
urLoaderCtlGet("adapters.level_zero.stats.queue.0.command_list.0.events_in_flight", &nevents);
This would also allow us to dump all CTL at once at teardown time, to print all statistics and see the configuration:
UR_CTL="ctl.dump_all=true;..." ./app
adapters.level_zero.events
.caching = false
.discarded
.enabled = false
...
This could be a simple and useful information we could require with all bug reports.
Implementation
Implementation could be copy/pasted from PMDK. Here's an example of a simple CTL namespace definition: https://github.com/pmem/pmdk/blob/master/src/common/ctl_prefault.c#L70. Here's a more complex one: https://github.com/pmem/pmdk/blob/master/src/libpmemobj/pmalloc.c#L846
However, this implementation is in pure C. A C++ one might be a tad simpler.