Skip to content

Conversation

cperkinsintel
Copy link
Contributor

Windows shutdown/teardown doesn't have the exact same timing as on Linux, and even more in the the unit tests which are statically compiled ( so no DllMain() call, which means no space between the calls to shutdown_early() and shutdown_late() ). Here we remove a race between the platform deletion in the mock and the win shutdown. This fixes a flaky failure we are seeing in a couple of the unit tests.

@uditagarwal97
Copy link
Contributor

uditagarwal97 commented Aug 11, 2025

Windows shutdown/teardown doesn't have the exact same timing as on Linux, and even more in the the unit tests which are statically compiled ( so no DllMain() call, which means no space between the calls to shutdown_early() and shutdown_late() ). Here we remove a race between the platform deletion in the mock and the win shutdown.

Can you please elaborate more on how the race happens? Like rough thread interleavings?
I'm trying to think of an alternative approach, somewhat along the lines of: can we use mutexes to enforce an ordering between shutdown_late and shutdown_early, on Windows? To avoid any overhead, this mutex based shutdown ordering can be put under a special macro, that we define only when building unittests.

@cperkinsintel
Copy link
Contributor Author

cperkinsintel commented Aug 15, 2025

@uditagarwal97 I had trouble pinning it down. It's flaky and also load sensitive. But what seems to be happening is that the clearing of the platform here is somehow leading to the UR to drop the adapters before we tell it to drop the adapters. This conflicts with our host tasks, which are threaded, when their event destructors fire and call to the now-dropped adapters.

But this entire scenario is contrived. These are stunts we are doing in the unit tests to "pretend" like we are shutting down when we aren't. If clearing backends and platforms between tests is imperative, then probably we change course entirely and do that at the beginning of the UrMock, rather than simulating shutdown. But I didn't want to make such a substantial change when trying to address a couple of flaky CI failures.

The scenario that is NOT contrived though, is using SYCL when statically linked. We don't have enough testing of that use case, and I think we need more. Perhaps a lot more

… Also fix the otherwise tautological DeviceRefCounter test.

Signed-off-by: Chris Perkins <[email protected]>
@cperkinsintel
Copy link
Contributor Author

This PR is limited to the unit tests. The test failures are in unrelated e2e tests. Not sure what's going one there

Copy link
Contributor

@uditagarwal97 uditagarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@againull againull merged commit 5790b6a into intel:sycl Aug 20, 2025
39 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants