Tools for finding scheduler-dependent "heisenbugs"

Race conditions suck, but they're a common kind of bug in concurrent programs. We should help people test for them if we can.

Some interesting research out of MSR:

* [CHESS](https://www.microsoft.com/en-us/research/publication/finding-and-reproducing-heisenbugs-in-concurrent-programs/)
* [GAMBIT](https://www.microsoft.com/en-us/research/publication/gambit-effective-unit-testing-for-concurrency-libraries/)

CHESS assumes you have a preemptive multitasking system and works by hooking synchronization primitives, and explores different legal paths through the resulting happens-before graph. This makes it tricky to find data races, i.e., places where two threads access the same variable without using any synchronization primitive; you really want to be able to try schedules where threads get preempted in the middle of these. In practice, it sounds like the way they do this (see page 7) is to first run some heavyweight data race detector that instruments memory reads and writes, and then use the result to add new annotations for the CHESS scheduler.

For us, we don't AFAIK have any reasonable way to do data race detection in Python (maybe you could write something with PyPy? it's non-trivial), but we do have direct access to the scheduler, so we can in principle explore all possible preemption decisions directly instead of having to infer them from synchronization points. However, this is likely to be somewhat inefficient – for CHESS it *really* needs to cut down the space of preemption points because otherwise it has to consider a preemption at every instruction which is ridiculously intractable, and we're in a better position then that, but the average program does still have lots of cases where two tasks that aren't interacting at all have some preemption points and this generates an exponential space of boring possible schedules.

They have a lot of interesting stuff in the paper (section 4) about how to structure a search, recovering from non-determinism in the code-under-test, some (not enough) detail about fair scheduling (they want to handle code with spin-locks). It looks like [this is the fair scheduling paper](https://pdfs.semanticscholar.org/adec/c5f64d69031ee5048bac09d1da1fc7811e75.pdf); it looks like the main idea is that if a thread calls `sched_yield` or equivalent then it means it can't make progress, so you should lower its priority. That's... not a good signal in the trio context. Fortunately it's not clear that anyone wants to write spin-locks anyway... (we use the equivalent a few times in our test suite, but mostly only in the earliest tests I wrote before we had much infrastructure, and certainly it's a terrible thing to do in real code).

Possibly useful citation: "ConTest [15] is a lightweight testing tool that attempts to create scheduling variance without resorting to systematic generation of all executions. In contrast, CHESS obtains greater control over thread scheduling to offer higher coverage guarantees and better reproducibility."

GAMBIT then adds a smarter best-first search algorithm on top of the basic CHESS framework. A lot of the cleverness in GAMBIT is about figuring out which states are equivalent and thus can be collapsed, which we don't really have access to. But the overall ideas are probably relevant. (It's also possible that we could use something like instrumented synchronization primitives plus the DPOR algorithm to pick "interesting" schedules to try first, and then fall back on more brute-force randomized search.)

I suspect there are two general approaches that are likely to be most useful in our context:

* Hypothesis-style randomized exploration of the space of scheduling decisions, with some clever heuristics to guide the search towards edge cases that are likely to find bugs, and maybe even hypothesis-style minimization. (The papers above have some citations to other papers showing that low-preemption traces are often sufficient to demonstrate bugs in practice.)

* Providing some sort of mechanism for users to explicitly guide the exploration to a subset of cases that they've identified as a problem (either by intuition or especially for regression tests and increasing coverage). You can imagine relatively simple things like "use these priorities until task 3 hits the yield point on line 72 and then switch to task 5", maybe?

An intermediate form between these is one of the heuristics mentioned in the GAMBIT paper, of letting the programmer name some specific functions they want to "focus" on, and using that to increase the number of preemptions that happen within those functions. (This is sort of mentioned in the CHESS paper too when talking about their state space reduction techniques.)


Possibly implementation for scheduler control: call an instrument hook before scheduling a batch, with a list of runnable tasks (useful as a general notification anyway), and let them optionally return something to control the scheduling of this batch.

See also: #77 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tools for finding scheduler-dependent "heisenbugs" #239

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Tools for finding scheduler-dependent "heisenbugs" #239

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions