Are you aware of solutions for multi-threaded and relative time-accurate recordi...

gregthelaw · on Feb 1, 2025

Co-founder of Undo here. This is a common misunderstanding, and just not true -- neither for Undo nor rr. Most races will reproduce at least as easily in Undo, especially if you use our "thread fuzzing" feature (rr has something similar, called chaos mode).

Sure, there will always be some races/timing issues that just won't repro under recording (Heisenberg principal and all that), but in fact most races are _more likely_ to occur under recording. Part of this is because you slow down the process being recorded, which is equivalent to speeding up the outside world.

And of course, when you do have your gnarly timing issue captured in a recording, it's usually trivial to root-cause exactly what happened. Our customers tell us that races and timing issues are a major use-case.

matu3ba · on Feb 1, 2025

Thanks for clarification. I have not been aware of thread fuzzing and chaos mode before.

Veserv · on Feb 1, 2025

Multiplexing onto a single thread is sufficient to observe and record concurrency errors. If that is not sufficient, then I am assuming you want to observe and record errors caused by true parallelism. If so, then you need a full memory trace. That restricts you to either hardware that supports full memory trace connected to a hardware trace probe or instrumented software full memory trace.

The former basically only exists for embedded boards and the latter does not exist (at say less than a 10x slowdown) for Linux or any other common desktop operating system as far as I am aware.

matu3ba · on Feb 1, 2025

So the only way to trace probe consumer desktop CPUs and possibly GPUs is by the hardware vendors them-self or specialized facilities/laboratories? For Intel I can find https://www.lauterbach.com/supported-platforms/architectures..., but nothing for trace probing AMD or later versions.

Veserv · on Feb 1, 2025

You can trace consumer desktop CPUs using instrumented software full memory trace, but that requires OS + debugger + compiler support which is not available for Linux, Windows, Mac, etc.

You can trace hardware that exposes trace functionality usually via a debug port of some kind. Many chips have trace functionality in their production design, but no debug connector is physically present in off-the-shelf boards (to reduce manufacturing cost). You can usually physically modify the board to get access to this functionality which is routinely done when porting software to a new chip/board.

Trace functionality comes in two major flavors, control flow trace and memory trace. Control flow trace only records control flow, so the contents of memory are unknown which is not very useful for your desired use case. Memory trace records memory accesses, so the contents of memory are known. Unfortunately, memory trace is very resource intensive, so most systems that support trace only implement control flow trace. As far as I am aware, it is very unlikely that any desktop or server CPU has memory trace.

The major manufacturers of trace probes and solutions that I know of are Green Hills Software, Lauterbach, and Segger.

matu3ba · on Feb 1, 2025

Thanks for your write-up, this is very helpful for understanding.

dzaima · on Feb 1, 2025

While the single-threaded execution means that issues from thread interleaving on the scale of nanoseconds will effectively not happen, multiple threads are still allowed and will be context-switched between. rr also has a chaos mode to intentionally make the context switching unfair.

matu3ba · on Feb 1, 2025

What kind/class of issues are caused by "thread interleaving on the scale of nanoseconds"? Faulty CPU bit flips due to radiation/quantum effects or what are you referring to? Just curious.

dzaima · on Feb 2, 2025

Not doing things atomically when they should be (incl. missing locks around tiny ops) would be a pretty large class.

With native multithreading data can pass from thread to thread millions of times per second, and you're much less likely to hit obscure interactions when limited instead to maybe a couple hundred context switches per second.

gregthelaw · on Feb 1, 2025

Exactly right. Undo has "thread fuzzing" which is similar concept to chaos mode, but more targeted.

AlotOfReading · on Feb 1, 2025

Antithesis is the only general purpose system I've seen for that. It takes the same single threaded approach, but can scale to N separate systems and fault injecting possible orderings.