Is "standard practice" nowadays to use C++11 memory orderings or barriers, instead of platform-specific macros?.I know that's the case in the Rust world.
Yes. They are much, much more likely to do the right thing than what one can cook up themselves. And if used correctly, work consistently and performantly across platforms (e.g. in theory at least, release and acquire barriers could be just a full stop-the-world memory barrier on one platform, and something more granular on another).
The bug looks a bit like these two threads running in parallel:
A: x := N
A: y := N
A: done := true
B: if(done) {
B: assert(x == y)
B: } else { retry }
Where the assert in thread B replaces some other code which had x == y as a precondition. The problem was thought to be that, due to weak memory ordering on the processor, B would sometimes observe done to be true but x != y.
The fix would be to insert appropriate memory barriers to get the threads to see consistent data.
The problem is to try to determine whether the hypothesis was correct. The bug is rare and the proposed fix might make it more rare. It would be bad if the fix just made another bug harder to discover.
The idea was then that to test this, you hit the edge case and test if the memory fence is sufficient to cause the assertion to succeed. But instead of rewriting the code to something like the following, you can just rely on invoking the debugger being sufficiently invasive to have the effect of the fence and then rerun the assertion.
B: if(done) {
B: if(x != y) {
B: fence();
B: assert(x == y);
B: printf("Gotcha!\n");
B: }
B: // Now use x, y as before…
B: } else { retry }
One issue is that while this finds evidence that missing memory barriers were sometimes the cause, protecting from the case where the cpu is just having a good day while you’re testing and the bug-triggering case doesn’t happen, it doesn’t really prove that the only cause of x != y issue (this example is presumably distilled from a single potential cause of something more complicated) is caused by the memory barrier problem (eg maybe a rare buffer overflow sets done to true too early).
> it doesn’t really prove that the only cause of x != y issue [...] is caused by the memory barrier problem (eg maybe a rare buffer overflow sets done to true too early).
I mean, you can never prove that, and that goes for pretty much any bug. Hypothetically, a hardware issue could be in the mix. Or a cosmic ray may hit the device (that's a real thing). You can't prove a negative after all.
But while this does not prove that there's no other cause with the same effect, this at least makes it extremely likely that the missing barrier was an issue: Breaking with the debugger suddenly makes it work without any other intervention, and also they've just looked at the machine code in the first place, with the knowledge of how CPUs work.
The likelihood that there's also a rare buffer overflow consistently reproducing the same issue is negligibly small. If there is, that's a problem you can look at once it actually happens again.
I kind of knew where this was going in terms of the actual bug, but the way they've "proven" their fix is indeed pretty neat. I would have thoroughly explained the bug and fix, but in order to actually verify I honestly most probably would have just done what they suggested (and rejected) first: Run it long enough to show that the issue does not occur anymore.
As an aside, I have not stared at the problem long enough to say this with full confidence, but if the barriers they inserted are full barriers, they might be too strong. acquire/release might have been sufficient.
The article mentions reordering as one possible explanation, but it's important to remember that ever since multiprocessor machines had per-core caches, they're not truly shared memory anymore.
I don't think that's a distinctive explanation, though (i.e. "memory accesses were reordered" covers it already).
On common architectures, from one core itself it never looks like the memory accesses were reordered. From other cores, it may look like that. If there weren't any caches of any sort, there would not be any observable "memory reordering", unless the same core would observe it as well, which breaks sequential consistency (the guarantee to observe any memory access in program order).
(And if I'm not missing anything, reordering memory accesses would be pretty useless without caches anyway.)
Caches might be coherent, meaning that the masters might look into each others' content without any flushing. That was cool when I was working on DMA-heavy code on a MIPS CPU.
However it's not just about caches, but write buffers also. "Modern" GPUs play tricks with memory accesses by combining them and certain CPUs use this feature to combine byte accesses to word accesses, reorder accesses sequentially in a short span, and eliminate redundant stores/loads. Sometimes it helps quite much on performance, but hurts concurrency pretty much. 10% of the code was a call to the au_sync() macro to flush the write buffer... followed by some more instructions to flush the cache, in case it was required.
> "memory accesses were reordered" covers it already
There are valid weak executions which correspond to no "reordered" memory access pattern at all. This theory can only take you 95% of the way, but that makes it far worse than a theory which only works for 5% of cases.
I don't want to try to recreate an exact code example on the fly (I'd surely get it wrong). But, there's [C++, rust, etc] code which uses seqcst and acquire/release together to set up this scenario: There are some memory operations on one location using acquire and release, call two effects that happen in some order A and B. Another location is manipulated using seqcst instructions with effects C1, C2, .... We are interested in possible executions where a certain thread observes a sequence of effects C... which is only consistent with A occurring before B. It turns out that it's possible to set this up so that those operations C still don't create a happens-before edge between A and B, and therefore it would be legal for a particular compiler+weakly ordered cpu not to forward the effect A to the operation B. This contradicts any theory based not only on a global total order (which is already more easily ruled out), but even on each thread having a local total order in which it observes all other threads' operations on multiple memory locations.
Setting aside that without a code sample my explanation lacks a punch line: If you "believe" the memory reordering explanation, then this is patent nonsense. If you "believe" the actual acq/rel semantics in the relevant specifications, then it's obvious (albeit unintuitive).
The problem is, this behavior really can happen, and it may not be that uncommon in complex systems where it is very hidden. Reasoning about these systems using the memory reordering theory will yield results which contradict actual executions on actual compiler/CPU combinations.
OP was alluding that caches and hierarchical memory in general are distinct from „memory reordering“. If I understand correctly, you are saying that the situations you have in mind are indeed a result of caches/hierarchical memory, but cannot be modeled as memory access reordering?
> situations you have in mind are indeed a result of caches/hierarchical memory, but cannot be modeled as memory access reordering
Well yes, but actually no. I mean, my point is that they can't be (correctly!) modeled as memory access reordering. But one doesn't have to and shouldn't appeal to caches. Recent ISAs (aarch64, risc-v) and recent languages (C++11, C<whatever it is now>, rust) are both defined based on acquire-release semantics: both hardware and software engineers have decided this is the common set of requirements we want to use. For software programmers to attempt to reason about cache hierarchies is not only unnecessary, it would also potentially be just as wrong as memory reordering if the design of new microarchitectures changes. But as long as those new uarches conform to existing ISAs, and/or are targeted by existing languages, then the acq-rel model will not change out from under you. So it is the only one that should be used. Indeed, the memory reordering theory seems to originate from this same kind of appeal to microarchitecture which are now out of date, i.e. whatever the first multiprocessing x86 cores did.
Despite being popular in x86-centric communities, the "barrier" theory of multiprogramming cannot be correctly extended to weak architectures. Just learn release/acquire semantics which work everywhere and are fairly straightforward.