Fantastic article -as always. Regarding the top-down analysis: I was a bit surpr...

yvdriess · 2025-07-07T09:47:17 1751881637

Take front end bound with a grain of salt. Frequently I find a backend backpressure reason for it, e.g. long-tail memory loads needed for a conditional branch or atomic. There are limitations to sampling methods and top down analysis, consider it a start point to understanding the potential bottlenecks, not the final word.

fschutze · 2025-07-07T11:08:30 1751886510

Interesting. You realize this by identifying the offending assembly instructions and then see that one operands comes from memory?

yvdriess · 2025-07-09T15:56:25 1752076585

There's no single good way, but yes as you said, logical deduction based on the surrounding instructions and their hardware counters is a way to do it. Instruction B might be collecting a ton of hardware counted cycles, but it could be because instruction A it depends on is slow. Sometimes, those dependencies are even implicit, since x86 is in-order commit some instructions like lock/atomics have implicit and dynamic dependencies based on what is in the reorder buffer at the time.

To give a concrete example I encountered analysing a GC: traversing the object graph in a loop means calculating the address of an object, loading that object, doing some work on it and then grabbing the bits to calculate the children to visit next. This creates a long brittle chain of data-dependent conditionals, depending on a calculation that eventually came from a much earlier load. That conditional branch might be 30/70 taken/untaken, so the branch predictor often does not speculate, reducing the ILP and making it harder to hide the load's latencies. Now, dear Watson, would you say the blame is to the front end? There are no stalls when all the loads go to fast cache, only when there is the occasional remote LLC hit, DRAM hit or god forbid cross-NUMA hit. What if I tell you that there's an atomic operation to mark the object as visited, which is fast in itself but can only be issued when all prior loads have completed and stops from newer instructions to be issued while it hasn't been committed.

You need to look at a whole bunch of surrounding instructions and a variety of hardware counters to start forming a picture. Insert Always Sunny in Philadelphia meme with the red wire crime board here.