Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fantastic article -as always. Regarding the top-down analysis: I was a bit surprised to see that in ~1/5 of the cases the pipeline stalls b/c the pipeline is Frontend Bound. Can that be? Similarly, why is Frontend Bandwidth a subgroup of Frontend Bound? Shouldn't one micro-op be enough?


Take front end bound with a grain of salt. Frequently I find a backend backpressure reason for it, e.g. long-tail memory loads needed for a conditional branch or atomic. There are limitations to sampling methods and top down analysis, consider it a start point to understanding the potential bottlenecks, not the final word.


Interesting. You realize this by identifying the offending assembly instructions and then see that one operands comes from memory?


There's no single good way, but yes as you said, logical deduction based on the surrounding instructions and their hardware counters is a way to do it. Instruction B might be collecting a ton of hardware counted cycles, but it could be because instruction A it depends on is slow. Sometimes, those dependencies are even implicit, since x86 is in-order commit some instructions like lock/atomics have implicit and dynamic dependencies based on what is in the reorder buffer at the time.

To give a concrete example I encountered analysing a GC: traversing the object graph in a loop means calculating the address of an object, loading that object, doing some work on it and then grabbing the bits to calculate the children to visit next. This creates a long brittle chain of data-dependent conditionals, depending on a calculation that eventually came from a much earlier load. That conditional branch might be 30/70 taken/untaken, so the branch predictor often does not speculate, reducing the ILP and making it harder to hide the load's latencies. Now, dear Watson, would you say the blame is to the front end? There are no stalls when all the loads go to fast cache, only when there is the occasional remote LLC hit, DRAM hit or god forbid cross-NUMA hit. What if I tell you that there's an atomic operation to mark the object as visited, which is fast in itself but can only be issued when all prior loads have completed and stops from newer instructions to be issued while it hasn't been committed.

You need to look at a whole bunch of surrounding instructions and a variety of hardware counters to start forming a picture. Insert Always Sunny in Philadelphia meme with the red wire crime board here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: