Is it? No one has come close to solving the problem of optimizing software for m...

smolder · 2025-07-08T00:03:45 1751933025

> No one has come close to solving the problem of optimizing software for multiple heterogeneous CPU's with differing micro-architectures when the scheduler is 'randomly' placing threads.

I think this isn't wholly correct. The comp sci part of things is pretty well figured out. You can do work-stealing parallelism to keep queues filled with decent latency, you can even dynamically adjust work distribution to thread performance (i.e. manual scheduling.) It's not trivial to use the best techniques for parallelism on heterogenous architecture, especially when it comes to adapting existing code bases that aren't fundamentally compatible with those techniques. Things get even more interesting as you take cache-locality, io, and library/driver interactions into consideration. However, I think it's more accurately described as an adoption problem than something that's unsolved. It took many years after their debut for homogenous multicore processors to be well supported across software for similar reasons. There are still games that are actively played which don't appreciably leverage multiple cores. (e.g. Starcraft 2 does some minimal offloading to a 2nd thread.)

StillBored · 2025-07-08T19:09:06 1752001746

I'm not sure I understand what your trying to say here WRT to cpu microarch optimization with multiple cpu microarches in the machine. Maybe something about SMT/hyperthreading? But that doesn't appear to be what your saying either.

AKA: I'm talking about the uplift one gets from say -march=native (or your arch of choice), FDO/PGO and various other optimization choices. Ex: Instruction selection for OoO cores. The compiler can know you only have to functional units capable of some operation with coreX, and your codes critpath is bottlnecked by those operations and can adjust the instruction mix to (mis)use some other functional units in parallel. Two units doing X, and one doing Y. Or just load to use latency, or avoidance of certain sequences, etc.

Those optimizations are tightly bound to a given core type. Sure modern OoO cores do a better job of keeping units busy, but its not uncommon to be working around some core deficiency by tweaking the compiler heuristics even now. Trolling through the gcc machine definitions:

https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i38...

So, when the CPU's are heterogeneous with differing optimization targets the code author ends up picking 'generic' optimization targets, and this decision by itself can frequently mean leaving a generation or two of performance behind vs the usual method of just building a handful of shared libraries/etc and picking one at runtime based on the cpu type.

Although, sure an application author can on some platforms hook a rescheduling notification, and then run a custom thread local jump table update to reconfigure which code paths are being run, or other non-standard operation. Or for that matter just set their affinity to a matching set of cores, but none of this is a core operation in any of the normal runtime/etc environments without considerable effort on the part of the application vendor.

smolder · 2025-07-10T04:35:02 1752122102

Yeah, sorry, everything you're saying is right. Compilers won't do the work for you. I just took issue with the wording about it being unsolved. If we can produce optimal binaries for a given process for multiple architectures we can also swap them as needed. I don't think any big new ideas need to come around, just work to implement ideas we have.

smolder · 2025-07-10T08:00:12 1752134412

By the way, compilers can conceivably do "lowest common denominator" architecture optimization to get decent perf on heterogenous cores as a compromise, without leaning into every optimization for both core types.

dist-epoch · 2025-07-07T14:49:12 1751899752

> Which is why AMD seems to be showing everyone a better path by simply using process optimizations and execution time optimization choices with the same zen cores.

Funny you would say that, because AMD X3D CPUs have cores with 3D Cache, and cores without, and massive scheduling problems because of this.

StillBored · 2025-07-07T15:46:47 1751903207

Which is just caching/locality asymmetries, knowledge of which has been at least partially integrated into schedulers for a couple decades now.

It just goes to show how hard scheduling actually is.

But also, you call it a 'massive' problem and its actually somewhat small in comparison to what can happen with vastly different core types in the same machine. Many of those cores also have quite large cache differences too.

cogman10 · 2025-07-07T17:02:42 1751907762

I think part of the problem is that from where I stand, there's no way to tell my programming language (java) "Hey, this thing doesn't need horsepower so prefer to schedule it on little cores." Or conversely "Hey, this thing is CPU sensitive, don't put it on a LITTLE core."

I don't think (but could be wrong) that C++ has a platform independent way of doing this either. I'm not even sure if such an API is exposed from the Windows or linux kernel (though I'd imagine it is).

That to me is the bigger issue here. I can specify which core a thread should run on, but I can't specify what type of core it should run on.

dist-epoch · 2025-07-07T17:11:38 1751908298

Windows has an API for that, I don't think it's widely used though:

> On platforms with heterogeneous processors, the QoS of a thread may restrict scheduling to a subset of processors, or indicate a preference for a particular class of processor.

https://learn.microsoft.com/en-us/windows/win32/procthread/q...

StillBored · 2025-07-08T19:54:20 1752004460

POSIX threads, used by C/C++ and a lot of other language runtimes is somewhat platform independent and provides for affinity, priority and policy controls.

For starters: https://man7.org/linux/man-pages/man3/pthread_setschedparam....

None of the standard attributes (AFAIK) are directly "put me on a tiny core", but they frequently work hand in hand with those decisions. Lower your priority and the scheduler dumps you on some low clocked small core kinds of effects when its not busy. Or if you have machine knowledge just set your core affinity with pthread_setaffinity_np() which as the extension states is non-portable but can directly translate to "run me on a tiny core" if the right mask is provided.

At least in C/C++ most if not all modern platforms provide these kinds of controls and metadata and while the API might change a bit writing a couple versions of schedule_me_on_a_tiny_core() for each target platform is fairly trivial.