Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is it?

No one has come close to solving the problem of optimizing software for multiple heterogeneous CPU's with differing micro-architectures when the scheduler is 'randomly' placing threads. There are various methods in linux/windows/etc which allow runtime selection of optimized paths, but they are all oriented around runtime selections (ex pick a foo variation based on X and keep it) made once, rather than on cpu/context switch.

This means that one either uses a generic target and takes the ~1.5-2.5x perf hit, or one optimizes for a single core type and ignores the perf on the other cores. This is probably at least partially why the gaming IPC numbers are so poor. The games are optimized for a much older device.

So a 1.5-2.5x perf difference is these days could be 5-10 years of performance uplifts being left on the table. Which is why AMD seems to be showing everyone a better path by simply using process optimizations and execution time optimization choices with the same zen cores. This gives them both area, and power optimized cores without having to do a lot of heavyweight scheduler/OS and application optimization. Especially for OS's like Linux which don't have strong foreground/background task signalling from a UI feeding the scheduler.

In other words heterogeneous SMP cores using differing micro-architectures will likely be replaced with more homogeneous looking cores that have better fine grained power and process optimizations. Ex, cores with even wider power/performance ranges/dynamism will win out over the nearly impossible task of adding yet another set of variables to schedulers which are already NP-hard, and adding even finer grained optimized path selection logic which will further damage branch prediction + cache locality the two problems already limiting CPU performance.



> No one has come close to solving the problem of optimizing software for multiple heterogeneous CPU's with differing micro-architectures when the scheduler is 'randomly' placing threads.

I think this isn't wholly correct. The comp sci part of things is pretty well figured out. You can do work-stealing parallelism to keep queues filled with decent latency, you can even dynamically adjust work distribution to thread performance (i.e. manual scheduling.) It's not trivial to use the best techniques for parallelism on heterogenous architecture, especially when it comes to adapting existing code bases that aren't fundamentally compatible with those techniques. Things get even more interesting as you take cache-locality, io, and library/driver interactions into consideration. However, I think it's more accurately described as an adoption problem than something that's unsolved. It took many years after their debut for homogenous multicore processors to be well supported across software for similar reasons. There are still games that are actively played which don't appreciably leverage multiple cores. (e.g. Starcraft 2 does some minimal offloading to a 2nd thread.)


I'm not sure I understand what your trying to say here WRT to cpu microarch optimization with multiple cpu microarches in the machine. Maybe something about SMT/hyperthreading? But that doesn't appear to be what your saying either.

AKA: I'm talking about the uplift one gets from say -march=native (or your arch of choice), FDO/PGO and various other optimization choices. Ex: Instruction selection for OoO cores. The compiler can know you only have to functional units capable of some operation with coreX, and your codes critpath is bottlnecked by those operations and can adjust the instruction mix to (mis)use some other functional units in parallel. Two units doing X, and one doing Y. Or just load to use latency, or avoidance of certain sequences, etc.

Those optimizations are tightly bound to a given core type. Sure modern OoO cores do a better job of keeping units busy, but its not uncommon to be working around some core deficiency by tweaking the compiler heuristics even now. Trolling through the gcc machine definitions:

https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i38...

So, when the CPU's are heterogeneous with differing optimization targets the code author ends up picking 'generic' optimization targets, and this decision by itself can frequently mean leaving a generation or two of performance behind vs the usual method of just building a handful of shared libraries/etc and picking one at runtime based on the cpu type.

Although, sure an application author can on some platforms hook a rescheduling notification, and then run a custom thread local jump table update to reconfigure which code paths are being run, or other non-standard operation. Or for that matter just set their affinity to a matching set of cores, but none of this is a core operation in any of the normal runtime/etc environments without considerable effort on the part of the application vendor.


Yeah, sorry, everything you're saying is right. Compilers won't do the work for you. I just took issue with the wording about it being unsolved. If we can produce optimal binaries for a given process for multiple architectures we can also swap them as needed. I don't think any big new ideas need to come around, just work to implement ideas we have.


By the way, compilers can conceivably do "lowest common denominator" architecture optimization to get decent perf on heterogenous cores as a compromise, without leaning into every optimization for both core types.


> Which is why AMD seems to be showing everyone a better path by simply using process optimizations and execution time optimization choices with the same zen cores.

Funny you would say that, because AMD X3D CPUs have cores with 3D Cache, and cores without, and massive scheduling problems because of this.


Which is just caching/locality asymmetries, knowledge of which has been at least partially integrated into schedulers for a couple decades now.

It just goes to show how hard scheduling actually is.

But also, you call it a 'massive' problem and its actually somewhat small in comparison to what can happen with vastly different core types in the same machine. Many of those cores also have quite large cache differences too.


I think part of the problem is that from where I stand, there's no way to tell my programming language (java) "Hey, this thing doesn't need horsepower so prefer to schedule it on little cores." Or conversely "Hey, this thing is CPU sensitive, don't put it on a LITTLE core."

I don't think (but could be wrong) that C++ has a platform independent way of doing this either. I'm not even sure if such an API is exposed from the Windows or linux kernel (though I'd imagine it is).

That to me is the bigger issue here. I can specify which core a thread should run on, but I can't specify what type of core it should run on.


Windows has an API for that, I don't think it's widely used though:

> On platforms with heterogeneous processors, the QoS of a thread may restrict scheduling to a subset of processors, or indicate a preference for a particular class of processor.

https://learn.microsoft.com/en-us/windows/win32/procthread/q...


POSIX threads, used by C/C++ and a lot of other language runtimes is somewhat platform independent and provides for affinity, priority and policy controls.

For starters: https://man7.org/linux/man-pages/man3/pthread_setschedparam....

None of the standard attributes (AFAIK) are directly "put me on a tiny core", but they frequently work hand in hand with those decisions. Lower your priority and the scheduler dumps you on some low clocked small core kinds of effects when its not busy. Or if you have machine knowledge just set your core affinity with pthread_setaffinity_np() which as the extension states is non-portable but can directly translate to "run me on a tiny core" if the right mask is provided.

At least in C/C++ most if not all modern platforms provide these kinds of controls and metadata and while the API might change a bit writing a couple versions of schedule_me_on_a_tiny_core() for each target platform is fairly trivial.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: