The code looks 100% identical except for the namespace prefixes. Must be something particular about github setup, because on mine (gcc15.2.1/clang20.1.8/Ryzen5600X) the run time is indistinguishably close. Interestingly, with default flags but -O3 clang is 30% slower, with flags from the script (-s -static -flto $MARCH_FLAG -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math) clang is a bit faster.
A nitpick is that benchmarking C/C++ with $MARCH_FLAG -mtune=native and math magic is kinda unfair for Zig/Julia (Nim seem to support those) - unless you are running Gentoo it's unlikely to be used for real applications.
I suspect this is it. Any benchmark that takes less than a second to run should have its iteration count increased such that it takes at least a second, and preferably 5+ seconds, to run. Otherwise CPU scheduling, network processing, etc. is perturbing everything.
BenchExec "uses the cgroups feature of the Linux kernel to correctly handle groups of processes and uses Linux user namespaces to create a container that restricts interference of [each program] with the benchmarking host."
Certainly better, but you’re always going to be better off maximizing the runtime to a level where it just swamps any of the other effects. Then do multiple runs and take an average.
(I have spent a good amount of time hacking the llvm pass pipeline for my personal project so if there was a significant difference I probably would have seen it by now)
You are correct, that was an uneducated guess on my part.
I just glanced at the IR which was different for some attributes (nounwind vs mustprogress norecurse), but the resulting assembly is 100% identical for every optimization level.