Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:
/**
* @brief Defines variable alignment to avoid false sharing.
* @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
* @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
*
* The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
*
* @code{.cpp}
* #if defined(__cpp_lib_hardware_interference_size)
* static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
* #else
* static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
* #endif
* @endcode
*
* That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
* unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
* with `-Wno-interference-size`.
*/
static constexpr std::size_t default_alignment_k = 128;
As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.
> even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines
I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.
Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule
AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.
It’s not a cargo cult if the actions directly cause cargo to arrive based on well understood mechanics.
Regardless of whether it would be better in some situations to align to 128 bytes, 64 bytes really is the cache line size on all common x86 cpus and it is a good idea to avoid threads modifying the same cacheline.
It indeed isn't, but I've seen my share of systems where nobody checked if cargo arrived. (The code was checked in without any benchmarks done, and after many years, it was found that the macros used were effectively no-ops :-) )
Storage, strings, sorting, counting, bioinformatics... I got nerd-sniped! Can't resist a shameless plug here :)
Looking at the code, there are a few things I would consider optimizing. I'd start by trying (my) StringZilla for hashing and sorting.
HashBrown collections under the hood use aHash, which is an excellent hash function, but on both short and long inputs, on new CPUs, StringZilla seems faster [0]:
short long
aHash::hash_one 1.23 GiB/s 8.61 GiB/s
stringzilla::hash 1.84 GiB/s 11.38 GiB/s
A similar story with sorting strings. Inner loops of arbitrary length string comparisons often dominate such workloads. Doing it in a more Radix-style fashion can 4x your performance [1]:
short long
std::sort_unstable_by_key ~54.35 M compares/s 57.70 M compares/s
stringzilla::argsort_permutation ~213.73 M compares/s 74.64 M compares/s
Bear in mind that "compares/s" is a made-up metric here; in reality, I'm comparing from the duration.
Cool suggestions! I definitely would be interested in exploring other hash functions for this (and other binf works) so I'll definitely take a look at your stringzilla lib.
Congrats on the release, Sam - the preview looks great!
I'm curious about the technical side: how are you handling the dimensionality reduction and visualization? Also noticed you mentioned "custom-trained LLMs" in the tweet - how large are those models, and what motivated using custom ones instead of existing open models?
We'll release the full data explorer soon, with more info.
At the core of this project is a structured-extraction task using a custom Qwen 14B model, which we distilled from larger closed-source models. We needed a model we could run at scale on https://devnet.inference.net, which is comprised mostly of idle consumer-grade NVIDIA devices.
Embeddings were generated using SPECTER2, a transformer model from AllenAI specifically designed for scientific documents. The model processes each paper's title, executive summary, and research context to generate 768-dimensional embeddings optimized for semantic search over scientific literature.
The visualization uses UMAP to reduce the 768D embeddings to 3D coordinates, preserving local and global structure. K-Means clustering groups papers into ~100 clusters based on semantic similarity in the embedding space. Cluster labels are automatically generated using TF-IDF analysis of paper fields and key takeaways, identifying the most distinctive terms for each cluster.
Will need some time to go through the details, but it’s increasingly rare to see teams consistently delivering meaningful improvements in the open. Impressive work!
Despite the promise of scalable vectors, I've struggled to find cases where it consistently beats NEON. Curious if others have found good use cases for SVE beyond AES-ing tiny string keys for hash-table lookups, scatter-gathers in small buffers, and histograms ;)
Those are extremely uniform latencies. Seems like on these CPUs most benefits from NUMA-aware thread-pools will be coming from reduced contention - mostly synchronizing small subsets of cores, rather than the actual memory affinity.
Well, all of the memory is at IO die. I remember AMD docs outright recommend to make processor hide NUMA nodes from the workload as trying to optimize for it might not even do anything for a lot of workloads
That AMD slide (in the conclusion) claims their switching fabric has some kind of bypass mode to improve latency when utilisation is low.
So they have been really optimising that IO die for latency.
NUMA is already workload sensitive, you need to benchmark your exact workload to know if it’s worth enabling or not, and this change is probably going to make it even less worthwhile. Sounds like you will need a workload that really pushes total memory bandwidth to make NUMA worthwhile.
NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.
The second piece (uniform call syntax) looks convenient, though I don’t see a realistic way to integrate it into modern C++. The first (structural pattern matching) is, for me, more of a dividing line between low- and high-level languages. I tend to avoid it in my C++, just as I avoid inheritance, virtual functions, and exceptions… or `<functional>` header contents.
Still, it’s always fun to stumble on corners of the STL I’d never paid attention to, even if I won’t end up using them. Thought it was worth sharing :)
[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...