More

ashvardanian · 2025-11-01T12:46:33 1762001193

Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:

  /**
   *  @brief Defines variable alignment to avoid false sharing.
   *  @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
   *  @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
   *
   *  The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
   *
   *  @code{.cpp}
   *  #if defined(__cpp_lib_hardware_interference_size)
   *  static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
   *  #else
   *  static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
   *  #endif
   *  @endcode
   *
   *  That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
   *  unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
   *  with `-Wno-interference-size`.
   */
  static constexpr std::size_t default_alignment_k = 128;

As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.

[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...

rnrn · 2025-11-01T16:10:34 1762013434

> even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines

I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.

Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule

AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.

Sesse__ · 2025-11-01T14:14:37 1762006477

This makes attempts of cargo-culting __attribute__((aligned(64))) without benchmarking even more hilarious. :-)

rnrn · 2025-11-01T16:14:12 1762013652

It’s not a cargo cult if the actions directly cause cargo to arrive based on well understood mechanics.

Regardless of whether it would be better in some situations to align to 128 bytes, 64 bytes really is the cache line size on all common x86 cpus and it is a good idea to avoid threads modifying the same cacheline.

Sesse__ · 2025-11-01T17:24:20 1762017860

It indeed isn't, but I've seen my share of systems where nobody checked if cargo arrived. (The code was checked in without any benchmarks done, and after many years, it was found that the macros used were effectively no-ops :-) )

ashvardanian · 2025-10-27T12:08:35 1761566915

Storage, strings, sorting, counting, bioinformatics... I got nerd-sniped! Can't resist a shameless plug here :)

Looking at the code, there are a few things I would consider optimizing. I'd start by trying (my) StringZilla for hashing and sorting.

HashBrown collections under the hood use aHash, which is an excellent hash function, but on both short and long inputs, on new CPUs, StringZilla seems faster [0]:

                               short               long
  aHash::hash_one         1.23 GiB/s         8.61 GiB/s
  stringzilla::hash       1.84 GiB/s        11.38 GiB/s

A similar story with sorting strings. Inner loops of arbitrary length string comparisons often dominate such workloads. Doing it in a more Radix-style fashion can 4x your performance [1]:

                                                    short                  long
  std::sort_unstable_by_key           ~54.35 M compares/s    57.70 M compares/s
  stringzilla::argsort_permutation   ~213.73 M compares/s    74.64 M compares/s

Bear in mind that "compares/s" is a made-up metric here; in reality, I'm comparing from the duration.

[0] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

[1] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

noamteyssier · 2025-10-28T16:12:05 1761667925

Cool suggestions! I definitely would be interested in exploring other hash functions for this (and other binf works) so I'll definitely take a look at your stringzilla lib.

ashvardanian · 2025-10-25T09:31:33 1761384693

In case someone is searching for the computational part of the proof, its on GitHub, implemented using SageMath: https://github.com/Jakob256/Rupert

ashvardanian · 2025-10-23T17:44:44 1761241484

My bad! "contributors" is more accurate, but HN doesn't allow editing titles, sadly :(

kibwen · 2025-10-23T20:13:10 1761250390

HN allows the submitter to edit the title, at least it did last time I checked.

pjmlp · 2025-10-24T07:32:28 1761291148

It still does, but you have a timeout for the first set of minutes after submission.

I routinely have to fix the autoformating done by HN.

LegNeato · 2025-10-23T17:57:10 1761242230

No worries, just wanted to correct it for folks. Thanks for posting!

ashvardanian · 2025-10-22T14:01:53 1761141713

Congrats on the release, Sam - the preview looks great!

I'm curious about the technical side: how are you handling the dimensionality reduction and visualization? Also noticed you mentioned "custom-trained LLMs" in the tweet - how large are those models, and what motivated using custom ones instead of existing open models?

funfunfunction · 2025-10-22T14:17:53 1761142673

We'll release the full data explorer soon, with more info.

At the core of this project is a structured-extraction task using a custom Qwen 14B model, which we distilled from larger closed-source models. We needed a model we could run at scale on https://devnet.inference.net, which is comprised mostly of idle consumer-grade NVIDIA devices.

Embeddings were generated using SPECTER2, a transformer model from AllenAI specifically designed for scientific documents. The model processes each paper's title, executive summary, and research context to generate 768-dimensional embeddings optimized for semantic search over scientific literature.

The visualization uses UMAP to reduce the 768D embeddings to 3D coordinates, preserving local and global structure. K-Means clustering groups papers into ~100 clusters based on semantic similarity in the embedding space. Cluster labels are automatically generated using TF-IDF analysis of paper fields and key takeaways, identifying the most distinctive terms for each cluster.

ashvardanian · 2025-10-15T21:17:20 1760563040

Detailed release notes: https://github.com/pytorch/pytorch/releases/tag/v2.9.0

ashvardanian · 2025-10-12T11:32:20 1760268740

Will need some time to go through the details, but it’s increasingly rare to see teams consistently delivering meaningful improvements in the open. Impressive work!

ashvardanian · 2025-10-07T23:28:16 1759879696

Despite the promise of scalable vectors, I've struggled to find cases where it consistently beats NEON. Curious if others have found good use cases for SVE beyond AES-ing tiny string keys for hash-table lookups, scatter-gathers in small buffers, and histograms ;)

ashvardanian · 2025-10-03T22:37:08 1759531028

Those are extremely uniform latencies. Seems like on these CPUs most benefits from NUMA-aware thread-pools will be coming from reduced contention - mostly synchronizing small subsets of cores, rather than the actual memory affinity.

PunchyHamster · 2025-10-03T22:59:17 1759532357

Well, all of the memory is at IO die. I remember AMD docs outright recommend to make processor hide NUMA nodes from the workload as trying to optimize for it might not even do anything for a lot of workloads

phire · 2025-10-03T23:17:34 1759533454

That AMD slide (in the conclusion) claims their switching fabric has some kind of bypass mode to improve latency when utilisation is low.

So they have been really optimising that IO die for latency.

NUMA is already workload sensitive, you need to benchmark your exact workload to know if it’s worth enabling or not, and this change is probably going to make it even less worthwhile. Sounds like you will need a workload that really pushes total memory bandwidth to make NUMA worthwhile.

afr0ck · 2025-10-04T08:07:30 1759565250

NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.

ashvardanian · 2025-09-29T18:18:46 1759169926

The second piece (uniform call syntax) looks convenient, though I don’t see a realistic way to integrate it into modern C++. The first (structural pattern matching) is, for me, more of a dividing line between low- and high-level languages. I tend to avoid it in my C++, just as I avoid inheritance, virtual functions, and exceptions… or `<functional>` header contents.

Still, it’s always fun to stumble on corners of the STL I’d never paid attention to, even if I won’t end up using them. Thought it was worth sharing :)

fooker · 2025-09-29T20:41:43 1759178503

There was an experimental implementation for uniform function call syntax in a clang fork, so it’s clearly doable.