An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

MontyCarloHall · 2025-10-27T12:41:00 1761568860

>I use nucgen to generate a random 100M line FASTQ file and pipe it into different tools to compare their throughput with hyperfine.

This is a strange benchmark [0] -- here is what this random FASTQ looks like:

  $ nucgen -n 100000000 -l 20 | head -n8
  
  >seq.0
  TGGGGTAAATTGACAGTTGG
  >seq.1
  CTTCTGCTTATCGCCATGGC
  >seq.2
  AGCCATCGATTATATAGACA
  >seq.3
  ATACCCTAGGAGCTTGCGCA

There are going to be very few [*] repeated strings in this 100M line file, since each >seq.X will be unique and there are roughly a trillion random 4-letter (ACGT) strings of length 20. So this is really assessing the performance of how well a hashtable can deal with reallocating after being overloaded.

I did not have enough RAM to run a 100M line benchmark, but the following simple `awk` command performed ~15x faster on a 10M line benchmark (using the same hyperfine setup) versus the naïve `sort | uniq -c`, which isn't bad for something that comes standard with every *nix system.

  awk '{ x[$0]++ } END { for(y in x) { print y, x[y] }}' <file> | sort -k2,2nr

[0] https://github.com/noamteyssier/hist-rs/blob/main/justfile

[*] Birthday problem math says about 250, for 50M strings sampled from a pool of ~1T.

pclmulqdq · 2025-10-27T14:12:25 1761574345

The awk script is probably the fastest way to do this still, and it's faster if you use gawk or something similar rather than default awk. Most people also don't need ordering, so you can get away with only the awk part and you don't need the sort.

noamteyssier · 2025-10-28T16:10:53 1761667853

Totally agree it's a bit of weird benchmark - it was just the first thing that I thought of to generate a huge amount of lines to test throughput.

There are definitely other benchmarks that we could try as well to test other characteristics as well.

I've actually just added in this `awk` implementation you provided to the benchmarks well.

Cheers!

donatj · 2025-10-27T07:49:34 1761551374

I created "unic" a number of years ago because I had need to get the unique lines from a giant file without losing the order they initially appeared. It achieves this using a Cuckoo Filter so it's pretty dang quick about it, faster than sorting a large file in memory for sure.

https://github.com/donatj/unic

noamteyssier · 2025-10-28T18:31:06 1761676266

I've actually added a benchmark for this specific task and added `unic` to it.

It may not be the most fair comparison because with these random fastqs I'm generating the vast majority of the input is unique so it could be overloading the cuckoo filter.

noamteyssier · 2025-10-28T16:09:16 1761667756

Nice tool!

Someone · 2025-10-27T07:42:58 1761550978

> I am measuring the performance of equivalent cat <file> | sort | uniq -c | sort -n functionality.

It likely won’t matter much here, but invoking cat is unnecessary.

   sort <file> | uniq -c | sort -n

will do the job just fine. GNU’s sort also has a few flags controlling buffer size and parallelism. Those may matter more (see https://www.gnu.org/software/coreutils/manual/html_node/sort...)

noamteyssier · 2025-10-28T16:44:55 1761669895

Thanks for sharing!

You're right that the `cat` is unnecessary - and removing it actually had some marginal gains to the naive solution. I've updated the benchmarks to show this

Cheers

ashvardanian · 2025-10-27T12:08:35 1761566915

Storage, strings, sorting, counting, bioinformatics... I got nerd-sniped! Can't resist a shameless plug here :)

Looking at the code, there are a few things I would consider optimizing. I'd start by trying (my) StringZilla for hashing and sorting.

HashBrown collections under the hood use aHash, which is an excellent hash function, but on both short and long inputs, on new CPUs, StringZilla seems faster [0]:

                               short               long
  aHash::hash_one         1.23 GiB/s         8.61 GiB/s
  stringzilla::hash       1.84 GiB/s        11.38 GiB/s

A similar story with sorting strings. Inner loops of arbitrary length string comparisons often dominate such workloads. Doing it in a more Radix-style fashion can 4x your performance [1]:

                                                    short                  long
  std::sort_unstable_by_key           ~54.35 M compares/s    57.70 M compares/s
  stringzilla::argsort_permutation   ~213.73 M compares/s    74.64 M compares/s

Bear in mind that "compares/s" is a made-up metric here; in reality, I'm comparing from the duration.

[0] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

[1] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...

noamteyssier · 2025-10-28T16:12:05 1761667925

Cool suggestions! I definitely would be interested in exploring other hash functions for this (and other binf works) so I'll definitely take a look at your stringzilla lib.

nasretdinov · 2025-10-27T07:25:09 1761549909

Note that by default sort command has a pretty low memory usage and spills to disk. You can improve the throughput quite a bit by increasing the allowed memory usage: --buffer-size=SIZE

noamteyssier · 2025-10-28T16:42:50 1761669770

I didn't know that - I've added in buffer size with a fairly large buffer to the benchmarks as well

f311a · 2025-10-27T09:06:13 1761555973

People often use sort | uniq when they don't want to load a bunch of lines into memory. That's why it's slow. It uses files and allocates very little memory by default. The pros? You can sort hundreds of gigabytes of data.

This Rust implementation uses hashmap, if you have a lot of unique values, you will need a lot of RAM.

noamteyssier · 2025-10-28T16:44:01 1761669841

Yeah definitely, it's always a trade-off. I think in many cases where I use it especially the number of unique values is actually not crazy high (much less than the required RAM) and the number of lines is crazy high.

So in those settings I think it's absolutely worth it

dbdr · 2025-10-27T06:59:12 1761548352

> using csv (serde) for writing leads to some big gains

Could you explain that, if you have the time? Is that for writing the output lines? Is actual CSV functionality used? That crate says "Fast CSV parsing with support for serde", so I'm especially confused how that helps with writing.

noamteyssier · 2025-10-28T16:49:10 1761670150

Yeah I'm using it to serialize the output lines as a TSV. Rust's `println!` is notoriously slow and using `csv` to serialize the output is a nice way to boost throughput

dbdr · 2025-10-29T07:07:57 1761721677

Thanks. Nice find! Though it feels weird to have to use a csv crate for that. Ideally the `fast printing` part should be understood, and either used directly, or extracted as a separate, smaller crate.

LtdJorge · 2025-10-27T11:29:07 1761564547

Yes, it’s used just for writing

xorcist · 2025-10-27T14:10:18 1761574218

From a causal glance, isn't your code limited by the amount of available memory?

Which could be totally useful in itself, but not even close to what "sort" is doing.

Did you run sort with a buffer size larger than the data? Your specialized one-pass program is likely faster, but at least the numbers would mean something.

That said, I don't see what is over-engineered here. It's pretty straightforward and easy to read.

noamteyssier · 2025-10-28T16:47:10 1761670030

Yes you're right, it's not trying to do what `sort` is doing, it's trying to reproduce the output of `sort | uniq -c | sort -n` which is a more specialized but common task.

But you're right - it will be limited by RAM in a way the unix tools are not.

I did add a test with sort using a larger buffer size to the benchmarks as well.

theemptiness · 2025-10-27T06:55:21 1761548121

Small semantics nit: it is not overengineered, it is engineered. You wanted more throughput, the collection of coreutils tools was not designed for throughput but flexibility.

It is not difficult to construct scenarios where throughput matters but that IMHO that does not determine engineering vs overengineering. What matters is whether there are requirements that need to be met. Debating the requirements is possible but doesn't take away from whether a solution obtained with reasonable effort meets the spec. Overengineering is about unreasonable effort, which could lead to overshoot the requirements, not about unreasonable requirements.

mabster · 2025-10-27T07:01:54 1761548514

We had similar thoughts about "premature optimisation" in the games industry. That is it's better to have prematurely optimised things than finding "everything is slow". But I guess in that context there are many many "inner-most loops" to optimise.

chii · 2025-10-27T07:07:30 1761548850

> That is it's better to have prematurely optimised things than finding "everything is slow".

or you found that you've optimized a game that is unfun to play and thus doesn't sell, even tho it runs fast...

wongarsu · 2025-10-27T10:42:39 1761561759

The best-practice solution would be to write a barely optimized ugly prototype to make sure the core idea is fun, then throw away the prototype and write the "real" game. But of course that's not always how reality works

chii · 2025-10-27T12:14:27 1761567267

> not always how reality works

yep. The stakeholder (who is paying the money) asks why the prototype can't just be "fixed up" and be sold for money, instead of paying for more dev time to rewrite. There's no answer that they can be satisfied with.

mabster · 2025-10-28T21:38:38 1761687518

Or the timeline just doesn't have capacity for experimentation so the expectations are clear right from the start!

mabster · 2025-10-28T21:37:28 1761687448

In context most of the major optimisation work was on the engine. The game code can be and usually is slow but we do try to tame things in an O() sense.

I worked licensed titles for a while and that area the quality of a title and whether it sells were largely uncorrelated haha!

majke · 2025-10-27T08:33:39 1761554019

I thought my mmuniq holds the crown!

https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

https://github.com/majek/mmuniq

noamteyssier · 2025-10-28T18:32:36 1761676356

this looks very interesting and I'd love to add it to the benchmarking! I was interested in trying it but unfortunately got an installation error on my macbook where I'm running the benchmarks:

``` clang \ -g -ggdb -O3 \ -Wall -Wextra -Wpointer-arith \ -D_FORTIFY_SOURCE=2 -fPIE \ mmuniq.c \ -lm \ -Wl,-z,now -Wl,-z,relro \ -o mmuniq mmuniq.c:1:10: fatal error: 'byteswap.h' file not found 1 | #include <byteswap.h> | ^~~~~~~~~~~~ 1 error generated. make: ** [mmuniq] Error 1 ```

nasretdinov · 2025-10-27T10:53:31 1761562411

I believe, given its reliance on the Bloom filter, that it doesn't actually report occurrences count?

noctune · 2025-10-27T07:29:50 1761550190

I built something similarly a few years ago for `sort | uniq -d` using sketches. The downside is you need two passes, but still it's overall faster than sorting: https://github.com/mpdn/sketch-duplicates

saltcured · 2025-10-28T20:37:17 1761683837

Back in the day, optimizing this would be about parallel IO and some map-reduce processing. Data sharded on a bunch of nodes, each effectively doing "sort | uniq -c" and then doing a merge of those sorted counts.

And then there would be countless arguments about whether you have to count the time it takes to stage the data into the cluster as part of the task completion benchmark...

noamteyssier · 2025-10-28T20:53:47 1761684827

I think you'd still need to go through that if you were really optimizing both `sort` and `uniq` working with their constraints.

What I'm really optimizing here is the functional equivalent of `sort | uniq -c | sort -n`

scaredginger · 2025-10-27T08:08:06 1761552486

Looks like the impl uses a HashMap. I'd be curious about how a trie or some other specialized string data structure would compare here.

noamteyssier · 2025-10-28T20:55:43 1761684943

I think this could potentially really reduce the amount of memory required - especially in cases where there is a lot of repetitive prefixes.

Would be interesting to try this out

jll29 · 2025-10-27T08:52:27 1761555147

I use questions around this pipeline in interviews. As soon as people say they'd write a Python program to sort a file, they get rejected.

Arguably, this will result in a slower result in most cases, but the reason for the rejection is wasting developer time (not to mention time to test for correctness) to re-develop something that is already available in the OS.

pbhjpbhj · 2025-10-27T10:13:07 1761559987

I'm sure you're doing it in a sensible way, but... the thought that played out in my head went a little like this {apologies, I'm not well today, this might be a fever dream}:

Interviewer: "Welcome to your hammer-stuff interview, hope you're ready to show your hammering skills, we see from your resumé you've been hammering for a while now."

Schmuck: "Yeah, I just love to make my bosses rich by hammering things!"

Interviewer: "Great, let's get right into the hammer use ... here's a screw, show me how you'd hammer that."

Schmuck: (Thinks - "Well, of course, I wouldn't normally hammer those; but I know interviewers like to see weird things hammered! Here goes...")

[Hammering commences]

Interviewer: "Well thanks for flying in, but you've failed the interview. We were very impressed that you demonstrated some of the best hammering we've ever seen. But, of course, wanted to see you use a screwdriver here in your hammering interview at We Hammer All The Things."

wahern · 2025-10-27T09:19:14 1761556754

One of the cooler Unix command utilities is tsort, which performs a topological sort. Basically you give it a list of items (first word in each line) and their dependencies (subsequent words on each line) and it sorts them accordingly, similar to how, e.g., Make builds a graph of targets and dependencies to run recipes in the correct order. https://en.wikipedia.org/wiki/Tsort https://pubs.opengroup.org/onlinepubs/9799919799/utilities/t...

However, I've never found a use for it. Apparently it was written for the Version 7 Unix build system to sort libraries for passing to the linker. And still used.[1][2] But of the few times I've needed a topological sort, it was part of a much larger problem where shell scripting was inappropriate, and implementing it from scratch using a typical sort routine isn't that difficult. Still, I'm waiting for an excuse to use it someday, hopefully in something high visibility so I can blow people's minds.

[1] https://github.com/openbsd/src/blob/17290de/share/mk/bsd.lib... [2] https://github.com/NetBSD/src/blob/7d8184e/share/mk/bsd.lib....

mr_toad · 2025-10-27T11:27:14 1761564434

Sounds like it’s intended to be used to schedule jobs, or complex builds.

rs186 · 2025-10-27T11:19:26 1761563966

Your loss, not theirs. Lots of good developers are not expert at unix commands, many of which spend most of their time on Windows. They may be "wasting" 2 minutes on this specific task with their "inefficient" method, but they may perform much better on other tasks, to the point that 2 minutes saved here is nothing.

Not to mention that these days people often ask ChatGPT "what's the best way to do this" before proceeding, and whatever you ask in interviews is completely irrelevant.

It is exactly for these reasons we never ask such questions in our interviews. There are much more important aspects of a candidate to evaluate.

sudahtigabulan · 2025-10-28T10:41:15 1761648075

The 2 minutes wasted to write it doesn't move the needle, but the time that their present and future teammates will waste on reading it might.

It costs me more effort to read and understand a screenful of unfamiliar code than the equivalent "sort -k 1.1" or "uniq" while skimming through a shell script. This adds up.

rs186 · 2025-10-28T11:45:26 1761651926

Assuming teammates are all experts at linux commands. Fact: not everyone knows what "sort | uniq -c" does or can type it out on demand. Actually I'll be surprised if more than 25% of all software engineers can do that . If you don't believe me, ask random people inside and outside your company.

> It costs me more effort to read and understand ...

You don't need to. LLMs are meant for that. You probably will roll your own script anyway and none of this matters.

You are worrying about things that, in my experience, do not make any noticeable impact on overall productivity.

f311a · 2025-10-27T09:09:19 1761556159

This depends on the context... If a file is pretty small, I would avoid sort pipes when there is a Python codebase. It's only useful when the files are pretty big (1-5GB+)

They are tricky and not very portable. Sorting depends on locales and the GNU tools implementation.

Aefiam · 2025-10-27T12:55:23 1761569723

you can develop just as fast or even faster with python once you develop a good enough utility library for it.

For example my python interpreter imports my custom List and Path classes and I could just do the following to get the same result:

List(List(Path("filepath").read_text_file().splitlines()).group_by_key(lambda x:x).items()).map(lambda x:(len(x[1]),x[0])).sorted()

and if used often enough, it could made an utility method:

Path("filepath").read_sorted_by_most_common()

So I find it shortsighted to reject someone based on that without giving them a chance to explain their reasoning.

I think generally people really underestimate how much more productive you can be with a good utility library.

sudahtigabulan · 2025-10-28T12:37:06 1761655026

> once you develop a good enough utility library for it.

What happens when everybody comes to the job with their own utility library and start working on the same codebase?

Would you like it if you had to get up to speed with several utility libraries your coworkers developed for themselves?

A common set of tools, like the Unix commands, makes it easier for people to collaborate. They were put in an official standard for a reason.

zahlman · 2025-10-27T15:09:39 1761577779

> For example my python interpreter imports my custom List and Path classes and I could just do the following to get the same result:

  List(List(Path("filepath").read_text_file().splitlines()).group_by_key(lambda x:x).items()).map(lambda x:(len(x[1]),x[0])).sorted()

... But I don't know why you would, because with builtins and the standard library you can already do

  sorted((count, line) for (line, count) in Counter(Path("filepath").read_text().splitlines()).items())

> and if used often enough, it could made an utility method:

Sure, but you can do that for any functionality in any practical language.

coldstartops · 2025-10-27T09:09:49 1761556189

> Wasting developer time

What is the definition of wasting developer time? If a developer takes a 2 hours break to recover mental power and avoid burnout, is it considered time wasted?

trollbridge · 2025-10-27T12:20:25 1761567625

This reminds me of a program I wrote to do the same thing that wc -L does, except a lot faster. I had to run it on a corpus of data that was many gigabytes (terabytes) in size, far too big to fit in RAM. MIT license.

https://github.com/JoshRodd/mll

stackedinserter · 2025-10-27T14:40:44 1761576044

It's shame that we normalized sorting twice for these cases.

Somebody, implement `uniq --global` switch already. Put it into your resume, it's a legitimate thing to brag about.

rurban · 2025-10-27T20:07:04 1761595624

But GNU coreutils would reject a new flag, you'd need to add it to BSD or Rust uutils.

zahlman · 2025-10-27T14:42:22 1761576142

How often is "count the unique lines of a file" a realistic task for others out there, and how big of files do y'all need to process and why?

noamteyssier · 2025-10-28T16:52:06 1761670326

Shows up a lot in bioinformatics actually - trying to identify sequences with a specific subsequence (grep) and how many of each unique sequence there are. The number of lines here could be massive (order of 1-10's of GB)

You don't really end up using these results in any specific analysis but it's super helpful for troubleshooting tools or edge-cases.

Tostino · 2025-10-27T15:05:42 1761577542

Reasonably often in ETL type tasks.

flowerthoughts · 2025-10-27T06:35:33 1761546933

The win here might be using HashMap to avoid having to sort all entries. Then sorting at the end instead. What's the ratio of duplicates in the benchmark input?

There is no text encoding processing, so this only works for single byte encodings. That probably speeds it up a little bit.

Depending on the size of the benchmark input, sort(1) may have done disk-based sorting. What's the size of the benchmark input?

wodenokoto · 2025-10-27T06:53:40 1761548020

To me, the really big win would be _not_ to have to sort at all. Have an option to keep first or last duplicate and remove all others while keeping line order is usually what I need.

thaumasiotes · 2025-10-27T07:30:18 1761550218

That's easy to do if you're keeping the first duplicate. It becomes complex if you're keeping the last duplicate, because every time you find a duplicate you have to go back through your "output" and delete the earlier occurrence.

You could do an annotating pass for learning which of each line is the last one, and then a followup pass for printing (or otherwise echoing) only the lines that are the last of their kind. Technically still faster than sorting.

You could also keep the information on last occurrence of each line in the hash map (that's where it's going to be anyway), and once you're done with the first pass sort the map by earliest last occurrence. That will get you the lines in the right order, but you had to do a sort. If the original input was mostly duplicates, this is probably a better approach.

You could also track last occurrence of each line in a separate self-sorting structure. Now you have slightly more overhead while processing the input, and sorting the output is free.

wodenokoto · 2025-10-28T00:42:04 1761612124

> It becomes complex if you're keeping the last duplicate, because every time you find a duplicate you have to go back through your "output" and delete the earlier occurrence.

Can't you reverse file | keep first occurrence | reverse output?

thaumasiotes · 2025-10-28T05:57:51 1761631071

Is reversing the file an improvement over making two passes?

mabster · 2025-10-27T07:04:37 1761548677

I've written this kind of function so many times it's not funny. I usually want something that is fed from an iterator, removes duplicates, and yields values as soon as possible.

noamteyssier · 2025-10-28T18:33:23 1761676403

I've added this functionality to `hist-0.1.5` with a benchmark of other tools that do this on the CLI

G_o_D · 2025-10-27T11:32:06 1761564726

why no mention of awk ? awk '!a[$0]++'

noamteyssier · 2025-10-28T16:50:09 1761670209

I've added awk into the benchmarks also!

pabs3 · 2025-10-29T00:46:43 1761698803

Also perl, should be faster than awk IIRC:

perl -ne 'print if ! $a{$_}++'

fsiefken · 2025-10-27T09:13:57 1761556437

I'm curious how much faster this is compared to the rust uutils coreutils ports of sort and uniq

noamteyssier · 2025-10-28T21:45:11 1761687911

Good question! I just added that comparison and the rust uutils coreutils port is significantly faster than the standard coreutils.

ukuina · 2025-10-27T08:12:00 1761552720

Neat!

Are there any tools that tolerate slight mismatches across lines while combining them (e.g., a timestamp, or only one text word changing)?

I attempted this with a vector DB, but the embeddings calculation for millions of lines is prohibitive, especially on CPU.

southwindcg · 2025-10-23T17:24:38 1761240278

I don't [currently?] have a use case for this tool, but I love seeing existing tools made faster or more efficient.

noamteyssier · 2025-10-23T17:39:58 1761241198

I think that it's a pretty common use case for text processing - I end up needing to use it a lot in bioinformatics where there is a lot of text processing.

It's great when you quickly need to see what the distribution of classes in an input stream is. This pops up all the time. Like measuring different types of log messages, counting the variants of a field in a csv, finding the most common word or substring, etc.

southwindcg · 2025-10-23T18:55:03 1761245703

Oh, I meant me, personally, I don't have a use case for it. Without a doubt a lot of people are going to find this speed improvement valuable.

vlovich123 · 2025-10-27T06:55:04 1761548104

Why does this test against sort | uniq | sort? It’s kind of weird to sort twice no?

gucci-on-fleek · 2025-10-27T07:01:59 1761548519

The first "sort" sorts the input lines lexicographically (which is required for "uniq" to work); the second "sort" sorts the output of "uniq" numerically (so that lines are ordered from most-frequent to least-frequent):

  $ echo c a b c | tr ' ' '\n'
  c
  a
  b
  c
  
  $ echo c a b c | tr ' ' '\n' | sort
  a
  b
  c
  c
  
  $ echo c a b c | tr ' ' '\n' | sort | uniq -c
        1 a
        1 b
        2 c
  
  $ echo c a b c | tr ' ' '\n' | sort | uniq -c | sort -rn
        2 c
        1 b
        1 a

happysadpanda2 · 2025-10-27T18:26:38 1761589598

`uniq -c` introduces a "count" at the beginning of the line, so what we are then sorting is on frequency of the unique terms in the output, not sorting the unique terms again (which indeed would be kindof nonsensical)

Aaron2222 · 2025-10-27T07:00:43 1761548443

  sort | uniq -c | sort -n

The second sort is sorting by frequency (the count output by `uniq -c`).

emmelaich · 2025-10-28T01:01:17 1761613277

I often add `head` with `sort -rn` because I'm only interested in the largest.

BuildTheRobots · 2025-10-27T06:59:15 1761548355

It's something I've done myself in the past. First sort is because it needs to be sorted for uniq -c to count it proper, second sort because uniq doesn't always give the output in the right order.

evertedsphere · 2025-10-27T07:18:11 1761549491

more precisely, uniq produces output in the same order as the input to it, just collapsing runs / run-length encoding it

mfld · 2025-10-24T07:50:22 1761292222

Nice - thanks! I assume the non-naive implementations skip the sorting and instead hash the input lines?

noamteyssier · 2025-10-28T18:34:16 1761676456

yeah that's right - there are trade-offs in doing so as it can require much more memory. So like everything it's an application specific decision

zX41ZdbW · 2025-10-27T07:20:13 1761549613

This and similar tasks can be solved efficiently with clickhouse-local [1]. Example:

    ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt

I've tested it and it is faster than both sort and this Rust code:

    time LC_ALL=C sort data.txt | uniq -c | sort -rn > /dev/null
    32 sec.

    time hist data.txt > /dev/null
    14 sec.

    time ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt > /dev/null
    2.7 sec.

It is like a Swiss Army knife for data processing: it can solve various tasks, such as joining data from multiple files and data sources, processing various binary and text formats, converting between them, and accessing external databases.

[1] https://clickhouse.com/docs/operations/utilities/clickhouse-...

danlark1 · 2025-10-27T11:18:41 1761563921

Disclaimer: the author of the comment is the founder and CTO of ClickHouse

OJFord · 2025-10-27T14:21:20 1761574880

And all their comments are shilling Clickhouse either directly or via a project built on top of it, without disclosure.

tuukkah · 2025-10-27T14:44:12 1761576252

Considering that it's an open source tool, I don't know if it's that bad to be shilling for the commons, basically.

zX41ZdbW · 2025-11-04T21:10:02 1762290602

I've edited my profile to provide a link to GitHub.

edoceo · 2025-10-27T14:05:49 1761573949

Disclosure, not disclaimer.

They want to own the claims made.

danlark1 · 2025-10-27T14:34:16 1761575656

Yes, sorry, it should be "disclosure"

LtdJorge · 2025-10-27T11:38:01 1761565081

When using clickhouse-local like this, does it build a logical plan and run the optimizer on it? Does it have any kind of code generation, since it knows the query (and physical data layout) ahead of time?

gigatexal · 2025-10-27T09:05:05 1761555905

Exactly. I love this and DuckDb and other such amazing tools.

nsteel · 2025-10-27T10:56:51 1761562611

Just noting that in your benchmark (which we know nothing about), your "naive" data point is just 2.29x slower than hist. In their testing it was 27x slower! And it's not quite the same naive shell command, which isn't helpful.

da_chicken · 2025-10-27T10:34:48 1761561288

I'd not heard of clickhouse before. It does seem interesting, but I just can't get behind a project that says:

> The easiest way to download the latest version is with the following command:

> curl https://clickhouse.com/ | sh

Like, sure, there is some risk downloading a binary or running an arbitrary installer. But this is just nuts.

trollbridge · 2025-10-27T12:19:05 1761567545

It's Apache licenced and you could also install it via your favourite package installer. Given all the crazy supply chain attacks going on, I don't really feel this is any worse than downloading a binary from a distro archive, and specifically this pipe | sh doesn't expect you to run it as root (which a lot of other cut-and-paste installers do).

xorcist · 2025-10-27T13:59:22 1761573562

> I don't really feel this is any worse than downloading a binary from a distro archive

Please don't say that. It denigrates the work of all the packagers that actually keep our supply chains clean. At least in the major distributions such as Red Hat/Fedora and Debian/Ubuntu.

The distro model is far from perfect and there are still plenty of ways to insert malware into the process, but it certainly is far better than running binaries directly from a web page. You have no idea who have access to that page and its mirrors and what their motives are. The binary isn't even signed, let alone reviewed by anyone!

trollbridge · 2025-10-28T13:59:16 1761659956

I’m not sure how much better this is the man blindly “npm i thing”, where I have no real assurance I’m not downloading a giant piece of malware either.

da_chicken · 2025-10-28T17:36:22 1761672982

That's exactly why it's insane. People remember the pad-left fiasco.

Previously discussed here: https://news.ycombinator.com/item?id=11348798

Article now resides here: https://www.davidhaney.io/npm-left-pad-have-we-forgotten-how...

monerozcash · 2025-10-27T15:17:59 1761578279

>Like, sure, there is some risk downloading a binary or running an arbitrary installer. But this is just nuts.

It's literally exactly the same thing

gigatexal · 2025-10-27T10:35:58 1761561358

Chdb is just a binary. You can just grab that. Also pipe to sh is used by a ton of projects

bflesch · 2025-10-27T10:42:09 1761561729

it's used by many projects but still regarded as an anti-pattern and security issue

monerozcash · 2025-10-29T21:58:34 1761775114

it's really exactly the same as wget file;./file and not a real anti-pattern in any way

Etheryte · 2025-10-27T12:47:59 1761569279

A ton of people drink and drive too, doesn't make it any more fine.

gigatexal · 2025-10-27T18:39:34 1761590374

Y’all are so pure. Just don’t install it that way. Sheesh.

Aefiam · 2025-10-27T11:16:46 1761563806

how is this any less secure than running a binary/installer? the binary could run this inside?

nasretdinov · 2025-10-27T07:26:12 1761549972

To be more fair you could also add SETTINGS max_threads=1 though?

supermatt · 2025-10-27T08:17:19 1761553039

How is that “more fair”?

nasretdinov · 2025-10-27T08:37:15 1761554235

Well, fair in a sense that we'd compare which implementation is more efficient. Surely, ClickHouse is faster, but is it because it's using actually superior algorithms or is it just that it executes stuff in parallel by default? I'd like to believe it's both, but without "user%" it's hard to tell

mickeyp · 2025-10-27T08:39:21 1761554361

Last time I checked, writing efficient, contention-free and correct parallel code is hard and often harder than pulling an algorithm out of a book.

reppap · 2025-10-27T09:36:51 1761557811

Would you take half the wheels off a car to compare it to a motorcycle?

Almondsetat · 2025-10-27T12:45:29 1761569129

Motorcycles are faster than cars though

Etheryte · 2025-10-27T12:53:58 1761569638

Not necessarily, that really depends on what you mean by fast. Cars definitely go higher in top speed than bikes do for example. If I'm not mistaken, racing electric cars also accelerate comparable or faster than bikes. A bike can generally go around a track faster than a car, but that only holds true in dry conditions. Etc, many ways to define fast and what you actually mean.

AtlasBarfed · 2025-10-27T15:32:39 1761579159

I thought all the land speed records were basically motorcycles.

Jet motorcycles but motorcycles

wang_li · 2025-10-27T15:08:53 1761577733

Musk's roadster is currently going in excess of 10,000 mph. Which bike is faster than that? :)

2025-10-22T22:26:45 1761172005

[deleted]