Of course we can. Reliability is a spectrum, not a binary state. You can push it up however high you like, and stop somewhere between "we don't care about error rate this low" and "error rate is so low it's unlikely to show in practice".
It's not like this is a new concept. There are plenty of algorithms we've been using for decades that are only statistically correct. A perfect example of this is efficient primality testing, which is probabilistic in nature[0], but you can easily make the probability of error as small as "unlikely to happen before heat death of the universe".
There are two problems with this comparison. First, probabilistic prime generation has a mathematically proven lower bound that improves with iteration. There is no comparably robust tuning parameter with an LLM. You can use a different model, you can use a bigger variant of the same model, etc., but these all have empirically determined and contextually sensitive reliability levels that are not otherwise tunable. Second, the prime generation function will always give you an integer, and never an apple, or a bicycle, or a phantasm. LLMs regurgitate and hallucinate, which means that a simple error rate is not the only metric that matters. One must also consider how egregiously wrong and even nonsensical the errors can be.
I think the better statement is that, if, say, you're running the Miller-Rabin test 10 times, you can be confident that an error in one test is uncorrelated with an error in the next test, so it's easy to dial up the accuracy as close to 1 as desired. Whereas with an LLM, correlated errors seem much more likely; if it failed three times parsing the same piece of data, I would have no confidence that the 4th-10th times would have the same accuracy rate as on a fresh piece of data. LLMs seem much more like the Fermat primality test, except that their "Carmichael numbers" are a lot more common.
I compare LLMs to a door with a slot where you put a piece of paper with a request on it and you get something back related to that request. If it's the same every time, great. But it might be different or completely wrong. You don't know what goes on behind the door and measuring the error rate tells you little predictive.
The general point is not that the feature currently exists to dial down the LLM parse error rate, it’s that the abstract argument “we can’t use LLMs because they aren't perfect” isn’t a realistic argument in the first place. You’re probably reading this on hardware that _probably_ shows you the correct text most all of the time but isn’t guaranteed to.
Precisely this. People dismiss utility of LLMs because they don't give 100% reliability, without considering the basic facts that:
- LLMs != ChatGPT interface, they don't need to be run in isolation, nor do they need to do everything end-to-end.
- There are no 100% reliable systems - neither technological nor social. Voltages fluctuate, radiation flips bit, humans confabulate just as much if not worse than LLMs, etc.
- We create reliability from unreliable systems.
LLMs aren't some magic unreliability pixie dust that makes everything they touch beyond repair. They're just another system with bounded reliability, and can be worked into larger systems just like anything else, and total reliability can be improved through this.
EDIT: In fact, my example with probabilistic primality tests is bad because those tests are too nice - they let us compute tight bounds on the error rate in advance. LLMs are not like that. But then, a lot of systems we rely in our daily lives also have this property - their reliability is established empirically, i.e. we improve them until they work reliably enough, and then we hope they'll keep on working, and deal with random failures when they occur. So that's nothing new, either.
No, LLMs do not have "bounded reliability". All reliability figures for LLMs are based upon empirical observation in specific contexts using artificial benchmarks. As they say in finance, "past performance is not indicative of future results".
Saying LLMs are no worse than random bit flips is, again, an unjustified comparison. We can control bit errors with ECC, we cannot control the output of an LLM except to shackle it into uselessness.
I said bounded. I didn't say how tight. But all of science is about bounding empirical observations, so this is nothing new - nor is relying on systems with empirically established failure rates, which is a good chunk of what engineering is about.
The number of 9s that can be assigned to these "bounds" currently is zero. They are not even 90% reliable. And there is no straightforward way to get to 90%, never mind 95%, 99%, etc. The sliding scale of reliability you originally presented just does not exist.
Yeah, sure, we can hypothetically engineer a system that tolerates a key step in the process which has, say, a 30% chance of being wrong, including a 10% chance of being dangerously wrong (appears correct but is broken in subtle ways), and a 5% chance of being batshit insane, but why would we? The amount of training, vetting, and supervision of human operators necessary to make a working process here immediately raises the question of whether the machine serves man or the other way around.
The best uses of an LLM are those where engineering levels of precision are neither required nor useful.
I see people hallucinate on HN all the time. We tolerate it. Why should we? We should if the overall inclusion of unreliable things (humans) provide value. The error rate for LLMs doesn't matter. The net value does. So if the value is great enough to tolerate the error rate, we do. We don’t categorically dismiss the technology because it can fail really poorly. We design things all the time which can fail catastrophically. Seriously. So LLMs will appear anywhere where the net value is positive. Maybe you’re taking a more nuanced stance, but I see a lot of “if it can hallucinate even once we can’t use it” rhetoric here. And that’s simply irrational. Even “we can’t use it for important things” is wrong. Doctors are using LLMs today to help collate observed data and suggest diagnoses. Trained professional in the loop mitigates the “terrible failure”. So no I don’t even agree that LLMs shall be relegated to non-important things.
I also think categorically dismissing LLMs is a mistake.
However, an LLM for automated code generation (the context of the thread as I understand it) is basically a dubious-code-copy-paster on steroids. That was already the wrong way to develop code to begin with, automating and accelerating it is not an improvement.
There has never been a single case where I took code from Stack Overflow, which is already a relatively high quality source of such snippets, and didn't have to adapt it in at least some way to work with the code I already had. Heck, I often find rewriting the snippet entirely is better than copying and pasting it. Of course, I also give attribution, both for credit and for referring back to the original in case I made a mistake, the best solution changes in the future, there's context I didn't cover, etc. And in between the problems I solve with other people's help is a whole lot of code I write entirely on my own.
There are many cases of code in the wild being bad, not just from a "readability" or "performance" standpoint, but from a security standpoint. LLMs regurgitate bad code despite also having good code, and even the blog posts explaining what's good and what's bad, in their training corpus! And an LLM never gives attribution, partly because it was designed not to care, and partly because the end result is a synthesis of multiple sources rather than a pure regurgitation. Moreover, LLMs don't have much continuity, so they mix metaphors and naming conventions, they tie things together in absurd ways, etc. The end result is an unmaintainable mess, even if it happens to work.
So no, an LLM is not like a compiler, even though compilers often have their own special brand of crazy magic that isn't necessarily good. Nor is it going to deliver a robust way to turn abstract human thoughts into concrete code. It is still a useful tool, but it's not going to be an automated part of developing quality code. And this is going to be true for any non-coding scenario that requires at least the same level of reliability.
Finance is an excellent analogy. Relying on LLM output is similar to relying on the stock market. You might come out ahead but it's always a gamble and the lower bound is always catastrophic failure.
It's not like this is a new concept. There are plenty of algorithms we've been using for decades that are only statistically correct. A perfect example of this is efficient primality testing, which is probabilistic in nature[0], but you can easily make the probability of error as small as "unlikely to happen before heat death of the universe".
--
[0] - https://en.wikipedia.org/wiki/Primality_test#Probabilistic_t...