In aviation safety, there is a concept of "Swiss cheese" model, where each successful layer of safety may not be 100% perfect, but has a different set of holes, so overlapping layers create a net gain in safety metrics.
One can treat current LLMs as a layer of "cheese" for any software development or deployment pipeline, so the goal of adding them should be an improvement for a measurable metric (code quality, uptime, development cost, successful transactions, etc).
Of course, one has to understand the chosen LLM behaviour for each specific scenario - are they like Swiss cheese (small numbers of large holes) or more like Havarti cheese (large number of small holes), and treat them accordingly.
LLMs are Kraft Singles. Stuff that only kind of looks like cheese. Once you know it's in there, someone has to inspect, and sign-off on, the entire wheel for any credible semblance of safety.
It will only get better at generating random slop and other crap. Maybe helping morons who are unable to eat and breathe without consulting the "helpful assistant".
They probably already can for a lot of things, but "Safety" is really about accountability when things go wrong. As a society, I hope we don't end up at "AI isn't perfect, but it's better than people on average, sorry if it failed you, good luck with that."
LLMs are very good at first pass PR checks for example. They catch the silly stuff actual humans just miss sometimes. Typos, copy-paste mistakes etc.
Before any human is pinged about a PR, have a properly tuned LLM look at it first so actual people don't have to waste their time pointing out typos in log messages.
Interesting concept, but as of now we don't apply this technologies as a new compounding layer.
We are not using them after the fact we constructed the initial solution. We are not ingesting the code to compare against specs. We are not using them to curate and analyze current hand written tests(prompt: is this test any good? assistant: it is hot garbage, you are inferring that expected result equals your mocked result).
We are not really at this phase yet. Not in general, not intelligently.
But when the "safe and effective" crowd leave technology we will find good use cases for it, I am certain (unlike uml, VB and Delphi)
> One can treat current LLMs as a layer of "cheese" for any software development or deployment pipeline
It's another interesting attempt at normalising the bullshit output by LLMs, but NO. Even with the entshittified Boeing, the aviation industry safety and reliability records, are far far far above deterministic software (know for a lot of un-reliability itself), and deterministic, B2C software to LLMs in turn is what Boeing and Airbus software and hardware reliablity are for the B2C software...So you cannot even begin to apply aviation industry paradigms to the shit machines, please.
I understand the frustration, but factually it is not true.
Engines are reliable to about 1 anomaly per million flight hours or so, current flight software is more reliable, on order of 1 fault per billion hours. In-flight engine shutdowns are fairly common, while major software anomalies are much rarer.
I used LLMs for coding and troubleshooting, and while they can definitely "hit" and "miss", they don't only "miss".
I was actually comparing aviation HW+SW vs. consumer software...and making the point that an old C++ invoices processing application, while being way less reliable than aviation HW or SW, is still orders of magnitude more reliable than LLMs. The LLMs don't always miss, true...but they miss far too often for the "hit" part to be relevant at all.
They miss but can self correct, this is the paradigm shift. You need a harness to unlock the potential and the harness is usually very buildable by LLMs, too.
Concrete examples are in your code just as they're in my employer's which I'm not at the liberty to share - but every little bit counts, starting from the simplest lints, typechecks, tests and going to more esoteric methods like model checkers. You're trying to get the probability of miss down with the initial context; then you want to minimize the probability of not catching a miss, then you want to maximize the probability of the model being able to fix a miss itself. Due to the multiplicative nature of the process the effect is that the pipeline rapidly jumps from 'doesn't work' to 'works well most of the time' and that is perceived as a step function by outsiders. Concrete examples are all over the place, they're just being laughed at (yesterday's post about 100% coverage was spot on even if it was an ad).
One can treat current LLMs as a layer of "cheese" for any software development or deployment pipeline, so the goal of adding them should be an improvement for a measurable metric (code quality, uptime, development cost, successful transactions, etc).
Of course, one has to understand the chosen LLM behaviour for each specific scenario - are they like Swiss cheese (small numbers of large holes) or more like Havarti cheese (large number of small holes), and treat them accordingly.