A "deep" tool that fully automates fairly specific tasks works this way. LLMs are more of a "shallow", general tool that can partially help with lots of different things, but none so completely that they alleviate the need for human involvement in them.
A car that can self-drive 100% of the time is a new tool that could make driving an obsolete skill. A car that can self-drive successfully 99% of the time is dangerous because it trains people to not be ready to take over for the 1% they need to.
What actually happens is that the 1% is ignored or outlawed. The shovel doesn't do 100% of human excavating tasks better than hands, but we rightly realized that the space of possibilities involving a shovel was much greater than the 1% of hand powered excavation.
Sure -- I think articles like this are a warning that the skills we're losing are likely _not_ so completely supplanted by AI that they'll soon be irrelevant.
There's a lot of software in between Air Traffic Controller and Facebook. And honestly would Meta be okay with Instagram or Facebook going down even for just a few minutes? I'd think at this point that'd be considered a fairly severe incident.
Even if we ignore criticality, things just get really messy and confusing if you push a bunch of broken stuff and only try to start understanding what's actually going on after it's already causing issues.
> And honestly would Meta be okay with Instagram or Facebook going down even for just a few minutes?
sure, they coined the term “move fast and break things”
and not every “bug” brings the system down, there is bugs after bugs after bugs in both facebook and insta being pushed to production daily, it is fine… it is (almost) always fine. if you are at a place where “deploying to production” is a “thing” you better be at some super mission-critical-lives-at-stake project or you should find another project to work on.
These are the bugs after bugs after bugs after bugs after bugs.
Simply put they are going through dev, QA, and UAT first before they are the bugs that we see. When you're running an organization using software of any size writing bugs that takes the software down is extremely easy, data corruption even easier.
> We live in a world where every line of code written by a human should be reviewed by another human. We can't even do that! Nothing should go straight to prod ever, ever ever, ever
Things should 100% go to prod whenever they need to go to prod. While this in theory makes sense, there is insane amount of ceremony in large number of places I have seen personally where it takes an act of congress to deploy to production all the while it is just ceremony, people are hunting other people with links to PR sent to various slack channels "hey anyone available to take a look at this" and then someone is like "I know nothing about that service/system but I'll look at approve." I would wager a high wager that this "we must review every line of code" - where actually implemented - is largely a ceremony. Today I deployed three services to production without anyone looking at what I did. Deploying to production should absolutely be a non-event in places that are ran well and where right people are doing their jobs.
Even with code review, a well configured CI/CD system is going to include a wealth of automated unit and integration tests, and then also a complex deploy system involving canaries and ramp-up and blue/green deployment and flags and monitoring and alerts that's backed by a pager and on-call rotation with runbooks. Code review simply will never be perfect and catch 100% of issues, so systems are designed with that in mind.
So then then question is what's actually reasonable given today's code generating tools? 0% review seems foolish but 100% seems similarly unreal. Automated code review systems like CodeRabbit are, dare I even say, reasonable as a first line of defense these days. It all comes down too developer velocity balanced with system stability. Error budgets like Google's SRE org is able to enforce against (some) services they support are one way of accomplishing that, but those are hard to put into practice.
So then, as you say, it takes an act of Congress to get anything deployed.
So in the abstract, imo it all comes down to the quality of the automated CI/CD system, and developers being on call for their service so they feel the pain of service unreliability and don't just throw code over the wall. But it's all talk at this level of abstraction. The reality of a given company's office politics and the amount of leverage the platform teams and whatever passes for SRE there have vs the rest of the company make all the difference.
I'm sure some companies do this poorly but there's lots of places where code review happens on every PR and there's processes and systems in place to make sure it's an easy process (or at least, as easy as it should be). Many large tech companies have things pushed to prod automatically many, many times per day and still have code review for all changes going out.
>sure, they coined the term “move fast and break things”
Yeah I'm aware, but as any company gets larger and has more and more traffic (and money) dependent on their existing systems working, keeping those systems working becomes more and more important.
There's lots of things worth protecting to ensure that people keep using your product that fall short of "lives are at stake". Of course it's a spectrum but lots of large enterprises that aren't saving lives but still care a lot about making sure their software keeps running.
How do you know which lines you need to review and which you don't?
Does it feel archaic because LLMs are clearly producing output of a quality that doesn't require any review, or because having to review all the code LLMs produce clips the productivity gains we can squeeze out of them?
> The best experiences I have are those where I can describe what I want done with details.
But that's the hard part! You can only eke out moderate productivity gains by automating the tedium of actually writing out the code, because it's a small fraction of software engineering.
That's why I don't like to claim massive productivity boosts, personally. It's helped me out with the tedious bits that are still necessary. It's also great as an idea board, where I ask for some sample approaches to a problem. That cuts way down on research time, even if a few of the options given are dead ends because e.g. they use an API that doesn't exist.
Stuff like this works for things that can be verified programmatically (though I find LLMs still do occasionally ignore instructions like this), but ensuring correct functionality and sensible code organization are bigger challenges.
There are techniques that can help deal with this but none of them work perfectly, and most of the time some direct oversight from me is required. And this really clips the potential productivity gains, because in order to effectively provide oversight you need to page in all the context of what's going on and how it ought to work, which is most of what the LLMs are in-theory helping you with.
LLMs are still very useful for certain tasks (bootstrapping in new unfamiliar domains, tedious plumbing or test fixture code), but the massive productivity gains people are claiming or alluding to still feel out of reach.
It depends - there are some very very difficult things that can still be easily verifiable!
For instance, if you are working on a compiler and have a huge test database of code to compile that all has tests itself, "all sample code must compile and pass tests, ensuring your new optimizer code gets adequate branch coverage in the process" - the underlying task can be very difficult, but you have large amounts of test coverage that have a very good chance at catching errors there.
At the very least "LLM code compiles, and is formatted and documented according to lint rules" is pretty basic. If people are saying LLM code doesn't compile, then yes, you are using it very incorrectly, as you're not even beginning to engage the agentic loop at all, as compiling is the simplest step.
Sure, a lot of more complex cases require oversight or don't work.
But "the code didn't compile" is definitely in "you're holding it wrong" territority, and it's not even subtle.
Yeah performance optimization is potentially another good area for LLMs to shine, if you already have a sufficiently comprehensive test suite, because no functionality is changing. But if functionality is changing, you need to be in the loop to, at the very least, review the tests that the LLM outputs. Sometimes that's easier than reviewing the code itself, but other times I think it requires similar levels of context.
But honestly I think sane code organization is the bigger hurdle, which is a lot harder to get right without manual oversight. Which of course leads to the temptation to give up on reviewing the code and just trusting whatever the LLM outputs. But I'm skeptical this is a viable approach. LLMs, like human devs, seem to need reasonably well-organized code to be able to work in a codebase, but I think the code they output often falls short of this standard.
(But yes agree that getting the LLM to iterate until CI passes is table-stakes.)
I think getting good code organization out of an LLM is one of the subtler things - I've learned quite a bit about what sort of things need to be specified, realizing that the LLM isn't actively learning my preferences particularly well, so there are some things about code organization I just have to be explicit about.
Which is more work, but less work than just writing the code myself to begin with.
It's good at writing/updating tedious test cases and fixtures when you're directing it more closely. But yes, it's not as great at coming up with what to test in the first place.
Yesterday I wanted to change a white background to transparent on some clip art. I’m still learning Affinity so asked Google Gemini Nano Banana PRO 2. The output looked ok at first but the grey squares were a little off. They didn’t make a perfect grid. I opened it in mspaint and was able to erase the grey squares. It didn’t change the white background to transparent, it just drew an array of grey squares, but only good enough for a first glance. I have no idea how these AI tools can make anything of use if left to their own devices.
reply