Yeah, that matches my experience: LLMs are amazing “script interns” and shaky “systems engineers.” One trick that helps on bigger stuff: force it into a tight loop of (1) write a failing test for a single behavior, (2) implement smallest change, (3) run tests, (4) refactor. When you make the unit of work “one green test,” the model’s tendency to wander gets way less destructive.
Outside of work I've been running a pure vibe-coding experiment where I don't look at the code at all, ever. I'm using this approach of telling it a specific scenario has to work in a certain way (the software relates to financial and tax planning).
The AI bot is very creative at creating a mess even with such tight guardrails. Many days into it I discovered that it had implemented four completely separate tax computation routines. All of them buggy in different ways. All of them addressed specific scenarios I had specified as part of the spec. But it never occurred to the bot to have a single centralized tax function! It is very good at satisfying specific scenarios I give, but absolutely terrible at any kind of system-wide planning.