Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is exactly the kind of task that LLMs are good at.

They are good at transforming one format to another. They are good at boilerplate.

They are bad at deciding requirements by themselves. They are bad at original research, for example developing a new algorithm.



> They are good at transforming one format to another. They are good at boilerplate.

You just described 90% of coding


Thing is, and LLM doesn't need motivation or self-discipline to start writing, which at this point I'm confident is the main slowing down factor in software development, after requirements etc.


These also have larger memory in a way, or deeper stacks of facts. They seems to be able to explore way more sources rapidly and thus emit a solution with more knowledge. As a human I will explore less before trying to solve a problem, and only if that fails I will dig deeper.


But they fail at global context, consistency, and deep understanding which constantly fails them in the real world.

You have to basically tell them all the patterns they need to follow and give them lots of hints to do anything decent, otherwise they invent new helpers that already exist in the codebase, don't follow existing patterns, put code in places that aren't consistent.

They are great at quickly researching a lot, but they start from 0 each time. Then they constantly "cheat" when they can't solve a problem immediately, stuff like casting to "any", skipping tests, deciding "it's ok if this doesn't work" etc.

a few things that would make them much better:

- an ongoing "specific codebase model" that significantly improved ability to remember things across the current codebase / patterns / where/why

- a lot more RL to teach them how to investigate things more deeply and use browsers/debuggers/one-off scripts to actually figure out things before "assuming" some path is right or ok

- much better recall of past conversations dynamically for future work

- much cheaper operating costs, it's clear a big part of why they "cheat" often is because they are told to minimize token costs, it's clear if their internal prompts said "don't be afraid to spin off sub-tasks and dig extremely deep / spend lots of tokens to validate assumptions" they would do a lot better


90% of writing code, sure. But most professionnel programmers write code maybe 20% of the time. A lot of the time is spent clarifying requirements and similar stuff.


The more I hear about other developers' work, the more varied it seems. I've had a few different roles, from one programmer in a huge org to lead programmer in a small team, with a few stints of technical expert in-between. For each the kind of work I do most has varied a lot, but it's never been mostly about "clarifying requirements". As a grunt worker I mostly just wrote and tested code. As a lead I spent most time mentoring, reviewing code, or in meetings. These days I spend most of my time debugging issues and staring at graphics debugger captures.


> As a lead I spent most time

> mentoring

Clarifying either business or technical requirements for newer or junior hires.

> reviewing code

See mentoring.

> or in meetings

So clarifying requirements from/for other teams, including scope, purely financial or technical concerns, etc.

Rephrase "clarifying requirements" to "human oriented aspects of software engineering".

Plus, based on the graphics debugger part of your comment, you're a game developer (or at least adjacent). That's a different world. Most software developers are line of business developers (pharmaceutical, healthcare, automotive, etc) or generalists in big tech companies that have to navigate very complex social environments. In both places, developers that are just heads down in code tend not to do well long term.


> human oriented aspects

The irony is of course that humans in general and software professionals in particular (myself definitely included) notoriously struggle with communication, whereas RLHF is literally optimizing LLMs for clear communication. Why wouldn't you expect an AI that's both a superhuman coder and a superhuman communicator to be decent at translating between human requirements and code?


> Why wouldn't you expect an AI that's both a superhuman coder and a superhuman communicator to be decent at translating between human requirements and code?

At this point LLMs are a superhuman nothing, except in terms of volume, which is a standard computer thing ("To err is human, but to really foul things up you need a computer" - a quote from 60 years ago).

LLMs are fast, reasonably flexible, but at the moment they don't really raise the ceiling in terms of quality, which is what I would define as "superhuman".

They are comparatively cheaper than humans and volume matters ("quantity has a quality all its own" - speaking of quotes). But I'm fairly sure that superhuman to most people means "Superman", not 1 trillion ants :-)


I wrote that based on my experience comparing my prose writing and code to what I can get from ChatGPT or Claude Code, which I feel are on average significantly higher quality than what I can do on a single pass. The quality still improves when I critique its output and iterate with it, but from what I tried, the quality of the result of it doing the work and me critiquing it is better (and definitely faster) than what I get when I try to do it myself and have it critique my approach.

But maybe it's just because I personally am not as good as others, so let me try to offer some examples of tasks where the quality of AI output is empirically better than the human baseline:

1. Chess (and other games) - Stockfish has an ELO of 3644[0], compared to Magnus Carlsen at 2882

2. Natural Language understanding - AIs surpassed the human expert baseline on SuperGlue a while ago [1]

3. General image classification - On Imagenet top-5, facebook's convnext is at 98.55 [2], while humans are at about 94.9% [3]. Humans are still better at poor lighting conditions, but with additional training data, AIs are catching up quickly.

4. Cancer diagnosis - on lymph-node whole slide images, the best human pathologist in the study got an AUC of 0.884, while the best AI classifier was at 0.994 [4]

5. Competition math - AI is at the level of the best competitors, achieving gold level at the IMO this year [5]. It's not clearly superhuman yet, but I expect it will be very soon.

6. Competition coding - Here too AI is head to head with the best competitors, successfully solving all problems at this year's ICPC [6]. Similarly, at the AtCoder World Tour Finals 2025 Heuristic contest, only one human managed to beat the OpenAI submission [7].

So summing this up, I'll say that even if AI isn't better at all of these tasks than the best prepared humans, it's extremely unlikely that I'll get one of those humans to do tasks for me. So while AI is still very flawed, I already quite often prefer to rely on it rather to delegate to another human, and this is as bad as it ever will be.

P.S. While not a benchmark, there's a small study from last year that looked at the quality of AI-generated code documentation in comparison to the actual human-written documentation in a variety of code bases and found "results indicate that all LLMs (except StarChat) consistently outperform the original documentation generated by humans." [8]

[0] https://computerchess.org.uk/ccrl/4040/

[1] https://super.gluebenchmark.com/

[2] https://huggingface.co/spaces/Bekhouche/ImageNet-1k_leaderbo...

[3] https://cs.stanford.edu/people/karpathy/ilsvrc/

[4] https://jamanetwork.com/journals/jama/fullarticle/2665774

[5] https://deepmind.google/blog/advanced-version-of-gemini-with...

[6] https://worldfinals.icpc.global/2025/openai.html

[7] https://arstechnica.com/ai/2025/07/exhausted-man-defeats-ai-...

[8] https://arxiv.org/pdf/2312.10349


Brother, you are not going to convince people who dedicated their lives to learning a language, knowledge that bankrolls a pretty cushy life, that that language is likely to soon be readily accessible to everyone with access to a machine translator.


Indeed, or in the words of Upton Sinclair:

> It is difficult to get a man to understand something, when his salary depends on his not understanding it.


Any chance the business/product folks will be using LLMs on their side to help with "clarifying requirements" before they turn them over to the developers?

They view this task as tedious minutia which is the sort of thing LLMs like to churn out.


They’re bad at 90% of coding, but for other reasons. That said if you babysit them incessantly they can help you move a bit faster through some of it.


Maybe 90% of the actual typing part of coding, but not 90% of the JOB of coding.


+/-

> They are bad at deciding requirements by themselves.

What do you mean by requirements here? In my experience the frontier models today are pretty good at figuring out requirements, even when you don't explicitly state them.

> They are bad at original research

Sure, I don't have any experience with that, so I'll trust you on that.

> for example developing a new algorithm.

This is just not correct. I used to think so, but I was trying to come up with a pretty complicated pattern matching, multi-dimensional algorithm (I can't go into the details) - it was something that I could figure out on my own, and was half way through it, but decided to write up a description of it and feed it to gemini 2.5 pro a couple of months ago, and I was stunned.

It came up with a really clever approach and something I had previously been convinced the models weren't very good at it.

In hindsight, since they are getting so good at math in general, there's probably some overlap, but you should revisit your views on this.

--

Your 'bad at' list is missing a few things though:

- Calculations (they can come up with how to calculate or write a program to calculate from given data, but they are not good at calculating in their responses)

- Even though the frontier models are multi-modal, they are still bad at visualizing html/css - or interpreting what it would look like

- Same goes for visualizing/figuring out visual errors in graphics programming such as games programming or 3d modeling (z-index issues, orientation etc)


> I was trying to come up with a pretty complicated pattern matching, multi-dimensional algorithm (I can't go into the details)

The downside is that if you used Gemini to create the algorithm, your company won't be able to patent it.

Or maybe that's a good thing, for the rest of us.


Figuring out detailed requirements requires a lot of contact with reality. Specific details about not only the technical surface area but also the organizational and financial constraints. An AI model with the appropriate context would probably do well. It seems one of the things humans do much better at the moment is distill the big picture across a long period of time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: