Hacker Newsnew | past | comments | ask | show | jobs | submit | zamalek's commentslogin

The last two weeks with Claude has been a nightmare with code quality, it outright ignores standards (in CLAUDE.md). Just yesterday I was reviewed a PR from a coworker where it undid some compliant code, and then proceeded to struggle with exactly what the standards were designed to address.

I threw in the towel last night and switched to codex, which has actually been following instructions.


> Ignores instructions

> Claims "simplest fixes" that are incorrect

> Does the opposite of requested activities

> Claims completion against instructions

I thought it was just me. I'm continuously interrupting it with "no, that's not what I said" - being ignored sometimes 3 times; is Claude at the intellectual level of a teenager now?

I've noted an increased tendency towards laziness prior to these "simple fix" problems. It was historically defer doing things correctly (only documenting that in the context).


I've noticed laziness in claude repeatedly. It sometimes takes the shortest way out even when asked explicitly to do the "right" thing.

It was later reproduced on the same machine without huge pages enabled. PICNIC?

Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.

With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.


So why does it happen only with hugepages? Is the extra overhead / TLB pressure enough to trigger the issue in some way? Of is it because the regular pages get swapped out (which hugepages can't be)?

I don't fully know, but I suspect it's just that due to the minor faults and tlb misses there is terrible contention with the spinlock, regardless of the PREEMPT_LAZY when using 4k pages (that easily reproducible). Which is then made worse by preempting more with the lock held.

> Here’s an exchange I had on twitter a few months ago:

The purple account is just plain wrong. Classically, the full architecture is this (keeping in mind that all rules are sometimes broken):

* CQRS is the linchpin.

* You generally only queue commands (writes). A few hundreds of ms of latency on those typically won't be noticed by users.

* Reads happen from either a read replica or cache.

The problem the author faces are caused by cherry-picking bits of the full picture.

A queue is a load smoothing operator. Things are going to go bad one way or another if you exceed capacity, a queue at least guarantees progress (up to a point). It's also a great metric to use to scale your worker count.

> What will you do when your queue is full

If your queue fills up you need to start rejecting requests. If you have a public facing API there's a good chance that there will be badly behaved clients that don't back off correctly - so you'll need a way to IP ban them until things calm down. AWS has API Gateway and Azure APIM that can help with this.

If you're separating commands and queries you should _typically_ see more headroom.


Agree that CQRS seems like a useful way to partitions writes from reads, aka slower requests from faster requests, to avoid many fast requests waiting behind a few slow ones in line.

But even if you shifted reads to one or more caches or read replicas, wouldn't those also have queues that will fill up when you are under-provisioned?

Note that I'm using the term "queue" pretty loosely, to include things like Redis' maxclients or tcpbacklog, or client-side queues when all connections are in use.


Absolutely. That's typically a good problem to have :). Hopefully you would had gradual enough growth to implement elastic scaling before this is an issue, but you're definitely eventually screwed and have to outright copy what the likes of FAANG do - your startup is a unicorn at that point, so you'd probably already have the talent hired.

Don't forget the pointless backronym.


LLMs are notoriously terrible at multiplying large numbers: https://claude.ai/share/538f7dca-1c4e-4b51-b887-8eaaf7e6c7d3

> Let me calculate that. 729,278,429 × 2,969,842,939 = 2,165,878,555,365,498,631

Real answer is: https://www.wolframalpha.com/input?i=729278429*2969842939

> 2 165 842 392 930 662 831

Your example seems short enough to not pose a problem.


Modern LLMs, just like everyone reading this, will instead reach for a calculator to perform such tasks. I can't do that in my head either, but a python script can so that's what any tool-using LLM will (and should) do.


This is special pleading.

Long multiplication is a trivial form of reasoning that is taught at elementary level. Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper. That was Opus with extended reasoning, it had all the opportunity to get it right, but didn't. There are people who can quickly multiply such numbers in their head (I am not one of them).

LLMs don't reason.


I tried this with Claude - it has to be explicitly instructed to not make an external tool call, and it can get the right answer if asked to show its work long-form.


Mathematics is not the only kind of reasoning, so your conclusion is false. The human brain also has compartments for different types of activities. Why shouldn't an AI be able to use tools to augment its intelligence?


I used the mathematics example only because the GP did. There are many other examples of non-reasoning, including some papers (as recent as Feb).


There are many examples of current limitations, but do you see a reason to think they are fundamental limitations? (I'm not saying they aren't, I'm curious what the evidence is for that.)


It's because of how transformers work, especially the fact that the output layer is a bunch of weights which we quite literally do a weighted random choice from. My hunch is that diffusion models would have a higher chance of doing real reasoning - or something like a latent space for reasoning.

Thinking that LLMs are intelligent arises from an incomplete understanding of how they work or, alternatively, having shareholders to keep happy.


Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper

LOL, talk about special pleading. Whatever it takes to reshape the argument into one you can win, I guess...

LLMs don't reason.

Let's see you do that multiplication in your head. Then, when you fail, we'll conclude you don't reason. Sound fair?


I can do it with a scratch pad. And I can also tell you when the calculation exceeds what I can do in my head and when I need a scratch pad. I can also check a long multiplication answer in my head (casting 9s, last digit etc.) and tell if there’s a mistake.

The LLMs also have access to a scratch pad. And importantly don’t know when they need to use it (as in, they will sometimes get long multiplication right if you ask them to show their work but if you don’t ask them to they will almost certainly get it wrong).


> And importantly don’t know when they need to use it

patently false, but hey at least you’re able to see the parallel between you with a scratch pad and an LLM with a python terminal


Sure, lets test that:

https://chatgpt.com/s/t_69c420f3118081919cf525123e39598c

https://chatgpt.com/s/t_69c4215daeb481919fdaf22498fb0c4f

Do you have a different definition of false? I'm referring to their reasoning context as their scratch pad if that wasn't clear.


The context is the scratch pad. LLMs have perfect recall (ignoring "lost in the middle") across the entire context, unlike humans. LLMs "think on paper."


The conclusion that LLMs don't reason is not a consequence of them not being able to do arithmetic, so your argument isn't valid.

Also, see https://news.ycombinator.com/newsguidelines.html

"Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."

etc.


Plenty of humans can't do arithmetic. Can they also not reason.

Reasoning isn't a binary switch. It's a multidimensional continuum. AI can clearly reason to some extent even if it also clearly doesn't reason in the same way that a human would.


> Plenty of humans can't do arithmetic. Can they also not reason.

I just pointed out that this isn't valid reasoning ... it's a fallacy of denial of the antecedent. No one is arguing that because LLMs can't do arithmetic, therefore they can't reason. After all, zamalek said that he can't quickly multiply large numbers in his head, but he isn't saying that therefore he can't reason.

> Reasoning isn't a binary switch. It's a multidimensional continuum.

Indeed, and a lot of humans are very bad at it, as is clear from the comments I'm responding to.

> AI can clearly reason to some extent

The claim was about LLMs, not AI. This is like if someone said that chihuahuas are little and someone responded by saying that dogs are tall to some extent.

LLMs do not reason ... they do syntactic pattern matching. The appearance of reasoning is because of all the reasoning by humans that is implicit in the training data.

I've had this argument too many times ... it never goes anywhere. So I won't respond again ... over and out.


Indeed, and a lot of humans are very bad at it, as is clear from the comments I'm responding to.

This is your idea of "conversing curiously" and "editing out swipes," I suppose.

I've had this argument too many times ... it never goes anywhere. So I won't respond again ... over and out.

A real reasoning entity might pause for self-examination here. Maybe run its chain of thought for a few more iterations, or spend some tokens calling research tools. Just to probe the apparent mismatch between its own priors and those of "a lot of humans," most of whom are not, in fact, morons.


> Don't be snarky.


ROFL


Comments should get more thoughtful and substantive

Yes, they should, but instead we're stuck with the stochastic-parrot crowd, who log onto HN and try their best to emulate a stochastic parrot.


i assert that by your evidentiary standards humans don't reason.

presumably one of us is wrong.

therefore, humans don't reason.


LLMs don't use tools. Systems that contain LLMs are programmed to use tools under certain circumstances.


you’re just abstracting it away into this new “systems” definition

when someone says LLMs today they obviously mean software that does more than just text, if you want to be extra pedantic you can even say LLMs by themselves can’t even geenrate text since they are just model files if you don’t add them to a “system” that makes use of that model files, doh


> when someone says LLMs today they obviously mean ...

LLMs, if the someone is me or others who understand why it's important to be precise. And in this context, the distinction between LLM and AI mattered--not pedantic at all.

I won't respond further ... over and out.


This hasn't been true for a while now.

I asked Gemini 3 Thinking to compute the multiplication "by hand." It showed its work and checked its answer by casting out nines and then by asking Python.

Sonnet 4.6 with Extended Thinking on also computed it correctly with the same prompt.


This doesn’t address the author’s point about novelty at all. You don’t need 100% accuracy to have the capability to solve novel problems.


It does address the GP comment about math.


I thought it might do better if I asked it to do long-form multiplication specifically rather than trying to vomit out an answer without any intermediate tokens. But surprisingly, I found it doesn't do much better.


Other comments indicate that asking it to do long multiplication does work, but the varying results makes sense: LLMs are probabilistic, you probably rolled an unlikely result.


Specifically, you need to use a reasoning model. Applying more test time compute is analogous to Kahneman's System 2 thinking, while directly taking the first output of an LLM is analogous to System 1.

This is true for solving difficult novel problems as well, with the addition of tools that an agent can use to research the problem autonomously.


Pijul isn't a CRDT is it? It's theory of patches (i.e. DARCS++) alongside native conflicts.


Its author says it implements a CRDT in its theory documentation.


I generally use flatpak for things that are important to keep extremely updated, e.g. my browser for vulnerability reasons.


I can completely understand how you were driven away. If you ever want to give it a go again:

> there's "Flakes" which I never quite understood

Nix never clicked for me until I started using flakes. There's a lot of internal drama surrounding them that honestly childish; that's why they are marked as experimental and not the official recommendation. You are going to have a worse time with Nix if you go with the official recommendation, flakes are significantly more intuitive. The Determinate Systems installer enables them by default, and whatever documentation they have is on the happier path (except for FlakeHub, I haven't figured that one out yet).

On the most fundamental level, flakes allow you to take /etc/nixos/nixos.nix (or whatever, it has been forever) out of /etc and into a git repository. Old-style nix may be able to do that, but I discovered flakes before trying. I did previously attempt to use git on /etc/nix, but git was falling to pieces with bizarre ownership problems.

What this means is that I could install and completely configure a machine, once booted into a nix iso, by running: nixos-install --flake https://github.com/.../repo.git. I manage all of my system config out of /home/$user/$clone

As for /home there is home-manager and, again, you are not steered towards it (the tutorial pushes you towards nix profiles/nix-env instead). Home-manager will do for your home directory what the system config does for your system, and has many program modules. You can even declare home-level systemd units and whatnot.

> manually edited /etc files.

You can use environment.etc for these files[1]. systemd.tmpfiles can be used for things outside of etc. Home-manager has the equivalent for .config, .local, .cache. [2].

[1]: https://search.nixos.org/options?channel=unstable&query=envi... [2]: https://home-manager-options.extranix.com/?query=xdg.configF...


Yep, i am doing the same. I have a central remote flake repo where all my machines, services, etc are defined and they all run tweaked autoupdaters to periodically do full updates. I push commits and wait and forget. It feels like maintaining your distro everywhere, no matter where you ssh in. And soon, i will migrate that repo off a central platform (github) into radicle or something and turn some of my machines into seeders. Then, with offsite data backups, my house could burn down and github go dark, i could still recover, maybe in the future even bootstrap from my smartphone. A big step towards digital sovereignity.


Great comment -- thank you!


> services.desktopManager.gnome.extraGSettingsOverrides =

You can set dconf settings more declaratively: https://tangled.org/jonathan.dickinson.id/nix/blob/7c895ada8...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: