More

robot-wrangler · 2026-01-17T17:48:37 1768672117

> strongly believe that structured outputs are one of the most underrated features in LLM engines

Structured output is really the whole foundation of lots of our hopes and dreams. The JSONSchemaBench paper is fairly preoccupied with performance, but where it talks about quality/compliance, the "LM only" scores in the tables are pretty bad. This post highlights the on-going difficulty and confusion around doing a simple, necessary, and very routine task well.

Massaging small inputs into structured formats isn't really the point. It's about all the nontrivial cases central to MCP, tool-use, local or custom APIs. My favorite example of this is every tool-use tutorial that's pretending that "ping" accepts 2 arguments, but, it's actually more like 20 arguments with subtle gotchas. Do the tool-use demos that correctly work with 2 arguments actually work with 20? How many more retries might that take, and what does this change about the hardware and models we need for "basic" stuff?

If you had a JSON schema correctly and completely describing legal input for say ffmpeg, then the size and complexity of it would be approaching that of kubernetes schemas (where JSONBench compliance is only at .56). Can you maybe yolo generate a correct ffmpeg command without consulting any schema with SOTA models? Of course!, but that works well because ffmpeg is a well-documented tool with decades of examples floating around in the wild. What's the arg-count and type-complexity for that one important function/class in your in-house code base? For a less well-known use case or tool, if you want hallucination free and correct output, then you need structured output that works, because the alternative is rolling your own model trained on your stuff.

robot-wrangler · 2026-01-14T22:57:46 1768431466

Is it? If kids grab the top result of a massively A/B tested algorithmic feed trying hook people and/or maximize the controversy / engagement, then arbitrage it onto another simpler less gamified platform... is that really human curated? There's some truth to the idea that, as long as there is any social media, everything is social media.

robot-wrangler · 2026-01-14T06:23:47 1768371827

This is a meta-game. I got curious about related topics in game theory once and found out about [1,2]. There are also a few papers directly trying to study calvinball and so-called minimal-nomic. It's pretty crazy how little we know theoretically about this stuff, considering how relevant games with dynamic rules actually are for daily life.

Of course, there's probably no clean solutions in this space short of lots of sims. Regardless of whether new agentic stuff works for everything else in AI.. agent-based modeling seems likely to benefit from some kind of renaissance and that should be really interesting.

[1] https://en.wikipedia.org/wiki/Constitutional_economics [2] https://en.wikipedia.org/wiki/Mechanism_design

crdrost · 2026-01-14T06:50:40 1768373440

That's only sort of true.

The metagame within 1kbwc is that at the end of play people generally vote on which new cards to keep for seeding the next game, and which to discard. So you get a rush of joy if everybody liked your card and wants to keep it.

For an example of metagame play, one deck developed Angry Sheep, Sleepy Sheep, a bunch of sheeps, plus some rule card of "if there are more than five sheep, the person with the most sheep wins." People liked those, so they kept them. Then someone created a different card called the Sheep Herder, all of a player's sheep get stacked under the Sheep Herder, which passes one player to the left every time a sheep is played, so it slowly goes around the circle vacuuming up sheep. People liked this but started making Angry Goat, Sleepy Goat etc. so that they could have an alternate victory condition. Which led to the Goat Herder card that goes to the right as new goats are played. The meta-joke then reached its peak with the Herder Herder, which picks up Herders and moves them around the board, dropping the things that they are herding as it moves.

The key to 1kbwc is that anyone can at any time create a card that says "I win the game" but that is no fun, not unless someone has a card called Counterspell that says "play me at any time to discard a card that some other player is playing, before it takes effect" etc. The metagame of 1kbwc allows the deck to become its own story and the players of the many rounds after rounds of it, are rewarded as the storytellers.

robot-wrangler · 2026-01-14T07:41:41 1768376501

> anyone can at any time create a card that says "I win the game" but that is no fun [..] The metagame of 1kbwc allows the deck to become its own story

Yep exploring this question collaboratively is of course the real activity. Depending on your perspective it's barely recognizable as a game, or it's the ultimate / only game. Also kinda related here is Carse on finite and infinite games and Wittgenstein on language games[1,2]. It is "only" philosophy, but also feels ripe for more rigorous treatment

Presumably a good theoretical treatment would try to look at how games and their meta's are related: how the number and stability of rules changes the richness of interaction, enjoyment, flexibility in strategy, average duration and tolerable length of game-play, etc

[1] https://openlibrary.org/books/OL22379733M/Finite_and_infinit... [2] https://en.wikipedia.org/wiki/Language_game_(philosophy)

isk517 · 2026-01-14T17:57:48 1768413468

Basically, the true victory condition is to create a win condition that impresses the other player

kruffalon · 2026-01-14T11:02:33 1768388553

(I have not read your links)

What do you mean by "solution" here?

robot-wrangler · 2026-01-14T11:54:19 1768391659

Nothing really specific.. just some kind of of relatively tidy insight in classical terms. Game-trees, clearly articulated dominant strategies, surreal numbers.

You could model most games with anything simple that's convenient (trees, state machines, term-rewriting systems). Meta-games, dynamic protocols, and multi-agent systems are broadly related but also different animals where you might need sims, full-blown process calculi, weird new kinds of logic. Depending on where you land a natural model candidate might be messy, maybe you have to give up things like completeness or decidability. Maybe the closest fit for a formalism here is dynamic deontic logic: https://www.cse.chalmers.se/~gersch/jlap2012.pdf

robot-wrangler · 2026-01-11T13:43:45 1768139025

> How would we measure the effects of AI coding tool taking over manual coding ?

Instead of asking "where are the AI-generated projects" we could ask about the easier problem of "where are the AI-generated ports". Why is it still hard to take an existing fully concrete specification, and an existing test suite, and dump out a working feature-complete port of huge, old, and popular projects? Lots of stuff like this will even be in the training set, so the fact that this isn't easy yet must mean something.

According to claude, wordpress is still 43% of all the websites on the internet and PHP has been despised by many people for many years and many reasons. Why no python or ruby portage? Harder but similar, throw in drupal, mediawiki, and wonder when can we automatically port the linux kernel to rust, etc.

simonw · 2026-01-11T13:49:30 1768139370

> Why is it still hard to take an existing fully concrete specification, and an existing test suite, and dump out a working feature-complete port of huge, old, and popular projects? Lots of stuff like this will even be in the training

We have a smaller version of that ability already:

- https://simonwillison.net/2025/Dec/15/porting-justhtml/

See also https://www.dbreunig.com/2026/01/08/a-software-library-with-...

I need to write these up properly, but I pulled a similar trick with an existing JavaScript test suite for https://github.com/simonw/micro-javascript and the official WebAssembly test suite for https://github.com/simonw/pwasm

robot-wrangler · 2026-01-11T14:06:20 1768140380

So extrapolating from here and assuming applications are as easy as libraries, operating systems are as easy as applications.. at this rate with a few people in a weekend you can convert anything to anything else, and the differences between different programming languages are very nearly effectively erased. Nice!

And yet it doesn't feel true yet, otherwise we'd see it. Why do you think that is?

simonw · 2026-01-11T14:21:37 1768141297

Because it's not true yet. You can't convert anything to anything else, but you CAN get good results for problems that can be reduced to a robust conformance suite.

(This capability is also brand new: prior to Claude Opus 4.5 in November I wasn't getting results from coding agents that convinced me they could do this.)

It turns out there are some pretty big problems that works for, like HTML5 parsers and WebAssembly runtimes and reduced-scoped JavaScript language interpreters. You have to be selective though. This won't work for Linux.

I thought it wouldn't work for web browsers either - one of my 2026 predictions was "by 2029 someone will build a new web browser using mostly LLM-code"[1] - but then I saw this thread on Reddit https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr... "Over christmas break I wrote a fully functional browser with Claude Code in Rust" and took a look at the code and it's surprisingly deep: https://github.com/hiwavebrowser/hiwave

[1] https://simonwillison.net/2026/Jan/8/llm-predictions-for-202...

robot-wrangler · 2026-01-11T17:00:30 1768150830

> you CAN get good results for problems that can be reduced to a robust conformance suite.

If that's what is shown then why doesn't it work on anything that has a sufficiently large test-suite, presumably scaling linearly in time with size? Why should we be selective, and based on what?

simonw · 2026-01-11T18:12:55 1768155175

It probably does. This only become possible over the last six weeks, and most people haven't yet figured out the pattern.

robot-wrangler · 2026-01-11T13:06:32 1768136792

I agree about skills actually, but it's also obvious that parent is making a very real point that you cannot just dismiss. For several years now and far short of wild AGI promises, the answer to literally every issue with casual or production AI has been something like "but the rate of model improvement.." or "but the tools and ecosystem will evolve.."

If you believe that uncritically about everything else, then you have to answer why agentic workflows or MCP or whatever is the one thing that it can't evolve to do for us. There's a logical contradiction here where you really can't have it both ways.

dkdcio · 2026-01-11T13:19:16 1768137556

I’m not understanding your point… (and would be genuinely curious to)? the models and systems around them have evolved and gotten better (over the past few years for LLMs and decades for “AI” more broadly)

oh I think I do get your point now after a few rereads (correct if wrong but you’re saying it should keep getting better until there’s nothing for us to do). “AI”, and computer systems more broadly, are not and cannot be viable systems. they don’t have agency (ironically) to affect change in their environment (without humans in the loop). computer systems don’t exist/survive without people. all the human concerns around what/why remain, AI is just another tool in a long line of computer systems that make our lives easier/more efficient

robot-wrangler · 2026-01-11T14:38:02 1768142282

AI Engineer to Software Engineer: Humans writing code is a waste of time, you can only hope to add value by designing agentic workflows

Prompt Engineer to AI Engineer: Designing agentic workflows is a waste of time, just pre/postfix whatever input you'd normally give to the agentic system with the request to "build or simulate an appropriate agentic workflow for this problem"

robot-wrangler · 2026-01-11T11:23:27 1768130607

Let's maybe avoid all the hype, whether it is for or against, and just have thoughtful and measured stances on things? Fairly high points for that on this piece, despite the title. It has the obligatory remark that manually writing code is pointless now but also the obligatory caveat that it depends on the kind of code you're writing.

robot-wrangler · 2026-01-10T03:15:02 1768014902

A root-cause analysis here that's about intrinsic difficulty is misguided IMHO. Secrets and secrets-delivery are an environment service that individual developers shouldn't ever have to think about. If you cut platform/devops/secops teams to the bone because they aren't adding application features, or if you understaff or overwork seniors that are supposed to be reviewing work and mentoring, then you will leak eventually. Simple as. Cutting engineering budgets for marketing budgets and executive bonuses practically guarantees these kinds of problems. Engineering leadership should understand this and deep down, it usually does. So the most direct way to talk about this is usually acknowledging willful negligence and/or greed

catlifeonmars · 2026-01-10T03:36:36 1768016196

Agreed. Proper secrets management is table stakes for any company entrusted with paying customers.

ncr100 · 2026-01-20T21:38:04 1768945084

Thank you robot wrangler - i don't have that insight without people sharing things like do here <3

I agree with what you write here. It was a bad proffer / explanation on my part.

robot-wrangler · 2026-01-10T02:19:29 1768011569

> Then what sort of math problem would be a milestone for you where an AI was doing something novel?

What? If we're discussing novel synthesis, and it's being contrasted with answer-from-search / answer-from-remix.. the problem does not matter. Only the answer and the originality of the approach. Connecting two fields that were not previously connected is novel, or applying a new kind of technique to an old problem. Recognizing that an unsolved problem is very much like a solved one is search / remix. So what happened here? Tao says it is

> is largely consistent with other recent demonstrations of AI using existing methods to resolve Erdos problem

Existing. Methods. Tao also says "This is a demonstration of the genuine increase in capability of these tools in recent months". This is the sentence everyone will focus on, so what is that capability?

> the more interesting capability revealed by these events is the ability to rapidly write and rewrite new versions of a text as needed, even if one was not the original author of the argument.

Rejoice! But rejoice for the right reasons, and about what actually happened. Style and voice transformations, interesting new capabilities for fuzzy search. Correct usage of external tools for heavy-lifting with symbolics. And yes, actual problem solving. Novel techniques, creativity, originality though? IDK, sounds kind of optimistic based on the detail here.

frozenseven · 2026-01-10T13:22:12 1768051332

If you squint hard enough, every new thing is an example of "answer-from-search / answer-from-remix". Solving any Erdős problem in this manner was largely seen as unthinkable just a year ago.

>the problem does not matter.

Really? All of the other Erdős problems? Millennium Problems? Anything at all? This gets us directly into the territory of "nothing can convince us otherwise".

robot-wrangler · 2026-01-10T22:45:00 1768085100

Tiresome. You're quoting me out of context, and generally assigning me the POV you want to argue with. You come across as pro-AI looking for anti-AI to do combat with. First, I'm not the right guy, and second, all I'm really saying above is that if we're going to do argument-from-authority, maybe let's engage with what the authority is actually saying in TFA.

frozenseven · 2026-01-11T01:03:14 1768093394

I don't think I quoted you out of context. In any case, Terence Tao and co. are doing wonderful work in this area. I'd encourage everyone to bookmark the following link: https://github.com/teorth/erdosproblems/wiki/AI-contribution...

It's a rapidly evolving story and I expect H1 2026 to bring much clarity on this topic. Especially with upcoming model releases and more professional mathematicians taking an interest.

robot-wrangler · 2026-01-09T19:23:58 1767986638

The conclusion perfectly and concisely states the straw-man:

> If your job is only to write beautiful code, you have a problem.

Definitely no one at senior or even mid thinks this is what their job is about. Something like modularity is beautiful because it works and because of what it enables.. we don't try to shove it into place because it's beautiful. Talking about it the other way around sounds like a manager who does not understand engineering trying to paraphrase stuff that engineers are saying, poorly. Indeed quoting from this self-described entrepreneur's "About Me" page:

> While I was there, I took a couple computer science courses, decided I was terrible at it and swore to never write software again.

I guess that's thought-leadership for you.

falloutx · 2026-01-09T21:57:09 1767995829

Yeah, I think good code allows modularity and future development without turning into frankenstein's monster . Readability isn't the main goal, but future adaptability is.

Also a big issue with AI imo is: It allows people to write who stopped writing code ages ago because they think they somehow can work at higher level.

robot-wrangler · 2026-01-09T17:50:32 1767981032

> The kind of cultural cognition highlighted by the article/study is common to everyone, not to some groups that just are incapable of seeing it in themselves.

Yeah this seems political, and it is, but it's really about cognitive bias. Reframing the thing in terms of daily workplace dynamics is pretty easy: just convert "legally consequential facts" to "technically consequential facts" and convert "cultural outlook" to "preferred tech-stack". Now you're in a planning and architecture meeting which is theoretically easier to conduct but where everyone is still working hard to confirm their bias.

How to "fix" this in other people / society at large is a difficult question, but in principle you can imagine decision-systems (like data-driven policies and a kind of double-blind experimental politics) that's starting to chip away at the problem. Even assuming that was a tractable approach with a feasible transition plan, there's another question. What to do in the meanwhile?

IOW, assuming the existence of citizens/co-workers that have more persistent non-situational goals and stable values that are fairly unbothered by "group commitments".. how should they participate in group dynamics that are still going to basically be dominated by tribalism? There's really only a few strategies, including stuff like "check out completely", "become a single issue voter", or "give up all other goals and dedicate your entire life to educating others". All options seem quite bad for individuals and the whole. If group-commitment is fundamentally problematic, maybe a way to recognize a "good" faction is by looking for one that is implicitly dedicated to eliminating itself as well as the rival factions.