I don't think this is correct; previously you could already control output by re...

simonw · on July 21, 2023

Right, which is the same idea - it's just that the code in llama.cpp is running your grammar as part of its token generation decisions as opposed to pausing and waiting for your other code to pick the next token.

(I'm trying for a very high level explanation here.)

swyx · on July 22, 2023

you could also always specify the logit bias parameter in openai apis

pshc · on July 22, 2023

That's true, and one can bias logits in llama.cpp and friends too, but those are global biases that affect the entire output rather than being specified per-token. Uploading a grammar or a wasm binary to the inference engine does seem more expressive.