I don't think this is correct; previously you could already control output by reading tokens one at a time from the LLM until you hit a stop character.
My take from the grammar-based sampling PR is that you ask llama.cpp to constrain the next output token, to a restricted set of possible tokens, using the grammar.
Right, which is the same idea - it's just that the code in llama.cpp is running your grammar as part of its token generation decisions as opposed to pausing and waiting for your other code to pick the next token.
(I'm trying for a very high level explanation here.)
That's true, and one can bias logits in llama.cpp and friends too, but those are global biases that affect the entire output rather than being specified per-token. Uploading a grammar or a wasm binary to the inference engine does seem more expressive.
My take from the grammar-based sampling PR is that you ask llama.cpp to constrain the next output token, to a restricted set of possible tokens, using the grammar.