I'd love to see some real numbers showing how these different decisions impact p...

I'd love to see some real numbers showing how these different decisions impact performance and code size. I suspect the branch cost is pretty minimal because so few of my smallvecs get spilled - so the branch predictor probably does a pretty good job at this.

And there's often fiercely diminishing returns from optimizing allocations. Dropping the number of allocations from 1M to 1k made a massive performance difference. Dropping it from 1k to 1 will probably be under the benchmark noise floor.