I've got an orca-3b GGML (koboldcpp) running on an RPi 4 and it sucks. It takes ...

I've got an orca-3b GGML (koboldcpp) running on an RPi 4 and it sucks. It takes a few minutes just to process the prompt, then it's 1 token per second of output.

...which is usually crap (because it's only 3b) and needs to be regenerated anyway. It's not a viable solution for any generative use case. Mechanical Turk is faster and more reliable.

There are smaller models that I could try but 7b is already the lower limit of my patience. YMMV