Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Update: I tried it out. It took about 8 seconds per token, and didn't seem to be using much of my GPU (MPU), but was using a lot of RAM. Not a model that I could use practically on my machine.


Did you run it the best way possible? im no expert, but I understand it can affect inference time greatly (which format/engine is used)


I ran it via Ollama, which I assume uses the best way. Screenshot in my post here: https://bsky.app/profile/pamelafox.bsky.social/post/3lvobol3...

I'm still wondering why my MPU usage was so low.. maybe Ollama isn't optimized for running it yet?


Might need to wait on MLX


To clarify, this was the 20B model?


Yep, 20B model, via Ollama: ollama run gpt-oss:20b

Screenshot here with Ollama running and asitop in other terminal:

https://bsky.app/profile/pamelafox.bsky.social/post/3lvobol3...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: