Hacker Newsnew | past | comments | ask | show | jobs | submit | mongrelion's commentslogin

It's clear to me that the maintainer is referring to "shushtain" and those type of people

> when they take that tone with you.

This makes it sound as if you took it personally?


Having a bad day does not entitle you to take it out on others

Empathy goes both ways. You can recognize them being unfair while still appreciating their reasons for being unfair.

People seem to have this notion that there's some theoretical possible world where everything is completely moral, and we're just failing to get there. But that is not true. You get locally moral and globally moral arrangements, and they're not necessarily going to mesh. It's just like any other large system.

Guy can be justified from their perspective, people can be justified for distancing themselves from him. That's life. Having a reason for something is further the bare minimum, not the endgame.


that's why i said it's not really an excuse?

You should totally post this on the original thread just for adjustment :-)

The project is archived, you can't.

Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.

I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.

I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.

From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.

The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.

Which model have you tried locally? Also, out of curiosity, what is your host configuration?

[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5


For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m. The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).

Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).


I am definitely looking forward to TurboQuant. Makes me feel like my current setup is an investment that could pay over time. Imagine being able to run models like MiniMax M2.5 locally at Q4 levels. That would be swell.

I don't remember exact models, but I tried whatever was available in Ollama. I remember using some really low parameter version of llama

What is this 10€ per month subscription that you are talking about?


How is the speed and stability?

These small Chinese companies dont always have access to serious hardware.


I’ve never had any problems with MiniMax. I wouldn’t call the speed fast exactly, but it’s faster than GLM and seems similar to Opus.

It’s been fast enough that I’ve been using it as my main model (M2.7 and before that, M2.5). Opus still does better at tasks, but MiniMax is so much cheaper. I’ve used their cheaper plan and I’ve never been rate limited.


At what temperature did you run it and what was your context limit?


I don't understand why I'm getting downvoted.

I am legitimately curious about the parameters that the person used for running the model locally to get the results they got because I am myself currently experimenting with running models locally myself. You can see I am asking similar questions to others in this same thread and correlate the timestamps.


Apparently there is a whole science behind running models. I have seen the instructions that unsloth publishes for their quants and depending on the model they'll tweak things like the temperature, top k, etc.

The size of the quantization you chose also makes a difference.

The GPU driver also plays an important role.

What was your approach? What software did you use to run the models?


What front-end framework did you use? I find the UI so visually appealing


FWIW, while I find it appealing, I also strongly associate it with "vibe coded webapp of dubious quality," so personally I'm not gonna try to replicate it myself.


Thanks. I actually used Google AI Studio for this. Prompted with my color choices and let it do the rest, turned out pretty good.


Which quantization are you running and what context size? 32tok/s for that model on that card sounds pretty good to me!


It might be that the system prompt sent by codex is not optimal for that model. Try with open code and see if your results improve


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: