More

mongrelion · 2026-04-05T11:00:25 1775386825

It's clear to me that the maintainer is referring to "shushtain" and those type of people

> when they take that tone with you.

This makes it sound as if you took it personally?

mongrelion · 2026-04-05T10:56:24 1775386584

Having a bad day does not entitle you to take it out on others

perching_aix · 2026-04-05T12:57:44 1775393864

Empathy goes both ways. You can recognize them being unfair while still appreciating their reasons for being unfair.

People seem to have this notion that there's some theoretical possible world where everything is completely moral, and we're just failing to get there. But that is not true. You get locally moral and globally moral arrangements, and they're not necessarily going to mesh. It's just like any other large system.

Guy can be justified from their perspective, people can be justified for distancing themselves from him. That's life. Having a reason for something is further the bare minimum, not the endgame.

zysko-vendy · 2026-04-05T12:56:14 1775393774

that's why i said it's not really an excuse?

mongrelion · 2026-04-05T10:55:26 1775386526

You should totally post this on the original thread just for adjustment :-)

thayne · 2026-04-05T18:04:13 1775412253

The project is archived, you can't.

mongrelion · 2026-03-27T11:14:31 1774610071

Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.

I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.

I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.

From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.

The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.

Which model have you tried locally? Also, out of curiosity, what is your host configuration?

[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5

kroaton · 2026-03-27T13:26:46 1774618006

For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m. The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).

Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).

mongrelion · 2026-03-27T17:56:41 1774634201

I am definitely looking forward to TurboQuant. Makes me feel like my current setup is an investment that could pay over time. Imagine being able to run models like MiniMax M2.5 locally at Q4 levels. That would be swell.

sznio · 2026-03-28T07:45:58 1774683958

I don't remember exact models, but I tried whatever was available in Ollama. I remember using some really low parameter version of llama

mongrelion · 2026-03-27T10:26:45 1774607205

What is this 10€ per month subscription that you are talking about?

harias · 2026-03-27T10:34:31 1774607671

MiniMax token plan

https://platform.minimax.io/docs/guides/pricing-token-plan

throwa356262 · 2026-03-27T14:06:42 1774620402

How is the speed and stability?

These small Chinese companies dont always have access to serious hardware.

dkersten · 2026-03-28T09:45:26 1774691126

I’ve never had any problems with MiniMax. I wouldn’t call the speed fast exactly, but it’s faster than GLM and seems similar to Opus.

It’s been fast enough that I’ve been using it as my main model (M2.7 and before that, M2.5). Opus still does better at tasks, but MiniMax is so much cheaper. I’ve used their cheaper plan and I’ve never been rate limited.

mongrelion · 2026-03-14T10:41:58 1773484918

At what temperature did you run it and what was your context limit?

mongrelion · 2026-03-14T21:34:09 1773524049

I don't understand why I'm getting downvoted.

I am legitimately curious about the parameters that the person used for running the model locally to get the results they got because I am myself currently experimenting with running models locally myself. You can see I am asking similar questions to others in this same thread and correlate the timestamps.

mongrelion · 2026-03-13T21:38:37 1773437917

Apparently there is a whole science behind running models. I have seen the instructions that unsloth publishes for their quants and depending on the model they'll tweak things like the temperature, top k, etc.

The size of the quantization you chose also makes a difference.

The GPU driver also plays an important role.

What was your approach? What software did you use to run the models?

mongrelion · 2026-03-13T21:31:57 1773437517

What front-end framework did you use? I find the UI so visually appealing

hatthew · 2026-03-14T00:27:36 1773448056

FWIW, while I find it appealing, I also strongly associate it with "vibe coded webapp of dubious quality," so personally I'm not gonna try to replicate it myself.

mkagenius · 2026-03-13T22:18:06 1773440286

Thanks. I actually used Google AI Studio for this. Prompted with my color choices and let it do the rest, turned out pretty good.

mongrelion · 2026-03-13T21:27:10 1773437230

Which quantization are you running and what context size? 32tok/s for that model on that card sounds pretty good to me!

mongrelion · 2026-03-13T21:23:03 1773436983

It might be that the system prompt sent by codex is not optimal for that model. Try with open code and see if your results improve