Which specs are as low as reasonable possible for local LLM models? Do you recommend some distro in particular?

SocialistVibes01@lemmy.ml · edit-2 23 days ago

Which specs are as low as reasonable possible for local LLM models? Do you recommend some distro in particular?

thingsiplay@lemmy.ml · 23 days ago

I use local LLM with 8gb VRAM and 32gb system RAM, thanks to Vulkan support. My GPU is a RX 7600. I can run qwen/qwen3.6-35B-A3B-Q4_K_M.gguf and gemma-4-26B-A4B-it-Q4_K_M.gguf in example. It will first fill in the GPU and the rest will use the system RAM instead, which is slower but at least it will fit and run bigger models. I just need to lower the context length, which has a great impact (current custom value is 64k for anyone who wants to know).

But this is still highly limited and not competitive at all. I mostly play around with it and occasionally ask a question here or there and that’s it. So if you are serious about your system, you need something faster and with more than just 8gb VRAM.

Domi@lemmy.secnd.me · 22 days ago

As a side note, Qwen3.6-27B is much more capable than Qwen3.6-35B, even though it is much slower.

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

For coding tasks where you don’t mind waiting, you should be able to barely squeeze in the 8-bit quantized version with 32 GB RAM + 8 GB VRAM and have a pretty competent local model. 4-bit quants work but they have issues with complex tool calls.

If you use the MTP branch of llama.cpp (and a suitable model) you can even double or triple your token generation speed: https://github.com/ggml-org/llama.cpp/pull/22673

For easier tasks, disable reasoning for instant responses.