Something to handle code, text and math.

  • thingsiplay@lemmy.ml
    link
    fedilink
    arrow-up
    0
    ·
    23 days ago

    I use local LLM with 8gb VRAM and 32gb system RAM, thanks to Vulkan support. My GPU is a RX 7600. I can run qwen/qwen3.6-35B-A3B-Q4_K_M.gguf and gemma-4-26B-A4B-it-Q4_K_M.gguf in example. It will first fill in the GPU and the rest will use the system RAM instead, which is slower but at least it will fit and run bigger models. I just need to lower the context length, which has a great impact (current custom value is 64k for anyone who wants to know).

    But this is still highly limited and not competitive at all. I mostly play around with it and occasionally ask a question here or there and that’s it. So if you are serious about your system, you need something faster and with more than just 8gb VRAM.

    • Domi@lemmy.secnd.me
      link
      fedilink
      arrow-up
      1
      ·
      22 days ago

      As a side note, Qwen3.6-27B is much more capable than Qwen3.6-35B, even though it is much slower.

      https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

      For coding tasks where you don’t mind waiting, you should be able to barely squeeze in the 8-bit quantized version with 32 GB RAM + 8 GB VRAM and have a pretty competent local model. 4-bit quants work but they have issues with complex tool calls.

      If you use the MTP branch of llama.cpp (and a suitable model) you can even double or triple your token generation speed: https://github.com/ggml-org/llama.cpp/pull/22673

      For easier tasks, disable reasoning for instant responses.