What is quantization?
Quantization compresses a model to use less memory. Q8 is highest quality, Q4 is the sweet spot, Q2 trades quality for speed. Lower numbers = smaller file, faster, slightly less accurate.
Ollama vs llama.cpp โ which should I use?
Ollama is easier โ one command to install and run. llama.cpp gives you more control over VRAM, threads, and context size. Start with Ollama, switch to llama.cpp if you need to tune performance.
How do I check my VRAM or memory?
Mac: Apple menu โ About This Mac โ Memory. NVIDIA: run nvidia-smi in terminal. Windows: Task Manager โ Performance โ GPU.
What is MoE (Mixture of Experts)?
Qwen3-Coder uses MoE โ a 480B parameter model where only 3B params activate per token. This means you get frontier-quality output while only needing enough memory for the active parameters.
Can I use this with VS Code or Cursor?
Yes. Once the model is running via Ollama, install the Continue extension in VS Code or Cursor and point it to localhost:11434. Pro users get the full IDE setup guide.