← Guides

GGUF Quantization Explained (Q4 vs Q8 vs FP16): What Runs on a Laptop

If you have ever downloaded a local LLM and seen files named Q4_K_M, Q5_K_S, Q8_0 or FP16, you have met quantization — the single most important concept for running AI models on a used laptop. Quantization is what lets a model that needs 13 GB at full precision run in 4 GB of VRAM with barely any quality loss. Understanding it tells you exactly which models your laptop can run before you download a 10 GB file that won’t load.

This guide explains GGUF quantization in plain terms, shows the quality-vs-memory trade-off, and gives you a table of which quant fits 4, 6, 8 and 16 GB of VRAM.

What quantization actually does

A neural network is billions of numbers (weights). When a model is trained, each weight is stored as a 16-bit number — FP16 or BF16, “full precision” for our purposes. A 7-billion-parameter model in FP16 is about 13 GB, because 7 billion weights × 2 bytes ≈ 14 GB.

Quantization stores each weight using fewer bits — 8, 5, 4, even 2 — by mapping the range of values onto a smaller grid. The trade is simple:

  • Fewer bits → smaller file, less VRAM/RAM, faster loading.
  • Fewer bits → slightly less precise weights → marginally lower output quality.

The remarkable finding of the last few years is that large language models are extraordinarily tolerant of this. Dropping from 16 bits to 4 bits shrinks a model by ~70% while keeping the great majority of its quality for chat, coding and summarisation. That is the whole reason local AI on modest hardware is viable at all — and why VRAM is the spec that matters most when buying a laptop.

GGUF and the K-quants

GGUF is the file format used by llama.cpp and Ollama — the two tools most people run local models with. A GGUF file bundles the quantised weights plus metadata so any compatible runtime can load it. The naming looks cryptic but decodes cleanly:

  • The number is the bits per weight: Q4 ≈ 4-bit, Q5 ≈ 5-bit, Q8 ≈ 8-bit.
  • _K means a K-quant — a smarter scheme that varies precision across the model, protecting the most sensitive weights.
  • _S, _M, _L are small / medium / large variants of the K-quant: Q4_K_M keeps a bit more precision than Q4_K_S.
  • Q8_0 and Q4_0 are older “legacy” quants without the K-scheme — generally superseded by K-quants at the same size.

For almost everyone, Q4_K_M is the default choice: the best balance of size, speed and quality. Step up to Q5_K_M or Q6_K if you have spare memory; reach for Q8_0 only when you want near-lossless output and have the VRAM.

The quality vs memory trade-off

QuantBits/weightSize vs FP16QualityWhen to use
FP16/BF1616100%Reference (full)Fine-tuning and training only
Q8_0~8~53%Near-losslessSpare memory; maximum fidelity
Q6_K~6.5~41%ExcellentHigh quality with moderate savings
Q5_K_M~5.5~35%Very goodA safe step up from Q4 when VRAM allows
Q4_K_M~4.5~28%Good (default)The everyday laptop sweet spot
Q3_K_M~3.5~22%Noticeably degradedSqueezing a bigger model into tight VRAM
Q2_K~2.6~16%Poor / last resortOnly to make a model load at all

The practical takeaway: Q4_K_M to Q5_K_M is the zone you want. Below Q3 the model starts making more mistakes, repeating itself and losing instruction-following. Above Q6 you pay a lot of memory for gains most laptop tasks won’t notice.

Which quant fits your VRAM?

The rough rule: GGUF file size + 1–2 GB overhead (context, KV cache, runtime) is what you need in VRAM to run a model fully on the GPU. If the model is bigger than your VRAM, Ollama and llama.cpp will offload some layers to system RAM — it still runs, just slower, so plenty of RAM matters too.

Your VRAMComfortable model + quantExample laptop
4 GB7B at Q4_K_M (partly offloaded), 3B fully on GPUThinkPad X1 Extreme Gen 4
6 GB7B at Q4_K_M fully on GPU; 13B partly offloadedLenovo Legion 5 Gen 6
8 GB7B at Q5/Q6; 13B at Q4_K_M fully on GPUASUS ROG Zephyrus G14
16 GB13B at Q6/Q8; 34B at Q4 (tight); fine-tuningThinkPad P15 Gen 2

Note how a 13B model that won’t fit on a 6 GB card does fit on 8 GB at Q4_K_M — that single step is why 8 GB is such a meaningful VRAM tier. For the model-by-model speed picture, see our Ollama laptop requirements guide.

What about Stable Diffusion and FLUX?

Image models work differently — they are not distributed as GGUF chat quants — but the same memory logic applies. SDXL wants 6–8 GB of VRAM to be comfortable; FLUX.1 is the big one. Full-precision FLUX needs roughly 16 GB, but quantised FLUX (community GGUF/Q8 and Q4 builds) brings it down to run on 8 GB cards, exactly as quantization does for LLMs. So an 8 GB laptop runs quantised FLUX; only 16 GB runs it at full precision. See best used laptops for Stable Diffusion for the hardware tiers.

Practical recommendations

  • Start with Q4_K_M. It is the default for a reason. Only change if you have a specific need.
  • Match the model to your VRAM, not your ambition. A 7B at Q4_K_M running fully on GPU beats a 13B that’s half-offloaded to RAM and crawling.
  • Add RAM if you’ll offload. 32 GB of system RAM lets a 4–8 GB GPU run bigger models by spilling layers to CPU.
  • Keep FP16 for fine-tuning only. Inference almost never needs it; training does — and that’s a 16 GB-VRAM job.

FAQ

What does Q4 mean in a GGUF model file?

Q4 means the model weights are quantised to roughly 4 bits each instead of 16. A Q4_K_M file is about a quarter the size of the FP16 original, so a 7B model drops from ~13 GB to ~4 GB. Quality loss is small for most chat and coding tasks, which is why Q4_K_M is the most popular laptop quant.

Is Q8 noticeably better than Q4?

Q8 is closer to the full-precision model and slightly more accurate, but for most laptop use the difference over Q4_K_M is hard to notice in everyday chat, summarisation and coding. Q8 roughly doubles the file size and VRAM/RAM footprint versus Q4, so you only choose it when you have memory to spare or need maximum fidelity.

Do I need FP16 to run a model locally?

No. FP16 (or BF16) is the full-precision format used for training and fine-tuning, not for everyday inference on a laptop. For running models locally, a quantised GGUF (Q4 or Q5) gives almost the same answers at a fraction of the memory. Reserve FP16 for fine-tuning, which needs a 16 GB GPU like the ThinkPad P15 Gen 2.

How do I know if a quant will fit my VRAM?

As a rough rule, the GGUF file size plus 1–2 GB of overhead is what you need in VRAM to run fully on GPU. A 4.4 GB Q4_K_M 7B model needs about 6 GB of VRAM to be comfortable. If the model is larger than your VRAM, llama.cpp and Ollama can offload some layers to system RAM — slower, but it works.

Related articles