LLM System Requirements Calculator

Estimate the VRAM, system RAM, and disk space you need to run any open-source LLM locally. Covers Llama, Mistral, Qwen, DeepSeek, Gemma, Phi and 30+ models across FP16 through Q2_K quantization.

Your System

OS: CPU cores: RAM:GB

Configure your model

Apache 2.0 · released 2024-05 · native context up to 32K tokens · HuggingFace

Standard community default — best size/quality tradeoff

Typical: 4K–8K chat, 32K–128K long-doc work. Longer context = larger KV cache.

Estimated VRAM
9.42 GB
weights 3.92 GB + KV 4.00 GB + overhead 1.50 GB
System RAM (CPU inference)
13.9 GB
Minimum headroom for CPU-only inference. GPU inference needs half of this.
Disk space
3.92 GB
Download size for a single GGUF/safetensors file

Will it run on my hardware?

HardwareMemoryFitHeadroomNote
NVIDIA RTX 3060 12GB12 GB Fits+2.58 GBBudget consumer GPU
NVIDIA RTX 4060 Ti 16GB16 GB Fits+6.58 GBMid-range with extra VRAM
NVIDIA RTX 3090 24GB24 GB Fits+14.6 GBEnthusiast, great for local LLMs
NVIDIA RTX 4090 24GB24 GB Fits+14.6 GBTop consumer GPU
NVIDIA RTX 5090 32GB32 GB Fits+22.6 GBCurrent-gen flagship
NVIDIA A100 40GB40 GB Fits+30.6 GBDatacenter-class
NVIDIA A100 80GB80 GB Fits+70.6 GBDatacenter-class, large
NVIDIA H100 80GB80 GB Fits+70.6 GBCurrent datacenter flagship
Mac M2 16GB12 GB(unified) Fits+2.58 GBUnified memory; ~12GB usable after OS
Mac M3 Max 36GB30 GB(unified) Fits+20.6 GBUnified memory; ~30GB usable
Mac M3 Max 64GB56 GB(unified) Fits+46.6 GBUnified memory; ~56GB usable
Mac M3 Ultra 128GB115 GB(unified) Fits+105.6 GBUnified memory; ~115GB usable
Mac M4 Max 128GB115 GB(unified) Fits+105.6 GBUnified memory; ~115GB usable

How the math works:

  • Weights: params × bytes/param. FP16 = 2 bytes, Q4_K_M ≈ 0.58 bytes. For MoE, only active-expert params are loaded on the GPU per forward pass.
  • KV cache: 2 × num_layers × hidden_size × context × 2 bytes (FP16). This scales linearly with context length and is often the surprise cost.
  • Overhead: activations, framework workspace, CUDA kernels — typically 1.5–3 GB or ~10% of weights, whichever is larger.
  • Disk: the full quantized checkpoint. MoE models store every expert on disk even if only a few are active per token.

Sources: Architecture numbers (num_hidden_layers, hidden_size, max_position_embeddings) come directly from each model's published HuggingFace config.json — click the HuggingFace link next to the selected model to see the exact config. Quantization byte/param ratios come from the llama.cpp GGUF k-quants spec. Estimates typically land within ±10% of real-world nvidia-smi usage; exact overhead depends on your runtime (llama.cpp, vLLM, TensorRT-LLM, MLX).

Quantization availability: Mathematically any model can be quantized to any level — but community GGUF releases don't always ship every variant. Q4_K_M, Q5_K_M, Q8_0 and FP16 are near-universal; Q2_K and Q3_K_M are often skipped on smaller models where quality loss is noticeable. To find a specific quant for a specific model, search bartowski, TheBloke, or the official repo on HuggingFace.

Context length: Each model has a native max context defined in its config. Going beyond it (via RoPE/YaRN scaling) works mechanically but degrades quality — the warning banner above flags this whenever the current selection exceeds the model's native limit.

How to estimate LLM requirements

  1. 1

    Pick the model

    Select the open-source model you plan to run — Llama 3.1, Mistral, Qwen 2.5, DeepSeek R1, Gemma 2, Phi, or Code Llama. The dropdown is grouped by family and shows license and release date.
  2. 2

    Choose a quantization level

    Q4_K_M is the community default: about 0.58 bytes per parameter and nearly indistinguishable from FP16 on most tasks. FP16 doubles the size but matches the original training precision. Q2_K is the smallest, with visible quality loss.
  3. 3

    Set the context length

    Context length drives the KV cache, which scales linearly with tokens. A 128K-context session on Llama 3.1 70B adds over 10 GB to VRAM on top of the weights. Use the preset buttons for common sizes.
  4. 4

    Read the verdict

    The three cards show total VRAM (weights + KV cache + overhead), system RAM recommendation for CPU-only inference, and the disk space you need to download. The hardware table underneath marks every common GPU and Mac configuration as Fits / Tight / Too small with exact headroom.

Who this is for

1

Choosing a GPU for local inference

Before you buy a 4090 or an M3 Max, check whether the model you care about actually fits after accounting for the KV cache at your real context length. A 70B model at 128K context needs far more memory than the base weights suggest.
2

Sizing a cloud instance

Pick the smallest A100 / H100 / L40S / A10G instance that comfortably runs your quantization + context combo. The 'Headroom' column tells you exactly how much slack you have for multi-request batching.
3

Picking a quantization for your hardware

If Q4_K_M fits on your GPU but Q5_K_M is tight, the calculator shows the delta so you can make an informed quality-vs-memory tradeoff. Swap between quantization levels and watch the VRAM estimate update instantly.
4

Budgeting disk space

Download sizes for models are not obvious from their name. A Q4_K_M quant of Llama 3.1 70B is about 40 GB on disk; the 405B version exceeds 200 GB. MoE models store every expert on disk even if only a few are active per token.

About This Tool

This calculator estimates the VRAM, system RAM, and disk space required to run a given open-source large language model on your own hardware. It covers 30+ popular models across the Llama, Mistral, Qwen, DeepSeek, Gemma, Phi, and Code Llama families, and every common quantization level from full FP32 down to Q2_K. Select a model, pick a quantization, set a context length, and the tool computes the exact memory footprint plus a fit verdict for every common GPU and Apple Silicon configuration.

The math is the same math llama.cpp, vLLM, and Ollama actually use. Model weight memory equals the parameter count times bytes-per-parameter, where Q4_K_M sits at about 0.58 bytes/param and FP16 at 2 bytes/param. The KV cache is 2 × num_layers × hidden_size × context × 2 bytes (it stays FP16 even when weights are quantized), which is usually the surprise cost on long-context sessions — a 128K context on a 70B model adds more than 10 GB on top of the weights. Runtime overhead (activations, CUDA workspace, framework buffers) adds another 1.5–3 GB or ~10% of weights, whichever is larger.

For MoE models like Mixtral 8x7B or DeepSeek V3, the calculator distinguishes active parameters (which determine per-token VRAM) from total parameters (which determine disk size). DeepSeek V3 has 671 B total parameters but only 37 B active per token, so it runs in a fraction of the VRAM the full number suggests — but still needs the full ~400 GB of disk for the quantized weights.

Pair this calculator with the AI Model Picker to choose a model by quality-per-cost, or with the dnpm Configurator and AI Agent Starter Guide when setting up a local development workflow. For hosted inference, the Cloudflare Cost Calculator estimates Workers AI spend for the same model at inference time.

How It Compares

Memory estimator blog posts and Hugging Face Space demos exist, but most are model-specific, skip the KV cache entirely, or ignore runtime overhead — which is exactly the memory that makes a model OOM instead of OOM-near-miss. This calculator is model-agnostic across 30+ curated checkpoints, always includes KV cache and overhead, and lets you see the headroom on 13 specific hardware targets side by side.

The alternative — trying to load the model and watching nvidia-smi or macOS Activity Monitor — works but wastes a 40 GB download when the answer is no. This tool gives you the answer before you download anything, and runs entirely in your browser: the model list and math are static data, so your hardware choices and model preferences are never transmitted anywhere.

Tips for running local LLMs

1
Q4_K_M is the community default for a reason — it produces the best quality per gigabyte on almost every model and saves roughly 75% vs FP16.
2
On unified-memory Macs (M1/M2/M3/M4), the whole RAM pool doubles as VRAM, minus about 4 GB for macOS. A 36 GB M3 Max gives you ~30 GB usable for model weights.
3
KV cache is always FP16 regardless of weight quantization in llama.cpp and most runtimes. Cutting context length is the fastest way to reclaim VRAM without switching models.
4
Mixture-of-Experts models (Mixtral 8x7B, DeepSeek V3) need the full expert set on disk, but only the active-expert slice in VRAM. That's why the VRAM estimate is much lower than the disk estimate.
5
For a 7B–8B model at 8K context, Q4_K_M fits comfortably in 8 GB of VRAM and runs well on a mid-range laptop GPU or a base-model M2 Mac.

Frequently Asked Questions

1

How much VRAM do I need to run Llama 3.1 70B?

At Q4_K_M with an 8K context, Llama 3.1 70B needs about 42 GB of VRAM (around 40 GB of weights plus 1 GB of KV cache and 3 GB of overhead). That fits a 2×3090 or 2×4090 setup, an A100 40GB with tight headroom, or an M3 Max with 64+ GB of unified memory. At FP16 it doubles to about 150 GB, requiring an H100 80GB pair or A100 80GB multi-GPU.
2

What does Q4_K_M mean, and why is it the default?

Q4_K_M is a 4-bit group-wise quantization format from llama.cpp. It stores most weights in 4-bit blocks with per-block scale factors, reaching about 0.58 bytes per parameter. The 'K' means k-quants (improved block layout) and the 'M' means medium mix — important weights stay in higher precision. It's the default because quality loss is typically under 1% on reasoning benchmarks while size drops 75% vs FP16.
3

Can I run DeepSeek V3 or DeepSeek R1 locally?

The full 671B MoE models are extremely heavy. Disk footprint at Q4_K_M is about 400 GB. VRAM for a forward pass is much lower — around 22 GB at Q4_K_M, 8K context — because only 37 B parameters are active per token, but you still need enough fast storage or RAM to stream the inactive experts. In practice, the 70B R1 Distill or 32B R1 Distill are the realistic local choices for consumer hardware.
4

Why does my VRAM estimate jump when I increase context length?

The KV cache is 2 × num_layers × hidden_size × context_length × 2 bytes, and it sits in VRAM alongside the weights. On a 70B model with 80 layers and 8192 hidden size, each 1K of context adds ~2.5 GB. Going from 8K to 128K adds over 30 GB. Shortening context is usually the fastest way to reclaim VRAM without changing models.
5

Is this calculator accurate for vLLM, TensorRT, and MLX?

The weight and KV-cache math is consistent across runtimes — they all obey the same linear algebra. The overhead constant varies a bit: vLLM with PagedAttention is slightly more efficient on KV cache memory than llama.cpp, and MLX on Apple Silicon has lower framework overhead. Expect estimates within ±10% of real-world nvidia-smi numbers.
6

Does the calculator handle Apple Silicon / M-series Macs?

Yes. Apple Silicon uses unified memory, so RAM and VRAM come from the same pool. The hardware table lists common M-series configurations and shows usable memory after subtracting ~4 GB of macOS overhead. A 64 GB M3 Max effectively gives you ~56 GB of VRAM for inference.
7

What's the difference between VRAM and RAM in the results?

VRAM is what the GPU needs to hold the model for fast inference. System RAM is what your CPU needs if you run the model CPU-only (no GPU) — this is much larger because CPU inference holds the full uncompressed working set plus OS headroom. On unified-memory systems they're the same number.
8

Are these download sizes accurate for GGUF files?

The disk estimate matches GGUF file sizes to within a few percent because GGUF stores quantized weights essentially as-is. The small extra overhead (tokenizer, vocab, metadata) is under 100 MB for most models. HF Hub safetensors files at FP16/BF16 match the FP16 estimate exactly.

Rate This Tool

0/1000

Get Weekly Tools

Suggest a Tool