LLM System Requirements Calculator — VRAM, RAM & Disk

Estimate the VRAM, system RAM, and disk space you need to run any open-source LLM locally. Covers Llama, Mistral, Qwen, DeepSeek, Gemma, Phi and 30+ models across FP16 through Q2_K quantization.

X Facebook LinkedIn Reddit

Your System

OS: …CPU cores: …RAM:GB

Configure your model

Model

Apache 2.0 · released 2024-05 · native context up to 32K tokens · HuggingFace

Quantization

Standard community default — best size/quality tradeoff

Context length (tokens)

Typical: 4K–8K chat, 32K–128K long-doc work. Longer context = larger KV cache.

Quick presets

Estimated VRAM

9.42 GB

weights 3.92 GB + KV 4.00 GB + overhead 1.50 GB

System RAM (CPU inference)

13.9 GB

Minimum headroom for CPU-only inference. GPU inference needs half of this.

Disk space

3.92 GB

Download size for a single GGUF/safetensors file

Will it run on my hardware?

Hardware	Memory	Fit	Headroom	Note
NVIDIA RTX 3060 12GB	12 GB	Fits	+2.58 GB	Budget consumer GPU
NVIDIA RTX 4060 Ti 16GB	16 GB	Fits	+6.58 GB	Mid-range with extra VRAM
NVIDIA RTX 3090 24GB	24 GB	Fits	+14.6 GB	Enthusiast, great for local LLMs
NVIDIA RTX 4090 24GB	24 GB	Fits	+14.6 GB	Top consumer GPU
NVIDIA RTX 5090 32GB	32 GB	Fits	+22.6 GB	Current-gen flagship
NVIDIA A100 40GB	40 GB	Fits	+30.6 GB	Datacenter-class
NVIDIA A100 80GB	80 GB	Fits	+70.6 GB	Datacenter-class, large
NVIDIA H100 80GB	80 GB	Fits	+70.6 GB	Current datacenter flagship
Mac M2 16GB	12 GB(unified)	Fits	+2.58 GB	Unified memory; ~12GB usable after OS
Mac M3 Max 36GB	30 GB(unified)	Fits	+20.6 GB	Unified memory; ~30GB usable
Mac M3 Max 64GB	56 GB(unified)	Fits	+46.6 GB	Unified memory; ~56GB usable
Mac M3 Ultra 128GB	115 GB(unified)	Fits	+105.6 GB	Unified memory; ~115GB usable
Mac M4 Max 128GB	115 GB(unified)	Fits	+105.6 GB	Unified memory; ~115GB usable

How the math works:

Weights: params × bytes/param. FP16 = 2 bytes, Q4_K_M ≈ 0.58 bytes. For MoE, only active-expert params are loaded on the GPU per forward pass.
KV cache: 2 × num_layers × hidden_size × context × 2 bytes (FP16). This scales linearly with context length and is often the surprise cost.
Overhead: activations, framework workspace, CUDA kernels — typically 1.5–3 GB or ~10% of weights, whichever is larger.
Disk: the full quantized checkpoint. MoE models store every expert on disk even if only a few are active per token.

Sources: Architecture numbers (num_hidden_layers, hidden_size, max_position_embeddings) come directly from each model's published HuggingFace config.json — click the HuggingFace link next to the selected model to see the exact config. Quantization byte/param ratios come from the llama.cpp GGUF k-quants spec. Estimates typically land within ±10% of real-world nvidia-smi usage; exact overhead depends on your runtime (llama.cpp, vLLM, TensorRT-LLM, MLX).

Quantization availability: Mathematically any model can be quantized to any level — but community GGUF releases don't always ship every variant. Q4_K_M, Q5_K_M, Q8_0 and FP16 are near-universal; Q2_K and Q3_K_M are often skipped on smaller models where quality loss is noticeable. To find a specific quant for a specific model, search bartowski, TheBloke, or the official repo on HuggingFace.

Context length: Each model has a native max context defined in its config. Going beyond it (via RoPE/YaRN scaling) works mechanically but degrades quality — the warning banner above flags this whenever the current selection exceeds the model's native limit.

How to estimate LLM requirements

1

Pick the model

Select the open-source model you plan to run — Llama 3.1, Mistral, Qwen 2.5, DeepSeek R1, Gemma 2, Phi, or Code Llama. The dropdown is grouped by family and shows license and release date.
2

Choose a quantization level

Q4_K_M is the community default: about 0.58 bytes per parameter and nearly indistinguishable from FP16 on most tasks. FP16 doubles the size but matches the original training precision. Q2_K is the smallest, with visible quality loss.
3

Set the context length

Context length drives the KV cache, which scales linearly with tokens. A 128K-context session on Llama 3.1 70B adds over 10 GB to VRAM on top of the weights. Use the preset buttons for common sizes.
4

Read the verdict

The three cards show total VRAM (weights + KV cache + overhead), system RAM recommendation for CPU-only inference, and the disk space you need to download. The hardware table underneath marks every common GPU and Mac configuration as Fits / Tight / Too small with exact headroom.

Who this is for

Choosing a GPU for local inference

Before you buy a 4090 or an M3 Max, check whether the model you care about actually fits after accounting for the KV cache at your real context length. A 70B model at 128K context needs far more memory than the base weights suggest.

Sizing a cloud instance

Pick the smallest A100 / H100 / L40S / A10G instance that comfortably runs your quantization + context combo. The 'Headroom' column tells you exactly how much slack you have for multi-request batching.

Picking a quantization for your hardware

If Q4_K_M fits on your GPU but Q5_K_M is tight, the calculator shows the delta so you can make an informed quality-vs-memory tradeoff. Swap between quantization levels and watch the VRAM estimate update instantly.

Budgeting disk space

Download sizes for models are not obvious from their name. A Q4_K_M quant of Llama 3.1 70B is about 40 GB on disk; the 405B version exceeds 200 GB. MoE models store every expert on disk even if only a few are active per token.

About This Tool

This calculator estimates the VRAM, system RAM, and disk space required to run a given open-source large language model on your own hardware. It covers 30+ popular models across the Llama, Mistral, Qwen, DeepSeek, Gemma, Phi, and Code Llama families, and every common quantization level from full FP32 down to Q2_K. Select a model, pick a quantization, set a context length, and the tool computes the exact memory footprint plus a fit verdict for every common GPU and Apple Silicon configuration.

The math is the same math llama.cpp, vLLM, and Ollama actually use. Model weight memory equals the parameter count times bytes-per-parameter, where Q4_K_M sits at about 0.58 bytes/param and FP16 at 2 bytes/param. The KV cache is 2 × num_layers × hidden_size × context × 2 bytes (it stays FP16 even when weights are quantized), which is usually the surprise cost on long-context sessions — a 128K context on a 70B model adds more than 10 GB on top of the weights. Runtime overhead (activations, CUDA workspace, framework buffers) adds another 1.5–3 GB or ~10% of weights, whichever is larger.

For MoE models like Mixtral 8x7B or DeepSeek V3, the calculator distinguishes active parameters (which determine per-token VRAM) from total parameters (which determine disk size). DeepSeek V3 has 671 B total parameters but only 37 B active per token, so it runs in a fraction of the VRAM the full number suggests — but still needs the full ~400 GB of disk for the quantized weights.

Pair this calculator with the AI Model Picker to choose a model by quality-per-cost, or with the dnpm Configurator and AI Agent Starter Guide when setting up a local development workflow. For hosted inference, the Cloudflare Cost Calculator estimates Workers AI spend for the same model at inference time.

How It Compares

Memory estimator blog posts and Hugging Face Space demos exist, but most are model-specific, skip the KV cache entirely, or ignore runtime overhead — which is exactly the memory that makes a model OOM instead of OOM-near-miss. This calculator is model-agnostic across 30+ curated checkpoints, always includes KV cache and overhead, and lets you see the headroom on 13 specific hardware targets side by side.

The alternative — trying to load the model and watching nvidia-smi or macOS Activity Monitor — works but wastes a 40 GB download when the answer is no. This tool gives you the answer before you download anything, and runs entirely in your browser: the model list and math are static data, so your hardware choices and model preferences are never transmitted anywhere.

Tips for running local LLMs

Q4_K_M is the community default for a reason — it produces the best quality per gigabyte on almost every model and saves roughly 75% vs FP16.

On unified-memory Macs (M1/M2/M3/M4), the whole RAM pool doubles as VRAM, minus about 4 GB for macOS. A 36 GB M3 Max gives you ~30 GB usable for model weights.

KV cache is always FP16 regardless of weight quantization in llama.cpp and most runtimes. Cutting context length is the fastest way to reclaim VRAM without switching models.

Mixture-of-Experts models (Mixtral 8x7B, DeepSeek V3) need the full expert set on disk, but only the active-expert slice in VRAM. That's why the VRAM estimate is much lower than the disk estimate.

For a 7B–8B model at 8K context, Q4_K_M fits comfortably in 8 GB of VRAM and runs well on a mid-range laptop GPU or a base-model M2 Mac.

Frequently Asked Questions

How much VRAM do I need to run Llama 3.1 70B?

At Q4_K_M with an 8K context, Llama 3.1 70B needs about 42 GB of VRAM (around 40 GB of weights plus 1 GB of KV cache and 3 GB of overhead). That fits a 2×3090 or 2×4090 setup, an A100 40GB with tight headroom, or an M3 Max with 64+ GB of unified memory. At FP16 it doubles to about 150 GB, requiring an H100 80GB pair or A100 80GB multi-GPU.

What does Q4_K_M mean, and why is it the default?

Q4_K_M is a 4-bit group-wise quantization format from llama.cpp. It stores most weights in 4-bit blocks with per-block scale factors, reaching about 0.58 bytes per parameter. The 'K' means k-quants (improved block layout) and the 'M' means medium mix — important weights stay in higher precision. It's the default because quality loss is typically under 1% on reasoning benchmarks while size drops 75% vs FP16.

Can I run DeepSeek V3 or DeepSeek R1 locally?

The full 671B MoE models are extremely heavy. Disk footprint at Q4_K_M is about 400 GB. VRAM for a forward pass is much lower — around 22 GB at Q4_K_M, 8K context — because only 37 B parameters are active per token, but you still need enough fast storage or RAM to stream the inactive experts. In practice, the 70B R1 Distill or 32B R1 Distill are the realistic local choices for consumer hardware.

Why does my VRAM estimate jump when I increase context length?

The KV cache is 2 × num_layers × hidden_size × context_length × 2 bytes, and it sits in VRAM alongside the weights. On a 70B model with 80 layers and 8192 hidden size, each 1K of context adds ~2.5 GB. Going from 8K to 128K adds over 30 GB. Shortening context is usually the fastest way to reclaim VRAM without changing models.

Is this calculator accurate for vLLM, TensorRT, and MLX?

The weight and KV-cache math is consistent across runtimes — they all obey the same linear algebra. The overhead constant varies a bit: vLLM with PagedAttention is slightly more efficient on KV cache memory than llama.cpp, and MLX on Apple Silicon has lower framework overhead. Expect estimates within ±10% of real-world nvidia-smi numbers.

Does the calculator handle Apple Silicon / M-series Macs?

Yes. Apple Silicon uses unified memory, so RAM and VRAM come from the same pool. The hardware table lists common M-series configurations and shows usable memory after subtracting ~4 GB of macOS overhead. A 64 GB M3 Max effectively gives you ~56 GB of VRAM for inference.

What's the difference between VRAM and RAM in the results?

VRAM is what the GPU needs to hold the model for fast inference. System RAM is what your CPU needs if you run the model CPU-only (no GPU) — this is much larger because CPU inference holds the full uncompressed working set plus OS headroom. On unified-memory systems they're the same number.

Are these download sizes accurate for GGUF files?

The disk estimate matches GGUF file sizes to within a few percent because GGUF stores quantized weights essentially as-is. The small extra overhead (tokenizer, vocab, metadata) is under 100 MB for most models. HF Hub safetensors files at FP16/BF16 match the FP16 estimate exactly.

Related Tools

Ai Model Picker

Developers

Ai Agent Starter Guide

Developers

Claude Code Usage Analyzer

Developers

Dnpm Configurator

Developers

Cloudflare Cost Calculator

Developers

Chmod Calculator

Developers

Jwt Decoder

Developers

Curl To Code

Developers

FindUtils Is Now One of the Most Agent-Ready Websites on the Internet 8 min read

Your System

Configure your model

Will it run on my hardware?

How to estimate LLM requirements

Pick the model

Choose a quantization level

Set the context length

Read the verdict

Who this is for

Choosing a GPU for local inference

Sizing a cloud instance

Picking a quantization for your hardware

Budgeting disk space

About This Tool

How It Compares

Tips for running local LLMs

Frequently Asked Questions

How much VRAM do I need to run Llama 3.1 70B?

What does Q4_K_M mean, and why is it the default?

Can I run DeepSeek V3 or DeepSeek R1 locally?

Why does my VRAM estimate jump when I increase context length?

Is this calculator accurate for vLLM, TensorRT, and MLX?

Does the calculator handle Apple Silicon / M-series Macs?

What's the difference between VRAM and RAM in the results?

Are these download sizes accurate for GGUF files?

Related Tools

Related Guides

Rate This Tool

Get Weekly Tools

Suggest a Tool