GPU Layers Explained: How LLMs Use GPU Memory
When people talk about "GPU layers" in the context of running LLMs, they're referring to how a model's transformer layers get loaded into GPU memory. Understanding this is key to choosing the right hardware, picking the right quantization, and getting the most out of your available VRAM. Here's a practical breakdown.
What Are Layers in a Neural Network?
Modern large language models are built on the transformer architecture. At their core, transformers are stacks of identical layers. A 7-billion-parameter model like Llama 2 7B has 32 of these layers. A 70B model has 80. Each layer is a self-contained unit of computation.
Every transformer layer contains several components:
- →Multi-head attention: The mechanism that lets the model look at different parts of the input simultaneously. This is where the "understanding context" happens.
- →Feed-forward network: A two-layer MLP that processes the output of the attention heads. This is typically the largest component by parameter count.
- →Layer normalization: Normalizes activations to stabilize training and inference. Usually RMSNorm in modern models.
- →Residual connections: Skip connections that add the layer's input back to its output, enabling deeper networks to train effectively.
In addition to the repeated transformer layers, there's an embedding layer at the input (converting tokens to vectors) and an output head (converting vectors back to token probabilities). When we talk about "GPU layers," we're primarily talking about the transformer blocks, since they make up the vast majority of the model's parameters.
How Model Layers Map to GPU Memory
When you load a model onto a GPU, each transformer layer consumes VRAM proportional to its parameter count and the numerical precision you're using. The total VRAM required follows a simple formula:
VRAM Estimation Formula
Total VRAM = Model Weights + KV Cache + Activations + Overhead
Model Weights = num_layers x (params_per_layer x bytes_per_param)
KV Cache = 2 x num_layers x hidden_dim x seq_len x batch_size x bytes_per_param
The KV cache grows linearly with sequence length and batch size, which is why long-context inference can consume significantly more VRAM than just loading the model weights.
For a rough estimate, model weights dominate for short sequences. A 7B parameter model at FP16 precision needs approximately 14GB just for the weights. Add 1-4GB for the KV cache and activations depending on your context length, and you can see why a 24GB GPU is the practical minimum for running 7B models at full precision.
VRAM Requirements by Precision
The number of bytes per parameter depends on the precision format you choose. Lower precision means smaller memory footprint and faster inference, with some tradeoff in output quality:
| Precision | Bytes/Param | 7B Model | 13B Model | 70B Model |
|---|---|---|---|---|
| FP32 | 4 bytes | 28 GB | 52 GB | 280 GB |
| FP16 / BF16 | 2 bytes | 14 GB | 26 GB | 140 GB |
| INT8 | 1 byte | 7 GB | 13 GB | 70 GB |
| INT4 (GPTQ/AWQ) | 0.5 bytes | 3.5 GB | 6.5 GB | 35 GB |
* Weights only. Actual VRAM usage will be 15-40% higher due to KV cache, activations, and CUDA overhead. These figures do not include the embedding and output head layers.
The practical takeaway: quantization is how most people fit large models on consumer GPUs. A 7B model at INT4 fits comfortably on an 8GB GPU. A 70B model at INT4 fits on two RTX 4090s. Without quantization, that same 70B model would need 140GB of VRAM at FP16, requiring multiple datacenter-grade GPUs.
GPU Offloading: Splitting Layers Across Devices
What if your model doesn't quite fit in VRAM? That's where GPU offloading comes in. Tools like llama.cpp and Ollama let you control exactly how many layers run on the GPU versus the CPU using a parameter typically called n_gpu_layers or --n-gpu-layers.
When you set n_gpu_layers=20 on a 32-layer model, the first 20 layers run on the GPU (fast) and the remaining 12 layers run on the CPU (slower). Each token must traverse all layers sequentially, so the CPU layers become a bottleneck.
How Layer Offloading Works
- →All layers on GPU: Maximum speed. Token generation at full throughput. Set n_gpu_layers to the total layer count or -1.
- →Most layers on GPU: Near-full speed with slight slowdown. Good compromise when you're just a few GB short on VRAM.
- →Half and half: Noticeably slower. You'll typically see 30-60% of full GPU throughput depending on your CPU.
- →All layers on CPU: Slowest option. Can be 5-20x slower than full GPU inference. Only useful when you have no GPU at all.
The key insight is that partial offloading is a spectrum, not a binary choice. Every layer you move to GPU gives a linear improvement in speed. If you can fit 28 out of 32 layers on your GPU, you'll get roughly 85-90% of the performance of having all layers on GPU, since the CPU only needs to handle a small fraction of the computation.
Practical Guide: Which GPU for Which Model
Here's a practical reference for matching models to GPU hardware. This assumes you want all layers on GPU for maximum inference speed:
| Model Size | Layers | Recommended GPU | Notes |
|---|---|---|---|
| 7B (e.g., Llama 3 8B, Mistral 7B) | 32 | RTX 3090 / 4090 (24GB) | Full layers on GPU at FP16. Runs great at INT4 on 8GB cards. |
| 13B (e.g., Llama 2 13B, CodeLlama 13B) | 40 | RTX 4090 (24GB) | Fits at INT4/INT8. FP16 requires partial offload or 2 GPUs. |
| 34B (e.g., CodeLlama 34B, Yi 34B) | 48-60 | RTX 4090 (4-bit) or 2x GPUs | Tight fit at INT4 on single 24GB. Comfortable on 2x 24GB. |
| 70B (e.g., Llama 3 70B, Qwen 72B) | 80 | 2x H100 or 4x RTX 4090 | 35GB at INT4 needs multi-GPU. 2x H100 handles FP16. |
A rule of thumb: at INT4 quantization, divide the model's parameter count by 2 to get a rough VRAM estimate in gigabytes. A 7B model needs roughly 3.5GB, a 13B model needs roughly 6.5GB, and so on. Then add 20-30% for KV cache and overhead.
Maximizing Layers on Your GPU
If you're running close to your VRAM limit, here are practical techniques to fit more layers on GPU:
Use Quantization
Going from FP16 to INT4 cuts memory by 4x with surprisingly little quality loss for inference. GPTQ and AWQ are the most popular quantization formats. For most use cases, the quality difference between FP16 and a good INT4 quantization is negligible.
Reduce KV Cache With Sliding Window Attention
Models like Mistral use sliding window attention, which limits the KV cache size to a fixed window (e.g., 4096 tokens) regardless of the total sequence length. This significantly reduces VRAM usage for long conversations. If your framework supports it, set a lower max context length to cap KV cache size.
Enable Flash Attention
Flash Attention (and its successor Flash Attention 2) computes attention in-place without materializing the full attention matrix. This can reduce the memory overhead of the attention mechanism by 5-20x for long sequences. Most modern inference frameworks support it out of the box.
Reduce Batch Size
If you're serving multiple requests, each concurrent request needs its own KV cache. Reducing the maximum batch size frees VRAM for model layers. Find the sweet spot between throughput and VRAM usage for your specific workload.
Use Grouped Query Attention (GQA) Models
Models that use GQA (like Llama 3, Mistral) share key-value heads across multiple query heads, dramatically reducing the KV cache size. When choosing between models of similar quality, prefer ones with GQA support if you're VRAM-constrained.
Running LLMs Without the Hardware Hassle
Understanding GPU layers and VRAM math is valuable, but you don't always want to spend your time optimizing layer counts and quantization settings. If you just need a GPU to run inference, VectorLay offers RTX 4090s starting at $0.49/hr and RTX 3090s at $0.29/hr with automatic failover built in.
Deploy a container with your model, point your API at the endpoint, and the platform handles the rest. No fiddling with n_gpu_layers when you have a full 24GB GPU dedicated to your workload.
Check out the available GPUs and pricing on the GPU catalog.