Llama 2 and Llama 3.1 Hardware Requirements: GPU, CPU, RAM

Home
Guides
Llama 2 and Llama 3.1 Hardware Requirements

Last updated: Sep 30, 2024 | Author: Allan Witt

In recent years, large language models (LLMs) have revolutionized the field of artificial intelligence, offering unprecedented capabilities in programming, text summarization, role-playing, or serving as general AI assistants.

Among these, Llama 2 and the more recent Llama 3.1 stand out as powerful open-source alternatives to proprietary models. For enthusiasts and researchers alike, the ability to run these models locally presents an exciting opportunity to explore and harness their potential. However, the computational demands of these models necessitate careful consideration of hardware requirements

GPU Requirements for Llama 2 and Llama 3.1

At the heart of any system designed to run Llama 2 or Llama 3.1 is the Graphics Processing Unit (GPU). The parallel processing capabilities of modern GPUs make them ideal for the matrix operations that underpin these language models. In our testing, We’ve found the NVIDIA GeForce RTX 3090 strikes an excellent balance between performance, price and VRAM capacity for running Llama. With its 24 GB of GDDR6X memory, this GPU provides sufficient VRAM to accommodate the substantial memory footprint of these models.

Using an RTX 3090 in conjunction with optimized software solutions like ExLlamaV2 and a 8-bit quantized version of Llama 3.1 13B, users can achieve impressive performance, with speeds up to 50 tokens per second. This level of performance brings near real-time interactions within reach for home users.

VRAM requirements for pure GPU inference for 4-Bit quantized Llama models

LLaMA Model	Model size	Minimum VRAM Requirement	Recommended GPU Examples
Llama 2 / Llama 3.1	8B	6GB	RTX 3060, RTX 4060, GTX 1660, 2060, AMD 5700 XT, RTX 3050
Llama 2	13B	10GB	AMD 6900 XT, RTX 2060 12GB, 3060 12GB, RTX 4070, RTX 3080, A2000
LLaMA	33B	20GB	RTX 3080 20GB, RTX 4000 Ada, A4500, A5000, 3090, 4090, 6000, Tesla V100, Tesla P40
Llama 2 / Llama 3.1	70B	40GB	A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000
Llama 3.1	405B	232GB	10x3090, 10x4090, 6xA100 40GB, 3xH100 80GB

For smaller Llama models like the 8B and 13B, you can use consumer GPUs such as the RTX 3060, which handles the 6GB and 12GB VRAM requirements well. The LLaMA 33B steps up to 20GB, making the RTX 3090 a good choice.

When we scaled up to the 70B Llama 2 and 3.1 model, We quickly realized the limitations of a single GPU setup. A dual RTX 3090 or RTX 4090 configuration offered the necessary VRAM and processing power for smooth operation. This configuration allows for distribution of the model weights across the available VRAM, enabling faster token generation compared to setup where the models weights are split between the VRAM and the system memory (RAM).

For the massive Llama 3.1 405B, you’re looking at a staggering 232GB of VRAM, which requires 10 RTX 3090s or powerful data center GPUs like A100s or H100s. The hardware demands scale dramatically with model size, from consumer-friendly to enterprise-level setups.

Example of inference speed using llama.cpp, RTX 4090, and Intel i9-12900K CPU

Model	Size	Context	VRAM used	Speed
Llama 2 / Llama 3.1	8B	2,048 t	5 GB	175 t/s
Llama 2 / Llama 3.1	13B	2,048 t	9 GB	90 t/s
LLaMA	33B	2,048 t	21 GB	41 t/s

Quantization: Balancing Performance and Accuracy

One of the key techniques enabling the use of these large models on consumer hardware is quantization (compression). This process reduces the precision of the model’s weights, significantly decreasing memory requirements and computational demands. The most common quantization levels for home use are 4-bit and 8-bit.

4-bit quantized models offer a higher level of compression, allowing larger models to fit within the constraints of consumer GPUs. These models run efficiently on GPUs with lower VRAM capacities, making them an excellent choice for users with more modest hardware. However, this compression comes at a cost of some reduction in model accuracy.

huggingface llama 3.1 repository with gguf models inside

A typical Huggingface repository for Llama 3.1, where you can download various quantized versions of the model—in this case, Meta Llama 3.1 8B Instruct in GGUF file format.

8-bit quantized models strike a balance between compression and accuracy. They require more GPU memory and computational power than their 4-bit counterparts but generally offer higher accuracy. These models are well-suited for users with high-end consumer GPUs that have more VRAM.

CPU Considerations

While the GPU handles the bulk of the computational load, the CPU plays a crucial supporting role in the overall system performance. For users running Llama 2 or Llama 3.1 primarily on the GPU, the CPU’s main tasks involve data loading, preprocessing, and managing system resources. High-end consumer CPUs like the Intel Core i9-13900K or AMD Ryzen 9 7950X provide ample processing power for these tasks.

CPU-based inference is another popular approach for running large language models. Software solutions like llama.cpp (LM Studio, Ollama), combined with GGUF model formats, allow for split (VRAM and RAM) and pure CPU inference. This approach is particularly useful for users who may not have access to high-end GPUs or wish to run models that exceed their available VRAM.

When considering CPU-based inference, it’s important to note that performance is closely tied to memory bandwidth. This is because when generating a single token, the entire model needs to be read from memory once. That’s why a processors supporting higher memory bandwidth and more channels, such as those compatible with DDR5-6000 memory, can improve inference speeds.

This is an example of running llama.cpp with a Ryzen 7 3700X and 128GB RAM @ 3600 MHz.

GGML Model	Memory per Token	Load Time	Sample Time	Predict Time	Total Time
LLaMA-7B 4-bit	14434244 bytes	1270.15 ms	325.76 ms	15147.15 ms / 8.5 t/s	17077.88 ms
LLaMA-13B 4-bit	22439492 bytes	2946.00 ms	86.11 ms	7358.48 ms / 4.62 t/s	11019.28 ms
LLaMA-30B 4-bit	43387780 bytes	6666.53 ms	332.71 ms	68779.27 ms / 1.87 t/s	77333.97 ms
LLaMA-65B 4-bit	70897348 bytes	14010.35 ms	335.09 ms	140527.48 ms / 0.91 t/s	157951.48 ms

RAM and Memory Bandwidth

The importance of system memory (RAM) in running Llama 2 and Llama 3.1 cannot be overstated. For GPU-based inference, 16 GB of RAM is generally sufficient for most use cases, allowing the entire model to be held in memory without resorting to disk swapping. However, for larger models, 32 GB or more of RAM can provide a performance boost.

These are the the memory (RAM) requirements for LLaMA model used on the CPU:

GGUF Model	Original size	Quantized size (4-bit)	Quantized size (5-bit)	Quantized size (8-bit)
7B	13 GB	3.9 – 7.5 GB	7.5 – 8.5 GB	8.5 – 10.0 GB
13B	24 GB	7.8 – 11 GB	11.5 – 13.5 GB	13.5 – 17.5 GB
30B	60 GB	19.5 – 23.0 GB	23.5 – 27.5 GB	28.5 – 38.5 GB
65B	120 GB	38.5 – 47.0 GB	47.0 – 52.0 GB	71.0 – 80.0 GB

In CPU-based inference scenarios, the relationship between the CPU and memory becomes even more critical. The bandwidth between the CPU and RAM directly impacts inference speed. For instance, a system with a Core i9-10900X (supporting 4 memory channels) and DDR4-3600 memory can achieve a throughput of about 115 GB/s. With a 13 GB model, this translates to an inference speed of approximately 8 tokens per second, regardless of the CPU’s clock speed or core count.

Inference speed for 13B model with 4-bit quantization, based on memory (RAM) speed when running on CPU:

RAM speed	CPU	CPU channels	Bandwidth	*Inference
DDR4-3600	Ryzen 5 3600	2	56 GB/s	~ 7 tokens/s
DDR4-3200	Ryzen 5 5600X	2	51 GB/s	~ 6.3 tokens/s
DDR5-5600	Core i9-13900K	2	89.6 GB/s	~ 11.2 tokens/s
DDR4-2666	Core i5-10400f	2	41.6 GB/s	~ 5.1 tokens/s

*These are pure numbers. The actual speed will be lower and will depend on OS and system load.

Storage: Fast Access to Model Data

While often overlooked, storage plays a crucial role in the overall performance of a system running Llama 2 or Llama 3.1. A high-speed NVMe SSD with PCIe 4.0 support is recommended for storing model files and datasets. This ensures rapid data transfer between storage and system RAM, minimizing bottlenecks during model loading and data preprocessing.

A minimum of 1 TB storage is advisable, but 2 TB or 4 TB configurations provide more flexibility for storing multiple models, datasets, and generated outputs. The sequential read and write speeds of modern NVMe SSDs contribute significantly to reducing model load times and enhancing overall system responsiveness.

Motherboard: The Foundation of Your AI Workstation

For users building a single-GPU system, a solid mid-range motherboard that matches the chosen CPU socket will suffice. However, those considering a dual-GPU setup, particularly with cards like the RTX 3090, need to pay special attention to motherboard selection.

Key considerations for dual-GPU setups include:

PCIe slot layout: Ensure there are two PCIe slots with adequate spacing between them, as the RTX 3090 is a 3-slot card.
PCIe bifurcation support: The motherboard should support splitting a single PCIe 16x slot into two 8x slots (8x/8x configuration) for optimal performance with dual GPUs.
Power delivery: Robust VRM design to handle the power demands of high-end CPUs and multiple GPUs.
Connectivity: Ample USB ports and M.2 slots for storage expansion.

Power Supply and Cooling: Keeping Your AI Rig Running Smoothly

Running Llama 2 or Llama 3.1 models, especially on high-end GPUs, can be power-intensive. A high-quality power supply unit (PSU) with sufficient wattage is crucial for system stability. For single GPU setups, an 750W or 850W PSU is generally sufficient. Dual-GPU configurations may require 1200W or higher PSUs to ensure stable operation under load.

Equally important is maintaining optimal thermal conditions. Large language models can put significant stress on GPUs and CPUs, leading to sustained high temperatures. A well-designed case with good airflow, coupled with effective CPU and GPU cooling solutions, is essential for maintaining performance and longevity of components.

Comparing Performance: Dual GPUs vs. Apple Silicon

While dual-GPU setups using RTX 3090 or RTX 4090 cards offer impressive performance for running Llama 2 and Llama 3.1 models, it’s worth considering alternative platforms. Apple’s M1, M2, and M3 series of processors, particularly in their Pro, Max, and Ultra configurations, have shown remarkable capabilities in AI workloads.

We were particularly impressed by the unified memory architecture of the M2 Max when testing Llama. The unified memory architecture (UMA) and the high bandwidth definitely contributed to efficient data handling. This architecture can be particularly beneficial for running large language models. For instance, the M2 Max chip supports up to 400 GB/s memory bandwidth, while the M2 Ultra pushes this to an impressive 800 GB/s.

However, when it comes to raw inference speed for the largest models, dual high-end NVIDIA GPUs still hold an edge. A dual RTX 4090 setup can achieve speeds of around 20 tokens per second with a 65B model, while two RTX 3090s manage about 15 tokens per second. In comparison, even the most powerful Apple Silicon chips struggle to match these speeds for the largest models.

Building Your Llama Rig

After exploring the hardware requirements for running Llama 2 and Llama 3.1 models, let’s summarize the key points and provide a step-by-step guide to building your own Llama rig.

Key Takeaways:

GPU is crucial: A high-end GPU like the NVIDIA GeForce RTX 3090 with 24GB VRAM is ideal for running Llama models efficiently.
CPU matters: While not as critical as the GPU, a strong CPU helps with data loading and preprocessing.
RAM requirements: 32GB or more of fast RAM (DDR4-3600 or better) is recommended for optimal performance.
Storage speed: An NVMe SSD with PCIe 4.0 support ensures fast model loading and data access.
Cooling is essential: Proper cooling solutions are necessary to maintain performance during long inference sessions.
Power supply: A high-quality PSU with sufficient wattage (850W+) is crucial for system stability.

LLM build based on RTX 3090 24GB

Type	Item	Price
CPU	AMD Ryzen 5 7600X 4.7 GHz 6-Core Processor	$209.99
CPU Cooler	Noctua NH-D15 82.5 CFM CPU Cooler	$109.95
Motherboard	Asus PROART B650-CREATOR ATX AM5 Motherboard	$229.99
Memory	G.Skill Flare X5 64 GB (2 x 32 GB) DDR5-5600 CL36 Memory	$179.99
Storage	Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive	$89.99
Video Card	RTX 3090 24 GB Video Card (Secondhand)	~ $650 @ Ebay
Case	Lian Li O11 Dynamic EVO XL ATX Full Tower Case	$205.81
Power Supply	Corsair RM850x (2018) 850 W 80+ Gold Certified Fully Modular ATX Power Supply	$229.98
	Total	$1905.70

Expand and upgrade:

Consider adding more storage as you accumulate models and datasets
This build room for expiation in to double RTX 3090 setup. Just add a higher power PSU like Corsair HX1200 Platinum 1200 W 80+ Platinum and couple (~x5) of quiet! Pure Wings 2 87 CFM 120 mm case fans and you will be running 70B parameter models in no time.

Conclusion

Running Llama 2 and Llama 3.1 models locally opens up exciting possibilities for AI enthusiasts, researchers, and developers. While the hardware requirements may seem daunting, careful selection of components can result in a system capable of impressive performance. Whether opting for a high-end GPU setup or leveraging CPU-based inference, the key lies in balancing the various components to create a harmonious system tailored to your specific needs and budget.

As these models continue to evolve and new optimization techniques emerge, we can expect the hardware landscape to shift as well. Staying informed about the latest developments in both AI models and hardware will be crucial for those looking to push the boundaries of what’s possible with home-based AI systems.

Allan Witt

Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.

Twitter

4 Comments

Khanan on 30.09.2024 at 11:39

I’m curious about your testing setup. Did you actually build systems with dual 3090s, or were these benchmarks run in a diferent environment?
Reply
- Allan Witt on 30.09.2024 at 11:44
  
  Thanks for the question! We used a both physical test benches and cloud instances. For dual 3090s specifically, we used our own test bench with X299 motherboard (has 48 lanes) with an i9-7900X processor and 64 GB RAM. This setup allowed us to test a wider range of hardware.
Robert on 31.12.2024 at 12:01

I have a computer with a single 4090, and it has room and power to add a 5090 – if I wish to use a dual GPU setup for LLama do both GPU need to be the same? would I be better off with 2 4090GPU rather than a 4090 and a 5090?
Reply
- Allan Witt on 30.01.2025 at 11:37
  
  Yes, they are compatible. However, LLM inference will run at the speed of the RTX 4090. Current benchmarks show that the RTX 5090 is approximately 35–40% faster for LLM inference.
  
  For the best possible performance, you would likely be better off using two RTX 5090 GPUs rather than a mixed setup with a 4090 and a 5090, as the slower card will limit the overall speed.
  
  It really depends on what you’re trying to do. The RTX 5090 has 32GB of VRAM, which makes a big difference if you’re working with larger models or need more context length. For example, if you’re running a 70B Q4 model with over 16K context, the extra memory on the 5090 will help a lot. Plus, it lets you handle even bigger LLMs that might not fit on a 4090 alone.

Llama 2 and Llama 3.1 Hardware Requirements: GPU, CPU, RAM

GPU Requirements for Llama 2 and Llama 3.1

VRAM requirements for pure GPU inference for 4-Bit quantized Llama models

Quantization: Balancing Performance and Accuracy

CPU Considerations

RAM and Memory Bandwidth

Storage: Fast Access to Model Data

Motherboard: The Foundation of Your AI Workstation

Power Supply and Cooling: Keeping Your AI Rig Running Smoothly

Comparing Performance: Dual GPUs vs. Apple Silicon

Building Your Llama Rig

LLM build based on RTX 3090 24GB

Conclusion

Allan Witt

4 Comments

Submit a Comment Cancel reply

Latest articles

Latest news

Related

Desktops

Best GPUs for 600W and 650W PSU

Guides

Dell Outlet and Dell Refurbished Guide

Guides

Dell OptiPlex 3020 vs 7020 vs 9020

Guides

Best Dedicated GPU for Dell OptiPlex