Llama 2 and Llama 3.1 Hardware Requirements: GPU, CPU, RAM
In recent years, large language models (LLMs) have revolutionized the field of artificial intelligence, offering unprecedented capabilities in programming, text summarization, role-playing, or serving as general AI assistants.
Among these, Llama 2 and the more recent Llama 3.1 stand out as powerful open-source alternatives to proprietary models. For enthusiasts and researchers alike, the ability to run these models locally presents an exciting opportunity to explore and harness their potential. However, the computational demands of these models necessitate careful consideration of hardware requirements
GPU Requirements for Llama 2 and Llama 3.1
At the heart of any system designed to run Llama 2 or Llama 3.1 is the Graphics Processing Unit (GPU). The parallel processing capabilities of modern GPUs make them ideal for the matrix operations that underpin these language models. In our testing, We’ve found the NVIDIA GeForce RTX 3090 strikes an excellent balance between performance, price and VRAM capacity for running Llama. With its 24 GB of GDDR6X memory, this GPU provides sufficient VRAM to accommodate the substantial memory footprint of these models.
Using an RTX 3090 in conjunction with optimized software solutions like ExLlamaV2 and a 8-bit quantized version of Llama 3.1 13B, users can achieve impressive performance, with speeds up to 50 tokens per second. This level of performance brings near real-time interactions within reach for home users.
VRAM requirements for pure GPU inference for 4-Bit quantized Llama models
LLaMA Model | Model size | Minimum VRAM Requirement | Recommended GPU Examples |
---|---|---|---|
Llama 2 / Llama 3.1 | 8B | 6GB | RTX 3060, RTX 4060, GTX 1660, 2060, AMD 5700 XT, RTX 3050 |
Llama 2 | 13B | 10GB | AMD 6900 XT, RTX 2060 12GB, 3060 12GB, RTX 4070, RTX 3080, A2000 |
LLaMA | 33B | 20GB | RTX 3080 20GB, RTX 4000 Ada, A4500, A5000, 3090, 4090, 6000, Tesla V100, Tesla P40 |
Llama 2 / Llama 3.1 | 70B | 40GB | A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000 |
Llama 3.1 | 405B | 232GB | 10x3090, 10x4090, 6xA100 40GB, 3xH100 80GB |
For smaller Llama models like the 8B and 13B, you can use consumer GPUs such as the RTX 3060, which handles the 6GB and 12GB VRAM requirements well. The LLaMA 33B steps up to 20GB, making the RTX 3090 a good choice.
When we scaled up to the 70B Llama 2 and 3.1 model, We quickly realized the limitations of a single GPU setup. A dual RTX 3090 or RTX 4090 configuration offered the necessary VRAM and processing power for smooth operation. This configuration allows for distribution of the model weights across the available VRAM, enabling faster token generation compared to setup where the models weights are split between the VRAM and the system memory (RAM).
For the massive Llama 3.1 405B, you’re looking at a staggering 232GB of VRAM, which requires 10 RTX 3090s or powerful data center GPUs like A100s or H100s. The hardware demands scale dramatically with model size, from consumer-friendly to enterprise-level setups.
Example of inference speed using llama.cpp, RTX 4090, and Intel i9-12900K CPU
Model | Size | Context | VRAM used | Speed |
---|---|---|---|---|
Llama 2 / Llama 3.1 | 8B | 2,048 t | 5 GB | 175 t/s |
Llama 2 / Llama 3.1 | 13B | 2,048 t | 9 GB | 90 t/s |
LLaMA | 33B | 2,048 t | 21 GB | 41 t/s |
Quantization: Balancing Performance and Accuracy
One of the key techniques enabling the use of these large models on consumer hardware is quantization (compression). This process reduces the precision of the model’s weights, significantly decreasing memory requirements and computational demands. The most common quantization levels for home use are 4-bit and 8-bit.
4-bit quantized models offer a higher level of compression, allowing larger models to fit within the constraints of consumer GPUs. These models run efficiently on GPUs with lower VRAM capacities, making them an excellent choice for users with more modest hardware. However, this compression comes at a cost of some reduction in model accuracy.
8-bit quantized models strike a balance between compression and accuracy. They require more GPU memory and computational power than their 4-bit counterparts but generally offer higher accuracy. These models are well-suited for users with high-end consumer GPUs that have more VRAM.
CPU Considerations
While the GPU handles the bulk of the computational load, the CPU plays a crucial supporting role in the overall system performance. For users running Llama 2 or Llama 3.1 primarily on the GPU, the CPU’s main tasks involve data loading, preprocessing, and managing system resources. High-end consumer CPUs like the Intel Core i9-13900K or AMD Ryzen 9 7950X provide ample processing power for these tasks.
CPU-based inference is another popular approach for running large language models. Software solutions like llama.cpp (LM Studio, Ollama), combined with GGUF model formats, allow for split (VRAM and RAM) and pure CPU inference. This approach is particularly useful for users who may not have access to high-end GPUs or wish to run models that exceed their available VRAM.
When considering CPU-based inference, it’s important to note that performance is closely tied to memory bandwidth. This is because when generating a single token, the entire model needs to be read from memory once. That’s why a processors supporting higher memory bandwidth and more channels, such as those compatible with DDR5-6000 memory, can improve inference speeds.
This is an example of running llama.cpp with a Ryzen 7 3700X and 128GB RAM @ 3600 MHz.
GGML Model | Memory per Token | Load Time | Sample Time | Predict Time | Total Time |
---|---|---|---|---|---|
LLaMA-7B 4-bit | 14434244 bytes | 1270.15 ms | 325.76 ms | 15147.15 ms / 8.5 t/s | 17077.88 ms |
LLaMA-13B 4-bit | 22439492 bytes | 2946.00 ms | 86.11 ms | 7358.48 ms / 4.62 t/s | 11019.28 ms |
LLaMA-30B 4-bit | 43387780 bytes | 6666.53 ms | 332.71 ms | 68779.27 ms / 1.87 t/s | 77333.97 ms |
LLaMA-65B 4-bit | 70897348 bytes | 14010.35 ms | 335.09 ms | 140527.48 ms / 0.91 t/s | 157951.48 ms |
RAM and Memory Bandwidth
The importance of system memory (RAM) in running Llama 2 and Llama 3.1 cannot be overstated. For GPU-based inference, 16 GB of RAM is generally sufficient for most use cases, allowing the entire model to be held in memory without resorting to disk swapping. However, for larger models, 32 GB or more of RAM can provide a performance boost.
These are the the memory (RAM) requirements for LLaMA model used on the CPU:
GGUF Model | Original size | Quantized size (4-bit) | Quantized size (5-bit) | Quantized size (8-bit) |
---|---|---|---|---|
7B | 13 GB | 3.9 – 7.5 GB | 7.5 – 8.5 GB | 8.5 – 10.0 GB |
13B | 24 GB | 7.8 – 11 GB | 11.5 – 13.5 GB | 13.5 – 17.5 GB |
30B | 60 GB | 19.5 – 23.0 GB | 23.5 – 27.5 GB | 28.5 – 38.5 GB |
65B | 120 GB | 38.5 – 47.0 GB | 47.0 – 52.0 GB | 71.0 – 80.0 GB |
In CPU-based inference scenarios, the relationship between the CPU and memory becomes even more critical. The bandwidth between the CPU and RAM directly impacts inference speed. For instance, a system with a Core i9-10900X (supporting 4 memory channels) and DDR4-3600 memory can achieve a throughput of about 115 GB/s. With a 13 GB model, this translates to an inference speed of approximately 8 tokens per second, regardless of the CPU’s clock speed or core count.
Inference speed for 13B model with 4-bit quantization, based on memory (RAM) speed when running on CPU:
RAM speed | CPU | CPU channels | Bandwidth | *Inference |
---|---|---|---|---|
DDR4-3600 | Ryzen 5 3600 | 2 | 56 GB/s | ~ 7 tokens/s |
DDR4-3200 | Ryzen 5 5600X | 2 | 51 GB/s | ~ 6.3 tokens/s |
DDR5-5600 | Core i9-13900K | 2 | 89.6 GB/s | ~ 11.2 tokens/s |
DDR4-2666 | Core i5-10400f | 2 | 41.6 GB/s | ~ 5.1 tokens/s |
*These are pure numbers. The actual speed will be lower and will depend on OS and system load.
Storage: Fast Access to Model Data
While often overlooked, storage plays a crucial role in the overall performance of a system running Llama 2 or Llama 3.1. A high-speed NVMe SSD with PCIe 4.0 support is recommended for storing model files and datasets. This ensures rapid data transfer between storage and system RAM, minimizing bottlenecks during model loading and data preprocessing.
A minimum of 1 TB storage is advisable, but 2 TB or 4 TB configurations provide more flexibility for storing multiple models, datasets, and generated outputs. The sequential read and write speeds of modern NVMe SSDs contribute significantly to reducing model load times and enhancing overall system responsiveness.
Motherboard: The Foundation of Your AI Workstation
For users building a single-GPU system, a solid mid-range motherboard that matches the chosen CPU socket will suffice. However, those considering a dual-GPU setup, particularly with cards like the RTX 3090, need to pay special attention to motherboard selection.
Key considerations for dual-GPU setups include:
- PCIe slot layout: Ensure there are two PCIe slots with adequate spacing between them, as the RTX 3090 is a 3-slot card.
- PCIe bifurcation support: The motherboard should support splitting a single PCIe 16x slot into two 8x slots (8x/8x configuration) for optimal performance with dual GPUs.
- Power delivery: Robust VRM design to handle the power demands of high-end CPUs and multiple GPUs.
- Connectivity: Ample USB ports and M.2 slots for storage expansion.
Power Supply and Cooling: Keeping Your AI Rig Running Smoothly
Running Llama 2 or Llama 3.1 models, especially on high-end GPUs, can be power-intensive. A high-quality power supply unit (PSU) with sufficient wattage is crucial for system stability. For single GPU setups, an 750W or 850W PSU is generally sufficient. Dual-GPU configurations may require 1200W or higher PSUs to ensure stable operation under load.
Equally important is maintaining optimal thermal conditions. Large language models can put significant stress on GPUs and CPUs, leading to sustained high temperatures. A well-designed case with good airflow, coupled with effective CPU and GPU cooling solutions, is essential for maintaining performance and longevity of components.
Comparing Performance: Dual GPUs vs. Apple Silicon
While dual-GPU setups using RTX 3090 or RTX 4090 cards offer impressive performance for running Llama 2 and Llama 3.1 models, it’s worth considering alternative platforms. Apple’s M1, M2, and M3 series of processors, particularly in their Pro, Max, and Ultra configurations, have shown remarkable capabilities in AI workloads.
We were particularly impressed by the unified memory architecture of the M2 Max when testing Llama. The unified memory architecture (UMA) and the high bandwidth definitely contributed to efficient data handling. This architecture can be particularly beneficial for running large language models. For instance, the M2 Max chip supports up to 400 GB/s memory bandwidth, while the M2 Ultra pushes this to an impressive 800 GB/s.
However, when it comes to raw inference speed for the largest models, dual high-end NVIDIA GPUs still hold an edge. A dual RTX 4090 setup can achieve speeds of around 20 tokens per second with a 65B model, while two RTX 3090s manage about 15 tokens per second. In comparison, even the most powerful Apple Silicon chips struggle to match these speeds for the largest models.
Building Your Llama Rig
After exploring the hardware requirements for running Llama 2 and Llama 3.1 models, let’s summarize the key points and provide a step-by-step guide to building your own Llama rig.
Key Takeaways:
- GPU is crucial: A high-end GPU like the NVIDIA GeForce RTX 3090 with 24GB VRAM is ideal for running Llama models efficiently.
- CPU matters: While not as critical as the GPU, a strong CPU helps with data loading and preprocessing.
- RAM requirements: 32GB or more of fast RAM (DDR4-3600 or better) is recommended for optimal performance.
- Storage speed: An NVMe SSD with PCIe 4.0 support ensures fast model loading and data access.
- Cooling is essential: Proper cooling solutions are necessary to maintain performance during long inference sessions.
- Power supply: A high-quality PSU with sufficient wattage (850W+) is crucial for system stability.
LLM build based on RTX 3090 24GB
Type | Item | Price |
---|---|---|
CPU | AMD Ryzen 5 7600X 4.7 GHz 6-Core Processor | $209.99 |
CPU Cooler | Noctua NH-D15 82.5 CFM CPU Cooler | $109.95 |
Motherboard | Asus PROART B650-CREATOR ATX AM5 Motherboard | $229.99 |
Memory | G.Skill Flare X5 64 GB (2 x 32 GB) DDR5-5600 CL36 Memory | $179.99 |
Storage | Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive | $89.99 |
Video Card | RTX 3090 24 GB Video Card (Secondhand) | ~ $650 @ Ebay |
Case | Lian Li O11 Dynamic EVO XL ATX Full Tower Case | $205.81 |
Power Supply | Corsair RM850x (2018) 850 W 80+ Gold Certified Fully Modular ATX Power Supply | $229.98 |
Total | $1905.70 |
Expand and upgrade:
Consider adding more storage as you accumulate models and datasets
This build room for expiation in to double RTX 3090 setup. Just add a higher power PSU like Corsair HX1200 Platinum 1200 W 80+ Platinum and couple (~x5) of quiet! Pure Wings 2 87 CFM 120 mm case fans and you will be running 70B parameter models in no time.
Conclusion
Running Llama 2 and Llama 3.1 models locally opens up exciting possibilities for AI enthusiasts, researchers, and developers. While the hardware requirements may seem daunting, careful selection of components can result in a system capable of impressive performance. Whether opting for a high-end GPU setup or leveraging CPU-based inference, the key lies in balancing the various components to create a harmonious system tailored to your specific needs and budget.
As these models continue to evolve and new optimization techniques emerge, we can expect the hardware landscape to shift as well. Staying informed about the latest developments in both AI models and hardware will be crucial for those looking to push the boundaries of what’s possible with home-based AI systems.
Allan Witt
Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.2 Comments
Submit a Comment
Related
Desktops
Best GPUs for 600W and 650W PSU
A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…
Guides
Dell OptiPlex 3020 vs 7020 vs 9020
Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.
I’m curious about your testing setup. Did you actually build systems with dual 3090s, or were these benchmarks run in a diferent environment?
Thanks for the question! We used a both physical test benches and cloud instances. For dual 3090s specifically, we used our own test bench with X299 motherboard (has 48 lanes) with an i9-7900X processor and 64 GB RAM. This setup allowed us to test a wider range of hardware.