GPU and Apple Silicone Benchmarks with Large Language Models
This chart showcases a range of benchmarks for GPU performance while running large language models like LLaMA and Llama-2, using various quantizations. The data covers a set of GPUs, from Apple Silicon M series chips to Nvidia GPUs, helping you make an informed decision if you’re considering using a large language model locally.
To switch GPUs on or off, use to the legend below the graph. When you hover over a data point, you’ll see additional details about each model, such as an estimated system price.
Memory Boundary Conditions for GPUs Running Large Language Models
Our benchmarks emphasize the crucial role of VRAM capacity when running large language models. Even if a GPU can manage specified model sizes and quantizations—for instance, a context of 512 tokens—it may struggle or fail with larger contexts due to VRAM limitations. Take the RTX 3090, which comes with 24 GB of VRAM, as an example. It handled the 30 billion (30B) parameter Airobors Llama-2 model with 5-bit quantization (Q_5), consuming around 23 GB of VRAM. However, expanding the context caused the GPU to run out of memory. This scenario illustrates the importance of balancing model size, quantization level, and context length for users.
Key Components of the Benchmark
GPUs Tested
We’ve included a variety of consumer-grade GPUs that are suitable for local setups. For instance, the Nvidia A100 80GB is available on the second-hand market for around $15,000. In contrast, a dual RTX 4090 setup, which allows you to run 70B models at a reasonable speed, costs only $4,000 for a brand-new setup.
Model | VRAM | Bandwidth | TFLOPS | 7B 4-bit TG |
---|---|---|---|---|
RTX 3060 | 12GB | 360 GB/s | 12.74 | 59.86 t/s |
RTX 3080 Ti | 12GB | 912 GB/s | 34.10 | 108.46 t/s |
RTX 4080 | 16GB | 716 GB/s | 48.74 | 112.85 t/s |
RTX 4000 Ada | 20GB | 360 GB/s | 26.73 | 64.53 t/s |
RTX 3090 | 24GB | 936 GB/s | 35.58 | 120.6 t/s |
RTX 4090 | 24GB | 1,008 GB/s | 82.58 | 139.37 t/s |
(2x) RTX 3060 | 24GB | 360 GB/s | 12.74 | 59.86 t/s |
(2x) RTX 3090 | 48GB | 936 GB/s | 35.58 | 120.6 t/s |
(2x) RTX 4090 | 48GB | 1,008 GB/s | 82.58 | 139.37 t/s |
RTX A6000 | 48GB | 768.0 GB/s | 38.71 | 107.11 t/s |
M3 Max 40-GPU | 48GB | 400 GB/s | 13.6 | 66.31 t/s |
(3x) RTX 3090 | 72GB | 936 GB/s | 35.58 | 120.6 t/s |
M4 Max 40-GPU | 128 GB | 546 GB/s | 18.4 | 69.77 t/s |
M2 Ultra 76-GPU | 192 GB | 800 GB/s | 27.2 | 93.86 t/s |
Model Quantization
The benchmark includes model sizes ranging from 7 billion (7B) to 75 billion (75B) parameters, illustrating the influence of various quantizations on processing speed. Our tests were conducted on the LLaMA, Llama-2 and Mixtral MoE models; however, you can make rough estimates about the inference speed for other models, such as Mistral and Yi, based on the size of their weights in gigabytes. The table below displays the sizes of the models we used, categorized by their quantization.
Model Name | Parameter Count | Model Quantization | Model Size (GB) |
---|---|---|---|
7B_q4_0 | 7B | 4-bit | 3.8 |
7B_q5_0 | 7B | 5-bit | 4.6 |
7B_q8_0 | 7B | 8-bit | 7.1 |
13B_q4_0 | 13B | 4-bit | 7.8 |
13B_q5_0 | 13B | 5-bit | 8.9 |
7B_f16 | 7B | 16-bit | 13.4 |
13B_q8_0 | 13B | 8-bit | 13.8 |
30B_q4_0 | 30B | 4-bit | 19.1 |
30B_q5_0 | 30B | 5-bit | 23.2 |
13B_f16 | 13B | 16-bit | 24.2 |
8x7B_q5_0 | 8x7B (46B) | 5-bit | 32.2 |
30B_q8_0 | 30B | 8-bit | 35.9 |
65B_q4_0 | 65B | 4-bit | 36.8 |
70B_q4_0 | 70B | 4-bit | 38.9 |
30B_f16 | 30B | 16-bit | 60 |
120B_Q4 | 120B | 4-bit | 66 |
Speed Measurement
Performance is quantified as tokens per second (t/s), representing the average speed after three tests with a 512 tokens context and 1024 tokens generated.
Pricing Information
The chart includes estimated prices for computer systems equipped with the tested GPUs or Apple Silicon chips. Prices reflect the cost of complete systems, incorporating both new, second-hand, and refurbished units, to account for market availability.
Benchmarking Environment
Data was gathered from user benchmarks across the web and our personal benchmarks. We used Ubuntu 22.04, CUDA 12.1, and llama.cpp (build: 8504d2d0, 2097).
For the dual GPU setup, we utilized both -sm row
and -sm layer
options in llama.cpp. With -sm row
, the dual RTX 3090 demonstrated a higher inference speed of 3 tokens per second (t/s), whereas the dual RTX 4090 performed better with -sm layer
, achieving 5 t/s more. However, it’s important to note that using the -sm row
option results in a prompt processing speed decrease of approximately 60%.
Tests were conducted on systems sourced from two cloud GPU providers, vast.ai and runpod.io.
This GPU benchmark graph is work in progress and will be update with more GPUs regularly.
Allan Witt
Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.4 Comments
Submit a Comment
Related
Desktops
Best GPUs for 600W and 650W PSU
A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…
Guides
Dell OptiPlex 3020 vs 7020 vs 9020
Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.
Looking to either drop big bucks for a M4 Max or upcoming 5090, thanks to your benchmark ive decided to still stick woth PC master race
Remember to factor in the power and cooling requirements. 150 W (M4 Max) vs 600 W (5090 – GPU only). If it’s pure performance sure it will be faster but for how long will it last before it melts?
Just a friendly reminder:
MBP M4 Max – 40 Core GPU, 128GB vRAM, ~ 5300 eur
MBP M4 Max – 40 Core GPU, 64GB vRAM, ~ 4300 eur
Mac mini M4 Pro – 20 Core GPU, 64GB vRAM, ~ 2300 eur
2 x rtx 4090, 48GB RAM ~ 6500 eur, only a gpu
2 x rtx 5090, 64GB RAM, ~ 7000 eur, only a gpu
Yes, currently in March 2025, with RTX 5090 shortages and the high prices of RTX 4090, it makes sense to consider a Mac. However, you still have the option of a second-hand RTX 3090. The price for a dual RTX 3090 setup is about 1500 euros, and I still think this is the best option at the moment for local inference.
Thanks for the reminder, Helmut!