GPU and Apple Silicone Benchmarks with Large Language Models

Last updated: Nov 08, 2024 | Author: Allan Witt

This chart showcases a range of benchmarks for GPU performance while running large language models like LLaMA and Llama-2, using various quantizations. The data covers a set of GPUs, from Apple Silicon M series chips to Nvidia GPUs, helping you make an informed decision if you’re considering using a large language model locally.

To switch GPUs on or off, use to the legend below the graph. When you hover over a data point, you’ll see additional details about each model, such as an estimated system price.

Memory Boundary Conditions for GPUs Running Large Language Models

Our benchmarks emphasize the crucial role of VRAM capacity when running large language models. Even if a GPU can manage specified model sizes and quantizations—for instance, a context of 512 tokens—it may struggle or fail with larger contexts due to VRAM limitations. Take the RTX 3090, which comes with 24 GB of VRAM, as an example. It handled the 30 billion (30B) parameter Airobors Llama-2 model with 5-bit quantization (Q_5), consuming around 23 GB of VRAM. However, expanding the context caused the GPU to run out of memory. This scenario illustrates the importance of balancing model size, quantization level, and context length for users.

Key Components of the Benchmark

GPUs Tested

We’ve included a variety of consumer-grade GPUs that are suitable for local setups. For instance, the Nvidia A100 80GB is available on the second-hand market for around $15,000. In contrast, a dual RTX 4090 setup, which allows you to run 70B models at a reasonable speed, costs only $4,000 for a brand-new setup.

Model	VRAM	Bandwidth	TFLOPS	7B 4-bit TG
RTX 3060	12GB	360 GB/s	12.74	59.86 t/s
RTX 3080 Ti	12GB	912 GB/s	34.10	108.46 t/s
RTX 4080	16GB	716 GB/s	48.74	112.85 t/s
RTX 4000 Ada	20GB	360 GB/s	26.73	64.53 t/s
RTX 3090	24GB	936 GB/s	35.58	120.6 t/s
RTX 4090	24GB	1,008 GB/s	82.58	139.37 t/s
(2x) RTX 3060	24GB	360 GB/s	12.74	59.86 t/s
(2x) RTX 3090	48GB	936 GB/s	35.58	120.6 t/s
(2x) RTX 4090	48GB	1,008 GB/s	82.58	139.37 t/s
RTX A6000	48GB	768.0 GB/s	38.71	107.11 t/s
M3 Max 40-GPU	48GB	400 GB/s	13.6	66.31 t/s
(3x) RTX 3090	72GB	936 GB/s	35.58	120.6 t/s
M4 Max 40-GPU	128 GB	546 GB/s	18.4	69.77 t/s
M2 Ultra 76-GPU	192 GB	800 GB/s	27.2	93.86 t/s

Model Quantization

The benchmark includes model sizes ranging from 7 billion (7B) to 75 billion (75B) parameters, illustrating the influence of various quantizations on processing speed. Our tests were conducted on the LLaMA, Llama-2 and Mixtral MoE models; however, you can make rough estimates about the inference speed for other models, such as Mistral and Yi, based on the size of their weights in gigabytes. The table below displays the sizes of the models we used, categorized by their quantization.

Model Name	Parameter Count	Model Quantization	Model Size (GB)
7B_q4_0	7B	4-bit	3.8
7B_q5_0	7B	5-bit	4.6
7B_q8_0	7B	8-bit	7.1
13B_q4_0	13B	4-bit	7.8
13B_q5_0	13B	5-bit	8.9
7B_f16	7B	16-bit	13.4
13B_q8_0	13B	8-bit	13.8
30B_q4_0	30B	4-bit	19.1
30B_q5_0	30B	5-bit	23.2
13B_f16	13B	16-bit	24.2
8x7B_q5_0	8x7B (46B)	5-bit	32.2
30B_q8_0	30B	8-bit	35.9
65B_q4_0	65B	4-bit	36.8
70B_q4_0	70B	4-bit	38.9
30B_f16	30B	16-bit	60
120B_Q4	120B	4-bit	66

Speed Measurement

Performance is quantified as tokens per second (t/s), representing the average speed after three tests with a 512 tokens context and 1024 tokens generated.

Pricing Information

The chart includes estimated prices for computer systems equipped with the tested GPUs or Apple Silicon chips. Prices reflect the cost of complete systems, incorporating both new, second-hand, and refurbished units, to account for market availability.

Benchmarking Environment

Data was gathered from user benchmarks across the web and our personal benchmarks. We used Ubuntu 22.04, CUDA 12.1, and llama.cpp (build: 8504d2d0, 2097).

For the dual GPU setup, we utilized both -sm row and -sm layer options in llama.cpp. With -sm row, the dual RTX 3090 demonstrated a higher inference speed of 3 tokens per second (t/s), whereas the dual RTX 4090 performed better with -sm layer, achieving 5 t/s more. However, it’s important to note that using the -sm row option results in a prompt processing speed decrease of approximately 60%.

Tests were conducted on systems sourced from two cloud GPU providers, vast.ai and runpod.io.

This GPU benchmark graph is work in progress and will be update with more GPUs regularly.

Allan Witt

Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.

Twitter

4 Comments

Sam on 02.02.2025 at 1:26

Looking to either drop big bucks for a M4 Max or upcoming 5090, thanks to your benchmark ive decided to still stick woth PC master race
Reply
- Michael on 22.02.2025 at 8:47
  
  Remember to factor in the power and cooling requirements. 150 W (M4 Max) vs 600 W (5090 – GPU only). If it’s pure performance sure it will be faster but for how long will it last before it melts?
Helmut on 25.02.2025 at 19:33

Just a friendly reminder:

MBP M4 Max – 40 Core GPU, 128GB vRAM, ~ 5300 eur
MBP M4 Max – 40 Core GPU, 64GB vRAM, ~ 4300 eur
Mac mini M4 Pro – 20 Core GPU, 64GB vRAM, ~ 2300 eur

2 x rtx 4090, 48GB RAM ~ 6500 eur, only a gpu
2 x rtx 5090, 64GB RAM, ~ 7000 eur, only a gpu
Reply
- Allan Witt on 14.03.2025 at 9:58
  
  Yes, currently in March 2025, with RTX 5090 shortages and the high prices of RTX 4090, it makes sense to consider a Mac. However, you still have the option of a second-hand RTX 3090. The price for a dual RTX 3090 setup is about 1500 euros, and I still think this is the best option at the moment for local inference.
  
  Thanks for the reminder, Helmut!

GPU and Apple Silicone Benchmarks with Large Language Models

Memory Boundary Conditions for GPUs Running Large Language Models

Key Components of the Benchmark

GPUs Tested

Model Quantization

Speed Measurement

Pricing Information

Benchmarking Environment

Allan Witt

4 Comments

Submit a Comment Cancel reply

Latest articles

Latest news

Related

Desktops

Best GPUs for 600W and 650W PSU

Guides

Dell Outlet and Dell Refurbished Guide

Guides

Dell OptiPlex 3020 vs 7020 vs 9020

Guides

Best Dedicated GPU for Dell OptiPlex