GPU and Apple Silicone Benchmarks with Large Language Models

This chart showcases a range of benchmarks for GPU performance while running large language models like LLaMA and Llama-2, using various quantizations. The data covers a set of GPUs, from Apple Silicon M series chips to Nvidia GPUs, helping you make an informed decision if you’re considering using a large language model locally.

To switch GPUs on or off, use to the legend below the graph. When you hover over a data point, you’ll see additional details about each model, such as an estimated system price.

Memory Boundary Conditions for GPUs Running Large Language Models

Our benchmarks emphasize the crucial role of VRAM capacity when running large language models. Even if a GPU can manage specified model sizes and quantizations—for instance, a context of 512 tokens—it may struggle or fail with larger contexts due to VRAM limitations. Take the RTX 3090, which comes with 24 GB of VRAM, as an example. It handled the 30 billion (30B) parameter Airobors Llama-2 model with 5-bit quantization (Q_5), consuming around 23 GB of VRAM. However, expanding the context caused the GPU to run out of memory. This scenario illustrates the importance of balancing model size, quantization level, and context length for users.

Key Components of the Benchmark

GPUs Tested

We’ve included a variety of consumer-grade GPUs that are suitable for local setups. For instance, the Nvidia A100 80GB is available on the second-hand market for around $15,000. In contrast, a dual RTX 4090 setup, which allows you to run 70B models at a reasonable speed, costs only $4,000 for a brand-new setup.

Model VRAM Bandwidth TFLOPS 7B 4-bit TG
RTX 3060 12GB 360 GB/s 12.74 59.86 t/s
RTX 3080 Ti 12GB 912 GB/s 34.10 108.46 t/s
RTX 4080 16GB 716 GB/s 48.74 112.85 t/s
RTX 4000 Ada 20GB 360 GB/s 26.73 64.53 t/s
RTX 3090 24GB 936 GB/s 35.58 120.6 t/s
RTX 4090 24GB 1,008 GB/s 82.58 139.37 t/s
(2x) RTX 3060 24GB 360 GB/s 12.74 59.86 t/s
(2x) RTX 3090 48GB 936 GB/s 35.58 120.6 t/s
(2x) RTX 4090 48GB 1,008 GB/s 82.58 139.37 t/s
RTX A6000 48GB 768.0 GB/s 38.71 107.11 t/s
M3 Max 40-GPU 48GB 400 GB/s 13.6 66.31 t/s
(3x) RTX 3090 72GB 936 GB/s 35.58 120.6 t/s
M4 Max 40-GPU 128 GB 546 GB/s 18.4 69.77 t/s
M2 Ultra 76-GPU 192 GB 800 GB/s 27.2 93.86 t/s

Model Quantization

The benchmark includes model sizes ranging from 7 billion (7B) to 75 billion (75B) parameters, illustrating the influence of various quantizations on processing speed. Our tests were conducted on the LLaMA, Llama-2 and Mixtral MoE models; however, you can make rough estimates about the inference speed for other models, such as Mistral and Yi, based on the size of their weights in gigabytes. The table below displays the sizes of the models we used, categorized by their quantization.

Model Name Parameter Count Model Quantization Model Size (GB)
7B_q4_0 7B 4-bit 3.8
7B_q5_0 7B 5-bit 4.6
7B_q8_0 7B 8-bit 7.1
13B_q4_0 13B 4-bit 7.8
13B_q5_0 13B 5-bit 8.9
7B_f16 7B 16-bit 13.4
13B_q8_0 13B 8-bit 13.8
30B_q4_0 30B 4-bit 19.1
30B_q5_0 30B 5-bit 23.2
13B_f16 13B 16-bit 24.2
8x7B_q5_0 8x7B (46B) 5-bit 32.2
30B_q8_0 30B 8-bit 35.9
65B_q4_0 65B 4-bit 36.8
70B_q4_0 70B 4-bit 38.9
30B_f16 30B 16-bit 60
120B_Q4 120B 4-bit 66

Speed Measurement

Performance is quantified as tokens per second (t/s), representing the average speed after three tests with a 512 tokens context  and 1024 tokens generated.

Pricing Information

The chart includes estimated prices for computer systems equipped with the tested GPUs or Apple Silicon chips. Prices reflect the cost of complete systems, incorporating both new, second-hand, and refurbished units, to account for market availability.

Benchmarking Environment

Data was gathered from user benchmarks across the web and our personal benchmarks. We used Ubuntu 22.04,  CUDA 12.1, and llama.cpp (build: 8504d2d0, 2097).

For the dual GPU setup, we utilized both -sm row and -sm layer options in llama.cpp. With -sm row, the dual RTX 3090 demonstrated a higher inference speed of 3 tokens per second (t/s), whereas the dual RTX 4090 performed better with -sm layer, achieving 5 t/s more. However, it’s important to note that using the -sm row option results in a prompt processing speed decrease of approximately 60%.

Tests were conducted on systems sourced from two cloud GPU providers, vast.ai and runpod.io.

This GPU benchmark graph is work in progress and will be update with more GPUs regularly.

Allan Witt

Allan Witt

Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.

Related

Desktops
Best GPUs for 600W and 650W PSU

A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.

Guides
Dell Outlet and Dell Refurbished Guide

For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…

Guides
Dell OptiPlex 3020 vs 7020 vs 9020

Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.

Guides
Best Dedicated GPU for Dell OptiPlex

Pick a GPU for your Dell OptiPlex.