What Hardware Do You Need To Run 120B Large Language Models Locally
So you want to run one of those crazy huge 120 billion parameter language models like Goliath or Miqu locally on your own machine? That’s an ambitious goal, but definitely doable if you’ve got the right hardware setup.
The first thing you need to understand is that these massive models require an absolute ton of memory just for the model weights themselves. We’re talking around 67GB of memory for a decent 4-bit quantized model using the GGUF file format. And that’s before you even add the context window size you want to use!
Throw in a 4K token context, and now you’re looking at needing around 70GB of total memory to run this model locally. Some of the newer frankenmerge models like Miqu 12K version are even crazier, requiring 77GB of memory with a 9GB overhead just for that 12K context length.
So how do you get that much memory for local inference? You’ve got three main options: going full GPU and using the VRAM, sticking to your CPU and system RAM, or splitting the load between VRAM and RAM. Each has its own pros and cons.
GPU for 120B large language model
The GPU VRAM route is definitely the performance king if you’ve got the hardware for it. Consumer graded high-end GPUs like the RTX 3090 and 4090 have ridiculously fast memory bandwidth – we’re talking over 900GB/s, which blows even cutting-edge DDR5 system RAM out of the water. The tradeoff is you need a ton of VRAM.
Those 24GB RTX 3090 and 4090 cards sadly aren’t enough for a full 120B model by themselves. To go this route, you’ll want to look dual and a triple RTX 3090 or 4090 setup to get you around 72GB of total VRAM. Do that, and you can expect performance of around 10 tokens/second on the RTX 3090 trio or a 13 tokens/second on triple RTX 4090s when running that 4-bit 120B model locally.
Those are really great speeds that make the models interactive and usable for all kinds of tasks – writing, analysis, coding, and role-playing.
Just don’t expect putting together a build like that to be easy or cheap!
First off, fitting three of those massive high-end GPUs into a single system is no simple task. You’ll need to track down a motherboard with at least PCIe slots that can run in an x8/x4/x4 lane configuration to give all the cards enough bandwidth. Many consumer boards max out at just 2 full-speed slots.
Then for sanity’s sake, you’ll likely want to go with an open frame setup to and skip the liquid cooling complexity. Those RTX 3090 and 4090 put out a ridiculous amount of heat when churning away at full tilt. An open air frame allows you to easily space them apart and use their stock air cooling solutions. Installing three RTX 3090 or 4090 GPUs in a desktop PC case and attempting to liquid cool them is an ambitious endeavor I wouldn’t recommend taking on unless you really know what you’re doing.
You will want a reasonably beefy CPU. Current top-tier chips like Intel’s 13900K or one of AMD’s Ryzen 7000 series would make a solid pairing with the triple GPU setup.
Finally make sure your power supply is capable enough for running 3 power-hungry GPUs plus the rest of the components. A quality 1600W unit from a reputable brand like EVGA or Corsair is an absolute must.
Now if dropping multiple thousands on a triple GPU setup isn’t option, there are some other approaches. You can go lower on the quantization to 3-bits using the EXL2 format which slashes the VRAM requirements down to around 44GB. With that memory footprint, you could run the model entirely on just a pair of 3090s or 4090s while still getting decent 10 token/second performance. However this setup will significantly lower the context size of your prompts.
Another option that avoids going full multi-GPU is Apple’s M1 and M2 chips with their unified memory. The M2 Ultra in something like the Mac Studio comes with a massive 192GB of unified memory that can be used as either RAM or VRAM as needed. With that much memory, you can load and run one of those 4-bit quantized 120B models at around 5 tokens/second. It’s not blistering, but still pretty darn usable for interactive tasks.
Running 120B in split mode with GPU and RAM
If you’ve only two beefy 24 GB GPUs, you can try doing a split mode using the llama.cpp and GGUF formats. With two 3090s or 4090s giving you 48GB of VRAM total, you could load 44GB of the model into the VRAM while offloading the remaining 22GB into a 32GB DDR5 system RAM.
Just be warned that this split mode comes with a serious performance hit compared to loading the full model into VRAM. You’re looking at maybe 1 to 1.5 tokens/second with this setup, which makes it pretty sluggish for interactive use cases in my opinion. But hey, it’s an option if your GPU setup is limited.
Running 120B large language model on CPU and RAM
For those of you CPU-only folks out there, it’s technically possible to load and run these models using just your system RAM, but performance is going to be painful. The bare minimum RAM you’ll need is 64GB.
Even with a tricked out 64GB DDR5 6000MHz kit, you’re only looking at around 0.7 tokens/second when running one of those 3-bit quantized models on just the CPU and RAM. Not terrible for lightweight use cases, but hardly going to blow your socks off.
And if you max out at 128GB of DDR5 using 4 RAM sticks (which prevents using the absolute highest frequencies), you’ll see speeds drop to a 0.5 tokens/second or so.
So there you have it! To run one of those massive 120 billion parameter language models locally with good enough speeds for real-time use, you’re going to want to lean heavily on some GPU horsepower. The gold standard is definitely a trio of beefy 3090s or 4090s giving you around 72GB of VRAM to fully load the model.
Apple’s M1/M2 Ultra is another great single-chip solution with its huge unified memory. You can get by with lower VRAM requirements using 3-bit quantization on dual 3090/4090 setups, or try the split GPU/RAM mode. But for the best speeds, it’s hard to beat stacking up some premium GPUs if you can afford that kind of investment.
For we mere mortal CPU warriors out there, running these models locally is definitely possible with a well-equipped 128GB DDR5 RAM setup. Just don’t expect blistering speeds.
Finally, keep in mind that there are other large language models with fewer than 120 billion parameters that are still excellent options for local inference. Models like Mixtral 8x7B, Llama-2 70B, and Yi 34B are smaller in size, yet they have great fine-tuned variations available with high context support. These more modestly-sized models will work happily on a single or dual GPU setup without requiring the immense hardware investment of the 120B titans. Don’t overlook their potential if your use case doesn’t demand the utmost scale.
Allan Witt
Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.Related
Desktops
Best GPUs for 600W and 650W PSU
A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…
Guides
Dell OptiPlex 3020 vs 7020 vs 9020
Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.