Best laptop for Large Language Models: Llama, Mistral and Mixtral
If you’re looking for the best laptop to handle large language models (LLMs) like Llama 2, Llama 3.1, Mistral, or Yi, the MacBook Pro with the M2 Max chip, 38 GPU cores, and 64GB of unified memory is the top choice. It’s a model that strikes the perfect balance between performance and portability, making it a game-changer for those who need to run LLMs on the go.
For a comprehensive guide on the best Mac options for LLM, including desktop solutions, check out our detailed best Mac for LLM guide.
The 64GB version allows you to use about 48GB (75% from the entire pool) as VRAM, which is crucial for running LLMs efficiently. This laptop can smoothly run 34B models in an 8-bit quantization and handle larger 70B models with decent context length. And if you’re feeling adventurous, you can tweak the memory limits to accommodate the massive 120B models!
The 400 GB/s bandwidth of MacBook Pro M2 Max chips is what sets it apart, offering good inference speeds that are close to GPU setups on desktops and leaving non-Apple laptops, which typically max out at 16GB VRAM, in the dust.
Non-Apple alternatives often struggle with models larger than 13B, and their performance significantly diminishes when handling even larger models, such as 33B and 70B. This decline in efficiency is primarily due to their reliance on slower system memory, such as DDR5-4800 which offers only 72 GB/s, especially when the model is distributed between the GPU’s VRAM and the RAM.
In contrast, the M2 Max’s unified memory system not only offers higher bandwidth but also ensures smoother operation. So, for those who can’t be tied down to a desktop, this MacBook Pro is your go-to. It balances portability with the power needed for LLM inference.
M2 Max Laptop LLM inference speed:
Model name | Model Size | Speed (t/s) |
---|---|---|
Mistral-Nemo-Instruct-2407 | 7B_q4_0 | 66.31 |
Mistral-Nemo-Instruct-2407 | 7B_q5_0 | 60.1 |
Mistral-v0.3-7B-ORPO | 7B_q8_0 | 42.75 |
LLaMA2-13B-Tiefighter | 13B_q4_0 | 36.49 |
LLaMA2-13B-Tiefighter | 13B_q5_0 | 34.2 |
Mistral-Nemo-Instruct-2407-f16 | 7B_f16 | 25.09 |
LLaMA2-13B-Tiefighter | 13B_q8_0 | 21.18 |
WizardLM-30B | 30B_q4_0 | 16.24 |
WizardLM-30B | 30B_q5_0 | 14.44 |
dolphin-2.7-mixtral-8x7b | 8x7B_q5_0 | 25.12 |
WizardLM-30B | 30B_q8_0 | 9.1 |
Airoboros-65B-GPT4-2.0 | 65B_q4_0 | 3.14 |
Meta-Llama-3.1-70B-Instruct | 70B_q4_0 | 3.11 |
If you’re open to sacrificing a bit of performance – roughly 7-8% in inference speed and about 20% in prompt processing speed – then the MacBook Pro with M1 Max is a viable alternative. It offers the same unified memory bandwidth as the M2, but with a slightly lower GPU core count (32 against 38). This difference, while noticeable, isn’t drastic. For instance, when running a 7B 4-bit quantized model, the difference in inference speed is only about 5 tokens per second. So, for those who are budget-conscious or don’t need the absolute peak of performance, the M1 Max stands as a solid choice, balancing cost with capability.
Windows and Linux based Laptops for LLM
If you’re looking to step outside the Apple ecosystem and are in the market for a Windows or Linux-based laptop, there are several options you might consider: the RTX 3080 with 16GB, RTX 3080 Ti with 16GB, RTX 4080 with 12GB, or a model equipped with the RTX 4090 with 16GB.
It’s essential to understand that the maximum VRAM you’ll typically find in a PC-based laptop is 16GB. This capacity is adequate for models up to 13B. However, for anything beyond that, you’ll need to split the load between VRAM and RAM to manage the higher memory demands. This split, unfortunately, leads to a significant reduction in inference speed.
Among these options, mobile GPUs like the RTX 4080, boasting a bandwidth of 432.0 GB/s, can offer speeds comparable to the M2 Max in terms of tokens per second. However, the RTX 4080 is somewhat limited with its 12GB of VRAM, making it most suitable for running a 13B 6-bit quantized model, but without much leeway for larger contexts. To get closer to the MacBook Pro’s capabilities, you might want to consider laptops with an RTX 3080 Ti or RTX 4090.
In particular, the mobile RTX 4090, with its 576 GB/s bandwidth, offers slightly better inference speeds than the RTX 4080. It also comes with 16GB of VRAM, allowing it to handle up to 13B models with more context space than the RTX 4080 can provide.
This positions RTX 4090-equipped laptops as a closer, albeit not perfect, alternative to the MacBook Pro in terms of LLM performance, keeping in mind the limitations when compared to Apple’s unified memory system.
Here are the best PC-based laptops for LLM inference:
ASUS ROG Zephyrus G14 2023
Asus has made some intriguing changes this year, especially with the GPU upgrade. Let’s break down the key hardware aspects and how they fit into the realm of LLM (Large Language Model) inference:
GPU – Nvidia RTX 4090 Mobile: This is a significant upgrade from AMD GPUs. For LLM tasks, the RTX 4090, even in its mobile form, is a powerhouse due to its high memory bandwidth (576 GB/s). It’s crucial to note though that the mobile version won’t match the desktop RTX 4090’s full capabilities. The 4090 is overkill for LLMs like LLaMA-13B, which requires at least 10GB VRAM, but it’s future-proof and well-suited for even larger models where you will need to split the model layers between the VRAM and RAM.
CPU – Ryzen 9 7940HS: A solid choice for LLM tasks. The CPU is essential for data loading, preprocessing, and managing prompts. The Ryzen 9 7940HS, being a high-end CPU, should handle these tasks efficiently
RAM: With 64GB of RAM, this laptop sits comfortably above the minimum for running models like the 30B, which require at least 20GB VRAM. The memory bandwidth is 4800 Mhz (76.8 GB/s) , the most common setup in a 64GB configuration
MSI Raider GE68HX 13VI
MSI Raider GE68, with its powerful CPU and GPU, ample RAM, and high memory bandwidth, is well-equipped for LLM inference tasks.
CPU – Intel Core i9-13950HX: This is a high-end processor, excellent for tasks like data loading, preprocessing, and handling prompts in LLM applications. The increased performance over previous generations should be beneficial for running LLMs efficiently.
GPU – Nvidia RTX 4090 Mobile (576.0 GB/s bandwidth): This GPU is a greate, especially for LLM tasks up to 13B model. The high memory bandwidth is crucial for handling large models efficiently. Although it’s a mobile version and might not reach the peak performance of its desktop counterpart, it’s still significantly powerful for running LLM models.
RAM – 64 GB of DDR5 5200 memory (83.2 GB/s memory bandwidth): The ample RAM and high memory bandwidth are ideal for LLM tasks. This amount of RAM surpasses the minimum requirements for most LLM models, ensuring smooth operation even with larger context.
Allan Witt
Allan Witt is Co-founder and editor in chief of Hardware-corner.net. Computers and the web have fascinated me since I was a child. In 2011 started training as an IT specialist in a medium-sized company and started a blog at the same time. I really enjoy blogging about tech. After successfully completing my training, I worked as a system administrator in the same company for two years. As a part-time job I started tinkering with pre-build PCs and building custom gaming rigs at local hardware shop. The desire to build PCs full-time grew stronger, and now this is my full time job.3 Comments
Submit a Comment
Related
Desktops
Best GPUs for 600W and 650W PSU
A high-quality 500W PSU is typically sufficient to power GPUs like the Nvidia GeForce RTX 370 Ti or RTX 4070.
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option…
Guides
Dell OptiPlex 3020 vs 7020 vs 9020
Differences between the Dell OptiPlex 3020, 7020 and 9020 desktops.
I am considering M1 Max as a more budget option can you comment a bit about the difference between the two when running large language modes.
You can check out our article on the best Mac for LLMs. Generally, for token generation, the M2 Max with a 38-core GPU is about 8% faster than the M1 Max with a 32-core GPU. This translates to around 7 or 8 additional tokens when performing inference on a 7B model with 4-bit quantization.
For prompt processing, the difference is more substantial – about 26% – as prompt processing is compute-bound, and the M2 Max has an advantage in that area. In contrast, token generation is bandwidth-bound, and both models have the same bandwidth at 400GB/s.
I have the M1 Max, and it’s a beast. A colleague of mine has the M2 Max MacBook Pro, and honestly, the difference is barely noticeable. Sure, the M1 is a bit slower with prompt processing, but that only really matters if you’re working with huge contexts like 16k or 32k tokens.
In general, if you’re planning to do long chats with large models, processing will slow down as the context size grows. For example, on the M1 Max, processing a 12k context can take over 6 minutes.