Mistral LLM: Versions, Prompt Templates & Hardware Requirements

Updated: 2024-09-04 |
Base model
|
role-play
erp
coding

Explore all versions of the model, their file formats like GGML, GPTQ, and HF, and understand the hardware requirements for local inference.

Mistral is a family of large language models known for their exceptional performance. Mistral 7B, a 7-billion-parameter model, uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to handle longer sequences, making it one of the top choices for coding and creative writing. It integrates well with major platforms and is easily fine-tuned for various tasks.

Mistral Large 2 Instruct, released in July 2024, is major upgrade with 123 billion parameters and a context length of 128,000 tokens. Available with open weights and a research license. Mistral large 2 is an exceptional large language model for programming, surpassing Llama 3 405B. The model supports numerous languages, including many programming languages. 

Hardware requirements for Mistral Large 2

Mistral Large is a massive 123-billion-parameter model, which in its original size requires a staggering amount of memory—around 250 GB. However, you can use 2-bit quantized versions to run the model even on a dual RTX 3090 setup.

Quantization Bits File Format Hardware Requirements (VRAM)
IQ3_XXS GGUF 47 GB
Q2_K GGUF 45.2 GB
2.75bpw EXL2 43 GB
IQ3_M GGUF 55 GB
IQ4_XS GGUF 66 GB
Q3_K_M GGUF 59 GB
Q4_K_M GGUF 73 GB
Q5_K_M GGUF 87 GB
Q6_K GGUF 101 GB
Q8_0 GGUF 130 GB

For the highest quality, use the Q8_0 model (130 GB), though it is generally unneeded for most tasks. If you seek very high quality, choose Q6_K (101 GB) for near-perfect results. For high quality that is still highly recommended, go with Q5_K_M (86.49 GB). The Q4_K_M model (73 GB) is ideal for most use cases where good quality is sufficient. You can run it with 4 GPUs, 24 GB VRAM each.

If you have tripe GPU (24GB) setup, consider Q3_K_XL (64.91 GB) or Q3_K_L (64.55 GB) for lower quality but adequate performance. For medium-low quality with decent performance, opt for IQ3_M (55.28 GB). The Q3_K_S (52.85 GB) and IQ3_XXS (47.01 GB) models offer low quality and should be used only if very low quality is acceptable.

For very low quality needs, the Q2_K_L (45.59 GB) and Q2_K (45.20 GB) models provide surprisingly usable performance. The IQ2_M (41.62 GB) model offers relatively low quality with surprisingly usable techniques. 

Hardware requirements

The performance of an Mistral model depends heavily on the hardware it's running on. For recommendations on the best computer hardware configurations to handle Mistral models smoothly, check out this guide: Best Computer for Running LLaMA and LLama-2 Models.

Below are the Mistral hardware requirements for 4-bit quantization:

For 7B Parameter Models

If the 7B model is what you're after, you gotta think about hardware in two ways. First, for the GPTQ version, you'll want a decent GPU with at least 6GB VRAM. The GTX 1660 or 2060, AMD 5700 XT, or RTX 3050 or 3060 would all work nicely. But for the GGML / GGUF format, it's more about having enough RAM. You'll need around 4 gigs free to run that one smoothly.

Format RAM Requirements VRAM Requirements
GPTQ (GPU inference) 6GB (Swap to Load*) 6GB
GGML / GGUF (CPU inference) 4GB 300MB
Combination of GPTQ and GGML / GGUF (offloading) 2GB 2GB

*RAM needed to load the model initially. Not required for inference. If your system doesn't have quite enough RAM to fully load the model at startup, you can create a swap file to help with the loading.

Memory speed

When running Mistral AI models, you gotta pay attention to how RAM bandwidth and mdodel size impact inference speed. These large language models need to load completely into RAM or VRAM each time they generate a new token (piece of text). For example, a 4-bit 7B billion parameter Mistral model takes up around 4.0GB of RAM.

Suppose your have Ryzen 5 5600X processor and DDR4-3200 RAM with theoretical max bandwidth of 50 GBps. In this scenario, you can expect to generate approximately 9 tokens per second. Typically, this performance is about 70% of your theoretical maximum speed due to several limiting factors such as inference sofware, latency, system overhead, and workload characteristics, which prevent reaching the peak speed. To achieve a higher inference speed, say 16 tokens per second, you would need more bandwidth. For example, a system with DDR5-5600 offering around 90 GBps could be enough.

For comparison, high-end GPUs like the Nvidia RTX 3090 boast nearly 930 GBps of bandwidth for their VRAM. The DDR5-6400 RAM can provide up to 100 GB/s. Therefore, understanding and optimizing bandwidth is crucial for running models like Mistral efficiently

Recommendations:

  1. For Best Performance: Opt for a machine with a high-end GPU (like NVIDIA's latest RTX 3090 or RTX 4090) or dual GPU setup to accommodate the largest models (65B and 70B). A system with adequate RAM (minimum 16 GB, but 64 GB best) would be optimal.
  2. For Budget Constraints: If you're limited by budget, focus on Mistral GGML/GGUF models that fit within the sytem RAM. Remember, while you can offload some weights to the system RAM, it will come at a performance cost.

Remember, these are recommendations, and the actual performance will depend on several factors, including the specific task, model implementation, and other system processes.

CPU requirements

For best performance, a modern multi-core CPU is recommended. An Intel Core i7 from 8th gen onward or AMD Ryzen 5 from 3rd gen onward will work well. CPU with 6-core or 8-core is ideal. Higher clock speeds also improve prompt processing, so aim for 3.6GHz or more.

Having CPU instruction sets like AVX, AVX2, AVX-512 can further improve performance if available. The key is to have a reasonably modern consumer-level CPU with decent core count and clocks, along with baseline vector processing (required for CPU inference with llama.cpp) through AVX2. With those specs, the CPU should handle Mistral model size.