Mixtral LLM: Versions, Prompt Templates & Hardware Requirements
Explore all versions of the model, their file formats like GGML, GPTQ, and HF, and understand the hardware requirements for local inference.
Mistral AI has introduced Mixtral 8x7B, a highly efficient sparse mixture of experts model (MoE) with open weights, licensed under Apache 2.0. This model stands out for its rapid inference, being six times faster than Llama 2 70B and excelling in cost/performance trade-offs. It rivals or surpasses GPT-3.5 in most standard benchmarks, making it a leading open-weight model with a permissive license.
Mixtral is using similar architecture to Mistral 7B and can handle a context of 32k tokens and supports English, French, Italian, German, and Spanish. It demonstrates strong performance in code generation. With 46.7B total parameters, Mixtral operates with the efficiency and cost of a 12.9B model.
Additional info about Mixtral 8x7B LLM
Mixtral 8x7B is a Mixture of Experts (MoE) large language model, boasting an effective size of 46.7 billion parameters while maintaining the computational efficiency of a model approximately one-forth its size. This remarkable efficiency is achieved through the model's architecture, which significantly reduces the compute and memory bandwidth requirements traditionally associated with models of this scale.
Prompt format
The official Mistral format plays it safe, shying away from even the slightest hint of controversy. But when you switch up the style, especially to something more roleplay-centric, the conversation flows more freely and interestingly.
If you seek something a bit more uncensored use SillyTavern's Roleplay preset. It is good for bypassing restrictions.
Playing around with different formats can unlock new levels of engagement and creativity, proving that even the smallest tweaks can lead to big changes in how Mixtral responds.
If you are looking for more prompt info you can check WolframRavenwolf's Mixtral prompt guide.
Hardware Requirements and Performance
To operate 5-bit quantization version of Mixtral you need a minimum 32.3 GB of memory. This makes the model compatible with a dual-GPU setup such as dual RTX 3090, RTX 4090, or Tesla P40 GPUs. LLM inference benchmarks show that performance metrics vary by hardware. Run purely on a dual GPU setup with no CPU offloading you can get around 54 t/s with RTX 3090, 59 t/s with RTX 4090, 44 t/s with Apple Silicon M2 Ultra, and 22 t/s with M3 Max.
You can also use a dual RTX 3060 12GB setup with layer offloading. For example for for 5-bit quantized Mixtral model, offloading 20 of 33 layers (~19GB) to the GPUs will give you around 7 t/s for token generation and 42 t/s for prompt processing.
Format | RAM Requirements | VRAM Requirements |
---|---|---|
EXL2/GPTQ (GPU inference) | 32 GB (Swap to Load*) | 23 GB |
GGUF (CPU inference) | 32 GB | 300MB |
50:50 GPU / RAM offloading | 16 GB | 16 GB |
Additionally, you can use split model execution, allowing for part of the model weights to be offloaded to system RAM to accommodate hardware with limited GPU memory. You can offload up to 27 out of 33 layers on a 24GB NVIDIA GPU, achieving a performance range between 15 and 23 tokens per second. Running the model purely on a CPU is also an option, requiring at least 32 GB of available system memory, with performance depending on RAM speed, ranging from 1 to 7 tokens per second.
The table bellow gives a general overview what to expect when running Mixtral (llama.cpp) on a single GPU with layers offloaded to the GPU.
GPU | Offloading | Context | TG* | PP* |
---|---|---|---|---|
RTX 3060 12GB | 10/33 | 512 tokens | 4.31 | 42.17 |
RTX 4080 16GB | 15/33 | 512 tokens | 7.17 | 104.37 |
RTX 3090 24 GB | 22/33 | 512 tokens | 14.66 | 148.05 |
*[TG]- tokens generated; [PP] - prompt processing; in tokens per second
Model Architecture and Training
Mixtral utilizes the MoE architecture to enhance processing speed and efficiency. This architecture divides the neural network into multiple 'experts,' each specializing in different aspects of the language model's capabilities. The distribution of computational tasks across these experts, managed by a router, allows for a more efficient use of resources, significantly increasing token generation throughput for the same total number of neurons.
Benchmark Performance
Mixtral's the highest-ranked open-source model in the Chatbot Arena leaderboard, surpassing the performance of models like GPT-3.5 Turbo, Gemini Pro and LLama-2 70B. Its MoE architecture not only enables it to run on relatively accessible hardware but also provides a scalable solution for handling large-scale computational tasks efficiently.
Hints and Tips for running Mixtral
- Use Mixtral 8x7b Q5_0 its has a better overall performance, especially in understanding and following complex prompts and explaining logic.
- Avoid "K" Quants: Mixtral "K" quant versions (e.g., Q4_K_M, Q5_K_M) they perform worse, and appear less intelligent and more prone to rambling.
- Q5_0 variants excel in coding tasks, including writing functional JavaScript snippets, whereas Q4 variants struggle.
- Lowering the temperature setting to 0.1 or 0.2 is recommended to stabilize Mixtral and reduce hallucinations.
- Some users report a bad repetition problem with Mixtral after about 1.5k tokens, but adjusting settings such as the Rope Base to 1 million can mitigate this issue.
- Mixtral performs exceptionally well at following instructions and can handle creative tasks effectively when provided with detailed prompts.
- Experimentation: Users encourage experimenting with different settings, formats, and model versions to achieve the best balance between speed, accuracy, and response quality.
- Use instruct versions of model for improved performance, especially for tasks requiring specific instructions or creative output.
These tips encapsulate community insights and personal experiences with Mixtral 8x7B, offering guidance on model selection, performance optimization, and effective use across various applications.