Mistral AI has released Mixtral 8x7B, a new large language model that sets a benchmark in the open-access AI field. The model, which outperforms GPT-3.5 across many metrics, is integrated into the Hugging Face ecosystem, highlighting its significant advancements. However, its innovative architecture raises questions about VRAM (Video Random Access Memory) usage, crucial for potential users with limited resources.
Mixtral 8x7B employs a Mixture of Experts (MoE) technique, incorporating eight expert models into one. This unique structure allows it to function at the speed of a 12B parameter-dense model while containing four times the number of effective parameters. It’s a breakthrough in efficiency but comes with a high VRAM requirement. The model needs a staggering 90GB of VRAM in half-precision, posing a challenge for local setups and individual developers.
TheBloke 2-bit to 8-bit quantization, indicate a VRAM need between 16GB and 50GB. While the model supports splitting computation between the CPU and GPU and has experimental llama.cpp support, its high VRAM usage still stands as a significant barrier for local LLM inference. The model effectively processes input and generates output at the same speed and cost as a 12.9B model, but its total parameters amount to 46.7B. This disparity underlines the necessity of substantial VRAM for its operation.
The VRAM demand highlights a broader issue in AI development: the balance between performance and accessibility. While Mixtral 8x7B offers exceptional speed and accuracy, surpassing major competitors and supporting five languages, its high resource requirement may limit its usability for many in local LLM community. Mistral AI’s release of this model, alongside its Instruct version optimized for instruction following, marks a significant step in AI progress.