Manticore LLM: Versions, Prompt Templates & Hardware Requirements
Explore all versions of the model, their file formats like GGML, GPTQ, and HF, and understand the hardware requirements for local inference.
Manticore 13B is fine-tuned on a mix of datasets, making it a highly adaptable language model. Built on the robust LLaMA 13B framework, it incorporates data from ShareGPT, WizardLM, and GPTeacher-General-Instruct for general knowledge and instruction following. It also specializes in roleplay and Chain-of-Thought (CoT) tasks through the QingyiSi/Alpaca-CoT dataset. For those seeking detailed responses in subjects like abstract algebra and conceptual physics, it leverages ARC-Easy & ARC-Challenge datasets.
Additional info about Manticore model
Building upon the robust framework of Manticore 13B, the Chat version takes conversational intelligence to the next level. Trained on an expansive and diverse set of datasets, this model excels at understanding and generating human-like text.
What sets it apart is the inclusion of a carefully curated subset of the Pygmalion dataset, focused on Role Play (RP) data, making interactions even more engaging and context-aware. Moving away from the traditional Alpaca-style prompts, Manticore 13B Chat adopts a more natural, chat-only style with USER:
and ASSISTANT:
identifiers, as well as specialized Pygmalion/Metharme prompting. It's perfect for users who seek a streamlined, intuitive conversation experience.
Key Datasets Used:
- De-duped Pygmalion, focused on RP data
- Riddle_sense, instruct-augmented for riddle understanding
- Hellaswag, enhanced for detailed explanations with 30K+ rows
- GSM8k, instruct-augmented for general knowledge
- EWOF/Code-Alpaca, unfiltered for coding-related tasks
Hardware requirements
The performance of an Manticore model depends heavily on the hardware it's running on. For recommendations on the best computer hardware configurations to handle Manticore models smoothly, check out this guide: Best Computer for Running LLaMA and LLama-2 Models.
Below are the Manticore hardware requirements for 4-bit quantization:
For 13B Parameter Models
For beefier models like the Manticore-13B-SuperHOT-8K-fp16, you'll need more powerful hardware. If you're using the GPTQ version, you'll want a strong GPU with at least 10 gigs of VRAM. AMD 6900 XT, RTX 2060 12GB, RTX 3060 12GB, or RTX 3080 would do the trick. For the CPU infgerence (GGML / GGUF) format, having enough RAM is key. You'll want your system to have around 8 gigs available to run it smoothly.
Format | RAM Requirements | VRAM Requirements |
---|---|---|
GPTQ (GPU inference) | 12GB (Swap to Load*) | 10GB |
GGML / GGUF (CPU inference) | 8GB | 500MB |
Combination of GPTQ and GGML / GGUF (offloading) | 10GB | 10GB |
*RAM needed to load the model initially. Not required for inference. If your system doesn't have quite enough RAM to fully load the model at startup, you can create a swap file to help with the loading.
Memory speed
When running Manticore AI models, you gotta pay attention to how RAM bandwidth and mdodel size impact inference speed. These large language models need to load completely into RAM or VRAM each time they generate a new token (piece of text). For example, a 4-bit 7B billion parameter Manticore model takes up around 4.0GB of RAM.
Suppose your have Ryzen 5 5600X processor and DDR4-3200 RAM with theoretical max bandwidth of 50 GBps. In this scenario, you can expect to generate approximately 9 tokens per second. Typically, this performance is about 70% of your theoretical maximum speed due to several limiting factors such as inference sofware, latency, system overhead, and workload characteristics, which prevent reaching the peak speed. To achieve a higher inference speed, say 16 tokens per second, you would need more bandwidth. For example, a system with DDR5-5600 offering around 90 GBps could be enough.
For comparison, high-end GPUs like the Nvidia RTX 3090 boast nearly 930 GBps of bandwidth for their VRAM. The DDR5-6400 RAM can provide up to 100 GB/s. Therefore, understanding and optimizing bandwidth is crucial for running models like Manticore efficiently
Recommendations:
- For Best Performance: Opt for a machine with a high-end GPU (like NVIDIA's latest RTX 3090 or RTX 4090) or dual GPU setup to accommodate the largest models (65B and 70B). A system with adequate RAM (minimum 16 GB, but 64 GB best) would be optimal.
- For Budget Constraints: If you're limited by budget, focus on Manticore GGML/GGUF models that fit within the sytem RAM. Remember, while you can offload some weights to the system RAM, it will come at a performance cost.
Remember, these are recommendations, and the actual performance will depend on several factors, including the specific task, model implementation, and other system processes.
CPU requirements
For best performance, a modern multi-core CPU is recommended. An Intel Core i7 from 8th gen onward or AMD Ryzen 5 from 3rd gen onward will work well. CPU with 6-core or 8-core is ideal. Higher clock speeds also improve prompt processing, so aim for 3.6GHz or more.
Having CPU instruction sets like AVX, AVX2, AVX-512 can further improve performance if available. The key is to have a reasonably modern consumer-level CPU with decent core count and clocks, along with baseline vector processing (required for CPU inference with llama.cpp) through AVX2. With those specs, the CPU should handle Manticore model size.