Airoboros LLM: Versions, Prompt Templates & Hardware Requirements
Explore all versions of the model, their file formats like GGML, GPTQ, and HF, and understand the hardware requirements for local inference.
Airoboros models are Mistral, LLaMa and Llama-2 based large language models, fine-tuned with synthetic data generated by GPT-4 via the Airoboros tool, align with the principles of the SELF-INSTRUCT framework. These models leverage machine-generated instructions to train specialized "experts" in tasks like multi-turn conversations, coding examples, and role-playing. By using synthetic data, they achieve higher performance and expertise in specific tasks without the need for limited or expensive human-annotated data. This approach results in a more comprehensive and efficient conversational experience, complete with capabilities like coding in multiple languages, jokes, and riddles.
Addtional info about Airoboros models
The Airoboros series offers an evolving set of LLaMa and Llama-2 based language models, each fine-tuned for specific capabilities using synthetic data. The series ranges from version 1.0 to 2.2.1, with enhancements added in each new iteration.
For example, version 1.1 focuses on trivia, coding, and multiple-choice questions. Version 1.2 incorporates plaintext coding and role-playing skills. Version 1.3 further refines coding and reasoning instructions, and experiments with multi-character interactions.
The 1.4 versions increase parameters and integrate abilities like coding in multiple languages, humor, and riddles. Versions 2.0, 2.1, and 2.2.1 take a more exploratory approach, with longer prompts, roleplay personas, and uncensored modes.
In essence, the series showcases the potential of instruction-based fine-tuning to systematically enhance language models. Each build iterates on capabilities like coding, reasoning, and conversation.
Airoboros 2.2.1
Airoboros 2.2.1 is a refined version of the Airoboros 2.2 model, bringing with it several improvements and adjustments. This iteration offers re-generated writing responses, with many sourced from gpt-4-0613, which tend to be more concise and have different readability scores when compared to responses from gpt-4-0314. In an effort to get closer to the 4k context size, the model now features longer contextual blocks, especially for items with multiple contexts.
One significant change is the removal of the "rp" data, as the content was found to be lackluster and not as engaging as intended. Instead of this, Airoboros 2.2.1 integrates responses inspired by fictional characters, capturing their distinct linguistic styles, which sometimes includes darker or more informal language. Additionally, the model undergoes a form of de-censoring, making it less restrictive in its responses.
Airoboros c34B 2.1
c34B 2.1 model is a refined Llama-2 version, enhanced with synthetic data through the Airoboros system. It introduces an experimental instruction set categorized into "rp" and "gtkm." While "rp" allows for multi-round, emotive conversations defined by character cards, "gtkm" provides a streamlined alternative for character-based questioning. The model also supports intricate writing prompts and is capable of generating next chapters for ongoing narratives.
Airoboros GPT4 2.1
- Experimental role-play and "gtkm" categories.
- Adds experimental support for longer writing prompts and next-chapter generation.
- Uses a subset of high-quality instructions and includes "stylized_response" for better style adherence.
Airoboros GPT4 m2.0
- Similar to 2.0 but includes the 1.4.1 dataset.
- Data from both March and June GPT-4 versions are included.
Airoboros GPT4 2.0
- Generated from the June version of GPT-4 synthetic dataset.
- Includes data from both March and June GPT-4 synthetic dataset versions.
Airoboros GPT4 1.4.1
- Mostly the same as 1.4 but built on Llama-2.
Airoboros GPT4 1.4
- Enhances multi-character, multi-turn conversations and adds coding examples in 10 languages.
- Introduces more role-playing, jokes, and riddles.
Airoboros GPT4 1.3
- Extension of 1.2.
- Adds "PLAINFORMAT" for all coding instructions and introduces orca style reasoning instructions.
- Includes a first attempt at multi-character interactions with various types of random items.
Airoboros GPT4 1.2
- Builds upon 1.1 with thousands of new training data.
- Introduces "PLAINFORMAT" for coding prompts.
- Focus areas are similar to 1.1, but with the addition of role-playing.
Airoboros GPT4 1.1
- Fine-tuned LlaMa model.
- Adds ~1k more coding instructions and improves context instructions.
- Focuses on trivia, math/reasoning, coding, multiple choice, fill-in-the-blank questions, context-obedient QA, and theory of mind.
Hardware requirements
The performance of an Airoboros model depends heavily on the hardware it's running on. For recommendations on the best computer hardware configurations to handle Airoboros models smoothly, check out this guide: Best Computer for Running LLaMA and LLama-2 Models.
Below are the Airoboros hardware requirements for 4-bit quantization:
For 7B Parameter Models
If the 7B Airoboros-L2-13B-2.1-GPTQ model is what you're after, you gotta think about hardware in two ways. First, for the GPTQ version, you'll want a decent GPU with at least 6GB VRAM. The GTX 1660 or 2060, AMD 5700 XT, or RTX 3050 or 3060 would all work nicely. But for the GGML / GGUF format, it's more about having enough RAM. You'll need around 4 gigs free to run that one smoothly.
Format | RAM Requirements | VRAM Requirements |
---|---|---|
GPTQ (GPU inference) | 6GB (Swap to Load*) | 6GB |
GGML / GGUF (CPU inference) | 4GB | 300MB |
Combination of GPTQ and GGML / GGUF (offloading) | 2GB | 2GB |
*RAM needed to load the model initially. Not required for inference. If your system doesn't have quite enough RAM to fully load the model at startup, you can create a swap file to help with the loading.
For 13B Parameter Models
For beefier models like the Airoboros-L2-13B-2.1-GPTQ, you'll need more powerful hardware. If you're using the GPTQ version, you'll want a strong GPU with at least 10 gigs of VRAM. AMD 6900 XT, RTX 2060 12GB, RTX 3060 12GB, or RTX 3080 would do the trick. For the CPU infgerence (GGML / GGUF) format, having enough RAM is key. You'll want your system to have around 8 gigs available to run it smoothly.
Format | RAM Requirements | VRAM Requirements |
---|---|---|
GPTQ (GPU inference) | 12GB (Swap to Load*) | 10GB |
GGML / GGUF (CPU inference) | 8GB | 500MB |
Combination of GPTQ and GGML / GGUF (offloading) | 10GB | 10GB |
*RAM needed to load the model initially. Not required for inference. If your system doesn't have quite enough RAM to fully load the model at startup, you can create a swap file to help with the loading.
For 30B, 33B, and 34B Parameter Models
If you're venturing into the realm of larger models the hardware requirements shift noticeably. GPTQ models benefit from GPUs like the RTX 3080 20GB, A4500, A5000, and the likes, demanding roughly 20GB of VRAM. Conversely, GGML formatted models will require a significant chunk of your system's RAM, nearing 20 GB.
Format | RAM Requirements | VRAM Requirements |
---|---|---|
GPTQ (GPU inference) | 32GB (Swap to Load*) | 20GB |
GGML / GGUF (CPU inference) | 20GB | 500MB |
Combination of GPTQ and GGML / GGUF (offloading) | 10GB | 4GB |
*RAM needed to load the model initially. Not required for inference. If your system doesn't have quite enough RAM to fully load the model at startup, you can create a swap file to help with the loading.
For 65B and 70B Parameter Models
When you step up to the big models like 65B and 70B models (Airoboros-65B-GPT4-m2.0-GGML), you need some serious hardware. For GPU inference and GPTQ formats, you'll want a top-shelf GPU with at least 40GB of VRAM. We're talking an A100 40GB, dual RTX 3090s or 4090s, A40, RTX A6000, or 8000. You'll also need 64GB of system RAM. For GGML / GGUF CPU inference, have around 40GB of RAM available for both the 65B and 70B models.
Format | RAM Requirements | VRAM Requirements |
---|---|---|
GPTQ (GPU inference) | 64GB (Swap to Load*) | 40GB |
GGML / GGUF (CPU inference) | 40GB | 600MB |
Combination of GPTQ and GGML / GGUF (offloading) | 20GB | 20GB |
*RAM needed to load the model initially. Not required for inference. If your system doesn't have quite enough RAM to fully load the model at startup, you can create a swap file to help with the loading.
Memory speed
When running Airoboros AI models, you gotta pay attention to how RAM bandwidth and mdodel size impact inference speed. These large language models need to load completely into RAM or VRAM each time they generate a new token (piece of text). For example, a 4-bit 7B billion parameter Airoboros model takes up around 4.0GB of RAM.
Suppose your have Ryzen 5 5600X processor and DDR4-3200 RAM with theoretical max bandwidth of 50 GBps. In this scenario, you can expect to generate approximately 9 tokens per second. Typically, this performance is about 70% of your theoretical maximum speed due to several limiting factors such as inference sofware, latency, system overhead, and workload characteristics, which prevent reaching the peak speed. To achieve a higher inference speed, say 16 tokens per second, you would need more bandwidth. For example, a system with DDR5-5600 offering around 90 GBps could be enough.
For comparison, high-end GPUs like the Nvidia RTX 3090 boast nearly 930 GBps of bandwidth for their VRAM. The DDR5-6400 RAM can provide up to 100 GB/s. Therefore, understanding and optimizing bandwidth is crucial for running models like Airoboros efficiently
Recommendations:
- For Best Performance: Opt for a machine with a high-end GPU (like NVIDIA's latest RTX 3090 or RTX 4090) or dual GPU setup to accommodate the largest models (65B and 70B). A system with adequate RAM (minimum 16 GB, but 64 GB best) would be optimal.
- For Budget Constraints: If you're limited by budget, focus on Airoboros GGML/GGUF models that fit within the sytem RAM. Remember, while you can offload some weights to the system RAM, it will come at a performance cost.
Remember, these are recommendations, and the actual performance will depend on several factors, including the specific task, model implementation, and other system processes.
CPU requirements
For best performance, a modern multi-core CPU is recommended. An Intel Core i7 from 8th gen onward or AMD Ryzen 5 from 3rd gen onward will work well. CPU with 6-core or 8-core is ideal. Higher clock speeds also improve prompt processing, so aim for 3.6GHz or more.
Having CPU instruction sets like AVX, AVX2, AVX-512 can further improve performance if available. The key is to have a reasonably modern consumer-level CPU with decent core count and clocks, along with baseline vector processing (required for CPU inference with llama.cpp) through AVX2. With those specs, the CPU should handle Airoboros model size.