Solar LLM: Versions, Prompt Templates & Hardware Requirements

Updated: 2024-02-29 |

Base model

Explore all versions of the model, their file formats like GGUF, GPTQ, and EXL2, and understand the hardware requirements for local inference.

The Solar model is a large language model that utilizes the Llama-2 architecture as its base. This 32-layer foundation is enhanced with pretrained weights from the Mistral 7B model, known for its compatibility and high performance. Solar's development features an innovative depthwise scaling process, where the base model is duplicated and layers are selectively removed from each copy, then concatenated. This process, resulting in a 48-layer model fitting between 7 and 13 billion parameters, exemplifies a non-Mixture of Experts (MoE) approach to model scaling. It mirrors methods used in constructing larger models by amalgamating and rearranging layers from smaller base models, thereby creating a singular, more extensive architecture.

Solar's capabilities are further refined through a comprehensive fine-tuning process, utilizing a combination of diverse datasets. Instruction tuning involves training with datasets like Alpaca-GPT4, OpenOrca, and Synth. Math-Instruct, focusing on enhancing the model's ability to follow instructions in a question-answer format, especially in mathematical contexts.

VRAM:

Dual GPU

System Memory:

Test this model with:

Model Overview

The Solar model is a sophisticated large language model developed using advanced scaling techniques and extensive fine-tuning processes. It represents an innovative approach to creating high-capacity models by leveraging existing architectures and pretrained weights.

Base Model: Llama 2 Architecture

Structure: Solar uses a 32-layer Llama 2 architecture as its foundation.
Pretraining: The Llama 2 base is initialized with pretrained weights from the Mistral 7B model, a top performer compatible with the Llama 2 architecture. This choice aims to utilize the robust features and capabilities of Mistral 7B.

Depthwise Scaling Process

Method: Solar undergoes a process called depthwise scaling, starting with the 32-layer base model (n = 32).
Scaling: The model is duplicated and modified by removing m layers from each version, resulting in two distinct models with n − m layers. These models are then concatenated, creating a scaled model with s = 2·(n−m) layers.
Parameters: For Solar, s is set to 48 layers, fitting between 7 and 13 billion parameters, requiring the removal of m = 8 layers from each of the duplicated models.

Training and Fine-Tuning

Instruction Tuning

Datasets Used: The model is fine-tuned using the Alpaca-GPT4, OpenOrca, and Synth. Math-Instruct datasets.
Purpose: This stage trains the model to follow instructions in a question-answer format, enhancing its capabilities, especially in mathematics.

Alignment Tuning

Datasets Used: For further refining, Solar employs the Orca DPO Pairs, Ultrafeedback Cleaned, and Synth. Math-Alignment datasets.
Goal: The alignment tuning stage aims to align the model more closely with human preferences and AI benchmarks, using direct preference optimization techniques.

Model Characteristics

Scaling Methodology: Unlike some other large models, Solar is not a Mixture of Experts (MoE) model like Mixtral. Instead, it follows a process similar to that used for creating 20B models from multiple 13Bs and 120B models from multiple 70Bs. It scales by mixing and matching layers from the base models to create a more extensive architecture than the individual components.
Integration: By adopting the Llama 2 architecture, Solar leverages a vast pool of community resources, while introducing novel modifications to enhance its capabilities.
Efficiency: The model is designed with hardware constraints in mind, ensuring efficient operation within the specified parameter range.

In summary, the Solar model represents a significant step forward in large language model development, combining deep architectural knowledge with innovative scaling and fine-tuning methods. Its creation process reflects a blend of technical sophistication and practical application, making it a potent tool in the field of AI and natural language processing.

Hardware requirements

The performance of an Solar model depends heavily on the hardware it's running on. For recommendations on the best computer hardware configurations to handle Solar models smoothly, check out this guide: Best Computer for Running LLaMA and LLama-2 Models.

Below are the Solar hardware requirements for 4-bit quantization:

For 7B Parameter Models

If the 7B model is what you're after, you gotta think about hardware in two ways. First, for the GPTQ version, you'll want a decent GPU with at least 6GB VRAM. The GTX 1660 or 2060, AMD 5700 XT, or RTX 3050 or 3060 would all work nicely. But for the GGML / GGUF format, it's more about having enough RAM. You'll need around 4 gigs free to run that one smoothly.

Format	RAM Requirements	VRAM Requirements
GPTQ (GPU inference)	6GB (Swap to Load*)	6GB
GGML / GGUF (CPU inference)	4GB	300MB
Combination of GPTQ and GGML / GGUF (offloading)	2GB	2GB

*RAM needed to load the model initially. Not required for inference. If your system doesn't have quite enough RAM to fully load the model at startup, you can create a swap file to help with the loading.

Memory speed

When running Solar AI models, you gotta pay attention to how RAM bandwidth and mdodel size impact inference speed. These large language models need to load completely into RAM or VRAM each time they generate a new token (piece of text). For example, a 4-bit 7B billion parameter Solar model takes up around 4.0GB of RAM.

Suppose your have Ryzen 5 5600X processor and DDR4-3200 RAM with theoretical max bandwidth of 50 GBps. In this scenario, you can expect to generate approximately 9 tokens per second. Typically, this performance is about 70% of your theoretical maximum speed due to several limiting factors such as inference sofware, latency, system overhead, and workload characteristics, which prevent reaching the peak speed. To achieve a higher inference speed, say 16 tokens per second, you would need more bandwidth. For example, a system with DDR5-5600 offering around 90 GBps could be enough.

For comparison, high-end GPUs like the Nvidia RTX 3090 boast nearly 930 GBps of bandwidth for their VRAM. The DDR5-6400 RAM can provide up to 100 GB/s. Therefore, understanding and optimizing bandwidth is crucial for running models like Solar efficiently

Recommendations:

For Best Performance: Opt for a machine with a high-end GPU (like NVIDIA's latest RTX 3090 or RTX 4090) or dual GPU setup to accommodate the largest models (65B and 70B). A system with adequate RAM (minimum 16 GB, but 64 GB best) would be optimal.
For Budget Constraints: If you're limited by budget, focus on Solar GGML/GGUF models that fit within the sytem RAM. Remember, while you can offload some weights to the system RAM, it will come at a performance cost.

Remember, these are recommendations, and the actual performance will depend on several factors, including the specific task, model implementation, and other system processes.

CPU requirements

For best performance, a modern multi-core CPU is recommended. An Intel Core i7 from 8th gen onward or AMD Ryzen 5 from 3rd gen onward will work well. CPU with 6-core or 8-core is ideal. Higher clock speeds also improve prompt processing, so aim for 3.6GHz or more.

Having CPU instruction sets like AVX, AVX2, AVX-512 can further improve performance if available. The key is to have a reasonably modern consumer-level CPU with decent core count and clocks, along with baseline vector processing (required for CPU inference with llama.cpp) through AVX2. With those specs, the CPU should handle Solar model size.