New Chinese Mini-PC with AI MAX+ 395 (Strix Halo) and 128GB Memory Targets Local LLM Inference
Chinese manufacturer FAVM has announced FX-EX9, a compact 2-liter Mini-PC powered by AMD’s Ryzen AI MAX+ 395 “Strix Halo” processor, potentially offering new options for enthusiasts running quantized large language models locally.
Memory Configuration Suitable for Local LLMs
The system comes equipped with 128GB of LPDDR5 memory on a 256-bit bus, with FAVM allowing up to 110GB to be allocated as video memory. This substantial memory capacity addresses one of the primary constraints in running larger quantized models locally, potentially enabling inference of 70B parameter models that typically require multiple GPUs or specialized hardware.
The memory bandwidth of 256GB/s, while significant for an integrated solution, is notably lower than dedicated consumer GPUs. For comparison, even the RTX 3060 12GB delivers 360GB/s of bandwidth. This limitation will likely impact performance with larger models, particularly during memory-intensive operations.
Expected Performance
Based on the specifications provided, this system should handle various quantized models with differing levels of performance:
- 4-bit quantized 8B models at approximately 38 tokens/second with 4K context windows
- 4-bit quantized 14B models at around 20 tokens/second with 4K context windows
- 4-bit quantized 70B models at a more modest 5-9 tokens/second
Prompt processing for 70B models also shows expected limitations with approximately 31 tokens/second for short context windows.
AMD’s Hybrid Architecture Approach
AMD’s architecture for the Ryzen AI 300-series processors integrates three distinct compute engines: Zen 5 CPU cores, Radeon Graphics (iGPU), and the XDNA-based NPU. The company is developing software to distribute LLM workloads across these components, with the NPU potentially handling prompt processing (prefill phase) while the iGPU manages token generation.
This approach, implemented through Lemonade Server and the ONNX Runtime-GenAI engine, aims to improve efficiency when running smaller, quantized models.
Hardware Specifications
The Mini-PC features the MAX+ 395 processor with 16 Zen 5 cores and 40 RDNA 3.5 Compute Units (Radeon 8060S) running at 120W TDP. FAVM claims this delivers performance comparable to a Ryzen 9 9955HX paired with a GeForce RTX 4070 Laptop GPU.
Connectivity options include HDMI 2.1, DisplayPort 2.1, dual USB4 ports supporting up to four 8K displays, and an OCuLink connector for external GPU expansion.
It’s particularly interesting that the OCuLink port is included, as this permits the use of affordable second-hand external GPUs like an RTX 3080 Ti. This could allow users to offload prompt processing to the dedicated GPU (using solutions like KTransformers) and speed it up. This flexibility addresses one of the key limitations of unified memory architectures like the Strix Halo, where prompt processing is notably slower compared to dedicated NVIDIA solutions, especially for larger models with extensive context windows.
Market Positioning
While pricing hasn’t been announced, this system would need to be competitively priced against existing options in the market. Current alternatives include the GMK EVO-X2 mini ($2000), HP ZBook Ultra G1a 14″ laptop ($3940), and ROG Flow Z13 ($2,799.99). For local LLM enthusiasts, a sub-$2000 price point would make this system worth consideration.
Assessment for LLM Users
For those looking to run local LLMs, especially quantized 70B models, this Strix Halo-based Mini-PC offers a compact form factor with substantial memory capacity. However, the relatively limited memory bandwidth compared to discrete GPUs suggests it will perform best with smaller models or in scenarios where space and power efficiency take priority over raw inference speed.
The system’s performance-per-watt and space efficiency may appeal to enthusiasts with specific constraints, though those requiring maximum inference speed will still be better served by discrete GPU solutions.