Llama 4 Scout & Maverick Benchmarks on Mac: How Fast Is Apple’s M3 Ultra with These LLMs?

The landscape of local large language model (LLM) inference is evolving at a breakneck pace. For enthusiasts building dedicated systems, maximizing performance-per-dollar while navigating the ever-present VRAM ceiling is a constant challenge. Following closely on the heels of models like DeepSeek V3, Meta has released its Llama 4 series, featuring Mixture of Experts (MoE) architectures. We’ve obtained early benchmarks running the 4-bit quantized versions of Llama 4 Scout and Llama 4 Maverick on a maxed-out Mac Studio M3 Ultra using the MLX framework, providing valuable insights for those considering Apple Silicon for high-end local inference.

Understanding the Llama 4 Contenders

Before diving into the numbers, let’s briefly dissect the new Llama 4 models tested:

  • Llama 4 Scout: This model features 17 billion active parameters routed through 16 experts, resulting in a total parameter count of approximately 109 billion. It boasts an impressive 10 million token context window. The implication for hardware is a moderate VRAM/memory footprint for the active parameters during inference, but potentially enormous unified memory requirements if one were to attempt utilizing the full context length – a scenario likely exceeding even the M3 Ultra’s capacity.

  • Llama 4 Maverick: Stepping up significantly, Maverick also uses 17 billion active parameters but routes them through 128 experts, totaling around 400 billion parameters. Its context window is a more conventional (for current high-end models) 1 million tokens. While inference relies on the smaller active parameter set, loading the full 400B model necessitates a substantial memory pool, making high-capacity unified memory systems like the top-tier Mac Studio particularly relevant.

The MoE architecture is key here. Unlike dense models where all parameters are engaged for every token, MoE models only activate a fraction (‘experts’) per token. This generally leads to faster inference speeds relative to a dense model of equivalent total parameters, though the overall memory footprint for loading the model remains significant.

The Test Platform: Mac Studio M3 Ultra

Our benchmark data comes from a Mac Studio equipped with the M3 Ultra System-on-Chip (SoC). Specifically, this configuration features a 32-core CPU, 80-core GPU with 512GB unified memory at 800 GB/s bandwidth.

This high-memory configuration is crucial. Apple’s unified memory architecture allows both the CPU and GPU cores to access the same large memory pool directly, eliminating the need to shuttle data across PCIe lanes, which is a significant advantage for LLMs that often exceed the VRAM capacity of discrete GPUs. The 800 GB/s bandwidth is respectable, though still lower than high-end multi-GPU NVIDIA setups, it’s the sheer size of the accessible memory pool that sets the M3 Ultra apart for models of this scale.

Llama 4 Performance on M3 Ultra (MLX, 4-bit Quantization)

The benchmarks were run using MLX, Apple’s framework optimized for machine learning on Apple Silicon. Both Llama 4 models were tested with 4-bit quantization.

Model Context Size Prompt Processing Speed Generation Speed
Llama 4 Scout 30 tokens 103.01 tokens/sec 44.15 tokens/sec
  10k tokens 82.03 tokens/sec 21.56 tokens/sec
Llama 4 Maverick 30 tokens 140.23 tokens/sec 50.11 tokens/sec
  10k tokens 117.07 tokens/sec 24.77 tokens/sec

Analysis of Results

  1. Strong Prompt Processing: The M3 Ultra demonstrates impressive prompt processing speeds, particularly with the larger Maverick model reaching over 140 tokens/sec on short prompts and maintaining over 117 tokens/sec even at a 10k token context. This highlights the efficiency of the unified memory architecture and MLX when initially evaluating lengthy inputs. This initial processing phase can be a significant part of the user experience for tasks involving large documents or histories.

  2. Decent Generation Speed: Generation speeds are respectable, hitting around 50 tokens/sec (Maverick) for short contexts and maintaining mid-20s tokens/sec (Maverick) and low-20s tokens/sec (Scout) at 10k tokens. As expected, generation speed decreases as the context window grows, likely due to the increasing computational load per generated token.

  3. Maverick’s Edge: Despite having a significantly larger total parameter count (400B vs 109B), Maverick consistently outperforms Scout in both prompt processing and generation speed across both context sizes tested. This likely reflects optimizations within the MoE routing or implementation differences favouring the larger expert count model on this specific hardware/software stack, demonstrating that total parameter count isn’t the sole determinant of speed in MoE models.

  4. Context Impact: The drop-off in generation speed from 30 tokens to 10k tokens is noticeable (roughly halved for both models). While still usable, this underscores that even with massive unified memory, computational limits become apparent as context scales.

Comparison: Llama 4 vs. DeepSeek V3 on M3 Ultra

How do these new Llama 4 models stack up against another recent MoE behemoth, DeepSeek V3, on the same M3 Ultra hardware? Drawing from previous analyses (like our look at DeepSeek’s speed with MLX), we can make some comparisons using the DeepSeek V3 (671B total parameters) benchmarks on the M3 Ultra (512GB, MLX, 4-bit):

  • DeepSeek V3 @ 15,777 tokens: Prompt Processing: ~69.45 tokens/sec; Generation: ~5.79 tokens/sec

  • Llama 4 Maverick @ 10,000 tokens: Prompt Processing: 117.07 tokens/sec; Generation: 24.77 tokens/sec

  • Llama 4 Scout @ 10,000 tokens: Prompt Processing: 82.03 tokens/sec; Generation: 21.56 tokens/sec

Both Llama 4 models demonstrate significantly faster prompt processing speeds compared to DeepSeek V3, even at large context sizes. Llama 4 Maverick is nearly 70% faster at processing a 10k prompt than DeepSeek V3 is at processing a ~16k prompt. This difference is likely attributable to DeepSeek V3’s substantially larger total parameter size (671B vs 400B/109B), which imposes a heavier initial load even with MoE architectures.

The difference in generation speed is even more stark. At large context sizes (~10k-16k tokens), both Llama 4 models deliver over 3.5x the generation speed of DeepSeek V3 on the M3 Ultra. Maverick achieves ~24.8 t/s and Scout ~21.6 t/s at 10k context, compared to DeepSeek V3’s ~5.8 t/s at ~16k context. While the context sizes aren’t identical, the trend is clear: the Llama 4 models appear considerably more computationally efficient during token generation on this hardware.

Value Proposition and Upgrade Path

For the technical enthusiast focused purely on performance-per-dollar for models fitting within, say, 48GB of VRAM (like 70B models across two 24GB GPUs), traditional multi-GPU PC builds likely still hold the crown.

However, the calculus changes dramatically when targeting models exceeding 100GB or even 200GB memory requirements like Llama 4 Maverick or DeepSeek V3. Here, the Mac Studio M3 Ultra, particularly in its 256GB or 512GB configurations, carves out a unique niche. It offers a turnkey solution that sidesteps the VRAM wall, providing a practical (though costly) platform for experimenting with the largest publicly available models.

For users currently hitting VRAM limits on their multi-GPU setups when trying to run 100B+ models or use very large context windows, the M3 Ultra represents a potential, albeit platform-shifting, upgrade path. It trades the modularity and GPU-compute focus of x86/NVIDIA for the massive unified memory advantage of Apple Silicon.

Final Thoughts

The Mac Studio M3 Ultra continues to impress in its ability to handle massive LLMs locally. The initial benchmarks for Llama 4 Scout and Maverick showcase strong performance. While generation speed and prompt processing still drops with context, the results are usable and demonstrate the viability of Apple’s high-memory platform for cutting-edge MoE models.

Leave a Reply

Your email address will not be published. Required fields are marked *