Home / LLM Hardware News

14-Minute Wait?! $10K Mac Studio Crawls with DeepSeek 671B + llama.cpp

Allan Witt • Mar 27, 2025 at 2:15pm PDT

💬 0 Comments

llama cpp mac studiom3 ultra deepseek llm

Apple’s latest Mac Studio, particularly the M3 Ultra variant configured with a staggering 512GB of unified memory, presents a unique proposition for local Large Language Model (LLM) enthusiasts. This massive memory pool theoretically allows running models far larger than what’s feasible on most consumer or even pro GPU setups without resorting to complex multi-card configurations. But capacity is only one part of the equation.

Performance, especially with demanding models and large context windows, is critical. We took a closer look at how the top-tier M3 Ultra fares when running the colossal DeepSeek V3 671B parameter model (specifically, the q4_K_M GGUF quant) using the popular llama.cpp inference engine. The results paint a picture of impressive capability tempered by significant performance considerations, particularly concerning prompt processing time with large contexts.

The primary allure of a 512GB unified memory system is the ability to load truly massive models entirely into memory, avoiding the need for slower system RAM or storage offloading. DeepSeek 671B, even in a 4-bit quantized format like q4_K_M, requires roughly 405GB of memory, making the 512GB M3 Ultra one of the few single-system solutions capable of handling it locally.

The Benchmark Numbers: Processing vs. Generation

Our tests focused on evaluating both prompt processing speed (the time taken to ingest the initial context) and token generation speed (the speed at which the model produces its response). These were conducted using llama.cpp (via the KoboldCpp frontend) with a substantial context size to simulate real-world workloads beyond simple chat interactions.

Here are the key figures for the DeepSeek V3 671B q4_K_M GGUF model on the M3 Ultra 512GB using llama.cpp:

Context Processed: 8001 tokens
Prompt Processing: 888.01 seconds (14.80 minutes) at 9.01 tokens/second
Token Generation: 145.25 seconds (2.42 minutes) at 6.21 tokens/second
Total Time: 1033.26 seconds (17.22 minutes)

Several things stand out immediately. First, the prompt processing time is substantial. Waiting over 14 minutes for the model to simply ingest an ~8k token prompt before generating a single output token is a significant latency hit. This is the primary bottleneck observed.

Second, the token generation speed, hovering around 6.2 tokens/second, while not blazing fast, is perhaps more respectable given the model’s size.

MoE Architecture and Memory Bandwidth

Why is prompt processing so much slower than generation, especially compared to smaller models? The answer lies partly in the architecture of DeepSeek 671B and the nature of prompt processing itself.

DeepSeek 671B is a Mixture-of-Experts (MoE) model. During token generation (inference), it selectively activates a subset of its parameters (reportedly around 37B active experts out of the total 671B). This significantly reduces the computational load for each generated token compared to a dense model of equivalent size. The observed generation speed of ~6.2 T/s is consistent with running a ~37B parameter model.

However, prompt processing typically involves evaluating the entire input sequence against a larger portion, if not all, of the model’s parameters to establish the initial context state. This operation is heavily dependent on computational power. While the M3 Ultra offers impressive performance for many tasks, its architecture is not as optimized for these types of operations as NVIDIA’s GPUs, which are equipped with specialized Tensor Cores designed for accelerating deep learning.

Contrast this with a dense model like Command-A 111B (q8) tested on the same hardware:

Context Processed: ~8201 tokens
Prompt Processing: 90.00 seconds (1.50 minutes) at 91.12 tokens/second
Token Generation: 206.37 seconds (3.43 minutes) at 3.92 tokens/second
Total time: 296.37 seconds (4.93 minutes)

Here, the 111B dense model processes the prompt significantly faster (91 T/s vs 9 T/s) than the 671B MoE model, as expected since it’s a smaller model overall, even if dense. However, its token generation is slower (3.9 T/s vs 6.2 T/s) because it utilizes all 111 billion parameters for generation, whereas DeepSeek only uses its active 37B experts.

The Large Context Problem and Misleading Benchmarks

Many performance showcases, particularly on platforms like X or YouTube, often use very short prompts (a few dozen or hundred tokens). This paints an overly optimistic picture. While the M3 Ultra might generate the first few tokens quickly in such scenarios, these tests completely mask the significant latency introduced when dealing with large initial contexts, common in tasks like document summarization, RAG (Retrieval-Augmented Generation) with extensive background material, or analyzing large codebases. If your workflow involves feeding thousands of tokens initially, the prompt processing time becomes the dominant factor in the user experience.

For multi-turn chat sessions, techniques like “context shifting” (as implemented in frontends like KoboldCpp and standard practice in modern inference engines) help significantly after the initial prompt is processed. Subsequent turns only require processing the new user input and the model’s last response, making interactions much faster. However, this doesn’t alleviate the pain of that first, large context ingestion.

The Software Factor: llama.cpp vs. MLX

It’s crucial to note that these benchmarks were performed using llama.cpp. Apple provides its own framework, MLX, specifically optimized for Apple Silicon. Reports and benchmarks from the community suggest that MLX can offer substantially better prompt processing performance on M-series chips compared to llama.cpp, sometimes by a factor of 4-5x for models like DeepSeek 671B.

Is the $10,000 Mac Studio Worth It for Massive LLMs?

The M3 Ultra Mac Studio configured with 512GB memory represents a significant investment, typically around $10,000 USD depending on other configuration choices. Based on these benchmarks, the value proposition for users specifically targeting models like DeepSeek 671B with large context requirements is questionable if low latency is critical. Waiting over 13 minutes to process a large document or codebase before generation begins will be unacceptable for many workflows.

However, the picture changes if:

You primarily use models up to the ~70B-120B class: For these, the M3 Ultra offers very usable performance, especially for interactive chat, with prompt processing being much faster.
Your primary workload involves large models but mostly short, interactive prompts: Context shifting makes multi-turn conversations snappy after the initial (potentially slow) prompt load.
You leverage MLX: The significantly improved prompt processing speeds reported with MLX could drastically alter the usability equation for large context tasks, making the Mac Studio a more viable, albeit still premium, option.
You value the simplicity of a single system with massive unified memory over managing a complex, power-hungry multi-GPU setup.
You have other demanding workloads (video editing, code compilation, etc.) that also benefit from the M3 Ultra’s power and memory capacity.

Conclusion: Capable, But With Caveats

The Mac Studio M3 Ultra with 512GB unified memory is an impressive piece of engineering that can run enormous models like DeepSeek 671B locally. Its unified memory architecture provides a straightforward path to hosting models that are challenging to fit onto even multi-GPU setups.

However, performance, particularly prompt processing speed with large contexts using the widely adopted llama.cpp framework, is a major bottleneck. The ~9 t/s prompt processing speed translates to multi-minute waits for common large-context tasks. While generation speed is more reasonable (~6.2 t/s), the initial latency can be prohibitive.