Flash-MoE Model Selection: Alternatives Research¶
Research by: Atlas, Director of Research
Date: 2026-03-19
Status: Recommendations Ready
Companion Doc: FLASH_MOE_PROPOSAL.md
TL;DR¶
Qwen3.5-397B-A17B is not just a default — it's a genuinely strong choice for flash-moe's SSD streaming architecture. But three credible alternatives exist. The right pick depends on whether you're optimizing for: (1) maximum intelligence, (2) fastest demo, or (3) broadest capability. Read on for the breakdown.
Architecture Primer: What Flash-MoE Needs¶
Before recommending alternatives, it's critical to understand what flash-moe actually requires — because this engine is not a general-purpose MoE loader. It's custom C/Metal code hardcoded for Qwen3.5-397B's specific structure.
What makes a model flash-moe-compatible:
- ✅ MoE architecture with clearly separable, identically-shaped expert weight tensors
- ✅ High expert sparsity (low K/N expert ratio) — the fewer experts activate per token, the less SSD I/O needed
- ✅ Active parameters comfortably < 51GB RAM available (so non-expert compute stays in-memory)
- ✅ Weights available in a convertible format (safetensors, GGUF)
- ⚠️ Each new model requires porting the weight extraction, expert packing scripts, and adjusting hardcoded architecture constants in infer.m
Key flash-moe sparsity insight: The engine reads K=4 expert blocks × ~3.9MB each per layer from SSD, totaling ~15.6MB of SSD reads per transformer layer. Models with fewer active experts or smaller expert blocks will read less data per token — translating directly to higher tok/s.
Sparsity ratios across candidates:
| Model | Total | Active | Sparsity % | Notes |
|---|---|---|---|---|
| Qwen3.5-397B-A17B | 397B | 17B | 4.3% | K=4 of 512 experts |
| Kimi K2 | 1T | 32B | 3.2% | Most sparse — ideal for streaming |
| DeepSeek-V3 | 671B | 37B | 5.5% | Moderate sparsity |
| Qwen3-235B-A22B | 235B | 22B | 9.4% | Less sparse; fits in RAM |
Recommendation 1: DeepSeek-V3-0324 (0324 Update)¶
"The Proven Alternative — Better Quality, More Engineering"
Specs¶
| Parameter | Value |
|---|---|
| Total parameters | 671B |
| Active parameters | 37B per token |
| MoE layers | 61 layers, 256 routed experts + 1 shared expert |
| Experts activated | K=8 per token |
| Architecture | DeepSeekMoE + Multi-head Latent Attention (MLA) |
| License | MIT (open weights) |
| HuggingFace | deepseek-ai/DeepSeek-V3-0324 |
Why It Fits Flash-MoE¶
DeepSeek-V3's architecture aligns well with flash-moe's SSD streaming concept: - 256 experts per layer with K=8 activation → clean, separable expert tensors - Each expert activation is ~3.6% of the layer's parameters — good streaming candidate - Active params (37B) fit in 51GB RAM with room to spare for expert cache - MLX/GGUF weights widely available; community has already quantized to 2-bit
Engineering work required: Significant. DeepSeek-V3 uses DeepSeekMoE architecture (not GatedDeltaNet like Qwen3.5). The flash-moe C engine would need:
- New extract_weights.py for DeepSeek tensor layout
- Updated expert packing scripts (different expert dimensions)
- Modified infer.m for DeepSeek's attention variant (MLA vs standard)
- Rewritten GatedDeltaNet linear attention → DeepSeek's MLA
Estimated port effort: 2–4 weeks for a competent C/Metal developer. Not trivial, but doable.
Expected Performance on M4 Pro 64GB¶
| Mode | Estimated tok/s | Notes |
|---|---|---|
| 2-bit experts, warm cache | 3.5–4.5 | More experts active = more SSD I/O vs Qwen3.5 |
| 2-bit experts, cold SSD | 2.0–2.5 | I/O bound on K=8 vs K=4 |
Why slightly slower than Qwen3.5-397B: DeepSeek activates K=8 experts per token vs K=4 for Qwen3.5. That's roughly 2× the SSD reads per layer, which is the bottleneck. Compensating factor: 37B active parameters means larger GPU batches can be computed efficiently.
Quality vs Frontier¶
| Benchmark | DeepSeek-V3-0324 | GPT-4o | Claude Opus 4.5 | Qwen3.5-397B |
|---|---|---|---|---|
| MMLU | 88.5% | 87.2% | ~88% | ~88% (est.) |
| HumanEval | 85%+ | ~85% | ~80% | ~85% (est.) |
| GPQA | Competitive | Strong | Strong | Strong |
| Intelligence Index | Top 5 open | Proprietary | Proprietary | #3 open |
Verdict on quality: DeepSeek-V3-0324 is a genuine GPT-4o-class model. It matches GPT-4o on MMLU despite 37B active parameters vs GPT-4o's ~220B (estimated). Strong at coding, math, and reasoning. The 0324 update improved over the original V3.
Disk & Download¶
| Item | Size |
|---|---|
| BF16 full checkpoint | 1.3TB |
| 2-bit quantized GGUF (UD-Q2_K_XL) | ~230GB |
| 1.78-bit GGUF (UD-IQ1_S) | ~151GB |
| Non-expert weights (estimated) | ~6–8GB |
Download strategy: Use Unsloth's unsloth/DeepSeek-V3-0324-GGUF on HuggingFace. The UD-Q2_K_XL (2.4-bit dynamic) is the sweet spot.
Gotchas¶
- Largest disk footprint of the three candidates (230GB vs 120GB for Qwen3.5)
- MLA architecture requires specialized attention kernel rewrites
- Community has run this via llama.cpp on Mac but NOT via custom C/Metal like flash-moe — you'd be pioneering here
- 37B active params means you lose some of flash-moe's RAM headroom advantage vs 17B active
Use Case¶
Best choice if you want the most battle-tested, widely-benchmarked alternative that's definitively frontier quality. Higher engineering cost, but the community ecosystem (GGUF support, quantization tooling) is mature.
Recommendation 2: Kimi K2 (Moonshot AI)¶
"The Intelligence Champion — Highest Ceiling, Largest Ask"
Specs¶
| Parameter | Value |
|---|---|
| Total parameters | 1 Trillion (1T) |
| Active parameters | 32B per token |
| MoE layers | 61 layers (1 dense + 60 MoE) |
| Architecture | Standard MoE transformer (MuonClip optimizer) |
| License | Modified MIT (open weights, commercial use OK) |
| HuggingFace | moonshotai/Kimi-K2-Instruct |
| Intelligence Index | #2 open weights (Artificial Analysis, as Kimi K2.5) |
Why It Fits Flash-MoE¶
Kimi K2 has the best sparsity profile of any 1T+ model currently available: - 32B active out of 1T total = 3.2% activation rate — even more sparse than Qwen3.5-397B - Standard transformer MoE architecture with clearly separable expert layers - Active params (32B) fit in 51GB RAM when combined with a solid caching strategy - Pre-trained on 15.5T tokens — broadest knowledge base among candidates
Sparsity advantage: Because Kimi K2 activates fewer experts as a proportion of total capacity, the SSD I/O per token could be comparable or better than Qwen3.5-397B per layer. This is flash-moe's sweet spot.
Engineering work required: Substantial. Kimi K2 uses standard MoE (no GatedDeltaNet hybrid), which actually simplifies the attention layer port vs Qwen3.5-397B's hybrid. Main challenges: - Weight extraction from 1T parameter checkpoint (Moonshot's tensor naming conventions) - Expert repacking for a model with different layer count (61 vs 60) - Memory math at 1T scale needs careful tuning
Estimated port effort: 3–5 weeks. The standard MoE attention (no GatedDeltaNet) could make the inference kernel simpler than Qwen3.5, but the raw scale is daunting.
Expected Performance on M4 Pro 64GB¶
| Mode | Estimated tok/s | Notes |
|---|---|---|
| 2-bit experts, warm cache | 3.0–4.0 | Expert block sizes are larger at 1T scale |
| 2-bit experts, cold SSD | 1.8–2.5 | I/O intensive; depends on expert geometry |
Why potentially slower: At 1T total parameters, each expert tensor is physically larger even with the same K count. More bytes per SSD read at the same number of read calls. 2-bit compression mitigates this significantly.
Community note: A Reddit user ran Kimi K2 via llama.cpp on 128GB RAM + 24GB VRAM and noted llama.cpp only uses ~1000MB/s of SSD bandwidth even when SSD can do 7300MB/s. Flash-moe's custom pread() + F_NOCACHE approach should dramatically outperform this, potentially making Kimi K2 viable on flash-moe where it's impractical on llama.cpp.
Quality vs Frontier¶
| Benchmark | Kimi K2 | GPT-4o | DeepSeek-V3 | Qwen3.5-397B |
|---|---|---|---|---|
| Intelligence Index | #2 open (47/100) | Proprietary | Below top 3 open | #3 open (45/100) |
| Math | Top tier | Top tier | Top tier | Top tier |
| Coding | Excellent | Strong | Excellent | Excellent |
| Agentic tasks | Outstanding | Good | Good | Good |
Verdict on quality: Kimi K2 is legitimately the most capable open-weight model for agentic and non-thinking tasks right now — only GLM-5 (Reasoning) beats it. It's trained with the Muon optimizer specifically for agentic capability. For Jeff's product demo angle, this is the most impressive name to put on a local inference product.
Disk & Download¶
| Item | Size |
|---|---|
| BF16 full checkpoint | 1.09TB |
| 1.8-bit dynamic GGUF (Unsloth UD) | ~230GB |
| 1-bit quantized | ~240GB |
Download: Use Unsloth's unsloth/Kimi-K2-Instruct-GGUF. This fits on our 1.8TB SSD (230GB model + 120GB Qwen3.5 = 350GB, well within headroom).
Gotchas¶
- No existing Mac-native flash-moe implementation. You'd be the first. This is a risk and an opportunity.
- 1T model weight extraction is slow — expect 6–10 hours for the repacking pipeline
- Community llama.cpp results for Kimi K2 on Mac are poor due to llama.cpp's inefficient SSD usage. Flash-moe's direct I/O approach could be the breakthrough.
- Tool calling had issues at launch (noted in community); check current Instruct model status
- 1T parameters is a marketing story, but active params (32B) is what determines quality at inference
Use Case¶
Best for the "most intelligent local AI" demo narrative. If the pitch is "we run the #2 open-weights AI model in the world on a Mac mini," Kimi K2 is the headline model. Higher risk/reward than DeepSeek-V3.
Recommendation 3: Qwen3-235B-A22B (Updated: 2507 Instruct)¶
"The Speed Play — In-RAM Inference, No SSD Streaming"
Specs¶
| Parameter | Value |
|---|---|
| Total parameters | 235B |
| Active parameters | 22B per token |
| MoE config | 128 total experts, 8 active per token |
| Layers | 94 transformer layers |
| Context window | 131,072 tokens (native) |
| Architecture | Standard transformer MoE |
| License | Apache 2.0 (fully open for commercial use) |
| HuggingFace | Qwen/Qwen3-235B-A22B-Instruct-2507 |
Why It Fits — With a Twist¶
Here's the contrarian take: Qwen3-235B-A22B doesn't need SSD expert streaming at all.
At 2-bit quantization, the entire model fits in approximately 60–70GB on disk. With our 51GB RAM available: - A 2-bit or 3-bit quantized version of Qwen3-235B fits almost entirely in memory - No SSD I/O bottleneck - Expected speed: 15–25 tok/s vs 3.5–5.5 tok/s for 397B via flash-moe
This is a different value proposition: Not "stream a giant model from SSD," but "fit a better-than-GPT-4-class model entirely in RAM."
For OpenClaw integration, this model can run via standard llama.cpp/Ollama — no custom C engine required.
Flash-moe port effort: Lower than the others, since Qwen3-235B uses a similar MoE structure to Qwen3.5-397B (same family). But honestly, you wouldn't need flash-moe for this model — it defeats the purpose.
Expected Performance on M4 Pro 64GB¶
| Mode | Estimated tok/s | Notes |
|---|---|---|
| Q3_K_M in RAM | 12–20 tok/s | Primary constraint: memory bandwidth |
| Q2_K in RAM | 18–25 tok/s | Faster at cost of some quality |
| Via flash-moe (overkill) | 6–8 tok/s | SSD streaming not needed, worse than pure RAM |
Why faster: At 273 GB/s memory bandwidth and 22B active params, the compute is genuinely fast. This model completes tokens roughly 4–5× faster than Qwen3.5-397B via flash-moe, which matters enormously for real-time conversation.
Quality vs Frontier¶
| Benchmark | Qwen3-235B-A22B | GPT-4o | DeepSeek-V3 | Qwen3.5-397B |
|---|---|---|---|---|
| Intelligence Index | ~#5 open weights | Proprietary | ~#4 open | #3 open (45/100) |
| Reasoning | Excellent | Excellent | Strong | Excellent |
| Multilingual | Outstanding | Good | Strong | Outstanding |
| Coding | Very strong | Strong | Excellent | Very strong |
Verdict on quality: Qwen3-235B-A22B is excellent — competitive with GPT-4o — but it's meaningfully below Qwen3.5-397B and Kimi K2. If you're pitching "local frontier AI," this is the B-team. If you're pitching "the fastest capable local AI on a Mac mini," this is the A-team.
Disk & Download¶
| Item | Size |
|---|---|
| BF16 full checkpoint | ~468GB |
| Q4_K_M GGUF | ~130GB |
| Q3_K_M GGUF | ~97GB |
| Q2_K GGUF | ~63GB |
Practical setup: Pull lmstudio-community/Qwen3-235B-A22B-GGUF via LM Studio or llama.cpp. The Q3_K_M at 97GB fits on our SSD trivially. It would also fit in RAM at Q2_K (63GB < 64GB with some OS overhead — tight).
Gotchas¶
- Doesn't need flash-moe; using it would be counterproductive
- 63GB for Q2_K is a very tight fit in 64GB unified memory — would need OS + app overhead carefully managed
- Thinking mode and non-thinking mode both available (same
/think/no_thinktoggle as Qwen3.5) - Context window (128K) is outstanding; far better than 397B's current implementation
- Apache 2.0 license is cleaner for commercial products than Qwen3.5-397B's custom terms
Use Case¶
Best if the product is "best real-time local AI on consumer hardware" rather than "largest local AI." Ship faster, work better for live chat, integrate into OpenClaw today without custom C code. If 5 tok/s is too slow for the demo, this is the model to bet on.
Verdict: Is Qwen3.5-397B Actually the Best Choice?¶
Short answer: Yes, for flash-moe specifically — and the reasons are architectural, not just familiarity.
Here's the honest breakdown:
Why Qwen3.5-397B-A17B Is Hard to Beat for Flash-MoE¶
1. Sparsity is almost perfectly tuned for SSD streaming
512 experts with K=4 active = 0.78% expert activation rate per layer. This is extreme sparsity. At flash-moe's ~3.9MB per expert block, that's only 15.6MB of SSD reads per layer. The model was designed to be efficient at inference; GatedDeltaNet linear attention layers (45 of 60) are dramatically cheaper than full attention.
2. 17B active parameters is the sweet spot for our RAM
With only 17B active params, the working set fits easily in 51GB of available RAM, leaving room for an expert cache to reduce cold SSD reads. DeepSeek-V3 (37B active) and Kimi K2 (32B active) both leave less headroom for expert caching.
3. The 2-bit compression ratio is exceptional
Qwen3.5-397B compresses from 397B params to 120GB on disk in 2-bit mode. That's ~0.3 bytes per parameter — excellent for a model of this quality. The 44% size reduction over 4-bit is already baked into flash-moe's implementation.
4. Intelligence ranking validates the choice
Artificial Analysis ranks Qwen3.5-397B as #3 open weights globally (45/100), just behind GLM-5 Reasoning (50) and Kimi K2.5 Reasoning (47). This is genuine frontier quality — not mid-tier.
5. The implementation already exists
Flash-moe is written and tested for this model. Every alternative requires weeks of engineering work. The opportunity cost of porting is real.
When an Alternative Makes Sense¶
| If you want… | Choose |
|---|---|
| Maximum intelligence for "most impressive open-weight model" narrative | Kimi K2 (1T/32B, #2 open weights) |
| Most battle-tested, widest community support, easiest GGUF path | DeepSeek-V3-0324 (671B/37B) |
| Real-time conversation speed without SSD streaming complexity | Qwen3-235B-A22B (235B/22B, via Ollama/llama.cpp) |
| Best flash-moe demo on existing code, today | Qwen3.5-397B-A17B (status quo) |
Hardware Fit Matrix¶
| Model | Active RAM Need | 2-bit Disk | Flash-MoE Port Effort | Est. tok/s (M4 Pro) |
|---|---|---|---|---|
| Qwen3.5-397B-A17B | ~17B (~34GB est.) | 120GB | 0 (done) | 3.5–5.5 |
| DeepSeek-V3-0324 | ~37B (~74GB est.) | 230GB | High (4 weeks) | 3.0–4.5 |
| Kimi K2 | ~32B (~64GB est.) | 230GB | Very High (5 weeks) | 2.5–4.0 |
| Qwen3-235B-A22B | ~22B (~44GB est.) | 63–97GB | N/A (use Ollama) | 12–25 |
Active RAM estimates based on: active_params × 2 bytes (BF16) as a working-set floor. Actual flash-moe usage is lower because non-expert weights (5.5GB) are the only persistent in-memory component.
"Most Impressive Demo" Ranking¶
If Jeff's goal is first-mover advantage on "frontier local AI" as a product:
-
🥇 Kimi K2 — "Running the #2 open-weights AI in the world on a Mac mini" is the strongest marketing line. Nobody has done this with flash-moe-style SSD streaming. First-mover window is real.
-
🥈 Qwen3.5-397B-A17B — Already works. "#3 open weights, 397 billion parameters, $2500 hardware" is a strong demo today. Ship this while Kimi K2 port is in progress.
-
🥉 DeepSeek-V3 — Well-known brand, GPT-4o quality benchmarks. Good secondary story ("we also run the model that shook the AI industry in January 2025").
-
Qwen3-235B-A22B — Good product, wrong narrative for the "frontier local AI" pitch. Better positioned as a speed optimization once the flagship is established.
Recommended Path Forward¶
Phase 1 (Now): Ship Qwen3.5-397B via flash-moe as the proof point. #3 open weights on consumer hardware is a legitimate claim.
Phase 2 (3–6 months): Port flash-moe to Kimi K2. This is the flagship narrative upgrade — "from #3 to #2 open weights, same hardware." Hire or contract a C/Metal developer with MoE experience to drive the port.
Phase 3 (Ongoing): Maintain Qwen3-235B-A22B via Ollama/llama.cpp as the fast inference tier. Flash-moe handles the "impressive" demo; Ollama handles real-time conversation. Two-tier local inference is a differentiated product.
Sources¶
- flash-moe GitHub: https://github.com/danveloper/flash-moe
- Artificial Analysis Intelligence Leaderboard: https://artificialanalysis.ai/leaderboards/models
- Kimi K2 architecture: https://moonshotai.github.io/Kimi-K2/
- DeepSeek-V3 technical report: https://arxiv.org/abs/2412.19437
- Unsloth GGUF sizing: https://unsloth.ai/blog/deepseek-v3-0324 / https://unsloth.ai/docs/models/kimi-k2.5
- Qwen3-235B-A22B GGUF: https://huggingface.co/lmstudio-community/Qwen3-235B-A22B-GGUF
- Kimi K2 SSD inference community discussion: https://www.reddit.com/r/LocalLLaMA/comments/1ow0jj0
- NVIDIA blog on Kimi K2.5 open-weights leaderboard status: https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/
Research completed 2026-03-19. Leaderboard rankings and model availability subject to rapid change in the current AI landscape.