Flash-MoE Model Selection: Alternatives Research¶

Research by: Atlas, Director of Research
Date: 2026-03-19
Status: Recommendations Ready
Companion Doc: FLASH_MOE_PROPOSAL.md

TL;DR¶

Qwen3.5-397B-A17B is not just a default — it's a genuinely strong choice for flash-moe's SSD streaming architecture. But three credible alternatives exist. The right pick depends on whether you're optimizing for: (1) maximum intelligence, (2) fastest demo, or (3) broadest capability. Read on for the breakdown.

Architecture Primer: What Flash-MoE Needs¶

Before recommending alternatives, it's critical to understand what flash-moe actually requires — because this engine is not a general-purpose MoE loader. It's custom C/Metal code hardcoded for Qwen3.5-397B's specific structure.

What makes a model flash-moe-compatible: - ✅ MoE architecture with clearly separable, identically-shaped expert weight tensors - ✅ High expert sparsity (low K/N expert ratio) — the fewer experts activate per token, the less SSD I/O needed - ✅ Active parameters comfortably < 51GB RAM available (so non-expert compute stays in-memory) - ✅ Weights available in a convertible format (safetensors, GGUF) - ⚠️ Each new model requires porting the weight extraction, expert packing scripts, and adjusting hardcoded architecture constants in infer.m

Key flash-moe sparsity insight: The engine reads K=4 expert blocks × ~3.9MB each per layer from SSD, totaling ~15.6MB of SSD reads per transformer layer. Models with fewer active experts or smaller expert blocks will read less data per token — translating directly to higher tok/s.

Sparsity ratios across candidates:

Model	Total	Active	Sparsity %	Notes
Qwen3.5-397B-A17B	397B	17B	4.3%	K=4 of 512 experts
Kimi K2	1T	32B	3.2%	Most sparse — ideal for streaming
DeepSeek-V3	671B	37B	5.5%	Moderate sparsity
Qwen3-235B-A22B	235B	22B	9.4%	Less sparse; fits in RAM

Recommendation 1: DeepSeek-V3-0324 (0324 Update)¶

"The Proven Alternative — Better Quality, More Engineering"

Specs¶

Parameter	Value
Total parameters	671B
Active parameters	37B per token
MoE layers	61 layers, 256 routed experts + 1 shared expert
Experts activated	K=8 per token
Architecture	DeepSeekMoE + Multi-head Latent Attention (MLA)
License	MIT (open weights)
HuggingFace	`deepseek-ai/DeepSeek-V3-0324`

Why It Fits Flash-MoE¶

DeepSeek-V3's architecture aligns well with flash-moe's SSD streaming concept: - 256 experts per layer with K=8 activation → clean, separable expert tensors - Each expert activation is ~3.6% of the layer's parameters — good streaming candidate - Active params (37B) fit in 51GB RAM with room to spare for expert cache - MLX/GGUF weights widely available; community has already quantized to 2-bit

Engineering work required: Significant. DeepSeek-V3 uses DeepSeekMoE architecture (not GatedDeltaNet like Qwen3.5). The flash-moe C engine would need: - New extract_weights.py for DeepSeek tensor layout - Updated expert packing scripts (different expert dimensions) - Modified infer.m for DeepSeek's attention variant (MLA vs standard) - Rewritten GatedDeltaNet linear attention → DeepSeek's MLA

Estimated port effort: 2–4 weeks for a competent C/Metal developer. Not trivial, but doable.

Expected Performance on M4 Pro 64GB¶

Mode	Estimated tok/s	Notes
2-bit experts, warm cache	3.5–4.5	More experts active = more SSD I/O vs Qwen3.5
2-bit experts, cold SSD	2.0–2.5	I/O bound on K=8 vs K=4

Why slightly slower than Qwen3.5-397B: DeepSeek activates K=8 experts per token vs K=4 for Qwen3.5. That's roughly 2× the SSD reads per layer, which is the bottleneck. Compensating factor: 37B active parameters means larger GPU batches can be computed efficiently.

Quality vs Frontier¶

Benchmark	DeepSeek-V3-0324	GPT-4o	Claude Opus 4.5	Qwen3.5-397B
MMLU	88.5%	87.2%	~88%	~88% (est.)
HumanEval	85%+	~85%	~80%	~85% (est.)
GPQA	Competitive	Strong	Strong	Strong
Intelligence Index	Top 5 open	Proprietary	Proprietary	#3 open

Verdict on quality: DeepSeek-V3-0324 is a genuine GPT-4o-class model. It matches GPT-4o on MMLU despite 37B active parameters vs GPT-4o's ~220B (estimated). Strong at coding, math, and reasoning. The 0324 update improved over the original V3.

Disk & Download¶

Item	Size
BF16 full checkpoint	1.3TB
2-bit quantized GGUF (UD-Q2_K_XL)	~230GB
1.78-bit GGUF (UD-IQ1_S)	~151GB
Non-expert weights (estimated)	~6–8GB

Download strategy: Use Unsloth's unsloth/DeepSeek-V3-0324-GGUF on HuggingFace. The UD-Q2_K_XL (2.4-bit dynamic) is the sweet spot.

Gotchas¶

Largest disk footprint of the three candidates (230GB vs 120GB for Qwen3.5)
MLA architecture requires specialized attention kernel rewrites
Community has run this via llama.cpp on Mac but NOT via custom C/Metal like flash-moe — you'd be pioneering here
37B active params means you lose some of flash-moe's RAM headroom advantage vs 17B active

Use Case¶

Best choice if you want the most battle-tested, widely-benchmarked alternative that's definitively frontier quality. Higher engineering cost, but the community ecosystem (GGUF support, quantization tooling) is mature.

Recommendation 2: Kimi K2 (Moonshot AI)¶

"The Intelligence Champion — Highest Ceiling, Largest Ask"

Specs¶

Parameter	Value
Total parameters	1 Trillion (1T)
Active parameters	32B per token
MoE layers	61 layers (1 dense + 60 MoE)
Architecture	Standard MoE transformer (MuonClip optimizer)
License	Modified MIT (open weights, commercial use OK)
HuggingFace	`moonshotai/Kimi-K2-Instruct`
Intelligence Index	#2 open weights (Artificial Analysis, as Kimi K2.5)

Why It Fits Flash-MoE¶

Kimi K2 has the best sparsity profile of any 1T+ model currently available: - 32B active out of 1T total = 3.2% activation rate — even more sparse than Qwen3.5-397B - Standard transformer MoE architecture with clearly separable expert layers - Active params (32B) fit in 51GB RAM when combined with a solid caching strategy - Pre-trained on 15.5T tokens — broadest knowledge base among candidates

Sparsity advantage: Because Kimi K2 activates fewer experts as a proportion of total capacity, the SSD I/O per token could be comparable or better than Qwen3.5-397B per layer. This is flash-moe's sweet spot.

Engineering work required: Substantial. Kimi K2 uses standard MoE (no GatedDeltaNet hybrid), which actually simplifies the attention layer port vs Qwen3.5-397B's hybrid. Main challenges: - Weight extraction from 1T parameter checkpoint (Moonshot's tensor naming conventions) - Expert repacking for a model with different layer count (61 vs 60) - Memory math at 1T scale needs careful tuning

Estimated port effort: 3–5 weeks. The standard MoE attention (no GatedDeltaNet) could make the inference kernel simpler than Qwen3.5, but the raw scale is daunting.

Expected Performance on M4 Pro 64GB¶

Mode	Estimated tok/s	Notes
2-bit experts, warm cache	3.0–4.0	Expert block sizes are larger at 1T scale
2-bit experts, cold SSD	1.8–2.5	I/O intensive; depends on expert geometry

Why potentially slower: At 1T total parameters, each expert tensor is physically larger even with the same K count. More bytes per SSD read at the same number of read calls. 2-bit compression mitigates this significantly.

Community note: A Reddit user ran Kimi K2 via llama.cpp on 128GB RAM + 24GB VRAM and noted llama.cpp only uses ~1000MB/s of SSD bandwidth even when SSD can do 7300MB/s. Flash-moe's custom pread() + F_NOCACHE approach should dramatically outperform this, potentially making Kimi K2 viable on flash-moe where it's impractical on llama.cpp.

Quality vs Frontier¶

Benchmark	Kimi K2	GPT-4o	DeepSeek-V3	Qwen3.5-397B
Intelligence Index	#2 open (47/100)	Proprietary	Below top 3 open	#3 open (45/100)
Math	Top tier	Top tier	Top tier	Top tier
Coding	Excellent	Strong	Excellent	Excellent
Agentic tasks	Outstanding	Good	Good	Good

Verdict on quality: Kimi K2 is legitimately the most capable open-weight model for agentic and non-thinking tasks right now — only GLM-5 (Reasoning) beats it. It's trained with the Muon optimizer specifically for agentic capability. For Jeff's product demo angle, this is the most impressive name to put on a local inference product.

Disk & Download¶

Item	Size
BF16 full checkpoint	1.09TB
1.8-bit dynamic GGUF (Unsloth UD)	~230GB
1-bit quantized	~240GB

Download: Use Unsloth's unsloth/Kimi-K2-Instruct-GGUF. This fits on our 1.8TB SSD (230GB model + 120GB Qwen3.5 = 350GB, well within headroom).

Gotchas¶

No existing Mac-native flash-moe implementation. You'd be the first. This is a risk and an opportunity.
1T model weight extraction is slow — expect 6–10 hours for the repacking pipeline
Community llama.cpp results for Kimi K2 on Mac are poor due to llama.cpp's inefficient SSD usage. Flash-moe's direct I/O approach could be the breakthrough.
Tool calling had issues at launch (noted in community); check current Instruct model status
1T parameters is a marketing story, but active params (32B) is what determines quality at inference

Use Case¶

Best for the "most intelligent local AI" demo narrative. If the pitch is "we run the #2 open-weights AI model in the world on a Mac mini," Kimi K2 is the headline model. Higher risk/reward than DeepSeek-V3.

Recommendation 3: Qwen3-235B-A22B (Updated: 2507 Instruct)¶

"The Speed Play — In-RAM Inference, No SSD Streaming"

Specs¶

Parameter	Value
Total parameters	235B
Active parameters	22B per token
MoE config	128 total experts, 8 active per token
Layers	94 transformer layers
Context window	131,072 tokens (native)
Architecture	Standard transformer MoE
License	Apache 2.0 (fully open for commercial use)
HuggingFace	`Qwen/Qwen3-235B-A22B-Instruct-2507`

Why It Fits — With a Twist¶

Here's the contrarian take: Qwen3-235B-A22B doesn't need SSD expert streaming at all.

At 2-bit quantization, the entire model fits in approximately 60–70GB on disk. With our 51GB RAM available: - A 2-bit or 3-bit quantized version of Qwen3-235B fits almost entirely in memory - No SSD I/O bottleneck - Expected speed: 15–25 tok/s vs 3.5–5.5 tok/s for 397B via flash-moe

This is a different value proposition: Not "stream a giant model from SSD," but "fit a better-than-GPT-4-class model entirely in RAM."

For OpenClaw integration, this model can run via standard llama.cpp/Ollama — no custom C engine required.

Flash-moe port effort: Lower than the others, since Qwen3-235B uses a similar MoE structure to Qwen3.5-397B (same family). But honestly, you wouldn't need flash-moe for this model — it defeats the purpose.

Expected Performance on M4 Pro 64GB¶

Mode	Estimated tok/s	Notes
Q3_K_M in RAM	12–20 tok/s	Primary constraint: memory bandwidth
Q2_K in RAM	18–25 tok/s	Faster at cost of some quality
Via flash-moe (overkill)	6–8 tok/s	SSD streaming not needed, worse than pure RAM

Why faster: At 273 GB/s memory bandwidth and 22B active params, the compute is genuinely fast. This model completes tokens roughly 4–5× faster than Qwen3.5-397B via flash-moe, which matters enormously for real-time conversation.

Quality vs Frontier¶

Benchmark	Qwen3-235B-A22B	GPT-4o	DeepSeek-V3	Qwen3.5-397B
Intelligence Index	~#5 open weights	Proprietary	~#4 open	#3 open (45/100)
Reasoning	Excellent	Excellent	Strong	Excellent
Multilingual	Outstanding	Good	Strong	Outstanding
Coding	Very strong	Strong	Excellent	Very strong

Verdict on quality: Qwen3-235B-A22B is excellent — competitive with GPT-4o — but it's meaningfully below Qwen3.5-397B and Kimi K2. If you're pitching "local frontier AI," this is the B-team. If you're pitching "the fastest capable local AI on a Mac mini," this is the A-team.

Disk & Download¶

Item	Size
BF16 full checkpoint	~468GB
Q4_K_M GGUF	~130GB
Q3_K_M GGUF	~97GB
Q2_K GGUF	~63GB

Practical setup: Pull lmstudio-community/Qwen3-235B-A22B-GGUF via LM Studio or llama.cpp. The Q3_K_M at 97GB fits on our SSD trivially. It would also fit in RAM at Q2_K (63GB < 64GB with some OS overhead — tight).

Gotchas¶

Doesn't need flash-moe; using it would be counterproductive
63GB for Q2_K is a very tight fit in 64GB unified memory — would need OS + app overhead carefully managed
Thinking mode and non-thinking mode both available (same /think /no_think toggle as Qwen3.5)
Context window (128K) is outstanding; far better than 397B's current implementation
Apache 2.0 license is cleaner for commercial products than Qwen3.5-397B's custom terms

Use Case¶

Best if the product is "best real-time local AI on consumer hardware" rather than "largest local AI." Ship faster, work better for live chat, integrate into OpenClaw today without custom C code. If 5 tok/s is too slow for the demo, this is the model to bet on.

Verdict: Is Qwen3.5-397B Actually the Best Choice?¶

Short answer: Yes, for flash-moe specifically — and the reasons are architectural, not just familiarity.

Here's the honest breakdown:

Why Qwen3.5-397B-A17B Is Hard to Beat for Flash-MoE¶

1. Sparsity is almost perfectly tuned for SSD streaming
512 experts with K=4 active = 0.78% expert activation rate per layer. This is extreme sparsity. At flash-moe's ~3.9MB per expert block, that's only 15.6MB of SSD reads per layer. The model was designed to be efficient at inference; GatedDeltaNet linear attention layers (45 of 60) are dramatically cheaper than full attention.

2. 17B active parameters is the sweet spot for our RAM
With only 17B active params, the working set fits easily in 51GB of available RAM, leaving room for an expert cache to reduce cold SSD reads. DeepSeek-V3 (37B active) and Kimi K2 (32B active) both leave less headroom for expert caching.

3. The 2-bit compression ratio is exceptional
Qwen3.5-397B compresses from 397B params to 120GB on disk in 2-bit mode. That's ~0.3 bytes per parameter — excellent for a model of this quality. The 44% size reduction over 4-bit is already baked into flash-moe's implementation.

4. Intelligence ranking validates the choice
Artificial Analysis ranks Qwen3.5-397B as #3 open weights globally (45/100), just behind GLM-5 Reasoning (50) and Kimi K2.5 Reasoning (47). This is genuine frontier quality — not mid-tier.

5. The implementation already exists
Flash-moe is written and tested for this model. Every alternative requires weeks of engineering work. The opportunity cost of porting is real.

When an Alternative Makes Sense¶

If you want…	Choose
Maximum intelligence for "most impressive open-weight model" narrative	Kimi K2 (1T/32B, #2 open weights)
Most battle-tested, widest community support, easiest GGUF path	DeepSeek-V3-0324 (671B/37B)
Real-time conversation speed without SSD streaming complexity	Qwen3-235B-A22B (235B/22B, via Ollama/llama.cpp)
Best flash-moe demo on existing code, today	Qwen3.5-397B-A17B (status quo)

Hardware Fit Matrix¶

Model	Active RAM Need	2-bit Disk	Flash-MoE Port Effort	Est. tok/s (M4 Pro)
Qwen3.5-397B-A17B	~17B (~34GB est.)	120GB	0 (done)	3.5–5.5
DeepSeek-V3-0324	~37B (~74GB est.)	230GB	High (4 weeks)	3.0–4.5
Kimi K2	~32B (~64GB est.)	230GB	Very High (5 weeks)	2.5–4.0
Qwen3-235B-A22B	~22B (~44GB est.)	63–97GB	N/A (use Ollama)	12–25

Active RAM estimates based on: active_params × 2 bytes (BF16) as a working-set floor. Actual flash-moe usage is lower because non-expert weights (5.5GB) are the only persistent in-memory component.

"Most Impressive Demo" Ranking¶

If Jeff's goal is first-mover advantage on "frontier local AI" as a product:

🥇 Kimi K2 — "Running the #2 open-weights AI in the world on a Mac mini" is the strongest marketing line. Nobody has done this with flash-moe-style SSD streaming. First-mover window is real.
🥈 Qwen3.5-397B-A17B — Already works. "#3 open weights, 397 billion parameters, $2500 hardware" is a strong demo today. Ship this while Kimi K2 port is in progress.
🥉 DeepSeek-V3 — Well-known brand, GPT-4o quality benchmarks. Good secondary story ("we also run the model that shook the AI industry in January 2025").
Qwen3-235B-A22B — Good product, wrong narrative for the "frontier local AI" pitch. Better positioned as a speed optimization once the flagship is established.

Recommended Path Forward¶

Phase 1 (Now): Ship Qwen3.5-397B via flash-moe as the proof point. #3 open weights on consumer hardware is a legitimate claim.

Phase 2 (3–6 months): Port flash-moe to Kimi K2. This is the flagship narrative upgrade — "from #3 to #2 open weights, same hardware." Hire or contract a C/Metal developer with MoE experience to drive the port.

Phase 3 (Ongoing): Maintain Qwen3-235B-A22B via Ollama/llama.cpp as the fast inference tier. Flash-moe handles the "impressive" demo; Ollama handles real-time conversation. Two-tier local inference is a differentiated product.

Sources¶

flash-moe GitHub: https://github.com/danveloper/flash-moe
Artificial Analysis Intelligence Leaderboard: https://artificialanalysis.ai/leaderboards/models
Kimi K2 architecture: https://moonshotai.github.io/Kimi-K2/
DeepSeek-V3 technical report: https://arxiv.org/abs/2412.19437
Unsloth GGUF sizing: https://unsloth.ai/blog/deepseek-v3-0324 / https://unsloth.ai/docs/models/kimi-k2.5
Qwen3-235B-A22B GGUF: https://huggingface.co/lmstudio-community/Qwen3-235B-A22B-GGUF
Kimi K2 SSD inference community discussion: https://www.reddit.com/r/LocalLLaMA/comments/1ow0jj0
NVIDIA blog on Kimi K2.5 open-weights leaderboard status: https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/

Research completed 2026-03-19. Leaderboard rankings and model availability subject to rapid change in the current AI landscape.