Flash-MoE Implementation Proposal¶
Research by: Atlas, Director of Research
Date: 2026-03-19
Status: Ready for Jeff's Decision
Repo: https://github.com/danveloper/flash-moe
Executive Summary¶
Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B — a 397 billion parameter model — on a MacBook Pro M3 Max 48GB at 5.5 tok/s using SSD expert streaming. Our M4 Pro 64GB Mac mini is hardware-comparable, with the critical difference being lower memory bandwidth (~273 GB/s vs ~400 GB/s) offset by +16GB of RAM enabling a larger in-memory expert cache.
Bottom line: Flash-MoE is real, technically impressive, and can run on our hardware — but it's not production-ready as an OpenClaw integration today. It's a proof-of-concept (5,000-line Objective-C file, no API server). Worth building for strategic and product reasons; needs a thin wrapper to integrate with OpenClaw.
1. Build and Run: Step-by-Step Instructions¶
Prerequisites¶
# Verify Xcode command line tools
xcode-select --install
# Verify Metal (built into macOS on Apple Silicon)
xcrun metal --version
# Python 3.11+ for weight extraction scripts
python3 --version
Step 1: Clone the Repo¶
Step 2: Download Model Weights from Hugging Face¶
This is the largest step. You need the Qwen3.5-397B-A17B safetensors to extract weights.
pip install huggingface_hub
# Download the MLX 4-bit quantized version (~209GB on HF — but we'll repack to 2-bit)
# OR: use Unsloth's pre-packed GGUF as base for repacking
huggingface-cli download mlx-community/Qwen3.5-397B-A17B-4bit \
--local-dir ~/models/Qwen3.5-397B-4bit \
--repo-type model
Disk note: This download requires ~209GB free space temporarily. After 2-bit repacking, you need ~120GB. Total disk usage during process: ~330GB peak. Our 1.8TB SSD has headroom.
Step 3: Extract and Repack Weights¶
# Extract non-expert weights (creates model_weights.bin ~5.5GB)
python3 extract_weights.py ~/models/Qwen3.5-397B-4bit
# Pack expert weights into flash-moe format (~209GB at 4-bit)
python3 repack_experts.py ~/models/Qwen3.5-397B-4bit
# Optional but recommended: repack to 2-bit (120GB, +44% speed)
python3 repack_experts_2bit.py
The 2-bit repacking takes several hours. Run overnight.
Step 4: Build the Engine¶
cd metal_infer
make
# Compiles infer.m (~5,000 lines) + shaders.metal (~1,100 lines)
# Build time: ~2-3 minutes on M4 Pro
Step 5: Run Inference¶
# Single prompt (4-bit mode)
./infer --prompt "You are a helpful assistant." --tokens 200
# Single prompt (2-bit mode — faster)
./infer --prompt "You are a helpful assistant." --tokens 200 --2bit
# Interactive chat
./chat --2bit
Step 6: Verify Output Quality¶
The project includes a results.tsv with 90+ experiments. Run a few baseline prompts from math, code, and reasoning to validate output quality before integrating.
2. Expected Performance on Our M4 Pro 64GB¶
Reference Benchmark (M3 Max 48GB)¶
| Mode | tok/s | Notes |
|---|---|---|
| 2-bit experts, K=4 (warm cache) | 5.55 | Current best |
| 4-bit experts, K=4 (warm cache) | 4.80 | 209GB on disk |
| 4-bit experts, K=4 (cold SSD) | 2.83 | Steady-state cold |
| Peak single token | 7.05 | Warm cache, 2-bit |
M3 Max hardware: 400 GB/s memory bandwidth, 17.5 GB/s SSD sequential read, 48GB RAM
Our M4 Pro 64GB Projection¶
| Factor | M3 Max 48GB | Our M4 Pro 64GB | Impact |
|---|---|---|---|
| Memory bandwidth | ~400 GB/s | ~273 GB/s | -32% |
| RAM available | 48GB | 64GB | +16GB expert cache |
| SSD sequential read | ~17.5 GB/s | ~14-16 GB/s | -5-10% |
| SSD NVMe controller | Apple Fabric | Apple Fabric | Comparable |
| GPU cores | 40 | 20 | Fewer Metal cores |
The bottleneck is SSD I/O for expert streaming, not GPU compute. Each token requires loading K=4 expert blocks (~3.9MB each = ~15.6MB) from SSD, taking ~1.49ms per token per layer in the M3 Max. This is the ceiling.
Expected performance range on our hardware:
- 2-bit mode, warm page cache: 4.0–5.0 tok/s
- 2-bit mode, cold SSD: 2.2–2.8 tok/s
- 4-bit mode, warm: 3.5–4.2 tok/s
- Peak single token: ~5.5–6.0 tok/s
The extra 16GB RAM is meaningful: the M3 Max has ~35GB available for page cache after the 5.5GB model weights + Metal buffers. Our M4 Pro has ~51GB available for page cache. This means more expert blocks stay hot in memory, reducing cold-SSD reads — partially compensating for lower bandwidth.
Realistic sustained throughput: 3.5–4.5 tok/s for typical multi-turn conversation.
That's slow by standard local model standards but extraordinary for a 397B model. It's usable for research, drafting, and offline reasoning tasks where waiting 10–30 seconds for a response is acceptable. Not suitable for real-time chat.
3. OpenClaw Integration¶
Current State of flash-moe¶
Flash-MoE is a CLI tool, not an API server. It has:
- ./infer — single-shot inference, outputs to stdout
- ./chat — interactive REPL
- No HTTP server
- No OpenAI-compatible API
- No streaming output
- No JSON mode
Integration Options¶
Option A: Thin Python Wrapper (Recommended)¶
Build a lightweight FastAPI server that wraps the ./infer binary:
# flash_moe_server.py
from fastapi import FastAPI
from pydantic import BaseModel
import subprocess, json
app = FastAPI()
class ChatRequest(BaseModel):
model: str
messages: list
max_tokens: int = 500
@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
# Build prompt from messages
prompt = format_prompt(req.messages)
# Call flash-moe binary
result = subprocess.run(
["./infer", "--prompt", prompt, "--tokens", str(req.max_tokens), "--2bit"],
capture_output=True, text=True, cwd="/path/to/flash-moe/metal_infer"
)
# Return OpenAI-compatible response
return {
"id": "chatcmpl-local",
"object": "chat.completion",
"model": "qwen3.5-397b-flash-moe",
"choices": [{
"message": {"role": "assistant", "content": result.stdout},
"finish_reason": "stop"
}]
}
Run on port 11435 (adjacent to Ollama's 11434) to avoid conflicts.
Option B: Use llama-server Instead¶
The Unsloth GGUF versions of Qwen3.5-397B can run through llama-server (part of llama.cpp) which already has a full OpenAI-compatible API. This is less performant than flash-moe but immediately OpenClaw-compatible. See Section 6 for comparison.
OpenClaw Wiring¶
Once the wrapper is running, configure OpenClaw to route to it:
Streaming caveat: The current flash-moe binary does not stream tokens. A proper wrapper would need to either (a) modify the C source to emit tokens progressively, or (b) fake streaming with token-by-token chunking from post-processed output. Option (b) is easier; option (a) gives real UX.
4. Disk Space Requirements¶
| Item | Size | Notes |
|---|---|---|
| Full BF16 checkpoint (raw) | ~807 GB | Not needed for flash-moe |
| MLX 4-bit download (flash-moe source) | ~209 GB | Downloaded, then repacked |
| flash-moe 4-bit expert files | ~209 GB | packed_experts/ directory |
| flash-moe 2-bit expert files | ~120 GB | packed_experts_2bit/ directory |
Non-expert weights (model_weights.bin) |
~5.5 GB | mmap'd at runtime |
| Vocab + manifest | ~50 MB | Trivial |
| Total (2-bit only, no 4-bit) | ~125 GB | After cleanup |
| Total (both quants on disk) | ~335 GB | Keep both for quality comparisons |
| Peak during setup | ~330 GB | While 4-bit download + 2-bit repack coexist |
Our 1.8TB SSD: Assuming ~500GB used by OS + current models + workspace, we have ~1.3TB free. Running both quants simultaneously (335GB) is comfortable. Download + repack peak (330GB) is fine.
5. Memory Usage Projections¶
Flash-MoE is specifically designed to not require the full model in RAM. This is its killer feature.
| Component | RAM Usage | Notes |
|---|---|---|
| Non-expert weights | ~5.5 GB | mmap'd read-only, stays resident |
| Metal scratch buffers | ~200 MB | GPU compute workspace |
| Expert LRU cache (Metal) | 0–3.5 GB | Optional, configurable |
| OS + macOS processes | ~4–6 GB | Typical Mac mini background |
| Page cache (expert blocks) | up to remaining | OS fills this organically |
| Total flash-moe footprint | ~6–9 GB | By design |
| Available for page cache | ~51–58 GB | Our M4 Pro 64GB advantage |
Key insight: On the M3 Max 48GB, ~35GB is available for page cache. On our 64GB, ~51GB is available. Since expert blocks are 3.9MB each and there are 512 experts × 60 layers, the total unique expert space is ~120GB at 2-bit — well beyond what fits in RAM. But the 80/20 rule applies: a subset of experts is called frequently, and those will stay warm in our larger cache, meaningfully improving sustained performance over the M3 Max reference.
No OOM risk. Flash-moe's design explicitly avoids it. Expert data streams on demand; it never tries to load the full 120GB into RAM.
6. Flash-MoE vs Qwen3.5-35B via Ollama (What We Already Run)¶
| Dimension | flash-moe (397B-A17B) | Ollama (35B-A3B Q4) |
|---|---|---|
| Model | Qwen3.5-397B-A17B | Qwen3.5-35B-A3B |
| Parameter count | 397B total, 17B active | 35B total, 3B active |
| Intelligence tier | Gemini 3 Pro / Claude Opus 4.5 level | ~Qwen3.5 mid tier |
| Generation speed | ~3.5–5.5 tok/s | ~18–28 tok/s |
| Setup complexity | High (C build, weight repack) | Trivial (ollama pull) |
| API compatibility | None (CLI only, needs wrapper) | Full OpenAI-compatible |
| Memory footprint | ~6–9 GB RAM | ~20–25 GB RAM |
| Disk space | ~120–335 GB | ~22 GB |
| Streaming | No (requires patch/wrapper) | Yes (native) |
| Context window | Limited by implementation | 32K–128K |
| Production readiness | Research/PoC | Production |
| Thinking mode | Available in base model | Available via Ollama |
| Cost | One-time setup, ~24h | Already running |
Verdict¶
Ollama + Qwen3.5-35B wins on every operational metric for day-to-day use: speed, compatibility, reliability. Flash-MoE wins on one thing: model intelligence at equivalent compute cost. A 397B model with 17B active parameters is genuinely more capable than a 35B model with 3B active parameters — particularly on complex reasoning, code, and long-context tasks.
The use case for flash-moe isn't "replace Ollama." It's: - Batch reasoning tasks that can tolerate low speed (overnight research, deep analysis) - Tasks where model quality matters more than latency - Demonstrating 397B capability from a consumer device (product/marketing angle)
Recommendation: Keep Ollama running Qwen3.5-35B as the primary fast model. Add flash-moe as a secondary "heavy reasoning" endpoint for high-stakes tasks.
7. Product Angle: "Local 397B on Consumer Hardware"¶
This is where it gets interesting.
The Market Signal¶
The fact that @danveloper built this in 24 hours and it spread virally signals genuine market appetite. People want: 1. Frontier-class AI locally (privacy, cost, latency) 2. Proof that it's possible on consumer hardware 3. Something they can actually set up and use
Flash-MoE answered #1 and #2. Nobody has answered #3 properly yet.
Product Concepts¶
Concept A: Flash-MoE Stack Installer (Low Effort, Fast to Ship)¶
A one-command installer that: - Handles the dependency chain (Xcode tools, Python, HF CLI) - Manages the model download (resumable, progress UI) - Runs the weight repacking (scheduled overnight) - Installs the OpenAI-compatible wrapper server - Configures auto-start as a macOS Launch Agent
Target: Technical enthusiasts who want to run flash-moe but don't want to debug C compilation and weight extraction manually.
Distribution: GitHub + Homebrew tap
Pricing: Free (lead gen) or $49 one-time
Time to build: 1–2 weeks for Melody
Concept B: OpenClaw Local Inference Bundle (Medium Effort, Higher Value)¶
Bundle the flash-moe stack with OpenClaw configuration: - OpenClaw pre-configured to route heavy tasks to local 397B - Routing logic: fast tasks → Ollama 35B, reasoning tasks → flash-moe 397B - Dashboard showing live model usage, cost savings vs cloud - Mac mini "AI appliance" positioning
Target: Privacy-conscious small businesses, solo operators, developers
Pricing: $299–499/year subscription or $149 setup fee + OpenClaw license
Positioning: "Your own private frontier AI, no monthly API bills"
Concept C: Managed "AI Appliance" Service (High Effort, High Margin)¶
Sell and configure Mac mini AI nodes for businesses: - Hardware procurement (Mac mini M4 Pro 64GB, ~$2,500 hardware) - Setup fee: $1,500–3,000 (flash-moe + OpenClaw configured) - Monthly support/maintenance: $500–1,000/month - VPN access so clients can use the node remotely
Target: Small law firms, medical practices, financial advisors, research teams — any high-privacy vertical that can't send data to OpenAI
Revenue model: Hardware margin (~15%) + setup fee + recurring support
Annual value per client: $7,500–14,000
Key constraints: - Requires someone to handle the physical hardware + setup - Support burden is real (OS updates, model updates) - Not truly scalable without tooling - Jeff's bandwidth is limited — needs systems to automate the ops
Concept D: "Local Frontier" API Reselling (Speculative)¶
Run multiple Mac minis as a local inference cluster, sell API access at rates below cloud: - OpenAI GPT-4o: ~$5/M tokens - Claude Sonnet 4.6: ~$3/M tokens - Local 397B via flash-moe: ~$0.50/M tokens (after hardware amortization)
The problem: Throughput is 3.5–5.5 tok/s per machine. At 100 concurrent users, you need ~20 machines just to keep latency reasonable. Hardware cost: ~$50,000. Not viable at early stage.
Realistic Near-Term Recommendation¶
Start with Concept A + B: Build the installer (low effort) to establish credibility in the local AI space, then evolve it toward a bundled OpenClaw offering. This connects directly to the "Managed OpenClaw Service" idea already in BACKLOG.md.
The narrative that sells: "Anthropic, Google, and OpenAI charge $50–200/month for frontier AI access. For $2,500 in hardware and a weekend of setup, you can run an equivalent model locally forever, with zero data leaving your machine."
That's a compelling pitch — especially post-GDPR, post-AI-privacy-scare. The market is real. The technology works. The gap is the polish layer.
8. Open Questions / Risks¶
| Risk | Severity | Mitigation |
|---|---|---|
| Flash-moe isn't maintained | Medium | Fork it. It's MIT licensed. The C code is self-contained. |
| 3.5–5.5 tok/s is too slow for users | Medium | Frame it as batch/overnight mode, not real-time chat |
| Weight download takes hours | Low | Build resumable downloader, run overnight |
| Community hasn't packaged this yet | Low | First-mover advantage if we build the installer |
| Wrapper complexity | Low | Python FastAPI wrapper is a few hundred lines |
| Competing with Ollama (which adds Qwen3.5-397B via llama.cpp) | High | Flash-moe's edge is 2-bit speed optimization; monitor llama.cpp progress |
Watch: llama.cpp is actively adding MoE expert streaming. If they achieve similar SSD offloading performance natively, it would subsume flash-moe's advantage while being API-compatible out of the box. Flash-moe's 24-hour head start may be a 3–6 month window.
9. Recommended Next Steps¶
| Priority | Action | Effort | Owner |
|---|---|---|---|
| 1 | Build flash-moe on Mac mini, validate it compiles and runs | 2h | Jeff/Jules |
| 2 | Download 4-bit weights overnight, run weight extraction | 8–12h | Automated |
| 3 | Run 2-bit repack overnight | 4–6h | Automated |
| 4 | Benchmark actual tok/s on our hardware | 30m | Jules |
| 5 | Build OpenAI-compatible wrapper (Python FastAPI) | 1–2 days | Melody |
| 6 | Wire wrapper to OpenClaw as secondary model endpoint | 2h | Jules |
| 7 | Evaluate product packaging concept (A vs B) | Decision call | Jeff |
Sources¶
- flash-moe GitHub: https://github.com/danveloper/flash-moe
- flash-moe paper: https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf
- Unsloth Qwen3.5 guide: https://unsloth.ai/docs/models/qwen3.5
- Unsloth GGUF benchmarks: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
- M4 Pro bandwidth: ~273 GB/s (confirmed multiple sources)
- M3 Max bandwidth: ~400 GB/s (danveloper hardware)
- llama.cpp Apple Silicon discussion: https://github.com/ggml-org/llama.cpp/discussions/4167
No community benchmarks for flash-moe specifically on M4 Pro hardware found as of 2026-03-19. Flash-moe remains early-stage; no user-friendly packaging beyond the raw repo exists yet.