Flash-MoE Implementation Proposal¶

Research by: Atlas, Director of Research
Date: 2026-03-19
Status: Ready for Jeff's Decision
Repo: https://github.com/danveloper/flash-moe

Executive Summary¶

Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B — a 397 billion parameter model — on a MacBook Pro M3 Max 48GB at 5.5 tok/s using SSD expert streaming. Our M4 Pro 64GB Mac mini is hardware-comparable, with the critical difference being lower memory bandwidth (~273 GB/s vs ~400 GB/s) offset by +16GB of RAM enabling a larger in-memory expert cache.

Bottom line: Flash-MoE is real, technically impressive, and can run on our hardware — but it's not production-ready as an OpenClaw integration today. It's a proof-of-concept (5,000-line Objective-C file, no API server). Worth building for strategic and product reasons; needs a thin wrapper to integrate with OpenClaw.

1. Build and Run: Step-by-Step Instructions¶

Prerequisites¶

# Verify Xcode command line tools
xcode-select --install

# Verify Metal (built into macOS on Apple Silicon)
xcrun metal --version

# Python 3.11+ for weight extraction scripts
python3 --version

Step 1: Clone the Repo¶

git clone https://github.com/danveloper/flash-moe.git
cd flash-moe/metal_infer

Step 2: Download Model Weights from Hugging Face¶

This is the largest step. You need the Qwen3.5-397B-A17B safetensors to extract weights.

pip install huggingface_hub

# Download the MLX 4-bit quantized version (~209GB on HF — but we'll repack to 2-bit)
# OR: use Unsloth's pre-packed GGUF as base for repacking
huggingface-cli download mlx-community/Qwen3.5-397B-A17B-4bit \
    --local-dir ~/models/Qwen3.5-397B-4bit \
    --repo-type model

Disk note: This download requires ~209GB free space temporarily. After 2-bit repacking, you need ~120GB. Total disk usage during process: ~330GB peak. Our 1.8TB SSD has headroom.

Step 3: Extract and Repack Weights¶

# Extract non-expert weights (creates model_weights.bin ~5.5GB)
python3 extract_weights.py ~/models/Qwen3.5-397B-4bit

# Pack expert weights into flash-moe format (~209GB at 4-bit)
python3 repack_experts.py ~/models/Qwen3.5-397B-4bit

# Optional but recommended: repack to 2-bit (120GB, +44% speed)
python3 repack_experts_2bit.py

The 2-bit repacking takes several hours. Run overnight.

Step 4: Build the Engine¶

cd metal_infer
make
# Compiles infer.m (~5,000 lines) + shaders.metal (~1,100 lines)
# Build time: ~2-3 minutes on M4 Pro

Step 5: Run Inference¶

# Single prompt (4-bit mode)
./infer --prompt "You are a helpful assistant." --tokens 200

# Single prompt (2-bit mode — faster)
./infer --prompt "You are a helpful assistant." --tokens 200 --2bit

# Interactive chat
./chat --2bit

Step 6: Verify Output Quality¶

The project includes a results.tsv with 90+ experiments. Run a few baseline prompts from math, code, and reasoning to validate output quality before integrating.

2. Expected Performance on Our M4 Pro 64GB¶

Reference Benchmark (M3 Max 48GB)¶

Mode	tok/s	Notes
2-bit experts, K=4 (warm cache)	5.55	Current best
4-bit experts, K=4 (warm cache)	4.80	209GB on disk
4-bit experts, K=4 (cold SSD)	2.83	Steady-state cold
Peak single token	7.05	Warm cache, 2-bit

M3 Max hardware: 400 GB/s memory bandwidth, 17.5 GB/s SSD sequential read, 48GB RAM

Our M4 Pro 64GB Projection¶

Factor	M3 Max 48GB	Our M4 Pro 64GB	Impact
Memory bandwidth	~400 GB/s	~273 GB/s	-32%
RAM available	48GB	64GB	+16GB expert cache
SSD sequential read	~17.5 GB/s	~14-16 GB/s	-5-10%
SSD NVMe controller	Apple Fabric	Apple Fabric	Comparable
GPU cores	40	20	Fewer Metal cores

The bottleneck is SSD I/O for expert streaming, not GPU compute. Each token requires loading K=4 expert blocks (~3.9MB each = ~15.6MB) from SSD, taking ~1.49ms per token per layer in the M3 Max. This is the ceiling.

Expected performance range on our hardware: - 2-bit mode, warm page cache: 4.0–5.0 tok/s - 2-bit mode, cold SSD: 2.2–2.8 tok/s
- 4-bit mode, warm: 3.5–4.2 tok/s - Peak single token: ~5.5–6.0 tok/s

The extra 16GB RAM is meaningful: the M3 Max has ~35GB available for page cache after the 5.5GB model weights + Metal buffers. Our M4 Pro has ~51GB available for page cache. This means more expert blocks stay hot in memory, reducing cold-SSD reads — partially compensating for lower bandwidth.

Realistic sustained throughput: 3.5–4.5 tok/s for typical multi-turn conversation.

That's slow by standard local model standards but extraordinary for a 397B model. It's usable for research, drafting, and offline reasoning tasks where waiting 10–30 seconds for a response is acceptable. Not suitable for real-time chat.

3. OpenClaw Integration¶

Current State of flash-moe¶

Flash-MoE is a CLI tool, not an API server. It has: - ./infer — single-shot inference, outputs to stdout - ./chat — interactive REPL - No HTTP server - No OpenAI-compatible API - No streaming output - No JSON mode

Integration Options¶

Option A: Thin Python Wrapper (Recommended)¶

Build a lightweight FastAPI server that wraps the ./infer binary:

# flash_moe_server.py
from fastapi import FastAPI
from pydantic import BaseModel
import subprocess, json

app = FastAPI()

class ChatRequest(BaseModel):
    model: str
    messages: list
    max_tokens: int = 500

@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
    # Build prompt from messages
    prompt = format_prompt(req.messages)

    # Call flash-moe binary
    result = subprocess.run(
        ["./infer", "--prompt", prompt, "--tokens", str(req.max_tokens), "--2bit"],
        capture_output=True, text=True, cwd="/path/to/flash-moe/metal_infer"
    )

    # Return OpenAI-compatible response
    return {
        "id": "chatcmpl-local",
        "object": "chat.completion",
        "model": "qwen3.5-397b-flash-moe",
        "choices": [{
            "message": {"role": "assistant", "content": result.stdout},
            "finish_reason": "stop"
        }]
    }

Run on port 11435 (adjacent to Ollama's 11434) to avoid conflicts.

Option B: Use llama-server Instead¶

The Unsloth GGUF versions of Qwen3.5-397B can run through llama-server (part of llama.cpp) which already has a full OpenAI-compatible API. This is less performant than flash-moe but immediately OpenClaw-compatible. See Section 6 for comparison.

OpenClaw Wiring¶

Once the wrapper is running, configure OpenClaw to route to it:

{
  "model": "qwen3.5-397b-local",
  "endpoint": "http://localhost:11435/v1",
  "api_key": "local"
}

Streaming caveat: The current flash-moe binary does not stream tokens. A proper wrapper would need to either (a) modify the C source to emit tokens progressively, or (b) fake streaming with token-by-token chunking from post-processed output. Option (b) is easier; option (a) gives real UX.

4. Disk Space Requirements¶

Item	Size	Notes
Full BF16 checkpoint (raw)	~807 GB	Not needed for flash-moe
MLX 4-bit download (flash-moe source)	~209 GB	Downloaded, then repacked
flash-moe 4-bit expert files	~209 GB	`packed_experts/` directory
flash-moe 2-bit expert files	~120 GB	`packed_experts_2bit/` directory
Non-expert weights (`model_weights.bin`)	~5.5 GB	mmap'd at runtime
Vocab + manifest	~50 MB	Trivial
Total (2-bit only, no 4-bit)	~125 GB	After cleanup
Total (both quants on disk)	~335 GB	Keep both for quality comparisons
Peak during setup	~330 GB	While 4-bit download + 2-bit repack coexist

Our 1.8TB SSD: Assuming ~500GB used by OS + current models + workspace, we have ~1.3TB free. Running both quants simultaneously (335GB) is comfortable. Download + repack peak (330GB) is fine.

5. Memory Usage Projections¶

Flash-MoE is specifically designed to not require the full model in RAM. This is its killer feature.

Component	RAM Usage	Notes
Non-expert weights	~5.5 GB	mmap'd read-only, stays resident
Metal scratch buffers	~200 MB	GPU compute workspace
Expert LRU cache (Metal)	0–3.5 GB	Optional, configurable
OS + macOS processes	~4–6 GB	Typical Mac mini background
Page cache (expert blocks)	up to remaining	OS fills this organically
Total flash-moe footprint	~6–9 GB	By design
Available for page cache	~51–58 GB	Our M4 Pro 64GB advantage

Key insight: On the M3 Max 48GB, ~35GB is available for page cache. On our 64GB, ~51GB is available. Since expert blocks are 3.9MB each and there are 512 experts × 60 layers, the total unique expert space is ~120GB at 2-bit — well beyond what fits in RAM. But the 80/20 rule applies: a subset of experts is called frequently, and those will stay warm in our larger cache, meaningfully improving sustained performance over the M3 Max reference.

No OOM risk. Flash-moe's design explicitly avoids it. Expert data streams on demand; it never tries to load the full 120GB into RAM.

6. Flash-MoE vs Qwen3.5-35B via Ollama (What We Already Run)¶

Dimension	flash-moe (397B-A17B)	Ollama (35B-A3B Q4)
Model	Qwen3.5-397B-A17B	Qwen3.5-35B-A3B
Parameter count	397B total, 17B active	35B total, 3B active
Intelligence tier	Gemini 3 Pro / Claude Opus 4.5 level	~Qwen3.5 mid tier
Generation speed	~3.5–5.5 tok/s	~18–28 tok/s
Setup complexity	High (C build, weight repack)	Trivial (`ollama pull`)
API compatibility	None (CLI only, needs wrapper)	Full OpenAI-compatible
Memory footprint	~6–9 GB RAM	~20–25 GB RAM
Disk space	~120–335 GB	~22 GB
Streaming	No (requires patch/wrapper)	Yes (native)
Context window	Limited by implementation	32K–128K
Production readiness	Research/PoC	Production
Thinking mode	Available in base model	Available via Ollama
Cost	One-time setup, ~24h	Already running

Verdict¶

Ollama + Qwen3.5-35B wins on every operational metric for day-to-day use: speed, compatibility, reliability. Flash-MoE wins on one thing: model intelligence at equivalent compute cost. A 397B model with 17B active parameters is genuinely more capable than a 35B model with 3B active parameters — particularly on complex reasoning, code, and long-context tasks.

The use case for flash-moe isn't "replace Ollama." It's: - Batch reasoning tasks that can tolerate low speed (overnight research, deep analysis) - Tasks where model quality matters more than latency - Demonstrating 397B capability from a consumer device (product/marketing angle)

Recommendation: Keep Ollama running Qwen3.5-35B as the primary fast model. Add flash-moe as a secondary "heavy reasoning" endpoint for high-stakes tasks.

7. Product Angle: "Local 397B on Consumer Hardware"¶

This is where it gets interesting.

The Market Signal¶

The fact that @danveloper built this in 24 hours and it spread virally signals genuine market appetite. People want: 1. Frontier-class AI locally (privacy, cost, latency) 2. Proof that it's possible on consumer hardware 3. Something they can actually set up and use

Flash-MoE answered #1 and #2. Nobody has answered #3 properly yet.

Product Concepts¶

Concept A: Flash-MoE Stack Installer (Low Effort, Fast to Ship)¶

A one-command installer that: - Handles the dependency chain (Xcode tools, Python, HF CLI) - Manages the model download (resumable, progress UI) - Runs the weight repacking (scheduled overnight) - Installs the OpenAI-compatible wrapper server - Configures auto-start as a macOS Launch Agent

Target: Technical enthusiasts who want to run flash-moe but don't want to debug C compilation and weight extraction manually.
Distribution: GitHub + Homebrew tap
Pricing: Free (lead gen) or $49 one-time
Time to build: 1–2 weeks for Melody

Concept B: OpenClaw Local Inference Bundle (Medium Effort, Higher Value)¶

Bundle the flash-moe stack with OpenClaw configuration: - OpenClaw pre-configured to route heavy tasks to local 397B - Routing logic: fast tasks → Ollama 35B, reasoning tasks → flash-moe 397B - Dashboard showing live model usage, cost savings vs cloud - Mac mini "AI appliance" positioning

Target: Privacy-conscious small businesses, solo operators, developers
Pricing: $299–499/year subscription or $149 setup fee + OpenClaw license
Positioning: "Your own private frontier AI, no monthly API bills"

Concept C: Managed "AI Appliance" Service (High Effort, High Margin)¶

Sell and configure Mac mini AI nodes for businesses: - Hardware procurement (Mac mini M4 Pro 64GB, ~$2,500 hardware) - Setup fee: $1,500–3,000 (flash-moe + OpenClaw configured) - Monthly support/maintenance: $500–1,000/month - VPN access so clients can use the node remotely

Target: Small law firms, medical practices, financial advisors, research teams — any high-privacy vertical that can't send data to OpenAI
Revenue model: Hardware margin (~15%) + setup fee + recurring support
Annual value per client: $7,500–14,000

Key constraints: - Requires someone to handle the physical hardware + setup - Support burden is real (OS updates, model updates) - Not truly scalable without tooling - Jeff's bandwidth is limited — needs systems to automate the ops

Concept D: "Local Frontier" API Reselling (Speculative)¶

Run multiple Mac minis as a local inference cluster, sell API access at rates below cloud: - OpenAI GPT-4o: ~$5/M tokens - Claude Sonnet 4.6: ~$3/M tokens - Local 397B via flash-moe: ~$0.50/M tokens (after hardware amortization)

The problem: Throughput is 3.5–5.5 tok/s per machine. At 100 concurrent users, you need ~20 machines just to keep latency reasonable. Hardware cost: ~$50,000. Not viable at early stage.

Realistic Near-Term Recommendation¶

Start with Concept A + B: Build the installer (low effort) to establish credibility in the local AI space, then evolve it toward a bundled OpenClaw offering. This connects directly to the "Managed OpenClaw Service" idea already in BACKLOG.md.

The narrative that sells: "Anthropic, Google, and OpenAI charge $50–200/month for frontier AI access. For $2,500 in hardware and a weekend of setup, you can run an equivalent model locally forever, with zero data leaving your machine."

That's a compelling pitch — especially post-GDPR, post-AI-privacy-scare. The market is real. The technology works. The gap is the polish layer.

8. Open Questions / Risks¶

Risk	Severity	Mitigation
Flash-moe isn't maintained	Medium	Fork it. It's MIT licensed. The C code is self-contained.
3.5–5.5 tok/s is too slow for users	Medium	Frame it as batch/overnight mode, not real-time chat
Weight download takes hours	Low	Build resumable downloader, run overnight
Community hasn't packaged this yet	Low	First-mover advantage if we build the installer
Wrapper complexity	Low	Python FastAPI wrapper is a few hundred lines
Competing with Ollama (which adds Qwen3.5-397B via llama.cpp)	High	Flash-moe's edge is 2-bit speed optimization; monitor llama.cpp progress

Watch: llama.cpp is actively adding MoE expert streaming. If they achieve similar SSD offloading performance natively, it would subsume flash-moe's advantage while being API-compatible out of the box. Flash-moe's 24-hour head start may be a 3–6 month window.

9. Recommended Next Steps¶

Priority	Action	Effort	Owner
1	Build flash-moe on Mac mini, validate it compiles and runs	2h	Jeff/Jules
2	Download 4-bit weights overnight, run weight extraction	8–12h	Automated
3	Run 2-bit repack overnight	4–6h	Automated
4	Benchmark actual tok/s on our hardware	30m	Jules
5	Build OpenAI-compatible wrapper (Python FastAPI)	1–2 days	Melody
6	Wire wrapper to OpenClaw as secondary model endpoint	2h	Jules
7	Evaluate product packaging concept (A vs B)	Decision call	Jeff

Sources¶

flash-moe GitHub: https://github.com/danveloper/flash-moe
flash-moe paper: https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf
Unsloth Qwen3.5 guide: https://unsloth.ai/docs/models/qwen3.5
Unsloth GGUF benchmarks: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
M4 Pro bandwidth: ~273 GB/s (confirmed multiple sources)
M3 Max bandwidth: ~400 GB/s (danveloper hardware)
llama.cpp Apple Silicon discussion: https://github.com/ggml-org/llama.cpp/discussions/4167

No community benchmarks for flash-moe specifically on M4 Pro hardware found as of 2026-03-19. Flash-moe remains early-stage; no user-friendly packaging beyond the raw repo exists yet.