Skip to content

Flash-MoE Implementation Proposal

Research by: Atlas, Director of Research
Date: 2026-03-19
Status: Ready for Jeff's Decision
Repo: https://github.com/danveloper/flash-moe


Executive Summary

Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B — a 397 billion parameter model — on a MacBook Pro M3 Max 48GB at 5.5 tok/s using SSD expert streaming. Our M4 Pro 64GB Mac mini is hardware-comparable, with the critical difference being lower memory bandwidth (~273 GB/s vs ~400 GB/s) offset by +16GB of RAM enabling a larger in-memory expert cache.

Bottom line: Flash-MoE is real, technically impressive, and can run on our hardware — but it's not production-ready as an OpenClaw integration today. It's a proof-of-concept (5,000-line Objective-C file, no API server). Worth building for strategic and product reasons; needs a thin wrapper to integrate with OpenClaw.


1. Build and Run: Step-by-Step Instructions

Prerequisites

# Verify Xcode command line tools
xcode-select --install

# Verify Metal (built into macOS on Apple Silicon)
xcrun metal --version

# Python 3.11+ for weight extraction scripts
python3 --version

Step 1: Clone the Repo

git clone https://github.com/danveloper/flash-moe.git
cd flash-moe/metal_infer

Step 2: Download Model Weights from Hugging Face

This is the largest step. You need the Qwen3.5-397B-A17B safetensors to extract weights.

pip install huggingface_hub

# Download the MLX 4-bit quantized version (~209GB on HF — but we'll repack to 2-bit)
# OR: use Unsloth's pre-packed GGUF as base for repacking
huggingface-cli download mlx-community/Qwen3.5-397B-A17B-4bit \
    --local-dir ~/models/Qwen3.5-397B-4bit \
    --repo-type model

Disk note: This download requires ~209GB free space temporarily. After 2-bit repacking, you need ~120GB. Total disk usage during process: ~330GB peak. Our 1.8TB SSD has headroom.

Step 3: Extract and Repack Weights

# Extract non-expert weights (creates model_weights.bin ~5.5GB)
python3 extract_weights.py ~/models/Qwen3.5-397B-4bit

# Pack expert weights into flash-moe format (~209GB at 4-bit)
python3 repack_experts.py ~/models/Qwen3.5-397B-4bit

# Optional but recommended: repack to 2-bit (120GB, +44% speed)
python3 repack_experts_2bit.py

The 2-bit repacking takes several hours. Run overnight.

Step 4: Build the Engine

cd metal_infer
make
# Compiles infer.m (~5,000 lines) + shaders.metal (~1,100 lines)
# Build time: ~2-3 minutes on M4 Pro

Step 5: Run Inference

# Single prompt (4-bit mode)
./infer --prompt "You are a helpful assistant." --tokens 200

# Single prompt (2-bit mode — faster)
./infer --prompt "You are a helpful assistant." --tokens 200 --2bit

# Interactive chat
./chat --2bit

Step 6: Verify Output Quality

The project includes a results.tsv with 90+ experiments. Run a few baseline prompts from math, code, and reasoning to validate output quality before integrating.


2. Expected Performance on Our M4 Pro 64GB

Reference Benchmark (M3 Max 48GB)

Mode tok/s Notes
2-bit experts, K=4 (warm cache) 5.55 Current best
4-bit experts, K=4 (warm cache) 4.80 209GB on disk
4-bit experts, K=4 (cold SSD) 2.83 Steady-state cold
Peak single token 7.05 Warm cache, 2-bit

M3 Max hardware: 400 GB/s memory bandwidth, 17.5 GB/s SSD sequential read, 48GB RAM

Our M4 Pro 64GB Projection

Factor M3 Max 48GB Our M4 Pro 64GB Impact
Memory bandwidth ~400 GB/s ~273 GB/s -32%
RAM available 48GB 64GB +16GB expert cache
SSD sequential read ~17.5 GB/s ~14-16 GB/s -5-10%
SSD NVMe controller Apple Fabric Apple Fabric Comparable
GPU cores 40 20 Fewer Metal cores

The bottleneck is SSD I/O for expert streaming, not GPU compute. Each token requires loading K=4 expert blocks (~3.9MB each = ~15.6MB) from SSD, taking ~1.49ms per token per layer in the M3 Max. This is the ceiling.

Expected performance range on our hardware: - 2-bit mode, warm page cache: 4.0–5.0 tok/s - 2-bit mode, cold SSD: 2.2–2.8 tok/s
- 4-bit mode, warm: 3.5–4.2 tok/s - Peak single token: ~5.5–6.0 tok/s

The extra 16GB RAM is meaningful: the M3 Max has ~35GB available for page cache after the 5.5GB model weights + Metal buffers. Our M4 Pro has ~51GB available for page cache. This means more expert blocks stay hot in memory, reducing cold-SSD reads — partially compensating for lower bandwidth.

Realistic sustained throughput: 3.5–4.5 tok/s for typical multi-turn conversation.

That's slow by standard local model standards but extraordinary for a 397B model. It's usable for research, drafting, and offline reasoning tasks where waiting 10–30 seconds for a response is acceptable. Not suitable for real-time chat.


3. OpenClaw Integration

Current State of flash-moe

Flash-MoE is a CLI tool, not an API server. It has: - ./infer — single-shot inference, outputs to stdout - ./chat — interactive REPL - No HTTP server - No OpenAI-compatible API - No streaming output - No JSON mode

Integration Options

Build a lightweight FastAPI server that wraps the ./infer binary:

# flash_moe_server.py
from fastapi import FastAPI
from pydantic import BaseModel
import subprocess, json

app = FastAPI()

class ChatRequest(BaseModel):
    model: str
    messages: list
    max_tokens: int = 500

@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
    # Build prompt from messages
    prompt = format_prompt(req.messages)

    # Call flash-moe binary
    result = subprocess.run(
        ["./infer", "--prompt", prompt, "--tokens", str(req.max_tokens), "--2bit"],
        capture_output=True, text=True, cwd="/path/to/flash-moe/metal_infer"
    )

    # Return OpenAI-compatible response
    return {
        "id": "chatcmpl-local",
        "object": "chat.completion",
        "model": "qwen3.5-397b-flash-moe",
        "choices": [{
            "message": {"role": "assistant", "content": result.stdout},
            "finish_reason": "stop"
        }]
    }

Run on port 11435 (adjacent to Ollama's 11434) to avoid conflicts.

Option B: Use llama-server Instead

The Unsloth GGUF versions of Qwen3.5-397B can run through llama-server (part of llama.cpp) which already has a full OpenAI-compatible API. This is less performant than flash-moe but immediately OpenClaw-compatible. See Section 6 for comparison.

OpenClaw Wiring

Once the wrapper is running, configure OpenClaw to route to it:

{
  "model": "qwen3.5-397b-local",
  "endpoint": "http://localhost:11435/v1",
  "api_key": "local"
}

Streaming caveat: The current flash-moe binary does not stream tokens. A proper wrapper would need to either (a) modify the C source to emit tokens progressively, or (b) fake streaming with token-by-token chunking from post-processed output. Option (b) is easier; option (a) gives real UX.


4. Disk Space Requirements

Item Size Notes
Full BF16 checkpoint (raw) ~807 GB Not needed for flash-moe
MLX 4-bit download (flash-moe source) ~209 GB Downloaded, then repacked
flash-moe 4-bit expert files ~209 GB packed_experts/ directory
flash-moe 2-bit expert files ~120 GB packed_experts_2bit/ directory
Non-expert weights (model_weights.bin) ~5.5 GB mmap'd at runtime
Vocab + manifest ~50 MB Trivial
Total (2-bit only, no 4-bit) ~125 GB After cleanup
Total (both quants on disk) ~335 GB Keep both for quality comparisons
Peak during setup ~330 GB While 4-bit download + 2-bit repack coexist

Our 1.8TB SSD: Assuming ~500GB used by OS + current models + workspace, we have ~1.3TB free. Running both quants simultaneously (335GB) is comfortable. Download + repack peak (330GB) is fine.


5. Memory Usage Projections

Flash-MoE is specifically designed to not require the full model in RAM. This is its killer feature.

Component RAM Usage Notes
Non-expert weights ~5.5 GB mmap'd read-only, stays resident
Metal scratch buffers ~200 MB GPU compute workspace
Expert LRU cache (Metal) 0–3.5 GB Optional, configurable
OS + macOS processes ~4–6 GB Typical Mac mini background
Page cache (expert blocks) up to remaining OS fills this organically
Total flash-moe footprint ~6–9 GB By design
Available for page cache ~51–58 GB Our M4 Pro 64GB advantage

Key insight: On the M3 Max 48GB, ~35GB is available for page cache. On our 64GB, ~51GB is available. Since expert blocks are 3.9MB each and there are 512 experts × 60 layers, the total unique expert space is ~120GB at 2-bit — well beyond what fits in RAM. But the 80/20 rule applies: a subset of experts is called frequently, and those will stay warm in our larger cache, meaningfully improving sustained performance over the M3 Max reference.

No OOM risk. Flash-moe's design explicitly avoids it. Expert data streams on demand; it never tries to load the full 120GB into RAM.


6. Flash-MoE vs Qwen3.5-35B via Ollama (What We Already Run)

Dimension flash-moe (397B-A17B) Ollama (35B-A3B Q4)
Model Qwen3.5-397B-A17B Qwen3.5-35B-A3B
Parameter count 397B total, 17B active 35B total, 3B active
Intelligence tier Gemini 3 Pro / Claude Opus 4.5 level ~Qwen3.5 mid tier
Generation speed ~3.5–5.5 tok/s ~18–28 tok/s
Setup complexity High (C build, weight repack) Trivial (ollama pull)
API compatibility None (CLI only, needs wrapper) Full OpenAI-compatible
Memory footprint ~6–9 GB RAM ~20–25 GB RAM
Disk space ~120–335 GB ~22 GB
Streaming No (requires patch/wrapper) Yes (native)
Context window Limited by implementation 32K–128K
Production readiness Research/PoC Production
Thinking mode Available in base model Available via Ollama
Cost One-time setup, ~24h Already running

Verdict

Ollama + Qwen3.5-35B wins on every operational metric for day-to-day use: speed, compatibility, reliability. Flash-MoE wins on one thing: model intelligence at equivalent compute cost. A 397B model with 17B active parameters is genuinely more capable than a 35B model with 3B active parameters — particularly on complex reasoning, code, and long-context tasks.

The use case for flash-moe isn't "replace Ollama." It's: - Batch reasoning tasks that can tolerate low speed (overnight research, deep analysis) - Tasks where model quality matters more than latency - Demonstrating 397B capability from a consumer device (product/marketing angle)

Recommendation: Keep Ollama running Qwen3.5-35B as the primary fast model. Add flash-moe as a secondary "heavy reasoning" endpoint for high-stakes tasks.


7. Product Angle: "Local 397B on Consumer Hardware"

This is where it gets interesting.

The Market Signal

The fact that @danveloper built this in 24 hours and it spread virally signals genuine market appetite. People want: 1. Frontier-class AI locally (privacy, cost, latency) 2. Proof that it's possible on consumer hardware 3. Something they can actually set up and use

Flash-MoE answered #1 and #2. Nobody has answered #3 properly yet.

Product Concepts

Concept A: Flash-MoE Stack Installer (Low Effort, Fast to Ship)

A one-command installer that: - Handles the dependency chain (Xcode tools, Python, HF CLI) - Manages the model download (resumable, progress UI) - Runs the weight repacking (scheduled overnight) - Installs the OpenAI-compatible wrapper server - Configures auto-start as a macOS Launch Agent

Target: Technical enthusiasts who want to run flash-moe but don't want to debug C compilation and weight extraction manually.
Distribution: GitHub + Homebrew tap
Pricing: Free (lead gen) or $49 one-time
Time to build: 1–2 weeks for Melody

Concept B: OpenClaw Local Inference Bundle (Medium Effort, Higher Value)

Bundle the flash-moe stack with OpenClaw configuration: - OpenClaw pre-configured to route heavy tasks to local 397B - Routing logic: fast tasks → Ollama 35B, reasoning tasks → flash-moe 397B - Dashboard showing live model usage, cost savings vs cloud - Mac mini "AI appliance" positioning

Target: Privacy-conscious small businesses, solo operators, developers
Pricing: $299–499/year subscription or $149 setup fee + OpenClaw license
Positioning: "Your own private frontier AI, no monthly API bills"

Concept C: Managed "AI Appliance" Service (High Effort, High Margin)

Sell and configure Mac mini AI nodes for businesses: - Hardware procurement (Mac mini M4 Pro 64GB, ~$2,500 hardware) - Setup fee: $1,500–3,000 (flash-moe + OpenClaw configured) - Monthly support/maintenance: $500–1,000/month - VPN access so clients can use the node remotely

Target: Small law firms, medical practices, financial advisors, research teams — any high-privacy vertical that can't send data to OpenAI
Revenue model: Hardware margin (~15%) + setup fee + recurring support
Annual value per client: $7,500–14,000

Key constraints: - Requires someone to handle the physical hardware + setup - Support burden is real (OS updates, model updates) - Not truly scalable without tooling - Jeff's bandwidth is limited — needs systems to automate the ops

Concept D: "Local Frontier" API Reselling (Speculative)

Run multiple Mac minis as a local inference cluster, sell API access at rates below cloud: - OpenAI GPT-4o: ~$5/M tokens - Claude Sonnet 4.6: ~$3/M tokens - Local 397B via flash-moe: ~$0.50/M tokens (after hardware amortization)

The problem: Throughput is 3.5–5.5 tok/s per machine. At 100 concurrent users, you need ~20 machines just to keep latency reasonable. Hardware cost: ~$50,000. Not viable at early stage.

Realistic Near-Term Recommendation

Start with Concept A + B: Build the installer (low effort) to establish credibility in the local AI space, then evolve it toward a bundled OpenClaw offering. This connects directly to the "Managed OpenClaw Service" idea already in BACKLOG.md.

The narrative that sells: "Anthropic, Google, and OpenAI charge $50–200/month for frontier AI access. For $2,500 in hardware and a weekend of setup, you can run an equivalent model locally forever, with zero data leaving your machine."

That's a compelling pitch — especially post-GDPR, post-AI-privacy-scare. The market is real. The technology works. The gap is the polish layer.


8. Open Questions / Risks

Risk Severity Mitigation
Flash-moe isn't maintained Medium Fork it. It's MIT licensed. The C code is self-contained.
3.5–5.5 tok/s is too slow for users Medium Frame it as batch/overnight mode, not real-time chat
Weight download takes hours Low Build resumable downloader, run overnight
Community hasn't packaged this yet Low First-mover advantage if we build the installer
Wrapper complexity Low Python FastAPI wrapper is a few hundred lines
Competing with Ollama (which adds Qwen3.5-397B via llama.cpp) High Flash-moe's edge is 2-bit speed optimization; monitor llama.cpp progress

Watch: llama.cpp is actively adding MoE expert streaming. If they achieve similar SSD offloading performance natively, it would subsume flash-moe's advantage while being API-compatible out of the box. Flash-moe's 24-hour head start may be a 3–6 month window.


Priority Action Effort Owner
1 Build flash-moe on Mac mini, validate it compiles and runs 2h Jeff/Jules
2 Download 4-bit weights overnight, run weight extraction 8–12h Automated
3 Run 2-bit repack overnight 4–6h Automated
4 Benchmark actual tok/s on our hardware 30m Jules
5 Build OpenAI-compatible wrapper (Python FastAPI) 1–2 days Melody
6 Wire wrapper to OpenClaw as secondary model endpoint 2h Jules
7 Evaluate product packaging concept (A vs B) Decision call Jeff

Sources

  • flash-moe GitHub: https://github.com/danveloper/flash-moe
  • flash-moe paper: https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf
  • Unsloth Qwen3.5 guide: https://unsloth.ai/docs/models/qwen3.5
  • Unsloth GGUF benchmarks: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
  • M4 Pro bandwidth: ~273 GB/s (confirmed multiple sources)
  • M3 Max bandwidth: ~400 GB/s (danveloper hardware)
  • llama.cpp Apple Silicon discussion: https://github.com/ggml-org/llama.cpp/discussions/4167

No community benchmarks for flash-moe specifically on M4 Pro hardware found as of 2026-03-19. Flash-moe remains early-stage; no user-friendly packaging beyond the raw repo exists yet.