Root Cause Analysis: qwen3-coder:30b Local Model Failures¶
Date: 2026-03-20 Author: Jules (COA) Context: 6/6 tasks failed on qwen3-coder:30b during APA Production Readiness Shakedown
Failure Pattern¶
Every task exhibited the identical failure mode: - Model spends entire 5-minute timeout "checking repo state" or "reading files" - Zero files written across 6 attempts - No code generation output produced
Root Cause: Context Overload¶
The model itself is not broken. When tested directly via ollama run with small prompts:
| Test | Input Tokens | Time | Output |
|---|---|---|---|
| One-line function | ~33 | 2s | ✅ Correct |
| Zod schema from types | ~200 | 2s | ✅ Correct |
| Full Express router (CRUD) | ~50 | 22s | ✅ 226 lines, correct |
| Express router (verbose) | ~33 | 19s | ✅ 1,258 tokens at 68.7 tok/s |
The model generates fine with small context. The problem is what OpenClaw loads BEFORE the task prompt:
Context Budget Breakdown (estimated per subagent spawn)¶
| Component | Size | Tokens (est.) |
|---|---|---|
| OpenClaw system prompt (framework) | ~20-40KB | 5,000-10,000 |
| Jules workspace context (AGENTS.md, SOUL.md, USER.md, MEMORY.md, HEARTBEAT.md, TOOLS.md, IDENTITY.md) | 39,363 bytes | ~9,800 |
| Task prompt | ~3,000-4,000 bytes | ~800-1,000 |
| Total input before generation starts | — | ~15,000-20,000 |
The Math That Kills It¶
- qwen3-coder:30b prompt eval rate: 42 tok/s (measured)
- 15,000 tokens at 42 tok/s = ~357 seconds just to process the prompt
- That's 6 minutes of prompt evaluation alone — EXCEEDING the 5-minute timeout
- The model literally never gets to the generation phase
This is a MoE architecture penalty: qwen3-coder:30b has 30B total params but only 3B active per token. The 3B active parameter count means prompt processing is slow compared to a dense 30B model. The model is essentially a 3B model when it comes to throughput.
Why Direct ollama run Works¶
When running ollama run directly:
- No OpenClaw system prompt (~0 tokens)
- No workspace context files (~0 tokens)
- Just the task prompt (~50-200 tokens)
- Prompt eval: <1 second
- All time budget goes to generation
Contributing Factors¶
-
Jules's workspace is bloated for local models. MEMORY.md alone is 18KB (~4,500 tokens). AGENTS.md is 11KB (~2,800 tokens). This context is designed for Opus (200K window, fast prompt eval), not a 3B-active MoE model.
-
Subagents inherit parent workspace. When spawning a subagent, OpenClaw loads the spawning agent's workspace files as context. Jules's workspace is the heaviest in the fleet.
-
No context-aware model routing. The dispatch protocol didn't account for context load when routing to local models. A task tagged
[local-ok]based on task complexity was actually[cloud-required]based on context size. -
Timeout was set for generation time, not prompt eval + generation. 5-minute timeout assumed the model would start generating within seconds. For this model with this context, prompt eval alone exceeds the timeout.
Solutions¶
Immediate (can implement now)¶
S1: Minimal workspace for local model agents Create a stripped-down workspace for Melody when running on local models. Remove MEMORY.md, USER.md, SOUL.md — only include CONVENTIONS.md and task-specific files. Target: <2,000 tokens of workspace context.
S2: Increase timeout for local models If using local models, set timeout to 15-20 minutes instead of 5. This accounts for slow prompt eval. Downside: slower feedback loop, longer to detect actual failures.
S3: Context-aware routing Before dispatching to local model, estimate total context size (system prompt + workspace + task). If >8K tokens, route directly to cloud. Add this check to the Agent Dispatch Protocol.
Medium-term (architectural changes)¶
S4: Agent-specific workspaces for local execution
Melody already has her own workspace at ~/.openclaw/workspace-melody/ which is much leaner. Route local model tasks through Melody's agent identity instead of spawning subagents from Jules. This gives Melody her own context (CONVENTIONS.md only, ~3KB) instead of inheriting Jules's 39KB workspace.
S5: Use sessions_spawn with explicit cwd and minimal context
Investigate whether OpenClaw supports spawning with a custom workspace that strips unnecessary context files. The subagent doesn't need Jules's MEMORY.md to write a Prisma schema.
S6: Dense model instead of MoE
qwen3-coder:30b is MoE (3B active). A dense 7-14B model would have faster prompt eval because all parameters are always active. Trade: smaller model = less capable, but faster throughput. Options:
- qwen2.5-coder:14b — dense, fast prompt eval, may handle the context load
- codestral:22b — dense, 13GB, fast
- devstral:24b — dense, 15GB, Mistral's coding agent model
Long-term (strategic)¶
S7: flash-moe (@danveloper) The flash-moe project enables running much larger MoE models (Qwen3.5-397B) via SSD expert streaming on Apple Silicon. If this matures, it could give us frontier-class local inference at acceptable speed. But it's experimental — not a solution today.
S8: Dedicated inference server If local model usage is strategic, consider a dedicated Mac Studio with M4 Ultra (192GB unified memory) that runs models without competing with OpenClaw, Mission Control, and other services. Cost: ~$5-7K. Only justified if local inference becomes a core competitive advantage (e.g., fine-tuned APA models).
Recommendation¶
Short-term (this week):
1. Implement S3 (context-aware routing) in the Agent Dispatch Protocol — if total context >8K tokens, skip local
2. Test S6 — pull devstral:24b or qwen2.5-coder:14b (dense models) and re-run T01-T03 as a quick validation
3. Implement S4 — route local tasks through Melody's agent identity instead of Jules subagents
Medium-term (next 2 weeks): 4. Implement S1 — create a minimal "local execution" workspace template 5. Benchmark dense 14-24B models against the same shakedown tasks
Decision point: If dense 14-24B models can handle Tasks T01-T05 within reasonable timeouts, we have a viable local-first strategy. If not, local models are limited to non-agentic use cases (embeddings, triage, heartbeats) and all coding goes to cloud with smart tier routing (Haiku for scaffold, Sonnet for impl, Opus for architecture).
Key Insight¶
The problem was never model quality — it was model throughput under real-world context load. qwen3-coder:30b generates correct code at 68 tok/s when given small prompts. But our agentic workflow loads 15-20K tokens of context before the model even starts, and a 3B-active MoE model can't process that within our timeout windows.
The fix isn't "get a better model." The fix is either: 1. Reduce context (lean workspaces, context-aware routing) 2. Use dense models (faster prompt eval at the cost of reduced capability) 3. Accept cloud for coding (local stays for embeddings/triage)
"The blueprint was fine. The loading dock was too narrow for the truck."