Strategic Brief: APA Production Readiness Shakedown¶

Author: Jules (COA) Date: 2026-03-20 Purpose: Determine if our agent pipeline + local model can execute APA development sprints with acceptable quality, speed, and cost.

What We're Actually Testing¶

This is NOT just a model benchmark. We're testing three things simultaneously:

1. Model Capability (qwen3-coder:30b)¶

Can it produce sprint-quality code for APA-style work?
Where does it hit its ceiling?
What percentage of tasks need cloud escalation?

2. Pipeline Integrity (Forge → Melody → Quinn → Feedback)¶

Does Forge's spec produce unambiguous work for Melody?
Can Quinn catch real issues on local model?
Does the feedback loop actually improve subsequent specs?
How much orchestration overhead do I (Jules) need to add?

3. Process Readiness¶

Task decomposition: are our atomic units the right size?
Estimation: do complexity ratings (⭐-⭐⭐⭐⭐⭐) predict actual effort?
Error handling: when something goes wrong, do we catch it or does it cascade?

Test Build: APA Core API Prototype¶

Build a functional (not production) APA API that exercises the patterns we'll use in real sprints. This should be a real, runnable codebase — not toy code.

Layer 1: Foundation (tests data modeling + basic CRUD)¶

Prisma schema with core APA entities (Athlete, Team, Session, MetricReading)
TypeScript types that mirror schema + API-specific types
Express app scaffold with health check, error middleware, request logging
Basic CRUD for Athletes and Teams with validation (zod)

Layer 2: Business Logic (tests the hard stuff)¶

GAP Score calculation service — this is our core IP
Weighted composite from HRV, sleep, training load, mood, resting HR
Normalization, trend detection, missing data handling
This is where model quality matters most
Session management with metric ingestion
Athlete timeline/history aggregation

Layer 3: Access Control (tests security judgment)¶

JWT auth with refresh tokens
Three roles: admin, coach, athlete
Row-level security: coach sees only their team, athlete sees only self
This tests whether local model makes security mistakes

Layer 4: Integration Patterns (tests real-world messiness)¶

Mock Garmin Connect webhook receiver
Data transformation pipeline (external format → internal models)
Rate limiting and retry logic
Error recovery when external data is malformed

Layer 5: Adversarial (tests debugging + refactoring + ambiguity)¶

Plant bugs in Layer 1-2 code, ask model to find and fix them
Give messy/duplicated code, ask for refactor without breaking tests
Give vague product requirement, evaluate how model handles ambiguity
Multi-file feature addition that requires coordinated changes across all layers

Evaluation Framework¶

Per-task scoring (Quinn evaluates):¶

Correctness (does it work?)
Completeness (all requirements met?)
Code quality (clean, typed, proper patterns?)
Edge cases (handled without being told?)
Production proximity (how much rework to ship?)

Pipeline scoring (Jules evaluates):¶

Spec clarity: did Melody need clarification? (0 = perfect, each question = -1)
Orchestration overhead: how many interventions did Jules need?
Feedback effectiveness: did spec quality improve across layers?
Estimation accuracy: predicted vs actual time per task
Cloud escalation rate: what % of tasks needed cloud model?

Go/No-Go Criteria:¶

GO: ≥70% of tasks pass at acceptable quality on local model, pipeline runs with <20% orchestration intervention, feedback loop shows measurable improvement across layers
CONDITIONAL GO: 50-70% local pass rate — we can proceed but need to define cloud escalation policy per task type
NO GO: <50% local pass rate — local model can't carry sprint work, need different approach

Constraints for Forge¶

All Layer 1-2 tasks should be tagged [local-ok] — if 30B can't handle these, it can't handle anything
Layer 3-4 tasks: Forge decides [local-ok] vs [cloud-required] based on RESOURCE_CONSTRAINTS.md
Layer 5 adversarial tasks: expect mixed results — this is where we find the ceiling
Total task count should be 15-25 atomic tasks
Each task should be completable in a single agent session (5-20 min)
Use the test-runs/local-model-benchmark/ directory for all output

What Forge Needs to Deliver¶

A complete spec following his CONVENTIONS.md template: 1. Full task list with complexity ratings, model tier tags, and estimated time 2. Acceptance criteria for every task (Quinn's test script) 3. Intentional bug specifications for Layer 5 (what bugs to plant, where) 4. Dependency graph showing which tasks can run parallel vs sequential 5. The refactor scenario setup (what messy code to create, what clean looks like) 6. The ambiguous requirement for the ambiguity test 7. Scoring rubric mapped to go/no-go thresholds