Skip to content

Strategic Brief: APA Production Readiness Shakedown

Author: Jules (COA) Date: 2026-03-20 Purpose: Determine if our agent pipeline + local model can execute APA development sprints with acceptable quality, speed, and cost.

What We're Actually Testing

This is NOT just a model benchmark. We're testing three things simultaneously:

1. Model Capability (qwen3-coder:30b)

  • Can it produce sprint-quality code for APA-style work?
  • Where does it hit its ceiling?
  • What percentage of tasks need cloud escalation?

2. Pipeline Integrity (Forge → Melody → Quinn → Feedback)

  • Does Forge's spec produce unambiguous work for Melody?
  • Can Quinn catch real issues on local model?
  • Does the feedback loop actually improve subsequent specs?
  • How much orchestration overhead do I (Jules) need to add?

3. Process Readiness

  • Task decomposition: are our atomic units the right size?
  • Estimation: do complexity ratings (⭐-⭐⭐⭐⭐⭐) predict actual effort?
  • Error handling: when something goes wrong, do we catch it or does it cascade?

Test Build: APA Core API Prototype

Build a functional (not production) APA API that exercises the patterns we'll use in real sprints. This should be a real, runnable codebase — not toy code.

Layer 1: Foundation (tests data modeling + basic CRUD)

  • Prisma schema with core APA entities (Athlete, Team, Session, MetricReading)
  • TypeScript types that mirror schema + API-specific types
  • Express app scaffold with health check, error middleware, request logging
  • Basic CRUD for Athletes and Teams with validation (zod)

Layer 2: Business Logic (tests the hard stuff)

  • GAP Score calculation service — this is our core IP
  • Weighted composite from HRV, sleep, training load, mood, resting HR
  • Normalization, trend detection, missing data handling
  • This is where model quality matters most
  • Session management with metric ingestion
  • Athlete timeline/history aggregation

Layer 3: Access Control (tests security judgment)

  • JWT auth with refresh tokens
  • Three roles: admin, coach, athlete
  • Row-level security: coach sees only their team, athlete sees only self
  • This tests whether local model makes security mistakes

Layer 4: Integration Patterns (tests real-world messiness)

  • Mock Garmin Connect webhook receiver
  • Data transformation pipeline (external format → internal models)
  • Rate limiting and retry logic
  • Error recovery when external data is malformed

Layer 5: Adversarial (tests debugging + refactoring + ambiguity)

  • Plant bugs in Layer 1-2 code, ask model to find and fix them
  • Give messy/duplicated code, ask for refactor without breaking tests
  • Give vague product requirement, evaluate how model handles ambiguity
  • Multi-file feature addition that requires coordinated changes across all layers

Evaluation Framework

Per-task scoring (Quinn evaluates):

  • Correctness (does it work?)
  • Completeness (all requirements met?)
  • Code quality (clean, typed, proper patterns?)
  • Edge cases (handled without being told?)
  • Production proximity (how much rework to ship?)

Pipeline scoring (Jules evaluates):

  • Spec clarity: did Melody need clarification? (0 = perfect, each question = -1)
  • Orchestration overhead: how many interventions did Jules need?
  • Feedback effectiveness: did spec quality improve across layers?
  • Estimation accuracy: predicted vs actual time per task
  • Cloud escalation rate: what % of tasks needed cloud model?

Go/No-Go Criteria:

  • GO: ≥70% of tasks pass at acceptable quality on local model, pipeline runs with <20% orchestration intervention, feedback loop shows measurable improvement across layers
  • CONDITIONAL GO: 50-70% local pass rate — we can proceed but need to define cloud escalation policy per task type
  • NO GO: <50% local pass rate — local model can't carry sprint work, need different approach

Constraints for Forge

  • All Layer 1-2 tasks should be tagged [local-ok] — if 30B can't handle these, it can't handle anything
  • Layer 3-4 tasks: Forge decides [local-ok] vs [cloud-required] based on RESOURCE_CONSTRAINTS.md
  • Layer 5 adversarial tasks: expect mixed results — this is where we find the ceiling
  • Total task count should be 15-25 atomic tasks
  • Each task should be completable in a single agent session (5-20 min)
  • Use the test-runs/local-model-benchmark/ directory for all output

What Forge Needs to Deliver

A complete spec following his CONVENTIONS.md template: 1. Full task list with complexity ratings, model tier tags, and estimated time 2. Acceptance criteria for every task (Quinn's test script) 3. Intentional bug specifications for Layer 5 (what bugs to plant, where) 4. Dependency graph showing which tasks can run parallel vs sequential 5. The refactor scenario setup (what messy code to create, what clean looks like) 6. The ambiguous requirement for the ambiguity test 7. Scoring rubric mapped to go/no-go thresholds