Local Model Benchmark: qwen3-coder:30b¶
APA-Style API Development Test¶
Purpose: Evaluate qwen3-coder:30b as a primary coding model for APA dev sprints. Date: 2026-03-20 Model: qwen3-coder:30b (MoE, 30B params, 3B active)
Test Tasks (5 levels, increasing complexity)¶
Task 1: Data Models + Schema (Difficulty: ⭐)¶
Create TypeScript data models and Prisma schema for: - Athlete (id, name, team, position, dateOfBirth, metrics relation) - Team (id, name, sport, athletes relation) - Session (id, athleteId, type, startTime, endTime, readings relation) - MetricReading (id, sessionId, metricType, value, timestamp) Success criteria: Valid Prisma schema, proper relations, correct TypeScript types exported.
Task 2: CRUD Endpoints (Difficulty: ⭐⭐)¶
Build Express.js REST endpoints for Athletes and Sessions: - GET/POST/PUT/DELETE /api/athletes (with pagination, filtering by team) - GET/POST /api/sessions (with date range filtering) - GET /api/athletes/:id/sessions (nested resource) - Proper error handling, input validation (zod), consistent response format Success criteria: All endpoints functional, validation works, errors return proper HTTP codes and messages.
Task 3: Business Logic - GAP Score Calculator (Difficulty: ⭐⭐⭐)¶
Implement a GAP Score calculation service: - Takes an athlete's recent MetricReadings (HRV, sleep quality, training load, mood) - Applies weighted composite scoring with normalization - Returns GAP score (0-100), component breakdown, and trend direction - Handles edge cases: missing data, insufficient history, out-of-range values Success criteria: Mathematically correct, handles edge cases gracefully, well-structured service with clear interfaces.
Task 4: Auth Middleware (Difficulty: ⭐⭐⭐⭐)¶
Implement JWT authentication and RBAC: - JWT token generation/validation with refresh token flow - Three roles: admin, coach, athlete - Route-level permission guards - Coach can see their team's athletes only - Athlete can see only their own data - Admin has full access Success criteria: Auth flow works, role checks enforce properly, token refresh handles expiry, no obvious security holes.
Task 5: Integration Tests (Difficulty: ⭐⭐⭐⭐⭐)¶
Write a comprehensive test suite (Vitest) that: - Tests each CRUD endpoint with valid and invalid inputs - Tests auth flow (login, token refresh, expired token, wrong role) - Tests GAP Score calculation with known inputs/expected outputs - Tests role-based access (coach sees own team only, athlete sees own data only) - Uses test fixtures and proper setup/teardown Success criteria: Tests are meaningful (not just happy path), assertions are specific, test isolation is proper, all tests would pass against a correct implementation.
Evaluation Criteria (per task)¶
- Correctness: Does the code work? Are there bugs?
- Completeness: Did it address all requirements?
- Code Quality: Clean structure, proper typing, good patterns?
- Edge Cases: Did it handle the tricky stuff without being told?
- Production Readiness: Would you ship this (with review)?
Task 6: Bug Hunt (Difficulty: ⭐⭐⭐⭐)¶
Given the code from Tasks 1-3 with 3-4 planted subtle bugs: - An off-by-one in pagination - A race condition in the GAP Score calculator (async without await) - A type coercion bug (string "0" treated as falsy) - A missing null check on optional relation Find and fix all bugs. Explain each one. Success criteria: Finds all bugs, fixes are correct, explanations show understanding not just pattern matching.
Task 7: Refactor (Difficulty: ⭐⭐⭐⭐)¶
Given working-but-messy code (duplicated validation, inconsistent error handling, god function): - Refactor into clean service/repository pattern - Extract shared validation into middleware - Maintain identical external behavior Success criteria: Refactored code passes same tests, structure is cleaner, no regressions introduced.
Task 8: Ambiguous Spec (Difficulty: ⭐⭐⭐⭐⭐)¶
Prompt: "Coaches are complaining that the athlete comparison feature doesn't work well when athletes have different training histories. Make it better." - No file pointers, no specific requirements - Model must: identify what "comparison" means in context, propose an approach, implement it Success criteria: Asks clarifying questions OR makes reasonable assumptions and states them. Implementation is defensible.
Task 9: Multi-file Coordinated Change (Difficulty: ⭐⭐⭐⭐⭐)¶
Add a new "TeamAnalytics" feature that requires simultaneous changes to: - Schema (new model: TeamSnapshot) - Types (new interfaces) - New service (TeamAnalyticsService) - New routes (/api/teams/:id/analytics) - Auth middleware (coach-level access) - Tests (new test file + updates to existing) All changes must be internally consistent. Success criteria: All files are consistent with each other, no broken imports, types align, tests cover the new feature.
Scoring¶
Each task scored 1-5 on each criterion. Total possible: 225 points (9 tasks × 5 criteria × 5 max). - 180+: Sprint-ready. Use as primary coding model. - 135-179: Usable with oversight. Good for routine work, cloud for complex. - 90-134: Supplement only. Cloud primary, local for boilerplate. - <90: Not viable for sprint work.