60ce0fef54
fix: tighten vague language across all prompt files
...
- Remove blanket "write tests" instructions; tests only when
acceptance criteria require them
- Replace arbitrary "30-50% rejection rate" with clear directive
- Replace "4/5 threshold" with "majority of claims" rule
- List concrete quality gate commands instead of "whatever project uses"
- Remove "learnings" from progress summary (too vague)
- Make error-leak pattern generic (not HTTP-specific)
- Align fix evaluator with updated test expectations
2026-03-28 11:58:13 -04:00
2dc291aac4
fix: make evaluator calibration examples project-agnostic
...
Replace ChaosRush-specific references with generic examples
that apply to any codebase.
2026-03-28 11:21:11 -04:00
1d059e218b
feat: add few-shot calibration examples to evaluator prompt
...
Three examples showing bad rubber-stamp, good rejection, and good
pass patterns. Based on Anthropic's harness design recommendation
to calibrate evaluators with few-shot score breakdowns, and
informed by real failures observed in a production loop run.
2026-03-28 11:15:52 -04:00
48bc656cd8
refactor: trim generator and evaluator prompts — cut total in half
2026-03-27 14:48:42 -04:00
5f8a34cc7b
fix: simplify evaluator runtime verification — let claude figure out the tools
2026-03-27 14:45:55 -04:00
ee08e3617c
feat: evaluator runtime verification for web projects, optional Playwright docs
2026-03-27 14:30:09 -04:00
1e7f7ea6ed
feat: true interactive mode — run claude directly, verdict via file, no script/capture
2026-03-27 13:07:25 -04:00
17e5eb707f
feat: agent loop harness with Claude Code plugin support
...
Generator-evaluator architecture with iterative context-reset for
long-running coding tasks. Ships as a Claude Code plugin — install
with /plugin and use /agent-loop:init, /agent-loop:plan, /agent-loop:run.
2026-03-27 08:03:18 -04:00