19 Commits

Author SHA1 Message Date
ce111b4cbe feat: add guidance for subjective acceptance criteria
Planner now has examples for design/UX criteria that are evaluable
without being purely binary. Prevents the planner from avoiding
qualitative criteria just because they aren't grep-checkable.
2026-03-28 12:59:42 -04:00
77fd9e0cd6 feat: add concrete examples of good vs bad acceptance criteria
Planner now sees specific examples of verifiable criteria (grep,
test commands, file checks) alongside vague anti-patterns. Drives
higher story quality which directly improves evaluator accuracy.
2026-03-28 12:56:53 -04:00
1efca3c185 feat: add blocker handling and artifact protection to generator
Generator now has explicit instructions for when it's stuck: write
the blocker to notes, leave passes as false, and stop. Also adds
a "Do Not Modify" section preventing changes to other stories,
contracts, or config.
2026-03-28 12:40:05 -04:00
e4df81fdac feat: add self-verification gate before generator marks story done
Generator must now verify each acceptance criterion against actual
code before setting passes: true. Acts as a first filter before
the evaluator runs, reducing false completions.
2026-03-28 12:36:24 -04:00
60ce0fef54 fix: tighten vague language across all prompt files
- Remove blanket "write tests" instructions; tests only when
  acceptance criteria require them
- Replace arbitrary "30-50% rejection rate" with clear directive
- Replace "4/5 threshold" with "majority of claims" rule
- List concrete quality gate commands instead of "whatever project uses"
- Remove "learnings" from progress summary (too vague)
- Make error-leak pattern generic (not HTTP-specific)
- Align fix evaluator with updated test expectations
2026-03-28 11:58:13 -04:00
f26bdce534 fix: replace misleading context budget percentages with scope guidance
The planner prompt had vague context window budget percentages that
don't reflect how agents actually work. Replaced with concrete
scope guidance (keep stories to ~10 files) which aligns with the
existing scope budgets in config.json.
2026-03-28 11:49:04 -04:00
2dc291aac4 fix: make evaluator calibration examples project-agnostic
Replace ChaosRush-specific references with generic examples
that apply to any codebase.
2026-03-28 11:21:11 -04:00
1d059e218b feat: add few-shot calibration examples to evaluator prompt
Three examples showing bad rubber-stamp, good rejection, and good
pass patterns. Based on Anthropic's harness design recommendation
to calibrate evaluators with few-shot score breakdowns, and
informed by real failures observed in a production loop run.
2026-03-28 11:15:52 -04:00
80b0f0f4c1 feat: add regression patterns to evaluator implement prompt
Three new failure patterns: missing imports after refactoring,
orphaned resource instances, and error detail leakage. These were
observed in a real loop run where the evaluator missed them.
2026-03-28 10:57:44 -04:00
5e4ad3b12e feat: add smoke test step to generator startup sequence
Generator now runs a quick health check before implementing if the
project has tests or a dev server. Catches regressions from previous
iterations early instead of building on a broken foundation.
2026-03-27 21:09:36 -04:00
9a7fa3a1bd fix: enforce strict orientation sequence in generator prompt
Add git log step and explicit gate requiring all startup steps
complete before implementation begins. Based on Anthropic's
prompting guide recommendation for prescriptive session orientation.
2026-03-27 21:07:48 -04:00
a4e9c4de05 feat: US-003 - Clarify .loop/ changes are expected in explore evaluator 2026-03-27 18:42:46 -04:00
b3d263258a fix: critical bugs, stale refs, README rewrite, security fixes
- Fix evaluator bypass on last story (moved completion check)
- Fix all stale command name references across README, loop.sh, skills, plugin.json
- Fix explore evaluator false rejects (.loop/ files are expected)
- Fix stderr capture order in headless mode
- Fix shell injection risk in hooks.sh python fallback
- Remove .DS_Store from tracking
- Rewrite README to match current architecture (single entry point, tmux, optional tools)
- Add XcodeBuildMCP and iOS simulator MCP to optional tools docs
2026-03-27 14:58:01 -04:00
f3cbfd258c refactor: remove domain-specific language from prompts — fully universal 2026-03-27 14:50:52 -04:00
48bc656cd8 refactor: trim generator and evaluator prompts — cut total in half 2026-03-27 14:48:42 -04:00
5f8a34cc7b fix: simplify evaluator runtime verification — let claude figure out the tools 2026-03-27 14:45:55 -04:00
ee08e3617c feat: evaluator runtime verification for web projects, optional Playwright docs 2026-03-27 14:30:09 -04:00
1e7f7ea6ed feat: true interactive mode — run claude directly, verdict via file, no script/capture 2026-03-27 13:07:25 -04:00
17e5eb707f feat: agent loop harness with Claude Code plugin support
Generator-evaluator architecture with iterative context-reset for
long-running coding tasks. Ships as a Claude Code plugin — install
with /plugin and use /agent-loop:init, /agent-loop:plan, /agent-loop:run.
2026-03-27 08:03:18 -04:00