Files

Sheldon Finlay ee08e3617c feat: evaluator runtime verification for web projects, optional Playwright docs

2026-03-27 14:30:09 -04:00

7.8 KiB

Raw Blame History

Agent Loop

Autonomous AI agent harness that combines a generator-evaluator architecture with iterative context-reset patterns for long-running coding tasks.

Inspired by Geoffrey Huntley's Ralph pattern and Anthropic's harness design research.

A generator-evaluator loop runs fresh agent instances per iteration. Each iteration: a Generator does the work, then an Evaluator verifies it. Human judgment stays in the planning phase; execution is autonomous.

Two execution modes: headless via loop.sh (fully autonomous bash process) or interactive via /loop-run (Claude Code-native with full visibility and intervention).

Install

As a Claude Code Plugin (Recommended)

/plugin marketplace add https://git.jagfly.com/sheldon/loop-loop.git
/plugin install agent-loop@agent-loop

Then in any project:

/agent-loop:init          # Set up the loop for your project
/agent-loop:plan          # Generate PRD and sprint contracts
/agent-loop:run           # Run the loop interactively

Manual Install

# Clone into your project
cp -r /path/to/loop-loop .loop

# Install skills as Claude Code commands
mkdir -p .claude/commands
for skill in loop-init loop-plan loop-run loop-triage; do
    ln -sf "../../.loop/skills/$skill/SKILL.md" ".claude/commands/$skill.md"
done

# Then in Claude Code:
/loop-init && /loop-plan && /loop-run

How It Works

[You + Claude Code]                    [Loop Execution]

/agent-loop:init                       Interactive (/agent-loop:run)
  → scaffolds .loop/                     └─ dispatches Agent subagents
  → detects project                      └─ visible tool calls, can intervene
  → picks mode                           └─ chat mid-loop to adjust course
  → creates config.json
                                        Headless (.loop/loop.sh)
/agent-loop:plan                         └─ spawns claude --print per iteration
  → asks clarifying questions            └─ fully autonomous, no UI
  → generates prd.json
  → generates sprint contracts          Both paths:
  → populates progress.md                ├─→ Generator → picks story → implements → commits
                                          ├─→ Evaluator → verifies → PASS or REJECT
                                          ├─→ next iteration...
                                          └─→ all stories pass → done

Modes

Mode	What it does	Git writes?
implement	Build features from a PRD	Yes
explore	Read-only codebase analysis	No
fix	Targeted bug fixes / tech debt	Yes

Running the Loop

Option A: Interactive (`/loop-run`) — Recommended

Run inside Claude Code. You see every tool call, file edit, and test run. You can intervene at any point — deny a tool call, chat to adjust course, or stop the loop.

/loop-run                    # Run until done or max iterations
/loop-run 3                  # Run at most 3 iterations
/loop-run --skip-eval        # Skip evaluator pass
/loop-run --story US-003     # Run only a specific story

Option B: Headless (`loop.sh`)

Run as a standalone bash process. Fully autonomous — no UI, no intervention. Useful for background execution or CI.

.loop/loop.sh [options]

--mode <implement|explore|fix>   Operating mode
--max <N>                        Maximum iterations (default: 20)
--skip-eval                      Skip evaluator pass
--tool <claude|amp>              AI tool to use
--no-hooks                       Don't install stop hooks
--dry-run                        Print assembled prompts without running agents
--resume                         Skip already-passed stories (explicit exit when none remain)

Architecture

Generator

Fresh Claude Code instance each iteration. Reads prd.json to find the highest-priority incomplete story, reads the sprint contract, implements the story, runs quality gates, commits, and marks it done.

Evaluator

Separate fresh instance after each generator pass. Skeptically verifies the work: checks acceptance criteria against actual code, runs tests independently, and issues a PASS or REJECT verdict. Rejection sends the story back to the generator with specific feedback.

Evaluator skepticism is deliberately tuned — Claude's default tendency is to rationalize away issues. The evaluator prompt includes explicit bias correction.

Sprint Contracts

Before the loop starts, /loop-plan generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.

State Persistence

Artifact	Purpose
`prd.json`	Story status (pass/fail), acceptance criteria
`progress.md`	Append-only session log + codebase patterns
`contracts/`	Sprint contracts per story
`config.json`	Harness configuration
Git commits	Code changes with story-tagged messages

File Structure

.loop/
  loop.sh                        # Main loop orchestrator
  config.json                    # Project config (generated by /loop-init)
  init.sh                        # Project setup script (generated by /loop-init)
  prd.json                       # Active PRD (generated by /loop-plan)
  progress.md                    # Cross-session memory (append-only)

  prompts/
    generator/_base.md           # Shared generator instructions
    generator/implement.md       # Implement mode overlay
    generator/explore.md         # Explore mode overlay
    generator/fix.md             # Fix mode overlay
    evaluator/_base.md           # Skeptical evaluator base
    evaluator/implement.md       # Implement verification
    evaluator/explore.md         # Analysis verification
    evaluator/fix.md             # Fix verification
    planner/plan.md              # Planning context

  templates/                     # Reference templates
  lib/                           # Shell library functions
  skills/                        # Claude Code skills (/loop-init, /loop-plan, /loop-run, /loop-triage)
  contracts/                     # Sprint contracts (generated by /loop-plan)
  triage/                        # Analysis output (explore mode)
  archive/                       # Completed feature archives

Browser Testing (Optional)

The evaluator includes basic runtime verification for web projects (starts a local server, checks HTTP response). For full browser testing with console error detection and screenshots, install the Playwright MCP server:

claude mcp add playwright npx @playwright/mcp@latest --headless --browser=chromium

When Playwright is available, the evaluator will use it to:

Navigate to the running application
Check for JavaScript console errors
Take screenshots for visual verification
Reject stories with runtime errors

This is optional — the evaluator works without it, but may miss runtime issues that only surface in a browser.

Design Principles

Fresh context per iteration — no accumulated hallucination drift
Separate generation from evaluation — external skepticism is easier to tune than self-criticism
Human judgment for planning, AI for execution — interactive /loop-plan, autonomous loop
Structured handoffs via artifacts — not conversation memory
No git revert on rejection — next generator sees partial work + feedback (more signal)
Advisory scope budgets — prompt-enforced limits on files read/written per iteration

Credits

Geoffrey Huntley — original Ralph pattern
Anthropic Engineering — generator-evaluator harness design

7.8 KiB Raw Blame History