Files
loop-loop/README.md
Sheldon Finlay b3d263258a fix: critical bugs, stale refs, README rewrite, security fixes
- Fix evaluator bypass on last story (moved completion check)
- Fix all stale command name references across README, loop.sh, skills, plugin.json
- Fix explore evaluator false rejects (.loop/ files are expected)
- Fix stderr capture order in headless mode
- Fix shell injection risk in hooks.sh python fallback
- Remove .DS_Store from tracking
- Rewrite README to match current architecture (single entry point, tmux, optional tools)
- Add XcodeBuildMCP and iOS simulator MCP to optional tools docs
2026-03-27 14:58:01 -04:00

145 lines
5.0 KiB
Markdown

# Agent Loop
Autonomous AI agent harness that combines a generator-evaluator architecture with iterative context-reset patterns for long-running coding tasks.
Inspired by [Geoffrey Huntley's Ralph pattern](https://ghuntley.com/ralph/) and [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps).
A generator-evaluator loop runs fresh Claude Code sessions per iteration. Each iteration: a **Generator** does the work, then an **Evaluator** verifies it. Human judgment stays in the planning phase; execution is autonomous with full visibility.
## Install
### As a Claude Code Plugin (Recommended)
```
/plugin marketplace add https://git.jagfly.com/sheldon/loop-loop.git
/plugin install agent-loop@agent-loop
```
Then in any project:
```
/agent-loop:run
```
That's it. The single command handles setup, planning, and execution.
### Manual Install
```bash
cp -r /path/to/loop-loop .loop
```
Then run `.loop/loop.sh` directly.
## How It Works
```
/agent-loop:run
├─ Phase 1: Scaffold .loop/ (if needed)
├─ Phase 2: Generate stories from spec (if needed)
│ └─ Presents stories for human review
│ └─ STOPS — user reviews and says "go"
└─ Phase 3: Launch loop in tmux
├─→ Generator → picks story → implements → commits
├─→ Evaluator → verifies → PASS or REJECT
├─→ next iteration (fresh CC session each time)
└─→ all stories pass → done
```
## Modes
| Mode | What it does | Git writes? |
|------|-------------|-------------|
| **implement** | Build features from a PRD | Yes |
| **explore** | Read-only codebase analysis | No |
| **fix** | Targeted bug fixes / tech debt | Yes |
## Monitoring
After the loop launches in tmux:
```bash
# Watch live (from Claude Code)
! tmux attach -t agent-loop
# Detach back to Claude Code
Ctrl+B then D
# Stop the loop
Ctrl+C in the tmux session
```
Or ask Claude Code "status" — it reads `.loop/prd.json` and `.loop/progress.md`.
## Headless Mode
For CI or background execution without the interactive UI:
```bash
.loop/loop.sh --headless [options]
--mode <implement|explore|fix> Operating mode
--max <N> Maximum iterations (default: 20)
--skip-eval Skip evaluator pass
--dry-run Print assembled prompts without running
```
## Architecture
### Generator
Fresh Claude Code session each iteration. Reads `prd.json` to find the highest-priority incomplete story, reads the sprint contract, implements the story, runs quality gates, commits, and marks it done.
### Evaluator
Separate fresh session after each generator pass. Skeptically verifies the work: checks acceptance criteria against actual code, runs tests and the application, and issues a `PASS` or `REJECT` verdict. Rejection sends the story back with specific feedback.
Evaluator skepticism is deliberately tuned — Claude's default tendency is to rationalize away issues. The evaluator prompt includes explicit bias correction.
### Sprint Contracts
Before the loop starts, the planner generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.
### State Persistence
| Artifact | Purpose |
|----------|---------|
| `prd.json` | Story status (pass/fail), acceptance criteria |
| `progress.md` | Append-only session log + codebase patterns |
| `contracts/` | Sprint contracts per story |
| `config.json` | Harness configuration |
| Git commits | Code changes with story-tagged messages |
## Optional: Runtime Testing Tools
The evaluator verifies code actually runs, not just that it looks correct. It uses whatever tools are available. For richer verification, install these optional MCP servers:
**Web projects (Playwright):**
```bash
claude mcp add playwright npx @playwright/mcp@latest --headless --browser=chromium
```
**iOS/Xcode projects (XcodeBuildMCP):**
```bash
brew tap getsentry/xcodebuildmcp && brew install xcodebuildmcp
claude mcp add xcodebuild -- xcodebuildmcp
```
**iOS Simulator interaction:**
```bash
claude mcp add ios-simulator -- npx -y ios-simulator-mcp
```
These are optional — the evaluator works without them but may miss runtime-only issues.
## Design Principles
- **Fresh context per iteration** — no accumulated hallucination drift
- **Separate generation from evaluation** — external skepticism is easier to tune than self-criticism
- **Human judgment for planning, AI for execution** — human reviews stories, loop executes autonomously
- **Structured handoffs via artifacts** — not conversation memory
- **No git revert on rejection** — next generator sees partial work + feedback (more signal)
- **Tool-agnostic** — evaluator uses whatever tools are available, no hardcoded dependencies
## Credits
- [Geoffrey Huntley](https://ghuntley.com/ralph/) — original Ralph pattern
- [Anthropic Engineering](https://www.anthropic.com/engineering/harness-design-long-running-apps) — generator-evaluator harness design