115 lines
4.9 KiB
Markdown
115 lines
4.9 KiB
Markdown
# Agent Loop
|
|
|
|
Autonomous AI agent harness that combines a generator-evaluator architecture with iterative context-reset patterns for long-running coding tasks.
|
|
|
|
Inspired by [Geoffrey Huntley's Ralph pattern](https://ghuntley.com/ralph/) and [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps).
|
|
|
|
A generator-evaluator loop runs fresh Claude Code sessions per iteration. Each iteration: a **Generator** does the work, then an **Evaluator** verifies it. Human judgment stays in the planning phase; execution is autonomous with full visibility.
|
|
|
|
## Install
|
|
|
|
```
|
|
/plugin install agent-loop@agent-loop
|
|
```
|
|
|
|
Then in any project:
|
|
|
|
```
|
|
/agent-loop:run
|
|
```
|
|
|
|
That's it. The single command handles setup, planning, and execution.
|
|
|
|
## Prerequisites
|
|
|
|
- [Claude Code](https://docs.anthropic.com/en/docs/claude-code) CLI installed
|
|
- `tmux` available (used to run the loop in a detachable session)
|
|
- `jq` or `python3` (for JSON state management)
|
|
|
|
## How It Works
|
|
|
|
1. Write a spec describing what you want to build (`SPEC.md`, `docs/specs/*.md`, or similar). You can write it yourself, ask Claude to draft one, or use planning tools like `/plan`.
|
|
2. Run `/agent-loop:run` — it scaffolds `.loop/`, generates stories from your spec, and presents them for review
|
|
3. Say "go" — the loop launches in tmux and runs autonomously
|
|
|
|
```
|
|
/agent-loop:run
|
|
├─ Phase 1: Scaffold .loop/ (if needed)
|
|
├─ Phase 2: Generate stories from spec (if needed)
|
|
│ └─ Presents stories for human review
|
|
│ └─ STOPS — user reviews and says "go"
|
|
└─ Phase 3: Launch loop in tmux
|
|
├─→ Generator → picks story → implements → commits
|
|
├─→ Evaluator → verifies → PASS or REJECT
|
|
├─→ next iteration (fresh CC session each time)
|
|
└─→ all stories pass → done
|
|
```
|
|
|
|
## Modes
|
|
|
|
| Mode | What it does | Git writes? |
|
|
|------|-------------|-------------|
|
|
| **implement** | Build features from a spec | Yes |
|
|
| **explore** | Read-only codebase analysis | No |
|
|
| **fix** | Targeted bug fixes / tech debt | Yes |
|
|
|
|
## Monitoring
|
|
|
|
After the loop launches in tmux:
|
|
|
|
```bash
|
|
# Watch live (from Claude Code)
|
|
! tmux attach -t agent-loop
|
|
|
|
# Detach back to Claude Code
|
|
Ctrl+B then D
|
|
|
|
# Stop the loop
|
|
Ctrl+C in the tmux session
|
|
```
|
|
|
|
Or ask Claude Code "status" — it reads `.loop/prd.json` and `.loop/progress.md`.
|
|
|
|
Each generator and evaluator run is a full Claude Code session saved to history. Use `claude -r` to resume any session and inspect what happened, debug a rejection, or continue from where it left off.
|
|
|
|
## Architecture
|
|
|
|
### Generator
|
|
Fresh Claude Code session each iteration. Follows a strict startup sequence: reads progress.md, finds the next story from prd.json, reads the sprint contract, checks for evaluator feedback, reviews git history, and runs a smoke test if available — all before writing any code. Then implements the story, runs quality gates, commits, and marks it done.
|
|
|
|
### Evaluator
|
|
Separate fresh session after each generator pass. Skeptically verifies the work: checks each acceptance criterion against actual code with file paths and line numbers, runs tests, and issues a `PASS` or `REJECT` verdict. Rejection sends the story back with specific feedback.
|
|
|
|
Evaluator skepticism is deliberately tuned — Claude's default tendency is to rationalize away issues. The evaluator prompt includes explicit bias correction and few-shot calibration examples.
|
|
|
|
### Sprint Contracts
|
|
Before the loop starts, the planner generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.
|
|
|
|
### State Persistence
|
|
|
|
| Artifact | Purpose |
|
|
|----------|---------|
|
|
| `prd.json` | Story status (pass/fail), acceptance criteria |
|
|
| `progress.md` | Append-only session log + codebase patterns |
|
|
| `contracts/` | Sprint contracts per story |
|
|
| `config.json` | Harness configuration |
|
|
| Git commits | Code changes with story-tagged messages |
|
|
|
|
## Runtime Verification
|
|
|
|
The evaluator doesn't just read diffs — it runs tests, builds the project, and checks for runtime errors using whatever tools the project already has (test runners, linters, build commands).
|
|
|
|
## Design Principles
|
|
|
|
- **Fresh context per iteration** — no accumulated hallucination drift
|
|
- **Separate generation from evaluation** — external skepticism is easier to tune than self-criticism
|
|
- **Human judgment for planning, AI for execution** — human reviews stories, loop executes autonomously
|
|
- **Structured handoffs via artifacts** — not conversation memory
|
|
- **No git revert on rejection** — next generator sees partial work + feedback (more signal)
|
|
- **Tool-agnostic** — evaluator uses whatever tools are available, no hardcoded dependencies
|
|
|
|
## Credits
|
|
|
|
- [Geoffrey Huntley](https://ghuntley.com/ralph/) — original Ralph pattern
|
|
- [Anthropic Engineering](https://www.anthropic.com/engineering/harness-design-long-running-apps) — generator-evaluator harness design
|