fix: critical bugs, stale refs, README rewrite, security fixes
- Fix evaluator bypass on last story (moved completion check) - Fix all stale command name references across README, loop.sh, skills, plugin.json - Fix explore evaluator false rejects (.loop/ files are expected) - Fix stderr capture order in headless mode - Fix shell injection risk in hooks.sh python fallback - Remove .DS_Store from tracking - Rewrite README to match current architecture (single entry point, tmux, optional tools) - Add XcodeBuildMCP and iOS simulator MCP to optional tools docs
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "agent-loop",
|
||||
"version": "0.8.0",
|
||||
"description": "Autonomous generator-evaluator agent loop for long-running coding tasks. Plan with /agent-loop:init, then execute with /agent-loop:run.",
|
||||
"description": "Autonomous generator-evaluator agent loop for long-running coding tasks. Run /agent-loop:run to start.",
|
||||
"author": {
|
||||
"name": "Sheldon"
|
||||
},
|
||||
|
||||
150
README.md
150
README.md
@@ -4,9 +4,7 @@ Autonomous AI agent harness that combines a generator-evaluator architecture wit
|
||||
|
||||
Inspired by [Geoffrey Huntley's Ralph pattern](https://ghuntley.com/ralph/) and [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps).
|
||||
|
||||
A generator-evaluator loop runs fresh agent instances per iteration. Each iteration: a **Generator** does the work, then an **Evaluator** verifies it. Human judgment stays in the planning phase; execution is autonomous.
|
||||
|
||||
Two execution modes: **headless** via `loop.sh` (fully autonomous bash process) or **interactive** via `/loop-run` (Claude Code-native with full visibility and intervention).
|
||||
A generator-evaluator loop runs fresh Claude Code sessions per iteration. Each iteration: a **Generator** does the work, then an **Evaluator** verifies it. Human judgment stays in the planning phase; execution is autonomous with full visibility.
|
||||
|
||||
## Install
|
||||
|
||||
@@ -20,46 +18,32 @@ Two execution modes: **headless** via `loop.sh` (fully autonomous bash process)
|
||||
Then in any project:
|
||||
|
||||
```
|
||||
/agent-loop:init # Set up the loop for your project
|
||||
/agent-loop:plan # Generate PRD and sprint contracts
|
||||
/agent-loop:run # Run the loop interactively
|
||||
/agent-loop:run
|
||||
```
|
||||
|
||||
That's it. The single command handles setup, planning, and execution.
|
||||
|
||||
### Manual Install
|
||||
|
||||
```bash
|
||||
# Clone into your project
|
||||
cp -r /path/to/loop-loop .loop
|
||||
|
||||
# Install skills as Claude Code commands
|
||||
mkdir -p .claude/commands
|
||||
for skill in loop-init loop-plan loop-run loop-triage; do
|
||||
ln -sf "../../.loop/skills/$skill/SKILL.md" ".claude/commands/$skill.md"
|
||||
done
|
||||
|
||||
# Then in Claude Code:
|
||||
/loop-init && /loop-plan && /loop-run
|
||||
```
|
||||
|
||||
Then run `.loop/loop.sh` directly.
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
[You + Claude Code] [Loop Execution]
|
||||
|
||||
/agent-loop:init Interactive (/agent-loop:run)
|
||||
→ scaffolds .loop/ └─ dispatches Agent subagents
|
||||
→ detects project └─ visible tool calls, can intervene
|
||||
→ picks mode └─ chat mid-loop to adjust course
|
||||
→ creates config.json
|
||||
Headless (.loop/loop.sh)
|
||||
/agent-loop:plan └─ spawns claude --print per iteration
|
||||
→ asks clarifying questions └─ fully autonomous, no UI
|
||||
→ generates prd.json
|
||||
→ generates sprint contracts Both paths:
|
||||
→ populates progress.md ├─→ Generator → picks story → implements → commits
|
||||
├─→ Evaluator → verifies → PASS or REJECT
|
||||
├─→ next iteration...
|
||||
└─→ all stories pass → done
|
||||
/agent-loop:run
|
||||
├─ Phase 1: Scaffold .loop/ (if needed)
|
||||
├─ Phase 2: Generate stories from spec (if needed)
|
||||
│ └─ Presents stories for human review
|
||||
│ └─ STOPS — user reviews and says "go"
|
||||
└─ Phase 3: Launch loop in tmux
|
||||
├─→ Generator → picks story → implements → commits
|
||||
├─→ Evaluator → verifies → PASS or REJECT
|
||||
├─→ next iteration (fresh CC session each time)
|
||||
└─→ all stories pass → done
|
||||
```
|
||||
|
||||
## Modes
|
||||
@@ -70,47 +54,48 @@ done
|
||||
| **explore** | Read-only codebase analysis | No |
|
||||
| **fix** | Targeted bug fixes / tech debt | Yes |
|
||||
|
||||
## Running the Loop
|
||||
## Monitoring
|
||||
|
||||
### Option A: Interactive (`/loop-run`) — Recommended
|
||||
|
||||
Run inside Claude Code. You see every tool call, file edit, and test run. You can intervene at any point — deny a tool call, chat to adjust course, or stop the loop.
|
||||
|
||||
```
|
||||
/loop-run # Run until done or max iterations
|
||||
/loop-run 3 # Run at most 3 iterations
|
||||
/loop-run --skip-eval # Skip evaluator pass
|
||||
/loop-run --story US-003 # Run only a specific story
|
||||
```
|
||||
|
||||
### Option B: Headless (`loop.sh`)
|
||||
|
||||
Run as a standalone bash process. Fully autonomous — no UI, no intervention. Useful for background execution or CI.
|
||||
After the loop launches in tmux:
|
||||
|
||||
```bash
|
||||
.loop/loop.sh [options]
|
||||
# Watch live (from Claude Code)
|
||||
! tmux attach -t agent-loop
|
||||
|
||||
# Detach back to Claude Code
|
||||
Ctrl+B then D
|
||||
|
||||
# Stop the loop
|
||||
Ctrl+C in the tmux session
|
||||
```
|
||||
|
||||
Or ask Claude Code "status" — it reads `.loop/prd.json` and `.loop/progress.md`.
|
||||
|
||||
## Headless Mode
|
||||
|
||||
For CI or background execution without the interactive UI:
|
||||
|
||||
```bash
|
||||
.loop/loop.sh --headless [options]
|
||||
|
||||
--mode <implement|explore|fix> Operating mode
|
||||
--max <N> Maximum iterations (default: 20)
|
||||
--skip-eval Skip evaluator pass
|
||||
--tool <claude|amp> AI tool to use
|
||||
--no-hooks Don't install stop hooks
|
||||
--dry-run Print assembled prompts without running agents
|
||||
--resume Skip already-passed stories (explicit exit when none remain)
|
||||
--dry-run Print assembled prompts without running
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Generator
|
||||
Fresh Claude Code instance each iteration. Reads `prd.json` to find the highest-priority incomplete story, reads the sprint contract, implements the story, runs quality gates, commits, and marks it done.
|
||||
Fresh Claude Code session each iteration. Reads `prd.json` to find the highest-priority incomplete story, reads the sprint contract, implements the story, runs quality gates, commits, and marks it done.
|
||||
|
||||
### Evaluator
|
||||
Separate fresh instance after each generator pass. Skeptically verifies the work: checks acceptance criteria against actual code, runs tests independently, and issues a `PASS` or `REJECT` verdict. Rejection sends the story back to the generator with specific feedback.
|
||||
Separate fresh session after each generator pass. Skeptically verifies the work: checks acceptance criteria against actual code, runs tests and the application, and issues a `PASS` or `REJECT` verdict. Rejection sends the story back with specific feedback.
|
||||
|
||||
Evaluator skepticism is deliberately tuned — Claude's default tendency is to rationalize away issues. The evaluator prompt includes explicit bias correction.
|
||||
|
||||
### Sprint Contracts
|
||||
Before the loop starts, `/loop-plan` generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.
|
||||
Before the loop starts, the planner generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.
|
||||
|
||||
### State Persistence
|
||||
|
||||
@@ -122,59 +107,36 @@ Before the loop starts, `/loop-plan` generates contracts for each story. These d
|
||||
| `config.json` | Harness configuration |
|
||||
| Git commits | Code changes with story-tagged messages |
|
||||
|
||||
## File Structure
|
||||
## Optional: Runtime Testing Tools
|
||||
|
||||
```
|
||||
.loop/
|
||||
loop.sh # Main loop orchestrator
|
||||
config.json # Project config (generated by /loop-init)
|
||||
init.sh # Project setup script (generated by /loop-init)
|
||||
prd.json # Active PRD (generated by /loop-plan)
|
||||
progress.md # Cross-session memory (append-only)
|
||||
|
||||
prompts/
|
||||
generator/_base.md # Shared generator instructions
|
||||
generator/implement.md # Implement mode overlay
|
||||
generator/explore.md # Explore mode overlay
|
||||
generator/fix.md # Fix mode overlay
|
||||
evaluator/_base.md # Skeptical evaluator base
|
||||
evaluator/implement.md # Implement verification
|
||||
evaluator/explore.md # Analysis verification
|
||||
evaluator/fix.md # Fix verification
|
||||
planner/plan.md # Planning context
|
||||
|
||||
templates/ # Reference templates
|
||||
lib/ # Shell library functions
|
||||
skills/ # Claude Code skills (/loop-init, /loop-plan, /loop-run, /loop-triage)
|
||||
contracts/ # Sprint contracts (generated by /loop-plan)
|
||||
triage/ # Analysis output (explore mode)
|
||||
archive/ # Completed feature archives
|
||||
```
|
||||
|
||||
## Browser Testing (Optional)
|
||||
|
||||
The evaluator includes basic runtime verification for web projects (starts a local server, checks HTTP response). For full browser testing with console error detection and screenshots, install the Playwright MCP server:
|
||||
The evaluator verifies code actually runs, not just that it looks correct. It uses whatever tools are available. For richer verification, install these optional MCP servers:
|
||||
|
||||
**Web projects (Playwright):**
|
||||
```bash
|
||||
claude mcp add playwright npx @playwright/mcp@latest --headless --browser=chromium
|
||||
```
|
||||
|
||||
When Playwright is available, the evaluator will use it to:
|
||||
- Navigate to the running application
|
||||
- Check for JavaScript console errors
|
||||
- Take screenshots for visual verification
|
||||
- Reject stories with runtime errors
|
||||
**iOS/Xcode projects (XcodeBuildMCP):**
|
||||
```bash
|
||||
brew tap getsentry/xcodebuildmcp && brew install xcodebuildmcp
|
||||
claude mcp add xcodebuild -- xcodebuildmcp
|
||||
```
|
||||
|
||||
This is optional — the evaluator works without it, but may miss runtime issues that only surface in a browser.
|
||||
**iOS Simulator interaction:**
|
||||
```bash
|
||||
claude mcp add ios-simulator -- npx -y ios-simulator-mcp
|
||||
```
|
||||
|
||||
These are optional — the evaluator works without them but may miss runtime-only issues.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- **Fresh context per iteration** — no accumulated hallucination drift
|
||||
- **Separate generation from evaluation** — external skepticism is easier to tune than self-criticism
|
||||
- **Human judgment for planning, AI for execution** — interactive `/loop-plan`, autonomous loop
|
||||
- **Human judgment for planning, AI for execution** — human reviews stories, loop executes autonomously
|
||||
- **Structured handoffs via artifacts** — not conversation memory
|
||||
- **No git revert on rejection** — next generator sees partial work + feedback (more signal)
|
||||
- **Advisory scope budgets** — prompt-enforced limits on files read/written per iteration
|
||||
- **Tool-agnostic** — evaluator uses whatever tools are available, no hardcoded dependencies
|
||||
|
||||
## Credits
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
# 1. Copies the harness to ~/.claude/loop/ (prompts, templates, lib, loop.sh)
|
||||
# 2. Installs skills as Claude Code commands at ~/.claude/commands/
|
||||
#
|
||||
# After install, use /loop-init in any project to get started.
|
||||
# After install, use /agent-loop:run in any project to get started.
|
||||
#
|
||||
# Usage:
|
||||
# ./install.sh # Install
|
||||
@@ -18,7 +18,7 @@ HARNESS_DIR="$CLAUDE_DIR/loop"
|
||||
COMMANDS_DIR="$CLAUDE_DIR/commands"
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
|
||||
SKILLS=(loop-init loop-plan loop-run loop-triage)
|
||||
SKILLS=(setup stories run triage)
|
||||
|
||||
# --- Colors (if terminal supports them) ---
|
||||
if [ -t 1 ]; then
|
||||
@@ -100,9 +100,7 @@ info "${BOLD}Installation complete.${RESET}"
|
||||
echo ""
|
||||
echo " Next steps (inside Claude Code, in any project):"
|
||||
echo ""
|
||||
echo " /loop-init # Set up the loop for your project"
|
||||
echo " /loop-plan # Generate PRD and sprint contracts"
|
||||
echo " /loop-run # Run the loop interactively"
|
||||
echo " /agent-loop:run # Single command — setup, plan, and run"
|
||||
echo ""
|
||||
echo " Or run headless: .loop/loop.sh"
|
||||
echo ""
|
||||
|
||||
13
lib/hooks.sh
13
lib/hooks.sh
@@ -20,9 +20,9 @@ install_hooks() {
|
||||
jq '.hooks.Stop = [{"matcher": "", "hooks": [{"type": "command", "command": "kill -INT $PPID || true"}]}]' \
|
||||
"$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
|
||||
else
|
||||
python3 -c "
|
||||
LOOP_SETTINGS="$SETTINGS_FILE" python3 -c "
|
||||
import json, os
|
||||
p = '$SETTINGS_FILE'
|
||||
p = os.environ['LOOP_SETTINGS']
|
||||
s = json.load(open(p)) if os.path.exists(p) else {}
|
||||
s.setdefault('hooks', {})['Stop'] = [{'matcher': '', 'hooks': [{'type': 'command', 'command': 'kill -INT \$PPID || true'}]}]
|
||||
json.dump(s, open(p, 'w'), indent=2)
|
||||
@@ -37,12 +37,13 @@ remove_hooks() {
|
||||
jq 'del(.hooks.Stop)' "$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
|
||||
jq 'if .hooks == {} then del(.hooks) else . end' "$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
|
||||
else
|
||||
python3 -c "
|
||||
import json
|
||||
s = json.load(open('$SETTINGS_FILE'))
|
||||
LOOP_SETTINGS="$SETTINGS_FILE" python3 -c "
|
||||
import json, os
|
||||
p = os.environ['LOOP_SETTINGS']
|
||||
s = json.load(open(p))
|
||||
s.get('hooks', {}).pop('Stop', None)
|
||||
if not s.get('hooks'): s.pop('hooks', None)
|
||||
json.dump(s, open('$SETTINGS_FILE', 'w'), indent=2)
|
||||
json.dump(s, open(p, 'w'), indent=2)
|
||||
"
|
||||
fi
|
||||
log "Stop hook removed"
|
||||
|
||||
29
loop.sh
29
loop.sh
@@ -124,7 +124,7 @@ while [[ $# -gt 0 ]]; do
|
||||
--dry-run) DRY_RUN=true; shift ;;
|
||||
--headless) export LOOP_HEADLESS=true; shift ;;
|
||||
--resume) RESUME=true; shift ;;
|
||||
--replan) log "ERROR: --replan is not yet implemented. Use /loop-plan interactively."; exit 1 ;;
|
||||
--replan) log "ERROR: --replan is not yet implemented. Use /agent-loop:stories interactively."; exit 1 ;;
|
||||
[0-9]*) MAX_ITERATIONS="$1"; shift ;;
|
||||
*) log "Unknown option: $1"; exit 1 ;;
|
||||
esac
|
||||
@@ -162,7 +162,7 @@ check_archive
|
||||
|
||||
# Validate prd.json exists (AFTER archive check, which may delete it on branch change)
|
||||
if [ ! -f "$LOOP_DIR/prd.json" ]; then
|
||||
log "ERROR: No prd.json found. Run /loop-plan first to create one."
|
||||
log "ERROR: No prd.json found. Run /agent-loop:stories first to create one."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
@@ -240,11 +240,11 @@ run_agent() {
|
||||
claude)
|
||||
printf '%s\n' "$prompt" | timeout "${LOOP_AGENT_TIMEOUT:-600}" \
|
||||
claude --dangerously-skip-permissions --output-format text \
|
||||
--print 2>&1 > "$output_file"
|
||||
--print > "$output_file" 2>&1
|
||||
;;
|
||||
amp)
|
||||
printf '%s\n' "$prompt" | timeout "${LOOP_AGENT_TIMEOUT:-600}" \
|
||||
amp --dangerously-allow-all 2>&1 > "$output_file"
|
||||
amp --dangerously-allow-all > "$output_file" 2>&1
|
||||
;;
|
||||
*)
|
||||
log "ERROR: Unknown tool '$TOOL'"
|
||||
@@ -319,7 +319,7 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
|
||||
fi
|
||||
snapshot_for_archive
|
||||
if any_stories_blocked 2>/dev/null; then
|
||||
log "Some stories are blocked and need human review. Run /loop-triage for details."
|
||||
log "Some stories are blocked and need human review. Run /agent-loop:triage for details."
|
||||
exit $EXIT_ALL_BLOCKED
|
||||
fi
|
||||
exit $EXIT_OK
|
||||
@@ -364,7 +364,7 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
|
||||
# --- Scope budget check ---
|
||||
# Verify the generator stayed within configured limits (files modified, lines written).
|
||||
# Advisory in implement/fix modes (log warning), but enforced as rejection reason for evaluator.
|
||||
if [ -n "$PRE_GENERATOR_SHA" ] && [ "$PRE_GENERATOR_SHA" != "" ]; then
|
||||
if [ -n "$PRE_GENERATOR_SHA" ]; then
|
||||
SCOPE_FILES_MODIFIED=$(git diff --name-only "$PRE_GENERATOR_SHA" HEAD 2>/dev/null | wc -l | tr -d ' ')
|
||||
SCOPE_LINES_WRITTEN=$(git diff --stat "$PRE_GENERATOR_SHA" HEAD 2>/dev/null | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")
|
||||
|
||||
@@ -381,18 +381,9 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
|
||||
export SCOPE_FILES_MODIFIED SCOPE_LINES_WRITTEN
|
||||
fi
|
||||
|
||||
# Check for completion — in interactive mode, check prd.json directly
|
||||
if all_stories_pass 2>/dev/null; then
|
||||
log_header "All Stories Complete! ($(story_counts))"
|
||||
snapshot_for_archive
|
||||
exit 0
|
||||
fi
|
||||
# Headless mode: also check output sentinel
|
||||
if [ -n "$GENERATOR_OUTPUT" ] && echo "$GENERATOR_OUTPUT" | grep -q "<promise>COMPLETE</promise>"; then
|
||||
log_header "Generator signaled COMPLETE ($(story_counts))"
|
||||
snapshot_for_archive
|
||||
exit 0
|
||||
fi
|
||||
# NOTE: Do NOT check all_stories_pass here. The generator marks its own story
|
||||
# as passed, but the evaluator hasn't verified yet. Checking here would skip
|
||||
# evaluation on the last story. The completion check is at the top of the loop.
|
||||
|
||||
# --- Evaluator pass ---
|
||||
if [ "$SKIP_EVAL" != true ]; then
|
||||
@@ -460,6 +451,6 @@ done
|
||||
# --- Max iterations reached ---
|
||||
log_header "Max Iterations Reached ($MAX_ITERATIONS)"
|
||||
log "Stories completed: $(story_counts)"
|
||||
log "Run /loop-triage to generate a handoff brief."
|
||||
log "Run /agent-loop:triage to generate a handoff brief."
|
||||
snapshot_for_archive
|
||||
exit $EXIT_MAX_ITERATIONS
|
||||
|
||||
@@ -6,7 +6,7 @@ You are evaluating an analysis/exploration task. The generator claims to have an
|
||||
|
||||
Before any other checks, verify explore mode's read-only constraint:
|
||||
1. Run `git diff {{PRE_GENERATOR_SHA}}..HEAD --name-only`
|
||||
2. If ANY file outside `.loop/triage/` was modified or committed, **REJECT immediately** — explore mode is read-only. The generator must not modify host project files.
|
||||
2. If ANY file outside `.loop/` was modified or committed, **REJECT immediately** — explore mode is read-only. The generator must not modify host project files. (Files inside `.loop/` like `prd.json` and `progress.md` are expected.)
|
||||
|
||||
## Exploration-Specific Checks
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Planner Context
|
||||
|
||||
This file is loaded by the `/loop-plan` skill to provide additional context for PRD generation.
|
||||
This file provides additional context for PRD generation.
|
||||
|
||||
## Story Decomposition Guidelines
|
||||
|
||||
|
||||
4
setup.sh
4
setup.sh
@@ -1,7 +1,7 @@
|
||||
#!/bin/bash
|
||||
# Agent Loop — project setup script
|
||||
# Scaffolds .loop/ directory and generates config.json.
|
||||
# Called by /agent-loop:init or run directly.
|
||||
# Called by /agent-loop:setup or /agent-loop:run, or run directly.
|
||||
#
|
||||
# Usage:
|
||||
# setup.sh <mode> # mode: implement, explore, or fix
|
||||
@@ -120,5 +120,5 @@ echo "[setup] Mode: $MODE"
|
||||
echo "[setup] Config: .loop/config.json"
|
||||
echo ""
|
||||
echo "Next steps (in Claude Code):"
|
||||
echo " /agent-loop:plan # Generate stories from your spec or description"
|
||||
echo " /agent-loop:stories # Generate stories from your spec or description"
|
||||
echo ""
|
||||
|
||||
@@ -3,7 +3,7 @@ name: setup
|
||||
description: "Run the setup script to scaffold .loop/ directory. Does not plan features or write code."
|
||||
---
|
||||
|
||||
# /init — Scaffold the Agent Loop
|
||||
# /setup — Scaffold the Agent Loop
|
||||
|
||||
Run the setup script to create `.loop/` with harness files and config. This skill does ONE thing: run a bash command.
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@ name: stories
|
||||
description: "Generate prd.json and sprint contracts by dispatching the planner agent. Does not write source code."
|
||||
---
|
||||
|
||||
# /plan — Generate PRD and Sprint Contracts
|
||||
# /stories — Generate PRD and Sprint Contracts
|
||||
|
||||
Dispatch the planner agent to decompose a spec into stories. The planner agent cannot write source code or run bash commands — it can only write to `.loop/`.
|
||||
|
||||
@@ -11,7 +11,7 @@ Dispatch the planner agent to decompose a spec into stories. The planner agent c
|
||||
|
||||
### 1. Check prerequisites
|
||||
|
||||
Verify `.loop/config.json` exists. If not, tell the user to run `/agent-loop:init` first and stop.
|
||||
Verify `.loop/config.json` exists. If not, tell the user to run `/agent-loop:setup` first and stop.
|
||||
|
||||
### 2. Find the spec
|
||||
|
||||
|
||||
Reference in New Issue
Block a user