fix: critical bugs, stale refs, README rewrite, security fixes

- Fix evaluator bypass on last story (moved completion check) - Fix all stale command name references across README, loop.sh, skills, plugin.json - Fix explore evaluator false rejects (.loop/ files are expected) - Fix stderr capture order in headless mode - Fix shell injection risk in hooks.sh python fallback - Remove .DS_Store from tracking - Rewrite README to match current architecture (single entry point, tmux, optional tools) - Add XcodeBuildMCP and iOS simulator MCP to optional tools docs
2026-03-27 14:58:01 -04:00
parent f3cbfd258c
commit b3d263258a
10 changed files with 84 additions and 132 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
  "name": "agent-loop",
  "version": "0.8.0",
-  "description": "Autonomous generator-evaluator agent loop for long-running coding tasks. Plan with /agent-loop:init, then execute with /agent-loop:run.",
+  "description": "Autonomous generator-evaluator agent loop for long-running coding tasks. Run /agent-loop:run to start.",
  "author": {
    "name": "Sheldon"
  },
--- a/README.md
+++ b/README.md
@@ -4,9 +4,7 @@ Autonomous AI agent harness that combines a generator-evaluator architecture wit

 Inspired by [Geoffrey Huntley's Ralph pattern](https://ghuntley.com/ralph/) and [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps).

-A generator-evaluator loop runs fresh agent instances per iteration. Each iteration: a **Generator** does the work, then an **Evaluator** verifies it. Human judgment stays in the planning phase; execution is autonomous.
-
-Two execution modes: **headless** via `loop.sh` (fully autonomous bash process) or **interactive** via `/loop-run` (Claude Code-native with full visibility and intervention).
+A generator-evaluator loop runs fresh Claude Code sessions per iteration. Each iteration: a **Generator** does the work, then an **Evaluator** verifies it. Human judgment stays in the planning phase; execution is autonomous with full visibility.

 ## Install

@@ -20,46 +18,32 @@ Two execution modes: **headless** via `loop.sh` (fully autonomous bash process)
 Then in any project:

 ```
-/agent-loop:init          # Set up the loop for your project
-/agent-loop:plan          # Generate PRD and sprint contracts
-/agent-loop:run           # Run the loop interactively
+/agent-loop:run
 ```

+That's it. The single command handles setup, planning, and execution.
+
 ### Manual Install

 ```bash
-# Clone into your project
 cp -r /path/to/loop-loop .loop
-
-# Install skills as Claude Code commands
-mkdir -p .claude/commands
-for skill in loop-init loop-plan loop-run loop-triage; do
-    ln -sf "../../.loop/skills/$skill/SKILL.md" ".claude/commands/$skill.md"
-done
-
-# Then in Claude Code:
-/loop-init && /loop-plan && /loop-run
 ```

+Then run `.loop/loop.sh` directly.
+
 ## How It Works

 ```
-[You + Claude Code]                    [Loop Execution]
-
-/agent-loop:init                       Interactive (/agent-loop:run)
-  → scaffolds .loop/                     └─ dispatches Agent subagents
-  → detects project                      └─ visible tool calls, can intervene
-  → picks mode                           └─ chat mid-loop to adjust course
-  → creates config.json
-                                        Headless (.loop/loop.sh)
-/agent-loop:plan                         └─ spawns claude --print per iteration
-  → asks clarifying questions            └─ fully autonomous, no UI
-  → generates prd.json
-  → generates sprint contracts          Both paths:
-  → populates progress.md                ├─→ Generator → picks story → implements → commits
-                                          ├─→ Evaluator → verifies → PASS or REJECT
-                                          ├─→ next iteration...
-                                          └─→ all stories pass → done
+/agent-loop:run
+  ├─ Phase 1: Scaffold .loop/ (if needed)
+  ├─ Phase 2: Generate stories from spec (if needed)
+  │    └─ Presents stories for human review
+  │    └─ STOPS — user reviews and says "go"
+  └─ Phase 3: Launch loop in tmux
+       ├─→ Generator → picks story → implements → commits
+       ├─→ Evaluator → verifies → PASS or REJECT
+       ├─→ next iteration (fresh CC session each time)
+       └─→ all stories pass → done
 ```

 ## Modes
@@ -70,47 +54,48 @@ done
 | **explore** | Read-only codebase analysis | No |
 | **fix** | Targeted bug fixes / tech debt | Yes |

-## Running the Loop
+## Monitoring

-### Option A: Interactive (`/loop-run`) — Recommended
-
-Run inside Claude Code. You see every tool call, file edit, and test run. You can intervene at any point — deny a tool call, chat to adjust course, or stop the loop.
-
-```
-/loop-run                    # Run until done or max iterations
-/loop-run 3                  # Run at most 3 iterations
-/loop-run --skip-eval        # Skip evaluator pass
-/loop-run --story US-003     # Run only a specific story
-```
-
-### Option B: Headless (`loop.sh`)
-
-Run as a standalone bash process. Fully autonomous — no UI, no intervention. Useful for background execution or CI.
+After the loop launches in tmux:

 ```bash
-.loop/loop.sh [options]
+# Watch live (from Claude Code)
+! tmux attach -t agent-loop
+
+# Detach back to Claude Code
+Ctrl+B then D
+
+# Stop the loop
+Ctrl+C in the tmux session
+```
+
+Or ask Claude Code "status" — it reads `.loop/prd.json` and `.loop/progress.md`.
+
+## Headless Mode
+
+For CI or background execution without the interactive UI:
+
+```bash
+.loop/loop.sh --headless [options]

 --mode <implement|explore|fix>   Operating mode
 --max <N>                        Maximum iterations (default: 20)
 --skip-eval                      Skip evaluator pass
--tool <claude|amp>              AI tool to use
--no-hooks                       Don't install stop hooks
--dry-run                        Print assembled prompts without running agents
--resume                         Skip already-passed stories (explicit exit when none remain)
+--dry-run                        Print assembled prompts without running
 ```

 ## Architecture

 ### Generator
-Fresh Claude Code instance each iteration. Reads `prd.json` to find the highest-priority incomplete story, reads the sprint contract, implements the story, runs quality gates, commits, and marks it done.
+Fresh Claude Code session each iteration. Reads `prd.json` to find the highest-priority incomplete story, reads the sprint contract, implements the story, runs quality gates, commits, and marks it done.

 ### Evaluator
-Separate fresh instance after each generator pass. Skeptically verifies the work: checks acceptance criteria against actual code, runs tests independently, and issues a `PASS` or `REJECT` verdict. Rejection sends the story back to the generator with specific feedback.
+Separate fresh session after each generator pass. Skeptically verifies the work: checks acceptance criteria against actual code, runs tests and the application, and issues a `PASS` or `REJECT` verdict. Rejection sends the story back with specific feedback.

 Evaluator skepticism is deliberately tuned — Claude's default tendency is to rationalize away issues. The evaluator prompt includes explicit bias correction.

 ### Sprint Contracts
-Before the loop starts, `/loop-plan` generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.
+Before the loop starts, the planner generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.

 ### State Persistence

@@ -122,59 +107,36 @@ Before the loop starts, `/loop-plan` generates contracts for each story. These d
 | `config.json` | Harness configuration |
 | Git commits | Code changes with story-tagged messages |

-## File Structure
+## Optional: Runtime Testing Tools

-```
-.loop/
-  loop.sh                        # Main loop orchestrator
-  config.json                    # Project config (generated by /loop-init)
-  init.sh                        # Project setup script (generated by /loop-init)
-  prd.json                       # Active PRD (generated by /loop-plan)
-  progress.md                    # Cross-session memory (append-only)
-
-  prompts/
-    generator/_base.md           # Shared generator instructions
-    generator/implement.md       # Implement mode overlay
-    generator/explore.md         # Explore mode overlay
-    generator/fix.md             # Fix mode overlay
-    evaluator/_base.md           # Skeptical evaluator base
-    evaluator/implement.md       # Implement verification
-    evaluator/explore.md         # Analysis verification
-    evaluator/fix.md             # Fix verification
-    planner/plan.md              # Planning context
-
-  templates/                     # Reference templates
-  lib/                           # Shell library functions
-  skills/                        # Claude Code skills (/loop-init, /loop-plan, /loop-run, /loop-triage)
-  contracts/                     # Sprint contracts (generated by /loop-plan)
-  triage/                        # Analysis output (explore mode)
-  archive/                       # Completed feature archives
-```
-
-## Browser Testing (Optional)
-
-The evaluator includes basic runtime verification for web projects (starts a local server, checks HTTP response). For full browser testing with console error detection and screenshots, install the Playwright MCP server:
+The evaluator verifies code actually runs, not just that it looks correct. It uses whatever tools are available. For richer verification, install these optional MCP servers:

+**Web projects (Playwright):**
 ```bash
 claude mcp add playwright npx @playwright/mcp@latest --headless --browser=chromium
 ```

-When Playwright is available, the evaluator will use it to:
- Navigate to the running application
- Check for JavaScript console errors
- Take screenshots for visual verification
- Reject stories with runtime errors
+**iOS/Xcode projects (XcodeBuildMCP):**
+```bash
+brew tap getsentry/xcodebuildmcp && brew install xcodebuildmcp
+claude mcp add xcodebuild -- xcodebuildmcp
+```

-This is optional — the evaluator works without it, but may miss runtime issues that only surface in a browser.
+**iOS Simulator interaction:**
+```bash
+claude mcp add ios-simulator -- npx -y ios-simulator-mcp
+```
+
+These are optional — the evaluator works without them but may miss runtime-only issues.

 ## Design Principles

 - **Fresh context per iteration** — no accumulated hallucination drift
 - **Separate generation from evaluation** — external skepticism is easier to tune than self-criticism
- **Human judgment for planning, AI for execution** — interactive `/loop-plan`, autonomous loop
+- **Human judgment for planning, AI for execution** — human reviews stories, loop executes autonomously
 - **Structured handoffs via artifacts** — not conversation memory
 - **No git revert on rejection** — next generator sees partial work + feedback (more signal)
- **Advisory scope budgets** — prompt-enforced limits on files read/written per iteration
+- **Tool-agnostic** — evaluator uses whatever tools are available, no hardcoded dependencies

 ## Credits

--- a/install.sh
+++ b/install.sh
@@ -5,7 +5,7 @@
 #   1. Copies the harness to ~/.claude/loop/  (prompts, templates, lib, loop.sh)
 #   2. Installs skills as Claude Code commands at ~/.claude/commands/
 #
-# After install, use /loop-init in any project to get started.
+# After install, use /agent-loop:run in any project to get started.
 #
 # Usage:
 #   ./install.sh            # Install
@@ -18,7 +18,7 @@ HARNESS_DIR="$CLAUDE_DIR/loop"
 COMMANDS_DIR="$CLAUDE_DIR/commands"
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

-SKILLS=(loop-init loop-plan loop-run loop-triage)
+SKILLS=(setup stories run triage)

 # --- Colors (if terminal supports them) ---
 if [ -t 1 ]; then
@@ -100,9 +100,7 @@ info "${BOLD}Installation complete.${RESET}"
 echo ""
 echo "  Next steps (inside Claude Code, in any project):"
 echo ""
-echo "    /loop-init        # Set up the loop for your project"
-echo "    /loop-plan        # Generate PRD and sprint contracts"
-echo "    /loop-run         # Run the loop interactively"
+echo "    /agent-loop:run      # Single command — setup, plan, and run"
 echo ""
 echo "  Or run headless:    .loop/loop.sh"
 echo ""
--- a/lib/hooks.sh
+++ b/lib/hooks.sh
@@ -20,9 +20,9 @@ install_hooks() {
        jq '.hooks.Stop = [{"matcher": "", "hooks": [{"type": "command", "command": "kill -INT $PPID || true"}]}]' \
            "$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
    else
-        python3 -c "
+        LOOP_SETTINGS="$SETTINGS_FILE" python3 -c "
 import json, os
-p = '$SETTINGS_FILE'
+p = os.environ['LOOP_SETTINGS']
 s = json.load(open(p)) if os.path.exists(p) else {}
 s.setdefault('hooks', {})['Stop'] = [{'matcher': '', 'hooks': [{'type': 'command', 'command': 'kill -INT \$PPID || true'}]}]
 json.dump(s, open(p, 'w'), indent=2)
@@ -37,12 +37,13 @@ remove_hooks() {
            jq 'del(.hooks.Stop)' "$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
            jq 'if .hooks == {} then del(.hooks) else . end' "$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
        else
-            python3 -c "
-import json
-s = json.load(open('$SETTINGS_FILE'))
+            LOOP_SETTINGS="$SETTINGS_FILE" python3 -c "
+import json, os
+p = os.environ['LOOP_SETTINGS']
+s = json.load(open(p))
 s.get('hooks', {}).pop('Stop', None)
 if not s.get('hooks'): s.pop('hooks', None)
-json.dump(s, open('$SETTINGS_FILE', 'w'), indent=2)
+json.dump(s, open(p, 'w'), indent=2)
 "
        fi
        log "Stop hook removed"
--- a/loop.sh
+++ b/loop.sh
@@ -124,7 +124,7 @@ while [[ $# -gt 0 ]]; do
        --dry-run) DRY_RUN=true; shift ;;
        --headless) export LOOP_HEADLESS=true; shift ;;
        --resume) RESUME=true; shift ;;
-        --replan) log "ERROR: --replan is not yet implemented. Use /loop-plan interactively."; exit 1 ;;
+        --replan) log "ERROR: --replan is not yet implemented. Use /agent-loop:stories interactively."; exit 1 ;;
        [0-9]*) MAX_ITERATIONS="$1"; shift ;;
        *) log "Unknown option: $1"; exit 1 ;;
    esac
@@ -162,7 +162,7 @@ check_archive

 # Validate prd.json exists (AFTER archive check, which may delete it on branch change)
 if [ ! -f "$LOOP_DIR/prd.json" ]; then
-    log "ERROR: No prd.json found. Run /loop-plan first to create one."
+    log "ERROR: No prd.json found. Run /agent-loop:stories first to create one."
    exit 1
 fi

@@ -240,11 +240,11 @@ run_agent() {
                claude)
                    printf '%s\n' "$prompt" | timeout "${LOOP_AGENT_TIMEOUT:-600}" \
                        claude --dangerously-skip-permissions --output-format text \
-                        --print 2>&1 > "$output_file"
+                        --print > "$output_file" 2>&1
                    ;;
                amp)
                    printf '%s\n' "$prompt" | timeout "${LOOP_AGENT_TIMEOUT:-600}" \
-                        amp --dangerously-allow-all 2>&1 > "$output_file"
+                        amp --dangerously-allow-all > "$output_file" 2>&1
                    ;;
                *)
                    log "ERROR: Unknown tool '$TOOL'"
@@ -319,7 +319,7 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
        fi
        snapshot_for_archive
        if any_stories_blocked 2>/dev/null; then
-            log "Some stories are blocked and need human review. Run /loop-triage for details."
+            log "Some stories are blocked and need human review. Run /agent-loop:triage for details."
            exit $EXIT_ALL_BLOCKED
        fi
        exit $EXIT_OK
@@ -364,7 +364,7 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
    # --- Scope budget check ---
    # Verify the generator stayed within configured limits (files modified, lines written).
    # Advisory in implement/fix modes (log warning), but enforced as rejection reason for evaluator.
-    if [ -n "$PRE_GENERATOR_SHA" ] && [ "$PRE_GENERATOR_SHA" != "" ]; then
+    if [ -n "$PRE_GENERATOR_SHA" ]; then
        SCOPE_FILES_MODIFIED=$(git diff --name-only "$PRE_GENERATOR_SHA" HEAD 2>/dev/null | wc -l | tr -d ' ')
        SCOPE_LINES_WRITTEN=$(git diff --stat "$PRE_GENERATOR_SHA" HEAD 2>/dev/null | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")

@@ -381,18 +381,9 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
        export SCOPE_FILES_MODIFIED SCOPE_LINES_WRITTEN
    fi

-    # Check for completion — in interactive mode, check prd.json directly
-    if all_stories_pass 2>/dev/null; then
-        log_header "All Stories Complete! ($(story_counts))"
-        snapshot_for_archive
-        exit 0
-    fi
-    # Headless mode: also check output sentinel
-    if [ -n "$GENERATOR_OUTPUT" ] && echo "$GENERATOR_OUTPUT" | grep -q "<promise>COMPLETE</promise>"; then
-        log_header "Generator signaled COMPLETE ($(story_counts))"
-        snapshot_for_archive
-        exit 0
-    fi
+    # NOTE: Do NOT check all_stories_pass here. The generator marks its own story
+    # as passed, but the evaluator hasn't verified yet. Checking here would skip
+    # evaluation on the last story. The completion check is at the top of the loop.

    # --- Evaluator pass ---
    if [ "$SKIP_EVAL" != true ]; then
@@ -460,6 +451,6 @@ done
 # --- Max iterations reached ---
 log_header "Max Iterations Reached ($MAX_ITERATIONS)"
 log "Stories completed: $(story_counts)"
-log "Run /loop-triage to generate a handoff brief."
+log "Run /agent-loop:triage to generate a handoff brief."
 snapshot_for_archive
 exit $EXIT_MAX_ITERATIONS
--- a/prompts/evaluator/explore.md
+++ b/prompts/evaluator/explore.md
@@ -6,7 +6,7 @@ You are evaluating an analysis/exploration task. The generator claims to have an

 Before any other checks, verify explore mode's read-only constraint:
 1. Run `git diff {{PRE_GENERATOR_SHA}}..HEAD --name-only`
-2. If ANY file outside `.loop/triage/` was modified or committed, **REJECT immediately** — explore mode is read-only. The generator must not modify host project files.
+2. If ANY file outside `.loop/` was modified or committed, **REJECT immediately** — explore mode is read-only. The generator must not modify host project files. (Files inside `.loop/` like `prd.json` and `progress.md` are expected.)

 ## Exploration-Specific Checks

--- a/prompts/planner/plan.md
+++ b/prompts/planner/plan.md
@@ -1,6 +1,6 @@
 # Planner Context

-This file is loaded by the `/loop-plan` skill to provide additional context for PRD generation.
+This file provides additional context for PRD generation.

 ## Story Decomposition Guidelines

--- a/setup.sh
+++ b/setup.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 # Agent Loop — project setup script
 # Scaffolds .loop/ directory and generates config.json.
-# Called by /agent-loop:init or run directly.
+# Called by /agent-loop:setup or /agent-loop:run, or run directly.
 #
 # Usage:
 #   setup.sh <mode>           # mode: implement, explore, or fix
@@ -120,5 +120,5 @@ echo "[setup] Mode: $MODE"
 echo "[setup] Config: .loop/config.json"
 echo ""
 echo "Next steps (in Claude Code):"
-echo "  /agent-loop:plan    # Generate stories from your spec or description"
+echo "  /agent-loop:stories    # Generate stories from your spec or description"
 echo ""
--- a/skills/setup/SKILL.md
+++ b/skills/setup/SKILL.md
@@ -3,7 +3,7 @@ name: setup
 description: "Run the setup script to scaffold .loop/ directory. Does not plan features or write code."
 ---

-# /init — Scaffold the Agent Loop
+# /setup — Scaffold the Agent Loop

 Run the setup script to create `.loop/` with harness files and config. This skill does ONE thing: run a bash command.

--- a/skills/stories/SKILL.md
+++ b/skills/stories/SKILL.md
@@ -3,7 +3,7 @@ name: stories
 description: "Generate prd.json and sprint contracts by dispatching the planner agent. Does not write source code."
 ---

-# /plan — Generate PRD and Sprint Contracts
+# /stories — Generate PRD and Sprint Contracts

 Dispatch the planner agent to decompose a spec into stories. The planner agent cannot write source code or run bash commands — it can only write to `.loop/`.

@@ -11,7 +11,7 @@ Dispatch the planner agent to decompose a spec into stories. The planner agent c

 ### 1. Check prerequisites

-Verify `.loop/config.json` exists. If not, tell the user to run `/agent-loop:init` first and stop.
+Verify `.loop/config.json` exists. If not, tell the user to run `/agent-loop:setup` first and stop.

 ### 2. Find the spec