fix: critical bugs, stale refs, README rewrite, security fixes

- Fix evaluator bypass on last story (moved completion check) - Fix all stale command name references across README, loop.sh, skills, plugin.json - Fix explore evaluator false rejects (.loop/ files are expected) - Fix stderr capture order in headless mode - Fix shell injection risk in hooks.sh python fallback - Remove .DS_Store from tracking - Rewrite README to match current architecture (single entry point, tmux, optional tools) - Add XcodeBuildMCP and iOS simulator MCP to optional tools docs
2026-03-27 14:58:01 -04:00
parent f3cbfd258c
commit b3d263258a
10 changed files with 84 additions and 132 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
  "name": "agent-loop",
  "version": "0.8.0",
-  "description": "Autonomous generator-evaluator agent loop for long-running coding tasks. Plan with /agent-loop:init, then execute with /agent-loop:run.",
+  "description": "Autonomous generator-evaluator agent loop for long-running coding tasks. Run /agent-loop:run to start.",
  "author": {
    "name": "Sheldon"
  },
--- a/README.md
+++ b/README.md
@@ -4,9 +4,7 @@ Autonomous AI agent harness that combines a generator-evaluator architecture wit
 Inspired by [Geoffrey Huntley's Ralph pattern](https://ghuntley.com/ralph/) and [Anthropic's harness design research](https://www.anthropic.com/engineering/harness-design-long-running-apps).
-A generator-evaluator loop runs fresh agent instances per iteration. Each iteration: a **Generator** does the work, then an **Evaluator** verifies it. Human judgment stays in the planning phase; execution is autonomous.
+A generator-evaluator loop runs fresh Claude Code sessions per iteration. Each iteration: a **Generator** does the work, then an **Evaluator** verifies it. Human judgment stays in the planning phase; execution is autonomous with full visibility.
 Two execution modes: **headless** via `loop.sh` (fully autonomous bash process) or **interactive** via `/loop-run` (Claude Code-native with full visibility and intervention).
 ## Install
@@ -20,46 +18,32 @@ Two execution modes: **headless** via `loop.sh` (fully autonomous bash process)
 Then in any project:
 ```
-/agent-loop:init          # Set up the loop for your project
+/agent-loop:run
 /agent-loop:plan          # Generate PRD and sprint contracts
 /agent-loop:run           # Run the loop interactively
 ```
 That's it. The single command handles setup, planning, and execution.
 ### Manual Install
 ```bash
 # Clone into your project
 cp -r /path/to/loop-loop .loop
 # Install skills as Claude Code commands
 mkdir -p .claude/commands
 for skill in loop-init loop-plan loop-run loop-triage; do
    ln -sf "../../.loop/skills/$skill/SKILL.md" ".claude/commands/$skill.md"
 done
 # Then in Claude Code:
 /loop-init && /loop-plan && /loop-run
 ```
 Then run `.loop/loop.sh` directly.
 ## How It Works
 ```
-[You + Claude Code]                    [Loop Execution]
+/agent-loop:run
-
+  ├─ Phase 1: Scaffold .loop/ (if needed)
-/agent-loop:init                       Interactive (/agent-loop:run)
+  ├─ Phase 2: Generate stories from spec (if needed)
-  → scaffolds .loop/                     └─ dispatches Agent subagents
+  │    └─ Presents stories for human review
-  → detects project                      └─ visible tool calls, can intervene
+  │    └─ STOPS — user reviews and says "go"
-  → picks mode                           └─ chat mid-loop to adjust course
+  └─ Phase 3: Launch loop in tmux
-  → creates config.json
+       ├─→ Generator → picks story → implements → commits
-                                        Headless (.loop/loop.sh)
+       ├─→ Evaluator → verifies → PASS or REJECT
-/agent-loop:plan                         └─ spawns claude --print per iteration
+       ├─→ next iteration (fresh CC session each time)
-  → asks clarifying questions            └─ fully autonomous, no UI
+       └─→ all stories pass → done
  → generates prd.json
  → generates sprint contracts          Both paths:
  → populates progress.md                ├─→ Generator → picks story → implements → commits
                                          ├─→ Evaluator → verifies → PASS or REJECT
                                          ├─→ next iteration...
                                          └─→ all stories pass → done
 ```
 ## Modes
@@ -70,47 +54,48 @@ done
 | **explore** | Read-only codebase analysis | No |
 | **fix** | Targeted bug fixes / tech debt | Yes |
-## Running the Loop
+## Monitoring
-### Option A: Interactive (`/loop-run`) — Recommended
+After the loop launches in tmux:
 Run inside Claude Code. You see every tool call, file edit, and test run. You can intervene at any point — deny a tool call, chat to adjust course, or stop the loop.
 ```
 /loop-run                    # Run until done or max iterations
 /loop-run 3                  # Run at most 3 iterations
 /loop-run --skip-eval        # Skip evaluator pass
 /loop-run --story US-003     # Run only a specific story
 ```
 ### Option B: Headless (`loop.sh`)
 Run as a standalone bash process. Fully autonomous — no UI, no intervention. Useful for background execution or CI.
 ```bash
-.loop/loop.sh [options]
+# Watch live (from Claude Code)
 ! tmux attach -t agent-loop
 # Detach back to Claude Code
 Ctrl+B then D
 # Stop the loop
 Ctrl+C in the tmux session
 ```
 Or ask Claude Code "status" — it reads `.loop/prd.json` and `.loop/progress.md`.
 ## Headless Mode
 For CI or background execution without the interactive UI:
 ```bash
 .loop/loop.sh --headless [options]
 --mode <implement|explore|fix>   Operating mode
 --max <N>                        Maximum iterations (default: 20)
 --skip-eval                      Skip evaluator pass
--tool <claude|amp>              AI tool to use
+--dry-run                        Print assembled prompts without running
 --no-hooks                       Don't install stop hooks
 --dry-run                        Print assembled prompts without running agents
 --resume                         Skip already-passed stories (explicit exit when none remain)
 ```
 ## Architecture
 ### Generator
-Fresh Claude Code instance each iteration. Reads `prd.json` to find the highest-priority incomplete story, reads the sprint contract, implements the story, runs quality gates, commits, and marks it done.
+Fresh Claude Code session each iteration. Reads `prd.json` to find the highest-priority incomplete story, reads the sprint contract, implements the story, runs quality gates, commits, and marks it done.
 ### Evaluator
-Separate fresh instance after each generator pass. Skeptically verifies the work: checks acceptance criteria against actual code, runs tests independently, and issues a `PASS` or `REJECT` verdict. Rejection sends the story back to the generator with specific feedback.
+Separate fresh session after each generator pass. Skeptically verifies the work: checks acceptance criteria against actual code, runs tests and the application, and issues a `PASS` or `REJECT` verdict. Rejection sends the story back with specific feedback.
 Evaluator skepticism is deliberately tuned — Claude's default tendency is to rationalize away issues. The evaluator prompt includes explicit bias correction.
 ### Sprint Contracts
-Before the loop starts, `/loop-plan` generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.
+Before the loop starts, the planner generates contracts for each story. These define "done" conditions that both generator and evaluator reference, eliminating ambiguity about whether work is complete.
 ### State Persistence
@@ -122,59 +107,36 @@ Before the loop starts, `/loop-plan` generates contracts for each story. These d
 | `config.json` | Harness configuration |
 | Git commits | Code changes with story-tagged messages |
-## File Structure
+## Optional: Runtime Testing Tools
-```
+The evaluator verifies code actually runs, not just that it looks correct. It uses whatever tools are available. For richer verification, install these optional MCP servers:
 .loop/
  loop.sh                        # Main loop orchestrator
  config.json                    # Project config (generated by /loop-init)
  init.sh                        # Project setup script (generated by /loop-init)
  prd.json                       # Active PRD (generated by /loop-plan)
  progress.md                    # Cross-session memory (append-only)
  prompts/
    generator/_base.md           # Shared generator instructions
    generator/implement.md       # Implement mode overlay
    generator/explore.md         # Explore mode overlay
    generator/fix.md             # Fix mode overlay
    evaluator/_base.md           # Skeptical evaluator base
    evaluator/implement.md       # Implement verification
    evaluator/explore.md         # Analysis verification
    evaluator/fix.md             # Fix verification
    planner/plan.md              # Planning context
  templates/                     # Reference templates
  lib/                           # Shell library functions
  skills/                        # Claude Code skills (/loop-init, /loop-plan, /loop-run, /loop-triage)
  contracts/                     # Sprint contracts (generated by /loop-plan)
  triage/                        # Analysis output (explore mode)
  archive/                       # Completed feature archives
 ```
 ## Browser Testing (Optional)
 The evaluator includes basic runtime verification for web projects (starts a local server, checks HTTP response). For full browser testing with console error detection and screenshots, install the Playwright MCP server:
 **Web projects (Playwright):**
 ```bash
 claude mcp add playwright npx @playwright/mcp@latest --headless --browser=chromium
 ```
-When Playwright is available, the evaluator will use it to:
+**iOS/Xcode projects (XcodeBuildMCP):**
- Navigate to the running application
+```bash
- Check for JavaScript console errors
+brew tap getsentry/xcodebuildmcp && brew install xcodebuildmcp
- Take screenshots for visual verification
+claude mcp add xcodebuild -- xcodebuildmcp
- Reject stories with runtime errors
+```
-This is optional — the evaluator works without it, but may miss runtime issues that only surface in a browser.
+**iOS Simulator interaction:**
 ```bash
 claude mcp add ios-simulator -- npx -y ios-simulator-mcp
 ```
 These are optional — the evaluator works without them but may miss runtime-only issues.
 ## Design Principles
 - **Fresh context per iteration** — no accumulated hallucination drift
 - **Separate generation from evaluation** — external skepticism is easier to tune than self-criticism
- **Human judgment for planning, AI for execution** — interactive `/loop-plan`, autonomous loop
+- **Human judgment for planning, AI for execution** — human reviews stories, loop executes autonomously
 - **Structured handoffs via artifacts** — not conversation memory
 - **No git revert on rejection** — next generator sees partial work + feedback (more signal)
- **Advisory scope budgets** — prompt-enforced limits on files read/written per iteration
+- **Tool-agnostic** — evaluator uses whatever tools are available, no hardcoded dependencies
 ## Credits
--- a/install.sh
+++ b/install.sh
@@ -5,7 +5,7 @@
 #   1. Copies the harness to ~/.claude/loop/  (prompts, templates, lib, loop.sh)
 #   2. Installs skills as Claude Code commands at ~/.claude/commands/
 #
-# After install, use /loop-init in any project to get started.
+# After install, use /agent-loop:run in any project to get started.
 #
 # Usage:
 #   ./install.sh            # Install
@@ -18,7 +18,7 @@ HARNESS_DIR="$CLAUDE_DIR/loop"
 COMMANDS_DIR="$CLAUDE_DIR/commands"
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
-SKILLS=(loop-init loop-plan loop-run loop-triage)
+SKILLS=(setup stories run triage)
 # --- Colors (if terminal supports them) ---
 if [ -t 1 ]; then
@@ -100,9 +100,7 @@ info "${BOLD}Installation complete.${RESET}"
 echo ""
 echo "  Next steps (inside Claude Code, in any project):"
 echo ""
-echo "    /loop-init        # Set up the loop for your project"
+echo "    /agent-loop:run      # Single command — setup, plan, and run"
 echo "    /loop-plan        # Generate PRD and sprint contracts"
 echo "    /loop-run         # Run the loop interactively"
 echo ""
 echo "  Or run headless:    .loop/loop.sh"
 echo ""
--- a/lib/hooks.sh
+++ b/lib/hooks.sh
@@ -20,9 +20,9 @@ install_hooks() {
        jq '.hooks.Stop = [{"matcher": "", "hooks": [{"type": "command", "command": "kill -INT $PPID || true"}]}]' \
            "$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
    else
-        python3 -c "
+        LOOP_SETTINGS="$SETTINGS_FILE" python3 -c "
 import json, os
-p = '$SETTINGS_FILE'
+p = os.environ['LOOP_SETTINGS']
 s = json.load(open(p)) if os.path.exists(p) else {}
 s.setdefault('hooks', {})['Stop'] = [{'matcher': '', 'hooks': [{'type': 'command', 'command': 'kill -INT \$PPID || true'}]}]
 json.dump(s, open(p, 'w'), indent=2)
@@ -37,12 +37,13 @@ remove_hooks() {
            jq 'del(.hooks.Stop)' "$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
            jq 'if .hooks == {} then del(.hooks) else . end' "$SETTINGS_FILE" > "${SETTINGS_FILE}.tmp" && mv "${SETTINGS_FILE}.tmp" "$SETTINGS_FILE"
        else
-            python3 -c "
+            LOOP_SETTINGS="$SETTINGS_FILE" python3 -c "
-import json
+import json, os
-s = json.load(open('$SETTINGS_FILE'))
+p = os.environ['LOOP_SETTINGS']
 s = json.load(open(p))
 s.get('hooks', {}).pop('Stop', None)
 if not s.get('hooks'): s.pop('hooks', None)
-json.dump(s, open('$SETTINGS_FILE', 'w'), indent=2)
+json.dump(s, open(p, 'w'), indent=2)
 "
        fi
        log "Stop hook removed"
--- a/loop.sh
+++ b/loop.sh
@@ -124,7 +124,7 @@ while [[ $# -gt 0 ]]; do
        --dry-run) DRY_RUN=true; shift ;;
        --headless) export LOOP_HEADLESS=true; shift ;;
        --resume) RESUME=true; shift ;;
-        --replan) log "ERROR: --replan is not yet implemented. Use /loop-plan interactively."; exit 1 ;;
+        --replan) log "ERROR: --replan is not yet implemented. Use /agent-loop:stories interactively."; exit 1 ;;
        [0-9]*) MAX_ITERATIONS="$1"; shift ;;
        *) log "Unknown option: $1"; exit 1 ;;
    esac
@@ -162,7 +162,7 @@ check_archive
 # Validate prd.json exists (AFTER archive check, which may delete it on branch change)
 if [ ! -f "$LOOP_DIR/prd.json" ]; then
-    log "ERROR: No prd.json found. Run /loop-plan first to create one."
+    log "ERROR: No prd.json found. Run /agent-loop:stories first to create one."
    exit 1
 fi
@@ -240,11 +240,11 @@ run_agent() {
                claude)
                    printf '%s\n' "$prompt" | timeout "${LOOP_AGENT_TIMEOUT:-600}" \
                        claude --dangerously-skip-permissions --output-format text \
-                        --print 2>&1 > "$output_file"
+                        --print > "$output_file" 2>&1
                    ;;
                amp)
                    printf '%s\n' "$prompt" | timeout "${LOOP_AGENT_TIMEOUT:-600}" \
-                        amp --dangerously-allow-all 2>&1 > "$output_file"
+                        amp --dangerously-allow-all > "$output_file" 2>&1
                    ;;
                *)
                    log "ERROR: Unknown tool '$TOOL'"
@@ -319,7 +319,7 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
        fi
        snapshot_for_archive
        if any_stories_blocked 2>/dev/null; then
-            log "Some stories are blocked and need human review. Run /loop-triage for details."
+            log "Some stories are blocked and need human review. Run /agent-loop:triage for details."
            exit $EXIT_ALL_BLOCKED
        fi
        exit $EXIT_OK
@@ -364,7 +364,7 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
    # --- Scope budget check ---
    # Verify the generator stayed within configured limits (files modified, lines written).
    # Advisory in implement/fix modes (log warning), but enforced as rejection reason for evaluator.
-    if [ -n "$PRE_GENERATOR_SHA" ] && [ "$PRE_GENERATOR_SHA" != "" ]; then
+    if [ -n "$PRE_GENERATOR_SHA" ]; then
        SCOPE_FILES_MODIFIED=$(git diff --name-only "$PRE_GENERATOR_SHA" HEAD 2>/dev/null | wc -l | tr -d ' ')
        SCOPE_LINES_WRITTEN=$(git diff --stat "$PRE_GENERATOR_SHA" HEAD 2>/dev/null | tail -1 | grep -oE '[0-9]+ insertion' | grep -oE '[0-9]+' || echo "0")
@@ -381,18 +381,9 @@ while [ "$ITERATION" -lt "$MAX_ITERATIONS" ]; do
        export SCOPE_FILES_MODIFIED SCOPE_LINES_WRITTEN
    fi
-    # Check for completion — in interactive mode, check prd.json directly
+    # NOTE: Do NOT check all_stories_pass here. The generator marks its own story
-    if all_stories_pass 2>/dev/null; then
+    # as passed, but the evaluator hasn't verified yet. Checking here would skip
-        log_header "All Stories Complete! ($(story_counts))"
+    # evaluation on the last story. The completion check is at the top of the loop.
        snapshot_for_archive
        exit 0
    fi
    # Headless mode: also check output sentinel
    if [ -n "$GENERATOR_OUTPUT" ] && echo "$GENERATOR_OUTPUT" | grep -q "<promise>COMPLETE</promise>"; then
        log_header "Generator signaled COMPLETE ($(story_counts))"
        snapshot_for_archive
        exit 0
    fi
    # --- Evaluator pass ---
    if [ "$SKIP_EVAL" != true ]; then
@@ -460,6 +451,6 @@ done
 # --- Max iterations reached ---
 log_header "Max Iterations Reached ($MAX_ITERATIONS)"
 log "Stories completed: $(story_counts)"
-log "Run /loop-triage to generate a handoff brief."
+log "Run /agent-loop:triage to generate a handoff brief."
 snapshot_for_archive
 exit $EXIT_MAX_ITERATIONS
--- a/prompts/evaluator/explore.md
+++ b/prompts/evaluator/explore.md
@@ -6,7 +6,7 @@ You are evaluating an analysis/exploration task. The generator claims to have an
 Before any other checks, verify explore mode's read-only constraint:
 1. Run `git diff {{PRE_GENERATOR_SHA}}..HEAD --name-only`
-2. If ANY file outside `.loop/triage/` was modified or committed, **REJECT immediately** — explore mode is read-only. The generator must not modify host project files.
+2. If ANY file outside `.loop/` was modified or committed, **REJECT immediately** — explore mode is read-only. The generator must not modify host project files. (Files inside `.loop/` like `prd.json` and `progress.md` are expected.)
 ## Exploration-Specific Checks
--- a/prompts/planner/plan.md
+++ b/prompts/planner/plan.md
@@ -1,6 +1,6 @@
 # Planner Context
-This file is loaded by the `/loop-plan` skill to provide additional context for PRD generation.
+This file provides additional context for PRD generation.
 ## Story Decomposition Guidelines
--- a/setup.sh
+++ b/setup.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 # Agent Loop — project setup script
 # Scaffolds .loop/ directory and generates config.json.
-# Called by /agent-loop:init or run directly.
+# Called by /agent-loop:setup or /agent-loop:run, or run directly.
 #
 # Usage:
 #   setup.sh <mode>           # mode: implement, explore, or fix
@@ -120,5 +120,5 @@ echo "[setup] Mode: $MODE"
 echo "[setup] Config: .loop/config.json"
 echo ""
 echo "Next steps (in Claude Code):"
-echo "  /agent-loop:plan    # Generate stories from your spec or description"
+echo "  /agent-loop:stories    # Generate stories from your spec or description"
 echo ""
--- a/skills/setup/SKILL.md
+++ b/skills/setup/SKILL.md
@@ -3,7 +3,7 @@ name: setup
 description: "Run the setup script to scaffold .loop/ directory. Does not plan features or write code."
 ---
-# /init — Scaffold the Agent Loop
+# /setup — Scaffold the Agent Loop
 Run the setup script to create `.loop/` with harness files and config. This skill does ONE thing: run a bash command.
--- a/skills/stories/SKILL.md
+++ b/skills/stories/SKILL.md
@@ -3,7 +3,7 @@ name: stories
 description: "Generate prd.json and sprint contracts by dispatching the planner agent. Does not write source code."
 ---
-# /plan — Generate PRD and Sprint Contracts
+# /stories — Generate PRD and Sprint Contracts
 Dispatch the planner agent to decompose a spec into stories. The planner agent cannot write source code or run bash commands — it can only write to `.loop/`.
@@ -11,7 +11,7 @@ Dispatch the planner agent to decompose a spec into stories. The planner agent c
 ### 1. Check prerequisites
-Verify `.loop/config.json` exists. If not, tell the user to run `/agent-loop:init` first and stop.
+Verify `.loop/config.json` exists. If not, tell the user to run `/agent-loop:setup` first and stop.
 ### 2. Find the spec