feat: evaluator runtime verification for web projects, optional Playwright docs
This commit is contained in:
16
README.md
16
README.md
@@ -151,6 +151,22 @@ Before the loop starts, `/loop-plan` generates contracts for each story. These d
|
|||||||
archive/ # Completed feature archives
|
archive/ # Completed feature archives
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Browser Testing (Optional)
|
||||||
|
|
||||||
|
The evaluator includes basic runtime verification for web projects (starts a local server, checks HTTP response). For full browser testing with console error detection and screenshots, install the Playwright MCP server:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
claude mcp add playwright npx @playwright/mcp@latest --headless --browser=chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
When Playwright is available, the evaluator will use it to:
|
||||||
|
- Navigate to the running application
|
||||||
|
- Check for JavaScript console errors
|
||||||
|
- Take screenshots for visual verification
|
||||||
|
- Reject stories with runtime errors
|
||||||
|
|
||||||
|
This is optional — the evaluator works without it, but may miss runtime issues that only surface in a browser.
|
||||||
|
|
||||||
## Design Principles
|
## Design Principles
|
||||||
|
|
||||||
- **Fresh context per iteration** — no accumulated hallucination drift
|
- **Fresh context per iteration** — no accumulated hallucination drift
|
||||||
|
|||||||
@@ -67,11 +67,58 @@ Be concrete — "the function doesn't handle null input" not "there might be edg
|
|||||||
|
|
||||||
End your response with the same verdict block so it's visible in the terminal output.
|
End your response with the same verdict block so it's visible in the terminal output.
|
||||||
|
|
||||||
|
## Runtime Verification (Web Projects)
|
||||||
|
|
||||||
|
If the project has an `index.html` or is a web application, you MUST verify it actually runs:
|
||||||
|
|
||||||
|
1. **Start a local server** (if not already running):
|
||||||
|
```bash
|
||||||
|
python3 -m http.server 8080 &
|
||||||
|
SERVER_PID=$!
|
||||||
|
sleep 1
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check the page loads** — use curl to verify the server responds:
|
||||||
|
```bash
|
||||||
|
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080
|
||||||
|
```
|
||||||
|
Expected: 200. If not, REJECT.
|
||||||
|
|
||||||
|
3. **Check for JavaScript errors** — if Node.js is available, run a quick headless check:
|
||||||
|
```bash
|
||||||
|
node -e "
|
||||||
|
const http = require('http');
|
||||||
|
http.get('http://localhost:8080', res => {
|
||||||
|
let data = '';
|
||||||
|
res.on('data', chunk => data += chunk);
|
||||||
|
res.on('end', () => {
|
||||||
|
const hasModules = data.includes('type=\"module\"');
|
||||||
|
const hasCanvas = data.includes('<canvas');
|
||||||
|
console.log(JSON.stringify({ status: res.statusCode, hasModules, hasCanvas }));
|
||||||
|
});
|
||||||
|
});
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **If Playwright MCP is available** (check for `playwright_navigate` tool), use it for full browser verification:
|
||||||
|
- Navigate to `http://localhost:8080`
|
||||||
|
- Check for console errors
|
||||||
|
- Take a screenshot
|
||||||
|
- REJECT if any JavaScript errors in console
|
||||||
|
|
||||||
|
5. **Kill the server when done:**
|
||||||
|
```bash
|
||||||
|
kill $SERVER_PID 2>/dev/null
|
||||||
|
```
|
||||||
|
|
||||||
|
**Runtime errors = automatic REJECT.** Code that looks correct but doesn't run is not complete.
|
||||||
|
|
||||||
## What Warrants Rejection
|
## What Warrants Rejection
|
||||||
|
|
||||||
- ANY acceptance criterion not actually met (not "mostly met" — MET)
|
- ANY acceptance criterion not actually met (not "mostly met" — MET)
|
||||||
- Tests fail
|
- Tests fail
|
||||||
- Typecheck fails
|
- Typecheck fails
|
||||||
|
- Runtime errors (page doesn't load, console errors, server crashes)
|
||||||
- Placeholder/stub code left in place
|
- Placeholder/stub code left in place
|
||||||
- Security vulnerability introduced
|
- Security vulnerability introduced
|
||||||
- Regression in existing functionality
|
- Regression in existing functionality
|
||||||
|
|||||||
Reference in New Issue
Block a user