Three examples showing bad rubber-stamp, good rejection, and good pass patterns. Based on Anthropic's harness design recommendation to calibrate evaluators with few-shot score breakdowns, and informed by real failures observed in a production loop run.