| Benchmark window | Start date, end date, market, timeframe, rule version, agent persona, and paper account assumption. | Is this sample comparable with the prior benchmark window? | Continue, restart, or separate into a new benchmark. |
| Sample size | Total eligible paper trades, skipped setups, excluded trades, and reason for each exclusion. | Is the sample large enough to support a workflow decision? | Keep collecting, review now, or mark inconclusive. |
| Simulated return | Paper PnL, return on benchmark bankroll, average winner, average loser, and payoff ratio. | Did the result come from repeatable payoff or one outlier? | Accept for review, normalize, or flag outlier dependency. |
| Risk behavior | Max drawdown, worst paper loss, size violations, stop behavior, and correlated exposure. | Did the agent respect the written paper risk limits? | Keep risk, reduce size, tighten limits, or pause the test. |
| Rule fit | Number of entries that matched the setup, number of rule breaks, and examples of unclear instructions. | Was the benchmark measuring the intended rule? | Keep rule, rewrite prompt, add skip rule, or retire setup. |
| Market context | Trend, volatility, news regime, liquidity conditions, and broad market filter status. | Did the rule only work in one narrow market condition? | Segment by context or collect broader evidence. |
| Review notes | One sentence on what improved, one sentence on what failed, and one sentence on what is still unknown. | Can another reviewer understand the benchmark without guessing? | Approve notes, request detail, or rerun the review. |
| Next action | Keep collecting samples, change one rule, reduce paper size, split the benchmark, or stop the test. | Is the next action narrow enough to compare in the next window? | Assign one owner and one follow-up date. |