Metrics to keep
Track rule-fit pass rate, average planned risk, maximum paper drawdown, skipped-trade count, mistake tags, sample size, and review completion rate.
Evaluate an AI paper trading agent by process quality, not just paper profit and loss. Use rule fit, journal quality, risk behavior, skipped trades, sample size, and versioned improvements.
| Dimension | What to check | Weak signal |
|---|---|---|
| Rule fit | Did the agent follow the written setup, invalidation, and skip rules? | The agent explains the trade after the outcome instead of from the original rule. |
| Risk behavior | Did simulated size, drawdown, and exposure stay inside the written limits? | The agent makes profitable paper trades by breaking risk boundaries. |
| Journal quality | Can a reviewer understand thesis, rationale, exit reason, and next action? | The record contains confident summaries but no reviewable fields. |
| Skipped trades | Did the agent document when it did nothing and why? | Only entries are recorded, so discipline cannot be measured. |
| Version quality | Did each rule change target one specific behavior? | The prompt changes broadly after every small sample. |
A paper agent can show a high win rate and still be a poor workflow. It may enter late, ignore risk, skip journaling, or rely on a market condition that happened to be favorable during a small sample. Evaluation needs to ask whether the process is repeatable and reviewable.
Start by reading the agent's rules and prompt version. The sample only means something if you know which instructions produced it. Then review the journal for every entry, exit, and skip. A complete record should explain what the agent saw, why the decision fit the rule, what invalidated the thesis, and what should be reviewed next.
Risk behavior deserves its own score. An agent that wins by taking oversized paper positions is not better than an agent that loses while following the plan. The review should compare each decision with the AI agent risk controls workflow before drawing conclusions from paper returns.
Evaluation should end with one decision: keep collecting evidence, tighten one rule, reduce simulated size, change the prompt output, or retire the test. That decision should be written before the next sample starts.
Sample: An AI paper trading agent records twenty simulated decisions over two weeks. Twelve are entries, five are skips, and three are exits from the prior sample.
Finding: Paper profit is positive, but the journal shows four entries happened before the confirmation rule completed. The skipped trades were well documented, and size stayed inside limits.
Score: Risk behavior passes, journal quality passes, skip discipline passes, but rule fit needs work. The next version tightens confirmation language without changing position sizing or the watchlist.
Decision: The agent is not promoted or discarded. It starts a new paper sample with one rule change so the next evaluation can compare early-entry behavior directly.
Track rule-fit pass rate, average planned risk, maximum paper drawdown, skipped-trade count, mistake tags, sample size, and review completion rate.
Win rate, paper return, average profit, or a single leaderboard rank can be misleading when they are not paired with risk and rule-fit context.
Compare versions by behavior, not by a single paper result. Version one may have a higher paper return, while version two may follow rules more consistently and create better journal records. The stronger version depends on the evaluation goal.
Start every comparison by naming the change. If version two changed the confirmation rule, the review should focus on early entries and skipped trades. If version two changed the risk cap, the review should focus on size, exposure, and drawdown. A version comparison is weak when it changes many variables and then tries to explain the result afterward.
Use the same sample window when possible. If one version ran during a calm market and another ran during a volatile market, say that directly in the review. The evaluation may still be useful, but it should not pretend the samples were identical.
The best version is often the one that is easier to audit. A slightly less profitable paper sample with clean decisions may be more useful than a profitable sample with missing invalidation, vague rationale, and no skipped-trade evidence.
Evaluate rule fit, journal quality, risk behavior, skipped trades, drawdown, sample size, and whether each version improves one specific behavior.
No. Win rate can hide weak process. Review risk, rationale, sample size, skipped trades, and whether the agent followed the written rule.
Change it when repeated paper evidence shows a specific behavior problem, such as early entries, missing invalidation, oversized positions, or poor skip discipline.