Is win rate enough to evaluate a paper trading agent?

No. Win rate can hide weak process. Paper-agent evaluation should also review risk, rule fit, sample size, decision rationale, and skipped trades.

Learn

How to evaluate an AI paper trading agent

Q: When should an AI paper trading agent be changed?

Change the agent when repeated paper evidence shows a specific behavior problem, such as early entries, missing invalidation, oversized positions, or poor skip discipline.

Evaluate an AI paper trading agent by process quality, not just paper profit and loss. Use rule fit, journal quality, risk behavior, skipped trades, sample size, and versioned improvements.

Start free Open the feedback loop

Evaluation scorecard

Dimension	What to check	Weak signal
Rule fit	Did the agent follow the written setup, invalidation, and skip rules?	The agent explains the trade after the outcome instead of from the original rule.
Risk behavior	Did simulated size, drawdown, and exposure stay inside the written limits?	The agent makes profitable paper trades by breaking risk boundaries.
Journal quality	Can a reviewer understand thesis, rationale, exit reason, and next action?	The record contains confident summaries but no reviewable fields.
Skipped trades	Did the agent document when it did nothing and why?	Only entries are recorded, so discipline cannot be measured.
Version quality	Did each rule change target one specific behavior?	The prompt changes broadly after every small sample.

Why win rate is not enough

A paper agent can show a high win rate and still be a poor workflow. It may enter late, ignore risk, skip journaling, or rely on a market condition that happened to be favorable during a small sample. Evaluation needs to ask whether the process is repeatable and reviewable.

Start by reading the agent's rules and prompt version. The sample only means something if you know which instructions produced it. Then review the journal for every entry, exit, and skip. A complete record should explain what the agent saw, why the decision fit the rule, what invalidated the thesis, and what should be reviewed next.

Risk behavior deserves its own score. An agent that wins by taking oversized paper positions is not better than an agent that loses while following the plan. The review should compare each decision with the AI agent risk controls workflow before drawing conclusions from paper returns.

Evaluation should end with one decision: keep collecting evidence, tighten one rule, reduce simulated size, change the prompt output, or retire the test. That decision should be written before the next sample starts.

Example evaluation

Sample: An AI paper trading agent records twenty simulated decisions over two weeks. Twelve are entries, five are skips, and three are exits from the prior sample.

Finding: Paper profit is positive, but the journal shows four entries happened before the confirmation rule completed. The skipped trades were well documented, and size stayed inside limits.

Score: Risk behavior passes, journal quality passes, skip discipline passes, but rule fit needs work. The next version tightens confirmation language without changing position sizing or the watchlist.

Decision: The agent is not promoted or discarded. It starts a new paper sample with one rule change so the next evaluation can compare early-entry behavior directly.

Metrics to keep

Track rule-fit pass rate, average planned risk, maximum paper drawdown, skipped-trade count, mistake tags, sample size, and review completion rate.

Metrics to distrust alone

Win rate, paper return, average profit, or a single leaderboard rank can be misleading when they are not paired with risk and rule-fit context.

How to compare agent versions

Compare versions by behavior, not by a single paper result. Version one may have a higher paper return, while version two may follow rules more consistently and create better journal records. The stronger version depends on the evaluation goal.

Start every comparison by naming the change. If version two changed the confirmation rule, the review should focus on early entries and skipped trades. If version two changed the risk cap, the review should focus on size, exposure, and drawdown. A version comparison is weak when it changes many variables and then tries to explain the result afterward.

Use the same sample window when possible. If one version ran during a calm market and another ran during a volatile market, say that directly in the review. The evaluation may still be useful, but it should not pretend the samples were identical.

The best version is often the one that is easier to audit. A slightly less profitable paper sample with clean decisions may be more useful than a profitable sample with missing invalidation, vague rationale, and no skipped-trade evidence.

AI paper trading agent evaluation FAQ

How do you evaluate an AI paper trading agent?

Evaluate rule fit, journal quality, risk behavior, skipped trades, drawdown, sample size, and whether each version improves one specific behavior.

Is win rate enough?

No. Win rate can hide weak process. Review risk, rationale, sample size, skipped trades, and whether the agent followed the written rule.

When should the agent be changed?

Change it when repeated paper evidence shows a specific behavior problem, such as early entries, missing invalidation, oversized positions, or poor skip discipline.