Template

Paper trading benchmark review worksheet

Use this worksheet to compare simulated benchmark results before changing an AI trading rule, prompt, persona, or risk limit. It is built for paper-trading review, sample quality checks, and clear next actions.

Start free Read methodology

Paper-first safety note

Trading Boy does not execute live trades, hold funds, or provide financial advice. This worksheet is for reviewing simulated paper-trading evidence, not for approving live capital, copying signals, or predicting future returns.

Reusable benchmark review worksheet

Copy this crawlable worksheet into a journal, spreadsheet, team note, or agent review prompt. Complete the fields after a benchmark window closes so each row is evaluated against the same standard instead of being judged by the most recent outcome.

Worksheet field	What to record	Review question	Decision output
Benchmark window	Start date, end date, market, timeframe, rule version, agent persona, and paper account assumption.	Is this sample comparable with the prior benchmark window?	Continue, restart, or separate into a new benchmark.
Sample size	Total eligible paper trades, skipped setups, excluded trades, and reason for each exclusion.	Is the sample large enough to support a workflow decision?	Keep collecting, review now, or mark inconclusive.
Simulated return	Paper PnL, return on benchmark bankroll, average winner, average loser, and payoff ratio.	Did the result come from repeatable payoff or one outlier?	Accept for review, normalize, or flag outlier dependency.
Risk behavior	Max drawdown, worst paper loss, size violations, stop behavior, and correlated exposure.	Did the agent respect the written paper risk limits?	Keep risk, reduce size, tighten limits, or pause the test.
Rule fit	Number of entries that matched the setup, number of rule breaks, and examples of unclear instructions.	Was the benchmark measuring the intended rule?	Keep rule, rewrite prompt, add skip rule, or retire setup.
Market context	Trend, volatility, news regime, liquidity conditions, and broad market filter status.	Did the rule only work in one narrow market condition?	Segment by context or collect broader evidence.
Review notes	One sentence on what improved, one sentence on what failed, and one sentence on what is still unknown.	Can another reviewer understand the benchmark without guessing?	Approve notes, request detail, or rerun the review.
Next action	Keep collecting samples, change one rule, reduce paper size, split the benchmark, or stop the test.	Is the next action narrow enough to compare in the next window?	Assign one owner and one follow-up date.

Use it before changing rules

A benchmark review should happen before a prompt rewrite, persona change, or risk-setting update. Start with the pre-trade checklist and the paper trading journal template, then use this worksheet to decide whether the sample supports a change.

The safest review habit is to change one variable at a time. If the worksheet points to unclear entries, poor skip logic, and oversized paper risk, choose the issue that blocks the next benchmark most directly instead of rewriting the whole workflow.

Use it after a benchmark closes

After the window closes, compare the row against the leaderboard methodology and the paper-trading limitations. Paper fills can miss live-market effects, so the benchmark is useful as review evidence, not as a live-performance promise.

Then connect the finding to the post-trade review template. The worksheet says whether the benchmark deserves more samples, a tighter rule, a smaller paper size, or a clean stop.

Example completed benchmark review

Benchmark window: Four-week SOL paper-trading benchmark, hourly setup, version 3 prompt, balanced agent persona, fixed benchmark bankroll, no live execution.

Sample size: Twenty-eight eligible paper trades, nine skipped setups, two excluded entries because the market filter was disabled. The sample is useful for process review but still too small for broad claims.

Simulated return: The paper return was positive, but one outlier winner contributed most of the gain. Average loss stayed inside plan, and the payoff ratio looked acceptable only after the outlier was marked.

Risk behavior: Max drawdown stayed below the written limit. One paper entry approached the correlated exposure cap, so the next review should check whether the agent understands overlapping market risk.

Rule fit: Most entries matched the reclaim setup, but late entries appeared in five cases. The next benchmark should add a confirmation timeout rather than changing the whole thesis.

Next action: Keep the setup active in paper mode, add one late-entry guardrail, and rerun the same benchmark window before comparing against the leaderboard.

How to read the worksheet

The worksheet is designed to keep benchmark review slow and specific. A single high-return paper window should not automatically promote a rule. A single losing paper window should not automatically delete a rule. The useful question is whether the agent followed the written process, respected the simulated risk limits, and produced enough clean evidence to justify the next step.

Start with sample quality: If there are too few trades, too many exclusions, or inconsistent rule versions, the benchmark should usually be marked inconclusive.
Separate result from behavior: Positive paper PnL does not fix a rule break. Negative paper PnL does not invalidate a clean process by itself.
Check risk before return: Drawdown, size violations, and correlated exposure can make a benchmark unusable even when the simulated return looks attractive.
Document one next action: The best review output is a narrow change or a decision to keep collecting evidence, not a long list of new experiments.

Good benchmark evidence

Good evidence has a defined window, enough eligible paper trades, stable rules, recorded skips, visible drawdown, payoff context, and a clear link between the benchmark result and the next action. It should be easy to audit alongside risk controls and review.

Weak benchmark evidence

Weak evidence depends on one lucky outlier, mixes several rule versions, hides skipped setups, ignores drawdown, or explains the result only after the fact. Weak evidence should stay in research notes rather than becoming a claim on a public page.

Review sequence for AI trading agents

For an AI paper-trading workflow, the benchmark review worksheet sits between the individual journal entry and the public-facing benchmark summary. It helps reviewers decide what changed, what stayed stable, and what is still unknown.

Start by defining the prompt or rule version with an AI trading agent prompt template or a trade thesis journal template. During the benchmark, record each simulated decision in the journal and tag important skipped trades. When the benchmark window closes, complete the worksheet without editing the original entries.

Next, ask the review questions in order. Did the agent follow its setup? Did it respect invalidation? Did it size paper positions consistently? Did it avoid correlated exposure? Did the result depend on one market regime? Did the next action change only one thing? If any answer is unclear, the benchmark needs more evidence before it should influence the workflow.

Finally, turn the worksheet into a short decision. Keep collecting, tighten one rule, reduce paper size, split the benchmark by market context, or retire the test. The decision can then feed a trading journal review, a private team note, or a public benchmark explanation that points users back to paper-trading limitations.

When to mark a benchmark inconclusive

Mark the review inconclusive when the sample is small, the benchmark window changed midstream, the agent used multiple prompt versions, market context was not recorded, or the result depends on a single unusually large simulated trade. Inconclusive does not mean failed. It means the evidence is not strong enough to change the workflow yet.

Benchmark review FAQ

What is a paper trading benchmark review worksheet?

It is a reusable checklist for comparing simulated paper-trading results against a benchmark, including sample size, drawdown, win rate, payoff, rule fit, and next review action.

How often should I update the worksheet?

Update it after a benchmark window closes, such as weekly or monthly, and avoid changing benchmark rules mid-sample unless the test has been retired and restarted.

Can this worksheet prove that a strategy will work live?

No. It reviews simulated paper-trading evidence only. Trading Boy does not execute live trades, hold funds, or provide financial advice.