Backtesting & Validation
This skill provides the complete backtesting and strategy validation pipeline — from historical data collection through statistical validation of results. Backtesting is the process of testing a trading strategy on historical data before risking real capital. Done correctly, it provides evidence of a strategy’s edge. Done incorrectly, it produces dangerously misleading results due to overfitting, look-ahead bias, and other pitfalls. This skill teaches rigorous backtesting methodology, including walk-forward analysis, Freqtrade integration, paper trading, and the statistical tests needed to distinguish genuine edge from random chance.
When to Use This Skill
- When testing a new trading strategy before deploying capital
- When evaluating the historical performance of any strategy
- When determining if backtest results are statistically significant
- When checking for overfitting in optimized strategy parameters
- When transitioning from backtest to paper trading to live trading
- When configuring Freqtrade or other backtesting frameworks
- When calculating performance metrics (Sharpe, Sortino, max drawdown, etc.)
- When comparing multiple strategy variants to select the best one
- When a user presents backtest results and wants them validated
What This Skill Does
- Backtesting Pipeline: Guides through data collection, strategy coding, execution simulation, and performance analysis
- Bias Detection: Identifies look-ahead bias, survivorship bias, overfitting, and other common pitfalls
- Walk-Forward Analysis: Implements in-sample optimization with out-of-sample validation on rolling windows
- Freqtrade Integration: Provides strategy templates, backtesting commands, and hyperopt configuration
- Paper Trading: Defines methodology for transitioning from backtest to paper to live
- Performance Metrics: Calculates and interprets Sharpe, Sortino, Calmar, max drawdown, win rate, profit factor, expectancy
- Statistical Significance: Determines minimum sample size, bootstrap confidence intervals, and Monte Carlo analysis
- Overfitting Detection: Compares in-sample vs. out-of-sample degradation and parameter sensitivity
How to Use
Run a Backtest
Backtest a momentum crossover strategy (EMA 9/21) on BTC daily data for the past 2 years
Test my mean-reversion strategy on ETH: buy when z-score < -2, sell at -0.5
Validate Results
I got a Sharpe of 2.5 on my backtest -- is this realistic or am I overfitting?
Run walk-forward validation on my strategy -- 6-month in-sample, 2-month out-of-sample
Freqtrade Setup
Create a Freqtrade strategy template for a Bollinger Band reversion trade
Configure hyperopt for my Freqtrade strategy
Paper Trading
My backtest looks good. What's the paper trading plan before going live?
Data Sources
With MCP/CLI tools connected:
- Freqtrade CLI — Strategy backtesting, hyperopt optimization, paper trading, live trading
- Empyrical MCP — Performance metrics calculation (Sharpe, Sortino, max drawdown, VaR, etc.)
- yFinance MCPs (tooyipjee, maxscheijen, Adity-star) — Historical price and volume data
- Binance MCP — Historical crypto data, OHLCV candles
- OpenBB CLI — Comprehensive financial data, backtesting frameworks
Without tool access: Ask the user to provide:
- Strategy rules (entry, exit, position sizing)
- Historical data or source
- Time period for the test
- Initial capital and commission assumptions
- Any existing backtest results to validate
Proceed with methodology guidance and manual analysis of provided results.
Methodology
Step 1: The Backtesting Pipeline
COMPLETE BACKTESTING WORKFLOW:
┌──────────────────────────────────────────────────────────────┐
│ Phase 1: DATA COLLECTION │
│ > Gather clean OHLCV data for target asset(s) │
│ > Check for data quality: gaps, splits, survivorship │
│ > Minimum: 2 years daily data (500+ bars) or equivalent │
├──────────────────────────────────────────────────────────────┤
│ Phase 2: STRATEGY CODING │
│ > Define exact entry/exit rules (no ambiguity) │
│ > Define position sizing rules │
│ > Include commission and slippage modeling │
│ > Include realistic execution assumptions │
├──────────────────────────────────────────────────────────────┤
│ Phase 3: IN-SAMPLE BACKTEST │
│ > Run on training data (60-70% of total) │
│ > Optimize parameters if needed │
│ > Record ALL metrics │
├──────────────────────────────────────────────────────────────┤
│ Phase 4: OUT-OF-SAMPLE VALIDATION │
│ > Run EXACT same strategy (no re-optimization) on held-out data│
│ > Compare IS vs OOS metrics │
│ > Acceptable degradation: < 30% decline in key metrics │
├──────────────────────────────────────────────────────────────┤
│ Phase 5: WALK-FORWARD ANALYSIS │
│ > Repeat IS/OOS on rolling windows │
│ > Chain OOS results to create a realistic equity curve │
├──────────────────────────────────────────────────────────────┤
│ Phase 6: STATISTICAL VALIDATION │
│ > Minimum sample size check │
│ > Bootstrap confidence intervals │
│ > Monte Carlo simulation for worst-case drawdown │
├──────────────────────────────────────────────────────────────┤
│ Phase 7: PAPER TRADING │
│ > Minimum 30 trades or 3 months (whichever is longer) │
│ > Compare paper results to backtest expectations │
│ > If within tolerance → proceed to live │
├──────────────────────────────────────────────────────────────┤
│ Phase 8: LIVE DEPLOYMENT (at reduced size) │
│ > Start at 25-50% of target size │
│ > Scale up over 3-6 months if results confirm │
└──────────────────────────────────────────────────────────────┘
Step 2: Bias Detection
Look-Ahead Bias
LOOK-AHEAD BIAS: Using future information in trading decisions.
COMMON SOURCES:
1. Using close price to make decisions that execute at close price
FIX: Use close for signals, execute at next bar's open
2. Using data that wasn't available at the time (revised earnings, restated GDP)
FIX: Use point-in-time data only
3. Indicator calculation using future data (centered moving averages)
FIX: Use only trailing indicators
4. Filling missing data with future values (forward fill from future)
FIX: Use only backward fill or drop missing periods
5. Using survivorship-bias-free universes retroactively
FIX: Use constituent lists as of each historical date
DETECTION CHECKLIST:
- [ ] Signals generated BEFORE the bar closes, execution at NEXT bar open?
- [ ] All data used was available at the time of the signal?
- [ ] No forward-looking indicators (centered averages, etc.)?
- [ ] Universe of assets reflects what was available at each point in time?
- [ ] Dividend adjustments applied correctly (backward, not forward)?
Survivorship Bias
SURVIVORSHIP BIAS: Only testing on assets that survived to the present.
IMPACT: Overstates returns by excluding bankruptcies, delistings, and failed projects.
EXAMPLES:
Stocks: Testing momentum on today's S&P 500 members ignores stocks that were
removed due to decline. Adds ~1-2% annual bias.
Crypto: Testing on today's top 50 coins ignores hundreds of projects that went
to zero. Adds potentially 5-10%+ annual bias.
MITIGATION:
1. Use survivorship-bias-free databases (CRSP for stocks, CoinGecko full history)
2. Include delisted assets in the backtest universe
3. For crypto: Account for tokens that went to zero
4. Clearly note if survivorship bias cannot be eliminated
Overfitting
OVERFITTING: Strategy is tuned to historical noise, not signal.
WARNING SIGNS:
- Very specific parameters (e.g., EMA(17) instead of EMA(20))
- Many parameters (> 5 free parameters for a simple strategy)
- Strategy only works on one asset or one time period
- In-sample Sharpe >> out-of-sample Sharpe
- Strategy fails on similar but different data
- Equity curve is unrealistically smooth
- Win rate > 80% (suspicious unless strategy has very tight stops)
OVERFITTING DETECTION:
Metric Degradation Test:
OOS_Sharpe / IS_Sharpe > 0.7 → Likely robust
OOS_Sharpe / IS_Sharpe = 0.5-0.7 → Possible overfitting
OOS_Sharpe / IS_Sharpe < 0.5 → Likely overfit
Parameter Stability Test:
Vary each parameter ±20%
If performance collapses → overfit to exact parameter
If performance degrades gracefully → more likely robust
Cross-Asset Test:
Run the same strategy on similar assets (e.g., BTC strategy on ETH)
If it works → strategy captures a general pattern
If it fails → may be overfit to that specific asset
Step 3: Walk-Forward Analysis
WALK-FORWARD METHOD:
CONCEPT: Repeatedly optimize on a window, validate on the next window,
then roll forward. Chain the OOS results for a realistic performance estimate.
WINDOW CONFIGURATION:
| Strategy Frequency | IS Window | OOS Window | Step Size |
|-------------------|-------------|-------------|-------------|
| Daily trades | 6 months | 2 months | 2 months |
| Weekly trades | 12 months | 3 months | 3 months |
| Monthly trades | 24 months | 6 months | 6 months |
STEP-BY-STEP:
Window 1: IS = months 1-6, OOS = months 7-8 → Record OOS metrics
Window 2: IS = months 3-8, OOS = months 9-10 → Record OOS metrics
Window 3: IS = months 5-10, OOS = months 11-12 → Record OOS metrics
...continue rolling forward...
Final equity curve = chain of ALL OOS results (no IS data in the curve)
VALIDATION CRITERIA:
Walk-forward efficiency = Average(OOS_Metric) / Average(IS_Metric)
WFE > 0.5: Acceptable → strategy has real edge
WFE > 0.7: Good → strategy is robust
WFE > 0.85: Excellent → very robust strategy
WFE < 0.3: Poor → strategy is likely overfit, do NOT deploy
MINIMUM WINDOWS:
At least 5 walk-forward windows for statistical relevance
Preferably 8-12 windows covering different market conditions
Step 4: Freqtrade Integration
Strategy Template
# Freqtrade Strategy Template
# Save as: user_data/strategies/MyStrategy.py
from freqtrade.strategy import IStrategy, merge_informative_pair
from pandas import DataFrame
import talib.abstract as ta
class MyStrategy(IStrategy):
# Strategy parameters
INTERFACE_VERSION = 3
timeframe = '1h'
# Position management
stoploss = -0.05 # 5% stop loss
trailing_stop = True
trailing_stop_positive = 0.01
trailing_stop_positive_offset = 0.03
trailing_only_offset_is_reached = True
# ROI table (take profit at these levels)
minimal_roi = {
"0": 0.10, # 10% immediate target
"30": 0.05, # 5% after 30 minutes
"60": 0.03, # 3% after 60 minutes
"120": 0.01 # 1% after 120 minutes
}
def populate_indicators(self, dataframe: DataFrame, metadata: dict) -> DataFrame:
# Add your indicators here
dataframe['ema_9'] = ta.EMA(dataframe, timeperiod=9)
dataframe['ema_21'] = ta.EMA(dataframe, timeperiod=21)
dataframe['rsi'] = ta.RSI(dataframe, timeperiod=14)
dataframe['adx'] = ta.ADX(dataframe, timeperiod=14)
dataframe['volume_sma'] = ta.SMA(dataframe['volume'], timeperiod=20)
return dataframe
def populate_entry_trend(self, dataframe: DataFrame, metadata: dict) -> DataFrame:
dataframe.loc[
(
(dataframe['ema_9'] > dataframe['ema_21']) & # EMA crossover
(dataframe['adx'] > 25) & # Trend confirmed
(dataframe['rsi'] < 70) & # Not overbought
(dataframe['volume'] > dataframe['volume_sma']) # Volume confirmation
),
'enter_long'] = 1
return dataframe
def populate_exit_trend(self, dataframe: DataFrame, metadata: dict) -> DataFrame:
dataframe.loc[
(
(dataframe['ema_9'] < dataframe['ema_21']) | # EMA cross back
(dataframe['rsi'] > 80) # Overbought
),
'exit_long'] = 1
return dataframe
Backtesting Commands
# Download historical data
freqtrade download-data --exchange binance --pairs BTC/USDT ETH/USDT SOL/USDT \
--timeframe 1h --days 730
# Run backtest
freqtrade backtesting --strategy MyStrategy \
--timeframe 1h \
--timerange 20240101-20250101 \
--enable-protections
# Run with detailed trade list
freqtrade backtesting --strategy MyStrategy \
--timeframe 1h \
--timerange 20240101-20250101 \
--export trades --export-filename user_data/backtest_results/my_strategy.json
# Run hyperopt (parameter optimization)
freqtrade hyperopt --strategy MyStrategy \
--hyperopt-loss SharpeHyperOptLoss \
--spaces buy sell roi stoploss \
--epochs 500 \
--timerange 20240101-20241001 # IS period only!
# Paper trading (dry run)
freqtrade trade --strategy MyStrategy \
--config user_data/config.json \
--dry-run
Hyperopt Configuration
# Add to strategy class for hyperopt
from freqtrade.strategy import DecimalParameter, IntParameter
class MyStrategy(IStrategy):
# Hyperopt parameters -- keep these to a minimum (< 6)
buy_ema_short = IntParameter(5, 15, default=9, space='buy')
buy_ema_long = IntParameter(15, 30, default=21, space='buy')
buy_adx_threshold = IntParameter(20, 35, default=25, space='buy')
sell_rsi_threshold = IntParameter(70, 85, default=80, space='sell')
# WARNING: More parameters = higher overfitting risk
# Rule of thumb: max parameters = sqrt(number_of_trades) / 2
# 100 trades → max 5 parameters
# 400 trades → max 10 parameters
Step 5: Performance Metrics
ESSENTIAL PERFORMANCE METRICS:
┌──────────────────────────────────────────────────────────────────┐
│ Metric │ Formula │ Benchmark │
├──────────────────────────────────────────────────────────────────┤
│ Sharpe Ratio │ (Avg Return - Rf) / StdDev │ > 1.0 acceptable│
│ │ Annualized: SR × √252 │ > 1.5 good │
│ │ │ > 2.0 excellent │
│ │ │ > 3.0 suspicious │
├──────────────────────────────────────────────────────────────────┤
│ Sortino Ratio │ (Avg Return - Rf) / DownDev │ > 1.5 acceptable│
│ │ (Uses only downside deviation)│ > 2.0 good │
├──────────────────────────────────────────────────────────────────┤
│ Calmar Ratio │ Annual Return / Max Drawdown │ > 1.0 acceptable│
│ │ │ > 2.0 good │
├──────────────────────────────────────────────────────────────────┤
│ Max Drawdown │ Largest peak-to-trough decline│ < 20% acceptable│
│ │ │ < 10% good │
├──────────────────────────────────────────────────────────────────┤
│ Win Rate │ Winning trades / Total trades │ Depends on R:R │
│ │ │ 40%+ for 2:1 R:R│
│ │ │ 55%+ for 1:1 R:R│
├──────────────────────────────────────────────────────────────────┤
│ Profit Factor │ Gross Profits / Gross Losses │ > 1.5 acceptable│
│ │ │ > 2.0 good │
├──────────────────────────────────────────────────────────────────┤
│ Expectancy │ (Win% × AvgWin) - (Loss% × AvgLoss) │ Must be > 0│
│ │ Per trade expected value │ │
├──────────────────────────────────────────────────────────────────┤
│ Risk of Ruin │ P(account drawdown > X%) │ < 5% acceptable │
│ │ Based on win rate, payoff, risk│ │
├──────────────────────────────────────────────────────────────────┤
│ Recovery Factor │ Net Profit / Max Drawdown │ > 3.0 acceptable│
└──────────────────────────────────────────────────────────────────┘
METRIC RED FLAGS:
Sharpe > 3.0 → Almost certainly overfit or biased
Win rate > 80% with tight stops → Likely look-ahead bias
Max drawdown < 5% on a 2-year test → Unrealistic
No losing months → Extreme red flag
Profit factor > 5.0 → Likely overfit
Step 6: Statistical Significance
MINIMUM TRADE COUNT:
The minimum number of trades needed depends on the strategy's win rate:
| Win Rate | Min Trades (95% confidence) | Min Trades (99% confidence) |
|----------|---------------------------|---------------------------|
| 40% | ~60 trades | ~100 trades |
| 50% | ~100 trades | ~150 trades |
| 60% | ~60 trades | ~100 trades |
Rule of thumb: MINIMUM 100 trades for any strategy validation
For parameter optimization: Min trades = 20 × number_of_parameters
T-TEST FOR STRATEGY EDGE:
H0: Average return per trade = 0 (no edge)
H1: Average return per trade > 0 (strategy has edge)
t = (Mean_Return × √n) / StdDev_Return
p-value < 0.05 → Strategy has statistically significant edge
BOOTSTRAP CONFIDENCE INTERVAL:
1. Resample trades with replacement (10,000 iterations)
2. Calculate metric (Sharpe, expectancy) for each resample
3. Sort results, take 2.5th and 97.5th percentile → 95% CI
If 95% CI for Sharpe includes 0 → strategy may NOT have real edge
If 95% CI for Sharpe is entirely > 0 → strategy likely has edge
MONTE CARLO WORST-CASE DRAWDOWN:
1. Randomly shuffle trade sequence (10,000 iterations)
2. Calculate max drawdown for each sequence
3. 95th percentile of drawdowns = expected worst-case drawdown
Ensure your risk tolerance can handle the Monte Carlo worst-case drawdown
If worst-case drawdown > 40% → strategy needs better risk management
Step 7: Paper Trading Protocol
PAPER TRADING TRANSITION:
REQUIREMENTS TO START PAPER TRADING:
- Backtest passes walk-forward validation (WFE > 0.5)
- Minimum 100 trades in backtest
- Statistical significance confirmed (p < 0.05)
- Sharpe ratio > 1.0 (out-of-sample)
- Max drawdown within tolerance (see risk-management)
- Strategy coded with no manual overrides
PAPER TRADING DURATION:
Minimum: 30 trades OR 3 months, whichever is LONGER
Preferred: 50+ trades or 6 months
PAPER TRADING VALIDATION:
Compare paper results to backtest expectations:
| Metric | Tolerance vs Backtest |
|----------------|---------------------------|
| Sharpe Ratio | > 50% of backtest Sharpe |
| Win Rate | Within ±10% absolute |
| Avg Win/Loss | Within ±20% relative |
| Max Drawdown | Within 1.5× backtest DD |
| Trade Frequency| Within ±30% of expected |
If ALL metrics within tolerance → APPROVED for live
If 1-2 metrics outside tolerance → Investigate cause, extend paper period
If 3+ metrics outside → Strategy likely overfit, return to development
LIVE DEPLOYMENT SCHEDULE:
Month 1: 25% of target size
Month 2: 50% of target size (if month 1 is within tolerance)
Month 3: 75% of target size
Month 4+: 100% of target size
At any point: If performance falls outside tolerance → reduce to 25% and reassess
Step 8: Overfitting Detection Deep Dive
OVERFITTING SCORECARD:
Test Score Result
─────────────────────────────────────────────────────
IS vs OOS Sharpe ratio > 0.7 0-20 ___
Parameter stability (±20% test) 0-20 ___
Cross-asset validation 0-20 ___
Walk-forward efficiency > 0.5 0-20 ___
Number of parameters ≤ 5 0-10 ___
Trade count > 100 0-10 ___
─────────────────────────────────────────────────────
TOTAL /100 ___
INTERPRETATION:
80-100: Low overfitting risk → proceed to paper trading
60-79: Moderate risk → simplify strategy, re-test
40-59: High risk → significant overfitting suspected
< 40: Very high risk → strategy is almost certainly overfit
DEFLATED SHARPE RATIO (DSR):
Accounts for the number of strategy variations tried:
DSR = Sharpe × (1 - N_trials / (4 × Sharpe^2 × T))
Where: N_trials = number of strategy variations tested
T = number of return observations
If DSR < 0 → The observed Sharpe can be explained by random trials alone
Anti-Patterns
DO NOT do these — they produce misleading backtest results:
-
Optimizing on the full dataset: If you use all data for optimization, there is no unseen data to validate on. Always split IS/OOS BEFORE any optimization.
-
Peeking at OOS data during development: Once you look at OOS results and then modify the strategy, the OOS data is contaminated. It becomes IS data. Reserve OOS strictly.
-
Ignoring transaction costs and slippage: A strategy that makes 0.1% per trade looks great until you realize 0.05% goes to commissions and 0.03% to slippage. Model costs realistically.
-
Assuming perfect fills: In reality, limit orders may not fill, market orders have slippage, and illiquid assets have wide spreads. Add 1-2 ticks of slippage per trade.
-
Testing on a single asset and time period: A strategy that works on BTC 2023-2024 may fail on BTC 2022 or on ETH. Test across assets and periods.
-
Too many parameters: Every parameter you add doubles the search space. More than 5 parameters for a simple strategy is almost guaranteed overfitting.
-
Confusing backtest Sharpe with live Sharpe: Backtests overstate Sharpe by 30-50% on average. If your backtest Sharpe is 1.5, expect 0.75-1.0 live.
-
Deploying without paper trading: Skipping paper trading means you discover data issues, execution problems, and strategy flaws with real money. Always paper trade first.
Validation Checkpoints
Input Validation
- Strategy rules fully defined (no ambiguity in entry, exit, sizing)
- Historical data quality verified (no gaps, correct adjustments)
- Data period includes different market regimes (bull, bear, sideways)
- Transaction costs and slippage modeled realistically
- Survivorship bias addressed or acknowledged
- Look-ahead bias checked with the detection checklist
Output Validation
- Minimum 100 trades in backtest
- IS/OOS split performed BEFORE optimization
- Walk-forward analysis completed (WFE > 0.5)
- All performance metrics calculated and within acceptable ranges
- Statistical significance confirmed (p < 0.05 or bootstrap CI excludes 0)
- Overfitting scorecard completed (score > 60)
- Monte Carlo worst-case drawdown within risk tolerance
- Risk management validated — per Risk Management
- Paper trading plan defined with specific pass/fail criteria
- Comparison to simple benchmark (buy-and-hold) documented
Example
User: “I built a momentum strategy for BTC — EMA 9/21 crossover with ADX filter. The backtest shows Sharpe 2.8 and 75% win rate over 2 years. Is this good?”
Analysis:
## Backtest Validation: BTC EMA 9/21 + ADX Strategy
### Initial Assessment -- Red Flags Check
- Sharpe 2.8: SUSPICIOUS -- backtested Sharpe > 2.5 is almost always overfit
- Win rate 75%: SUSPICIOUS -- momentum strategies typically have 35-50% win rates
- These metrics together suggest possible overfitting or look-ahead bias
### Bias Checklist
- [ ] Look-ahead bias: Was entry on the bar AFTER the crossover signal?
→ Common error: entering on the same bar the EMA crosses
→ FIX: Signal on bar N, enter at open of bar N+1
- [ ] Survivorship bias: N/A (single asset BTC)
- [ ] Slippage modeled? If no → add 0.1% per trade for crypto
- [ ] Commissions modeled? If no → add 0.1% per trade (0.05% each side)
### What Realistic Metrics Look Like
For an EMA crossover momentum strategy on BTC:
| Metric | Suspicious (reported) | Realistic range |
|-------------|----------------------|------------------|
| Sharpe | 2.8 | 0.8 - 1.5 |
| Win Rate | 75% | 35-50% |
| Profit Factor| (unknown) | 1.3 - 2.0 |
| Max Drawdown | (unknown) | 15-35% |
### Recommended Validation Steps
1. FIX ENTRY TIMING: Ensure execution is at next bar open, not signal bar close
2. ADD COSTS: Apply 0.1% round-trip (0.05% entry + 0.05% exit) + 0.05% slippage
3. SPLIT DATA: Use 2022-01 to 2024-06 as IS, 2024-07 to 2025-01 as OOS
4. RUN WALK-FORWARD: 6-month IS, 2-month OOS, rolling
5. CROSS-ASSET TEST: Run same strategy on ETH and SOL
6. PARAMETER SENSITIVITY: Test EMA(7/19), EMA(9/21), EMA(11/23) -- if results
collapse for nearby parameters, the strategy is overfit to EMA(9/21)
### Expected Outcome After Corrections
After fixing look-ahead bias and adding costs, expect:
- Sharpe: 0.8-1.2 (down from 2.8)
- Win rate: 40-50% (down from 75%)
- Max drawdown: 20-30%
- If OOS Sharpe > 0.7 after walk-forward → strategy may have real edge
- If OOS Sharpe < 0.5 → strategy is likely overfit
### Verdict
DO NOT deploy this strategy until the validation steps above are completed.
The reported metrics are almost certainly inflated. The strategy may still
have a real edge, but it needs rigorous validation to prove it.