Back to Insights
Strategy Deep Dive·2026-04-11·5 min read

Our XGBoost Sharpe just hit 2.14. We're suspicious.

Since January 1, XGBoost has posted +5.5% with 7 trades and a Sharpe north of 2. That's better than the 3-year optimized backtest. When a model performs above its own ceiling, the first question should be 'what's broken?'

By Li Tan

If you scroll to the Live Track Record tab on our Performance page right now, you'll see XGBoost sitting at +5.5% return, Sharpe 2.14, max drawdown -3.0%. 71 days, 7 trades. It's the best-performing strategy in our stack by a comfortable margin.

My first instinct when I saw this yesterday: something is wrong.

Here's why. The 3-year backtest on Performance → Showcase, with grid-searched optimal parameters, shows XGBoost at +13.3% total return and Sharpe 0.90. Annualize that and you're looking at ~4.3% per year, ~0.90 Sharpe — the kind of numbers that match what the ML literature tells us to expect from walk-forward gradient-boosted trees on financial data. Modest, decent, believable.

The Live Track Record uses DEFAULT parameters (no grid search, no optimization, straight from strategies/xgboost_strategy.py with zero arguments). Default should be WORSE than optimized. That's the whole point of optimization. And yet it's showing an annualized Sharpe of ~4, which is hedge-fund-level.

Three hypotheses, ranked by how scared they make me

Hypothesis 1: It's a lucky streak.

71 days × 7 trades is a tiny sample. At that size, a strategy with a true Sharpe of 0.9 can easily post a 3-month Sharpe of 2.1 through sheer luck. Run enough 3-month windows on the 3-year backtest and you'll find dozens like this. If this is the explanation, the Sharpe will regress toward 0.9 over the next 6-12 months and we'll look silly for making a fuss.

I think this is the most likely explanation. It's also the most boring one.

Hypothesis 2: There's a market regime XGBoost happens to love right now.

XGBoost uses technical features like RSI, MACD, momentum z-scores, volatility bands. In trending markets with low realized vol, these features are stable and the model's predictions track reality well. In choppy or volatile regimes, the same features whipsaw and the model breaks.

Look at the other strategies' live track record: TSMOM is also positive (+2.8%), HMM is mildly positive (+1.8%). Trend-followers are winning. Mean-reverters are flat or losing. That's a trend regime. XGBoost essentially IS a trend-follower when you strip away the ML label — it's using momentum features. So it's winning for the same reason TSMOM is winning.

What makes me nervous: if this is the case, XGBoost's outperformance will collapse the moment the regime shifts. Not slowly. Suddenly. And because XGBoost looks like it has 'learned something', people (including us) will be slow to accept that nothing was ever learned — it was just regime luck.

Hypothesis 3: There's a bug.

The one I'm most afraid of. Data leaks in walk-forward ML are subtle and can produce exactly this profile: too-good-to-be-true results that look legitimate. The classic failure modes:

  • Feature engineering that peeks at future data (e.g. using ewm with adjust=False can leak information across the train/test boundary)
  • Target variable computed with a shifted index that accidentally overlaps training data
  • Scaler (like RobustScaler) fit on the full dataset instead of only the training window
  • Survivorship bias in the asset list (though this doesn't apply to forex pairs)

I went through strategies/xgboost_strategy.py line by line last night. Features use ewm(adjust=True). Target uses shift(-N) which is correct for forward returns. Scaler is fit inside the walk-forward loop on train data only. The code looks clean.

But 'looks clean' has been famous last words many times in ML history.

What we're doing about it

For the next month, I'm going to watch the XGBoost Live Track Record closely. If the Sharpe stays above 1.5 through May while the regime holds, I'll start believing it. If it drops below 1.0, that's the regression we expected and we move on. If it crashes below 0 suddenly, we have a regime-dependence problem we need to document honestly.

In the meantime, if you're a Learner subscriber, don't treat XGBoost's current Sharpe as normal. It isn't. Treat TSMOM's 1.01 as the number to anchor on — that's the 'we believe it' zone for this regime.

If you want to reproduce these numbers yourself: clone github.com/Lee26116/openalpha, run scripts/daily_track_record.py. The data file is published via API at /api/v1/backtest/live-track-record — you should get a SHA-256 that matches ours, or we have a bug.

Like this kind of analysis? Upgrade to Learner for weekly articles, real-time signals, and full educational content.

See Learner plan