When is a Backtest Too Good to be True?

One statistic which I find useful to form a first impression of a backtest is the success/winning percentage. Since it can mean different things, let’s be more precise: for a strategy over daily data, the winning percentage is the percentage of the days on which the strategy had positive returns (in other words, the strategy guessed the sign of the return correctly on these days). Now the question – if I see 60% winning percentage for a S&P 500 strategy, does/should my bullshit-alarm go off?

That actually happened not too long ago while reading a paper. One of the best strategy in the paper was reporting 60% winning percentage out-of-sample. My gut feeling was that it’s a unrealistic, cherry-picked sample. I was pretty sure I am correct, but it raises other interesting questions, so I decided to investigate a bit more.

Let’s perform an experiment. Let’s take 20 years of daily returns on the S&P 500 and draw a number of samples. All samples have the same size – 60% of the data. For each sample, we assume that we guessed the sign on these days correctly, and we were wrong on the rest. One way to implement this in R is:

return.mc = function(rets, samples=1000, size=252) {
   # The annualized return for each sample
   result = rep(NA, samples)
   for(ii in 1:samples) {
      # Sample the indexes
      aa = sample(1:NROW(rets), size=size)
      # All days we guessed wrong
      bb = -abs(rets)
      # On the days in the sample we guessed correctly
      bb[aa] = abs(bb[aa])
      # Compute the annualized return for this sample.
      # Note, we convert bb to a vector, otherwise, the computation takes forever.
      result[ii] = Return.annualized(as.numeric(bb),scale=252)

Now, let’s see what we get for a couple success rates:

gspc = getSymbols("^GSPC", from="1900-01-01", auto.assign=F)
rets = ROC(Cl(gspc),type="discrete",na.pad=F)["1994/2013"]

aa = return.mc(rets, size=as.integer(0.6*NROW(rets)))
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.3353  0.4518  0.4849  0.4860  0.5183  0.6666

aa = return.mc(rets, size=as.integer(0.55*NROW(rets)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.08744 0.17870 0.20410 0.20410 0.23130 0.33670

For the 60% success rate, we have an average (over all samples) of 48% annually! It seems it was completely justified to make me very suspicious about serious cherry-picking. It’s either that, or the authors have discovered the ultimate trading strategy. Needless to say it turned out to be the former …

The second result, the 55% success is about where I draw the line of something plausible, although I’d be surprised if something of that quality is shared in public.

One last observation, which is important mostly for long-only strategies. We have 53.81% positive returns overall, yet, at the 50% success rate we still get negative mean return! That’s a manifestation of the fact that the positive and negative market profiles are totally different. In any case, what it means is that for long-only strategies I might push the plausible percentage up a bit.


  1. Excellent experiment! It shows how a little edge can be turned into real profit. One thing to take into consideration however is return asymetry. Depending on the strategy (mean reversion vs. trend following) average gain vs. average loss affects significantly the outcome

  2. Nice experiment indeed. For the discrepancy between the fraction of positive market returns and the success rate of a timing strategy, see my paper “On the expected performance of market timing strategies” in the Jnl of Portfolio Management (Summer 2014), http://www.iijournals.com/doi/abs/10.3905/jpm.2014.40.4.042

  3. Richard says:

    If you are allowed to skip days where you don’t have a strong conviction then a 60% out-of-sample hit rate is plausible. It also depends how many trades are in the out-of-sample. For approximately normally distributed returns I like to look at the annualized Sharpe because it naturally normalises for sample size and multiplying by sqrt(years) gives a crude estimate of statistical significance.

    1. quintuitive says:

      That’s exactly the point: the out-of-sample is not representative for some reason (size, location in time, etc).
      Even if we skip days, I am dubious that one can come up with something that doesn’t look way to impressive at these statistics. The worst guess is 30%, which is already impressive. At such return rate, using leverage to boost the overall returns should not be an issue, but one needs to look into drawdowns (and other similar stats), of course.

  4. Kevin says:

    Can you please elaborate a bit on the rationale of this experiment? How does the results – the mean return conditioned on a certain successful prediction rate – translate into plausibility of the prediction?

    1. quintuitive says:

      If one can reliably produce 60% success rate, the experiment shows that it’s impossible to come up with a sample (over 20 years at success rate of 60%) which returns are not ridiculous.

Leave a Reply