Yahoo Finance (CSI) Data Quirks. Or Why is the ROC not Stable?

Rotational strategies on ETFs have been a common occurrence on this blog, and I have been using something similar for real life trading for about two years now. Readers of this blog may have also noticed concerns about the stability of the computations of such strategies. At the end it turned out be a quirk in the data feed, not accounted for in my code.

Before we jump to the code, a little history: About a year ago, I noticed that positions taken in real life based on rotational strategies, sometimes, are not the same as the ones taken in the backtest performed later. I was quite concerned, and did a fair amount of investigation, but wasn’t able to find the root cause. In other words, I couldn’t find anything wrong with the usual suspects – my code and other package code.

At a high level, it seemed I have been doing everything right: I have been using close (adjusted only for splits) to generate signals, and I have been using the adjusted (both for splits and dividends) close to compute rate of change (ROC or returns) and to compute performance.

That was the point I started thinking there is something wrong with my data handling, so I decided to backup the data at then end of each month (when the new positions are computed) so that I can compare the runs a month or two later, when the problem occurs. Well, it didn’t reproduce for a while, and I stopped the data backups and almost forgot about it.

It reproduced again a month ago, but that didn’t catch my attention either – I started thinking about simply retiring the approach or implementing it in Java (Tradelib).

Everything changed when I stumbled upon (for a completely other reason) the DTR Trading blog. I found about four posts describing the same issue in details. The only common piece between us two seemed to be the data feed. Thus, I decided to investigate further.

I had an old RData backup file lying around so I started investigating:


ee = new.env()
load("DBC.RData", envir=ee)
dbc1 = ee[["DBC"]]
dbc2 = getSymbols("DBC", from="1900-01-01", auto.assign=F)
dbc = merge(Cl(dbc1), Ad(dbc1), Cl(dbc2), Ad(dbc2), all=F)
dbc.roc = merge(ROC(dbc[,2],na.pad=F), ROC(dbc[,4],na.pad=F), all=F)

# Hmm, there are differences

# Inspecting the data visually reveals nothing - they look the same. However ...

any(dbc[,2] != dbc[,4])
# That yields TRUE, hmm ...

# 65% of the data is different - holy cow!!
# Now we know - the floating point representation is different

# Take the last five indexes
ii = tail(which(as.numeric(dbc[,2])!=as.numeric(dbc[,4]),arr.ind=T), 5)

# Increase the precision

# And view the data again

# Try a fix
any(round(dbc[,2],2) != round(dbc[,4],2))
# FALSE - vola!

To summarize: both the close and the adjusted close are presented as floating point numbers with a few digits (more than two) of precision (I checked the csv file downloaded from Yahoo Finance and it does contain more than two significant digits for precision). They change over time and if the full precision is used – they are different! Thus, the rate of change is different, and finally, the results may be different.

Now I am going to update the code and monitor the results, hopefully this is the fix. 🙂

Leave a Reply