For quite some time now I have been using R’s caret package to choose the model for forecasting time series data. The approach is satisfactory as long as the model is not an evolving model (i.e. is not re-trained), or if it evolves rarely. If the model is re-trained often – the approach has significant computational overhead. Interestingly enough, an alternative, more efficient approach allows also for more flexibility in the area of model selection.

Let’s first outline how caret chooses a single model. The high level algorithm is outlined here:

So let’s say we are training a random forest. For this model, a single parameter, *mtry* is optimized:

require(caret) getModelInfo('rf')$wsrf$parameters # parameter class label # 1 mtry numeric #Randomly Selected Predictors

Let’s assume we are using some form of cross validation. According to the algorithm outline, caret will create a few subsets. On each subset, it will train all models (as many models as different values for *mtry* there are) and finally it will choose the model behaving best over all cross validation folds. So far so good.

When dealing with time series, using regular cross validation has a future-snooping problem and from my experience general cross validation doesn’t work well in practice for time series data. The results are good on the training set, but the performance on the test set, the hold out, is bad. To address this issue, caret provides the *timeslice* cross validation method:

require(caret) history = 1000 initial.window = 800 train.control = trainControl( method="timeslice", initialWindow=initial.window, horizon=history-initial.window, fixedWindow=T)

When the above *train.control* is used in training (via the *train* call), we will end up using 200 models for each set of parameters (each value of *mtry* in the random forest case). In other words, for a single value of *mtry*, we will compute:

Window | Training Points | Test Point |
---|---|---|

1 | 1..800 | 801 |

2 | 2..801 | 802 |

3 | 3..803 | 803 |

… | … | … |

200 | 200..999 | 1000 |

The training set for each model is the previous 800 points. The test set for a single model is the single point forecast. Now, for each value of *mtry* we end up with 200 forecasted points, using the accuracy (or any other metric) we select the best performing model over these 200 points. No future-snooping here, because all history points are prior the points being forecasted.

Granted, this approach (of doing things on daily basis) may sound extreme, but it’s useful to illustrate the overhead which is imposed when the model evolves over time, so bear with me.

So far we have dealt with a single model selection. Once the best model is selected, we can forecast the next data point. Then what? What I usually do is to walk the time series forward and repeat these steps at certain intervals. This is equivalent to saying something like: “Let’s choose the best model each Friday, use the selected model to predict each day for the next week. Then re-fit it on Friday.”. This forward-walking approach has been found useful in trading, but surprisingly, hasn’t been discussed pretty much elsewhere. Abundant time series data is generated everywhere, hence, I feel this evolving model approach deserves at least as much attention as the “fit once, live happily thereafter” approach.

Back to our discussion. To illustrate the inefficiency, consider an even more extreme case – we are selecting the best model every day, using the the above parameters, i.e. the best model for each day is selected tuning the parameters over the previous 200 days. On day *n* for a given value of the parameter (mtry), we will train this model over a sequence of 200 sliding windows each of which is of size 800. Next we will move to day *n+1* and we will compute, yet again, this model over a sequence of 200 sliding windows each of which is of size 800. Most of these operations are repeated (the last 800 window on day *n* is the second last 800 window on day *n+1*). So just for a single parameter value, we are repeating most of the computation on each step.

At this point, I hope you get the idea. So what is my solution? Simple. For each set of model parameters (each value of *mtry*), walk the series separately, do the training (no cross validation – we have a single parameter value), do the forecasting and store everything important into, let’s say, SQLite database. Next, pull out all predictions and walk the combined series. On each step, look at the history, and based on it, decide which model prediction to use for the next step. Assuming we are selecting the model over 5 different values for *mtry*, here is how the combined data may look like for a three-class (0, -1 and 1) classification:

Obviously the described approach is going to be orders of magnitude faster, but will deliver very similar (there are differences based on the window sizes) results. It also has an added bonus – once the forecasts are generated, one can experiment with different metrics for model selection on each step and all that without re-running the machine learning portion. For instance, instead of model accuracy (the default *caret* metric for classification), one can compare accumulative returns over the last n days.

Still cryptic or curious about the details? My plan is keep posting details and code as I progress with my Python implementation. Thus, look for the next installments of these series.