The previous post described the high level architecture of a walk-forward forecasting for time series data. As a hands-on implementation – let’s apply a simple QDA classifier on the series discussed previously.
First things first, most of the relevant code is available on GitHub. Although in general I try to publish run-able code, this is not a self-contained executable script. There are various reasons for it, but the main goal here is to demonstrate what I dubbed the “Walk-Forward Loop” template.
Writing the code structurally wasn’t too hard – I have been using this approach for a few years now, so it was quite clear where do I want to get to. Implementing it in Python was the bigger obstacle, especially since I wanted to make use of some nice packages to keep the code as flexible as possible.
def drive_mlloop(): # Load the data from a file all_data = dsh.load('all_data.bin') # Use the data for the Heating Oil continuous contract data = all_data['HO2'] # Sanity checks combined = pd.concat([data['features'], data['full']['entry']], axis=1) combined = combined.dropna() if len(data['features']) != len(combined): raise RuntimeError('Some observations were removed while merging the series. Ensure there are no NAs and the series length match.') # The dependent and the independent variables response = combined.iloc[:,-1] features = combined.iloc[:,:-1] fl = dsh.ForecastLocations(features.index) # Run the machine learning loop ml = MlLoop('test_model', 'ml.log', db_url = 'sqlite:///ml.sqlite') ml.run(features, response, fl)
That’s pretty much the main driver. The previously prepared series are loaded from a file. The features and the response (independent and dependent variables, respectively) are then concatenated just to make sure they align to some degree. The first interesting point is the ForecastLocations object creation. It’s a simple class which based on it’s configuration outputs the periods (as indexes) where the forecasting is to occur. I have found that to be quite useful. My current implementation of this class is immature at the moment so use it with caution.
class ForecastLocations: def __init__(self, timestamps, nahead=1, min_history=1000, max_history=1e6, start_index=None, end_index=None, start_date=None, end_date=None, history_delay=0): self.starts = None self.ends = None ll = len(timestamps) if start_index is None: if start_date is not None: start_index = max(np.searchsorted(timestamps, start_date), min_history + 1 + history_delay) else: start_index = min_history + history_delay if end_index is None: if end_date is None: end_index = ll - 1 else: end_index = np.searchsorted(timestamps, end_date) if start_index >= end_index: return self.ends = np.arange(nahead, end_index - start_index + 1, nahead) + start_index - 1 self.starts = self.ends - nahead + 1 def len(self): if self.starts is None: return None return len(self.starts)
The machine learning loop is responsible for the rest. It walks through the locations identified for forecasting and performs a series of tasks:
- Extracts the training data set from the full series.
- Trains the models.
- Extracts the features for the period to be forecasted.
- Forecasts using the previously trained model and the features.
- Stores the results in a database.
- Logs the progress to a file.
For the machine learning part, I used the scikit-learn package but that’s just for illustrative purposes.
A nice feature which makes this version shine compared to the R version is the use of the sqlalchemy package. A nice ORM implementation which allowed me to abstract the database code. I have been using SQLite to run the code, but with a single line change, the connection string, it will be quite straightforward to modify it to use any other SQL engine. Sounds simple, but try doing that in R … It’s complicated enough even in Java, to the degree that I am using direct SQL queries in my backtesting faramework.
For the logging I simply ported my R logging code. The log is useful to get an idea how fast the computations are going. Ideally, the access to the file should be synchronized so that we can log even when executing in parallel. I hope there is a Python package I can use for that. 🙂
Synchronization, parallelization and a few other pieces are still missing, but this is pretty much the entire walk-forward template! Plugin different algorithms, store the results in the same database (using different model id), and then use the available data to compare performance, do step-wise model selection or whatever else you can think of!