As mentioned earlier, currently I am playing with trading strategies based on Support Vector Machines. At a high level, the approach is quite similar to what I have implemented for my ARMA+GARCH strategy. Briefly, the simulation goes as follows: we step through the series one period (day, week, etc) at a time. For each period, we use history of pre-defined length to determine the best model parameters. Then, using these parameters, we forecast the next period.

There are two ways to parallelize this process. The approach taken in the ARMA+GARCH system, parallelizes **the model selection ** (done for each period), while it steps through the series **sequentially**. There are good reasons why I did it this way, for instance, the *garchSearch* function is usable standalone. Another reason was probably that I was dumb enough to hope that it (*garchSearch* function)may get picked up and added to a GARCH package (*fGarch* preferably), so that I don’t have to maintain it myself.:)

In any case, this is definitely the more inefficient approach. In parallel execution, it pays off to increase the size of the task executed in parallel. The more efficient alternative in our case, is to step through the series **in parallel**, while each task does the fitting for all possible models for a given period. A different way to think about it is how many tasks will be executed in total. In the ARMA+GARCH system, we execute *inputeSeriesLength*numberOfModels* tasks, while in the second we only execute *inputSeriesLength* tasks. Since the total amount of work is the same, certainly the latter approach has bigger tasks, thus, less overhead in terms of starting/stopping new tasks.

This approach has another advantage as well, it also works better with the way SVMs, Neural Networks and other similar apparatus is implemented in R. For instance, the e1071 package comes with a *tune* function which already does the tuning among a predefined set of models. Thus, if we want to make use of the provided *tune*, then we need to parallelize the *tune* invocations.

Let’s take a look at the driver code for an SVM (the comments should be helpful understanding the code):

svmComputeForecasts = function( data, response, history=500, modelPeriod="days", modelPeriodMultiple=1, trace=TRUE, startDate, endDate, # All following parameters are system (SVM, Neural Network, etc) dependent kernel="radial", gamma=10^(-5:-1), cost=10^(0:2), sampling="cross", cross=10, selectFeatures=TRUE, cores) { require( e1071 ) stopifnot( NROW( data ) == NROW( response ) ) len = NROW( response ) # Determine the starting index if( !missing( startDate ) ) { startIndex = max( len - NROW( index( data[paste( sep="", startDate, "/" )] ) ) + 1, history + 2 ) } else { startIndex = history + 2 } # Determine the ending index if( missing( endDate ) ) { lastIndex = len } else { lastIndex = NROW( index( data[paste( sep="", "/", endDate )] ) ) } if( startIndex > lastIndex ) { return( NULL ) } modelPeriod = tolower( modelPeriod[1] ) # Get the interesting indexes periods = index(data)[startIndex:lastIndex] # Compute the end points for each period (day, week, month, etc) endPoints = endpoints( periods, modelPeriod, modelPeriodMultiple ) # Compute the starting points of each period, relative to the *data* index startPoints = endPoints + startIndex # Remove the last start point - it's outside length(startPoints) = length(startPoints) - 1 # Make the end points relative to the *data* index endPoints = endPoints + startIndex - 1 # Remove the first end point - it's always zero endPoints = tail( endPoints, -1 ) stopifnot( length( endPoints ) == length( startPoints ) ) # Initialize the number of cores if( missing( cores ) ) { cores = 1 } res = mclapply( seq(1,length(startPoints)), svmComputeOneForecast, data=data, # System dependent response=response, startPoints=startPoints, endPoints=endPoints, len=len, history=history, trace=TRUE, kernel=kernel, gamma=gamma, cost=cost, selectFeatures=selectFeatures, mc.cores=cores ) # Vectors to summarize the results forecasts = rep( NA, len ) gammas = rep( NA, len ) costs = rep( NA, len ) performances = rep( NA, len ) features = rep( NA, len ) # Format the results for( ll in res ) { # Prepare the indexes ii = ll[["index"]] jj = ii + NROW( ll[["forecasts"]] ) - 1 # Copy the output forecasts[ii:jj] = ll[["forecasts"]] gammas[ii:jj] = ll[["gamma"]] costs[ii:jj] = ll[["cost"]] performances[ii:jj] = ll[["performance"]] # Encode the participating features as a bit mask stored in a single # integer. This representation limits us to max 32 features. features[ii:jj] = sum( 2^( ll[["features"]] - 1 ) ) } sigUp = ifelse( forecasts >= 0, 1, 0 ) sigUp[is.na( sigUp )] = 0 sigDown = ifelse( forecasts < 0, -1, 0 ) sigDown[is.na( sigDown)] = 0 sig = sigUp + sigDown res = merge( reclass( sig, response ), reclass( sigUp, response ), reclass( sigDown, response ), na.trim( reclass( forecasts, response ) ), reclass( performances, response ), reclass( gammas, response ), reclass( costs, response ), reclass( features, response ), all=F ) colnames( res ) = c( "Indicator", "Up", "Down", "Forecasts", "Performance", "Gamma", "Cost", "Features" ) return( res ) }

The pattern that I want to emphasize culminates in the call to *mclapply*. Each instance of the function executed in parallel (*svmComputeOneForecast* in this example), needs an index to identify the period it has to operate on. This is precisely what *seq(1,length(startPoints)* is for. Each task will be passed an unique index identifying the period. The period is retrieved by using: *startPoints[uniqueIndex]* and *endPoints[uniqueIndex]*, respectively.

I have used this approach in an earlier implementation of the ARMA+GARCH system and now I am using it for the SVM system. Likely I will be using the same approach with Neural Networks, Random Forest and all the rest I have on my to-test list. Thought it may be helpful to others too.

{ 3 comments… read them below or add one }

Very interesting. Looking forward to seeing your backtesting results. Are they any better than the ones you from the ARMA model you published back in August?

So far the results are very promising. I have only used data from the 50s, since I am trying to get a feel for the methods. More details are coming over the next few months.

Hi,

I’ve built a library for “cross-validating” time series models, which is really the same thing as a rolling backtest. I’ve included functions for training models from the caret package, which covers a pretty wide range of predictive models. My code is here:

https://github.com/zachmayer/cv.ts

And an example using stock market data is here:

http://moderntoolmaking.blogspot.com/2012/06/time-series-cross-validation-4.html

-Zach