# Utilizing Parallel Processing to Improve Model Runtime

## 2018/07/24

Categories: Parallel Computing Tags: R foreach doMC EC2

## Preface

A while back I was peer-reviewing a trading strategies research paper, I became bottle-necked by an AdaBoost Decision Tree Stump Analysis that was clocking over ~10 minutes to complete. As this part of my analysis was only one-third of the analyses that I was conducting, for logistical, spiritual reasons – I had to remedy this bottleneck. An immediate solution was utilizing the parallel processing packages: foreach(), and doMC().

## foreach and doMC Packages

The foreach package supports parallel execution. The foreach function is just a for loop, but it returns objects e.g. lists(), matrices, and arrays. A reason for using foreach package over other looping functions i.e. the apply family is that foreach supports parallel execution. The doMC package provides the mechanism for the foreach function. The doMC package only works on a single computer, NOT a cluster of computers. I found this out the hard way.

To run a parallel job, you simply add %dopar% after initializing the cluster object.

registerDoMC(2)                                               #initialize cluster object
foreach(i=1:4, .combine='+') %do%    {some_function(i)}       #sequential
foreach(i=1:4, .combine='+') %dopar% {some_function(i)}       #parallel 

Seems simple, no? But there are a lot of potential pitfalls. foreach function works best when different tasks can be done independently, but when these tasks have to communicate with each other, then foreach function can be quite inefficient. I too found this out the hard way.

## A Simple Example

The example that I chose to show here is simply calculating the Frobenius norm of a varying sized list of vectors. Believe it or not, trying to recreate this example to run as my model had done was actually more educational than the original application. I chose this example because it is similar to baseline matrix multiplications which are typically used for comparing performance. My computer only has 2 cores, so to get something more exciting and expensive, I used an EC2 instance. Amazon charged me $\$0.44$ to run this. I guess I could’ve bought a Bazooka Joe instead – if those still exist. The results below show the expected overhead in parallelizing small vectors. Spinning up more task takes up more time for smaller sized objects. This seems evident when comparing the red, blue, and black lines below. Likewise, a performance gain for parallelizing is only observed after the object’s size is over $2^{25}\$. The parallelization doesn’t quite meet the available number of cores, but this most likely due to inefficient code on my part with a nested loop thereby reducing the number of cores that are available. This will be a follow-on exercise for me to verify.