Too Much Parallelism is as Bad

The other day I run a machine learning backtest on a new data set. Once I got through the LDA and QDA initial run, I decided to try xgboost. The first thing I observed was a really bad performance. The results from the following debugging session were quite surprising to me.


I have been using the same framework for a few years now. I think there are some examples outlining the approach even on this blog, but I am lazy to dig them out now. Without going into further details, let me outline my “stack”:

stack

As I mentioned, I have been using this stack for a few years now, and during this time, I have seen some really slow models. Two factors got me suspicious in this case:

  • I was using a new method – yeah, my first attempt with xgboost.
  • The data set was rather small and simple.

What I found out was that there was too much parallelization happening. Somehow, all these threads and process were getting messed up, and, although there was progress, it was glacially slow.

Looking at the stack – the parallization is not that obvious. Certainly I was using multiple processes via the parallel package, but what else – I was seeing a lot more threads running. The culprit in this case was the default parallelization in xgboost. Nowadays apparently every layer is trying to exploit multiple cores, thus, that wasn’t surprising, just something new to me.

The fix ended up being quite simple – call caret’s train with nthread=1, which in turn is passed to xgb.train and solves the problem.

Looking at the stack above, I realized that, potentially, there might be other similar issues. For instance, Microsoft’s R Open provides some multi-threaded improvements via the Intel’s MKL library. In my case, that was not causing any observable problems, but in case it is – the threading can be disabled via:

setMKLthreads(1)

Now everything is up and running, and I am looking forward to the output.

Comments

  1. patriczhao says:

    Great hints!

    I am curious are you distributing the processes in multiple machines or only one machine?

    If in one machine, all threads from multiple processes (from ‘parallel’) will cause lots of overhead.
    But it may be not such bad if distributed processes to remote machine.

    I’d like to see the code and performance data if it’s possible.

    Thanks,

    1. quintuitive says:

      Single machine in this case, and I didn’t realize that xgboost is multi-threaded by default.

  2. Bab says:

    No, using parallelism the *wrong way* is bad

    1. quintuitive says:

      The presumption that a package can exploit all available cores on the system just because it was run on that system is questionable at best.

      1. Bab says:

        sure, but that’s why I read the manual

  3. Tony says:

    I’m having trouble with xgboost threads. I have found that manually setting the number of threads is better than letting xgboost set it and can result in a massive speedup. I suspect that the problem is due to MRO, not xgboost. I’m still investigating so this reply is just preliminary. Ask me for some plots if you like.

    1. Bab says:

      Tony, xgboost tries to use *all* available cores by default and if any of those cores are already in use, you get a significant speed reduction

      1. Tony says:

        I’m doing all my testing while the machine is idle. I suspect that xgboost in some situations is trying to use more than the available cores. I’m not blaming xgboost either. It may be a problem with the local OpenMP setup. I can verify what you have said – if you try to use more than the available cores you do get a spectacular speed reduction.

        1. quintuitive says:

          Tony, are you on Linux? If so, htop helps see all threads as well.

  4. adam c says:

    nThread = 1 was a HUGE help to me — thank you so much for this post!! I was experiencing the same throttling using caret and xgbTree, seems like both of their parallel backends were clashing because it was taking hundreds of times longer (with all threads at 100%) before the nthread fix in the caret::train function.

  5. sazulay says:

    XGBoost is really agressive in the way it utilizes all cores. I use it for online application over a Python Tornado web-server, and it’s performance is horrible when not limited, due to some thread mesh-up and oversubscription.
    I use the OMP flag OMP_NUM_THREADS=1 to limit XGBoost to 1 thread.

Leave a Reply