Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-4714

Bad OpenBlas setting in miniconda/numpy causes very poor performance for running multiple processes

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: afw, lsstsw
    • Labels:
      None

      Description

      I have been running many processes of processCcdDecam.py on my new linux machine in Tucson (bambam). To my surprise, running 40 processes at once gets very poor performance (~70 sec per process) compared to running a single process (~16 sec). I expected some performance hit because of larger overheads but not a factor of 4!

      I ran it both on a spinning HDD and PCIe SSD but they both had the same problem. I also tried running it on multiple visits versus multiple chips for a single visit (all accessing the same MEF FITS file) but this made no difference. I tested it with various numbers of processes and found that the time per processes increases linearly with the number of processes running.

      J Matt Peterson [X] has been helping me track this down. We used some performance tools (htop, iotop, and perf) to figure out what was going on. It was clear that the issue was not a RAM or I/O problem. By watching htop while the 40 processes were running it became clear that once some of the processes hit "deblending" everything slowed down considerably and all cores were maxed out and showing lots of kernel traffic. I also ran processCcdDecam.py with deblending turned off and the performance was much more reasonable (~24 sec. per process).

      After more digging (with perftop), we found that there was a lot of swapping going on during the deblending step by "openblas". This is a package that numpy uses for speeding up certain computationally intensive tasks using multithreading (e.g. linear algebra). By default each openblas instance takes advantage of ALL cores on a machine. So all 40 processes were trying to use all available cores and most of the time was spent swapping between all of these threads.

      OpenBlas can be configured to use a more reasonable number of cores/threads, but the version that LSSTSW uses is installed by miniconda via a dependency of numpy and, as far as we could tell, it's not possible to configured NUM_THREADS for OpenBlas with miniconda.

      We ended up compiling our own version of OpenBlas with NUM_THREADS = 6 (the maximum threads that OpenBlas uses) and the performance was great, 24 sec.

      I'm not sure what the solution is for this but we probably don't want to go with miniconda for the default LSSTSW installation (uses currently done by bin/deploy).

      JMatt might have comments to add.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            Builds for me on Mac (I pushed one fix for an include file) with and without MKL. I have one question about boost::algorithms on the PR. Otherwise looks okay to me and I assume it really does fix your problem with the pipe_base change? Thanks for including MKL.

            Show
            tjenness Tim Jenness added a comment - Builds for me on Mac (I pushed one fix for an include file) with and without MKL. I have one question about boost::algorithms on the PR. Otherwise looks okay to me and I assume it really does fix your problem with the pipe_base change? Thanks for including MKL.
            Hide
            price Paul Price added a comment -

            Pulled out boost::algorithm in favour of my own implementation, with a comment (as requested).

            Rebasing, and will merge soon.

            Show
            price Paul Price added a comment - Pulled out boost::algorithm in favour of my own implementation, with a comment (as requested). Rebasing, and will merge soon.
            Hide
            price Paul Price added a comment -

            Merged to master.

            Show
            price Paul Price added a comment - Merged to master.
            Hide
            swinbank John Swinbank added a comment -

            Added a brief summary to the release notes at https://confluence.lsstcorp.org/pages/viewpage.action?pageId=41785757#DataReleaseProductionWIPW16/X16releasenotes-DisableimplicitthreadingPaul Price, can you sanity check please? Is there any documentation for this that we could link to beyond this ticket/comment in the code?

            Show
            swinbank John Swinbank added a comment - Added a brief summary to the release notes at https://confluence.lsstcorp.org/pages/viewpage.action?pageId=41785757#DataReleaseProductionWIPW16/X16releasenotes-Disableimplicitthreading – Paul Price , can you sanity check please? Is there any documentation for this that we could link to beyond this ticket/comment in the code?
            Hide
            price Paul Price added a comment -

            Tweaked the summary and added a link to the CLO post.

            Show
            price Paul Price added a comment - Tweaked the summary and added a link to the CLO post .

              People

              Assignee:
              price Paul Price
              Reporter:
              nidever David Nidever [X] (Inactive)
              Reviewers:
              Tim Jenness
              Watchers:
              David Nidever [X] (Inactive), Frossie Economou, Jim Bosch, J Matt Peterson [X] (Inactive), John Swinbank, Joshua Hoblitt, Kian-Tat Lim, Mario Juric, Paul Price, Russell Owen, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins Builds

                  No builds found.