Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-5123

develop strategy for handling environments with Intel's MKL library

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: afw
    • Labels:
      None
    • Team:
      Architecture

      Description

      Per discussion with Tim Jenness and a bit of symbol table inspection, the root cause of the issue observed in DM-5105 is likely due to the Intel MKL library duplicating symbols which are present in fftw, blas, etc.. Eg.,

      https://software.intel.com/en-us/node/522277

      After discussion with CA sales, it has been clarified that the new mkl conda packages are under different terms than the previous MKL Optimized "add-ons".

      http://docs.continuum.io/anaconda/eula

      The new license is permissive and allows redistribution. It is reasonable to assume that MKL will be present on more LSST enduser's systems in the future and we need a strategy for managing linking to prevent symbol conflicts. A few possibilities are:

      • remove fftw, openblas, etc. from the stack and link solely with MKL (unknown if it is a 100% replacement and this would tie us to the conda package with the permissive eula)
      • officially declare MKL as unsupported
      • create library stubs (ie., symlinks) so that -lfftw/etc. will continue to work and re-plumb how EUPS handles optional dependencies so we can switch between openblas/fftw and MKL
      • modify our scons probing logic to be able to select between the various library options and re-plumb EUPS optional dependency handling

        Attachments

          Issue Links

            Activity

            jhoblitt Joshua Hoblitt created issue -
            jhoblitt Joshua Hoblitt made changes -
            Field Original Value New Value
            Link This issue relates to DM-5105 [ DM-5105 ]
            jhoblitt Joshua Hoblitt made changes -
            Summary develope strategy for handling environments with Intel's MKL library develop strategy for handling environments with Intel's MKL library
            Hide
            cwalter Chris Walter added a comment -

            Just a note... if you want anaconda can be explicitly installed without MKL and with openblas instead like before.

            Show
            cwalter Chris Walter added a comment - Just a note... if you want anaconda can be explicitly installed without MKL and with openblas instead like before.
            Hide
            tjenness Tim Jenness added a comment -

            Yes, that is what is currently done in Jenkins in the linked ticket. The issue though is that we are bound to see more people running with MKL now that it's the default. I imagine the fix is to have a check in place in scons or eupspkg that results in the Intel MKL library being used as a drop in replacement for FFTW if it is found on the system. Some thought required.

            Show
            tjenness Tim Jenness added a comment - Yes, that is what is currently done in Jenkins in the linked ticket. The issue though is that we are bound to see more people running with MKL now that it's the default. I imagine the fix is to have a check in place in scons or eupspkg that results in the Intel MKL library being used as a drop in replacement for FFTW if it is found on the system. Some thought required.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            Also, we don't internally use anaconda any longer. We have completely switched to miniconda, which requires that all packages be manually installed.

            Show
            jhoblitt Joshua Hoblitt added a comment - Also, we don't internally use anaconda any longer. We have completely switched to miniconda, which requires that all packages be manually installed.
            Hide
            cwalter Chris Walter added a comment -

            Just in case:

            According to this page:

            https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations

            even for miniconda you need to do:

            conda install nomkl

            otherwise when you get nupmy or scipy etc you will get the MKL versions.

            Show
            cwalter Chris Walter added a comment - Just in case: According to this page: https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations even for miniconda you need to do: conda install nomkl otherwise when you get nupmy or scipy etc you will get the MKL versions.
            Hide
            jhoblitt Joshua Hoblitt added a comment -
            Show
            jhoblitt Joshua Hoblitt added a comment - Chris Walter See DM-5105
            Show
            tjenness Tim Jenness added a comment - Yes, see https://github.com/lsst/lsstsw/blob/master/bin/deploy#L93
            Hide
            mjuric Mario Juric added a comment -

            Tim Jenness et al., here's a hack that may make it possible to ship conda binaries that don't require the nomkl versions: https://github.com/mjuric/conda-lsst/pull/71

            That said, I don't consider this an all-encompassing long-term solution (e.g., it doesn't help you if you build from source). IMHO, the right thing to do would be to understand why our code returns incorrect results when built against MKL's fftw3 interface.

            Show
            mjuric Mario Juric added a comment - Tim Jenness et al., here's a hack that may make it possible to ship conda binaries that don't require the nomkl versions: https://github.com/mjuric/conda-lsst/pull/71 That said, I don't consider this an all-encompassing long-term solution (e.g., it doesn't help you if you build from source). IMHO, the right thing to do would be to understand why our code returns incorrect results when built against MKL's fftw3 interface.
            Hide
            tjenness Tim Jenness added a comment -

            I agree with you that the fix is to stub out our fftw. Is it possible for fftw to work out whether it needs to be built and then just change the .cfg file to use MKL instead? I wonder if the results and segv are simply the linker picking symbols from libfftw and MKL and mixing and matching. I don't think anyone has spent any serious effort in tracking down the subtleties.

            Show
            tjenness Tim Jenness added a comment - I agree with you that the fix is to stub out our fftw. Is it possible for fftw to work out whether it needs to be built and then just change the .cfg file to use MKL instead? I wonder if the results and segv are simply the linker picking symbols from libfftw and MKL and mixing and matching. I don't think anyone has spent any serious effort in tracking down the subtleties.
            Hide
            mjuric Mario Juric added a comment -

            The issue is in the fftw_plan_r2r_2d call at https://github.com/lsst/meas_base/blob/master/src/SincCoeffs.cc#L447, which MKL doesn't support. Rewriting this using fftw_plan_c2r_2d (which MKL does support) should be straightforward but is tedious (had a brief chat about this with Jim Bosch and Robert Lupton; I'll open an issue for future reference). This seems to be the only place in our code that's incompatible with MKL.

            In the meantime, at:
            https://github.com/lsst/meas_base/compare/master...mjuric:mkl-support?expand=1
            there's an alternative patch for meas_base. It works around the problem by ensuring that the problematic calls always call the libfftw3 implementations. If this is OK, it should make it possible to build/run our code against all variants of Anaconda Python (mkl or nomkl).

            Any thoughts on this approach?

            Show
            mjuric Mario Juric added a comment - The issue is in the fftw_plan_r2r_2d call at https://github.com/lsst/meas_base/blob/master/src/SincCoeffs.cc#L447 , which MKL doesn't support. Rewriting this using fftw_plan_c2r_2d (which MKL does support) should be straightforward but is tedious (had a brief chat about this with Jim Bosch and Robert Lupton ; I'll open an issue for future reference). This seems to be the only place in our code that's incompatible with MKL. In the meantime, at: https://github.com/lsst/meas_base/compare/master...mjuric:mkl-support?expand=1 there's an alternative patch for meas_base . It works around the problem by ensuring that the problematic calls always call the libfftw3 implementations. If this is OK, it should make it possible to build/run our code against all variants of Anaconda Python (mkl or nomkl). Any thoughts on this approach?
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            Mario Juric - is it a consequence of your proposal that original recipe fftw would need to be present at build time or is the fftw.h from MKL sufficient?

            Show
            jhoblitt Joshua Hoblitt added a comment - Mario Juric - is it a consequence of your proposal that original recipe fftw would need to be present at build time or is the fftw.h from MKL sufficient?
            Hide
            mjuric Mario Juric added a comment -

            No, libfftw is still needed as it's required by the SincCoeffs.cc code (MKL doesn't implement half-complex transforms). This patch just makes sure that the libfftw implementation (and not mkl) gets called at runtime in the one case where we use half-complex transforms. That should prevent the incorrect results & segfaults.

            Show
            mjuric Mario Juric added a comment - No, libfftw is still needed as it's required by the SincCoeffs.cc code (MKL doesn't implement half-complex transforms). This patch just makes sure that the libfftw implementation (and not mkl) gets called at runtime in the one case where we use half-complex transforms. That should prevent the incorrect results & segfaults.
            Hide
            mjuric Mario Juric added a comment -

            FWIW, from https://www.nersc.gov/users/announcements/featured-announcements/important-update-on-cori-status/:

            NERSC has retired its builds of Python in favor of Anaconda Python with threaded MKL support for numpy, scipy, scikit-learn, and numexpr. Some Python packages previously provided as modules are now available through the root Anaconda environment. Others remain as standalone modules or will return in the coming days. Users should note that conda environments replace virtualenv; we advise users to convert virtualenv to conda environments.

            (under "Python for Cori").

            Show
            mjuric Mario Juric added a comment - FWIW, from https://www.nersc.gov/users/announcements/featured-announcements/important-update-on-cori-status/: NERSC has retired its builds of Python in favor of Anaconda Python with threaded MKL support for numpy, scipy, scikit-learn, and numexpr. Some Python packages previously provided as modules are now available through the root Anaconda environment. Others remain as standalone modules or will return in the coming days. Users should note that conda environments replace virtualenv; we advise users to convert virtualenv to conda environments. (under "Python for Cori").
            jmatt J Matt Peterson [X] (Inactive) made changes -
            Link This issue relates to DM-7774 [ DM-7774 ]
            tjenness Tim Jenness made changes -
            Link This issue is triggering DM-8146 [ DM-8146 ]
            Hide
            tjenness Tim Jenness added a comment -

            I couldn't find a ticket in the system for fixing meas_base so I've created DM-8146.

            Show
            tjenness Tim Jenness added a comment - I couldn't find a ticket in the system for fixing meas_base so I've created DM-8146 .
            Hide
            tjenness Tim Jenness added a comment -

            Can either of Mario Juric or John Swinbank confirm that this ticket is complete in that we have agreed that the strategy is to fix meas_base in DM-8146 ?

            Show
            tjenness Tim Jenness added a comment - Can either of Mario Juric or John Swinbank confirm that this ticket is complete in that we have agreed that the strategy is to fix meas_base in DM-8146 ?
            tjenness Tim Jenness made changes -
            Reviewers John Swinbank, Mario Juric [ swinbank, mjuric ]
            Status To Do [ 10001 ] In Review [ 10004 ]
            Hide
            swinbank John Swinbank added a comment -

            I can confirm that the work on DM-8146 is in the DRP backlog and will be scheduled.

            I don't know anything much about MKL beyond what I've learned from scanning this ticket, so I can't confirm that DM-8146 is a sufficient "strategy for handling environments with Intel's MKL library".

            Show
            swinbank John Swinbank added a comment - I can confirm that the work on DM-8146 is in the DRP backlog and will be scheduled. I don't know anything much about MKL beyond what I've learned from scanning this ticket, so I can't confirm that DM-8146 is a sufficient "strategy for handling environments with Intel's MKL library".
            swinbank John Swinbank made changes -
            Reviewers John Swinbank, Mario Juric [ swinbank, mjuric ] Mario Juric [ mjuric ]
            Hide
            tjenness Tim Jenness added a comment -

            Mario Juric will you be able to look at this?

            Show
            tjenness Tim Jenness added a comment - Mario Juric will you be able to look at this?
            Hide
            mjuric Mario Juric added a comment -

            Hi Tim Jenness, sorry, lost these in JIRA noise.

            Yes, I do agree that DM-8146 is the right (righteous) solution (as opposed to a workaround).

            Show
            mjuric Mario Juric added a comment - Hi Tim Jenness , sorry, lost these in JIRA noise. Yes, I do agree that DM-8146 is the right (righteous) solution (as opposed to a workaround).
            Hide
            tjenness Tim Jenness added a comment -

            Mario Juric thank you. I'll close this now.

            Show
            tjenness Tim Jenness added a comment - Mario Juric thank you. I'll close this now.
            tjenness Tim Jenness made changes -
            Resolution Done [ 10000 ]
            Status In Review [ 10004 ] Done [ 10002 ]

              People

              • Assignee:
                tjenness Tim Jenness
                Reporter:
                jhoblitt Joshua Hoblitt
                Reviewers:
                Mario Juric
                Watchers:
                Chris Walter, Dominique Boutigny, Frossie Economou, Jim Bosch, John Parejko, John Swinbank, Joshua Hoblitt, Kian-Tat Lim, Mario Juric, Paul Price, Robert Lupton, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel