Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-37301

Pipeline builds are unreliable on macOS Ventura

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      Building lsst_distrib on macOS Ventura is quite frustrating. The problem manifests in a couple of different ways but they all relate to running tests with mult-iprocessing (the default situations when running rebuild). Sometimes what happens is that pytest hangs on runup and rebuild prints dots forever. This seems to be related to the xdist workers crashing in some way but is not predictable (although seemingly it happens more later in the build than early on).

      A more explicit failure seems to be more repeatable with drp_pipe which fails the tests:

      $ pytest -n8 tests
      ============================================================================================= test session starts =============================================================================================
      platform darwin -- Python 3.10.6, pytest-7.2.0, pluggy-1.0.0
      rootdir: /Users/timj/work/lsstsw/build/drp_pipe, configfile: setup.cfg
      plugins: subtests-0.9.0, remotedata-0.3.3, arraydiff-0.3, asdf-2.10.1, mock-3.10.0, hypothesis-6.56.4, xdist-3.0.2, flake8-1.1.1, doctestplus-0.12.1, typeguard-2.13.3, astropy-0.10.0, session2file-0.1.11, filter-subpackage-0.1.1, cov-4.0.0, openfiles-0.5.0, anyio-3.6.2, astropy-header-0.1.2
      gw0 [8] / gw1 [8] / gw2 [8] / gw3 [8] / gw4 [8] / gw5 [8] / gw6 [8] / gw7 [8]
      s[gw3] node down: Not properly terminated
      F
      replacing crashed worker gw3
      gw0 [8] / gw1 [8] / gw2 [8] / gw8 ok / gw4 [8] / gw5 [8] / gw6 [8] / gw7 [8].....
      ================================================================================================== FAILURES ===================================================================================================
      ___________________________________________________________________________________________ tests/test_pipelines.py ___________________________________________________________________________________________
      [gw3] darwin -- Python 3.10.6 /Users/timj/work/lsstsw/miniconda/envs/lsst-scipipe-5.0.0/bin/python
      worker 'gw3' crashed while running 'tests/test_pipelines.py::PipelineTestCase::test_hsc_drp_prod'
      =========================================================================================== short test summary info ===========================================================================================
      FAILED tests/test_pipelines.py::PipelineTestCase::test_hsc_drp_prod
      ============================================================================ 1 failed, 6 passed, 1 skipped, 14 warnings in 11.51s =============================================================================
      

      or slightly differently:

      ============================= test session starts ==============================
      platform darwin -- Python 3.10.6, pytest-7.2.0, pluggy-1.0.0
      rootdir: /Users/timj/work/lsstsw/build/drp_pipe, configfile: setup.cfg
      plugins: subtests-0.9.0, remotedata-0.3.3, arraydiff-0.3, asdf-2.10.1, mock-3.10.0, hypothesis-6.56.4, xdist-3.0.2, flake8-1.1.1, doctestplus-0.12.1, typeguard-2.13.3, astropy-0.10.0, session2file-0.1.11, filter-subpackage-0.1.1, cov-4.0.0, openfiles-0.5.0, anyio-3.6.2, astropy-header-0.1.2
      gw0 I / gw1 I / gw2 I / gw3 I / gw4 I / gw5 I / gw6 I / gw7 I / gw8 I / gw9 I
      gw0 [11] / gw1 [11] / gw2 [11] / gw3 [11] / gw4 [11] / gw5 [11] / gw6 [11] / gw7 [11] / gw8 [11] / gw9 [11]
       
      ....[gw8] node down: Not properly terminated
      F
      worker gw8 crashed and worker restarting disabled
      ......
      =================================== FAILURES ===================================
      ___________________________ tests/test_pipelines.py ____________________________
      [gw8] darwin -- Python 3.10.6 /Users/timj/work/lsstsw/miniconda/envs/lsst-scipipe-5.0.0/bin/python3.10
      worker 'gw8' crashed while running 'tests/test_pipelines.py::PipelineTestCase::test_lsstcam_imsim_drp_ci_imsim'
      

      I don't seem to get a crash report for these failures but other failures look like:

      Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
      0   dyld                          	       0x1a17c8f90 lsl::Allocator::free(void*) + 60
      1   dyld                          	       0x1a17c7c2c lsl::UniquePtr<char const>::~UniquePtr() + 52
      2   dyld                          	       0x1a17c7570 dyld4::FileRecord::~FileRecord() + 52
      3   dyld                          	       0x1a17af6f8 lsl::UniquePtr<dyld4::Atlas::Image>::~UniquePtr() + 68
      4   dyld                          	       0x1a17dba9c lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 116
      5   dyld                          	       0x1a17dba68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64
      6   dyld                          	       0x1a17dba68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64
      7   dyld                          	       0x1a17af678 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::clear() + 32
      8   dyld                          	       0x1a17af00c dyld4::RuntimeState::freeProcessSnapshot() + 60
      9   dyld                          	       0x1a17d4990 dyld4::APIs::dlopen_from(char const*, int, void*) + 932
      10  python3.10                    	       0x10420f78c _PyImport_FindSharedFuncptr + 296
      11  python3.10                    	       0x1041cc640 _imp_create_dynamic + 940
      12  python3.10                    	       0x1040fc4e4 cfunction_vectorcall_FASTCALL + 200
      13  python3.10                    	       0x10419a808 _PyEval_EvalFrameDefault + 26928
      14  python3.10                    	       0x104193634 _PyEval_Vector + 2028
      

      Suggesting the workers are crashing during initial loading of shared libraries.

        Attachments

        1. crash.txt
          316 kB
        2. drp_pipe-libraries.txt
          161 kB
        3. libraries.txt
          82 kB
        4. report2.txt
          4 kB

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            Just to prove it doesn't really need a lot of activity in parallel to trigger this:

            $ time python -c 'import lsst.meas.algorithms'
            Caught signal 11, backtrace follows:
            0   libcpputils.dylib                   0x00000001057b92e4 lsst::cpputils::(anonymous namespace)::signalHandler(int) + 92
            1   libsystem_platform.dylib            0x00000001909ec2a4 (null) + 56
            1   libsystem_platform.dylib            0x00000001909ec2a4 _sigtramp + 56
            2   dyld                                0x00000001906bbc2c lsl::UniquePtr<char const>::~UniquePtr() + 52
            3   dyld                                0x00000001906bb570 dyld4::FileRecord::~FileRecord() + 52
            4   dyld                                0x00000001906a36f8 lsl::UniquePtr<dyld4::Atlas::Image>::~UniquePtr() + 68
            5   dyld                                0x00000001906cfa9c lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 116
            6   dyld                                0x00000001906cfa68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64
            7   dyld                                0x00000001906cfa68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64
            8   dyld                                0x00000001906a3678 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::clear() + 32
            9   dyld                                0x00000001906a300c dyld4::RuntimeState::freeProcessSnapshot() + 60
            10  dyld                                0x00000001906c8990 dyld4::APIs::dlopen_from(char const*, int, void*) + 932
            11  python3.10                          0x0000000104d3378c (null) + 296
            11  python3.10                          0x0000000104d3378c _PyImport_FindSharedFuncptr + 296
            12  python3.10                          0x0000000104cf0640 (null) + 940
            12  python3.10                          0x0000000104cf0640 _imp_create_dynamic + 940
            13  python3.10                          0x0000000104c204e4 cfunction_vectorcall_FASTCALL + 200
            14  python3.10                          0x0000000104cbe808 (null) + 26928
            14  python3.10                          0x0000000104cbe808 _PyEval_EvalFrameDefault + 26928
            

            Show
            tjenness Tim Jenness added a comment - Just to prove it doesn't really need a lot of activity in parallel to trigger this: $ time python -c 'import lsst.meas.algorithms' Caught signal 11, backtrace follows: 0 libcpputils.dylib 0x00000001057b92e4 lsst::cpputils::(anonymous namespace)::signalHandler(int) + 92 1 libsystem_platform.dylib 0x00000001909ec2a4 (null) + 56 1 libsystem_platform.dylib 0x00000001909ec2a4 _sigtramp + 56 2 dyld 0x00000001906bbc2c lsl::UniquePtr<char const>::~UniquePtr() + 52 3 dyld 0x00000001906bb570 dyld4::FileRecord::~FileRecord() + 52 4 dyld 0x00000001906a36f8 lsl::UniquePtr<dyld4::Atlas::Image>::~UniquePtr() + 68 5 dyld 0x00000001906cfa9c lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 116 6 dyld 0x00000001906cfa68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64 7 dyld 0x00000001906cfa68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64 8 dyld 0x00000001906a3678 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::clear() + 32 9 dyld 0x00000001906a300c dyld4::RuntimeState::freeProcessSnapshot() + 60 10 dyld 0x00000001906c8990 dyld4::APIs::dlopen_from(char const*, int, void*) + 932 11 python3.10 0x0000000104d3378c (null) + 296 11 python3.10 0x0000000104d3378c _PyImport_FindSharedFuncptr + 296 12 python3.10 0x0000000104cf0640 (null) + 940 12 python3.10 0x0000000104cf0640 _imp_create_dynamic + 940 13 python3.10 0x0000000104c204e4 cfunction_vectorcall_FASTCALL + 200 14 python3.10 0x0000000104cbe808 (null) + 26928 14 python3.10 0x0000000104cbe808 _PyEval_EvalFrameDefault + 26928
            Hide
            tjenness Tim Jenness added a comment -

            For the record I can't run ci_hsc_gen3 at all – inevitably one of the spawned subprocesses gets a segv trying to import everything and then there is a cascading failure.

            Show
            tjenness Tim Jenness added a comment - For the record I can't run ci_hsc_gen3 at all – inevitably one of the spawned subprocesses gets a segv trying to import everything and then there is a cascading failure.
            Hide
            tjenness Tim Jenness added a comment -
            Show
            tjenness Tim Jenness added a comment - Emacs has the same problem: https://developer.apple.com/forums/thread/722590
            Hide
            tjenness Tim Jenness added a comment - - edited

            macOS 13.3 seems to have fixed the dydl4 problem although ci_hsc_gen3 crashed:

            [2023-03-27T17:53:23.882905Z] INFO 2023-03-27T10:53:21.711-07:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:640) - Executed 198 quanta successfully, 0 failed and 971 remain out of total 1169 quanta.
            [2023-03-27T17:53:24.079541Z] INFO 2023-03-27T10:53:23.882-07:00 lsst.isolatedStarAssociation (isolatedStarAssociation:{instrument: 'HSC', skymap: 'discrete/ci_hsc', tract: 0})(isolatedStarAssociation.py:193) - Running with 11 source_table_visit dataRefs
            [2023-03-27T17:53:24.901520Z] INFO 2023-03-27T10:53:24.079-07:00 lsst.isolatedStarAssociation (isolatedStarAssociation:{instrument: 'HSC', skymap: 'discrete/ci_hsc', tract: 0})(isolatedStarAssociation.py:437) - Found 1591 primary stars in i band.
            [2023-03-27T17:53:24.903526Z] ERROR 2023-03-27T10:53:24.901-07:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:590) - Task <TaskDef(lsst.pipe.tasks.isolatedStarAssociation.IsolatedStarAssociationTask, label=isolatedStarAssociation) dataId={instrument: 'HSC', skymap: 'discrete/ci_hsc', tract: 0}> failed, killed by signal 11 (Segmentation fault: 11); processing will continue for remaining tasks.
            

            The segv seems to be in malloc inside _ckdtree which is part of scipy.

            crash.txt

            Show
            tjenness Tim Jenness added a comment - - edited macOS 13.3 seems to have fixed the dydl4 problem although ci_hsc_gen3 crashed: [2023-03-27T17:53:23.882905Z] INFO 2023-03-27T10:53:21.711-07:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:640) - Executed 198 quanta successfully, 0 failed and 971 remain out of total 1169 quanta. [2023-03-27T17:53:24.079541Z] INFO 2023-03-27T10:53:23.882-07:00 lsst.isolatedStarAssociation (isolatedStarAssociation:{instrument: 'HSC', skymap: 'discrete/ci_hsc', tract: 0})(isolatedStarAssociation.py:193) - Running with 11 source_table_visit dataRefs [2023-03-27T17:53:24.901520Z] INFO 2023-03-27T10:53:24.079-07:00 lsst.isolatedStarAssociation (isolatedStarAssociation:{instrument: 'HSC', skymap: 'discrete/ci_hsc', tract: 0})(isolatedStarAssociation.py:437) - Found 1591 primary stars in i band. [2023-03-27T17:53:24.903526Z] ERROR 2023-03-27T10:53:24.901-07:00 lsst.ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:590) - Task <TaskDef(lsst.pipe.tasks.isolatedStarAssociation.IsolatedStarAssociationTask, label=isolatedStarAssociation) dataId={instrument: 'HSC', skymap: 'discrete/ci_hsc', tract: 0}> failed, killed by signal 11 (Segmentation fault: 11); processing will continue for remaining tasks. The segv seems to be in malloc inside _ckdtree which is part of scipy. crash.txt
            Hide
            tjenness Tim Jenness added a comment -

            Closing this ticket since it seems that 13.3 really has fixed the dyld4 problem. I did a build of lsst_distrib and there were no test worker crashes.

            Show
            tjenness Tim Jenness added a comment - Closing this ticket since it seems that 13.3 really has fixed the dyld4 problem. I did a build of lsst_distrib and there were no test worker crashes.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              tjenness Tim Jenness
              Watchers:
              Chris Walter, Eli Rykoff, Fabio Hernandez, Jim Bosch, Kian-Tat Lim, Matthias Wittgen, Nate Lust, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.