Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Team:Architecture
-
Urgent?:No
Description
Building lsst_distrib on macOS Ventura is quite frustrating. The problem manifests in a couple of different ways but they all relate to running tests with mult-iprocessing (the default situations when running rebuild). Sometimes what happens is that pytest hangs on runup and rebuild prints dots forever. This seems to be related to the xdist workers crashing in some way but is not predictable (although seemingly it happens more later in the build than early on).
A more explicit failure seems to be more repeatable with drp_pipe which fails the tests:
$ pytest -n8 tests
|
============================================================================================= test session starts =============================================================================================
|
platform darwin -- Python 3.10.6, pytest-7.2.0, pluggy-1.0.0
|
rootdir: /Users/timj/work/lsstsw/build/drp_pipe, configfile: setup.cfg
|
plugins: subtests-0.9.0, remotedata-0.3.3, arraydiff-0.3, asdf-2.10.1, mock-3.10.0, hypothesis-6.56.4, xdist-3.0.2, flake8-1.1.1, doctestplus-0.12.1, typeguard-2.13.3, astropy-0.10.0, session2file-0.1.11, filter-subpackage-0.1.1, cov-4.0.0, openfiles-0.5.0, anyio-3.6.2, astropy-header-0.1.2
|
gw0 [8] / gw1 [8] / gw2 [8] / gw3 [8] / gw4 [8] / gw5 [8] / gw6 [8] / gw7 [8]
|
s[gw3] node down: Not properly terminated
|
F
|
replacing crashed worker gw3
|
gw0 [8] / gw1 [8] / gw2 [8] / gw8 ok / gw4 [8] / gw5 [8] / gw6 [8] / gw7 [8].....
|
================================================================================================== FAILURES ===================================================================================================
|
___________________________________________________________________________________________ tests/test_pipelines.py ___________________________________________________________________________________________
|
[gw3] darwin -- Python 3.10.6 /Users/timj/work/lsstsw/miniconda/envs/lsst-scipipe-5.0.0/bin/python
|
worker 'gw3' crashed while running 'tests/test_pipelines.py::PipelineTestCase::test_hsc_drp_prod'
|
=========================================================================================== short test summary info ===========================================================================================
|
FAILED tests/test_pipelines.py::PipelineTestCase::test_hsc_drp_prod
|
============================================================================ 1 failed, 6 passed, 1 skipped, 14 warnings in 11.51s =============================================================================
|
or slightly differently:
============================= test session starts ==============================
|
platform darwin -- Python 3.10.6, pytest-7.2.0, pluggy-1.0.0
|
rootdir: /Users/timj/work/lsstsw/build/drp_pipe, configfile: setup.cfg
|
plugins: subtests-0.9.0, remotedata-0.3.3, arraydiff-0.3, asdf-2.10.1, mock-3.10.0, hypothesis-6.56.4, xdist-3.0.2, flake8-1.1.1, doctestplus-0.12.1, typeguard-2.13.3, astropy-0.10.0, session2file-0.1.11, filter-subpackage-0.1.1, cov-4.0.0, openfiles-0.5.0, anyio-3.6.2, astropy-header-0.1.2
|
gw0 I / gw1 I / gw2 I / gw3 I / gw4 I / gw5 I / gw6 I / gw7 I / gw8 I / gw9 I
|
gw0 [11] / gw1 [11] / gw2 [11] / gw3 [11] / gw4 [11] / gw5 [11] / gw6 [11] / gw7 [11] / gw8 [11] / gw9 [11]
|
|
....[gw8] node down: Not properly terminated
|
F
|
worker gw8 crashed and worker restarting disabled
|
......
|
=================================== FAILURES ===================================
|
___________________________ tests/test_pipelines.py ____________________________
|
[gw8] darwin -- Python 3.10.6 /Users/timj/work/lsstsw/miniconda/envs/lsst-scipipe-5.0.0/bin/python3.10
|
worker 'gw8' crashed while running 'tests/test_pipelines.py::PipelineTestCase::test_lsstcam_imsim_drp_ci_imsim'
|
I don't seem to get a crash report for these failures but other failures look like:
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
|
0 dyld 0x1a17c8f90 lsl::Allocator::free(void*) + 60
|
1 dyld 0x1a17c7c2c lsl::UniquePtr<char const>::~UniquePtr() + 52
|
2 dyld 0x1a17c7570 dyld4::FileRecord::~FileRecord() + 52
|
3 dyld 0x1a17af6f8 lsl::UniquePtr<dyld4::Atlas::Image>::~UniquePtr() + 68
|
4 dyld 0x1a17dba9c lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 116
|
5 dyld 0x1a17dba68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64
|
6 dyld 0x1a17dba68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64
|
7 dyld 0x1a17af678 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::clear() + 32
|
8 dyld 0x1a17af00c dyld4::RuntimeState::freeProcessSnapshot() + 60
|
9 dyld 0x1a17d4990 dyld4::APIs::dlopen_from(char const*, int, void*) + 932
|
10 python3.10 0x10420f78c _PyImport_FindSharedFuncptr + 296
|
11 python3.10 0x1041cc640 _imp_create_dynamic + 940
|
12 python3.10 0x1040fc4e4 cfunction_vectorcall_FASTCALL + 200
|
13 python3.10 0x10419a808 _PyEval_EvalFrameDefault + 26928
|
14 python3.10 0x104193634 _PyEval_Vector + 2028
|
Suggesting the workers are crashing during initial loading of shared libraries.
Attachments
Issue Links
- relates to
-
DM-37719 Reduce number of afw pybind11 shared libraries
- In Progress
-
DM-37720 Combine jointcal pybind11 wrappers into single shared library
- Done
-
DM-37767 Combine meas_modelfit pybind11 wrappers into single shared library
- Done
-
DM-37770 Combine ip_diffim pybind11 wrappers into single shared library
- Done
-
DM-37791 Combine meas_algorithms pybind11 wrappers into single shared library
- Done
-
DM-37793 Combine shapelet pybind11 wrappers into single shared library
- Done
-
DM-37804 Combine meas_astrom pybind11 wrappers into single shared library
- Done
-
DM-37982 Combine daf_base pybind11 wrappers into single shared library
- To Do
-
DM-37532 Combine meas_base pybind11 wrappers into single shared library
- Done
-
DM-37627 Combine astshim pybind11 wrappers into single shared library
- Done
- links to
Just to prove it doesn't really need a lot of activity in parallel to trigger this:
$ time python -c 'import lsst.meas.algorithms'
Caught signal 11, backtrace follows:
0 libcpputils.dylib 0x00000001057b92e4 lsst::cpputils::(anonymous namespace)::signalHandler(int) + 92
1 libsystem_platform.dylib 0x00000001909ec2a4 (null) + 56
1 libsystem_platform.dylib 0x00000001909ec2a4 _sigtramp + 56
2 dyld 0x00000001906bbc2c lsl::UniquePtr<char const>::~UniquePtr() + 52
3 dyld 0x00000001906bb570 dyld4::FileRecord::~FileRecord() + 52
4 dyld 0x00000001906a36f8 lsl::UniquePtr<dyld4::Atlas::Image>::~UniquePtr() + 68
5 dyld 0x00000001906cfa9c lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 116
6 dyld 0x00000001906cfa68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64
7 dyld 0x00000001906cfa68 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::NodeCore<31u, 15u>::deallocateChildren() + 64
8 dyld 0x00000001906a3678 lsl::BTree<lsl::UniquePtr<dyld4::Atlas::Image>, std::__1::less<lsl::UniquePtr<dyld4::Atlas::Image> >, false>::clear() + 32
9 dyld 0x00000001906a300c dyld4::RuntimeState::freeProcessSnapshot() + 60
10 dyld 0x00000001906c8990 dyld4::APIs::dlopen_from(char const*, int, void*) + 932
11 python3.10 0x0000000104d3378c (null) + 296
11 python3.10 0x0000000104d3378c _PyImport_FindSharedFuncptr + 296
12 python3.10 0x0000000104cf0640 (null) + 940
12 python3.10 0x0000000104cf0640 _imp_create_dynamic + 940
13 python3.10 0x0000000104c204e4 cfunction_vectorcall_FASTCALL + 200
14 python3.10 0x0000000104cbe808 (null) + 26928
14 python3.10 0x0000000104cbe808 _PyEval_EvalFrameDefault + 26928