Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15406

mosaic.py timeout error in readCatalog

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: meas_mosaic
    • Labels:
      None
    • Team:
      External

      Description

      Jeffrey Carlin and I noticed that running mosaic.py with numCoresForReadSource >1 no longer works, it hangs forever, results in timeout like below:

      Traceback (most recent call last):
        File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/meas_mosaic/16.0-6-ged8029f+4/bin/mosaic.py", line 5, in <module>
          MosaicTask.parseAndRun()
        File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/16.0-6-g44ca919+2/python/lsst/pipe/base/cmdLineTask.py", line 575, in parseAndRun
          resultList = taskRunner.run(parsedCmd)
        File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_base/16.0-6-g44ca919+2/python/lsst/pipe/base/cmdLineTask.py", line 224, in run
          resultList = list(mapFunc(self, targetList))
        File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/meas_mosaic/16.0-6-ged8029f+4/python/lsst/meas/mosaic/mosaicTask.py", line 80, in __call__
          result = task.run(*args)
        File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/meas_mosaic/16.0-6-ged8029f+4/python/lsst/meas/mosaic/mosaicTask.py", line 1054, in run
          numCoresForReadSource, readTimeout, verbose)
        File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/meas_mosaic/16.0-6-ged8029f+4/python/lsst/meas/mosaic/mosaicTask.py", line 871, in mosaic
          readTimeout, verbose)
        File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/meas_mosaic/16.0-6-ged8029f+4/python/lsst/meas/mosaic/mosaicTask.py", line 554, in readCatalog
          resultList = pool.map_async(worker, params).get(readTimeout)
        File "/software/lsstsw/stack3_20171023/python/miniconda3-4.3.21/lib/python3.6/multiprocessing/pool.py", line 640, in get
          raise TimeoutError
      

      Stack version w_2018_30 was used.

      FWIW, I was able to run the same thing using only 1 core for reading source catalog (--numCoresForReadSource=1).

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            Great, thank you!

            Show
            jbosch Jim Bosch added a comment - Great, thank you!
            Hide
            tjenness Tim Jenness added a comment -

            Is it possible that this change has caused this problem?

            $ python
            Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) 
            [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
            Type "help", "copyright", "credits" or "license" for more information.
            >>> import warnings
            >>> warnings.simplefilter("error")
            >>> import lsst.afw.image
            libc++abi.dylib: terminating with uncaught exception of type pybind11::error_already_set: SystemError: <built-in method __contains__ of dict object at 0x10bd53480> returned a result with an error set
            Caught signal 6, backtrace follows:
            0   libutils.dylib                      0x0000000110dbe311 lsst::utils::(anonymous namespace)::signalHandler(int) + 81
            1   libsystem_platform.dylib            0x00007fff51f22f5a (null) + 26
            1   libsystem_platform.dylib            0x00007fff51f22f5a _sigtramp + 26
            2   ???                                 0x00007ffe00020008 0x0 + 140728898551816
            3   libsystem_c.dylib                   0x00007fff51cc01ae abort + 127
            4   libc++abi.dylib                     0x00007fff4fbbaf8f (null) + 0
            4   libc++abi.dylib                     0x00007fff4fbbaf8f __cxa_bad_cast + 0
            5   libc++abi.dylib                     0x00007fff4fbbb113 default_terminate_handler() + 241
            6   libobjc.A.dylib                     0x00007fff50ffceab _objc_terminate() + 105
            7   libc++abi.dylib                     0x00007fff4fbd67c9 std::__terminate(void (*)()) + 8
            8   libc++abi.dylib                     0x00007fff4fbd6843 std::terminate() + 51
            9   libc++abi.dylib                     0x00007fff4fbd63c3 (null) + 47
            9   libc++abi.dylib                     0x00007fff4fbd63c3 __cxa_decrement_exception_refcount + 47
            10  schema.so                           0x000000011eaa5755 PyInit_schema + 4053
            ...
            

            It only happens if importing triggers a warning (in my case a `RuntimeWarning`) – and that requires updated numpy, but I worry this is a real problem.

            Show
            tjenness Tim Jenness added a comment - Is it possible that this change has caused this problem? $ python Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import warnings >>> warnings.simplefilter("error") >>> import lsst.afw.image libc++abi.dylib: terminating with uncaught exception of type pybind11::error_already_set: SystemError: <built-in method __contains__ of dict object at 0x10bd53480> returned a result with an error set Caught signal 6, backtrace follows: 0 libutils.dylib 0x0000000110dbe311 lsst::utils::(anonymous namespace)::signalHandler(int) + 81 1 libsystem_platform.dylib 0x00007fff51f22f5a (null) + 26 1 libsystem_platform.dylib 0x00007fff51f22f5a _sigtramp + 26 2 ??? 0x00007ffe00020008 0x0 + 140728898551816 3 libsystem_c.dylib 0x00007fff51cc01ae abort + 127 4 libc++abi.dylib 0x00007fff4fbbaf8f (null) + 0 4 libc++abi.dylib 0x00007fff4fbbaf8f __cxa_bad_cast + 0 5 libc++abi.dylib 0x00007fff4fbbb113 default_terminate_handler() + 241 6 libobjc.A.dylib 0x00007fff50ffceab _objc_terminate() + 105 7 libc++abi.dylib 0x00007fff4fbd67c9 std::__terminate(void (*)()) + 8 8 libc++abi.dylib 0x00007fff4fbd6843 std::terminate() + 51 9 libc++abi.dylib 0x00007fff4fbd63c3 (null) + 47 9 libc++abi.dylib 0x00007fff4fbd63c3 __cxa_decrement_exception_refcount + 47 10 schema.so 0x000000011eaa5755 PyInit_schema + 4053 ... It only happens if importing triggers a warning (in my case a `RuntimeWarning`) – and that requires updated numpy, but I worry this is a real problem.
            Hide
            price Paul Price added a comment -

            I could believe that the fix that I merged is incomplete, but I wonder if your exception is coming from somewhere else. What does gdb say?

            Show
            price Paul Price added a comment - I could believe that the fix that I merged is incomplete, but I wonder if your exception is coming from somewhere else. What does gdb say?
            Hide
            tjenness Tim Jenness added a comment -

            * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
              * frame #0: 0x00007fff51d64b66 libsystem_kernel.dylib`__pthread_kill + 10
                frame #1: 0x00007fff51f2f080 libsystem_pthread.dylib`pthread_kill + 333
                frame #2: 0x00007fff51cc01ae libsystem_c.dylib`abort + 127
                frame #3: 0x00007fff4fbbaf8f libc++abi.dylib`abort_message + 245
                frame #4: 0x00007fff4fbbb113 libc++abi.dylib`default_terminate_handler() + 241
                frame #5: 0x00007fff50ffceab libobjc.A.dylib`_objc_terminate() + 105
                frame #6: 0x00007fff4fbd67c9 libc++abi.dylib`std::__terminate(void (*)()) + 8
                frame #7: 0x00007fff4fbd6843 libc++abi.dylib`std::terminate() + 51
                frame #8: 0x000000011787977b schema.so`__clang_call_terminate + 11
                frame #9: 0x00000001178b77d9 schema.so`pybind11::error_already_set::~error_already_set(this=<unavailable>) at pybind11.h:1875 [opt]
                frame #10: 0x00007fff4fbd63c3 libc++abi.dylib`__cxa_decrement_exception_refcount + 47
                frame #11: 0x000000011787a755 schema.so`::PyInit_schema() at schema.cc:390 [opt]
                frame #12: 0x000000010018d999 python`_PyImport_LoadDynamicModuleWithSpec + 473
                frame #13: 0x000000010018d02c python`_imp_create_dynamic + 188
            ...
                frame #100: 0x00000001170bbf0f apCorrMap.so`::PyInit_apCorrMap() [inlined] pybind11::module::import(name=<unavailable>) at pybind11.h:843 [opt]
                frame #101: 0x00000001170bbf03 apCorrMap.so`::PyInit_apCorrMap() [inlined] lsst::afw::image::(anonymous namespace)::pybind11_init_apCorrMap(pybind11::module&) at apCorrMap.cc:44 [opt]
                frame #102: 0x00000001170bbf03 apCorrMap.so`::PyInit_apCorrMap() at apCorrMap.cc:43 [opt]
            ...
            

            Show
            tjenness Tim Jenness added a comment - * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x00007fff51d64b66 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff51f2f080 libsystem_pthread.dylib`pthread_kill + 333 frame #2: 0x00007fff51cc01ae libsystem_c.dylib`abort + 127 frame #3: 0x00007fff4fbbaf8f libc++abi.dylib`abort_message + 245 frame #4: 0x00007fff4fbbb113 libc++abi.dylib`default_terminate_handler() + 241 frame #5: 0x00007fff50ffceab libobjc.A.dylib`_objc_terminate() + 105 frame #6: 0x00007fff4fbd67c9 libc++abi.dylib`std::__terminate(void (*)()) + 8 frame #7: 0x00007fff4fbd6843 libc++abi.dylib`std::terminate() + 51 frame #8: 0x000000011787977b schema.so`__clang_call_terminate + 11 frame #9: 0x00000001178b77d9 schema.so`pybind11::error_already_set::~error_already_set(this=<unavailable>) at pybind11.h:1875 [opt] frame #10: 0x00007fff4fbd63c3 libc++abi.dylib`__cxa_decrement_exception_refcount + 47 frame #11: 0x000000011787a755 schema.so`::PyInit_schema() at schema.cc:390 [opt] frame #12: 0x000000010018d999 python`_PyImport_LoadDynamicModuleWithSpec + 473 frame #13: 0x000000010018d02c python`_imp_create_dynamic + 188 ... frame #100: 0x00000001170bbf0f apCorrMap.so`::PyInit_apCorrMap() [inlined] pybind11::module::import(name=<unavailable>) at pybind11.h:843 [opt] frame #101: 0x00000001170bbf03 apCorrMap.so`::PyInit_apCorrMap() [inlined] lsst::afw::image::(anonymous namespace)::pybind11_init_apCorrMap(pybind11::module&) at apCorrMap.cc:44 [opt] frame #102: 0x00000001170bbf03 apCorrMap.so`::PyInit_apCorrMap() at apCorrMap.cc:43 [opt] ...
            Hide
            tjenness Tim Jenness added a comment -

            I've moved discussion of the import error to DM-15478.

            Show
            tjenness Tim Jenness added a comment - I've moved discussion of the import error to DM-15478 .

              People

              Assignee:
              price Paul Price
              Reporter:
              hchiang2 Hsin-Fang Chiang
              Reviewers:
              Jim Bosch
              Watchers:
              Hsin-Fang Chiang, Jeffrey Carlin, Jim Bosch, John Swinbank, Paul Price, Pim Schellart [X] (Inactive), Russell Owen, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.