Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-9746

validate_drp cfht/decam datasets timing out post pybind11 merge

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Invalid
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The cfht and decam datasets have timed out several times now after the pybind11 merge. Eg. https://ci.lsst.codes/job/validate_drp/851/

      The runtime before failure for both datasets is ~2 hours 47mins. This isn't conclusively related to the pybind11 merge but the timing is coincidental.

      There have been recent changes to both pipe_tasks and pipe_drivers.

      https://github.com/lsst/pipe_tasks/commit/edc5e2ac3717660f1ff218ac0e43a61e2afcb794
      https://github.com/lsst/pipe_drivers/commit/a2154f1044a73c9f569b26784e97744581503f61

      Traceback (most recent call last):
        File "/home/jenkins-slave/workspace/validate_drp/dataset/cfht/label/centos-7/python/py2/lsstsw/stack/Linux64/pipe_tasks/13.0-4-gedc5e2a/bin/processCcd.py", line 25, in <module>
          ProcessCcdTask.parseAndRun()
        File "/home/jenkins-slave/workspace/validate_drp/dataset/cfht/label/centos-7/python/py2/lsstsw/stack/Linux64/pipe_base/13.0+5/python/lsst/pipe/base/cmdLineTask.py", line 482, in parseAndRun
          resultList = taskRunner.run(parsedCmd)
        File "/home/jenkins-slave/workspace/validate_drp/dataset/cfht/label/centos-7/python/py2/lsstsw/stack/Linux64/pipe_base/13.0+5/python/lsst/pipe/base/cmdLineTask.py", line 209, in run
          resultList = list(mapFunc(self, targetList))
        File "/home/jenkins-slave/workspace/validate_drp/dataset/cfht/label/centos-7/python/py2/lsstsw/stack/Linux64/pipe_base/13.0+5/python/lsst/pipe/base/cmdLineTask.py", line 70, in _runPool
          return pool.map_async(functools.partial(_poolFunctionWrapper, function), iterable).get(timeout)
        File "/home/jenkins-slave/workspace/validate_drp/dataset/cfht/label/centos-7/python/py2/lsstsw/miniconda/lib/python2.7/multiprocessing/pool.py", line 563, in get
          raise TimeoutError
      multiprocessing.TimeoutError
      

        Attachments

          Issue Links

            Activity

            Hide
            jhoblitt Joshua Hoblitt added a comment -

            To clarify, the hsc dataset is working. It differs from the cfht/decam datasets in that it uses pipe_drivers.

            Show
            jhoblitt Joshua Hoblitt added a comment - To clarify, the hsc dataset is working. It differs from the cfht/decam datasets in that it uses pipe_drivers .
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            The default timeout value in pipe_base is 9999s, which explains the ~ 2 hour 47 min runtime(s):

            https://github.com/lsst/pipe_base/blob/master/python/lsst/pipe/base/cmdLineTask.py#L137

            Show
            jhoblitt Joshua Hoblitt added a comment - The default timeout value in pipe_base is 9999s, which explains the ~ 2 hour 47 min runtime(s): https://github.com/lsst/pipe_base/blob/master/python/lsst/pipe/base/cmdLineTask.py#L137
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            It appears that the timeout can be controlled via a --timeout argument to processCcd.py. It is probably worth adding support for this flag to examples/runExample.sh, as we know both of these datasets should have much shorter runtimes.

            Show
            jhoblitt Joshua Hoblitt added a comment - It appears that the timeout can be controlled via a --timeout argument to processCcd.py . It is probably worth adding support for this flag to examples/runExample.sh , as we know both of these datasets should have much shorter runtimes.
            Hide
            jhoblitt Joshua Hoblitt added a comment - - edited

            When implementing DM-9749, I tested running with NUMPROC set to 1, and the cfht dataset was able to run to completion.

            Show
            jhoblitt Joshua Hoblitt added a comment - - edited When implementing DM-9749 , I tested running with NUMPROC set to 1, and the cfht dataset was able to run to completion.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            Per discussion on slack with Michael Wood-Vasey, we are going to set NUMPROC = 1 in production for the cfht/decam datasets in order to get them working for the time being.

            Show
            jhoblitt Joshua Hoblitt added a comment - Per discussion on slack with Michael Wood-Vasey , we are going to set NUMPROC = 1 in production for the cfht/decam datasets in order to get them working for the time being.
            Hide
            tjenness Tim Jenness added a comment -

            Is this still a problem?

            Show
            tjenness Tim Jenness added a comment - Is this still a problem?
            Hide
            ktl Kian-Tat Lim added a comment -

            The timeouts don't seem to be happening anymore.

            Show
            ktl Kian-Tat Lim added a comment - The timeouts don't seem to be happening anymore.
            Hide
            ktl Kian-Tat Lim added a comment -
            Show
            ktl Kian-Tat Lim added a comment - And the NUMPROC=1 was removed in https://github.com/lsst-dm/jenkins-dm-jobs/commit/387422624be213d766e84883e792aa927a0695ae back in 2017.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              jhoblitt Joshua Hoblitt
              Watchers:
              Angelo Fausti, John Parejko, Jonathan Sick, Joshua Hoblitt, Kian-Tat Lim, Michael Wood-Vasey, Simon Krughoff, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.