Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-34076

ctrl_mpexec unit test fails in test_mpexec_timeout

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Sprint:
      DB_S22_12
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      Jenkins builds fail for a test_mpexec_timeout unit test, e.g. https://ci.lsst.codes/blue/organizations/jenkins/scipipe%2Flsst_distrib/detail/lsst_distrib/1683/tests

      =================================== FAILURES ===================================
      _________________ MPGraphExecutorTestCase.test_mpexec_timeout __________________
      [gw3] linux -- Python 3.8.12 /j/ws/scipipe/lsst_distrib/centos-7-conda/lsstsw/miniconda/envs/lsst-scipipe-3.0.0/bin/python3.8
       
      self = <test_executors.MPGraphExecutorTestCase testMethod=test_mpexec_timeout>
       
          def test_mpexec_timeout(self):
              """Fail due to timeout"""
          
              taskDef = TaskDefMock()
              taskDefSleep = TaskDefMock(taskClass=TaskMockSleep)
              qgraph = QuantumGraphMock(
                  [
                      QuantumIterDataMock(index=0, taskDef=taskDef, detector=0),
                      QuantumIterDataMock(index=1, taskDef=taskDefSleep, detector=1),
                      QuantumIterDataMock(index=2, taskDef=taskDef, detector=2),
                  ]
              )
          
              # with failFast we'll get immediate MPTimeoutError
              qexec = QuantumExecutorMock(mp=True)
              mpexec = MPGraphExecutor(numProc=3, timeout=1, quantumExecutor=qexec, failFast=True)
              with self.assertRaises(MPTimeoutError):
                  mpexec.execute(qgraph, butler=None)
              report = mpexec.getReport()
              self.assertEqual(report.status, ExecutionStatus.TIMEOUT)
              self.assertEqual(report.exceptionInfo.className, "lsst.ctrl.mpexec.mpGraphExecutor.MPTimeoutError")
              self.assertGreater(len(report.quantaReports), 0)
              self.assertEquals(_count_status(report, ExecutionStatus.TIMEOUT), 1)
      >       self.assertTrue(any(qrep.exitCode < 0 for qrep in report.quantaReports))
      E       AssertionError: False is not true
       
      tests/test_executors.py:352: AssertionError
      

        Attachments

          Activity

          Hide
          salnikov Andy Salnikov added a comment -

          One more similar failure: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36085/tests

          There is no timing info for individual unit test so it's hard to say exactly what is going on, but I have a guess that I think is reasonable. Because there is no actual synchronization between parent and child processes, there is a sort of race condition in MPGraphExecutor between the parent process checking that a child has timed out and trying to kill it and the child process actually finishing successfully. The window between check and kill is usually short but on over-subscribed CPU it can extend indefinitely. That race condition is actually problematic as it can lead to confusing results, exactly what we saw in these recent failures - parent says that child has timed out, but child reports success return code. The unit test timing was configured to try to avoid this condition, but apparently resources are very over-subscribed on Jenkins (likely by ctrl_mpexec unit tests which spawn a bunch of sub-processes) so we are lucky to actually catch this condition.

          There are couple of things to try to fix here:

          • If the parent thinks that a child has timed out, but the child finishes successfully before parent manages to kill it, then the child status should be set to SUCCESS not TIMEOUT.
          • Make unit tests less likely to trigger this condition, probably enough to extend the lifetime of a child that times out, (make it 1 minute instead of 5 seconds). It may also be useful to reduce concurrency in pytest for ctrl_mpexec to avoid over-subscription, don't know how to do that yet.
          Show
          salnikov Andy Salnikov added a comment - One more similar failure: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36085/tests There is no timing info for individual unit test so it's hard to say exactly what is going on, but I have a guess that I think is reasonable. Because there is no actual synchronization between parent and child processes, there is a sort of race condition in MPGraphExecutor between the parent process checking that a child has timed out and trying to kill it and the child process actually finishing successfully. The window between check and kill is usually short but on over-subscribed CPU it can extend indefinitely. That race condition is actually problematic as it can lead to confusing results, exactly what we saw in these recent failures - parent says that child has timed out, but child reports success return code. The unit test timing was configured to try to avoid this condition, but apparently resources are very over-subscribed on Jenkins (likely by ctrl_mpexec unit tests which spawn a bunch of sub-processes) so we are lucky to actually catch this condition. There are couple of things to try to fix here: If the parent thinks that a child has timed out, but the child finishes successfully before parent manages to kill it, then the child status should be set to SUCCESS not TIMEOUT. Make unit tests less likely to trigger this condition, probably enough to extend the lifetime of a child that times out, (make it 1 minute instead of 5 seconds). It may also be useful to reduce concurrency in pytest for ctrl_mpexec to avoid over-subscription, don't know how to do that yet.
          Hide
          tjenness Tim Jenness added a comment -

          The number of parallel tests is controlled from the scons -j parameter which in jenkins is set by the eupspkg EUPSPKG_NJOBS environment variable. It may be possible to set that in an eupspkg.cfg.sh file but it would be the path of last resort.

          DM-11792 talks about ways to try to limit how pytest breaks up a single test file and maybe things have moved on with pytest in that past 5 years such that you can tell pytest that all the tests in this one test file should run sequentially.

          Show
          tjenness Tim Jenness added a comment - The number of parallel tests is controlled from the scons -j parameter which in jenkins is set by the eupspkg EUPSPKG_NJOBS environment variable. It may be possible to set that in an eupspkg.cfg.sh file but it would be the path of last resort. DM-11792 talks about ways to try to limit how pytest breaks up a single test file and maybe things have moved on with pytest in that past 5 years such that you can tell pytest that all the tests in this one test file should run sequentially.
          Hide
          salnikov Andy Salnikov added a comment -

          I tried the same command line as in DM-11792 and it did not hang for me. Limiting the number of workers for all tests in ctrl_mpexec is not optimal, there is only one test that spawns subprocesses, all others can be executed without changes. I tried to see if it's possible to run single test separately from all other but can't see how that can be done. Limiting the number of workers can be done via setup.py, e.g.:

          [tool:pytest]
          addopts = --flake8 --maxprocesses=2
          

          This also slows down all other tests, but I think this could be acceptable, even with that option tests for ctrl_mpexec take less than a minute.

          Show
          salnikov Andy Salnikov added a comment - I tried the same command line as in DM-11792 and it did not hang for me. Limiting the number of workers for all tests in ctrl_mpexec is not optimal, there is only one test that spawns subprocesses, all others can be executed without changes. I tried to see if it's possible to run single test separately from all other but can't see how that can be done. Limiting the number of workers can be done via setup.py , e.g.: [tool:pytest] addopts = --flake8 --maxprocesses=2 This also slows down all other tests, but I think this could be acceptable, even with that option tests for ctrl_mpexec take less than a minute.
          Hide
          salnikov Andy Salnikov added a comment -

          Hmm, setup.py addition works ok if I run pytest, but does not work with scons. scons run pytest with --tx option instead of -n, and that seem to ignore --maxprocesses. I can add "-n auto" to setup.py but then it will always run with 2 processes even if I say "scons -j 1". I think I'll ignore this for now and let it run as before, the changes to the unit test should fix spurious failures even if we have too many processes.

          Show
          salnikov Andy Salnikov added a comment - Hmm, setup.py addition works ok if I run pytest, but does not work with scons. scons run pytest with --tx option instead of -n, and that seem to ignore --maxprocesses. I can add "-n auto" to setup.py but then it will always run with 2 processes even if I say "scons -j 1". I think I'll ignore this for now and let it run as before, the changes to the unit test should fix spurious failures even if we have too many processes.
          Hide
          salnikov Andy Salnikov added a comment -

          Tim Jenness, could you check this when you have a minute? I decided not to mess with pytest, increasing the lifetime of a "hung" task should be sufficient to avoid the problem.

          Show
          salnikov Andy Salnikov added a comment - Tim Jenness , could you check this when you have a minute? I decided not to mess with pytest, increasing the lifetime of a "hung" task should be sufficient to avoid the problem.
          Hide
          salnikov Andy Salnikov added a comment -
          Show
          salnikov Andy Salnikov added a comment - Missed the PR: https://github.com/lsst/ctrl_mpexec/pull/174
          Hide
          tjenness Tim Jenness added a comment -

          Minor question on the PR.

          Show
          tjenness Tim Jenness added a comment - Minor question on the PR.

            People

            Assignee:
            salnikov Andy Salnikov
            Reporter:
            salnikov Andy Salnikov
            Reviewers:
            Tim Jenness
            Watchers:
            Andy Salnikov, Tim Jenness
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.