Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: ctrl_mpexec
-
Labels:
-
Story Points:1
-
Epic Link:
-
Sprint:DB_S22_12
-
Team:Data Access and Database
-
Urgent?:No
Description
Jenkins builds fail for a test_mpexec_timeout unit test, e.g. https://ci.lsst.codes/blue/organizations/jenkins/scipipe%2Flsst_distrib/detail/lsst_distrib/1683/tests
=================================== FAILURES ===================================
|
_________________ MPGraphExecutorTestCase.test_mpexec_timeout __________________
|
[gw3] linux -- Python 3.8.12 /j/ws/scipipe/lsst_distrib/centos-7-conda/lsstsw/miniconda/envs/lsst-scipipe-3.0.0/bin/python3.8
|
|
self = <test_executors.MPGraphExecutorTestCase testMethod=test_mpexec_timeout>
|
|
def test_mpexec_timeout(self):
|
"""Fail due to timeout"""
|
|
taskDef = TaskDefMock()
|
taskDefSleep = TaskDefMock(taskClass=TaskMockSleep)
|
qgraph = QuantumGraphMock(
|
[
|
QuantumIterDataMock(index=0, taskDef=taskDef, detector=0),
|
QuantumIterDataMock(index=1, taskDef=taskDefSleep, detector=1),
|
QuantumIterDataMock(index=2, taskDef=taskDef, detector=2),
|
]
|
)
|
|
# with failFast we'll get immediate MPTimeoutError
|
qexec = QuantumExecutorMock(mp=True)
|
mpexec = MPGraphExecutor(numProc=3, timeout=1, quantumExecutor=qexec, failFast=True)
|
with self.assertRaises(MPTimeoutError):
|
mpexec.execute(qgraph, butler=None)
|
report = mpexec.getReport()
|
self.assertEqual(report.status, ExecutionStatus.TIMEOUT)
|
self.assertEqual(report.exceptionInfo.className, "lsst.ctrl.mpexec.mpGraphExecutor.MPTimeoutError")
|
self.assertGreater(len(report.quantaReports), 0)
|
self.assertEquals(_count_status(report, ExecutionStatus.TIMEOUT), 1)
|
> self.assertTrue(any(qrep.exitCode < 0 for qrep in report.quantaReports))
|
E AssertionError: False is not true
|
|
tests/test_executors.py:352: AssertionError
|
One more similar failure: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36085/tests
There is no timing info for individual unit test so it's hard to say exactly what is going on, but I have a guess that I think is reasonable. Because there is no actual synchronization between parent and child processes, there is a sort of race condition in MPGraphExecutor between the parent process checking that a child has timed out and trying to kill it and the child process actually finishing successfully. The window between check and kill is usually short but on over-subscribed CPU it can extend indefinitely. That race condition is actually problematic as it can lead to confusing results, exactly what we saw in these recent failures - parent says that child has timed out, but child reports success return code. The unit test timing was configured to try to avoid this condition, but apparently resources are very over-subscribed on Jenkins (likely by ctrl_mpexec unit tests which spawn a bunch of sub-processes) so we are lucky to actually catch this condition.
There are couple of things to try to fix here: