One more similar failure: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36085/tests
There is no timing info for individual unit test so it's hard to say exactly what is going on, but I have a guess that I think is reasonable. Because there is no actual synchronization between parent and child processes, there is a sort of race condition in MPGraphExecutor between the parent process checking that a child has timed out and trying to kill it and the child process actually finishing successfully. The window between check and kill is usually short but on over-subscribed CPU it can extend indefinitely. That race condition is actually problematic as it can lead to confusing results, exactly what we saw in these recent failures - parent says that child has timed out, but child reports success return code. The unit test timing was configured to try to avoid this condition, but apparently resources are very over-subscribed on Jenkins (likely by ctrl_mpexec unit tests which spawn a bunch of sub-processes) so we are lucky to actually catch this condition.
There are couple of things to try to fix here:
- If the parent thinks that a child has timed out, but the child finishes successfully before parent manages to kill it, then the child status should be set to SUCCESS not TIMEOUT.
- Make unit tests less likely to trigger this condition, probably enough to extend the lifetime of a child that times out, (make it 1 minute instead of 5 seconds). It may also be useful to reduce concurrency in pytest for ctrl_mpexec to avoid over-subscription, don't know how to do that yet.