I tried to build the stack from scratch, and got a failure in one of the meas_astrom tests (attached). It's similar to the
DM-2303 failures, where Eups aborts because the cache pickle file is invalid.
This happens because:
a) tests in meas_astrom are run concurrently (good)
b) at least 11 of them instantiate EUPS (ok)
c) on every creation of an instance of Eups class, it checks whether it's Products cache is valid; if it isn't it rebuilds it and writes it out to the disk for later reading. Because the write is non-atomic, this creates a race condition where if another instance of Eups tries to read the cache file before the writing has finished, it will see it as corrupted. And that is what happens here.
While there are known issues w. EUPS locking, I think this kind of race should not exist even in the absence of locking (i.e., readers should not have to lock). The fix is to make the write atomic, using the usual write-to-temporary-and-then-move pattern. In pseudocode:
Also, the caching code should not abort on invalid cache, but fall back to rebuilding the cache instead (and maybe issue a warning).
Once this implemented I think it may be possible to remove the workaround developed for