# unrepeatable build failure in base

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
None
• Team:
Architecture

#### Description

The attached bug report is from a unit test failure that occurred in base when trying to build the sims packages. The failure did not happen to me. It happened once to Seth Digel, and once to Michael Reuter. In both cases, simply restarting the build "resolved" the issue (i.e. the build completed on the second attempt with no changes made to the code).

#### Attachments

1. testModuleImporter2.py.failed
0.8 kB

#### Activity

Hide
John Swinbank added a comment -

I'm a bit confused as to whether you've actually solved the race or not though. From reading the code, it looks as though generating lsst64defs.py is part of the python target, and that the tests target won't be run until python is complete. But that's not quite what you write, either here or in the code – you say you've "deferred the tests to the last minute", which rather implies you're just giving the C code more time to run but not guaranteeing that it will always have enough time. My scons-fu isn't strong enough to say for sure which is true without digging in further – can you clarify?

If we are not able to definitively avoid the race in this situation, I'm actually a bit tempted by the simple solution I suggested above of just hard-coding the value. It would make the code a lot simpler and avoid possible races, it only need to be hard-coded for Python 2/Mac, and it does appear to be hard-coded (in multiple places) in the Python source.

Show
John Swinbank added a comment - Minor comments on the PR . I'm a bit confused as to whether you've actually solved the race or not though. From reading the code, it looks as though generating lsst64defs.py is part of the python target, and that the tests target won't be run until python is complete. But that's not quite what you write, either here or in the code – you say you've "deferred the tests to the last minute", which rather implies you're just giving the C code more time to run but not guaranteeing that it will always have enough time. My scons -fu isn't strong enough to say for sure which is true without digging in further – can you clarify? If we are not able to definitively avoid the race in this situation, I'm actually a bit tempted by the simple solution I suggested above of just hard-coding the value. It would make the code a lot simpler and avoid possible races, it only need to be hard-coded for Python 2/Mac, and it does appear to be hard-coded (in multiple places ) in the Python source.
Hide
Tim Jenness added a comment - - edited

I didn't solve the race. All I did was try to make the race less likely by forcing all the tests to be deferred until both lib and python had completed. I am still confused as to how scons can report that the m4 execution has completed when it had not. In the initial report and the report from Robert Lupton last week the lsst64defs.py file has clearly been created but when the test executes it has not been populated (and has been by the time any one tries to work out what has happened by rerunning the test). This suggests that m4 is still running when the test has started but I have no idea why scons is not waiting for m4 to complete. It's not like NFS is involved.

I am coming round to the idea of removing the lsst64defs code generation completely and just hard-wiring RTLD_NOW in the case (Python2/Mac) where it's not able to be determined from other modules. This will guarantee that the race condition is removed. I'll modify my patch.

Show
Tim Jenness added a comment - - edited I didn't solve the race. All I did was try to make the race less likely by forcing all the tests to be deferred until both lib and python had completed. I am still confused as to how scons can report that the m4 execution has completed when it had not. In the initial report and the report from Robert Lupton last week the lsst64defs.py file has clearly been created but when the test executes it has not been populated (and has been by the time any one tries to work out what has happened by rerunning the test). This suggests that m4 is still running when the test has started but I have no idea why scons is not waiting for m4 to complete. It's not like NFS is involved. I am coming round to the idea of removing the lsst64defs code generation completely and just hard-wiring RTLD_NOW in the case (Python2/Mac) where it's not able to be determined from other modules. This will guarantee that the race condition is removed. I'll modify my patch.
Hide
Tim Jenness added a comment -

I've made the suggested changes and pushed the new version to the PR.

• Remove lsst64defs.py generation.
• No longer try to import dl
• Fall back to a hard-coded "2" if RTLD_NOW can not be found.
• Raise an exception if RTLD_GLOBAL could not be found.
Show
Tim Jenness added a comment - I've made the suggested changes and pushed the new version to the PR. Remove lsst64defs.py generation. No longer try to import dl Fall back to a hard-coded "2" if RTLD_NOW can not be found. Raise an exception if RTLD_GLOBAL could not be found.
Hide
John Swinbank added a comment -

Looks good.

My only (tiny!) comment would be that you could consider expanding the comment to explain why RTLD_NOW is set to 2 and provide a source for that number. Otherwise, good to go.

Show
John Swinbank added a comment - Looks good. My only (tiny!) comment would be that you could consider expanding the comment to explain why RTLD_NOW is set to 2 and provide a source for that number. Otherwise, good to go.
Hide
Tim Jenness added a comment -

Merged to master with the suggested change to the comment noting where RTLD_NOW comes from.

Show
Tim Jenness added a comment - Merged to master with the suggested change to the comment noting where RTLD_NOW comes from.

#### People

Assignee:
Tim Jenness
Reporter:
Scott Daniel
Reviewers:
John Swinbank
Watchers:
John Swinbank, Michael Reuter, Scott Daniel, Tim Jenness