Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-3517

unrepeatable build failure in base

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: base
    • Labels:
      None
    • Team:
      Architecture

      Description

      The attached bug report is from a unit test failure that occurred in base when trying to build the sims packages. The failure did not happen to me. It happened once to Seth Digel, and once to Michael Reuter. In both cases, simply restarting the build "resolved" the issue (i.e. the build completed on the second attempt with no changes made to the code).

        Attachments

          Activity

          Hide
          swinbank John Swinbank added a comment -

          Minor comments on the PR.

          I'm a bit confused as to whether you've actually solved the race or not though. From reading the code, it looks as though generating lsst64defs.py is part of the python target, and that the tests target won't be run until python is complete. But that's not quite what you write, either here or in the code – you say you've "deferred the tests to the last minute", which rather implies you're just giving the C code more time to run but not guaranteeing that it will always have enough time. My scons-fu isn't strong enough to say for sure which is true without digging in further – can you clarify?

          If we are not able to definitively avoid the race in this situation, I'm actually a bit tempted by the simple solution I suggested above of just hard-coding the value. It would make the code a lot simpler and avoid possible races, it only need to be hard-coded for Python 2/Mac, and it does appear to be hard-coded (in multiple places) in the Python source.

          Show
          swinbank John Swinbank added a comment - Minor comments on the PR . I'm a bit confused as to whether you've actually solved the race or not though. From reading the code, it looks as though generating lsst64defs.py is part of the python target, and that the tests target won't be run until python is complete. But that's not quite what you write, either here or in the code – you say you've "deferred the tests to the last minute", which rather implies you're just giving the C code more time to run but not guaranteeing that it will always have enough time. My scons -fu isn't strong enough to say for sure which is true without digging in further – can you clarify? If we are not able to definitively avoid the race in this situation, I'm actually a bit tempted by the simple solution I suggested above of just hard-coding the value. It would make the code a lot simpler and avoid possible races, it only need to be hard-coded for Python 2/Mac, and it does appear to be hard-coded (in multiple places ) in the Python source.
          Hide
          tjenness Tim Jenness added a comment - - edited

          I didn't solve the race. All I did was try to make the race less likely by forcing all the tests to be deferred until both lib and python had completed. I am still confused as to how scons can report that the m4 execution has completed when it had not. In the initial report and the report from Robert Lupton last week the lsst64defs.py file has clearly been created but when the test executes it has not been populated (and has been by the time any one tries to work out what has happened by rerunning the test). This suggests that m4 is still running when the test has started but I have no idea why `scons` is not waiting for m4 to complete. It's not like NFS is involved.

          I am coming round to the idea of removing the lsst64defs code generation completely and just hard-wiring `RTLD_NOW` in the case (Python2/Mac) where it's not able to be determined from other modules. This will guarantee that the race condition is removed. I'll modify my patch.

          Show
          tjenness Tim Jenness added a comment - - edited I didn't solve the race. All I did was try to make the race less likely by forcing all the tests to be deferred until both lib and python had completed. I am still confused as to how scons can report that the m4 execution has completed when it had not. In the initial report and the report from Robert Lupton last week the lsst64defs.py file has clearly been created but when the test executes it has not been populated (and has been by the time any one tries to work out what has happened by rerunning the test). This suggests that m4 is still running when the test has started but I have no idea why `scons` is not waiting for m4 to complete. It's not like NFS is involved. I am coming round to the idea of removing the lsst64defs code generation completely and just hard-wiring `RTLD_NOW` in the case (Python2/Mac) where it's not able to be determined from other modules. This will guarantee that the race condition is removed. I'll modify my patch.
          Hide
          tjenness Tim Jenness added a comment -

          I've made the suggested changes and pushed the new version to the PR.

          • Remove lsst64defs.py generation.
          • No longer try to import dl
          • Fall back to a hard-coded "2" if RTLD_NOW can not be found.
          • Raise an exception if RTLD_GLOBAL could not be found.
          Show
          tjenness Tim Jenness added a comment - I've made the suggested changes and pushed the new version to the PR. Remove lsst64defs.py generation. No longer try to import dl Fall back to a hard-coded "2" if RTLD_NOW can not be found. Raise an exception if RTLD_GLOBAL could not be found.
          Hide
          swinbank John Swinbank added a comment -

          Looks good.

          My only (tiny!) comment would be that you could consider expanding the comment to explain why RTLD_NOW is set to 2 and provide a source for that number. Otherwise, good to go.

          Show
          swinbank John Swinbank added a comment - Looks good. My only (tiny!) comment would be that you could consider expanding the comment to explain why RTLD_NOW is set to 2 and provide a source for that number. Otherwise, good to go.
          Hide
          tjenness Tim Jenness added a comment -

          Merged to master with the suggested change to the comment noting where RTLD_NOW comes from.

          Show
          tjenness Tim Jenness added a comment - Merged to master with the suggested change to the comment noting where RTLD_NOW comes from.

            People

            Assignee:
            tjenness Tim Jenness
            Reporter:
            danielsf Scott Daniel
            Reviewers:
            John Swinbank
            Watchers:
            John Swinbank, Michael Reuter, Scott Daniel, Tim Jenness
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.