Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-1098

Buildbot is not recognizing the exit return from a lock-exceeded situation

    Details

    • Type: Bug
    • Status: Won't Fix
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Re-edit: What I should have noted is that the error refinement mentioned below is occurring in DM-1430. It handles this Issue at no added cost.

      Buildbot's error detection is failing to recognize both conditions when lsstsw returns in error either from the lsstsw flock or the eups lock. The action should be to terminate the entire run in failure; instead it proceeds as though no error occurred and continues processing thru doxydoc generation and the demo end-to-end run.

      This Issue is being expanded (in DM-1430) to include more nuanced error messages than the current blanket: 'The build failed'. The new error messages (and error handling) should differentiate on:
      1) the error occurred during setup prior to the build;
      2) the error occurred during the build;
      3) the error occurred during the doxygen generation;
      4) the error occurred during the I&T test.

      The flock error is characterized as being in the pre-build phase. The eups timeout error may occur in either the pre- or build phases.

      (We leave for another Issue, the separation into new steps for 1&2, 3, 4.)

        Attachments

          Issue Links

            Activity

            Hide
            robyn Robyn Allsman [X] (Inactive) added a comment -

            History of both flock and eups-lock failures in a time-drawn-out sequence in the course of 24 hours. This hadn't occurred before and not since (as of 8/26/2014).

            System Failed due to exception:
            1) Thurs 21 Aug 2014 14:47:56
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/394/steps/shell/logs/stdio (dummy branch)
            sconsUtils: master-g934a82c692 ..................................................................................
            remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
            ]

            Problem dealing with eups-lock left in place after exception:
            2) Thurs 21 Aug 2014 15:00:56
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/395/steps/shell/logs/stdio (dummy branch)

                • error building product afwdata.
                • exit code = 9
                • log is in /lsst/home/lsstsw/build/afwdata/_build.log
                • last few lines:
                  :::: Unable to take exclusive lock on /lsst/home/lsstsw/stack: locks are held by [user=lsstsw, pid=3333] [user=lsstsw, pid=3282] [user=lsstsw, pid=3331]; retrying

            3) Thurs 21 Aug 2014 15:03:20
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/396/steps/shell/logs/stdio (master only)
            afwdata: 8.0.0.1+8 ........ERROR (34 sec).

                • error building product afwdata.
                • exit code = 9
                • log is in /lsst/home/lsstsw/build/afwdata/_build.log
                • last few lines:
                  ::::: Unable to take exclusive lock on /lsst/home/lsstsw/stack: locks are held by [user=lsstsw, pid=3333]
                  [user=lsstsw, pid=3282] [user=lsstsw, pid=3331]; retrying

            4) Thurs 21 Aug 2014 15:44:07
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/397/steps/shell/logs/stdio (master only)
            afwdata: 8.0.0.1+8 .......ERROR (33 sec).

                • error building product afwdata.
                • exit code = 9
                • log is in /lsst/home/lsstsw/build/afwdata/_build.log
                • last few lines:
                  ::::: Unable to take exclusive lock on /lsst/home/lsstsw/stack: locks are held by [user=lsstsw, pid=3333] [user=lsstsw, pid=3282] [user=lsstsw, pid=3331]; retrying

            5) Thurs 21 Aug 2014 16:31:10
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/398/steps/shell/logs/stdio (master only)
            afwdata: 8.0.0.1+8 ........ERROR (34 sec).

                • error building product afwdata.
                • exit code = 9
                • log is in /lsst/home/lsstsw/build/afwdata/_build.log
                • last few lines:
                  ::::: Unable to take exclusive lock on /lsst/home/lsstsw/stack: locks are held by [user=lsstsw, pid=3333] [user=lsstsw, pid=3282] [user=lsstsw, pid=3331]; retrying

            Mario intervenes and identifies fact this is eups lock and uses ~ 'eups admin clearlocks' to fix.
            Then Stack completes build successfully:
            6) Thurs 21 Aug 2014 16:48:24
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/399/steps/shell/logs/stdio (master only)
            anaconda: 1.8.0 (already installed).

            1. BUILD b193 completed.

            Then stack fails with a flock this time:
            7) Thurs 21 Aug 2014 21:07:38
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/400/steps/shell/logs/stdio ('u/pgee/DM-984 u/jbosch/DM-984')
            Rebuild is commencing....stand by; using -r u/pgee/DM-984 -r u/jbosch/DM-984 -r master
            flock: 200: No locks available
            a rebuild is already in process.
            Failed rebuild of DM stack.

            8) Thurs 21 Aug 2014 22:34:13
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/401/steps/shell/logs/stdio (u/pgee/DM-984 u/jbosch/DM-984)
            Rebuild is commencing....stand by; using -r u/pgee/DM-984 -r u/jbosch/DM-984 -r master
            flock: 200: No locks available
            a rebuild is already in process.
            Failed rebuild of DM stack.

            9) Thurs 21 Aug 2014 23:29:31
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/402/steps/shell/logs/stdio (u/pgee/DM-984 u/jbosch/DM-984)
            Rebuild is commencing....stand by; using -r u/pgee/DM-984 -r u/jbosch/DM-984 -r master
            lock: 200: No locks available
            a rebuild is already in process.
            Failed rebuild of DM stack.

            Now it starts working again:
            10) Friday 22 Aug 2014 8:45:03 am
            http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/403/steps/shell/logs/stdio (master only)
            anaconda: 1.8.0 (already installed).

            1. BUILD b194 completed.
              The DM stack has been installed at /lsst/home/lsstsw with tag: b194.
            Show
            robyn Robyn Allsman [X] (Inactive) added a comment - History of both flock and eups-lock failures in a time-drawn-out sequence in the course of 24 hours. This hadn't occurred before and not since (as of 8/26/2014). System Failed due to exception: 1) Thurs 21 Aug 2014 14:47:56 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/394/steps/shell/logs/stdio (dummy branch) sconsUtils: master-g934a82c692 .................................................................................. remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] Problem dealing with eups-lock left in place after exception: 2) Thurs 21 Aug 2014 15:00:56 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/395/steps/shell/logs/stdio (dummy branch) error building product afwdata. exit code = 9 log is in /lsst/home/lsstsw/build/afwdata/_build.log last few lines: :::: Unable to take exclusive lock on /lsst/home/lsstsw/stack: locks are held by [user=lsstsw, pid=3333] [user=lsstsw, pid=3282] [user=lsstsw, pid=3331] ; retrying 3) Thurs 21 Aug 2014 15:03:20 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/396/steps/shell/logs/stdio (master only) afwdata: 8.0.0.1+8 ........ERROR (34 sec). error building product afwdata. exit code = 9 log is in /lsst/home/lsstsw/build/afwdata/_build.log last few lines: ::::: Unable to take exclusive lock on /lsst/home/lsstsw/stack: locks are held by [user=lsstsw, pid=3333] [user=lsstsw, pid=3282] [user=lsstsw, pid=3331] ; retrying 4) Thurs 21 Aug 2014 15:44:07 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/397/steps/shell/logs/stdio (master only) afwdata: 8.0.0.1+8 .......ERROR (33 sec). error building product afwdata. exit code = 9 log is in /lsst/home/lsstsw/build/afwdata/_build.log last few lines: ::::: Unable to take exclusive lock on /lsst/home/lsstsw/stack: locks are held by [user=lsstsw, pid=3333] [user=lsstsw, pid=3282] [user=lsstsw, pid=3331] ; retrying 5) Thurs 21 Aug 2014 16:31:10 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/398/steps/shell/logs/stdio (master only) afwdata: 8.0.0.1+8 ........ERROR (34 sec). error building product afwdata. exit code = 9 log is in /lsst/home/lsstsw/build/afwdata/_build.log last few lines: ::::: Unable to take exclusive lock on /lsst/home/lsstsw/stack: locks are held by [user=lsstsw, pid=3333] [user=lsstsw, pid=3282] [user=lsstsw, pid=3331] ; retrying Mario intervenes and identifies fact this is eups lock and uses ~ 'eups admin clearlocks' to fix. Then Stack completes build successfully: 6) Thurs 21 Aug 2014 16:48:24 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/399/steps/shell/logs/stdio (master only) anaconda: 1.8.0 (already installed). BUILD b193 completed. Then stack fails with a flock this time: 7) Thurs 21 Aug 2014 21:07:38 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/400/steps/shell/logs/stdio ('u/pgee/ DM-984 u/jbosch/ DM-984 ') Rebuild is commencing....stand by; using -r u/pgee/ DM-984 -r u/jbosch/ DM-984 -r master flock: 200: No locks available a rebuild is already in process. Failed rebuild of DM stack. 8) Thurs 21 Aug 2014 22:34:13 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/401/steps/shell/logs/stdio (u/pgee/ DM-984 u/jbosch/ DM-984 ) Rebuild is commencing....stand by; using -r u/pgee/ DM-984 -r u/jbosch/ DM-984 -r master flock: 200: No locks available a rebuild is already in process. Failed rebuild of DM stack. 9) Thurs 21 Aug 2014 23:29:31 http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/402/steps/shell/logs/stdio (u/pgee/ DM-984 u/jbosch/ DM-984 ) Rebuild is commencing....stand by; using -r u/pgee/ DM-984 -r u/jbosch/ DM-984 -r master lock: 200: No locks available a rebuild is already in process. Failed rebuild of DM stack. Now it starts working again: 10) Friday 22 Aug 2014 8:45:03 am http://lsst-buildx.ncsa.illinois.edu:8010/builders/DM_stack/builds/403/steps/shell/logs/stdio (master only) anaconda: 1.8.0 (already installed). BUILD b194 completed. The DM stack has been installed at /lsst/home/lsstsw with tag: b194.
            Hide
            robyn Robyn Allsman [X] (Inactive) added a comment -

            Random observations:
            1) These builds are presented in sequential order with none missing as noted by the 'waterfall' display. By assumption, there should have been no other activity occurring in the lsstsw build stack - SINCE ALL MODIFICATIONS TO THE BUILD STACK ARE DONE BY AND WITHIN BUILDBOT MANAGEMENT unless the buildmaster& slave have been halted. This was not the case.

            2) Perry reported that after the first flock error he tried a few more times before he gave up. There was a large overnight break between the verification run for the eups-lock problem's solution and Perry's subsequent flock problem.

            Buildbot can be made to recognize the eups-lock error exit (it did in the old buildbot) and potentially clean up after it. Buildbot should also be able to recognize the flock problem. I don't know how it can clean up after itself because it indicates that another lsst-build process has already locked access. Possibly we have parallel processes stomping on each other? But that seems to indicate that a non-buildbot process was active during Perry's first buildbot run having the flock problem.

            It would be easier if the generators of those errors used their usual error message template so the search is more targeted.

            Show
            robyn Robyn Allsman [X] (Inactive) added a comment - Random observations: 1) These builds are presented in sequential order with none missing as noted by the 'waterfall' display. By assumption, there should have been no other activity occurring in the lsstsw build stack - SINCE ALL MODIFICATIONS TO THE BUILD STACK ARE DONE BY AND WITHIN BUILDBOT MANAGEMENT unless the buildmaster& slave have been halted. This was not the case. 2) Perry reported that after the first flock error he tried a few more times before he gave up. There was a large overnight break between the verification run for the eups-lock problem's solution and Perry's subsequent flock problem. Buildbot can be made to recognize the eups-lock error exit (it did in the old buildbot) and potentially clean up after it. Buildbot should also be able to recognize the flock problem. I don't know how it can clean up after itself because it indicates that another lsst-build process has already locked access. Possibly we have parallel processes stomping on each other? But that seems to indicate that a non-buildbot process was active during Perry's first buildbot run having the flock problem. It would be easier if the generators of those errors used their usual error message template so the search is more targeted.
            Hide
            robyn Robyn Allsman [X] (Inactive) added a comment -

            From 10/24/2014:
            New occurrence of failure. Turns out it was due to NFS failures on the disk in use. However, it illustrated the 'error handling on lock' problems.

            Following is the build log output:
            _=/usr/bin/printenv
            Rebuild is commencing....stand by; using -r dummy1 -r master
            flock-fd: 200: No locks available
            a rebuild is already in process.
            The DM stack has been installed at /lsst/home/lsstsw with tag: b399.
            :LastEupsTag:ThisEupsTag: --> :b399:b399:
            Failed rebuild of DM stack.
            The following build artifacts are in directory: /lsst/home/lsstsw/build/FailedLogs/1277/
            ls: cannot access /lsst/home/lsstsw/build/FailedLogs/1277/*: No such file or directory
            program finished with exit code 1
            elapsedTime=60.606591

            Note that exit is taken. But subsequent error handling needs to recognize whether a build was initiated or not.

            Show
            robyn Robyn Allsman [X] (Inactive) added a comment - From 10/24/2014: New occurrence of failure. Turns out it was due to NFS failures on the disk in use. However, it illustrated the 'error handling on lock' problems. Following is the build log output: _=/usr/bin/printenv Rebuild is commencing....stand by; using -r dummy1 -r master flock-fd: 200: No locks available a rebuild is already in process. The DM stack has been installed at /lsst/home/lsstsw with tag: b399. :LastEupsTag:ThisEupsTag: --> :b399:b399: Failed rebuild of DM stack. The following build artifacts are in directory: /lsst/home/lsstsw/build/FailedLogs/1277/ ls: cannot access /lsst/home/lsstsw/build/FailedLogs/1277/*: No such file or directory program finished with exit code 1 elapsedTime=60.606591 Note that exit is taken. But subsequent error handling needs to recognize whether a build was initiated or not.
            Hide
            robyn Robyn Allsman [X] (Inactive) added a comment -

            I entered 'won't fix' because it's not done yet but will be covered under DM-1470.

            Show
            robyn Robyn Allsman [X] (Inactive) added a comment - I entered 'won't fix' because it's not done yet but will be covered under DM-1470 .

              People

              • Assignee:
                robyn Robyn Allsman [X] (Inactive)
                Reporter:
                robyn Robyn Allsman [X] (Inactive)
                Watchers:
                Robyn Allsman [X] (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel