Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-11161

unhandled exception in jenkins-node-cleanup job

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Continuous Integration
    • Labels:
      None

      Description

      Build #206 marked jenkins-el7-5 as offline for cleanup then threw an uncaught exception for a file that could not be deleted for unknown reasons. Then, subsequent builds (#207+), which are supposed to detect nodes that were unintentionally left in an offline state by the cleanup script, failed to return the node to service.

      https://ci.lsst.codes/job/sqre/job/infrastructure/job/jenkins-node-cleanup/206/consoleText

      Started by timer
      [EnvInject] - Loading node environment variables.
      Building remotely on jenkins-master (swarm) in workspace /home/jenkins-slave/workspace/sqre/infrastructure/jenkins-node-cleanup
      found elcapitan-1
      Node elcapitan-1 is offline
      found elcapitan-2
      Node elcapitan-2 is offline
      found elcapitan-3
      Node elcapitan-3 is offline
      found jenkins-el6-1
      node: jenkins-el6-1, free space: 434GB. Idle: true
      Skipping jenkins-el6-1 based on disk threshhold
      found jenkins-el6-2
      node: jenkins-el6-2, free space: 172GB. Idle: true
      Skipping jenkins-el6-2 based on disk threshhold
      found jenkins-el7-1
      node: jenkins-el7-1, free space: 523GB. Idle: true
      Skipping jenkins-el7-1 based on disk threshhold
      found jenkins-el7-2
      node: jenkins-el7-2, free space: 441GB. Idle: false
      Skipping jenkins-el7-2 based on disk threshhold
      found jenkins-el7-3
      node: jenkins-el7-3, free space: 811GB. Idle: true
      Skipping jenkins-el7-3 based on disk threshhold
      found jenkins-el7-4
      node: jenkins-el7-4, free space: 1128GB. Idle: false
      Skipping jenkins-el7-4 based on disk threshhold
      found jenkins-el7-5
      node: jenkins-el7-5, free space: 94GB. Idle: true
      Failed to delete /home/jenkins-slave/workspace: java.io.IOException: remote file operation failed: /home/jenkins-slave/workspace at hudson.remoting.Channel@17506ed0:jenkins-el7-5: java.io.IOException: Unable to delete '/home/jenkins-slave/workspace/infrastructure/update-cmirror/local_mirror/linux-64/_license-1.1-py27_0.tar.bz2'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
      Error with jenkins-el7-5: groovy.lang.MissingPropertyException: No such property: allJobs for class: Script1
      found jenkins-master
      node: jenkins-master, free space: 479GB. Idle: false
      Skipping jenkins-master based on disk threshhold
      found lsst-dev
      Skipping lsst-dev based on labels
      found sierra-1
      node: sierra-1, free space: 153GB. Idle: true
      Skipping sierra-1 based on disk threshhold
      found sierra-2
      Node sierra-2 is offline
      found sierra-3
      node: sierra-3, free space: 160GB. Idle: true
      Skipping sierra-3 based on disk threshhold
      ### SUMMARY
      	ERRORS with: jenkins-el7-5
      	ERRORS with: jenkins-el7-5
      	Offline: elcapitan-1
      	Offline: elcapitan-2
      	Offline: elcapitan-3
      	Offline: sierra-2
      	Skipped: jenkins-el6-1
      	Skipped: jenkins-el6-2
      	Skipped: jenkins-el7-1
      	Skipped: jenkins-el7-2
      	Skipped: jenkins-el7-3
      	Skipped: jenkins-el7-4
      	Skipped: jenkins-master
      	Skipped: lsst-dev
      	Skipped: sierra-1
      	Skipped: sierra-3
      FATAL: assert failedNodes.size() == 0
             |           |      |
             |           2      false
             [hudson.plugins.swarm.SwarmSlave[jenkins-el7-5], hudson.plugins.swarm.SwarmSlave[jenkins-el7-5]]
      Assertion failed: 
       
      assert failedNodes.size() == 0
             |           |      |
             |           2      false
             [hudson.plugins.swarm.SwarmSlave[jenkins-el7-5], hudson.plugins.swarm.SwarmSlave[jenkins-el7-5]]
       
      	at org.codehaus.groovy.runtime.InvokerHelper.assertFailed(InvokerHelper.java:402)
      	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.assertFailed(ScriptBytecodeAdapter.java:650)
      	at Script1.run(Script1.groovy:190)
      	at groovy.lang.GroovyShell.evaluate(GroovyShell.java:585)
      	at groovy.lang.GroovyShell.evaluate(GroovyShell.java:623)
      	at groovy.lang.GroovyShell.evaluate(GroovyShell.java:594)
      	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SecureGroovyScript.evaluate(SecureGroovyScript.java:168)
      	at hudson.plugins.groovy.SystemGroovy.run(SystemGroovy.java:95)
      	at hudson.plugins.groovy.SystemGroovy.perform(SystemGroovy.java:59)
      	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
      	at hudson.model.Build$BuildExecution.build(Build.java:205)
      	at hudson.model.Build$BuildExecution.doRun(Build.java:162)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
      	at hudson.model.Run.execute(Run.java:1720)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
      	at hudson.model.ResourceController.execute(ResourceController.java:98)
      	at hudson.model.Executor.run(Executor.java:404)
      Finished: FAILURE
      

      https://ci.lsst.codes/job/sqre/job/infrastructure/job/jenkins-node-cleanup/207/consoleText

      Started by timer
      [EnvInject] - Loading node environment variables.
      Building remotely on jenkins-master (swarm) in workspace /home/jenkins-slave/workspace/sqre/infrastructure/jenkins-node-cleanup
      found elcapitan-1
      Node elcapitan-1 is offline
      found elcapitan-2
      Node elcapitan-2 is offline
      found elcapitan-3
      Node elcapitan-3 is offline
      found jenkins-el6-1
      node: jenkins-el6-1, free space: 434GB. Idle: true
      Skipping jenkins-el6-1 based on disk threshhold
      found jenkins-el6-2
      node: jenkins-el6-2, free space: 172GB. Idle: false
      Skipping jenkins-el6-2 based on disk threshhold
      found jenkins-el7-1
      node: jenkins-el7-1, free space: 521GB. Idle: false
      Skipping jenkins-el7-1 based on disk threshhold
      found jenkins-el7-2
      node: jenkins-el7-2, free space: 443GB. Idle: false
      Skipping jenkins-el7-2 based on disk threshhold
      found jenkins-el7-3
      node: jenkins-el7-3, free space: 811GB. Idle: false
      Skipping jenkins-el7-3 based on disk threshhold
      found jenkins-el7-4
      node: jenkins-el7-4, free space: 1120GB. Idle: false
      Skipping jenkins-el7-4 based on disk threshhold
      found jenkins-el7-5
      node: jenkins-el7-5, free space: 1070GB. Idle: true
      Skipping jenkins-el7-5 based on disk threshhold
      found jenkins-master
      node: jenkins-master, free space: 479GB. Idle: false
      Skipping jenkins-master based on disk threshhold
      found lsst-dev
      Skipping lsst-dev based on labels
      found sierra-1
      node: sierra-1, free space: 153GB. Idle: true
      Skipping sierra-1 based on disk threshhold
      found sierra-2
      Node sierra-2 is offline
      found sierra-3
      node: sierra-3, free space: 160GB. Idle: true
      Skipping sierra-3 based on disk threshhold
      ### SUMMARY
      	Offline: elcapitan-1
      	Offline: elcapitan-2
      	Offline: elcapitan-3
      	Offline: sierra-2
      	Skipped: jenkins-el6-1
      	Skipped: jenkins-el6-2
      	Skipped: jenkins-el7-1
      	Skipped: jenkins-el7-2
      	Skipped: jenkins-el7-3
      	Skipped: jenkins-el7-4
      	Skipped: jenkins-el7-5
      	Skipped: jenkins-master
      	Skipped: lsst-dev
      	Skipped: sierra-1
      	Skipped: sierra-3
      Finished: SUCCESS
      

        Attachments

          Issue Links

            Activity

            Hide
            jhoblitt Joshua Hoblitt added a comment - - edited

            The delete error was valid. This is is a side-effect of the conda mirror job running in a container and writing to a bind mount volume, thus the UID does not match that of the jenkins agent role user.

            [jenkins-slave@jenkins-el7-5 ~]$ ls -la /home/jenkins-slave/workspace/infrastructure/update-cmirror/local_mirror/linux-64/_license-1.1-py27_0.tar.bz2
            -rw-r--r-- 1 centos centos 195012 Apr  3 05:44 /home/jenkins-slave/workspace/infrastructure/update-cmirror/local_mirror/linux-64/_license-1.1-py27_0.tar.bz2
            

            That job needs to have the uid mapping fixed as well.

            Show
            jhoblitt Joshua Hoblitt added a comment - - edited The delete error was valid. This is is a side-effect of the conda mirror job running in a container and writing to a bind mount volume, thus the UID does not match that of the jenkins agent role user. [jenkins-slave @jenkins -el7- 5 ~]$ ls -la /home/jenkins-slave/workspace/infrastructure/update-cmirror/local_mirror/linux- 64 /_license- 1.1 -py27_0.tar.bz2 -rw-r--r-- 1 centos centos 195012 Apr 3 05 : 44 /home/jenkins-slave/workspace/infrastructure/update-cmirror/local_mirror/linux- 64 /_license- 1.1 -py27_0.tar.bz2 That job needs to have the uid mapping fixed as well.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            This has deployed to production a week ago and appears to be working as intended.

            Show
            jhoblitt Joshua Hoblitt added a comment - This has deployed to production a week ago and appears to be working as intended.

              People

              Assignee:
              jhoblitt Joshua Hoblitt
              Reporter:
              jhoblitt Joshua Hoblitt
              Watchers:
              Joshua Hoblitt
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.