Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-22333

tssw jenkins agent jenkins-el7-1 running out of disk space

    Details

      Description

      The jenkins-el7-1 has dropped below the free disk space limit several times in the last few weeks. It seems plausible there is an issue with workspace cleanup and that this agent may need a larger volume.

        Attachments

          Activity

          Hide
          jhoblitt Joshua Hoblitt added a comment -

          It appears that this morning the free disk space is up to ~90GiB. I've gone ahead and updated the cleanup script to be in sync with the current DM version. Triggering a forced cleanup hit directories that can't be removed:

          .... trying to delete: /j/ws/_ts_mt_hexRot_middleware_develop
          Failed to delete /j/ws/_ts_mt_hexRot_middleware_develop: jenkins.util.io.CompositeIOException: Unable to delete '/j/ws/_ts_mt_hexRot_middleware_develop'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
          

          I am investigating.

          Show
          jhoblitt Joshua Hoblitt added a comment - It appears that this morning the free disk space is up to ~90GiB. I've gone ahead and updated the cleanup script to be in sync with the current DM version. Triggering a forced cleanup hit directories that can't be removed: .... trying to delete: /j/ws/_ts_mt_hexRot_middleware_develop Failed to delete /j/ws/_ts_mt_hexRot_middleware_develop: jenkins.util.io.CompositeIOException: Unable to delete '/j/ws/_ts_mt_hexRot_middleware_develop' . Tried 3 times (of a maximum of 3 ) waiting 0.1 sec between attempts. I am investigating.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          There are many (more than a dozen) job workspaces that are owned uid/gid 1891:1891:

          [root@jenkins-el7-1 ws]# ls -lad _ts_mt_hexRot_middleware_develop*
          drwxr-xr-x 14          1891          1891 242 Sep 30 09:45 _ts_mt_hexRot_middleware_develop
          drwxr-sr-x  2 jenkins-slave jenkins-slave   6 Sep 30 09:45 _ts_mt_hexRot_middleware_develop_tmp
          drwxr-sr-x  2 jenkins-slave jenkins-slave   6 Sep 30 09:44 _ts_mt_hexRot_middleware_develop@tmp
          

          I've added a root crontab entry to fix up permissions once per hour:

          10 * * * * find /j/ws \! -user jenkins-slave -exec chown jenkins-slave:jenkins-slave {} \;
          

          Show
          jhoblitt Joshua Hoblitt added a comment - There are many (more than a dozen) job workspaces that are owned uid/gid 1891:1891 : [root @jenkins -el7- 1 ws]# ls -lad _ts_mt_hexRot_middleware_develop* drwxr-xr-x 14 1891 1891 242 Sep 30 09 : 45 _ts_mt_hexRot_middleware_develop drwxr-sr-x 2 jenkins-slave jenkins-slave 6 Sep 30 09 : 45 _ts_mt_hexRot_middleware_develop_tmp drwxr-sr-x 2 jenkins-slave jenkins-slave 6 Sep 30 09 : 44 _ts_mt_hexRot_middleware_develop @tmp I've added a root crontab entry to fix up permissions once per hour: 10 * * * * find /j/ws \! -user jenkins-slave -exec chown jenkins-slave:jenkins-slave {} \;
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          Fixing the permissions got a "force cleanup" working but there was still ~90GiB of files space under /j/ws that weren't being cleaned up so I did a manual delete.

          Show
          jhoblitt Joshua Hoblitt added a comment - Fixing the permissions got a "force cleanup" working but there was still ~90GiB of files space under /j/ws that weren't being cleaned up so I did a manual delete.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          Free disk space for jenkins-el7-1 is now showing as 246.63 GB. I've set the free space threshold back to 50 GiB. If this becomes a problem again I will likely expand the volume.

          Show
          jhoblitt Joshua Hoblitt added a comment - Free disk space for jenkins-el7-1 is now showing as 246.63 GB. I've set the free space threshold back to 50 GiB . If this becomes a problem again I will likely expand the volume.
          Hide
          aclements Andy Clements added a comment -

          Joshua Hoblitt  Thanks for looking into this.  Rob Bovill, Tiago Ribeiro, Te-Wei Tsai - Can someone look into this?  Is the hexRot job new?

          Show
          aclements Andy Clements added a comment - Joshua Hoblitt   Thanks for looking into this.  Rob Bovill , Tiago Ribeiro , Te-Wei Tsai - Can someone look into this?  Is the hexRot job new?
          Hide
          ttsai Te-Wei Tsai added a comment - - edited

          Joshua Hoblitt I am the owner of ts_mt_hexRot_middleware repo. That repo was automatically on TSSW instance by the auto-population of "LSST Telescope & Site". Please help to remove that one.

          Thanks!

          BTW, the test of ts_mt_hexRot_middleware repo is on the Jenkins instance by T&S team now.

          Show
          ttsai Te-Wei Tsai added a comment - - edited Joshua Hoblitt I am the owner of ts_mt_hexRot_middleware repo. That repo was automatically on TSSW instance by the auto-population of "LSST Telescope & Site". Please help to remove that one. Thanks! BTW, the test of ts_mt_hexRot_middleware repo is on the Jenkins instance by T&S team now.
          Hide
          ttsai Te-Wei Tsai added a comment -

          To be clear, the TSSW jenkins instance hold by DM team uses the root authority for the docker image. The TSSW jenkins instance hold by T&S team uses the jenkinsuser (uid: 1004, not the root) for the docker image. This is why you will see the error here.

          Show
          ttsai Te-Wei Tsai added a comment - To be clear, the TSSW jenkins instance hold by DM team uses the root authority for the docker image. The TSSW jenkins instance hold by T&S team uses the jenkinsuser (uid: 1004, not the root) for the docker image. This is why you will see the error here.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          Te-Wei Tsai Are you saying that the job https://ts-ci.lsst.codes/blue/organizations/jenkins/lsst-ts%2Fts_mt_hexRot_middleware/branches/ should not exist? It looks like there are Jenkinsfiles at the tips of branches.

          Show
          jhoblitt Joshua Hoblitt added a comment - Te-Wei Tsai Are you saying that the job https://ts-ci.lsst.codes/blue/organizations/jenkins/lsst-ts%2Fts_mt_hexRot_middleware/branches/ should not exist? It looks like there are Jenkinsfiles at the tips of branches.
          Hide
          tribeiro Tiago Ribeiro added a comment -

          I removed all the triggers from that build. It should not run anymore. 

          Show
          tribeiro Tiago Ribeiro added a comment - I removed all the triggers from that build. It should not run anymore. 
          Hide
          ttsai Te-Wei Tsai added a comment -

          Joshua Hoblitt The Jenkinsfile of ts_mt_hexRot_middleware is designed for the Jenkins instance of T&S team to work around the permission issue (uid: 1004). Therefore, it will fail on the Jenkins instance by DM team. Actually, I thought the auto-population of "LSST Telescope & Site" on DM's Jenkins instance may not be a good idea. Thanks!

          Show
          ttsai Te-Wei Tsai added a comment - Joshua Hoblitt The Jenkinsfile of ts_mt_hexRot_middleware is designed for the Jenkins instance of T&S team to work around the permission issue (uid: 1004). Therefore, it will fail on the Jenkins instance by DM team. Actually, I thought the auto-population of "LSST Telescope & Site" on DM's Jenkins instance may not be a good idea. Thanks!

            People

            • Assignee:
              Unassigned
              Reporter:
              jhoblitt Joshua Hoblitt
              Watchers:
              Andy Clements, Joshua Hoblitt, Te-Wei Tsai, Tiago Ribeiro
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel