## Details

## Description

The jenkins-el7-1 has dropped below the free disk space limit several times in the last few weeks. It seems plausible there is an issue with workspace cleanup and that this agent may need a larger volume.

autoPopulation.png
Joshua Hoblitt added a comment -

It appears that this morning the free disk space is up to ~90GiB. I've gone ahead and updated the cleanup script to be in sync with the current DM version. Triggering a forced cleanup hit directories that can't be removed:

 .... trying to delete: /j/ws/_ts_mt_hexRot_middleware_develop Failed to delete /j/ws/_ts_mt_hexRot_middleware_develop: jenkins.util.io.CompositeIOException: Unable to delete '/j/ws/_ts_mt_hexRot_middleware_develop'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. 

I am investigating.

Joshua Hoblitt added a comment -

There are many (more than a dozen) job workspaces that are owned uid/gid 1891:1891:

 [root@jenkins-el7-1 ws]# ls -lad _ts_mt_hexRot_middleware_develop* drwxr-xr-x 14 1891 1891 242 Sep 30 09:45 _ts_mt_hexRot_middleware_develop drwxr-sr-x 2 jenkins-slave jenkins-slave 6 Sep 30 09:45 _ts_mt_hexRot_middleware_develop_tmp drwxr-sr-x 2 jenkins-slave jenkins-slave 6 Sep 30 09:44 _ts_mt_hexRot_middleware_develop@tmp 

I've added a root crontab entry to fix up permissions once per hour:

 10 * * * * find /j/ws \! -user jenkins-slave -exec chown jenkins-slave:jenkins-slave {} \; 

Joshua Hoblitt added a comment -

Fixing the permissions got a "force cleanup" working but there was still ~90GiB of files space under /j/ws that weren't being cleaned up so I did a manual delete.

Joshua Hoblitt added a comment -

Free disk space for jenkins-el7-1 is now showing as 246.63 GB. I've set the free space threshold back to 50 GiB. If this becomes a problem again I will likely expand the volume.

Andy Clements added a comment -

Joshua Hoblitt  Thanks for looking into this.  Rob Bovill, Tiago Ribeiro, Te-Wei Tsai - Can someone look into this?  Is the hexRot job new?

Te-Wei Tsai added a comment - - edited

Joshua Hoblitt I am the owner of ts_mt_hexRot_middleware repo. That repo was automatically on TSSW instance by the auto-population of "LSST Telescope & Site". Please help to remove that one.

Thanks!

BTW, the test of ts_mt_hexRot_middleware repo is on the Jenkins instance by T&S team now.

Te-Wei Tsai added a comment -

To be clear, the TSSW jenkins instance hold by DM team uses the root authority for the docker image. The TSSW jenkins instance hold by T&S team uses the jenkinsuser (uid: 1004, not the root) for the docker image. This is why you will see the error here.

Joshua Hoblitt added a comment -

Te-Wei Tsai Are you saying that the job https://ts-ci.lsst.codes/blue/organizations/jenkins/lsst-ts%2Fts_mt_hexRot_middleware/branches/ should not exist? It looks like there are Jenkinsfiles at the tips of branches.

Tiago Ribeiro added a comment -

I removed all the triggers from that build. It should not run anymore.

Te-Wei Tsai added a comment -

Joshua Hoblitt The Jenkinsfile of ts_mt_hexRot_middleware is designed for the Jenkins instance of T&S team to work around the permission issue (uid: 1004). Therefore, it will fail on the Jenkins instance by DM team. Actually, I thought the auto-population of "LSST Telescope & Site" on DM's Jenkins instance may not be a good idea. Thanks!

