Details
-
Type:
Bug
-
Status: To Do
-
Resolution: Unresolved
-
Fix Version/s: None
-
Component/s: jenkins
-
Labels:None
-
Team:Architecture
-
Urgent?:No
Description
stack-os-matrix jobs occasionally fail with eups declare: Unable to take exclusive lock on /j/ws/stack-os-matrix/[...]: locks are held by [user=jswarm, pid=...]
The actual lock directory in the stack looks like /project/jenkins/prod/agent-ldfc-ws-1/ws/stack-os-matrix/adacff179f/lsstsw/stack/cb4e2dc/.lockDir. Removing that and its contents resolves the problem but is manual (although a Jenkins pipeline, sqre/infra/clean-locks has been created to let anyone do this cleanup without requiring administrator intervention).
The problem appears to be correlated with manual termination of previous jobs. The code in eups is supposed to clean up lock files and directories when interrupted, but perhaps something else is going on here.
Either fix the manual job termination to clean up properly, or turn off locking altogether for these jobs, if that is safe and feasible.