Details
-
Type:
Story
-
Status: To Do
-
Resolution: Unresolved
-
Fix Version/s: None
-
Component/s: jenkins
-
Labels:None
-
Team:SQuaRE
-
Urgent?:No
Description
The Jenkins dind (Docker-in-Docker) containers appear to be sporadically running out of memory, causing OOM kills, exits with status 137, and JNLP4 connection failures. This is based on seeing the memory usage approach and hit 64 GiB at times in monitor-ncsa.lsst.org and seeing the exits in the Kubernetes container information, although the timing of the failures does not always seem properly correlated.
Please bump up the memory allocated to these containers and redeploy the system. I believe the relevant parameters are at https://github.com/lsst-sqre/deploy-jenkins/blob/master/tf/modules/agent-ldfc/main.tf#L167-L172
96 GiB might be a good place to start?