Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: Continuous Integration, Developer Infrastructure
-
Labels:None
-
Story Points:0.5
-
Epic Link:
-
Team:SQuaRE
Description
It appears that the pod has been stuck in terminating status for the last 12 hours.
[jhoblitt@lsst-bastion01 ~]$ kubectl -n jenkins-prod get pods |
NAME READY STATUS RESTARTS AGE
|
agent-ldfc-0 1/3 Terminating 1 22h |
agent-ldfc-1 3/3 Running 0 22h |
agent-ldfc-2 3/3 Running 0 22h |
agent-ldfc-3 3/3 Running 0 22h |
agent-ldfc-4 3/3 Running 0 22h |
agent-ldfc-5 3/3 Running 0 22h |
[jhoblitt@lsst-bastion01 ~]$ kubectl -n jenkins-prod describe pod agent-ldfc-0 |
Name: agent-ldfc-0 |
Namespace: jenkins-prod
|
Priority: 0 |
PriorityClassName: <none>
|
Node: lsst-kub016/141.142.181.185 |
Start Time: Sun, 23 Jun 2019 12:46:07 -0500 |
Labels: app.k8s.io/component=agent
|
app.k8s.io/instance=prod
|
app.k8s.io/managed-by=terraform
|
app.k8s.io/name=agent-ldfc
|
app.k8s.io/part-of=jenkins
|
app.k8s.io/version=1.0.0 |
controller-revision-hash=agent-ldfc-748c96c6f6
|
statefulset.kubernetes.io/pod-name=agent-ldfc-0 |
Annotations: kubernetes.io/psp: privileged
|
Status: Terminating (lasts 12h)
|
Termination Grace Period: 30s
|
IP: 10.47.128.37 |
Controlled By: StatefulSet/agent-ldfc
|
Containers:
|
dind:
|
Container ID: docker://f2559135a5a3b3223985c4c368bf31919b07d1c5d129b8fe389ba7e88145817a |
Image: lsstsqre/dind:18.09.5 |
Image ID: docker-pullable://lsstsqre/dind@sha256:3ed62672a84eb9c1a3c214b4d8159d77263aaec159eecab13ab18d39d3ea4652 |
Port: <none>
|
Host Port: <none>
|
Command:
|
/usr/local/bin/dockerd
|
Args:
|
--host=tcp://localhost:2375 |
--mtu=1376 |
State: Running
|
Started: Sun, 23 Jun 2019 12:53:59 -0500 |
Ready: False
|
Restart Count: 0 |
Limits:
|
cpu: 32 |
memory: 96Gi
|
Requests:
|
cpu: 6 |
memory: 12Gi
|
Liveness: exec [wget --spider -q http://localhost:2375/_ping] delay=5s timeout=1s period=5s #success=1 #failure=2 |
Readiness: exec [wget --spider -q http://localhost:2375/_ping] delay=5s timeout=1s period=5s #success=1 #failure=2 |
Environment:
|
DOCKER_HOST: tcp://localhost:2375 |
Mounts:
|
/j from ws (rw)
|
/var/lib/docker from docker-graph-storage (rw)
|
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nkqvn (ro) |
docker-gc:
|
Container ID: docker://6f7bafaeb25a28ccb842a0532d6d2173e972407dcbc6ad7a1ef679778ff98cb8 |
Image: lsstsqre/docker-gc:latest
|
Image ID: docker-pullable://lsstsqre/docker-gc@sha256:83e4a8b52eef27ebf1b469e294ffe6905fc4e5de33b30d591811c8ec802c292c |
Port: <none>
|
Host Port: <none>
|
Command:
|
sh
|
-c
|
while true; do /usr/local/bin/docker-gc; sleep $GRACE_PERIOD_SECONDS; done |
State: Running
|
Started: Sun, 23 Jun 2019 12:54:14 -0500 |
Ready: True
|
Restart Count: 0 |
Limits:
|
cpu: 500m
|
memory: 512Mi
|
Requests:
|
cpu: 200m
|
memory: 100Mi
|
Environment:
|
DOCKER_HOST: tcp://localhost:2375 |
GRACE_PERIOD_SECONDS: 3600 |
MINIMUM_IMAGES_TO_SAVE: 5 |
REMOVE_VOLUMES: 1 |
FORCE_CONTAINER_REMOVAL: 1 |
FORCE_IMAGE_REMOVAL: 1 |
Mounts:
|
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nkqvn (ro) |
swarm:
|
Container ID: docker://2190d76765e077023b8d1dc284a68295aec7417a6082a401ad4d3fdbb9a96d81 |
Image: lsstsqre/jenkins-swarm-client:3.15-ldfc |
Image ID: docker-pullable://lsstsqre/jenkins-swarm-client@sha256:8dfa62ed9c4323982e552c353b040701b8051c781df26f9bd77302ff58ea2f76 |
Port: <none>
|
Host Port: <none>
|
State: Running
|
Started: Sun, 23 Jun 2019 12:55:07 -0500 |
Ready: False
|
Restart Count: 1 |
Limits:
|
cpu: 2 |
memory: 3Gi
|
Requests:
|
cpu: 1 |
memory: 2Gi
|
Liveness: exec [wget --spider -q http://localhost:8080/metrics] delay=5s timeout=1s period=5s #success=1 #failure=2 |
Readiness: exec [wget --spider -q http://localhost:8080/metrics] delay=5s timeout=1s period=5s #success=1 #failure=2 |
Environment:
|
DOCKER_HOST: tcp://localhost:2375 |
JSWARM_MASTER_URL: https://ci.lsst.codes |
JSWARM_MODE: normal
|
JSWARM_LABELS: docker ldfc
|
JSWARM_EXECUTORS: 1 |
JSWARM_AGENT_NAME: agent-ldfc-0 (v1:metadata.name) |
JSWARM_DISABLE_CLIENTS_UNIQUE_ID: true |
JSWARM_DELETE_EXISTING_CLIENTS: true |
JSWARM_USERNAME: <set to the key 'JSWARM_USERNAME' in secret 'agent-ldfc'> Optional: false |
JSWARM_PASSWORD: <set to the key 'JSWARM_PASSWORD' in secret 'agent-ldfc'> Optional: false |
Mounts:
|
/j from ws (rw)
|
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nkqvn (ro) |
Conditions:
|
Type Status
|
Initialized True
|
Ready False
|
ContainersReady False
|
PodScheduled True
|
Volumes:
|
ws:
|
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
|
ClaimName: ws-agent-ldfc-0 |
ReadOnly: false |
docker-graph-storage:
|
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
|
Medium:
|
SizeLimit: <unset>
|
default-token-nkqvn: |
Type: Secret (a volume populated by a Secret)
|
SecretName: default-token-nkqvn |
Optional: false |
QoS Class: Burstable
|
Node-Selectors: <none>
|
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s |
node.kubernetes.io/unreachable:NoExecute for 300s |
Events: <none>
|
|
I'm going to consider this resolved as kub016 did eventually come back online and there have been no further reports of problems since the core request/limit were made equivalent.