Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Story Points:1.4
-
Epic Link:
-
Team:SQuaRE
-
Urgent?:Yes
Description
Expired k3s certificates at the Summit EFD. k3s documentation says certificates should rotate if k3s is restarted within <90 days before expiration. We certainly did that after the Summit power up but certificates did not rotate. k3s certificates expired on Nov 25 and services like Chronograf were unreachable. The problem was reported by Michael R. on Nov 30 after the holiday break.
|
time="2020-11-30T18:46:01.762244488Z" level=info msg="Starting k3s v0.5.0-rc1 (41520d73)"
|
time="2020-11-30T18:46:01.764015424Z" level=info msg="Running kube-apiserver --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --insecure-port=0 --requestheader-allowed-names=kubernetes-proxy --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-extra-headers-prefix=X-Remote-Extra- --allow-privileged=true --tls-private-key-file=/var/lib/rancher/k3s/server/tls/localhost.key --service-account-issuer=k3s --kubelet-client-key=/var/lib/rancher/k3s/server/tls/token-node.key --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --advertise-address=127.0.0.1 --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --basic-auth-file=/var/lib/rancher/k3s/server/cred/passwd --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --watch-cache=false --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --authorization-mode=Node,RBAC --advertise-port=6445 --secure-port=6444 --bind-address=127.0.0.1 --tls-cert-file=/var/lib/rancher/k3s/server/tls/localhost.crt --api-audiences=unknown --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/token-node.crt"
|
time="2020-11-30T18:46:01.842459907Z" level=info msg="Running kube-scheduler --port=10251 --bind-address=127.0.0.1 --secure-port=0 --kubeconfig=/var/lib/rancher/k3s/server/cred/kubeconfig-system.yaml --leader-elect=false"
|
time="2020-11-30T18:46:01.843294136Z" level=info msg="Running kube-controller-manager --port=10252 --secure-port=0 --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --cluster-cidr=10.42.0.0/16 --allocate-node-cidrs=true --bind-address=127.0.0.1 --kubeconfig=/var/lib/rancher/k3s/server/cred/kubeconfig-system.yaml --root-ca-file=/var/lib/rancher/k3s/server/tls/token-ca.crt --leader-elect=false"
|
2020/11/30 18:46:01 http: TLS handshake error from 127.0.0.1:36548: remote error: tls: bad certificate
|
panic: creating CRD store Get https://localhost:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: x509: certificate has expired or is not yet valid
|
I've tried restarting the k3s docker image:
docker restart master
|
but it failed to restart.
To start the k3s docker image and to be able to exec into the running container, I've set the server clock to one month before the certificate expiration (Oct 1)
sudo timedatectl set-ntp off
|
sudo date --set="2020-10-01 00:00:00.000"
|
docker exec -it master /bin/sh
|
At this point I confirmed that the certificates did not rotate after restarting k3s. That seems to be related to a bug fixed recently.
Then I was able to copy the certificates to the host as a backup and tried deleting one of the certs to see if k3s would recreate it after a restart /var/lib/rancher/k3s/server/tls/client-auth-proxy.crt. That worked:
/var/lib/rancher/k3s/server/tls # ls -lh
|
total 52K
|
-rw-r--r-- 1 0 0 1.1K Nov 25 2019 ca.crt
|
-rw------- 1 0 0 1.7K Nov 25 2019 ca.key
|
-rw-r--r-- 1 0 0 2.2K Oct 1 09:41 client-auth-proxy.crt
|
-rw------- 1 0 0 1.7K Oct 1 09:41 client-auth-proxy.key
|
-rw-r--r-- 1 0 0 2.2K Nov 25 2019 localhost.crt
|
-rw------- 1 0 0 1.7K Nov 25 2019 localhost.key
|
-rw-r--r-- 1 0 0 1.1K Nov 25 2019 request-header-ca.crt
|
-rw------- 1 0 0 1.7K Nov 25 2019 request-header-ca.key
|
-rw------- 1 0 0 1.7K Nov 25 2019 service.key
|
drwx------ 2 0 0 84 Nov 25 2019 temporary-certs
|
-rw-r--r-- 1 0 0 1.1K Nov 25 2019 token-ca.crt
|
-rw------- 1 0 0 1.7K Nov 25 2019 token-ca.key
|
-rw-r--r-- 1 0 0 2.3K Nov 25 2019 token-node.crt
|
-rw------- 1 0 0 1.7K Nov 25 2019 token-node.key
|
I decided to delete the other certificates to force k3s to recreate them. But after that I got a "bad certificate" error and the k3s docker image would not start anymore.
time="2020-10-01T09:47:20.785414474Z" level=info msg="Starting k3s v0.5.0-rc1 (41520d73)"
|
time="2020-10-01T09:47:20.787377725Z" level=info msg="Running kube-apiserver --kubelet-client-key=/var/lib/rancher/k3s/server/tls/token-node.key --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-allowed-names=kubernetes-proxy --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-group-headers=X-Remote-Group --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --allow-privileged=true --service-cluster-ip-range=10.43.0.0/16 --insecure-port=0 --tls-private-key-file=/var/lib/rancher/k3s/server/tls/localhost.key --service-account-issuer=k3s --basic-auth-file=/var/lib/rancher/k3s/server/cred/passwd --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/token-node.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --watch-cache=false --authorization-mode=Node,RBAC --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --secure-port=6444 --bind-address=127.0.0.1 --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --advertise-port=6445 --advertise-address=127.0.0.1 --tls-cert-file=/var/lib/rancher/k3s/server/tls/localhost.crt --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --api-audiences=unknown --requestheader-username-headers=X-Remote-User"
|
time="2020-10-01T09:47:20.877849959Z" level=info msg="Running kube-scheduler --port=10251 --bind-address=127.0.0.1 --secure-port=0 --kubeconfig=/var/lib/rancher/k3s/server/cred/kubeconfig-system.yaml --leader-elect=false"
|
time="2020-10-01T09:47:20.878168722Z" level=info msg="Running kube-controller-manager --allocate-node-cidrs=true --port=10252 --bind-address=127.0.0.1 --secure-port=0 --cluster-cidr=10.42.0.0/16 --leader-elect=false --kubeconfig=/var/lib/rancher/k3s/server/cred/kubeconfig-system.yaml --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --root-ca-file=/var/lib/rancher/k3s/server/tls/token-ca.crt"
|
2020/10/01 09:47:20 http: TLS handshake error from 127.0.0.1:39710: remote error: tls: bad certificate
|
panic: creating CRD store Get https://localhost:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: x509: certificate signed by unknown authority
|
I found this reference that explains how to create the certs manually, but it suggests that I would have the same "bad certificate" error since I have pods running with old certificates.
At this point it seemed safer to redeploy k3s and then the EFD. Note that we are planning on migrating the Summit EFD to Andes, a proper k8s cluster at the Summit. However as AuxTel operations will resume soon the fastest path was to restore the Summit EFD still using k3s.
I've tried my instructions in SQR-031 to deploy k3s using the latest release.
export K3S_IMAGE=rancher/k3s
export K3S_TAG=v1.19.4-k3s1
export K3S_PORT=6443
export K3S_URL=https://139.229.160.30:${K3S_PORT}
export K3S_TOKEN=$(date | base64)
export HOST_PATH=/data # change depending on your host
export CONTAINER_PATH=/opt/local-path-provisioner
sudo docker run -d --restart always --tmpfs /run --tmpfs /var/run --volume ${HOST_PATH}:${CONTAINER_PATH} -e K3S_URL=${K3S_URL} -e K3S_TOKEN=${K3S_TOKEN} --privileged --network host --name efd docker.io/${K3S_IMAGE}:${K3S_TAG} server --https-listen-port ${K3S_PORT} --no-deploy traefik
That didn't work, the k3s docker image didn't start.
time="2020-11-30T22:00:55.477185589Z" level=info msg="Starting k3s v1.19.4+k3s1 (2532c10f)"
time="2020-11-30T22:00:55.478646128Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://139.229.160.30:6443/cacerts\": dial tcp 139.229.160.30:6443: connect: connection refused"
I had no time to further investigate this error, and decided to go with the version of k3s that was running before K3S_TAG=v0.5.0.
Then I followed the steps in SQR-031 to deploy the local-path storage provisioner, Argo CD, create the initial vault-secrets-operator secret and bootstrap the EFD app.
After everything was synced I moved the data from the old deployment to the new one:
InlfuxDB
[afausti@efd-temp-k3s ~]$ sudo cp -r /data/pvc-9e6b3c05-2358-11ea-894b-d09466627ab2/* /data/pvc-4f67178f-3362-11eb-be16-d09466627ab2_influxdb_influxdb-data-influxdb-0/
Kapacitor
cp -r /data/pvc-9e0a9725-2358-11ea-894b-d09466627ab2/* /data/pvc-0ce8b1a0-3363-11eb-be16-d09466627ab2_kapacitor_kapacitor-kapacitor/
Chronograf
sudo cp /data/pvc-9e268a8b-2358-11ea-894b-d09466627ab2/chronograf-v1.db /data/pvc-f0bdff10-3362-11eb-be16-d09466627ab2_chronograf_chronograf-chronograf/chronograf-v1.db
Finally, after restarting the InfluxDB, Kapacitor and Chronograf pods the Summit EFD was restored.