Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-31573

Upgrade k3s on efd-temp-k3s.cp.lsst.org

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We need to upgrade k3s at the Summit before November because of the k3s certs expiration issue and try the latest version of k3s with Velero.

        Attachments

          Issue Links

            Activity

            Hide
            afausti Angelo Fausti added a comment - - edited

            Cristián Silva my understanding is that the EFD will continue running on efd-temp-k3s.cp.lsst.org at the Summit until we get more resources for the Andes cluster to migrate it over, see DM-29576.

            In this case, I would like to upgrade k3s on efd-temp-k3s.cp.lsst.org since the version running there is quite old. I did install k3s manually in the past, but perhaps we should automate that part? should I talk to Heinrich or Josh to help on this? Any thoughts?

            Show
            afausti Angelo Fausti added a comment - - edited Cristián Silva my understanding is that the EFD will continue running on efd-temp-k3s.cp.lsst.org at the Summit until we get more resources for the Andes cluster to migrate it over, see DM-29576 . In this case, I would like to upgrade k3s on efd-temp-k3s.cp.lsst.org since the version running there is quite old. I did install k3s manually in the past, but perhaps we should automate that part? should I talk to Heinrich or Josh to help on this? Any thoughts?
            Hide
            afausti Angelo Fausti added a comment - - edited

            During the Summit maintenance window on Nov 10 we tried to force the k3s certificates to rotate but that didn't work, see DM-32557.

            We have an emergency maintenance this week to upgrade k3s on efd-temp-k3s.cp.lsst.org and redeploy the Summit EFD.

            Show
            afausti Angelo Fausti added a comment - - edited During the Summit maintenance window on Nov 10 we tried to force the k3s certificates to rotate but that didn't work, see DM-32557 . We have an emergency maintenance this week to upgrade k3s on efd-temp-k3s.cp.lsst.org and redeploy the Summit EFD.
            Hide
            afausti Angelo Fausti added a comment - - edited

            Maintenance starts Nov 17 2:30pm MST

            0. Announce maintenance window on #summit-announce and #com-square-support one hour in advance

            1. Stop producers at the Summit

            2. Run Summit backups at NCSA

            export KUBECONFIG=$HOME/.kube/config-summit
            while true;  do kubectl port-forward service/influxdb -n influxdb 8088:8088; echo "Restarting..."; done
             
            backup-chronograf.sh summit
            backup-kapacitor.sh summit
            

            Backup current InfluxDB shard at the Summit, shard 718 starting 2021-11-15T00:00:00Z

            influxd backup -portable -database efd -host 127.0.0.1:8088 -start 2021-11-15T00:00:00Z -end  2021-11-18T00:00:00Z summit-efd-2021-11-17.influx
            

            3. Pause connectors at LDF

            • replicator
            • Influxdb-sink
            • s3 Sink

            4. Summit EFD server upgrades

            4.1 Stop runningk3s efd container

            4.2 Upgrade docker to 20.10.10

            I had to stop the puppet agent and remove the yum locks on docker-ce* to upgrade docker.

            sudo yum-config-manager   --add-repo  https://download.docker.com/linux/centos/docker-ce.repo
             
            sudo yum upgrade docker-ce docker-ce-cli containerd.io
            

            4.3 Upgrade kubectl to 1.22.3

            curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
            

            4.4 Install latest K3d

            k3d version v5.1.0
            k3s version v1.21.5-k3s2 (default)
             
            curl -s https://raw.githubusercontent.com/rancher/k3d/main/install.sh | bash
            

            4.5 Create EFD cluster

            export HOST_PATH=/data
            export CONTAINER_PATH=/var/lib/rancher/k3s/storage/
             
            sudo /usr/local/bin/k3d cluster create efd  --network host  --no-lb -v ${HOST_PATH}:${CONTAINER_PATH} --k3s-arg "--disable=traefik"
             
            sudo /usr/local/bin/k3d kubeconfig get efd > k3s.yaml
            export KUBECONFIG=$(pwd)/k3s.yaml
            

            4.6 Install latest Argo CD

            kubectl create namespace argocd
             
            kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
            

            4.7 Install latest Argo CD client

            curl -sSL -o bin/argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
            chmod +x bin/argocd
            

            5. EFD deployment

            5.1 Create EFD parent app

            kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
             
            kubectl port-forward svc/argocd-server -n argocd 8080:443
             
            argocd login --insecure localhost:8080
             
            argocd app create efd --dest-namespace argocd --dest-server https://kubernetes.default.svc --repo https://github.com/lsst-sqre/argocd-efd.git --path apps/efd --helm-set env=summit
             
            argocd app sync efd
            

            5.2 Sync vault-secrets-operator

            argocd app sync vault-secrets-operator
             
            export VAULT_TOKEN=<read vault token for the summit>
            export VAULT_TOKEN_LEASE_DURATION=86400
             
            kubectl create secret generic vault-secrets-operator --from-literal=VAULT_TOKEN=$VAULT_TOKEN --from-literal=VAULT_TOKEN_LEASE_DURATION=$VAULT_TOKEN_LEASE_DURATION --namespace vault-secrets-operator
            

            5.3 Sync ingress-nginx

            argocd app sync nginx-ingress
            

            5.4 Sync remaining apps

            5.5 Argo CD TLS

            # Create tls-certs for argocd
             
            cat << EOF | kubectl apply -f -
            apiVersion: ricoberger.de/v1alpha1
            kind: VaultSecret
            metadata:
              name: tls-certs
              namespace: argocd
            spec:
              path: secret/k8s_operator/summit-lsp.lsst.codes/efd/tls-certs
              type: Opaque
            EOF
            

            6 Migrate EFD volumes to new deployment

            6.1 OLD volumes

            kubectl get pv
             
            NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                     STORAGECLASS   REASON   AGE
            pvc-0ce8b1a0-3363-11eb-be16-d09466627ab2   8Gi        RWO            Retain           Bound    kapacitor/kapacitor-kapacitor                             local-path              351d
             
            pvc-4f67178f-3362-11eb-be16-d09466627ab2   15Gi       RWO            Retain           Bound    influxdb/influxdb-data-influxdb-0                         local-path              351d
             
            pvc-f0bdff10-3362-11eb-be16-d09466627ab2   8Gi        RWO            Retain           Bound    chronograf/chronograf-chronograf                          local-path              351d
            

            6.2 NEW volumes

            kubectl get pv
            

            6.3 Change reclaim policy of new volumes to Retain

            kubectl patch pv <new pv> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

            6.4 Chronograf data

            sudo mv /data/pvc-f0bdff10-3362-11eb-be16-d09466627ab2_chronograf_chronograf-chronograf/* /data/<new pv>/
            

            Restart Chronograf pod

            6.5 Kapacitor

            sudo mv /data/pvc-0ce8b1a0-3363-11eb-be16-d09466627ab2_kapacitor_kapacitor-kapacitor/* /data/<new pv>/
            

            Restart Kapacitor pod

            6.6 InfluxDB

            sudo mv /data/pvc-4f67178f-3362-11eb-be16-d09466627ab2_influxdb_influxdb-data-influxdb-0/* /data/<new pv>/
            

            Restart InfluxDB pod

            7. Resume producers at the Summit

            8. Resume connectors at LDF

            Show
            afausti Angelo Fausti added a comment - - edited Maintenance starts Nov 17 2:30pm MST 0. Announce maintenance window on #summit-announce and #com-square-support one hour in advance 1. Stop producers at the Summit 2. Run Summit backups at NCSA export KUBECONFIG=$HOME/.kube/config-summit while true; do kubectl port-forward service/influxdb -n influxdb 8088:8088; echo "Restarting..."; done   backup-chronograf.sh summit backup-kapacitor.sh summit Backup current InfluxDB shard at the Summit, shard 718 starting 2021-11-15T00:00:00Z influxd backup -portable -database efd -host 127.0.0.1:8088 -start 2021-11-15T00:00:00Z -end 2021-11-18T00:00:00Z summit-efd-2021-11-17.influx 3. Pause connectors at LDF replicator Influxdb-sink s3 Sink 4. Summit EFD server upgrades 4.1 Stop runningk3s efd container 4.2 Upgrade docker to 20.10.10 I had to stop the puppet agent and remove the yum locks on docker-ce* to upgrade docker. sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo   sudo yum upgrade docker-ce docker-ce-cli containerd.io 4.3 Upgrade kubectl to 1.22.3 curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" 4.4 Install latest K3d k3d version v5.1.0 k3s version v1.21.5-k3s2 (default)   curl -s https://raw.githubusercontent.com/rancher/k3d/main/install.sh | bash 4.5 Create EFD cluster export HOST_PATH=/data export CONTAINER_PATH=/var/lib/rancher/k3s/storage/   sudo /usr/local/bin/k3d cluster create efd --network host --no-lb -v ${HOST_PATH}:${CONTAINER_PATH} --k3s-arg "--disable=traefik"   sudo /usr/local/bin/k3d kubeconfig get efd > k3s.yaml export KUBECONFIG=$(pwd)/k3s.yaml 4.6 Install latest Argo CD kubectl create namespace argocd   kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml 4.7 Install latest Argo CD client curl -sSL -o bin/argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64 chmod +x bin/argocd 5. EFD deployment 5.1 Create EFD parent app kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d   kubectl port-forward svc/argocd-server -n argocd 8080:443   argocd login --insecure localhost:8080   argocd app create efd --dest-namespace argocd --dest-server https://kubernetes.default.svc --repo https://github.com/lsst-sqre/argocd-efd.git --path apps/efd --helm-set env=summit   argocd app sync efd 5.2 Sync vault-secrets-operator argocd app sync vault-secrets-operator   export VAULT_TOKEN=<read vault token for the summit> export VAULT_TOKEN_LEASE_DURATION=86400   kubectl create secret generic vault-secrets-operator --from-literal=VAULT_TOKEN=$VAULT_TOKEN --from-literal=VAULT_TOKEN_LEASE_DURATION=$VAULT_TOKEN_LEASE_DURATION --namespace vault-secrets-operator 5.3 Sync ingress-nginx argocd app sync nginx-ingress 5.4 Sync remaining apps 5.5 Argo CD TLS # Create tls-certs for argocd   cat << EOF | kubectl apply -f - apiVersion: ricoberger.de/v1alpha1 kind: VaultSecret metadata: name: tls-certs namespace: argocd spec: path: secret/k8s_operator/summit-lsp.lsst.codes/efd/tls-certs type: Opaque EOF 6 Migrate EFD volumes to new deployment 6.1 OLD volumes kubectl get pv   NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-0ce8b1a0-3363-11eb-be16-d09466627ab2 8Gi RWO Retain Bound kapacitor/kapacitor-kapacitor local-path 351d   pvc-4f67178f-3362-11eb-be16-d09466627ab2 15Gi RWO Retain Bound influxdb/influxdb-data-influxdb-0 local-path 351d   pvc-f0bdff10-3362-11eb-be16-d09466627ab2 8Gi RWO Retain Bound chronograf/chronograf-chronograf local-path 351d 6.2 NEW volumes kubectl get pv 6.3 Change reclaim policy of new volumes to Retain kubectl patch pv <new pv> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' 6.4 Chronograf data sudo mv /data/pvc-f0bdff10-3362-11eb-be16-d09466627ab2_chronograf_chronograf-chronograf/* /data/<new pv>/ Restart Chronograf pod 6.5 Kapacitor sudo mv /data/pvc-0ce8b1a0-3363-11eb-be16-d09466627ab2_kapacitor_kapacitor-kapacitor/* /data/<new pv>/ Restart Kapacitor pod 6.6 InfluxDB sudo mv /data/pvc-4f67178f-3362-11eb-be16-d09466627ab2_influxdb_influxdb-data-influxdb-0/* /data/<new pv>/ Restart InfluxDB pod 7. Resume producers at the Summit 8. Resume connectors at LDF
            Hide
            afausti Angelo Fausti added a comment - - edited

            We had a problem configuring ingress with k3d. Apparently the -network host option in k3d does not have the same effect as the -network host I used in docker.

            The host port 443 is not exposed to the ingress-nginx container, even when we configure hostPort.enabled=true. In that case ingress-nginx container stays in pending state and never starts.

            I've tried the methods suggested in the k3d docs with traefik which is the default ingress controller instead of ingress-nginx, but that also didn't work.

            [afausti@instance-1 ~]$ k3d cluster create efd --agents 3  -p "30000-32767:30000-32767@server:0"
            INFO[0000] Prep: Network
            INFO[0000] Created network 'k3d-efd'
            INFO[0000] Created volume 'k3d-efd-images'
            INFO[0000] Starting new tools node...
            INFO[0000] Starting Node 'k3d-efd-tools'
            INFO[0001] Creating node 'k3d-efd-server-0'
            INFO[0001] Creating node 'k3d-efd-agent-0'
            INFO[0001] Creating node 'k3d-efd-agent-1'
            INFO[0001] Creating node 'k3d-efd-agent-2'
            INFO[0001] Creating LoadBalancer 'k3d-efd-serverlb'
            INFO[0001] Using the k3d-tools node to gather environment information
            INFO[0001] HostIP: using network gateway...
            INFO[0001] Starting cluster 'efd'
            INFO[0001] Starting servers...
            INFO[0001] Starting Node 'k3d-efd-server-0'
            INFO[0006] Starting agents...
            INFO[0006] Starting Node 'k3d-efd-agent-1'
            INFO[0006] Starting Node 'k3d-efd-agent-2'
            INFO[0006] Starting Node 'k3d-efd-agent-0'
            INFO[0019] Starting helpers...
            INFO[0019] Starting Node 'k3d-efd-serverlb'
             
             
            ERRO[0210] Failed Cluster Start: Failed to add one or more helper nodes: runtime failed to start node 'k3d-efd-serverlb': docker failed to start container for node 'k3d-efd-serverlb': Error response from daemon: driver failed programming external connectivity on endpoint k3d-efd-serverlb (6b4358abdc6fe3e7073dbab3165977cbca16df71d2597c05a7ee3694d7ef171a): Error starting userland proxy:
            ERRO[0210] Failed to create cluster >>> Rolling Back
            INFO[0210] Deleting cluster 'efd'
            INFO[0212] Deleting cluster network 'k3d-efd'
            INFO[0212] Deleting image volume 'k3d-efd-images'
            FATA[0212] Cluster creation FAILED, all changes have been rolled back!
            

            The workaround was to use --network host in k3d when creating the cluster and hostPort.enabled=false in the ingress-nginx configuration, and use port :30828 to connect to the services.

            [afausti@efd-temp-k3s ~]$ kubectl get services -n ingress-nginx
            NAME                                 TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                      AGE
            ingress-nginx-controller-admission   ClusterIP      10.43.122.214   <none>           443/TCP                      37h
            ingress-nginx-controller             LoadBalancer   10.43.217.102   139.229.160.30   80:30298/TCP,443:30828/TCP   37h
            

            Show
            afausti Angelo Fausti added a comment - - edited We had a problem configuring ingress with k3d. Apparently the - network host option in k3d does not have the same effect as the -network host I used in docker. The host port 443 is not exposed to the ingress-nginx container, even when we configure hostPort.enabled=true . In that case ingress-nginx container stays in pending state and never starts. I've tried the methods suggested in the k3d docs with traefik which is the default ingress controller instead of ingress-nginx, but that also didn't work. [afausti@instance-1 ~]$ k3d cluster create efd --agents 3 -p "30000-32767:30000-32767@server:0" INFO[0000] Prep: Network INFO[0000] Created network 'k3d-efd' INFO[0000] Created volume 'k3d-efd-images' INFO[0000] Starting new tools node... INFO[0000] Starting Node 'k3d-efd-tools' INFO[0001] Creating node 'k3d-efd-server-0' INFO[0001] Creating node 'k3d-efd-agent-0' INFO[0001] Creating node 'k3d-efd-agent-1' INFO[0001] Creating node 'k3d-efd-agent-2' INFO[0001] Creating LoadBalancer 'k3d-efd-serverlb' INFO[0001] Using the k3d-tools node to gather environment information INFO[0001] HostIP: using network gateway... INFO[0001] Starting cluster 'efd' INFO[0001] Starting servers... INFO[0001] Starting Node 'k3d-efd-server-0' INFO[0006] Starting agents... INFO[0006] Starting Node 'k3d-efd-agent-1' INFO[0006] Starting Node 'k3d-efd-agent-2' INFO[0006] Starting Node 'k3d-efd-agent-0' INFO[0019] Starting helpers... INFO[0019] Starting Node 'k3d-efd-serverlb'     ERRO[0210] Failed Cluster Start: Failed to add one or more helper nodes: runtime failed to start node 'k3d-efd-serverlb': docker failed to start container for node 'k3d-efd-serverlb': Error response from daemon: driver failed programming external connectivity on endpoint k3d-efd-serverlb (6b4358abdc6fe3e7073dbab3165977cbca16df71d2597c05a7ee3694d7ef171a): Error starting userland proxy: ERRO[0210] Failed to create cluster >>> Rolling Back INFO[0210] Deleting cluster 'efd' INFO[0212] Deleting cluster network 'k3d-efd' INFO[0212] Deleting image volume 'k3d-efd-images' FATA[0212] Cluster creation FAILED, all changes have been rolled back! The workaround was to use --network host in k3d when creating the cluster and hostPort.enabled=false in the ingress-nginx configuration, and use port :30828 to connect to the services. [afausti@efd-temp-k3s ~]$ kubectl get services -n ingress-nginx NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ingress-nginx-controller-admission ClusterIP 10.43.122.214 <none> 443/TCP 37h ingress-nginx-controller LoadBalancer 10.43.217.102 139.229.160.30 80:30298/TCP,443:30828/TCP 37h
            Hide
            afausti Angelo Fausti added a comment - - edited

            Data is flowing to the Summit EFD as of yesterday. The screenshot attached gives an idea of the outage.

            I resumed the connectors at the LDF EFD and checked that the EFD replication service is working and data is being written to InfluxDB and Parquet again.

            Show
            afausti Angelo Fausti added a comment - - edited Data is flowing to the Summit EFD as of yesterday. The screenshot attached gives an idea of the outage. I resumed the connectors at the LDF EFD and checked that the EFD replication service is working and data is being written to InfluxDB and Parquet again.

              People

              Assignee:
              afausti Angelo Fausti
              Reporter:
              afausti Angelo Fausti
              Watchers:
              Angelo Fausti, Cristián Silva, Frossie Economou
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.