Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-23990

Deploy the EFD on the LSP integration cluster at NCSA

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Deploy the EFD on the LSP integration cluster to exercise the deployment on the LSP stable cluster.

      This thicket reports the work done from March 5 to March 16 worth of 5.6 SP mostly due to problems with the local-path-provisioner and NFS file locking issues on the LSP integration cluster.

        Attachments

          Issue Links

            Activity

            Hide
            afausti Angelo Fausti added a comment -

            Argo CD environment ncsa-int has the configuration for this deployment.

            NCSA created the following namespaces for the deployment and configure my user with full permission to these namespaces:

            argocd                
            chronograf            
            cp-helm-charts                       
            influxdb              
            influxdb-sink                  
            kapacitor                       
            local-path-storage    
            test-local-path       
            vault-secrets-operator
            

            NCSA also confiured DNS for the EFD services, using the nginx-ingress already deployed to the Integration cluster.

            Chronograf: https://lsst-chronograf-int-efd.ncsa.illinois.edu
            InfluxDB: https://lsst-influxdb-int-efd.ncsa.illinois.edu
            Control Center: https://lsst-control-center-int-efd.ncsa.illinois.edu
            Kafka Schema Registry: https://lsst-schema-registry-int-efd.ncsa.illinois.edu
            Kafka Broker: https://lsst-kafka-0-int-efd.ncsa.illinois.edu
            Argo CD: https://lsst-argocd-int-efd.ncsa.illinois.edu
            

            The corresponding TLS certificates were created by NCSA and installed in the cluster as well.

            For storage, we have the /efd NFS mount on kub0[16-20] on the cluster.

            lsst-nfs.ncsa.illinois.edu:/dac/services/efd on /efd
            

            We spent a couple of days debugging a problem that was preventing us to create PVs with the local-path-provisioner:

            W0311 15:25:30.375190       1 controller.go:893] Retrying syncing claim "6f0c2986-ce55-48f1-9eab-32c963ed2bf6" because failures 0 < threshold 15
            E0311 15:25:30.375243       1 controller.go:913] error syncing claim "6f0c2986-ce55-48f1-9eab-32c963ed2bf6": failed to provision volume with StorageClass "local-path": failed to create volume pvc-6f0c2986-ce55-48f1-9eab-32c963ed2bf6: create process timeout after 120 seconds
            I0311 15:25:30.375281       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test-local-path", Name:"local-path-pvc", UID:"6f0c2986-ce55-48f1-9eab-32c963ed2bf6", APIVersion:"v1", ResourceVersion:"194097661", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "local-path": failed to create volume pvc-6f0c2986-ce55-48f1-9eab-32c963ed2bf6: create process timeout after 120 seconds
            
            

            I also did not have permission to get/create/edit storage classes in the cluster:

            [afausti@lsst-bastion01 ~]$ kubectl get sc
            Error from server (Forbidden): storageclasses.storage.k8s.io is forbidden: User "afausti" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
            

            We decided that NCSA would be responsible for deploying the local-path-provisioner as part of the cluster infrastructure. That makes sense, and also solves the issue giving especial privileges to my user.

            With admin privileges that finally worked:

            event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test-local-path", Name:"local-path-pvc", UID:"f1d88cee-ba94-475e-8fe5-436848c5dec7", APIVersion:"v1", ResourceVersion:"194198088", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-f1d88cee-ba94-475e-8fe5-436848c5dec7
            

            Show
            afausti Angelo Fausti added a comment - Argo CD environment ncsa-int has the configuration for this deployment. NCSA created the following namespaces for the deployment and configure my user with full permission to these namespaces: argocd chronograf cp-helm-charts influxdb influxdb-sink kapacitor local-path-storage test-local-path vault-secrets-operator NCSA also confiured DNS for the EFD services, using the nginx-ingress already deployed to the Integration cluster. Chronograf: https://lsst-chronograf-int-efd.ncsa.illinois.edu InfluxDB: https://lsst-influxdb-int-efd.ncsa.illinois.edu Control Center: https://lsst-control-center-int-efd.ncsa.illinois.edu Kafka Schema Registry: https://lsst-schema-registry-int-efd.ncsa.illinois.edu Kafka Broker: https://lsst-kafka-0-int-efd.ncsa.illinois.edu Argo CD: https://lsst-argocd-int-efd.ncsa.illinois.edu The corresponding TLS certificates were created by NCSA and installed in the cluster as well. For storage, we have the /efd NFS mount on kub0 [16-20] on the cluster. lsst-nfs.ncsa.illinois.edu:/dac/services/efd on /efd We spent a couple of days debugging a problem that was preventing us to create PVs with the local-path-provisioner : W0311 15:25:30.375190 1 controller.go:893] Retrying syncing claim "6f0c2986-ce55-48f1-9eab-32c963ed2bf6" because failures 0 < threshold 15 E0311 15:25:30.375243 1 controller.go:913] error syncing claim "6f0c2986-ce55-48f1-9eab-32c963ed2bf6": failed to provision volume with StorageClass "local-path": failed to create volume pvc-6f0c2986-ce55-48f1-9eab-32c963ed2bf6: create process timeout after 120 seconds I0311 15:25:30.375281 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test-local-path", Name:"local-path-pvc", UID:"6f0c2986-ce55-48f1-9eab-32c963ed2bf6", APIVersion:"v1", ResourceVersion:"194097661", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "local-path": failed to create volume pvc-6f0c2986-ce55-48f1-9eab-32c963ed2bf6: create process timeout after 120 seconds I also did not have permission to get/create/edit storage classes in the cluster: [afausti@lsst-bastion01 ~]$ kubectl get sc Error from server (Forbidden): storageclasses.storage.k8s.io is forbidden: User "afausti" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope We decided that NCSA would be responsible for deploying the local-path-provisioner as part of the cluster infrastructure. That makes sense, and also solves the issue giving especial privileges to my user. With admin privileges that finally worked: event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test-local-path", Name:"local-path-pvc", UID:"f1d88cee-ba94-475e-8fe5-436848c5dec7", APIVersion:"v1", ResourceVersion:"194198088", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-f1d88cee-ba94-475e-8fe5-436848c5dec7
            Hide
            afausti Angelo Fausti added a comment -

            After deploying the EFD with Argo CD we noticed the following errors all pointing to NFS file locking issues.

            Kafka:

            [2020-03-13 22:20:26,579] ERROR Disk error while locking directory /opt/kafka/data-0/logs (kafka.server.LogDirFailureChannel) java.io.IOException: No locks available
            

            Chronograf:

            time="2020-03-13T22:31:36Z" level=error msg="Unable to open boltdb; is there a chronograf already running?  no locks available" component=boltstore
            

            Kapacitor:

            ts=2020-03-13T22:32:39.012Z lvl=error msg="encountered error" service=run err="open server: open service *storage.Service: open boltdb @ \"/var/lib
            /kapacitor/kapacitor.db\": no locks available"
            run: open server: open service *storage.Service: open boltdb @ "/var/lib/kapacitor/kapacitor.db": no locks available
            

            This happened in the past with the LSP IHS-1575 and the solution was to use NSF v4 insteand of NFS v3 initially used for exporting lsst-nfs.ncsa.illinois.edu:/dac/services/efd

            Show
            afausti Angelo Fausti added a comment - After deploying the EFD with Argo CD we noticed the following errors all pointing to NFS file locking issues. Kafka: [2020-03-13 22:20:26,579] ERROR Disk error while locking directory /opt/kafka/data-0/logs (kafka.server.LogDirFailureChannel) java.io.IOException: No locks available Chronograf: time="2020-03-13T22:31:36Z" level=error msg="Unable to open boltdb; is there a chronograf already running? no locks available" component=boltstore Kapacitor: ts=2020-03-13T22:32:39.012Z lvl=error msg="encountered error" service=run err="open server: open service *storage.Service: open boltdb @ \"/var/lib /kapacitor/kapacitor.db\": no locks available" run: open server: open service *storage.Service: open boltdb @ "/var/lib/kapacitor/kapacitor.db": no locks available This happened in the past with the LSP IHS-1575 and the solution was to use NSF v4 insteand of NFS v3 initially used for exporting lsst-nfs.ncsa.illinois.edu:/dac/services/efd
            Hide
            afausti Angelo Fausti added a comment - - edited

            Once that was solved, we successfully deployed the EFD on the LSP integration cluster. All EFD services are running and this instance is temporarily serving the Summit EFD data due to the shutdown of the Observatory because of the Covid-19 outbreak.

            SQR-034 was updated accordingly with the URLs for the services running on this instance.

            Show
            afausti Angelo Fausti added a comment - - edited Once that was solved, we successfully deployed the EFD on the LSP integration cluster. All EFD services are running and this instance is temporarily serving the Summit EFD data due to the shutdown of the Observatory because of the Covid-19 outbreak. SQR-034 was updated accordingly with the URLs for the services running on this instance.

              People

              • Assignee:
                afausti Angelo Fausti
                Reporter:
                afausti Angelo Fausti
                Watchers:
                Angelo Fausti
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel