Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12510

convert validate_drp jenkins job to pipeline

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Continuous Integration
    • Labels:
      None

      Attachments

        Issue Links

          Activity

          Hide
          jhoblitt Joshua Hoblitt added a comment -

          It looks like squash isn't accepting the results:

          "https://github.com/lsst/lsst_ci.git"}], "date": "2017-11-16T16:55:57.130106+00:00", "ci_id": "b3217", "ci_name": "validate_drp", "ci_dataset": "cfht", "ci_label": "docker.io/lsstsqre/centos:7-stack-lsst_distrib-d_2017_11_16", "ci_url": "https://ci.lsst.codes/job/sqre/job/validate_drp/1126/", "status": 0}
          400 Client Error: BAD REQUEST for url: ****jobs/
          script returned exit code 1
          

          From looking at the source, I'm guessing we're banging into a 16 char limit of the ci_label field. Previously, this was being set to the jenkins node label the sub-configuration was running, which was always centis-7. As part of the rewrite, I'm not setting this to the registry url of the docker image (eg., ocker.io/lsstsqre/centos:7-stack-lsst_distrib-d_2017_11_16).

          Ihttps://github.com/lsst-sqre/squash-api/blob/master/squash/api/models.py#L18-L19

          Show
          jhoblitt Joshua Hoblitt added a comment - It looks like squash isn't accepting the results: "https://github.com/lsst/lsst_ci.git" }], "date" : "2017-11-16T16:55:57.130106+00:00" , "ci_id" : "b3217" , "ci_name" : "validate_drp" , "ci_dataset" : "cfht" , "ci_label" : "docker.io/lsstsqre/centos:7-stack-lsst_distrib-d_2017_11_16" , "ci_url" : "https://ci.lsst.codes/job/sqre/job/validate_drp/1126/" , "status" : 0 } 400 Client Error: BAD REQUEST for url: ****jobs/ script returned exit code 1 From looking at the source, I'm guessing we're banging into a 16 char limit of the ci_label field. Previously, this was being set to the jenkins node label the sub-configuration was running, which was always centis-7 . As part of the rewrite, I'm not setting this to the registry url of the docker image (eg., ocker.io/lsstsqre/centos:7-stack-lsst_distrib-d_2017_11_16 ). Ihttps://github.com/lsst-sqre/squash-api/blob/master/squash/api/models.py#L18-L19
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          Some benchmarking work of lfs versions has been tacked onto this ticket after random very long pull times were observed with lfs 2.3.4. A longer (1000 sample) benchmark will be run over the week to attempt to determine if this behavior is present in other versions as well.

          Show
          jhoblitt Joshua Hoblitt added a comment - Some benchmarking work of lfs versions has been tacked onto this ticket after random very long pull times were observed with lfs 2.3.4 . A longer (1000 sample) benchmark will be run over the week to attempt to determine if this behavior is present in other versions as well.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          We are continuing to frequently have problems with the dataset clone filling up the disk. This results in the node automatically being taken offline and the pipeline script sits until the master timeout is hit, which will release the build agent and retry. The master timeout for the hsc dataset is 15 hours and this is causing validate_drp jobs to not finish in 24 hours, which is causing a backlog to accumulate.

          I have added per dataset timelimits for cloning the dataset and running drp. Hopefully, this will reduce the time between retries.

          Show
          jhoblitt Joshua Hoblitt added a comment - We are continuing to frequently have problems with the dataset clone filling up the disk. This results in the node automatically being taken offline and the pipeline script sits until the master timeout is hit, which will release the build agent and retry. The master timeout for the hsc dataset is 15 hours and this is causing validate_drp jobs to not finish in 24 hours, which is causing a backlog to accumulate. I have added per dataset timelimits for cloning the dataset and running drp. Hopefully, this will reduce the time between retries.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          We're still having issues with LFS retry timeouts and disks filling up. In order to make it more determinate, I'm going to clear the workspace at the start of each build. This will result in the dataset always being re downloaded but, hopefully, this will greatly reduce the chance that the agent disk will fill up.

          Show
          jhoblitt Joshua Hoblitt added a comment - We're still having issues with LFS retry timeouts and disks filling up. In order to make it more determinate, I'm going to clear the workspace at the start of each build. This will result in the dataset always being re downloaded but, hopefully, this will greatly reduce the chance that the agent disk will fill up.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          I'm considering this task as complete. Any additional tweaking can be done under DM-12448.

          Show
          jhoblitt Joshua Hoblitt added a comment - I'm considering this task as complete. Any additional tweaking can be done under DM-12448 .

            People

            • Assignee:
              jhoblitt Joshua Hoblitt
              Reporter:
              jhoblitt Joshua Hoblitt
              Reviewers:
              Joshua Hoblitt
              Watchers:
              Joshua Hoblitt, Simon Krughoff
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel