# convert validate_drp jenkins job to pipeline

XMLWordPrintable

## Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
None
• Story Points:
10.125
• Team:
SQuaRE

## Activity

Hide
Joshua Hoblitt added a comment -

It looks like squash isn't accepting the results:

 "https://github.com/lsst/lsst_ci.git"}], "date": "2017-11-16T16:55:57.130106+00:00", "ci_id": "b3217", "ci_name": "validate_drp", "ci_dataset": "cfht", "ci_label": "docker.io/lsstsqre/centos:7-stack-lsst_distrib-d_2017_11_16", "ci_url": "https://ci.lsst.codes/job/sqre/job/validate_drp/1126/", "status": 0} 400 Client Error: BAD REQUEST for url: ****jobs/ script returned exit code 1 

From looking at the source, I'm guessing we're banging into a 16 char limit of the ci_label field. Previously, this was being set to the jenkins node label the sub-configuration was running, which was always centis-7. As part of the rewrite, I'm not setting this to the registry url of the docker image (eg., ocker.io/lsstsqre/centos:7-stack-lsst_distrib-d_2017_11_16).

Ihttps://github.com/lsst-sqre/squash-api/blob/master/squash/api/models.py#L18-L19

Show
Joshua Hoblitt added a comment - It looks like squash isn't accepting the results: "https://github.com/lsst/lsst_ci.git" }], "date" : "2017-11-16T16:55:57.130106+00:00" , "ci_id" : "b3217" , "ci_name" : "validate_drp" , "ci_dataset" : "cfht" , "ci_label" : "docker.io/lsstsqre/centos:7-stack-lsst_distrib-d_2017_11_16" , "ci_url" : "https://ci.lsst.codes/job/sqre/job/validate_drp/1126/" , "status" : 0 } 400 Client Error: BAD REQUEST for url: ****jobs/ script returned exit code 1 From looking at the source, I'm guessing we're banging into a 16 char limit of the ci_label field. Previously, this was being set to the jenkins node label the sub-configuration was running, which was always centis-7 . As part of the rewrite, I'm not setting this to the registry url of the docker image (eg., ocker.io/lsstsqre/centos:7-stack-lsst_distrib-d_2017_11_16 ). Ihttps://github.com/lsst-sqre/squash-api/blob/master/squash/api/models.py#L18-L19
Hide
Joshua Hoblitt added a comment -

Some benchmarking work of lfs versions has been tacked onto this ticket after random very long pull times were observed with lfs 2.3.4. A longer (1000 sample) benchmark will be run over the week to attempt to determine if this behavior is present in other versions as well.

Show
Joshua Hoblitt added a comment - Some benchmarking work of lfs versions has been tacked onto this ticket after random very long pull times were observed with lfs 2.3.4 . A longer (1000 sample) benchmark will be run over the week to attempt to determine if this behavior is present in other versions as well.
Hide
Joshua Hoblitt added a comment -

We are continuing to frequently have problems with the dataset clone filling up the disk. This results in the node automatically being taken offline and the pipeline script sits until the master timeout is hit, which will release the build agent and retry. The master timeout for the hsc dataset is 15 hours and this is causing validate_drp jobs to not finish in 24 hours, which is causing a backlog to accumulate.

I have added per dataset timelimits for cloning the dataset and running drp. Hopefully, this will reduce the time between retries.

Show
Joshua Hoblitt added a comment - We are continuing to frequently have problems with the dataset clone filling up the disk. This results in the node automatically being taken offline and the pipeline script sits until the master timeout is hit, which will release the build agent and retry. The master timeout for the hsc dataset is 15 hours and this is causing validate_drp jobs to not finish in 24 hours, which is causing a backlog to accumulate. I have added per dataset timelimits for cloning the dataset and running drp. Hopefully, this will reduce the time between retries.
Hide
Joshua Hoblitt added a comment -

We're still having issues with LFS retry timeouts and disks filling up. In order to make it more determinate, I'm going to clear the workspace at the start of each build. This will result in the dataset always being re downloaded but, hopefully, this will greatly reduce the chance that the agent disk will fill up.

Show
Joshua Hoblitt added a comment - We're still having issues with LFS retry timeouts and disks filling up. In order to make it more determinate, I'm going to clear the workspace at the start of each build. This will result in the dataset always being re downloaded but, hopefully, this will greatly reduce the chance that the agent disk will fill up.
Hide
Joshua Hoblitt added a comment -

I'm considering this task as complete. Any additional tweaking can be done under DM-12448.

Show
Joshua Hoblitt added a comment - I'm considering this task as complete. Any additional tweaking can be done under DM-12448 .

## People

• Assignee:
Joshua Hoblitt
Reporter:
Joshua Hoblitt
Reviewers:
Joshua Hoblitt
Watchers:
Joshua Hoblitt, Simon Krughoff