Status: To Do
Fix Version/s: None
So far, this has only happened to ci_hsc, and only on one platform at a time, perhaps because of the timing of when it runs, so it looks like this is a transient problem. The hope is that this is due to a server capacity or maintenance issue that will be resolved without our intervention, but if it is not, here are some thoughts on what we could do.
There are some GitHub issues on conda that cite this error, but most are due to other problems (e.g. firewalls or Windows DLL issues). Nevertheless, this seems to have become more common recently, despite changes in 4.8.0 (we use 4.8.2) that were supposed to ameliorate it. An upstream fix at the client level seems unlikely.
There seem to be a few ways to avoid it: retry the entire job at the Jenkins level, retry the deploy job inside ci-scripts/jenkins_wrapper.sh, retry the conda command inside deploy, or configure conda to retry more times or with longer backoff.
Since this happens during the deploy step, not much time has been expended on the job, and therefore retrying at the Jenkins level might make sense — but the ci_hsc pipeline is simply a configuration of the main stack-os-matrix pipeline, and retrying that on any error would be inappropriate. I'm not sure it's easy to get access to the actual error so that we could retry only on particular ones.
Since conda is potentially installed in deploy, I think configuring it would have to happen there. Doing so would apply this to all (new or updated) lsstsw installations, which may not be harmful but seems heavy-handed.
So retrying in jenkins_wrapper may be the best solution.