The forcedPhotCoadd failures look like they're all user-error on my part (I stupidly tinkered with my .lsst/db-auth.yaml to set up access on the RSP while they were running).
To close out this ticket, here's my analysis of memory usage from the step3 tasks:
- detection, mergeDetections, and mergeMeasurements never showed usage above the 2048 default - but in
DM-29670 I definitely had to increase the the request for mergeMeasurements up to 4096 (the others still use the default), so I'm worried that one may be another case of inaccurate reporting, and it's probably best to leave it at 4096.
- As discussed above, one deblend job needed more than the 16G in the step3 file right now. Many others needed almost that much, so it's operator preference whether to bump that up or leave it and deal with the special one manually. Doesn't look like we should bring it any lower, or there will be many more holds.
- The measure tasks to seem to frequently require more than 4096, so the 8192 limit in the step3 file is not bad, but the max I saw was 5373, so an adventurous operator could try to reduce that a bit to squeeze more jobs in (this is a particularly slow step). Always possible that 6373 is underreported, though, and 8192 did work on everything (except the one downstream of the held deblend job which I never ran - and there's a good chance that would also be the most memory-hungry measure quantum).
- The max memory usage for forcedPhotCoadd was 3481 (run with an 8192 limit). But there's the caveat that a lot of these didn't run. Still, might be worth shrinking this down to 4096 next time and seeing how many holds we get (if any).
- The max memory usage for forcedPhotCcd was 5113, so the 8192 limit I used is unfortunately probably a good one. I'm actually quite surprised this is as high as it is; the images are smaller than the coadds by roughly a factor of two, and we shouldn't be doing what I thought would be the most memory-intensive step (deblending). That's worth investigating on the algorithms side; I bet some deferred loads could help. But for now we should leave this at 8192, I think.
And finally, I've been rounding these numbers up to the nearest power of two (or multiple of 4096 in some cases), but that's probably not quite ideal for trying to subdivide up our machines. There may be a better "unit" to round up to, but I'm not going to try to figure that out on this ticket.