Fix Version/s: None
We think w16 includes all changes needed to make the Gen3 HSC RC2 w14 run succeed, as well as a single-shot FCGM configuration that should require less handholding and better pipeline subset definitions (
DM-29737, DM-29615, DM-29750, DM-29348). Test this by trying to run tract=9813. Also use this to measure actual peak memory usage and update our BPS configs as well.
- relates to
DM-29884 Fix exception in Gen3+applyColorTerms logic branch of jointcal
DM-29885 Disable jointcal photometry in HSC via config
DM-29916 Single-shot, multi-cycle FGCM is memory-inefficient
DM-30046 Investigate memory usage of ForcedPhotCcd
DM-29670 Parallel/alternate Gen3 RC2 w_2021_14 processing for jointcal+
DM-29918 Investigate bad cgroup/HTCondor memory analysis in BPS vs. fgcmcal
DM-29944 Add some narrow-band filters to skymap's tract+patch+band data ID packers
DM-29714 Write simple tooling to check QuantumGraph expects vs. actual outputs
- Won't Fix
OK, and I filed https://jira.lsstcorp.org/browse/DM-29943 to look at WriteObjectTable.
w18 should be fine. WriteObjectTable's memory explosion should be fixed by
DM-29907 too. I'm not going to try to merge anything tonight, but use that ticket to make the joining a little more robust.
I have just attached three BPS configs (step1.yaml, step2.yaml, step3.yaml) that can be used (after filling in a few personalization and ticket fields) to run the new step1, step2, and step3 pipeline subsets in obs_subaru (now in w_2021_18). The first two contain requestMemory values that I have actually tested successfully, though some of these may be larger than they need to be (see comments in the files); testing for step3 is still ongoing. It's fine (and possibly even advisable to maximize resource usage) to split up any of these into smaller pieces by:
- which tasks are run together (fine for any of these steps)
- which tracts are run at once (fine for any of these steps, but step1 should be restricted by changing the input collections, not by adding tract=X to the data ID expression)
- which band are run in step2 (each band is totally independent for that step)
It is not safe to run tasks from different steps together, or with data ID expressions that extend beyond the ones in these files.
Monika Adamow, as a heads-up for your w18 run, my step3 test did uncover a few problems:
- one deblend job was held for OOM (with the 16G limit in my BPS configs);
- 37 forcedPhotCoadd jobs failed (compared to 346 succeeded so far);
- 4953 forcedPhotCcd jobs failed (compared to 9894 succeeded so far).
I think the last one is entirely expected; we just haven't bothered to transform this failure into some kind of qualified success because there's no good way to do it and there's nothing downstream of forcedPhotCcd. I will investigate the forcedPhotCoadd failures next week, as I'm going to focus on things that wouldn't benefit from being able to ask HTCondor/BPS questions of others today. And I since it's just one deblend job that exceeded that memory limit, I'll let you decide whether you want to handle that by changing your BPS configs or editing the job manually or something else.
The forcedPhotCoadd failures look like they're all user-error on my part (I stupidly tinkered with my .lsst/db-auth.yaml to set up access on the RSP while they were running).
To close out this ticket, here's my analysis of memory usage from the step3 tasks:
- detection, mergeDetections, and mergeMeasurements never showed usage above the 2048 default - but in
DM-29670I definitely had to increase the the request for mergeMeasurements up to 4096 (the others still use the default), so I'm worried that one may be another case of inaccurate reporting, and it's probably best to leave it at 4096.
- As discussed above, one deblend job needed more than the 16G in the step3 file right now. Many others needed almost that much, so it's operator preference whether to bump that up or leave it and deal with the special one manually. Doesn't look like we should bring it any lower, or there will be many more holds.
- The measure tasks to seem to frequently require more than 4096, so the 8192 limit in the step3 file is not bad, but the max I saw was 5373, so an adventurous operator could try to reduce that a bit to squeeze more jobs in (this is a particularly slow step). Always possible that 6373 is underreported, though, and 8192 did work on everything (except the one downstream of the held deblend job which I never ran - and there's a good chance that would also be the most memory-hungry measure quantum).
- The max memory usage for forcedPhotCoadd was 3481 (run with an 8192 limit). But there's the caveat that a lot of these didn't run. Still, might be worth shrinking this down to 4096 next time and seeing how many holds we get (if any).
- The max memory usage for forcedPhotCcd was 5113, so the 8192 limit I used is unfortunately probably a good one. I'm actually quite surprised this is as high as it is; the images are smaller than the coadds by roughly a factor of two, and we shouldn't be doing what I thought would be the most memory-intensive step (deblending). That's worth investigating on the algorithms side; I bet some deferred loads could help. But for now we should leave this at 8192, I think.
And finally, I've been rounding these numbers up to the nearest power of two (or multiple of 4096 in some cases), but that's probably not quite ideal for trying to subdivide up our machines. There may be a better "unit" to round up to, but I'm not going to try to figure that out on this ticket.
I've merged the ticket branches; now that Jenkins is green on this ticket, I figure any new problems I discover later are best done on new branches. I'll keep this ticket open until I either attach BPS configs with memory values or decide to make a new ticket for that.