Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29776

Attempt complete two-QG HSC RC2 run on w16 on one tract

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: obs_subaru
    • Labels:
    • Story Points:
      8
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      We think w16 includes all changes needed to make the Gen3 HSC RC2 w14 run succeed, as well as a single-shot FCGM configuration that should require less handholding and better pipeline subset definitions (DM-29737, DM-29615, DM-29750, DM-29348). Test this by trying to run tract=9813. Also use this to measure actual peak memory usage and update our BPS configs as well.

        Attachments

        1. step1.yaml
          2 kB
        2. step2.yaml
          2 kB
        3. step3.yaml
          2 kB

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            I've merged the ticket branches; now that Jenkins is green on this ticket, I figure any new problems I discover later are best done on new branches.  I'll keep this ticket open until I either attach BPS configs with memory values or decide to make a new ticket for that.

            Show
            jbosch Jim Bosch added a comment - I've merged the ticket branches; now that Jenkins is green on this ticket, I figure any new problems I discover later are best done on new branches.  I'll keep this ticket open until I either attach BPS configs with memory values or decide to make a new ticket for that.
            Hide
            yusra Yusra AlSayyad added a comment - - edited

            OK, and I filed https://jira.lsstcorp.org/browse/DM-29943 to look at WriteObjectTable.

            w18 should be fine. WriteObjectTable's memory explosion should be fixed by DM-29907 too. I'm not going to try to merge anything tonight, but use that ticket to make the joining a little more robust.

            Show
            yusra Yusra AlSayyad added a comment - - edited OK, and I filed https://jira.lsstcorp.org/browse/DM-29943 to look at WriteObjectTable. w18 should be fine. WriteObjectTable's memory explosion should be fixed by DM-29907 too. I'm not going to try to merge anything tonight, but use that ticket to make the joining a little more robust.
            Hide
            jbosch Jim Bosch added a comment -

            I have just attached three BPS configs (step1.yaml, step2.yaml, step3.yaml) that can be used (after filling in a few personalization and ticket fields) to run the new step1, step2, and step3 pipeline subsets in obs_subaru (now in w_2021_18).  The first two contain requestMemory values that I have actually tested successfully, though some of these may be larger than they need to be (see comments in the files); testing for step3 is still ongoing.  It's fine (and possibly even advisable to maximize resource usage) to split up any of these into smaller pieces by:

            • which tasks are run together (fine for any of these steps)
            • which tracts are run at once (fine for any of these steps, but step1 should be restricted by changing the input collections, not by adding tract=X to the data ID expression)
            • which band are run in step2 (each band is totally independent for that step)

            It is not safe to run tasks from different steps together, or with data ID expressions that extend beyond the ones in these files.

            Show
            jbosch Jim Bosch added a comment - I have just attached three BPS configs ( step1.yaml , step2.yaml , step3.yaml ) that can be used (after filling in a few personalization and ticket fields) to run the new step1, step2, and step3 pipeline subsets in obs_subaru (now in w_2021_18).  The first two contain  requestMemory values that I have actually tested successfully, though some of these may be larger than they need to be (see comments in the files); testing for step3 is still ongoing.  It's fine (and possibly even advisable to maximize resource usage) to split up any of these into smaller pieces by: which tasks are run together (fine for any of these steps) which tracts are run at once (fine for any of these steps, but step1 should be restricted by changing the input collections, not by adding tract=X to the data ID expression) which band are run in step2 (each band is totally independent for that step) It is not safe to run tasks from different steps together, or with data ID expressions that extend beyond the ones in these files.
            Hide
            jbosch Jim Bosch added a comment -

            Monika Adamow, as a heads-up for your w18 run, my step3 test did uncover a few problems:

            • one deblend job was held for OOM (with the 16G limit in my BPS configs);
            • 37 forcedPhotCoadd jobs failed (compared to 346 succeeded so far);
            • 4953 forcedPhotCcd jobs failed (compared to 9894 succeeded so far).

            I think the last one is entirely expected; we just haven't bothered to transform this failure into some kind of qualified success because there's no good way to do it and there's nothing downstream of forcedPhotCcd.  I will investigate the forcedPhotCoadd failures next week, as I'm going to focus on things that wouldn't benefit from being able to ask HTCondor/BPS questions of others today.  And I since it's just one deblend job that exceeded that memory limit, I'll let you decide whether you want to handle that by changing your BPS configs or editing the job manually or something else.

            Show
            jbosch Jim Bosch added a comment - Monika Adamow , as a heads-up for your w18 run, my step3 test did uncover a few problems: one deblend job was held for OOM (with the 16G limit in my BPS configs); 37 forcedPhotCoadd jobs failed (compared to 346 succeeded so far); 4953 forcedPhotCcd jobs failed (compared to 9894 succeeded so far). I think the last one is entirely expected; we just haven't bothered to transform this failure into some kind of qualified success because there's no good way to do it and there's nothing downstream of forcedPhotCcd.  I will investigate the forcedPhotCoadd failures next week, as I'm going to focus on things that wouldn't benefit from being able to ask HTCondor/BPS questions of others today.  And I since it's just one deblend job that exceeded that memory limit, I'll let you decide whether you want to handle that by changing your BPS configs or editing the job manually or something else.
            Hide
            jbosch Jim Bosch added a comment -

            The forcedPhotCoadd failures look like they're all user-error on my part (I stupidly tinkered with my .lsst/db-auth.yaml to set up access on the RSP while they were running).

            To close out this ticket, here's my analysis of memory usage from the step3 tasks:

            • detection, mergeDetections, and mergeMeasurements never showed usage above the 2048 default - but in DM-29670 I definitely had to increase the the request for mergeMeasurements up to 4096 (the others still use the default), so I'm worried that one may be another case of inaccurate reporting, and it's probably best to leave it at 4096.
            • As discussed above, one deblend job needed more than the 16G in the step3 file right now.  Many others needed almost that much, so it's operator preference whether to bump that up or leave it and deal with the special one manually.  Doesn't look like we should bring it any lower, or there will be many more holds.
            • The measure tasks to seem to frequently require more than 4096, so the 8192 limit in the step3 file is not bad, but the max I saw was 5373, so an adventurous operator could try to reduce that a bit to squeeze more jobs in (this is a particularly slow step).  Always possible that 6373 is underreported, though, and 8192 did work on everything (except the one downstream of the held deblend job which I never ran - and there's a good chance that would also be the most memory-hungry measure quantum).
            • The max memory usage for forcedPhotCoadd was 3481 (run with an 8192 limit).  But there's the caveat that a lot of these didn't run.  Still, might be worth shrinking this down to 4096 next time and seeing how many holds we get (if any).
            • The max memory usage for forcedPhotCcd was 5113, so the 8192 limit I used is unfortunately probably a good one.  I'm actually quite surprised this is as high as it is; the images are smaller than the coadds by roughly a factor of two, and we shouldn't be doing what I thought would be the most memory-intensive step (deblending).  That's worth investigating on the algorithms side; I bet some deferred loads could help.  But for now we should leave this at 8192, I think.

            And finally, I've been rounding these numbers up to the nearest power of two (or multiple of 4096 in some cases), but that's probably not quite ideal for trying to subdivide up our machines.  There may be a better "unit" to round up to, but I'm not going to try to figure that out on this ticket.

            Show
            jbosch Jim Bosch added a comment - The forcedPhotCoadd failures look like they're all user-error on my part (I stupidly tinkered with my .lsst/db-auth.yaml to set up access on the RSP while they were running). To close out this ticket, here's my analysis of memory usage from the step3 tasks: detection , mergeDetections , and mergeMeasurements never showed usage above the 2048 default - but in DM-29670 I definitely had to increase the the request for mergeMeasurements up to 4096 (the others still use the default), so I'm worried that one may be another case of inaccurate reporting, and it's probably best to leave it at 4096. As discussed above, one deblend job needed more than the 16G in the step3 file right now.  Many others needed almost that much, so it's operator preference whether to bump that up or leave it and deal with the special one manually.  Doesn't look like we should bring it any lower, or there will be many more holds. The measure tasks to seem to frequently require more than 4096, so the 8192 limit in the step3 file is not bad, but the max I saw was 5373, so an adventurous operator could try to reduce that a bit to squeeze more jobs in (this is a particularly slow step).  Always possible that 6373 is underreported, though, and 8192 did work on everything (except the one downstream of the held deblend job which I never ran - and there's a good chance that would also be the most memory-hungry measure quantum). The max memory usage for forcedPhotCoadd was 3481 (run with an 8192 limit).  But there's the caveat that a lot of these didn't run.  Still, might be worth shrinking this down to 4096 next time and seeing how many holds we get (if any). The max memory usage for forcedPhotCcd was 5113, so the 8192 limit I used is unfortunately probably a good one.  I'm actually quite surprised this is as high as it is; the images are smaller than the coadds by roughly a factor of two, and we shouldn't be doing what I thought would be the most memory-intensive step (deblending).  That's worth investigating on the algorithms side; I bet some deferred loads could help.  But for now we should leave this at 8192, I think. And finally, I've been rounding these numbers up to the nearest power of two (or multiple of 4096 in some cases), but that's probably not quite ideal for trying to subdivide up our machines.  There may be a better "unit" to round up to, but I'm not going to try to figure that out on this ticket.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Yusra AlSayyad
              Watchers:
              Jim Bosch, John Parejko, Lee Kelvin, Monika Adamow, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.