Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29670

Parallel/alternate Gen3 RC2 w_2021_14 processing for jointcal+

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      1
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      DM-29528 has been stalled by multiple problems, with the latest being jointcal fixes on DM-28991 that aren't working as expected. I'll use this ticket to essentially attempt to complete this initial RC2 Gen3 run with BPS myself, as it's become apparent that the Gen3 DRP pipeline was not fully ready to hand off to operators. And I don't want to use DM-29528 branches because I don't want to get in the way of that ticket.

      I will be starting from the single-frame + FGCM processing already completed on DM-29528.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            After increasing memory limits, 9615 seems to have completed (or at least none of my retry jobs failed). I have not yet inspected the data repository in detail to check that everything is there; I just kicked off the 9813 post-jointcal tasks via BPS (after patching up the chained collection to include the 9813 jointcal outputs).

            Show
            jbosch Jim Bosch added a comment - After increasing memory limits, 9615 seems to have completed (or at least none of my retry jobs failed). I have not yet inspected the data repository in detail to check that everything is there; I just kicked off the 9813 post-jointcal tasks via BPS (after patching up the chained collection to include the 9813 jointcal outputs).
            Hide
            jbosch Jim Bosch added a comment -

            Post-jointcal processing of 9813 is still underway, and I've kicked off jointcal runs for 9697 via pipetask on lsst-devl02.

            So far one post-jointcal deblend quantum (tract=9813, patch=77) has been held for running out of memory (with the maximum our slots seem to support, 14G).

            I also forgot to mention earlier that my -j2 pipetask jointcal run for 9813 yielded one failure (in y band) because of a Python multiprocessing timeout. I fixed that by just running that one again without multiprocessing, and I'm doing the 9697 jointcal processing using only one node both to avoid that and to try to reduce how much of lsst-devl02 I consume on a weekday.

            Show
            jbosch Jim Bosch added a comment - Post-jointcal processing of 9813 is still underway, and I've kicked off jointcal runs for 9697 via pipetask on lsst-devl02. So far one post-jointcal deblend quantum (tract=9813, patch=77) has been held for running out of memory (with the maximum our slots seem to support, 14G). I also forgot to mention earlier that my -j2 pipetask jointcal run for 9813 yielded one failure (in y band) because of a Python multiprocessing timeout. I fixed that by just running that one again without multiprocessing, and I'm doing the 9697 jointcal processing using only one node both to avoid that and to try to reduce how much of lsst-devl02 I consume on a weekday.
            Hide
            jbosch Jim Bosch added a comment -

            All of the OOM held jobs for 9813 were mergeMeasurements, which was using the 2G default. Bumping that to 4G and resubmitting.

            Show
            jbosch Jim Bosch added a comment - All of the OOM held jobs for 9813 were mergeMeasurements , which was using the 2G default. Bumping that to 4G and resubmitting.
            Hide
            jbosch Jim Bosch added a comment -

            Resubmitting failed because we've got quanta with partial outputs (assembleCoadd quanta, to be specific). But condor doesn't think any of those jobs failed, while it does think a bunch of forcedPhotCoadd jobs did fail.

            I'm going to take a break from shepherding processing through to work on DM-29714 (and other things); a little tooling will help a lot with figuring out what's going on here, and now that DM-29528 is proceeding again, being able to understand the failures in w14 before w18 starts is more important than me actually finishing the processing on this ticket.

            Show
            jbosch Jim Bosch added a comment - Resubmitting failed because we've got quanta with partial outputs (assembleCoadd quanta, to be specific). But condor doesn't think any of those jobs failed, while it does think a bunch of forcedPhotCoadd jobs did fail. I'm going to take a break from shepherding processing through to work on DM-29714 (and other things); a little tooling will help a lot with figuring out what's going on here, and now that DM-29528 is proceeding again, being able to understand the failures in w14 before w18 starts is more important than me actually finishing the processing on this ticket.
            Hide
            jbosch Jim Bosch added a comment -

            DM-29528 has pulled well ahead of this ticket, and I don't think its useful to complete the processing itself (in retrospect, I really should have made this a u/jbosch collection, but back when I started I thought this would end up superseding DM-29528 as the official w14 run). I'd still like to go through and inspect memory usage (at least) in order to firm up the requestMemory values in our BPS configs, but I think it'll be easiest to do that via a new w16 run that tries to run things in just two QGs (at least for one tract); I'll do that on DM-29776.

            Show
            jbosch Jim Bosch added a comment - DM-29528 has pulled well ahead of this ticket, and I don't think its useful to complete the processing itself (in retrospect, I really should have made this a u/jbosch collection, but back when I started I thought this would end up superseding DM-29528 as the official w14 run). I'd still like to go through and inspect memory usage (at least) in order to firm up the requestMemory values in our BPS configs, but I think it'll be easiest to do that via a new w16 run that tries to run things in just two QGs (at least for one tract); I'll do that on DM-29776 .

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Jim Bosch, Monika Adamow
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.