Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29776

Attempt complete two-QG HSC RC2 run on w16 on one tract

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: obs_subaru
    • Labels:
    • Story Points:
      8
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      We think w16 includes all changes needed to make the Gen3 HSC RC2 w14 run succeed, as well as a single-shot FCGM configuration that should require less handholding and better pipeline subset definitions (DM-29737, DM-29615, DM-29750, DM-29348). Test this by trying to run tract=9813. Also use this to measure actual peak memory usage and update our BPS configs as well.

        Attachments

        1. step1.yaml
          2 kB
        2. step2.yaml
          2 kB
        3. step3.yaml
          2 kB

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            singleFrame subset worked (as expected; it already worked in w14).

            First attempt at multiVisit failed because of an exception raised in jointcal while trying to apply color terms:

              File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/jointcal/21.0.0-14-gc0d6d5c+36dc79dc4c/python/lsst/jointcal/jointcal.py", line 744, in run
                photometry = self._do_load_refcat_and_fit(associations, defaultFilter, center, radius,
              File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/jointcal/21.0.0-14-gc0d6d5c+36dc79dc4c/python/lsst/jointcal/jointcal.py", line 1229, in _do_load_refcat_and_fit
                refCat, fluxField = self._load_reference_catalog(refObjLoader, referenceSelector,
              File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/jointcal/21.0.0-14-gc0d6d5c+36dc79dc4c/python/lsst/jointcal/jointcal.py", line 1309, in _load_reference_catalog
                refCatName = refObjLoader.ref_dataset_name
            AttributeError: 'ReferenceObjectLoader' object has no attribute 'ref_dataset_name'
            

            This didn't happen in the w14 processing because the pre-review branches of DM-29615 disabled jointcal photometry and hence its color terms. During review, I was advised to try to disable jointcal photometry in the HSC configs, not just the Gen3 pipeline, but that got tricky and I ended up not disabling jointcal photometry at all. It looks like we do need to disable it in Gen3 for now (I'll do that on this ticket) and open new ones to both disable it more properly via config and fix the above error so it could be re-enabled if desired.

            Show
            jbosch Jim Bosch added a comment - singleFrame subset worked (as expected; it already worked in w14). First attempt at multiVisit failed because of an exception raised in jointcal while trying to apply color terms: File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/jointcal/21.0.0-14-gc0d6d5c+36dc79dc4c/python/lsst/jointcal/jointcal.py", line 744, in run photometry = self._do_load_refcat_and_fit(associations, defaultFilter, center, radius, File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/jointcal/21.0.0-14-gc0d6d5c+36dc79dc4c/python/lsst/jointcal/jointcal.py", line 1229, in _do_load_refcat_and_fit refCat, fluxField = self._load_reference_catalog(refObjLoader, referenceSelector, File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/jointcal/21.0.0-14-gc0d6d5c+36dc79dc4c/python/lsst/jointcal/jointcal.py", line 1309, in _load_reference_catalog refCatName = refObjLoader.ref_dataset_name AttributeError: 'ReferenceObjectLoader' object has no attribute 'ref_dataset_name' This didn't happen in the w14 processing because the pre-review branches of DM-29615 disabled jointcal photometry and hence its color terms. During review, I was advised to try to disable jointcal photometry in the HSC configs, not just the Gen3 pipeline, but that got tricky and I ended up not disabling jointcal photometry at all. It looks like we do need to disable it in Gen3 for now (I'll do that on this ticket) and open new ones to both disable it more properly via config and fix the above error so it could be re-enabled if desired.
            Hide
            jbosch Jim Bosch added a comment -

            A few more problems discovered and (mostly) fixed:

            • skyCorr was not in the singleFrame pipeline subset (fixed on the obs_subaru branch of this ticket);
            • fgcm uses more memory than expected, and in ways HTCondor doesn't recognize well (DM-29916, DM-29918; worked around by putting large numbers in BPS configs);
            • the RC2 raw collections didn't include the NB921 data in COSMOS (fixed in the repos, and in the gen3_shared_repo_admin branch for this ticket);

            I also measured the jointcal memory usage from the HTCondor logs, and came up with a max of 24G.

            I believe a complete multiVisit-subset run would now get through jointcal and FGCM, but would probably break later due to DM-29907. Since a fix for that seems imminent, and part of the reason for this ticket is to try running lots of things together, I'm going to wait a bit for that before kicking things off.

            Show
            jbosch Jim Bosch added a comment - A few more problems discovered and (mostly) fixed: skyCorr was not in the singleFrame pipeline subset (fixed on the obs_subaru branch of this ticket); fgcm uses more memory than expected, and in ways HTCondor doesn't recognize well ( DM-29916 , DM-29918 ; worked around by putting large numbers in BPS configs); the RC2 raw collections didn't include the NB921 data in COSMOS (fixed in the repos, and in the gen3_shared_repo_admin branch for this ticket); I also measured the jointcal memory usage from the HTCondor logs, and came up with a max of 24G. I believe a complete multiVisit-subset run would now get through jointcal and FGCM, but would probably break later due to DM-29907 . Since a fix for that seems imminent, and part of the reason for this ticket is to try running lots of things together, I'm going to wait a bit for that before kicking things off.
            Hide
            jbosch Jim Bosch added a comment -

            DM-29916 and DM-29907 have now been fixed. Remaining problems include:

            One jointcal run failed under mysterious circumstances; Task log (.err) looks truncated and the .log file (not sure where that comes from) reports signal 7 (memory access violation). See https://lsstc.slack.com/archives/C4JQP6FRS/p1619619532281200 for more details.

            Assemble coadd writes nothing but does not raise when it has nothing to do, and that causes detection to raise an exception. We have no good options for this right now (I consider this a high priority with my middleware hat on). Consider:

            • If AssembleCoadd raises, all downstream Tasks in the DAG will be pruned - including consolidateObjectTable (and anything else we have that's tract-wide in the future). This is what happens right now with detection raising when it can't find a coadd.
            • We could make AssembleCoadd do what makeWarp does, and write an empty exposure. We'd then have to make sure that satisfies everything that all downstream tasks expect, which probably means adding logic to make them short-circuit. This may be worth doing, but it's not a good day-of-the-weekly project, and it would need to be backed out when middleware provides a better solution anyway.
            • We could carry around a list of patch+band combinations we expect to fail, and pass them in as part of the data ID expression. That would work, but it looks like about 20 patches (looking at the converted Gen2 outputs) and that's a big expression to carry around.
            • We could split the pipeline into three steps (singleFrame, jointcal+fgcm+makeWarp+assembleCoadd, remainder of multiVisit) instead of just two. Running QG gen on the third step after the second has completed will automatically prune out quanta that correspond only to the missing coadd, but not the merge-across-bands (or patches) steps. This is my preferred approach, and I'll try to work out a way to express that in the obs_subaru pipeline subsets on this ticket.
            Show
            jbosch Jim Bosch added a comment - DM-29916 and DM-29907 have now been fixed. Remaining problems include: One jointcal run failed under mysterious circumstances; Task log (.err) looks truncated and the .log file (not sure where that comes from) reports signal 7 (memory access violation). See https://lsstc.slack.com/archives/C4JQP6FRS/p1619619532281200 for more details. Assemble coadd writes nothing but does not raise when it has nothing to do, and that causes detection to raise an exception. We have no good options for this right now (I consider this a high priority with my middleware hat on). Consider: If AssembleCoadd raises, all downstream Tasks in the DAG will be pruned - including consolidateObjectTable (and anything else we have that's tract-wide in the future). This is what happens right now with detection raising when it can't find a coadd. We could make AssembleCoadd do what makeWarp does, and write an empty exposure. We'd then have to make sure that satisfies everything that all downstream tasks expect, which probably means adding logic to make them short-circuit. This may be worth doing, but it's not a good day-of-the-weekly project, and it would need to be backed out when middleware provides a better solution anyway. We could carry around a list of patch+band combinations we expect to fail, and pass them in as part of the data ID expression. That would work, but it looks like about 20 patches (looking at the converted Gen2 outputs) and that's a big expression to carry around. We could split the pipeline into three steps (singleFrame, jointcal+fgcm+makeWarp+assembleCoadd, remainder of multiVisit) instead of just two. Running QG gen on the third step after the second has completed will automatically prune out quanta that correspond only to the missing coadd, but not the merge-across-bands (or patches) steps. This is my preferred approach, and I'll try to work out a way to express that in the obs_subaru pipeline subsets on this ticket.
            Hide
            jbosch Jim Bosch added a comment -

            Yusra AlSayyad, I think this is ready for review. You can ignore the gen3_shared_repo_admin package, because it's in lsst-dm, but if you take a look at the PR it'll be a preview of next Wednesday's party-programming. I hope the obs_subaru changes are at least self-explanatory; they definitely aren't ideal, but they're I think the best we can do with the middleware we've got.

            Jenkins is still running, and I'm hoping to get some more BPS runs to test all of this - another important output of this ticket will be BPS configs with reasonably accurate requestMemory values for everything, but that part may not make the weekly.

            Show
            jbosch Jim Bosch added a comment - Yusra AlSayyad , I think this is ready for review. You can ignore the gen3_shared_repo_admin package, because it's in lsst-dm, but if you take a look at the PR it'll be a preview of next Wednesday's party-programming. I hope the obs_subaru changes are at least self-explanatory; they definitely aren't ideal, but they're I think the best we can do with the middleware we've got. Jenkins is still running, and I'm hoping to get some more BPS runs to test all of this - another important output of this ticket will be BPS configs with reasonably accurate requestMemory values for everything, but that part may not make the weekly.
            Hide
            yusra Yusra AlSayyad added a comment -

            Looks good. Thanks for the clear exposition for each subset. Thinking about where #diffimDRP fits in. I definitely want that run on the w18 RC2 too.
            It requires that both makeWarp (for template generation) and mergeMeasurements (for the reference catalog for forcedPhotDiffim) to be done first.

            Show
            yusra Yusra AlSayyad added a comment - Looks good. Thanks for the clear exposition for each subset. Thinking about where #diffimDRP fits in. I definitely want that run on the w18 RC2 too. It requires that both makeWarp (for template generation) and mergeMeasurements (for the reference catalog for forcedPhotDiffim) to be done first.
            Hide
            yusra Yusra AlSayyad added a comment -

            And if you need a post-Jenkins late-night merge, I'd be happy to.

            Show
            yusra Yusra AlSayyad added a comment - And if you need a post-Jenkins late-night merge, I'd be happy to.
            Hide
            jbosch Jim Bosch added a comment -

            Sounds like diffimDRP can probably go in step3, but I might vote for just making it a separate step4 (or running it as diffimDRP after step3) for now to make sure nothing unexpected pops up that brings down tasks in step3; we could then merge them together for w22 (if we even still need the step* subsets by then; I hope we don't, but that's an ambitious schedule for some middleware improvements).

            Show
            jbosch Jim Bosch added a comment - Sounds like diffimDRP can probably go in step3, but I might vote for just making it a separate step4 (or running it as diffimDRP after step3) for now to make sure nothing unexpected pops up that brings down tasks in step3; we could then merge them together for w22 (if we even still need the step* subsets by then; I hope we don't, but that's an ambitious schedule for some middleware improvements).
            Hide
            yusra Yusra AlSayyad added a comment - - edited

            Also heads up for your memory investigation, during pair-coding today, Sophie and I noticed that WriteObjectTable was acting strange on /repo/main.

            pipetask run -b /repo/main -i HSC/runs/RC2/w_2021_14/DM-29528 -o u/yusra/objectTables -p $OBS_SUBARU_DIR/pipelines/DRP.yaml#writeObjectTable -d  "instrument='HSC' AND skymap='hsc_rings_v1' AND tract=9813 AND patch=23" --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -j 1
            
            

            just keeps climbing to 100GB and beyond as it's reading in the inputs (before the logger says it's starting the task).I killed it when top said:

            1491649 yusra     20   0  129.3g 125.5g  72772 R 100.0 49.9  12:07.66 python  
            

            In contrast, while running on /datasets/hsc/gen3repo/rc2w06_ssw06 with w_2021_17, writeObjectTask is quick and painless:

            (lsst-scipipe) [yusra@lsst-devl01 ~]$  /usr/bin/time -v pipetask run -b /datasets/hsc/gen3repo/rc2w06_ssw06 -i HSC/runs/RC2/w_2021_06  -o  u/yusra/object_test_again -p $OBS_SUBARU_DIR/pipelines/DRP.yaml#writeObjectTable   -d "instrument='HSC' AND skymap='hsc_rings_v1' AND tract=9813 AND patch=23 " --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -j 1
            ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 1 quanta for 1 tasks, graph ID: '1619653960.919665-1506427'
            conda.common.io INFO: overtaking stderr and stdout
            conda.common.io INFO: stderr and stdout yielding back
            ctrl.mpexec.singleQuantumExecutor INFO: Execution of task 'writeObjectTable' on quantum {skymap: 'hsc_rings_v1', tract: 9813, patch: 23} took 62.071 seconds
            ctrl.mpexec.mpGraphExecutor INFO: Executed 1 quanta, 0 remain out of total 1 quanta.
                    Command being timed: "pipetask run -b /datasets/hsc/gen3repo/rc2w06_ssw06 -i HSC/runs/RC2/w_2021_06 -o u/yusra/object_test_again -p /software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/obs_subaru/21.0.0-32-g0ce1f32a+fd3c508698/pipelines/DRP.yaml#writeObjectTable -d instrument='HSC' AND skymap='hsc_rings_v1' AND tract=9813 AND patch=23  --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -j 1"
                    User time (seconds): 57.62
                    System time (seconds): 13.84
                    Percent of CPU this job got: 94%
                    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:15.29
                    Average shared text size (kbytes): 0
                    Average unshared data size (kbytes): 0
                    Average stack size (kbytes): 0
                    Average total size (kbytes): 0
                    Maximum resident set size (kbytes): 10290184
                    Average resident set size (kbytes): 0
                    Major (requiring I/O) page faults: 0
                    Minor (reclaiming a frame) page faults: 5767844
                    Voluntary context switches: 14906
                    Involuntary context switches: 448033
                    Swaps: 0
                    File system inputs: 0
                    File system outputs: 0
                    Socket messages sent: 0
                    Socket messages received: 0
                    Signals delivered: 0
                    Page size (bytes): 4096
                    Exit status: 0
            

            For reference, he gen2 version takes around 15GB per patch (remember this is why we concatenate the narrow tables instead of these wide ones). It reads in 5 deepCoadd_meas, 5 deepCoadd_forced, and 1 deepCoadd_ref. (I think it holds the input afwTable in memory even after creating the DataFrame which means, I can prob cut memory in half for next weekly)

            Show
            yusra Yusra AlSayyad added a comment - - edited Also heads up for your memory investigation, during pair-coding today, Sophie and I noticed that WriteObjectTable was acting strange on /repo/main. pipetask run -b /repo/main -i HSC/runs/RC2/w_2021_14/DM-29528 -o u/yusra/objectTables -p $OBS_SUBARU_DIR/pipelines/DRP.yaml#writeObjectTable -d "instrument='HSC' AND skymap='hsc_rings_v1' AND tract=9813 AND patch=23" --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -j 1 just keeps climbing to 100GB and beyond as it's reading in the inputs (before the logger says it's starting the task).I killed it when top said: 1491649 yusra 20 0 129.3g 125.5g 72772 R 100.0 49.9 12:07.66 python In contrast, while running on /datasets/hsc/gen3repo/rc2w06_ssw06  with w_2021_17, writeObjectTask is quick and painless: (lsst-scipipe) [yusra@lsst-devl01 ~]$ /usr/bin/time -v pipetask run -b /datasets/hsc/gen3repo/rc2w06_ssw06 -i HSC/runs/RC2/w_2021_06 -o u/yusra/object_test_again -p $OBS_SUBARU_DIR/pipelines/DRP.yaml#writeObjectTable -d "instrument='HSC' AND skymap='hsc_rings_v1' AND tract=9813 AND patch=23 " --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -j 1 ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 1 quanta for 1 tasks, graph ID: '1619653960.919665-1506427' conda.common.io INFO: overtaking stderr and stdout conda.common.io INFO: stderr and stdout yielding back ctrl.mpexec.singleQuantumExecutor INFO: Execution of task 'writeObjectTable' on quantum {skymap: 'hsc_rings_v1', tract: 9813, patch: 23} took 62.071 seconds ctrl.mpexec.mpGraphExecutor INFO: Executed 1 quanta, 0 remain out of total 1 quanta. Command being timed: "pipetask run -b /datasets/hsc/gen3repo/rc2w06_ssw06 -i HSC/runs/RC2/w_2021_06 -o u/yusra/object_test_again -p /software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/obs_subaru/21.0.0-32-g0ce1f32a+fd3c508698/pipelines/DRP.yaml#writeObjectTable -d instrument='HSC' AND skymap='hsc_rings_v1' AND tract=9813 AND patch=23 --register-dataset-types --instrument lsst.obs.subaru.HyperSuprimeCam -j 1" User time (seconds): 57.62 System time (seconds): 13.84 Percent of CPU this job got: 94% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:15.29 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 10290184 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 5767844 Voluntary context switches: 14906 Involuntary context switches: 448033 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 For reference, he gen2 version takes around 15GB per patch (remember this is why we concatenate the narrow tables instead of these wide ones). It reads in 5 deepCoadd_meas, 5 deepCoadd_forced, and 1 deepCoadd_ref. (I think it holds the input afwTable in memory even after creating the DataFrame which means, I can prob cut memory in half for next weekly)
            Hide
            jbosch Jim Bosch added a comment -

            Interesting, and something Dan Taranu and Huan Lin may be interested in - I think they saw some object-table weirdness in /repo/dc2 as well. I can't imagine what could be different about the repos, but I'll look into it when I get there.

            Show
            jbosch Jim Bosch added a comment - Interesting, and something Dan Taranu and Huan Lin may be interested in - I think they saw some object-table weirdness in /repo/dc2 as well. I can't imagine what could be different about the repos, but I'll look into it when I get there.
            Hide
            jbosch Jim Bosch added a comment -

            I've merged the ticket branches; now that Jenkins is green on this ticket, I figure any new problems I discover later are best done on new branches.  I'll keep this ticket open until I either attach BPS configs with memory values or decide to make a new ticket for that.

            Show
            jbosch Jim Bosch added a comment - I've merged the ticket branches; now that Jenkins is green on this ticket, I figure any new problems I discover later are best done on new branches.  I'll keep this ticket open until I either attach BPS configs with memory values or decide to make a new ticket for that.
            Hide
            yusra Yusra AlSayyad added a comment - - edited

            OK, and I filed https://jira.lsstcorp.org/browse/DM-29943 to look at WriteObjectTable.

            w18 should be fine. WriteObjectTable's memory explosion should be fixed by DM-29907 too. I'm not going to try to merge anything tonight, but use that ticket to make the joining a little more robust.

            Show
            yusra Yusra AlSayyad added a comment - - edited OK, and I filed https://jira.lsstcorp.org/browse/DM-29943 to look at WriteObjectTable. w18 should be fine. WriteObjectTable's memory explosion should be fixed by DM-29907 too. I'm not going to try to merge anything tonight, but use that ticket to make the joining a little more robust.
            Hide
            jbosch Jim Bosch added a comment -

            I have just attached three BPS configs (step1.yaml, step2.yaml, step3.yaml) that can be used (after filling in a few personalization and ticket fields) to run the new step1, step2, and step3 pipeline subsets in obs_subaru (now in w_2021_18).  The first two contain requestMemory values that I have actually tested successfully, though some of these may be larger than they need to be (see comments in the files); testing for step3 is still ongoing.  It's fine (and possibly even advisable to maximize resource usage) to split up any of these into smaller pieces by:

            • which tasks are run together (fine for any of these steps)
            • which tracts are run at once (fine for any of these steps, but step1 should be restricted by changing the input collections, not by adding tract=X to the data ID expression)
            • which band are run in step2 (each band is totally independent for that step)

            It is not safe to run tasks from different steps together, or with data ID expressions that extend beyond the ones in these files.

            Show
            jbosch Jim Bosch added a comment - I have just attached three BPS configs ( step1.yaml , step2.yaml , step3.yaml ) that can be used (after filling in a few personalization and ticket fields) to run the new step1, step2, and step3 pipeline subsets in obs_subaru (now in w_2021_18).  The first two contain  requestMemory values that I have actually tested successfully, though some of these may be larger than they need to be (see comments in the files); testing for step3 is still ongoing.  It's fine (and possibly even advisable to maximize resource usage) to split up any of these into smaller pieces by: which tasks are run together (fine for any of these steps) which tracts are run at once (fine for any of these steps, but step1 should be restricted by changing the input collections, not by adding tract=X to the data ID expression) which band are run in step2 (each band is totally independent for that step) It is not safe to run tasks from different steps together, or with data ID expressions that extend beyond the ones in these files.
            Hide
            jbosch Jim Bosch added a comment -

            Monika Adamow, as a heads-up for your w18 run, my step3 test did uncover a few problems:

            • one deblend job was held for OOM (with the 16G limit in my BPS configs);
            • 37 forcedPhotCoadd jobs failed (compared to 346 succeeded so far);
            • 4953 forcedPhotCcd jobs failed (compared to 9894 succeeded so far).

            I think the last one is entirely expected; we just haven't bothered to transform this failure into some kind of qualified success because there's no good way to do it and there's nothing downstream of forcedPhotCcd.  I will investigate the forcedPhotCoadd failures next week, as I'm going to focus on things that wouldn't benefit from being able to ask HTCondor/BPS questions of others today.  And I since it's just one deblend job that exceeded that memory limit, I'll let you decide whether you want to handle that by changing your BPS configs or editing the job manually or something else.

            Show
            jbosch Jim Bosch added a comment - Monika Adamow , as a heads-up for your w18 run, my step3 test did uncover a few problems: one deblend job was held for OOM (with the 16G limit in my BPS configs); 37 forcedPhotCoadd jobs failed (compared to 346 succeeded so far); 4953 forcedPhotCcd jobs failed (compared to 9894 succeeded so far). I think the last one is entirely expected; we just haven't bothered to transform this failure into some kind of qualified success because there's no good way to do it and there's nothing downstream of forcedPhotCcd.  I will investigate the forcedPhotCoadd failures next week, as I'm going to focus on things that wouldn't benefit from being able to ask HTCondor/BPS questions of others today.  And I since it's just one deblend job that exceeded that memory limit, I'll let you decide whether you want to handle that by changing your BPS configs or editing the job manually or something else.
            Hide
            jbosch Jim Bosch added a comment -

            The forcedPhotCoadd failures look like they're all user-error on my part (I stupidly tinkered with my .lsst/db-auth.yaml to set up access on the RSP while they were running).

            To close out this ticket, here's my analysis of memory usage from the step3 tasks:

            • detection, mergeDetections, and mergeMeasurements never showed usage above the 2048 default - but in DM-29670 I definitely had to increase the the request for mergeMeasurements up to 4096 (the others still use the default), so I'm worried that one may be another case of inaccurate reporting, and it's probably best to leave it at 4096.
            • As discussed above, one deblend job needed more than the 16G in the step3 file right now.  Many others needed almost that much, so it's operator preference whether to bump that up or leave it and deal with the special one manually.  Doesn't look like we should bring it any lower, or there will be many more holds.
            • The measure tasks to seem to frequently require more than 4096, so the 8192 limit in the step3 file is not bad, but the max I saw was 5373, so an adventurous operator could try to reduce that a bit to squeeze more jobs in (this is a particularly slow step).  Always possible that 6373 is underreported, though, and 8192 did work on everything (except the one downstream of the held deblend job which I never ran - and there's a good chance that would also be the most memory-hungry measure quantum).
            • The max memory usage for forcedPhotCoadd was 3481 (run with an 8192 limit).  But there's the caveat that a lot of these didn't run.  Still, might be worth shrinking this down to 4096 next time and seeing how many holds we get (if any).
            • The max memory usage for forcedPhotCcd was 5113, so the 8192 limit I used is unfortunately probably a good one.  I'm actually quite surprised this is as high as it is; the images are smaller than the coadds by roughly a factor of two, and we shouldn't be doing what I thought would be the most memory-intensive step (deblending).  That's worth investigating on the algorithms side; I bet some deferred loads could help.  But for now we should leave this at 8192, I think.

            And finally, I've been rounding these numbers up to the nearest power of two (or multiple of 4096 in some cases), but that's probably not quite ideal for trying to subdivide up our machines.  There may be a better "unit" to round up to, but I'm not going to try to figure that out on this ticket.

            Show
            jbosch Jim Bosch added a comment - The forcedPhotCoadd failures look like they're all user-error on my part (I stupidly tinkered with my .lsst/db-auth.yaml to set up access on the RSP while they were running). To close out this ticket, here's my analysis of memory usage from the step3 tasks: detection , mergeDetections , and mergeMeasurements never showed usage above the 2048 default - but in DM-29670 I definitely had to increase the the request for mergeMeasurements up to 4096 (the others still use the default), so I'm worried that one may be another case of inaccurate reporting, and it's probably best to leave it at 4096. As discussed above, one deblend job needed more than the 16G in the step3 file right now.  Many others needed almost that much, so it's operator preference whether to bump that up or leave it and deal with the special one manually.  Doesn't look like we should bring it any lower, or there will be many more holds. The measure tasks to seem to frequently require more than 4096, so the 8192 limit in the step3 file is not bad, but the max I saw was 5373, so an adventurous operator could try to reduce that a bit to squeeze more jobs in (this is a particularly slow step).  Always possible that 6373 is underreported, though, and 8192 did work on everything (except the one downstream of the held deblend job which I never ran - and there's a good chance that would also be the most memory-hungry measure quantum). The max memory usage for forcedPhotCoadd was 3481 (run with an 8192 limit).  But there's the caveat that a lot of these didn't run.  Still, might be worth shrinking this down to 4096 next time and seeing how many holds we get (if any). The max memory usage for forcedPhotCcd was 5113, so the 8192 limit I used is unfortunately probably a good one.  I'm actually quite surprised this is as high as it is; the images are smaller than the coadds by roughly a factor of two, and we shouldn't be doing what I thought would be the most memory-intensive step (deblending).  That's worth investigating on the algorithms side; I bet some deferred loads could help.  But for now we should leave this at 8192, I think. And finally, I've been rounding these numbers up to the nearest power of two (or multiple of 4096 in some cases), but that's probably not quite ideal for trying to subdivide up our machines.  There may be a better "unit" to round up to, but I'm not going to try to figure that out on this ticket.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Yusra AlSayyad
              Watchers:
              Jim Bosch, John Parejko, Lee Kelvin, Monika Adamow, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.