Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-28885

makeWarp is a memory hog for large numbers of visits

    XMLWordPrintable

    Details

    • Urgent?:
      No

      Description

      When running a pipetask that includes ISR, characterize, calibrate, and makeWarp, the last step appears to use way more memory than it ought to before crashing.

      I encountered this when I fed pipetask on order 100 visits in 5 filters, i.e. some 70,000 quanta, and naively ran it with pipetask run -j 16 on lsst-devl01. I checked back a day or two later and the run had died, so I restarted it (with extend-run, skip-existing, and skip-init-writes). It had more failed quanta the second time but enough success that I let it go. Then I got into the regime of diminishing returns. It thinks there are over 20,000 quanta left to process and the first warp it tries to make takes over 1000 seconds.

      Current thinking is that this happens on patches with maximal visit overlap and that makeWarpTask needs to be more selective in opening all the input visits. See Slack discussion for more: https://lsstc.slack.com/archives/C01FBUGM2CV/p1613608286458900 

        Attachments

          Issue Links

            Activity

            Hide
            mgower Michelle Gower added a comment -

            In case it is helpful, I was checking memory usage of Monika Adamow's RC2 runs to make sure we're requesting enough memory for the jobs.  The production HTCondor cluster does not kill for memory usage (only kills at the OS level).   These runs are earlier in the year with older stack versions. There may be some way within the data collected by the pipeline to get similar information, but I looked at what HTCondor reported jobs were using.  makeWarp was the leading PipelineTask in using lots of memory.   The largest was 79.5GB for a single execution of makeWarp (makeWarp:{instrument: 'HSC', skymap: 'hsc_rings_v1', tract: 9615, patch: 20, visit: 24494, ...})  Full log output for this execution is located at /home/madamow/gen2-to-gen3/bps/submit/RC2/w_2021_02/DM-28282/20210119T162854Z/jobs/makeWarp/00022023.390394.err  Approximately 5 minutes into the job, the memory usage is already reported as over 70GB. 

            There are several others above 10GB (would need to do better digging to be careful about counting same dataID across RC2 reruns).   Let me know if I can provide more data.  If we need to look at same data after running RC2 with a newer software stack, we can do that too.  

            bps currently only runs a single Quantum in a job.  So the large memory usage isn't a memory leak across multiple Quantum executions.   

             

            Show
            mgower Michelle Gower added a comment - In case it is helpful, I was checking memory usage of Monika Adamow 's RC2 runs to make sure we're requesting enough memory for the jobs.  The production HTCondor cluster does not kill for memory usage (only kills at the OS level).   These runs are earlier in the year with older stack versions. There may be some way within the data collected by the pipeline to get similar information, but I looked at what HTCondor reported jobs were using.  makeWarp was the leading PipelineTask in using lots of memory.   The largest was 79.5GB for a single execution of makeWarp (makeWarp:{instrument: 'HSC', skymap: 'hsc_rings_v1', tract: 9615, patch: 20, visit: 24494, ...})  Full log output for this execution is located at /home/madamow/gen2-to-gen3/bps/submit/RC2/w_2021_02/ DM-28282 /20210119T162854Z/jobs/makeWarp/00022023.390394.err  Approximately 5 minutes into the job, the memory usage is already reported as over 70GB.  There are several others above 10GB (would need to do better digging to be careful about counting same dataID across RC2 reruns).   Let me know if I can provide more data.  If we need to look at same data after running RC2 with a newer software stack, we can do that too.   bps currently only runs a single Quantum in a job.  So the large memory usage isn't a memory leak across multiple Quantum executions.     
            Hide
            mrawls Meredith Rawls added a comment -

            Thanks Michelle. We had some further discussion, and are pretty sure this is a gen3 thing (whereas I understand we are still running RC2 reprocessing in gen2 and then exporting to gen3) that will be improved once we have the ability to more easily specify which visits to use to build each warp. Right now, it chooses all visits within some (large!) radius of a patch center and opens them all. In contrast, gen2 runs a big loop over all the visits specified with --selectId and doesn't try to open every matching visit at once.

            Show
            mrawls Meredith Rawls added a comment - Thanks Michelle. We had some further discussion, and are pretty sure this is a gen3 thing (whereas I understand we are still running RC2 reprocessing in gen2 and then exporting to gen3) that will be improved once we have the ability to more easily specify which visits to use to build each warp. Right now, it chooses all visits within some (large!) radius of a patch center and opens them all. In contrast, gen2 runs a big loop over all the visits specified with --selectId  and doesn't try to open every matching visit at once.
            Hide
            mgower Michelle Gower added a comment -

            The info I provided was from Monika running RC2 in Gen3 (with the pipeline being only what was already converted to Gen3).  The Gen3 RC2 run is usually submitted after the Gen2 one has been converted (although there are discussions about increasing the frequency of the Gen3 runs).

            Show
            mgower Michelle Gower added a comment - The info I provided was from Monika running RC2 in Gen3 (with the pipeline being only what was already converted to Gen3).  The Gen3 RC2 run is usually submitted after the Gen2 one has been converted (although there are discussions about increasing the frequency of the Gen3 runs).
            Hide
            jbosch Jim Bosch added a comment -

            I think this was fixed by DM-20965, and on DM-29670 I'm going to sort-of test that theory by trying to build RC2 coadds with a BPS config that only gives makeWarp 4G of memory, instead of the 85G we seemed to have been giving it before in some configs. Please let me know if you can already tell that's going to fell, or if you can already confirm that it works, or have a better (Gen2-based?) guess at how much memory makeWarp is likely to actually want.

            Show
            jbosch Jim Bosch added a comment - I think this was fixed by DM-20965 , and on DM-29670 I'm going to sort-of test that theory by trying to build RC2 coadds with a BPS config that only gives makeWarp 4G of memory, instead of the 85G we seemed to have been giving it before in some configs. Please let me know if you can already tell that's going to fell, or if you can already confirm that it works, or have a better (Gen2-based?) guess at how much memory makeWarp is likely to actually want.
            Hide
            jbosch Jim Bosch added a comment -

            Confirmed on DM-29670 that this problem was fixed by DM-20695. Closing as a duplicate of that.

            Show
            jbosch Jim Bosch added a comment - Confirmed on DM-29670 that this problem was fixed by DM-20695 . Closing as a duplicate of that.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              mrawls Meredith Rawls
              Watchers:
              Brock Brendal [X] (Inactive), Ian Sullivan, James Chiang, Jim Bosch, John Parejko, Kenneth Herner, Meredith Rawls, Michelle Gower
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.