Details
-
Type:
Bug
-
Status: Invalid
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: pipe_tasks
-
Labels:
-
Urgent?:No
Description
When running a pipetask that includes ISR, characterize, calibrate, and makeWarp, the last step appears to use way more memory than it ought to before crashing.
I encountered this when I fed pipetask on order 100 visits in 5 filters, i.e. some 70,000 quanta, and naively ran it with pipetask run -j 16 on lsst-devl01. I checked back a day or two later and the run had died, so I restarted it (with extend-run, skip-existing, and skip-init-writes). It had more failed quanta the second time but enough success that I let it go. Then I got into the regime of diminishing returns. It thinks there are over 20,000 quanta left to process and the first warp it tries to make takes over 1000 seconds.
Current thinking is that this happens on patches with maximal visit overlap and that makeWarpTask needs to be more selective in opening all the input visits. See Slack discussion for more: https://lsstc.slack.com/archives/C01FBUGM2CV/p1613608286458900
In case it is helpful, I was checking memory usage of Monika Adamow's RC2 runs to make sure we're requesting enough memory for the jobs. The production HTCondor cluster does not kill for memory usage (only kills at the OS level). These runs are earlier in the year with older stack versions. There may be some way within the data collected by the pipeline to get similar information, but I looked at what HTCondor reported jobs were using. makeWarp was the leading PipelineTask in using lots of memory. The largest was 79.5GB for a single execution of makeWarp (makeWarp:{instrument: 'HSC', skymap: 'hsc_rings_v1', tract: 9615, patch: 20, visit: 24494, ...}) Full log output for this execution is located at /home/madamow/gen2-to-gen3/bps/submit/RC2/w_2021_02/
DM-28282/20210119T162854Z/jobs/makeWarp/00022023.390394.err Approximately 5 minutes into the job, the memory usage is already reported as over 70GB.There are several others above 10GB (would need to do better digging to be careful about counting same dataID across RC2 reruns). Let me know if I can provide more data. If we need to look at same data after running RC2 with a newer software stack, we can do that too.
bps currently only runs a single Quantum in a job. So the large memory usage isn't a memory leak across multiple Quantum executions.