Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13667

singleFrameDriver of the HSC PDR1 dataset

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      singleFrameDriver/ProcessCcd of the entire HSC PDR1 dataset

        Attachments

        1. fsfm-deep-136273.out
          46 kB
        2. fsfm-udeep-136275.out
          82 kB
        3. fsfm-wide-136933.out
          1.02 MB

          Issue Links

            Activity

            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            In the beginning I attempted to run each visit as its own singleFrameDriver. One advantage of doing so is that each visit would have its own log file. During the attempt, I found that many jobs took much longer than expected and timed out. The IO wait time on GPFS was excessively high (e.g. ~5 sec), likely because too many nodes simultaneously put locks on the same folders and sometimes same files. So I changed the strategy, and grouped multiple visits into fewer singleFrameDriver calls.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - In the beginning I attempted to run each visit as its own singleFrameDriver. One advantage of doing so is that each visit would have its own log file. During the attempt, I found that many jobs took much longer than expected and timed out. The IO wait time on GPFS was excessively high (e.g. ~5 sec), likely because too many nodes simultaneously put locks on the same folders and sometimes same files. So I changed the strategy, and grouped multiple visits into fewer singleFrameDriver calls.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            The modified strategy was to group all UDEEP visits into one singleFrameDriver job, DEEP visits into 7 jobs, and WIDE visits into 149 jobs. These jobs were submitted ~ over the Apr 21 weekend.

            Each WIDE singleFrameDriver used 3 nodes. For some reasons, a non-small percentage (~30%) of jobs failed at slurm failing to launch the job. This is after singleFrameDriver.py successfully submitted the job to slurm, the job waited in the queue for its turn, and the job started trying once it got its turn, but then the job failed to launch. This failure wasn't restricted to one specific worker node.

            Log records like the following are seen:

            srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
            srun: error: Task launch for 136725.0 failed on node lsst-verify-worker03: Socket timed out on send/recv operation
            srun: error: Application launch failed: Socket timed out on send/recv operation
            srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
            slurmstepd: error: *** STEP 136725.0 ON lsst-verify-worker02 CANCELLED AT 2018-04-22T20:53:48 ***
            srun: error: lsst-verify-worker02: task 0: Killed
            srun: error: lsst-verify-worker04: task 2: Killed
            [mpiexec@lsst-verify-worker02] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
            [mpiexec@lsst-verify-worker02] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
            [mpiexec@lsst-verify-worker02] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
            [mpiexec@lsst-verify-worker02] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
            

            All failures were resubmitted iteratively, with many failed again in the iterations, and eventually were all pushed though.

            DM-14181 was filed about this issue.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - The modified strategy was to group all UDEEP visits into one singleFrameDriver job, DEEP visits into 7 jobs, and WIDE visits into 149 jobs. These jobs were submitted ~ over the Apr 21 weekend. Each WIDE singleFrameDriver used 3 nodes. For some reasons, a non-small percentage (~30%) of jobs failed at slurm failing to launch the job. This is after singleFrameDriver.py successfully submitted the job to slurm, the job waited in the queue for its turn, and the job started trying once it got its turn, but then the job failed to launch. This failure wasn't restricted to one specific worker node. Log records like the following are seen: srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 136725.0 failed on node lsst-verify-worker03: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 136725.0 ON lsst-verify-worker02 CANCELLED AT 2018-04-22T20:53:48 *** srun: error: lsst-verify-worker02: task 0: Killed srun: error: lsst-verify-worker04: task 2: Killed [mpiexec@lsst-verify-worker02] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed [mpiexec@lsst-verify-worker02] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [mpiexec@lsst-verify-worker02] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec@lsst-verify-worker02] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion All failures were resubmitted iteratively, with many failed again in the iterations, and eventually were all pushed though. DM-14181 was filed about this issue.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            In singleFrameDriver/processCcd, there were reproducible failures in 14 CCDs in the UDEEP layer, 6 CCDs in the DEEP layer, and 120 CCDs in the WIDE layer. Their data IDs are:

            UDEEP

            --id visit=17934 ccd=1 --id visit=19712 ccd=33 --id visit=23596 ccd=6 --id visit=37828 ccd=101 --id visit=38494 ccd=4 --id visit=38494 ccd=6 --id visit=38494 ccd=11 --id visit=38494 ccd=32 --id visit=38494 ccd=37 --id visit=38494 ccd=43 --id visit=38494 ccd=54 --id visit=38494 ccd=80 --id visit=38494 ccd=96 --id visit=38494 ccd=102
            
            

            DEEP

            --id visit=19702 ccd=12 --id visit=22640 ccd=10 --id visit=24342 ccd=102 --id visit=9664 ccd=94 --id visit=36842 ccd=94 --id visit=15206 ccd=100
            
            

            WIDE

             --id visit=6342 ccd=11 --id visit=6478 ccd=99 --id visit=6528 ccd=24 --id visit=6528 ccd=59 --id visit=6528 ccd=67 --id visit=6528 ccd=75 --id visit=6542 ccd=96 --id visit=7344 ccd=67 --id visit=7356 ccd=96 --id visit=7408 ccd=84 --id visit=9708 ccd=99 --id visit=9736 ccd=67 --id visit=9748 ccd=96 --id visit=9838 ccd=101 --id visit=9868 ccd=76 --id visit=11406 ccd=22 --id visit=11414 ccd=66 --id visit=11582 ccd=76 --id visit=11614 ccd=101 --id visit=11640 ccd=70 --id visit=13166 ccd=20 --id visit=13178 ccd=91 --id visit=13182 ccd=101 --id visit=13198 ccd=84 --id visit=13198 ccd=85 --id visit=13198 ccd=90 --id visit=13198 ccd=91 --id visit=13288 ccd=84 --id visit=15096 ccd=47 --id visit=15096 ccd=54 --id visit=16064 ccd=101 --id visit=17670 ccd=24 --id visit=17672 ccd=24 --id visit=17692 ccd=7 --id visit=17692 ccd=8 --id visit=17722 ccd=20 --id visit=17736 ccd=63 --id visit=17738 ccd=69 --id visit=17750 ccd=58 --id visit=19394 ccd=24 --id visit=19414 ccd=8 --id visit=19454 ccd=20 --id visit=19468 ccd=69 --id visit=19646 ccd=2 --id visit=25894 ccd=68 --id visit=25956 ccd=76 --id visit=25968 ccd=65 --id visit=26054 ccd=43 --id visit=27068 ccd=96 --id visit=29378 ccd=70 --id visit=29898 ccd=99 --id visit=29916 ccd=99 --id visit=29936 ccd=66 --id visit=29942 ccd=96 --id visit=29966 ccd=103 --id visit=30588 ccd=98 --id visit=31410 ccd=73 --id visit=32506 ccd=3 --id visit=32506 ccd=8 --id visit=33824 ccd=56 --id visit=33862 ccd=8 --id visit=33890 ccd=61 --id visit=33934 ccd=95 --id visit=33950 ccd=52 --id visit=33950 ccd=60 --id visit=34268 ccd=103 --id visit=34332 ccd=61 --id visit=34334 ccd=61 --id visit=34348 ccd=100 --id visit=34410 ccd=58 --id visit=34412 ccd=78 --id visit=34634 ccd=61 --id visit=34636 ccd=61 --id visit=34684 ccd=58 --id visit=34748 ccd=85 --id visit=34928 ccd=61 --id visit=34930 ccd=61 --id visit=34934 ccd=101 --id visit=34936 ccd=50 --id visit=34938 ccd=95 --id visit=35852 ccd=8 --id visit=35862 ccd=61 --id visit=35882 ccd=19 --id visit=35892 ccd=12 --id visit=35894 ccd=86 --id visit=35908 ccd=28 --id visit=35916 ccd=50 --id visit=35942 ccd=4 --id visit=35942 ccd=64 --id visit=35948 ccd=72 --id visit=35966 ccd=60 --id visit=36178 ccd=98 --id visit=36216 ccd=6 --id visit=36264 ccd=94 --id visit=36604 ccd=81 --id visit=37532 ccd=33 --id visit=37538 ccd=100 --id visit=37552 ccd=12 --id visit=37988 ccd=33 --id visit=38316 ccd=11 --id visit=38328 ccd=91 --id visit=38330 ccd=8 --id visit=38346 ccd=8 --id visit=38912 ccd=86 --id visit=42218 ccd=102 --id visit=42222 ccd=102 --id visit=42454 ccd=17 --id visit=42454 ccd=24 --id visit=42510 ccd=77 --id visit=42534 ccd=65 --id visit=44050 ccd=94 --id visit=44060 ccd=31 --id visit=44090 ccd=27 --id visit=44154 ccd=66 --id visit=44160 ccd=55 --id visit=44162 ccd=61 --id visit=45262 ccd=64 --id visit=45348 ccd=64 --id visit=45940 ccd=47 --id visit=46892 ccd=64
            

            Out of the 140 failures:

            • 58 failed with "Unable to match sources"
            • 31 failed with "No matches to use for photocal"
            • 12 failed with "No objects passed our cuts for consideration as psf stars"
            • 2 failed with "Unable to measure aperture correction for required algorithm 'modelfit_CModel_exp': only [01] sources, but require at least 2."
            • 37 failed with "InvalidParameterError" which trace back to "PSF star selector found [123] candidates" in processCcd.charImage.measurePsf and then

                File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/15.0-5-g389937dc+5/python/lsst/pipe/tasks/characterizeImage.py", line 413, in characterize
                  psfSigma = psf.computeShape().getDeterminantRadius()
              lsst.pex.exceptions.wrappers.InvalidParameterError:
                File "src/PsfexPsf.cc", line 221, in virtual std::shared_ptr<lsst::afw::image::Image<double> > lsst::meas::extensions::psfex::PsfexPsf::_doComputeImage(const Point2D&, const lsst::afw::image::Color&, const Point2D&) const
                  Only spatial variation (ndim == 2) is supported; saw 0 {0}
              lsst::pex::exceptions::InvalidParameterError: 'Only spatial variation (ndim == 2) is supported; saw 0'
              

            Show
            hchiang2 Hsin-Fang Chiang added a comment - In singleFrameDriver/processCcd, there were reproducible failures in 14 CCDs in the UDEEP layer, 6 CCDs in the DEEP layer, and 120 CCDs in the WIDE layer. Their data IDs are: UDEEP --id visit=17934 ccd=1 --id visit=19712 ccd=33 --id visit=23596 ccd=6 --id visit=37828 ccd=101 --id visit=38494 ccd=4 --id visit=38494 ccd=6 --id visit=38494 ccd=11 --id visit=38494 ccd=32 --id visit=38494 ccd=37 --id visit=38494 ccd=43 --id visit=38494 ccd=54 --id visit=38494 ccd=80 --id visit=38494 ccd=96 --id visit=38494 ccd=102 DEEP --id visit=19702 ccd=12 --id visit=22640 ccd=10 --id visit=24342 ccd=102 --id visit=9664 ccd=94 --id visit=36842 ccd=94 --id visit=15206 ccd=100 WIDE --id visit=6342 ccd=11 --id visit=6478 ccd=99 --id visit=6528 ccd=24 --id visit=6528 ccd=59 --id visit=6528 ccd=67 --id visit=6528 ccd=75 --id visit=6542 ccd=96 --id visit=7344 ccd=67 --id visit=7356 ccd=96 --id visit=7408 ccd=84 --id visit=9708 ccd=99 --id visit=9736 ccd=67 --id visit=9748 ccd=96 --id visit=9838 ccd=101 --id visit=9868 ccd=76 --id visit=11406 ccd=22 --id visit=11414 ccd=66 --id visit=11582 ccd=76 --id visit=11614 ccd=101 --id visit=11640 ccd=70 --id visit=13166 ccd=20 --id visit=13178 ccd=91 --id visit=13182 ccd=101 --id visit=13198 ccd=84 --id visit=13198 ccd=85 --id visit=13198 ccd=90 --id visit=13198 ccd=91 --id visit=13288 ccd=84 --id visit=15096 ccd=47 --id visit=15096 ccd=54 --id visit=16064 ccd=101 --id visit=17670 ccd=24 --id visit=17672 ccd=24 --id visit=17692 ccd=7 --id visit=17692 ccd=8 --id visit=17722 ccd=20 --id visit=17736 ccd=63 --id visit=17738 ccd=69 --id visit=17750 ccd=58 --id visit=19394 ccd=24 --id visit=19414 ccd=8 --id visit=19454 ccd=20 --id visit=19468 ccd=69 --id visit=19646 ccd=2 --id visit=25894 ccd=68 --id visit=25956 ccd=76 --id visit=25968 ccd=65 --id visit=26054 ccd=43 --id visit=27068 ccd=96 --id visit=29378 ccd=70 --id visit=29898 ccd=99 --id visit=29916 ccd=99 --id visit=29936 ccd=66 --id visit=29942 ccd=96 --id visit=29966 ccd=103 --id visit=30588 ccd=98 --id visit=31410 ccd=73 --id visit=32506 ccd=3 --id visit=32506 ccd=8 --id visit=33824 ccd=56 --id visit=33862 ccd=8 --id visit=33890 ccd=61 --id visit=33934 ccd=95 --id visit=33950 ccd=52 --id visit=33950 ccd=60 --id visit=34268 ccd=103 --id visit=34332 ccd=61 --id visit=34334 ccd=61 --id visit=34348 ccd=100 --id visit=34410 ccd=58 --id visit=34412 ccd=78 --id visit=34634 ccd=61 --id visit=34636 ccd=61 --id visit=34684 ccd=58 --id visit=34748 ccd=85 --id visit=34928 ccd=61 --id visit=34930 ccd=61 --id visit=34934 ccd=101 --id visit=34936 ccd=50 --id visit=34938 ccd=95 --id visit=35852 ccd=8 --id visit=35862 ccd=61 --id visit=35882 ccd=19 --id visit=35892 ccd=12 --id visit=35894 ccd=86 --id visit=35908 ccd=28 --id visit=35916 ccd=50 --id visit=35942 ccd=4 --id visit=35942 ccd=64 --id visit=35948 ccd=72 --id visit=35966 ccd=60 --id visit=36178 ccd=98 --id visit=36216 ccd=6 --id visit=36264 ccd=94 --id visit=36604 ccd=81 --id visit=37532 ccd=33 --id visit=37538 ccd=100 --id visit=37552 ccd=12 --id visit=37988 ccd=33 --id visit=38316 ccd=11 --id visit=38328 ccd=91 --id visit=38330 ccd=8 --id visit=38346 ccd=8 --id visit=38912 ccd=86 --id visit=42218 ccd=102 --id visit=42222 ccd=102 --id visit=42454 ccd=17 --id visit=42454 ccd=24 --id visit=42510 ccd=77 --id visit=42534 ccd=65 --id visit=44050 ccd=94 --id visit=44060 ccd=31 --id visit=44090 ccd=27 --id visit=44154 ccd=66 --id visit=44160 ccd=55 --id visit=44162 ccd=61 --id visit=45262 ccd=64 --id visit=45348 ccd=64 --id visit=45940 ccd=47 --id visit=46892 ccd=64 Out of the 140 failures: 58 failed with "Unable to match sources" 31 failed with "No matches to use for photocal" 12 failed with "No objects passed our cuts for consideration as psf stars" 2 failed with "Unable to measure aperture correction for required algorithm 'modelfit_CModel_exp': only  [01] sources, but require at least 2." 37 failed with "InvalidParameterError" which trace back to "PSF star selector found [123] candidates" in processCcd.charImage.measurePsf and then File "/software/lsstsw/stack3_20171023/stack/miniconda3-4.3.21-10a4fa6/Linux64/pipe_tasks/15.0-5-g389937dc+5/python/lsst/pipe/tasks/characterizeImage.py", line 413, in characterize psfSigma = psf.computeShape().getDeterminantRadius() lsst.pex.exceptions.wrappers.InvalidParameterError: File "src/PsfexPsf.cc", line 221, in virtual std::shared_ptr<lsst::afw::image::Image<double> > lsst::meas::extensions::psfex::PsfexPsf::_doComputeImage(const Point2D&, const lsst::afw::image::Color&, const Point2D&) const Only spatial variation (ndim == 2) is supported; saw 0 {0} lsst::pex::exceptions::InvalidParameterError: 'Only spatial variation (ndim == 2) is supported; saw 0'
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            Three log files of retrying the failed ProcessCcd are attached, for DEEP/UDEEP/WIDE. 

            Show
            hchiang2 Hsin-Fang Chiang added a comment - Three log files of retrying the failed ProcessCcd are attached, for DEEP/UDEEP/WIDE. 
            Hide
            hchiang2 Hsin-Fang Chiang added a comment - - edited

            Full logs, including 314 files for 157 slurm jobs, are copied to

            /datasets/hsc/repo/rerun/DM-13666/DEEP/logs/singleFrame/
            /datasets/hsc/repo/rerun/DM-13666/UDEEP/logs/singleFrame/
            /datasets/hsc/repo/rerun/DM-13666/WIDE/logs/singleFrame/
            

            Show
            hchiang2 Hsin-Fang Chiang added a comment - - edited Full logs, including 314 files for 157 slurm jobs, are copied to /datasets/hsc/repo/rerun/DM-13666/DEEP/logs/singleFrame/ /datasets/hsc/repo/rerun/DM-13666/UDEEP/logs/singleFrame/ /datasets/hsc/repo/rerun/DM-13666/WIDE/logs/singleFrame/
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            The singleFrameDriver outputs have been made immutable. 

            Show
            hchiang2 Hsin-Fang Chiang added a comment - The singleFrameDriver outputs have been made immutable. 

              People

              Assignee:
              hchiang2 Hsin-Fang Chiang
              Reporter:
              hchiang2 Hsin-Fang Chiang
              Watchers:
              Hsin-Fang Chiang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.