Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-36151

Create gen3 HSC repo on tiger2-sumire

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      36
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      This ticket collects effort to set up a gen3 repo on the tiger2-sumire machine at Princeton. A recent HSC RC2 run will be ingested, containing all data from raws through to final coadd outputs.

      Summary:

      1. Set up a PostgreSQL Server
      2. Create a butler repo and register the HSC instrument
      3. Register the skyMap
      4. Write curated calibrations
      5. Ingest reference catalogues
      6. Ingest raw HSC data
      7. Import Sky Frames
      8. Import yBackgrounds
      9. Import flats
      10. Import biases
      11. Import darks
      12. Import fringe data
      13. Set up 'HSC/calib' CHAINED collection
      14. Import brightObjectMask data
      15. Set up 'HSC/masks' CHAINED collection
      16. Import fgcmLookUpTable data
      17. Set up 'HSC/fgcmcal/lut/RC2' CHAINED collection
      18. Set up 'HSC/defaults' CHAINED collection
      19. Set up RC2 TAGGED collections
      20. Set up 'HSC/raw/RC2' CHAINED collection
      21. Set up 'HSC/RC2/defaults' CHAINED collection
      22. Define visits
      23. Run a small step 1 test run on the head node
      24. Run a small step 1 test run using BPS
      25. Run RC2 step 1
      26. Run RC2 step 2
      27. Run RC2 step 3
      28. Summary

      This ticket is going on an PREOPS epic because it was work directly generated by the loss of our construction data facility and the USDF not being available for all project developers. Not all team members could get an account at the USDF because of their nationality, therefore we need an equivalent space at Princeton until SLAC can provide accounts for all project developers.

        Attachments

          Issue Links

            Activity

            Hide
            lskelvin Lee Kelvin added a comment - - edited

            Run RC2 step 2

            source /projects/HSC/LSST/stack/loadLSST.sh
            setup lsst_distrib -t w_2022_40
            

            REPO=/projects/HSC/repo/main
            GPFSDIR=/scratch/gpfs/$USER
            LOGDIR=$GPFSDIR/logs
            

            Step 2 submit:

            cd $GPFSDIR; \
            BPSYAML=/projects/HSC/LSST/bps/bps_01h_10cores.yaml; \
            export OMP_NUM_THREADS=1; \
            export NUMEXPR_MAX_THREADS=1; \
            LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-36151_step2.log; \
            DATAQUERY="band != 'N921'"; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            bps submit $BPSYAML \
            -b $REPO \
            -i HSC/RC2/defaults \
            -o HSC/runs/RC2/w_2022_40/DM-36151 \
            -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step2 \
            -d "instrument='HSC' AND $DATAQUERY" \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Incomplete step 2 results (Total runtime: 43:10:40):

            $ ~lkelvin/software/task_times.py \
            > /scratch/gpfs/lkelvin/submit/HSC/runs/RC2/w_2022_40/DM-36151/20221019T210526Z
             
            Concatenating BPS log files... done!
             
            QuantumGraph contains 1658 quanta for 10 tasks.
             
            task                              time                  pass     fail     skip  
            --------------------------------------------------------------------------------
            finalizeCharacterization          720085.71s (~65%)     267      0        0     
            skyCorr                           371484.09s (~33%)     364      0        0     
            consolidateVisitSummary           14791.66s (~1%)       404      0        0     
            fgcmFitCycle                      2909.72s (~0%)        1        0        0     
            consolidatePreSourceTable         2406.27s (~0%)        404      0        0     
            fgcmBuildStarsTable               1379.23s (~0%)        1        0        0     
            isolatedStarAssociation           680.98s (~0%)         37       0        0     
            fgcmOutputProducts                40.55s (~0%)          1        0        0     
            makeCcdVisitTable                 39.55s (~0%)          1        0        0     
            makeVisitTable                    31.26s (~0%)          1        0        0     
            --------------------------------------------------------------------------------
            total                             1113849.03s           1481     0        0     
             
            Executed 1481 quanta out of a total of 1658 quanta (~89%).
            

            Step 2 afterburner:

            LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-36151_step2_afterburner.log; \
            DATAQUERY="band != 'N921'"; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            pipetask --long-log run --register-dataset-types -j 5 \
            -b $REPO \
            -i HSC/RC2/defaults,HSC/runs/RC2/w_2022_40/DM-36151 \
            --output-run HSC/runs/RC2/w_2022_40/DM-36151/20221019T210526Z \
            -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step2 \
            -d "instrument='HSC' AND $DATAQUERY" \
            --skip-existing-in HSC/runs/RC2/w_2022_40/DM-36151 --extend-run --clobber-outputs \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Step 2 final results using task_check:

                      tasks           expected   completed   missing
            ------------------------- -------- ------------- -------
                              skyCorr      404 404 (~100.0%)       0
            consolidatePreSourceTable      404 404 (~100.0%)       0
              consolidateVisitSummary      404 404 (~100.0%)       0
              isolatedStarAssociation       37  37 (~100.0%)       0
                    makeCcdVisitTable        1   1 (~100.0%)       0
                  fgcmBuildStarsTable        1   1 (~100.0%)       0
                       makeVisitTable        1   1 (~100.0%)       0
             finalizeCharacterization      404 404 (~100.0%)       0
                         fgcmFitCycle        1   1 (~100.0%)       0
                   fgcmOutputProducts        1   1 (~100.0%)       0
            

            Show
            lskelvin Lee Kelvin added a comment - - edited Run RC2 step 2 source /projects/HSC/LSST/stack/loadLSST.sh setup lsst_distrib -t w_2022_40 REPO=/projects/HSC/repo/main GPFSDIR=/scratch/gpfs/$USER LOGDIR=$GPFSDIR/logs Step 2 submit: cd $GPFSDIR; \ BPSYAML=/projects/HSC/LSST/bps/bps_01h_10cores.yaml; \ export OMP_NUM_THREADS=1; \ export NUMEXPR_MAX_THREADS=1; \ LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-36151_step2.log; \ DATAQUERY="band != 'N921'"; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ bps submit $BPSYAML \ -b $REPO \ -i HSC/RC2/defaults \ -o HSC/runs/RC2/w_2022_40/DM-36151 \ -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step2 \ -d "instrument='HSC' AND $DATAQUERY" \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Incomplete step 2 results (Total runtime: 43:10:40): $ ~lkelvin/software/task_times.py \ > /scratch/gpfs/lkelvin/submit/HSC/runs/RC2/w_2022_40/DM-36151/20221019T210526Z   Concatenating BPS log files... done!   QuantumGraph contains 1658 quanta for 10 tasks.   task time pass fail skip -------------------------------------------------------------------------------- finalizeCharacterization 720085.71s (~65%) 267 0 0 skyCorr 371484.09s (~33%) 364 0 0 consolidateVisitSummary 14791.66s (~1%) 404 0 0 fgcmFitCycle 2909.72s (~0%) 1 0 0 consolidatePreSourceTable 2406.27s (~0%) 404 0 0 fgcmBuildStarsTable 1379.23s (~0%) 1 0 0 isolatedStarAssociation 680.98s (~0%) 37 0 0 fgcmOutputProducts 40.55s (~0%) 1 0 0 makeCcdVisitTable 39.55s (~0%) 1 0 0 makeVisitTable 31.26s (~0%) 1 0 0 -------------------------------------------------------------------------------- total 1113849.03s 1481 0 0   Executed 1481 quanta out of a total of 1658 quanta (~89%). Step 2 afterburner: LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-36151_step2_afterburner.log; \ DATAQUERY="band != 'N921'"; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ pipetask --long-log run --register-dataset-types -j 5 \ -b $REPO \ -i HSC/RC2/defaults,HSC/runs/RC2/w_2022_40/DM-36151 \ --output-run HSC/runs/RC2/w_2022_40/DM-36151/20221019T210526Z \ -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step2 \ -d "instrument='HSC' AND $DATAQUERY" \ --skip-existing-in HSC/runs/RC2/w_2022_40/DM-36151 --extend-run --clobber-outputs \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Step 2 final results using task_check: tasks expected completed missing ------------------------- -------- ------------- ------- skyCorr 404 404 (~100.0%) 0 consolidatePreSourceTable 404 404 (~100.0%) 0 consolidateVisitSummary 404 404 (~100.0%) 0 isolatedStarAssociation 37 37 (~100.0%) 0 makeCcdVisitTable 1 1 (~100.0%) 0 fgcmBuildStarsTable 1 1 (~100.0%) 0 makeVisitTable 1 1 (~100.0%) 0 finalizeCharacterization 404 404 (~100.0%) 0 fgcmFitCycle 1 1 (~100.0%) 0 fgcmOutputProducts 1 1 (~100.0%) 0
            Hide
            lskelvin Lee Kelvin added a comment - - edited

            Run RC2 step 3

            source /projects/HSC/LSST/stack/loadLSST.sh
            setup lsst_distrib -t w_2022_40
            

            REPO=/projects/HSC/repo/main
            GPFSDIR=/scratch/gpfs/$USER
            LOGDIR=$GPFSDIR/logs
            

            Step 3 submit:

            cd $GPFSDIR; \
            BPSYAML=/projects/HSC/LSST/bps/bps_05h_10cores.yaml; \
            export OMP_NUM_THREADS=1; \
            export NUMEXPR_MAX_THREADS=1; \
            LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-36151_step3.log; \
            DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            bps submit $BPSYAML \
            -b $REPO \
            -i HSC/RC2/defaults \
            -o HSC/runs/RC2/w_2022_40/DM-36151 \
            -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step3 \
            -d "instrument='HSC' AND $DATAQUERY" \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Step 3 afterburner:

            LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-36151_step3_afterburner.log; \
            DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            pipetask --long-log run --register-dataset-types -j 5 \
            -b $REPO \
            -i HSC/RC2/defaults,HSC/runs/RC2/w_2022_40/DM-36151 \
            --output-run HSC/runs/RC2/w_2022_40/DM-36151/20221028T214041Z \
            -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step3 \
            -d "instrument='HSC' AND $DATAQUERY" \
            --skip-existing-in HSC/runs/RC2/w_2022_40/DM-36151 --extend-run --clobber-outputs \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Step 3 final results using task_check:

            $ ~lkelvin/software/task_check.py $REPO $RUN3 -q $QGRAPH3
                    tasks          expected   completed    missing
            ---------------------- -------- -------------- -------
            selectGoodSeeingVisits     1203  1203 (100.0%)       0
                          jointcal       15    15 (100.0%)       0
                          makeWarp    17371 17371 (100.0%)       0
                     assembleCoadd     1203  1203 (100.0%)       0
                       templateGen     1203  1203 (100.0%)       0
                         detection     1203  1202 (~99.9%)       1
            healSparsePropertyMaps       15    15 (100.0%)       0
                   mergeDetections      241   240 (~99.6%)       1
                           deblend      241   240 (~99.6%)       1
                           measure     1203  1198 (~99.6%)       5
                 mergeMeasurements      241   240 (~99.6%)       1
                   forcedPhotCoadd     1203  1198 (~99.6%)       5
                  writeObjectTable      241   240 (~99.6%)       1
              transformObjectTable      241   240 (~99.6%)       1
            consolidateObjectTable        3     3 (100.0%)       0
                             TOTAL    25827 25811 (~99.9%)      16
            

            Show
            lskelvin Lee Kelvin added a comment - - edited Run RC2 step 3 source /projects/HSC/LSST/stack/loadLSST.sh setup lsst_distrib -t w_2022_40 REPO=/projects/HSC/repo/main GPFSDIR=/scratch/gpfs/$USER LOGDIR=$GPFSDIR/logs Step 3 submit: cd $GPFSDIR; \ BPSYAML=/projects/HSC/LSST/bps/bps_05h_10cores.yaml; \ export OMP_NUM_THREADS=1; \ export NUMEXPR_MAX_THREADS=1; \ LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-36151_step3.log; \ DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ bps submit $BPSYAML \ -b $REPO \ -i HSC/RC2/defaults \ -o HSC/runs/RC2/w_2022_40/DM-36151 \ -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step3 \ -d "instrument='HSC' AND $DATAQUERY" \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Step 3 afterburner: LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-36151_step3_afterburner.log; \ DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ pipetask --long-log run --register-dataset-types -j 5 \ -b $REPO \ -i HSC/RC2/defaults,HSC/runs/RC2/w_2022_40/DM-36151 \ --output-run HSC/runs/RC2/w_2022_40/DM-36151/20221028T214041Z \ -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step3 \ -d "instrument='HSC' AND $DATAQUERY" \ --skip-existing-in HSC/runs/RC2/w_2022_40/DM-36151 --extend-run --clobber-outputs \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Step 3 final results using task_check: $ ~lkelvin/software/task_check.py $REPO $RUN3 -q $QGRAPH3 tasks expected completed missing ---------------------- -------- -------------- ------- selectGoodSeeingVisits 1203 1203 (100.0%) 0 jointcal 15 15 (100.0%) 0 makeWarp 17371 17371 (100.0%) 0 assembleCoadd 1203 1203 (100.0%) 0 templateGen 1203 1203 (100.0%) 0 detection 1203 1202 (~99.9%) 1 healSparsePropertyMaps 15 15 (100.0%) 0 mergeDetections 241 240 (~99.6%) 1 deblend 241 240 (~99.6%) 1 measure 1203 1198 (~99.6%) 5 mergeMeasurements 241 240 (~99.6%) 1 forcedPhotCoadd 1203 1198 (~99.6%) 5 writeObjectTable 241 240 (~99.6%) 1 transformObjectTable 241 240 (~99.6%) 1 consolidateObjectTable 3 3 (100.0%) 0 TOTAL 25827 25811 (~99.9%) 16
            Hide
            lskelvin Lee Kelvin added a comment - - edited

            Summary

            Now step 3 has finished, I think I'm ready to call this ticket complete.

            In summary, we have set up an HSC repo on Tiger in /projects/HSC/repo/main. All HSC data on disk has been ingested into HSC/raw/all. All calibs have been set up, and the standard HSC RC2 defaults input collection established under HSC/RC2/defaults.

            Using w_2022_40, a full HSC RC2 run of steps 1, 2 and 3 has taken place, with only a handful of dataset types ultimately missing. Many of these missing dataset types are known failure modes. Indeed, the w_2022_40 on disk at the USDF is missing 47 patches in its final step 3 run, so the w40 on-disk on Tiger is more complete than any w40 yet available.

            I will shortly be updating the Dev Guide with information on how to access this repo, and providing a summary of the above available collections.

            Note: Getting BPS/parsl working at-scale on the Princeton clusters has been a bit of trial and error. One important lesson learned is that the total requested memory must also account for OS memory usage. For example, the Princeton cluster consists of ~1000 nodes, with ~90% of them on 192 GB memory, and the remaining nodes on a significantly higher 768 GB memory. If BPS/parsl requests a max memory value of 192 GB, the Slurm scheduler will look for clusters with 192+a few GB of memory, i.e., none of the 192 GB nodes will be considered for scheduling. In practice, it is strongly recommended that BPS/parsl users at Princeton limit themselves to max memory requests of 189 GB (edit: 187 GB, see DM-37049) or less if they want access to the main bulk of cluster nodes.

            Sophie Reed, as one of the active users of the Tiger HSC repo so far, would you have time to review this rather long ticket?

            Show
            lskelvin Lee Kelvin added a comment - - edited Summary Now step 3 has finished, I think I'm ready to call this ticket complete. In summary, we have set up an HSC repo on Tiger in /projects/HSC/repo/main . All HSC data on disk has been ingested into HSC/raw/all . All calibs have been set up, and the standard HSC RC2 defaults input collection established under HSC/RC2/defaults . Using w_2022_40 , a full HSC RC2 run of steps 1, 2 and 3 has taken place, with only a handful of dataset types ultimately missing. Many of these missing dataset types are known failure modes. Indeed, the w_2022_40 on disk at the USDF is missing 47 patches in its final step 3 run, so the w40 on-disk on Tiger is more complete than any w40 yet available. I will shortly be updating the Dev Guide with information on how to access this repo, and providing a summary of the above available collections. Note: Getting BPS/parsl working at-scale on the Princeton clusters has been a bit of trial and error. One important lesson learned is that the total requested memory must also account for OS memory usage. For example, the Princeton cluster consists of ~1000 nodes, with ~90% of them on 192 GB memory, and the remaining nodes on a significantly higher 768 GB memory. If BPS/parsl requests a max memory value of 192 GB, the Slurm scheduler will look for clusters with 192+a few GB of memory, i.e., none of the 192 GB nodes will be considered for scheduling. In practice, it is strongly recommended that BPS/parsl users at Princeton limit themselves to max memory requests of 189 GB (edit: 187 GB, see DM-37049 ) or less if they want access to the main bulk of cluster nodes. Sophie Reed , as one of the active users of the Tiger HSC repo so far, would you have time to review this rather long ticket?
            Hide
            sophiereed Sophie Reed added a comment -

            I've been through all of this and looks useful and complete. I've also played with this data, as has my student, and found the instructions very helpful. Thank you Lee for the effort you put into this.

            Show
            sophiereed Sophie Reed added a comment - I've been through all of this and looks useful and complete. I've also played with this data, as has my student, and found the instructions very helpful. Thank you Lee for the effort you put into this.
            Hide
            lskelvin Lee Kelvin added a comment -

            Thanks Sophie Reed, much appreciated. Following the merge of this ticket branch in lsst-dm containing updated Dev Guide instructions on how to access these resources, I'm happy to finally consider this ticket as done. I'll now remove all temporary data stored on-disk in our shared /scratch/gpfs space.

            Thank you again for the review of this long ticket, much appreciated!

            Show
            lskelvin Lee Kelvin added a comment - Thanks Sophie Reed , much appreciated. Following the merge of this ticket branch in lsst-dm containing updated Dev Guide instructions on how to access these resources, I'm happy to finally consider this ticket as done. I'll now remove all temporary data stored on-disk in our shared /scratch/gpfs space. Thank you again for the review of this long ticket, much appreciated!

              People

              Assignee:
              lskelvin Lee Kelvin
              Reporter:
              lskelvin Lee Kelvin
              Reviewers:
              Sophie Reed
              Watchers:
              Lee Kelvin, Sophie Reed
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.