Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-35492

Rerun RC2 w_2022_40 with FitAffineWcs on

    XMLWordPrintable

Details

    • Story
    • Status: Done
    • Resolution: Done
    • None
    • None
    • None
    • 32
    • DRP S22B, DRP S23A
    • Data Release Production
    • No

    Description

      Rerun RC2 with FitAffineWcs on:

      calibrate.astrometry.wcsFitter.retarget(FitAffineWcsTask)
      

      End point would be a collection for analysis. We only need to run through coadd processing, i.e. only through step 3.

      Include BPS reports for steps 1, 2, and 3.

      This is the first step in commissioning FitAffineWcs for single-frame processing. Story points include time learning BPS.
       

      Attachments

        1. bps_hsc_rc2_step1.yaml
          1.0 kB
        2. bps_hsc_rc2_step2.yaml
          1 kB
        3. bps_hsc_rc2_step3.yaml
          1 kB
        4. cfg_affine_step1.yaml
          0.4 kB

        Issue Links

          Activity

            Ticket renamed to reflect that w_2022_28 rerun at NCSA disappeared and was rerun on tiger with w_2022_40. Lee ran the corresponding main w_2022_40 collection

            yusra Yusra AlSayyad added a comment - Ticket renamed to reflect that w_2022_28 rerun at NCSA disappeared and was rerun on tiger with w_2022_40. Lee ran the corresponding main w_2022_40 collection
            erfan Erfan Nourbakhsh added a comment - - edited

            Generally following the direction provided for RC2 step 1 to step 3 runs in DM-36151 but with FitAffineWcs on.

            .bashrc

            SCIPIPE=/projects/HSC/LSST/stack/loadLSST.sh
            scipipe410 () { LSST_CONDA_ENV_NAME=lsst-scipipe-4.1.0 source "$SCIPIPE" ; }
            alias egset='eups list | sed -e "s/^lsst_distrib/0&/" -e t -e "s/^/1/" | sort | sed "s/^.//" | grep setup | grep -E "lsst_distrib|LOCAL"'
            setlsst40 () { setup lsst_distrib -t w_2022_40 ; egset ; }

            Setting up the environment for my runs

            Start a screen session (required for Parsl):

            screen -S rc2

            Source the LSST Science Pipelines (v.4.1.0) and set up the code base (weekly 40, 2022):

            scipipe410
            setlsst40

            REPO=/projects/HSC/repo/main
            GPFSDIR=/scratch/gpfs/$USER
            LOGDIR=$GPFSDIR/logs

            Introducing FitAffineWcs in my customized pipelineYaml

            cfg_affine_step1.yaml

            description: The DRP pipeline specialized for rc2_subset processing for https://jira.lsstcorp.org/browse/DM-35492
            imports:
              - $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml
            tasks:
              calibrate:
                class: lsst.pipe.tasks.calibrate.CalibrateTask
                config:
                  python: |
                    from lsst.meas.astrom import FitAffineWcsTask
                    config.astrometry.wcsFitter.retarget(FitAffineWcsTask)
            

            Note: The "_step1" at the end of the file name indicates that the config change is made to the CalibrateTask in the first step only. However, beware that these changes will be reflected in the outputs of the subsequent steps.

            erfan Erfan Nourbakhsh added a comment - - edited Generally following the direction provided for RC2 step 1 to step 3 runs in DM-36151 but with FitAffineWcs on. .bashrc SCIPIPE=/projects/HSC/LSST/stack/loadLSST.sh scipipe410 () { LSST_CONDA_ENV_NAME=lsst-scipipe-4.1.0 source "$SCIPIPE" ; } alias egset='eups list | sed -e "s/^lsst_distrib/0&/" -e t -e "s/^/1/" | sort | sed "s/^.//" | grep setup | grep -E "lsst_distrib|LOCAL"' setlsst40 () { setup lsst_distrib -t w_2022_40 ; egset ; } Setting up the environment for my runs Start a screen session (required for Parsl): screen -S rc2 Source the LSST Science Pipelines (v.4.1.0) and set up the code base (weekly 40, 2022): scipipe410 setlsst40 REPO=/projects/HSC/repo/main GPFSDIR=/scratch/gpfs/$USER LOGDIR=$GPFSDIR/logs Introducing FitAffineWcs in my customized pipelineYaml cfg_affine_step1.yaml description: The DRP pipeline specialized for rc2_subset processing for https://jira.lsstcorp.org/browse/DM-35492 imports: - $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml tasks: calibrate: class: lsst.pipe.tasks.calibrate.CalibrateTask config: python: | from lsst.meas.astrom import FitAffineWcsTask config.astrometry.wcsFitter.retarget(FitAffineWcsTask) Note: The "_step1" at the end of the file name indicates that the config change is made to the CalibrateTask in the first step only. However, beware that these changes will be reflected in the outputs of the subsequent steps.
            erfan Erfan Nourbakhsh added a comment - - edited

            Run RC2 Step 1

            Step 1 submit:

            cd $GPFSDIR; \
            LOGFILE=$LOGDIR/rc2_step1.log; \
            BPSYAML=/home/en7908/HSC/bps_hsc_rc2_step1.yaml; \
            export OMP_NUM_THREADS=1; \
            export NUMEXPR_MAX_THREADS=1; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            bps submit $BPSYAML \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Corresponding chain collection:

            u/en7908/HSC/runs/RC2/w_2022_40/DM-35492
            

            Run collection created for Step 1 (within the chain collection above):

            u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z

            Initial step 1 results using Lee's task_times (Total runtime: 78:20:54):

            en7908@tiger2-sumire:~$ ~lkelvin/software/task_times.py /scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z
             
            Concatenating BPS log files... done!
             
            QuantumGraph contains 222480 quanta for 5 tasks.
             
            task                              time                  pass     fail     skip
            --------------------------------------------------------------------------------
            characterizeImage                 7491490.14s (~59%)    44389    2        1
            calibrate                         3362938.93s (~27%)    44325    10       3
            isr                               1755970.06s (~14%)    44492    0        0
            writePreSourceTable               32919.07s (~0%)       44325    0        13
            transformPreSourceTable           21962.83s (~0%)       44322    0        14
            --------------------------------------------------------------------------------
            total                             12665281.04s          221853   12       31
             
            Executed 221853 quanta out of a total of 222480 quanta (~100%). 

            Run Step 1 afterburner on the head node:

            LOGFILE=$LOGDIR/rc2_step1_afterburner.log; \
            DATAQUERY="exposure.observation_type='science' AND detector NOT IN (9)"; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            pipetask --long-log run --register-dataset-types -j 20 \
            -b $REPO \
            -i HSC/RC2/defaults \
            --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z \
            -p /home/en7908/HSC/cfg_affine_step1.yaml#step1 \
            -d "instrument='HSC' AND $DATAQUERY" \
            --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Note: task_times uses Parsl logs to generate a report. It turns out that these logs are not reliable on Tiger. This is because many of the BPS/parsl logs appear to be corrupted. According to Lee, they seemingly cut off mid run, and don't provide a success/fail message at the end of processing the quanta. To get completely accurate numbers, we'll need to load up the quantum graph and compare the actual number of files on disk with those expected. Going forward, we will use this improved check called task_check developed by Lee for reporting the completion of runs.

            Step 1 final results using task_check:

            :$ RUN=u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z
            :$ QGRAPH=/scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221019T184937Z.qgraph
            :$ ~lkelvin/software/task_check.py $REPO $RUN -q $QGRAPH
             
                     tasks          expected    completed    missing
            ----------------------- -------- --------------- -------
                                isr    44496 44496 (~100.0%)       0
                  characterizeImage    44496 44494 (~100.0%)       2
                          calibrate    44496 44484 (~100.0%)      12
                writePreSourceTable    44496 44484 (~100.0%)      12
            transformPreSourceTable    44496 44484 (~100.0%)      12
            

            Below is the Step 1 final results for Lee's run (from here):

                     tasks          expected    completed    missing
            ----------------------- -------- --------------- -------
                                isr    44496 44496 (~100.0%)       0
                  characterizeImage    44496 44494 (~100.0%)       2
                          calibrate    44496 44484 (~100.0%)      12
                writePreSourceTable    44496 44484 (~100.0%)      12
            transformPreSourceTable    44496 44484 (~100.0%)      12
            

            erfan Erfan Nourbakhsh added a comment - - edited Run RC2 Step 1 Step 1 submit: cd $GPFSDIR; \ LOGFILE=$LOGDIR/rc2_step1.log; \ BPSYAML=/home/en7908/HSC/bps_hsc_rc2_step1.yaml; \ export OMP_NUM_THREADS=1; \ export NUMEXPR_MAX_THREADS=1; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ bps submit $BPSYAML \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Corresponding chain collection: u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 Run collection created for Step 1 (within the chain collection above): u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z Initial step 1 results using Lee's task_times (Total runtime: 78:20:54): en7908@tiger2-sumire:~$ ~lkelvin/software/task_times.py /scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z   Concatenating BPS log files... done!   QuantumGraph contains 222480 quanta for 5 tasks.   task time pass fail skip -------------------------------------------------------------------------------- characterizeImage 7491490.14s (~59%) 44389 2 1 calibrate 3362938.93s (~27%) 44325 10 3 isr 1755970.06s (~14%) 44492 0 0 writePreSourceTable 32919.07s (~0%) 44325 0 13 transformPreSourceTable 21962.83s (~0%) 44322 0 14 -------------------------------------------------------------------------------- total 12665281.04s 221853 12 31   Executed 221853 quanta out of a total of 222480 quanta (~100%). Run Step 1 afterburner on the head node: LOGFILE=$LOGDIR/rc2_step1_afterburner.log; \ DATAQUERY="exposure.observation_type='science' AND detector NOT IN (9)"; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ pipetask --long-log run --register-dataset-types -j 20 \ -b $REPO \ -i HSC/RC2/defaults \ --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z \ -p /home/en7908/HSC/cfg_affine_step1.yaml#step1 \ -d "instrument='HSC' AND $DATAQUERY" \ --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Note: task_times uses Parsl logs to generate a report. It turns out that these logs are not reliable on Tiger. This is because many of the BPS/parsl logs appear to be corrupted. According to Lee, they seemingly cut off mid run, and don't provide a success/fail message at the end of processing the quanta. To get completely accurate numbers, we'll need to load up the quantum graph and compare the actual number of files on disk with those expected. Going forward, we will use this improved check called task_check developed by Lee for reporting the completion of runs. Step 1 final results using task_check: :$ RUN=u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z :$ QGRAPH=/scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221019T184937Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221019T184937Z.qgraph :$ ~lkelvin/software/task_check.py $REPO $RUN -q $QGRAPH   tasks expected completed missing ----------------------- -------- --------------- ------- isr 44496 44496 (~100.0%) 0 characterizeImage 44496 44494 (~100.0%) 2 calibrate 44496 44484 (~100.0%) 12 writePreSourceTable 44496 44484 (~100.0%) 12 transformPreSourceTable 44496 44484 (~100.0%) 12 Below is the Step 1 final results for Lee's run (from here ): tasks expected completed missing ----------------------- -------- --------------- ------- isr 44496 44496 (~100.0%) 0 characterizeImage 44496 44494 (~100.0%) 2 calibrate 44496 44484 (~100.0%) 12 writePreSourceTable 44496 44484 (~100.0%) 12 transformPreSourceTable 44496 44484 (~100.0%) 12
            erfan Erfan Nourbakhsh added a comment - - edited

            Run RC2 Step 2

            Step 2 submit:

            cd $GPFSDIR; \
            LOGFILE=$LOGDIR/my_log_step2.log; \
            BPSYAML=/home/en7908/HSC/bps_hsc_rc2_step2.yaml; \
            export OMP_NUM_THREADS=1; \
            export NUMEXPR_MAX_THREADS=1; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            bps submit $BPSYAML \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Corresponding chain collection:

            u/en7908/HSC/runs/RC2/w_2022_40/DM-35492
            

            Run collection created for Step 2 (under the chain collection above):

            u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z

            Initial step 2 results using task_check:

            :$ RUN=u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z
            :$ QGRAPH=/scratch/gpfs/en7908/submit//u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221025T195306Z.qgraph
            :$ ~lkelvin/software/task_check.py $REPO $QGRAPH $RUN
             
                      tasks           expected   completed   missing
            ------------------------- -------- ------------- -------
                              skyCorr      404  404 (100.0%)       0
              consolidateVisitSummary      404  404 (100.0%)       0
            consolidatePreSourceTable      404  404 (100.0%)       0
                       makeVisitTable        1    1 (100.0%)       0
                    makeCcdVisitTable        1    1 (100.0%)       0
              isolatedStarAssociation       37   37 (100.0%)       0
                  fgcmBuildStarsTable        1    1 (100.0%)       0
             finalizeCharacterization      404  404 (100.0%)       0
                         fgcmFitCycle        1    0 (~00.0%)       1
                   fgcmOutputProducts        1    0 (~00.0%)       1
                                TOTAL     1658 1656 (~99.9%)       2
            

            Apparently, FGCM doesn't play nice with multiprocessing. Let's run a series of Step 2 afterburners on the head node - the first one for fgcmBuildStarsTable with one processor:

            LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-35492_step2_fgcmBuildStarsTable_afterburner.log; \
            DATAQUERY="band != 'N921'"; \
            date | tee -a $LOGFILE; \
            export OMP_NUM_THREADS=1; \
            export NUMEXPR_MAX_THREADS=1; \
            $(which time) -f "Total runtime: %E" \
            pipetask --long-log run --register-dataset-types \
            -b $REPO \
            -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \
            --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z \
            -p /home/en7908/HSC/cfg_affine_step1.yaml#fgcmBuildStarsTable \
            -d "instrument='HSC' AND $DATAQUERY" \
            --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Then the rest with multiprocessing again:

            LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-35492_step2_afterburner.log; \
            DATAQUERY="band != 'N921'"; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            pipetask --long-log run --register-dataset-types -j 5 \
            -b $REPO \
            -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \
            --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z \
            -p /home/en7908/HSC/cfg_affine_step1.yaml#step2 \
            -d "instrument='HSC' AND $DATAQUERY" \
            --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Finally, another single core afterburner run, this time for fgcmFitCycle:

            LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-35492_step2_fgcmFitCycle_afterburner.log; \
            DATAQUERY="band != 'N921'"; \
            date | tee -a $LOGFILE; \
            export OMP_NUM_THREADS=1; \
            export NUMEXPR_MAX_THREADS=1; \
            $(which time) -f "Total runtime: %E" \
            pipetask --long-log run --register-dataset-types \
            -b $REPO \
            -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \
            --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z \
            -p /home/en7908/HSC/cfg_affine_step1.yaml#step2 \
            -d "instrument='HSC' AND $DATAQUERY" \
            --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Step 2 final results using task_check:

                      tasks           expected   completed   missing
            ------------------------- -------- ------------- -------
                              skyCorr      404  404 (100.0%)       0
            consolidatePreSourceTable      404  404 (100.0%)       0
              consolidateVisitSummary      404  404 (100.0%)       0
              isolatedStarAssociation       37   37 (100.0%)       0
                  fgcmBuildStarsTable        1    1 (100.0%)       0
                       makeVisitTable        1    1 (100.0%)       0
                    makeCcdVisitTable        1    1 (100.0%)       0
             finalizeCharacterization      404  404 (100.0%)       0
                         fgcmFitCycle        1    1 (100.0%)       0
                   fgcmOutputProducts        1    1 (100.0%)       0
                                TOTAL     1658 1658 (100.0%)       0
            

            Below is the Step 2 final results for Lee's run (from here):

                      tasks           expected   completed   missing
            ------------------------- -------- ------------- -------
                              skyCorr      404  404 (100.0%)       0
            consolidatePreSourceTable      404  404 (100.0%)       0
              consolidateVisitSummary      404  404 (100.0%)       0
              isolatedStarAssociation       37   37 (100.0%)       0
                  fgcmBuildStarsTable        1    1 (100.0%)       0
                       makeVisitTable        1    1 (100.0%)       0
                    makeCcdVisitTable        1    1 (100.0%)       0
             finalizeCharacterization      404  404 (100.0%)       0
                         fgcmFitCycle        1    1 (100.0%)       0
                   fgcmOutputProducts        1    1 (100.0%)       0
                                TOTAL     1658 1658 (100.0%)       0
            

             

            erfan Erfan Nourbakhsh added a comment - - edited Run RC2 Step 2 Step 2 submit: cd $GPFSDIR; \ LOGFILE=$LOGDIR/my_log_step2.log; \ BPSYAML=/home/en7908/HSC/bps_hsc_rc2_step2.yaml; \ export OMP_NUM_THREADS=1; \ export NUMEXPR_MAX_THREADS=1; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ bps submit $BPSYAML \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Corresponding chain collection: u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 Run collection created for Step 2 (under the chain collection above): u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z Initial step 2 results using task_check: :$ RUN=u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z :$ QGRAPH=/scratch/gpfs/en7908/submit//u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221025T195306Z.qgraph :$ ~lkelvin/software/task_check.py $REPO $QGRAPH $RUN   tasks expected completed missing ------------------------- -------- ------------- ------- skyCorr 404 404 (100.0%) 0 consolidateVisitSummary 404 404 (100.0%) 0 consolidatePreSourceTable 404 404 (100.0%) 0 makeVisitTable 1 1 (100.0%) 0 makeCcdVisitTable 1 1 (100.0%) 0 isolatedStarAssociation 37 37 (100.0%) 0 fgcmBuildStarsTable 1 1 (100.0%) 0 finalizeCharacterization 404 404 (100.0%) 0 fgcmFitCycle 1 0 (~00.0%) 1 fgcmOutputProducts 1 0 (~00.0%) 1 TOTAL 1658 1656 (~99.9%) 2 Apparently, FGCM doesn't play nice with multiprocessing. Let's run a series of Step 2 afterburners on the head node - the first one for fgcmBuildStarsTable with one processor: LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-35492_step2_fgcmBuildStarsTable_afterburner.log; \ DATAQUERY="band != 'N921'"; \ date | tee -a $LOGFILE; \ export OMP_NUM_THREADS=1; \ export NUMEXPR_MAX_THREADS=1; \ $(which time) -f "Total runtime: %E" \ pipetask --long-log run --register-dataset-types \ -b $REPO \ -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \ --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z \ -p /home/en7908/HSC/cfg_affine_step1.yaml#fgcmBuildStarsTable \ -d "instrument='HSC' AND $DATAQUERY" \ --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Then the rest with multiprocessing again: LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-35492_step2_afterburner.log; \ DATAQUERY="band != 'N921'"; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ pipetask --long-log run --register-dataset-types -j 5 \ -b $REPO \ -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \ --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z \ -p /home/en7908/HSC/cfg_affine_step1.yaml#step2 \ -d "instrument='HSC' AND $DATAQUERY" \ --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Finally, another single core afterburner run, this time for fgcmFitCycle : LOGFILE=$LOGDIR/hsc_runs_rc2_w_2022_40_DM-35492_step2_fgcmFitCycle_afterburner.log; \ DATAQUERY="band != 'N921'"; \ date | tee -a $LOGFILE; \ export OMP_NUM_THREADS=1; \ export NUMEXPR_MAX_THREADS=1; \ $(which time) -f "Total runtime: %E" \ pipetask --long-log run --register-dataset-types \ -b $REPO \ -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \ --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221025T195306Z \ -p /home/en7908/HSC/cfg_affine_step1.yaml#step2 \ -d "instrument='HSC' AND $DATAQUERY" \ --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Step 2 final results using task_check: tasks expected completed missing ------------------------- -------- ------------- ------- skyCorr 404 404 (100.0%) 0 consolidatePreSourceTable 404 404 (100.0%) 0 consolidateVisitSummary 404 404 (100.0%) 0 isolatedStarAssociation 37 37 (100.0%) 0 fgcmBuildStarsTable 1 1 (100.0%) 0 makeVisitTable 1 1 (100.0%) 0 makeCcdVisitTable 1 1 (100.0%) 0 finalizeCharacterization 404 404 (100.0%) 0 fgcmFitCycle 1 1 (100.0%) 0 fgcmOutputProducts 1 1 (100.0%) 0 TOTAL 1658 1658 (100.0%) 0 Below is the Step 2 final results for Lee's run (from here ): tasks expected completed missing ------------------------- -------- ------------- ------- skyCorr 404 404 (100.0%) 0 consolidatePreSourceTable 404 404 (100.0%) 0 consolidateVisitSummary 404 404 (100.0%) 0 isolatedStarAssociation 37 37 (100.0%) 0 fgcmBuildStarsTable 1 1 (100.0%) 0 makeVisitTable 1 1 (100.0%) 0 makeCcdVisitTable 1 1 (100.0%) 0 finalizeCharacterization 404 404 (100.0%) 0 fgcmFitCycle 1 1 (100.0%) 0 fgcmOutputProducts 1 1 (100.0%) 0 TOTAL 1658 1658 (100.0%) 0  
            erfan Erfan Nourbakhsh added a comment - - edited

            Run RC2 Step 3

            There have been numerous problems with Tiger at this stage, but for the sake of brevity and clarity, I am choosing to omit these details.

            Step 3 submit:

            cd $GPFSDIR; \
            LOGFILE=$LOGDIR/rc2_step3.log; \
            BPSYAML=/home/en7908/HSC/bps_hsc_rc2_step3.yaml; \
            export OMP_NUM_THREADS=1; \
            export NUMEXPR_MAX_THREADS=1; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            bps submit $BPSYAML \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Corresponding chain collection:

            u/en7908/HSC/runs/RC2/w_2022_40/DM-35492
            

            Run collection created for Step 3 (under the chain collection above):

            u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z

            Initial step 3 results using task_times:

            :$ ~lkelvin/software/task_times.py /scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z
             
            Concatenating BPS log files... done!
             
            QuantumGraph contains 24873 quanta for 15 tasks.
             
            task                        time (s)                    pass    fail    skip
            --------------------------------------------------------------------------------
            forcedPhotCoadd             3996715 (~41%, ~4082/q)     979     0       0
            measure                     3516485 (~36%, ~3171/q)     1109    0       0
            makeWarp                    925341 (~9%, ~56/q)         16417   0       0
            assembleCoadd               539196 (~5%, ~450/q)        1198    0       0
            deblend                     364425 (~4%, ~1518/q)       240     0       0
            templateGen                 302836 (~3%, ~253/q)        1198    0       0
            detection                   74719 (~1%, ~62/q)          1197    1       0
            mergeMeasurements           36382 (~0%, ~175/q)         208     0       0
            healSparsePropertyMaps      29373 (~0%, ~2098/q)        14      0       0
            jointcal                    12337 (~0%, ~822/q)         15      0       0
            mergeDetections             12307 (~0%, ~51/q)          240     0       0
            writeObjectTable            6644 (~0%, ~38/q)           175     0       0
            transformObjectTable        4397 (~0%, ~25/q)           175     0       0
            selectGoodSeeingVisits      4228 (~0%, ~4/q)            1203    0       0
            --------------------------------------------------------------------------------
            total                       9825384                     24368   1       0
             
            Executed 24368 quanta out of a total of 24873 quanta (~98%).
            

            It turns out that during the run above, the scratch space for Tiger was unmounted, causing the job to be terminated and the completed quanta to not be copied back to the main repository. I had to transfer the files manually, and thanks to Lee, I found the lines in the config yaml below (located in the submit directory) that are responsible for transferring the files.

            /scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z_config.yaml

            Digging into BPS's brain, there are three commands that make the transfer happen. We need to bashify and then run them.

            # Define variables
            mergePreCmdOpts="--long-log --log-level=VERBOSE"
            executionButlerDir="/scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/EXEC_REPO-u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z"
            butlerConfig="/projects/HSC/repo/main"
            # RUN collection
            outputRun="u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z"
            # Chain collection
            output="u/en7908/HSC/runs/RC2/w_2022_40/DM-35492"
            inCollection="HSC/RC2/defaults"

            # Command 1
            butler ${mergePreCmdOpts} transfer-datasets ${executionButlerDir} ${butlerConfig} --collections ${outputRun} --register-dataset-types --transfer=move
            # Command 2
            butler ${mergePreCmdOpts} collection-chain ${butlerConfig} ${output} --flatten --mode=extend ${inCollection}
            # Command 3
            butler ${mergePreCmdOpts} collection-chain ${butlerConfig} ${output} --flatten --mode=prepend ${outputRun}

            Run Step 3 afterburner on the head node:

            LOGFILE=$LOGDIR/rc2_step3_afterburner.log; \
            DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            pipetask --long-log run --register-dataset-types -j 5 \
            -b $REPO \
            -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \
            --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z \
            -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step3 \
            -d "instrument='HSC' AND $DATAQUERY" \
            --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Step 3 results using task_check:

            :$ RUN=u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z
            :$ QGRAPH=/scratch/gpfs/en7908/submit//u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z.qgraph
            :$ ~lkelvin/software/task_check.py $REPO $QGRAPH $RUN
             
                    tasks          expected   completed    missing
            ---------------------- -------- -------------- -------
                          jointcal       15    15 (100.0%)       0
            selectGoodSeeingVisits     1203  1203 (100.0%)       0
                          makeWarp    16417 16417 (100.0%)       0
                     assembleCoadd     1203  1203 (100.0%)       0
                       templateGen     1203  1203 (100.0%)       0
            healSparsePropertyMaps       15    15 (100.0%)       0
                         detection     1203  1202 (~99.9%)       1
                   mergeDetections      241   240 (~99.6%)       1
                           deblend      241   240 (~99.6%)       1
                           measure     1203  1198 (~99.6%)       5
                 mergeMeasurements      241   240 (~99.6%)       1
                   forcedPhotCoadd     1203  1198 (~99.6%)       5
                  writeObjectTable      241   240 (~99.6%)       1
              transformObjectTable      241   240 (~99.6%)       1
            consolidateObjectTable        3     2 (~66.7%)       1
                             TOTAL    24873 24856 (~99.9%)      17
             

            Run another Step 3 afterburner on the head node to double-check that nothing else wants to complete. We only do it for these tasks of interest:

            writeObjectTable, transformObjectTable, consolidateObjectTable 
            

            I had to run this final pipetask run per Lee's advice to finish up generating the object table: "They don't want to complete because they think that they are still missing some inputs. The truth is that those inputs will always be missing, so you need to manually finish the run off."

            LOGFILE=$LOGDIR/rc2_step3_afterburner_final.log; \
            DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \
            date | tee -a $LOGFILE; \
            $(which time) -f "Total runtime: %E" \
            pipetask --long-log run --register-dataset-types -j 5 \
            -b $REPO \
            -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \
            --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z \
            -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#writeObjectTable,transformObjectTable,consolidateObjectTable \
            -d "instrument='HSC' AND $DATAQUERY" \
            --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \
            2>&1 | tee -a $LOGFILE; \
            date | tee -a $LOGFILE
            

            Step 3 final results using task_check:

                    tasks          expected   completed    missing
            ---------------------- -------- -------------- -------
            selectGoodSeeingVisits     1203  1203 (100.0%)       0
                          jointcal       15    15 (100.0%)       0
                           makeWarp    16417 16417 (100.0%)      0
                     assembleCoadd     1203  1203 (100.0%)       0
                       templateGen     1203  1203 (100.0%)       0
                         detection     1203  1202 (~99.9%)       1
            healSparsePropertyMaps       15    15 (100.0%)       0
                   mergeDetections      241   240 (~99.6%)       1
                           deblend      241   240 (~99.6%)       1
                           measure     1203  1198 (~99.6%)       5
                 mergeMeasurements      241   240 (~99.6%)       1
                   forcedPhotCoadd     1203  1198 (~99.6%)       5
                  writeObjectTable      241   240 (~99.6%)       1
              transformObjectTable      241   240 (~99.6%)       1
            consolidateObjectTable        3     3 (100.0%)       0
                             TOTAL    24873 24857 (~99.9%)      16
            

            Below is the Step 3 final results for Lee's run (from here):

                    tasks          expected   completed    missing
            ---------------------- -------- -------------- -------
            selectGoodSeeingVisits     1203  1203 (100.0%)       0
                          jointcal       15    15 (100.0%)       0
                          makeWarp    17371 17371 (100.0%)       0
                     assembleCoadd     1203  1203 (100.0%)       0
                       templateGen     1203  1203 (100.0%)       0
                         detection     1203  1202 (~99.9%)       1
            healSparsePropertyMaps       15    15 (100.0%)       0
                   mergeDetections      241   240 (~99.6%)       1
                           deblend      241   240 (~99.6%)       1
                           measure     1203  1198 (~99.6%)       5
                 mergeMeasurements      241   240 (~99.6%)       1
                   forcedPhotCoadd     1203  1198 (~99.6%)       5
                  writeObjectTable      241   240 (~99.6%)       1
              transformObjectTable      241   240 (~99.6%)       1
            consolidateObjectTable        3     3 (100.0%)       0
                             TOTAL    25827 25811 (~99.9%)      16
            

             

            erfan Erfan Nourbakhsh added a comment - - edited Run RC2 Step 3 There have been numerous problems with Tiger at this stage, but for the sake of brevity and clarity, I am choosing to omit these details. Step 3 submit: cd $GPFSDIR; \ LOGFILE=$LOGDIR/rc2_step3.log; \ BPSYAML=/home/en7908/HSC/bps_hsc_rc2_step3.yaml; \ export OMP_NUM_THREADS=1; \ export NUMEXPR_MAX_THREADS=1; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ bps submit $BPSYAML \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Corresponding chain collection: u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 Run collection created for Step 3 (under the chain collection above): u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z Initial step 3 results using task_times: :$ ~lkelvin/software/task_times.py /scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z   Concatenating BPS log files... done!   QuantumGraph contains 24873 quanta for 15 tasks.   task time (s) pass fail skip -------------------------------------------------------------------------------- forcedPhotCoadd 3996715 (~41%, ~4082/q) 979 0 0 measure 3516485 (~36%, ~3171/q) 1109 0 0 makeWarp 925341 (~9%, ~56/q) 16417 0 0 assembleCoadd 539196 (~5%, ~450/q) 1198 0 0 deblend 364425 (~4%, ~1518/q) 240 0 0 templateGen 302836 (~3%, ~253/q) 1198 0 0 detection 74719 (~1%, ~62/q) 1197 1 0 mergeMeasurements 36382 (~0%, ~175/q) 208 0 0 healSparsePropertyMaps 29373 (~0%, ~2098/q) 14 0 0 jointcal 12337 (~0%, ~822/q) 15 0 0 mergeDetections 12307 (~0%, ~51/q) 240 0 0 writeObjectTable 6644 (~0%, ~38/q) 175 0 0 transformObjectTable 4397 (~0%, ~25/q) 175 0 0 selectGoodSeeingVisits 4228 (~0%, ~4/q) 1203 0 0 -------------------------------------------------------------------------------- total 9825384 24368 1 0   Executed 24368 quanta out of a total of 24873 quanta (~98%). It turns out that during the run above, the scratch space for Tiger was unmounted, causing the job to be terminated and the completed quanta to not be copied back to the main repository. I had to transfer the files manually, and thanks to Lee, I found the lines in the config yaml below (located in the submit directory) that are responsible for transferring the files. /scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z_config.yaml Digging into BPS's brain, there are three commands that make the transfer happen. We need to bashify and then run them. # Define variables mergePreCmdOpts= "--long-log --log-level=VERBOSE" executionButlerDir= "/scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/EXEC_REPO-u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z" butlerConfig= "/projects/HSC/repo/main" # RUN collection outputRun= "u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z" # Chain collection output= "u/en7908/HSC/runs/RC2/w_2022_40/DM-35492" inCollection= "HSC/RC2/defaults" # Command 1 butler ${mergePreCmdOpts} transfer-datasets ${executionButlerDir} ${butlerConfig} --collections ${outputRun} --register-dataset-types --transfer=move # Command 2 butler ${mergePreCmdOpts} collection-chain ${butlerConfig} ${output} --flatten --mode=extend ${inCollection} # Command 3 butler ${mergePreCmdOpts} collection-chain ${butlerConfig} ${output} --flatten --mode=prepend ${outputRun} Run Step 3 afterburner on the head node: LOGFILE=$LOGDIR/rc2_step3_afterburner.log; \ DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ pipetask --long-log run --register-dataset-types -j 5 \ -b $REPO \ -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \ --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z \ -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step3 \ -d "instrument='HSC' AND $DATAQUERY" \ --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Step 3 results using task_check: :$ RUN=u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z :$ QGRAPH=/scratch/gpfs/en7908/submit//u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z.qgraph :$ ~lkelvin/software/task_check.py $REPO $QGRAPH $RUN   tasks expected completed missing ---------------------- -------- -------------- ------- jointcal 15 15 (100.0%) 0 selectGoodSeeingVisits 1203 1203 (100.0%) 0 makeWarp 16417 16417 (100.0%) 0 assembleCoadd 1203 1203 (100.0%) 0 templateGen 1203 1203 (100.0%) 0 healSparsePropertyMaps 15 15 (100.0%) 0 detection 1203 1202 (~99.9%) 1 mergeDetections 241 240 (~99.6%) 1 deblend 241 240 (~99.6%) 1 measure 1203 1198 (~99.6%) 5 mergeMeasurements 241 240 (~99.6%) 1 forcedPhotCoadd 1203 1198 (~99.6%) 5 writeObjectTable 241 240 (~99.6%) 1 transformObjectTable 241 240 (~99.6%) 1 consolidateObjectTable 3 2 (~66.7%) 1 TOTAL 24873 24856 (~99.9%) 17   Run another Step 3 afterburner on the head node to double-check that nothing else wants to complete. We only do it for these tasks of interest: writeObjectTable, transformObjectTable, consolidateObjectTable  I had to run this final pipetask run per Lee's advice to finish up generating the object table: "They don't want to complete because they think that they are still missing some inputs. The truth is that those inputs will always be missing, so you need to manually finish the run off." LOGFILE=$LOGDIR/rc2_step3_afterburner_final.log; \ DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \ date | tee -a $LOGFILE; \ $(which time) -f "Total runtime: %E" \ pipetask --long-log run --register-dataset-types -j 5 \ -b $REPO \ -i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \ --output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z \ -p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#writeObjectTable,transformObjectTable,consolidateObjectTable \ -d "instrument='HSC' AND $DATAQUERY" \ --skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \ 2>&1 | tee -a $LOGFILE; \ date | tee -a $LOGFILE Step 3 final results using task_check: tasks expected completed missing ---------------------- -------- -------------- ------- selectGoodSeeingVisits 1203 1203 (100.0%) 0 jointcal 15 15 (100.0%) 0 makeWarp 16417 16417 (100.0%) 0 assembleCoadd 1203 1203 (100.0%) 0 templateGen 1203 1203 (100.0%) 0 detection 1203 1202 (~99.9%) 1 healSparsePropertyMaps 15 15 (100.0%) 0 mergeDetections 241 240 (~99.6%) 1 deblend 241 240 (~99.6%) 1 measure 1203 1198 (~99.6%) 5 mergeMeasurements 241 240 (~99.6%) 1 forcedPhotCoadd 1203 1198 (~99.6%) 5 writeObjectTable 241 240 (~99.6%) 1 transformObjectTable 241 240 (~99.6%) 1 consolidateObjectTable 3 3 (100.0%) 0 TOTAL 24873 24857 (~99.9%) 16 Below is the Step 3 final results for Lee's run (from here ): tasks expected completed missing ---------------------- -------- -------------- ------- selectGoodSeeingVisits 1203 1203 (100.0%) 0 jointcal 15 15 (100.0%) 0 makeWarp 17371 17371 (100.0%) 0 assembleCoadd 1203 1203 (100.0%) 0 templateGen 1203 1203 (100.0%) 0 detection 1203 1202 (~99.9%) 1 healSparsePropertyMaps 15 15 (100.0%) 0 mergeDetections 241 240 (~99.6%) 1 deblend 241 240 (~99.6%) 1 measure 1203 1198 (~99.6%) 5 mergeMeasurements 241 240 (~99.6%) 1 forcedPhotCoadd 1203 1198 (~99.6%) 5 writeObjectTable 241 240 (~99.6%) 1 transformObjectTable 241 240 (~99.6%) 1 consolidateObjectTable 3 3 (100.0%) 0 TOTAL 25827 25811 (~99.9%) 16  
            erfan Erfan Nourbakhsh added a comment - - edited

            Summary

            It appears that there are missing Warps in Step 3 of the run with FitAffineWcs on. With the help of Yusra, Arun and Lee, I was able to identify that the issue originated from Step 1. The task_check function, which is used to search for metadata (see the line of code below for reference), is not fully reliable and falsely reported that those quanta in Step 1 were actually completed. This propagated further and caused the issue with the missing makeWarp's in Step 3. 

            butler.registry.queryDatasets(f"{task}_metadata", collections=run)

            erfan Erfan Nourbakhsh added a comment - - edited Summary It appears that there are missing Warps  in Step 3 of the run with FitAffineWcs  on. With the help of Yusra, Arun and Lee, I was able to identify that the issue originated from Step 1. The task_check function, which is used to search for metadata (see the line of code below for reference), is not fully reliable and falsely reported that those quanta in Step 1 were actually completed. This propagated further and caused the issue with the missing makeWarp 's in Step 3.  butler.registry.queryDatasets(f "{task}_metadata" , collections = run)

            Hi Yusra, could you please review this ticket? I have included all necessary details and attachments in the ticket, but please let me know if you need any additional information to complete the review. I would appreciate your feedback.

            erfan Erfan Nourbakhsh added a comment - Hi Yusra, could you please review this ticket? I have included all necessary details and attachments in the ticket, but please let me know if you need any additional information to complete the review. I would appreciate your feedback.

            People

              erfan Erfan Nourbakhsh
              erfan Erfan Nourbakhsh
              Yusra AlSayyad
              Clare Saunders, Dan Taranu, Erfan Nourbakhsh, Eric Bellm, John Parejko, Lee Kelvin, Sophie Reed, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.