Run RC2 Step 3
There have been numerous problems with Tiger at this stage, but for the sake of brevity and clarity, I am choosing to omit these details.
Step 3 submit:
cd $GPFSDIR; \
|
LOGFILE=$LOGDIR/rc2_step3.log; \
|
BPSYAML=/home/en7908/HSC/bps_hsc_rc2_step3.yaml; \
|
export OMP_NUM_THREADS=1; \
|
export NUMEXPR_MAX_THREADS=1; \
|
date | tee -a $LOGFILE; \
|
$(which time) -f "Total runtime: %E" \
|
bps submit $BPSYAML \
|
2>&1 | tee -a $LOGFILE; \
|
date | tee -a $LOGFILE
|
Corresponding chain collection:
u/en7908/HSC/runs/RC2/w_2022_40/DM-35492
|
Run collection created for Step 3 (under the chain collection above):
u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z
|
Initial step 3 results using task_times:
:$ ~lkelvin/software/task_times.py /scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z
|
|
Concatenating BPS log files... done!
|
|
QuantumGraph contains 24873 quanta for 15 tasks.
|
|
task time (s) pass fail skip
|
--------------------------------------------------------------------------------
|
forcedPhotCoadd 3996715 (~41%, ~4082/q) 979 0 0
|
measure 3516485 (~36%, ~3171/q) 1109 0 0
|
makeWarp 925341 (~9%, ~56/q) 16417 0 0
|
assembleCoadd 539196 (~5%, ~450/q) 1198 0 0
|
deblend 364425 (~4%, ~1518/q) 240 0 0
|
templateGen 302836 (~3%, ~253/q) 1198 0 0
|
detection 74719 (~1%, ~62/q) 1197 1 0
|
mergeMeasurements 36382 (~0%, ~175/q) 208 0 0
|
healSparsePropertyMaps 29373 (~0%, ~2098/q) 14 0 0
|
jointcal 12337 (~0%, ~822/q) 15 0 0
|
mergeDetections 12307 (~0%, ~51/q) 240 0 0
|
writeObjectTable 6644 (~0%, ~38/q) 175 0 0
|
transformObjectTable 4397 (~0%, ~25/q) 175 0 0
|
selectGoodSeeingVisits 4228 (~0%, ~4/q) 1203 0 0
|
--------------------------------------------------------------------------------
|
total 9825384 24368 1 0
|
|
Executed 24368 quanta out of a total of 24873 quanta (~98%).
|
It turns out that during the run above, the scratch space for Tiger was unmounted, causing the job to be terminated and the completed quanta to not be copied back to the main repository. I had to transfer the files manually, and thanks to Lee, I found the lines in the config yaml below (located in the submit directory) that are responsible for transferring the files.
/scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z_config.yaml
|
Digging into BPS's brain, there are three commands that make the transfer happen. We need to bashify and then run them.
# Define variables
|
mergePreCmdOpts="--long-log --log-level=VERBOSE"
|
executionButlerDir="/scratch/gpfs/en7908/submit/u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/EXEC_REPO-u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z"
|
butlerConfig="/projects/HSC/repo/main"
|
# RUN collection
|
outputRun="u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z"
|
# Chain collection
|
output="u/en7908/HSC/runs/RC2/w_2022_40/DM-35492"
|
inCollection="HSC/RC2/defaults"
|
# Command 1
|
butler ${mergePreCmdOpts} transfer-datasets ${executionButlerDir} ${butlerConfig} --collections ${outputRun} --register-dataset-types --transfer=move
|
# Command 2
|
butler ${mergePreCmdOpts} collection-chain ${butlerConfig} ${output} --flatten --mode=extend ${inCollection}
|
# Command 3
|
butler ${mergePreCmdOpts} collection-chain ${butlerConfig} ${output} --flatten --mode=prepend ${outputRun}
|
Run Step 3 afterburner on the head node:
LOGFILE=$LOGDIR/rc2_step3_afterburner.log; \
|
DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \
|
date | tee -a $LOGFILE; \
|
$(which time) -f "Total runtime: %E" \
|
pipetask --long-log run --register-dataset-types -j 5 \
|
-b $REPO \
|
-i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \
|
--output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z \
|
-p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step3 \
|
-d "instrument='HSC' AND $DATAQUERY" \
|
--skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \
|
2>&1 | tee -a $LOGFILE; \
|
date | tee -a $LOGFILE
|
Step 3 results using task_check:
:$ RUN=u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z
|
:$ QGRAPH=/scratch/gpfs/en7908/submit//u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z/u_en7908_HSC_runs_RC2_w_2022_40_DM-35492_20221123T230405Z.qgraph
|
:$ ~lkelvin/software/task_check.py $REPO $QGRAPH $RUN
|
|
tasks expected completed missing
|
---------------------- -------- -------------- -------
|
jointcal 15 15 (100.0%) 0
|
selectGoodSeeingVisits 1203 1203 (100.0%) 0
|
makeWarp 16417 16417 (100.0%) 0
|
assembleCoadd 1203 1203 (100.0%) 0
|
templateGen 1203 1203 (100.0%) 0
|
healSparsePropertyMaps 15 15 (100.0%) 0
|
detection 1203 1202 (~99.9%) 1
|
mergeDetections 241 240 (~99.6%) 1
|
deblend 241 240 (~99.6%) 1
|
measure 1203 1198 (~99.6%) 5
|
mergeMeasurements 241 240 (~99.6%) 1
|
forcedPhotCoadd 1203 1198 (~99.6%) 5
|
writeObjectTable 241 240 (~99.6%) 1
|
transformObjectTable 241 240 (~99.6%) 1
|
consolidateObjectTable 3 2 (~66.7%) 1
|
TOTAL 24873 24856 (~99.9%) 17
|
|
Run another Step 3 afterburner on the head node to double-check that nothing else wants to complete. We only do it for these tasks of interest:
writeObjectTable, transformObjectTable, consolidateObjectTable
|
I had to run this final pipetask run per Lee's advice to finish up generating the object table: "They don't want to complete because they think that they are still missing some inputs. The truth is that those inputs will always be missing, so you need to manually finish the run off."
LOGFILE=$LOGDIR/rc2_step3_afterburner_final.log; \
|
DATAQUERY="band!='N921' AND skymap='hsc_rings_v1' AND tract IN (9615,9697,9813)"; \
|
date | tee -a $LOGFILE; \
|
$(which time) -f "Total runtime: %E" \
|
pipetask --long-log run --register-dataset-types -j 5 \
|
-b $REPO \
|
-i HSC/RC2/defaults,u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 \
|
--output-run u/en7908/HSC/runs/RC2/w_2022_40/DM-35492/20221123T230405Z \
|
-p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#writeObjectTable,transformObjectTable,consolidateObjectTable \
|
-d "instrument='HSC' AND $DATAQUERY" \
|
--skip-existing-in u/en7908/HSC/runs/RC2/w_2022_40/DM-35492 --extend-run --clobber-outputs \
|
2>&1 | tee -a $LOGFILE; \
|
date | tee -a $LOGFILE
|
Step 3 final results using task_check:
tasks expected completed missing
|
---------------------- -------- -------------- -------
|
selectGoodSeeingVisits 1203 1203 (100.0%) 0
|
jointcal 15 15 (100.0%) 0
|
makeWarp 16417 16417 (100.0%) 0
|
assembleCoadd 1203 1203 (100.0%) 0
|
templateGen 1203 1203 (100.0%) 0
|
detection 1203 1202 (~99.9%) 1
|
healSparsePropertyMaps 15 15 (100.0%) 0
|
mergeDetections 241 240 (~99.6%) 1
|
deblend 241 240 (~99.6%) 1
|
measure 1203 1198 (~99.6%) 5
|
mergeMeasurements 241 240 (~99.6%) 1
|
forcedPhotCoadd 1203 1198 (~99.6%) 5
|
writeObjectTable 241 240 (~99.6%) 1
|
transformObjectTable 241 240 (~99.6%) 1
|
consolidateObjectTable 3 3 (100.0%) 0
|
TOTAL 24873 24857 (~99.9%) 16
|
Below is the Step 3 final results for Lee's run (from here):
tasks expected completed missing
|
---------------------- -------- -------------- -------
|
selectGoodSeeingVisits 1203 1203 (100.0%) 0
|
jointcal 15 15 (100.0%) 0
|
makeWarp 17371 17371 (100.0%) 0
|
assembleCoadd 1203 1203 (100.0%) 0
|
templateGen 1203 1203 (100.0%) 0
|
detection 1203 1202 (~99.9%) 1
|
healSparsePropertyMaps 15 15 (100.0%) 0
|
mergeDetections 241 240 (~99.6%) 1
|
deblend 241 240 (~99.6%) 1
|
measure 1203 1198 (~99.6%) 5
|
mergeMeasurements 241 240 (~99.6%) 1
|
forcedPhotCoadd 1203 1198 (~99.6%) 5
|
writeObjectTable 241 240 (~99.6%) 1
|
transformObjectTable 241 240 (~99.6%) 1
|
consolidateObjectTable 3 3 (100.0%) 0
|
TOTAL 25827 25811 (~99.9%) 16
|
Ticket renamed to reflect that w_2022_28 rerun at NCSA disappeared and was rerun on tiger with w_2022_40. Lee ran the corresponding main w_2022_40 collection