The modified strategy was to group all UDEEP visits into one singleFrameDriver job, DEEP visits into 7 jobs, and WIDE visits into 149 jobs. These jobs were submitted ~ over the Apr 21 weekend.
Each WIDE singleFrameDriver used 3 nodes. For some reasons, a non-small percentage (~30%) of jobs failed at slurm failing to launch the job. This is after singleFrameDriver.py successfully submitted the job to slurm, the job waited in the queue for its turn, and the job started trying once it got its turn, but then the job failed to launch. This failure wasn't restricted to one specific worker node.
Log records like the following are seen:
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
|
srun: error: Task launch for 136725.0 failed on node lsst-verify-worker03: Socket timed out on send/recv operation
|
srun: error: Application launch failed: Socket timed out on send/recv operation
|
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
|
slurmstepd: error: *** STEP 136725.0 ON lsst-verify-worker02 CANCELLED AT 2018-04-22T20:53:48 ***
|
srun: error: lsst-verify-worker02: task 0: Killed
|
srun: error: lsst-verify-worker04: task 2: Killed
|
[mpiexec@lsst-verify-worker02] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
|
[mpiexec@lsst-verify-worker02] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
|
[mpiexec@lsst-verify-worker02] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
|
[mpiexec@lsst-verify-worker02] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
|
All failures were resubmitted iteratively, with many failed again in the iterations, and eventually were all pushed though.
DM-14181 was filed about this issue.
In the beginning I attempted to run each visit as its own singleFrameDriver. One advantage of doing so is that each visit would have its own log file. During the attempt, I found that many jobs took much longer than expected and timed out. The IO wait time on GPFS was excessively high (e.g. ~5 sec), likely because too many nodes simultaneously put locks on the same folders and sometimes same files. So I changed the strategy, and grouped multiple visits into fewer singleFrameDriver calls.