UDEEP
In the UDEEP layer, there are 11 tracts in total, so the starting plan was to run multiband in 11 jobs. In the first attempt, each used 4 nodes and 12 cores per node. Some jobs failed to launch with the socket timeout messages as in DM-14181, before the sssd timeout window was updated on Apr 24 morning. Those jobs were resubmitted and the socket timeout issues were no longer seen after the DM-14181 update.
Some jobs failed because they went out of memory; they were tract=8523 and tract=9813. I then attempted to run them with 5 nodes and 6 cores each (without reusing the existing data). tract=8523 finished but tract=9813 went out of memory again. I continued tract=9813 using the --reuse-outputs-from option, and then it completed. Therefore, 12 slurm jobs in total contributed to the output data products.
UDEEP
In the UDEEP layer, there are 11 tracts in total, so the starting plan was to run multiband in 11 jobs. In the first attempt, each used 4 nodes and 12 cores per node. Some jobs failed to launch with the socket timeout messages as in
DM-14181, before the sssd timeout window was updated on Apr 24 morning. Those jobs were resubmitted and the socket timeout issues were no longer seen after theDM-14181update.Some jobs failed because they went out of memory; they were tract=8523 and tract=9813. I then attempted to run them with 5 nodes and 6 cores each (without reusing the existing data). tract=8523 finished but tract=9813 went out of memory again. I continued tract=9813 using the --reuse-outputs-from option, and then it completed. Therefore, 12 slurm jobs in total contributed to the output data products.