In the UDEEP layer, there are 11 tracts in total, so the starting plan was to run multiband in 11 jobs. In the first attempt, each used 4 nodes and 12 cores per node. Some jobs failed to launch with the socket timeout messages as in
DM-14181, before the sssd timeout window was updated on Apr 24 morning. Those jobs were resubmitted and the socket timeout issues were no longer seen after the DM-14181 update.
Some jobs failed because they went out of memory; they were tract=8523 and tract=9813. I then attempted to run them with 5 nodes and 6 cores each (without reusing the existing data). tract=8523 finished but tract=9813 went out of memory again. I continued tract=9813 using the --reuse-outputs-from option, and then it completed. Therefore, 12 slurm jobs in total contributed to the output data products.