Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-14202

multiBandDriver of the HSC PDR1 dataset

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      multiBandDriver.py of the entire HSC PDR1 dataset with w_2018_15

        Attachments

          Issue Links

            Activity

            Hide
            hchiang2 Hsin-Fang Chiang added a comment - - edited
            UDEEP

            In the UDEEP layer, there are 11 tracts in total, so the starting plan was to run multiband in 11 jobs. In the first attempt, each used 4 nodes and 12 cores per node. Some jobs failed to launch with the socket timeout messages as in DM-14181, before the sssd timeout window was updated on Apr 24 morning. Those jobs were resubmitted and the socket timeout issues were no longer seen after the DM-14181 update.

            Some jobs failed because they went out of memory; they were tract=8523 and tract=9813. I then attempted to run them with 5 nodes and 6 cores each (without reusing the existing data). tract=8523 finished but tract=9813 went out of memory again. I continued tract=9813 using the --reuse-outputs-from option, and then it completed. Therefore, 12 slurm jobs in total contributed to the output data products.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - - edited UDEEP In the UDEEP layer, there are 11 tracts in total, so the starting plan was to run multiband in 11 jobs. In the first attempt, each used 4 nodes and 12 cores per node. Some jobs failed to launch with the socket timeout messages as in DM-14181 , before the sssd timeout window was updated on Apr 24 morning. Those jobs were resubmitted and the socket timeout issues were no longer seen after the DM-14181 update. Some jobs failed because they went out of memory; they were tract=8523 and tract=9813 . I then attempted to run them with 5 nodes and 6 cores each (without reusing the existing data). tract=8523 finished but tract=9813 went out of memory again. I continued tract=9813 using the --reuse-outputs-from option, and then it completed. Therefore, 12 slurm jobs in total contributed to the output data products.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -
            DEEP

            In the DEEP layer, there are 37 tracts in total. In the first attempt, 37 slurm jobs were submitted and each used 4 nodes and 12 cores. All completed except the job of tract=9463 went out of memory and failed. tract=9463 was then re-run using 5 nodes and 6 cores each (without reusing the existing data); it competed.

            WIDE

            In the WIDE layer, there are 91 tracts in total. It was completed in 91 slurm jobs, using either 4 or 3 nodes per job, 12 cores per node.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - DEEP In the DEEP layer, there are 37 tracts in total. In the first attempt, 37 slurm jobs were submitted and each used 4 nodes and 12 cores. All completed except the job of tract=9463 went out of memory and failed. tract=9463 was then re-run using 5 nodes and 6 cores each (without reusing the existing data); it competed. WIDE In the WIDE layer, there are 91 tracts in total. It was completed in 91 slurm jobs, using either 4 or 3 nodes per job, 12 cores per node.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            The 280 log files, including 24 files from 12 UDEEP jobs, 74 files from 37 DEEP jobs, and 182 files from 91 WIDE jobs, have been copied to:

            /datasets/hsc/repo/rerun/DM-13666/UDEEP/logs/multiBand
            /datasets/hsc/repo/rerun/DM-13666/DEEP/logs/multiBand
            /datasets/hsc/repo/rerun/DM-13666/WIDE/logs/multiBand
            

            No fatal errors are found in these logs.

            Show
            hchiang2 Hsin-Fang Chiang added a comment - The 280 log files, including 24 files from 12 UDEEP jobs, 74 files from 37 DEEP jobs, and 182 files from 91 WIDE jobs, have been copied to: /datasets/hsc/repo/rerun/DM-13666/UDEEP/logs/multiBand /datasets/hsc/repo/rerun/DM-13666/DEEP/logs/multiBand /datasets/hsc/repo/rerun/DM-13666/WIDE/logs/multiBand No fatal errors are found in these logs.

              People

              Assignee:
              hchiang2 Hsin-Fang Chiang
              Reporter:
              hchiang2 Hsin-Fang Chiang
              Watchers:
              Hsin-Fang Chiang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.