Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32074

Modify how HTCondor plugin handles transfer of execution butler

    XMLWordPrintable

    Details

      Description

      After the execution butler became the default behavior and large workflows (e.g., > 100,000 in single submission) were running, problems started being reported.

      Digging around discovered the time was in the transfer of the input files to each job:

      • The defaults don't have bpsUseShared set, so HTCondor plugin was telling HTCondor to also copy the QuantumGraph file. Have told users to add it to their submit yaml. Really need to work on easier-for-user site-specific configs, but that's for a different ticket.
      • The execution butler in these large runs can get larger than 1.5GB.

      The HTCondor shadow processes on the submit node that were causing more load are responsible for that type of transfers. So this would also explain why folks were noticing them on the machine more and the higher than usual load.

      Check to see if HTCondor has a different file transfer mechanism. If so, change bps to use that when bpsUseShared is set. If not, will need to make new bps executable that does the ButlerURI transfer (cp in this case) and then runs pipetask.

        Attachments

          Issue Links

            Activity

            Hide
            mgower Michelle Gower added a comment -

            Have replicated issue with simple condor submit file using:

            /scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3
            

            The above showed noticeable slowness with only 100 jobs which only got worse if tried to get the max 240 jobs running at once.

            Changing that line to the following allowed 1200 few second jobs to run through quickly with barely any load increase on the submit node:

            transfer_input_files=file:///scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3
            

            With this change, HTCondor runs a command within the job to transfer the file (in this case a local copy from one directory to another). Unfortunately this (the curl plugin) doesn't work with directories so will need to to make some other changes in the bps code. The file pulls will probably to in the future new job executable so going to take some shortcuts in the changes for this ticket in the HTCondor plugin that aren't completely generic and future proof.

            Show
            mgower Michelle Gower added a comment - Have replicated issue with simple condor submit file using: /scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3 The above showed noticeable slowness with only 100 jobs which only got worse if tried to get the max 240 jobs running at once. Changing that line to the following allowed 1200 few second jobs to run through quickly with barely any load increase on the submit node: transfer_input_files=file:///scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3 With this change, HTCondor runs a command within the job to transfer the file (in this case a local copy from one directory to another). Unfortunately this (the curl plugin) doesn't work with directories so will need to to make some other changes in the bps code. The file pulls will probably to in the future new job executable so going to take some shortcuts in the changes for this ticket in the HTCondor plugin that aren't completely generic and future proof.
            Hide
            mkowalik Mikolaj Kowalik added a comment -

            Changes can be merged after addressing few issues pointed out by Tim J and me.

            Show
            mkowalik Mikolaj Kowalik added a comment - Changes can be merged after addressing few issues pointed out by Tim J and me.
            Hide
            yusra Yusra AlSayyad added a comment -

            Your backport request has been approved. See my longer comment on DM-32217 for backporting info.

            Show
            yusra Yusra AlSayyad added a comment - Your backport request has been approved. See my longer comment on DM-32217 for backporting info.

              People

              Assignee:
              mgower Michelle Gower
              Reporter:
              mgower Michelle Gower
              Reviewers:
              Mikolaj Kowalik
              Watchers:
              Michelle Gower, Mikolaj Kowalik, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.