# Modify how HTCondor plugin handles transfer of execution butler

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Team:
Data Facility
• Urgent?:
No

#### Description

After the execution butler became the default behavior and large workflows (e.g., > 100,000 in single submission) were running, problems started being reported.

Digging around discovered the time was in the transfer of the input files to each job:

• The defaults don't have bpsUseShared set, so HTCondor plugin was telling HTCondor to also copy the QuantumGraph file. Have told users to add it to their submit yaml. Really need to work on easier-for-user site-specific configs, but that's for a different ticket.
• The execution butler in these large runs can get larger than 1.5GB.

The HTCondor shadow processes on the submit node that were causing more load are responsible for that type of transfers. So this would also explain why folks were noticing them on the machine more and the higher than usual load.

Check to see if HTCondor has a different file transfer mechanism. If so, change bps to use that when bpsUseShared is set. If not, will need to make new bps executable that does the ButlerURI transfer (cp in this case) and then runs pipetask.

#### Activity

Hide
Michelle Gower added a comment -

Have replicated issue with simple condor submit file using:

 /scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3 

The above showed noticeable slowness with only 100 jobs which only got worse if tried to get the max 240 jobs running at once.

Changing that line to the following allowed 1200 few second jobs to run through quickly with barely any load increase on the submit node:

 transfer_input_files=file:///scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3 

With this change, HTCondor runs a command within the job to transfer the file (in this case a local copy from one directory to another). Unfortunately this (the curl plugin) doesn't work with directories so will need to to make some other changes in the bps code. The file pulls will probably to in the future new job executable so going to take some shortcuts in the changes for this ticket in the HTCondor plugin that aren't completely generic and future proof.

Show
Michelle Gower added a comment - Have replicated issue with simple condor submit file using: /scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3 The above showed noticeable slowness with only 100 jobs which only got worse if tried to get the max 240 jobs running at once. Changing that line to the following allowed 1200 few second jobs to run through quickly with barely any load increase on the submit node: transfer_input_files=file:///scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3 With this change, HTCondor runs a command within the job to transfer the file (in this case a local copy from one directory to another). Unfortunately this (the curl plugin) doesn't work with directories so will need to to make some other changes in the bps code. The file pulls will probably to in the future new job executable so going to take some shortcuts in the changes for this ticket in the HTCondor plugin that aren't completely generic and future proof.
Hide
Mikolaj Kowalik added a comment -

Changes can be merged after addressing few issues pointed out by Tim J and me.

Show
Mikolaj Kowalik added a comment - Changes can be merged after addressing few issues pointed out by Tim J and me.
Hide

Your backport request has been approved. See my longer comment on DM-32217 for backporting info.

Show
Yusra AlSayyad added a comment - Your backport request has been approved. See my longer comment on DM-32217 for backporting info.

#### People

Assignee:
Michelle Gower
Reporter:
Michelle Gower
Reviewers:
Mikolaj Kowalik
Watchers:
Michelle Gower, Mikolaj Kowalik, Yusra AlSayyad