Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: ctrl_bps
-
Epic Link:
-
Team:Data Facility
-
Urgent?:No
Description
After the execution butler became the default behavior and large workflows (e.g., > 100,000 in single submission) were running, problems started being reported.
- Reports load higher than normal on submit machine (https://lsstc.slack.com/archives/C01FBUGM2CV/p1633472157031500)
- Reports reprocessing run taking a lot longer than normal. (https://lsstc.slack.com/archives/C01FBUGM2CV/p1633547834070100)
Digging around discovered the time was in the transfer of the input files to each job:
- The defaults don't have bpsUseShared set, so HTCondor plugin was telling HTCondor to also copy the QuantumGraph file. Have told users to add it to their submit yaml. Really need to work on easier-for-user site-specific configs, but that's for a different ticket.
- The execution butler in these large runs can get larger than 1.5GB.
The HTCondor shadow processes on the submit node that were causing more load are responsible for that type of transfers. So this would also explain why folks were noticing them on the machine more and the higher than usual load.
Check to see if HTCondor has a different file transfer mechanism. If so, change bps to use that when bpsUseShared is set. If not, will need to make new bps executable that does the ButlerURI transfer (cp in this case) and then runs pipetask.
Have replicated issue with simple condor submit file using:
/scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3
The above showed noticeable slowness with only 100 jobs which only got worse if tried to get the max 240 jobs running at once.
Changing that line to the following allowed 1200 few second jobs to run through quickly with barely any load increase on the submit node:
transfer_input_files=file:///scratch/mgower/EXEC_REPO-2.2i_runs_test-med-1_w_2021_40_DM-32024_20211005T175202Z/gen3.sqlite3
With this change, HTCondor runs a command within the job to transfer the file (in this case a local copy from one directory to another). Unfortunately this (the curl plugin) doesn't work with directories so will need to to make some other changes in the bps code. The file pulls will probably to in the future new job executable so going to take some shortcuts in the changes for this ticket in the HTCondor plugin that aren't completely generic and future proof.