Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32241

bps submit can take multiple hours

    XMLWordPrintable

Details

    Description

      The post-QuantumGraph generation stages of submission can take multiple hours for large graphs. See if can figure out where the time is going and implement some changes to speed this up.

      Attachments

        Issue Links

          Activity

            Found an assert isinstance added in an unusual location called O(# clusters) times purely for debugging a particular issue during coding that didn't get removed before committing.

             

            mkowalik pointed out a _LOG.debug line (also called O(# clusters) times) that was problematic:

              _LOG.debug("get_cluster nodes = %s", list(self._cluster_graph.nodes))

            mgower Michelle Gower added a comment - Found an assert isinstance added in an unusual location called O(# clusters) times purely for debugging a particular issue during coding that didn't get removed before committing.   mkowalik  pointed out a _LOG.debug line (also called O(# clusters) times) that was problematic:   _LOG.debug("get_cluster nodes = %s", list(self._cluster_graph.nodes))

            Further profiling was reporting most time spent in config functions.  Created DM-32305.  Also changing bps code to cache results known to be currently shared (e.g., all Quanta or clusters with same label) to reduce the actual numbers of config searches.

            mgower Michelle Gower added a comment - Further profiling was reporting most time spent in config functions.  Created DM-32305 .  Also changing bps code to cache results known to be currently shared (e.g., all Quanta or clusters with same label) to reduce the actual numbers of config searches.
            tjenness Tim Jenness added a comment -

            For the log problem, if you still need the debug message you can use _LOG.isEnabledFor(logging.DEBUG) so you only get the slow down if the log message would be issued.

            tjenness Tim Jenness added a comment - For the log problem, if you still need the debug message you can use _LOG.isEnabledFor(logging.DEBUG) so you only get the slow down if the log message would be issued.

            Most of the changes are saving values currently the same across labels.  There are a couple clean up changes (removing an isinstance in an unusual location that was missed in a previous commit, removing a debug statement that doesn't make sense at huge scales, etc).  And a couple bug fixes in rarely used code discovered while trying to profile/test changes.

            mgower Michelle Gower added a comment - Most of the changes are saving values currently the same across labels.  There are a couple clean up changes (removing an isinstance in an unusual location that was missed in a previous commit, removing a debug statement that doesn't make sense at huge scales, etc).  And a couple bug fixes in rarely used code discovered while trying to profile/test changes.

            Ran a little script to gather timings for various submission stages using one of Brock's DC2's steps that had a large Quantum Graph (325710 Quanta).  It uses the existing Quantum Graph (so there's a read time).  Also timed saving/loading of bps classes (note the ClusteredQuantumGraph loading also reads the QuantumGraph).

            This is using only w44 stack (no profiling happening during these timings):

            INFO 2021-10-29T09:50:51.015-05:00 __main__ ()(run_clustering.py:40) - Reading quantum graph - START
            INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:42) - Reading QuantumGraph: Took 350.5966 seconds
            INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:43) - Reading quantum graph 325710 - END
            INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:46) - Clustering quantum graph (single) - START
            INFO 2021-10-29T09:58:39.184-05:00 __main__ ()(run_clustering.py:51) - Timing single quantum clustering: Took 117.5717 seconds
            INFO 2021-10-29T09:58:39.185-05:00 __main__ ()(run_clustering.py:52) - Clustering quantum graph (single) - END
            INFO 2021-10-29T09:58:39.185-05:00 __main__ ()(run_clustering.py:54) - saving clustered quantum graph - START
            INFO 2021-10-29T09:58:41.589-05:00 __main__ ()(run_clustering.py:56) - Saving clustered quantum graph: Took 2.4044 seconds
            INFO 2021-10-29T09:58:41.589-05:00 __main__ ()(run_clustering.py:57) - saving clustered quantum graph - END
            INFO 2021-10-29T09:59:06.867-05:00 __main__ ()(run_transform.py:32) - Reading ClusteredQuantumGraph - START
            INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:34) - Reading ClusteredQuantumGraph: Took 346.0675 seconds
            INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:35) - Reading ClusteredQuantumGraph - END
            INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:37) - Running transform - START
            INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:42) - Timing transform: Took 9340.7905 seconds
            INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:43) - Running transform - END
            INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:45) - Saving GenericWorkflow - START
            INFO 2021-10-29T12:40:41.103-05:00 __main__ ()(run_transform.py:48) - Saving GenericWorkflow: Took 7.3770 seconds
            INFO 2021-10-29T12:40:41.104-05:00 __main__ ()(run_transform.py:49) - Saving GenericWorkflow - END
            INFO 2021-10-29T12:41:17.489-05:00 __main__ ()(run_prepare.py:32) - Reading GenericWorkflow - START
            INFO 2021-10-29T12:41:27.836-05:00 __main__ ()(run_prepare.py:35) - Reading GenericWorkflow: Took 10.3469 seconds
            INFO 2021-10-29T12:41:27.837-05:00 __main__ ()(run_prepare.py:36) - Reading GenericWorkflow - END
            INFO 2021-10-29T12:41:27.837-05:00 __main__ ()(run_prepare.py:38) - Running prepare - START
            INFO 2021-10-29T12:55:13.814-05:00 __main__ ()(run_prepare.py:43) - Timing prepare: Took 825.9773 seconds
            INFO 2021-10-29T12:55:13.814-05:00 __main__ ()(run_prepare.py:44) - Running prepare - END
            

            mgower Michelle Gower added a comment - Ran a little script to gather timings for various submission stages using one of Brock's DC2's steps that had a large Quantum Graph (325710 Quanta).  It uses the existing Quantum Graph (so there's a read time).  Also timed saving/loading of bps classes (note the ClusteredQuantumGraph loading also reads the QuantumGraph). This is using only w44 stack (no profiling happening during these timings): INFO 2021-10-29T09:50:51.015-05:00 __main__ ()(run_clustering.py:40) - Reading quantum graph - START INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:42) - Reading QuantumGraph: Took 350.5966 seconds INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:43) - Reading quantum graph 325710 - END INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:46) - Clustering quantum graph (single) - START INFO 2021-10-29T09:58:39.184-05:00 __main__ ()(run_clustering.py:51) - Timing single quantum clustering: Took 117.5717 seconds INFO 2021-10-29T09:58:39.185-05:00 __main__ ()(run_clustering.py:52) - Clustering quantum graph (single) - END INFO 2021-10-29T09:58:39.185-05:00 __main__ ()(run_clustering.py:54) - saving clustered quantum graph - START INFO 2021-10-29T09:58:41.589-05:00 __main__ ()(run_clustering.py:56) - Saving clustered quantum graph: Took 2.4044 seconds INFO 2021-10-29T09:58:41.589-05:00 __main__ ()(run_clustering.py:57) - saving clustered quantum graph - END INFO 2021-10-29T09:59:06.867-05:00 __main__ ()(run_transform.py:32) - Reading ClusteredQuantumGraph - START INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:34) - Reading ClusteredQuantumGraph: Took 346.0675 seconds INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:35) - Reading ClusteredQuantumGraph - END INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:37) - Running transform - START INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:42) - Timing transform: Took 9340.7905 seconds INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:43) - Running transform - END INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:45) - Saving GenericWorkflow - START INFO 2021-10-29T12:40:41.103-05:00 __main__ ()(run_transform.py:48) - Saving GenericWorkflow: Took 7.3770 seconds INFO 2021-10-29T12:40:41.104-05:00 __main__ ()(run_transform.py:49) - Saving GenericWorkflow - END INFO 2021-10-29T12:41:17.489-05:00 __main__ ()(run_prepare.py:32) - Reading GenericWorkflow - START INFO 2021-10-29T12:41:27.836-05:00 __main__ ()(run_prepare.py:35) - Reading GenericWorkflow: Took 10.3469 seconds INFO 2021-10-29T12:41:27.837-05:00 __main__ ()(run_prepare.py:36) - Reading GenericWorkflow - END INFO 2021-10-29T12:41:27.837-05:00 __main__ ()(run_prepare.py:38) - Running prepare - START INFO 2021-10-29T12:55:13.814-05:00 __main__ ()(run_prepare.py:43) - Timing prepare: Took 825.9773 seconds INFO 2021-10-29T12:55:13.814-05:00 __main__ ()(run_prepare.py:44) - Running prepare - END

            Here's the timing using w44 + ctrl_bps ticket branch (version at time of initial PR):
            (Added separate timing of creating HTCondor DAG vs writing all of the HTCondor files in the prepare stage)

            INFO 2021-11-02T10:24:32.174-05:00 __main__ ()(run_clustering.py:40) - Reading quantum graph - START
            INFO 2021-11-02T10:30:19.168-05:00 __main__ ()(run_clustering.py:42) - Reading QuantumGraph: Took 346.9940 seconds
            INFO 2021-11-02T10:30:19.168-05:00 __main__ ()(run_clustering.py:43) - Reading quantum graph 325710 - END
            INFO 2021-11-02T10:30:19.169-05:00 __main__ ()(run_clustering.py:46) - Clustering quantum graph (single) - START
            INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:51) - Timing single quantum clustering: Took 52.6112 seconds
            INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:52) - Clustering quantum graph (single) - END
            INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:54) - saving clustered quantum graph - START
            INFO 2021-11-02T10:31:14.221-05:00 __main__ ()(run_clustering.py:56) - Saving clustered quantum graph: Took 2.4403 seconds
            INFO 2021-11-02T10:31:14.221-05:00 __main__ ()(run_clustering.py:57) - saving clustered quantum graph - END
            INFO 2021-11-02T10:31:38.914-05:00 __main__ ()(run_transform.py:32) - Reading ClusteredQuantumGraph - START
            INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:34) - Reading ClusteredQuantumGraph: Took 352.6081 seconds
            INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:35) - Reading ClusteredQuantumGraph - END
            INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:37) - Running transform - START
            INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:42) - Timing transform: Took 43.8918 seconds
            INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:43) - Running transform - END
            INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:45) - Saving GenericWorkflow - START
            INFO 2021-11-02T10:38:20.751-05:00 __main__ ()(run_transform.py:48) - Saving GenericWorkflow: Took 5.3355 seconds
            INFO 2021-11-02T10:38:20.751-05:00 __main__ ()(run_transform.py:49) - Saving GenericWorkflow - END
            INFO 2021-11-02T10:38:49.392-05:00 __main__ ()(run_prepare.py:32) - Reading GenericWorkflow - START
            INFO 2021-11-02T10:38:58.729-05:00 __main__ ()(run_prepare.py:35) - Reading GenericWorkflow: Took 9.3376 seconds
            INFO 2021-11-02T10:38:58.730-05:00 __main__ ()(run_prepare.py:36) - Reading GenericWorkflow - END
            INFO 2021-11-02T10:38:58.730-05:00 __main__ ()(run_prepare.py:38) - Running prepare - START
            INFO 2021-11-02T10:39:56.386-05:00 lsst.ctrl.bps.wms.htcondor.htcondor_service ()(htcondor_service.py:123) - Completed HTCondor workflow creation: Took 57.5734 seconds
            INFO 2021-11-02T10:43:58.801-05:00 lsst.ctrl.bps.wms.htcondor.htcondor_service ()(htcondor_service.py:129) - Completed writing out HTCondor workflow: Took 242.4142 seconds
            INFO 2021-11-02T10:43:58.802-05:00 __main__ ()(run_prepare.py:43) - Timing prepare: Took 300.0719 seconds
            INFO 2021-11-02T10:43:58.802-05:00 __main__ ()(run_prepare.py:44) - Running prepare - END
            

            mgower Michelle Gower added a comment - Here's the timing using w44 + ctrl_bps ticket branch (version at time of initial PR): (Added separate timing of creating HTCondor DAG vs writing all of the HTCondor files in the prepare stage) INFO 2021-11-02T10:24:32.174-05:00 __main__ ()(run_clustering.py:40) - Reading quantum graph - START INFO 2021-11-02T10:30:19.168-05:00 __main__ ()(run_clustering.py:42) - Reading QuantumGraph: Took 346.9940 seconds INFO 2021-11-02T10:30:19.168-05:00 __main__ ()(run_clustering.py:43) - Reading quantum graph 325710 - END INFO 2021-11-02T10:30:19.169-05:00 __main__ ()(run_clustering.py:46) - Clustering quantum graph (single) - START INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:51) - Timing single quantum clustering: Took 52.6112 seconds INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:52) - Clustering quantum graph (single) - END INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:54) - saving clustered quantum graph - START INFO 2021-11-02T10:31:14.221-05:00 __main__ ()(run_clustering.py:56) - Saving clustered quantum graph: Took 2.4403 seconds INFO 2021-11-02T10:31:14.221-05:00 __main__ ()(run_clustering.py:57) - saving clustered quantum graph - END INFO 2021-11-02T10:31:38.914-05:00 __main__ ()(run_transform.py:32) - Reading ClusteredQuantumGraph - START INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:34) - Reading ClusteredQuantumGraph: Took 352.6081 seconds INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:35) - Reading ClusteredQuantumGraph - END INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:37) - Running transform - START INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:42) - Timing transform: Took 43.8918 seconds INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:43) - Running transform - END INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:45) - Saving GenericWorkflow - START INFO 2021-11-02T10:38:20.751-05:00 __main__ ()(run_transform.py:48) - Saving GenericWorkflow: Took 5.3355 seconds INFO 2021-11-02T10:38:20.751-05:00 __main__ ()(run_transform.py:49) - Saving GenericWorkflow - END INFO 2021-11-02T10:38:49.392-05:00 __main__ ()(run_prepare.py:32) - Reading GenericWorkflow - START INFO 2021-11-02T10:38:58.729-05:00 __main__ ()(run_prepare.py:35) - Reading GenericWorkflow: Took 9.3376 seconds INFO 2021-11-02T10:38:58.730-05:00 __main__ ()(run_prepare.py:36) - Reading GenericWorkflow - END INFO 2021-11-02T10:38:58.730-05:00 __main__ ()(run_prepare.py:38) - Running prepare - START INFO 2021-11-02T10:39:56.386-05:00 lsst.ctrl.bps.wms.htcondor.htcondor_service ()(htcondor_service.py:123) - Completed HTCondor workflow creation: Took 57.5734 seconds INFO 2021-11-02T10:43:58.801-05:00 lsst.ctrl.bps.wms.htcondor.htcondor_service ()(htcondor_service.py:129) - Completed writing out HTCondor workflow: Took 242.4142 seconds INFO 2021-11-02T10:43:58.802-05:00 __main__ ()(run_prepare.py:43) - Timing prepare: Took 300.0719 seconds INFO 2021-11-02T10:43:58.802-05:00 __main__ ()(run_prepare.py:44) - Running prepare - END

            Can be merged after addressing minor issues in docstrings of some newly added functions.

            mkowalik Mikolaj Kowalik added a comment - Can be merged after addressing minor issues in docstrings of some newly added functions.

            People

              mgower Michelle Gower
              mgower Michelle Gower
              Mikolaj Kowalik
              Michelle Gower, Mikolaj Kowalik, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.