Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32241

bps submit can take multiple hours

    XMLWordPrintable

    Details

      Description

      The post-QuantumGraph generation stages of submission can take multiple hours for large graphs. See if can figure out where the time is going and implement some changes to speed this up.

        Attachments

          Issue Links

            Activity

            Hide
            mgower Michelle Gower added a comment -

            Found an assert isinstance added in an unusual location called O(# clusters) times purely for debugging a particular issue during coding that didn't get removed before committing.

             

            Mikolaj Kowalik pointed out a _LOG.debug line (also called O(# clusters) times) that was problematic:

              _LOG.debug("get_cluster nodes = %s", list(self._cluster_graph.nodes))

            Show
            mgower Michelle Gower added a comment - Found an assert isinstance added in an unusual location called O(# clusters) times purely for debugging a particular issue during coding that didn't get removed before committing.   Mikolaj Kowalik  pointed out a _LOG.debug line (also called O(# clusters) times) that was problematic:   _LOG.debug("get_cluster nodes = %s", list(self._cluster_graph.nodes))
            Hide
            mgower Michelle Gower added a comment -

            Further profiling was reporting most time spent in config functions.  Created DM-32305.  Also changing bps code to cache results known to be currently shared (e.g., all Quanta or clusters with same label) to reduce the actual numbers of config searches.

            Show
            mgower Michelle Gower added a comment - Further profiling was reporting most time spent in config functions.  Created DM-32305 .  Also changing bps code to cache results known to be currently shared (e.g., all Quanta or clusters with same label) to reduce the actual numbers of config searches.
            Hide
            tjenness Tim Jenness added a comment -

            For the log problem, if you still need the debug message you can use _LOG.isEnabledFor(logging.DEBUG) so you only get the slow down if the log message would be issued.

            Show
            tjenness Tim Jenness added a comment - For the log problem, if you still need the debug message you can use _LOG.isEnabledFor(logging.DEBUG) so you only get the slow down if the log message would be issued.
            Hide
            mgower Michelle Gower added a comment -

            Most of the changes are saving values currently the same across labels.  There are a couple clean up changes (removing an isinstance in an unusual location that was missed in a previous commit, removing a debug statement that doesn't make sense at huge scales, etc).  And a couple bug fixes in rarely used code discovered while trying to profile/test changes.

            Show
            mgower Michelle Gower added a comment - Most of the changes are saving values currently the same across labels.  There are a couple clean up changes (removing an isinstance in an unusual location that was missed in a previous commit, removing a debug statement that doesn't make sense at huge scales, etc).  And a couple bug fixes in rarely used code discovered while trying to profile/test changes.
            Hide
            mgower Michelle Gower added a comment -

            Ran a little script to gather timings for various submission stages using one of Brock's DC2's steps that had a large Quantum Graph (325710 Quanta).  It uses the existing Quantum Graph (so there's a read time).  Also timed saving/loading of bps classes (note the ClusteredQuantumGraph loading also reads the QuantumGraph).

            This is using only w44 stack (no profiling happening during these timings):

            INFO 2021-10-29T09:50:51.015-05:00 __main__ ()(run_clustering.py:40) - Reading quantum graph - START
            INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:42) - Reading QuantumGraph: Took 350.5966 seconds
            INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:43) - Reading quantum graph 325710 - END
            INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:46) - Clustering quantum graph (single) - START
            INFO 2021-10-29T09:58:39.184-05:00 __main__ ()(run_clustering.py:51) - Timing single quantum clustering: Took 117.5717 seconds
            INFO 2021-10-29T09:58:39.185-05:00 __main__ ()(run_clustering.py:52) - Clustering quantum graph (single) - END
            INFO 2021-10-29T09:58:39.185-05:00 __main__ ()(run_clustering.py:54) - saving clustered quantum graph - START
            INFO 2021-10-29T09:58:41.589-05:00 __main__ ()(run_clustering.py:56) - Saving clustered quantum graph: Took 2.4044 seconds
            INFO 2021-10-29T09:58:41.589-05:00 __main__ ()(run_clustering.py:57) - saving clustered quantum graph - END
            INFO 2021-10-29T09:59:06.867-05:00 __main__ ()(run_transform.py:32) - Reading ClusteredQuantumGraph - START
            INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:34) - Reading ClusteredQuantumGraph: Took 346.0675 seconds
            INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:35) - Reading ClusteredQuantumGraph - END
            INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:37) - Running transform - START
            INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:42) - Timing transform: Took 9340.7905 seconds
            INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:43) - Running transform - END
            INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:45) - Saving GenericWorkflow - START
            INFO 2021-10-29T12:40:41.103-05:00 __main__ ()(run_transform.py:48) - Saving GenericWorkflow: Took 7.3770 seconds
            INFO 2021-10-29T12:40:41.104-05:00 __main__ ()(run_transform.py:49) - Saving GenericWorkflow - END
            INFO 2021-10-29T12:41:17.489-05:00 __main__ ()(run_prepare.py:32) - Reading GenericWorkflow - START
            INFO 2021-10-29T12:41:27.836-05:00 __main__ ()(run_prepare.py:35) - Reading GenericWorkflow: Took 10.3469 seconds
            INFO 2021-10-29T12:41:27.837-05:00 __main__ ()(run_prepare.py:36) - Reading GenericWorkflow - END
            INFO 2021-10-29T12:41:27.837-05:00 __main__ ()(run_prepare.py:38) - Running prepare - START
            INFO 2021-10-29T12:55:13.814-05:00 __main__ ()(run_prepare.py:43) - Timing prepare: Took 825.9773 seconds
            INFO 2021-10-29T12:55:13.814-05:00 __main__ ()(run_prepare.py:44) - Running prepare - END
            

            Show
            mgower Michelle Gower added a comment - Ran a little script to gather timings for various submission stages using one of Brock's DC2's steps that had a large Quantum Graph (325710 Quanta).  It uses the existing Quantum Graph (so there's a read time).  Also timed saving/loading of bps classes (note the ClusteredQuantumGraph loading also reads the QuantumGraph). This is using only w44 stack (no profiling happening during these timings): INFO 2021-10-29T09:50:51.015-05:00 __main__ ()(run_clustering.py:40) - Reading quantum graph - START INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:42) - Reading QuantumGraph: Took 350.5966 seconds INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:43) - Reading quantum graph 325710 - END INFO 2021-10-29T09:56:41.612-05:00 __main__ ()(run_clustering.py:46) - Clustering quantum graph (single) - START INFO 2021-10-29T09:58:39.184-05:00 __main__ ()(run_clustering.py:51) - Timing single quantum clustering: Took 117.5717 seconds INFO 2021-10-29T09:58:39.185-05:00 __main__ ()(run_clustering.py:52) - Clustering quantum graph (single) - END INFO 2021-10-29T09:58:39.185-05:00 __main__ ()(run_clustering.py:54) - saving clustered quantum graph - START INFO 2021-10-29T09:58:41.589-05:00 __main__ ()(run_clustering.py:56) - Saving clustered quantum graph: Took 2.4044 seconds INFO 2021-10-29T09:58:41.589-05:00 __main__ ()(run_clustering.py:57) - saving clustered quantum graph - END INFO 2021-10-29T09:59:06.867-05:00 __main__ ()(run_transform.py:32) - Reading ClusteredQuantumGraph - START INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:34) - Reading ClusteredQuantumGraph: Took 346.0675 seconds INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:35) - Reading ClusteredQuantumGraph - END INFO 2021-10-29T10:04:52.935-05:00 __main__ ()(run_transform.py:37) - Running transform - START INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:42) - Timing transform: Took 9340.7905 seconds INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:43) - Running transform - END INFO 2021-10-29T12:40:33.726-05:00 __main__ ()(run_transform.py:45) - Saving GenericWorkflow - START INFO 2021-10-29T12:40:41.103-05:00 __main__ ()(run_transform.py:48) - Saving GenericWorkflow: Took 7.3770 seconds INFO 2021-10-29T12:40:41.104-05:00 __main__ ()(run_transform.py:49) - Saving GenericWorkflow - END INFO 2021-10-29T12:41:17.489-05:00 __main__ ()(run_prepare.py:32) - Reading GenericWorkflow - START INFO 2021-10-29T12:41:27.836-05:00 __main__ ()(run_prepare.py:35) - Reading GenericWorkflow: Took 10.3469 seconds INFO 2021-10-29T12:41:27.837-05:00 __main__ ()(run_prepare.py:36) - Reading GenericWorkflow - END INFO 2021-10-29T12:41:27.837-05:00 __main__ ()(run_prepare.py:38) - Running prepare - START INFO 2021-10-29T12:55:13.814-05:00 __main__ ()(run_prepare.py:43) - Timing prepare: Took 825.9773 seconds INFO 2021-10-29T12:55:13.814-05:00 __main__ ()(run_prepare.py:44) - Running prepare - END
            Hide
            mgower Michelle Gower added a comment -

            Here's the timing using w44 + ctrl_bps ticket branch (version at time of initial PR):
            (Added separate timing of creating HTCondor DAG vs writing all of the HTCondor files in the prepare stage)

            INFO 2021-11-02T10:24:32.174-05:00 __main__ ()(run_clustering.py:40) - Reading quantum graph - START
            INFO 2021-11-02T10:30:19.168-05:00 __main__ ()(run_clustering.py:42) - Reading QuantumGraph: Took 346.9940 seconds
            INFO 2021-11-02T10:30:19.168-05:00 __main__ ()(run_clustering.py:43) - Reading quantum graph 325710 - END
            INFO 2021-11-02T10:30:19.169-05:00 __main__ ()(run_clustering.py:46) - Clustering quantum graph (single) - START
            INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:51) - Timing single quantum clustering: Took 52.6112 seconds
            INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:52) - Clustering quantum graph (single) - END
            INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:54) - saving clustered quantum graph - START
            INFO 2021-11-02T10:31:14.221-05:00 __main__ ()(run_clustering.py:56) - Saving clustered quantum graph: Took 2.4403 seconds
            INFO 2021-11-02T10:31:14.221-05:00 __main__ ()(run_clustering.py:57) - saving clustered quantum graph - END
            INFO 2021-11-02T10:31:38.914-05:00 __main__ ()(run_transform.py:32) - Reading ClusteredQuantumGraph - START
            INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:34) - Reading ClusteredQuantumGraph: Took 352.6081 seconds
            INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:35) - Reading ClusteredQuantumGraph - END
            INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:37) - Running transform - START
            INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:42) - Timing transform: Took 43.8918 seconds
            INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:43) - Running transform - END
            INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:45) - Saving GenericWorkflow - START
            INFO 2021-11-02T10:38:20.751-05:00 __main__ ()(run_transform.py:48) - Saving GenericWorkflow: Took 5.3355 seconds
            INFO 2021-11-02T10:38:20.751-05:00 __main__ ()(run_transform.py:49) - Saving GenericWorkflow - END
            INFO 2021-11-02T10:38:49.392-05:00 __main__ ()(run_prepare.py:32) - Reading GenericWorkflow - START
            INFO 2021-11-02T10:38:58.729-05:00 __main__ ()(run_prepare.py:35) - Reading GenericWorkflow: Took 9.3376 seconds
            INFO 2021-11-02T10:38:58.730-05:00 __main__ ()(run_prepare.py:36) - Reading GenericWorkflow - END
            INFO 2021-11-02T10:38:58.730-05:00 __main__ ()(run_prepare.py:38) - Running prepare - START
            INFO 2021-11-02T10:39:56.386-05:00 lsst.ctrl.bps.wms.htcondor.htcondor_service ()(htcondor_service.py:123) - Completed HTCondor workflow creation: Took 57.5734 seconds
            INFO 2021-11-02T10:43:58.801-05:00 lsst.ctrl.bps.wms.htcondor.htcondor_service ()(htcondor_service.py:129) - Completed writing out HTCondor workflow: Took 242.4142 seconds
            INFO 2021-11-02T10:43:58.802-05:00 __main__ ()(run_prepare.py:43) - Timing prepare: Took 300.0719 seconds
            INFO 2021-11-02T10:43:58.802-05:00 __main__ ()(run_prepare.py:44) - Running prepare - END
            

            Show
            mgower Michelle Gower added a comment - Here's the timing using w44 + ctrl_bps ticket branch (version at time of initial PR): (Added separate timing of creating HTCondor DAG vs writing all of the HTCondor files in the prepare stage) INFO 2021-11-02T10:24:32.174-05:00 __main__ ()(run_clustering.py:40) - Reading quantum graph - START INFO 2021-11-02T10:30:19.168-05:00 __main__ ()(run_clustering.py:42) - Reading QuantumGraph: Took 346.9940 seconds INFO 2021-11-02T10:30:19.168-05:00 __main__ ()(run_clustering.py:43) - Reading quantum graph 325710 - END INFO 2021-11-02T10:30:19.169-05:00 __main__ ()(run_clustering.py:46) - Clustering quantum graph (single) - START INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:51) - Timing single quantum clustering: Took 52.6112 seconds INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:52) - Clustering quantum graph (single) - END INFO 2021-11-02T10:31:11.780-05:00 __main__ ()(run_clustering.py:54) - saving clustered quantum graph - START INFO 2021-11-02T10:31:14.221-05:00 __main__ ()(run_clustering.py:56) - Saving clustered quantum graph: Took 2.4403 seconds INFO 2021-11-02T10:31:14.221-05:00 __main__ ()(run_clustering.py:57) - saving clustered quantum graph - END INFO 2021-11-02T10:31:38.914-05:00 __main__ ()(run_transform.py:32) - Reading ClusteredQuantumGraph - START INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:34) - Reading ClusteredQuantumGraph: Took 352.6081 seconds INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:35) - Reading ClusteredQuantumGraph - END INFO 2021-11-02T10:37:31.523-05:00 __main__ ()(run_transform.py:37) - Running transform - START INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:42) - Timing transform: Took 43.8918 seconds INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:43) - Running transform - END INFO 2021-11-02T10:38:15.415-05:00 __main__ ()(run_transform.py:45) - Saving GenericWorkflow - START INFO 2021-11-02T10:38:20.751-05:00 __main__ ()(run_transform.py:48) - Saving GenericWorkflow: Took 5.3355 seconds INFO 2021-11-02T10:38:20.751-05:00 __main__ ()(run_transform.py:49) - Saving GenericWorkflow - END INFO 2021-11-02T10:38:49.392-05:00 __main__ ()(run_prepare.py:32) - Reading GenericWorkflow - START INFO 2021-11-02T10:38:58.729-05:00 __main__ ()(run_prepare.py:35) - Reading GenericWorkflow: Took 9.3376 seconds INFO 2021-11-02T10:38:58.730-05:00 __main__ ()(run_prepare.py:36) - Reading GenericWorkflow - END INFO 2021-11-02T10:38:58.730-05:00 __main__ ()(run_prepare.py:38) - Running prepare - START INFO 2021-11-02T10:39:56.386-05:00 lsst.ctrl.bps.wms.htcondor.htcondor_service ()(htcondor_service.py:123) - Completed HTCondor workflow creation: Took 57.5734 seconds INFO 2021-11-02T10:43:58.801-05:00 lsst.ctrl.bps.wms.htcondor.htcondor_service ()(htcondor_service.py:129) - Completed writing out HTCondor workflow: Took 242.4142 seconds INFO 2021-11-02T10:43:58.802-05:00 __main__ ()(run_prepare.py:43) - Timing prepare: Took 300.0719 seconds INFO 2021-11-02T10:43:58.802-05:00 __main__ ()(run_prepare.py:44) - Running prepare - END
            Hide
            mkowalik Mikolaj Kowalik added a comment -

            Can be merged after addressing minor issues in docstrings of some newly added functions.

            Show
            mkowalik Mikolaj Kowalik added a comment - Can be merged after addressing minor issues in docstrings of some newly added functions.

              People

              Assignee:
              mgower Michelle Gower
              Reporter:
              mgower Michelle Gower
              Reviewers:
              Mikolaj Kowalik
              Watchers:
              Michelle Gower, Mikolaj Kowalik, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.