Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21817

tract-scale execution integration and tests

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Based on the w_2019_38 stack (before the backward incompatible change in the Gen3 registry and the missing RC2 repo generation code), integrate the steps and test execution. Use tract=9615, the HSC-RC2 tract as in the June/July milestone run. Troubleshoot issues. 

        Attachments

          Activity

          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191127T192022+0000 20191127T192345+0000

          Attempted again running two graphs simultaneously. The master was a m5.2xlarge on-demand instance. The CPU utilization was ~50-70% on the master instance while it managed the sfm jobs. There was a surge of database transactions shortly after the sfm jobs started, but soon it evened out to be ~50-70 average active sessions during sfm processing. For the workers, I got 150 m5.xlarge Spot instances for the single frame processing part and 150 r4.2xlarge Spot instances later. From the RDS performance insight, xact_commit.avg (transaction commits per second) was typically ~10 and db load (average active sessions) ~1.

          Workflow wall time                                       : 8 hrs, 44 mins
          Cumulative job wall time                                 : 62 days, 16 hrs
          Cumulative job wall time as seen from submit side        : 62 days, 23 hrs
          Cumulative job badput wall time                          : 0.0 secs
          Cumulative job badput wall time as seen from submit side : 0.0 secs
           
          Workflow wall time                                       : 9 hrs, 59 mins
          Cumulative job wall time                                 : 62 days, 23 hrs
          Cumulative job wall time as seen from submit side        : 63 days, 6 hrs
          Cumulative job badput wall time                          : 0.0 secs
          Cumulative job badput wall time as seen from submit side : 0.0 secs
          

          # All (All)
          Transformation           Count     Succeeded Failed  Min       Max       Mean          Total       
          dagman::post             29788     29788     0       0.0       9.0       1.052         31343.0     
          pegasus::dirmanager      1         1         0       3.0       3.0       3.0           3.0         
          pegasus::transfer        2712      2712      0       2.176     6.768     4.21          11417.243   
          pipetask                 27075     27075     0       17.861    6243.819  199.636       5405141.464 
           
          # All (All)
          Transformation           Count     Succeeded Failed  Min       Max       Mean          Total       
          dagman::post             29788     29788     0       0.0       7.0       1.012         30148.0     
          pegasus::dirmanager      1         1         0       2.0       2.0       2.0           2.0         
          pegasus::transfer        2712      2712      0       2.167     9.002     4.255         11540.378   
          pipetask                 27075     27075     0       19.297    6300.657  200.601       5431273.315 
          

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191127T192022+0000 20191127T192345+0000 Attempted again running two graphs simultaneously. The master was a m5.2xlarge on-demand instance. The CPU utilization was ~50-70% on the master instance while it managed the sfm jobs. There was a surge of database transactions shortly after the sfm jobs started, but soon it evened out to be ~50-70 average active sessions during sfm processing. For the workers, I got 150 m5.xlarge Spot instances for the single frame processing part and 150 r4.2xlarge Spot instances later. From the RDS performance insight, xact_commit.avg (transaction commits per second) was typically ~10 and db load (average active sessions) ~1. Workflow wall time : 8 hrs, 44 mins Cumulative job wall time : 62 days, 16 hrs Cumulative job wall time as seen from submit side : 62 days, 23 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs   Workflow wall time : 9 hrs, 59 mins Cumulative job wall time : 62 days, 23 hrs Cumulative job wall time as seen from submit side : 63 days, 6 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 9.0 1.052 31343.0 pegasus::dirmanager 1 1 0 3.0 3.0 3.0 3.0 pegasus::transfer 2712 2712 0 2.176 6.768 4.21 11417.243 pipetask 27075 27075 0 17.861 6243.819 199.636 5405141.464   # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 7.0 1.012 30148.0 pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0 pegasus::transfer 2712 2712 0 2.167 9.002 4.255 11540.378 pipetask 27075 27075 0 19.297 6300.657 200.601 5431273.315
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191203T195830+0000 20191203T195926+0000 20191203T200145+0000 20191203T200315+0000 20191203T200433+0000

          I attempted to run 5 graphs at the same time. In the initialization jobs (first blocking job of the graph), two died with
           

          Traceback (most recent call last):
            File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1230, in _execute_context
              cursor, statement, parameters, context
            File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 536, in do_execute
              cursor.execute(statement, parameters)
          psycopg2.OperationalError: out of shared memory
          HINT:  You might need to increase max_locks_per_transaction.
          

          One finished its init job and continued on with single frame processing. The other two are still trying but I'll just kill them.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191203T195830+0000 20191203T195926+0000 20191203T200145+0000 20191203T200315+0000 20191203T200433+0000 I attempted to run 5 graphs at the same time. In the initialization jobs (first blocking job of the graph), two died with   Traceback (most recent call last): File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1230, in _execute_context cursor, statement, parameters, context File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 536, in do_execute cursor.execute(statement, parameters) psycopg2.OperationalError: out of shared memory HINT: You might need to increase max_locks_per_transaction. One finished its init job and continued on with single frame processing. The other two are still trying but I'll just kill them.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191216T201334+0000 20191216T201429+0000 20191216T201752+0000 20191216T201957+0000 20191216T202127+0000

          After Dino Bektesevic increased the cache sizes of the RDS instance to ~1GB, I tried the same 5 graphs again.

          After two of the 5 workflows finished their initialization jobs, many ISR (but not all) jobs failed with

          conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
          sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  remaining connection slots are reserved for non-replication superuser connections
          

          At the time there were 200 m5.xlarge workers.

          Dino then increased max_connections from 350 to 1000. Re-submitting new workflows with the same 200 workers, the connection slot error was no longer seen.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191216T201334+0000 20191216T201429+0000 20191216T201752+0000 20191216T201957+0000 20191216T202127+0000 After Dino Bektesevic increased the cache sizes of the RDS instance to ~1GB, I tried the same 5 graphs again. After two of the 5 workflows finished their initialization jobs, many ISR (but not all) jobs failed with conn = _connect(dsn, connection_factory=connection_factory, **kwasync) sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections At the time there were 200 m5.xlarge workers. Dino then increased max_connections from 350 to 1000. Re-submitting new workflows with the same 200 workers, the connection slot error was no longer seen.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          In the next attempt I (stupidly) requested 300 m5.xlarge workers; as each instance ran 4 jobs that meant 1200 connections. So I hit the same error again. We then increased max_connections to 2500.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - In the next attempt I (stupidly) requested 300 m5.xlarge workers; as each instance ran 4 jobs that meant 1200 connections. So I hit the same error again. We then increased max_connections to 2500.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          We are wrapping up this effort for now or at least until the new LDF / DOE lab decision is made. Closing this ticket.  The report is being written in https://dmtn-137.lsst.io/ (DM-21820)

          Show
          hchiang2 Hsin-Fang Chiang added a comment - We are wrapping up this effort for now or at least until the new LDF / DOE lab decision is made. Closing this ticket.  The report is being written in https://dmtn-137.lsst.io/ ( DM-21820 )

            People

            Assignee:
            hchiang2 Hsin-Fang Chiang
            Reporter:
            hchiang2 Hsin-Fang Chiang
            Watchers:
            Dino Bektesevic, Hsin-Fang Chiang, Yusra AlSayyad
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.