Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21817

tract-scale execution integration and tests

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Based on the w_2019_38 stack (before the backward incompatible change in the Gen3 registry and the missing RC2 repo generation code), integrate the steps and test execution. Use tract=9615, the HSC-RC2 tract as in the June/July milestone run. Troubleshoot issues. 

        Attachments

          Activity

          No builds found.
          hchiang2 Hsin-Fang Chiang created issue -
          hchiang2 Hsin-Fang Chiang made changes -
          Field Original Value New Value
          Epic Link DM-21045 [ 395252 ]
          hchiang2 Hsin-Fang Chiang made changes -
          Status To Do [ 10001 ] In Progress [ 3 ]
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          Made the HSC-RC2 repo with its datastore in a S3 bucket and its registry in RDS, made a quantum graph pickle (DM-21461), and transformed the quantum graph into Pegasus format with one-to-one quantum-to-job mapping.
          All files needed in the Pegasus workflow are stored in s3://pegasus9615/ bucket and can be retrieved for new runs. Multiple test runs are done to iteratively identify issues and improve the process.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - Made the HSC-RC2 repo with its datastore in a S3 bucket and its registry in RDS, made a quantum graph pickle ( DM-21461 ), and transformed the quantum graph into Pegasus format with one-to-one quantum-to-job mapping. All files needed in the Pegasus workflow are stored in s3://pegasus9615/ bucket and can be retrieved for new runs. Multiple test runs are done to iteratively identify issues and improve the process.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment - - edited

          Run 20191014T193110+0000

           

          Before the new AMIs (DM-21818) were ready test runs were done using the old-style AMIs with messier environments and credentials. r5.large Spot instances were used for memory-demanding tasks and instance connection lost was observed.
          It was first noticed as the instances disappearing from the condor pool (according to condor_status), while "condor_annex -status" still said the instances were running. It seemed the instances lost their reachability (for example cannot ssh to the spot instance directly like usual). Eventually, but with a delay, the EC2 "instance status check" showed that the reachability check failed. Meanwhile the EC2 instance status remains "running". Investigations continue with new runs.

          Connection timed out to the RDS instance and errors like lib/python3.7/site-packages/urllib3/response.py", line 378, in _error_catcher were seen in some failed jobs.  In total 109 jobs failed and 1014 unready when the workflow stopped. 

           

          See more discussions in https://confluence.lsstcorp.org/display/DM/AWS+PoC+Meeting+2019-10-18

          Show
          hchiang2 Hsin-Fang Chiang added a comment - - edited Run 20191014T193110+0000   Before the new AMIs ( DM-21818 ) were ready test runs were done using the old-style AMIs with messier environments and credentials. r5.large Spot instances were used for memory-demanding tasks and instance connection lost was observed. It was first noticed as the instances disappearing from the condor pool (according to condor_status), while "condor_annex -status" still said the instances were running. It seemed the instances lost their reachability (for example cannot ssh to the spot instance directly like usual). Eventually, but with a delay, the EC2 "instance status check" showed that the reachability check failed. Meanwhile the EC2 instance status remains "running". Investigations continue with new runs. Connection timed out to the RDS instance and errors like lib/python3.7/site-packages/urllib3/response.py", line 378, in _error_catcher were seen in some failed jobs.  In total 109 jobs failed and 1014 unready when the workflow stopped.    See more discussions in https://confluence.lsstcorp.org/display/DM/AWS+PoC+Meeting+2019-10-18
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          Run 20191023T000938+0000

           

          This used the new AMIs centos_condor_w38_master and centos_condor_w38_worker_hfc_partitionable. Condor annex workers were added with Spot Fleets of m4.large and then r4.large.    Jobs of 'MakeWarpTask', 'CompareWarpAssembleCoaddTask', 'DeblendCoaddSourcesSingleTask', 'MeasureMergedCoaddSourcesTask' were set to require 12GB of memory.

          The instance reachability issue happened again, with the r4.large instances this time. Recommended by our AWS collaborators, a support case was filed to AWS Support.

          MemoryError: std::bad_alloc was found in the logs of 4 DetectCoaddSourcesTask jobs. Many Connection timed out to the RDS instance were found in the logs of failed jobs, including sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server . Some logs look truncated without good indicators. When I stopped the workflow, it had 115 failed job and 1034 job unready. Failed jobs included MakeWarpTask CompareWarpAssembleCoaddTask DeblendCoaddSourcesSingleTask, and MeasureMergedCoaddSourcesTask.

          I identified a few failed jobs and tried them manually, with memory monitoring enabled through CloudWatch (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html )  I got memory usage spiking around 26GB for the few test jobs. I suspect OOM was related the troublesome instances, nonetheless a more graceful exit is much more desired.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - Run 20191023T000938+0000   This used the new AMIs centos_condor_w38_master and centos_condor_w38_worker_hfc_partitionable. Condor annex workers were added with Spot Fleets of m4.large and then r4.large.    Jobs of 'MakeWarpTask', 'CompareWarpAssembleCoaddTask', 'DeblendCoaddSourcesSingleTask', 'MeasureMergedCoaddSourcesTask' were set to require 12GB of memory. The instance reachability issue happened again, with the r4.large instances this time. Recommended by our AWS collaborators, a support case was filed to AWS Support. MemoryError: std::bad_alloc was found in the logs of 4 DetectCoaddSourcesTask jobs. Many Connection timed out to the RDS instance were found in the logs of failed jobs, including sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server . Some logs look truncated without good indicators. When I stopped the workflow, it had 115 failed job and 1034 job unready. Failed jobs included MakeWarpTask CompareWarpAssembleCoaddTask DeblendCoaddSourcesSingleTask, and MeasureMergedCoaddSourcesTask. I identified a few failed jobs and tried them manually, with memory monitoring enabled through CloudWatch ( https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html )  I got memory usage spiking around 26GB for the few test jobs. I suspect OOM was related the troublesome instances, nonetheless a more graceful exit is much more desired.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191026T041828+0000

          Same as the last run but require 30GB for those heavy jobs instead of 12GB. A fleet of 40 m5.xlarge instances were used for single frame processing and then 50 r4.2xlarge instances for the rest. The workflow finished successfully.

          In MakeWarpTask logs, I see 207 lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = …. in one or both dimensions' For example: tract=9615 patch=43 visit=1278

          makeWarp INFO: Processing calexp 25 of 68 for this Warp: id={instrument: HSC, visit: 1278, detector: 86}
          makeWarp.warpAndPsfMatch.psfMatch INFO: compute Psf-matching kernel
          makeWarp.warpAndPsfMatch.psfMatch INFO: Adjusted dimensions of reference PSF model from (23, 23) to (41, 41)
          ip.diffim.generateAlardLuptonBasisList INFO: PSF sigmas are not available or scaling by fwhm disabled, falling back to config values
          makeWarp.warpAndPsfMatch.psfMatch INFO: Psf-match science exposure to reference
          makeWarp.warpAndPsfMatch INFO: Cannot PSF-Match:
            File "src/math/Kernel.cc", line 188, in lsst::geom::Box2I lsst::afw::math::Kernel::shrinkBBox(const lsst::geom::Box2I&) const
              bbox dimensions = (1883, 7) < (21, 21) in one or both dimensions {0}
          lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = (1883, 7) < (21, 21) in one or both dimensions'
          

          As the warp files were still made alright, I can ignore them. But maybe Yusra AlSayyad and Jim Bosch would like to know.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191026T041828+0000 Same as the last run but require 30GB for those heavy jobs instead of 12GB. A fleet of 40 m5.xlarge instances were used for single frame processing and then 50 r4.2xlarge instances for the rest. The workflow finished successfully. In MakeWarpTask logs, I see 207 lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = …. in one or both dimensions' For example: tract=9615 patch=43 visit=1278 makeWarp INFO: Processing calexp 25 of 68 for this Warp: id={instrument: HSC, visit: 1278 , detector: 86 } makeWarp.warpAndPsfMatch.psfMatch INFO: compute Psf-matching kernel makeWarp.warpAndPsfMatch.psfMatch INFO: Adjusted dimensions of reference PSF model from ( 23 , 23 ) to ( 41 , 41 ) ip.diffim.generateAlardLuptonBasisList INFO: PSF sigmas are not available or scaling by fwhm disabled, falling back to config values makeWarp.warpAndPsfMatch.psfMatch INFO: Psf-match science exposure to reference makeWarp.warpAndPsfMatch INFO: Cannot PSF-Match: File "src/math/Kernel.cc" , line 188 , in lsst::geom::Box2I lsst::afw::math::Kernel::shrinkBBox( const lsst::geom::Box2I&) const bbox dimensions = ( 1883 , 7 ) < ( 21 , 21 ) in one or both dimensions { 0 } lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = (1883, 7) < (21, 21) in one or both dimensions' As the warp files were still made alright, I can ignore them. But maybe Yusra AlSayyad and Jim Bosch would like to know.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          This workflow for tract=9615 has 1 init job and 27074 regular pipetask jobs:

          Count Task
          6787 IsrTask
          6787 CharacterizeImageTask
          6787 CalibrateTask
          4580 MakeWarpTask
          395 CompareWarpAssembleCoaddTask
          395 DetectCoaddSourcesTask
          79 MergeDetectionsTask
          395 DeblendCoaddSourcesSingleTask
          395 MeasureMergedCoaddSourcesTask
          79 MergeMeasurementsTask
          395 ForcedPhotCoaddTask

          Statistics from Pegasus:

          # All (All)
          Transformation           Count     Succeeded Failed  Min       Max       Mean          Total       
          dagman::post             29788     29788     0       0.0       12.0      3.079         91709.0     
          pegasus::dirmanager      1         1         0       2.0       2.0       2.0           2.0         
          pegasus::transfer        2712      2712      0       2.273     16.538    6.102         16547.867   
          pipetask                 27075     27075     0       17.025    5936.038  195.465       5292217.997 
          

          ------------------------------------------------------------------------------
          Type           Succeeded Failed  Incomplete  Total     Retries   Total+Retries
          Tasks          27075     0       0           27075     0         27075        
          Jobs           29788     0       0           29788     0         29788        
          Sub-Workflows  0         0       0           0         0         0            
          ------------------------------------------------------------------------------
           
          Workflow wall time                                       : 1 day, 4 hrs
          Cumulative job wall time                                 : 61 days, 10 hrs
          Cumulative job wall time as seen from submit side        : 61 days, 17 hrs
          Cumulative job badput wall time                          : 0.0 secs
          Cumulative job badput wall time as seen from submit side : 0.0 secs
          

          Show
          hchiang2 Hsin-Fang Chiang added a comment - This workflow for tract=9615 has 1 init job and 27074 regular pipetask jobs: Count Task 6787 IsrTask 6787 CharacterizeImageTask 6787 CalibrateTask 4580 MakeWarpTask 395 CompareWarpAssembleCoaddTask 395 DetectCoaddSourcesTask 79 MergeDetectionsTask 395 DeblendCoaddSourcesSingleTask 395 MeasureMergedCoaddSourcesTask 79 MergeMeasurementsTask 395 ForcedPhotCoaddTask Statistics from Pegasus: # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 12.0 3.079 91709.0 pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0 pegasus::transfer 2712 2712 0 2.273 16.538 6.102 16547.867 pipetask 27075 27075 0 17.025 5936.038 195.465 5292217.997 ------------------------------------------------------------------------------ Type Succeeded Failed Incomplete Total Retries Total+Retries Tasks 27075 0 0 27075 0 27075 Jobs 29788 0 0 29788 0 29788 Sub-Workflows 0 0 0 0 0 0 ------------------------------------------------------------------------------   Workflow wall time : 1 day, 4 hrs Cumulative job wall time : 61 days, 10 hrs Cumulative job wall time as seen from submit side : 61 days, 17 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs
          Hide
          yusra Yusra AlSayyad added a comment -

          Thanks for the heads up on the error! That usually means that the WCS solution for a calexp was bad.

          We'll record that dataID (visit: 1278, detector: 86) as a chip with a hard-to-fit WCS.

          Show
          yusra Yusra AlSayyad added a comment - Thanks for the heads up on the error! That usually means that the WCS solution for a calexp was bad. We'll record that dataID (visit: 1278, detector: 86) as a chip with a hard-to-fit WCS.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          Yusra AlSayyad There are >200 of such error. Getting all their data IDs is possible but needs some log processing. Hopefully a few examples suffice. 

          Show
          hchiang2 Hsin-Fang Chiang added a comment - Yusra AlSayyad  There are >200 of such error. Getting all their data IDs is possible but needs some log processing. Hopefully a few examples suffice. 
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191112T205946+0000 

          Attempted multiple times the same workflow / same software and encountered errors during ingest on S3 FileNotFoundError despite the file exists (eventually), such as

          "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 233, in put
              self.ingest(location.uri, ref, formatter=formatter)
            File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/core/utils.py", line 226, in inner
              return func(self, *args, **kwargs)
            File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 299, in ingest
              .format(srcUri))
          FileNotFoundError: File at 's3://hsc-rc2-test1-hfc/repo-w38/hfc30/srcMatchFull/1316/srcMatchFull_1316_24_HSC_17.fits' does not exist; note that paths to ingest are assumed to be relative to self.root unless they are absolute.
          

          The errors appeared random and did not happen for specific files/tasks. A possible cause is in DM-22201 which will further investigate and fix the issue.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191112T205946+0000  Attempted multiple times the same workflow / same software and encountered errors during ingest on S3 FileNotFoundError despite the file exists (eventually), such as "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 233, in put self.ingest(location.uri, ref, formatter=formatter) File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/core/utils.py", line 226, in inner return func(self, *args, **kwargs) File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 299, in ingest .format(srcUri)) FileNotFoundError: File at 's3://hsc-rc2-test1-hfc/repo-w38/hfc30/srcMatchFull/1316/srcMatchFull_1316_24_HSC_17.fits' does not exist; note that paths to ingest are assumed to be relative to self.root unless they are absolute. The errors appeared random and did not happen for specific files/tasks. A possible cause is in DM-22201 which will further investigate and fix the issue.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191118T191713+0000

          Attempted to request a fleet of 100 m5.xlarge but only 75 were launched, and a spotInstanceCountLimitExceeded error was seen in Spot Request indicating that the maximum spot instance count was exceeded. I created a support case to request an increase of spot instance limit.

          A tiny fraction of jobs (7 so far) failed with
          botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None"
          before science processing started. Those could be re-submitted via a pegasus-halt/remove followed by a new pegasus-run. New attempts of the same jobs were able to finish alright.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191118T191713+0000 Attempted to request a fleet of 100 m5.xlarge but only 75 were launched, and a spotInstanceCountLimitExceeded error was seen in Spot Request indicating that the maximum spot instance count was exceeded. I created a support case to request an increase of spot instance limit. A tiny fraction of jobs (7 so far) failed with botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None" before science processing started. Those could be re-submitted via a pegasus-halt/remove followed by a new pegasus-run. New attempts of the same jobs were able to finish alright.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          The last run eventually failed because I mistakenly killed an instance while a job was running and had already written some outputs to the datastore & registry. Retrying the same job as-is resulted in dataset conflicts, as expected.

          pegasus-status -l /home/centos/pegasus9615E/submit/centos/pegasus/ciHsc/20191118T191713+0000
          (no matching jobs found in Condor Q)
          UNRDY READY   PRE  IN_Q  POST   DONE  FAIL %DONE STATE   DAGNAME                                                      
             39     0     0     0     0 29,748     1  99.9 Failure..ubmit/centos/pegasus/ciHsc/20191118T191713+0000/ciHsc-0.dag
          Summary: 1 DAG total (Failure:1)
          

          A more proper way to rescue it may be making a new qgraph with --skip-existing and re-launching a new workflow writing to a new collection. Deferring that to another day.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - The last run eventually failed because I mistakenly killed an instance while a job was running and had already written some outputs to the datastore & registry. Retrying the same job as-is resulted in dataset conflicts, as expected. pegasus-status -l /home/centos/pegasus9615E/submit/centos/pegasus/ciHsc/20191118T191713+0000 (no matching jobs found in Condor Q) UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME 39 0 0 0 0 29,748 1 99.9 Failure..ubmit/centos/pegasus/ciHsc/20191118T191713+0000/ciHsc-0.dag Summary: 1 DAG total (Failure:1) A more proper way to rescue it may be making a new qgraph with --skip-existing and re-launching a new workflow writing to a new collection. Deferring that to another day.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191121T015100+0000

          Same as the last successful run but require 29GB for the large-memory jobs instead of 30GB, so to fit two jobs in one r4.2xlrage instance. The head was a m5.large on-demand Instance. Workers were launched in 3 fleets: the first two for a total of 75 m4.xlarge instances, and started ~1 hr later a fleet 50 r4.2xlarge. The workflow finished successfully around 2-3 hours before I killed the idle instances.

          ------------------------------------------------------------------------------
          Type           Succeeded Failed  Incomplete  Total     Retries   Total+Retries
          Tasks          27075     0       0           27075     0         27075        
          Jobs           29788     0       0           29788     0         29788        
          Sub-Workflows  0         0       0           0         0         0            
          ------------------------------------------------------------------------------
           
          Workflow wall time                                       : 11 hrs, 39 mins
          Cumulative job wall time                                 : 65 days, 8 hrs
          Cumulative job wall time as seen from submit side        : 65 days, 19 hrs
          Cumulative job badput wall time                          : 0.0 secs
          Cumulative job badput wall time as seen from submit side : 0.0 secs
           
           
          # All (All)
          Transformation           Count     Succeeded Failed  Min       Max       Mean          Total       
          dagman::post             29788     29788     0       0.0       14.0      5.31          158184.0    
          pegasus::dirmanager      1         1         0       2.0       2.0       2.0           2.0         
          pegasus::transfer        2712      2712      0       2.162     19.621    8.528         23128.89    
          pipetask                 27075     27075     0       18.272    5852.514  207.691       5623244.556 
          

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191121T015100+0000 Same as the last successful run but require 29GB for the large-memory jobs instead of 30GB, so to fit two jobs in one r4.2xlrage instance. The head was a m5.large on-demand Instance. Workers were launched in 3 fleets: the first two for a total of 75 m4.xlarge instances, and started ~1 hr later a fleet 50 r4.2xlarge. The workflow finished successfully around 2-3 hours before I killed the idle instances. ------------------------------------------------------------------------------ Type Succeeded Failed Incomplete Total Retries Total+Retries Tasks 27075 0 0 27075 0 27075 Jobs 29788 0 0 29788 0 29788 Sub-Workflows 0 0 0 0 0 0 ------------------------------------------------------------------------------   Workflow wall time : 11 hrs, 39 mins Cumulative job wall time : 65 days, 8 hrs Cumulative job wall time as seen from submit side : 65 days, 19 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs     # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 14.0 5.31 158184.0 pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0 pegasus::transfer 2712 2712 0 2.162 19.621 8.528 23128.89 pipetask 27075 27075 0 18.272 5852.514 207.691 5623244.556
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191123T020404+0000   20191123T020534+0000

          20191126T015937+0000  20191126T020115+0000  20191126T025156+0000

          On 11/23 and 11/26 I attempted to execute multiple counts of the same workflow graph simultaneously. On both nights, I was not able to get the number of m4.xlarge instances I wanted and some also got terminated by AWS (e.g. outbid). I then asked for m5 instances to boost up the number of running instances. For m4.xlarge I got messages like this:

           Error message: There is no Spot capacity available that matches your request. (Service: AmazonEC2; Status Code: 500; Error Code: InsufficientInstanceCapacity)

           
          Workflows all failed, with seemingly random dataset conflict errors form the registry, though all had distinct output collection names and Pegasus/Condor retry was turned off.

          lsst.daf.butler.core.registry.ConflictingDefinitionError: A dataset of type DatasetType(calexpBackground, DimensionGraph({instrument, detector, visit}, joins={visit_detector_region}), Background) with id: {instrument: HSC, visit: 1280, detector: 56} already exists in collection hfc41
          

          I don't know how to explain this. Two thoughts: (1) the known not-so-robust ingestion which has been refactored in DM-21201, and (2) there were Spot instances getting terminated (by AWS or by lease expiration) while I observed it.

           

          Some log files at

          https://lsst-web.ncsa.illinois.edu/~hchiang2/aws_poc/faults/
           

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191123T020404+0000   20191123T020534+0000 20191126T015937+0000  20191126T020115+0000  20191126T025156+0000 On 11/23 and 11/26 I attempted to execute multiple counts of the same workflow graph simultaneously. On both nights, I was not able to get the number of m4.xlarge instances I wanted and some also got terminated by AWS (e.g. outbid). I then asked for m5 instances to boost up the number of running instances. For m4.xlarge I got messages like this:  Error message: There is no Spot capacity available that matches your request. (Service: AmazonEC2; Status Code: 500; Error Code: InsufficientInstanceCapacity)   Workflows all failed, with seemingly random dataset conflict errors form the registry, though all had distinct output collection names and Pegasus/Condor retry was turned off. lsst.daf.butler.core.registry.ConflictingDefinitionError: A dataset of type DatasetType(calexpBackground, DimensionGraph({instrument, detector, visit}, joins={visit_detector_region}), Background) with id: {instrument: HSC, visit: 1280 , detector: 56 } already exists in collection hfc41 I don't know how to explain this. Two thoughts: (1) the known not-so-robust ingestion which has been refactored in DM-21201 , and (2) there were Spot instances getting terminated (by AWS or by lease expiration) while I observed it.   Some log files at https://lsst-web.ncsa.illinois.edu/~hchiang2/aws_poc/faults/  
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191127T192022+0000 20191127T192345+0000

          Attempted again running two graphs simultaneously. The master was a m5.2xlarge on-demand instance. The CPU utilization was ~50-70% on the master instance while it managed the sfm jobs. There was a surge of database transactions shortly after the sfm jobs started, but soon it evened out to be ~50-70 average active sessions during sfm processing. For the workers, I got 150 m5.xlarge Spot instances for the single frame processing part and 150 r4.2xlarge Spot instances later. From the RDS performance insight, xact_commit.avg (transaction commits per second) was typically ~10 and db load (average active sessions) ~1.

          Workflow wall time                                       : 8 hrs, 44 mins
          Cumulative job wall time                                 : 62 days, 16 hrs
          Cumulative job wall time as seen from submit side        : 62 days, 23 hrs
          Cumulative job badput wall time                          : 0.0 secs
          Cumulative job badput wall time as seen from submit side : 0.0 secs
           
          Workflow wall time                                       : 9 hrs, 59 mins
          Cumulative job wall time                                 : 62 days, 23 hrs
          Cumulative job wall time as seen from submit side        : 63 days, 6 hrs
          Cumulative job badput wall time                          : 0.0 secs
          Cumulative job badput wall time as seen from submit side : 0.0 secs
          

          # All (All)
          Transformation           Count     Succeeded Failed  Min       Max       Mean          Total       
          dagman::post             29788     29788     0       0.0       9.0       1.052         31343.0     
          pegasus::dirmanager      1         1         0       3.0       3.0       3.0           3.0         
          pegasus::transfer        2712      2712      0       2.176     6.768     4.21          11417.243   
          pipetask                 27075     27075     0       17.861    6243.819  199.636       5405141.464 
           
          # All (All)
          Transformation           Count     Succeeded Failed  Min       Max       Mean          Total       
          dagman::post             29788     29788     0       0.0       7.0       1.012         30148.0     
          pegasus::dirmanager      1         1         0       2.0       2.0       2.0           2.0         
          pegasus::transfer        2712      2712      0       2.167     9.002     4.255         11540.378   
          pipetask                 27075     27075     0       19.297    6300.657  200.601       5431273.315 
          

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191127T192022+0000 20191127T192345+0000 Attempted again running two graphs simultaneously. The master was a m5.2xlarge on-demand instance. The CPU utilization was ~50-70% on the master instance while it managed the sfm jobs. There was a surge of database transactions shortly after the sfm jobs started, but soon it evened out to be ~50-70 average active sessions during sfm processing. For the workers, I got 150 m5.xlarge Spot instances for the single frame processing part and 150 r4.2xlarge Spot instances later. From the RDS performance insight, xact_commit.avg (transaction commits per second) was typically ~10 and db load (average active sessions) ~1. Workflow wall time : 8 hrs, 44 mins Cumulative job wall time : 62 days, 16 hrs Cumulative job wall time as seen from submit side : 62 days, 23 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs   Workflow wall time : 9 hrs, 59 mins Cumulative job wall time : 62 days, 23 hrs Cumulative job wall time as seen from submit side : 63 days, 6 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 9.0 1.052 31343.0 pegasus::dirmanager 1 1 0 3.0 3.0 3.0 3.0 pegasus::transfer 2712 2712 0 2.176 6.768 4.21 11417.243 pipetask 27075 27075 0 17.861 6243.819 199.636 5405141.464   # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 7.0 1.012 30148.0 pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0 pegasus::transfer 2712 2712 0 2.167 9.002 4.255 11540.378 pipetask 27075 27075 0 19.297 6300.657 200.601 5431273.315
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191203T195830+0000 20191203T195926+0000 20191203T200145+0000 20191203T200315+0000 20191203T200433+0000

          I attempted to run 5 graphs at the same time. In the initialization jobs (first blocking job of the graph), two died with
           

          Traceback (most recent call last):
            File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1230, in _execute_context
              cursor, statement, parameters, context
            File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 536, in do_execute
              cursor.execute(statement, parameters)
          psycopg2.OperationalError: out of shared memory
          HINT:  You might need to increase max_locks_per_transaction.
          

          One finished its init job and continued on with single frame processing. The other two are still trying but I'll just kill them.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191203T195830+0000 20191203T195926+0000 20191203T200145+0000 20191203T200315+0000 20191203T200433+0000 I attempted to run 5 graphs at the same time. In the initialization jobs (first blocking job of the graph), two died with   Traceback (most recent call last): File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1230, in _execute_context cursor, statement, parameters, context File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 536, in do_execute cursor.execute(statement, parameters) psycopg2.OperationalError: out of shared memory HINT: You might need to increase max_locks_per_transaction. One finished its init job and continued on with single frame processing. The other two are still trying but I'll just kill them.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          20191216T201334+0000 20191216T201429+0000 20191216T201752+0000 20191216T201957+0000 20191216T202127+0000

          After Dino Bektesevic increased the cache sizes of the RDS instance to ~1GB, I tried the same 5 graphs again.

          After two of the 5 workflows finished their initialization jobs, many ISR (but not all) jobs failed with

          conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
          sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  remaining connection slots are reserved for non-replication superuser connections
          

          At the time there were 200 m5.xlarge workers.

          Dino then increased max_connections from 350 to 1000. Re-submitting new workflows with the same 200 workers, the connection slot error was no longer seen.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - 20191216T201334+0000 20191216T201429+0000 20191216T201752+0000 20191216T201957+0000 20191216T202127+0000 After Dino Bektesevic increased the cache sizes of the RDS instance to ~1GB, I tried the same 5 graphs again. After two of the 5 workflows finished their initialization jobs, many ISR (but not all) jobs failed with conn = _connect(dsn, connection_factory=connection_factory, **kwasync) sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections At the time there were 200 m5.xlarge workers. Dino then increased max_connections from 350 to 1000. Re-submitting new workflows with the same 200 workers, the connection slot error was no longer seen.
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          In the next attempt I (stupidly) requested 300 m5.xlarge workers; as each instance ran 4 jobs that meant 1200 connections. So I hit the same error again. We then increased max_connections to 2500.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - In the next attempt I (stupidly) requested 300 m5.xlarge workers; as each instance ran 4 jobs that meant 1200 connections. So I hit the same error again. We then increased max_connections to 2500.
          hchiang2 Hsin-Fang Chiang made changes -
          Resolution Done [ 10000 ]
          Status In Progress [ 3 ] Done [ 10002 ]
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          We are wrapping up this effort for now or at least until the new LDF / DOE lab decision is made. Closing this ticket.  The report is being written in https://dmtn-137.lsst.io/ (DM-21820)

          Show
          hchiang2 Hsin-Fang Chiang added a comment - We are wrapping up this effort for now or at least until the new LDF / DOE lab decision is made. Closing this ticket.  The report is being written in https://dmtn-137.lsst.io/ ( DM-21820 )
          hchiang2 Hsin-Fang Chiang made changes -
          Story Points 20

            People

            Assignee:
            hchiang2 Hsin-Fang Chiang
            Reporter:
            hchiang2 Hsin-Fang Chiang
            Watchers:
            Dino Bektesevic, Hsin-Fang Chiang, Yusra AlSayyad
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins Builds

                No builds found.