# tract-scale execution integration and tests

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
None
• Story Points:
20
• Team:
DM Science

#### Description

Based on the w_2019_38 stack (before the backward incompatible change in the Gen3 registry and the missing RC2 repo generation code), integrate the steps and test execution. Use tract=9615, the HSC-RC2 tract as in the June/July milestone run. Troubleshoot issues.

#### Activity

Hide
Hsin-Fang Chiang added a comment -

Made the HSC-RC2 repo with its datastore in a S3 bucket and its registry in RDS, made a quantum graph pickle (DM-21461), and transformed the quantum graph into Pegasus format with one-to-one quantum-to-job mapping.
All files needed in the Pegasus workflow are stored in s3://pegasus9615/ bucket and can be retrieved for new runs. Multiple test runs are done to iteratively identify issues and improve the process.

Show
Hsin-Fang Chiang added a comment - Made the HSC-RC2 repo with its datastore in a S3 bucket and its registry in RDS, made a quantum graph pickle ( DM-21461 ), and transformed the quantum graph into Pegasus format with one-to-one quantum-to-job mapping. All files needed in the Pegasus workflow are stored in s3://pegasus9615/ bucket and can be retrieved for new runs. Multiple test runs are done to iteratively identify issues and improve the process.
Hide
Hsin-Fang Chiang added a comment - - edited

#### Run 20191014T193110+0000

Before the new AMIs (DM-21818) were ready test runs were done using the old-style AMIs with messier environments and credentials. r5.large Spot instances were used for memory-demanding tasks and instance connection lost was observed.
It was first noticed as the instances disappearing from the condor pool (according to condor_status), while "condor_annex -status" still said the instances were running. It seemed the instances lost their reachability (for example cannot ssh to the spot instance directly like usual). Eventually, but with a delay, the EC2 "instance status check" showed that the reachability check failed. Meanwhile the EC2 instance status remains "running". Investigations continue with new runs.

Connection timed out to the RDS instance and errors like lib/python3.7/site-packages/urllib3/response.py", line 378, in _error_catcher were seen in some failed jobs.  In total 109 jobs failed and 1014 unready when the workflow stopped.

See more discussions in https://confluence.lsstcorp.org/display/DM/AWS+PoC+Meeting+2019-10-18

Show
Hsin-Fang Chiang added a comment - - edited Run 20191014T193110+0000   Before the new AMIs ( DM-21818 ) were ready test runs were done using the old-style AMIs with messier environments and credentials. r5.large Spot instances were used for memory-demanding tasks and instance connection lost was observed. It was first noticed as the instances disappearing from the condor pool (according to condor_status), while "condor_annex -status" still said the instances were running. It seemed the instances lost their reachability (for example cannot ssh to the spot instance directly like usual). Eventually, but with a delay, the EC2 "instance status check" showed that the reachability check failed. Meanwhile the EC2 instance status remains "running". Investigations continue with new runs. Connection timed out to the RDS instance and errors like lib/python3.7/site-packages/urllib3/response.py", line 378, in _error_catcher were seen in some failed jobs.  In total 109 jobs failed and 1014 unready when the workflow stopped.    See more discussions in https://confluence.lsstcorp.org/display/DM/AWS+PoC+Meeting+2019-10-18
Hide
Hsin-Fang Chiang added a comment -

#### Run 20191023T000938+0000

The instance reachability issue happened again, with the r4.large instances this time. Recommended by our AWS collaborators, a support case was filed to AWS Support.

I identified a few failed jobs and tried them manually, with memory monitoring enabled through CloudWatch (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html )  I got memory usage spiking around 26GB for the few test jobs. I suspect OOM was related the troublesome instances, nonetheless a more graceful exit is much more desired.

Show
Hide
Hsin-Fang Chiang added a comment -

#### 20191026T041828+0000

Same as the last run but require 30GB for those heavy jobs instead of 12GB. A fleet of 40 m5.xlarge instances were used for single frame processing and then 50 r4.2xlarge instances for the rest. The workflow finished successfully.

In MakeWarpTask logs, I see 207 lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = …. in one or both dimensions' For example: tract=9615 patch=43 visit=1278

 makeWarp INFO: Processing calexp 25 of 68 for this Warp: id={instrument: HSC, visit: 1278, detector: 86} makeWarp.warpAndPsfMatch.psfMatch INFO: compute Psf-matching kernel makeWarp.warpAndPsfMatch.psfMatch INFO: Adjusted dimensions of reference PSF model from (23, 23) to (41, 41) ip.diffim.generateAlardLuptonBasisList INFO: PSF sigmas are not available or scaling by fwhm disabled, falling back to config values makeWarp.warpAndPsfMatch.psfMatch INFO: Psf-match science exposure to reference makeWarp.warpAndPsfMatch INFO: Cannot PSF-Match:  File "src/math/Kernel.cc", line 188, in lsst::geom::Box2I lsst::afw::math::Kernel::shrinkBBox(const lsst::geom::Box2I&) const  bbox dimensions = (1883, 7) < (21, 21) in one or both dimensions {0} lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = (1883, 7) < (21, 21) in one or both dimensions' 

As the warp files were still made alright, I can ignore them. But maybe Yusra AlSayyad and Jim Bosch would like to know.

Show
Hsin-Fang Chiang added a comment - 20191026T041828+0000 Same as the last run but require 30GB for those heavy jobs instead of 12GB. A fleet of 40 m5.xlarge instances were used for single frame processing and then 50 r4.2xlarge instances for the rest. The workflow finished successfully. In MakeWarpTask logs, I see 207 lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = …. in one or both dimensions' For example: tract=9615 patch=43 visit=1278 makeWarp INFO: Processing calexp 25 of 68 for this Warp: id={instrument: HSC, visit: 1278 , detector: 86 } makeWarp.warpAndPsfMatch.psfMatch INFO: compute Psf-matching kernel makeWarp.warpAndPsfMatch.psfMatch INFO: Adjusted dimensions of reference PSF model from ( 23 , 23 ) to ( 41 , 41 ) ip.diffim.generateAlardLuptonBasisList INFO: PSF sigmas are not available or scaling by fwhm disabled, falling back to config values makeWarp.warpAndPsfMatch.psfMatch INFO: Psf-match science exposure to reference makeWarp.warpAndPsfMatch INFO: Cannot PSF-Match: File "src/math/Kernel.cc" , line 188 , in lsst::geom::Box2I lsst::afw::math::Kernel::shrinkBBox( const lsst::geom::Box2I&) const bbox dimensions = ( 1883 , 7 ) < ( 21 , 21 ) in one or both dimensions { 0 } lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = (1883, 7) < (21, 21) in one or both dimensions' As the warp files were still made alright, I can ignore them. But maybe Yusra AlSayyad and Jim Bosch would like to know.
Hide
Hsin-Fang Chiang added a comment -

This workflow for tract=9615 has 1 init job and 27074 regular pipetask jobs:

Statistics from Pegasus:

 # All (All) Transformation Count Succeeded Failed Min Max Mean Total  dagman::post 29788 29788 0 0.0 12.0 3.079 91709.0  pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0  pegasus::transfer 2712 2712 0 2.273 16.538 6.102 16547.867  pipetask 27075 27075 0 17.025 5936.038 195.465 5292217.997 

 ------------------------------------------------------------------------------ Type Succeeded Failed Incomplete Total Retries Total+Retries Tasks 27075 0 0 27075 0 27075  Jobs 29788 0 0 29788 0 29788  Sub-Workflows 0 0 0 0 0 0  ------------------------------------------------------------------------------   Workflow wall time : 1 day, 4 hrs Cumulative job wall time : 61 days, 10 hrs Cumulative job wall time as seen from submit side : 61 days, 17 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs 

Show
Hide

Thanks for the heads up on the error! That usually means that the WCS solution for a calexp was bad.

We'll record that dataID (visit: 1278, detector: 86) as a chip with a hard-to-fit WCS.

Show
Yusra AlSayyad added a comment - Thanks for the heads up on the error! That usually means that the WCS solution for a calexp was bad. We'll record that dataID (visit: 1278, detector: 86) as a chip with a hard-to-fit WCS.
Hide
Hsin-Fang Chiang added a comment -

Yusra AlSayyad There are >200 of such error. Getting all their data IDs is possible but needs some log processing. Hopefully a few examples suffice.

Show
Hsin-Fang Chiang added a comment - Yusra AlSayyad  There are >200 of such error. Getting all their data IDs is possible but needs some log processing. Hopefully a few examples suffice.
Hide
Hsin-Fang Chiang added a comment -

#### 20191112T205946+0000

Attempted multiple times the same workflow / same software and encountered errors during ingest on S3 FileNotFoundError despite the file exists (eventually), such as

 "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 233, in put  self.ingest(location.uri, ref, formatter=formatter)  File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/core/utils.py", line 226, in inner  return func(self, *args, **kwargs)  File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 299, in ingest  .format(srcUri)) FileNotFoundError: File at 's3://hsc-rc2-test1-hfc/repo-w38/hfc30/srcMatchFull/1316/srcMatchFull_1316_24_HSC_17.fits' does not exist; note that paths to ingest are assumed to be relative to self.root unless they are absolute. 

The errors appeared random and did not happen for specific files/tasks. A possible cause is in DM-22201 which will further investigate and fix the issue.

Show
Hsin-Fang Chiang added a comment - 20191112T205946+0000  Attempted multiple times the same workflow / same software and encountered errors during ingest on S3 FileNotFoundError despite the file exists (eventually), such as "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 233, in put self.ingest(location.uri, ref, formatter=formatter) File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/core/utils.py", line 226, in inner return func(self, *args, **kwargs) File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 299, in ingest .format(srcUri)) FileNotFoundError: File at 's3://hsc-rc2-test1-hfc/repo-w38/hfc30/srcMatchFull/1316/srcMatchFull_1316_24_HSC_17.fits' does not exist; note that paths to ingest are assumed to be relative to self.root unless they are absolute. The errors appeared random and did not happen for specific files/tasks. A possible cause is in DM-22201 which will further investigate and fix the issue.
Hide
Hsin-Fang Chiang added a comment -

#### 20191118T191713+0000

Attempted to request a fleet of 100 m5.xlarge but only 75 were launched, and a spotInstanceCountLimitExceeded error was seen in Spot Request indicating that the maximum spot instance count was exceeded. I created a support case to request an increase of spot instance limit.

A tiny fraction of jobs (7 so far) failed with
before science processing started. Those could be re-submitted via a pegasus-halt/remove followed by a new pegasus-run. New attempts of the same jobs were able to finish alright.

Show
Hsin-Fang Chiang added a comment - 20191118T191713+0000 Attempted to request a fleet of 100 m5.xlarge but only 75 were launched, and a spotInstanceCountLimitExceeded error was seen in Spot Request indicating that the maximum spot instance count was exceeded. I created a support case to request an increase of spot instance limit. A tiny fraction of jobs (7 so far) failed with botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None" before science processing started. Those could be re-submitted via a pegasus-halt/remove followed by a new pegasus-run. New attempts of the same jobs were able to finish alright.
Hide
Hsin-Fang Chiang added a comment -

The last run eventually failed because I mistakenly killed an instance while a job was running and had already written some outputs to the datastore & registry. Retrying the same job as-is resulted in dataset conflicts, as expected.

 pegasus-status -l /home/centos/pegasus9615E/submit/centos/pegasus/ciHsc/20191118T191713+0000 (no matching jobs found in Condor Q) UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME   39 0 0 0 0 29,748 1 99.9 Failure..ubmit/centos/pegasus/ciHsc/20191118T191713+0000/ciHsc-0.dag Summary: 1 DAG total (Failure:1) 

A more proper way to rescue it may be making a new qgraph with --skip-existing and re-launching a new workflow writing to a new collection. Deferring that to another day.

Show
Hsin-Fang Chiang added a comment - The last run eventually failed because I mistakenly killed an instance while a job was running and had already written some outputs to the datastore & registry. Retrying the same job as-is resulted in dataset conflicts, as expected. pegasus-status -l /home/centos/pegasus9615E/submit/centos/pegasus/ciHsc/20191118T191713+0000 (no matching jobs found in Condor Q) UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME 39 0 0 0 0 29,748 1 99.9 Failure..ubmit/centos/pegasus/ciHsc/20191118T191713+0000/ciHsc-0.dag Summary: 1 DAG total (Failure:1) A more proper way to rescue it may be making a new qgraph with --skip-existing and re-launching a new workflow writing to a new collection. Deferring that to another day.
Hide
Hsin-Fang Chiang added a comment -

#### 20191121T015100+0000

Same as the last successful run but require 29GB for the large-memory jobs instead of 30GB, so to fit two jobs in one r4.2xlrage instance. The head was a m5.large on-demand Instance. Workers were launched in 3 fleets: the first two for a total of 75 m4.xlarge instances, and started ~1 hr later a fleet 50 r4.2xlarge. The workflow finished successfully around 2-3 hours before I killed the idle instances.

 ------------------------------------------------------------------------------ Type Succeeded Failed Incomplete Total Retries Total+Retries Tasks 27075 0 0 27075 0 27075  Jobs 29788 0 0 29788 0 29788  Sub-Workflows 0 0 0 0 0 0  ------------------------------------------------------------------------------   Workflow wall time : 11 hrs, 39 mins Cumulative job wall time : 65 days, 8 hrs Cumulative job wall time as seen from submit side : 65 days, 19 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs     # All (All) Transformation Count Succeeded Failed Min Max Mean Total  dagman::post 29788 29788 0 0.0 14.0 5.31 158184.0  pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0  pegasus::transfer 2712 2712 0 2.162 19.621 8.528 23128.89  pipetask 27075 27075 0 18.272 5852.514 207.691 5623244.556 

Show
Hsin-Fang Chiang added a comment - 20191121T015100+0000 Same as the last successful run but require 29GB for the large-memory jobs instead of 30GB, so to fit two jobs in one r4.2xlrage instance. The head was a m5.large on-demand Instance. Workers were launched in 3 fleets: the first two for a total of 75 m4.xlarge instances, and started ~1 hr later a fleet 50 r4.2xlarge. The workflow finished successfully around 2-3 hours before I killed the idle instances. ------------------------------------------------------------------------------ Type Succeeded Failed Incomplete Total Retries Total+Retries Tasks 27075 0 0 27075 0 27075 Jobs 29788 0 0 29788 0 29788 Sub-Workflows 0 0 0 0 0 0 ------------------------------------------------------------------------------   Workflow wall time : 11 hrs, 39 mins Cumulative job wall time : 65 days, 8 hrs Cumulative job wall time as seen from submit side : 65 days, 19 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs     # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 14.0 5.31 158184.0 pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0 pegasus::transfer 2712 2712 0 2.162 19.621 8.528 23128.89 pipetask 27075 27075 0 18.272 5852.514 207.691 5623244.556
Hide
Hsin-Fang Chiang added a comment -

#### 20191126T015937+0000  20191126T020115+0000  20191126T025156+0000

On 11/23 and 11/26 I attempted to execute multiple counts of the same workflow graph simultaneously. On both nights, I was not able to get the number of m4.xlarge instances I wanted and some also got terminated by AWS (e.g. outbid). I then asked for m5 instances to boost up the number of running instances. For m4.xlarge I got messages like this:

Error message: There is no Spot capacity available that matches your request. (Service: AmazonEC2; Status Code: 500; Error Code: InsufficientInstanceCapacity)

Workflows all failed, with seemingly random dataset conflict errors form the registry, though all had distinct output collection names and Pegasus/Condor retry was turned off.

 lsst.daf.butler.core.registry.ConflictingDefinitionError: A dataset of type DatasetType(calexpBackground, DimensionGraph({instrument, detector, visit}, joins={visit_detector_region}), Background) with id: {instrument: HSC, visit: 1280, detector: 56} already exists in collection hfc41 

I don't know how to explain this. Two thoughts: (1) the known not-so-robust ingestion which has been refactored in DM-21201, and (2) there were Spot instances getting terminated (by AWS or by lease expiration) while I observed it.

Some log files at

Show
Hsin-Fang Chiang added a comment - 20191123T020404+0000   20191123T020534+0000 20191126T015937+0000  20191126T020115+0000  20191126T025156+0000 On 11/23 and 11/26 I attempted to execute multiple counts of the same workflow graph simultaneously. On both nights, I was not able to get the number of m4.xlarge instances I wanted and some also got terminated by AWS (e.g. outbid). I then asked for m5 instances to boost up the number of running instances. For m4.xlarge I got messages like this:  Error message: There is no Spot capacity available that matches your request. (Service: AmazonEC2; Status Code: 500; Error Code: InsufficientInstanceCapacity)   Workflows all failed, with seemingly random dataset conflict errors form the registry, though all had distinct output collection names and Pegasus/Condor retry was turned off. lsst.daf.butler.core.registry.ConflictingDefinitionError: A dataset of type DatasetType(calexpBackground, DimensionGraph({instrument, detector, visit}, joins={visit_detector_region}), Background) with id: {instrument: HSC, visit: 1280 , detector: 56 } already exists in collection hfc41 I don't know how to explain this. Two thoughts: (1) the known not-so-robust ingestion which has been refactored in DM-21201 , and (2) there were Spot instances getting terminated (by AWS or by lease expiration) while I observed it.   Some log files at https://lsst-web.ncsa.illinois.edu/~hchiang2/aws_poc/faults/
Hide
Hsin-Fang Chiang added a comment -

#### 20191127T192022+0000 20191127T192345+0000

Attempted again running two graphs simultaneously. The master was a m5.2xlarge on-demand instance. The CPU utilization was ~50-70% on the master instance while it managed the sfm jobs. There was a surge of database transactions shortly after the sfm jobs started, but soon it evened out to be ~50-70 average active sessions during sfm processing. For the workers, I got 150 m5.xlarge Spot instances for the single frame processing part and 150 r4.2xlarge Spot instances later. From the RDS performance insight, xact_commit.avg (transaction commits per second) was typically ~10 and db load (average active sessions) ~1.

 Workflow wall time : 8 hrs, 44 mins Cumulative job wall time : 62 days, 16 hrs Cumulative job wall time as seen from submit side : 62 days, 23 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs   Workflow wall time : 9 hrs, 59 mins Cumulative job wall time : 62 days, 23 hrs Cumulative job wall time as seen from submit side : 63 days, 6 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs 

 # All (All) Transformation Count Succeeded Failed Min Max Mean Total  dagman::post 29788 29788 0 0.0 9.0 1.052 31343.0  pegasus::dirmanager 1 1 0 3.0 3.0 3.0 3.0  pegasus::transfer 2712 2712 0 2.176 6.768 4.21 11417.243  pipetask 27075 27075 0 17.861 6243.819 199.636 5405141.464    # All (All) Transformation Count Succeeded Failed Min Max Mean Total  dagman::post 29788 29788 0 0.0 7.0 1.012 30148.0  pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0  pegasus::transfer 2712 2712 0 2.167 9.002 4.255 11540.378  pipetask 27075 27075 0 19.297 6300.657 200.601 5431273.315 

Show
Hsin-Fang Chiang added a comment - 20191127T192022+0000 20191127T192345+0000 Attempted again running two graphs simultaneously. The master was a m5.2xlarge on-demand instance. The CPU utilization was ~50-70% on the master instance while it managed the sfm jobs. There was a surge of database transactions shortly after the sfm jobs started, but soon it evened out to be ~50-70 average active sessions during sfm processing. For the workers, I got 150 m5.xlarge Spot instances for the single frame processing part and 150 r4.2xlarge Spot instances later. From the RDS performance insight, xact_commit.avg (transaction commits per second) was typically ~10 and db load (average active sessions) ~1. Workflow wall time : 8 hrs, 44 mins Cumulative job wall time : 62 days, 16 hrs Cumulative job wall time as seen from submit side : 62 days, 23 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs   Workflow wall time : 9 hrs, 59 mins Cumulative job wall time : 62 days, 23 hrs Cumulative job wall time as seen from submit side : 63 days, 6 hrs Cumulative job badput wall time : 0.0 secs Cumulative job badput wall time as seen from submit side : 0.0 secs # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 9.0 1.052 31343.0 pegasus::dirmanager 1 1 0 3.0 3.0 3.0 3.0 pegasus::transfer 2712 2712 0 2.176 6.768 4.21 11417.243 pipetask 27075 27075 0 17.861 6243.819 199.636 5405141.464   # All (All) Transformation Count Succeeded Failed Min Max Mean Total dagman::post 29788 29788 0 0.0 7.0 1.012 30148.0 pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0 pegasus::transfer 2712 2712 0 2.167 9.002 4.255 11540.378 pipetask 27075 27075 0 19.297 6300.657 200.601 5431273.315
Hide
Hsin-Fang Chiang added a comment -

20191203T195830+0000 20191203T195926+0000 20191203T200145+0000 20191203T200315+0000 20191203T200433+0000

I attempted to run 5 graphs at the same time. In the initialization jobs (first blocking job of the graph), two died with

 Traceback (most recent call last):  File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1230, in _execute_context  cursor, statement, parameters, context  File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 536, in do_execute  cursor.execute(statement, parameters) psycopg2.OperationalError: out of shared memory HINT: You might need to increase max_locks_per_transaction. 

One finished its init job and continued on with single frame processing. The other two are still trying but I'll just kill them.

Show
Hsin-Fang Chiang added a comment - 20191203T195830+0000 20191203T195926+0000 20191203T200145+0000 20191203T200315+0000 20191203T200433+0000 I attempted to run 5 graphs at the same time. In the initialization jobs (first blocking job of the graph), two died with   Traceback (most recent call last): File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1230, in _execute_context cursor, statement, parameters, context File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 536, in do_execute cursor.execute(statement, parameters) psycopg2.OperationalError: out of shared memory HINT: You might need to increase max_locks_per_transaction. One finished its init job and continued on with single frame processing. The other two are still trying but I'll just kill them.
Hide
Hsin-Fang Chiang added a comment -

#### 20191216T201334+0000 20191216T201429+0000 20191216T201752+0000 20191216T201957+0000 20191216T202127+0000

After Dino Bektesevic increased the cache sizes of the RDS instance to ~1GB, I tried the same 5 graphs again.

After two of the 5 workflows finished their initialization jobs, many ISR (but not all) jobs failed with

 conn = _connect(dsn, connection_factory=connection_factory, **kwasync) sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections 

At the time there were 200 m5.xlarge workers.

Dino then increased max_connections from 350 to 1000. Re-submitting new workflows with the same 200 workers, the connection slot error was no longer seen.

Show
Hsin-Fang Chiang added a comment - 20191216T201334+0000 20191216T201429+0000 20191216T201752+0000 20191216T201957+0000 20191216T202127+0000 After Dino Bektesevic increased the cache sizes of the RDS instance to ~1GB, I tried the same 5 graphs again. After two of the 5 workflows finished their initialization jobs, many ISR (but not all) jobs failed with conn = _connect(dsn, connection_factory=connection_factory, **kwasync) sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections At the time there were 200 m5.xlarge workers. Dino then increased max_connections from 350 to 1000. Re-submitting new workflows with the same 200 workers, the connection slot error was no longer seen.
Hide
Hsin-Fang Chiang added a comment -

In the next attempt I (stupidly) requested 300 m5.xlarge workers; as each instance ran 4 jobs that meant 1200 connections. So I hit the same error again. We then increased max_connections to 2500.

Show
Hsin-Fang Chiang added a comment - In the next attempt I (stupidly) requested 300 m5.xlarge workers; as each instance ran 4 jobs that meant 1200 connections. So I hit the same error again. We then increased max_connections to 2500.
Hide
Hsin-Fang Chiang added a comment -

We are wrapping up this effort for now or at least until the new LDF / DOE lab decision is made. Closing this ticket.  The report is being written in https://dmtn-137.lsst.io/ (DM-21820)

Show
Hsin-Fang Chiang added a comment - We are wrapping up this effort for now or at least until the new LDF / DOE lab decision is made. Closing this ticket.  The report is being written in https://dmtn-137.lsst.io/ ( DM-21820 )

#### People

Assignee:
Hsin-Fang Chiang
Reporter:
Hsin-Fang Chiang
Watchers:
Dino Bektesevic, Hsin-Fang Chiang, Yusra AlSayyad