Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Story Points:20
-
Epic Link:
-
Team:DM Science
Description
Based on the w_2019_38 stack (before the backward incompatible change in the Gen3 registry and the missing RC2 repo generation code), integrate the steps and test execution. Use tract=9615, the HSC-RC2 tract as in the June/July milestone run. Troubleshoot issues.
Attachments
Activity
Run 20191014T193110+0000
Before the new AMIs (DM-21818) were ready test runs were done using the old-style AMIs with messier environments and credentials. r5.large Spot instances were used for memory-demanding tasks and instance connection lost was observed.
It was first noticed as the instances disappearing from the condor pool (according to condor_status), while "condor_annex -status" still said the instances were running. It seemed the instances lost their reachability (for example cannot ssh to the spot instance directly like usual). Eventually, but with a delay, the EC2 "instance status check" showed that the reachability check failed. Meanwhile the EC2 instance status remains "running". Investigations continue with new runs.
Connection timed out to the RDS instance and errors like lib/python3.7/site-packages/urllib3/response.py", line 378, in _error_catcher were seen in some failed jobs. In total 109 jobs failed and 1014 unready when the workflow stopped.
See more discussions in https://confluence.lsstcorp.org/display/DM/AWS+PoC+Meeting+2019-10-18
Run 20191023T000938+0000
This used the new AMIs centos_condor_w38_master and centos_condor_w38_worker_hfc_partitionable. Condor annex workers were added with Spot Fleets of m4.large and then r4.large. Jobs of 'MakeWarpTask', 'CompareWarpAssembleCoaddTask', 'DeblendCoaddSourcesSingleTask', 'MeasureMergedCoaddSourcesTask' were set to require 12GB of memory.
The instance reachability issue happened again, with the r4.large instances this time. Recommended by our AWS collaborators, a support case was filed to AWS Support.
MemoryError: std::bad_alloc was found in the logs of 4 DetectCoaddSourcesTask jobs. Many Connection timed out to the RDS instance were found in the logs of failed jobs, including sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server . Some logs look truncated without good indicators. When I stopped the workflow, it had 115 failed job and 1034 job unready. Failed jobs included MakeWarpTask CompareWarpAssembleCoaddTask DeblendCoaddSourcesSingleTask, and MeasureMergedCoaddSourcesTask.
I identified a few failed jobs and tried them manually, with memory monitoring enabled through CloudWatch (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html ) I got memory usage spiking around 26GB for the few test jobs. I suspect OOM was related the troublesome instances, nonetheless a more graceful exit is much more desired.
20191026T041828+0000
Same as the last run but require 30GB for those heavy jobs instead of 12GB. A fleet of 40 m5.xlarge instances were used for single frame processing and then 50 r4.2xlarge instances for the rest. The workflow finished successfully.
In MakeWarpTask logs, I see 207 lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = …. in one or both dimensions' For example: tract=9615 patch=43 visit=1278
makeWarp INFO: Processing calexp 25 of 68 for this Warp: id={instrument: HSC, visit: 1278, detector: 86} |
makeWarp.warpAndPsfMatch.psfMatch INFO: compute Psf-matching kernel
|
makeWarp.warpAndPsfMatch.psfMatch INFO: Adjusted dimensions of reference PSF model from (23, 23) to (41, 41) |
ip.diffim.generateAlardLuptonBasisList INFO: PSF sigmas are not available or scaling by fwhm disabled, falling back to config values
|
makeWarp.warpAndPsfMatch.psfMatch INFO: Psf-match science exposure to reference
|
makeWarp.warpAndPsfMatch INFO: Cannot PSF-Match:
|
File "src/math/Kernel.cc", line 188, in lsst::geom::Box2I lsst::afw::math::Kernel::shrinkBBox(const lsst::geom::Box2I&) const |
bbox dimensions = (1883, 7) < (21, 21) in one or both dimensions {0} |
lsst::pex::exceptions::InvalidParameterError: 'bbox dimensions = (1883, 7) < (21, 21) in one or both dimensions' |
As the warp files were still made alright, I can ignore them. But maybe Yusra AlSayyad and Jim Bosch would like to know.
This workflow for tract=9615 has 1 init job and 27074 regular pipetask jobs:
Count | Task |
---|---|
6787 | IsrTask |
6787 | CharacterizeImageTask |
6787 | CalibrateTask |
4580 | MakeWarpTask |
395 | CompareWarpAssembleCoaddTask |
395 | DetectCoaddSourcesTask |
79 | MergeDetectionsTask |
395 | DeblendCoaddSourcesSingleTask |
395 | MeasureMergedCoaddSourcesTask |
79 | MergeMeasurementsTask |
395 | ForcedPhotCoaddTask |
Statistics from Pegasus:
# All (All)
|
Transformation Count Succeeded Failed Min Max Mean Total
|
dagman::post 29788 29788 0 0.0 12.0 3.079 91709.0 |
pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0 |
pegasus::transfer 2712 2712 0 2.273 16.538 6.102 16547.867 |
pipetask 27075 27075 0 17.025 5936.038 195.465 5292217.997 |
------------------------------------------------------------------------------
|
Type Succeeded Failed Incomplete Total Retries Total+Retries
|
Tasks 27075 0 0 27075 0 27075 |
Jobs 29788 0 0 29788 0 29788 |
Sub-Workflows 0 0 0 0 0 0 |
------------------------------------------------------------------------------
|
|
Workflow wall time : 1 day, 4 hrs |
Cumulative job wall time : 61 days, 10 hrs |
Cumulative job wall time as seen from submit side : 61 days, 17 hrs |
Cumulative job badput wall time : 0.0 secs |
Cumulative job badput wall time as seen from submit side : 0.0 secs |
Thanks for the heads up on the error! That usually means that the WCS solution for a calexp was bad.
We'll record that dataID (visit: 1278, detector: 86) as a chip with a hard-to-fit WCS.
Yusra AlSayyad There are >200 of such error. Getting all their data IDs is possible but needs some log processing. Hopefully a few examples suffice.
20191112T205946+0000
Attempted multiple times the same workflow / same software and encountered errors during ingest on S3 FileNotFoundError despite the file exists (eventually), such as
"/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 233, in put
|
self.ingest(location.uri, ref, formatter=formatter)
|
File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/core/utils.py", line 226, in inner
|
return func(self, *args, **kwargs)
|
File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/daf_butler/18.1.0-25-g487c8c7+2/python/lsst/daf/butler/datastores/s3Datastore.py", line 299, in ingest
|
.format(srcUri))
|
FileNotFoundError: File at 's3://hsc-rc2-test1-hfc/repo-w38/hfc30/srcMatchFull/1316/srcMatchFull_1316_24_HSC_17.fits' does not exist; note that paths to ingest are assumed to be relative to self.root unless they are absolute.
|
The errors appeared random and did not happen for specific files/tasks. A possible cause is in DM-22201 which will further investigate and fix the issue.
20191118T191713+0000
Attempted to request a fleet of 100 m5.xlarge but only 75 were launched, and a spotInstanceCountLimitExceeded error was seen in Spot Request indicating that the maximum spot instance count was exceeded. I created a support case to request an increase of spot instance limit.
A tiny fraction of jobs (7 so far) failed with
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None"
before science processing started. Those could be re-submitted via a pegasus-halt/remove followed by a new pegasus-run. New attempts of the same jobs were able to finish alright.
The last run eventually failed because I mistakenly killed an instance while a job was running and had already written some outputs to the datastore & registry. Retrying the same job as-is resulted in dataset conflicts, as expected.
pegasus-status -l /home/centos/pegasus9615E/submit/centos/pegasus/ciHsc/20191118T191713+0000
|
(no matching jobs found in Condor Q)
|
UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME
|
39 0 0 0 0 29,748 1 99.9 Failure..ubmit/centos/pegasus/ciHsc/20191118T191713+0000/ciHsc-0.dag
|
Summary: 1 DAG total (Failure:1)
|
A more proper way to rescue it may be making a new qgraph with --skip-existing and re-launching a new workflow writing to a new collection. Deferring that to another day.
20191121T015100+0000
Same as the last successful run but require 29GB for the large-memory jobs instead of 30GB, so to fit two jobs in one r4.2xlrage instance. The head was a m5.large on-demand Instance. Workers were launched in 3 fleets: the first two for a total of 75 m4.xlarge instances, and started ~1 hr later a fleet 50 r4.2xlarge. The workflow finished successfully around 2-3 hours before I killed the idle instances.
------------------------------------------------------------------------------
|
Type Succeeded Failed Incomplete Total Retries Total+Retries
|
Tasks 27075 0 0 27075 0 27075
|
Jobs 29788 0 0 29788 0 29788
|
Sub-Workflows 0 0 0 0 0 0
|
------------------------------------------------------------------------------
|
|
Workflow wall time : 11 hrs, 39 mins
|
Cumulative job wall time : 65 days, 8 hrs
|
Cumulative job wall time as seen from submit side : 65 days, 19 hrs
|
Cumulative job badput wall time : 0.0 secs
|
Cumulative job badput wall time as seen from submit side : 0.0 secs
|
|
|
# All (All)
|
Transformation Count Succeeded Failed Min Max Mean Total
|
dagman::post 29788 29788 0 0.0 14.0 5.31 158184.0
|
pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0
|
pegasus::transfer 2712 2712 0 2.162 19.621 8.528 23128.89
|
pipetask 27075 27075 0 18.272 5852.514 207.691 5623244.556
|
20191123T020404+0000 20191123T020534+0000
20191126T015937+0000 20191126T020115+0000 20191126T025156+0000
On 11/23 and 11/26 I attempted to execute multiple counts of the same workflow graph simultaneously. On both nights, I was not able to get the number of m4.xlarge instances I wanted and some also got terminated by AWS (e.g. outbid). I then asked for m5 instances to boost up the number of running instances. For m4.xlarge I got messages like this:
Error message: There is no Spot capacity available that matches your request. (Service: AmazonEC2; Status Code: 500; Error Code: InsufficientInstanceCapacity)
Workflows all failed, with seemingly random dataset conflict errors form the registry, though all had distinct output collection names and Pegasus/Condor retry was turned off.
lsst.daf.butler.core.registry.ConflictingDefinitionError: A dataset of type DatasetType(calexpBackground, DimensionGraph({instrument, detector, visit}, joins={visit_detector_region}), Background) with id: {instrument: HSC, visit: 1280, detector: 56} already exists in collection hfc41 |
I don't know how to explain this. Two thoughts: (1) the known not-so-robust ingestion which has been refactored in DM-21201, and (2) there were Spot instances getting terminated (by AWS or by lease expiration) while I observed it.
Some log files at
https://lsst-web.ncsa.illinois.edu/~hchiang2/aws_poc/faults/
20191127T192022+0000 20191127T192345+0000
Attempted again running two graphs simultaneously. The master was a m5.2xlarge on-demand instance. The CPU utilization was ~50-70% on the master instance while it managed the sfm jobs. There was a surge of database transactions shortly after the sfm jobs started, but soon it evened out to be ~50-70 average active sessions during sfm processing. For the workers, I got 150 m5.xlarge Spot instances for the single frame processing part and 150 r4.2xlarge Spot instances later. From the RDS performance insight, xact_commit.avg (transaction commits per second) was typically ~10 and db load (average active sessions) ~1.
Workflow wall time : 8 hrs, 44 mins
|
Cumulative job wall time : 62 days, 16 hrs
|
Cumulative job wall time as seen from submit side : 62 days, 23 hrs
|
Cumulative job badput wall time : 0.0 secs
|
Cumulative job badput wall time as seen from submit side : 0.0 secs
|
|
Workflow wall time : 9 hrs, 59 mins
|
Cumulative job wall time : 62 days, 23 hrs
|
Cumulative job wall time as seen from submit side : 63 days, 6 hrs
|
Cumulative job badput wall time : 0.0 secs
|
Cumulative job badput wall time as seen from submit side : 0.0 secs
|
# All (All)
|
Transformation Count Succeeded Failed Min Max Mean Total
|
dagman::post 29788 29788 0 0.0 9.0 1.052 31343.0
|
pegasus::dirmanager 1 1 0 3.0 3.0 3.0 3.0
|
pegasus::transfer 2712 2712 0 2.176 6.768 4.21 11417.243
|
pipetask 27075 27075 0 17.861 6243.819 199.636 5405141.464
|
|
# All (All)
|
Transformation Count Succeeded Failed Min Max Mean Total
|
dagman::post 29788 29788 0 0.0 7.0 1.012 30148.0
|
pegasus::dirmanager 1 1 0 2.0 2.0 2.0 2.0
|
pegasus::transfer 2712 2712 0 2.167 9.002 4.255 11540.378
|
pipetask 27075 27075 0 19.297 6300.657 200.601 5431273.315
|
20191203T195830+0000 20191203T195926+0000 20191203T200145+0000 20191203T200315+0000 20191203T200433+0000
I attempted to run 5 graphs at the same time. In the initialization jobs (first blocking job of the graph), two died with
Traceback (most recent call last):
|
File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1230, in _execute_context
|
cursor, statement, parameters, context
|
File "/home/centos/w_2019_38/stack/miniconda3-4.5.12-1172c30/Linux64/sqlalchemy/1.2.16+2/lib/python/SQLAlchemy-1.2.16-py3.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 536, in do_execute
|
cursor.execute(statement, parameters)
|
psycopg2.OperationalError: out of shared memory
|
HINT: You might need to increase max_locks_per_transaction.
|
One finished its init job and continued on with single frame processing. The other two are still trying but I'll just kill them.
20191216T201334+0000 20191216T201429+0000 20191216T201752+0000 20191216T201957+0000 20191216T202127+0000
After Dino Bektesevic increased the cache sizes of the RDS instance to ~1GB, I tried the same 5 graphs again.
After two of the 5 workflows finished their initialization jobs, many ISR (but not all) jobs failed with
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
|
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections |
At the time there were 200 m5.xlarge workers.
Dino then increased max_connections from 350 to 1000. Re-submitting new workflows with the same 200 workers, the connection slot error was no longer seen.
In the next attempt I (stupidly) requested 300 m5.xlarge workers; as each instance ran 4 jobs that meant 1200 connections. So I hit the same error again. We then increased max_connections to 2500.
We are wrapping up this effort for now or at least until the new LDF / DOE lab decision is made. Closing this ticket. The report is being written in https://dmtn-137.lsst.io/ (DM-21820)
Made the HSC-RC2 repo with its datastore in a S3 bucket and its registry in RDS, made a quantum graph pickle (
DM-21461), and transformed the quantum graph into Pegasus format with one-to-one quantum-to-job mapping.All files needed in the Pegasus workflow are stored in s3://pegasus9615/ bucket and can be retrieved for new runs. Multiple test runs are done to iteratively identify issues and improve the process.