This used the new AMIs centos_condor_w38_master and centos_condor_w38_worker_hfc_partitionable. Condor annex workers were added with Spot Fleets of m4.large and then r4.large. Jobs of 'MakeWarpTask', 'CompareWarpAssembleCoaddTask', 'DeblendCoaddSourcesSingleTask', 'MeasureMergedCoaddSourcesTask' were set to require 12GB of memory.
The instance reachability issue happened again, with the r4.large instances this time. Recommended by our AWS collaborators, a support case was filed to AWS Support.
MemoryError: std::bad_alloc was found in the logs of 4 DetectCoaddSourcesTask jobs. Many Connection timed out to the RDS instance were found in the logs of failed jobs, including sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server . Some logs look truncated without good indicators. When I stopped the workflow, it had 115 failed job and 1034 job unready. Failed jobs included MakeWarpTask CompareWarpAssembleCoaddTask DeblendCoaddSourcesSingleTask, and MeasureMergedCoaddSourcesTask.
I identified a few failed jobs and tried them manually, with memory monitoring enabled through CloudWatch (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html ) I got memory usage spiking around 26GB for the few test jobs. I suspect OOM was related the troublesome instances, nonetheless a more graceful exit is much more desired.