added a comment - Summary of work:
high stackdriver log ingestion volume was investigated and gke "stackdriver monitoring" was disabled for all clusters. Problematic test pods with debug output were deleted and all gke clusters were upgraded to reduce the large number of gke pool management messages that were being produced by some clusters. This is expected to reduce log volume from ~4TiB/Mon -> <100GiB/mon.
gce billing notifications were setup
aws billing notifications were setup
a stackdriver log ingestion rate alert was created (untested)
the [broken] aws rabbitmq instance was terminated