Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Epic Link:
-
Team:Ops Middleware
-
Urgent?:No
Description
I noticed a case where a job doesn't run properly but panda thinks it succeeds rather than reporting an error. It's reproducible and a bps submission yaml is attached.
The silent failure is seen if this job is run in the default queue DOMA_LSST_GOOGLE_TEST. If I increase its requestMemory to make it run on DOMA_LSST_GOOGLE_TEST_HIMEM, then it runs correctly. It's very likely that this job really needs more memory, but I'd expect it to fail when it doesn't have enough memory to finish the job, rather than appearing to be successful falsely.
Example:
- This really succeeded: https://panda-doma.cern.ch/jobs/?jeditaskid=7467&mode=nodrop&display_limit=100 and the payload log at https://storage.googleapis.com/drp-us-central1-logging/logs/DOMA_LSST_GOOGLE_TEST_HIMEM/PandaJob_2759210/payload.stderr
- This should show as a failed job: https://panda-doma.cern.ch/jobs/?jeditaskid=7465&mode=nodrop&display_limit=100 and the payload log at https://storage.googleapis.com/drp-us-central1-logging/logs/DOMA_LSST_GOOGLE_TEST/PandaJob_2746451/payload.stderr
Comparing the payload logs, after the log record
consolidateObjectTable (consolidateObjectTable:{skymap: 'DC2', tract: 4431})(postprocess.py:959) - Concatenating 49 per-patch Object Tables |
The former had two more log records, but the latter went straight to some post-payload stuff. The latter did not write some expected outputs (objectTable_tract, consolidateObjectTable_metadata, consolidateObjectTable_log) to the datastore bucket.
May you please help investigate why this happens? I suspect this isn't the only case this happens (and this might explain some mysteries I encountered before but couldn't pin down), but this is a relatively simple example and is in a reproducible state.
Attaching the bps submit yaml file. The only difference between making it really succeed or making it fail silently is the last 3 lines for the requestMemory.
It can be submitted from data-int RSP or from the submission VM (then replacing butlerConfig from butler-external.yaml to butler.yaml)