I noticed a case where a job doesn't run properly but panda thinks it succeeds rather than reporting an error. It's reproducible and a bps submission yaml is attached.
The silent failure is seen if this job is run in the default queue DOMA_LSST_GOOGLE_TEST. If I increase its requestMemory to make it run on DOMA_LSST_GOOGLE_TEST_HIMEM, then it runs correctly. It's very likely that this job really needs more memory, but I'd expect it to fail when it doesn't have enough memory to finish the job, rather than appearing to be successful falsely.
- This really succeeded: https://panda-doma.cern.ch/jobs/?jeditaskid=7467&mode=nodrop&display_limit=100 and the payload log at https://storage.googleapis.com/drp-us-central1-logging/logs/DOMA_LSST_GOOGLE_TEST_HIMEM/PandaJob_2759210/payload.stderr
- This should show as a failed job: https://panda-doma.cern.ch/jobs/?jeditaskid=7465&mode=nodrop&display_limit=100 and the payload log at https://storage.googleapis.com/drp-us-central1-logging/logs/DOMA_LSST_GOOGLE_TEST/PandaJob_2746451/payload.stderr
Comparing the payload logs, after the log record
The former had two more log records, but the latter went straight to some post-payload stuff. The latter did not write some expected outputs (objectTable_tract, consolidateObjectTable_metadata, consolidateObjectTable_log) to the datastore bucket.
May you please help investigate why this happens? I suspect this isn't the only case this happens (and this might explain some mysteries I encountered before but couldn't pin down), but this is a relatively simple example and is in a reproducible state.