# Silent failure in executing memory-hungry job?

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
• Team:
Ops Middleware
• Urgent?:
No

#### Description

I noticed a case where a job doesn't run properly but panda thinks it succeeds rather than reporting an error.  It's reproducible and a bps submission yaml is attached.

The silent failure is seen if this job is run in the default queue DOMA_LSST_GOOGLE_TEST. If I increase its requestMemory to make it run on DOMA_LSST_GOOGLE_TEST_HIMEM, then it runs correctly. It's very likely that this job really needs more memory, but I'd expect it to fail when it doesn't have enough memory to finish the job, rather than appearing to be successful falsely.

Example:

Comparing the payload logs, after the log record

 consolidateObjectTable (consolidateObjectTable:{skymap: 'DC2', tract: 4431})(postprocess.py:959) - Concatenating 49 per-patch Object Tables 

The former had two more log records, but the latter went straight to some post-payload stuff. The latter did not write some expected outputs (objectTable_tract, consolidateObjectTable_metadata, consolidateObjectTable_log) to the datastore bucket.

May you please help investigate why this happens? I suspect this isn't the only case this happens (and this might explain some mysteries I encountered before but couldn't pin down), but this is a relatively simple example and is in a reproducible state.

#### Attachments

1. DM-32435.yaml
0.6 kB

#### Activity

Hide
Hsin-Fang Chiang added a comment - - edited

Attaching the bps submit yaml file. The only difference between making it really succeed or making it fail silently is the last 3 lines for the requestMemory.

It can be submitted from data-int RSP or from the submission VM (then replacing butlerConfig from butler-external.yaml to butler.yaml)

Show
Hsin-Fang Chiang added a comment - - edited Attaching the bps submit yaml file. The only difference between making it really succeed or making it fail silently is the last 3 lines for the requestMemory. It can be submitted from data-int RSP or from the submission VM (then replacing butlerConfig from butler-external.yaml to butler.yaml)
Hide

I think I found the root of the problem - the SW wrapper transmitted the pipeline exit code and didn't properly handled the signal based termination. This job: https://panda-doma.cern.ch/job?pandaid=2762834 has correct failing information.
I am going to prepare a PR but before I'd like to test a bit more. the image which was used in the test is spodolsky/centos:7-stack-lsst_distrib-w_2021_40-exit-code
Thanks a lot for the collection a lot of information which helped to reproduce the problem.

Show
Sergey Padolski added a comment - I think I found the root of the problem - the SW wrapper transmitted the pipeline exit code and didn't properly handled the signal based termination. This job: https://panda-doma.cern.ch/job?pandaid=2762834 has correct failing information. I am going to prepare a PR but before I'd like to test a bit more. the image which was used in the test is spodolsky/centos:7-stack-lsst_distrib-w_2021_40-exit-code Thanks a lot for the collection a lot of information which helped to reproduce the problem.
Hide
Michelle Gower added a comment -

Need to add a file to the doc/changes directory.  Tiny question about default exit value. Changes approved for merging.

Show
Michelle Gower added a comment - Need to add a file to the doc/changes directory.  Tiny question about default exit value. Changes approved for merging.

#### People

Assignee:
Reporter:
Hsin-Fang Chiang
Reviewers:
Michelle Gower
Watchers:
Hsin-Fang Chiang, Michelle Gower, Sergey Padolski, Tim Jenness