Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32435

Silent failure in executing memory-hungry job?

    XMLWordPrintable

    Details

      Description

      I noticed a case where a job doesn't run properly but panda thinks it succeeds rather than reporting an error.  It's reproducible and a bps submission yaml is attached. 

      The silent failure is seen if this job is run in the default queue DOMA_LSST_GOOGLE_TEST. If I increase its requestMemory to make it run on DOMA_LSST_GOOGLE_TEST_HIMEM, then it runs correctly. It's very likely that this job really needs more memory, but I'd expect it to fail when it doesn't have enough memory to finish the job, rather than appearing to be successful falsely.

      Example:

      Comparing the payload logs, after the log record

      consolidateObjectTable (consolidateObjectTable:{skymap: 'DC2', tract: 4431})(postprocess.py:959) - Concatenating 49 per-patch Object Tables
      

      The former had two more log records, but the latter went straight to some post-payload stuff. The latter did not write some expected outputs (objectTable_tract, consolidateObjectTable_metadata, consolidateObjectTable_log) to the datastore bucket.

      May you please help investigate why this happens? I suspect this isn't the only case this happens (and this might explain some mysteries I encountered before but couldn't pin down), but this is a relatively simple example and is in a reproducible state.
       

        Attachments

          Activity

          Hide
          hchiang2 Hsin-Fang Chiang added a comment - - edited

          Attaching the bps submit yaml file. The only difference between making it really succeed or making it fail silently is the last 3 lines for the requestMemory.

          It can be submitted from data-int RSP or from the submission VM (then replacing butlerConfig from butler-external.yaml to butler.yaml)

          Show
          hchiang2 Hsin-Fang Chiang added a comment - - edited Attaching the bps submit yaml file. The only difference between making it really succeed or making it fail silently is the last 3 lines for the requestMemory. It can be submitted from data-int RSP or from the submission VM (then replacing butlerConfig from butler-external.yaml to butler.yaml)
          Hide
          podolsky Sergey Padolski added a comment -

          I think I found the root of the problem - the SW wrapper transmitted the pipeline exit code and didn't properly handled the signal based termination. This job: https://panda-doma.cern.ch/job?pandaid=2762834 has correct failing information.
          I am going to prepare a PR but before I'd like to test a bit more. the image which was used in the test is spodolsky/centos:7-stack-lsst_distrib-w_2021_40-exit-code
          Thanks a lot for the collection a lot of information which helped to reproduce the problem.

          Show
          podolsky Sergey Padolski added a comment - I think I found the root of the problem - the SW wrapper transmitted the pipeline exit code and didn't properly handled the signal based termination. This job: https://panda-doma.cern.ch/job?pandaid=2762834 has correct failing information. I am going to prepare a PR but before I'd like to test a bit more. the image which was used in the test is spodolsky/centos:7-stack-lsst_distrib-w_2021_40-exit-code Thanks a lot for the collection a lot of information which helped to reproduce the problem.
          Hide
          mgower Michelle Gower added a comment -

          Need to add a file to the doc/changes directory.  Tiny question about default exit value. Changes approved for merging.

          Show
          mgower Michelle Gower added a comment - Need to add a file to the doc/changes directory.  Tiny question about default exit value. Changes approved for merging.

            People

            Assignee:
            podolsky Sergey Padolski
            Reporter:
            hchiang2 Hsin-Fang Chiang
            Reviewers:
            Michelle Gower
            Watchers:
            Hsin-Fang Chiang, Michelle Gower, Sergey Padolski, Tim Jenness
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.