Fix Memory monitoring for Rubin PanDA jobs

XMLWordPrintable

Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Team:
Ops Middleware
• Urgent?:
No

Activity

Hide
Shuwei Ye added a comment -

I did not realize that I should click the "merge" button. I just made the merge.

Shuwei

Show
Shuwei Ye added a comment - Hi Michelle Gower , I did not realize that I should click the "merge" button. I just made the merge. Shuwei
Hide
Michelle Gower added a comment -

Thanks Shuwei Ye for merging and helping me with testing the backport.  I have the backport in the ticket branch, but I'm having trouble getting runs through to test it.  And the one that did go through last night didn't work.  So I'm not sure what's up.

LSST stack version: r23_0_1_rc4

https://github.com/lsst/ctrl_bps branch (should only be needed on submit side):  tickets/DM-32579-v23

Here's the yaml with 3 jobs where the first and last jobs should report successful and the middle job should always fail (bad pipetask command line):

 includeConfigs: - ${CTRL_BPS_DIR}/config/bps_idf.yaml project: dev campaign: quick pipelineYaml: "${OBS_LSST_DIR}/pipelines/imsim/DRP.yaml#isr" runPreCmdOpts: "--bad" payload: payloadName: prmon/shouldFail/r23_0_1_rc4 butlerConfig: s3://butler-us-central1-panda-dev/dc2/butler-external.yaml inCollection: "2.2i/defaults/test-med-1" dataQuery: "instrument='LSSTCam-imSim' and skymap='DC2' and exposure in (214433) and detector=2" sw_image: "lsstsqre/centos:7-stack-lsst_distrib-r23_0_1_rc4"

You don't have to do anything with the branch or PR.  I will do all the github stuff once someone has verified that it actually works in the r23_0_1_rc4 environment.  Thanks again for helping.

Show
Michelle Gower added a comment - Thanks Shuwei Ye  for merging and helping me with testing the backport.  I have the backport in the ticket branch, but I'm having trouble getting runs through to test it.  And the one that did go through last night didn't work.  So I'm not sure what's up.   LSST stack version: r23_0_1_rc4 https://github.com/lsst/ctrl_bps branch (should only be needed on submit side):  tickets/ DM-32579 -v23 Here's the yaml with 3 jobs where the first and last jobs should report successful and the middle job should always fail (bad pipetask command line):   includeConfigs: - ${CTRL_BPS_DIR}/config/bps_idf.yaml project: dev campaign: quick pipelineYaml: "${OBS_LSST_DIR}/pipelines/imsim/DRP.yaml#isr" runPreCmdOpts: "--bad" payload: payloadName: prmon/shouldFail/r23_0_1_rc4 butlerConfig: s3: //butler-us-central1-panda-dev/dc2/butler-external.yaml inCollection: "2.2i/defaults/test-med-1" dataQuery: "instrument='LSSTCam-imSim' and skymap='DC2' and exposure in (214433) and detector=2" sw_image: "lsstsqre/centos:7-stack-lsst_distrib-r23_0_1_rc4"   You don't have to do anything with the branch or PR.  I will do all the github stuff once someone has verified that it actually works in the r23_0_1_rc4 environment.  Thanks again for helping.
Hide
Shuwei Ye added a comment -

You asked for the container image "lsstsqre/centos:7-stack-lsst_distrib-r23_0_1_rc4", but I could not find such image tag on https://hub.docker.com/r/lsstsqre/centos/tags.

Shuwei

Show
Shuwei Ye added a comment - Hi Michelle Gower , You asked for the container image "lsstsqre/centos:7-stack-lsst_distrib-r23_0_1_rc4" , but I could not find such image tag on https://hub.docker.com/r/lsstsqre/centos/tags. Shuwei
Hide
Michelle Gower added a comment -

Whew.  Where did you find any error messages about that?

It is actually: lsstsqre/centos:7-stack-lsst_distrib-v23_0_1_rc4

(Not sure why the jupyter one uses an r whereas this one uses a v)

Show
Michelle Gower added a comment - Whew.  Where did you find any error messages about that?   It is actually: lsstsqre/centos:7-stack-lsst_distrib-v23_0_1_rc4 (Not sure why the jupyter one uses an r whereas this one uses a v)
Hide
Michelle Gower added a comment -

Nevermind.  I found it in the pilot stdout.  At various points today I've clicked on something and it's told me not found, but this time I must have clicked in all the right places.

Show
Michelle Gower added a comment - Nevermind.  I found it in the pilot stdout.  At various points today I've clicked on something and it's told me not found, but this time I must have clicked in all the right places.

People

Assignee:
Shuwei Ye
Reporter:
Shuwei Ye
Reviewers:
Michelle Gower
Watchers:
Hsin-Fang Chiang, Kian-Tat Lim, Michelle Gower, Sergey Padolski, Shuwei Ye, Tim Jenness