Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27380

Watcher loses control capability after start command failure

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Sprint:
      TSSW Sprint - Oct 26 - Nov 9
    • Team:
      Telescope and Site
    • Urgent?:
      No

      Description

      If one forgets to pass a configuration to the Watcher on the start command, it correctly rejects the command with the following error:

      Traceback (most recent call last):
       File "/opt/lsst/software/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/lsst/ts/salobj/csc_utils.py", line 134, in set_summary_state
       await cmd.start(timeout=timeout)
       File "/opt/lsst/software/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/lsst/ts/salobj/topics/remote_command.py", line 446, in start
       return await cmd_info.next_ackcmd(timeout=timeout)
       File "/opt/lsst/software/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/lsst/ts/salobj/topics/remote_command.py", line 183, in next_ackcmd
       raise base.AckError(msg="Command failed", ackcmd=ackcmd)
       lsst.ts.salobj.base.AckError: msg='Command failed', ackcmd=(ackcmd private_seqNum=1729314749, ack=<SalRetCode.CMD_FAILED: -302>, error=1, result="Failed: 'latin-1' codec can't encode character '
      u2013' in position 14808: ordinal not in range(256)")

      When trying to send the start command to the Watcher with a configuration after the above rejection, the Watcher is no longer commandable.

      Traceback (most recent call last):
       File "/opt/lsst/software/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/lsst/ts/salobj/csc_utils.py", line 134, in set_summary_state
       await cmd.start(timeout=timeout)
       File "/opt/lsst/software/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/lsst/ts/salobj/topics/remote_command.py", line 446, in start
       return await cmd_info.next_ackcmd(timeout=timeout)
       File "/opt/lsst/software/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/lsst/ts/salobj/topics/remote_command.py", line 198, in next_ackcmd
       msg="Timed out waiting for command acknowledgement", ackcmd=last_ackcmd
       lsst.ts.salobj.base.AckTimeoutError: msg='Timed out waiting for command acknowledgement', ackcmd=(ackcmd private_seqNum=1834082701, ack=<SalRetCode.CMD_ACK: 300>, error=0, result='')

      A command rejection should not leave the Watcher unresponsive.

        Attachments

          Activity

          Hide
          rowen Russell Owen added a comment - - edited

          The problem was quite difficult to track down. It turns out that the SalLogHandler could raise an exception when emitting a message, and my code did not anticipate that. The error from the Watcher's schema validator triggered this, causing the callback loop to die in the `start` command topic. That also explains the strange message; it was the logging system complaining that it could not encode a log message properly.

          My changes:

          • Explicitly encode the message and traceback fields in SalLogHandler.emit, replacing difficult characters with "?" or something like that.
          • Catch all exceptions in SalLogHandler.emit and print a warning to stderr.
          • Add a test for this.
          • Also I beefed up salobj's existing test for configuring a CSC with invalid configuration: I made it check that the command was rejected with CMD_FAILED and also made it check that the CSC could still be configured after a few rejected start commands. Note that the test was not tickling the bug just described, so even the improved version passed before fixing the log handler. But it may catch a new problem someday.

          I also made two changes to ts_watcher:

          • Fix a bug in how rules with no configuration were handled in the Model and add a unit test to check this.
          • Beef up the unit test for invalid configuration much as I did the one in salobj, plus I added an explicit test that a blank settingsToApply fails. Note that the beefed up version does expose this problem (which was triggered by every invalid config I tried). So if you try to use this new Watcher with an older salobj the test will fail.

          Pull requests:

          Show
          rowen Russell Owen added a comment - - edited The problem was quite difficult to track down. It turns out that the SalLogHandler could raise an exception when emitting a message, and my code did not anticipate that. The error from the Watcher's schema validator triggered this, causing the callback loop to die in the `start` command topic. That also explains the strange message; it was the logging system complaining that it could not encode a log message properly. My changes: Explicitly encode the message and traceback fields in SalLogHandler.emit , replacing difficult characters with "?" or something like that. Catch all exceptions in SalLogHandler.emit and print a warning to stderr. Add a test for this. Also I beefed up salobj's existing test for configuring a CSC with invalid configuration: I made it check that the command was rejected with CMD_FAILED and also made it check that the CSC could still be configured after a few rejected start commands. Note that the test was not tickling the bug just described, so even the improved version passed before fixing the log handler. But it may catch a new problem someday. I also made two changes to ts_watcher: Fix a bug in how rules with no configuration were handled in the Model and add a unit test to check this. Beef up the unit test for invalid configuration much as I did the one in salobj, plus I added an explicit test that a blank settingsToApply fails. Note that the beefed up version does expose this problem (which was triggered by every invalid config I tried). So if you try to use this new Watcher with an older salobj the test will fail. Pull requests: https://github.com/lsst-ts/ts_salobj/pull/161 https://github.com/lsst-ts/ts_watcher/pull/25/files
          Hide
          mareuter Michael Reuter added a comment -

          Both PRs reviewed and approved.

          Show
          mareuter Michael Reuter added a comment - Both PRs reviewed and approved.
          Hide
          rowen Russell Owen added a comment -

          Released ts_salobj v6.0.4 and ts_watcher v1.3.3.

          Show
          rowen Russell Owen added a comment - Released ts_salobj v6.0.4 and ts_watcher v1.3.3.

            People

            Assignee:
            rowen Russell Owen
            Reporter:
            mareuter Michael Reuter
            Reviewers:
            Michael Reuter
            Watchers:
            Michael Reuter, Russell Owen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.