Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-31994

ESS: Handle lost controller connections

    XMLWordPrintable

    Details

    • Story Points:
      2
    • Sprint:
      TSSW Sprint - Oct 25 - Nov 08
    • Team:
      Telescope and Site
    • Urgent?:
      No

      Description

      While trying to diagnose the issue with mtdome-ess02, I tried to send the associated CSC OFFLINE, but was greeted with this:

      021-09-30 17:17:31,874:INFO:ESS:Start periodic polling of the sensor data.
      021-09-30 17:18:33,468:ERROR:ESS:beg_disable failed; remaining in state <State.ENABLED: 2>
      Traceback (most recent call last):
      28 File "/opt/lsst/software/stack/miniconda/lib/python3.8/site-packages/lsst/ts/salobj/base_csc.py", line 830, in _do_change_state
      27 await getattr(self, f"begin_\{cmd_name}")(data)
      26 File "/opt/lsst/software/stack/miniconda/lib/python3.8/site-packages/lsst/ts/ess/csc/ess_csc.py", line 351, in begin_disable
      25 await self.write(command="stop", parameters=\{})
      24 File "/opt/lsst/software/stack/miniconda/lib/python3.8/site-packages/lsst/ts/ess/csc/ess_csc.py", line 150, in write
      23 await self.writer.drain()
      22 File "/opt/lsst/software/stack/miniconda/lib/python3.8/asyncio/streams.py", line 387, in drain
      21 await self._protocol._drain_helper()
      20 File "/opt/lsst/software/stack/miniconda/lib/python3.8/asyncio/streams.py", line 190, in _drain_helper
      19 raise ConnectionResetError('Connection lost')
      18ConnectionResetError: Connection lost
      172021-09-30 17:18:33,473:ERROR:ESS.command_disable:Callback <bound method BaseCsc.do_disable of <lsst.ts.ess.csc.ess_csc.EssCsc object at 0x7f42f18406d0>> failed with data=ESSID: 102, private_revCode: 1a5c6ccb, private_sndStamp: 1633022350.464937, private_rcvStamp: 1633022350.46549, private_seqNum: 327806053, private_identity: Script:200008, private_origin: 229559, private_host: 0
      16Traceback (most recent call last):
      15 File "/opt/lsst/software/stack/miniconda/lib/python3.8/site-packages/lsst/ts/salobj/topics/controller_command.py", line 252, in _run_callback
      14 ack = await result # type: ignore
      13 File "/opt/lsst/software/stack/miniconda/lib/python3.8/site-packages/lsst/ts/salobj/base_csc.py", line 487, in do_disable
      12 await self._do_change_state(data, "disable", [State.ENABLED], State.DISABLED)
      11 File "/opt/lsst/software/stack/miniconda/lib/python3.8/site-packages/lsst/ts/salobj/base_csc.py", line 830, in _do_change_state
      10 await getattr(self, f"begin_\{cmd_name}")(data)
      9 File "/opt/lsst/software/stack/miniconda/lib/python3.8/site-packages/lsst/ts/ess/csc/ess_csc.py", line 351, in begin_disable
      8 await self.write(command="stop", parameters=\{})
      \7 File "/opt/lsst/software/stack/miniconda/lib/python3.8/site-packages/lsst/ts/ess/csc/ess_csc.py", line 150, in write
      6 await self.writer.drain()
      5 File "/opt/lsst/software/stack/miniconda/lib/python3.8/asyncio/streams.py", line 387, in drain
      4 await self._protocol._drain_helper()
      3 File "/opt/lsst/software/stack/miniconda/lib/python3.8/asyncio/streams.py", line 190, in _drain_helper
      2 raise ConnectionResetError('Connection lost')
      1ConnectionResetError: Connection lost
      

      The component should either go into FAULT and complain that the connection is lost. Leaving the component in ENABLED assumes that it is still functioning. As a general comment, figuring out how to handle lost connections gracefully on state transitions is important. 

        Attachments

          Issue Links

            Activity

            Hide
            wvreeven Wouter van Reeven added a comment -

            This ticket addresses two issues:

            1. As indicated in the description, the CSC should go into FAULT state if an error happens. Also, not receiving telemetry shouldn't necessarily result in an error but a new attempt to receive telemetry should be done.
            2. The controller, that receives a connection from the CSC, refuses new connections, even if the same CSC tried to reconnect. This means that the controller needs to be restarted and that should not be necessary.
            Show
            wvreeven Wouter van Reeven added a comment - This ticket addresses two issues: As indicated in the description, the CSC should go into FAULT state if an error happens. Also, not receiving telemetry shouldn't necessarily result in an error but a new attempt to receive telemetry should be done. The controller, that receives a connection from the CSC, refuses new connections, even if the same CSC tried to reconnect. This means that the controller needs to be restarted and that should not be necessary.
            Hide
            wvreeven Wouter van Reeven added a comment -

            Not sure if I decided to go to FAULT state too much in the ESS CSC, so please feel free to push back on that.

            ts_ess_common PR:

            https://github.com/lsst-ts/ts_ess_common/pull/7

            ts_ess_csc PR:

            https://github.com/lsst-ts/ts_ess_csc/pull/48

             

            Show
            wvreeven Wouter van Reeven added a comment - Not sure if I decided to go to FAULT state too much in the ESS CSC, so please feel free to push back on that. ts_ess_common PR: https://github.com/lsst-ts/ts_ess_common/pull/7 ts_ess_csc PR: https://github.com/lsst-ts/ts_ess_csc/pull/48  
            Hide
            rowen Russell Owen added a comment -

            Reviewed on github

            Show
            rowen Russell Owen added a comment - Reviewed on github
            Hide
            wvreeven Wouter van Reeven added a comment -
            Show
            wvreeven Wouter van Reeven added a comment - Added ts_idl PR: https://github.com/lsst-ts/ts_idl/pull/77  

              People

              Assignee:
              wvreeven Wouter van Reeven
              Reporter:
              mareuter Michael Reuter
              Reviewers:
              Russell Owen
              Watchers:
              Michael Reuter, Russell Owen, Wouter van Reeven
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.