Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-34633

Improve stability of the connection between the ESS CSC and Controller

    XMLWordPrintable

    Details

    • Team:
      Telescope and Site
    • Urgent?:
      No

      Description

      Currently the ESS CSCs and Controllers need to be restarted on a near-daily basis. For some unknown reason the connection between the two times out, causing the CSC to go into FAULT state. Possible causes are a clogged Controller or a glitch in the network. It happens more to the MTDome CSCs than the others. The CSC doesn't recover well from this and the only way to get it to connect to the Controller again is by setting it OFFLINE and starting it again. The Controller doesn't detect that the CSC has disconnected and needs to be restarted too.

      The CSC usually simply logs "Read timed out" when this happens and occasionally "Fault! errorCode=2, errorReport='RPiDataClient(host=mtdome-rpi01.cp.lsst.org, port=5000) timed out waiting for data'".

      This ticket is for trying to avoid letting to CSC go to FAULT state. A possible way would be to increase the time out value of the CSC. Another way would be to not let the CSC go into FAULT state immediately but to give it a number of retries before it does so. In order to make sure that the CSC can reconnect in case the Controller doesn't realize that the CSC has disconnected, perhaps some kind of handshake mechanism can be implemented where the CSC sends a certain ID string (perhaps as simple as "ESS:101") which would let the OneClientServer reconnect the client even if it thinks it already has connected.

      I am open to discuss any of these ideas with you to work out the exact details.

        Attachments

          Issue Links

            Activity

            Hide
            wvreeven Wouter van Reeven added a comment - - edited

            I just noticed that the CSC doesn't disconnect when going to FAULT state. That may very well be the cause of the Controller not noticing that the CSC has disconnected because it never did.

            EDIT: Connecting and disconnecting is handled by handle_summary_state and I am sure that also gets called when going to FAULT state.

            Show
            wvreeven Wouter van Reeven added a comment - - edited I just noticed that the CSC doesn't disconnect when going to FAULT state. That may very well be the cause of the Controller not noticing that the CSC has disconnected because it never did. EDIT: Connecting and disconnecting is handled by handle_summary_state and I am sure that also gets called when going to FAULT state.
            Hide
            wvreeven Wouter van Reeven added a comment - - edited

            I grabbed this log from the MTDome ESS CSC with ID 101:

            2022-05-06 12:34:32,168:ERROR:SocketServer.OneClientServer(EssSensorsServer):read_loop failed. Disconnecting.
            Traceback (most recent call last):
              File "/home/saluser/miniconda3/lib/python3.8/site-packages/lsst/ts/ess/common/socket_server.py", line 137, in read_loop
                line = await self.reader.readuntil(tcpip.TERMINATOR)
              File "/home/saluser/miniconda3/lib/python3.8/asyncio/streams.py", line 629, in readuntil
                raise exceptions.IncompleteReadError(chunk, None)
            asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of undefined expected bytes
            2022-05-06 12:34:32,170:INFO:CommandHandler:stop_sending_telemetry
            2022-05-06 12:34:32,171:DEBUG:CommandHandler:Closing <lsst.ts.ess.controller.device.rpi_serial_hat.RpiSerialHat object at 0x7f846dbac0> device with name MTDome-ESS01
            2022-05-06 12:34:32,171:DEBUG:CommandHandler.RpiSerialHat:Stopping read loop for 'MTDome-ESS01' sensor.
            2022-05-06 12:34:32,172:ERROR:CommandHandler.RpiSerialHat:Serial port closed.
            Traceback (most recent call last):
              File "/home/saluser/miniconda3/lib/python3.8/site-packages/lsst/ts/ess/common/socket_server.py", line 137, in read_loop
                line = await self.reader.readuntil(tcpip.TERMINATOR)
              File "/home/saluser/miniconda3/lib/python3.8/asyncio/streams.py", line 629, in readuntil
                raise exceptions.IncompleteReadError(chunk, None)
            asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of undefined expected bytes
            2022-05-06 12:34:32,172:DEBUG:SocketServer.OneClientServer(EssSensorsServer):Cancelling read_loop_task.
            2022-05-06 12:34:32,173:DEBUG:SocketServer.OneClientServer(EssSensorsServer):Closing client.
            2022-05-06 12:34:32,173:INFO:SocketServer.OneClientServer(EssSensorsServer):Closing the client socket.
            

            Line 137 in socket_server.py is the line reading incoming data from the CSC. This seems to indicate that the network connection was terminated.

            I see the same error message at 2022-05-06 12:34:18,920 in the log from the MTDome ESS CSC with ID 102 and at 2022-05-06 12:34:32,374 in the log from the MTDome ESS CSC with ID 103.

            Show
            wvreeven Wouter van Reeven added a comment - - edited I grabbed this log from the MTDome ESS CSC with ID 101: 2022 - 05 - 06 12 : 34 : 32 , 168 :ERROR:SocketServer.OneClientServer(EssSensorsServer):read_loop failed. Disconnecting. Traceback (most recent call last): File "/home/saluser/miniconda3/lib/python3.8/site-packages/lsst/ts/ess/common/socket_server.py" , line 137 , in read_loop line = await self.reader.readuntil(tcpip.TERMINATOR) File "/home/saluser/miniconda3/lib/python3.8/asyncio/streams.py" , line 629 , in readuntil raise exceptions.IncompleteReadError(chunk, None) asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of undefined expected bytes 2022 - 05 - 06 12 : 34 : 32 , 170 :INFO:CommandHandler:stop_sending_telemetry 2022 - 05 - 06 12 : 34 : 32 , 171 :DEBUG:CommandHandler:Closing <lsst.ts.ess.controller.device.rpi_serial_hat.RpiSerialHat object at 0x7f846dbac0 > device with name MTDome-ESS01 2022 - 05 - 06 12 : 34 : 32 , 171 :DEBUG:CommandHandler.RpiSerialHat:Stopping read loop for 'MTDome-ESS01' sensor. 2022 - 05 - 06 12 : 34 : 32 , 172 :ERROR:CommandHandler.RpiSerialHat:Serial port closed. Traceback (most recent call last): File "/home/saluser/miniconda3/lib/python3.8/site-packages/lsst/ts/ess/common/socket_server.py" , line 137 , in read_loop line = await self.reader.readuntil(tcpip.TERMINATOR) File "/home/saluser/miniconda3/lib/python3.8/asyncio/streams.py" , line 629 , in readuntil raise exceptions.IncompleteReadError(chunk, None) asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of undefined expected bytes 2022 - 05 - 06 12 : 34 : 32 , 172 :DEBUG:SocketServer.OneClientServer(EssSensorsServer):Cancelling read_loop_task. 2022 - 05 - 06 12 : 34 : 32 , 173 :DEBUG:SocketServer.OneClientServer(EssSensorsServer):Closing client. 2022 - 05 - 06 12 : 34 : 32 , 173 :INFO:SocketServer.OneClientServer(EssSensorsServer):Closing the client socket. Line 137 in socket_server.py is the line reading incoming data from the CSC. This seems to indicate that the network connection was terminated. I see the same error message at 2022-05-06 12:34:18,920 in the log from the MTDome ESS CSC with ID 102 and at 2022-05-06 12:34:32,374 in the log from the MTDome ESS CSC with ID 103.
            Hide
            rowen Russell Owen added a comment -

            My suggestion is that we run the RPi server code with ts_tcpip v0.3.7 once it is released (DM-34694). That will not make connections more reliable, but it should make reconnection more reliable. At that point we should be able to better evaluate any remaining needed work.

            Show
            rowen Russell Owen added a comment - My suggestion is that we run the RPi server code with ts_tcpip v0.3.7 once it is released ( DM-34694 ). That will not make connections more reliable, but it should make reconnection more reliable. At that point we should be able to better evaluate any remaining needed work.
            Hide
            rowen Russell Owen added a comment -

            Wouter van Reeven fixed this using a different ticket.

            Show
            rowen Russell Owen added a comment - Wouter van Reeven fixed this using a different ticket.

              People

              Assignee:
              rowen Russell Owen
              Reporter:
              wvreeven Wouter van Reeven
              Watchers:
              Russell Owen, Wouter van Reeven
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.