Details
-
Type:
Story
-
Status: Won't Fix
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Team:Telescope and Site
-
Urgent?:No
Description
Currently the ESS CSCs and Controllers need to be restarted on a near-daily basis. For some unknown reason the connection between the two times out, causing the CSC to go into FAULT state. Possible causes are a clogged Controller or a glitch in the network. It happens more to the MTDome CSCs than the others. The CSC doesn't recover well from this and the only way to get it to connect to the Controller again is by setting it OFFLINE and starting it again. The Controller doesn't detect that the CSC has disconnected and needs to be restarted too.
The CSC usually simply logs "Read timed out" when this happens and occasionally "Fault! errorCode=2, errorReport='RPiDataClient(host=mtdome-rpi01.cp.lsst.org, port=5000) timed out waiting for data'".
This ticket is for trying to avoid letting to CSC go to FAULT state. A possible way would be to increase the time out value of the CSC. Another way would be to not let the CSC go into FAULT state immediately but to give it a number of retries before it does so. In order to make sure that the CSC can reconnect in case the Controller doesn't realize that the CSC has disconnected, perhaps some kind of handshake mechanism can be implemented where the CSC sends a certain ID string (perhaps as simple as "ESS:101") which would let the OneClientServer reconnect the client even if it thinks it already has connected.
I am open to discuss any of these ideas with you to work out the exact details.
I just noticed that the CSC doesn't disconnect when going to FAULT state. That may very well be the cause of the Controller not noticing that the CSC has disconnected because it never did.
EDIT: Connecting and disconnecting is handled by handle_summary_state and I am sure that also gets called when going to FAULT state.