Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-31111

MTHexapod reports failure in state transition when it is actually succeeding

    XMLWordPrintable

Details

    • 1
    • TSSW Sprint - Sep 13 - Sep 27, TSSW Sprint - Sep 27 - Oct 11
    • Telescope and Site
    • No

    Description

      I am not sure if this is an issue with the CSC or with the low level controller but we are constantly getting failures in state transition with the MTHexapod component when operating with the real hardware. 

      Assuming the system is in STANDBY state

      A simple;

       

      import salobj 
       
      r = salobj.Remote(salobj.Domain(), "MTHexapod", index=1)
       
      await r.start_task
       
      await salobj.set_summary_state(r, salobj.State.DISABLED, settingsToApply="default")

      results in;

       

      RuntimeError: Error on cmd=cmd_start, initial_state=5: msg='Command failed', ackcmd=(ackcmd private_seqNum=1948430428, ack=<SalRetCode.CMD_FAILED: -302>, error=1, result='Failed: Failed: final state is <State.STANDBY: 5> instead of <State.DISABLED: 1>')

      most of the time, though it works some times. Despite the failure reported above the CSC does transition to DISABLED state shortly after.

       

      I wonder if the CSC should allow a bit more time for the state transition to occur, or if the low level controller is reporting the command as completed too early. 

      Attachments

        Issue Links

          Activity

            rowen Russell Owen added a comment - - edited

            I think this is in the CSC. Let me give a bit of background: the low-level controller does not report command success or failure, so the CSC has to guess based on the data that the low-level controller does send. I fervently hope we ttsai can fix this someday. Meanwhile we are stuck with it and it leads to issues such as this.

            I looked at the CSC code that handles the state transition commands and its guessing is too naive. The current code issues the state transition command then then waits for 2 telemetry messages from the low-level controller, checks the controller state, and fails the command if it's not the desired new state. A more robust algorithm is to check the next "up to N" telemetry samples, waiting for the new state.

            ttsai is there some way to predict a minimum time for the low-level controller to respond to a request for state change? I could use that information to pick a suitable maximum number of telemetry samples.

            rowen Russell Owen added a comment - - edited I think this is in the CSC. Let me give a bit of background: the low-level controller does not report command success or failure, so the CSC has to guess based on the data that the low-level controller does send. I fervently hope we ttsai can fix this someday. Meanwhile we are stuck with it and it leads to issues such as this. I looked at the CSC code that handles the state transition commands and its guessing is too naive. The current code issues the state transition command then then waits for 2 telemetry messages from the low-level controller, checks the controller state, and fails the command if it's not the desired new state. A more robust algorithm is to check the next "up to N" telemetry samples, waiting for the new state. ttsai is there some way to predict a minimum time for the low-level controller to respond to a request for state change? I could use that information to pick a suitable maximum number of telemetry samples.

            Sounds good! I figured it was something on those lines. When you get an appropriate number of times, I imagine you can convert that into a timeout in seconds, right? Can you also report that in the “ack in progress”?

            tribeiro Tiago Ribeiro added a comment - Sounds good! I figured it was something on those lines. When you get an appropriate number of times, I imagine you can convert that into a timeout in seconds, right? Can you also report that in the “ack in progress”?
            ttsai Te-Wei Tsai added a comment -

            This is related to DM-29578. The telemetry frequency is ~20 Hz. If there is a state change, it will reflect in State and EnabledSubState:

                // Get state information
                tlmStruct->State = GUItlmStruct->State;
                tlmStruct->EnabledSubState = GUItlmStruct->EnabledSubState;
                tlmStruct->OfflineSubState = GUItlmStruct->OfflineSubState;
                tlmStruct->TestState = GUItlmStruct->TestState;
            

            https://github.com/lsst-ts/ts_hexapod_controller/blob/develop/src/actuatorTlm.c#L693-L697

            I think wait for >= 0.5 second is reasonable but I might be wrong. Thanks!

            ttsai Te-Wei Tsai added a comment - This is related to DM-29578 . The telemetry frequency is ~20 Hz. If there is a state change, it will reflect in State and EnabledSubState : // Get state information tlmStruct->State = GUItlmStruct->State; tlmStruct->EnabledSubState = GUItlmStruct->EnabledSubState; tlmStruct->OfflineSubState = GUItlmStruct->OfflineSubState; tlmStruct->TestState = GUItlmStruct->TestState; https://github.com/lsst-ts/ts_hexapod_controller/blob/develop/src/actuatorTlm.c#L693-L697 I think wait for >= 0.5 second is reasonable but I might be wrong. Thanks!
            rowen Russell Owen added a comment -

            This affects both the MT hexapod and MT rotator.

            rowen Russell Owen added a comment - This affects both the MT hexapod and MT rotator.
            rowen Russell Owen added a comment - - edited

            The issue affects both in MTHexapod and MTRotator.

            The fix is in in BaseCsc in ts_hexrotcomm. However, I took the liberty of simplifying assert_summary_state, deprecating an argument used to ts_mtrotator, so I also have a trivial patch for that package.

            Additional changes to ts_hexrotcomm:

            • Update to use ts_utils.
            • Fix cleanup in a unit test file.

            Pull requests:

            rowen Russell Owen added a comment - - edited The issue affects both in MTHexapod and MTRotator. The fix is in in BaseCsc in ts_hexrotcomm. However, I took the liberty of simplifying assert_summary_state, deprecating an argument used to ts_mtrotator, so I also have a trivial patch for that package. Additional changes to ts_hexrotcomm: Update to use ts_utils. Fix cleanup in a unit test file. Pull requests: https://github.com/lsst-ts/ts_hexrotcomm/pull/42 https://github.com/lsst-ts/ts_mtrotator/pull/51

            reviewed in GitHub...

            tribeiro Tiago Ribeiro added a comment - reviewed in GitHub...

            Released:

            • ts_hexrotcomm v0.20.0
            • ts_mtrotator v0.18.0. This requires ts_hexrotcomm v0.20.0, but is not requires in order to get the fix (i.e. one can use v0.17.0 if desired).
            rowen Russell Owen added a comment - Released: ts_hexrotcomm v0.20.0 ts_mtrotator v0.18.0. This requires ts_hexrotcomm v0.20.0, but is not requires in order to get the fix (i.e. one can use v0.17.0 if desired).

            People

              rowen Russell Owen
              tribeiro Tiago Ribeiro
              Tiago Ribeiro
              Andy Clements, Holger Drass, Russell Owen, Sandrine Thomas, Te-Wei Tsai, Tiago Ribeiro
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.