Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-36931

rotation data increasingly delayed over time

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ts_main_telescope
    • Labels:

      Description

      The rotation telemetry topic is increasingly delayed over time.

      I added a diagnostic print statement to the MTRotator CSC in an effort to figure out if this was likely in the low-level controller or the CSC. The statement is in the callback function that is called whenever the CSC reads telemetry from the low-level controller. It prints two values:

      • telemetry_age: rcv_time - telemetry header timestamp, where rcv_time is the time at which the telemetry calllback function is called to process low-level telemetry.
      • write rotation delay: the time just after the "rotation" SAL event is sent - rcv_time.

      What I see is that relemetry_tag increases slowly and apparently fairly linearly, whereas write rotation delay does not change.

      telemetry_age=0.084; write rotation delay=0.001
      telemetry_age=0.084; write rotation delay=0.001
      telemetry_age=0.084; write rotation delay=0.001
      telemetry_age=0.085; write rotation delay=0.001
      telemetry_age=0.085; write rotation delay=0.001
      telemetry_age=0.085; write rotation delay=0.001
      telemetry_age=0.086; write rotation delay=0.001
      telemetry_age=0.086; write rotation delay=0.001
      telemetry_age=0.086; write rotation delay=0.001
      telemetry_age=0.087; write rotation delay=0.001
      telemetry_age=0.087; write rotation delay=0.001
      telemetry_age=0.087; write rotation delay=0.001
      telemetry_age=0.088; write rotation delay=0.001
      telemetry_age=0.088; write rotation delay=0.001
      telemetry_age=0.088; write rotation delay=0.001
      telemetry_age=0.089; write rotation delay=0.001
      telemetry_age=0.089; write rotation delay=0.001
      telemetry_age=0.089; write rotation delay=0.001
      telemetry_age=0.090; write rotation delay=0.001
      telemetry_age=0.090; write rotation delay=0.001
      telemetry_age=0.091; write rotation delay=0.001
      telemetry_age=0.091; write rotation delay=0.001
      telemetry_age=0.091; write rotation delay=0.001
      telemetry_age=0.092; write rotation delay=0.001
      telemetry_age=0.092; write rotation delay=0.001
      telemetry_age=0.092; write rotation delay=0.001
      telemetry_age=0.093; write rotation delay=0.001
      telemetry_age=0.093; write rotation delay=0.001
      telemetry_age=0.093; write rotation delay=0.001
      telemetry_age=0.094; write rotation delay=0.001
      telemetry_age=0.094; write rotation delay=0.001
      telemetry_age=0.094; write rotation delay=0.001
      telemetry_age=0.095; write rotation delay=0.001
      telemetry_age=0.095; write rotation delay=0.001
      telemetry_age=0.095; write rotation delay=0.001
      telemetry_age=0.096; write rotation delay=0.001
      telemetry_age=0.096; write rotation delay=0.001
      telemetry_age=0.097; write rotation delay=0.001
      telemetry_age=0.097; write rotation delay=0.001
      telemetry_age=0.097; write rotation delay=0.001
      telemetry_age=0.098; write rotation delay=0.001
      telemetry_age=0.098; write rotation delay=0.001
      telemetry_age=0.098; write rotation delay=0.001
      telemetry_age=0.099; write rotation delay=0.001
      telemetry_age=0.099; write rotation delay=0.001
      telemetry_age=0.099; write rotation delay=0.001
      telemetry_age=0.100; write rotation delay=0.001
      ...
      

      The message is printed every 25 telemetry samples. After many minutes the time is up to 0.465 seconds.

      I strongly suspect the error is inside the low-level controller and I hope you will be willing to rule this out before I look for it in the CSC. I'm having trouble imagining how the CSC could do this (especially with such large delays).

      The only possible explanation I have come up with is an algorithm such as this:

      timestamp = get_current_time()
      while True:
          compute and send telemetry
          sleep telemetry_interval
          timestamp = old_timestamp + telemetry_interval
      

      The timestamp would fall further behind in each iteration by the amount of the time spent in the "compute and send telemetry" step.

        Attachments

        1. delay times after Te-Wei's fix.txt
          33 kB
          Russell Owen
        2. diff_sal_rcvStamp.png
          17 kB
          Te-Wei Tsai
        3. diff_sal_sndStamp_after_90sec.png
          18 kB
          Te-Wei Tsai
        4. diff_sal_sndStamp.png
          18 kB
          Te-Wei Tsai
        5. show_mismatched_delays_modified.py
          0.8 kB
          Te-Wei Tsai
        6. show_mismatched_delays.py
          0.7 kB
          Russell Owen
        7. timestamp_rotator_90sec.ipynb
          290 kB
          Te-Wei Tsai
        8. timestamp_rotator.ipynb
          288 kB
          Te-Wei Tsai

          Activity

          Hide
          ttsai Te-Wei Tsai added a comment - - edited

          Another interesting thing is that after 90 sec, the sndStamp comes back to normal:

          time_start = Time("2022-11-08T22:24:00", scale="utc", format="isot")
          time_end = Time("2022-11-08T22:25:30", scale="utc", format="isot")
          

          After some checking of other time period, it looks like this delay is intermittent. Sometimes I have it and sometimes I do not have. Right now, the question is: this intermittence is by accident or something does happen at that time.

          Show
          ttsai Te-Wei Tsai added a comment - - edited Another interesting thing is that after 90 sec, the sndStamp comes back to normal: time_start = Time( "2022-11-08T22:24:00" , scale = "utc" , format = "isot" ) time_end = Time( "2022-11-08T22:25:30" , scale = "utc" , format = "isot" ) After some checking of other time period, it looks like this delay is intermittent. Sometimes I have it and sometimes I do not have. Right now, the question is: this intermittence is by accident or something does happen at that time.
          Hide
          ttsai Te-Wei Tsai added a comment -

          The following is the code in low-level controller to write the telemetry TCP/IP packet:

          // Send the telemetry and command status to the client.
          // Note: The data types of input and output are required for pthread_create().
          // The input needs to cast to the correct data type.
          static void *cmdTlmServer_sendTlmAndCmdStatus(void *pData) {
           
              // Get the server information
              serverInfo_t *pServerInfo = (serverInfo_t *)pData;
           
              // Loop delay time
              // 50 milliseconds (= 20 Hz)
              struct timespec ts50;
              ts50.tv_nsec = 50000000;
              ts50.tv_sec = 0;
           
              // Wait until begins to write the telemetry
              while (!pServerInfo->isReadyTlm) {
                  nanosleep(&ts50, NULL);
              }
           
              // Write the telemetry and command status
              syslog(LOG_NOTICE, "The telemetry thread is running in the %s server.",
                     pServerInfo->pName);
           
              int lastServerStatus = ServerStatus_Disconnected;
              while (pServerInfo->isReadyTlm) {
                  // Reset the record of finding the closed connection and update the last
                  // server status
                  if ((lastServerStatus == ServerStatus_Disconnected) &&
                      (pServerInfo->serverStatus == ServerStatus_Connected)) {
                      pServerInfo->isCloseConnDetected = false;
                  }
           
                  lastServerStatus = pServerInfo->serverStatus;
           
                  // Reply the last command status from commanding.c in controller code
                  int error = 0;
                  if (!pServerInfo->isCloseConnDetected && (error == 0)) {
                      error = cmdTlmServer_sendLastCmdStateInMsgQueue(pServerInfo);
                  }
           
                  // Send the telemetry if any
                  if (!pServerInfo->isCloseConnDetected && (error == 0)) {
                      error = cmdTlmServer_sendLastTlmInMsgQueue(pServerInfo);
                  }
           
                  // Report if sending failed. The server thread will notice this and
                  // change the server status to be disconnected in cmdTlmServer_run()
                  if (error == -1) {
                      pServerInfo->isCloseConnDetected = true;
                  }
           
                  nanosleep(&ts50, NULL);
              }
           
              return 0;
          }
          

          If we assume there is one command status and one telemetry packets need to write into TCP/IP pipe at each iteration, the time will be more than 50 second for sure. However, this value would be "steady" and can not be gradually increasing as what we saw on summit. The message queues are non-blocking and the functions (cmdTlmServer_sendLastCmdStateInMsgQueue() and cmdTlmServer_sendLastTlmInMsgQueue() will only take one instance from queue at each time. There is no while loop instead (if we have, it would be easier to explain the increase of time delay).

          Show
          ttsai Te-Wei Tsai added a comment - The following is the code in low-level controller to write the telemetry TCP/IP packet: // Send the telemetry and command status to the client. // Note: The data types of input and output are required for pthread_create(). // The input needs to cast to the correct data type. static void *cmdTlmServer_sendTlmAndCmdStatus( void *pData) {   // Get the server information serverInfo_t *pServerInfo = (serverInfo_t *)pData;   // Loop delay time // 50 milliseconds (= 20 Hz) struct timespec ts50; ts50.tv_nsec = 50000000; ts50.tv_sec = 0;   // Wait until begins to write the telemetry while (!pServerInfo->isReadyTlm) { nanosleep(&ts50, NULL); }   // Write the telemetry and command status syslog(LOG_NOTICE, "The telemetry thread is running in the %s server." , pServerInfo->pName);   int lastServerStatus = ServerStatus_Disconnected; while (pServerInfo->isReadyTlm) { // Reset the record of finding the closed connection and update the last // server status if ((lastServerStatus == ServerStatus_Disconnected) && (pServerInfo->serverStatus == ServerStatus_Connected)) { pServerInfo->isCloseConnDetected = false ; }   lastServerStatus = pServerInfo->serverStatus;   // Reply the last command status from commanding.c in controller code int error = 0; if (!pServerInfo->isCloseConnDetected && (error == 0)) { error = cmdTlmServer_sendLastCmdStateInMsgQueue(pServerInfo); }   // Send the telemetry if any if (!pServerInfo->isCloseConnDetected && (error == 0)) { error = cmdTlmServer_sendLastTlmInMsgQueue(pServerInfo); }   // Report if sending failed. The server thread will notice this and // change the server status to be disconnected in cmdTlmServer_run() if (error == -1) { pServerInfo->isCloseConnDetected = true ; }   nanosleep(&ts50, NULL); }   return 0; } If we assume there is one command status and one telemetry packets need to write into TCP/IP pipe at each iteration, the time will be more than 50 second for sure. However, this value would be "steady" and can not be gradually increasing as what we saw on summit. The message queues are non-blocking and the functions ( cmdTlmServer_sendLastCmdStateInMsgQueue() and cmdTlmServer_sendLastTlmInMsgQueue() will only take one instance from queue at each time. There is no while loop instead (if we have, it would be easier to explain the increase of time delay).
          Hide
          ttsai Te-Wei Tsai added a comment -

          Please help to review the PR:
          https://github.com/lsst-ts/ts_rotator_controller/pull/46

          Thanks!

          Show
          ttsai Te-Wei Tsai added a comment - Please help to review the PR: https://github.com/lsst-ts/ts_rotator_controller/pull/46 Thanks!
          Hide
          rowen Russell Owen added a comment - - edited

          Reviewed on github.

          Also the change is running and seems to have worked. I will append a short log of delay times (time MTRotator received the telemetry - telemetry header time). It increases and then resets to a small value and starts over. All delays values are much smaller than they used to be.

          Unlike the log in the original report this shows every packet (so is chatty enough that I did not want to leave it running). I am not sure what to make of the alternation between almost no delay and slightly longer delay, but I do not consider it to be a problem.

          Show
          rowen Russell Owen added a comment - - edited Reviewed on github. Also the change is running and seems to have worked. I will append a short log of delay times (time MTRotator received the telemetry - telemetry header time). It increases and then resets to a small value and starts over. All delays values are much smaller than they used to be. Unlike the log in the original report this shows every packet (so is chatty enough that I did not want to leave it running). I am not sure what to make of the alternation between almost no delay and slightly longer delay, but I do not consider it to be a problem.
          Hide
          ttsai Te-Wei Tsai added a comment -

          I am glad to hear it works, although I still do not understand why. I am not sure how long you tested this already. Based on the data analysis on EFD, this time delay is intermittent. I have an assumption that this delay happened only when you issued the track command but not others (such as the point-to-point movement). Just let me know if you want me to take time to dig into it more or not. Thanks!

          Show
          ttsai Te-Wei Tsai added a comment - I am glad to hear it works, although I still do not understand why. I am not sure how long you tested this already. Based on the data analysis on EFD, this time delay is intermittent. I have an assumption that this delay happened only when you issued the track command but not others (such as the point-to-point movement). Just let me know if you want me to take time to dig into it more or not. Thanks!

            People

            Assignee:
            ttsai Te-Wei Tsai
            Reporter:
            rowen Russell Owen
            Reviewers:
            Russell Owen
            Watchers:
            Russell Owen, Te-Wei Tsai
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.