Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-36318

Base SAL Script fails during configuration and exits in unconfigured state

    XMLWordPrintable

    Details

    • Story Points:
      3
    • Sprint:
      TSSW Sprint - Oct 10 - Oct 24
    • Team:
      Telescope and Site
    • Urgent?:
      No

      Description

      During the night of 2022-09-13, a CWFS script was added to the script queue by the Scheduler CSC but failed during configuration and remained in the queue in an UNCONFIGURED state. Since it never transitioned to FAILED, the observation remained in the queued list of scripts but was never executed. See https://jira.lsstcorp.org/browse/OBS-54 for more details.

      It is not clear if this was a bug or by design, but if it is by design we may need to consider whether it can instead exit in a FAILED state to handle these situations more gracefully in the scheduler queue.

        Attachments

          Issue Links

            Activity

            Hide
            rowen Russell Owen added a comment - - edited

            The intent was that one could could send an amended configuration if the first one failed. But I can see why it is confusing to leave in UNCONFIGURED state when that happens. Also, given the way the script queue works, I don't think it is possible for a user to issue a new configuration.

            So I propose:

            • Add a new CONFIGUREFAILED ScriptState and make it a terminal state. (This name matches a ScriptQueue ScriptProcessState).
            • Change ScriptQueue to not remember scripts that fail in that fashion in the list of scripts that can be requeued (since that will just fail in the same way again).
            • Change ScriptQueue "add" and "requeue" commands to wait until configuration succeeds or fails, for final acknowledgement. Issue an "in progress" ack, to avoid new timeouts caused by this change. This could be sent before or after loading, but loading is likely a bit slower than configuration, so I lean towards issuing that before loading.
            Show
            rowen Russell Owen added a comment - - edited The intent was that one could could send an amended configuration if the first one failed. But I can see why it is confusing to leave in UNCONFIGURED state when that happens. Also, given the way the script queue works, I don't think it is possible for a user to issue a new configuration. So I propose: Add a new CONFIGUREFAILED ScriptState and make it a terminal state. (This name matches a ScriptQueue ScriptProcessState). Change ScriptQueue to not remember scripts that fail in that fashion in the list of scripts that can be requeued (since that will just fail in the same way again). Change ScriptQueue "add" and "requeue" commands to wait until configuration succeeds or fails, for final acknowledgement. Issue an "in progress" ack, to avoid new timeouts caused by this change. This could be sent before or after loading, but loading is likely a bit slower than configuration, so I lean towards issuing that before loading.
            Hide
            tribeiro Tiago Ribeiro added a comment -

            I think the first 2 points are ok, but I think add and requeue should retain their current behavior.

            Show
            tribeiro Tiago Ribeiro added a comment - I think the first 2 points are ok, but I think add and requeue should retain their current behavior.
            Hide
            rowen Russell Owen added a comment - - edited

            I asked Tiago about his reasons and he points out that different scripts can take radically different amounts of time to be configured (perhaps because constructing Remotes is so slow with DDS?). In any case, he would like more predictable timing for the add command. For now I will leave it as is. Perhaps we can revisit this if and when we switch to Kafka.


            On further inspection, the ScriptQueue itself is designed to terminate a script if configuration fails (this is a feature of the ScriptInfo class). I still agree that BaseScript should quit if configuration fails, and have implemented that in this ticket. But I also think the existing ScriptQueue code should have taken care of this problem. So I am puzzled.

            Show
            rowen Russell Owen added a comment - - edited I asked Tiago about his reasons and he points out that different scripts can take radically different amounts of time to be configured (perhaps because constructing Remotes is so slow with DDS?). In any case, he would like more predictable timing for the add command. For now I will leave it as is. Perhaps we can revisit this if and when we switch to Kafka. On further inspection, the ScriptQueue itself is designed to terminate a script if configuration fails (this is a feature of the ScriptInfo class). I still agree that BaseScript should quit if configuration fails, and have implemented that in this ticket. But I also think the existing ScriptQueue code should have taken care of this problem. So I am puzzled.
            Show
            rowen Russell Owen added a comment - - edited Pull requests: https://github.com/lsst-ts/ts_salobj/pull/256 https://github.com/lsst-ts/ts_xml/pull/623 https://github.com/lsst-ts/ts_idl/pull/99 https://github.com/lsst-ts/ts_scriptqueue/pull/74
            Hide
            rowen Russell Owen added a comment -

            reviewed on github

            Show
            rowen Russell Owen added a comment - reviewed on github
            Hide
            rowen Russell Owen added a comment -

            Merged to develop. Release will probably await the next cycle build DM-35787

            Show
            rowen Russell Owen added a comment - Merged to develop. Release will probably await the next cycle build DM-35787

              People

              Assignee:
              rowen Russell Owen
              Reporter:
              edennihy Erik Dennihy
              Reviewers:
              Petr Kubanek
              Watchers:
              Erik Dennihy, Petr Kubanek, Russell Owen, Tiago Ribeiro
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.