Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-35679

ts_atwhitelight freezing after chiller goes to alarm state

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Parker reports having an issue with the chiller. The log has these messages (I have taken the liberty of inserting a few comments):

      1658357968.746: chillerTECDrive: isCooling=True, level=62.0000
      1658357968.848: chillerTECBankCurrents: bank1=0.1600, bank2=0.1600
      1658357976.001: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.2000, returnTemperature=nan, ambientTemperature=27.7000
      1658357976.304: chillerTECDrive: isCooling=True, level=61.0000
      1658357976.406: chillerTECBankCurrents: bank1=0.1500, bank2=0.1600
      1658357983.557: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=27.7000
      1658357983.854: chillerTECDrive: isCooling=True, level=56.0000
      1658357983.956: chillerTECBankCurrents: bank1=0.1300, bank2=0.1300
       
      # The chiller reports an alarm. level1=4096 is PT7_LOW_TEMP.
      # I don't know what PT7 temp is and the chiller manuals do not explain what it is, nor do they show
      # how to read that temperature or set alarm thresholds for it. I have asked ThermoTek for an explanation.
      1658357989.804: chillerWatchdog: controllerState=2, pumpRunning=False, alarmsPresent=True, warningsPresent=False
      1658357990.049: chillerAlarms: level1=4096, level21=0, level22=0
       
      # The CSC then goes to FAULT state, as it should for a chiller alarm
      1658357990.066: logMessage: name=ATWhiteLight, level=50, message=Fault! errorCode=4, errorReport='Chiller reporting alarms', traceback=, filePath=/home/saluser/tsrepos/ts_salobj/python/lsst/ts/salobj/base_csc.py, functionName=fault, lineNumber=697, process=540, timestamp=0.0000
      1658357990.069: logMessage: name=ATWhiteLight, level=30, message=Going to fault while connected to the lamp controller; forcing the lamp off, traceback=, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/atwhitelight_csc.py, functionName=handle_summary_state, lineNumber=153, process=540, timestamp=0.0000
      1658357990.071: logMessage: name=ATWhiteLight.LampModel, level=30, message=Turning off lamp while warming up because force=True; remaining warmup duration=282.8 seconds, traceback=, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/lamp_model.py, functionName=turn_lamp_off, lineNumber=547, process=540, timestamp=0.0000
       
      # Why this additional errorCode message? We already had erroCode=4 and once we are in fault state we should not get another fault.
      # Looks like a bug in the CSC.
      1658357990.076: errorCode: errorCode=5, errorReport=Chiller pump is off, traceback=
      1658357990.085: logMessage: name=ATWhiteLight, level=50, message=Fault! errorCode=5, errorReport='Chiller pump is off', traceback=, filePath=/home/saluser/tsrepos/ts_salobj/python/lsst/ts/salobj/base_csc.py, functionName=fault, lineNumber=697, process=540, timestamp=0.0000
      1658357990.087: logMessage: name=ATWhiteLight, level=30, message=Going to fault while connected to the lamp controller; forcing the lamp off, traceback=, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/atwhitelight_csc.py, functionName=handle_summary_state, lineNumber=153, process=540, timestamp=0.0000
      1658357990.089: logMessage: name=ATWhiteLight.LampModel, level=30, message=Turning off lamp while warming up because force=True; remaining warmup duration=282.7 seconds, traceback=, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/lamp_model.py, functionName=turn_lamp_off, lineNumber=547, process=540, timestamp=0.0000
      1658357990.097: lampState: basicState=3, controllerError=-1, controllerState=4, setPower=0.0000, cooldownEndTime=1658358912.0000, warmupEndTime=1658358272.0000
      1658357990.101: lampOnHours: hours=0.1715
       
      # Note that despite going to fault the CSC is still getting status from the chiller.
      # I think this is as per design -- keep getting status for the things the CSC can talk to.
      1658357991.119: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=27.5000
      1658357991.419: chillerTECDrive: isCooling=True, level=0.0000
      1658357991.520: chillerTECBankCurrents: bank1=0.0000, bank2=0.0000
       
      # This is very odd: the chiller reports no alarms present. Did the CSC send any commands to the chiller once it read the fault state,
      # other than status commands? What could cause the alarm to go off again?
      1658357995.142: chillerWatchdog: controllerState=1, pumpRunning=False, alarmsPresent=False, warningsPresent=False
      1658357995.144: chillerAlarms: level1=0, level21=0, level22=0
      1658357998.673: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=27.0000
      1658357999.079: chillerTECBankCurrents: bank1=0.0100, bank2=0.0100
      1658358006.233: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=26.6000
      1658358006.632: chillerTECBankCurrents: bank1=0.0000, bank2=0.0000
      1658358013.784: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=26.5000
      1658358028.886: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=26.4000
      1658358036.434: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.1000, returnTemperature=nan, ambientTemperature=26.4000
      1658358043.986: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.1000, returnTemperature=nan, ambientTemperature=26.5000
      1658358051.536: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.1000, returnTemperature=nan, ambientTemperature=26.4000
       
      # The chiller is disconnected and the chiller loop times out reading the chiller in close succession.
      # Despite the disconnect being reported first, it is very likely that the order is the other way around;
      # note the slightly longer gap in time since the last status was reported.
      1658358060.739: chillerConnected: connected=False
      1658358060.742: logMessage: name=ATWhiteLight.ChillerModel, level=40, message=watchdog_loop failed, traceback=Traceback (most recent call last):
        File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-4.0.1/lib/python3.10/asyncio/streams.py", line 617, in readuntil
          await self._wait_for_data('readuntil')
        File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-4.0.1/lib/python3.10/asyncio/streams.py", line 502, in _wait_for_data
          await self._waiter
      asyncio.exceptions.CancelledError
       
      During handling of the above exception, another exception occurred:
       
      Traceback (most recent call last):
        File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-4.0.1/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
          return fut.result()
      asyncio.exceptions.CancelledError
       
      The above exception was the direct cause of the following exception:
       
      Traceback (most recent call last):
        File "/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_model.py", line 832, in watchdog_loop
          await self.do_watchdog()
        File "/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_model.py", line 465, in do_watchdog
          await self.run_command("01WatchDog")
        File "/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_model.py", line 763, in run_command
          reply = await self.client.run_command(cmd)
        File "/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_client.py", line 117, in run_command
          reply = await asyncio.wait_for(
        File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-4.0.1/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
          raise exceptions.TimeoutError() from exc
      asyncio.exceptions.TimeoutError, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_model.py, functionName=watchdog_loop, lineNumber=836, process=540, timestamp=0.0000
      

      The chiller control panel shows an alarm dialog "SYSTEM MODE MISMATCH ALARM" (see attached photo) and the chiller needs to be power cycled to get out of it. Pushing the "Accept" button on the control panel does nothing.

      One possibility is that the chiller stops communicating about a minute after it first shows an alarm. If that is the case then it may be best to not keep getting status after the chilller goes to an alarm state (or at least not for much longer).

        Attachments

          Issue Links

            Activity

            Hide
            rowen Russell Owen added a comment - - edited

            I contacted ThermoTek to ask about this. Based on logs we sent them (thanks to Parker Fagrelius for collecting those) and they think:

            • We are low on fluid.
            • The fluid sensor is is sampled every 500 msec. If it is low for 10 consecutive readings the alarm goes off and the chiller goes to standby.
            • Unfortunately, reading the alarm with Command 18 clears the alarm bits, including the latched ones. The intention is to allow you to restart the chiller and go from Standby back into Cooling Mode, but that's not what we try to do.
            • The mode mismatch is probably happening because the user interface sees the chiller has stopped, but the user interface does not check the alarm bits for at least 1.5 seconds.
            • They suspect that the alarms bits are being cleared so quickly by the Command 18 that the user interface misses them and does not understand why the chiller stopped cooling. It then sets the mode mismatch alarm out of an abundance of caution.
            • The mode mismatch alarm is what is preventing the unit from restarting without a power cycle.

            They suggest trying to send 18 again, to see if that clears the mode mismatch alarm. But that seems unlikely to be robust (and may not even work, since the chiller stops responding to commands on the serial port at some point – we suspect when the mode mismatch alarm starts).

            My own thought is that we should explicitly stop chilling as soon as the watchdog reports an alarm – before trying to read the alarm state. That way the chiller's user interface code will see that the chiller has been commanded to standby and not get upset. We will try that first. If that doesn't work, a fallback is to never read the alarm state.

            These changes will be implemented in DM-35753.

            Show
            rowen Russell Owen added a comment - - edited I contacted ThermoTek to ask about this. Based on logs we sent them (thanks to Parker Fagrelius for collecting those) and they think: We are low on fluid. The fluid sensor is is sampled every 500 msec. If it is low for 10 consecutive readings the alarm goes off and the chiller goes to standby. Unfortunately, reading the alarm with Command 18 clears the alarm bits, including the latched ones. The intention is to allow you to restart the chiller and go from Standby back into Cooling Mode, but that's not what we try to do. The mode mismatch is probably happening because the user interface sees the chiller has stopped, but the user interface does not check the alarm bits for at least 1.5 seconds. They suspect that the alarms bits are being cleared so quickly by the Command 18 that the user interface misses them and does not understand why the chiller stopped cooling. It then sets the mode mismatch alarm out of an abundance of caution. The mode mismatch alarm is what is preventing the unit from restarting without a power cycle. They suggest trying to send 18 again, to see if that clears the mode mismatch alarm. But that seems unlikely to be robust (and may not even work, since the chiller stops responding to commands on the serial port at some point – we suspect when the mode mismatch alarm starts). My own thought is that we should explicitly stop chilling as soon as the watchdog reports an alarm – before trying to read the alarm state. That way the chiller's user interface code will see that the chiller has been commanded to standby and not get upset. We will try that first. If that doesn't work, a fallback is to never read the alarm state. These changes will be implemented in DM-35753 .

              People

              Assignee:
              rowen Russell Owen
              Reporter:
              rowen Russell Owen
              Reviewers:
              Tiago Ribeiro
              Watchers:
              Parker Fagrelius, Patrick Ingraham, Russell Owen, Tiago Ribeiro
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.