Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:
-
Story Points:2
-
Epic Link:
-
Sprint:TSSW Sprint - Jul 18 - Aug 01
-
Team:Telescope and Site
-
Urgent?:No
Description
Parker reports having an issue with the chiller. The log has these messages (I have taken the liberty of inserting a few comments):
1658357968.746: chillerTECDrive: isCooling=True, level=62.0000
|
1658357968.848: chillerTECBankCurrents: bank1=0.1600, bank2=0.1600
|
1658357976.001: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.2000, returnTemperature=nan, ambientTemperature=27.7000
|
1658357976.304: chillerTECDrive: isCooling=True, level=61.0000
|
1658357976.406: chillerTECBankCurrents: bank1=0.1500, bank2=0.1600
|
1658357983.557: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=27.7000
|
1658357983.854: chillerTECDrive: isCooling=True, level=56.0000
|
1658357983.956: chillerTECBankCurrents: bank1=0.1300, bank2=0.1300
|
|
# The chiller reports an alarm. level1=4096 is PT7_LOW_TEMP.
|
# I don't know what PT7 temp is and the chiller manuals do not explain what it is, nor do they show
|
# how to read that temperature or set alarm thresholds for it. I have asked ThermoTek for an explanation.
|
1658357989.804: chillerWatchdog: controllerState=2, pumpRunning=False, alarmsPresent=True, warningsPresent=False
|
1658357990.049: chillerAlarms: level1=4096, level21=0, level22=0
|
|
# The CSC then goes to FAULT state, as it should for a chiller alarm
|
1658357990.066: logMessage: name=ATWhiteLight, level=50, message=Fault! errorCode=4, errorReport='Chiller reporting alarms', traceback=, filePath=/home/saluser/tsrepos/ts_salobj/python/lsst/ts/salobj/base_csc.py, functionName=fault, lineNumber=697, process=540, timestamp=0.0000
|
1658357990.069: logMessage: name=ATWhiteLight, level=30, message=Going to fault while connected to the lamp controller; forcing the lamp off, traceback=, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/atwhitelight_csc.py, functionName=handle_summary_state, lineNumber=153, process=540, timestamp=0.0000
|
1658357990.071: logMessage: name=ATWhiteLight.LampModel, level=30, message=Turning off lamp while warming up because force=True; remaining warmup duration=282.8 seconds, traceback=, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/lamp_model.py, functionName=turn_lamp_off, lineNumber=547, process=540, timestamp=0.0000
|
|
# Why this additional errorCode message? We already had erroCode=4 and once we are in fault state we should not get another fault.
|
# Looks like a bug in the CSC.
|
1658357990.076: errorCode: errorCode=5, errorReport=Chiller pump is off, traceback=
|
1658357990.085: logMessage: name=ATWhiteLight, level=50, message=Fault! errorCode=5, errorReport='Chiller pump is off', traceback=, filePath=/home/saluser/tsrepos/ts_salobj/python/lsst/ts/salobj/base_csc.py, functionName=fault, lineNumber=697, process=540, timestamp=0.0000
|
1658357990.087: logMessage: name=ATWhiteLight, level=30, message=Going to fault while connected to the lamp controller; forcing the lamp off, traceback=, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/atwhitelight_csc.py, functionName=handle_summary_state, lineNumber=153, process=540, timestamp=0.0000
|
1658357990.089: logMessage: name=ATWhiteLight.LampModel, level=30, message=Turning off lamp while warming up because force=True; remaining warmup duration=282.7 seconds, traceback=, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/lamp_model.py, functionName=turn_lamp_off, lineNumber=547, process=540, timestamp=0.0000
|
1658357990.097: lampState: basicState=3, controllerError=-1, controllerState=4, setPower=0.0000, cooldownEndTime=1658358912.0000, warmupEndTime=1658358272.0000
|
1658357990.101: lampOnHours: hours=0.1715
|
|
# Note that despite going to fault the CSC is still getting status from the chiller.
|
# I think this is as per design -- keep getting status for the things the CSC can talk to.
|
1658357991.119: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=27.5000
|
1658357991.419: chillerTECDrive: isCooling=True, level=0.0000
|
1658357991.520: chillerTECBankCurrents: bank1=0.0000, bank2=0.0000
|
|
# This is very odd: the chiller reports no alarms present. Did the CSC send any commands to the chiller once it read the fault state,
|
# other than status commands? What could cause the alarm to go off again?
|
1658357995.142: chillerWatchdog: controllerState=1, pumpRunning=False, alarmsPresent=False, warningsPresent=False
|
1658357995.144: chillerAlarms: level1=0, level21=0, level22=0
|
1658357998.673: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=27.0000
|
1658357999.079: chillerTECBankCurrents: bank1=0.0100, bank2=0.0100
|
1658358006.233: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=26.6000
|
1658358006.632: chillerTECBankCurrents: bank1=0.0000, bank2=0.0000
|
1658358013.784: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=26.5000
|
1658358028.886: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.0000, returnTemperature=nan, ambientTemperature=26.4000
|
1658358036.434: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.1000, returnTemperature=nan, ambientTemperature=26.4000
|
1658358043.986: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.1000, returnTemperature=nan, ambientTemperature=26.5000
|
1658358051.536: chillerTemperatures: setTemperature=25.0000, supplyTemperature=25.1000, returnTemperature=nan, ambientTemperature=26.4000
|
|
# The chiller is disconnected and the chiller loop times out reading the chiller in close succession.
|
# Despite the disconnect being reported first, it is very likely that the order is the other way around;
|
# note the slightly longer gap in time since the last status was reported.
|
1658358060.739: chillerConnected: connected=False
|
1658358060.742: logMessage: name=ATWhiteLight.ChillerModel, level=40, message=watchdog_loop failed, traceback=Traceback (most recent call last):
|
File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-4.0.1/lib/python3.10/asyncio/streams.py", line 617, in readuntil
|
await self._wait_for_data('readuntil')
|
File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-4.0.1/lib/python3.10/asyncio/streams.py", line 502, in _wait_for_data
|
await self._waiter
|
asyncio.exceptions.CancelledError
|
|
During handling of the above exception, another exception occurred:
|
|
Traceback (most recent call last):
|
File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-4.0.1/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
|
return fut.result()
|
asyncio.exceptions.CancelledError
|
|
The above exception was the direct cause of the following exception:
|
|
Traceback (most recent call last):
|
File "/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_model.py", line 832, in watchdog_loop
|
await self.do_watchdog()
|
File "/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_model.py", line 465, in do_watchdog
|
await self.run_command("01WatchDog")
|
File "/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_model.py", line 763, in run_command
|
reply = await self.client.run_command(cmd)
|
File "/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_client.py", line 117, in run_command
|
reply = await asyncio.wait_for(
|
File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-4.0.1/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
|
raise exceptions.TimeoutError() from exc
|
asyncio.exceptions.TimeoutError, filePath=/home/saluser/tsrepos/ts_atwhitelight/python/lsst/ts/atwhitelight/chiller_model.py, functionName=watchdog_loop, lineNumber=836, process=540, timestamp=0.0000
|
The chiller control panel shows an alarm dialog "SYSTEM MODE MISMATCH ALARM" (see attached photo) and the chiller needs to be power cycled to get out of it. Pushing the "Accept" button on the control panel does nothing.
One possibility is that the chiller stops communicating about a minute after it first shows an alarm. If that is the case then it may be best to not keep getting status after the chilller goes to an alarm state (or at least not for much longer).
Attachments
Issue Links
- is mitigated by
-
DM-35753 Fix chiller alarm parsing
- Done
I contacted ThermoTek to ask about this. Based on logs we sent them (thanks to Parker Fagrelius for collecting those) and they think:
They suggest trying to send 18 again, to see if that clears the mode mismatch alarm. But that seems unlikely to be robust (and may not even work, since the chiller stops responding to commands on the serial port at some point – we suspect when the mode mismatch alarm starts).
My own thought is that we should explicitly stop chilling as soon as the watchdog reports an alarm – before trying to read the alarm state. That way the chiller's user interface code will see that the chiller has been commanded to standby and not get upset. We will try that first. If that doesn't work, a fallback is to never read the alarm state.
These changes will be implemented in
DM-35753.