Fix Version/s: None
Sprint:TSSW Sprint - Oct 24 - Nov 07
Team:Telescope and Site
Respond to emerging issues, helping out where I can to solve problems, plans for options to maintain schedule and so on
Monday was not good. We went up at 8:30 expecting that the power would be back after lunch, not!!
They had to revert to the previous transfer switch and we finally got power to the MCS after dinner.
Russell managed to get a connection only by going up the 6th floor, but even that was not good as the network is in a bad state and he could not get a decent connection (simple commands timing out). It appears that the computer room power chaos of last week has damaged some of the switches. There may be a possibility of sparing from the base but unless the switches are the same AND the configurations were backed up for easy restore, it wil be a big job for IT to set then back up. I asked Alberto for the Tekniker mac/ip data and he has updated a document with that. No diagram as yet but it looks like moving part of it to the control network will also keep IT busy for a while.
I worked on the PXI and ran a bunch of tests and examined logs. It was not possible to get any of the popular performance packages to run as they all were either binaries for regular Linux systems , or had incompatible dependencies. I resorted to loading the cpu with large gzip tasks and disk copy. No sign of problems. I also loaded some debug tools ltrace/strace/etc so I can look at what happens when they run the simple LabVIEW test that was not working before. In the logs I found a reference to SoftMotion, but Julen explained that they had kept the SoftMotion class names so that replacing the library would be simpler. I managed to build the C code for the Trajectory into a ,so library. This does not depend upon any (eg MATLAB) libraries so I do not think it matters which modern version it was generated with, but at least the build path is much simpler than the two options described by the Tekniker document (2 lines of command line invocations!). I will compare the symbol tables with the one Tekniker is using….
Today I will try and get them to build the HHD code and update the device. They may be busy working with the Hall sensor though as they removed it late yesterday. If we get a single day of soak test before I leave I will be quite surprised.
Tuesday there was some more work on the ccw/rotator and the link between MCS and CSC seemed better for no particular reason. Many of the servers are still inaccessible from either 2nd or 6th (or both) floors. In particular the time server and VM’s….. I got an updated GIS interface manual from Alberto and learned how to reset the Rotator interlock, and how to bypass/unbypass items. I also managed to build the Trajectory library on the PXI itself, so that should certainly not have any compatability issue! I installed the lsst stack back on a Tucson machine and learnt how to configure it to setup the salobj kafka to generate avro json (I will continue learning about that and generating c++/java code from it ,whilst waiting for other TMA blockers). Also downloaded the LabVIEW rt-linux source repository. Interestingly it is now based on Yocto/OpenEmbedded and uses Ubuntu 20.04 for the binary repos! (should be better supported going forward even if NI stop). I started on the long journey of setting up a development environment based on it, it will be a learning experience but probably useful in the long term.
Tekniker were working on the Azimuth sensor / Ethercat interaction but I think they need Phase’s help to make significant progress. They also worked on the new Inclinometer based homing code. If we get the replacement switches up and configured tomorrow we will be off and running again…
Wednesday started off slow with a lot of discussions about !progress, schedule implications etc, etc.
Later on some progress was made though :
More computer room servers rebooted
HHD updated (despite IT being unable to provide a USB2 hub!)
Hall sensor checked against Ethercat in standalone mode - no issues
Hall sensor re-installed and full Azimuth range tested - no issues
(this will still need calibration work by Phase tomorrow before MCS control)
Communications to MCS controller computer restored by adding explicit route
Fixed small issues with chrony in MCS/PXI added to /etc/init.d to survive reboot
(this still needs a re-do with locally built chronyd on PXIs but working for now)
Small errrors in added MTMount telemetry fixed, more topics added
Once MCS-CSC and timeserver was back up , Rotator/CCW tests could resume
Thursday was a mixed bag. Phase tuned the Hall Effect sensor configuration and then Tekniker exercised the Azimuth axis and pronounced it good. The new PXI was also tested with the trivial LabVIEW test program and also was working again. Russell did some more Rotator/CCW tests and observed some pretty bizzare behaviour from the Rotator. There also appeared to be a “delay” somewhere in the Rotator which is not understood.
I helped Alberto with the 2nd HHD (some assembly required!), and then observed the connection process
on the TMA. We hope to train others on the procedure tomorrow. He had some issues rebuilding the HDD iimage initially but that got fixed and I plan to observe a sucessfull build procedure tomorrow as well.
I got the EUI displaying on one of the big screens in the control room and watched as Julen went through the startup procedures with OSS, Cap Banks, Locking Pins etc. I would like to record a clean instance of this for training.
After dinner Tekniker started to test two axis moves but soon ran into pre-existing problems with the trajectory generator (still using the old version). Ishmael at Tekniker is working on re-factoring the new generator but it will take a while as he has injured his hand, hopefully Alberto can assist when he is back in Spain. There was a lengthy set of meetings all discussing most of the same items! For TMA software Tekniker will be working remotely for 2-3 weeks after they are back in Spain and the Telescope will be reserved for them to access/move for a portion of each day (someone will have to be assigned on-summit to assist with startup checkout and E-STOP readiness during any motion). The Star Tracker camera was connected and they will be trying to connect up the basic (VIMBA) software tomorrow. A static telescope / manual dome shutter early test was also proposed to exercise the imaging aspects of the Pointing test.
Friday was a bit more successfull. Tekniker swapped back in the PXI, and after a few hiccups got it updated and “working” again. In our brief tests it did not exhibit the faulting behaviour shown before! Alberto showed me the image rebuild procedure for the EUI (no LabVIEW popups so that was nice). I worked with Julen to document the startup procedure using the EUI and recorded a set of videos of the process. This included OSS,Power up, locking pins, torque current monitoring (ie we are balanced), and point to point slews. We also exercised the ESTOP and recovery process with both the control room and HHD ESTOP’s. Alberto also recorded a video of the HHD connection process (it is quite simple and the timeout is set to 60 seconds which is ample). One concern was the OSS startup time; from “cold”, it took about 30 minutes, and even after a quick power cycle it still takes 5 or so. Julen said the bulk of it is the chillers, so maybe they can be “tuned”. I have arranged to get the Telescope (motion allowed) for a 2pm till 11pm time slot for the next 2 weeks+ and will be working with Tekniker remote (starting next Tuesday), and Samuel on-site. I will then of course also be remote as well starting Nov 14. The StarTracker folks could not connect to it despite having the IP’s, which Michael says they don’t need anyway. They were using wireless from the laptop and I suspect non-Jumbo-frames somewhere along the connection path. Next week I will try again to setup a BJ face-to-face with IT (still trying to get all the required info from Alberto myself). Russell and Bruno continued with the Rotator , they managed to get it stuck past a limit once, did a visual inspection of the full range to check for interferences (none), and then tried some motion (ramps). The behaviour was still quite bizzare in some cases. Bruno has shared a notebook he uses to plot relevant EFD data. We also have instructions for running test trajectories in simulation so maybe that will shed some light. From Te-Wei I learnt that although we are using the “new” Trajectory algorithm for both Rotater and CCW, it does not mean they use the same shared library as the actual implementation, yuk, he sent me some documents to look at for the details.
An amazing amount of work and heroic struggle by Dave on the summit to get things accomplished on the TMA. A lot of progress was accomplished (as shown by his comments). Thanks Dave for all the work!
Today was a bit frustating. Alberto’s plane was delayed so he arrived and slept, came up for dinner, not sure if they will acheive much this evening. Julen is wrestling with an issue with the new Trajectory (main axes) algorithm. He has installed the old one temporarily so we can hope to move under MCS control tomorrow. Some “valves” had been removed from the top end, so the telescope had to be coarse balanced again, that was completed before dinner. Russell arrived safely and started some CSC testing (he can better speak to detail). We identified the thermal telemetry as a priority for him to tackle getting through to the EFD. We are trying to get Tekniker to be able to work from the 2nd floor instead of the somewhat frozen 6th. This requires getting access to the TMA private network from there, so I put in a high priority ticket with IT. The proposed add a valve solution to the OSS issue has a 4 month lead time on the part!, so other sources are being investigated. The SAL build failure turned out to be an environmental issue triggered by “mamba”, so a fix to the JenkinsFile took care of that (kudos Tiago). Te-Wei gave me a link to the build instructions for the offeding Trajectory library (doco may need an update). I signed up for a Matlab trial and downloaded that so I can exercise/understand that procedure. Downloading my WIn10 VM from Tucson to run it on took a very long time (max of 8MB per sec even though we are supposed to have a dedicated 40GB link for non-pixel uses, Josh informs me that is what I can expect). In good news Russell reports that the Telemetry has been running for hours without a disconnect event, so that is another punch list ticket I can close.
Today we still did not get motion under MCS control! Tekniker had a major issue with the Axes PXI and are reverting to using the Siemens computer instead as a substitute. The PXI was running very slowly and they could not work out why, even a trivial LabVIEW program was super-slow. They will give the PXI to me tomorrow to run some basic performance metrics on. I had a look whilst it was still connected and could see nothing obvious. In the downtime I started trying to build the trajectories library using MATLAB in my VM. This took up a lot of time as MATLAB is crashing when I try the build….. Russell swapped in his rotator/ccw only CSC for Holger to test with tonight.
Lets hope for better progress tomorrow.
Wednesday was interesting. An impressive feat of diagnosis by Tekniker found that the issue after replacing the PXI with Siemens was due to a hall effect sensor being too close to a magnet, causing the shutdown of part of the Ethercat network. This may have been triggered by a certain physical location of the telescope where it was left. In ccw/rot news there is an issue with the tiiming between the computers, as system A (gets tiime from Microsemi, and serves it to system B on the private network), there is a 2 second offset, even after resyncing both systems. Other 37 second offsets also appear to be in play according to Russell (I think we may need to hire a timing guru to sort this out, but I will try and load the more modern Chronyd on the PXI’s today) . I think we should move the Tekniker computers onto the control network…….this caused network issues before but the offending switch has been replaced by a dumb one for other reasons anyway. Te-Wei may be able to help me get the generated Trajectory code, but cannot run the other steps either and suggests Tekniker update their procedure document (they are however rather busy so don’t hold your breath on that one). Chuck has asked me to look at using my old GigE camera stream-to-fits library to capture his high speed pointing data as the GenericCamera CSC cannot keep up. I have asked Eric to send up a PXI chassis so I can boot up the PXI standalone to investigate if there is really a hardware issue with it. Shawn wants to have an IT tag-up to try and speed up Tekniker move to 2nd floor and other IT issues, I think that will also need Franco to help get an emergency stop button installed as well.
We also did 4 hours EUI training with a variety of summit support staff. I recorded the 2 sessions using BlueJeans.
In unrelated matters UTE are threatening to leave the mountain if the awful food service does not improve real soon now. (edited)
Today I updated the TMA computers to use the chronyd time service instead of ntp (it is compatible however).
This fixed the bizzare 2 second offset we saw yesterday. Russell resumed ccw/rotator work only to be immdiately stymied by a CSC failure to communicate. Te-Wei realized this was likely due to an inappropriate power up sequence of the Rotator controller and directed us to the necessary ethernet power strip tool and EUI , sadly Russell could not access it due to IPA so he has filed a ticket), but I found a workaround by logging directly into the server and directing X-windows back to the laptop. Tekniker seem confident that the Azimuth motion issue is with the Phase hall sensor, and have arranged for them to dial in tomorrow. They are now testing Elevation motion and the newly Installed inclinometer. I got the “bad” PXI and took it down to the second floor (still in it’s chassis so I do not need to source that). It has a DisplayPort connector and most people had already left the building, so I posted a high priority ticket for IT to furnish me with the correct adaptor; hopefully I will get one tomorrow. Also trying to get mac addresses of all the TMA machines together to add to the “move Tekniker into the control network” ticket. Added another ticket to request an ESTOP in the control room.
Tried to debug an unusual network behaviour with one of the PXI’s where it fails to ping the machine you just logged into it from, until you run a traceroute back to said machine, then ping works! I vaugely suspect some dns issue and suggested adding all the machines to local /etc/hosts, we shall see if that works when they can reboot it. Noticed that Russell could temporarily not login to azar02, and then a few minutes later he could, I am also suspicious of dns here, it maybe worth trying to see if there is some kind of systemwide dns audit/monitoring that can be done (by IT).
Friday we worked on the CCW and Russell got the Rotator following working to some extent. It was faulting though with 3 degree moves. There is some tuning of parameters required. Tekniker continued working on the azimuth and Phase were engaged in remote diagnosis. They found an inverted parameter setting and fixed that, but there was still an issue with the Hall effect sensor borking the Ethercat communications. The suspicion is that the sensor - magnet distance is incorrect (too small). They are requesting that the elevation sensor distances be measured for comparison. Italy is on holiday monday/tuesday so they will not be able to help again until Wednesday. If this is the cause then the fix will be to make azimuth match. Whilst they were still diagnosing, we had a computer room power issue again, so they lost contact. IT were about the leave and did not want to reboot everything once power came back…..but Jacques stepped in and the main services were restarted. He also tried to get permission/access for me to reboot things on Monday if it becomes neccessary. IT did not get me a display adaptor, so I hunted around and “borrowed” one from the control room dsplays. This allowed me to boot the PXI and get into the bios. Of course this model does not have the onboard diiagnostics!. Next week I will load sone performance test software on it and see what I can find (I suspect the hardware is fine though). Russell, Tekniker, Brian and I plan to go up monday as usual and keep moving forward with other checkout. We should be able to exercise everthing but the azimuth motion. Te-Wei sent me the generated C code for the Trajectory algorithm, so I will have a go at generating the .so library next week. The DIMM arrived, but there seems to some question as to whether it will be used for the pointing tests (it has to used for acceptance testing first). The control room ESTOP is now in the control room. It is a cool wireless (not wifi) device, it needs to be configured in the interlock hardware which will likely be done Wednesday when the rest of the folks return to the summit. IT is requesting the TMA computer MAC addresses, and a “diagram”, the addresses are easy, I will engage with Alberto on monday to try and get an updated diagram as any that we have are of course out of date. I am attaching a first pass at a spares list…
Kevin Reil wants ALL the telemtry to be flowing into the EFD for the soak/pointing tests. Russells first response was “that is impsosible” due to XML release policy, but policies can be adjusted! I will work on him to try and get as much as possible into a TMA only XML update (we already had a full XML spec during the FAT). Note that all the telemetry is stored in the MCS computer 2 day store, albeit in some NI (TDMS) format. I previously wrote a quick parser for that so in principle it could be hacked to populate the EFD on a dialy basis if it comes to that.