Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-12847

Upgrade to Interim Kubernetes

    XMLWordPrintable

    Details

    • Epic Name:
      Upgrade to Interim Kubernetes
    • Story Points:
      20
    • WBS:
      02C.07.09
    • Team:
      Data Facility
    • Cycle:
      Spring 2018

      Description

      Upgrade interim Kubernetes service for development use. Per conversations with the SLAC and SQuaRE teams, this will involve investigating Kubernetes versions with desired capabilities and installing requested system software and services on PDAC nodes. Ongoing administration is not covered in this epic.

        Attachments

          Issue Links

            Activity

            No builds found.
            plutchak Joel Plutchak (Inactive) created issue -
            plutchak Joel Plutchak (Inactive) made changes -
            Field Original Value New Value
            Cycle Spring 2018 [ 10806 ]
            Priority Undefined [ 10000 ] Major [ 3 ]
            plutchak Joel Plutchak (Inactive) made changes -
            WBS 02C.07.07
            Labels Environment_and_Tools
            plutchak Joel Plutchak (Inactive) made changes -
            WBS 02C.07.07 02C.07.09
            bglick Bill Glick [X] (Inactive) made changes -
            Watchers Jacob Rundall, Joel Plutchak [ Jacob Rundall, Joel Plutchak ] Andrew Loftus, Bill Glick, Jacob Rundall, Joel Plutchak, Steve Pietrowicz [ Andrew Loftus, Bill Glick, Jacob Rundall, Joel Plutchak, Steve Pietrowicz ]
            kaylynr Kaylyn Rogers [X] (Inactive) made changes -
            Story Points 20
            plutchak Joel Plutchak (Inactive) made changes -
            Assignee Steve Pietrowicz [ spietrowicz ]
            plutchak Joel Plutchak (Inactive) made changes -
            Description Upgrade interim Kubernetes service for development use Upgrade interim Kubernetes service for development use. Per conversations with the SLAC and SQuaRE teams, this will involve investigating Kubernetes versions with desired capabilities and installing requested system software and services on PDAC nodes. Ongoing administration is not covered in this epic.
            gpdf Gregory Dubois-Felsmann made changes -
            Labels Environment_and_Tools Environment_and_Tools pdac
            gpdf Gregory Dubois-Felsmann made changes -
            Remote Link This issue links to "Page (Confluence)" [ 16103 ]
            Hide
            aloftus Andrew Loftus [X] (Inactive) added a comment -

            Initial kubernetes puppet modules are ready to roll out on qserv nodes. Awaiting word from Fritz Mueller to "go ahead".

            Show
            aloftus Andrew Loftus [X] (Inactive) added a comment - Initial kubernetes puppet modules are ready to roll out on qserv nodes. Awaiting word from Fritz Mueller to "go ahead".
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Can you paste in here the Kubernetes and Docker software versions that are configured in those modules?

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Can you paste in here the Kubernetes and Docker software versions that are configured in those modules?
            Hide
            aloftus Andrew Loftus [X] (Inactive) added a comment -

            Kubernetes version is 1.8.5-0

            Docker is currently not version controlled so it will get updated to the latest every monthly maintenance.

            Please let me know what version you want, if you need Docker to be locked to a specific version.

            Show
            aloftus Andrew Loftus [X] (Inactive) added a comment - Kubernetes version is 1.8.5-0 Docker is currently not version controlled so it will get updated to the latest every monthly maintenance. Please let me know what version you want, if you need Docker to be locked to a specific version.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Kubernetes is tested against specific versions of Docker.

            "Continuous integration builds use Docker versions 1.11.2, 1.12.6, 1.13.1, and 17.03.2. These versions were validated on Kubernetes 1.8." - from https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.8.md#external-dependencies

            So it would be interesting to know whether the Docker version you are installing is one of those.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Kubernetes is tested against specific versions of Docker. "Continuous integration builds use Docker versions 1.11.2, 1.12.6, 1.13.1, and 17.03.2. These versions were validated on Kubernetes 1.8." - from https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.8.md#external-dependencies So it would be interesting to know whether the Docker version you are installing is one of those.
            Hide
            aloftus Andrew Loftus [X] (Inactive) added a comment -

            I am working on getting yum-versionlock in place (in the puppet infrastructure) so that Docker and Kubernetes versions will be guaranteed to remain at the specified version until explicitly changed.

            Can you please post the current versions that you prefer for Kubernetes and for Docker.

            Show
            aloftus Andrew Loftus [X] (Inactive) added a comment - I am working on getting yum-versionlock in place (in the puppet infrastructure) so that Docker and Kubernetes versions will be guaranteed to remain at the specified version until explicitly changed. Can you please post the current versions that you prefer for Kubernetes and for Docker.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I've asked Fritz Mueller to weigh in on this, as I suspect his application is more sensitive to this than ours.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - I've asked Fritz Mueller to weigh in on this, as I suspect his application is more sensitive to this than ours.
            Hide
            fritzm Fritz Mueller added a comment - - edited

            Yes, we'll want to have this version locked as soon we have a "matched set" that seems to be working for us. My experience with Docker so far is that it evolves so rapidly that there can be a lot of tail-chasing if you let it float free (we'll want to be keeping up, but taking updates with intention.)

            I know that Steve Pietrowicz was doing quite a bit of work in December to identify a constellation of versions that would meet our needs (thanks, Steve!) He should probably comment here on the results of his testing, and we could start with that?

            Show
            fritzm Fritz Mueller added a comment - - edited Yes, we'll want to have this version locked as soon we have a "matched set" that seems to be working for us. My experience with Docker so far is that it evolves so rapidly that there can be a lot of tail-chasing if you let it float free (we'll want to be keeping up, but taking updates with intention.) I know that Steve Pietrowicz was doing quite a bit of work in December to identify a constellation of versions that would meet our needs (thanks, Steve!) He should probably comment here on the results of his testing, and we could start with that?
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            I've been testing a couple of things.  First, the install of Kubernetes (k8s) itself.  I have a set of scripts set up so I can deploy on a new node pretty quickly (meaning, it does the yum updates, installs the necessary repos, and installs the kubernetes packages).   The second part of this is to see how the install works in various environments.   The two I've been testing against are Openstack (on Nebula) and on a set of VMs we have hosted outside of that cluster.  The Openstack installs have been pretty straight-forward.  I was able to test various version of k8s (and Docker) with various overlays, and was able to get this them to work.  The VMs were a bit more difficult, because of a few different factors.  I was able to resolve those today.  

            The main issues there were 1) how k8s deals with pre-existing firewall rules and 2) how it expects from DNS configurations.   I won't go into the specifics here (I'm writing this up in greater detail tomorrow), but it wasn't very obvious from the k8s logs what was happening.   Since this also involved using Weave (because of the multicast requirements), that also needed to be tested.   I was able to get multicast send/receive to work from all six nodes I tested against, making sure that they executed from pods on those specific nodes.   After this all worked, we wiped everything, set the correct firewall rules in place, I re-installed everything and re-ran all the multicast tests.   This confirmed that what we were testing all worked.

            The VM work was done with Docker 1.12.6, Kubernetes 1.9.2, and Weave 2.1.3.    I had been planning on doing this same test with 1.8.5-0 starting tomorrow, on both OpenStack and on the VMs I have been using.   At that point, you can decide what you'd like to use.  

             

            As Fritz said, the changes for everything happen pretty often, so it would be good to see if we can settle on something and stick with it a while, unless a real blocker comes up.  The main thing I ran across from the system standpoint is that even with log messages from k8s, it can be pretty opaque to what is really going on.  Many of the errors I was seeing have multiple causes and resolutions, and in the end, none of the solutions  I was seeing on the web were what was happening to us.

             

            We believe this testing will help make the install for Fritz a bit easier (especially now we have a better handle on the firewall and DNS stuff), and will help us longer term with the upcoming k8s install we're doing on the hardware that's being set up.

             

            A bit more detailed version of what was going on during that debug is forthcoming.

            Show
            spietrowicz Steve Pietrowicz added a comment - I've been testing a couple of things.  First, the install of Kubernetes (k8s) itself.  I have a set of scripts set up so I can deploy on a new node pretty quickly (meaning, it does the yum updates, installs the necessary repos, and installs the kubernetes packages).   The second part of this is to see how the install works in various environments.   The two I've been testing against are Openstack (on Nebula) and on a set of VMs we have hosted outside of that cluster.  The Openstack installs have been pretty straight-forward.  I was able to test various version of k8s (and Docker) with various overlays, and was able to get this them to work.  The VMs were a bit more difficult, because of a few different factors.  I was able to resolve those today.   The main issues there were 1) how k8s deals with pre-existing firewall rules and 2) how it expects from DNS configurations.   I won't go into the specifics here (I'm writing this up in greater detail tomorrow), but it wasn't very obvious from the k8s logs what was happening.   Since this also involved using Weave (because of the multicast requirements), that also needed to be tested.   I was able to get multicast send/receive to work from all six nodes I tested against, making sure that they executed from pods on those specific nodes.   After this all worked, we wiped everything, set the correct firewall rules in place, I re-installed everything and re-ran all the multicast tests.   This confirmed that what we were testing all worked. The VM work was done with Docker 1.12.6, Kubernetes 1.9.2, and Weave 2.1.3.    I had been planning on doing this same test with 1.8.5-0 starting tomorrow, on both OpenStack and on the VMs I have been using.   At that point, you can decide what you'd like to use.     As Fritz said, the changes for everything happen pretty often, so it would be good to see if we can settle on something and stick with it a while, unless a real blocker comes up.  The main thing I ran across from the system standpoint is that even with log messages from k8s, it can be pretty opaque to what is really going on.  Many of the errors I was seeing have multiple causes and resolutions, and in the end, none of the solutions  I was seeing on the web were what was happening to us.   We believe this testing will help make the install for Fritz a bit easier (especially now we have a better handle on the firewall and DNS stuff), and will help us longer term with the upcoming k8s install we're doing on the hardware that's being set up.   A bit more detailed version of what was going on during that debug is forthcoming.
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            I tested against 1.8.5, on both OpenStack and on the VMs I've been using.  I ran tests using the Weave network overlay and multicast tools.  Everything worked fine.

            Show
            spietrowicz Steve Pietrowicz added a comment - I tested against 1.8.5, on both OpenStack and on the VMs I've been using.  I ran tests using the Weave network overlay and multicast tools.  Everything worked fine.
            spietrowicz Steve Pietrowicz made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            plutchak Joel Plutchak (Inactive) made changes -
            Epic Child DM-13421 [ 38414 ]
            plutchak Joel Plutchak (Inactive) made changes -
            Epic Child DM-13422 [ 38415 ]
            plutchak Joel Plutchak (Inactive) made changes -
            Epic Child DM-13424 [ 38417 ]
            plutchak Joel Plutchak (Inactive) made changes -
            Epic Child DM-13425 [ 38418 ]
            plutchak Joel Plutchak (Inactive) made changes -
            Epic Child DM-13426 [ 38419 ]
            plutchak Joel Plutchak (Inactive) made changes -
            Link This issue relates to DM-12836 [ DM-12836 ]
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            Documentation on what was done is available here: https://dmtn-071.lsst.io

             

            Main thing to watch out for will be firewall issues that might prevent a service from reaching its destination.

             

            Fritz Mueller we can start the version of Kubernetes/Docker you are comfortable with.   I've done Docker 1.12.6 with both Kubernetes 1.8.5 and 1.9.2 

            Show
            spietrowicz Steve Pietrowicz added a comment - Documentation on what was done is available here: https://dmtn-071.lsst.io   Main thing to watch out for will be firewall issues that might prevent a service from reaching its destination.   Fritz Mueller we can start the version of Kubernetes/Docker you are comfortable with.   I've done Docker 1.12.6 with both Kubernetes 1.8.5 and 1.9.2 
            Hide
            fritzm Fritz Mueller added a comment -

            Awesome, THANK YOU Steve Pietrowicz! Let's go ahead and plan to get started with 1.9.2 as soon as the WISE validation wraps up on the PDAC (which I understand is real soon now).

            Show
            fritzm Fritz Mueller added a comment - Awesome, THANK YOU Steve Pietrowicz ! Let's go ahead and plan to get started with 1.9.2 as soon as the WISE validation wraps up on the PDAC (which I understand is real soon now).
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            ok!  I sent a note to Bill Glick [X] and Andrew Loftus [X] to let them know.

            Show
            spietrowicz Steve Pietrowicz added a comment - ok!  I sent a note to Bill Glick [X] and Andrew Loftus [X] to let them know.
            Hide
            fritzm Fritz Mueller added a comment -

            Gregory Dubois-Felsmann has informed me that the WISE validation will be wrapping up by close-of-business on Wednesday, 2/14. So a rollout at any opportunity soon after that would be great – thanks very much!

            Show
            fritzm Fritz Mueller added a comment - Gregory Dubois-Felsmann has informed me that the WISE validation will be wrapping up by close-of-business on Wednesday, 2/14. So a rollout at any opportunity soon after that would be great – thanks very much!
            Hide
            aloftus Andrew Loftus [X] (Inactive) added a comment -

            Hi Fritz Mueller, I heard there was inquiry when this change will be applied. Please respond in this ticket when you are ready to have the changes applied. This is a routine puppet change that can be rolled out as soon as you are ready.

            Show
            aloftus Andrew Loftus [X] (Inactive) added a comment - Hi  Fritz Mueller , I heard there was inquiry when this change will be applied. Please respond in this ticket when you are ready to have the changes applied. This is a routine puppet change that can be rolled out as soon as you are ready.
            aloftus Andrew Loftus [X] (Inactive) made changes -
            Assignee Steve Pietrowicz [ spietrowicz ] Andrew Loftus [ aloftus ]
            Hide
            fritzm Fritz Mueller added a comment -

            Hi folks – please proceed any time on or after Thur, 2/15. Thanks much!

            Show
            fritzm Fritz Mueller added a comment - Hi folks – please proceed any time on or after Thur, 2/15. Thanks much!
            Hide
            aloftus Andrew Loftus [X] (Inactive) added a comment -

            Thanks, I will comment here when it's done (hopefully Thr, but the planned maintenance takes precedence).

            Show
            aloftus Andrew Loftus [X] (Inactive) added a comment - Thanks, I will comment here when it's done (hopefully Thr, but the planned maintenance takes precedence).
            Hide
            aloftus Andrew Loftus [X] (Inactive) added a comment -

            Puppet changes are rolled out. Kubernetes version is enforced at version 1.9.3-0

            Show
            aloftus Andrew Loftus [X] (Inactive) added a comment - Puppet changes are rolled out. Kubernetes version is enforced at version 1.9.3-0
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            I'll run some tests on the elast nodes I've been using to see if 1.9.3 works properly.    Some version in the past (notably, 1.7.1) were broken.   I'll post an update here when I'm done testing.

            Show
            spietrowicz Steve Pietrowicz added a comment - I'll run some tests on the elast nodes I've been using to see if 1.9.3 works properly.    Some version in the past (notably, 1.7.1) were broken.   I'll post an update here when I'm done testing.
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            Kubernetes 1.9.3 installed here late this afternoon.  That all works. I ran the multicast test with the Weave 1.7 overlay and that worked fine too.   Installed the dashboard, and that responded as well.  I say "responded" because I did that via wget, and not the browser, since I'm on a VPN here, and wasn't able to test it fully.  The VPN is something to keep in mind for the iptables rules for those systems.  I don't know how they're set since I don't have access to them, but we don't want that exposed on the open internet.  Doing the dashboard via a VPN would be a better choice, I think.

            Show
            spietrowicz Steve Pietrowicz added a comment - Kubernetes 1.9.3 installed here late this afternoon.  That all works. I ran the multicast test with the Weave 1.7 overlay and that worked fine too.   Installed the dashboard, and that responded as well.  I say "responded" because I did that via wget, and not the browser, since I'm on a VPN here, and wasn't able to test it fully.  The VPN is something to keep in mind for the iptables rules for those systems.  I don't know how they're set since I don't have access to them, but we don't want that exposed on the open internet.  Doing the dashboard via a VPN would be a better choice, I think.
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            I'm seeing an issue here with 1.9.3 as the control plane and client, and 1.9.2 client with 1.9.3 control plane  I'm trying to track this down, and retesting something for 1.9.2's control plane with 1.9.2 client to be sure.  I'll update here.

            Show
            spietrowicz Steve Pietrowicz added a comment - I'm seeing an issue here with 1.9.3 as the control plane and client, and 1.9.2 client with 1.9.3 control plane  I'm trying to track this down, and retesting something for 1.9.2's control plane with 1.9.2 client to be sure.  I'll update here.
            Hide
            spietrowicz Steve Pietrowicz added a comment -

            This was an issue where there was a race condition between when the firewall rules were set by kubernetes and when additional rules were put into place by puppet.  An additional rule for port 6443 had to be introduced.   This was tested under all configs listed above and works fine.

            Show
            spietrowicz Steve Pietrowicz added a comment - This was an issue where there was a race condition between when the firewall rules were set by kubernetes and when additional rules were put into place by puppet.  An additional rule for port 6443 had to be introduced.   This was tested under all configs listed above and works fine.
            Hide
            fritzm Fritz Mueller added a comment - - edited

            Hello Andrew Loftus [X], we will need to add the line:

            Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"

            ...to the top section of /etc/systemd/system/kubelet.service.d/10-kubeadm.conf on the pdac nodes, if this file is under puppet control?

            Additionally, we will need the kubelet systemd service enabled and started on all the nodes if systemd services are under puppet control?

            Show
            fritzm Fritz Mueller added a comment - - edited Hello Andrew Loftus [X] , we will need to add the line: Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false" ...to the top section of /etc/systemd/system/kubelet.service.d/10-kubeadm.conf on the pdac nodes, if this file is under puppet control? Additionally, we will need the kubelet systemd service enabled and started on all the nodes if systemd services are under puppet control?
            plutchak Joel Plutchak (Inactive) made changes -
            Epic Child DM-13683 [ 39062 ]
            plutchak Joel Plutchak (Inactive) made changes -
            Labels Environment_and_Tools pdac Environment_and_Tools FY18a pdac
            Hide
            plutchak Joel Plutchak (Inactive) added a comment -

            Kubernetes installation has been up and running. Initial configuration changes in place and stable.

            Show
            plutchak Joel Plutchak (Inactive) added a comment - Kubernetes installation has been up and running. Initial configuration changes in place and stable.
            plutchak Joel Plutchak (Inactive) made changes -
            Resolution Done [ 10000 ]
            Status In Progress [ 3 ] Done [ 10002 ]

              People

              Assignee:
              aloftus Andrew Loftus [X] (Inactive)
              Reporter:
              plutchak Joel Plutchak (Inactive)
              Watchers:
              Andrew Loftus [X] (Inactive), Bill Glick [X] (Inactive), Fritz Mueller, Gregory Dubois-Felsmann, Jacob Rundall, Joel Plutchak (Inactive), Steve Pietrowicz, Xiuqin Wu [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Start date:
                End date:

                  Jenkins

                  No builds found.