Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-905

Miscellaneous improvements to the Chile IT puppet workflow

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Adopted
    • Resolution: Unresolved
    • Component/s: DM
    • Labels:

      Description

      The scope of this RFC is changes the Puppet workflow for Chile IT, which
      encompasses the Summit, Base, and Tucson Test Stand (TTS). The Chilean IT team
      is not the exclusive user of puppet within the project. Any use of puppet
      outside of Chile IT is explicitly out of scope.

      Proposed changes:

      1) https://ittn-005.lsst.io/ is designated as the canonical document for puppet
      workflow. ittn-005 is currently minimal and should be updated after this RFC.

      Discussion:

      The current workflow is currently recorded as a mixture of google documents, a
      large number of dated / inaccurate confluence pages, or is by word of mouth.

      Examples of outdated documentation:

      2) The https://github.com/lsst-it/lsst-itconf/ repo is renamed to lsst-it/lsst-control.

      Discussion:

      In puppet speak this repo is the "control repo". In conversation, it sometimes
      referred to as "itconf" and sometimes referred to as "the control repo". To
      remove ambiguity, we propose using "control" in the repo's name.

      3) The default branch of the control repo is renamed from "master" to "production".

      Discussion:

      puppetserver effectively requires that a puppet environment named "production"
      is defined. As tooling is used to generate puppet environments based on branch
      names, this means that the control repo must contain a branch named
      "production". Although, the "production" environment is not currently used by
      any node, it must be periodically updated to be in sync with the current
      "master" branch (for extremely hand wavy reasons beyond the scope of this RFC).
      Instead of continuing to manually sync the branch or automating the process, it
      seems easier to simply use this branch.

      4) Retire the "dev_production", "ls_production", and "tu_production"
      environment names and replace them with the "production" environment.

      Discussion:

      Per site environment name of "{dev,cp,ls,tu}_production" were introduced in
      late 2019 to decouple the code review/merging process from the time which said
      changes were applied. Part of the motivation was to provide for a human to be
      able to pay attention to when changes for a particular site may be occurring
      (to watch for issues). For example, to allow code changes to be merged on a
      Friday afternoon without fear of unforeseen consequences causing breakage.

      These concerns were more relevant in 2019 because of lack of automated tested.
      It has proven to be rare for merged changes to be problematic. There are
      occasions upon which one of the site production branches have fallen weeks
      behind master and then springs forward all at once to someones surprise. This
      has never caused a problem as the changes have usually already been live on
      another production branch... which is a demonstration that this isn't a
      particularly valuable practice. In recent years, the practice has been to
      either update all production branches immediately after a PR is merged to
      master or to roll {dev,ls,tu}_production forward immediately and then after an
      hour or two to roll cp_production forward. This seems to be wasted effort and
      an unnecessarily the workflow.

      We propose using the "production" environment as the default environment for
      dev,ls,tu and continuing to use "cp_production" at the summit for the time
      being.

      5) The github branch protection for IT-* branches in the control repo is
      changed to require linear history. Pull-requests to the master branch are
      already required to be up to date.

      Discussion:

      The policy for this repo has long been that changes to the master branch are
      fast-forward only with the exception of a merge commit at the point at which a
      branch is merged (git merge --no-ff). Code review has occasionally missed merge
      commits lurking within a PR. The presence of errant merge commits complicates
      bisecting the history. Using CI to catch merge commits is a bit tricky. This
      modification is intended to interact well with the upcoming github merge queue
      feature.

      6) Pull-requests in the control repo are required to be labeled as either "bug" or "enhancement" prior to merge.

      Discussion:

      It is currently difficult to capture how often new features are being made
      verses bugs are being fixed. PR labels are a common means of automatically
      generating a change log. Such as with
      https://github.com/github-changelog-generator which is already being used by
      our stand-alone puppet module repos. A GHA already exists which is able to
      check if a PR has a specified list of labels.

      7) The subject of commits (excluding merge commits) should be prefixed with "(facility)".

      Discussion:

      The control repo contains a fair amount of code which is only loosely coupled.
      Looking through the history for the entire repo, it is difficult to determine
      what functionality a commit message of "fix error string whitespace" is
      relevant to. Whereas "(rke role) fix error string whitespace" is significantly
      more informative.

      Examples of good facility prefixes:

      • (Puppetfile)
      • (yagan cluster)
      • (rke role)
      • (profile::foo:bar)
      • (spec)
      • (gha)
      • (site/dev/role/hypervisor)

        Attachments

          Issue Links

            Activity

            Hide
            tjohnson Tony Johnson added a comment -

            Josh – a question. At the top of the RFC you say "Any use of puppet outside of Chile IT is explicitly out of scope." Does that mean the use of puppet to install CCS software on machines at TTS/BTS/summit is out of scope for this RFC, or is that something which we should discuss here?

            Show
            tjohnson Tony Johnson added a comment - Josh – a question. At the top of the RFC you say "Any use of puppet outside of Chile IT is explicitly out of scope." Does that mean the use of puppet to install CCS software on machines at TTS/BTS/summit is out of scope for this RFC, or is that something which we should discuss here?
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            Tony Johnson The summit, BTS, and TTS are responsibilities of Chile IT and in scope of this RFC. I am aware of other uses of puppet within the project, which is why I added the statement on scope.

            Show
            jhoblitt Joshua Hoblitt added a comment - Tony Johnson The summit, BTS, and TTS are responsibilities of Chile IT and in scope of this RFC. I am aware of other uses of puppet within the project, which is why I added the statement on scope.
            Hide
            tjohnson Tony Johnson added a comment -

            Hi Josh, apologies for the slow response, but it took us a while to digest your discussion above. Most CCS users of puppet are not very experienced with  DM-style workflow and might need assistance at first to comply with new rules proposed here, and hopefully that can be done in a constructive way that doesn't result in a slow-down of work.

            I am not entirely sure how the changes you propose will interact with our typical workflow, so I have outlined it here to get your input.
            Often a new CCS release coincides with a new XML release (but not always) and takes several weeks to migrate from TTS->BTS->Summit. The way we typically do this now is to:

            a) Prepare a CCS release candidate for auxtel and comcam (and in future for main camera) (a.k.a. a SNAPSHOT in our terminology)
            b) Create an IT ticket to track the changes, along with corresponding branches in lsst-it/lsst-itconf and lsst-it/lsst-puppet-hiera-private.
            c) Use foreman.tu.lsst.org to change the environment to our branch on comcam and auxtel nodes at TTS and use puppet to install the new release. Often we find problems at this stage and need to make updates to the release candidate, and sometimes to puppet modules, and then need to get puppet to update the installation (which sometimes we need to force by uninstalling the existing CCS version and having puppet reinstall it).
            d) Change the CCS prod links on TTS machines and restart CCS processes to fully deploy the new version
            e) Typically there are then several days during which Michael Reuter does his TTS testing, and sometimes he (or others) find problems which we patch in the release candidate (similar process to c)
            f) Once testing is complete we typically change the release candidate to an official release and repeat step c) on TTS
            g) Wait for go ahead to deploy new version on the summit, which normally happens once the CSCs at the summit have been shutdown. Deployment of the new versions is typically time critical at this point, especially if done during a run.
            h) Use foremap.cp.lsst.org to update the summit nodes to the new branch
            i) Use puppet to deploy the new release
            j) Move the prod links on the summit nodes, and restart the CCS processes. Where this involves HCUs controlling hardware we take extra care to make sure the restarts are done by appropriate hardware experts, and that we have people available on the summit in case any manual intervention is needed.
            k) Ensure all CCS processes are working successfully, and then wait for the go-ahead to restart CSCs (which for us means the appropriate ocs-bridge)
            l) Create a pull request to merge the branches into production. Our typical "CCS" way of doing this is to pull the main branch into our working branch and then create a pull request. This typically generates a clean pull request and allows us to keep track of the history of who did what when.
            m) Use foreman at TTS and summit to put nodes back to the production environment/branch

            Since this process takes several weeks, we typically have branches which last several weeks. Sometimes we forget to do steps l) and m), or get our pull request rejected for reasons that are unclear to us, which means the branches last even longer.

            Can you outline what about the process above you would like us to change to be compliant with this RFC?

            Show
            tjohnson Tony Johnson added a comment - Hi Josh, apologies for the slow response, but it took us a while to digest your discussion above. Most CCS users of puppet are not very experienced with  DM-style workflow and might need assistance at first to comply with new rules proposed here, and hopefully that can be done in a constructive way that doesn't result in a slow-down of work. I am not entirely sure how the changes you propose will interact with our typical workflow, so I have outlined it here to get your input. Often a new CCS release coincides with a new XML release (but not always) and takes several weeks to migrate from TTS->BTS->Summit. The way we typically do this now is to: a) Prepare a CCS release candidate for auxtel and comcam (and in future for main camera) (a.k.a. a SNAPSHOT in our terminology) b) Create an IT ticket to track the changes, along with corresponding branches in lsst-it/lsst-itconf and lsst-it/lsst-puppet-hiera-private. c) Use foreman.tu.lsst.org to change the environment to our branch on comcam and auxtel nodes at TTS and use puppet to install the new release. Often we find problems at this stage and need to make updates to the release candidate, and sometimes to puppet modules, and then need to get puppet to update the installation (which sometimes we need to force by uninstalling the existing CCS version and having puppet reinstall it). d) Change the CCS prod links on TTS machines and restart CCS processes to fully deploy the new version e) Typically there are then several days during which Michael Reuter does his TTS testing, and sometimes he (or others) find problems which we patch in the release candidate (similar process to c) f) Once testing is complete we typically change the release candidate to an official release and repeat step c) on TTS g) Wait for go ahead to deploy new version on the summit, which normally happens once the CSCs at the summit have been shutdown. Deployment of the new versions is typically time critical at this point, especially if done during a run. h) Use foremap.cp.lsst.org to update the summit nodes to the new branch i) Use puppet to deploy the new release j) Move the prod links on the summit nodes, and restart the CCS processes. Where this involves HCUs controlling hardware we take extra care to make sure the restarts are done by appropriate hardware experts, and that we have people available on the summit in case any manual intervention is needed. k) Ensure all CCS processes are working successfully, and then wait for the go-ahead to restart CSCs (which for us means the appropriate ocs-bridge) l) Create a pull request to merge the branches into production. Our typical "CCS" way of doing this is to pull the main branch into our working branch and then create a pull request. This typically generates a clean pull request and allows us to keep track of the history of who did what when. m) Use foreman at TTS and summit to put nodes back to the production environment/branch Since this process takes several weeks, we typically have branches which last several weeks. Sometimes we forget to do steps l) and m), or get our pull request rejected for reasons that are unclear to us, which means the branches last even longer. Can you outline what about the process above you would like us to change to be compliant with this RFC?
            Hide
            jhoblitt Joshua Hoblitt added a comment - - edited

            Tony Johnson This RFC won't change the nodes or edges in the graph of the workflow you outlined. It would change some of the low level details as a repo would be renamed, the branch to merge into changes, PRs would need a label applied, etc. but it is all intentionally minor. There are changes I've been considering that would impact the CCS workflow but I don't want to start that discussion until after this RFC has been implemented.

            Show
            jhoblitt Joshua Hoblitt added a comment - - edited Tony Johnson This RFC won't change the nodes or edges in the graph of the workflow you outlined. It would change some of the low level details as a repo would be renamed, the branch to merge into changes, PRs would need a label applied, etc. but it is all intentionally minor. There are changes I've been considering that would impact the CCS workflow but I don't want to start that discussion until after this RFC has been implemented.
            Hide
            jhoblitt Joshua Hoblitt added a comment -

            I'm going to mark the RFC as adopted and start opening implementation issues.

            Show
            jhoblitt Joshua Hoblitt added a comment - I'm going to mark the RFC as adopted and start opening implementation issues.

              People

              Assignee:
              jhoblitt Joshua Hoblitt
              Reporter:
              jhoblitt Joshua Hoblitt
              Watchers:
              Carlos Barria, Cristián Silva, Felipe Menanteau, Glenn Morris, Heinrich Reinking, John Gregg Thayer, Joshua Hoblitt, Kian-Tat Lim, Max Turri, Steve Pietrowicz, Tony Johnson
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Dates

                Created:
                Updated:
                Planned End:

                  Jenkins

                  No builds found.