Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-2623

Design Basic Watcher

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Design watcher, including its interactions with other components (mysql, css, etc). In the near term, the watcher will handle deleting tables and databases.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Jacek Becla, I have reassigned review to you, if you have a minute to do it today.

            Show
            salnikov Andy Salnikov added a comment - Jacek Becla , I have reassigned review to you, if you have a minute to do it today.
            Hide
            jbecla Jacek Becla added a comment -

            Andy, looks good. My comments:

            • standard question: who is watching the watcher. Seriously, what happens when the watcher dies? Who starts it?
            • No matter how careful we are, sometimes watcher will just die unexpectedly (say due to power glitch). So we must have code that resumes the work correctly even if the watcher was in the middle of something important, like dropping tables. And if we have such code, that saving state when stopping and resuming from saved state becomes purely a convenience, and I think I would not even bother doing it
            • If we have robust code for resuming work, then reconfiguring could be done through killing the watcher. So I am proposing we don't worry about reconfiguring without restarting.
            • In addition to start/stop commands that watcher accepts, I'd add "dump your state", just for debugging. And maybe "force full scan", this might be needed for debugging, or maybe when zookeeper restarts.
            Show
            jbecla Jacek Becla added a comment - Andy, looks good. My comments: standard question: who is watching the watcher. Seriously, what happens when the watcher dies? Who starts it? No matter how careful we are, sometimes watcher will just die unexpectedly (say due to power glitch). So we must have code that resumes the work correctly even if the watcher was in the middle of something important, like dropping tables. And if we have such code, that saving state when stopping and resuming from saved state becomes purely a convenience, and I think I would not even bother doing it If we have robust code for resuming work, then reconfiguring could be done through killing the watcher. So I am proposing we don't worry about reconfiguring without restarting. In addition to start/stop commands that watcher accepts, I'd add "dump your state", just for debugging. And maybe "force full scan", this might be needed for debugging, or maybe when zookeeper restarts.
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks Jacek.

            Regarding who is watching the watcher (or many level of watchers) - in the end there must be a human being who is notified about disasters. I presume there will be a monitoring system in place which watches both hardware and services which can take care of this. Of course from our side we need to code things so that they do not die frequently and unexpectedly (or are restarted if they die or when the machine is restarted).

            Most of the watcher state state will be in CSS so we do not really need to ask watcher, we can just look at CSS. Forcing periodic operation may be good idea, I'll add this to final page.

            Show
            salnikov Andy Salnikov added a comment - Thanks Jacek. Regarding who is watching the watcher (or many level of watchers) - in the end there must be a human being who is notified about disasters. I presume there will be a monitoring system in place which watches both hardware and services which can take care of this. Of course from our side we need to code things so that they do not die frequently and unexpectedly (or are restarted if they die or when the machine is restarted). Most of the watcher state state will be in CSS so we do not really need to ask watcher, we can just look at CSS. Forcing periodic operation may be good idea, I'll add this to final page.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I copied all pieces (except for bloody poster) to https://dev.lsstcorp.org/trac/wiki/db/Qserv/WatcherDesign and made few small edits.
            Done

            Show
            salnikov Andy Salnikov added a comment - - edited I copied all pieces (except for bloody poster) to https://dev.lsstcorp.org/trac/wiki/db/Qserv/WatcherDesign and made few small edits. Done
            Hide
            ktl Kian-Tat Lim added a comment -

            Our goal should be to avoid a requirement for human input at any time other than the addition or removal of a node from the cluster (and even that may be at least partly automated) – to the extent possible without radically increasing complexity. In our production systems, we should not in general be expecting humans to reboot machines or start or kill processes. I believe there are plenty of available examples to look at and well-tested components to use to achieve this. (A quick search turns up clusterlabs.org.) Some of this could be provided by cluster management software at the platform/infrastructure layer, rather than within Qserv, however.

            Show
            ktl Kian-Tat Lim added a comment - Our goal should be to avoid a requirement for human input at any time other than the addition or removal of a node from the cluster (and even that may be at least partly automated) – to the extent possible without radically increasing complexity. In our production systems, we should not in general be expecting humans to reboot machines or start or kill processes. I believe there are plenty of available examples to look at and well-tested components to use to achieve this. (A quick search turns up clusterlabs.org.) Some of this could be provided by cluster management software at the platform/infrastructure layer, rather than within Qserv, however.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              fritzm Fritz Mueller
              Reviewers:
              Jacek Becla
              Watchers:
              Andy Salnikov, Jacek Becla, Kian-Tat Lim
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.