Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26100

Review/update orchestration harness for KPM50 tests

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:

      Description

      For KPM50 we want to check the state of the existing test harness that was used for previous tests and understand its limitations and bottlenecks.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Start thinking about what we need from new implementation.

            Multiprocessing

            I think we definitely want use multiprocessing instead of multithreading. One potential issue that running 100 processes on a single machine may add bottlenecks that are different from multithreading bottlenecks. We want either one beefy machine with many cores to run that or spread the whole load across multiple machines. One good thing is that those 100 processes will all be sleeping most of the time, waiting for response from QServ so wee need 100 active cores, probably some small fraction of that should be enough. We could probably use master02 for that, it has 28 true cores. For multi-machine test we could use verification cluster and scale it as necessary. Management of multi-node test is more complicated but should not be too terrible as we don't need very many clients, probably 5 or 6 should be enough. We need to configure clients to only do a part of the load e.g. 5 clients doing 1/5 of LV queries each and one client doing all remaining types of queries (or something similar).

            LV queries

            We need to make sure that objectId in queries correspond to existing ones and also use very large set of objectIds to cover large fraction of chunks, not just few of those. It's probably worth to dump a reasonably large set of objectIds from database (secondary index) and use that list with randomization.

            Same applies to region-based queries, we want randomness there too but regions should not fall into empty area of the sky.

            Monitoring

            All useful info should be dumped to a file with exact timestamps so we could load it into time series database (e.g. InfluxDB) and probably correlate with anything that happens on Qserv side. Would be nice to have that integrated into NCSA grafana monitor, but we do not have an ability to edit grafana panels there. Should we ask if we can feed our data into whatever backend they use (InfluxDB or anything else) and give us a grafana playground so we could mess around a bit?

            The data that are interesting are: query execution time, how many queries are running, types of queries, number of rows and data sizes returned, maybe something else. Each client would dump its own set of metrics, we'll need to merge them, grafana should be able to do it easily. It would still be useful to identify each separate client to see if there are any correlations. And of course query class should be a part of the metrics too.

            Configuration

            Configuration for new harness will be somewhat more complicated so I think we should move most of it from Python code to separate config file. YAML is probably easiest to do for what we need. Things that will be there: definitions of query classes, target rates, per-class query or query templates.

            Testing the harness

            To debug this new harness would be better to make it testable, e.g. run with a sort of mock database that will internally generate some responses.

            Show
            salnikov Andy Salnikov added a comment - Start thinking about what we need from new implementation. Multiprocessing I think we definitely want use multiprocessing instead of multithreading. One potential issue that running 100 processes on a single machine may add bottlenecks that are different from multithreading bottlenecks. We want either one beefy machine with many cores to run that or spread the whole load across multiple machines. One good thing is that those 100 processes will all be sleeping most of the time, waiting for response from QServ so wee need 100 active cores, probably some small fraction of that should be enough. We could probably use master02 for that, it has 28 true cores. For multi-machine test we could use verification cluster and scale it as necessary. Management of multi-node test is more complicated but should not be too terrible as we don't need very many clients, probably 5 or 6 should be enough. We need to configure clients to only do a part of the load e.g. 5 clients doing 1/5 of LV queries each and one client doing all remaining types of queries (or something similar). LV queries We need to make sure that objectId in queries correspond to existing ones and also use very large set of objectIds to cover large fraction of chunks, not just few of those. It's probably worth to dump a reasonably large set of objectIds from database (secondary index) and use that list with randomization. Same applies to region-based queries, we want randomness there too but regions should not fall into empty area of the sky. Monitoring All useful info should be dumped to a file with exact timestamps so we could load it into time series database (e.g. InfluxDB) and probably correlate with anything that happens on Qserv side. Would be nice to have that integrated into NCSA grafana monitor, but we do not have an ability to edit grafana panels there. Should we ask if we can feed our data into whatever backend they use (InfluxDB or anything else) and give us a grafana playground so we could mess around a bit? The data that are interesting are: query execution time, how many queries are running, types of queries, number of rows and data sizes returned, maybe something else. Each client would dump its own set of metrics, we'll need to merge them, grafana should be able to do it easily. It would still be useful to identify each separate client to see if there are any correlations. And of course query class should be a part of the metrics too. Configuration Configuration for new harness will be somewhat more complicated so I think we should move most of it from Python code to separate config file. YAML is probably easiest to do for what we need. Things that will be there: definitions of query classes, target rates, per-class query or query templates. Testing the harness To debug this new harness would be better to make it testable, e.g. run with a sort of mock database that will internally generate some responses.
            Hide
            salnikov Andy Salnikov added a comment -

            I think this is ready for review. I tested it with my mock database implementation on verification cluster and it "works" OK. When we are ready to run it against qserv we may discover that some fixes may be needed, will work on it then.

            Nate Pease, I'm assigning this to you, in exchange (preemptively) to whatever MW review you are going to give to me and as the only person on QServ team who is enthusiastic about Python.  It's not much code and it even comes with README. Fritz Mueller, if you could look at README it would be helpful too (maybe after you finish reviewing my APDB technote ).

            Show
            salnikov Andy Salnikov added a comment - I think this is ready for review. I tested it with my mock database implementation on verification cluster and it "works" OK. When we are ready to run it against qserv we may discover that some fixes may be needed, will work on it then. Nate Pease , I'm assigning this to you, in exchange (preemptively) to whatever MW review you are going to give to me and as the only person on QServ team who is enthusiastic about Python.  It's not much code and it even comes with README. Fritz Mueller , if you could look at README it would be helpful too (maybe after you finish reviewing my APDB technote ).
            Hide
            npease Nate Pease added a comment -

            Overall it looks good. I left some comments. Will mark review complete, I'm not sure if I'm supposed to close the PR?

            Show
            npease Nate Pease added a comment - Overall it looks good. I left some comments. Will mark review complete, I'm not sure if I'm supposed to close the PR?
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review! I think I fixed all issues, merged and closed. PR is closed when it's closed, no need to do it (I think closing PR is sort of "cancelling" the request, not "resolving" it).

            Show
            salnikov Andy Salnikov added a comment - Thanks for review! I think I fixed all issues, merged and closed. PR is closed when it's closed, no need to do it (I think closing PR is sort of "cancelling" the request, not "resolving" it).
            Hide
            npease Nate Pease added a comment -

            Show
            npease Nate Pease added a comment -

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Nate Pease
              Watchers:
              Andy Salnikov, Fritz Mueller, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: