Status: In Progress
Fix Version/s: None
Sprint:DB_F20_09, DB_S21_12, DB_F21_06
Team:Data Access and Database
This is a continuation of work started in
The current ticket documents an effort of setting, configuring, and testing data collection and visualization services needed for testing Qserv at NCSA. Though the initial focus of the effort is KPM50, the services are also going to be used for Qserv performance testing in other contexts, such as regression testing.
Here is an incomplete list of services to be installed or affected:
Each instance of Qserv installed at NCSA will be instrumented with a separate collection of services to be run on the primary or sattellite master nodes of an instance. The mapping of the services to specific nodes is shown below:
|small||lsst-qserv-master04||influxdb||(to be installed) time-series database for collecting test data|
|small||lsst-qserv-master04||graphana||(to be installed) Web-based visualization services for the test data|
|small||lsst-qserv-master03||nginx||(existing) Web proxy for accessing visualization services by users|
|large||lsst-qserv-master02||influxdb||(to be installed) time-series database for collecting test data|
|large||lsst-qserv-master02||graphana||(to be installed) Web-based visualization services for the test data|
|large||lsst-qserv-master01||nginx||(existing) Web proxy for accessing visualization services by users|
All services will be run inside properly configured Docker containers. All aspects of configuring and managing the services shall be documented in the Git package qserv-ncsa. Configuration and automation scripts will be put into that package too.
Persistent data of the influxdb service would be placed at the SSD-based local filesystems of the _sattellite hosts:
missing filesystem: It seems that this filesystem is not presently mounted at lsst-qserv-master02. Get in touch and resolve this matter directly with the NCSA team. See [IHS-4334] for progress on resolving this issue.
It's expected that the main test application implemented for Qserv performance (see
DM-26100) would be run on the LSST development (lsst-devl*) or Verification Cluster machines (the SLURM cluster). Metrics obtained from the application would be ingested into the corresponding influxdb service via the Qserv instance's nginx server configured as a proxy.
It would be also possible, where applicable, to run the test application inside the latest version of the qserv/qserv:deps-* container directly on the sattelite machines of the corresponding Qserv clusters.
It's understood that a focus of the effort is set primarily at the infrastructure for testing Qserv. Subsequent efforts (to be documented in tickets linked to the current one) would also include:
- configuring graphana plots for vizusualizing test data
- studying a possibility of visualizing additional metrics on the graphana plots obtained from other sources, such as the existing monitoring services for the relevant machines, storage, and networks at NCSA
- obtaining Qserv-specific metrics directly from the Replication System of Qserv and putting them into the influxdb database for further visualization alongside the test metrics
- investigating a possibility of setting up a registry of the tests, which would encapsulate various information relevant to the tests, such as the name of a test, configuration parameters of the test, test conditions, the date, time and duration of the test, a version of Qserv, a link to a collection of the graphana plots, etc.
- a Web-based "push-buttom" solution for configuring, initiating and managing tests (in this case the tests would be run on the sattelite machines)