Status: In Progress
Fix Version/s: None
Team:Data Access and Database
This effort is meant to document extra (but required) data management actions to be taken for the earlier ingested catalog KPM50 before proceeding with the tests.
This is needed because the current implementation of the Qserv Replication/Ingest system won't automatically create any table-level indexes neither at the super-transaction commit time nor at the catalog publishing time. Reasons for this are explained in the documentation for the Replication/Ingest system. The indexes are still required in order to achieve a reasonable performance for the queries and to set a ground for the fair comparison of the Qserv's performance against the earlier (smaller scale) KPM tests (KPM30, KPM20 and KPM10).
As a minimum, the following indexes on the chunked tables need to be created:
Initially, the catalog was ingested with replication level 1. This poses the following problems:
- There is a risk of a permanent partial loss of the valuable data created in a course of a substantial (multi-weeks) effort due to filesystem failures (data corruption, etc.) or the whole worker losses. Should this happen a significant effort to regenerate the lost data and re-ingest them into the catalog will be required. This will be problematic since the current implementation of the Qserv Ingest system still doesn't support an easy way for extending or patching catalogs. This is supposed to be solved by DM-28626.
- Interruptions of the testing due to the temporary loss of single workers. Some workers may not be available due to intermittent network problems, or because Qserv worker crashes which are still possible under heavy loads (where the probability of the crashes have been found non-negligible). This would especially painful should this happened when running very long tests taking many hours or days before completion.
Increasing the replication level to 2 would allow losing (temporary or permanently) a single worker w/o losing any valuable data or interrupting the tests (though with some penalty to the performance of the on-going tests). The Replication system would take care of the permanent data loss at a worker by creating extra replicas for the chunks located at the worker, and Qserv would automatically redirect on-going queries to the backup replicas of chunks that happened to be on the lost worker.
Evaluate storage conditions at workers to allow further increasing the number of replicas to 3.
All operation were initiated from the master node lsst-qserv-master01 of the "large" Qserv cluster at NCSA.