Fix Version/s: None
Team:Data Access and Database
In the current implementation of the worker ingest services in Qserv, it's the responsibility of the workers to ensure a destination table exists and has a required MySQL partition before loading contributions into the table (the target MySQL partition). Therefore, each time the worker thread begins processing a table contribution request the first actions of the thread would be to:
Where, the prototype-table is already defined as the MySQL partitioned table, and it has one default partition corresponding to transactionId=0 (Note, each "super-transaction" in the Qserv Ingest system maps to a unique MySQL partition).
These statements are followed by the table data loading operation LOAD DATA INFILE ... into the new table.
An assumption behind this implementation was that each of the above-mentioned table management operations is atomic. Unfortunately, this may not be the case for the MyISAM engine of MySQL. It's been observed that when the ingest service is put under a very high load (high frequency of low size contributions into thousands of chunks while having hundreds or thousands of "super-transactions" open) the worker may fail at one of the table management operations:
Reported by Client A:
Reported by Client B:
While an overall probability of this happening is relatively low, the service implementation still needs to be reinforced.
Since it would be hard to track and eliminate the problem at a level of MySQL then the most reasonable approach would be to reinforce the code of the Ingests system. The new version of the code proposed in this ticket would put the above-mentioned table management operations into a critical section specific to a given database, table, and chunk. This will guarantee that only one thread at a time can modify the schema of a specific table while allowing such operations on different tables to be done in parallel.