Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27695

Managing engine-independent table statistics in Qserv



    • Type: Improvement
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:



      Many complex queries in Qserv would benefit from having persistent engine-independent table-level statistics. The technique is explained for MariaDB in https://mariadb.com/kb/en/engine-independent-table-statistics/

      The statistic collection phase is a data-intensive operation that requires the MySQL/MariaDB server to read the content of each table. Altogether it could become a lengthy procedure for large catalogs. Based on early experimentations with query ANALYZE TABLE, the performance of each such statement is about 15 MB/s for the table read operations (measured by iostat) while consuming 100% CPU. For a typical 12-cores worker machine, this translates into approximately 200 MB/s of the worker-level performance. The resulting catalog-level performance would depend on the total size of the catalog and the number of worker machines in the catalog (NOTE: an assumption here is that the underlying data filesystem is based on a decent hardware RAID).

      Hence, the main goal of the proposed effort is to extend the Replication system to allow analyzing and retrieving statistics. The main interface for the new tools will be the REST API of the Master Replication Controller.

      In this effort

      The following REST services will be added into the system:

      method service description
      POST /replication/sql/table/stat Build or re-build the statistics for a select table of a catalog
      GET /replication/sql/table/stat Return info on the table statistics (if any)
      DELETE /replication/sql/table/stat Delete table statistics (if any)

      In addition, there will be a command-line tool that would implement similar functionality.

      Extra requirements/notes:

      • The operations implemented in this context would be allowed on the published catalogs only.
      • All three above-mentioned services would allow extra flexibility of choosing a scope of the operation, which could be one of:
        • A specific group of tables, the regular one or the partitioned (implying all chunks of a partitioned table, including replicas) distributed across all workers.
        • A subset of the partitioned tables (to differentiate between chunk or overlap tables)
        • A single table like Object of chunk 123 (including all replicas).
      • The GET method would also support an option for returning results at various levels of utilization, including:
        • (default) A short YES/NO status of the availability of stats for each table.
        • An extended status that included a table and index-level info.
        • And complete status including column-level stats ( This option would be available only for a single table due to a large amount of data returned for each such table).
        • (possibly) A column-level stat accros a specific group of tables.


          Issue Links


            There are no comments yet on this issue.


              gapon Igor Gaponenko
              gapon Igor Gaponenko
              Fabrice Jammes, Fritz Mueller, Igor Gaponenko, Nate Pease
              0 Vote for this issue
              4 Start watching this issue