Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27695

Managing engine-independent table statistics in Qserv

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Goals

      Many complex queries in Qserv would benefit from having persistent engine-independent table-level statistics. The technique is explained for MariaDB in https://mariadb.com/kb/en/engine-independent-table-statistics/

      The statistic collection phase is a data-intensive operation that requires the MySQL/MariaDB server to read the content of each table. Altogether it could become a lengthy procedure for large catalogs. Based on early experimentations with query ANALYZE TABLE, the performance of each such statement is about 15 MB/s for the table read operations (measured by iostat) while consuming 100% CPU. For a typical 12-cores worker machine, this translates into approximately 200 MB/s of the worker-level performance. The resulting catalog-level performance would depend on the total size of the catalog and the number of worker machines in the catalog (NOTE: an assumption here is that the underlying data filesystem is based on a decent hardware RAID).

      Hence, the main goal of the proposed effort is to extend the Replication system to allow analyzing and retrieving statistics. The main interface for the new tools will be the REST API of the Master Replication Controller.

      In this effort

      The following REST services will be added into the system:

      method service description
      POST /replication/sql/table/stat Build or re-build the statistics for a select table of a catalog
      GET /replication/sql/table/stat Return info on the table statistics (if any)
      DELETE /replication/sql/table/stat Delete table statistics (if any)

      In addition, there will be a command-line tool that would implement similar functionality.

      Extra requirements/notes:

      • The operations implemented in this context would be allowed on the published catalogs only.
      • All three above-mentioned services would allow extra flexibility of choosing a scope of the operation, which could be one of:
        • A specific group of tables, the regular one or the partitioned (implying all chunks of a partitioned table, including replicas) distributed across all workers.
        • A subset of the partitioned tables (to differentiate between chunk or overlap tables)
        • A single table like Object of chunk 123 (including all replicas).
      • The GET method would also support an option for returning results at various levels of utilization, including:
        • (default) A short YES/NO status of the availability of stats for each table.
        • An extended status that included a table and index-level info.
        • And complete status including column-level stats ( This option would be available only for a single table due to a large amount of data returned for each such table).
        • (possibly) A column-level stat accros a specific group of tables.

        Attachments

          Issue Links

            Activity

            There are no comments yet on this issue.

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Watchers:
              Fabrice Jammes, Fritz Mueller, Igor Gaponenko, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated: