Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27400

Ingesting the secondary index into Qserv from local files

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Objectives

      An implementation of the "secondary index" builder in the current version of the Replication/Ingest system requires that intermediate files (index contributions into the index table) were placed at a folder that would be directly seen by the MySQL server of Qserv czar. Once a file is placed into that folder the Master Controller of the Replication system would issue MySQL statement to load the file in the table:

      LOAD DATA INFILE <file-path> ...
      

      As a direct consequence of that, both the MySQL server's container and the Replication Controller containers are required to share a folder where the contributions get stored. A problem is that in the Kubernetes environment, this is not always possible. It also makes the implementation of such distributed applications more complex and error-prone.

      Hence, the goal of this effort is to extend the implementation of the system to allow an option for loading contributions using MySQL statements of the following kind:

      LOAD DATA LOCAL INFILE <file-path> ...
      

      Notes

      According to the MySQL documentation of this table ingest technique https://dev.mysql.com/doc/refman/8.0/en/load-data.html one should be aware of the following caveats of the technique:

      • error handlings gets more complex:
        • incorrect data and duplicate keys may be ignored during the ingest
        • any problems are reported by MySQL at the WARNING level rather than be reported as errors
      • the overall performance of the index builder may be affected

      The last complication should be taken with care due to a scale of the index (in terms of the number of rows to be ingested into the index table). In some past tests, it took up to 24 hours to load a table comprising 18 billion rows.

      Additional considerations for using the proposed mechanism in the Qserv deplotements:

      Documentation

      Update the following documents due to changes in the REST API and behavior of the new implementation:

        Attachments

          Activity

          Hide
          gapon Igor Gaponenko added a comment - - edited

          Large scale tests of the extended protocol

          Using catalog sdss_stripe82_01 and director table RunDeepSource for the tests. The table is large enough to study effects of the new protocol.

          Record the current number of objects in the "secondary index" of the catalog for the quality control of the tests:

          % mysql --protocol=tcp -hlocalhost -P3306 -uroot -p***** qservMeta \
              -e "SELECT COUNT(*) FROM qservMeta.sdss_stripe82_01__RunDeepSource"
          

          +-----------+
          | COUNT(*)  |
          +-----------+
          | 186671762 |
          +-----------+
          

          Rebuild the catalog using the original non-local version of the builder (by default, if option local is not specified, it's assumed to be 0):

          % time curl http://localhost:25080/ingest/index/secondary -X POST \
            -H 'Content-Type: application/json' \
            -d'{"auth_key":"kukara4a","database":"sdss_stripe82_01","allow_for_published":1,"rebuild":1}'
           
          real	19m6.626s
          user	0m0.029s
          sys	0m0.039s
          

          Check the numbe of objects in the updated index:

          SELECT COUNT(*) FROM qservMeta.sdss_stripe82_01__RunDeepSource;
          +-----------+
          | COUNT(*)  |
          +-----------+
          | 186671762 |
          +-----------+
          

          NOw, rebuild the index using the local option:

          % time curl http://localhost:25080/ingest/index/secondary -X POST \
            -H 'Content-Type: application/json' \
            -d'{"auth_key":"kukara4a","database":"sdss_stripe82_01","allow_for_published":1,"rebuild":1,"local":1}'
           
          real	19m9.016s
          user	0m0.028s
          sys	0m0.047s
          

          Check the number of entries in the index table to ensure it's the same as before:

          SELECT COUNT(*) FROM qservMeta.sdss_stripe82_01__RunDeepSource;
          +-----------+
          | COUNT(*)  |
          +-----------+
          | 186671762 |
          +-----------+
          

          Conclusions:

          • the new version of the builder produces similar results as the older one ( a simple method of counting rows is still sufficient to demonstrate the "similarity" due to UNIQUE constraint on the primary key of the table).
          • performance degradation of the non-local build mode versus the local one is negligible, thereforo it's no longer a point of concern.
          Show
          gapon Igor Gaponenko added a comment - - edited Large scale tests of the extended protocol Using catalog sdss_stripe82_01 and director table RunDeepSource for the tests. The table is large enough to study effects of the new protocol. Record the current number of objects in the "secondary index" of the catalog for the quality control of the tests: % mysql --protocol=tcp -hlocalhost -P3306 -uroot -p***** qservMeta \ -e "SELECT COUNT(*) FROM qservMeta.sdss_stripe82_01__RunDeepSource" + -----------+ | COUNT (*) | + -----------+ | 186671762 | + -----------+ Rebuild the catalog using the original non-local version of the builder (by default, if option local is not specified, it's assumed to be 0 ): % time curl http: //localhost :25080 /ingest/index/secondary -X POST \ -H 'Content-Type: application/json' \ -d '{"auth_key":"kukara4a","database":"sdss_stripe82_01","allow_for_published":1,"rebuild":1}'   real 19m6.626s user 0m0.029s sys 0m0.039s Check the numbe of objects in the updated index: SELECT COUNT (*) FROM qservMeta.sdss_stripe82_01__RunDeepSource; + -----------+ | COUNT (*) | + -----------+ | 186671762 | + -----------+ NOw, rebuild the index using the local option: % time curl http: //localhost :25080 /ingest/index/secondary -X POST \ -H 'Content-Type: application/json' \ -d '{"auth_key":"kukara4a","database":"sdss_stripe82_01","allow_for_published":1,"rebuild":1,"local":1}'   real 19m9.016s user 0m0.028s sys 0m0.047s Check the number of entries in the index table to ensure it's the same as before: SELECT COUNT (*) FROM qservMeta.sdss_stripe82_01__RunDeepSource; + -----------+ | COUNT (*) | + -----------+ | 186671762 | + -----------+ Conclusions: the new version of the builder produces similar results as the older one ( a simple method of counting rows is still sufficient to demonstrate the "similarity" due to UNIQUE constraint on the primary key of the table). performance degradation of the non-local build mode versus the local one is negligible, thereforo it's no longer a point of concern.
          Hide
          npease Nate Pease added a comment -

          I had one very minor comment, but otherwise looks fine.

          Show
          npease Nate Pease added a comment - I had one very minor comment, but otherwise looks fine.

            People

            Assignee:
            gapon Igor Gaponenko
            Reporter:
            gapon Igor Gaponenko
            Reviewers:
            Nate Pease
            Watchers:
            Fabrice Jammes, Fritz Mueller, Igor Gaponenko, Nate Pease
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: