Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26390

Ingest GAIA DR2 catalogs into "small" Qserv cluster at NCSA

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Goals

      This effort is similar to DM-23086 and DM-23962 efforts, except the catalog will be ingested into the "small" Qserv cluster explained in DM-25949.

      Implementation details

      Files supporting the ingest installed on the GPFS filesystem at:

      /datasets/gapon/gaia_dr2_02/
      

      Many files are inherited (via symbolic links or copied and modified) from the previous effort:

      /datasets/gapon/gaia_dr2_01/
      

      The partitioning and chunk/table contributions phase will be run inside the latest version of the Docker container qserv/replica:tools-DM-25986 (see [DM-25986}] which included the latest improvements and fixes to the Replication & Ingest system.

      The ingest orchestration will be done from the master machine qserv-master03. This will include interactions with the Replication System's REST API, launching containers at workers.

      There will be one container per worker. Each container will be used for partitioning input files into the worker's work folder, and launching many ingest streams, The number of streams per worker would be 16 which is equal to the number of the CPU cores per node. This allows to fully utilize CPU resources available to each worker.

      There will be a dedicated super-transaction started for each such stream.

      Scripts and configurations used during this ingest are found in the dedicated Git package: https://github.com/lsst-dm/qserv-ncsa-gaia

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment - - edited

            Tests

            Test regular tables by counting rows

            From qserv-master03:

            for TABLE in $(cat /datasets/gapon/gaia_dr2_02/regular/tables); do \
                mysql -NB --protocol=tcp -P4040 -uqsmaster  -e "SELECT '${TABLE}',COUNT(*) FROM gaia_dr2_02.${TABLE}"; \
            done
            

            table rows
            aux_allwise_agn_gdr2_cross_id 555934
            aux_iers_gdr2_cross_id 2820
            aux_sso_orbit_residuals 1977702
            aux_sso_orbits 14099
            light_curves 48255086
            sso_observation 1977702
            sso_source 14099
            vari_cepheid 9575
            vari_classifier_result 363969
            vari_long_period_variable 89617
            vari_rotation_modulation 147535
            vari_rrlyrae 140784
            vari_short_timescale 3018
            vari_time_series_statistics 550737

            VERIFIED that these numbers match the ones reported for tables of catalog gaia_dr2_01 loaded into the "large" Qserv cluster.

            Test the director table gaia_source

            Count the total number of objects:

            SELECT COUNT(*) FROM gaia_dr2_02.gaia_source;
            +------------+
            | COUNT(*)   |
            +------------+
            | 1692919135 |
            +------------+
            

            VERIFIED that this numbers matches the one reported for the same table of catalog gaia_dr2_01 loaded into the "small" Qserv cluster.

            Test te secondary index optimization using one of the object identifiers reported from:

            SELECT * FROM qservMeta.gaia_dr2_02__gaia_source LIMIT 10;
            +--------------+---------+------------+
            | source_id    | chunkId | subChunkId |
            +--------------+---------+------------+
            |   4295806720 |  115684 |          2 |
            |  34361129088 |  115684 |          2 |
            |  38655544960 |  115684 |          2 |
            | 309238066432 |  115684 |          2 |
            | 343597448960 |  115684 |          2 |
            | 515396233856 |  115684 |          2 |
            | 549755818112 |  115684 |          2 |
            | 828929527040 |  115684 |          2 |
            | 927713095040 |  115684 |          2 |
            | 966367933184 |  115684 |          2 |
            +--------------+---------+------------+
            

            Qserv query:

            SELECT source_id,ra,`dec`,chunkId,subChunkId FROM gaia_dr2_02.gaia_source WHERE source_id=4295806720';
            +------------+--------------------+----------------------+---------+------------+
            | source_id  | ra                 | dec                  | chunkId | subChunkId |
            +------------+--------------------+----------------------+---------+------------+
            | 4295806720 | 44.996153684159594 | 0.005615806210679649 |  115684 |          2 |
            +------------+--------------------+----------------------+---------+------------+
            

            Show
            gapon Igor Gaponenko added a comment - - edited Tests Test regular tables by counting rows From qserv-master03 : for TABLE in $( cat /datasets/gapon/gaia_dr2_02/regular/tables ); do \ mysql -NB --protocol=tcp -P4040 -uqsmaster -e "SELECT '${TABLE}',COUNT(*) FROM gaia_dr2_02.${TABLE}" ; \ done table rows aux_allwise_agn_gdr2_cross_id 555934 aux_iers_gdr2_cross_id 2820 aux_sso_orbit_residuals 1977702 aux_sso_orbits 14099 light_curves 48255086 sso_observation 1977702 sso_source 14099 vari_cepheid 9575 vari_classifier_result 363969 vari_long_period_variable 89617 vari_rotation_modulation 147535 vari_rrlyrae 140784 vari_short_timescale 3018 vari_time_series_statistics 550737 VERIFIED that these numbers match the ones reported for tables of catalog gaia_dr2_01 loaded into the "large" Qserv cluster. Test the director table gaia_source Count the total number of objects: SELECT COUNT (*) FROM gaia_dr2_02.gaia_source; + ------------+ | COUNT (*) | + ------------+ | 1692919135 | + ------------+ VERIFIED that this numbers matches the one reported for the same table of catalog gaia_dr2_01 loaded into the "small" Qserv cluster. Test te secondary index optimization using one of the object identifiers reported from: SELECT * FROM qservMeta.gaia_dr2_02__gaia_source LIMIT 10; + --------------+---------+------------+ | source_id | chunkId | subChunkId | + --------------+---------+------------+ | 4295806720 | 115684 | 2 | | 34361129088 | 115684 | 2 | | 38655544960 | 115684 | 2 | | 309238066432 | 115684 | 2 | | 343597448960 | 115684 | 2 | | 515396233856 | 115684 | 2 | | 549755818112 | 115684 | 2 | | 828929527040 | 115684 | 2 | | 927713095040 | 115684 | 2 | | 966367933184 | 115684 | 2 | + --------------+---------+------------+ Qserv query: SELECT source_id,ra,` dec `,chunkId,subChunkId FROM gaia_dr2_02.gaia_source WHERE source_id=4295806720'; + ------------+--------------------+----------------------+---------+------------+ | source_id | ra | dec | chunkId | subChunkId | + ------------+--------------------+----------------------+---------+------------+ | 4295806720 | 44.996153684159594 | 0.005615806210679649 | 115684 | 2 | + ------------+--------------------+----------------------+---------+------------+
            Hide
            gapon Igor Gaponenko added a comment - - edited

            What's next

            • Test the performance of Qserv queries and compare the numbers with the ones obtained for similar queries run against the "large" catalog. See results in the next comment.
            • Put scripts and configurations used during this ingest into a dedicated Git package: https://github.com/lsst-dm/qserv-ncsa-gaia
            • Create a JIRA ticket for loading both partitioned tables of catalog wise into the cluster. Attach his ticket as the "related" one to the current one. See DM-26544.
            • build table-level UNIQUE indexes for the object identifier column source_id of table gaia_source
            • build table-level indexes for columns ra and dec for this table as well
            • get rid of the intermediate partitioned files at the Qserv workers' filesystems
            • get rid of the intermediate input files at GPFS
            Show
            gapon Igor Gaponenko added a comment - - edited What's next Test the performance of Qserv queries and compare the numbers with the ones obtained for similar queries run against the "large" catalog. See results in the next comment. Put scripts and configurations used during this ingest into a dedicated Git package: https://github.com/lsst-dm/qserv-ncsa-gaia Create a JIRA ticket for loading both partitioned tables of catalog wise into the cluster. Attach his ticket as the "related" one to the current one. See DM-26544 . build table-level UNIQUE indexes for the object identifier column source_id of table gaia_source build table-level indexes for columns ra and dec for this table as well get rid of the intermediate partitioned files at the Qserv workers' filesystems get rid of the intermediate input files at GPFS
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Basic performance tests

            All results were reported after putting Qserv into the "warm" mode by launching queries on the relevant tables.

            Unconditionally counting objects across all chunks (1 query at a time)

            SELECT COUNT(*) FROM gaia_dr2_02.gaia_source
            

            Performance:

            • "small" cluster: 2m 42s
            • "large" cluster: 1m 11s

            Shared scan query (1 query at a time)

            SELECT COUNT(*) FROM gaia_dr2_02.gaia_source WHERE ra < 0.1
            

            Performance:

            • "small" cluster: 19m 20s
            • "large" cluster: 4m 20s

            NOTE: the "small" cluster is ~5 times slower compared to the "big" one. The performance seems to scale linearly with the number of worker nodes. The "small" cluster has 5 times fewer workers compared to the other one.

            Shared scan query (4 parallel queries)

            SELECT COUNT(*) FROM gaia_dr2_02.gaia_source WHERE ra < 0.0[1,2,3,4]
            

            Performance:

            • "small" cluster: 19m 32s, 19m 35s, *19m 35s, 19m 37s
            • "large" cluster: 2m 58s, 3m 12s, 3m 15s, 3m 17s
            Show
            gapon Igor Gaponenko added a comment - - edited Basic performance tests All results were reported after putting Qserv into the "warm" mode by launching queries on the relevant tables. Unconditionally counting objects across all chunks (1 query at a time) SELECT COUNT (*) FROM gaia_dr2_02.gaia_source Performance: "small" cluster: 2m 42s "large" cluster: 1m 11s Shared scan query (1 query at a time) SELECT COUNT (*) FROM gaia_dr2_02.gaia_source WHERE ra < 0.1 Performance: "small" cluster: 19m 20s "large" cluster: 4m 20s NOTE : the "small" cluster is ~5 times slower compared to the "big" one. The performance seems to scale linearly with the number of worker nodes. The "small" cluster has 5 times fewer workers compared to the other one. Shared scan query (4 parallel queries) SELECT COUNT (*) FROM gaia_dr2_02.gaia_source WHERE ra < 0.0[1,2,3,4] Performance: "small" cluster: 19m 32s , 19m 35s , *19m 35s , 19m 37s "large" cluster: 2m 58s , 3m 12s , 3m 15s , 3m 17s
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Fritz Mueller I'm done with the ingest. Do you want to review this before I close the ticket?

            Show
            gapon Igor Gaponenko added a comment - - edited Fritz Mueller I'm done with the ingest. Do you want to review this before I close the ticket?
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Creating table indexes

            gaia_source

            The object identifier column*source_id*:

            curl 'http://localhost:25081/replication/sql/index' \
              -X POST -H "Content-Type: application/json" \
              -d@/datasets/gapon/gaia_dr2_02/qserv-ncsa-gaia/gaia_source/gaia_dr2_02_gaia_source_idx_source_id.json \
              -o gaia_dr2_02_gaia_source_idx_source_id.result \
              &>gaia_dr2_02_gaia_source_idx_source_id.log
            

            Columns dec and ra:

            curl 'http://localhost:25081/replication/sql/index' \
              -X POST -H "Content-Type: application/json" \
              -d@/datasets/gapon/gaia_dr2_02/qserv-ncsa-gaia/gaia_source/gaia_dr2_02_gaia_source_idx_dec.json \
              -o gaia_dr2_02_gaia_source_idx_dec.result \
              >& gaia_dr2_02_gaia_source_idx_dec.log&
             
            curl 'http://localhost:25081/replication/sql/index' \
              -X POST -H "Content-Type: application/json" \
              -d@/datasets/gapon/gaia_dr2_02/qserv-ncsa-gaia/gaia_source/gaia_dr2_02_gaia_source_idx_ra.json \
              -o gaia_dr2_02_gaia_source_idx_ra.result \
              >& gaia_dr2_02_gaia_source_idx_ra.log
            

            Show
            gapon Igor Gaponenko added a comment - - edited Creating table indexes gaia_source The object identifier column* source_id *: curl 'http://localhost:25081/replication/sql/index' \ -X POST -H "Content-Type: application/json" \ -d@ /datasets/gapon/gaia_dr2_02/qserv-ncsa-gaia/gaia_source/gaia_dr2_02_gaia_source_idx_source_id .json \ -o gaia_dr2_02_gaia_source_idx_source_id.result \ &>gaia_dr2_02_gaia_source_idx_source_id.log Columns dec and ra : curl 'http://localhost:25081/replication/sql/index' \ -X POST -H "Content-Type: application/json" \ -d@ /datasets/gapon/gaia_dr2_02/qserv-ncsa-gaia/gaia_source/gaia_dr2_02_gaia_source_idx_dec .json \ -o gaia_dr2_02_gaia_source_idx_dec.result \ >& gaia_dr2_02_gaia_source_idx_dec.log&   curl 'http://localhost:25081/replication/sql/index' \ -X POST -H "Content-Type: application/json" \ -d@ /datasets/gapon/gaia_dr2_02/qserv-ncsa-gaia/gaia_source/gaia_dr2_02_gaia_source_idx_ra .json \ -o gaia_dr2_02_gaia_source_idx_ra.result \ >& gaia_dr2_02_gaia_source_idx_ra.log

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Reviewers:
              Fritz Mueller
              Watchers:
              Fritz Mueller, Hsin-Fang Chiang, Igor Gaponenko, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: