Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-7053

Assemble a complete database with S13 DRP catalogs

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:

      Description

      Create a database populated with complete catalogs resulting from processing of the SDSS Stripe 82 data at both NCSA and IN2P3 sites. The database has to be created at NCSA on the following database server lsst-db.ncsa.illinois.edu

      • 2016-08-10: CORRECTION: that was the original plan. Due to the space limitation on that server the database will be installed on Nebula volume lsst-dm-gapon attached to image lsst-gapon-0 (floating IP address 141.142.211.152). See details on this setup in the comments stream below.)

      The database name will be: gapon_SDRP_Stripe82

      The database will be populated with the contents of the following databases:

      • NCSA (lsst-db.ncsa.illinois.edu):

        daues_SDRP_Stripe82_ncsa
        daues_SDRP_dedupe_byfilter_0
        daues_SDRP_dedupe_byfilter_1
        daues_SDRP_dedupe_byfilter_2
        daues_SDRP_dedupe_byfilter_3
        daues_SDRP_dedupe_byfilter_4
        

      • IN2P3 (ccdb02.in2p3.fr):

        lsst_prod_DC_2013_2
        lsst_prod_dedupe_byfilter_g
        lsst_prod_dedupe_byfilter_i
        lsst_prod_dedupe_byfilter_r
        lsst_prod_dedupe_byfilter_u
        lsst_prod_dedupe_byfilter_z
        

      Additional requirements:

      • The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated.
      • the referential integrity of the resulted database will need to be tested

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment - - edited

            Merging and de-duplicating DeepCoadd and RunDeepSource

            These tables needs to be merged and de-duplicated before attempting to de-duplicate 5 IN2P3's tables RunDeepForcedSource. The effort is based on the following (earlier explained in this ticket) dependency between tables:

            • RunDeepForcedSource.objectId -> id:RunDeepSource.coadd_id -> deepCoaddId:DeepCoadd

            We shall also increase the MySQL key buffer size because some of the operations used later on will greatly benefit from this. The parameter is explained at http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_key_buffer_size. Set it to 4 GB as per Serge's suggestion at https://github.com/lsst-dm/db_tests_summer15/blob/master/sql/dump_ForcedSource.sql then restart the server.

            % cat /etc/my.cnf.d/server.cnf
            ..
            key_buffer_size = 4294967296
            ..
            % systemctl restart mariadb.service
            

            Saving the overlap info for DeepCoadd

            Here we're going to save the overlap identifiers in intermediate tables.
            NCSA:

            CREATE TABLE merge_info.ncsa_DeepCoadd_overlap (
              `deepCoaddId` bigint(20) NOT NULL,
              `tract` int(11) NOT NULL,
              `patch` char(16) NOT NULL,
              `filterId` tinyint(4) NOT NULL,
              PRIMARY KEY (`deepCoaddId`)
            ) ENGINE=MyISAM;
             
            INSERT INTO merge_info.ncsa_DeepCoadd_overlap
              SELECT deepCoaddId,tract,patch,filterId
              FROM gapon_SDRP_Stripe82.DeepCoadd
              WHERE (tract,patch,filterId) IN
                (SELECT tract,patch,filterId FROM lsst_prod_DC_2013_2.DeepCoadd);
            

            Query OK, 1380 rows affected (0.10 sec)
            Records: 1380  Duplicates: 0  Warnings: 0
            

            IN2P3:

            CREATE TABLE merge_info.in2p3_DeepCoadd_overlap (
              `deepCoaddId` bigint(20) NOT NULL,
              `tract` int(11) NOT NULL,
              `patch` char(16) NOT NULL,
              `filterId` tinyint(4) NOT NULL,
              PRIMARY KEY (`deepCoaddId`)
            ) ENGINE=MyISAM;
             
            INSERT INTO merge_info.in2p3_DeepCoadd_overlap
              SELECT deepCoaddId,tract,patch,filterId
              FROM lsst_prod_DC_2013_2.DeepCoadd
              WHERE (tract,patch,filterId) IN
                (SELECT tract,patch,filterId FROM gapon_SDRP_Stripe82.DeepCoadd);
            

            Query OK, 1380 rows affected (41.85 sec)
            Records: 1380  Duplicates: 0  Warnings: 0
            

            Observations:

            • both tables have identical entries
            • the number of rows in each table is rather small: 1380
            • the table size is: 85560
            • the index size: 23552

            Saving the overlap info for RunDeepSource

            Estimating the total number of Objects in the overlap area

            NCSA:

            SELECT
              COUNT(id)
            FROM
              gapon_SDRP_Stripe82.RunDeepSource `rds`,
              merge_info.ncsa_DeepCoadd_overlap `overlap`
            WHERE
              rds.coadd_id = overlap.deepCoaddId;
            

            +-----------+
            | COUNT(id) |
            +-----------+
            |   9991529 |
            +-----------+
            1 row in set (1 min 43.00 sec)
            

            IN2P3:

            SELECT
              COUNT(id)
            FROM
              lsst_prod_DC_2013_2.RunDeepSource `rds`,
              merge_info.in2p3_DeepCoadd_overlap `overlap`
            WHERE
              rds.coadd_id = overlap.deepCoaddId;
            

            +-----------+
            | COUNT(id) |
            +-----------+
            |   9993373 |
            +-----------+
            1 row in set (2 min 46.84 sec)
            

            Recording the identifiers in the bookkeeping tables for both sites

            NCSA:

            CREATE TABLE merge_info.ncsa_RunDeepSource_overlap (
              `id` bigint(20) NOT NULL,
              `coadd_id` bigint(20) NOT NULL,
              PRIMARY KEY (`id`)
            ) ENGINE=MyISAM;
             
            INSERT INTO merge_info.ncsa_RunDeepSource_overlap
            SELECT id, coadd_id
            FROM
              gapon_SDRP_Stripe82.RunDeepSource `rds`,
              merge_info.ncsa_DeepCoadd_overlap `overlap`
            WHERE
              rds.coadd_id = overlap.deepCoaddId;
            

            Query OK, 9991529 rows affected (25.07 sec)
            Records: 9991529  Duplicates: 0  Warnings: 0
            

            IN2P3:

            CREATE TABLE merge_info.in2p3_RunDeepSource_overlap (
              `id` bigint(20) NOT NULL,
              `coadd_id` bigint(20) NOT NULL,
              PRIMARY KEY (`id`)
            ) ENGINE=MyISAM;
             
            INSERT INTO merge_info.in2p3_RunDeepSource_overlap
            SELECT id, coadd_id
            FROM
              lsst_prod_DC_2013_2.RunDeepSource `rds`,
              merge_info.in2p3_DeepCoadd_overlap `overlap`
            WHERE
              rds.coadd_id = overlap.deepCoaddId;
            

            Query OK, 9993373 rows affected (32.20 sec)
            Records: 9993373  Duplicates: 0  Warnings: 0
            

            Observations

            • the number of entries in the tables are almost identical
            • the number of rows in each table is: 9993373 (10^7)
            • the table size is: 169887341 (170 MB)
            • the index size: 144075776 (144 MB)

            It means that with a proper key buffer size (as mentioned at the start of this comment) the whole table index will fit into memory and it can be used for eliminated overlaps from the largest tables RunDeepForcedSource.

            Deduplicating DeepCoadd (and its metadata tables) of IN2P3

            This is an intermediate step before producing the fully merged tables. The goal is to remove the overlap in tables:

            • lsst_prod_DC_2013_2.DeepCoadd
            • lsst_prod_DC_2013_2.DeepCoadd_Metadata
            • lsst_prod_DC_2013_2.DeepCoadd_To_Htm10

            The resulting tables will be residing within the same database:

            CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_dedup
            LIKE lsst_prod_DC_2013_2.DeepCoadd;
            ALTER TABLE lsst_prod_DC_2013_2.DeepCoadd_dedup DISABLE KEYS;
             
            CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup
            LIKE lsst_prod_DC_2013_2.DeepCoadd_Metadata;
            ALTER TABLE lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup DISABLE KEYS;
             
            CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup
            LIKE lsst_prod_DC_2013_2.DeepCoadd_To_Htm10;
            ALTER TABLE lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup DISABLE KEYS;
            

            Deduplicate by excluding the previously recorded identifiers of coadds in the overlap area of IN2P3. Run a few test shots first to make sure we're getting consistent results.

            DeepCoadd

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd;
            

            +----------+
            | COUNT(*) |
            +----------+
            |    13680 |
            +----------+
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd
            WHERE
              deepCoaddId IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            +----------+
            | COUNT(*) |
            +----------+
            |     1380 |
            +----------+
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd
            WHERE
              deepCoaddId NOT IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            +----------+
            | COUNT(*) |
            +----------+
            |    12300 |
            +----------+
            

            INSERT INTO lsst_prod_DC_2013_2.DeepCoadd_dedup
            SELECT *
            FROM
              lsst_prod_DC_2013_2.DeepCoadd
            WHERE  
              deepCoaddId NOT IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            Query OK, 12300 rows affected (0.09 sec)
            Records: 12300  Duplicates: 0  Warnings: 0
            

            DeepCoadd_Metadata

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata;
            

            +----------+
            | COUNT(*) |
            +----------+
            |  3022797 |
            +----------+
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata
            WHERE
              deepCoaddId IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            +----------+
            | COUNT(*) |
            +----------+
            |   289140 |
            +----------+
            1 row in set (3.12 sec)
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata
            WHERE
              deepCoaddId NOT IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            +----------+
            | COUNT(*) |
            +----------+
            |  2733657 |
            +----------+
            1 row in set (1.34 sec)
            

            INSERT INTO lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup
            SELECT *
            FROM
              lsst_prod_DC_2013_2.DeepCoadd_Metadata
            WHERE  
              deepCoaddId NOT IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            Query OK, 2733657 rows affected (1 min 1.31 sec)
            Records: 2733657  Duplicates: 0  Warnings: 0
            

            DeepCoadd_To_Htm10

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10;
            

            +----------+
            | COUNT(*) |
            +----------+
            |   324020 |
            +----------+
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10
            WHERE
              deepCoaddId IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            +----------+
            | COUNT(*) |
            +----------+
            |    32740 |
            +----------+
            1 row in set (0.06 sec)
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10
            WHERE
              deepCoaddId NOT IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            +----------+
            | COUNT(*) |
            +----------+
            |   291280 |
            +----------+
            1 row in set (0.10 sec)
            

            INSERT INTO lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup
            SELECT *
            FROM
              lsst_prod_DC_2013_2.DeepCoadd_To_Htm10
            WHERE  
              deepCoaddId NOT IN
                (SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap);
            

            Query OK, 291280 rows affected (0.13 sec)
            Records: 291280  Duplicates: 0  Warnings: 0
            

            Deduplicating RunDeepSource of IN2P3

            This is an intermediate step before producing the fully merged tables. The goal is to remove the overlap in tables:

            • lsst_prod_DC_2013_2.RunDeepSource

            The resulting table will be residing within the same database:

            CREATE TABLE lsst_prod_DC_2013_2.RunDeepSource_dedup
            LIKE lsst_prod_DC_2013_2.RunDeepSource;
            ALTER TABLE lsst_prod_DC_2013_2.RunDeepSource_dedup DISABLE KEYS;
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.RunDeepSource;
            

            +----------+
            | COUNT(*) |
            +----------+
            | 98088148 |
            +----------+
            1 row in set (0.01 sec)
            
            

            SELECT COUNT(*) FROM merge_info.in2p3_RunDeepSource_overlap;
            

            +----------+
            | COUNT(*) |
            +----------+
            |  9993373 |
            +----------+
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.RunDeepSource
            WHERE
              id IN
                (SELECT id FROM merge_info.in2p3_RunDeepSource_overlap);
            

            +----------+
            | COUNT(*) |
            +----------+
            |  9993373 |
            +----------+
            1 row in set (46.01 sec)
            

            SELECT COUNT(*) FROM lsst_prod_DC_2013_2.RunDeepSource
            WHERE
              id NOT IN
                (SELECT id FROM merge_info.in2p3_RunDeepSource_overlap);
            

            +----------+
            | COUNT(*) |
            +----------+
            | 88094775 |
            +----------+
            1 row in set (3 min 35.95 sec)
            

            INSERT INTO lsst_prod_DC_2013_2.RunDeepSource_dedup
            SELECT *
            FROM
              lsst_prod_DC_2013_2.RunDeepSource
            WHERE  
              id NOT IN
                (SELECT id FROM merge_info.in2p3_RunDeepSource_overlap);
            

            Query OK, 88094775 rows affected (25 min 33.03 sec)
            Records: 88094775  Duplicates: 0  Warnings: 0
            

            Merging tables DeepCoadd and RunDeepSource

            This is the final stage of merging of each pair (from both sites) of the tables to include rows from both sites. The resulting tables are placed into a separate database called lsst_sdss_stripe82. The database will be also populated by other shared tables needed by the PDAC Qserv loader.

            • NOTE: The database will be placed on the OpenStack volume lsst-dm-gapon-3.

            CREATE DATABASE lsst_sdss_stripe82;
            

            DeepCoadd

            Create the table and disable all but the PRIMARY key which will be needed as a safeguard against duplicate rows:

            CREATE TABLE lsst_sdss_stripe82.DeepCoadd_merge
            LIKE gapon_SDRP_Stripe82.DeepCoadd;
             
            ALTER TABLE lsst_sdss_stripe82.DeepCoadd_merge DISABLE KEYS;
            

            Populate the table with rows from both sites (starting with NCSA):

            INSERT INTO lsst_sdss_stripe82.DeepCoadd_merge
            SELECT * FROM gapon_SDRP_Stripe82.DeepCoadd;
            

            Query OK, 13735 rows affected (0.20 sec)
            Records: 13735  Duplicates: 0  Warnings: 0
            

            INSERT INTO lsst_sdss_stripe82.DeepCoadd_merge
            SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd_dedup;
            

            Query OK, 12300 rows affected (0.14 sec)
            Records: 12300  Duplicates: 0  Warnings: 0
            

            SELECT COUNT (*) FROM  lsst_sdss_stripe82.DeepCoadd_merge;
            

            +----------+
            | COUNT(*) |
            +----------+
            |    26035 |
            +----------+
            

            Follow the same approach for 2 dependent metadata tables:

            CREATE TABLE lsst_sdss_stripe82.DeepCoadd_Metadata_merge
            LIKE  gapon_SDRP_Stripe82.DeepCoadd_Metadata;
            

            INSERT INTO lsst_sdss_stripe82.DeepCoadd_Metadata_merge
            SELECT * FROM gapon_SDRP_Stripe82.DeepCoadd_Metadata;
            

            Query OK, 2809690 rows affected (31.07 sec)
            Records: 2809690  Duplicates: 0  Warnings: 0
            

            INSERT INTO lsst_sdss_stripe82.DeepCoadd_Metadata_merge
            SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup;
            

            Query OK, 2733657 rows affected (1 min 35.83 sec)
            Records: 2733657  Duplicates: 0  Warnings: 0
            

            SELECT COUNT(*) FROM lsst_sdss_stripe82.DeepCoadd_Metadata_merge;
            

            +----------+
            | COUNT(*) |
            +----------+
            |  5543347 |
            +----------+
            

            CREATE TABLE lsst_sdss_stripe82.DeepCoadd_To_Htm10_merge
            LIKE gapon_SDRP_Stripe82.DeepCoadd_To_Htm10;
            

            INSERT INTO lsst_sdss_stripe82.DeepCoadd_To_Htm10_merge
            SELECT * FROM gapon_SDRP_Stripe82.DeepCoadd_To_Htm10;
            

            Query OK, 326420 rows affected (0.52 sec)
            Records: 326420  Duplicates: 0  Warnings: 0
            

            INSERT INTO lsst_sdss_stripe82.DeepCoadd_To_Htm10_merge
            SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup;
            

            Query OK, 291280 rows affected (1.40 sec)
            Records: 291280  Duplicates: 0  Warnings: 0
            

            SELECT COUNT(*) FROM lsst_sdss_stripe82.DeepCoadd_To_Htm10_merge;
            

            +----------+
            | COUNT(*) |
            +----------+
            |   617700 |
            +----------+
            

            RunDeepSource

            CREATE TABLE lsst_sdss_stripe82.RunDeepSource_merge
            LIKE gapon_SDRP_Stripe82.RunDeepSource;
             
            ALTER TABLE lsst_sdss_stripe82.RunDeepSource_merge DISABLE KEYS;
            

            INSERT INTO lsst_sdss_stripe82.RunDeepSource_merge
            SELECT * FROM gapon_SDRP_Stripe82.RunDeepSource;
            

            Query OK, 98577782 rows affected (18 min 56.41 sec)
            Records: 98577782  Duplicates: 0  Warnings: 0
            

            INSERT INTO lsst_sdss_stripe82.RunDeepSource_merge
            SELECT * FROM lsst_prod_DC_2013_2.RunDeepSource_dedup;
            

            Query OK, 88094775 rows affected (17 min 12.25 sec)
            Records: 88094775  Duplicates: 0  Warnings: 0
            

            SELECT COUNT(*) FROM lsst_sdss_stripe82.RunDeepSource_merge;
            

            +-----------+
            | COUNT(*)  |
            +-----------+
            | 186672557 |
            +-----------+
            

            Make a dump of the merged core tables

            Done

            Show
            gapon Igor Gaponenko added a comment - - edited Merging and de-duplicating DeepCoadd and RunDeepSource These tables needs to be merged and de-duplicated before attempting to de-duplicate 5 IN2P3 's tables RunDeepForcedSource . The effort is based on the following (earlier explained in this ticket) dependency between tables: RunDeepForcedSource .objectId -> id: RunDeepSource .coadd_id -> deepCoaddId: DeepCoadd We shall also increase the MySQL key buffer size because some of the operations used later on will greatly benefit from this. The parameter is explained at http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_key_buffer_size . Set it to 4 GB as per Serge's suggestion at https://github.com/lsst-dm/db_tests_summer15/blob/master/sql/dump_ForcedSource.sql then restart the server. % cat /etc/my .cnf.d /server .cnf .. key_buffer_size = 4294967296 .. % systemctl restart mariadb.service Saving the overlap info for DeepCoadd Here we're going to save the overlap identifiers in intermediate tables. NCSA : CREATE TABLE merge_info.ncsa_DeepCoadd_overlap ( `deepCoaddId` bigint (20) NOT NULL , `tract` int (11) NOT NULL , `patch` char (16) NOT NULL , `filterId` tinyint(4) NOT NULL , PRIMARY KEY (`deepCoaddId`) ) ENGINE=MyISAM;   INSERT INTO merge_info.ncsa_DeepCoadd_overlap SELECT deepCoaddId,tract,patch,filterId FROM gapon_SDRP_Stripe82.DeepCoadd WHERE (tract,patch,filterId) IN ( SELECT tract,patch,filterId FROM lsst_prod_DC_2013_2.DeepCoadd); Query OK, 1380 rows affected (0.10 sec) Records: 1380 Duplicates: 0 Warnings: 0 IN2P3 : CREATE TABLE merge_info.in2p3_DeepCoadd_overlap ( `deepCoaddId` bigint (20) NOT NULL , `tract` int (11) NOT NULL , `patch` char (16) NOT NULL , `filterId` tinyint(4) NOT NULL , PRIMARY KEY (`deepCoaddId`) ) ENGINE=MyISAM;   INSERT INTO merge_info.in2p3_DeepCoadd_overlap SELECT deepCoaddId,tract,patch,filterId FROM lsst_prod_DC_2013_2.DeepCoadd WHERE (tract,patch,filterId) IN ( SELECT tract,patch,filterId FROM gapon_SDRP_Stripe82.DeepCoadd); Query OK, 1380 rows affected (41.85 sec) Records: 1380 Duplicates: 0 Warnings: 0 Observations : both tables have identical entries the number of rows in each table is rather small: 1380 the table size is: 85560 the index size: 23552 Saving the overlap info for RunDeepSource Estimating the total number of Objects in the overlap area NCSA : SELECT COUNT (id) FROM gapon_SDRP_Stripe82.RunDeepSource `rds`, merge_info.ncsa_DeepCoadd_overlap `overlap` WHERE rds.coadd_id = overlap.deepCoaddId; +-----------+ | COUNT(id) | +-----------+ | 9991529 | +-----------+ 1 row in set (1 min 43.00 sec) IN2P3 : SELECT COUNT (id) FROM lsst_prod_DC_2013_2.RunDeepSource `rds`, merge_info.in2p3_DeepCoadd_overlap `overlap` WHERE rds.coadd_id = overlap.deepCoaddId; +-----------+ | COUNT(id) | +-----------+ | 9993373 | +-----------+ 1 row in set (2 min 46.84 sec) Recording the identifiers in the bookkeeping tables for both sites NCSA : CREATE TABLE merge_info.ncsa_RunDeepSource_overlap ( `id` bigint (20) NOT NULL , `coadd_id` bigint (20) NOT NULL , PRIMARY KEY (`id`) ) ENGINE=MyISAM;   INSERT INTO merge_info.ncsa_RunDeepSource_overlap SELECT id, coadd_id FROM gapon_SDRP_Stripe82.RunDeepSource `rds`, merge_info.ncsa_DeepCoadd_overlap `overlap` WHERE rds.coadd_id = overlap.deepCoaddId; Query OK, 9991529 rows affected (25.07 sec) Records: 9991529 Duplicates: 0 Warnings: 0 IN2P3 : CREATE TABLE merge_info.in2p3_RunDeepSource_overlap ( `id` bigint (20) NOT NULL , `coadd_id` bigint (20) NOT NULL , PRIMARY KEY (`id`) ) ENGINE=MyISAM;   INSERT INTO merge_info.in2p3_RunDeepSource_overlap SELECT id, coadd_id FROM lsst_prod_DC_2013_2.RunDeepSource `rds`, merge_info.in2p3_DeepCoadd_overlap `overlap` WHERE rds.coadd_id = overlap.deepCoaddId; Query OK, 9993373 rows affected (32.20 sec) Records: 9993373 Duplicates: 0 Warnings: 0 Observations the number of entries in the tables are almost identical the number of rows in each table is: 9993373 (10^7) the table size is: 169887341 (170 MB) the index size: 144075776 (144 MB) It means that with a proper key buffer size (as mentioned at the start of this comment) the whole table index will fit into memory and it can be used for eliminated overlaps from the largest tables RunDeepForcedSource . Deduplicating DeepCoadd (and its metadata tables) of IN2P3 This is an intermediate step before producing the fully merged tables. The goal is to remove the overlap in tables: lsst_prod_DC_2013_2.DeepCoadd lsst_prod_DC_2013_2.DeepCoadd_Metadata lsst_prod_DC_2013_2.DeepCoadd_To_Htm10 The resulting tables will be residing within the same database: CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_dedup LIKE lsst_prod_DC_2013_2.DeepCoadd; ALTER TABLE lsst_prod_DC_2013_2.DeepCoadd_dedup DISABLE KEYS;   CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup LIKE lsst_prod_DC_2013_2.DeepCoadd_Metadata; ALTER TABLE lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup DISABLE KEYS;   CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup LIKE lsst_prod_DC_2013_2.DeepCoadd_To_Htm10; ALTER TABLE lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup DISABLE KEYS; Deduplicate by excluding the previously recorded identifiers of coadds in the overlap area of IN2P3. Run a few test shots first to make sure we're getting consistent results. DeepCoadd SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd; +----------+ | COUNT(*) | +----------+ | 13680 | +----------+ SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd WHERE deepCoaddId IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); +----------+ | COUNT(*) | +----------+ | 1380 | +----------+ SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd WHERE deepCoaddId NOT IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); +----------+ | COUNT(*) | +----------+ | 12300 | +----------+ INSERT INTO lsst_prod_DC_2013_2.DeepCoadd_dedup SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd WHERE deepCoaddId NOT IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); Query OK, 12300 rows affected (0.09 sec) Records: 12300 Duplicates: 0 Warnings: 0 DeepCoadd_Metadata SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata; +----------+ | COUNT(*) | +----------+ | 3022797 | +----------+ SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata WHERE deepCoaddId IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); +----------+ | COUNT(*) | +----------+ | 289140 | +----------+ 1 row in set (3.12 sec) SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata WHERE deepCoaddId NOT IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); +----------+ | COUNT(*) | +----------+ | 2733657 | +----------+ 1 row in set (1.34 sec) INSERT INTO lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata WHERE deepCoaddId NOT IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); Query OK, 2733657 rows affected (1 min 1.31 sec) Records: 2733657 Duplicates: 0 Warnings: 0 DeepCoadd_To_Htm10 SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10; +----------+ | COUNT(*) | +----------+ | 324020 | +----------+ SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10 WHERE deepCoaddId IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); +----------+ | COUNT(*) | +----------+ | 32740 | +----------+ 1 row in set (0.06 sec) SELECT COUNT (*) FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10 WHERE deepCoaddId NOT IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); +----------+ | COUNT(*) | +----------+ | 291280 | +----------+ 1 row in set (0.10 sec) INSERT INTO lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10 WHERE deepCoaddId NOT IN ( SELECT deepCoaddId FROM merge_info.in2p3_DeepCoadd_overlap); Query OK, 291280 rows affected (0.13 sec) Records: 291280 Duplicates: 0 Warnings: 0 Deduplicating RunDeepSource of IN2P3 This is an intermediate step before producing the fully merged tables. The goal is to remove the overlap in tables: lsst_prod_DC_2013_2.RunDeepSource The resulting table will be residing within the same database: CREATE TABLE lsst_prod_DC_2013_2.RunDeepSource_dedup LIKE lsst_prod_DC_2013_2.RunDeepSource; ALTER TABLE lsst_prod_DC_2013_2.RunDeepSource_dedup DISABLE KEYS; SELECT COUNT (*) FROM lsst_prod_DC_2013_2.RunDeepSource; +----------+ | COUNT(*) | +----------+ | 98088148 | +----------+ 1 row in set (0.01 sec) SELECT COUNT (*) FROM merge_info.in2p3_RunDeepSource_overlap; +----------+ | COUNT(*) | +----------+ | 9993373 | +----------+ SELECT COUNT (*) FROM lsst_prod_DC_2013_2.RunDeepSource WHERE id IN ( SELECT id FROM merge_info.in2p3_RunDeepSource_overlap); +----------+ | COUNT(*) | +----------+ | 9993373 | +----------+ 1 row in set (46.01 sec) SELECT COUNT (*) FROM lsst_prod_DC_2013_2.RunDeepSource WHERE id NOT IN ( SELECT id FROM merge_info.in2p3_RunDeepSource_overlap); +----------+ | COUNT(*) | +----------+ | 88094775 | +----------+ 1 row in set (3 min 35.95 sec) INSERT INTO lsst_prod_DC_2013_2.RunDeepSource_dedup SELECT * FROM lsst_prod_DC_2013_2.RunDeepSource WHERE id NOT IN ( SELECT id FROM merge_info.in2p3_RunDeepSource_overlap); Query OK, 88094775 rows affected (25 min 33.03 sec) Records: 88094775 Duplicates: 0 Warnings: 0 Merging tables DeepCoadd and RunDeepSource This is the final stage of merging of each pair (from both sites) of the tables to include rows from both sites. The resulting tables are placed into a separate database called lsst_sdss_stripe82 . The database will be also populated by other shared tables needed by the PDAC Qserv loader. NOTE : The database will be placed on the OpenStack volume lsst-dm-gapon-3 . CREATE DATABASE lsst_sdss_stripe82; DeepCoadd Create the table and disable all but the PRIMARY key which will be needed as a safeguard against duplicate rows: CREATE TABLE lsst_sdss_stripe82.DeepCoadd_merge LIKE gapon_SDRP_Stripe82.DeepCoadd;   ALTER TABLE lsst_sdss_stripe82.DeepCoadd_merge DISABLE KEYS; Populate the table with rows from both sites (starting with NCSA): INSERT INTO lsst_sdss_stripe82.DeepCoadd_merge SELECT * FROM gapon_SDRP_Stripe82.DeepCoadd; Query OK, 13735 rows affected (0.20 sec) Records: 13735 Duplicates: 0 Warnings: 0 INSERT INTO lsst_sdss_stripe82.DeepCoadd_merge SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd_dedup; Query OK, 12300 rows affected (0.14 sec) Records: 12300 Duplicates: 0 Warnings: 0 SELECT COUNT (*) FROM lsst_sdss_stripe82.DeepCoadd_merge; +----------+ | COUNT(*) | +----------+ | 26035 | +----------+ Follow the same approach for 2 dependent metadata tables: CREATE TABLE lsst_sdss_stripe82.DeepCoadd_Metadata_merge LIKE gapon_SDRP_Stripe82.DeepCoadd_Metadata; INSERT INTO lsst_sdss_stripe82.DeepCoadd_Metadata_merge SELECT * FROM gapon_SDRP_Stripe82.DeepCoadd_Metadata; Query OK, 2809690 rows affected (31.07 sec) Records: 2809690 Duplicates: 0 Warnings: 0 INSERT INTO lsst_sdss_stripe82.DeepCoadd_Metadata_merge SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup; Query OK, 2733657 rows affected (1 min 35.83 sec) Records: 2733657 Duplicates: 0 Warnings: 0 SELECT COUNT (*) FROM lsst_sdss_stripe82.DeepCoadd_Metadata_merge; +----------+ | COUNT(*) | +----------+ | 5543347 | +----------+ CREATE TABLE lsst_sdss_stripe82.DeepCoadd_To_Htm10_merge LIKE gapon_SDRP_Stripe82.DeepCoadd_To_Htm10; INSERT INTO lsst_sdss_stripe82.DeepCoadd_To_Htm10_merge SELECT * FROM gapon_SDRP_Stripe82.DeepCoadd_To_Htm10; Query OK, 326420 rows affected (0.52 sec) Records: 326420 Duplicates: 0 Warnings: 0 INSERT INTO lsst_sdss_stripe82.DeepCoadd_To_Htm10_merge SELECT * FROM lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup; Query OK, 291280 rows affected (1.40 sec) Records: 291280 Duplicates: 0 Warnings: 0 SELECT COUNT (*) FROM lsst_sdss_stripe82.DeepCoadd_To_Htm10_merge; +----------+ | COUNT(*) | +----------+ | 617700 | +----------+ RunDeepSource CREATE TABLE lsst_sdss_stripe82.RunDeepSource_merge LIKE gapon_SDRP_Stripe82.RunDeepSource;   ALTER TABLE lsst_sdss_stripe82.RunDeepSource_merge DISABLE KEYS; INSERT INTO lsst_sdss_stripe82.RunDeepSource_merge SELECT * FROM gapon_SDRP_Stripe82.RunDeepSource; Query OK, 98577782 rows affected (18 min 56.41 sec) Records: 98577782 Duplicates: 0 Warnings: 0 INSERT INTO lsst_sdss_stripe82.RunDeepSource_merge SELECT * FROM lsst_prod_DC_2013_2.RunDeepSource_dedup; Query OK, 88094775 rows affected (17 min 12.25 sec) Records: 88094775 Duplicates: 0 Warnings: 0 SELECT COUNT (*) FROM lsst_sdss_stripe82.RunDeepSource_merge; +-----------+ | COUNT(*) | +-----------+ | 186672557 | +-----------+ Make a dump of the merged core tables Done
            Hide
            gapon Igor Gaponenko added a comment - - edited

            De-duplication of RunDeepForcedSource (IN2P3 u-band)

            This operation is only needed for the IN2P3's tables of all 5 bands. Only the u-band is going to be processed at this stage, mainly to figure out the most optimal protocol for doing the de-duplication. Other bands will be processed later.

            The general idea of the merge is to use pre-recorded object identifiers for objects (rows of DeepForcedSource) found previously to eliminate rows which correspond to those objects. The identifiers are stored in table: merge_info.in2p3_RunDeepSource_overlap. There are about 10 7 such objects across all 5 bands.

            In order to achieve the maximum performance of the operation it would be best to avoid any direct I/O during the merge. This would be possible if all of the following conditions were met:

            • do not write back into the input table
            • stream the filtered rows into another output table which has the same schema
            • an index for the FK object_id on the input table must be DISABLED (this allows the sequential full table scan)
            • all keys (including the PRIMARY one) on the output table must be DISABLED
            • making sure the index of object identifiers which are to be eliminated fits into the MySQL key buffer in the machine's memory (this has already been done by increasing the buffer size up to 8 GB)

            Database server tuneup

            Tuned up some server parameters for the 16 GB memory machine as per the following document:

            Here is a list of those:

            bulk_insert_buffer_size = 8589934592
            key_buffer_size = 4294967296
            sort_buffer_size = 8589934592
            myisam_max_sort_file_size = 2147483648
            myisam_repair_threads = 8
            

            Estimating the performance of the operation

            ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource DISABLE KEYS;
             
            SELECT COUNT(*)
            FROM
              lsst_prod_dedupe_byfilter_u.RunDeepForcedSource `rdfs`,
              merge_info.in2p3_RunDeepSource_overlap `overlap`
            WHERE
              rdfs.objectId=overlap.id;
            

            +-----------+
            | COUNT(*)  |
            +-----------+
            | 176989749 |
            +-----------+
            1 row in set (1 hour 45 min 39.36 sec)
            

            De-duplicating the table

            The performance of the operation shall be monitored by watching for changes in the size of the corresponding file in the data directory of the database installation.

            Setting up the output table:

            CREATE TABLE
              lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup
            LIKE
              lsst_prod_dedupe_byfilter_u.RunDeepForcedSource;
             
            SHOW INDEX IN lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
            

            +---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
            | Table                     | Non_unique | Key_name               | Seq_in_index | Column_name        | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
            +---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
            | RunDeepForcedSource_dedup |          0 | PRIMARY                |            1 | id                 | A         |           0 |     NULL | NULL   |      | BTREE      |         |               |
            | RunDeepForcedSource_dedup |          1 | IDX_coord_htmId20      |            1 | coord_htmId20      | A         |        NULL |     NULL | NULL   | YES  | BTREE      |         |               |
            | RunDeepForcedSource_dedup |          1 | IDX_coord_decl         |            1 | coord_decl         | A         |        NULL |     NULL | NULL   | YES  | BTREE      |         |               |
            | RunDeepForcedSource_dedup |          1 | IDX_parent             |            1 | parent             | A         |        NULL |     NULL | NULL   | YES  | BTREE      |         |               |
            | RunDeepForcedSource_dedup |          1 | IDX_exposure_id        |            1 | exposure_id        | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
            | RunDeepForcedSource_dedup |          1 | IDX_exposure_filter_id |            1 | exposure_filter_id | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
            | RunDeepForcedSource_dedup |          1 | IDX_objectId           |            1 | objectId           | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
            +---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
            

            DROP INDEX `PRIMARY` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
            DROP INDEX `IDX_coord_htmId20` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
            DROP INDEX `IDX_coord_decl` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
            DROP INDEX `IDX_parent` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
            DROP INDEX `IDX_exposure_id` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
            DROP INDEX `IDX_exposure_filter_id` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
            DROP INDEX `IDX_objectId` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
            

            De-duplicating:

            SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'BEGIN TIME';
             
            INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup
              SELECT * FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource
              WHERE objectId NOT IN
                (SELECT id FROM merge_info.in2p3_RunDeepSource_overlap);
             
            SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'END TIME';
            

            +---------------------+
            | BEGIN TIME          |
            +---------------------+
            | 2016-09-02 18:37:04 |
            +---------------------+
             
            Query OK, 1720938745 rows affected (4 hours 21 min 29.13 sec)
            Records: 1720938745  Duplicates: 0  Warnings: 0
             
            +---------------------+
            | END TIME            |
            +---------------------+
            | 2016-09-02 22:58:34 |
            +---------------------+
            

            Summary of observations

            The performance of the operation is:

            • wrote 109690 rows/sec
            • wrote 41 MB/sec
            • 4.5 hours per table

            The file system view on the original and reduced tables:

            % ls -alh /data/mysql/db/lsst_prod_dedupe_byfilter_u/ | grep  RunDeepForcedSource
             
            -rw-rw----  1 mysql mysql 634G Sep  2 22:58 RunDeepForcedSource_dedup.MYD
            -rw-rw----  1 mysql mysql 1.0K Sep  2 22:58 RunDeepForcedSource_dedup.MYI
            -rw-rw----  1 mysql mysql 699G Aug 14 03:11 RunDeepForcedSource.MYD
            -rw-rw----  1 mysql mysql 155G Sep  2 03:30 RunDeepForcedSource.MYI
            

            Show
            gapon Igor Gaponenko added a comment - - edited De-duplication of RunDeepForcedSource (IN2P3 u-band) This operation is only needed for the IN2P3 's tables of all 5 bands. Only the u -band is going to be processed at this stage, mainly to figure out the most optimal protocol for doing the de-duplication. Other bands will be processed later. The general idea of the merge is to use pre-recorded object identifiers for objects (rows of DeepForcedSource ) found previously to eliminate rows which correspond to those objects. The identifiers are stored in table: merge_info.in2p3_RunDeepSource_overlap . There are about 10 7 such objects across all 5 bands. In order to achieve the maximum performance of the operation it would be best to avoid any direct I/O during the merge. This would be possible if all of the following conditions were met: do not write back into the input table stream the filtered rows into another output table which has the same schema an index for the FK object_id on the input table must be DISABLED (this allows the sequential full table scan) all keys (including the PRIMARY one) on the output table must be DISABLED making sure the index of object identifiers which are to be eliminated fits into the MySQL key buffer in the machine's memory (this has already been done by increasing the buffer size up to 8 GB) Database server tuneup Tuned up some server parameters for the 16 GB memory machine as per the following document: https://dev.lsstcorp.org/trac/wiki/mysqlLargeTables Here is a list of those: bulk_insert_buffer_size = 8589934592 key_buffer_size = 4294967296 sort_buffer_size = 8589934592 myisam_max_sort_file_size = 2147483648 myisam_repair_threads = 8 Estimating the performance of the operation ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource DISABLE KEYS;   SELECT COUNT (*) FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource `rdfs`, merge_info.in2p3_RunDeepSource_overlap `overlap` WHERE rdfs.objectId=overlap.id; +-----------+ | COUNT(*) | +-----------+ | 176989749 | +-----------+ 1 row in set (1 hour 45 min 39.36 sec) De-duplicating the table The performance of the operation shall be monitored by watching for changes in the size of the corresponding file in the data directory of the database installation. Setting up the output table: CREATE TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup LIKE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource;   SHOW INDEX IN lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; +---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | RunDeepForcedSource_dedup | 0 | PRIMARY | 1 | id | A | 0 | NULL | NULL | | BTREE | | | | RunDeepForcedSource_dedup | 1 | IDX_coord_htmId20 | 1 | coord_htmId20 | A | NULL | NULL | NULL | YES | BTREE | | | | RunDeepForcedSource_dedup | 1 | IDX_coord_decl | 1 | coord_decl | A | NULL | NULL | NULL | YES | BTREE | | | | RunDeepForcedSource_dedup | 1 | IDX_parent | 1 | parent | A | NULL | NULL | NULL | YES | BTREE | | | | RunDeepForcedSource_dedup | 1 | IDX_exposure_id | 1 | exposure_id | A | NULL | NULL | NULL | | BTREE | | | | RunDeepForcedSource_dedup | 1 | IDX_exposure_filter_id | 1 | exposure_filter_id | A | NULL | NULL | NULL | | BTREE | | | | RunDeepForcedSource_dedup | 1 | IDX_objectId | 1 | objectId | A | NULL | NULL | NULL | | BTREE | | | +---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ DROP INDEX ` PRIMARY ` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; DROP INDEX `IDX_coord_htmId20` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; DROP INDEX `IDX_coord_decl` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; DROP INDEX `IDX_parent` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; DROP INDEX `IDX_exposure_id` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; DROP INDEX `IDX_exposure_filter_id` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; DROP INDEX `IDX_objectId` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; De-duplicating: SELECT CONCAT( CURRENT_DATE (), ' ' , CURRENT_TIME ()) AS 'BEGIN TIME' ;   INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup SELECT * FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource WHERE objectId NOT IN ( SELECT id FROM merge_info.in2p3_RunDeepSource_overlap);   SELECT CONCAT( CURRENT_DATE (), ' ' , CURRENT_TIME ()) AS 'END TIME' ; +---------------------+ | BEGIN TIME | +---------------------+ | 2016-09-02 18:37:04 | +---------------------+   Query OK, 1720938745 rows affected (4 hours 21 min 29.13 sec) Records: 1720938745 Duplicates: 0 Warnings: 0   +---------------------+ | END TIME | +---------------------+ | 2016-09-02 22:58:34 | +---------------------+ Summary of observations The performance of the operation is: wrote 109690 rows/sec wrote 41 MB/sec 4.5 hours per table The file system view on the original and reduced tables: % ls -alh /data/mysql/db/lsst_prod_dedupe_byfilter_u/ | grep RunDeepForcedSource   -rw-rw---- 1 mysql mysql 634G Sep 2 22:58 RunDeepForcedSource_dedup.MYD -rw-rw---- 1 mysql mysql 1.0K Sep 2 22:58 RunDeepForcedSource_dedup.MYI -rw-rw---- 1 mysql mysql 699G Aug 14 03:11 RunDeepForcedSource.MYD -rw-rw---- 1 mysql mysql 155G Sep 2 03:30 RunDeepForcedSource.MYI
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Extending and preparing RunDeepForcedSource for partitioning

            This procedure applies to all 10 instances of the table.

            Reasons

            PROBLEM: Here is the problem which is going to be addressed here. The current implementation of the partitioning algorithm can't benefit from the direct association of Sources (RunDeepForcedSource) to the corresponding Objects (RunDeepSource). That would require to build and maintain so called secondary index mapping object identifiers to the corresponding Qserv chunks. The index will be build later. At the mean time the partitioning is made purely statically based on the (RA,DECL) attributes of both tables, where the Objects table is known to be the director table which determines the boundaries of chunks. Entries of the Sources table will be just placed into the corresponding partition/chunk based on an implicit association of their (RA,DECL) withe the ones of the corresponding objects. A problem with this scheme is that the current instances of table RunDeepForcedSource do not have values of the Objects' (RA,DECL). Values of the (RA,DECL) attributes of Sources fluctuate in some spatial region in the vicinity of their Object.

            This is illustrated with a result of following query:

            SELECT
              ROUND(coord_ra-38.37141147524322, 12) AS 'diff_coord_ra',
              ROUND(coord_decl-1.148188656054416,12) AS 'diff_coord_decl'
            FROM 
              lsst_prod_dedupe_byfilter_u.RunDeepForcedSource
            WHERE
              objectId=2067094074033859
            ORDER BY
              diff_coord_ra;
            

            +-----------------+-----------------+
            | diff_coord_ra   | diff_coord_decl |
            +-----------------+-----------------+
            | -0.000161892500 | -0.000111360577 |
            | -0.000095904728 | -0.000020831633 |
            | -0.000094140933 | -0.000029997114 |
            | -0.000083455936 | -0.000460564277 |
            | -0.000046769964 | -0.000084374571 |
            | -0.000039476722 |  0.000091648320 |
            | -0.000027119942 |  0.000072532090 |
            | -0.000024605283 |  0.000003101783 |
            | -0.000012371979 | -0.000301252512 |
            | -0.000003916412 |  0.000170950966 |
            | -0.000000000000 |  0.000000000000 |
            |  0.000000000000 | -0.000000000000 |
            ...
            | -0.000000000000 | -0.000000000000 |
            | -0.000000000000 | -0.000000000000 |
            |  0.000018318412 |  0.000036593137 |
            |  0.000029191432 |  0.000110790091 |
            |  0.000034774232 | -0.000108592399 |
            |  0.000060996604 | -0.000148811948 |
            |  0.000072243194 | -0.000003792909 |
            |  0.000092347604 | -0.000036465276 |
            |  0.000094263668 |  0.000113629595 |
            |  0.000106699599 |  0.000305447839 |
            |  0.000195478015 | -0.000067243647 |
            |  0.000385799167 | -0.000061535130 |
            +-----------------+-----------------+
            92 rows in set (0.00 sec)
            

            SOLUTION: extend the table before dumping its contents into the CVS/TVS format with two columns populated with values of (RA/DECL) copied from the corresponding entries of the _Objects* table (RunDeepSource). This operation will apply to all 10 instances of table RunDeepForcedSource. Note that 5 tables of IN2P3 will be de-duplicated first.

            Creating a subset of table RunDeepSource (Objects) of IN2P3

            This table is quite large (considering just its IN2P3's half):
            *100 million rows

            • the total size (w/o indexes) is about 55 GB
            • the row size is 556 Bytes

            Attempting a join of Sources table RunDeepForcedSource_dedup (2 billion entries, 700 GB in size) with this one will be terribly slow on a machine with only 16 GB of memory. Hence a goal of tgis step is to create a smaller table which will have:

            • only 3 columns: id, ra, decl (3x64 bits = 25 Bytes)
            • be 20 times smaller then the full table, that is 2.5 GB in size, which should fit into teh machine's physical memory
            • be indexed on the object identifier column

            Create the table:

            CREATE TABLE lsst_prod_DC_2013_2.RunDeepSource_radecl (
              `id` bigint(20) NOT NULL,
              `coord_ra` double DEFAULT NULL,
              `coord_decl` double DEFAULT NULL,
              PRIMARY KEY (`id`)
            ) ENGINE=MyISAM;
            

            Populate the table with a subset:

            INSERT INTO lsst_prod_DC_2013_2.RunDeepSource_radecl
            SELECT id,coord_ra,coord_decl
            FROM lsst_prod_DC_2013_2.RunDeepSource;
            

            Query OK, 98088148 rows affected (13 min 23.19 sec)
            Records: 98088148  Duplicates: 0  Warnings: 0
            

            % ls -alh /data/mysql/db/lsst_prod_DC_2013_2/ | grep RunDeepSource_radecl
            -rw-rw----  1 mysql mysql 8.5K Sep  2 22:45 RunDeepSource_radecl.frm
            -rw-rw----  1 mysql mysql 2.3G Sep  2 23:01 RunDeepSource_radecl.MYD
            -rw-rw----  1 mysql mysql 1.7G Sep  2 23:01 RunDeepSource_radecl.MYI
            

            Observations:

            • the row size is: 25 Bytes
            • the table size is: 2.3 GB
            • the primary index's size is: 1.7 GB

            Creating a subset of table RunDeepSource (Objects) of NCSA

            Create the table:

            CREATE TABLE gapon_SDRP_Stripe82.RunDeepSource_radecl (
              `id` bigint(20) NOT NULL,
              `coord_ra` double DEFAULT NULL,
              `coord_decl` double DEFAULT NULL,
              PRIMARY KEY (`id`)
            ) ENGINE=MyISAM;
            

            Populate the table with a subset:

            INSERT INTO gapon_SDRP_Stripe82.RunDeepSource_radecl
            SELECT id,coord_ra,coord_decl
            FROM gapon_SDRP_Stripe82.RunDeepSource;
            

            Query OK, 98577782 rows affected (14 min 52.09 sec)
            Records: 98577782  Duplicates: 0  Warnings: 0
            

            Extending Sources of the IN2P3's u-band

            The operation will be using the de-duplicated table RunDeepForcedSource_dedup from database lsst_prod_dedupe_byfilter_u as its source and it will produce the extended table RunDeepForcedSource_dedup_radecl as its destination within the same database. The destination table will have two extra columns:

            • object_ra
            • object_decl

            No indexes (including the PRIMARY one) will be created for the destination table.

            Creating the extended table ( two extra columns are added in the end):

            CREATE TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl
            LIKE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;
             
            ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl
            ADD COLUMN `object_coord_ra` double DEFAULT NULL;
             
            ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl
            ADD COLUMN `object_coord_decl` double DEFAULT NULL;
            

            Load the table with a small sample (10 million rows) to estimate the performance of the operation:

            INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl
              SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl
              FROM
                lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup `rdfs`
                  STRAIGHT_JOIN
                lsst_prod_DC_2013_2.RunDeepSource_radecl `rdss`
                  ON 
                rdfs.objectId = rdss.id
              LIMIT 10000000;
            

            Query OK, 10000000 rows affected (1 min 46.48 sec)
            Records: 10000000  Duplicates: 0  Warnings: 0
            

            Observations:

            • performance: 94000 rows/sec
            • write I/O rate: 37 MB/sec

            Clearing the destination table from the test data and processing all rows from the input table:

            DELETE FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl;
             
            SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'BEGIN TIME';
             
            INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl
              SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl
              FROM
                lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup `rdfs`
                  STRAIGHT_JOIN
                lsst_prod_DC_2013_2.RunDeepSource_radecl `rdss`
                  ON 
                rdfs.objectId = rdss.id;
             
            SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'END TIME';
            

            +---------------------+
            | BEGIN TIME          |
            +---------------------+
            | 2016-09-03 01:04:48 |
            +---------------------+
             
            +---------------------+
            | END TIME            |
            +---------------------+
            | 2016-09-03 05:42:48 |
            +---------------------+
            

            Dumping sources of IN2P3's u-band into TSV

            % mysqldump -u root \
            lsst_prod_dedupe_byfilter_u RunDeepForcedSource_dedup_radecl \
            -T/data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/
             
            % ls -alh /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/
            -rw-r--r-- 1 root  root  5.1K Sep  3 05:58 RunDeepForcedSource_dedup_radecl.sql
            -rw-rw-rw- 1 mysql mysql 1.2T Sep  3 18:59 RunDeepForcedSource_dedup_radecl.txt
             
            %  mv \
            /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl.txt \
            /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl.tsv
             
            % nohup pigz --fast \
            /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl.tsv&
            

            Extending Sources of the NCSA's u-band and making the TSV dump

            No indexes (including the PRIMARY one) should be defined for the destination table.

            Prepare the empty table:

            CREATE TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl
            LIKE gapon_SDRP_Stripe82.RunDeepForcedSource;
             
            DROP INDEX `PRIMARY`
            ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_coord_htmId20`
            ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_coord_decl`
            ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_parent`
            ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_exposure_id`
            ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_exposure_filter_id`
            ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_objectId`
            ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_objectId_exposure`
            ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;
             
             
            ALTER TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl
            ADD COLUMN `object_coord_ra` double DEFAULT NULL;
             
            ALTER TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl
            ADD COLUMN `object_coord_decl` double DEFAULT NULL;
            

            Begin extending the table.

            SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'BEGIN TIME';
             
            INSERT INTO gapon_SDRP_Stripe82.RunDeepForcedSource_radecl
              SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl
              FROM
                gapon_SDRP_Stripe82.RunDeepForcedSource `rdfs`
                  STRAIGHT_JOIN
                gapon_SDRP_Stripe82.RunDeepSource_radecl `rdss`
                  ON 
                rdfs.objectId = rdss.id;
             
            SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'END TIME';
            

            +---------------------+
            | BEGIN TIME          |
            +---------------------+
            | 2016-09-04 06:53:30 |
            +---------------------+
             
            +---------------------+
            | END TIME            |
            +---------------------+
            | 2016-09-04 11:25:05 |
            +---------------------+
            

            Dumping the table into the TSV format:

            % mysqldump -u root \
            gapon_SDRP_Stripe82 RunDeepForcedSource_radecl \
            -T/data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/
             
            % ls -alh /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/
             
            %  mv \
            /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl.txt \
            /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl.tsv
             
            % nohup pigz --fast \
            /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl.tsv&
            

            Extending Sources of the NCSA's g-band and making the TSV dump

            No indexes (including the PRIMARY one) should be defined for the destination table.

            Prepare the empty table:

            CREATE TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl
            LIKE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource;
             
            DROP INDEX `PRIMARY`
            ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_coord_htmId20`
            ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_coord_decl`
            ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_parent`
            ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_exposure_id`
            ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_exposure_filter_id`
            ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_objectId`
            ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;
             
            DROP INDEX `IDX_objectId_exposure`
            ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;
             
            ALTER TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl
            ADD COLUMN `object_coord_ra` double DEFAULT NULL;
             
            ALTER TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl
            ADD COLUMN `object_coord_decl` double DEFAULT NULL;
            
            

            Begin extending the table.

            SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'BEGIN TIME';
             
            INSERT INTO daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl
              SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl
              FROM
                daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource `rdfs`
                  STRAIGHT_JOIN
                gapon_SDRP_Stripe82.RunDeepSource_radecl `rdss`
                  ON 
                rdfs.objectId = rdss.id;
             
            SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'END TIME';
            

            IN PROGRESS...

            +---------------------+
            | BEGIN TIME          |
            +---------------------+
            | 2016-09-04 19:37:57 |
            +---------------------+
            
            

            Show
            gapon Igor Gaponenko added a comment - - edited Extending and preparing RunDeepForcedSource for partitioning This procedure applies to all 10 instances of the table. Reasons PROBLEM : Here is the problem which is going to be addressed here. The current implementation of the partitioning algorithm can't benefit from the direct association of Sources ( RunDeepForcedSource ) to the corresponding Objects ( RunDeepSource ). That would require to build and maintain so called secondary index mapping object identifiers to the corresponding Qserv chunks . The index will be build later. At the mean time the partitioning is made purely statically based on the (RA,DECL) attributes of both tables, where the Objects table is known to be the director table which determines the boundaries of chunks. Entries of the Sources table will be just placed into the corresponding partition/chunk based on an implicit association of their (RA,DECL) withe the ones of the corresponding objects. A problem with this scheme is that the current instances of table RunDeepForcedSource do not have values of the Objects' (RA,DECL) . Values of the (RA,DECL) attributes of Sources fluctuate in some spatial region in the vicinity of their Object . This is illustrated with a result of following query: SELECT ROUND(coord_ra-38.37141147524322, 12) AS 'diff_coord_ra' , ROUND(coord_decl-1.148188656054416,12) AS 'diff_coord_decl' FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource WHERE objectId=2067094074033859 ORDER BY diff_coord_ra; +-----------------+-----------------+ | diff_coord_ra | diff_coord_decl | +-----------------+-----------------+ | -0.000161892500 | -0.000111360577 | | -0.000095904728 | -0.000020831633 | | -0.000094140933 | -0.000029997114 | | -0.000083455936 | -0.000460564277 | | -0.000046769964 | -0.000084374571 | | -0.000039476722 | 0.000091648320 | | -0.000027119942 | 0.000072532090 | | -0.000024605283 | 0.000003101783 | | -0.000012371979 | -0.000301252512 | | -0.000003916412 | 0.000170950966 | | -0.000000000000 | 0.000000000000 | | 0.000000000000 | -0.000000000000 | ... | -0.000000000000 | -0.000000000000 | | -0.000000000000 | -0.000000000000 | | 0.000018318412 | 0.000036593137 | | 0.000029191432 | 0.000110790091 | | 0.000034774232 | -0.000108592399 | | 0.000060996604 | -0.000148811948 | | 0.000072243194 | -0.000003792909 | | 0.000092347604 | -0.000036465276 | | 0.000094263668 | 0.000113629595 | | 0.000106699599 | 0.000305447839 | | 0.000195478015 | -0.000067243647 | | 0.000385799167 | -0.000061535130 | +-----------------+-----------------+ 92 rows in set (0.00 sec) SOLUTION : extend the table before dumping its contents into the CVS / TVS format with two columns populated with values of ( RA / DECL ) copied from the corresponding entries of the _Objects* table ( RunDeepSource ). This operation will apply to all 10 instances of table RunDeepForcedSource . Note that 5 tables of IN2P3 will be de-duplicated first. Creating a subset of table RunDeepSource (Objects) of IN2P3 This table is quite large (considering just its IN2P3 's half): *100 million rows the total size (w/o indexes) is about 55 GB the row size is 556 Bytes Attempting a join of Sources table RunDeepForcedSource_dedup (2 billion entries, 700 GB in size) with this one will be terribly slow on a machine with only 16 GB of memory. Hence a goal of tgis step is to create a smaller table which will have: only 3 columns: id , ra , decl (3x64 bits = 25 Bytes) be 20 times smaller then the full table, that is 2.5 GB in size, which should fit into teh machine's physical memory be indexed on the object identifier column Create the table: CREATE TABLE lsst_prod_DC_2013_2.RunDeepSource_radecl ( `id` bigint (20) NOT NULL , `coord_ra` double DEFAULT NULL , `coord_decl` double DEFAULT NULL , PRIMARY KEY (`id`) ) ENGINE=MyISAM; Populate the table with a subset: INSERT INTO lsst_prod_DC_2013_2.RunDeepSource_radecl SELECT id,coord_ra,coord_decl FROM lsst_prod_DC_2013_2.RunDeepSource; Query OK, 98088148 rows affected (13 min 23.19 sec) Records: 98088148 Duplicates: 0 Warnings: 0 % ls -alh /data/mysql/db/lsst_prod_DC_2013_2/ | grep RunDeepSource_radecl -rw-rw---- 1 mysql mysql 8.5K Sep 2 22:45 RunDeepSource_radecl.frm -rw-rw---- 1 mysql mysql 2.3G Sep 2 23:01 RunDeepSource_radecl.MYD -rw-rw---- 1 mysql mysql 1.7G Sep 2 23:01 RunDeepSource_radecl.MYI Observations : the row size is: 25 Bytes the table size is: 2.3 GB the primary index's size is: 1.7 GB Creating a subset of table RunDeepSource (Objects) of NCSA Create the table: CREATE TABLE gapon_SDRP_Stripe82.RunDeepSource_radecl ( `id` bigint (20) NOT NULL , `coord_ra` double DEFAULT NULL , `coord_decl` double DEFAULT NULL , PRIMARY KEY (`id`) ) ENGINE=MyISAM; Populate the table with a subset: INSERT INTO gapon_SDRP_Stripe82.RunDeepSource_radecl SELECT id,coord_ra,coord_decl FROM gapon_SDRP_Stripe82.RunDeepSource; Query OK, 98577782 rows affected (14 min 52.09 sec) Records: 98577782 Duplicates: 0 Warnings: 0 Extending Sources of the IN2P3's u-band The operation will be using the de-duplicated table RunDeepForcedSource_dedup from database lsst_prod_dedupe_byfilter_u as its source and it will produce the extended table RunDeepForcedSource_dedup_radecl as its destination within the same database. The destination table will have two extra columns: object_ra object_decl No indexes (including the PRIMARY one) will be created for the destination table. Creating the extended table ( two extra columns are added in the end): CREATE TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl LIKE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup;   ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl ADD COLUMN `object_coord_ra` double DEFAULT NULL ;   ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl ADD COLUMN `object_coord_decl` double DEFAULT NULL ; Load the table with a small sample (10 million rows) to estimate the performance of the operation: INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup `rdfs` STRAIGHT_JOIN lsst_prod_DC_2013_2.RunDeepSource_radecl `rdss` ON rdfs.objectId = rdss.id LIMIT 10000000; Query OK, 10000000 rows affected (1 min 46.48 sec) Records: 10000000 Duplicates: 0 Warnings: 0 Observations : performance: 94000 rows/sec write I/O rate: 37 MB/sec Clearing the destination table from the test data and processing all rows from the input table: DELETE FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl;   SELECT CONCAT( CURRENT_DATE (), ' ' , CURRENT_TIME ()) AS 'BEGIN TIME' ;   INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup `rdfs` STRAIGHT_JOIN lsst_prod_DC_2013_2.RunDeepSource_radecl `rdss` ON rdfs.objectId = rdss.id;   SELECT CONCAT( CURRENT_DATE (), ' ' , CURRENT_TIME ()) AS 'END TIME' ; +---------------------+ | BEGIN TIME | +---------------------+ | 2016-09-03 01:04:48 | +---------------------+   +---------------------+ | END TIME | +---------------------+ | 2016-09-03 05:42:48 | +---------------------+ Dumping sources of IN2P3's u-band into TSV % mysqldump -u root \ lsst_prod_dedupe_byfilter_u RunDeepForcedSource_dedup_radecl \ -T /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/   % ls -alh /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/ -rw-r--r-- 1 root root 5.1K Sep 3 05:58 RunDeepForcedSource_dedup_radecl.sql -rw-rw-rw- 1 mysql mysql 1.2T Sep 3 18:59 RunDeepForcedSource_dedup_radecl.txt   % mv \ /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl .txt \ /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl .tsv   % nohup pigz --fast \ /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl .tsv& Extending Sources of the NCSA's u-band and making the TSV dump No indexes (including the PRIMARY one) should be defined for the destination table. Prepare the empty table: CREATE TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl LIKE gapon_SDRP_Stripe82.RunDeepForcedSource;   DROP INDEX ` PRIMARY ` ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;   DROP INDEX `IDX_coord_htmId20` ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;   DROP INDEX `IDX_coord_decl` ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;   DROP INDEX `IDX_parent` ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;   DROP INDEX `IDX_exposure_id` ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;   DROP INDEX `IDX_exposure_filter_id` ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;   DROP INDEX `IDX_objectId` ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;   DROP INDEX `IDX_objectId_exposure` ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl;     ALTER TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl ADD COLUMN `object_coord_ra` double DEFAULT NULL ;   ALTER TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl ADD COLUMN `object_coord_decl` double DEFAULT NULL ; Begin extending the table. SELECT CONCAT( CURRENT_DATE (), ' ' , CURRENT_TIME ()) AS 'BEGIN TIME' ;   INSERT INTO gapon_SDRP_Stripe82.RunDeepForcedSource_radecl SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl FROM gapon_SDRP_Stripe82.RunDeepForcedSource `rdfs` STRAIGHT_JOIN gapon_SDRP_Stripe82.RunDeepSource_radecl `rdss` ON rdfs.objectId = rdss.id;   SELECT CONCAT( CURRENT_DATE (), ' ' , CURRENT_TIME ()) AS 'END TIME' ; +---------------------+ | BEGIN TIME | +---------------------+ | 2016-09-04 06:53:30 | +---------------------+   +---------------------+ | END TIME | +---------------------+ | 2016-09-04 11:25:05 | +---------------------+ Dumping the table into the TSV format: % mysqldump -u root \ gapon_SDRP_Stripe82 RunDeepForcedSource_radecl \ -T /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/   % ls -alh /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/   % mv \ /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl .txt \ /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl .tsv   % nohup pigz --fast \ /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl .tsv& Extending Sources of the NCSA's g-band and making the TSV dump No indexes (including the PRIMARY one) should be defined for the destination table. Prepare the empty table: CREATE TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl LIKE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource;   DROP INDEX ` PRIMARY ` ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;   DROP INDEX `IDX_coord_htmId20` ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;   DROP INDEX `IDX_coord_decl` ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;   DROP INDEX `IDX_parent` ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;   DROP INDEX `IDX_exposure_id` ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;   DROP INDEX `IDX_exposure_filter_id` ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;   DROP INDEX `IDX_objectId` ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;   DROP INDEX `IDX_objectId_exposure` ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl;   ALTER TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl ADD COLUMN `object_coord_ra` double DEFAULT NULL ;   ALTER TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl ADD COLUMN `object_coord_decl` double DEFAULT NULL ; Begin extending the table. SELECT CONCAT( CURRENT_DATE (), ' ' , CURRENT_TIME ()) AS 'BEGIN TIME' ;   INSERT INTO daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl FROM daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource `rdfs` STRAIGHT_JOIN gapon_SDRP_Stripe82.RunDeepSource_radecl `rdss` ON rdfs.objectId = rdss.id;   SELECT CONCAT( CURRENT_DATE (), ' ' , CURRENT_TIME ()) AS 'END TIME' ; IN PROGRESS... +---------------------+ | BEGIN TIME | +---------------------+ | 2016-09-04 19:37:57 | +---------------------+
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Planning the next steps

            Bulk processing the forced deep sources

            database load de-dupe extend dump compress notes
            daues_SDRP_dedupe_byfilter_0 n/a  
            daues_SDRP_dedupe_byfilter_1 n/a  
            daues_SDRP_dedupe_byfilter_2 n/a  
            daues_SDRP_dedupe_byfilter_3 n/a  
            daues_SDRP_dedupe_byfilter_4 n/a a copy of this catalog has been migrated to lsst-dev7:/datasets/gapon/ uncompressed. This will make it easier to test the partitioning algorithm.
            lsst_prod_dedupe_byfilter_u  
            lsst_prod_dedupe_byfilter_g  
            lsst_prod_dedupe_byfilter_r  
            lsst_prod_dedupe_byfilter_i  
            lsst_prod_dedupe_byfilter_z  

            Other actions required to finish producing the complete input data set for the partitioning algorithm and the Qserv loader:

            1. do the actual merge of DeepCoadd and RunDeepSource and the metadata tables: DeepCoadd_Metadata and DeepCoadd_To_Htm10. Report this in the corresponding comment section above. Do not forget to mention what should be the right order of these operations.
            2. dump the rest of the tables from the main NCSA database in a format which is expected by the partitioning algorithm and the Qserv loader

            These steps are expected to be properly documented and reported in comments of this JIRA ticket.

            The FINAL step would be to put the most important summary information on the whole process into the ticket's description. This should include the key decisions on the merge, schema extension and a location of the input dataset for the loader.

            Show
            gapon Igor Gaponenko added a comment - - edited Planning the next steps Bulk processing the forced deep sources database load de-dupe extend dump compress notes daues_SDRP_dedupe_byfilter_0 n/a   daues_SDRP_dedupe_byfilter_1 n/a   daues_SDRP_dedupe_byfilter_2 n/a   daues_SDRP_dedupe_byfilter_3 n/a   daues_SDRP_dedupe_byfilter_4 n/a a copy of this catalog has been migrated to lsst-dev7:/datasets/gapon/ uncompressed. This will make it easier to test the partitioning algorithm. lsst_prod_dedupe_byfilter_u   lsst_prod_dedupe_byfilter_g   lsst_prod_dedupe_byfilter_r   lsst_prod_dedupe_byfilter_i   lsst_prod_dedupe_byfilter_z   Other actions required to finish producing the complete input data set for the partitioning algorithm and the Qserv loader: do the actual merge of DeepCoadd and RunDeepSource and the metadata tables: DeepCoadd_Metadata and DeepCoadd_To_Htm10 . Report this in the corresponding comment section above. Do not forget to mention what should be the right order of these operations. dump the rest of the tables from the main NCSA database in a format which is expected by the partitioning algorithm and the Qserv loader These steps are expected to be properly documented and reported in comments of this JIRA ticket. The FINAL step would be to put the most important summary information on the whole process into the ticket's description. This should include the key decisions on the merge, schema extension and a location of the input dataset for the loader.
            Hide
            gapon Igor Gaponenko added a comment -

            Summary

            The task has been complete. All dumps have been made and copied over to the GPFS file system:

            • lsst-dev7.ncsa.illinois.edu:/datasets/gapon/catalogs

            The intermediate data are still available on the NCSA's OpenStack instance lsst-gapon-0 and 4 attached volumes:

            • lsst-dm-gapon
            • lsst-dm-gapon-1
            • lsst-dm-gapon-2
            • lsst-dm-gapon-3
            Show
            gapon Igor Gaponenko added a comment - Summary The task has been complete. All dumps have been made and copied over to the GPFS file system: lsst-dev7.ncsa.illinois.edu :/datasets/gapon/catalogs The intermediate data are still available on the NCSA's OpenStack instance lsst-gapon-0 and 4 attached volumes: lsst-dm-gapon lsst-dm-gapon-1 lsst-dm-gapon-2 lsst-dm-gapon-3

              People

              • Assignee:
                gapon Igor Gaponenko
                Reporter:
                gapon Igor Gaponenko
                Watchers:
                Fritz Mueller, Gregory Dubois-Felsmann, Igor Gaponenko
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel