Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-7007

Investigate coverage of S13 databases found so far

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:

      Description

      Look at databases located at NCSA so far to assess if they cover the full survey. The databases to be evaluated are mentioned in DM-6905.

      According to S13 Testing Plan the S13 DRP dataset was split into two regions with an overalp used for cross-site verification:

      • NCSA: -40< R.A. < +10
      • IN2P3: +5 < R.A. < +55

      Hence a goal of this task is to identify which previously located candidate databases and files correspond to either or both of these ranges.

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment - - edited

            The final conclusions

            The study has reviled the following:

            • The complete (and scientifically meaningful) catalogs corresponding to NCSA-only half of the S13 dataset have been located withing LSST databse server at NCSA
            • The catalogs are spread between 6 databases: daues_SDRP_Stripe82_ncsa (CCD exposure metadata and deep sources) and daues_SDRP_dedupe_byfilter_0,1,2,3,4 (deep forced sources accros 5 wavelength bands).
            • there is a clear evidence that the referential integrity is possible across these databases
            • there is a clear evidence that the primary key structure of the RunDeepForcedSource tables across all 5 bands won't prevent merging the contents of these tables into a single one

            These are optional ideas which may still need to be tested:

            • running a comprehensive test to verify the referential integrity across tables of those 6 databases
            • verifying that no non-trivial intersection exists between keys of the deep forced source tables
            Show
            gapon Igor Gaponenko added a comment - - edited The final conclusions The study has reviled the following: The complete (and scientifically meaningful) catalogs corresponding to NCSA-only half of the S13 dataset have been located withing LSST databse server at NCSA The catalogs are spread between 6 databases: daues_SDRP_Stripe82_ncsa (CCD exposure metadata and deep sources) and daues_SDRP_dedupe_byfilter_0,1,2,3,4 (deep forced sources accros 5 wavelength bands). there is a clear evidence that the referential integrity is possible across these databases there is a clear evidence that the primary key structure of the RunDeepForcedSource tables across all 5 bands won't prevent merging the contents of these tables into a single one These are optional ideas which may still need to be tested: running a comprehensive test to verify the referential integrity across tables of those 6 databases verifying that no non-trivial intersection exists between keys of the deep forced source tables
            Hide
            gapon Igor Gaponenko added a comment - - edited

            The final sweep through other databases of lsst-db

            The following tests have been made in order to rule out a possibility that there is some other (or many) database on the server which might we well be involved into this investigation. The tests were based on patters established in the prior investigation of the well known databases. And we were looking for signs of a presence the substantial amount of data in the corresponding tables of the databases. All tables with the number of rows in the relevant tables which (the numbers) were less than the what had been found earlier were ignored. An additional constrain to be satisfied was a time when each table was created and updated. All tables updated before July 2013 were ignored. MySQL/MariaDB information schema was used in this investigation.

            Science exposures:

            SELECT
              TABLE_SCHEMA,
              TABLE_NAME,
              TABLE_ROWS,
              SUBSTR(CREATE_TIME,1,10) AS 'CREATE',
              SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE'
            FROM
              INFORMATION_SCHEMA.TABLES
            WHERE
              TABLE_NAME='Science_Ccd_Exposure'
                AND
              TABLE_ROWS > 0
            ORDER BY
              TABLE_ROWS DESC;
            

            +--------------------------------------------------+----------------------+------------+------------+------------+
            | TABLE_SCHEMA                                     | TABLE_NAME           | TABLE_ROWS | CREATE     | UPDATE     |
            +--------------------------------------------------+----------------------+------------+------------+------------+
            | DC_W13_Stripe82                                  | Science_Ccd_Exposure |    1998219 | 2013-01-18 | 2013-01-18 |
            | daues_SDRP_Stripe82_ncsa                         | Science_Ccd_Exposure |    1403385 | 2013-07-31 | 2013-08-01 |
            | yusra_Stripe82_S13                               | Science_Ccd_Exposure |    1403385 | 2014-09-02 | 2014-09-02 |
            | yusra_rgb2                                       | Science_Ccd_Exposure |      83427 | 2012-12-30 | 2013-03-06 |
            | daues_S13_Stripe82_2000                          | Science_Ccd_Exposure |      79179 | 2013-06-22 | 2013-06-22 |
            ..
            

            Deep sources:

            SELECT
              TABLE_SCHEMA,
              TABLE_NAME,
              TABLE_ROWS,
              SUBSTR(CREATE_TIME,1,10) AS 'CREATE',
              SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE'
            FROM
              INFORMATION_SCHEMA.TABLES
            WHERE
              TABLE_NAME='RunDeepSource'
            ORDER BY
              TABLE_ROWS DESC;
            

            +----------------------------------+---------------+------------+------------+------------+
            | TABLE_SCHEMA                     | TABLE_NAME    | TABLE_ROWS | CREATE     | UPDATE     |
            +----------------------------------+---------------+------------+------------+------------+
            | daues_SDRP_Stripe82_ncsa         | RunDeepSource |   98577782 | 2013-08-13 | 2013-08-13 |
            | DC_W13_Stripe82                  | RunDeepSource |   14670597 | 2012-12-01 | 2012-12-02 |
            | yusra_S13_Gordon                 | RunDeepSource |    5884398 | 2013-11-01 | 2013-11-01 |
            | daues_S13_Stripe82_2000          | RunDeepSource |    5131738 | 2013-06-27 | 2013-09-01 |
            | daues_coadd_test                 | RunDeepSource |    2836990 | 2013-08-08 | 2013-08-08 |
            | yusra_S13_Stripe82_demo          | RunDeepSource |     763995 | 2013-06-16 | 2013-06-16 |
            | daues_test1                      | RunDeepSource |     749895 | 2013-11-30 | 2013-11-30 |
            | daues_S13_Stripe82_1000          | RunDeepSource |     749895 | 2013-06-16 | 2013-06-16 |
            | shaw_Stripe82_demo               | RunDeepSource |     285785 | 2014-01-30 | 2014-01-30 |
            | pgee2000_a35                     | RunDeepSource |      92343 | 2013-07-07 | 2013-07-07 |
            ...
            

            Forced deep sources:

            SELECT
              TABLE_SCHEMA,
              TABLE_NAME,
              TABLE_ROWS,
              SUBSTR(CREATE_TIME,1,10) AS 'CREATE',
              SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE'
            FROM
              INFORMATION_SCHEMA.TABLES
            WHERE
              TABLE_NAME='RunDeepForcedSource'
            ORDER BY
              TABLE_ROWS DESC;
            

            +------------------------------+---------------------+------------+------------+------------+
            | TABLE_SCHEMA                 | TABLE_NAME          | TABLE_ROWS | CREATE     | UPDATE     |
            +------------------------------+---------------------+------------+------------+------------+
            | DC_W13_Stripe82              | RunDeepForcedSource | 3870141868 | 2012-12-11 | 2012-12-11 |
            | daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 2014-05-06 | 2014-05-30 |
            | daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 2014-05-18 | 2014-05-18 |
            | daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 2014-05-16 | 2014-05-17 |
            | daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 2014-05-19 | 2014-05-19 |
            | daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 |
            | daues_S13_Stripe82_2000      | RunDeepForcedSource |  447180788 | 2013-08-22 | 2013-08-22 |
            ..
            

            Observation

            • no other database meets desired criteria of the search
            Show
            gapon Igor Gaponenko added a comment - - edited The final sweep through other databases of lsst-db The following tests have been made in order to rule out a possibility that there is some other (or many) database on the server which might we well be involved into this investigation. The tests were based on patters established in the prior investigation of the well known databases. And we were looking for signs of a presence the substantial amount of data in the corresponding tables of the databases. All tables with the number of rows in the relevant tables which (the numbers) were less than the what had been found earlier were ignored. An additional constrain to be satisfied was a time when each table was created and updated. All tables updated before July 2013 were ignored. MySQL/MariaDB information schema was used in this investigation. Science exposures: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME= 'Science_Ccd_Exposure' AND TABLE_ROWS > 0 ORDER BY TABLE_ROWS DESC ; +--------------------------------------------------+----------------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +--------------------------------------------------+----------------------+------------+------------+------------+ | DC_W13_Stripe82 | Science_Ccd_Exposure | 1998219 | 2013-01-18 | 2013-01-18 | | daues_SDRP_Stripe82_ncsa | Science_Ccd_Exposure | 1403385 | 2013-07-31 | 2013-08-01 | | yusra_Stripe82_S13 | Science_Ccd_Exposure | 1403385 | 2014-09-02 | 2014-09-02 | | yusra_rgb2 | Science_Ccd_Exposure | 83427 | 2012-12-30 | 2013-03-06 | | daues_S13_Stripe82_2000 | Science_Ccd_Exposure | 79179 | 2013-06-22 | 2013-06-22 | .. Deep sources: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME= 'RunDeepSource' ORDER BY TABLE_ROWS DESC ; +----------------------------------+---------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +----------------------------------+---------------+------------+------------+------------+ | daues_SDRP_Stripe82_ncsa | RunDeepSource | 98577782 | 2013-08-13 | 2013-08-13 | | DC_W13_Stripe82 | RunDeepSource | 14670597 | 2012-12-01 | 2012-12-02 | | yusra_S13_Gordon | RunDeepSource | 5884398 | 2013-11-01 | 2013-11-01 | | daues_S13_Stripe82_2000 | RunDeepSource | 5131738 | 2013-06-27 | 2013-09-01 | | daues_coadd_test | RunDeepSource | 2836990 | 2013-08-08 | 2013-08-08 | | yusra_S13_Stripe82_demo | RunDeepSource | 763995 | 2013-06-16 | 2013-06-16 | | daues_test1 | RunDeepSource | 749895 | 2013-11-30 | 2013-11-30 | | daues_S13_Stripe82_1000 | RunDeepSource | 749895 | 2013-06-16 | 2013-06-16 | | shaw_Stripe82_demo | RunDeepSource | 285785 | 2014-01-30 | 2014-01-30 | | pgee2000_a35 | RunDeepSource | 92343 | 2013-07-07 | 2013-07-07 | ... Forced deep sources: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME= 'RunDeepForcedSource' ORDER BY TABLE_ROWS DESC ; +------------------------------+---------------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +------------------------------+---------------------+------------+------------+------------+ | DC_W13_Stripe82 | RunDeepForcedSource | 3870141868 | 2012-12-11 | 2012-12-11 | | daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 2014-05-06 | 2014-05-30 | | daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 2014-05-18 | 2014-05-18 | | daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 2014-05-16 | 2014-05-17 | | daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 2014-05-19 | 2014-05-19 | | daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 | | daues_S13_Stripe82_2000 | RunDeepForcedSource | 447180788 | 2013-08-22 | 2013-08-22 | .. Observation no other database meets desired criteria of the search
            Hide
            gapon Igor Gaponenko added a comment - - edited

            daues_SDRP_dedupe_byfilter_0(1,2,3,4)

            According to Greg, these contain the final deduped forced sources for each band as shown by the following filter definition:

            SELECT filterId,filterName FROM daues_SDRP_dedupe_byfilter_0.Filter;
            

            +----------+------------+
            | filterId | filterName |
            +----------+------------+
            |        0 | u          |
            |        1 | g          |
            |        2 | r          |
            |        3 | i          |
            |        4 | z          |
            |      -99 | DD         |
            +----------+------------+
            

            Non-empty tables:

            SELECT
              TABLE_NAME, TABLE_ROWS,
              SUBSTR(CREATE_TIME,1,10) AS 'CREATE',
              SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE'
            FROM
              INFORMATION_SCHEMA.TABLES
            WHERE
              TABLE_SCHEMA='daues_SDRP_dedupe_byfilter_0'
                AND
              TABLE_ROWS > 0;
            

            +---------------------+------------+------------+------------+
            | TABLE_NAME          | TABLE_ROWS | CREATE     | UPDATE     |
            +---------------------+------------+------------+------------+
            | Filter              |          6 | 2013-09-09 | 2013-09-09 |
            | LeapSeconds         |         39 | 2013-09-09 | 2013-09-09 |
            | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 |
            | ZZZ_Db_Description  |          1 | 2013-09-09 | 2013-09-09 |
            +---------------------+------------+------------+------------+
            

            Each table has over 1.7 billion entries and is nearly 700 GB in size:

            SELECT
              TABLE_SCHEMA,
              TABLE_NAME,
              TABLE_ROWS,
              DATA_LENGTH,
              SUBSTR(CREATE_TIME,1,10) AS 'CREATE',SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE'
            FROM
              INFORMATION_SCHEMA.TABLES
            WHERE
              TABLE_SCHEMA LIKE 'daues_SDRP_dedupe_byfilter_%'
                AND
              ABLE_NAME='RunDeepForcedSource';
            

            +------------------------------+---------------------+------------+--------------+------------+------------+
            | TABLE_SCHEMA                 | TABLE_NAME          | TABLE_ROWS | DATA_LENGTH  | CREATE     | UPDATE     |
            +------------------------------+---------------------+------------+--------------+------------+------------+
            | daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 683191917445 | 2014-05-21 | 2014-05-21 |
            | daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 692358132605 | 2014-05-16 | 2014-05-17 |
            | daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 692944820970 | 2014-05-06 | 2014-05-30 |
            | daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 692937772195 | 2014-05-18 | 2014-05-18 |
            | daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 691992469650 | 2014-05-19 | 2014-05-19 |
            +------------------------------+---------------------+------------+--------------+------------+------------+
            

            The RA region covered by the tables (judging on one band only) is (ATTENTION : this is a very long query. It takes about 30 minutes for the query to finish):

            SELECT
              MIN(CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END) AS 'min_ra',
              MAX(CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END) AS 'max_ra'
            FROM
              RunDeepForcedSource;
            

            +--------------------+--------------------+
            | min_ra             | max_ra             |
            +--------------------+--------------------+
            | -40.17089170204929 | 10.036033419213043 |
            +--------------------+--------------------+
            

            Evaluating primary key constrains of the tables

            Here we going to evaluate a possibility of merging the contents of these tables into a single one which would contain forced sources across all bands.

            The primary key constrain on the table requires uniqueness only for id:

            SHOW CREATE TABLE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource\G
            

              `id` bigint(20) NOT NULL,
              ..
              `objectId` bigint(20) NOT NULL,
              ..
              PRIMARY KEY (`id`),
              ..
              KEY `IDX_objectId` (`objectId`),
              ..
            

            Where the other column objectId serves as AKA foreign key for the corresponding entries in the DeepSource (or RunDeepSource) tables.

            Fortunately, the values of the primary keys (id) in tables RunDeepForcedSource of all bands do not intersect. This can be demonstrated with this technique:

            CREATE TEMPORARY TABLE
              tmpRunDeepForcedSource
            LIKE
              daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource;
            

            INSERT INTO
              tmpRunDeepForcedSource
            SELECT
              *
            FROM
              daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource
            WHERE
              objectId=217773672520226380;
             
            ...
             
            INSERT INTO
              tmpRunDeepForcedSource
            SELECT
              *
            FROM
              daues_SDRP_dedupe_byfilter_4.RunDeepForcedSource
            WHERE
              objectId=217773672520226380;
            

            This code won't cause any violation of the primary key constrain. And the contents of the temporary table will be the union of result sets select from all 5 input tables.

            SELECT
              id,
              objectId,
              COUNT(*)
            FROM
              tmpRunDeepForcedSource
            GROUP BY
              id
            ORDER BY
              id;
            

            +--------------------+--------------------+----------+
            | id                 | objectId           | COUNT(*) |
            +--------------------+--------------------+----------+
            |  69324128875708417 | 217773672520226380 |        1 |
            |  69330839762108417 | 217773672520226380 |        1 |
            |  69337550648508417 | 217773672520226380 |        1 |
            |  69344261534908417 | 217773672520226380 |        1 |
            |  69350972421308417 | 217773672520226380 |        1 |
            | 160055318036873217 | 217773672520226380 |        1 |
            | 160062028923273217 | 217773672520226380 |        1 |
            | 160068739809673217 | 217773672520226380 |        1 |
            | 160075450696073217 | 217773672520226380 |        1 |
            | 160082161582473217 | 217773672520226380 |        1 |
            | 178443148853248001 | 217773672520226380 |        1 |
            | 178449859739648001 | 217773672520226380 |        1 |
            | 178456570626048001 | 217773672520226380 |        1 |
            | 178463281512448001 | 217773672520226380 |        1 |
            | 178469992398848001 | 217773672520226380 |        1 |
            ...
            

            Evaluating cross-database referential integrity

            This is another important precondition to be asserted on a way of an idea of merging the tables of daues_SDRP_Stripe82_ncsa with 5 band-specific databases. There are two external dependencies (AKA foreign keys) in the forced deep sources tables (RunDeepForcedSource) which would need to be satisfied:

            • objectId
            • exposure_id

            It turns out the dependencies are resolved as follows:

            RunDeepForcedSource key target table key
            objectId daues_SDRP_Stripe82_ncsa.DeepSource deepSourceId
            exposure_id daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure scienceCcdExposureId

            A weak proof of this referential integrity can be demonstrated with the following test (based on any row from the deep forced sources table:

            SELECT
              rdfs.objectId,
              rdfs.exposure_id,
              sce.run
            FROM
              daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource rdfs,
              daues_SDRP_Stripe82_ncsa.DeepSource ds,
              daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure sce
            WHERE
              rdfs.objectId=217773672520226380
                AND
              rdfs.objectId=ds.deepSourceId
                AND
              rdfs.exposure_id=sce.scienceCcdExposureId;
            

            +--------------------+-------------+------+
            | objectId           | exposure_id | run  |
            +--------------------+-------------+------+
            | 217773672520226380 |  1033010019 | 1033 |
            | 217773672520226380 |  2385010094 | 2385 |
            | 217773672520226380 |  2659010125 | 2659 |
            | 217773672520226380 |  3384010101 | 3384 |
            | 217773672520226380 |  3437010075 | 3437 |
            | 217773672520226380 |  4184010122 | 4184 |
            ...
            | 217773672520226380 |  7195010091 | 7195 |
            | 217773672520226380 |  7202010034 | 7202 |
            | 217773672520226380 |  7202010035 | 7202 |
            +--------------------+-------------+------+
            77 rows in set (0.63 sec)
            

            This corresponds to the same number of entries found in table RunDeepForcedSource for this objectId.
            NOTE: strictly speaking this is NOT a proof. Though, a more reliable verification procedure could be establish to verify/guarantee the referential integrity.

            Observations

            • each of these databases contains only forced sources. No other data from the preceding S13 DRP pipeline stages are present.
            • according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset
            • it's quite clear that the contents of table RunDeepForcedSource from all these databases could be combined within a single large table covering all bands due to non-overlapping sequences of the primary key id in the table instances.
            • the referential integrity between tables of databases daues_SDRP_Stripe82_ncsa and daues_SDRP_dedupe_byfilter_0,1,2,3,4 has been demostrated
            • it also seems to be conceivable to merge tables RunDeepForcedSource into the large database daues_SDRP_Stripe82_ncsa which holds catalogs from the first stages of the processing pipeline. This will yield the complete catalog of the first (NCSA) half of the dataset.
            Show
            gapon Igor Gaponenko added a comment - - edited daues_SDRP_dedupe_byfilter_0(1,2,3,4) According to Greg, these contain the final deduped forced sources for each band as shown by the following filter definition: SELECT filterId,filterName FROM daues_SDRP_dedupe_byfilter_0.Filter; +----------+------------+ | filterId | filterName | +----------+------------+ | 0 | u | | 1 | g | | 2 | r | | 3 | i | | 4 | z | | -99 | DD | +----------+------------+ Non-empty tables: SELECT TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA= 'daues_SDRP_dedupe_byfilter_0' AND TABLE_ROWS > 0; +---------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +---------------------+------------+------------+------------+ | Filter | 6 | 2013-09-09 | 2013-09-09 | | LeapSeconds | 39 | 2013-09-09 | 2013-09-09 | | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 | | ZZZ_Db_Description | 1 | 2013-09-09 | 2013-09-09 | +---------------------+------------+------------+------------+ Each table has over 1.7 billion entries and is nearly 700 GB in size: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, DATA_LENGTH, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' ,SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA LIKE 'daues_SDRP_dedupe_byfilter_%' AND ABLE_NAME= 'RunDeepForcedSource' ; +------------------------------+---------------------+------------+--------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | DATA_LENGTH | CREATE | UPDATE | +------------------------------+---------------------+------------+--------------+------------+------------+ | daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 683191917445 | 2014-05-21 | 2014-05-21 | | daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 692358132605 | 2014-05-16 | 2014-05-17 | | daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 692944820970 | 2014-05-06 | 2014-05-30 | | daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 692937772195 | 2014-05-18 | 2014-05-18 | | daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 691992469650 | 2014-05-19 | 2014-05-19 | +------------------------------+---------------------+------------+--------------+------------+------------+ The RA region covered by the tables (judging on one band only) is ( ATTENTION : this is a very long query. It takes about 30 minutes for the query to finish): SELECT MIN ( CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END ) AS 'min_ra' , MAX ( CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END ) AS 'max_ra' FROM RunDeepForcedSource; +--------------------+--------------------+ | min_ra | max_ra | +--------------------+--------------------+ | -40.17089170204929 | 10.036033419213043 | +--------------------+--------------------+ Evaluating primary key constrains of the tables Here we going to evaluate a possibility of merging the contents of these tables into a single one which would contain forced sources across all bands. The primary key constrain on the table requires uniqueness only for id : SHOW CREATE TABLE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource\G `id` bigint(20) NOT NULL, .. `objectId` bigint(20) NOT NULL, .. PRIMARY KEY (`id`), .. KEY `IDX_objectId` (`objectId`), .. Where the other column objectId serves as AKA foreign key for the corresponding entries in the DeepSource (or RunDeepSource ) tables. Fortunately, the values of the primary keys ( id ) in tables RunDeepForcedSource of all bands do not intersect. This can be demonstrated with this technique: CREATE TEMPORARY TABLE tmpRunDeepForcedSource LIKE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; INSERT INTO tmpRunDeepForcedSource SELECT * FROM daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource WHERE objectId=217773672520226380;   ...   INSERT INTO tmpRunDeepForcedSource SELECT * FROM daues_SDRP_dedupe_byfilter_4.RunDeepForcedSource WHERE objectId=217773672520226380; This code won't cause any violation of the primary key constrain. And the contents of the temporary table will be the union of result sets select from all 5 input tables. SELECT id, objectId, COUNT (*) FROM tmpRunDeepForcedSource GROUP BY id ORDER BY id; +--------------------+--------------------+----------+ | id | objectId | COUNT(*) | +--------------------+--------------------+----------+ | 69324128875708417 | 217773672520226380 | 1 | | 69330839762108417 | 217773672520226380 | 1 | | 69337550648508417 | 217773672520226380 | 1 | | 69344261534908417 | 217773672520226380 | 1 | | 69350972421308417 | 217773672520226380 | 1 | | 160055318036873217 | 217773672520226380 | 1 | | 160062028923273217 | 217773672520226380 | 1 | | 160068739809673217 | 217773672520226380 | 1 | | 160075450696073217 | 217773672520226380 | 1 | | 160082161582473217 | 217773672520226380 | 1 | | 178443148853248001 | 217773672520226380 | 1 | | 178449859739648001 | 217773672520226380 | 1 | | 178456570626048001 | 217773672520226380 | 1 | | 178463281512448001 | 217773672520226380 | 1 | | 178469992398848001 | 217773672520226380 | 1 | ... Evaluating cross-database referential integrity This is another important precondition to be asserted on a way of an idea of merging the tables of daues_SDRP_Stripe82_ncsa with 5 band-specific databases. There are two external dependencies (AKA foreign keys) in the forced deep sources tables ( RunDeepForcedSource ) which would need to be satisfied: objectId exposure_id It turns out the dependencies are resolved as follows: RunDeepForcedSource key target table key objectId daues_SDRP_Stripe82_ncsa.DeepSource deepSourceId exposure_id daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure scienceCcdExposureId A weak proof of this referential integrity can be demonstrated with the following test (based on any row from the deep forced sources table: SELECT rdfs.objectId, rdfs.exposure_id, sce.run FROM daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource rdfs, daues_SDRP_Stripe82_ncsa.DeepSource ds, daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure sce WHERE rdfs.objectId=217773672520226380 AND rdfs.objectId=ds.deepSourceId AND rdfs.exposure_id=sce.scienceCcdExposureId; +--------------------+-------------+------+ | objectId | exposure_id | run | +--------------------+-------------+------+ | 217773672520226380 | 1033010019 | 1033 | | 217773672520226380 | 2385010094 | 2385 | | 217773672520226380 | 2659010125 | 2659 | | 217773672520226380 | 3384010101 | 3384 | | 217773672520226380 | 3437010075 | 3437 | | 217773672520226380 | 4184010122 | 4184 | ... | 217773672520226380 | 7195010091 | 7195 | | 217773672520226380 | 7202010034 | 7202 | | 217773672520226380 | 7202010035 | 7202 | +--------------------+-------------+------+ 77 rows in set (0.63 sec) This corresponds to the same number of entries found in table RunDeepForcedSource for this objectId . NOTE : strictly speaking this is NOT a proof. Though, a more reliable verification procedure could be establish to verify/guarantee the referential integrity. Observations each of these databases contains only forced sources . No other data from the preceding S13 DRP pipeline stages are present. according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset it's quite clear that the contents of table RunDeepForcedSource from all these databases could be combined within a single large table covering all bands due to non-overlapping sequences of the primary key id in the table instances. the referential integrity between tables of databases daues_SDRP_Stripe82_ncsa and daues_SDRP_dedupe_byfilter_0,1,2,3,4 has been demostrated it also seems to be conceivable to merge tables RunDeepForcedSource into the large database daues_SDRP_Stripe82_ncsa which holds catalogs from the first stages of the processing pipeline. This will yield the complete catalog of the first (NCSA) half of the dataset.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            daues_S13_Stripe82_2000

            This is a very interesting one because it contains the forced photometry. The database has also been "de-duplicated". In some sense this is the most complete database both from the scientific point of view and also per Yusra's instructions (see the documentation link in the Description section of this task).

            SELECT
              TABLE_NAME, TABLE_ROWS,
              SUBSTR(CREATE_TIME,1,10) AS 'CREATE',
              SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE'
            FROM
              INFORMATION_SCHEMA.TABLES
            WHERE
              TABLE_SCHEMA='daues_S13_Stripe82_2000'
                AND
              TABLE_ROWS > 0;
            

            +-------------------------------+------------+------------+------------+
            | TABLE_NAME                    | TABLE_ROWS | CREATE     | UPDATE     |
            +-------------------------------+------------+------------+------------+
            | DeepCoadd                     |        720 | 2013-06-22 | 2013-06-27 |
            | DeepCoadd_Metadata            |     148296 | 2013-06-22 | 2013-06-27 |
            | DeepCoadd_To_Htm10            |      17000 | 2013-06-22 | 2013-06-27 |
            | Filter                        |          6 | 2013-06-22 | 2013-06-22 |
            | LeapSeconds                   |         39 | 2013-06-22 | 2013-06-22 |
            | RefDeepSrcMatch               |   25731005 | 2013-07-04 | 2013-07-04 |
            | RefObject                     |     471554 | 2013-07-04 | 2013-07-04 |
            | RunDeepForcedSource           |  447180788 | 2013-08-22 | 2013-08-22 |
            | RunDeepForcedSource4          |   90030986 | 2013-09-30 | 2013-10-01 |
            | RunDeepSource                 |    5131738 | 2013-06-27 | 2013-09-01 |
            | Science_Ccd_Exposure          |      79179 | 2013-06-22 | 2013-06-22 |
            | Science_Ccd_Exposure_Metadata |    9263943 | 2013-06-22 | 2013-06-22 |
            | Science_Ccd_Exposure_To_Htm10 |    1514810 | 2013-06-22 | 2013-06-22 |
            | ZZZ_Db_Description            |          1 | 2013-06-22 | 2013-06-22 |
            | medianForcedPhot              |    1068212 | 2014-03-11 | 2014-03-11 |
            | medianForcedPhotStd           |    1068212 | 2014-03-12 | 2014-03-12 |
            | medianForcedPhotStdByFilter   |    5340082 | 2014-03-12 | 2014-03-12 |
            | tmp_dedupe4                   |    7677429 | 2013-09-30 | 2013-09-30 |
            +-------------------------------+------------+------------+------------+
            

            SELECT
              TABLE_NAME
            FROM
              INFORMATION_SCHEMA.VIEWS
            WHERE
              TABLE_SCHEMA='daues_S13_Stripe82_2000';
            

            +------------------+
            | TABLE_NAME       |
            +------------------+
            | DeepForcedSource |
            | DeepSource       |
            +------------------+
            

            The RA coverage range of the database is:

            SELECT
              MIN(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'min_ra',
              MAX(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'max_ra'
            FROM
              DeepSource;
            

            +-------------------+---------+
            | min_ra            | max_ra  |
            +-------------------+---------+
            | 4.957029999999996 | 7.60286 |
            +-------------------+---------+
            

            Observations

            • the database contains results from all stages of the pipeline, including: deep sources (from coadds), deep forced sources and de-duplication.
            • only about 5% of the deep sources are found in this database compared with previously discussed database daues_SDRP_Stripe82_ncsa
            • most tables were created and populated BEFORE the full-scale S13 DRP effort. The only exception was for the last two stages of the pipeline: forced photometry and de-duplication. Those were done in September 2013.
            • Apparently this database represents a full scale (complete in a sense of a variety of data produced by the S13 DRP) effort of testing the pipeline before processing all data within the OVERLAP region of the dataset.
            Show
            gapon Igor Gaponenko added a comment - - edited daues_S13_Stripe82_2000 This is a very interesting one because it contains the forced photometry. The database has also been "de-duplicated". In some sense this is the most complete database both from the scientific point of view and also per Yusra's instructions (see the documentation link in the Description section of this task). SELECT TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA= 'daues_S13_Stripe82_2000' AND TABLE_ROWS > 0; +-------------------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +-------------------------------+------------+------------+------------+ | DeepCoadd | 720 | 2013-06-22 | 2013-06-27 | | DeepCoadd_Metadata | 148296 | 2013-06-22 | 2013-06-27 | | DeepCoadd_To_Htm10 | 17000 | 2013-06-22 | 2013-06-27 | | Filter | 6 | 2013-06-22 | 2013-06-22 | | LeapSeconds | 39 | 2013-06-22 | 2013-06-22 | | RefDeepSrcMatch | 25731005 | 2013-07-04 | 2013-07-04 | | RefObject | 471554 | 2013-07-04 | 2013-07-04 | | RunDeepForcedSource | 447180788 | 2013-08-22 | 2013-08-22 | | RunDeepForcedSource4 | 90030986 | 2013-09-30 | 2013-10-01 | | RunDeepSource | 5131738 | 2013-06-27 | 2013-09-01 | | Science_Ccd_Exposure | 79179 | 2013-06-22 | 2013-06-22 | | Science_Ccd_Exposure_Metadata | 9263943 | 2013-06-22 | 2013-06-22 | | Science_Ccd_Exposure_To_Htm10 | 1514810 | 2013-06-22 | 2013-06-22 | | ZZZ_Db_Description | 1 | 2013-06-22 | 2013-06-22 | | medianForcedPhot | 1068212 | 2014-03-11 | 2014-03-11 | | medianForcedPhotStd | 1068212 | 2014-03-12 | 2014-03-12 | | medianForcedPhotStdByFilter | 5340082 | 2014-03-12 | 2014-03-12 | | tmp_dedupe4 | 7677429 | 2013-09-30 | 2013-09-30 | +-------------------------------+------------+------------+------------+ SELECT TABLE_NAME FROM INFORMATION_SCHEMA.VIEWS WHERE TABLE_SCHEMA= 'daues_S13_Stripe82_2000' ; +------------------+ | TABLE_NAME | +------------------+ | DeepForcedSource | | DeepSource | +------------------+ The RA coverage range of the database is: SELECT MIN ( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) AS 'min_ra' , MAX ( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) AS 'max_ra' FROM DeepSource; +-------------------+---------+ | min_ra | max_ra | +-------------------+---------+ | 4.957029999999996 | 7.60286 | +-------------------+---------+ Observations the database contains results from all stages of the pipeline, including: deep sources (from coadds), deep forced sources and de-duplication. only about 5% of the deep sources are found in this database compared with previously discussed database daues_SDRP_Stripe82_ncsa most tables were created and populated BEFORE the full-scale S13 DRP effort. The only exception was for the last two stages of the pipeline: forced photometry and de-duplication . Those were done in September 2013. Apparently this database represents a full scale (complete in a sense of a variety of data produced by the S13 DRP) effort of testing the pipeline before processing all data within the OVERLAP region of the dataset.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            daues_SDRP_Stripe82_ncsa

            Non-empty tables:

            SELECT
              TABLE_NAME, TABLE_ROWS,
              SUBSTR(CREATE_TIME,1,10) AS 'CREATE',
              SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE'
            FROM
              INFORMATION_SCHEMA.TABLES
            WHERE
              TABLE_SCHEMA='daues_SDRP_Stripe82_ncsa'
                AND
              TABLE_ROWS > 0;
            

            +-------------------------------+------------+------------+------------+
            | TABLE_NAME                    | TABLE_ROWS | CREATE     | UPDATE     |
            +-------------------------------+------------+------------+------------+
            | DeepCoadd                     |      13735 | 2013-07-31 | 2013-08-09 |
            | DeepCoadd_Metadata            |    2809690 | 2013-07-31 | 2013-08-09 |
            | DeepCoadd_To_Htm10            |     326420 | 2013-07-31 | 2013-08-09 |
            | Filter                        |          6 | 2013-07-31 | 2013-07-31 |
            | LeapSeconds                   |         39 | 2013-07-31 | 2013-07-31 |
            | RunDeepSource                 |   98577782 | 2013-08-13 | 2013-08-13 |
            | Science_Ccd_Exposure          |    1403385 | 2013-07-31 | 2013-08-01 |
            | Science_Ccd_Exposure_Metadata |  164196045 | 2013-07-31 | 2013-08-01 |
            | Science_Ccd_Exposure_To_Htm10 |   26861430 | 2013-07-31 | 2013-08-01 |
            | ZZZ_Db_Description            |          1 | 2013-07-31 | 2013-07-31 |
            +-------------------------------+------------+------------+------------+
            

            SELECT
              TABLE_NAME
            FROM
              INFORMATION_SCHEMA.VIEWS
            WHERE
              TABLE_SCHEMA='daues_SDRP_Stripe82_ncsa';
            

            +------------+
            | TABLE_NAME |
            +------------+
            | DeepSource |
            +------------+
            

            Evaluating RA range of the deep sources:

            SELECT
              MIN(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'min_ra',
              MAX(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'max_ra'
            FROM
              DeepSource;
            

            +--------------------+--------------------+
            | min_ra             | max_ra             |
            +--------------------+--------------------+
            | -40.16296999999997 | 10.022860000000001 |
            +--------------------+--------------------+
            

            And a histogram:

            SELECT
              ROUND(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) as chan,
              COUNT(*)
            FROM
              DeepSource
            GROUP BY
              chan;
            

            +------+----------+
            | chan | COUNT(*) |
            +------+----------+
            |  -40 |  1400094 |
            |  -39 |  2163949 |
            |  -38 |  2165932 |
            |  -37 |  2133441 |
            |  -36 |  2126065 |
            |  -35 |  2104630 |
            |  -34 |  2180195 |
            |  -33 |  2103957 |
            |  -32 |  2029172 |
            |  -31 |  2094254 |
            |  -30 |  2049792 |
            |  -29 |  1987232 |
            |  -28 |  2000964 |
            |  -27 |  2113021 |
            |  -26 |  2037486 |
            |  -25 |  1968147 |
            |  -24 |  1999535 |
            |  -23 |  1961868 |
            |  -22 |  2047897 |
            |  -21 |  1979037 |
            |  -20 |  2050123 |
            |  -19 |  1956640 |
            |  -18 |  1936693 |
            |  -17 |  1918960 |
            |  -16 |  1863719 |
            |  -15 |  1786404 |
            |  -14 |  1954821 |
            |  -13 |  1868526 |
            |  -12 |  1870825 |
            |  -11 |  1959243 |
            |  -10 |  1881106 |
            |   -9 |  1787642 |
            |   -8 |  1778006 |
            |   -7 |  1860040 |
            |   -6 |  1859687 |
            |   -5 |  1859810 |
            |   -4 |  1732045 |
            |   -3 |  1839934 |
            |   -2 |  1873321 |
            |   -1 |  1903577 |
            |   -0 |  1034431 |
            |    0 |  1033119 |
            |    1 |  1922121 |
            |    2 |  1796961 |
            |    3 |  1889517 |
            |    4 |  1838618 |
            |    5 |  1891434 |
            |    6 |  1955242 |
            |    7 |  1972145 |
            |    8 |  1995514 |
            |    9 |  1996845 |
            |   10 |  1064045 |
            +------+----------+
            

            Observations

            • since table DeepForcedSource is empty this means that that results of the last two steps of the S13 data release production pipeline (ingesting deep sources and de-duplication) never made into this database.
            • according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset
            • the timing information for the creation and modification times of the tables corresponds to the large scale S13 DRP effort
            Show
            gapon Igor Gaponenko added a comment - - edited daues_SDRP_Stripe82_ncsa Non-empty tables: SELECT TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA= 'daues_SDRP_Stripe82_ncsa' AND TABLE_ROWS > 0; +-------------------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +-------------------------------+------------+------------+------------+ | DeepCoadd | 13735 | 2013-07-31 | 2013-08-09 | | DeepCoadd_Metadata | 2809690 | 2013-07-31 | 2013-08-09 | | DeepCoadd_To_Htm10 | 326420 | 2013-07-31 | 2013-08-09 | | Filter | 6 | 2013-07-31 | 2013-07-31 | | LeapSeconds | 39 | 2013-07-31 | 2013-07-31 | | RunDeepSource | 98577782 | 2013-08-13 | 2013-08-13 | | Science_Ccd_Exposure | 1403385 | 2013-07-31 | 2013-08-01 | | Science_Ccd_Exposure_Metadata | 164196045 | 2013-07-31 | 2013-08-01 | | Science_Ccd_Exposure_To_Htm10 | 26861430 | 2013-07-31 | 2013-08-01 | | ZZZ_Db_Description | 1 | 2013-07-31 | 2013-07-31 | +-------------------------------+------------+------------+------------+ SELECT TABLE_NAME FROM INFORMATION_SCHEMA.VIEWS WHERE TABLE_SCHEMA= 'daues_SDRP_Stripe82_ncsa' ; +------------+ | TABLE_NAME | +------------+ | DeepSource | +------------+ Evaluating RA range of the deep sources: SELECT MIN ( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) AS 'min_ra' , MAX ( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) AS 'max_ra' FROM DeepSource; +--------------------+--------------------+ | min_ra | max_ra | +--------------------+--------------------+ | -40.16296999999997 | 10.022860000000001 | +--------------------+--------------------+ And a histogram: SELECT ROUND( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) as chan, COUNT (*) FROM DeepSource GROUP BY chan; +------+----------+ | chan | COUNT(*) | +------+----------+ | -40 | 1400094 | | -39 | 2163949 | | -38 | 2165932 | | -37 | 2133441 | | -36 | 2126065 | | -35 | 2104630 | | -34 | 2180195 | | -33 | 2103957 | | -32 | 2029172 | | -31 | 2094254 | | -30 | 2049792 | | -29 | 1987232 | | -28 | 2000964 | | -27 | 2113021 | | -26 | 2037486 | | -25 | 1968147 | | -24 | 1999535 | | -23 | 1961868 | | -22 | 2047897 | | -21 | 1979037 | | -20 | 2050123 | | -19 | 1956640 | | -18 | 1936693 | | -17 | 1918960 | | -16 | 1863719 | | -15 | 1786404 | | -14 | 1954821 | | -13 | 1868526 | | -12 | 1870825 | | -11 | 1959243 | | -10 | 1881106 | | -9 | 1787642 | | -8 | 1778006 | | -7 | 1860040 | | -6 | 1859687 | | -5 | 1859810 | | -4 | 1732045 | | -3 | 1839934 | | -2 | 1873321 | | -1 | 1903577 | | -0 | 1034431 | | 0 | 1033119 | | 1 | 1922121 | | 2 | 1796961 | | 3 | 1889517 | | 4 | 1838618 | | 5 | 1891434 | | 6 | 1955242 | | 7 | 1972145 | | 8 | 1995514 | | 9 | 1996845 | | 10 | 1064045 | +------+----------+ Observations since table DeepForcedSource is empty this means that that results of the last two steps of the S13 data release production pipeline (ingesting deep sources and de-duplication) never made into this database. according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset the timing information for the creation and modification times of the tables corresponds to the large scale S13 DRP effort

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              fritzm Fritz Mueller
              Watchers:
              Fritz Mueller, Igor Gaponenko
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: