# Investigate coverage of S13 databases found so far

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
4
• Sprint:
DB_F16_7
• Team:
Data Access and Database

#### Description

Look at databases located at NCSA so far to assess if they cover the full survey. The databases to be evaluated are mentioned in DM-6905.

According to S13 Testing Plan the S13 DRP dataset was split into two regions with an overalp used for cross-site verification:

• NCSA: -40< R.A. < +10
• IN2P3: +5 < R.A. < +55

Hence a goal of this task is to identify which previously located candidate databases and files correspond to either or both of these ranges.

#### Activity

Hide
Igor Gaponenko added a comment - - edited

# daues_SDRP_Stripe82_ncsa

Non-empty tables:

 SELECT  TABLE_NAME, TABLE_ROWS,  SUBSTR(CREATE_TIME,1,10) AS 'CREATE',  SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM  INFORMATION_SCHEMA.TABLES WHERE  TABLE_SCHEMA='daues_SDRP_Stripe82_ncsa'  AND  TABLE_ROWS > 0; 

 +-------------------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +-------------------------------+------------+------------+------------+ | DeepCoadd | 13735 | 2013-07-31 | 2013-08-09 | | DeepCoadd_Metadata | 2809690 | 2013-07-31 | 2013-08-09 | | DeepCoadd_To_Htm10 | 326420 | 2013-07-31 | 2013-08-09 | | Filter | 6 | 2013-07-31 | 2013-07-31 | | LeapSeconds | 39 | 2013-07-31 | 2013-07-31 | | RunDeepSource | 98577782 | 2013-08-13 | 2013-08-13 | | Science_Ccd_Exposure | 1403385 | 2013-07-31 | 2013-08-01 | | Science_Ccd_Exposure_Metadata | 164196045 | 2013-07-31 | 2013-08-01 | | Science_Ccd_Exposure_To_Htm10 | 26861430 | 2013-07-31 | 2013-08-01 | | ZZZ_Db_Description | 1 | 2013-07-31 | 2013-07-31 | +-------------------------------+------------+------------+------------+ 

 SELECT  TABLE_NAME FROM  INFORMATION_SCHEMA.VIEWS WHERE  TABLE_SCHEMA='daues_SDRP_Stripe82_ncsa'; 

 +------------+ | TABLE_NAME | +------------+ | DeepSource | +------------+ 

Evaluating RA range of the deep sources:

 SELECT  MIN(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'min_ra',  MAX(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'max_ra' FROM  DeepSource; 

 +--------------------+--------------------+ | min_ra | max_ra | +--------------------+--------------------+ | -40.16296999999997 | 10.022860000000001 | +--------------------+--------------------+ 

And a histogram:

 SELECT  ROUND(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) as chan,  COUNT(*) FROM  DeepSource GROUP BY  chan; 

 +------+----------+ | chan | COUNT(*) | +------+----------+ | -40 | 1400094 | | -39 | 2163949 | | -38 | 2165932 | | -37 | 2133441 | | -36 | 2126065 | | -35 | 2104630 | | -34 | 2180195 | | -33 | 2103957 | | -32 | 2029172 | | -31 | 2094254 | | -30 | 2049792 | | -29 | 1987232 | | -28 | 2000964 | | -27 | 2113021 | | -26 | 2037486 | | -25 | 1968147 | | -24 | 1999535 | | -23 | 1961868 | | -22 | 2047897 | | -21 | 1979037 | | -20 | 2050123 | | -19 | 1956640 | | -18 | 1936693 | | -17 | 1918960 | | -16 | 1863719 | | -15 | 1786404 | | -14 | 1954821 | | -13 | 1868526 | | -12 | 1870825 | | -11 | 1959243 | | -10 | 1881106 | | -9 | 1787642 | | -8 | 1778006 | | -7 | 1860040 | | -6 | 1859687 | | -5 | 1859810 | | -4 | 1732045 | | -3 | 1839934 | | -2 | 1873321 | | -1 | 1903577 | | -0 | 1034431 | | 0 | 1033119 | | 1 | 1922121 | | 2 | 1796961 | | 3 | 1889517 | | 4 | 1838618 | | 5 | 1891434 | | 6 | 1955242 | | 7 | 1972145 | | 8 | 1995514 | | 9 | 1996845 | | 10 | 1064045 | +------+----------+ 

## Observations

• since table DeepForcedSource is empty this means that that results of the last two steps of the S13 data release production pipeline (ingesting deep sources and de-duplication) never made into this database.
• according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset
• the timing information for the creation and modification times of the tables corresponds to the large scale S13 DRP effort
Show
Igor Gaponenko added a comment - - edited daues_SDRP_Stripe82_ncsa Non-empty tables: SELECT TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA= 'daues_SDRP_Stripe82_ncsa' AND TABLE_ROWS > 0; +-------------------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +-------------------------------+------------+------------+------------+ | DeepCoadd | 13735 | 2013-07-31 | 2013-08-09 | | DeepCoadd_Metadata | 2809690 | 2013-07-31 | 2013-08-09 | | DeepCoadd_To_Htm10 | 326420 | 2013-07-31 | 2013-08-09 | | Filter | 6 | 2013-07-31 | 2013-07-31 | | LeapSeconds | 39 | 2013-07-31 | 2013-07-31 | | RunDeepSource | 98577782 | 2013-08-13 | 2013-08-13 | | Science_Ccd_Exposure | 1403385 | 2013-07-31 | 2013-08-01 | | Science_Ccd_Exposure_Metadata | 164196045 | 2013-07-31 | 2013-08-01 | | Science_Ccd_Exposure_To_Htm10 | 26861430 | 2013-07-31 | 2013-08-01 | | ZZZ_Db_Description | 1 | 2013-07-31 | 2013-07-31 | +-------------------------------+------------+------------+------------+ SELECT TABLE_NAME FROM INFORMATION_SCHEMA.VIEWS WHERE TABLE_SCHEMA= 'daues_SDRP_Stripe82_ncsa' ; +------------+ | TABLE_NAME | +------------+ | DeepSource | +------------+ Evaluating RA range of the deep sources: SELECT MIN ( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) AS 'min_ra' , MAX ( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) AS 'max_ra' FROM DeepSource; +--------------------+--------------------+ | min_ra | max_ra | +--------------------+--------------------+ | -40.16296999999997 | 10.022860000000001 | +--------------------+--------------------+ And a histogram: SELECT ROUND( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) as chan, COUNT (*) FROM DeepSource GROUP BY chan; +------+----------+ | chan | COUNT(*) | +------+----------+ | -40 | 1400094 | | -39 | 2163949 | | -38 | 2165932 | | -37 | 2133441 | | -36 | 2126065 | | -35 | 2104630 | | -34 | 2180195 | | -33 | 2103957 | | -32 | 2029172 | | -31 | 2094254 | | -30 | 2049792 | | -29 | 1987232 | | -28 | 2000964 | | -27 | 2113021 | | -26 | 2037486 | | -25 | 1968147 | | -24 | 1999535 | | -23 | 1961868 | | -22 | 2047897 | | -21 | 1979037 | | -20 | 2050123 | | -19 | 1956640 | | -18 | 1936693 | | -17 | 1918960 | | -16 | 1863719 | | -15 | 1786404 | | -14 | 1954821 | | -13 | 1868526 | | -12 | 1870825 | | -11 | 1959243 | | -10 | 1881106 | | -9 | 1787642 | | -8 | 1778006 | | -7 | 1860040 | | -6 | 1859687 | | -5 | 1859810 | | -4 | 1732045 | | -3 | 1839934 | | -2 | 1873321 | | -1 | 1903577 | | -0 | 1034431 | | 0 | 1033119 | | 1 | 1922121 | | 2 | 1796961 | | 3 | 1889517 | | 4 | 1838618 | | 5 | 1891434 | | 6 | 1955242 | | 7 | 1972145 | | 8 | 1995514 | | 9 | 1996845 | | 10 | 1064045 | +------+----------+ Observations since table DeepForcedSource is empty this means that that results of the last two steps of the S13 data release production pipeline (ingesting deep sources and de-duplication) never made into this database. according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset the timing information for the creation and modification times of the tables corresponds to the large scale S13 DRP effort
Hide
Igor Gaponenko added a comment - - edited

# daues_S13_Stripe82_2000

This is a very interesting one because it contains the forced photometry. The database has also been "de-duplicated". In some sense this is the most complete database both from the scientific point of view and also per Yusra's instructions (see the documentation link in the Description section of this task).

 SELECT  TABLE_NAME, TABLE_ROWS,  SUBSTR(CREATE_TIME,1,10) AS 'CREATE',  SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM  INFORMATION_SCHEMA.TABLES WHERE  TABLE_SCHEMA='daues_S13_Stripe82_2000'  AND  TABLE_ROWS > 0; 

 +-------------------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +-------------------------------+------------+------------+------------+ | DeepCoadd | 720 | 2013-06-22 | 2013-06-27 | | DeepCoadd_Metadata | 148296 | 2013-06-22 | 2013-06-27 | | DeepCoadd_To_Htm10 | 17000 | 2013-06-22 | 2013-06-27 | | Filter | 6 | 2013-06-22 | 2013-06-22 | | LeapSeconds | 39 | 2013-06-22 | 2013-06-22 | | RefDeepSrcMatch | 25731005 | 2013-07-04 | 2013-07-04 | | RefObject | 471554 | 2013-07-04 | 2013-07-04 | | RunDeepForcedSource | 447180788 | 2013-08-22 | 2013-08-22 | | RunDeepForcedSource4 | 90030986 | 2013-09-30 | 2013-10-01 | | RunDeepSource | 5131738 | 2013-06-27 | 2013-09-01 | | Science_Ccd_Exposure | 79179 | 2013-06-22 | 2013-06-22 | | Science_Ccd_Exposure_Metadata | 9263943 | 2013-06-22 | 2013-06-22 | | Science_Ccd_Exposure_To_Htm10 | 1514810 | 2013-06-22 | 2013-06-22 | | ZZZ_Db_Description | 1 | 2013-06-22 | 2013-06-22 | | medianForcedPhot | 1068212 | 2014-03-11 | 2014-03-11 | | medianForcedPhotStd | 1068212 | 2014-03-12 | 2014-03-12 | | medianForcedPhotStdByFilter | 5340082 | 2014-03-12 | 2014-03-12 | | tmp_dedupe4 | 7677429 | 2013-09-30 | 2013-09-30 | +-------------------------------+------------+------------+------------+ 

 SELECT  TABLE_NAME FROM  INFORMATION_SCHEMA.VIEWS WHERE  TABLE_SCHEMA='daues_S13_Stripe82_2000'; 

 +------------------+ | TABLE_NAME | +------------------+ | DeepForcedSource | | DeepSource | +------------------+ 

The RA coverage range of the database is:

 SELECT  MIN(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'min_ra',  MAX(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'max_ra' FROM  DeepSource; 

 +-------------------+---------+ | min_ra | max_ra | +-------------------+---------+ | 4.957029999999996 | 7.60286 | +-------------------+---------+ 

## Observations

• the database contains results from all stages of the pipeline, including: deep sources (from coadds), deep forced sources and de-duplication.
• only about 5% of the deep sources are found in this database compared with previously discussed database daues_SDRP_Stripe82_ncsa
• most tables were created and populated BEFORE the full-scale S13 DRP effort. The only exception was for the last two stages of the pipeline: forced photometry and de-duplication. Those were done in September 2013.
• Apparently this database represents a full scale (complete in a sense of a variety of data produced by the S13 DRP) effort of testing the pipeline before processing all data within the OVERLAP region of the dataset.
Show
Igor Gaponenko added a comment - - edited daues_S13_Stripe82_2000 This is a very interesting one because it contains the forced photometry. The database has also been "de-duplicated". In some sense this is the most complete database both from the scientific point of view and also per Yusra's instructions (see the documentation link in the Description section of this task). SELECT TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA= 'daues_S13_Stripe82_2000' AND TABLE_ROWS > 0; +-------------------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +-------------------------------+------------+------------+------------+ | DeepCoadd | 720 | 2013-06-22 | 2013-06-27 | | DeepCoadd_Metadata | 148296 | 2013-06-22 | 2013-06-27 | | DeepCoadd_To_Htm10 | 17000 | 2013-06-22 | 2013-06-27 | | Filter | 6 | 2013-06-22 | 2013-06-22 | | LeapSeconds | 39 | 2013-06-22 | 2013-06-22 | | RefDeepSrcMatch | 25731005 | 2013-07-04 | 2013-07-04 | | RefObject | 471554 | 2013-07-04 | 2013-07-04 | | RunDeepForcedSource | 447180788 | 2013-08-22 | 2013-08-22 | | RunDeepForcedSource4 | 90030986 | 2013-09-30 | 2013-10-01 | | RunDeepSource | 5131738 | 2013-06-27 | 2013-09-01 | | Science_Ccd_Exposure | 79179 | 2013-06-22 | 2013-06-22 | | Science_Ccd_Exposure_Metadata | 9263943 | 2013-06-22 | 2013-06-22 | | Science_Ccd_Exposure_To_Htm10 | 1514810 | 2013-06-22 | 2013-06-22 | | ZZZ_Db_Description | 1 | 2013-06-22 | 2013-06-22 | | medianForcedPhot | 1068212 | 2014-03-11 | 2014-03-11 | | medianForcedPhotStd | 1068212 | 2014-03-12 | 2014-03-12 | | medianForcedPhotStdByFilter | 5340082 | 2014-03-12 | 2014-03-12 | | tmp_dedupe4 | 7677429 | 2013-09-30 | 2013-09-30 | +-------------------------------+------------+------------+------------+ SELECT TABLE_NAME FROM INFORMATION_SCHEMA.VIEWS WHERE TABLE_SCHEMA= 'daues_S13_Stripe82_2000' ; +------------------+ | TABLE_NAME | +------------------+ | DeepForcedSource | | DeepSource | +------------------+ The RA coverage range of the database is: SELECT MIN ( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) AS 'min_ra' , MAX ( CASE WHEN ra < 180 THEN ra ELSE ra - 360 END ) AS 'max_ra' FROM DeepSource; +-------------------+---------+ | min_ra | max_ra | +-------------------+---------+ | 4.957029999999996 | 7.60286 | +-------------------+---------+ Observations the database contains results from all stages of the pipeline, including: deep sources (from coadds), deep forced sources and de-duplication. only about 5% of the deep sources are found in this database compared with previously discussed database daues_SDRP_Stripe82_ncsa most tables were created and populated BEFORE the full-scale S13 DRP effort. The only exception was for the last two stages of the pipeline: forced photometry and de-duplication . Those were done in September 2013. Apparently this database represents a full scale (complete in a sense of a variety of data produced by the S13 DRP) effort of testing the pipeline before processing all data within the OVERLAP region of the dataset.
Hide
Igor Gaponenko added a comment - - edited

# daues_SDRP_dedupe_byfilter_0(1,2,3,4)

According to Greg, these contain the final deduped forced sources for each band as shown by the following filter definition:

 SELECT filterId,filterName FROM daues_SDRP_dedupe_byfilter_0.Filter; 

 +----------+------------+ | filterId | filterName | +----------+------------+ | 0 | u | | 1 | g | | 2 | r | | 3 | i | | 4 | z | | -99 | DD | +----------+------------+ 

Non-empty tables:

 SELECT  TABLE_NAME, TABLE_ROWS,  SUBSTR(CREATE_TIME,1,10) AS 'CREATE',  SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM  INFORMATION_SCHEMA.TABLES WHERE  TABLE_SCHEMA='daues_SDRP_dedupe_byfilter_0'  AND  TABLE_ROWS > 0; 

 +---------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +---------------------+------------+------------+------------+ | Filter | 6 | 2013-09-09 | 2013-09-09 | | LeapSeconds | 39 | 2013-09-09 | 2013-09-09 | | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 | | ZZZ_Db_Description | 1 | 2013-09-09 | 2013-09-09 | +---------------------+------------+------------+------------+ 

Each table has over 1.7 billion entries and is nearly 700 GB in size:

 SELECT  TABLE_SCHEMA,  TABLE_NAME,  TABLE_ROWS,  DATA_LENGTH,  SUBSTR(CREATE_TIME,1,10) AS 'CREATE',SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM  INFORMATION_SCHEMA.TABLES WHERE  TABLE_SCHEMA LIKE 'daues_SDRP_dedupe_byfilter_%'  AND  ABLE_NAME='RunDeepForcedSource'; 

 +------------------------------+---------------------+------------+--------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | DATA_LENGTH | CREATE | UPDATE | +------------------------------+---------------------+------------+--------------+------------+------------+ | daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 683191917445 | 2014-05-21 | 2014-05-21 | | daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 692358132605 | 2014-05-16 | 2014-05-17 | | daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 692944820970 | 2014-05-06 | 2014-05-30 | | daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 692937772195 | 2014-05-18 | 2014-05-18 | | daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 691992469650 | 2014-05-19 | 2014-05-19 | +------------------------------+---------------------+------------+--------------+------------+------------+ 

The RA region covered by the tables (judging on one band only) is (ATTENTION : this is a very long query. It takes about 30 minutes for the query to finish):

 SELECT  MIN(CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END) AS 'min_ra',  MAX(CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END) AS 'max_ra' FROM  RunDeepForcedSource; 

 +--------------------+--------------------+ | min_ra | max_ra | +--------------------+--------------------+ | -40.17089170204929 | 10.036033419213043 | +--------------------+--------------------+ 

## Evaluating primary key constrains of the tables

Here we going to evaluate a possibility of merging the contents of these tables into a single one which would contain forced sources across all bands.

The primary key constrain on the table requires uniqueness only for id:

 SHOW CREATE TABLE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource\G 

  id bigint(20) NOT NULL,  ..  objectId bigint(20) NOT NULL,  ..  PRIMARY KEY (id),  ..  KEY IDX_objectId (objectId),  .. 

Where the other column objectId serves as AKA foreign key for the corresponding entries in the DeepSource (or RunDeepSource) tables.

Fortunately, the values of the primary keys (id) in tables RunDeepForcedSource of all bands do not intersect. This can be demonstrated with this technique:

 CREATE TEMPORARY TABLE  tmpRunDeepForcedSource LIKE  daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; 

 INSERT INTO  tmpRunDeepForcedSource SELECT  * FROM  daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource WHERE  objectId=217773672520226380;   ...   INSERT INTO  tmpRunDeepForcedSource SELECT  * FROM  daues_SDRP_dedupe_byfilter_4.RunDeepForcedSource WHERE  objectId=217773672520226380; 

This code won't cause any violation of the primary key constrain. And the contents of the temporary table will be the union of result sets select from all 5 input tables.

 SELECT  id,  objectId,  COUNT(*) FROM  tmpRunDeepForcedSource GROUP BY  id ORDER BY  id; 

 +--------------------+--------------------+----------+ | id | objectId | COUNT(*) | +--------------------+--------------------+----------+ | 69324128875708417 | 217773672520226380 | 1 | | 69330839762108417 | 217773672520226380 | 1 | | 69337550648508417 | 217773672520226380 | 1 | | 69344261534908417 | 217773672520226380 | 1 | | 69350972421308417 | 217773672520226380 | 1 | | 160055318036873217 | 217773672520226380 | 1 | | 160062028923273217 | 217773672520226380 | 1 | | 160068739809673217 | 217773672520226380 | 1 | | 160075450696073217 | 217773672520226380 | 1 | | 160082161582473217 | 217773672520226380 | 1 | | 178443148853248001 | 217773672520226380 | 1 | | 178449859739648001 | 217773672520226380 | 1 | | 178456570626048001 | 217773672520226380 | 1 | | 178463281512448001 | 217773672520226380 | 1 | | 178469992398848001 | 217773672520226380 | 1 | ... 

## Evaluating cross-database referential integrity

This is another important precondition to be asserted on a way of an idea of merging the tables of daues_SDRP_Stripe82_ncsa with 5 band-specific databases. There are two external dependencies (AKA foreign keys) in the forced deep sources tables (RunDeepForcedSource) which would need to be satisfied:

• objectId
• exposure_id

It turns out the dependencies are resolved as follows:

RunDeepForcedSource key target table key
objectId daues_SDRP_Stripe82_ncsa.DeepSource deepSourceId
exposure_id daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure scienceCcdExposureId

A weak proof of this referential integrity can be demonstrated with the following test (based on any row from the deep forced sources table:

 SELECT  rdfs.objectId,  rdfs.exposure_id,  sce.run FROM  daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource rdfs,  daues_SDRP_Stripe82_ncsa.DeepSource ds,  daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure sce WHERE  rdfs.objectId=217773672520226380  AND  rdfs.objectId=ds.deepSourceId  AND  rdfs.exposure_id=sce.scienceCcdExposureId; 

 +--------------------+-------------+------+ | objectId | exposure_id | run | +--------------------+-------------+------+ | 217773672520226380 | 1033010019 | 1033 | | 217773672520226380 | 2385010094 | 2385 | | 217773672520226380 | 2659010125 | 2659 | | 217773672520226380 | 3384010101 | 3384 | | 217773672520226380 | 3437010075 | 3437 | | 217773672520226380 | 4184010122 | 4184 | ... | 217773672520226380 | 7195010091 | 7195 | | 217773672520226380 | 7202010034 | 7202 | | 217773672520226380 | 7202010035 | 7202 | +--------------------+-------------+------+ 77 rows in set (0.63 sec) 

This corresponds to the same number of entries found in table RunDeepForcedSource for this objectId.
NOTE: strictly speaking this is NOT a proof. Though, a more reliable verification procedure could be establish to verify/guarantee the referential integrity.

## Observations

• each of these databases contains only forced sources. No other data from the preceding S13 DRP pipeline stages are present.
• according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset
• it's quite clear that the contents of table RunDeepForcedSource from all these databases could be combined within a single large table covering all bands due to non-overlapping sequences of the primary key id in the table instances.
• the referential integrity between tables of databases daues_SDRP_Stripe82_ncsa and daues_SDRP_dedupe_byfilter_0,1,2,3,4 has been demostrated
• it also seems to be conceivable to merge tables RunDeepForcedSource into the large database daues_SDRP_Stripe82_ncsa which holds catalogs from the first stages of the processing pipeline. This will yield the complete catalog of the first (NCSA) half of the dataset.
Show
Igor Gaponenko added a comment - - edited daues_SDRP_dedupe_byfilter_0(1,2,3,4) According to Greg, these contain the final deduped forced sources for each band as shown by the following filter definition: SELECT filterId,filterName FROM daues_SDRP_dedupe_byfilter_0.Filter; +----------+------------+ | filterId | filterName | +----------+------------+ | 0 | u | | 1 | g | | 2 | r | | 3 | i | | 4 | z | | -99 | DD | +----------+------------+ Non-empty tables: SELECT TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA= 'daues_SDRP_dedupe_byfilter_0' AND TABLE_ROWS > 0; +---------------------+------------+------------+------------+ | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +---------------------+------------+------------+------------+ | Filter | 6 | 2013-09-09 | 2013-09-09 | | LeapSeconds | 39 | 2013-09-09 | 2013-09-09 | | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 | | ZZZ_Db_Description | 1 | 2013-09-09 | 2013-09-09 | +---------------------+------------+------------+------------+ Each table has over 1.7 billion entries and is nearly 700 GB in size: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, DATA_LENGTH, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' ,SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA LIKE 'daues_SDRP_dedupe_byfilter_%' AND ABLE_NAME= 'RunDeepForcedSource' ; +------------------------------+---------------------+------------+--------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | DATA_LENGTH | CREATE | UPDATE | +------------------------------+---------------------+------------+--------------+------------+------------+ | daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 683191917445 | 2014-05-21 | 2014-05-21 | | daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 692358132605 | 2014-05-16 | 2014-05-17 | | daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 692944820970 | 2014-05-06 | 2014-05-30 | | daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 692937772195 | 2014-05-18 | 2014-05-18 | | daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 691992469650 | 2014-05-19 | 2014-05-19 | +------------------------------+---------------------+------------+--------------+------------+------------+ The RA region covered by the tables (judging on one band only) is ( ATTENTION : this is a very long query. It takes about 30 minutes for the query to finish): SELECT MIN ( CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END ) AS 'min_ra' , MAX ( CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END ) AS 'max_ra' FROM RunDeepForcedSource; +--------------------+--------------------+ | min_ra | max_ra | +--------------------+--------------------+ | -40.17089170204929 | 10.036033419213043 | +--------------------+--------------------+ Evaluating primary key constrains of the tables Here we going to evaluate a possibility of merging the contents of these tables into a single one which would contain forced sources across all bands. The primary key constrain on the table requires uniqueness only for id : SHOW CREATE TABLE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource\G id bigint(20) NOT NULL, .. objectId bigint(20) NOT NULL, .. PRIMARY KEY (id), .. KEY IDX_objectId (objectId), .. Where the other column objectId serves as AKA foreign key for the corresponding entries in the DeepSource (or RunDeepSource ) tables. Fortunately, the values of the primary keys ( id ) in tables RunDeepForcedSource of all bands do not intersect. This can be demonstrated with this technique: CREATE TEMPORARY TABLE tmpRunDeepForcedSource LIKE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; INSERT INTO tmpRunDeepForcedSource SELECT * FROM daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource WHERE objectId=217773672520226380;   ...   INSERT INTO tmpRunDeepForcedSource SELECT * FROM daues_SDRP_dedupe_byfilter_4.RunDeepForcedSource WHERE objectId=217773672520226380; This code won't cause any violation of the primary key constrain. And the contents of the temporary table will be the union of result sets select from all 5 input tables. SELECT id, objectId, COUNT (*) FROM tmpRunDeepForcedSource GROUP BY id ORDER BY id; +--------------------+--------------------+----------+ | id | objectId | COUNT(*) | +--------------------+--------------------+----------+ | 69324128875708417 | 217773672520226380 | 1 | | 69330839762108417 | 217773672520226380 | 1 | | 69337550648508417 | 217773672520226380 | 1 | | 69344261534908417 | 217773672520226380 | 1 | | 69350972421308417 | 217773672520226380 | 1 | | 160055318036873217 | 217773672520226380 | 1 | | 160062028923273217 | 217773672520226380 | 1 | | 160068739809673217 | 217773672520226380 | 1 | | 160075450696073217 | 217773672520226380 | 1 | | 160082161582473217 | 217773672520226380 | 1 | | 178443148853248001 | 217773672520226380 | 1 | | 178449859739648001 | 217773672520226380 | 1 | | 178456570626048001 | 217773672520226380 | 1 | | 178463281512448001 | 217773672520226380 | 1 | | 178469992398848001 | 217773672520226380 | 1 | ... Evaluating cross-database referential integrity This is another important precondition to be asserted on a way of an idea of merging the tables of daues_SDRP_Stripe82_ncsa with 5 band-specific databases. There are two external dependencies (AKA foreign keys) in the forced deep sources tables ( RunDeepForcedSource ) which would need to be satisfied: objectId exposure_id It turns out the dependencies are resolved as follows: RunDeepForcedSource key target table key objectId daues_SDRP_Stripe82_ncsa.DeepSource deepSourceId exposure_id daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure scienceCcdExposureId A weak proof of this referential integrity can be demonstrated with the following test (based on any row from the deep forced sources table: SELECT rdfs.objectId, rdfs.exposure_id, sce.run FROM daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource rdfs, daues_SDRP_Stripe82_ncsa.DeepSource ds, daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure sce WHERE rdfs.objectId=217773672520226380 AND rdfs.objectId=ds.deepSourceId AND rdfs.exposure_id=sce.scienceCcdExposureId; +--------------------+-------------+------+ | objectId | exposure_id | run | +--------------------+-------------+------+ | 217773672520226380 | 1033010019 | 1033 | | 217773672520226380 | 2385010094 | 2385 | | 217773672520226380 | 2659010125 | 2659 | | 217773672520226380 | 3384010101 | 3384 | | 217773672520226380 | 3437010075 | 3437 | | 217773672520226380 | 4184010122 | 4184 | ... | 217773672520226380 | 7195010091 | 7195 | | 217773672520226380 | 7202010034 | 7202 | | 217773672520226380 | 7202010035 | 7202 | +--------------------+-------------+------+ 77 rows in set (0.63 sec) This corresponds to the same number of entries found in table RunDeepForcedSource for this objectId . NOTE : strictly speaking this is NOT a proof. Though, a more reliable verification procedure could be establish to verify/guarantee the referential integrity. Observations each of these databases contains only forced sources . No other data from the preceding S13 DRP pipeline stages are present. according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset it's quite clear that the contents of table RunDeepForcedSource from all these databases could be combined within a single large table covering all bands due to non-overlapping sequences of the primary key id in the table instances. the referential integrity between tables of databases daues_SDRP_Stripe82_ncsa and daues_SDRP_dedupe_byfilter_0,1,2,3,4 has been demostrated it also seems to be conceivable to merge tables RunDeepForcedSource into the large database daues_SDRP_Stripe82_ncsa which holds catalogs from the first stages of the processing pipeline. This will yield the complete catalog of the first (NCSA) half of the dataset.
Hide
Igor Gaponenko added a comment - - edited

# The final sweep through other databases of lsst-db

The following tests have been made in order to rule out a possibility that there is some other (or many) database on the server which might we well be involved into this investigation. The tests were based on patters established in the prior investigation of the well known databases. And we were looking for signs of a presence the substantial amount of data in the corresponding tables of the databases. All tables with the number of rows in the relevant tables which (the numbers) were less than the what had been found earlier were ignored. An additional constrain to be satisfied was a time when each table was created and updated. All tables updated before July 2013 were ignored. MySQL/MariaDB information schema was used in this investigation.

Science exposures:

 SELECT  TABLE_SCHEMA,  TABLE_NAME,  TABLE_ROWS,  SUBSTR(CREATE_TIME,1,10) AS 'CREATE',  SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM  INFORMATION_SCHEMA.TABLES WHERE  TABLE_NAME='Science_Ccd_Exposure'  AND  TABLE_ROWS > 0 ORDER BY  TABLE_ROWS DESC; 

 +--------------------------------------------------+----------------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +--------------------------------------------------+----------------------+------------+------------+------------+ | DC_W13_Stripe82 | Science_Ccd_Exposure | 1998219 | 2013-01-18 | 2013-01-18 | | daues_SDRP_Stripe82_ncsa | Science_Ccd_Exposure | 1403385 | 2013-07-31 | 2013-08-01 | | yusra_Stripe82_S13 | Science_Ccd_Exposure | 1403385 | 2014-09-02 | 2014-09-02 | | yusra_rgb2 | Science_Ccd_Exposure | 83427 | 2012-12-30 | 2013-03-06 | | daues_S13_Stripe82_2000 | Science_Ccd_Exposure | 79179 | 2013-06-22 | 2013-06-22 | .. 

Deep sources:

 SELECT  TABLE_SCHEMA,  TABLE_NAME,  TABLE_ROWS,  SUBSTR(CREATE_TIME,1,10) AS 'CREATE',  SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM  INFORMATION_SCHEMA.TABLES WHERE  TABLE_NAME='RunDeepSource' ORDER BY  TABLE_ROWS DESC; 

 +----------------------------------+---------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +----------------------------------+---------------+------------+------------+------------+ | daues_SDRP_Stripe82_ncsa | RunDeepSource | 98577782 | 2013-08-13 | 2013-08-13 | | DC_W13_Stripe82 | RunDeepSource | 14670597 | 2012-12-01 | 2012-12-02 | | yusra_S13_Gordon | RunDeepSource | 5884398 | 2013-11-01 | 2013-11-01 | | daues_S13_Stripe82_2000 | RunDeepSource | 5131738 | 2013-06-27 | 2013-09-01 | | daues_coadd_test | RunDeepSource | 2836990 | 2013-08-08 | 2013-08-08 | | yusra_S13_Stripe82_demo | RunDeepSource | 763995 | 2013-06-16 | 2013-06-16 | | daues_test1 | RunDeepSource | 749895 | 2013-11-30 | 2013-11-30 | | daues_S13_Stripe82_1000 | RunDeepSource | 749895 | 2013-06-16 | 2013-06-16 | | shaw_Stripe82_demo | RunDeepSource | 285785 | 2014-01-30 | 2014-01-30 | | pgee2000_a35 | RunDeepSource | 92343 | 2013-07-07 | 2013-07-07 | ... 

Forced deep sources:

 SELECT  TABLE_SCHEMA,  TABLE_NAME,  TABLE_ROWS,  SUBSTR(CREATE_TIME,1,10) AS 'CREATE',  SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM  INFORMATION_SCHEMA.TABLES WHERE  TABLE_NAME='RunDeepForcedSource' ORDER BY  TABLE_ROWS DESC; 

 +------------------------------+---------------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +------------------------------+---------------------+------------+------------+------------+ | DC_W13_Stripe82 | RunDeepForcedSource | 3870141868 | 2012-12-11 | 2012-12-11 | | daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 2014-05-06 | 2014-05-30 | | daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 2014-05-18 | 2014-05-18 | | daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 2014-05-16 | 2014-05-17 | | daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 2014-05-19 | 2014-05-19 | | daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 | | daues_S13_Stripe82_2000 | RunDeepForcedSource | 447180788 | 2013-08-22 | 2013-08-22 | .. 

## Observation

• no other database meets desired criteria of the search
Show
Igor Gaponenko added a comment - - edited The final sweep through other databases of lsst-db The following tests have been made in order to rule out a possibility that there is some other (or many) database on the server which might we well be involved into this investigation. The tests were based on patters established in the prior investigation of the well known databases. And we were looking for signs of a presence the substantial amount of data in the corresponding tables of the databases. All tables with the number of rows in the relevant tables which (the numbers) were less than the what had been found earlier were ignored. An additional constrain to be satisfied was a time when each table was created and updated. All tables updated before July 2013 were ignored. MySQL/MariaDB information schema was used in this investigation. Science exposures: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME= 'Science_Ccd_Exposure' AND TABLE_ROWS > 0 ORDER BY TABLE_ROWS DESC ; +--------------------------------------------------+----------------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +--------------------------------------------------+----------------------+------------+------------+------------+ | DC_W13_Stripe82 | Science_Ccd_Exposure | 1998219 | 2013-01-18 | 2013-01-18 | | daues_SDRP_Stripe82_ncsa | Science_Ccd_Exposure | 1403385 | 2013-07-31 | 2013-08-01 | | yusra_Stripe82_S13 | Science_Ccd_Exposure | 1403385 | 2014-09-02 | 2014-09-02 | | yusra_rgb2 | Science_Ccd_Exposure | 83427 | 2012-12-30 | 2013-03-06 | | daues_S13_Stripe82_2000 | Science_Ccd_Exposure | 79179 | 2013-06-22 | 2013-06-22 | .. Deep sources: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME= 'RunDeepSource' ORDER BY TABLE_ROWS DESC ; +----------------------------------+---------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +----------------------------------+---------------+------------+------------+------------+ | daues_SDRP_Stripe82_ncsa | RunDeepSource | 98577782 | 2013-08-13 | 2013-08-13 | | DC_W13_Stripe82 | RunDeepSource | 14670597 | 2012-12-01 | 2012-12-02 | | yusra_S13_Gordon | RunDeepSource | 5884398 | 2013-11-01 | 2013-11-01 | | daues_S13_Stripe82_2000 | RunDeepSource | 5131738 | 2013-06-27 | 2013-09-01 | | daues_coadd_test | RunDeepSource | 2836990 | 2013-08-08 | 2013-08-08 | | yusra_S13_Stripe82_demo | RunDeepSource | 763995 | 2013-06-16 | 2013-06-16 | | daues_test1 | RunDeepSource | 749895 | 2013-11-30 | 2013-11-30 | | daues_S13_Stripe82_1000 | RunDeepSource | 749895 | 2013-06-16 | 2013-06-16 | | shaw_Stripe82_demo | RunDeepSource | 285785 | 2014-01-30 | 2014-01-30 | | pgee2000_a35 | RunDeepSource | 92343 | 2013-07-07 | 2013-07-07 | ... Forced deep sources: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, SUBSTR(CREATE_TIME,1,10) AS 'CREATE' , SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME= 'RunDeepForcedSource' ORDER BY TABLE_ROWS DESC ; +------------------------------+---------------------+------------+------------+------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE | +------------------------------+---------------------+------------+------------+------------+ | DC_W13_Stripe82 | RunDeepForcedSource | 3870141868 | 2012-12-11 | 2012-12-11 | | daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 2014-05-06 | 2014-05-30 | | daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 2014-05-18 | 2014-05-18 | | daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 2014-05-16 | 2014-05-17 | | daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 2014-05-19 | 2014-05-19 | | daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 | | daues_S13_Stripe82_2000 | RunDeepForcedSource | 447180788 | 2013-08-22 | 2013-08-22 | .. Observation no other database meets desired criteria of the search
Hide
Igor Gaponenko added a comment - - edited

# The final conclusions

The study has reviled the following:

• The complete (and scientifically meaningful) catalogs corresponding to NCSA-only half of the S13 dataset have been located withing LSST databse server at NCSA
• The catalogs are spread between 6 databases: daues_SDRP_Stripe82_ncsa (CCD exposure metadata and deep sources) and daues_SDRP_dedupe_byfilter_0,1,2,3,4 (deep forced sources accros 5 wavelength bands).
• there is a clear evidence that the referential integrity is possible across these databases
• there is a clear evidence that the primary key structure of the RunDeepForcedSource tables across all 5 bands won't prevent merging the contents of these tables into a single one

These are optional ideas which may still need to be tested:

• running a comprehensive test to verify the referential integrity across tables of those 6 databases
• verifying that no non-trivial intersection exists between keys of the deep forced source tables
Show
Igor Gaponenko added a comment - - edited The final conclusions The study has reviled the following: The complete (and scientifically meaningful) catalogs corresponding to NCSA-only half of the S13 dataset have been located withing LSST databse server at NCSA The catalogs are spread between 6 databases: daues_SDRP_Stripe82_ncsa (CCD exposure metadata and deep sources) and daues_SDRP_dedupe_byfilter_0,1,2,3,4 (deep forced sources accros 5 wavelength bands). there is a clear evidence that the referential integrity is possible across these databases there is a clear evidence that the primary key structure of the RunDeepForcedSource tables across all 5 bands won't prevent merging the contents of these tables into a single one These are optional ideas which may still need to be tested: running a comprehensive test to verify the referential integrity across tables of those 6 databases verifying that no non-trivial intersection exists between keys of the deep forced source tables

#### People

Assignee:
Igor Gaponenko
Reporter:
Fritz Mueller
Watchers:
Fritz Mueller, Igor Gaponenko