Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: Qserv
-
Labels:
-
Story Points:4
-
Epic Link:
-
Sprint:DB_F16_7
-
Team:Data Access and Database
Description
Look at databases located at NCSA so far to assess if they cover the full survey. The databases to be evaluated are mentioned in DM-6905.
According to S13 Testing Plan the S13 DRP dataset was split into two regions with an overalp used for cross-site verification:
- NCSA: -40< R.A. < +10
- IN2P3: +5 < R.A. < +55
Hence a goal of this task is to identify which previously located candidate databases and files correspond to either or both of these ranges.
Attachments
Issue Links
Activity
daues_S13_Stripe82_2000
This is a very interesting one because it contains the forced photometry. The database has also been "de-duplicated". In some sense this is the most complete database both from the scientific point of view and also per Yusra's instructions (see the documentation link in the Description section of this task).
SELECT
|
TABLE_NAME, TABLE_ROWS,
|
SUBSTR(CREATE_TIME,1,10) AS 'CREATE', |
SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' |
FROM
|
INFORMATION_SCHEMA.TABLES
|
WHERE
|
TABLE_SCHEMA='daues_S13_Stripe82_2000' |
AND |
TABLE_ROWS > 0;
|
+-------------------------------+------------+------------+------------+
|
| TABLE_NAME | TABLE_ROWS | CREATE | UPDATE |
|
+-------------------------------+------------+------------+------------+
|
| DeepCoadd | 720 | 2013-06-22 | 2013-06-27 |
|
| DeepCoadd_Metadata | 148296 | 2013-06-22 | 2013-06-27 |
|
| DeepCoadd_To_Htm10 | 17000 | 2013-06-22 | 2013-06-27 |
|
| Filter | 6 | 2013-06-22 | 2013-06-22 |
|
| LeapSeconds | 39 | 2013-06-22 | 2013-06-22 |
|
| RefDeepSrcMatch | 25731005 | 2013-07-04 | 2013-07-04 |
|
| RefObject | 471554 | 2013-07-04 | 2013-07-04 |
|
| RunDeepForcedSource | 447180788 | 2013-08-22 | 2013-08-22 |
|
| RunDeepForcedSource4 | 90030986 | 2013-09-30 | 2013-10-01 |
|
| RunDeepSource | 5131738 | 2013-06-27 | 2013-09-01 |
|
| Science_Ccd_Exposure | 79179 | 2013-06-22 | 2013-06-22 |
|
| Science_Ccd_Exposure_Metadata | 9263943 | 2013-06-22 | 2013-06-22 |
|
| Science_Ccd_Exposure_To_Htm10 | 1514810 | 2013-06-22 | 2013-06-22 |
|
| ZZZ_Db_Description | 1 | 2013-06-22 | 2013-06-22 |
|
| medianForcedPhot | 1068212 | 2014-03-11 | 2014-03-11 |
|
| medianForcedPhotStd | 1068212 | 2014-03-12 | 2014-03-12 |
|
| medianForcedPhotStdByFilter | 5340082 | 2014-03-12 | 2014-03-12 |
|
| tmp_dedupe4 | 7677429 | 2013-09-30 | 2013-09-30 |
|
+-------------------------------+------------+------------+------------+
|
SELECT
|
TABLE_NAME
|
FROM
|
INFORMATION_SCHEMA.VIEWS
|
WHERE
|
TABLE_SCHEMA='daues_S13_Stripe82_2000'; |
+------------------+
|
| TABLE_NAME |
|
+------------------+
|
| DeepForcedSource |
|
| DeepSource |
|
+------------------+
|
The RA coverage range of the database is:
SELECT
|
MIN(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'min_ra', |
MAX(CASE WHEN ra < 180 THEN ra ELSE ra - 360 END) AS 'max_ra' |
FROM
|
DeepSource;
|
+-------------------+---------+
|
| min_ra | max_ra |
|
+-------------------+---------+
|
| 4.957029999999996 | 7.60286 |
|
+-------------------+---------+
|
Observations
- the database contains results from all stages of the pipeline, including: deep sources (from coadds), deep forced sources and de-duplication.
- only about 5% of the deep sources are found in this database compared with previously discussed database daues_SDRP_Stripe82_ncsa
- most tables were created and populated BEFORE the full-scale S13 DRP effort. The only exception was for the last two stages of the pipeline: forced photometry and de-duplication. Those were done in September 2013.
- Apparently this database represents a full scale (complete in a sense of a variety of data produced by the S13 DRP) effort of testing the pipeline before processing all data within the OVERLAP
region of the dataset.
daues_SDRP_dedupe_byfilter_0(1,2,3,4)
According to Greg, these contain the final deduped forced sources for each band as shown by the following filter definition:
SELECT filterId,filterName FROM daues_SDRP_dedupe_byfilter_0.Filter; |
+----------+------------+
|
| filterId | filterName |
|
+----------+------------+
|
| 0 | u |
|
| 1 | g |
|
| 2 | r |
|
| 3 | i |
|
| 4 | z |
|
| -99 | DD |
|
+----------+------------+
|
Non-empty tables:
SELECT
|
TABLE_NAME, TABLE_ROWS,
|
SUBSTR(CREATE_TIME,1,10) AS 'CREATE', |
SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' |
FROM
|
INFORMATION_SCHEMA.TABLES
|
WHERE
|
TABLE_SCHEMA='daues_SDRP_dedupe_byfilter_0' |
AND |
TABLE_ROWS > 0;
|
+---------------------+------------+------------+------------+
|
| TABLE_NAME | TABLE_ROWS | CREATE | UPDATE |
|
+---------------------+------------+------------+------------+
|
| Filter | 6 | 2013-09-09 | 2013-09-09 |
|
| LeapSeconds | 39 | 2013-09-09 | 2013-09-09 |
|
| RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 |
|
| ZZZ_Db_Description | 1 | 2013-09-09 | 2013-09-09 |
|
+---------------------+------------+------------+------------+
|
Each table has over 1.7 billion entries and is nearly 700 GB in size:
SELECT
|
TABLE_SCHEMA,
|
TABLE_NAME,
|
TABLE_ROWS,
|
DATA_LENGTH,
|
SUBSTR(CREATE_TIME,1,10) AS 'CREATE',SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' |
FROM
|
INFORMATION_SCHEMA.TABLES
|
WHERE
|
TABLE_SCHEMA LIKE 'daues_SDRP_dedupe_byfilter_%' |
AND |
ABLE_NAME='RunDeepForcedSource'; |
+------------------------------+---------------------+------------+--------------+------------+------------+
|
| TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | DATA_LENGTH | CREATE | UPDATE |
|
+------------------------------+---------------------+------------+--------------+------------+------------+
|
| daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 683191917445 | 2014-05-21 | 2014-05-21 |
|
| daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 692358132605 | 2014-05-16 | 2014-05-17 |
|
| daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 692944820970 | 2014-05-06 | 2014-05-30 |
|
| daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 692937772195 | 2014-05-18 | 2014-05-18 |
|
| daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 691992469650 | 2014-05-19 | 2014-05-19 |
|
+------------------------------+---------------------+------------+--------------+------------+------------+
|
The RA region covered by the tables (judging on one band only) is (ATTENTION : this is a very long query. It takes about 30 minutes for the query to finish):
SELECT
|
MIN(CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END) AS 'min_ra', |
MAX(CASE WHEN coord_ra < 180 THEN coord_ra ELSE coord_ra - 360 END) AS 'max_ra' |
FROM
|
RunDeepForcedSource;
|
+--------------------+--------------------+
|
| min_ra | max_ra |
|
+--------------------+--------------------+
|
| -40.17089170204929 | 10.036033419213043 |
|
+--------------------+--------------------+
|
Evaluating primary key constrains of the tables
Here we going to evaluate a possibility of merging the contents of these tables into a single one which would contain forced sources across all bands.
The primary key constrain on the table requires uniqueness only for id:
SHOW CREATE TABLE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource\G |
`id` bigint(20) NOT NULL,
|
..
|
`objectId` bigint(20) NOT NULL,
|
..
|
PRIMARY KEY (`id`),
|
..
|
KEY `IDX_objectId` (`objectId`),
|
..
|
Where the other column objectId serves as AKA foreign key for the corresponding entries in the DeepSource (or RunDeepSource) tables.
Fortunately, the values of the primary keys (id) in tables RunDeepForcedSource of all bands do not intersect. This can be demonstrated with this technique:
CREATE TEMPORARY TABLE |
tmpRunDeepForcedSource
|
LIKE
|
daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource;
|
INSERT INTO |
tmpRunDeepForcedSource
|
SELECT
|
*
|
FROM
|
daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource
|
WHERE
|
objectId=217773672520226380;
|
|
...
|
|
INSERT INTO |
tmpRunDeepForcedSource
|
SELECT
|
*
|
FROM
|
daues_SDRP_dedupe_byfilter_4.RunDeepForcedSource
|
WHERE
|
objectId=217773672520226380;
|
This code won't cause any violation of the primary key constrain. And the contents of the temporary table will be the union of result sets select from all 5 input tables.
SELECT
|
id,
|
objectId,
|
COUNT(*) |
FROM
|
tmpRunDeepForcedSource
|
GROUP BY |
id
|
ORDER BY |
id;
|
+--------------------+--------------------+----------+
|
| id | objectId | COUNT(*) |
|
+--------------------+--------------------+----------+
|
| 69324128875708417 | 217773672520226380 | 1 |
|
| 69330839762108417 | 217773672520226380 | 1 |
|
| 69337550648508417 | 217773672520226380 | 1 |
|
| 69344261534908417 | 217773672520226380 | 1 |
|
| 69350972421308417 | 217773672520226380 | 1 |
|
| 160055318036873217 | 217773672520226380 | 1 |
|
| 160062028923273217 | 217773672520226380 | 1 |
|
| 160068739809673217 | 217773672520226380 | 1 |
|
| 160075450696073217 | 217773672520226380 | 1 |
|
| 160082161582473217 | 217773672520226380 | 1 |
|
| 178443148853248001 | 217773672520226380 | 1 |
|
| 178449859739648001 | 217773672520226380 | 1 |
|
| 178456570626048001 | 217773672520226380 | 1 |
|
| 178463281512448001 | 217773672520226380 | 1 |
|
| 178469992398848001 | 217773672520226380 | 1 |
|
...
|
Evaluating cross-database referential integrity
This is another important precondition to be asserted on a way of an idea of merging the tables of daues_SDRP_Stripe82_ncsa with 5 band-specific databases. There are two external dependencies (AKA foreign keys) in the forced deep sources tables (RunDeepForcedSource) which would need to be satisfied:
- objectId
- exposure_id
It turns out the dependencies are resolved as follows:
RunDeepForcedSource key | target table | key |
---|---|---|
objectId | daues_SDRP_Stripe82_ncsa.DeepSource | deepSourceId |
exposure_id | daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure | scienceCcdExposureId |
A weak proof of this referential integrity can be demonstrated with the following test (based on any row from the deep forced sources table:
SELECT
|
rdfs.objectId,
|
rdfs.exposure_id,
|
sce.run
|
FROM
|
daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource rdfs,
|
daues_SDRP_Stripe82_ncsa.DeepSource ds,
|
daues_SDRP_Stripe82_ncsa.Science_Ccd_Exposure sce
|
WHERE
|
rdfs.objectId=217773672520226380
|
AND |
rdfs.objectId=ds.deepSourceId
|
AND |
rdfs.exposure_id=sce.scienceCcdExposureId;
|
+--------------------+-------------+------+
|
| objectId | exposure_id | run |
|
+--------------------+-------------+------+
|
| 217773672520226380 | 1033010019 | 1033 |
|
| 217773672520226380 | 2385010094 | 2385 |
|
| 217773672520226380 | 2659010125 | 2659 |
|
| 217773672520226380 | 3384010101 | 3384 |
|
| 217773672520226380 | 3437010075 | 3437 |
|
| 217773672520226380 | 4184010122 | 4184 |
|
...
|
| 217773672520226380 | 7195010091 | 7195 |
|
| 217773672520226380 | 7202010034 | 7202 |
|
| 217773672520226380 | 7202010035 | 7202 |
|
+--------------------+-------------+------+
|
77 rows in set (0.63 sec)
|
This corresponds to the same number of entries found in table RunDeepForcedSource for this objectId.
NOTE: strictly speaking this is NOT a proof. Though, a more reliable verification procedure could be establish to verify/guarantee the referential integrity.
Observations
- each of these databases contains only forced sources. No other data from the preceding S13 DRP pipeline stages are present.
- according to the RA range seen in the database its contents corresponds to the NCSA-only subset of the full dataset
- it's quite clear that the contents of table RunDeepForcedSource from all these databases could be combined within a single large table covering all bands due to non-overlapping sequences of the primary key id in the table instances.
- the referential integrity between tables of databases daues_SDRP_Stripe82_ncsa and daues_SDRP_dedupe_byfilter_0,1,2,3,4 has been demostrated
- it also seems to be conceivable to merge tables RunDeepForcedSource into the large database daues_SDRP_Stripe82_ncsa which holds catalogs from the first stages of the processing pipeline. This will yield the complete catalog of the first (NCSA) half of the dataset.
The final sweep through other databases of lsst-db
The following tests have been made in order to rule out a possibility that there is some other (or many) database on the server which might we well be involved into this investigation. The tests were based on patters established in the prior investigation of the well known databases. And we were looking for signs of a presence the substantial amount of data in the corresponding tables of the databases. All tables with the number of rows in the relevant tables which (the numbers) were less than the what had been found earlier were ignored. An additional constrain to be satisfied was a time when each table was created and updated. All tables updated before July 2013 were ignored. MySQL/MariaDB information schema was used in this investigation.
Science exposures:
SELECT
|
TABLE_SCHEMA,
|
TABLE_NAME,
|
TABLE_ROWS,
|
SUBSTR(CREATE_TIME,1,10) AS 'CREATE', |
SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' |
FROM
|
INFORMATION_SCHEMA.TABLES
|
WHERE
|
TABLE_NAME='Science_Ccd_Exposure' |
AND |
TABLE_ROWS > 0
|
ORDER BY |
TABLE_ROWS DESC; |
+--------------------------------------------------+----------------------+------------+------------+------------+
|
| TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE |
|
+--------------------------------------------------+----------------------+------------+------------+------------+
|
| DC_W13_Stripe82 | Science_Ccd_Exposure | 1998219 | 2013-01-18 | 2013-01-18 |
|
| daues_SDRP_Stripe82_ncsa | Science_Ccd_Exposure | 1403385 | 2013-07-31 | 2013-08-01 |
|
| yusra_Stripe82_S13 | Science_Ccd_Exposure | 1403385 | 2014-09-02 | 2014-09-02 |
|
| yusra_rgb2 | Science_Ccd_Exposure | 83427 | 2012-12-30 | 2013-03-06 |
|
| daues_S13_Stripe82_2000 | Science_Ccd_Exposure | 79179 | 2013-06-22 | 2013-06-22 |
|
..
|
Deep sources:
SELECT
|
TABLE_SCHEMA,
|
TABLE_NAME,
|
TABLE_ROWS,
|
SUBSTR(CREATE_TIME,1,10) AS 'CREATE', |
SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' |
FROM
|
INFORMATION_SCHEMA.TABLES
|
WHERE
|
TABLE_NAME='RunDeepSource' |
ORDER BY |
TABLE_ROWS DESC; |
+----------------------------------+---------------+------------+------------+------------+
|
| TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE |
|
+----------------------------------+---------------+------------+------------+------------+
|
| daues_SDRP_Stripe82_ncsa | RunDeepSource | 98577782 | 2013-08-13 | 2013-08-13 |
|
| DC_W13_Stripe82 | RunDeepSource | 14670597 | 2012-12-01 | 2012-12-02 |
|
| yusra_S13_Gordon | RunDeepSource | 5884398 | 2013-11-01 | 2013-11-01 |
|
| daues_S13_Stripe82_2000 | RunDeepSource | 5131738 | 2013-06-27 | 2013-09-01 |
|
| daues_coadd_test | RunDeepSource | 2836990 | 2013-08-08 | 2013-08-08 |
|
| yusra_S13_Stripe82_demo | RunDeepSource | 763995 | 2013-06-16 | 2013-06-16 |
|
| daues_test1 | RunDeepSource | 749895 | 2013-11-30 | 2013-11-30 |
|
| daues_S13_Stripe82_1000 | RunDeepSource | 749895 | 2013-06-16 | 2013-06-16 |
|
| shaw_Stripe82_demo | RunDeepSource | 285785 | 2014-01-30 | 2014-01-30 |
|
| pgee2000_a35 | RunDeepSource | 92343 | 2013-07-07 | 2013-07-07 |
|
...
|
Forced deep sources:
SELECT
|
TABLE_SCHEMA,
|
TABLE_NAME,
|
TABLE_ROWS,
|
SUBSTR(CREATE_TIME,1,10) AS 'CREATE', |
SUBSTR(UPDATE_TIME,1,10) AS 'UPDATE' |
FROM
|
INFORMATION_SCHEMA.TABLES
|
WHERE
|
TABLE_NAME='RunDeepForcedSource' |
ORDER BY |
TABLE_ROWS DESC; |
+------------------------------+---------------------+------------+------------+------------+
|
| TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | CREATE | UPDATE |
|
+------------------------------+---------------------+------------+------------+------------+
|
| DC_W13_Stripe82 | RunDeepForcedSource | 3870141868 | 2012-12-11 | 2012-12-11 |
|
| daues_SDRP_dedupe_byfilter_2 | RunDeepForcedSource | 1754290686 | 2014-05-06 | 2014-05-30 |
|
| daues_SDRP_dedupe_byfilter_3 | RunDeepForcedSource | 1754272841 | 2014-05-18 | 2014-05-18 |
|
| daues_SDRP_dedupe_byfilter_1 | RunDeepForcedSource | 1752805399 | 2014-05-16 | 2014-05-17 |
|
| daues_SDRP_dedupe_byfilter_4 | RunDeepForcedSource | 1751879670 | 2014-05-19 | 2014-05-19 |
|
| daues_SDRP_dedupe_byfilter_0 | RunDeepForcedSource | 1729599791 | 2014-05-21 | 2014-05-21 |
|
| daues_S13_Stripe82_2000 | RunDeepForcedSource | 447180788 | 2013-08-22 | 2013-08-22 |
|
..
|
Observation
- no other database meets desired criteria of the search
The final conclusions
The study has reviled the following:
- The complete (and scientifically meaningful) catalogs corresponding to NCSA-only half of the S13 dataset have been located withing LSST databse server at NCSA
- The catalogs are spread between 6 databases: daues_SDRP_Stripe82_ncsa (CCD exposure metadata and deep sources) and daues_SDRP_dedupe_byfilter_0,1,2,3,4 (deep forced sources accros 5 wavelength bands).
- there is a clear evidence that the referential integrity is possible across these databases
- there is a clear evidence that the primary key structure of the RunDeepForcedSource tables across all 5 bands won't prevent merging the contents of these tables into a single one
These are optional ideas which may still need to be tested:
- running a comprehensive test to verify the referential integrity across tables of those 6 databases
- verifying that no non-trivial intersection exists between keys of the deep forced source tables
daues_SDRP_Stripe82_ncsa
Non-empty tables:
SELECT
TABLE_NAME, TABLE_ROWS,
FROM
INFORMATION_SCHEMA.TABLES
WHERE
TABLE_ROWS > 0;
+-------------------------------+------------+------------+------------+
| TABLE_NAME | TABLE_ROWS | CREATE | UPDATE |
+-------------------------------+------------+------------+------------+
| DeepCoadd | 13735 | 2013-07-31 | 2013-08-09 |
| DeepCoadd_Metadata | 2809690 | 2013-07-31 | 2013-08-09 |
| DeepCoadd_To_Htm10 | 326420 | 2013-07-31 | 2013-08-09 |
| Filter | 6 | 2013-07-31 | 2013-07-31 |
| LeapSeconds | 39 | 2013-07-31 | 2013-07-31 |
| RunDeepSource | 98577782 | 2013-08-13 | 2013-08-13 |
| Science_Ccd_Exposure | 1403385 | 2013-07-31 | 2013-08-01 |
| Science_Ccd_Exposure_Metadata | 164196045 | 2013-07-31 | 2013-08-01 |
| Science_Ccd_Exposure_To_Htm10 | 26861430 | 2013-07-31 | 2013-08-01 |
| ZZZ_Db_Description | 1 | 2013-07-31 | 2013-07-31 |
+-------------------------------+------------+------------+------------+
SELECT
TABLE_NAME
FROM
INFORMATION_SCHEMA.VIEWS
WHERE
+------------+
| TABLE_NAME |
+------------+
| DeepSource |
+------------+
Evaluating RA range of the deep sources:
SELECT
FROM
DeepSource;
+--------------------+--------------------+
| min_ra | max_ra |
+--------------------+--------------------+
| -40.16296999999997 | 10.022860000000001 |
+--------------------+--------------------+
And a histogram:
SELECT
FROM
DeepSource
chan;
+------+----------+
| chan | COUNT(*) |
+------+----------+
| -40 | 1400094 |
| -39 | 2163949 |
| -38 | 2165932 |
| -37 | 2133441 |
| -36 | 2126065 |
| -35 | 2104630 |
| -34 | 2180195 |
| -33 | 2103957 |
| -32 | 2029172 |
| -31 | 2094254 |
| -30 | 2049792 |
| -29 | 1987232 |
| -28 | 2000964 |
| -27 | 2113021 |
| -26 | 2037486 |
| -25 | 1968147 |
| -24 | 1999535 |
| -23 | 1961868 |
| -22 | 2047897 |
| -21 | 1979037 |
| -20 | 2050123 |
| -19 | 1956640 |
| -18 | 1936693 |
| -17 | 1918960 |
| -16 | 1863719 |
| -15 | 1786404 |
| -14 | 1954821 |
| -13 | 1868526 |
| -12 | 1870825 |
| -11 | 1959243 |
| -10 | 1881106 |
| -9 | 1787642 |
| -8 | 1778006 |
| -7 | 1860040 |
| -6 | 1859687 |
| -5 | 1859810 |
| -4 | 1732045 |
| -3 | 1839934 |
| -2 | 1873321 |
| -1 | 1903577 |
| -0 | 1034431 |
| 0 | 1033119 |
| 1 | 1922121 |
| 2 | 1796961 |
| 3 | 1889517 |
| 4 | 1838618 |
| 5 | 1891434 |
| 6 | 1955242 |
| 7 | 1972145 |
| 8 | 1995514 |
| 9 | 1996845 |
| 10 | 1064045 |
+------+----------+
Observations