Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: Qserv
-
Labels:
-
Story Points:16
-
Epic Link:
-
Sprint:DB_F16_7, DB_F16_8, DB_F16_9
-
Team:Data Access and Database
Description
Create a database populated with complete catalogs resulting from processing of the SDSS Stripe 82 data at both NCSA and IN2P3 sites. The database has to be created at NCSA on the following database server lsst-db.ncsa.illinois.edu
- 2016-08-10: CORRECTION: that was the original plan. Due to the space limitation on that server the database will be installed on Nebula volume lsst-dm-gapon attached to image lsst-gapon-0 (floating IP address 141.142.211.152). See details on this setup in the comments stream below.)
The database name will be: gapon_SDRP_Stripe82
The database will be populated with the contents of the following databases:
- NCSA (lsst-db.ncsa.illinois.edu):
daues_SDRP_Stripe82_ncsa
daues_SDRP_dedupe_byfilter_0
daues_SDRP_dedupe_byfilter_1
daues_SDRP_dedupe_byfilter_2
daues_SDRP_dedupe_byfilter_3
daues_SDRP_dedupe_byfilter_4
- IN2P3 (ccdb02.in2p3.fr):
lsst_prod_DC_2013_2
lsst_prod_dedupe_byfilter_g
lsst_prod_dedupe_byfilter_i
lsst_prod_dedupe_byfilter_r
lsst_prod_dedupe_byfilter_u
lsst_prod_dedupe_byfilter_z
Additional requirements:
- The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated.
- the referential integrity of the resulted database will need to be tested
Attachments
Issue Links
Activity
De-duplication of RunDeepForcedSource (IN2P3 u-band)
This operation is only needed for the IN2P3's tables of all 5 bands. Only the u-band is going to be processed at this stage, mainly to figure out the most optimal protocol for doing the de-duplication. Other bands will be processed later.
The general idea of the merge is to use pre-recorded object identifiers for objects (rows of DeepForcedSource) found previously to eliminate rows which correspond to those objects. The identifiers are stored in table: merge_info.in2p3_RunDeepSource_overlap. There are about 10 7 such objects across all 5 bands.
In order to achieve the maximum performance of the operation it would be best to avoid any direct I/O during the merge. This would be possible if all of the following conditions were met:
- do not write back into the input table
- stream the filtered rows into another output table which has the same schema
- an index for the FK object_id on the input table must be DISABLED (this allows the sequential full table scan)
- all keys (including the PRIMARY one) on the output table must be DISABLED
- making sure the index of object identifiers which are to be eliminated fits into the MySQL key buffer in the machine's memory (this has already been done by increasing the buffer size up to 8 GB)
Database server tuneup
Tuned up some server parameters for the 16 GB memory machine as per the following document:
Here is a list of those:
bulk_insert_buffer_size = 8589934592
|
key_buffer_size = 4294967296
|
sort_buffer_size = 8589934592
|
myisam_max_sort_file_size = 2147483648
|
myisam_repair_threads = 8
|
Estimating the performance of the operation
ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource DISABLE KEYS; |
|
SELECT COUNT(*) |
FROM
|
lsst_prod_dedupe_byfilter_u.RunDeepForcedSource `rdfs`,
|
merge_info.in2p3_RunDeepSource_overlap `overlap`
|
WHERE
|
rdfs.objectId=overlap.id;
|
+-----------+
|
| COUNT(*) |
|
+-----------+
|
| 176989749 |
|
+-----------+
|
1 row in set (1 hour 45 min 39.36 sec)
|
De-duplicating the table
The performance of the operation shall be monitored by watching for changes in the size of the corresponding file in the data directory of the database installation.
Setting up the output table:
CREATE TABLE |
lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup
|
LIKE
|
lsst_prod_dedupe_byfilter_u.RunDeepForcedSource;
|
|
SHOW INDEX IN lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
+---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
|
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
|
+---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
|
| RunDeepForcedSource_dedup | 0 | PRIMARY | 1 | id | A | 0 | NULL | NULL | | BTREE | | |
|
| RunDeepForcedSource_dedup | 1 | IDX_coord_htmId20 | 1 | coord_htmId20 | A | NULL | NULL | NULL | YES | BTREE | | |
|
| RunDeepForcedSource_dedup | 1 | IDX_coord_decl | 1 | coord_decl | A | NULL | NULL | NULL | YES | BTREE | | |
|
| RunDeepForcedSource_dedup | 1 | IDX_parent | 1 | parent | A | NULL | NULL | NULL | YES | BTREE | | |
|
| RunDeepForcedSource_dedup | 1 | IDX_exposure_id | 1 | exposure_id | A | NULL | NULL | NULL | | BTREE | | |
|
| RunDeepForcedSource_dedup | 1 | IDX_exposure_filter_id | 1 | exposure_filter_id | A | NULL | NULL | NULL | | BTREE | | |
|
| RunDeepForcedSource_dedup | 1 | IDX_objectId | 1 | objectId | A | NULL | NULL | NULL | | BTREE | | |
|
+---------------------------+------------+------------------------+--------------+--------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
|
DROP INDEX `PRIMARY` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
DROP INDEX `IDX_coord_htmId20` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
DROP INDEX `IDX_coord_decl` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
DROP INDEX `IDX_parent` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
DROP INDEX `IDX_exposure_id` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
DROP INDEX `IDX_exposure_filter_id` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
DROP INDEX `IDX_objectId` ON lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
De-duplicating:
SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'BEGIN TIME'; |
|
INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup |
SELECT * FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource |
WHERE objectId NOT IN |
(SELECT id FROM merge_info.in2p3_RunDeepSource_overlap); |
|
SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'END TIME'; |
+---------------------+
|
| BEGIN TIME |
|
+---------------------+
|
| 2016-09-02 18:37:04 |
|
+---------------------+
|
|
Query OK, 1720938745 rows affected (4 hours 21 min 29.13 sec)
|
Records: 1720938745 Duplicates: 0 Warnings: 0
|
|
+---------------------+
|
| END TIME |
|
+---------------------+
|
| 2016-09-02 22:58:34 |
|
+---------------------+
|
Summary of observations
The performance of the operation is:
- wrote 109690 rows/sec
- wrote 41 MB/sec
- 4.5 hours per table
The file system view on the original and reduced tables:
% ls -alh /data/mysql/db/lsst_prod_dedupe_byfilter_u/ | grep RunDeepForcedSource |
|
-rw-rw---- 1 mysql mysql 634G Sep 2 22:58 RunDeepForcedSource_dedup.MYD
|
-rw-rw---- 1 mysql mysql 1.0K Sep 2 22:58 RunDeepForcedSource_dedup.MYI
|
-rw-rw---- 1 mysql mysql 699G Aug 14 03:11 RunDeepForcedSource.MYD
|
-rw-rw---- 1 mysql mysql 155G Sep 2 03:30 RunDeepForcedSource.MYI
|
Extending and preparing RunDeepForcedSource for partitioning
This procedure applies to all 10 instances of the table.
Reasons
PROBLEM: Here is the problem which is going to be addressed here. The current implementation of the partitioning algorithm can't benefit from the direct association of Sources (RunDeepForcedSource) to the corresponding Objects (RunDeepSource). That would require to build and maintain so called secondary index mapping object identifiers to the corresponding Qserv chunks. The index will be build later. At the mean time the partitioning is made purely statically based on the (RA,DECL) attributes of both tables, where the Objects table is known to be the director table which determines the boundaries of chunks. Entries of the Sources table will be just placed into the corresponding partition/chunk based on an implicit association of their (RA,DECL) withe the ones of the corresponding objects. A problem with this scheme is that the current instances of table RunDeepForcedSource do not have values of the Objects' (RA,DECL). Values of the (RA,DECL) attributes of Sources fluctuate in some spatial region in the vicinity of their Object.
This is illustrated with a result of following query:
SELECT
|
ROUND(coord_ra-38.37141147524322, 12) AS 'diff_coord_ra', |
ROUND(coord_decl-1.148188656054416,12) AS 'diff_coord_decl' |
FROM |
lsst_prod_dedupe_byfilter_u.RunDeepForcedSource
|
WHERE
|
objectId=2067094074033859
|
ORDER BY |
diff_coord_ra;
|
+-----------------+-----------------+
|
| diff_coord_ra | diff_coord_decl |
|
+-----------------+-----------------+
|
| -0.000161892500 | -0.000111360577 |
|
| -0.000095904728 | -0.000020831633 |
|
| -0.000094140933 | -0.000029997114 |
|
| -0.000083455936 | -0.000460564277 |
|
| -0.000046769964 | -0.000084374571 |
|
| -0.000039476722 | 0.000091648320 |
|
| -0.000027119942 | 0.000072532090 |
|
| -0.000024605283 | 0.000003101783 |
|
| -0.000012371979 | -0.000301252512 |
|
| -0.000003916412 | 0.000170950966 |
|
| -0.000000000000 | 0.000000000000 |
|
| 0.000000000000 | -0.000000000000 |
|
...
|
| -0.000000000000 | -0.000000000000 |
|
| -0.000000000000 | -0.000000000000 |
|
| 0.000018318412 | 0.000036593137 |
|
| 0.000029191432 | 0.000110790091 |
|
| 0.000034774232 | -0.000108592399 |
|
| 0.000060996604 | -0.000148811948 |
|
| 0.000072243194 | -0.000003792909 |
|
| 0.000092347604 | -0.000036465276 |
|
| 0.000094263668 | 0.000113629595 |
|
| 0.000106699599 | 0.000305447839 |
|
| 0.000195478015 | -0.000067243647 |
|
| 0.000385799167 | -0.000061535130 |
|
+-----------------+-----------------+
|
92 rows in set (0.00 sec)
|
SOLUTION: extend the table before dumping its contents into the CVS/TVS format with two columns populated with values of (RA/DECL) copied from the corresponding entries of the _Objects* table (RunDeepSource). This operation will apply to all 10 instances of table RunDeepForcedSource. Note that 5 tables of IN2P3 will be de-duplicated first.
Creating a subset of table RunDeepSource (Objects) of IN2P3
This table is quite large (considering just its IN2P3's half):
*100 million rows
- the total size (w/o indexes) is about 55 GB
- the row size is 556 Bytes
Attempting a join of Sources table RunDeepForcedSource_dedup (2 billion entries, 700 GB in size) with this one will be terribly slow on a machine with only 16 GB of memory. Hence a goal of tgis step is to create a smaller table which will have:
- only 3 columns: id, ra, decl (3x64 bits = 25 Bytes)
- be 20 times smaller then the full table, that is 2.5 GB in size, which should fit into teh machine's physical memory
- be indexed on the object identifier column
Create the table:
CREATE TABLE lsst_prod_DC_2013_2.RunDeepSource_radecl ( |
`id` bigint(20) NOT NULL, |
`coord_ra` double DEFAULT NULL, |
`coord_decl` double DEFAULT NULL, |
PRIMARY KEY (`id`) |
) ENGINE=MyISAM;
|
Populate the table with a subset:
INSERT INTO lsst_prod_DC_2013_2.RunDeepSource_radecl |
SELECT id,coord_ra,coord_decl |
FROM lsst_prod_DC_2013_2.RunDeepSource; |
Query OK, 98088148 rows affected (13 min 23.19 sec)
|
Records: 98088148 Duplicates: 0 Warnings: 0
|
% ls -alh /data/mysql/db/lsst_prod_DC_2013_2/ | grep RunDeepSource_radecl |
-rw-rw---- 1 mysql mysql 8.5K Sep 2 22:45 RunDeepSource_radecl.frm
|
-rw-rw---- 1 mysql mysql 2.3G Sep 2 23:01 RunDeepSource_radecl.MYD
|
-rw-rw---- 1 mysql mysql 1.7G Sep 2 23:01 RunDeepSource_radecl.MYI
|
Observations:
- the row size is: 25 Bytes
- the table size is: 2.3 GB
- the primary index's size is: 1.7 GB
Creating a subset of table RunDeepSource (Objects) of NCSA
Create the table:
CREATE TABLE gapon_SDRP_Stripe82.RunDeepSource_radecl ( |
`id` bigint(20) NOT NULL, |
`coord_ra` double DEFAULT NULL, |
`coord_decl` double DEFAULT NULL, |
PRIMARY KEY (`id`) |
) ENGINE=MyISAM;
|
Populate the table with a subset:
INSERT INTO gapon_SDRP_Stripe82.RunDeepSource_radecl |
SELECT id,coord_ra,coord_decl |
FROM gapon_SDRP_Stripe82.RunDeepSource; |
Query OK, 98577782 rows affected (14 min 52.09 sec)
|
Records: 98577782 Duplicates: 0 Warnings: 0
|
Extending Sources of the IN2P3's u-band
The operation will be using the de-duplicated table RunDeepForcedSource_dedup from database lsst_prod_dedupe_byfilter_u as its source and it will produce the extended table RunDeepForcedSource_dedup_radecl as its destination within the same database. The destination table will have two extra columns:
- object_ra
- object_decl
No indexes (including the PRIMARY one) will be created for the destination table.
Creating the extended table ( two extra columns are added in the end):
CREATE TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl |
LIKE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup; |
|
ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl |
ADD COLUMN `object_coord_ra` double DEFAULT NULL; |
|
ALTER TABLE lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl |
ADD COLUMN `object_coord_decl` double DEFAULT NULL; |
Load the table with a small sample (10 million rows) to estimate the performance of the operation:
INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl |
SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl |
FROM |
lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup `rdfs`
|
STRAIGHT_JOIN
|
lsst_prod_DC_2013_2.RunDeepSource_radecl `rdss`
|
ON |
rdfs.objectId = rdss.id
|
LIMIT 10000000;
|
Query OK, 10000000 rows affected (1 min 46.48 sec)
|
Records: 10000000 Duplicates: 0 Warnings: 0
|
Observations:
- performance: 94000 rows/sec
- write I/O rate: 37 MB/sec
Clearing the destination table from the test data and processing all rows from the input table:
DELETE FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl; |
|
SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'BEGIN TIME'; |
|
INSERT INTO lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup_radecl |
SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl |
FROM |
lsst_prod_dedupe_byfilter_u.RunDeepForcedSource_dedup `rdfs`
|
STRAIGHT_JOIN
|
lsst_prod_DC_2013_2.RunDeepSource_radecl `rdss`
|
ON |
rdfs.objectId = rdss.id;
|
|
SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'END TIME'; |
+---------------------+
|
| BEGIN TIME |
|
+---------------------+
|
| 2016-09-03 01:04:48 |
|
+---------------------+
|
|
+---------------------+
|
| END TIME |
|
+---------------------+
|
| 2016-09-03 05:42:48 |
|
+---------------------+
|
Dumping sources of IN2P3's u-band into TSV
% mysqldump -u root \
|
lsst_prod_dedupe_byfilter_u RunDeepForcedSource_dedup_radecl \
|
-T/data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/ |
|
% ls -alh /data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/ |
-rw-r--r-- 1 root root 5.1K Sep 3 05:58 RunDeepForcedSource_dedup_radecl.sql
|
-rw-rw-rw- 1 mysql mysql 1.2T Sep 3 18:59 RunDeepForcedSource_dedup_radecl.txt
|
|
% mv \ |
/data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl.txt \ |
/data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl.tsv |
|
% nohup pigz --fast \ |
/data1/mysql_tsv_dumps/lsst_prod_dedupe_byfilter_u/RunDeepForcedSource_dedup_radecl.tsv& |
Extending Sources of the NCSA's u-band and making the TSV dump
No indexes (including the PRIMARY one) should be defined for the destination table.
Prepare the empty table:
CREATE TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl |
LIKE gapon_SDRP_Stripe82.RunDeepForcedSource; |
|
DROP INDEX `PRIMARY` |
ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_coord_htmId20` |
ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_coord_decl` |
ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_parent` |
ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_exposure_id` |
ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_exposure_filter_id` |
ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_objectId` |
ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_objectId_exposure` |
ON gapon_SDRP_Stripe82.RunDeepForcedSource_radecl; |
|
|
ALTER TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl |
ADD COLUMN `object_coord_ra` double DEFAULT NULL; |
|
ALTER TABLE gapon_SDRP_Stripe82.RunDeepForcedSource_radecl |
ADD COLUMN `object_coord_decl` double DEFAULT NULL; |
Begin extending the table.
SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'BEGIN TIME'; |
|
INSERT INTO gapon_SDRP_Stripe82.RunDeepForcedSource_radecl |
SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl |
FROM |
gapon_SDRP_Stripe82.RunDeepForcedSource `rdfs`
|
STRAIGHT_JOIN
|
gapon_SDRP_Stripe82.RunDeepSource_radecl `rdss`
|
ON |
rdfs.objectId = rdss.id;
|
|
SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'END TIME'; |
+---------------------+
|
| BEGIN TIME |
|
+---------------------+
|
| 2016-09-04 06:53:30 |
|
+---------------------+
|
|
+---------------------+
|
| END TIME |
|
+---------------------+
|
| 2016-09-04 11:25:05 |
|
+---------------------+
|
Dumping the table into the TSV format:
% mysqldump -u root \
|
gapon_SDRP_Stripe82 RunDeepForcedSource_radecl \
|
-T/data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/ |
|
% ls -alh /data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/ |
|
% mv \ |
/data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl.txt \ |
/data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl.tsv |
|
% nohup pigz --fast \ |
/data1/mysql_tsv_dumps/daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource_radecl.tsv& |
Extending Sources of the NCSA's g-band and making the TSV dump
No indexes (including the PRIMARY one) should be defined for the destination table.
Prepare the empty table:
CREATE TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl |
LIKE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource; |
|
DROP INDEX `PRIMARY` |
ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_coord_htmId20` |
ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_coord_decl` |
ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_parent` |
ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_exposure_id` |
ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_exposure_filter_id` |
ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_objectId` |
ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl; |
|
DROP INDEX `IDX_objectId_exposure` |
ON daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl; |
|
ALTER TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl |
ADD COLUMN `object_coord_ra` double DEFAULT NULL; |
|
ALTER TABLE daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl |
ADD COLUMN `object_coord_decl` double DEFAULT NULL; |
|
Begin extending the table.
SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'BEGIN TIME'; |
|
INSERT INTO daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource_radecl |
SELECT rdfs.*,rdss.coord_ra,rdss.coord_decl |
FROM |
daues_SDRP_dedupe_byfilter_1.RunDeepForcedSource `rdfs`
|
STRAIGHT_JOIN
|
gapon_SDRP_Stripe82.RunDeepSource_radecl `rdss`
|
ON |
rdfs.objectId = rdss.id;
|
|
SELECT CONCAT(CURRENT_DATE(),' ',CURRENT_TIME()) AS 'END TIME'; |
IN PROGRESS...
+---------------------+
|
| BEGIN TIME |
|
+---------------------+
|
| 2016-09-04 19:37:57 |
|
+---------------------+
|
|
Planning the next steps
Bulk processing the forced deep sources
database | load | de-dupe | extend | dump | compress | notes |
---|---|---|---|---|---|---|
daues_SDRP_dedupe_byfilter_0 | ![]() |
n/a | ![]() |
![]() |
![]() |
|
daues_SDRP_dedupe_byfilter_1 | ![]() |
n/a | ![]() |
![]() |
![]() |
|
daues_SDRP_dedupe_byfilter_2 | ![]() |
n/a | ![]() |
![]() |
![]() |
|
daues_SDRP_dedupe_byfilter_3 | ![]() |
n/a | ![]() |
![]() |
![]() |
|
daues_SDRP_dedupe_byfilter_4 | ![]() |
n/a | ![]() |
![]() |
![]() |
![]() |
lsst_prod_dedupe_byfilter_u | ![]() |
![]() |
![]() |
![]() |
![]() |
|
lsst_prod_dedupe_byfilter_g | ![]() |
![]() |
![]() |
![]() |
![]() |
|
lsst_prod_dedupe_byfilter_r | ![]() |
![]() |
![]() |
![]() |
![]() |
|
lsst_prod_dedupe_byfilter_i | ![]() |
![]() |
![]() |
![]() |
![]() |
|
lsst_prod_dedupe_byfilter_z | ![]() |
![]() |
![]() |
![]() |
![]() |
Other actions required to finish producing the complete input data set for the partitioning algorithm and the Qserv loader:
do the actual merge of DeepCoadd and RunDeepSource and the metadata tables: DeepCoadd_Metadata and DeepCoadd_To_Htm10. Report this in the corresponding comment section above. Do not forget to mention what should be the right order of these operations.
dump the rest of the tables from the main NCSA database in a format which is expected by the partitioning algorithm and the Qserv loader
These steps are expected to be properly documented and reported in comments of this JIRA ticket.
The FINAL step would be to put the most important summary information on the whole process into the ticket's description. This should include the key decisions on the merge, schema extension and a location of the input dataset for the loader.
Summary
The task has been complete. All dumps have been made and copied over to the GPFS file system:
- lsst-dev7.ncsa.illinois.edu:/datasets/gapon/catalogs
The intermediate data are still available on the NCSA's OpenStack instance lsst-gapon-0 and 4 attached volumes:
- lsst-dm-gapon
- lsst-dm-gapon-1
- lsst-dm-gapon-2
- lsst-dm-gapon-3
Merging and de-duplicating DeepCoadd and RunDeepSource
These tables needs to be merged and de-duplicated before attempting to de-duplicate 5 IN2P3's tables RunDeepForcedSource. The effort is based on the following (earlier explained in this ticket) dependency between tables:
..
key_buffer_size = 4294967296
..
% systemctl restart mariadb.service
Saving the overlap info for DeepCoadd
Here we're going to save the overlap identifiers in intermediate tables.
NCSA:
) ENGINE=MyISAM;
Query OK, 1380 rows affected (0.10 sec)
Records: 1380 Duplicates: 0 Warnings: 0
IN2P3:
) ENGINE=MyISAM;
Query OK, 1380 rows affected (41.85 sec)
Records: 1380 Duplicates: 0 Warnings: 0
Observations:
Saving the overlap info for RunDeepSource
Estimating the total number of Objects in the overlap area
NCSA:
SELECT
FROM
gapon_SDRP_Stripe82.RunDeepSource `rds`,
merge_info.ncsa_DeepCoadd_overlap `overlap`
WHERE
rds.coadd_id = overlap.deepCoaddId;
+-----------+
| COUNT(id) |
+-----------+
| 9991529 |
+-----------+
1 row in set (1 min 43.00 sec)
IN2P3:
SELECT
FROM
lsst_prod_DC_2013_2.RunDeepSource `rds`,
merge_info.in2p3_DeepCoadd_overlap `overlap`
WHERE
rds.coadd_id = overlap.deepCoaddId;
+-----------+
| COUNT(id) |
+-----------+
| 9993373 |
+-----------+
1 row in set (2 min 46.84 sec)
Recording the identifiers in the bookkeeping tables for both sites
NCSA:
) ENGINE=MyISAM;
FROM
gapon_SDRP_Stripe82.RunDeepSource `rds`,
merge_info.ncsa_DeepCoadd_overlap `overlap`
WHERE
rds.coadd_id = overlap.deepCoaddId;
Query OK, 9991529 rows affected (25.07 sec)
Records: 9991529 Duplicates: 0 Warnings: 0
IN2P3:
) ENGINE=MyISAM;
FROM
lsst_prod_DC_2013_2.RunDeepSource `rds`,
merge_info.in2p3_DeepCoadd_overlap `overlap`
WHERE
rds.coadd_id = overlap.deepCoaddId;
Query OK, 9993373 rows affected (32.20 sec)
Records: 9993373 Duplicates: 0 Warnings: 0
Observations
It means that with a proper key buffer size (as mentioned at the start of this comment) the whole table index will fit into memory and it can be used for eliminated overlaps from the largest tables RunDeepForcedSource.
Deduplicating DeepCoadd (and its metadata tables) of IN2P3
This is an intermediate step before producing the fully merged tables. The goal is to remove the overlap in tables:
The resulting tables will be residing within the same database:
Deduplicate by excluding the previously recorded identifiers of coadds in the overlap area of IN2P3. Run a few test shots first to make sure we're getting consistent results.
DeepCoadd
+----------+
| COUNT(*) |
+----------+
| 13680 |
+----------+
WHERE
+----------+
| COUNT(*) |
+----------+
| 1380 |
+----------+
WHERE
+----------+
| COUNT(*) |
+----------+
| 12300 |
+----------+
FROM
lsst_prod_DC_2013_2.DeepCoadd
Query OK, 12300 rows affected (0.09 sec)
Records: 12300 Duplicates: 0 Warnings: 0
DeepCoadd_Metadata
+----------+
| COUNT(*) |
+----------+
| 3022797 |
+----------+
WHERE
+----------+
| COUNT(*) |
+----------+
| 289140 |
+----------+
1 row in set (3.12 sec)
WHERE
+----------+
| COUNT(*) |
+----------+
| 2733657 |
+----------+
1 row in set (1.34 sec)
FROM
lsst_prod_DC_2013_2.DeepCoadd_Metadata
Query OK, 2733657 rows affected (1 min 1.31 sec)
Records: 2733657 Duplicates: 0 Warnings: 0
DeepCoadd_To_Htm10
+----------+
| COUNT(*) |
+----------+
| 324020 |
+----------+
WHERE
+----------+
| COUNT(*) |
+----------+
| 32740 |
+----------+
1 row in set (0.06 sec)
WHERE
+----------+
| COUNT(*) |
+----------+
| 291280 |
+----------+
1 row in set (0.10 sec)
FROM
lsst_prod_DC_2013_2.DeepCoadd_To_Htm10
Query OK, 291280 rows affected (0.13 sec)
Records: 291280 Duplicates: 0 Warnings: 0
Deduplicating RunDeepSource of IN2P3
This is an intermediate step before producing the fully merged tables. The goal is to remove the overlap in tables:
The resulting table will be residing within the same database:
+----------+
| COUNT(*) |
+----------+
| 98088148 |
+----------+
1 row in set (0.01 sec)
+----------+
| COUNT(*) |
+----------+
| 9993373 |
+----------+
WHERE
+----------+
| COUNT(*) |
+----------+
| 9993373 |
+----------+
1 row in set (46.01 sec)
WHERE
+----------+
| COUNT(*) |
+----------+
| 88094775 |
+----------+
1 row in set (3 min 35.95 sec)
FROM
lsst_prod_DC_2013_2.RunDeepSource
Query OK, 88094775 rows affected (25 min 33.03 sec)
Records: 88094775 Duplicates: 0 Warnings: 0
Merging tables DeepCoadd and RunDeepSource
This is the final stage of merging of each pair (from both sites) of the tables to include rows from both sites. The resulting tables are placed into a separate database called lsst_sdss_stripe82. The database will be also populated by other shared tables needed by the PDAC Qserv loader.
DeepCoadd
Create the table and disable all but the PRIMARY key which will be needed as a safeguard against duplicate rows:
Populate the table with rows from both sites (starting with NCSA):
Query OK, 13735 rows affected (0.20 sec)
Records: 13735 Duplicates: 0 Warnings: 0
Query OK, 12300 rows affected (0.14 sec)
Records: 12300 Duplicates: 0 Warnings: 0
+----------+
| COUNT(*) |
+----------+
| 26035 |
+----------+
Follow the same approach for 2 dependent metadata tables:
Query OK, 2809690 rows affected (31.07 sec)
Records: 2809690 Duplicates: 0 Warnings: 0
Query OK, 2733657 rows affected (1 min 35.83 sec)
Records: 2733657 Duplicates: 0 Warnings: 0
+----------+
| COUNT(*) |
+----------+
| 5543347 |
+----------+
Query OK, 326420 rows affected (0.52 sec)
Records: 326420 Duplicates: 0 Warnings: 0
Query OK, 291280 rows affected (1.40 sec)
Records: 291280 Duplicates: 0 Warnings: 0
+----------+
| COUNT(*) |
+----------+
| 617700 |
+----------+
RunDeepSource
Query OK, 98577782 rows affected (18 min 56.41 sec)
Records: 98577782 Duplicates: 0 Warnings: 0
Query OK, 88094775 rows affected (17 min 12.25 sec)
Records: 88094775 Duplicates: 0 Warnings: 0
+-----------+
| COUNT(*) |
+-----------+
| 186672557 |
+-----------+
Make a dump of the merged core tables