# Assemble a complete database with S13 DRP catalogs

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
16
• Sprint:
DB_F16_7, DB_F16_8, DB_F16_9
• Team:
Data Access and Database

#### Description

Create a database populated with complete catalogs resulting from processing of the SDSS Stripe 82 data at both NCSA and IN2P3 sites. The database has to be created at NCSA on the following database server lsst-db.ncsa.illinois.edu

• 2016-08-10: CORRECTION: that was the original plan. Due to the space limitation on that server the database will be installed on Nebula volume lsst-dm-gapon attached to image lsst-gapon-0 (floating IP address 141.142.211.152). See details on this setup in the comments stream below.)

The database name will be: gapon_SDRP_Stripe82

The database will be populated with the contents of the following databases:

• NCSA (lsst-db.ncsa.illinois.edu):

 daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 

• IN2P3 (ccdb02.in2p3.fr):

 lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z 

• The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated.
• the referential integrity of the resulted database will need to be tested

#### Activity

Igor Gaponenko created issue -
Field Original Value New Value
Epic Link DM-6230 [ 24746 ]
 Link This issue has to be done after DM-7007 [ DM-7007 ]
 Description Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. The database has to be created at NCSA on the following database server: * lsst-db.ncsa.illinois.edu The database name will be: * gapon_SDRP_Stripe82 The database will be populated with the contents of the following databases: *NCSA* (lsst-db.ncsa.illinois.edu): | daues_SDRP_Stripe82_ncsa | | daues_SDRP_dedupe_byfilter_0 | | daues_SDRP_dedupe_byfilter_1 | | daues_SDRP_dedupe_byfilter_2 | | daues_SDRP_dedupe_byfilter_3 | | daues_SDRP_dedupe_byfilter_4 | *IN2P3* (ccdb02.in2p3.fr): | lsst_prod_DC_2013_2 | | lsst_prod_dedupe_byfilter_g | | lsst_prod_dedupe_byfilter_i | | lsst_prod_dedupe_byfilter_r | | lsst_prod_dedupe_byfilter_u | | lsst_prod_dedupe_byfilter_z | Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. The database has to be created at NCSA on the following database server: * *lsst-db.ncsa.illinois.edu* The database name will be: * *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested
 Description Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. The database has to be created at NCSA on the following database server: * *lsst-db.ncsa.illinois.edu* The database name will be: * *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. The database has to be created at NCSA on the following database server: * *lsst-db.ncsa.illinois.edu* The database name will be: * *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: * *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} * *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested
 Story Points 8 16
Hide
Igor Gaponenko added a comment - - edited

# Dumping catalogs from a MySQL server at IN2P3

## The location of the dumps

The dump files will be placed at the following file system:

 % hostname -f ccosvms0070.in2p3.fr   % df -h /sps/lsst/data/gapon Filesystem Size Used Avail Use% Mounted on /dev/sps_lsst 131T 123T 8.4T 94% /sps/lsst 

Each database will be dumped into the corresponding sub-folder:

 % ls -1 /sps/lsst/data/gapon/ lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z 

## Installed PIGZ (parallel GZIP) to speed up the compression phase of the dumps

Compiled and installed the tool at:

 /afs/in2p3.fr/home/g/gapon//soft/pigz-2.3.3/ 

## Planning for parallel data extraction

The procedure will run in parallel by launching an independent instance of the mysqldump command per each table. The following non-trivial tables have been identified:

 SELECT  TABLE_SCHEMA,  TABLE_NAME,  TABLE_ROWS,  ROUND(DATA_LENGTH/1e6) AS 'DATA_LENGTH_MB' FROM  information_schema.tables WHERE  TABLE_ROWS>0; 

 +-----------------------------+-------------------------------+------------+----------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | DATA_LENGTH_MB | +-----------------------------+-------------------------------+------------+----------------+ | lsst_prod_DC_2013_2 | DeepCoadd | 13680 | 5 | | lsst_prod_DC_2013_2 | DeepCoadd_Metadata | 3022797 | 133 | | lsst_prod_DC_2013_2 | DeepCoadd_To_Htm10 | 324020 | 4 | | lsst_prod_DC_2013_2 | Filter | 6 | 0 | | lsst_prod_DC_2013_2 | LeapSeconds | 39 | 0 | | lsst_prod_DC_2013_2 | RunDeepSource | 98088148 | 54733 | | lsst_prod_DC_2013_2 | Science_Ccd_Exposure | 1543108 | 704 | | lsst_prod_DC_2013_2 | Science_Ccd_Exposure_Metadata | 180543635 | 5592 | | lsst_prod_DC_2013_2 | Science_Ccd_Exposure_To_Htm10 | 29107193 | 378 | | lsst_prod_DC_2013_2 | ZZZ_Db_Description | 1 | 0 | | lsst_prod_dedupe_byfilter_g | RunDeepForcedSource | 1947408074 | 769226 | | lsst_prod_dedupe_byfilter_g | tmp_dedupe_g | 166970338 | 10686 | | lsst_prod_dedupe_byfilter_i | RunDeepForcedSource | 1950338024 | 770384 | | lsst_prod_dedupe_byfilter_i | tmp_dedupe_i | 167273430 | 10705 | | lsst_prod_dedupe_byfilter_r | RunDeepForcedSource | 1950357798 | 770391 | | lsst_prod_dedupe_byfilter_r | tmp_dedupe_r | 128661060 | 8234 | | lsst_prod_dedupe_byfilter_u | RunDeepForcedSource | 1897928494 | 749682 | | lsst_prod_dedupe_byfilter_u | tmp_dedupe_u | 160239581 | 10255 | | lsst_prod_dedupe_byfilter_z | RunDeepForcedSource | 1946839240 | 769001 | | lsst_prod_dedupe_byfilter_z | tmp_dedupe_z | 166920433 | 10683 | +-----------------------------+-------------------------------+------------+----------------+ 

The list of operations to be run in parallel:

 % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 DeepCoadd | pigz > lsst_prod_DC_2013_2/DeepCoadd.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 DeepCoadd_Metadata | pigz > lsst_prod_DC_2013_2/DeepCoadd_Metadata.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 DeepCoadd_To_Htm10 | pigz > lsst_prod_DC_2013_2/DeepCoadd_To_Htm10.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 Filter | pigz > lsst_prod_DC_2013_2/Filter.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 LeapSeconds | pigz > lsst_prod_DC_2013_2/LeapSeconds.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 RunDeepSource | pigz > lsst_prod_DC_2013_2/RunDeepSource.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 Science_Ccd_Exposure | pigz > lsst_prod_DC_2013_2/Science_Ccd_Exposure.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 Science_Ccd_Exposure_Metadata | pigz > lsst_prod_DC_2013_2/Science_Ccd_Exposure_Metadata.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 Science_Ccd_Exposure_To_Htm10 | pigz > lsst_prod_DC_2013_2/Science_Ccd_Exposure_To_Htm10.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_DC_2013_2 ZZZ_Db_Description | pigz > lsst_prod_DC_2013_2/ZZZ_Db_Description.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_g RunDeepForcedSource | pigz > lsst_prod_dedupe_byfilter_g/RunDeepForcedSource.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_g tmp_dedupe_g | pigz > lsst_prod_dedupe_byfilter_g/tmp_dedupe_g.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_i RunDeepForcedSource | pigz > lsst_prod_dedupe_byfilter_i/RunDeepForcedSource.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_i tmp_dedupe_i | pigz > lsst_prod_dedupe_byfilter_i/tmp_dedupe_i.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_r RunDeepForcedSource | pigz > lsst_prod_dedupe_byfilter_r/RunDeepForcedSource.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_r tmp_dedupe_r | pigz > lsst_prod_dedupe_byfilter_r/tmp_dedupe_r.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_u RunDeepForcedSource | pigz > lsst_prod_dedupe_byfilter_u/RunDeepForcedSource.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_u tmp_dedupe_u | pigz > lsst_prod_dedupe_byfilter_u/tmp_dedupe_u.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_z RunDeepForcedSource | pigz > lsst_prod_dedupe_byfilter_z/RunDeepForcedSource.sql.gz& % mysqldump --skip-lock-tables ... lsst_prod_dedupe_byfilter_z tmp_dedupe_z | pigz > lsst_prod_dedupe_byfilter_z/tmp_dedupe_z.sql.gz& 

## Monitoring the performance of the operation

The effective data reduction rate of 3.2 has been established by two independent techniques. The first one was monitoring the in-bound and outbound network traffic to/from a node where the dumps were being run using a custom script. The script periodically (every 5 seconds) samples counters of the network interface and reports the overall progress. The script location and its usage illustrated below:

 /sps/lsst/data/gapon/netstat_rhel7.pl eth0 5 

  Hostname: ccosvms0070 IP address: 134.158.239.70  Interface: eth0  Unit: MB/s    +--------+--------+------------+--------------------------+  | RX | TX | TIMESTAMP | LOCAL TIME |  +--------+--------+------------+--------------------------+  | 11.9 | 3.7 | 1469828109 | Fri Jul 29 23:35:09 2016 |  | 11.9 | 3.7 | 1469828114 | Fri Jul 29 23:35:14 2016 |  | 11.9 | 3.4 | 1469828119 | Fri Jul 29 23:35:19 2016 |  | 11.9 | 3.7 | 1469828124 | Fri Jul 29 23:35:24 2016 |  .. 

The second technique was based on dumping one table, storing it in the compressed format and comparing the resulting file size with the amount of data in the corresponding table as reported earlier. Table lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata was used as a reference:

 % ls -lh lsst_prod_DC_2013_2/Science_Ccd_Exposure_Metadata.sql.gz -rw-r--r-- 1 gapon lsst 1.7G Jul 29 22:01 lsst_prod_DC_2013_2/Science_Ccd_Exposure_Metadata.sql.gz 

The 'DATA_LENGTH' counter reported earlier for the table was 5592 MB. Hence, we have the same deflation factor of 3.2 as reported by the network monitoring tool.

Based on the total amount of data as reported by:

 SELECT  SUM(TABLE_ROWS),  ROUND(SUM(DATA_LENGTH/1e6)) AS 'TOTAL_DATA_LENGTH_MB' FROM  information_schema.tables; 

 +-----------------+----------------------+ | SUM(TABLE_ROWS) | TOTAL_DATA_LENGTH_MB | +-----------------+----------------------+ | 10795579099 | 3940798 | +-----------------+----------------------+ 

Estimates:

• ESTIMATED TIME TO COMPLETION for the operation is 331160 s, or 92 hours, or slightly less than 4 days.
• ESTIMATED DATASET SIZE to be moved to NCSA over WAN will be 1.18 TB

## dump completed: lsst_prod_DC_2013_2

 /sps/lsst/data/gapon(0)>ls -al lsst_prod_DC_2013_2/ total 44301120 drwxr-sr-x 2 gapon lsst 32768 Jul 29 21:04 . drwxr-sr-x 8 gapon lsst 32768 Jul 29 21:58 .. -rw-r--r-- 1 gapon lsst 30687846 Jul 29 21:04 DeepCoadd_Metadata.sql.gz -rw-r--r-- 1 gapon lsst 2346437 Jul 29 21:04 DeepCoadd.sql.gz -rw-r--r-- 1 gapon lsst 908926 Jul 29 21:04 DeepCoadd_To_Htm10.sql.gz -rw-r--r-- 1 gapon lsst 787 Jul 29 21:04 Filter.sql.gz -rw-r--r-- 1 gapon lsst 1637 Jul 29 21:04 LeapSeconds.sql.gz -rw-r--r-- 1 gapon lsst 42934514253 Jul 30 13:27 RunDeepSource.sql.gz -rw-r--r-- 1 gapon lsst 1789068690 Jul 29 22:01 Science_Ccd_Exposure_Metadata.sql.gz -rw-r--r-- 1 gapon lsst 527090153 Jul 29 21:06 Science_Ccd_Exposure.sql.gz -rw-r--r-- 1 gapon lsst 79454956 Jul 29 21:05 Science_Ccd_Exposure_To_Htm10.sql.gz -rw-r--r-- 1 gapon lsst 745 Jul 29 21:04 ZZZ_Db_Description.sql.gz 

## Failed dumps for the forced deep source database

The dump of the large tables from all 5 deep forced source catalogs failed approximately 2 days after it began. All 5 outstanding dumps were killed at the same time. Here is what was left in the folders:

 >ls -alh lsst_prod_dedupe_byfilter_*/   lsst_prod_dedupe_byfilter_g/:   -rw-r--r-- 1 gapon lsst 190G Aug 1 21:03 RunDeepForcedSource.sql.gz -rw-r--r-- 1 gapon lsst 2.4G Jul 30 00:10 tmp_dedupe_g.sql.gz   lsst_prod_dedupe_byfilter_i/:   -rw-r--r-- 1 gapon lsst 192G Aug 1 21:03 RunDeepForcedSource.sql.gz -rw-r--r-- 1 gapon lsst 2.4G Jul 30 00:14 tmp_dedupe_i.sql.gz   lsst_prod_dedupe_byfilter_r/:   -rw-r--r-- 1 gapon lsst 191G Aug 1 21:03 RunDeepForcedSource.sql.gz -rw-r--r-- 1 gapon lsst 1.8G Jul 29 23:37 tmp_dedupe_r.sql.gz   lsst_prod_dedupe_byfilter_u/:   -rw-r--r-- 1 gapon lsst 186G Aug 1 21:03 RunDeepForcedSource.sql.gz -rw-r--r-- 1 gapon lsst 2.3G Jul 30 00:08 tmp_dedupe_u.sql.gz   lsst_prod_dedupe_byfilter_z/:   -rw-r--r-- 1 gapon lsst 188G Aug 1 21:03 RunDeepForcedSource.sql.gz -rw-r--r-- 1 gapon lsst 2.4G Jul 30 00:13 tmp_dedupe_z.sql.gz 

UPDATE (2016-08-01): Resubmitted the operation for database lsst_prod_dedupe_byfilter_z. The main rationales for taking a snapshot of a single database are:

• it will increase changes of the operation to complete within a reasonable period of time
• it will allow to measure the final size of the dump
• this database alone should be sufficient to study effects (and find a solution to them should there be any problem with that) of the overlap between NCSA and IN2P3 datasets.

The operation successfully completed in 30 hours:

 >stat lsst_prod_dedupe_byfilter_z/RunDeepForcedSource.sql.gz Access: 2016-08-02 07:47:05.902134000 +0200 Modify: 2016-08-03 12:39:28.288670022 +0200 Change: 2016-08-03 12:39:28.288670022 +0200 

OBSERVATIONS: the aggregate network I/O during this (single dump) operation was similar to the case of 5 simultaneous dumps. It means running multiple dumps in parallel will make no sense.

### Status of the operation

database status
lsst_prod_DC_2013_2
lsst_prod_dedupe_byfilter_z
lsst_prod_dedupe_byfilter_u
lsst_prod_dedupe_byfilter_r
lsst_prod_dedupe_byfilter_i
lsst_prod_dedupe_byfilter_g
Show
 Status To Do [ 10001 ] In Progress [ 3 ]
 Sprint DB_F16_7 [ 232 ] DB_F16_7, DB_F16_8 [ 232, 233 ]
 Rank Ranked higher
Hide
Igor Gaponenko added a comment - - edited

# Transferring MySQL dump files from IN2P3 to NCSA using BBCP

Due to an on-going modernization of the NCSA network which may cause long transfers to fail the transfer is happening in two stages:

1. copy files to SLAC: psexport.slac.stanford.edu:/reg/data/ana04/test/gapon/in2p3/
2. copy from SLAC to NCSA: /lsst8/gapon/

## IN2P3 to SLAC

All but transfers were initiated from hosts ccqservbuild.in2p3.fr or ccage.in2p3.fr. The typical configuration of the data transfer command was:

 % bbcp -v -P 2 -s 16 -r \ lsst_prod_DC_2013_2/ \ psexport.slac.stanford.edu:/reg/data/ana04/test/gapon/in2p3/ 

### Progress report

database status
lsst_prod_DC_2013_2
lsst_prod_dedupe_byfilter_z
lsst_prod_dedupe_byfilter_u
lsst_prod_dedupe_byfilter_r
lsst_prod_dedupe_byfilter_i
lsst_prod_dedupe_byfilter_g

### Observations

The sustained transfer rates:

host rate notes
ccqservbuild.in2p3.fr 7.5 MB/s limited by the network configuration of that VM
ccage.in2p3.fr 65 MB/s limited by WAN

An option of a parallel transfer(s) into a separate destination(s), such as NCSA has also been investigated. It has led to the same result as above. Apparently there is bottleneck at WAN segments connecting IN2P3 with US networks.

## SLAC to NCSA

The files are being moved to NCSA via a dedicated DTN node lsst-xfer.ncsa.illinois.edu. The machine has a range of ports 65001-65535 open for incoming connections. Details found at:

The copying was made using the following command:

 % bbcp -i $HOME/.ssh/lsst/id_ncsa_rsa -v -P 2 -s 16 -Z 65001 -r \ /reg/data/ana04/test/gapon/in2p3/lsst_prod_DC_2013_2 \ lsst-xfer.ncsa.illinois.edu:/lsst8/gapon/  ### Progress report database status lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_z lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_g ### Observations The average transfer rate was about 300 MB/s ## Uploading dumps into a Nebula volume This operation was run (AS NEEDED FOR FURTHER LOADING!) from node lsst-dev.ncsa.illinois.edu:  % bbcp -v -P 2 -Z 65001 -z -s 16 -i ~/.ssh/id_ncsa_rsa -r \ /lsst8/gapon/daues_SDRP_Stripe82_ncsa.sql \ 141.142.211.152:/data/gapon/   % bbcp -v -P 2 -Z 65001 -z -s 16 -i ~/.ssh/id_ncsa_rsa -r \ /lsst8/gapon/lsst_prod_dedupe_byfilter_z \ 141.142.211.152:/data/gapon/   ...  ### Observations The typical performance of the operation was 95 MB/s. Show Igor Gaponenko added a comment - - edited Transferring MySQL dump files from IN2P3 to NCSA using BBCP Due to an on-going modernization of the NCSA network which may cause long transfers to fail the transfer is happening in two stages: copy files to SLAC: psexport.slac.stanford.edu:/reg/data/ana04/test/gapon/in2p3/ copy from SLAC to NCSA: /lsst8/gapon/ IN2P3 to SLAC All but transfers were initiated from hosts ccqservbuild.in2p3.fr or ccage.in2p3.fr . The typical configuration of the data transfer command was: % bbcp - v -P 2 -s 16 -r \ lsst_prod_DC_2013_2/ \ psexport.slac.stanford.edu: /reg/data/ana04/test/gapon/in2p3/ Progress report database status lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_z lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_g Observations The sustained transfer rates: host rate notes ccqservbuild.in2p3.fr 7.5 MB/s limited by the network configuration of that VM ccage.in2p3.fr 65 MB/s limited by WAN An option of a parallel transfer(s) into a separate destination(s), such as NCSA has also been investigated. It has led to the same result as above. Apparently there is bottleneck at WAN segments connecting IN2P3 with US networks. SLAC to NCSA The files are being moved to NCSA via a dedicated DTN node lsst-xfer.ncsa.illinois.edu . The machine has a range of ports 65001-65535 open for incoming connections. Details found at: https://developer.lsst.io/services/ncsa_bulk_transfer.html#ncsa-bulk-bbcp The copying was made using the following command: % bbcp -i$HOME/. ssh /lsst/id_ncsa_rsa - v -P 2 -s 16 -Z 65001 -r \ /reg/data/ana04/test/gapon/in2p3/lsst_prod_DC_2013_2 \ lsst-xfer.ncsa.illinois.edu: /lsst8/gapon/ Progress report database status lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_z lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_g Observations The average transfer rate was about 300 MB/s Uploading dumps into a Nebula volume This operation was run (AS NEEDED FOR FURTHER LOADING!) from node lsst-dev.ncsa.illinois.edu : % bbcp - v -P 2 -Z 65001 -z -s 16 -i ~/. ssh /id_ncsa_rsa -r \ /lsst8/gapon/daues_SDRP_Stripe82_ncsa .sql \ 141.142.211.152: /data/gapon/   % bbcp - v -P 2 -Z 65001 -z -s 16 -i ~/. ssh /id_ncsa_rsa -r \ /lsst8/gapon/lsst_prod_dedupe_byfilter_z \ 141.142.211.152: /data/gapon/   ... Observations The typical performance of the operation was 95 MB/s.
 Description Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. The database has to be created at NCSA on the following database server: * *lsst-db.ncsa.illinois.edu* The database name will be: * *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: * *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} * *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. -The database has to be created at NCSA on the following database server: * *lsst-db.ncsa.illinois.edu* - ** 2016-08-10: *CORRECTION*: that was the original plan. Due to the space limitation on that server the database will be installed on Nebula volume *lsst-dm-gapon* attached to image *lsst-gapon-0* (floating IP address *141.142.211.152*. See details on this setup in the comments stream below.) The database name will be: * *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: * *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} * *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested
 Description Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. -The database has to be created at NCSA on the following database server: * *lsst-db.ncsa.illinois.edu* - ** 2016-08-10: *CORRECTION*: that was the original plan. Due to the space limitation on that server the database will be installed on Nebula volume *lsst-dm-gapon* attached to image *lsst-gapon-0* (floating IP address *141.142.211.152*. See details on this setup in the comments stream below.) The database name will be: * *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: * *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} * *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. -The database has to be created at NCSA on the following database server *lsst-db.ncsa.illinois.edu*- * 2016-08-10: *CORRECTION*: that was the original plan. Due to the space limitation on that server the database will be installed on Nebula volume *lsst-dm-gapon* attached to image *lsst-gapon-0* (floating IP address *141.142.211.152*. See details on this setup in the comments stream below.) The database name will be: *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: * *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} * *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested
 Description Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. -The database has to be created at NCSA on the following database server *lsst-db.ncsa.illinois.edu*- * 2016-08-10: *CORRECTION*: that was the original plan. Due to the space limitation on that server the database will be installed on Nebula volume *lsst-dm-gapon* attached to image *lsst-gapon-0* (floating IP address *141.142.211.152*. See details on this setup in the comments stream below.) The database name will be: *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: * *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} * *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested Create a database populated with complete catalogs resulting from processing of the *SDSS Stripe 82* data at both NCSA and IN2P3 sites. -The database has to be created at NCSA on the following database server *lsst-db.ncsa.illinois.edu*- * 2016-08-10: *CORRECTION*: that was the original plan. Due to the space limitation on that server the database will be installed on Nebula volume *lsst-dm-gapon* attached to image *lsst-gapon-0* (floating IP address *141.142.211.152*). See details on this setup in the comments stream below.) The database name will be: *gapon_SDRP_Stripe82* The database will be populated with the contents of the following databases: * *NCSA* (lsst-db.ncsa.illinois.edu): {code} daues_SDRP_Stripe82_ncsa daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 {code} * *IN2P3* (ccdb02.in2p3.fr): {code} lsst_prod_DC_2013_2 lsst_prod_dedupe_byfilter_g lsst_prod_dedupe_byfilter_i lsst_prod_dedupe_byfilter_r lsst_prod_dedupe_byfilter_u lsst_prod_dedupe_byfilter_z {code} Additional requirements: * The duplicate entries (due to the overlap in the RA range) will need to be carefully assessed and eliminated. * the referential integrity of the resulted database will need to be tested
Hide
Igor Gaponenko added a comment - - edited

# The first attempt to initializing database gapon_SDRP_Stripe82 (NCSA)

The database will be initialized as a clone of an existing database daues_SDRP_Stripe82_ncsa.sql. The procedure has the following steps:

1. making a mysqldump of the source database
2. creating the destination database

Initially the dump files were placed at (NCSA):

 /nfs/lsst8/gapon 

• in theory a more efficient way of cloning a database would be by chaining two operations using UNIX pipe w/o compressing/decompressing the stream:

 mysqldump ... daues_SDRP_Stripe82_ncsa | mysql ... gapon_SDRP_Stripe82 

The main reason why I chose to go via an intermediate file is because it allows a quicker turnaround time for making other clones (should this be needed) based on the dump file. The second reason was related to the security restrictions imposed by MySQL for views. For instance, a straightforward import of the above taken snapshot failed with the following message:

 ERROR 1227 (42000) at line 140273: Access denied; you need (at least one of) the SUPER privilege(s) for this operation 

Further investigation has pointed out onto the following line in the dump:

 /*!50013 DEFINER=daues@% SQL SECURITY DEFINER */ 

Hence, a solution was to:

 % cat daues_SDRP_Stripe82_ncsa.sql | sed 's/daues/gapon/g' | mysql gapon_SDRP_Stripe82 

At this point the database was properly seeded.

### Observations

The performance of the database loading was rather low: 20 MB/s or less. And there was a substantial CPU usage on the database server:

 % top    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND  5236 mysql 20 0 972804 216772 4044 S 83.0 1.3 689:45.22 mysqld  8307 gapon 20 0 238796 6632 2768 S 16.3 0.0 95:42.25 mysql 

 USE gapon_SDRP_Stripe82; CREATE TABLE RunDeepForcedSource LIKE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; INSERT INTO RunDeepForcedSource SELECT * FROM daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; 

The query finished in 32 hours:

 Query OK, 1729599791 rows affected (1 day 8 hours 15 min 12.97 sec) Records: 1729599791 Duplicates: 0 Warnings: 0 

Observations: the aggregate performance was (700 GB of data from the table + 190 GB of index) / (1 day 8 hours 15 min) = 7.7 MB/s.

UPDATED: 2016-08-08: ran out of space on the database server. Giving up with this option. Investigating other options for setting up a database server with the large disk.

Show
Igor Gaponenko added a comment - - edited The first attempt to initializing database gapon_SDRP_Stripe82 (NCSA) The database will be initialized as a clone of an existing database daues_SDRP_Stripe82_ncsa.sql . The procedure has the following steps: making a mysqldump of the source database creating the destination database loading the dump into the destination database Initially the dump files were placed at (NCSA): /nfs/lsst8/gapon Loading options in theory a more efficient way of cloning a database would be by chaining two operations using UNIX pipe w/o compressing/decompressing the stream: mysqldump ... daues_SDRP_Stripe82_ncsa | mysql ... gapon_SDRP_Stripe82 The main reason why I chose to go via an intermediate file is because it allows a quicker turnaround time for making other clones (should this be needed) based on the dump file. The second reason was related to the security restrictions imposed by MySQL for views. For instance, a straightforward import of the above taken snapshot failed with the following message: ERROR 1227 (42000) at line 140273: Access denied; you need (at least one of) the SUPER privilege(s) for this operation Further investigation has pointed out onto the following line in the dump: /*!50013 DEFINER=daues@% SQL SECURITY DEFINER */ Hence, a solution was to: % cat daues_SDRP_Stripe82_ncsa.sql | sed 's/daues/gapon/g' | mysql gapon_SDRP_Stripe82 At this point the database was properly seeded. Observations The performance of the database loading was rather low: 20 MB/s or less. And there was a substantial CPU usage on the database server: % top   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5236 mysql 20 0 972804 216772 4044 S 83.0 1.3 689:45.22 mysqld 8307 gapon 20 0 238796 6632 2768 S 16.3 0.0 95:42.25 mysql Loading deep forced sources into gapon_SDRP_Stripe82 (NCSA) Loading data of the u band (filter index 0 ): USE gapon_SDRP_Stripe82; CREATE TABLE RunDeepForcedSource LIKE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; INSERT INTO RunDeepForcedSource SELECT * FROM daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; The query finished in 32 hours: Query OK, 1729599791 rows affected (1 day 8 hours 15 min 12.97 sec) Records: 1729599791 Duplicates: 0 Warnings: 0 Observations : the aggregate performance was (700 GB of data from the table + 190 GB of index) / (1 day 8 hours 15 min) = 7.7 MB/s . UPDATED: 2016-08-08 : ran out of space on the database server. Giving up with this option. Investigating other options for setting up a database server with the large disk.
 Comment [ h1. Initializing database gapon_SDRP_Stripe82 (NCSA) The database will be initialized as a clone of an existing database *daues_SDRP_Stripe82_ncsa.sql*. The procedure has the following steps: # making a *mysqldump* of the source database # creating the destination database # loading the dump into the destination database Initially the dump files were placed at (NCSA): {code} /nfs/lsst8/gapon {code} After switching the database server from *lsst-db* to a Nebula image/volume the dumps were made (uploaded) onto the Nebula volume first. A reason of that was that neither LSST NFS mount poibnt was available to the Nebula images nor the Nebula storage volumes could be mounted on the public development machines allocated to LSST at NCSA. h2. Loading options * in theory a more efficient way of cloning a database would be by chaining two operations using UNIX pipe w/o compressing/decompressing the stream: {code:bash} mysqldump ... daues_SDRP_Stripe82_ncsa | mysql ... gapon_SDRP_Stripe82 {code} The main reason why I chose to go via an intermediate file is because it allows a quicker turnaround time for making other clones (should this be needed) based on the dump file. The second reason was related to the security restrictions imposed by MySQL for views. For instance, a straightforward import of the above taken snapshot failed with the following message: {code} ERROR 1227 (42000) at line 140273: Access denied; you need (at least one of) the SUPER privilege(s) for this operation {code} Further investigation has pointed out onto the following line in the dump: {code} /*!50013 DEFINER=daues@% SQL SECURITY DEFINER */ {code} Hence, a solution was to: {code:bash} % cat daues_SDRP_Stripe82_ncsa.sql | sed 's/daues/gapon/g' | mysql gapon_SDRP_Stripe82 {code} At this point the database was properly seeded. h3. Observations The performance of the database loading was rather low: 20 MB/s or less. And there was a substantial CPU usage on the database server: {code} % top   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND  5236 mysql 20 0 972804 216772 4044 S 83.0 1.3 689:45.22 mysqld  8307 gapon 20 0 238796 6632 2768 S 16.3 0.0 95:42.25 mysql {code} ]
 Comment [ h1. Loading deep forced sources into gapon_SDRP_Stripe82 (NCSA) Loading data of the *u* band (filter index *0*): {code:sql} USE gapon_SDRP_Stripe82; CREATE TABLE RunDeepForcedSource LIKE daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; INSERT INTO RunDeepForcedSource SELECT * FROM daues_SDRP_dedupe_byfilter_0.RunDeepForcedSource; {code} The query finished in 32 hours: {code} Query OK, 1729599791 rows affected (1 day 8 hours 15 min 12.97 sec) Records: 1729599791 Duplicates: 0 Warnings: 0 {code} *Observations*: the aggregate performance was (700 GB of data from the table + 190 GB of index) / (1 day 8 hours 15 min) = *7.7 MB/s*. *UPDATED: 2016-08-08*: ran out of space on the database server. Giving up with this option. Investigating other options for setting up a database server with the large disk. ]
Hide
Igor Gaponenko added a comment - - edited

# Setting up a MariaDB server on the LSST Nebula cluster at NCSA

Please, provide most important details which may be helpful when reproducing the set up!

Show
Igor Gaponenko added a comment - - edited Setting up a MariaDB server on the LSST Nebula cluster at NCSA Please, provide most important details which may be helpful when reproducing the set up!
Hide
Igor Gaponenko added a comment -

# Loading MySQL dumps taken in IN2P3 and NCSA into the combined database (LSST Nebula)

This is the very delicate operation because it will also require a thorough evaluation of the input data. This is the proposed scenario:

• initialize the database with gapon_SDRP_Stripe82
• load the first bunch of the deep forced sources for frequency band 0 from database daues_SDRP_dedupe_byfilter_0 into gapon_SDRP_Stripe82
• initialize database gapon_SDRP_dedupe_byfilter_1 and load the second bunch from database daues_SDRP_dedupe_byfilter_1 into it
• cross check the deep forced sources of that band against the first band to see if there are any conflicts (key overlaps, etc.) between them

TO BE COMPLETED...

Show
Igor Gaponenko added a comment - Loading MySQL dumps taken in IN2P3 and NCSA into the combined database (LSST Nebula) This is the very delicate operation because it will also require a thorough evaluation of the input data. This is the proposed scenario: initialize the database with gapon_SDRP_Stripe82 load the first bunch of the deep forced sources for frequency band 0 from database daues_SDRP_dedupe_byfilter_0 into gapon_SDRP_Stripe82 initialize database gapon_SDRP_dedupe_byfilter_1 and load the second bunch from database daues_SDRP_dedupe_byfilter_1 into it cross check the deep forced sources of that band against the first band to see if there are any conflicts (key overlaps, etc.) between them TO BE COMPLETED...
Hide
Igor Gaponenko added a comment - - edited

# Evaluating schema of loaded catalogs

## Schema

Explained in: https://github.com/lsst/cat/tree/master/sql

Each database has a special table with a reference to a schema initialization procedure:

 SELECT * FROM ZZZ_Db_Description;   +------------------------------+------+ | f | r | +------------------------------+------+ | lsstSchema4mysqlS12_sdss.sql | NULL | +------------------------------+------+ 

Tables with non-empty contents in the main database (using the one from NCSA as an example):

 SELECT  TABLE_NAME, TABLE_ROWS, DATA_LENGTH FROM  information_schema.Tables WHERE  TABLE_SCHEMA='gapon_SDRP_Stripe82' AND  TABLE_ROWS > 0; 

 +-------------------------------+------------+--------------+ | TABLE_NAME | TABLE_ROWS | DATA_LENGTH | +-------------------------------+------------+--------------+ | DeepCoadd | 13735 | 4862980 | | DeepCoadd_Metadata | 2809690 | 122591652 | | DeepCoadd_To_Htm10 | 326420 | 4243460 | | Filter | 6 | 78 | | LeapSeconds | 39 | 1443 | | RunDeepForcedSource | 1729599791 | 683191917445 | | RunDeepSource | 98577782 | 55006402356 | | Science_Ccd_Exposure | 1403385 | 644435260 | | Science_Ccd_Exposure_Metadata | 164196045 | 5085842272 | | Science_Ccd_Exposure_To_Htm10 | 26861430 | 349198590 | | ZZZ_Db_Description | 1 | 36 | +-------------------------------+------------+--------------+ 

NOTE: this particular database has already been preloaded with table RunDeepForcedSource which has deep forced sources of filter 0 (band u) processed at NCSA.

Filter-specific databases (using one of those as an example):

 SELECT  TABLE_NAME,TABLE_ROWS,DATA_LENGTH FROM  information_schema.Tables WHERE  TABLE_SCHEMA='daues_SDRP_dedupe_byfilter_1' AND  TABLE_ROWS > 0; 

 +---------------------+------------+--------------+ | TABLE_NAME | TABLE_ROWS | DATA_LENGTH | +---------------------+------------+--------------+ | RunDeepForcedSource | 1752805399 | 692358132605 | +---------------------+------------+--------------+ 

## Dependencies

table: Filter

column
PK filterId

table: Science_Ccd_Exposure

column FK.table FK.column
PK scienceCcdExposureId
FK filterId Filter filterId

table: Science_Ccd_Exposure_To_Htm10

column FK.table FK.column
FK exposure_id Science_Ccd_Exposure scienceCcdExposureId

column

column FK.table FK.column

table: RunDeepSource (L2 Object as per LSST DPD)

column FK.table FK.column
PK id
FK parent self id

table: RunDeepForcedSource (L2 Source as per LSST DPD)

column FK.table FK.column
PK id
FK object_id RunDeepSource id
FK exposure_filter_id Filter filterId
FK exposure_id Science_Ccd_Exposure scienceCcdExposureI

### Observations on direct dependencies

Direct FK/PK dependencies as per the schema:

• Filter
• Science_Ccd_Exposure
• RunDeepSource
• RunDeepForcedSource
• Science_Ccd_Exposure
• Science_Ccd_Exposure_To_Htm10
• RunDeepForcedSource
• RunDeepSource
• RunDeepSource
• RunDeepForcedSource

Preliminary conclusions:

1. Dependencies onto Filter are trivial due to a small size and a static nature of that table
2. Dependencies on Science_Ccd_Exposure allows for potentially easy merge (of two sites) of the relevant 2 metadata tables (Science_Ccd_Exposure_Metadata and Science_Ccd_Exposure_To_Htm10) based based on the run numbers encoded into the primary key of that table
3. DeepCoadd, RunDeepSource and RunDeepForcedSource form a cluster of dependencies which requires further investigation before attempting any merge
4. RunDeepForcedSource has dual dependency onto Science_Ccd_Exposure and RunDeepSource, of which one must have be trivially eliminated.
• 2016-09-01: it turns out the key dependency is onto RunDeepSource. See details in one of the comments down the stream.

Further actions:

• investigate a structure of the primary key deepCoaddId of table DeepCoadd
• understand a meaning of patch and tract in table DeepCoadd. This may be a clue for establishing a correct merge algorithm
• investigate the dependency chain of DeepCoadd, RunDeepSource and RunDeepForcedSource

### Spatial (indirect) dependencies

There are three options here:

• RA and DEC
• HTM
Show
Hide
Igor Gaponenko added a comment -

# The algorithm for generating calibrated CCD exposures IDs

The algorithm can be deduced from the following results sets (for each processing site)
NCSA:

 USE gapon_SDRP_Stripe82;   SELECT  scienceCcdExposureId, run, filterId, camcol, field FROM  Science_Ccd_Exposure WHERE  run=1033 ORDER BY  scienceCcdExposureId ASC LIMIT 3;   +----------------------+------+----------+--------+-------+ | scienceCcdExposureId | run | filterId | camcol | field | +----------------------+------+----------+--------+-------+ | 1033010018 | 1033 | 0 | 1 | 18 | | 1033010019 | 1033 | 0 | 1 | 19 | | 1033010020 | 1033 | 0 | 1 | 20 | +----------------------+------+----------+--------+-------+ 

IN2P3:

 USE lsst_prod_DC_2013_2;   SELECT  scienceCcdExposureId, run, filterId, camcol, field FROM  Science_Ccd_Exposure WHERE  run=1752 ORDER BY  scienceCcdExposureId ASC LIMIT 3;   +----------------------+------+----------+--------+-------+ | scienceCcdExposureId | run | filterId | camcol | field | +----------------------+------+----------+--------+-------+ | 1752010040 | 1752 | 0 | 1 | 40 | | 1752010041 | 1752 | 0 | 1 | 41 | | 1752010042 | 1752 | 0 | 1 | 42 | +----------------------+------+----------+--------+-------+ 

The final formula:

 scienceCcdExposureId = 1000000 * run + 100000 * filterId + 10000 * camcol + field 

NOTE: Apparently, there is the following code for calculating these identifiers:

The relevant function is found below:

  def _computeCcdExposureId(self, dataId):  """Compute the 64-bit (long) identifier for a CCD exposure.  @param dataId (dict) Data identifier with run, rerun, filter, camcol, field  """  return ((long(dataId['run']) \  * 10 + self.filterIdMap[dataId['filter']]) \  * 10 + dataId['camcol']) \  * 10000 + dataId['field'] 

This code is PERFECTLY CONSISTENT with the previously shown observations on what's actually stored in the databases.

Show
Igor Gaponenko added a comment - The algorithm for generating calibrated CCD exposures IDs The algorithm can be deduced from the following results sets (for each processing site) NCSA : USE gapon_SDRP_Stripe82;   SELECT scienceCcdExposureId, run, filterId, camcol, field FROM Science_Ccd_Exposure WHERE run=1033 ORDER BY scienceCcdExposureId ASC LIMIT 3;   +----------------------+------+----------+--------+-------+ | scienceCcdExposureId | run | filterId | camcol | field | +----------------------+------+----------+--------+-------+ | 1033010018 | 1033 | 0 | 1 | 18 | | 1033010019 | 1033 | 0 | 1 | 19 | | 1033010020 | 1033 | 0 | 1 | 20 | +----------------------+------+----------+--------+-------+ IN2P3 : USE lsst_prod_DC_2013_2; SELECT scienceCcdExposureId, run, filterId, camcol, field FROM Science_Ccd_Exposure WHERE run=1752 ORDER BY scienceCcdExposureId ASC LIMIT 3;   +----------------------+------+----------+--------+-------+ | scienceCcdExposureId | run | filterId | camcol | field | +----------------------+------+----------+--------+-------+ | 1752010040 | 1752 | 0 | 1 | 40 | | 1752010041 | 1752 | 0 | 1 | 41 | | 1752010042 | 1752 | 0 | 1 | 42 | +----------------------+------+----------+--------+-------+ The final formula: scienceCcdExposureId = 1000000 * run + 100000 * filterId + 10000 * camcol + field NOTE : Apparently, there is the following code for calculating these identifiers: https://github.com/lsst/obs_sdss/blob/master/python/lsst/obs/sdss/sdssMapper.py#L61-L69 The relevant function is found below: def _computeCcdExposureId(self, dataId): """Compute the 64-bit (long) identifier for a CCD exposure. @param dataId (dict) Data identifier with run, rerun, filter, camcol, field """ return ((long(dataId['run']) \ * 10 + self.filterIdMap[dataId['filter']]) \ * 10 + dataId['camcol']) \ * 10000 + dataId['field'] This code is PERFECTLY CONSISTENT with the previously shown observations on what's actually stored in the databases.
Hide
Igor Gaponenko added a comment -

# The algorithm for generating primary keys of table DeepCoadd

The algorithm is explained in:

The simplified formula based on a presence of filters looks like this:

 tract = 3 patchX = 99 patchY = 9 filterId = 0   id = 8 * ((tract * 2**13 + patchX) * 2**13 + patchY) + filterId 1617100872 

It produces results which are consistent with the following results set sets:

 USE gapon_SDRP_Stripe82; SELECT  deepCoaddId,tract,patch,filterId FROM  DeepCoadd WHERE  patch='99,9' ORDER BY  deepCoaddId;   +-------------+-------+-------+----------+ | deepCoaddId | tract | patch | filterId | +-------------+-------+-------+----------+ | 1617100872 | 3 | 99,9 | 0 | | 1617100873 | 3 | 99,9 | 1 | | 1617100874 | 3 | 99,9 | 2 | | 1617100875 | 3 | 99,9 | 3 | | 1617100876 | 3 | 99,9 | 4 | +-------------+-------+-------+----------+ 

Show
Igor Gaponenko added a comment - The algorithm for generating primary keys of table DeepCoadd The algorithm is explained in: https://github.com/lsst/obs_sdss/blob/master/python/lsst/obs/sdss/sdssMapper.py#L71-L89 The simplified formula based on a presence of filters looks like this: tract = 3 patchX = 99 patchY = 9 filterId = 0 id = 8 * ((tract * 2**13 + patchX) * 2**13 + patchY) + filterId 1617100872 It produces results which are consistent with the following results set sets: USE gapon_SDRP_Stripe82; SELECT deepCoaddId,tract,patch,filterId FROM DeepCoadd WHERE patch='99,9' ORDER BY deepCoaddId;   +-------------+-------+-------+----------+ | deepCoaddId | tract | patch | filterId | +-------------+-------+-------+----------+ | 1617100872 | 3 | 99,9 | 0 | | 1617100873 | 3 | 99,9 | 1 | | 1617100874 | 3 | 99,9 | 2 | | 1617100875 | 3 | 99,9 | 3 | | 1617100876 | 3 | 99,9 | 4 | +-------------+-------+-------+----------+
Hide
Igor Gaponenko added a comment - - edited

# Analyzing an overlap in DeepCoadd and RunDeepSource

As per my earlier conversation with KT Lim:

• the overlap can be uniquely identified by following patch and tract attributes in table DeepCoadd
• single exposure images are not aligned with boundaries of patches and tracts
• RA/DEC isn't a reliable mechanism for tracking overlaps in this set of tables

The overlap can be tracked using the following query:

 SELECT  patch,tract,COUNT(*) FROM  gapon_SDRP_Stripe82.DeepCoadd WHERE  (patch,tract) IN (SELECT patch,tract FROM lsst_prod_DC_2013_2.DeepCoadd) GROUP BY  patch,tract ORDER BY patch,tract; 

 +--------+-------+----------+ | patch | tract | COUNT(*) | +--------+-------+----------+ | 364,0 | 0 | 5 | | 364,1 | 0 | 5 | | 364,10 | 0 | 5 | | 364,11 | 0 | 5 | | 364,2 | 0 | 5 | | 364,3 | 0 | 5 | | 364,4 | 0 | 5 | | 364,5 | 0 | 5 | | 364,6 | 0 | 5 | | 364,7 | 0 | 5 | | 364,8 | 0 | 5 | | 364,9 | 0 | 5 | +--------+-------+----------+ | 365,0 | 0 | 5 | | 365,1 | 0 | 5 | | 365,10 | 0 | 5 | | 365,11 | 0 | 5 | | 365,2 | 0 | 5 | | 365,3 | 0 | 5 | | 365,4 | 0 | 5 | | 365,5 | 0 | 5 | | 365,6 | 0 | 5 | | 365,7 | 0 | 5 | | 365,8 | 0 | 5 | | 365,9 | 0 | 5 | +--------+-------+----------+ | 366,0 | 0 | 5 | | 366,1 | 0 | 5 | | 366,10 | 0 | 5 | | 366,11 | 0 | 5 | | 366,2 | 0 | 5 | | 366,3 | 0 | 5 | | 366,4 | 0 | 5 | | 366,5 | 0 | 5 | | 366,6 | 0 | 5 | | 366,7 | 0 | 5 | | 366,8 | 0 | 5 | | 366,9 | 0 | 5 | +--------+-------+----------+ | 367,0 | 0 | 5 | | 367,1 | 0 | 5 | | 367,10 | 0 | 5 | | 367,11 | 0 | 5 | | 367,2 | 0 | 5 | | 367,3 | 0 | 5 | | 367,4 | 0 | 5 | | 367,5 | 0 | 5 | | 367,6 | 0 | 5 | | 367,7 | 0 | 5 | | 367,8 | 0 | 5 | | 367,9 | 0 | 5 | +--------+-------+----------+ | 368,0 | 0 | 5 | | 368,1 | 0 | 5 | | 368,10 | 0 | 5 | | 368,11 | 0 | 5 | | 368,2 | 0 | 5 | | 368,3 | 0 | 5 | | 368,4 | 0 | 5 | | 368,5 | 0 | 5 | | 368,6 | 0 | 5 | | 368,7 | 0 | 5 | | 368,8 | 0 | 5 | | 368,9 | 0 | 5 | +--------+-------+----------+ | 369,0 | 0 | 5 | | 369,1 | 0 | 5 | | 369,10 | 0 | 5 | | 369,11 | 0 | 5 | | 369,2 | 0 | 5 | | 369,3 | 0 | 5 | | 369,4 | 0 | 5 | | 369,5 | 0 | 5 | | 369,6 | 0 | 5 | | 369,7 | 0 | 5 | | 369,8 | 0 | 5 | | 369,9 | 0 | 5 | +--------+-------+----------+ | 370,0 | 0 | 5 | | 370,1 | 0 | 5 | | 370,10 | 0 | 5 | | 370,11 | 0 | 5 | | 370,2 | 0 | 5 | | 370,3 | 0 | 5 | | 370,4 | 0 | 5 | | 370,5 | 0 | 5 | | 370,6 | 0 | 5 | | 370,7 | 0 | 5 | | 370,8 | 0 | 5 | | 370,9 | 0 | 5 | +--------+-------+----------+ | 371,0 | 0 | 5 | | 371,1 | 0 | 5 | | 371,10 | 0 | 5 | | 371,11 | 0 | 5 | | 371,2 | 0 | 5 | | 371,3 | 0 | 5 | | 371,4 | 0 | 5 | | 371,5 | 0 | 5 | | 371,6 | 0 | 5 | | 371,7 | 0 | 5 | | 371,8 | 0 | 5 | | 371,9 | 0 | 5 | +--------+-------+----------+ | 372,0 | 0 | 5 | | 372,1 | 0 | 5 | | 372,10 | 0 | 5 | | 372,11 | 0 | 5 | | 372,2 | 0 | 5 | | 372,3 | 0 | 5 | | 372,4 | 0 | 5 | | 372,5 | 0 | 5 | | 372,6 | 0 | 5 | | 372,7 | 0 | 5 | | 372,8 | 0 | 5 | | 372,9 | 0 | 5 | +--------+-------+----------+ | 373,0 | 0 | 5 | | 373,1 | 0 | 5 | | 373,10 | 0 | 5 | | 373,11 | 0 | 5 | | 373,2 | 0 | 5 | | 373,3 | 0 | 5 | | 373,4 | 0 | 5 | | 373,5 | 0 | 5 | | 373,6 | 0 | 5 | | 373,7 | 0 | 5 | | 373,8 | 0 | 5 | | 373,9 | 0 | 5 | +--------+-------+----------+ | 374,0 | 0 | 5 | | 374,1 | 0 | 5 | | 374,10 | 0 | 5 | | 374,11 | 0 | 5 | | 374,2 | 0 | 5 | | 374,3 | 0 | 5 | | 374,4 | 0 | 5 | | 374,5 | 0 | 5 | | 374,6 | 0 | 5 | | 374,7 | 0 | 5 | | 374,8 | 0 | 5 | | 374,9 | 0 | 5 | +--------+-------+----------+ | 375,0 | 0 | 5 | | 375,1 | 0 | 5 | | 375,10 | 0 | 5 | | 375,11 | 0 | 5 | | 375,2 | 0 | 5 | | 375,3 | 0 | 5 | | 375,4 | 0 | 5 | | 375,5 | 0 | 5 | | 375,6 | 0 | 5 | | 375,7 | 0 | 5 | | 375,8 | 0 | 5 | | 375,9 | 0 | 5 | +--------+-------+----------+ | 376,0 | 0 | 5 | | 376,1 | 0 | 5 | | 376,10 | 0 | 5 | | 376,11 | 0 | 5 | | 376,2 | 0 | 5 | | 376,3 | 0 | 5 | | 376,4 | 0 | 5 | | 376,5 | 0 | 5 | | 376,6 | 0 | 5 | | 376,7 | 0 | 5 | | 376,8 | 0 | 5 | | 376,9 | 0 | 5 | +--------+-------+----------+ | 377,0 | 0 | 5 | | 377,1 | 0 | 5 | | 377,10 | 0 | 5 | | 377,11 | 0 | 5 | | 377,2 | 0 | 5 | | 377,3 | 0 | 5 | | 377,4 | 0 | 5 | | 377,5 | 0 | 5 | | 377,6 | 0 | 5 | | 377,7 | 0 | 5 | | 377,8 | 0 | 5 | | 377,9 | 0 | 5 | +--------+-------+----------+ | 378,0 | 0 | 5 | | 378,1 | 0 | 5 | | 378,10 | 0 | 5 | | 378,11 | 0 | 5 | | 378,2 | 0 | 5 | | 378,3 | 0 | 5 | | 378,4 | 0 | 5 | | 378,5 | 0 | 5 | | 378,6 | 0 | 5 | | 378,7 | 0 | 5 | | 378,8 | 0 | 5 | | 378,9 | 0 | 5 | +--------+-------+----------+ | 379,0 | 0 | 5 | | 379,1 | 0 | 5 | | 379,10 | 0 | 5 | | 379,11 | 0 | 5 | | 379,2 | 0 | 5 | | 379,3 | 0 | 5 | | 379,4 | 0 | 5 | | 379,5 | 0 | 5 | | 379,6 | 0 | 5 | | 379,7 | 0 | 5 | | 379,8 | 0 | 5 | | 379,9 | 0 | 5 | +--------+-------+----------+ | 380,0 | 0 | 5 | | 380,1 | 0 | 5 | | 380,10 | 0 | 5 | | 380,11 | 0 | 5 | | 380,2 | 0 | 5 | | 380,3 | 0 | 5 | | 380,4 | 0 | 5 | | 380,5 | 0 | 5 | | 380,6 | 0 | 5 | | 380,7 | 0 | 5 | | 380,8 | 0 | 5 | | 380,9 | 0 | 5 | +--------+-------+----------+ | 381,0 | 0 | 5 | | 381,1 | 0 | 5 | | 381,10 | 0 | 5 | | 381,11 | 0 | 5 | | 381,2 | 0 | 5 | | 381,3 | 0 | 5 | | 381,4 | 0 | 5 | | 381,5 | 0 | 5 | | 381,6 | 0 | 5 | | 381,7 | 0 | 5 | | 381,8 | 0 | 5 | | 381,9 | 0 | 5 | +--------+-------+----------+ | 382,0 | 0 | 5 | | 382,1 | 0 | 5 | | 382,10 | 0 | 5 | | 382,11 | 0 | 5 | | 382,2 | 0 | 5 | | 382,3 | 0 | 5 | | 382,4 | 0 | 5 | | 382,5 | 0 | 5 | | 382,6 | 0 | 5 | | 382,7 | 0 | 5 | | 382,8 | 0 | 5 | | 382,9 | 0 | 5 | +--------+-------+----------+ | 383,0 | 0 | 5 | | 383,1 | 0 | 5 | | 383,10 | 0 | 5 | | 383,11 | 0 | 5 | | 383,2 | 0 | 5 | | 383,3 | 0 | 5 | | 383,4 | 0 | 5 | | 383,5 | 0 | 5 | | 383,6 | 0 | 5 | | 383,7 | 0 | 5 | | 383,8 | 0 | 5 | | 383,9 | 0 | 5 | +--------+-------+----------+ | 384,0 | 0 | 5 | | 384,1 | 0 | 5 | | 384,10 | 0 | 5 | | 384,11 | 0 | 5 | | 384,2 | 0 | 5 | | 384,3 | 0 | 5 | | 384,4 | 0 | 5 | | 384,5 | 0 | 5 | | 384,6 | 0 | 5 | | 384,7 | 0 | 5 | | 384,8 | 0 | 5 | | 384,9 | 0 | 5 | +--------+-------+----------+ | 385,0 | 0 | 5 | | 385,1 | 0 | 5 | | 385,10 | 0 | 5 | | 385,11 | 0 | 5 | | 385,2 | 0 | 5 | | 385,3 | 0 | 5 | | 385,4 | 0 | 5 | | 385,5 | 0 | 5 | | 385,6 | 0 | 5 | | 385,7 | 0 | 5 | | 385,8 | 0 | 5 | | 385,9 | 0 | 5 | +--------+-------+----------+ | 386,0 | 0 | 5 | | 386,1 | 0 | 5 | | 386,10 | 0 | 5 | | 386,11 | 0 | 5 | | 386,2 | 0 | 5 | | 386,3 | 0 | 5 | | 386,4 | 0 | 5 | | 386,5 | 0 | 5 | | 386,6 | 0 | 5 | | 386,7 | 0 | 5 | | 386,8 | 0 | 5 | | 386,9 | 0 | 5 | +--------+-------+----------+ 

An opposite query yields the same result, which means that coadds have been build by both sites in the overlap area:

 SELECT  patch,tract,COUNT(*) FROM  lsst_prod_DC_2013_2.DeepCoadd WHERE  (patch,tract) IN (SELECT patch,tract FROM gapon_SDRP_Stripe82.DeepCoadd) GROUP BY  patch,tract ORDER BY patch,tract; 

The overlap has been found in a rectangular region of patches:

• X: 364 - 386
• Y: 0 - 11
All 5 bands are present in each patch.

# The structure of the primary key in table RunDeepSource

This table has another FK parent referencing the PK id of the same table. This raises a potential problem of running into a key collision (and, in more general scope, the referential integrity) situation when attempting to merge the table contents.

A preliminary investigation show that:

• a key compute algorithm is based on patch, trac and filter
• there is an exact match of the keys for the overlap region

The first observation is illustrated by:

 SELECT  COUNT(id),MAX(id)-MIN(id),MIN(id),MAX(id) FROM  gapon_SDRP_Stripe82.RunDeepSource WHERE  coadd_id IN (SELECT deepCoaddId FROM gapon_SDRP_Stripe82.DeepCoadd WHERE patch='364,0' AND tract=0 AND filterId=0);   +-----------+-----------------+------------------+------------------+ | COUNT(id) | MAX(id)-MIN(id) | MIN(id) | MAX(id) | +-----------+-----------------+------------------+------------------+ | 5428 | 5427 | 3201777860083756 | 3201777860089183 | +-----------+-----------------+------------------+------------------+   SELECT  COUNT(id),MAX(id)-MIN(id),MIN(id),MAX(id) FROM  gapon_SDRP_Stripe82.RunDeepSource WHERE  coadd_id IN (SELECT deepCoaddId FROM gapon_SDRP_Stripe82.DeepCoadd WHERE patch='364,1' AND tract=0 AND filterId=0);   +-----------+-----------------+------------------+------------------+ | COUNT(id) | MAX(id)-MIN(id) | MIN(id) | MAX(id) | +-----------+-----------------+------------------+------------------+ | 5008 | 5007 | 3201778933825600 | 3201778933830607 | +-----------+-----------------+------------------+------------------+   SELECT 3201778933825600-3201777860083756;   +-----------------------------------+ | 3201778933825600-3201777860083756 | +-----------------------------------+ | 1073741844 | +-----------------------------------+   SELECT  COUNT(id),MAX(id)-MIN(id),MIN(id),MAX(id) FROM  gapon_SDRP_Stripe82.RunDeepSource WHERE  coadd_id IN (SELECT deepCoaddId FROM gapon_SDRP_Stripe82.DeepCoadd WHERE patch='364,0' AND tract=0 AND filterId=1);   +-----------+-----------------+------------------+------------------+ | COUNT(id) | MAX(id)-MIN(id) | MIN(id) | MAX(id) | +-----------+-----------------+------------------+------------------+ | 6510 | 6509 | 3201777994301623 | 3201777994308132 | +-----------+-----------------+------------------+------------------+   SELECT 3201777994301623-3201777860083756;   +-----------------------------------+ | 3201777994301623-3201777860083756 | +-----------------------------------+ | 134217867 | +-----------------------------------+ 

The second statement is illustrated with results of the following queries:

NCSA:

 SELECT  COUNT(id),MAX(id)-MIN(id),MIN(id),MAX(id) FROM  gapon_SDRP_Stripe82.RunDeepSource WHERE  coadd_id IN (SELECT deepCoaddId FROM gapon_SDRP_Stripe82.DeepCoadd WHERE patch='364,0' AND tract=0 AND filterId=0);   +-----------+-----------------+------------------+------------------+ | COUNT(id) | MAX(id)-MIN(id) | MIN(id) | MAX(id) | +-----------+-----------------+------------------+------------------+ | 5428 | 5427 | 3201777860083756 | 3201777860089183 | +-----------+-----------------+------------------+------------------+ 

IN2P3:

 SELECT  COUNT(id),MAX(id)-MIN(id),MIN(id),MAX(id) FROM  lsst_prod_DC_2013_2.RunDeepSource WHERE  coadd_id IN (SELECT deepCoaddId FROM lsst_prod_DC_2013_2.DeepCoadd WHERE patch='364,0' AND tract=0 AND filterId=0);   +-----------+-----------------+------------------+------------------+ | COUNT(id) | MAX(id)-MIN(id) | MIN(id) | MAX(id) | +-----------+-----------------+------------------+------------------+ | 5445 | 5444 | 3201777860083756 | 3201777860089200 | +-----------+-----------------+------------------+------------------+ 

ATTENTION: this observation doesn't directly prove that the same behavior is observed in the non-overlap region. An addition study (if needed) may need to be conducted. An easiest way to do this would be by scanning patches and tracts before and after the overlap region.

## The algorithm for generating the primary keys

It has been experimentally discovered that a formula for generating contiguous ranges of primary keys of table RunDeepSource has a direct dependency on the corresponding primary key of table DeepCoadd (which itself is calculated from the triplet of: tract,patch,filter). The formulate calculates the start key for a range of deep sources by:

 id_min = 2**27 * deepCoaddId 

This theory is supported by the following observation:

 SELECT  id/coadd_id as 'divider',COUNT(*) FROM  lsst_prod_DC_2013_2.RunDeepSource GROUP BY  divider;   +----------------+----------+ | divider | COUNT(*) | +----------------+----------+ | 134217728.0000 | 8684893 | | 134217728.0001 | 24277289 | | 134217728.0002 | 22549282 | | 134217728.0003 | 17703275 | | 134217728.0004 | 11786896 | | 134217728.0005 | 6560958 | | 134217728.0006 | 3519636 | | 134217728.0007 | 1742696 | | 134217728.0008 | 814825 | | 134217728.0009 | 320346 | | 134217728.0010 | 99486 | | 134217728.0011 | 22900 | | 134217728.0012 | 4810 | | 134217728.0013 | 856 | +----------------+----------+ 

Apparently the allocation of other keys in a scope of the same coadd was sequential because it happened within the same process while searching for deep sources on the coadd. And this can be seen by:

  SELECT POWER(2,27); +-------------+ | POWER(2,27) | +-------------+ | 134217728 | +-------------+   SELECT  id,coadd_id*134217728,id-coadd_id*134217728 AS 'diff' FROM  lsst_prod_DC_2013_2.RunDeepSource WHERE  coadd_id IN (  SELECT deepCoaddId  FROM lsst_prod_DC_2013_2.DeepCoadd  WHERE patch='364,0' AND tract=0 AND filterId=0) ORDER BY  id; 

 +------------------+--------------------+------+ | id | coadd_id*134217728 | diff | +------------------+--------------------+------+ | 3201777860083756 | 3201777860083712 | 44 | | 3201777860083757 | 3201777860083712 | 45 | | 3201777860083758 | 3201777860083712 | 46 | | 3201777860083759 | 3201777860083712 | 47 | | 3201777860083760 | 3201777860083712 | 48 | ... | 3201777860089196 | 3201777860083712 | 5484 | | 3201777860089197 | 3201777860083712 | 5485 | | 3201777860089198 | 3201777860083712 | 5486 | | 3201777860089199 | 3201777860083712 | 5487 | | 3201777860089200 | 3201777860083712 | 5488 | +------------------+--------------------+------+ 5445 rows in set (0.01 sec) 

This leads to the following optimization for eliminating overlapped keys from table RunDeepSource:

• for each value of deepCoaddId (FK: coadd_id) in the overlap area eliminate all entries whose primary keys fall in a range of:

 2**27 * deepCoaddId < id < 2**27 * (deepCoaddId + 1) 

A clear advantage of this algorithm over a nested query which would be directly matching values of RunDeepSource.coadd_id and DeepCoadd.deepCoaddId is that the optimized removal will be based on an existing index built for PK RunDeepSource.coadd_id.

## A minor inconsistency between sites in tables RunDeepSource

The total number of deep sources found by each processing site on the same coadd seem to disagree. This is seen from the following report for:

 patch = '364,0' tract = 0 filterId = 0 

NCSA:

 SELECT  id,coadd_id*134217728,id-coadd_id*134217728 AS 'diff' FROM  gapon_SDRP_Stripe82.RunDeepSource WHERE  coadd_id IN (  SELECT deepCoaddId  FROM gapon_SDRP_Stripe82.DeepCoadd  WHERE patch='364,0' AND tract=0 AND filterId=0) ORDER BY id; 

 +------------------+--------------------+------+ | id | coadd_id*134217728 | diff | +------------------+--------------------+------+ | 3201777860083756 | 3201777860083712 | 44 | | 3201777860083757 | 3201777860083712 | 45 | ... | 3201777860089176 | 3201777860083712 | 5464 | | 3201777860089175 | 3201777860083712 | 5463 | | 3201777860089174 | 3201777860083712 | 5462 | +------------------+--------------------+------+ 

IN2P3:

 SELECT  id,coadd_id*134217728,id-coadd_id*134217728 AS 'diff' FROM  lsst_prod_DC_2013_2.RunDeepSource WHERE  coadd_id IN (  SELECT deepCoaddId  FROM lsst_prod_DC_2013_2.DeepCoadd  WHERE patch='364,0' AND tract=0 AND filterId=0) ORDER BY id; 

 +------------------+--------------------+------+ | id | coadd_id*134217728 | diff | +------------------+--------------------+------+ | 3201777860083756 | 3201777860083712 | 44 | | 3201777860083757 | 3201777860083712 | 45 | ... | 3201777860089193 | 3201777860083712 | 5481 | | 3201777860089192 | 3201777860083712 | 5480 | | 3201777860089191 | 3201777860083712 | 5479 | +------------------+--------------------+------+ 

The origin of this disagreement could be found in subtle differences of the processing conducted at both sites.

# Planning the merge

A trivial merge of this cluster of tables bad on excluding the overlapped region from one (either of a) site (NCSA or IN2P3) is possible. The merge protocol is presented below:
presented below:

1. TBD
Show
Hide
Igor Gaponenko added a comment - - edited

# Analyzing the overlap in table Science_Ccd_Exposure and its dependents

The previously studied structure of the table's primary key suggests that the overlap can be eliminated based on a quadruplet of (run,filterId,camcol,field). Here is the relevant query:

 SELECT  run,COUNT(*) FROM  gapon_SDRP_Stripe82.Science_Ccd_Exposure WHERE  (run,filterId,camcol,field) IN (SELECT run,filterId,camcol,field FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure) GROUP BY  run; 

It yields 150 runs in the overlap:

 +------+----------+ | run | COUNT(*) | +------+----------+ | 1755 | 1080 | | 2649 | 810 | | 2650 | 1020 | | 2662 | 1110 | | 2677 | 210 | | 2708 | 1109 | | 2728 | 1080 | | 2768 | 1110 | | 3325 | 1080 | | 3360 | 1079 | | 3384 | 1080 | | 3388 | 1110 | | 3434 | 1080 | | 3437 | 1079 | | 3465 | 1079 | | 4128 | 1080 | | 4145 | 1079 | | 4153 | 1110 | | 4188 | 358 | | 4192 | 1080 | | 4198 | 1080 | | 4203 | 1080 | | 4207 | 1110 | | 4247 | 1109 | | 4253 | 1080 | | 4263 | 1110 | | 4849 | 1110 | | 4858 | 1080 | | 4868 | 1080 | | 4874 | 1080 | | 4895 | 1080 | | 4899 | 197 | | 4917 | 1075 | | 4927 | 1080 | | 4933 | 1080 | | 5052 | 1110 | | 5566 | 1109 | | 5582 | 1109 | | 5590 | 740 | | 5603 | 1080 | | 5619 | 1091 | | 5622 | 1079 | | 5633 | 1108 | | 5637 | 740 | | 5642 | 1110 | | 5646 | 1096 | | 5709 | 1080 | | 5713 | 1075 | | 5744 | 1079 | | 5754 | 1107 | | 5759 | 1059 | | 5765 | 1102 | | 5770 | 1077 | | 5776 | 1080 | | 5781 | 1110 | | 5786 | 1079 | | 5792 | 1079 | | 5800 | 1110 | | 5807 | 1106 | | 5813 | 1080 | | 5820 | 1109 | | 5823 | 1110 | | 5836 | 1110 | | 5853 | 1063 | | 5878 | 1076 | | 5882 | 1066 | | 5895 | 1110 | | 5898 | 1099 | | 5905 | 1081 | | 5918 | 1079 | | 5924 | 1079 | | 6283 | 1080 | | 6287 | 1110 | | 6293 | 570 | | 6314 | 1105 | | 6330 | 1075 | | 6355 | 1110 | | 6362 | 210 | | 6367 | 1110 | | 6373 | 720 | | 6377 | 1080 | | 6383 | 1103 | | 6402 | 1110 | | 6409 | 1076 | | 6414 | 1079 | | 6421 | 1109 | | 6425 | 1079 | | 6430 | 1110 | | 6433 | 1106 | | 6441 | 361 | | 6447 | 1074 | | 6458 | 1077 | | 6461 | 1109 | | 6471 | 149 | | 6474 | 1076 | | 6479 | 1110 | | 6484 | 1080 | | 6501 | 1104 | | 6504 | 1080 | | 6508 | 1102 | | 6513 | 1108 | | 6518 | 1104 | | 6524 | 808 | | 6530 | 1023 | | 6533 | 1108 | | 6537 | 1054 | | 6555 | 1080 | | 6559 | 1072 | | 6564 | 1109 | | 6568 | 1077 | | 6577 | 1079 | | 6580 | 1075 | | 6584 | 1078 | | 6596 | 1076 | | 6600 | 1109 | | 6618 | 1061 | | 6934 | 1080 | | 6955 | 1110 | | 6963 | 1107 | | 6976 | 1080 | | 6982 | 1080 | | 7024 | 1031 | | 7034 | 1107 | | 7038 | 1109 | | 7047 | 987 | | 7051 | 1079 | | 7054 | 1077 | | 7057 | 1108 | | 7080 | 1110 | | 7084 | 1079 | | 7095 | 527 | | 7101 | 1054 | | 7106 | 1067 | | 7112 | 1108 | | 7117 | 1099 | | 7121 | 1079 | | 7124 | 1065 | | 7136 | 1110 | | 7140 | 1080 | | 7145 | 1077 | | 7155 | 1090 | | 7158 | 1077 | | 7161 | 1079 | | 7170 | 1078 | | 7173 | 1088 | | 7177 | 1091 | | 7183 | 419 | | 7195 | 919 | | 7199 | 202 | | 7202 | 1097 | +------+----------+ 

The opposite query yields the same result. This is also illustrated with this report:

 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure WHERE (run,filterId,camcol,field)  IN (SELECT run,filterId,camcol,field FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure); 

 +----------+ | COUNT(*) | +----------+ | 153666 | +----------+ 

 SELECT COUNT(*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure WHERE (run,filterId,camcol,field)  IN  (SELECT run,filterId,camcol,field FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure); 

 +----------+ | COUNT(*) | +----------+ | 153666 | +----------+ 

For further analysis of the overlap and for optimization of the merge the overlap identifiers (along with their quadruplets) have been captured in the following table:

 SELECT * FROM merge_info.ccd_exp_overlap LIMIT 1; 

 +----------------------+------+--------+----------+-------+ | scienceCcdExposureId | run | camcol | filterId | field | +----------------------+------+--------+----------+-------+ | 1755010409 | 1755 | 1 | 0 | 409 | +----------------------+------+--------+----------+-------+ 

 SELECT COUNT(*) FROM merge_info.ccd_exp_overlap; 

 +----------+ | COUNT(*) | +----------+ | 153666 | +----------+ 

## Observations and conclusions

Based on the nature of the data in these tables, the dependencies between the tables and the previously investigated structure of the primary key of the CCD exposures table, it would seem reasonable to rely upon the previously mentioned quadruplet (or the derived primary key of the table) as a sole disambiguation marker when merging IN2P3 tables into the final catalog.

# The structure of the primary key in the dependent table RunDeepForcedSource

It's important to make sure that no primary key conflict will arise when attempting to merge all 10 instances of the table into a single table. Fortunately, it has been experimentally discovered (though no relevant Data Challenge 13 code has been found so far) that the primary keys of the table are computed directly from the CCD exposures' identifiers using the following simplified formula:

 id = 2**26 * ccd_exposure_id + delta 

This is illustrated with the following example:

 SELECT POW(2,26); 

 +-----------+ | POW(2,26) | +-----------+ | 67108864 | +-----------+ 

 SELECT id,exposure_id,id/exposure_id FROM gapon_SDRP_Stripe82.RunDeepForcedSource LIMIT 1; 

 +-------------------+-------------+----------------+ | id | exposure_id | id/exposure_id | +-------------------+-------------+----------------+ | 69327485392650241 | 1033060035 | 67108864.0000 | +-------------------+-------------+----------------+ 

 SELECT MIN(id),MAX(id),MAX(id)-MIN(id)+1,COUNT(id) FROM gapon_SDRP_Stripe82.RunDeepForcedSource WHERE exposure_id=1033060035; 

 +-------------------+-------------------+-------------------+-----------+ | MIN(id) | MAX(id) | MAX(id)-MIN(id)+1 | COUNT(id) | +-------------------+-------------------+-------------------+-----------+ | 69327485392650241 | 69327485392655340 | 5100 | 5100 | +-------------------+-------------------+-------------------+-----------+ 

## Other observations and conclusions

• the primary keys of the table were allocated in a contiguous interval whose base directly depends (FK) on the corresponding (PK) identifiers from the exposures table.
• the maximum size of each interval allows to accommodate up to 64 million sources per each CCD exposure, which excludes (even remotely) any possibility of running into a key conflict when merging the contents of 10 tables (minus overlap) into one.

# The coverage of entries stored in RunDeepForcedSource in the overlap area across sites

Now as we know that each site processed the same number of exposures in the overlap region, the next important problem to be investigated hereafter is related to a question on how complete was the further search of the forced deep sources on those exposures.

• put aside a potential chance that a site might be doing the partial processing of an image based on certain RA-DEC limits.

In this study we're going to count the total number of the forced deep sources found for each such exposure to make sure each image was processed.

• note we're not requiring that the total number of findings (per image) was an exact match as there might be site-specific subtle differences in the processing algorithms, configurations, hardware configurations (different CPU versions, etc.), order of processing. The general expectation here is to see some form of the general agreement between sites.

In the study we're going to use the u (filterId=0) band.

## Counting the total number of the forced deep sources found at both sites.

NCSA:

 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.RunDeepForcedSource WHERE exposure_id IN  (SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); 

 +-----------+ | COUNT(*) | +-----------+ | 179599164 | +-----------+ 

IN2P3:

 SELECT COUNT(*) FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource WHERE exposure_id IN  (SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); 

 +-----------+ | COUNT(*) | +-----------+ | 186050567 | +-----------+ 

## Per-image comparison of the number of found sources in the overlap region

Create two tables to store intermediate result of the analysis:

 CREATE TABLE merge_info.ncsa_num_deep_forced_src_overlap_u (  scienceCcdExposureId bigint(20) NOT NULL,   num int(11) DEFAULT 0 ); CREATE TABLE merge_info.in2p3_num_deep_forced_src_overlap_u (  scienceCcdExposureId bigint(20) NOT NULL,   num int(11) DEFAULT 0 ); 

Collects the stats per each site.

NCSA:

 INSERT INTO merge_info.ncsa_num_deep_forced_src_overlap_u SELECT exposure_id,COUNT(*) FROM gapon_SDRP_Stripe82.RunDeepForcedSource WHERE exposure_id IN  (SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap) GROUP BY exposure_id ORDER BY exposure_id; 

 Query OK, 29090 rows affected (1 min 0.73 sec) Records: 29090 Duplicates: 0 Warnings: 0 

IN2P3:

 INSERT INTO merge_info.in2p3_num_deep_forced_src_overlap_u SELECT exposure_id,COUNT(*) FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource WHERE exposure_id IN  (SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap) GROUP BY exposure_id ORDER BY exposure_id; 

 INSERT INTO merge_info.in2p3_num_deep_forced_src_overlap_u SELECT exposure_id,COUNT(*) FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource WHERE exposure_id IN  (SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap) GROUP BY exposure_id ORDER BY exposure_id; 

Counting the number of exposures with non-trivial number of found sources:

site total sources found total exposures unique exposures (not in the other site)
NCSA 179599164 29090 223
IN2P3 186050567 29905 1038

This table illustrates the numbers of runs (per each site) when the differences were found:

site num runs
NCSA 39
IN2P3 144

These were the queries used to calculate the numbers:

 SELECT run, COUNT(*) FROM merge_info.ccd_exp_overlap WHERE scienceCcdExposureId IN  (SELECT scienceCcdExposureId  FROM ncsa_num_deep_forced_src_overlap_u  WHERE scienceCcdExposureId NOT IN  (SELECT scienceCcdExposureId  FROM in2p3_num_deep_forced_src_overlap_u)) GROUP by run ORDER BY run;   SELECT run, COUNT(*) FROM merge_info.ccd_exp_overlap  WHERE scienceCcdExposureId IN  (SELECT scienceCcdExposureId  FROM in2p3_num_deep_forced_src_overlap_u  WHERE scienceCcdExposureId NOT IN  (SELECT scienceCcdExposureId  FROM ncsa_num_deep_forced_src_overlap_u)) GROUP by run ORDER BY run; 

HOW TO INTERPRET THIS?

Further investigation on those unique exposure is needed to see to see if (possibly) there is any correlation with RA-DEC.

Show
Igor Gaponenko added a comment - - edited Analyzing the overlap in table Science_Ccd_Exposure and its dependents The previously studied structure of the table's primary key suggests that the overlap can be eliminated based on a quadruplet of ( run , filterId , camcol , field ). Here is the relevant query: SELECT run, COUNT (*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure WHERE (run,filterId,camcol,field) IN ( SELECT run,filterId,camcol,field FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure) GROUP BY run; It yields 150 runs in the overlap: +------+----------+ | run | COUNT(*) | +------+----------+ | 1755 | 1080 | | 2649 | 810 | | 2650 | 1020 | | 2662 | 1110 | | 2677 | 210 | | 2708 | 1109 | | 2728 | 1080 | | 2768 | 1110 | | 3325 | 1080 | | 3360 | 1079 | | 3384 | 1080 | | 3388 | 1110 | | 3434 | 1080 | | 3437 | 1079 | | 3465 | 1079 | | 4128 | 1080 | | 4145 | 1079 | | 4153 | 1110 | | 4188 | 358 | | 4192 | 1080 | | 4198 | 1080 | | 4203 | 1080 | | 4207 | 1110 | | 4247 | 1109 | | 4253 | 1080 | | 4263 | 1110 | | 4849 | 1110 | | 4858 | 1080 | | 4868 | 1080 | | 4874 | 1080 | | 4895 | 1080 | | 4899 | 197 | | 4917 | 1075 | | 4927 | 1080 | | 4933 | 1080 | | 5052 | 1110 | | 5566 | 1109 | | 5582 | 1109 | | 5590 | 740 | | 5603 | 1080 | | 5619 | 1091 | | 5622 | 1079 | | 5633 | 1108 | | 5637 | 740 | | 5642 | 1110 | | 5646 | 1096 | | 5709 | 1080 | | 5713 | 1075 | | 5744 | 1079 | | 5754 | 1107 | | 5759 | 1059 | | 5765 | 1102 | | 5770 | 1077 | | 5776 | 1080 | | 5781 | 1110 | | 5786 | 1079 | | 5792 | 1079 | | 5800 | 1110 | | 5807 | 1106 | | 5813 | 1080 | | 5820 | 1109 | | 5823 | 1110 | | 5836 | 1110 | | 5853 | 1063 | | 5878 | 1076 | | 5882 | 1066 | | 5895 | 1110 | | 5898 | 1099 | | 5905 | 1081 | | 5918 | 1079 | | 5924 | 1079 | | 6283 | 1080 | | 6287 | 1110 | | 6293 | 570 | | 6314 | 1105 | | 6330 | 1075 | | 6355 | 1110 | | 6362 | 210 | | 6367 | 1110 | | 6373 | 720 | | 6377 | 1080 | | 6383 | 1103 | | 6402 | 1110 | | 6409 | 1076 | | 6414 | 1079 | | 6421 | 1109 | | 6425 | 1079 | | 6430 | 1110 | | 6433 | 1106 | | 6441 | 361 | | 6447 | 1074 | | 6458 | 1077 | | 6461 | 1109 | | 6471 | 149 | | 6474 | 1076 | | 6479 | 1110 | | 6484 | 1080 | | 6501 | 1104 | | 6504 | 1080 | | 6508 | 1102 | | 6513 | 1108 | | 6518 | 1104 | | 6524 | 808 | | 6530 | 1023 | | 6533 | 1108 | | 6537 | 1054 | | 6555 | 1080 | | 6559 | 1072 | | 6564 | 1109 | | 6568 | 1077 | | 6577 | 1079 | | 6580 | 1075 | | 6584 | 1078 | | 6596 | 1076 | | 6600 | 1109 | | 6618 | 1061 | | 6934 | 1080 | | 6955 | 1110 | | 6963 | 1107 | | 6976 | 1080 | | 6982 | 1080 | | 7024 | 1031 | | 7034 | 1107 | | 7038 | 1109 | | 7047 | 987 | | 7051 | 1079 | | 7054 | 1077 | | 7057 | 1108 | | 7080 | 1110 | | 7084 | 1079 | | 7095 | 527 | | 7101 | 1054 | | 7106 | 1067 | | 7112 | 1108 | | 7117 | 1099 | | 7121 | 1079 | | 7124 | 1065 | | 7136 | 1110 | | 7140 | 1080 | | 7145 | 1077 | | 7155 | 1090 | | 7158 | 1077 | | 7161 | 1079 | | 7170 | 1078 | | 7173 | 1088 | | 7177 | 1091 | | 7183 | 419 | | 7195 | 919 | | 7199 | 202 | | 7202 | 1097 | +------+----------+ The opposite query yields the same result. This is also illustrated with this report: SELECT COUNT (*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure WHERE (run,filterId,camcol,field) IN ( SELECT run,filterId,camcol,field FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure); +----------+ | COUNT(*) | +----------+ | 153666 | +----------+ SELECT COUNT (*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure WHERE (run,filterId,camcol,field) IN ( SELECT run,filterId,camcol,field FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure); +----------+ | COUNT(*) | +----------+ | 153666 | +----------+ For further analysis of the overlap and for optimization of the merge the overlap identifiers (along with their quadruplets) have been captured in the following table: SELECT * FROM merge_info.ccd_exp_overlap LIMIT 1; +----------------------+------+--------+----------+-------+ | scienceCcdExposureId | run | camcol | filterId | field | +----------------------+------+--------+----------+-------+ | 1755010409 | 1755 | 1 | 0 | 409 | +----------------------+------+--------+----------+-------+ SELECT COUNT (*) FROM merge_info.ccd_exp_overlap; +----------+ | COUNT(*) | +----------+ | 153666 | +----------+ Observations and conclusions Based on the nature of the data in these tables, the dependencies between the tables and the previously investigated structure of the primary key of the CCD exposures table, it would seem reasonable to rely upon the previously mentioned quadruplet (or the derived primary key of the table) as a sole disambiguation marker when merging IN2P3 tables into the final catalog. The structure of the primary key in the dependent table RunDeepForcedSource It's important to make sure that no primary key conflict will arise when attempting to merge all 10 instances of the table into a single table. Fortunately, it has been experimentally discovered (though no relevant Data Challenge 13 code has been found so far) that the primary keys of the table are computed directly from the CCD exposures' identifiers using the following simplified formula: id = 2**26 * ccd_exposure_id + delta This is illustrated with the following example: SELECT POW(2,26); +-----------+ | POW(2,26) | +-----------+ | 67108864 | +-----------+ SELECT id,exposure_id,id/exposure_id FROM gapon_SDRP_Stripe82.RunDeepForcedSource LIMIT 1; +-------------------+-------------+----------------+ | id | exposure_id | id/exposure_id | +-------------------+-------------+----------------+ | 69327485392650241 | 1033060035 | 67108864.0000 | +-------------------+-------------+----------------+ SELECT MIN (id), MAX (id), MAX (id)- MIN (id)+1, COUNT (id) FROM gapon_SDRP_Stripe82.RunDeepForcedSource WHERE exposure_id=1033060035; +-------------------+-------------------+-------------------+-----------+ | MIN(id) | MAX(id) | MAX(id)-MIN(id)+1 | COUNT(id) | +-------------------+-------------------+-------------------+-----------+ | 69327485392650241 | 69327485392655340 | 5100 | 5100 | +-------------------+-------------------+-------------------+-----------+ Other observations and conclusions the primary keys of the table were allocated in a contiguous interval whose base directly depends (FK) on the corresponding (PK) identifiers from the exposures table. the maximum size of each interval allows to accommodate up to 64 million sources per each CCD exposure, which excludes (even remotely) any possibility of running into a key conflict when merging the contents of 10 tables (minus overlap) into one. The coverage of entries stored in RunDeepForcedSource in the overlap area across sites Now as we know that each site processed the same number of exposures in the overlap region, the next important problem to be investigated hereafter is related to a question on how complete was the further search of the forced deep sources on those exposures. put aside a potential chance that a site might be doing the partial processing of an image based on certain RA-DEC limits. In this study we're going to count the total number of the forced deep sources found for each such exposure to make sure each image was processed. note we're not requiring that the total number of findings (per image) was an exact match as there might be site-specific subtle differences in the processing algorithms, configurations, hardware configurations (different CPU versions, etc.), order of processing. The general expectation here is to see some form of the general agreement between sites. In the study we're going to use the u (filterId=0) band. Counting the total number of the forced deep sources found at both sites. NCSA : SELECT COUNT (*) FROM gapon_SDRP_Stripe82.RunDeepForcedSource WHERE exposure_id IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); +-----------+ | COUNT(*) | +-----------+ | 179599164 | +-----------+ IN2P3 : SELECT COUNT (*) FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource WHERE exposure_id IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); +-----------+ | COUNT(*) | +-----------+ | 186050567 | +-----------+ Per-image comparison of the number of found sources in the overlap region Create two tables to store intermediate result of the analysis: CREATE TABLE merge_info.ncsa_num_deep_forced_src_overlap_u ( scienceCcdExposureId bigint (20) NOT NULL , num int (11) DEFAULT 0 ); CREATE TABLE merge_info.in2p3_num_deep_forced_src_overlap_u ( scienceCcdExposureId bigint (20) NOT NULL , num int (11) DEFAULT 0 ); Collects the stats per each site. NCSA : INSERT INTO merge_info.ncsa_num_deep_forced_src_overlap_u SELECT exposure_id, COUNT (*) FROM gapon_SDRP_Stripe82.RunDeepForcedSource WHERE exposure_id IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap) GROUP BY exposure_id ORDER BY exposure_id; Query OK, 29090 rows affected (1 min 0.73 sec) Records: 29090 Duplicates: 0 Warnings: 0 IN2P3 : INSERT INTO merge_info.in2p3_num_deep_forced_src_overlap_u SELECT exposure_id, COUNT (*) FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource WHERE exposure_id IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap) GROUP BY exposure_id ORDER BY exposure_id; INSERT INTO merge_info.in2p3_num_deep_forced_src_overlap_u SELECT exposure_id,COUNT(*) FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource WHERE exposure_id IN (SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap) GROUP BY exposure_id ORDER BY exposure_id; Counting the number of exposures with non-trivial number of found sources: site total sources found total exposures unique exposures (not in the other site) NCSA 179599164 29090 223 IN2P3 186050567 29905 1038 This table illustrates the numbers of runs (per each site) when the differences were found: site num runs NCSA 39 IN2P3 144 These were the queries used to calculate the numbers: SELECT run, COUNT (*) FROM merge_info.ccd_exp_overlap WHERE scienceCcdExposureId IN ( SELECT scienceCcdExposureId FROM ncsa_num_deep_forced_src_overlap_u WHERE scienceCcdExposureId NOT IN ( SELECT scienceCcdExposureId FROM in2p3_num_deep_forced_src_overlap_u)) GROUP by run ORDER BY run;   SELECT run, COUNT (*) FROM merge_info.ccd_exp_overlap WHERE scienceCcdExposureId IN ( SELECT scienceCcdExposureId FROM in2p3_num_deep_forced_src_overlap_u WHERE scienceCcdExposureId NOT IN ( SELECT scienceCcdExposureId FROM ncsa_num_deep_forced_src_overlap_u)) GROUP by run ORDER BY run; HOW TO INTERPRET THIS? Further investigation on those unique exposure is needed to see to see if (possibly) there is any correlation with RA-DEC .
Hide
Igor Gaponenko added a comment - - edited

# Planning the merge

Two independent clusters of tables have been identified. Each group can be merged separately from the other one. Intermediate results of the merge will be stored as needed in a special database called merge_info.

MySQL optimization techniques for MyISAM:
The most relevant suggestion from this document is:

 ALTER TABLE tbl_name DISABLE KEYS;  ALTER TABLE tbl_name ENABLE KEYS; 

Q: MySQL server tuning - anything can be done here?

## CCD exposures and its dependent metadata tables

As it's been discovered and explained earlier in this ticket, the merge can be safely made based on the exact equality of the primary keys of table Science_Ccd_Exposure in the overlap area. The keys are computed from a quadruplet of columns: (run,filterId,camcol,field). Hence in the proposed merge algorithm the overlapped rows will be first removed from the IN2P3 CCD exposure and its dependent metadata tables (2 of those). The next step will be to ingest the remaining content of those tables into the corresponding tables of the primary NCSA database.

Note that the biggest (both in terms of the total of rows and their size) band-specific tables RunDeepForcedSource will be merged based on RunDeepSource not on the exposures.

There are 10 instances of this table spread in between 10 databases. Each database represents a filer band (ugriz) and a site (NCSA, IN2P3). Rows of the table directly depend on:

• the corresponding rows of the CCD exposures table
• the corresponding rows of the objects in the RunDeepSource table

According to a previously made study the first dependency (onto exposures) can be safely eliminated because both sites provide the same collection of images in the overlap area, and the corresponding rows have the same identifiers.

The second dependency is more fundamental from the science point of view because it established a relationship between Astrophysical Objects (measured on the deep co-adds) and the corresponding collections of Sources (measured on individual images). Hence, the elimination of the overlapped rows of table RunDeepForcedSource should be made on a chain of:

These are other ideas for a most optimal merge algorithm:

• all 5 NCSA's tables RunDeepForcedSource do not require any modifications. They can be merged as is.
• IN2P3's tables DeepCoadd and RunDeepSource:
• first of all need to be cleaned from the overlapped data (presumably based on patch and track)
• the overlapped object identifiers RunDeepSource.id should be recorded in temporary table as they will be needed on the next stage
• IN2P3's tables RunDeepForcedSource:
• should be cleaned from all rows leading to the previously recorded objects in the overlap area
• then the contents of the tables can be either dumped into CSV/TSV or merged with the main table RunDeepForcedSource at NCSA.
Show
Hide
Igor Gaponenko added a comment - - edited

# Merging CCD exposures and its dependent metadata tables

status step queries, procedures, etc.
initialize and load main database gapon_SDRP_Stripe82
with exposures and relevant metadata processed
at NCSA
initialize and load the second database lsst_prod_DC_2013_2 with exposures and relevant metadata processed
in IN2P3
create table merge_info.ccd_exp_overlap for primary keys and quadruplets in the overlap area

 CREATE TABLE merge_info.ccd_exp_overlap (  scienceCcdExposureId bigint(20) NOT NULL,  run int(11) NOT NULL,  camcol tinyint(4) NOT NULL,  filterId tinyint(4) NOT NULL,  field int(11) NOT NULL,  PRIMARY KEY (scienceCcdExposureId) ) ENGINE=InnoDB; 

records the primary keys and quadruplets into merge_info.ccd_exp_overlap

 INSERT INTO merge_info.ccd_exp_overlap SELECT scienceCcdExposureId, run, camcol, filterId, field FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure WHERE  (scienceCcdExposureId, run, camcol, filterId, field)  IN  (SELECT scienceCcdExposureId, run, camcol, filterId, field  FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure); 

 Query OK, 153666 rows affected (3.70 sec) Records: 153666 Duplicates: 0 Warnings: 0 

delete the primary keys and quadruplets of the overlap from the IN2P3 table lsst_prod_DC_2013_2.Science_Ccd_Exposure

 DELETE FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure WHERE  (scienceCcdExposureId, run, camcol, filterId, field)  IN  (SELECT scienceCcdExposureId, run, camcol, filterId, field  FROM merge_info.ccd_exp_overlap); 

 Query OK, 153666 rows affected (5.10 sec) 

 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_Metadata WHERE  scienceCcdExposureId  IN  (SELECT scienceCcdExposureId  FROM merge_info.ccd_exp_overlap); 

 +----------+ | COUNT(*) | +----------+ | 17978922 | +----------+ 1 row in set (1 min 55.73 sec) 

 SELECT COUNT(*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata WHERE  scienceCcdExposureId  IN  (SELECT scienceCcdExposureId  FROM merge_info.ccd_exp_overlap); 

 +----------+ | COUNT(*) | +----------+ | 17978922 | +----------+ 1 row in set (3 min 34.18 sec) 

 DELETE FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata WHERE  scienceCcdExposureId  IN  (SELECT scienceCcdExposureId  FROM merge_info.ccd_exp_overlap); 

 Query OK, 17978922 rows affected (41 min 30.88 sec) 

eliminate dependent rows from the IN2P3's CCD metadata table lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10

 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 WHERE  scienceCcdExposureId  IN  (SELECT scienceCcdExposureId  FROM merge_info.ccd_exp_overlap); 

 +----------+ | COUNT(*) | +----------+ | 2938203 | +----------+ 1 row in set (3.64 sec) 

 SELECT COUNT(*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10 WHERE  scienceCcdExposureId  IN  (SELECT scienceCcdExposureId  FROM merge_info.ccd_exp_overlap); 

 +----------+ | COUNT(*) | +----------+ | 2938203 | +----------+ 1 row in set (3.49 sec) 

 DELETE FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10 WHERE  scienceCcdExposureId  IN  (SELECT scienceCcdExposureId  FROM merge_info.ccd_exp_overlap); 

 Query OK, 2938203 rows affected (56.25 sec) 

ingest the remaining contents of lsst_prod_DC_2013_2.Science_Ccd_Exposure into gapon_SDRP_Stripe82.Science_Ccd_Exposure

 SELECT COUNT(*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure; 

 +----------+ | COUNT(*) | +----------+ | 1389442 | +----------+ 1 row in set (0.00 sec) 

 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure; 

 +----------+ | COUNT(*) | +----------+ | 1403385 | +----------+ 1 row in set (0.00 sec) 

 INSERT INTO gapon_SDRP_Stripe82.Science_Ccd_Exposure SELECT * FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure; 

 Query OK, 1389442 rows affected (19.01 sec) Records: 1389442 Duplicates: 0 Warnings: 0 

 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure; 

 +----------+ | COUNT(*) | +----------+ | 2792827 | +----------+ 1 row in set (0.00 sec) 

 SELECT COUNT(*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata; 

 +-----------+ | COUNT(*) | +-----------+ | 162564713 | +-----------+ 1 row in set (0.00 sec) 

 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_Metadata; 

 +-----------+ | COUNT(*) | +-----------+ | 164196045 | +-----------+ 1 row in set (0.00 sec) 

 INSERT INTO gapon_SDRP_Stripe82.Science_Ccd_Exposure_Metadata SELECT * FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata; 

 Query OK, 162564713 rows affected (1 hour 46 min 56.68 sec) Records: 162564713 Duplicates: 0 Warnings: 0 

ingest the remaining contents of lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10 into gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10

 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10; 

 +----------+ | COUNT(*) | +----------+ | 26861430 | +----------+ 1 row in set (0.00 sec) 

 SELECT COUNT(*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10; 

 +----------+ | COUNT(*) | +----------+ | 26168990 | +----------+ 1 row in set (0.00 sec) 

 ALTER TABLE gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 DISABLE KEYS; 

Query OK, 0 rows affected (0.19 sec)

 {code:sql} INSERT INTO gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 SELECT * FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10; 

Query OK, 26168990 rows affected (7.02 sec)
Records: 26168990 Duplicates: 0 Warnings: 0

 {code:sql} ALTER TABLE gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 ENABLE KEYS; 

Query OK, 0 rows affected (2 min 0.93 sec)

 {code:sql} SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10; 

 +----------+ | COUNT(*) | +----------+ | 53030420 | +----------+ 1 row in set (0.00 sec) 

Show
Igor Gaponenko added a comment - - edited Merging CCD exposures and its dependent metadata tables status step queries, procedures, etc. initialize and load main database gapon_SDRP_Stripe82 with exposures and relevant metadata processed at NCSA MySQL dump/load initialize and load the second database lsst_prod_DC_2013_2 with exposures and relevant metadata processed in IN2P3 MySQL dump/load create table merge_info.ccd_exp_overlap for primary keys and quadruplets in the overlap area CREATE TABLE merge_info.ccd_exp_overlap ( scienceCcdExposureId bigint (20) NOT NULL , run int (11) NOT NULL , camcol tinyint(4) NOT NULL , filterId tinyint(4) NOT NULL , field int (11) NOT NULL , PRIMARY KEY (scienceCcdExposureId) ) ENGINE=InnoDB; records the primary keys and quadruplets into merge_info.ccd_exp_overlap INSERT INTO merge_info.ccd_exp_overlap SELECT scienceCcdExposureId, run, camcol, filterId, field FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure WHERE (scienceCcdExposureId, run, camcol, filterId, field) IN ( SELECT scienceCcdExposureId, run, camcol, filterId, field FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure); Query OK, 153666 rows affected (3.70 sec) Records: 153666 Duplicates: 0 Warnings: 0 delete the primary keys and quadruplets of the overlap from the IN2P3 table lsst_prod_DC_2013_2.Science_Ccd_Exposure DELETE FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure WHERE (scienceCcdExposureId, run, camcol, filterId, field) IN ( SELECT scienceCcdExposureId, run, camcol, filterId, field FROM merge_info.ccd_exp_overlap); Query OK, 153666 rows affected (5.10 sec) eliminate dependent rows from the IN2P3's CCD metadata table lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata SELECT COUNT (*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_Metadata WHERE scienceCcdExposureId IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); +----------+ | COUNT(*) | +----------+ | 17978922 | +----------+ 1 row in set (1 min 55.73 sec) SELECT COUNT (*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata WHERE scienceCcdExposureId IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); +----------+ | COUNT(*) | +----------+ | 17978922 | +----------+ 1 row in set (3 min 34.18 sec) DELETE FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata WHERE scienceCcdExposureId IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); Query OK, 17978922 rows affected (41 min 30.88 sec) eliminate dependent rows from the IN2P3's CCD metadata table lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10 SELECT COUNT (*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 WHERE scienceCcdExposureId IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); +----------+ | COUNT(*) | +----------+ | 2938203 | +----------+ 1 row in set (3.64 sec) SELECT COUNT (*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10 WHERE scienceCcdExposureId IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); +----------+ | COUNT(*) | +----------+ | 2938203 | +----------+ 1 row in set (3.49 sec) DELETE FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10 WHERE scienceCcdExposureId IN ( SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); Query OK, 2938203 rows affected (56.25 sec) ingest the remaining contents of lsst_prod_DC_2013_2.Science_Ccd_Exposure into gapon_SDRP_Stripe82.Science_Ccd_Exposure SELECT COUNT (*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure; +----------+ | COUNT(*) | +----------+ | 1389442 | +----------+ 1 row in set (0.00 sec) SELECT COUNT (*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure; +----------+ | COUNT(*) | +----------+ | 1403385 | +----------+ 1 row in set (0.00 sec) INSERT INTO gapon_SDRP_Stripe82.Science_Ccd_Exposure SELECT * FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure; Query OK, 1389442 rows affected (19.01 sec) Records: 1389442 Duplicates: 0 Warnings: 0 SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure; +----------+ | COUNT(*) | +----------+ | 2792827 | +----------+ 1 row in set (0.00 sec) ingest the remaining contents of lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata into gapon_SDRP_Stripe82.Science_Ccd_Exposure_Metadata SELECT COUNT (*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata; +-----------+ | COUNT(*) | +-----------+ | 162564713 | +-----------+ 1 row in set (0.00 sec) SELECT COUNT (*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_Metadata; +-----------+ | COUNT(*) | +-----------+ | 164196045 | +-----------+ 1 row in set (0.00 sec) INSERT INTO gapon_SDRP_Stripe82.Science_Ccd_Exposure_Metadata SELECT * FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_Metadata; Query OK, 162564713 rows affected (1 hour 46 min 56.68 sec) Records: 162564713 Duplicates: 0 Warnings: 0 ingest the remaining contents of lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10 into gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 SELECT COUNT (*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10; +----------+ | COUNT(*) | +----------+ | 26861430 | +----------+ 1 row in set (0.00 sec) SELECT COUNT (*) FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10; +----------+ | COUNT(*) | +----------+ | 26168990 | +----------+ 1 row in set (0.00 sec) ALTER TABLE gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 DISABLE KEYS; Query OK, 0 rows affected (0.19 sec) {code:sql} INSERT INTO gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 SELECT * FROM lsst_prod_DC_2013_2.Science_Ccd_Exposure_To_Htm10; Query OK, 26168990 rows affected (7.02 sec) Records: 26168990 Duplicates: 0 Warnings: 0 {code:sql} ALTER TABLE gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10 ENABLE KEYS; Query OK, 0 rows affected (2 min 0.93 sec) {code:sql} SELECT COUNT(*) FROM gapon_SDRP_Stripe82.Science_Ccd_Exposure_To_Htm10; +----------+ | COUNT(*) | +----------+ | 53030420 | +----------+ 1 row in set (0.00 sec)
Hide
Igor Gaponenko added a comment - - edited

# Merging RunDeepForcedSource (FAILED ATTEMPT)

These are important notes on the merge:

• the NCSA's u (filterId=0) frequency band is already preloaded into database gapon_SDRP_Stripe82
• the remaining 4 NCSA's bands will be loaded into the database w/o any filtering
• all 5 IN2P3's bands will be loaded with a filter which excludes sources related to the overlapped exposures (as discussed earlier in the corresponding analysis section)
• a list of those overlapped exposures is preloaded into table merge_info.ccd_exp_overlap.

The log of the operation is shown below:

status step queries, procedures, etc.

 INSERT INTO gapon_SDRP_Stripe82.RunDeepForcedSource SELECT *  FROM lsst_prod_dedupe_byfilter_u.RunDeepForcedSource  WHERE exposure_id NOT IN  (SELECT scienceCcdExposureId FROM merge_info.ccd_exp_overlap); 

The operation was CANCELLED after 2+ days due to poor performance of the distributed file system for the small block random I/O. The postmortem analysis is
presented below.

## The duration of the operation before it was interrupted (195811 seconds)

The duration in seconds (see column Time) was about 54 hours as shown below:

 SHOW PROCESSLIST; 

 +----+------+-----------+---------------------+---------+--------+--------------+------------------------------------------------------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | +----+------+-----------+---------------------+---------+--------+--------------+------------------------------------------------------------------------------------------------------+ | 55 | root | localhost | gapon_SDRP_Stripe82 | Query | 195811 | Sending data | INSERT INTO gapon_SDRP_Stripe82.RunDeepForcedSource SELECT * FROM lsst_prod_dedupe_byfilter_u.RunDee | +----+------+-----------+---------------------+---------+--------+--------------+------------------------------------------------------------------------------------------------------+ 

## The change in the table & index sizes

The table stats (number of rows, data & index sizes) at the beginning of the operation:

 SHOW TABLE STATUS; 

 +-------------------------------+--------+---------+------------+------------+----------------+--------------+--------------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+ | Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | +-------------------------------+--------+---------+------------+------------+----------------+--------------+--------------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+ | RunDeepForcedSource | MyISAM | 10 | Fixed | 1729599791 | 395 | 683191917445 | 111182615800709119 | 190732060672 | 0 | NULL | 2016-08-10 17:39:27 | 2016-08-31 00:16:31 | 2016-08-12 13:17:23 | +-------------------------------+--------+---------+------------+------------+----------------+--------------+--------------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+ 

And after it was interrupted:

 SHOW TABLE STATUS; 

 +-------------------------------+--------+---------+------------+------------+----------------+--------------+--------------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+ | Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | +-------------------------------+--------+---------+------------+------------+----------------+--------------+--------------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+ | RunDeepForcedSource | MyISAM | 10 | Fixed | 2197365176 | 395 | 867959244520 | 111182615800709119 | 262239522816 | 0 | NULL | 2016-08-10 17:39:27 | 2016-08-31 00:27:23 | 2016-08-12 13:17:23 | +-------------------------------+--------+---------+------------+------------+----------------+--------------+--------------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+ 

The change was:

rows data size (bytes) index size (bytes)
before 1729599791 683191917445 190732060672
after 2197365176 867959244520 262239522816

The estimated performance:

• rows: (2197365176 - 1729599791) / 195811 = 2389 rows/s
• table data: (867959244520 - 683191917445) / 195811 = 943600 bytes/sec = 1 MB/s
• index data: (262239522816 - 190732060672 ) / 195811 = 365186 bytes/sec = 0.36 MB/s

NOTE: most likely the performance was NOT LINEAR. A significant slowdown of the file size increase was observed during the merge. And it could be dominated by the index rebuild.

## File system performance analysis

It turns out the operation was blocked by the high latency and low direct I/O performance of an OpenStack volume attached to the VM. This was seen by:

 % iostat -m 10 

 Linux 3.10.0-327.13.1.el7.x86_64 (lsst-gapon-0.os.ncsa.edu) 08/31/2016 _x86_64_ (8 CPU)   avg-cpu: %user %nice %system %iowait %steal %idle  3.52 0.00 0.50 2.75 0.00 93.23   Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn vda 0.12 0.00 0.00 6137 539 vdb 359.15 14.64 6.04 26630294 10981068 vdc 4.90 0.01 2.06 13961 3744667   avg-cpu: %user %nice %system %iowait %steal %idle  1.89 0.00 0.53 10.10 0.01 87.47   Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn vda 0.30 0.00 0.00 0 0 vdb 661.10 18.83 0.01 188 0 vdc 0.00 0.00 0.00 0 0   avg-cpu: %user %nice %system %iowait %steal %idle  1.01 0.00 0.46 17.77 0.01 80.74   Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn vda 0.00 0.00 0.00 0 0 vdb 7900.00 12.52 43.49 125 434 vdc 0.00 0.00 0.00 0 0 

And this was a result reported by iozone under ideal conditions when no other I/O activity was happening on that machine:

 % iozone -s 100000 -r 16 -I 

  File size set to 100000 kB  Record Size 16 kB  O_DIRECT feature enabled  Command line used: iozone -s 100000 -r 16 -I  Output is in kBytes/sec  Time Resolution = 0.000001 seconds.  Processor cache size set to 1024 kBytes.  Processor cache line size set to 32 bytes.  File stride size set to 17 * record size.  random random bkwd record stride  kB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread  100000 16 34260 74921 62643 65426 37339 73873 70515 78902 69543 1409936 2021713 4858634 5284841 

The random I/O seemed to be on the order of 34 to 62 MB/s.

## Conclusions (on the poor performance)

Most likely the main factor limiting the performance was the table index rebuild process which required rebalancing huge B-Trees far exceeding the machine's memory wile the underlying file system had inadequate (for a problem of this scale) performance for the random (direct) I/O operations with MySQL's 16 KB pages.

Further options will be investigated in the next comment.

Show
Hide
Igor Gaponenko added a comment -

# Planning more efficient merge of the forced deep sources

Three options have been discussed:

1. making another attempt to merge two tables with disabled keys using the same hardware setup (OpenStack and NCSA)
• measure the performance of the operation, make observations on the file system behavior
• then try re-enabling the keys and study the performance of the index rebuild operation
2. develop an application which will produce a collection of CSV/TSV files compatible with the Qserv database loader/partitioner application
• the application should chunk its output into multiple files (one file per run)
• to speed up the operation 10 OpenStack VMs and 10 separate storage partitions could be set up
3. investigate the performance of the index rebuild on the new Qserv hardware in IN2P3 (NVMe, SSD based large memory machine *ccqserv150 *)
Show
Igor Gaponenko added a comment - Planning more efficient merge of the forced deep sources Three options have been discussed: making another attempt to merge two tables with disabled keys using the same hardware setup (OpenStack and NCSA) measure the performance of the operation, make observations on the file system behavior then try re-enabling the keys and study the performance of the index rebuild operation develop an application which will produce a collection of CSV/TSV files compatible with the Qserv database loader/partitioner application the application should chunk its output into multiple files (one file per run) to speed up the operation 10 OpenStack VMs and 10 separate storage partitions could be set up investigate the performance of the index rebuild on the new Qserv hardware in IN2P3 (NVMe, SSD based large memory machine *ccqserv150 *)
Hide
Igor Gaponenko added a comment - - edited

# Merging RunDeepForcedSource w/ DISABLE KEYS

This is the first option discussed in the previous comment. Note the following:

The log of the operation is shown below:

status step queries, procedures, etc.
saving the previous state of table gapon_SDRP_Stripe82.RunDeepForcedSource

 RENAME TABLE gapon_SDRP_Stripe82.RunDeepForcedSource  TO gapon_SDRP_Stripe82.RunDeepForcedSource_saved; 

loading NCSA's u band into the newly created table from a MySQL dump file with a command which prevents keys from being enabled in the end of the loading.

 date; \ cat /data1/gapon/_daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource.sql | \ sed 's/ENABLE KEYS/DISABLE KEYS/' | \ mysql -u root gapon_SDRP_Stripe82; \ date 

 Wed Aug 31 02:27:33 UTC 2016 Thu Sep 1 06:16:05 UTC 2016 

increased the size of of MyISAM's special tree-like cache from 8 MB to 8 GB (a half of the machine's memory) to optimize bulk inserts into non-empty tables as per: http://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_bulk_insert_buffer_size

 cat /etc/my.cnf.d/server.cnf ... bulk_insert_buffer_size = 8589934592 ... 

Restart MariaDB server to make sure this parameter gets into effect.

delete all indexes on table gapon_SDRP_Stripe82.RunDeepForcedSource
ATTENTION: do this only after finishing analyzing the relationship between this table and RunDeepCoadd based on the objectId. This may potentially affect the de-duplication effort.

 SHOW INDEX FROM gapon_SDRP_Stripe82.RunDeepForcedSource; DROP INDEX PRIMARY ON gapon_SDRP_Stripe82.RunDeepForcedSource; ... ... 

merge the de-duplicated rows of table RunDeepForcedSource of the IN2P3's u band into table gapon_SDRP_Stripe82.RunDeepForcedSource. See a recipe from the previous (FAILED) merging effort
Show
Igor Gaponenko added a comment - - edited Merging RunDeepForcedSource w/ DISABLE KEYS This is the first option discussed in the previous comment. Note the following: the DISABLE KEYS request won't affect the PRIMARY index of the table the performance of the INSERT ... SELECT ... operations can be significantly speed up by increasing a value of the following global variable: http://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_bulk_insert_buffer_size The log of the operation is shown below: status step queries, procedures, etc. saving the previous state of table gapon_SDRP_Stripe82.RunDeepForcedSource RENAME TABLE gapon_SDRP_Stripe82.RunDeepForcedSource TO gapon_SDRP_Stripe82.RunDeepForcedSource_saved; loading NCSA 's u band into the newly created table from a MySQL dump file with a command which prevents keys from being enabled in the end of the loading. date ; \ cat /data1/gapon/_daues_SDRP_dedupe_byfilter_0/RunDeepForcedSource .sql | \ sed 's/ENABLE KEYS/DISABLE KEYS/' | \ mysql -u root gapon_SDRP_Stripe82; \ date Wed Aug 31 02:27:33 UTC 2016 Thu Sep 1 06:16:05 UTC 2016 increased the size of of MyISAM's special tree-like cache from 8 MB to 8 GB (a half of the machine's memory) to optimize bulk inserts into non-empty tables as per: http://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_bulk_insert_buffer_size cat /etc/my .cnf.d /server .cnf ... bulk_insert_buffer_size = 8589934592 ... Restart MariaDB server to make sure this parameter gets into effect. delete all indexes on table gapon_SDRP_Stripe82.RunDeepForcedSource ATTENTION: do this only after finishing analyzing the relationship between this table and RunDeepCoadd based on the objectId . This may potentially affect the de-duplication effort. SHOW INDEX FROM gapon_SDRP_Stripe82.RunDeepForcedSource; DROP INDEX PRIMARY ON gapon_SDRP_Stripe82.RunDeepForcedSource; ... ... merge the de-duplicated rows of table RunDeepForcedSource of the IN2P3 's u band into table gapon_SDRP_Stripe82.RunDeepForcedSource . See a recipe from the previous (FAILED) merging effort
 Sprint DB_F16_7, DB_F16_8 [ 232, 233 ] DB_F16_7, DB_F16_8, DB_F16_9 [ 232, 233, 249 ]
 Rank Ranked higher
Hide
Igor Gaponenko added a comment - - edited

# Merging and de-duplicating DeepCoadd and RunDeepSource

These tables needs to be merged and de-duplicated before attempting to de-duplicate 5 IN2P3's tables RunDeepForcedSource. The effort is based on the following (earlier explained in this ticket) dependency between tables:

We shall also increase the MySQL key buffer size because some of the operations used later on will greatly benefit from this. The parameter is explained at http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_key_buffer_size. Set it to 4 GB as per Serge's suggestion at https://github.com/lsst-dm/db_tests_summer15/blob/master/sql/dump_ForcedSource.sql then restart the server.

 % cat /etc/my.cnf.d/server.cnf .. key_buffer_size = 4294967296 .. % systemctl restart mariadb.service 

## Saving the overlap info for DeepCoadd

Here we're going to save the overlap identifiers in intermediate tables.
NCSA:

 CREATE TABLE merge_info.ncsa_DeepCoadd_overlap (  deepCoaddId bigint(20) NOT NULL,  tract int(11) NOT NULL,  patch char(16) NOT NULL,  filterId tinyint(4) NOT NULL,  PRIMARY KEY (deepCoaddId) ) ENGINE=MyISAM;   INSERT INTO merge_info.ncsa_DeepCoadd_overlap  SELECT deepCoaddId,tract,patch,filterId  FROM gapon_SDRP_Stripe82.DeepCoadd  WHERE (tract,patch,filterId) IN  (SELECT tract,patch,filterId FROM lsst_prod_DC_2013_2.DeepCoadd); 

 Query OK, 1380 rows affected (0.10 sec) Records: 1380 Duplicates: 0 Warnings: 0 

IN2P3:

 CREATE TABLE merge_info.in2p3_DeepCoadd_overlap (  deepCoaddId bigint(20) NOT NULL,  tract int(11) NOT NULL,  patch char(16) NOT NULL,  filterId tinyint(4) NOT NULL,  PRIMARY KEY (deepCoaddId) ) ENGINE=MyISAM;   INSERT INTO merge_info.in2p3_DeepCoadd_overlap  SELECT deepCoaddId,tract,patch,filterId  FROM lsst_prod_DC_2013_2.DeepCoadd  WHERE (tract,patch,filterId) IN  (SELECT tract,patch,filterId FROM gapon_SDRP_Stripe82.DeepCoadd); 

 Query OK, 1380 rows affected (41.85 sec) Records: 1380 Duplicates: 0 Warnings: 0 

Observations:

• both tables have identical entries
• the number of rows in each table is rather small: 1380
• the table size is: 85560
• the index size: 23552

## Saving the overlap info for RunDeepSource

### Estimating the total number of Objects in the overlap area

NCSA:

 SELECT  COUNT(id) FROM  gapon_SDRP_Stripe82.RunDeepSource rds,  merge_info.ncsa_DeepCoadd_overlap overlap WHERE  rds.coadd_id = overlap.deepCoaddId; 

 +-----------+ | COUNT(id) | +-----------+ | 9991529 | +-----------+ 1 row in set (1 min 43.00 sec) 

IN2P3:

 SELECT  COUNT(id) FROM  lsst_prod_DC_2013_2.RunDeepSource rds,  merge_info.in2p3_DeepCoadd_overlap overlap WHERE  rds.coadd_id = overlap.deepCoaddId; 

 +-----------+ | COUNT(id) | +-----------+ | 9993373 | +-----------+ 1 row in set (2 min 46.84 sec) 

### Recording the identifiers in the bookkeeping tables for both sites

NCSA:

 CREATE TABLE merge_info.ncsa_RunDeepSource_overlap (  id bigint(20) NOT NULL,  coadd_id bigint(20) NOT NULL,  PRIMARY KEY (id) ) ENGINE=MyISAM;   INSERT INTO merge_info.ncsa_RunDeepSource_overlap SELECT id, coadd_id FROM  gapon_SDRP_Stripe82.RunDeepSource rds,  merge_info.ncsa_DeepCoadd_overlap overlap WHERE  rds.coadd_id = overlap.deepCoaddId; 

 Query OK, 9991529 rows affected (25.07 sec) Records: 9991529 Duplicates: 0 Warnings: 0 

IN2P3:

 CREATE TABLE merge_info.in2p3_RunDeepSource_overlap (  id bigint(20) NOT NULL,  coadd_id bigint(20) NOT NULL,  PRIMARY KEY (id) ) ENGINE=MyISAM;   INSERT INTO merge_info.in2p3_RunDeepSource_overlap SELECT id, coadd_id FROM  lsst_prod_DC_2013_2.RunDeepSource rds,  merge_info.in2p3_DeepCoadd_overlap overlap WHERE  rds.coadd_id = overlap.deepCoaddId; 

 Query OK, 9993373 rows affected (32.20 sec) Records: 9993373 Duplicates: 0 Warnings: 0 

### Observations

• the number of entries in the tables are almost identical
• the number of rows in each table is: 9993373 (10^7)
• the table size is: 169887341 (170 MB)
• the index size: 144075776 (144 MB)

It means that with a proper key buffer size (as mentioned at the start of this comment) the whole table index will fit into memory and it can be used for eliminated overlaps from the largest tables RunDeepForcedSource.

 CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_dedup LIKE lsst_prod_DC_2013_2.DeepCoadd; ALTER TABLE lsst_prod_DC_2013_2.DeepCoadd_dedup DISABLE KEYS;   CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup LIKE lsst_prod_DC_2013_2.DeepCoadd_Metadata; ALTER TABLE lsst_prod_DC_2013_2.DeepCoadd_Metadata_dedup DISABLE KEYS;   CREATE TABLE lsst_prod_DC_2013_2.DeepCoadd_To_Htm10_dedup LIKE lsst_prod_DC_2013_2.DeepCoadd_To_Htm10;