Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-6905

Locate the test dataset for PDAC

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:

      Description

      Locate and evaluate a dataset of SDSS Stripe82 which is going to be used for testing the prototype DAC.

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment -

            Primary sources of information for the study

            Experts contacted

            Show
            gapon Igor Gaponenko added a comment - Primary sources of information for the study LSST Data Challenge Report - the document published in September 2013 with the summary of the Date Challenge A collection of DC planning documents Technical Instructions from Yusra AlSayyad on the data processing and database ingestion. The document provides a detailed explanation of specific steps taken during the DC Additional instructions on the final stage of ingesting and de-duping of forced sources from Yusra AlSayyad Experts contacted NCSA production: Greg Daues , Yusra AlSayyad , Serge Monkewitz , Kian-Tat Lim IN2P3 production: Dominique Boutigny , Bogdan Vilpesku (IN2P3)
            Hide
            gapon Igor Gaponenko added a comment -

            Data runs imported from SDSS

            The data of 303 Stripe 82 runs were imported into NCSA and placed at:

            % du -hs /lsst7/stripe82/dr7/runs/
            11T /lsst7/stripe82/dr7/runs/
             
            % ls -al /lsst7/stripe82/dr7/runs/
            total 1112732
            drwxrwsr-x 309 mjuric   lsst          4096 Dec  9  2015 .
            drwxrwsr-x   7 mjuric   lsst          4096 Nov  2  2012 ..
            drwxrwsr-x   3 mjuric   mjuric        4096 Mar  4  2012 100006
            drwxrwsr-x   3 mjuric   mjuric        4096 Mar  3  2012 1033
            drwxrwsr-x   3 mjuric   mjuric        4096 Mar  3  2012 1040
            drwxrwsr-x   3 mjuric   mjuric        4096 Mar  3  2012 1056
            ..
            

            The raw data set also came with a registry in a form of an SQLite3. The registry has been evaluated to make sure the dataset is complete:

            % sqlite3 /lsst7/stripe82/dr7/runs/registry.final.sqlite3
             
            sqlite> .header ON
            sqlite> .databases
            seq  name             file
            ---  ---------------  ----------------------------------------------------------
            0    main             /lsst7/stripe82/dr7/runs/registry.final.sqlite3
             
            sqlite> .tables
            raw          raw_skyTile
             
            sqlite> SELECT * FROM raw LIMIT 1;
            id|run|rerun|filter|camcol|field|taiObs|strip|band|frame
            1|1033|40|r|2|229|1999-10-13T06:16:56.190000000|82N       |r|229
            

            Runs:

            sqlite> SELECT COUNT(DISTINCT run) FROM raw;
            COUNT(DISTINCT run)
            303
             
            sqlite> SELECT DISTINCT run from raw;
            run
            94
            125
            1033
            1040
            1056
            1752
            1755
            1894
            ...
            7188
            7195
            7199
            7202
            

            Images in 5 spectral bands:

            sqlite> SELECT COUNT(*) FROM raw;
            COUNT(*)
            3681013
             
            sqlite>  SELECT filter, COUNT(*)  FROM raw GROUP BY filter;
            filter|COUNT(*)
            g|736203
            i|736202
            r|736203
            u|736203
            z|736202
            

            CONCLUSION: This was exactly the dataset of interest as it has 303 runs and nearly 4 million images mentioned earlier in the 2013 LCLS Data Challenge Report

            Show
            gapon Igor Gaponenko added a comment - Data runs imported from SDSS The data of 303 Stripe 82 runs were imported into NCSA and placed at: % du -hs /lsst7/stripe82/dr7/runs/ 11T /lsst7/stripe82/dr7/runs/ % ls -al /lsst7/stripe82/dr7/runs/ total 1112732 drwxrwsr-x 309 mjuric lsst 4096 Dec 9 2015 . drwxrwsr-x 7 mjuric lsst 4096 Nov 2 2012 .. drwxrwsr-x 3 mjuric mjuric 4096 Mar 4 2012 100006 drwxrwsr-x 3 mjuric mjuric 4096 Mar 3 2012 1033 drwxrwsr-x 3 mjuric mjuric 4096 Mar 3 2012 1040 drwxrwsr-x 3 mjuric mjuric 4096 Mar 3 2012 1056 .. The raw data set also came with a registry in a form of an SQLite3 . The registry has been evaluated to make sure the dataset is complete: % sqlite3 /lsst7/stripe82/dr7/runs/registry.final.sqlite3   sqlite> .header ON sqlite> .databases seq name file --- --------------- ---------------------------------------------------------- 0 main /lsst7/stripe82/dr7/runs/registry.final.sqlite3   sqlite> .tables raw raw_skyTile   sqlite> SELECT * FROM raw LIMIT 1; id|run|rerun|filter|camcol|field|taiObs|strip|band|frame 1|1033|40|r|2|229|1999-10-13T06:16:56.190000000|82N |r|229 Runs: sqlite> SELECT COUNT(DISTINCT run) FROM raw; COUNT(DISTINCT run) 303   sqlite> SELECT DISTINCT run from raw; run 94 125 1033 1040 1056 1752 1755 1894 ... 7188 7195 7199 7202 Images in 5 spectral bands: sqlite> SELECT COUNT(*) FROM raw; COUNT(*) 3681013   sqlite> SELECT filter, COUNT(*) FROM raw GROUP BY filter; filter|COUNT(*) g|736203 i|736202 r|736203 u|736203 z|736202 CONCLUSION : This was exactly the dataset of interest as it has 303 runs and nearly 4 million images mentioned earlier in the 2013 LCLS Data Challenge Report
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Database catalogs discovered during the search at NCSA

            According to the IN2P3 sources involved into DC no appropriate database should exist in IN2P3. All relevant catalogs are reportedly located at NCSA on the LSST MySQL server:

            lsst-db.ncsa.illinois.edu
            

            The following candidate databases have been identified so far. They still need to be evaluated to see how complete they're:

            daues_SDRP_Stripe82_ncsa
            daues_S13_Stripe82_2000
            daues_SDRP_dedupe_byfilter_0
            daues_SDRP_dedupe_byfilter_1
            daues_SDRP_dedupe_byfilter_2
            daues_SDRP_dedupe_byfilter_3
            daues_SDRP_dedupe_byfilter_4
            

            NOTE: it's not quite clear yet if these catalogs corresponds to the NCSA-only half of the Stripe 82 dataset or they also include the other half processed in IN2P3.

            Show
            gapon Igor Gaponenko added a comment - - edited Database catalogs discovered during the search at NCSA According to the IN2P3 sources involved into DC no appropriate database should exist in IN2P3. All relevant catalogs are reportedly located at NCSA on the LSST MySQL server: lsst-db.ncsa.illinois.edu The following candidate databases have been identified so far. They still need to be evaluated to see how complete they're: daues_SDRP_Stripe82_ncsa daues_S13_Stripe82_2000 daues_SDRP_dedupe_byfilter_0 daues_SDRP_dedupe_byfilter_1 daues_SDRP_dedupe_byfilter_2 daues_SDRP_dedupe_byfilter_3 daues_SDRP_dedupe_byfilter_4 NOTE : it's not quite clear yet if these catalogs corresponds to the NCSA-only half of the Stripe 82 dataset or they also include the other half processed in IN2P3.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Processed datasets imported from IN2P3 into NCSA

            The following data resulting from the DC processing in IN2P3 were discovered at NCSA:

            % ls -al /lsst2/yusra/SDRP-IN2P3
            total 28
            drwxr-xr-x   7 yusra yusra 4096 Jun 21 04:56 .
            drwxrwxr-x   3 yusra yusra 4096 Feb 26  2015 ..
            drwxr-xr-x   3 yusra yusra 4096 Jun 21 14:25 coadd
            drwxr-xr-x   5 yusra yusra 4096 Jul 30  2015 coadd_dir
            drwxr-xr-x   3 yusra yusra 4096 Jun  5 00:46 coadd_image_dir
            drwxr-xr-x   7 yusra yusra 4096 Aug  3  2015 forcedPhot
            drwxr-xr-x 117 yusra yusra 4096 Aug  8  2015 ingestProcessed_csv
             
            % du -hs /lsst2/yusra/SDRP-IN2P3/
            1.7T    /lsst2/yusra/SDRP-IN2P3/
            

            Supposedly, this folder contains deep sources and deep forced sources. It's POSSIBLE (which still needs to be evaluated) that these files might be used to generate the corresponding database catalogs.

            Show
            gapon Igor Gaponenko added a comment - - edited Processed datasets imported from IN2P3 into NCSA The following data resulting from the DC processing in IN2P3 were discovered at NCSA: % ls -al /lsst2/yusra/SDRP-IN2P3 total 28 drwxr-xr-x 7 yusra yusra 4096 Jun 21 04:56 . drwxrwxr-x 3 yusra yusra 4096 Feb 26 2015 .. drwxr-xr-x 3 yusra yusra 4096 Jun 21 14:25 coadd drwxr-xr-x 5 yusra yusra 4096 Jul 30 2015 coadd_dir drwxr-xr-x 3 yusra yusra 4096 Jun 5 00:46 coadd_image_dir drwxr-xr-x 7 yusra yusra 4096 Aug 3 2015 forcedPhot drwxr-xr-x 117 yusra yusra 4096 Aug 8 2015 ingestProcessed_csv   % du -hs /lsst2/yusra/SDRP-IN2P3/ 1.7T /lsst2/yusra/SDRP-IN2P3/ Supposedly, this folder contains deep sources and deep forced sources . It's POSSIBLE (which still needs to be evaluated) that these files might be used to generate the corresponding database catalogs.
            Hide
            gapon Igor Gaponenko added a comment -

            Processed dataset at NCSA

            The processed results of the first (NCSA) half of Stripe 82 are more likely located at:

            % ls -al /lsst2/daues/SDRP/
            total 56
            drwxr-xr-x  11 daues ac 4096 Jul 22 15:49 .
            drwxr-xr-x  11 daues ac 4096 Oct 22  2014 ..
            lrwxrwxrwx   1 daues ac   33 Sep 12  2013 calexp_dir -> /lsst8/daues/SDRP/data/calexp_dir
            drwxr-xr-x   4 daues ac 4096 Dec 11  2013 coadd_csv_dir
            lrwxrwxrwx   1 daues ac   34 Sep 12  2013 coadd_g_dir -> /lsst7/daues/SDRP/data/coadd_g_dir
            lrwxrwxrwx   1 daues ac   34 Sep 12  2013 coadd_i_dir -> /lsst7/daues/SDRP/data/coadd_i_dir
            lrwxrwxrwx   1 daues ac   34 Sep 12  2013 coadd_r_dir -> /lsst7/daues/SDRP/data/coadd_r_dir
            lrwxrwxrwx   1 daues ac   34 Sep 12  2013 coadd_u_dir -> /lsst7/daues/SDRP/data/coadd_u_dir
            lrwxrwxrwx   1 daues ac   34 Sep 12  2013 coadd_z_dir -> /lsst7/daues/SDRP/data/coadd_z_dir
            -rw-r--r--   1 daues ac  139 Sep  3  2013 forcedPhotConfig.py
            lrwxrwxrwx   1 daues ac   41 Sep 12  2013 forcedPhot_csv_dir -> /lsst8/daues/SDRP/data/forcedPhot_csv_dir
            drwxr-xr-x   5 daues ac 4096 Sep  3  2013 forcedPhot_dir
            drwxr-xr-x   2 daues ac 4096 Nov 21  2013 ingestCoadd_g_csv_dir
            drwxr-xr-x   2 daues ac 4096 Sep  3  2013 ingestCoadd_i_csv_dir
            drwxr-xr-x   2 daues ac 4096 Sep  3  2013 ingestCoadd_r_csv_dir
            drwxr-xr-x   2 daues ac 4096 Sep  3  2013 ingestCoadd_u_csv_dir
            drwxr-xr-x   2 daues ac 4096 Sep  3  2013 ingestCoadd_z_csv_dir
            drwxr-xr-x   3 daues ac 4096 Jul 22 15:52 ingestForcedSources
            drwxr-xr-x 209 daues ac 4096 Sep  3  2013 ingestProcessed_csv_dir
            -rw-r--r--   1 daues ac  250 Sep  3  2013 processConfig.py
            -rw-r--r--   1 daues ac  635 Sep 30  2013 READ_IngestResults
            

            NOTE: these data need to be evaluated to see if they're complete, and if they could be used to generate complete MySQL catalogs. One may also pay attention to the following file located at that folder:

            % cat /lsst2/daues/SDRP/READ_IngestResults
             
            mysql> USE  daues_SDRP_dedupe_byfilter_0;
            mysql> SELECT COUNT(*) FROM  RunDeepForcedSource;
            | 1729599791 |
            mysql> USE  daues_SDRP_dedupe_byfilter_1;
            | 1752805399 |
            mysql> USE  daues_SDRP_dedupe_byfilter_2;
            | 1754290686 |
            mysql> USE  daues_SDRP_dedupe_byfilter_3;
            | 1754272841 |
            mysql> USE  daues_SDRP_dedupe_byfilter_4;
            | 1751879670 |
             
            | Database      | SELECT COUNT(*) FROM  RunDeepForcedSource |
             
            | daues_SDRP_dedupe_byfilter_0 | 1729599791 |
            | daues_SDRP_dedupe_byfilter_1 | 1752805399 |
            | daues_SDRP_dedupe_byfilter_2 | 1754290686 |
            | daues_SDRP_dedupe_byfilter_3 | 1754272841 |
            | daues_SDRP_dedupe_byfilter_4 | 1751879670 |
            

            Another interesting observation is for the number of files at:

            % ls -al /lsst2/daues/SDRP/calexp_dir/processCcd_metadata/*/*/* | wc -l
            1457614
            

            Show
            gapon Igor Gaponenko added a comment - Processed dataset at NCSA The processed results of the first (NCSA) half of Stripe 82 are more likely located at: % ls -al /lsst2/daues/SDRP/ total 56 drwxr-xr-x 11 daues ac 4096 Jul 22 15:49 . drwxr-xr-x 11 daues ac 4096 Oct 22 2014 .. lrwxrwxrwx 1 daues ac 33 Sep 12 2013 calexp_dir -> /lsst8/daues/SDRP/data/calexp_dir drwxr-xr-x 4 daues ac 4096 Dec 11 2013 coadd_csv_dir lrwxrwxrwx 1 daues ac 34 Sep 12 2013 coadd_g_dir -> /lsst7/daues/SDRP/data/coadd_g_dir lrwxrwxrwx 1 daues ac 34 Sep 12 2013 coadd_i_dir -> /lsst7/daues/SDRP/data/coadd_i_dir lrwxrwxrwx 1 daues ac 34 Sep 12 2013 coadd_r_dir -> /lsst7/daues/SDRP/data/coadd_r_dir lrwxrwxrwx 1 daues ac 34 Sep 12 2013 coadd_u_dir -> /lsst7/daues/SDRP/data/coadd_u_dir lrwxrwxrwx 1 daues ac 34 Sep 12 2013 coadd_z_dir -> /lsst7/daues/SDRP/data/coadd_z_dir -rw-r--r-- 1 daues ac 139 Sep 3 2013 forcedPhotConfig.py lrwxrwxrwx 1 daues ac 41 Sep 12 2013 forcedPhot_csv_dir -> /lsst8/daues/SDRP/data/forcedPhot_csv_dir drwxr-xr-x 5 daues ac 4096 Sep 3 2013 forcedPhot_dir drwxr-xr-x 2 daues ac 4096 Nov 21 2013 ingestCoadd_g_csv_dir drwxr-xr-x 2 daues ac 4096 Sep 3 2013 ingestCoadd_i_csv_dir drwxr-xr-x 2 daues ac 4096 Sep 3 2013 ingestCoadd_r_csv_dir drwxr-xr-x 2 daues ac 4096 Sep 3 2013 ingestCoadd_u_csv_dir drwxr-xr-x 2 daues ac 4096 Sep 3 2013 ingestCoadd_z_csv_dir drwxr-xr-x 3 daues ac 4096 Jul 22 15:52 ingestForcedSources drwxr-xr-x 209 daues ac 4096 Sep 3 2013 ingestProcessed_csv_dir -rw-r--r-- 1 daues ac 250 Sep 3 2013 processConfig.py -rw-r--r-- 1 daues ac 635 Sep 30 2013 READ_IngestResults NOTE : these data need to be evaluated to see if they're complete, and if they could be used to generate complete MySQL catalogs. One may also pay attention to the following file located at that folder: % cat /lsst2/daues/SDRP/READ_IngestResults   mysql> USE daues_SDRP_dedupe_byfilter_0; mysql> SELECT COUNT(*) FROM RunDeepForcedSource; | 1729599791 | mysql> USE daues_SDRP_dedupe_byfilter_1; | 1752805399 | mysql> USE daues_SDRP_dedupe_byfilter_2; | 1754290686 | mysql> USE daues_SDRP_dedupe_byfilter_3; | 1754272841 | mysql> USE daues_SDRP_dedupe_byfilter_4; | 1751879670 |   | Database | SELECT COUNT(*) FROM RunDeepForcedSource |   | daues_SDRP_dedupe_byfilter_0 | 1729599791 | | daues_SDRP_dedupe_byfilter_1 | 1752805399 | | daues_SDRP_dedupe_byfilter_2 | 1754290686 | | daues_SDRP_dedupe_byfilter_3 | 1754272841 | | daues_SDRP_dedupe_byfilter_4 | 1751879670 | Another interesting observation is for the number of files at: % ls -al /lsst2/daues/SDRP/calexp_dir/processCcd_metadata/ */*/* | wc -l 1457614
            Hide
            gapon Igor Gaponenko added a comment -

            The task has been complete. Candidate data sets and databases have been identified and reported in the Comments section of this story. The discovered data still need to be evaluated for completeness.

            Show
            gapon Igor Gaponenko added a comment - The task has been complete. Candidate data sets and databases have been identified and reported in the Comments section of this story. The discovered data still need to be evaluated for completeness.

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Watchers:
              Igor Gaponenko
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: