Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-34483

Butler Registry to ObsCore exporter

    XMLWordPrintable

Details

    • Ops Middleware
    • No

    Description

      The ObsTAP service requires that we have a database populated with ObsCore entries.

      We need something that can take a Registry (possibly with a dataset query constraint) and exports it to, say, a parquet file that can be ingested into the ObsCore database. Some values can come from configuration (such as the telescope location and how to map a dataset type to a processing level).

      Setting up the ObsCore database and loading these files into it can be done as a separate ticket.

      A proof-of-concept was written at https://github.com/lsst-dm/dax_obscore which attempts to show that this can be done. I don't know if any of this code can be used in the final system but it doesn't seem unreasonable to use the package itself dax_obscore.

      The code in dax_obscore has some problems in that it should never call butler.get – the sky region for an exposure must come from the visit record, as should the exposure time. calexp is visit based so that is fine, but for raw there might be a complication in getting the "footprint" in the absence of a direct link from an exposure record to a visit record.

      I envisage that we have a butler plugin click command that can do the equivalent of query-datasets and dump the results as a parquet table. Obviously we may also want a bespoke exporter that is distinct from the command-line tooling when doing a full export.

      gpdf how are you envisaging we use the visit concept in ObsCore? In theory we don't need it at all and can use the naive approach of every observation is an exposure. How many dataset types is the service meant to expose? raw, calexp and the tract/patch coadds?

      Attachments

        Issue Links

          Activity

            tjenness Tim Jenness added a comment -

            Looks like gpdf did an analysis of the mapping of registry to ObsCore here: https://confluence.lsstcorp.org/display/~gpdf/Satisfying+ObsCore+from+the+Gen3+Butler+Schema

            tjenness Tim Jenness added a comment - Looks like gpdf did an analysis of the mapping of registry to ObsCore here: https://confluence.lsstcorp.org/display/~gpdf/Satisfying+ObsCore+from+the+Gen3+Butler+Schema
            tjenness Tim Jenness added a comment -

            Moving this to ops team because the work is driven entirely by DP0.2 timeline.

            tjenness Tim Jenness added a comment - Moving this to ops team because the work is driven entirely by DP0.2 timeline.

            [I'm very new to the whole ObsCore/ObsTAP thing, please correct me if I say something disagreable]

            I have been reading ObsCore paper and Gregory's table which gave me some insight, will try to collect my thoughts here.

            Here is an "implementation phase 1" subset of the columns from Gregory's table:

            • dataproduct_type - this can probably be guessed from the storage type, e.g. "Exposure*" maps to "image", "DataFrame" to "measurements"?
            • dataproduct_subtype - could it be dataset type name (prefixed with "lsst.")?
            • calib_level - some sort of mapping is needed from dataset type name to level number
            • obs_collection - this probably identifies a particular butler repository (or maybe its subset)?
            • obs_publisher_did - a unique URL for a dataset, I think Tim mentioned this should include UUID, should it also include other contexts, (e.g. repo or obs_collection)?
            • access_url - this depends on other service that is used to access data via this URL
            • access_format - this can probably be mapped from dataset type name?
            • s_ra, s_dec - Region has a bounding circle, can use its center for these.
            • s_region - In STC-S encoding, I don't think Region knows how to do that, most regions should be polygons.

            Few general ideas I have so far:

            • Registries have datasets that don't need to be exposed to ObsTAP, so filtering based on dataset type name is probably the right thing.
            • I expect that registries may also have bunch of data that is not useful (e.g. test runs), maybe we also want to export only a subset of collections.
            • Some sort of configuration is probably needed, command line options can't handle complexity.
            salnikov Andy Salnikov added a comment - [I'm very new to the whole ObsCore/ObsTAP thing, please correct me if I say something disagreable] I have been reading ObsCore paper and Gregory's table which gave me some insight, will try to collect my thoughts here. Here is an "implementation phase 1" subset of the columns from Gregory's table: dataproduct_type - this can probably be guessed from the storage type, e.g. "Exposure*" maps to "image", "DataFrame" to "measurements"? dataproduct_subtype - could it be dataset type name (prefixed with "lsst.")? calib_level - some sort of mapping is needed from dataset type name to level number obs_collection - this probably identifies a particular butler repository (or maybe its subset)? obs_publisher_did - a unique URL for a dataset, I think Tim mentioned this should include UUID, should it also include other contexts, (e.g. repo or obs_collection )? access_url - this depends on other service that is used to access data via this URL access_format - this can probably be mapped from dataset type name? s_ra , s_dec - Region has a bounding circle, can use its center for these. s_region - In STC-S encoding, I don't think Region knows how to do that, most regions should be polygons. Few general ideas I have so far: Registries have datasets that don't need to be exposed to ObsTAP, so filtering based on dataset type name is probably the right thing. I expect that registries may also have bunch of data that is not useful (e.g. test runs), maybe we also want to export only a subset of collections. Some sort of configuration is probably needed, command line options can't handle complexity.
            tjenness Tim Jenness added a comment -

            obs_publisher_did: I think UUID is what rra is using in the cutout service. We decided that a run/datasetType/dataId tuple was more complicated than was necessary and the UUID is enough to find the relevant dataset. It also should be portable across different repositories that have the same data release.

            Controlling dataset types and collections for the export is going to be important. I assume much like the query-datasets command line parameters.

            Configuration is definitely going to be needed.

            tjenness Tim Jenness added a comment - obs_publisher_did: I think UUID is what rra is using in the cutout service. We decided that a run/datasetType/dataId tuple was more complicated than was necessary and the UUID is enough to find the relevant dataset. It also should be portable across different repositories that have the same data release. Controlling dataset types and collections for the export is going to be important. I assume much like the query-datasets command line parameters. Configuration is definitely going to be needed.

            Agreed on the need for configuration. Presumably we'd use our usual Python configuration tooling for that?

            I would like to centralize all the "static" mappings as based on dataset type name, including the one for dataproduct_type; I don't think there's any need to involve the storage type in the reasoning here. It'll be clearer if there's a single lookup table from dataset type to all the attributes derivable from it. (That includes more that are not in my original Confluence page - for instance, s_resolution and o_ucd. I will help you put together the mapping. For now: PVIs are calib_level 2; diffims and coadds are 3.

            Agreed that the extractor should take a positive list of dataset types for which extraction is to be performed: e.g., "calexp, deepCoadd_calexp".

            I would have to talk to tjenness about how to slice the repository by collection or other attribute. I think having it work on an enumerated list of collections, and if that's not supplied default to the whole repository, is likely to be OK, though.

            We very much need em_min and em_max populated from the filter ID.

            A critical column to get working for immediate prototyping purposes is s_region, because fritzm and company are going to have to develop the ability to ingest that into Qserv, turning it both into a string value the user can use in a SELECT and into the encoded form needed by scisql.

            We need two options for access_url: either (only for early testing and perhaps some ad-hoc usages I'll tell you about later) a URL that points straight to the image file artifact, or (what's really needed for production) a URL for a "links service" (see the DataLink standard if you are curious, but you don't need to know any real details about that in order to implement this). The latter is what I informally call the "CADC model", which we are adopting.

            The URL should be constructed based on a template provided in the configuration. Ideally the obs_creator_did should be the thing plugged in to the template, as this is relocatable to all sites (unlike the obs_publisher_did, which is meant to be unique for every site), but for DP0.2 we don't need to worry about that if it becomes an obstacle. I think, from what tjenness was saying above, that we are going to be using the UUID as the obs_creator_did?

            The access_format is mandated to be "application/x-votable+xml;content=datalink" in the CADC model, and is the same for all dataset types.

            In the "direct link" model, you are correct, the access_format would be dataset-type-dependent, but for now always either "image/fits" or "application/fits" (the latter is always acceptable, so we can just default to it for DP0.2) for all the datasets we are likely to put in ObsCore.

            gpdf Gregory Dubois-Felsmann added a comment - Agreed on the need for configuration. Presumably we'd use our usual Python configuration tooling for that? I would like to centralize all the "static" mappings as based on dataset type name, including the one for dataproduct_type ; I don't think there's any need to involve the storage type in the reasoning here. It'll be clearer if there's a single lookup table from dataset type to all the attributes derivable from it. (That includes more that are not in my original Confluence page - for instance, s_resolution and o_ucd . I will help you put together the mapping. For now: PVIs are calib_level 2; diffims and coadds are 3. Agreed that the extractor should take a positive list of dataset types for which extraction is to be performed: e.g., "calexp, deepCoadd_calexp". I would have to talk to tjenness about how to slice the repository by collection or other attribute. I think having it work on an enumerated list of collections, and if that's not supplied default to the whole repository, is likely to be OK, though. We very much need em_min and em_max populated from the filter ID. A critical column to get working for immediate prototyping purposes is s_region , because fritzm and company are going to have to develop the ability to ingest that into Qserv, turning it both into a string value the user can use in a SELECT and into the encoded form needed by scisql . We need two options for access_url : either (only for early testing and perhaps some ad-hoc usages I'll tell you about later) a URL that points straight to the image file artifact, or (what's really needed for production) a URL for a "links service" (see the DataLink standard if you are curious, but you don't need to know any real details about that in order to implement this). The latter is what I informally call the "CADC model", which we are adopting. The URL should be constructed based on a template provided in the configuration. Ideally the obs_creator_did should be the thing plugged in to the template, as this is relocatable to all sites (unlike the obs_publisher_did, which is meant to be unique for every site), but for DP0.2 we don't need to worry about that if it becomes an obstacle. I think, from what tjenness was saying above, that we are going to be using the UUID as the obs_creator_did ? The access_format is mandated to be "application/x-votable+xml;content=datalink" in the CADC model, and is the same for all dataset types. In the "direct link" model, you are correct, the access_format would be dataset-type-dependent, but for now always either "image/fits" or "application/fits" (the latter is always acceptable, so we can just default to it for DP0.2) for all the datasets we are likely to put in ObsCore.

            tjenness, to find a region for an exposure, I guess I need to associate an exposure to a visit. Should it be done based on timespans or is there a better way?

            For em_min and em_max, I think I will need to read filter data from Butler, they are not in the dimension records.

            salnikov Andy Salnikov added a comment - tjenness , to find a region for an exposure, I guess I need to associate an exposure to a visit. Should it be done based on timespans or is there a better way? For em_min and em_max, I think I will need to read filter data from Butler, they are not in the dimension records.
            tjenness Tim Jenness added a comment -

            I am absolutely fine with reading the min and max em values for a filter from a static config file. Butler knows the physical_filter and pipelines code does not store filter profiles in the Exposures any more (it uses FilterLabel). The relevant Instrument class could be told the filter widths but they don't at the moment. Use a config keyed by physical_filter now and we'll worry about a cleverer scheme later on.

            The sky region for raws was worrying me. For a calexp and later you are fine and can use the visit record directly. It looks like there is a visit_definition record that maps an exposure to the corresponding visit.

            tjenness Tim Jenness added a comment - I am absolutely fine with reading the min and max em values for a filter from a static config file. Butler knows the physical_filter and pipelines code does not store filter profiles in the Exposures any more (it uses FilterLabel). The relevant Instrument class could be told the filter widths but they don't at the moment. Use a config keyed by physical_filter now and we'll worry about a cleverer scheme later on. The sky region for raws was worrying me. For a calexp and later you are fine and can use the visit record directly. It looks like there is a visit_definition record that maps an exposure to the corresponding visit.

            I think I'm going to have to spend a little extra time on working out the right way to do the calexp-to-raw, i.e., visit-to-exposure connection in the DP0.2 deployment. I suppose it's definite that the plan is to include the raws in the data exposed? tjenness, let's plan to talk a bit about what obs_id should be for exposures vs. visits.

            For em_min and em_max, I 100% agree - just make it a table lookup from a config based on the filter ID. We can worry later on about how to populate that more directly from an authoritative source. It is very clear that no one should be using the ObsCore values for science-critical purposes, though.

            It's possible that this will be my only window of time to look at work-related things today. I do expect to be working a full day on Monday, though, and returning to at least 75% time from then on.

            gpdf Gregory Dubois-Felsmann added a comment - I think I'm going to have to spend a little extra time on working out the right way to do the calexp-to-raw, i.e., visit-to-exposure connection in the DP0.2 deployment. I suppose it's definite that the plan is to include the raws in the data exposed? tjenness , let's plan to talk a bit about what obs_id should be for exposures vs. visits. For em_min and em_max , I 100% agree - just make it a table lookup from a config based on the filter ID. We can worry later on about how to populate that more directly from an authoritative source. It is very clear that no one should be using the ObsCore values for science-critical purposes, though. It's possible that this will be my only window of time to look at work-related things today. I do expect to be working a full day on Monday, though, and returning to at least 75% time from then on.
            tjenness Tim Jenness added a comment -

            Completely ignoring raw for the moment does not seem like a bad plan.

            calexp is well-defined and is what the VO services currently need to show they work.

            tjenness Tim Jenness added a comment - Completely ignoring raw for the moment does not seem like a bad plan. calexp is well-defined and is what the VO services currently need to show they work.
            fritzm Fritz Mueller added a comment - - edited

            The the regions in an existing sample of Kenny's old converted metadata appear to be all four-point spherical polygons, but I assume this need not always be the case? Is there a limit on number of points (trying to figure out allocation for scisql binary polygon columns in Qserv...)

            fritzm Fritz Mueller added a comment - - edited The the regions in an existing sample of Kenny's old converted metadata appear to be all four-point spherical polygons, but I assume this need not always be the case? Is there a limit on number of points (trying to figure out allocation for scisql binary polygon columns in Qserv...)
            tjenness Tim Jenness added a comment -

            STC-S has no limit on polygon definitions but practically speaking the visit is region is defined in this code: https://github.com/lsst/obs_base/blob/main/python/lsst/obs/base/defineVisits.py#L806-L823

            tjenness Tim Jenness added a comment - STC-S has no limit on polygon definitions but practically speaking the visit is region is defined in this code: https://github.com/lsst/obs_base/blob/main/python/lsst/obs/base/defineVisits.py#L806-L823
            fritzm Fritz Mueller added a comment - - edited

            As a proof of concept for ability to leverage scisql_s2PtInCPoly, I dumped Kenny's previous ObsCore data (~88K rows) from Postgres and imported it into a MariaDB instance. A scisql binary polygon column was then added via:

            ALTER TABLE obscore ADD s_region_scisql VARBINARY(128);
             
            UPDATE obscore SET s_region_scisql = scisql_s2CPolyToBin(
                cast(substring_index(substring_index(s_region, ' ', -8), ' ', 1) as double),
                cast(substring_index(substring_index(s_region, ' ', -7), ' ', 1) as double),
                cast(substring_index(substring_index(s_region, ' ', -6), ' ', 1) as double),
                cast(substring_index(substring_index(s_region, ' ', -5), ' ', 1) as double),
                cast(substring_index(substring_index(s_region, ' ', -4), ' ', 1) as double),
                cast(substring_index(substring_index(s_region, ' ', -3), ' ', 1) as double),
                cast(substring_index(substring_index(s_region, ' ', -2), ' ', 1) as double),
                cast(substring_index(substring_index(s_region, ' ', -1), ' ', 1) as double)
            );
            

            After this, we can, e.g.:

            MariaDB [imgserv]> select s_region from obscore where scisql_s2PtInCPoly(216, 2, s_region_scisql);
            +--------------------------------------------------------------------------------------------------+
            | s_region                                                                                         |
            +--------------------------------------------------------------------------------------------------+
            | POLYGON ICRS 216.130297 +2.066402 216.133659 +1.972385 215.945891 +1.977446 215.943826 +2.072778 |
            | POLYGON ICRS 216.130318 +2.066394 216.133680 +1.972377 215.945912 +1.977438 215.943847 +2.072770 |
            | POLYGON ICRS 216.096602 +2.018849 216.097539 +1.922714 215.901814 +1.923511 215.901870 +2.019844 |
            | POLYGON ICRS 216.096614 +2.018841 216.097552 +1.922706 215.901826 +1.923503 215.901883 +2.019836 |
            | POLYGON ICRS 216.130293 +2.066413 216.133655 +1.972396 215.945887 +1.977457 215.943822 +2.072789 |
            | POLYGON ICRS 216.134442 +2.067605 216.137666 +1.977445 215.952147 +1.982629 215.950160 +2.074052 |
            | POLYGON ICRS 216.135765 +2.067434 216.138961 +1.977278 215.953429 +1.982520 215.951479 +2.073942 |
            | POLYGON ICRS 216.098774 +2.020408 216.099655 +1.928171 215.906041 +1.929070 215.906094 +2.021492 |
            | POLYGON ICRS 216.135666 +2.067753 216.138867 +1.977591 215.953328 +1.982811 215.951376 +2.074240 |
            | POLYGON ICRS 216.099985 +2.020794 216.100866 +1.928568 215.907266 +1.929489 215.907336 +2.021898 |
            +--------------------------------------------------------------------------------------------------+
            10 rows in set (0.191 sec)
            

            That looks workable at a basic level, but as it stands entails a full table scan. Will be investigating indexing strategies next...

            fritzm Fritz Mueller added a comment - - edited As a proof of concept for ability to leverage scisql_s2PtInCPoly , I dumped Kenny's previous ObsCore data (~88K rows) from Postgres and imported it into a MariaDB instance. A scisql binary polygon column was then added via: ALTER TABLE obscore ADD s_region_scisql VARBINARY(128);   UPDATE obscore SET s_region_scisql = scisql_s2CPolyToBin( cast(substring_index(substring_index(s_region, ' ', -8), ' ', 1) as double), cast(substring_index(substring_index(s_region, ' ', -7), ' ', 1) as double), cast(substring_index(substring_index(s_region, ' ', -6), ' ', 1) as double), cast(substring_index(substring_index(s_region, ' ', -5), ' ', 1) as double), cast(substring_index(substring_index(s_region, ' ', -4), ' ', 1) as double), cast(substring_index(substring_index(s_region, ' ', -3), ' ', 1) as double), cast(substring_index(substring_index(s_region, ' ', -2), ' ', 1) as double), cast(substring_index(substring_index(s_region, ' ', -1), ' ', 1) as double) ); After this, we can, e.g.: MariaDB [imgserv]> select s_region from obscore where scisql_s2PtInCPoly(216, 2, s_region_scisql); +--------------------------------------------------------------------------------------------------+ | s_region | +--------------------------------------------------------------------------------------------------+ | POLYGON ICRS 216.130297 +2.066402 216.133659 +1.972385 215.945891 +1.977446 215.943826 +2.072778 | | POLYGON ICRS 216.130318 +2.066394 216.133680 +1.972377 215.945912 +1.977438 215.943847 +2.072770 | | POLYGON ICRS 216.096602 +2.018849 216.097539 +1.922714 215.901814 +1.923511 215.901870 +2.019844 | | POLYGON ICRS 216.096614 +2.018841 216.097552 +1.922706 215.901826 +1.923503 215.901883 +2.019836 | | POLYGON ICRS 216.130293 +2.066413 216.133655 +1.972396 215.945887 +1.977457 215.943822 +2.072789 | | POLYGON ICRS 216.134442 +2.067605 216.137666 +1.977445 215.952147 +1.982629 215.950160 +2.074052 | | POLYGON ICRS 216.135765 +2.067434 216.138961 +1.977278 215.953429 +1.982520 215.951479 +2.073942 | | POLYGON ICRS 216.098774 +2.020408 216.099655 +1.928171 215.906041 +1.929070 215.906094 +2.021492 | | POLYGON ICRS 216.135666 +2.067753 216.138867 +1.977591 215.953328 +1.982811 215.951376 +2.074240 | | POLYGON ICRS 216.099985 +2.020794 216.100866 +1.928568 215.907266 +1.929489 215.907336 +2.021898 | +--------------------------------------------------------------------------------------------------+ 10 rows in set (0.191 sec) That looks workable at a basic level, but as it stands entails a full table scan. Will be investigating indexing strategies next...
            fritzm Fritz Mueller added a comment - - edited

            Okay, MariaDB has spatial indices on its own GIS geometry types, but they are not spherical. We can maybe get around that by storing a projected set of bounds in each row, then building a spatial index on that column and using it for a cut further refined by the exact scisql clause above. A lat/lon bounding box and a cylindrical projection are probably as simple as it could be, and likely a good match for MariaDB's r-tree spatial index. So:

            DELIMITER $$
            CREATE OR REPLACE FUNCTION bounds_from_region(s_region VARCHAR(512)) RETURNS POLYGON
            BEGIN
                DECLARE lon1, lat1, lon2, lat2, lon3, lat3, lon4, lat4 VARCHAR(32);
                SET lon1 = substring_index(substring_index(s_region, ' ', -8), ' ', 1);
                SET lat1 = substring_index(substring_index(s_region, ' ', -7), ' ', 1);
                SET lon2 = substring_index(substring_index(s_region, ' ', -6), ' ', 1);
                SET lat2 = substring_index(substring_index(s_region, ' ', -5), ' ', 1);
                SET lon3 = substring_index(substring_index(s_region, ' ', -4), ' ', 1);
                SET lat3 = substring_index(substring_index(s_region, ' ', -3), ' ', 1);
                SET lon4 = substring_index(substring_index(s_region, ' ', -2), ' ', 1);
                SET lat4 = substring_index(substring_index(s_region, ' ', -1), ' ', 1);
                RETURN ENVELOPE(POLYGON(LINESTRING(
                    POINT(lon1, lat1), POINT(lon2, lat2), POINT(lon3, lat3), POINT(lon4, lat4), POINT(lon1, lat1)
                )));
            END $$
            DELIMITER ;
             
            ALTER TABLE obscore ADD s_region_bounds POLYGON NOT NULL;
            UPDATE obscore SET s_region_bounds = bounds_from_region(s_region);
            ALTER TABLE obscore ADD SPATIAL INDEX(s_region_bounds);
            

            As a first cut it does seem to work. Query planner reports:

            MariaDB [imgserv]> explain select s_region from obscore where scisql_s2PtInCPoly(216, 2, s_region_scisql);
            +------+-------------+---------+------+---------------+------+---------+------+-------+-------------+
            | id   | select_type | table   | type | possible_keys | key  | key_len | ref  | rows  | Extra       |
            +------+-------------+---------+------+---------------+------+---------+------+-------+-------------+
            |    1 | SIMPLE      | obscore | ALL  | NULL          | NULL | NULL    | NULL | 88134 | Using where |
            +------+-------------+---------+------+---------------+------+---------+------+-------+-------------+
            1 row in set (0.001 sec)
             
            MariaDB [imgserv]> explain select s_region from obscore where contains(s_region_bounds, point(216, 2)) and scisql_s2PtInCPoly(216, 2, s_region_scisql);
            +------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+
            | id   | select_type | table   | type  | possible_keys   | key             | key_len | ref  | rows | Extra       |
            +------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+
            |    1 | SIMPLE      | obscore | range | s_region_bounds | s_region_bounds | 34      | NULL | 1    | Using where |
            +------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+
            1 row in set (0.001 sec)
            

            and we now have:

            MariaDB [imgserv]> select s_region from obscore where contains(s_region_bounds, point(216, 2)) and scisql_s2PtInCPoly(216, 2, s_region_scisql);
            +--------------------------------------------------------------------------------------------------+
            | s_region                                                                                         |
            +--------------------------------------------------------------------------------------------------+
            | POLYGON ICRS 216.135666 +2.067753 216.138867 +1.977591 215.953328 +1.982811 215.951376 +2.074240 |
            | POLYGON ICRS 216.099985 +2.020794 216.100866 +1.928568 215.907266 +1.929489 215.907336 +2.021898 |
            | POLYGON ICRS 216.130297 +2.066402 216.133659 +1.972385 215.945891 +1.977446 215.943826 +2.072778 |
            | POLYGON ICRS 216.130318 +2.066394 216.133680 +1.972377 215.945912 +1.977438 215.943847 +2.072770 |
            | POLYGON ICRS 216.096602 +2.018849 216.097539 +1.922714 215.901814 +1.923511 215.901870 +2.019844 |
            | POLYGON ICRS 216.096614 +2.018841 216.097552 +1.922706 215.901826 +1.923503 215.901883 +2.019836 |
            | POLYGON ICRS 216.130293 +2.066413 216.133655 +1.972396 215.945887 +1.977457 215.943822 +2.072789 |
            | POLYGON ICRS 216.134442 +2.067605 216.137666 +1.977445 215.952147 +1.982629 215.950160 +2.074052 |
            | POLYGON ICRS 216.135765 +2.067434 216.138961 +1.977278 215.953429 +1.982520 215.951479 +2.073942 |
            | POLYGON ICRS 216.098774 +2.020408 216.099655 +1.928171 215.906041 +1.929070 215.906094 +2.021492 |
            +--------------------------------------------------------------------------------------------------+
            10 rows in set (0.001 sec)
            

            so, a speedup of ~200x for that query. This is probably good enough to test next at scale.

            fritzm Fritz Mueller added a comment - - edited Okay, MariaDB has spatial indices on its own GIS geometry types, but they are not spherical. We can maybe get around that by storing a projected set of bounds in each row, then building a spatial index on that column and using it for a cut further refined by the exact scisql clause above. A lat/lon bounding box and a cylindrical projection are probably as simple as it could be, and likely a good match for MariaDB's r-tree spatial index. So: DELIMITER $$ CREATE OR REPLACE FUNCTION bounds_from_region(s_region VARCHAR(512)) RETURNS POLYGON BEGIN DECLARE lon1, lat1, lon2, lat2, lon3, lat3, lon4, lat4 VARCHAR(32); SET lon1 = substring_index(substring_index(s_region, ' ', -8), ' ', 1); SET lat1 = substring_index(substring_index(s_region, ' ', -7), ' ', 1); SET lon2 = substring_index(substring_index(s_region, ' ', -6), ' ', 1); SET lat2 = substring_index(substring_index(s_region, ' ', -5), ' ', 1); SET lon3 = substring_index(substring_index(s_region, ' ', -4), ' ', 1); SET lat3 = substring_index(substring_index(s_region, ' ', -3), ' ', 1); SET lon4 = substring_index(substring_index(s_region, ' ', -2), ' ', 1); SET lat4 = substring_index(substring_index(s_region, ' ', -1), ' ', 1); RETURN ENVELOPE(POLYGON(LINESTRING( POINT(lon1, lat1), POINT(lon2, lat2), POINT(lon3, lat3), POINT(lon4, lat4), POINT(lon1, lat1) ))); END $$ DELIMITER ;   ALTER TABLE obscore ADD s_region_bounds POLYGON NOT NULL; UPDATE obscore SET s_region_bounds = bounds_from_region(s_region); ALTER TABLE obscore ADD SPATIAL INDEX(s_region_bounds); As a first cut it does seem to work. Query planner reports: MariaDB [imgserv]> explain select s_region from obscore where scisql_s2PtInCPoly(216, 2, s_region_scisql); +------+-------------+---------+------+---------------+------+---------+------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+---------+------+---------------+------+---------+------+-------+-------------+ | 1 | SIMPLE | obscore | ALL | NULL | NULL | NULL | NULL | 88134 | Using where | +------+-------------+---------+------+---------------+------+---------+------+-------+-------------+ 1 row in set (0.001 sec)   MariaDB [imgserv]> explain select s_region from obscore where contains(s_region_bounds, point(216, 2)) and scisql_s2PtInCPoly(216, 2, s_region_scisql); +------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+ | 1 | SIMPLE | obscore | range | s_region_bounds | s_region_bounds | 34 | NULL | 1 | Using where | +------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+ 1 row in set (0.001 sec) and we now have: MariaDB [imgserv]> select s_region from obscore where contains(s_region_bounds, point(216, 2)) and scisql_s2PtInCPoly(216, 2, s_region_scisql); +--------------------------------------------------------------------------------------------------+ | s_region | +--------------------------------------------------------------------------------------------------+ | POLYGON ICRS 216.135666 +2.067753 216.138867 +1.977591 215.953328 +1.982811 215.951376 +2.074240 | | POLYGON ICRS 216.099985 +2.020794 216.100866 +1.928568 215.907266 +1.929489 215.907336 +2.021898 | | POLYGON ICRS 216.130297 +2.066402 216.133659 +1.972385 215.945891 +1.977446 215.943826 +2.072778 | | POLYGON ICRS 216.130318 +2.066394 216.133680 +1.972377 215.945912 +1.977438 215.943847 +2.072770 | | POLYGON ICRS 216.096602 +2.018849 216.097539 +1.922714 215.901814 +1.923511 215.901870 +2.019844 | | POLYGON ICRS 216.096614 +2.018841 216.097552 +1.922706 215.901826 +1.923503 215.901883 +2.019836 | | POLYGON ICRS 216.130293 +2.066413 216.133655 +1.972396 215.945887 +1.977457 215.943822 +2.072789 | | POLYGON ICRS 216.134442 +2.067605 216.137666 +1.977445 215.952147 +1.982629 215.950160 +2.074052 | | POLYGON ICRS 216.135765 +2.067434 216.138961 +1.977278 215.953429 +1.982520 215.951479 +2.073942 | | POLYGON ICRS 216.098774 +2.020408 216.099655 +1.928171 215.906041 +1.929070 215.906094 +2.021492 | +--------------------------------------------------------------------------------------------------+ 10 rows in set (0.001 sec) so, a speedup of ~200x for that query. This is probably good enough to test next at scale.
            salnikov Andy Salnikov added a comment - - edited

            I have a preliminary implementation of the butler obscore-export command on tickets/DM-34483 branch. There are obviously few things missing for now:

            • access_url is now taken directly from datastore, to make DataLink URL I need to know how that is constructed
            • exposure region is determined from visit region assuming that exposure group_id is the same as visit ID, I do not know if this is guaranteed to be always true
            • many things come from configuration, some are per-dataset type (check config/example.yaml for an example)
            • obs_id format (tempalte) can be specified in configuration, not sure if that is going to be sufficient for all cases
            • obs_publisher_did is not filled yet, I need some specific input on how to construct that (I can probably specify a per-dataset type format string for it, e.g. "ivoa://{obs_collection}/{obs_id}" or something like it)
            • probably few other things

            I used that example config to ran the tool on /repo/main to produce an example parquet and csv files, the output is on lss-devl:

            • /project/salnikov/obscore/data.csv.gz
            • /project/salnikov/obscore/data.parquet

            If you want to play with it, you need dax_obscore and daf_butler from the ticket branch, and the command I used to make parquet file:

            butler obscore-export -c $DAX_OBSCORE_DIR/configs/example.yaml /repo/main data.parquet
            

            salnikov Andy Salnikov added a comment - - edited I have a preliminary implementation of the butler obscore-export command on tickets/DM-34483 branch. There are obviously few things missing for now: access_url is now taken directly from datastore, to make DataLink URL I need to know how that is constructed exposure region is determined from visit region assuming that exposure group_id is the same as visit ID, I do not know if this is guaranteed to be always true many things come from configuration, some are per-dataset type (check config/example.yaml for an example) obs_id format (tempalte) can be specified in configuration, not sure if that is going to be sufficient for all cases obs_publisher_did is not filled yet, I need some specific input on how to construct that (I can probably specify a per-dataset type format string for it, e.g. "ivoa://{obs_collection}/{obs_id}" or something like it) probably few other things I used that example config to ran the tool on /repo/main to produce an example parquet and csv files, the output is on lss-devl: /project/salnikov/obscore/data.csv.gz /project/salnikov/obscore/data.parquet If you want to play with it, you need dax_obscore and daf_butler from the ticket branch, and the command I used to make parquet file: butler obscore-export -c $DAX_OBSCORE_DIR/configs/example.yaml /repo/main data.parquet
            tjenness Tim Jenness added a comment -

            exposure region is determined from visit region assuming that exposure group_id is the same as visit ID, I do not know if this is guaranteed to be always true

            This is not going to be reliable (and is imminently changing). Whether group_id or obs_id is used for visits depends on the visit_sytem (0 or 1).

            The visit table defines visits but the visit_definition dimension actually stores the exposure to visit mapping. See the discussion on DM-30948. If you have an exposure you can get the visits it is part of or vice versa.

            tjenness Tim Jenness added a comment - exposure region is determined from visit region assuming that exposure group_id is the same as visit ID, I do not know if this is guaranteed to be always true This is not going to be reliable (and is imminently changing). Whether group_id or obs_id is used for visits depends on the visit_sytem (0 or 1). The visit table defines visits but the visit_definition dimension actually stores the exposure to visit mapping. See the discussion on DM-30948 . If you have an exposure you can get the visits it is part of or vice versa.

            The visit table defines visits but the visit_definition dimension actually stores the exposure to visit mapping. See the discussion on DM-30948. If you have an exposure you can get the visits it is part of or vice versa.

            I have updated the code to use visit_definition.

            salnikov Andy Salnikov added a comment - The visit table defines visits but the visit_definition dimension actually stores the exposure to visit mapping. See the discussion on DM-30948 . If you have an exposure you can get the visits it is part of or vice versa. I have updated the code to use visit_definition.
            tjenness Tim Jenness added a comment -

            For current code you should also be using visit_system in the dimension records filtering, but for DP0.2 we only have one type of visit defined so that's not going to be a problem. Once we change the schema things will need to be tweaked because visit_system is moving to a different place.

            tjenness Tim Jenness added a comment - For current code you should also be using visit_system in the dimension records filtering, but for DP0.2 we only have one type of visit defined so that's not going to be a problem. Once we change the schema things will need to be tweaked because visit_system is moving to a different place.

            Not sure how this can work for my case. Exposure record does not have visit_system, so if there are multiple visits for the same exposure (with different visit_system) I cannot actually filter one of them. If there are multiple visits for the same exposure, should all of them have the same region?

            salnikov Andy Salnikov added a comment - Not sure how this can work for my case. Exposure record does not have visit_system, so if there are multiple visits for the same exposure (with different visit_system) I cannot actually filter one of them. If there are multiple visits for the same exposure, should all of them have the same region?
            tjenness Tim Jenness added a comment -

            Our definition of visit is that the telescope has not moved between exposures so I think we can probably assume that every exposure in a visit has the same sky region as any of the visits defined for that exposure. So I think you can just take the first result for an exposure from the visit_definition query and you will be fine. Just don't assume you will only get one result.

            tjenness Tim Jenness added a comment - Our definition of visit is that the telescope has not moved between exposures so I think we can probably assume that every exposure in a visit has the same sky region as any of the visits defined for that exposure. So I think you can just take the first result for an exposure from the visit_definition query and you will be fine. Just don't assume you will only get one result.

            Only 90K lines in that csv – should we give this a go against a DC2 repo?

            fritzm Fritz Mueller added a comment - Only 90K lines in that csv – should we give this a go against a DC2 repo?
            tjenness Tim Jenness added a comment -

            With the configuration yaml:

            dataset_types:
               -
                 name: raw
                 dataproduct_type: image
                 dataproduct_subtype: lsst.raw
                 calib_level: 1
                 obs_id_fmt: "{exposure}"
                 o_ucd: phot.count
                 access_format: image/fits
               -
                 name: calexp
                 dataproduct_type: image
                 dataproduct_subtype: lsst.calexp
                 calib_level: 2
                 obs_id_fmt: "{visit}"
                 o_ucd: phot.count
                 access_format: image/fits
            

            Can't the dataset type names be keys rather than being a list?

            I think it would be good if the raw obs_id was actually the exposure.obs_id name not the exposure ID. (the obs_id string is more descriptive than the integer).

            Similarly a calexp should probably use visit.name not visit.id.

            tjenness Tim Jenness added a comment - With the configuration yaml: dataset_types: - name: raw dataproduct_type: image dataproduct_subtype: lsst.raw calib_level: 1 obs_id_fmt: "{exposure}" o_ucd: phot.count access_format: image/fits - name: calexp dataproduct_type: image dataproduct_subtype: lsst.calexp calib_level: 2 obs_id_fmt: "{visit}" o_ucd: phot.count access_format: image/fits Can't the dataset type names be keys rather than being a list? I think it would be good if the raw obs_id was actually the exposure.obs_id name not the exposure ID. (the obs_id string is more descriptive than the integer). Similarly a calexp should probably use visit.name not visit.id.

            Can't the dataset type names be keys rather than being a list?

            I was thinking about making it a match-type (regex) to cover more than one dataset type, but if we don't need that then it's easy to convert to dictionary.

            I'll try to do something with obs_id in that example config.

            salnikov Andy Salnikov added a comment - Can't the dataset type names be keys rather than being a list? I was thinking about making it a match-type (regex) to cover more than one dataset type, but if we don't need that then it's easy to convert to dictionary. I'll try to do something with obs_id in that example config.
            tjenness Tim Jenness added a comment -

            Wouldn't a regex string still be usable as a key?

            tjenness Tim Jenness added a comment - Wouldn't a regex string still be usable as a key?

            Only 90K lines in that csv – should we give this a go against a DC2 repo?

            fritzm, I made another parquet file /project/salnikov/obscore/dc2.parquet from /repo/dc2 and collection 2.2i/runs/DP0.1. It has a bit more records in it, about 7M, and took ~1 hour to generate. To speed things a bit I replaced access_url with something bogus, should probably be fine for Qserv testing purposes.

            salnikov Andy Salnikov added a comment - Only 90K lines in that csv – should we give this a go against a DC2 repo? fritzm , I made another parquet file /project/salnikov/obscore/dc2.parquet from /repo/dc2 and collection 2.2i/runs/DP0.1 . It has a bit more records in it, about 7M, and took ~1 hour to generate. To speed things a bit I replaced access_url with something bogus, should probably be fine for Qserv testing purposes.

            Thanks, Andy – I'll give it a go!

            fritzm Fritz Mueller added a comment - Thanks, Andy – I'll give it a go!

            fritzm, I have regenerated parquet file at the same location (/project/salnikov/obscore/dc2.parquet) with the fixed regions.

            salnikov Andy Salnikov added a comment - fritzm , I have regenerated parquet file at the same location ( /project/salnikov/obscore/dc2.parquet ) with the fixed regions.

            Thanks, will give that a go!

            fritzm Fritz Mueller added a comment - Thanks, will give that a go!

            Looks like the scaling is going to be okay. With the new 7M row DC2 dataset:

            MariaDB [imgserv]> select s_region from obscore where contains(s_region_bounds, point(60, -30)) and scisql_s2PtInCPoly(60, -30, s_region_scisql);
             
            [abbreviated]
             
            | POLYGON ICRS 60.244426 -30.028724 60.267159 -29.775530 59.980076 -29.755842 59.956634 -30.008983 |
            | POLYGON ICRS 60.001429 -30.016522 60.281413 -29.941318 60.195939 -29.702694 59.916532 -29.777716 |
            | POLYGON ICRS 59.905059 -29.781186 59.975650 -30.027569 60.255450 -29.966884 60.184241 -29.720673 |
            | POLYGON ICRS 60.202006 -29.976891 60.154785 -29.726338 59.870781 -29.766263 59.917312 -30.016948 |
            | POLYGON ICRS 59.916275 -29.777072 59.987107 -30.023529 60.267065 -29.962715 60.195594 -29.716416 |
            +--------------------------------------------------------------------------------------------------+
            1104 rows in set (0.017 sec)
            
            

            fritzm Fritz Mueller added a comment - Looks like the scaling is going to be okay. With the new 7M row DC2 dataset: MariaDB [imgserv]> select s_region from obscore where contains(s_region_bounds, point(60, -30)) and scisql_s2PtInCPoly(60, -30, s_region_scisql);   [abbreviated]   | POLYGON ICRS 60.244426 -30.028724 60.267159 -29.775530 59.980076 -29.755842 59.956634 -30.008983 | | POLYGON ICRS 60.001429 -30.016522 60.281413 -29.941318 60.195939 -29.702694 59.916532 -29.777716 | | POLYGON ICRS 59.905059 -29.781186 59.975650 -30.027569 60.255450 -29.966884 60.184241 -29.720673 | | POLYGON ICRS 60.202006 -29.976891 60.154785 -29.726338 59.870781 -29.766263 59.917312 -30.016948 | | POLYGON ICRS 59.916275 -29.777072 59.987107 -30.023529 60.267065 -29.962715 60.195594 -29.716416 | +--------------------------------------------------------------------------------------------------+ 1104 rows in set (0.017 sec)

            tjenness, I think this should be OK to merge, I believe most functionality should be there already, including some docs. Other possible improvements could go on separate tickets.

            salnikov Andy Salnikov added a comment - tjenness , I think this should be OK to merge, I believe most functionality should be there already, including some docs. Other possible improvements could go on separate tickets.

            tjenness, sorry, forgot to push one more change and now it needs some rebasing. I'll let you know when it's ready.

            salnikov Andy Salnikov added a comment - tjenness , sorry, forgot to push one more change and now it needs some rebasing. I'll let you know when it's ready.

            salnikov I have (finally) been looking closely at this, looking at what was released a while back (on 4/24).

            Do you have a CSV version of the most recent export format?

            gpdf Gregory Dubois-Felsmann added a comment - salnikov I have (finally) been looking closely at this, looking at what was released a while back (on 4/24). Do you have a CSV version of the most recent export format?

            gpdf, I have made a fresh copy of CSV from /repo/main using example.yaml configuration, it is at https://lsst.ncsa.illinois.edu/~salnikov/data.csv.gz, or in /project/salnikov/obscore/data.csv.gz on lsst-devl.

            salnikov Andy Salnikov added a comment - gpdf , I have made a fresh copy of CSV from /repo/main using example.yaml configuration, it is at https://lsst.ncsa.illinois.edu/~salnikov/data.csv.gz , or in /project/salnikov/obscore/data.csv.gz on lsst-devl.

            tjenness, sorry, I forgot to make PR for daf_butler earlier, it needed a small fix to avoid crashing: https://github.com/lsst/daf_butler/pull/682.

            salnikov Andy Salnikov added a comment - tjenness , sorry, I forgot to make PR for daf_butler earlier, it needed a small fix to avoid crashing: https://github.com/lsst/daf_butler/pull/682 .
            tjenness Tim Jenness added a comment -

            Looks great. Thanks for all the cleanups.

            tjenness Tim Jenness added a comment - Looks great. Thanks for all the cleanups.

            Thanks everyone for reviewing and all suggestions! Finally merged.

            salnikov Andy Salnikov added a comment - Thanks everyone for reviewing and all suggestions! Finally merged.

            People

              salnikov Andy Salnikov
              tjenness Tim Jenness
              Tim Jenness
              Andy Salnikov, Christine Banek, Fritz Mueller, Frossie Economou, Gregory Dubois-Felsmann, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.