Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler, dax_obscore
-
Labels:
-
Team:Ops Middleware
-
Urgent?:No
Description
The ObsTAP service requires that we have a database populated with ObsCore entries.
We need something that can take a Registry (possibly with a dataset query constraint) and exports it to, say, a parquet file that can be ingested into the ObsCore database. Some values can come from configuration (such as the telescope location and how to map a dataset type to a processing level).
Setting up the ObsCore database and loading these files into it can be done as a separate ticket.
A proof-of-concept was written at https://github.com/lsst-dm/dax_obscore which attempts to show that this can be done. I don't know if any of this code can be used in the final system but it doesn't seem unreasonable to use the package itself dax_obscore.
The code in dax_obscore has some problems in that it should never call butler.get – the sky region for an exposure must come from the visit record, as should the exposure time. calexp is visit based so that is fine, but for raw there might be a complication in getting the "footprint" in the absence of a direct link from an exposure record to a visit record.
I envisage that we have a butler plugin click command that can do the equivalent of query-datasets and dump the results as a parquet table. Obviously we may also want a bespoke exporter that is distinct from the command-line tooling when doing a full export.
Gregory Dubois-Felsmann how are you envisaging we use the visit concept in ObsCore? In theory we don't need it at all and can use the naive approach of every observation is an exposure. How many dataset types is the service meant to expose? raw, calexp and the tract/patch coadds?
Attachments
Issue Links
- is blocked by
-
DM-22496 Provide an interim definition of the obs_publisher_did field in ObsCore that is usable for Gen3-based image data
- In Progress
- is triggering
-
DM-34735 Add UUID support to URL format string
- Done
-
DM-34685 Create production Butler-to-ObsCore configuration file(s) in sdm_schemas/obscore
- Reviewed
- Wiki Page
-
Wiki Page Loading...
Activity
Moving this to ops team because the work is driven entirely by DP0.2 timeline.
[I'm very new to the whole ObsCore/ObsTAP thing, please correct me if I say something disagreable]
I have been reading ObsCore paper and Gregory's table which gave me some insight, will try to collect my thoughts here.
Here is an "implementation phase 1" subset of the columns from Gregory's table:
- dataproduct_type - this can probably be guessed from the storage type, e.g. "Exposure*" maps to "image", "DataFrame" to "measurements"?
- dataproduct_subtype - could it be dataset type name (prefixed with "lsst.")?
- calib_level - some sort of mapping is needed from dataset type name to level number
- obs_collection - this probably identifies a particular butler repository (or maybe its subset)?
- obs_publisher_did - a unique URL for a dataset, I think Tim mentioned this should include UUID, should it also include other contexts, (e.g. repo or obs_collection)?
- access_url - this depends on other service that is used to access data via this URL
- access_format - this can probably be mapped from dataset type name?
- s_ra, s_dec - Region has a bounding circle, can use its center for these.
- s_region - In STC-S encoding, I don't think Region knows how to do that, most regions should be polygons.
Few general ideas I have so far:
- Registries have datasets that don't need to be exposed to ObsTAP, so filtering based on dataset type name is probably the right thing.
- I expect that registries may also have bunch of data that is not useful (e.g. test runs), maybe we also want to export only a subset of collections.
- Some sort of configuration is probably needed, command line options can't handle complexity.
obs_publisher_did: I think UUID is what Russ Allbery is using in the cutout service. We decided that a run/datasetType/dataId tuple was more complicated than was necessary and the UUID is enough to find the relevant dataset. It also should be portable across different repositories that have the same data release.
Controlling dataset types and collections for the export is going to be important. I assume much like the query-datasets command line parameters.
Configuration is definitely going to be needed.
Agreed on the need for configuration. Presumably we'd use our usual Python configuration tooling for that?
I would like to centralize all the "static" mappings as based on dataset type name, including the one for dataproduct_type; I don't think there's any need to involve the storage type in the reasoning here. It'll be clearer if there's a single lookup table from dataset type to all the attributes derivable from it. (That includes more that are not in my original Confluence page - for instance, s_resolution and o_ucd. I will help you put together the mapping. For now: PVIs are calib_level 2; diffims and coadds are 3.
Agreed that the extractor should take a positive list of dataset types for which extraction is to be performed: e.g., "calexp, deepCoadd_calexp".
I would have to talk to Tim Jenness about how to slice the repository by collection or other attribute. I think having it work on an enumerated list of collections, and if that's not supplied default to the whole repository, is likely to be OK, though.
We very much need em_min and em_max populated from the filter ID.
A critical column to get working for immediate prototyping purposes is s_region, because Fritz Mueller and company are going to have to develop the ability to ingest that into Qserv, turning it both into a string value the user can use in a SELECT and into the encoded form needed by scisql.
We need two options for access_url: either (only for early testing and perhaps some ad-hoc usages I'll tell you about later) a URL that points straight to the image file artifact, or (what's really needed for production) a URL for a "links service" (see the DataLink standard if you are curious, but you don't need to know any real details about that in order to implement this). The latter is what I informally call the "CADC model", which we are adopting.
The URL should be constructed based on a template provided in the configuration. Ideally the obs_creator_did should be the thing plugged in to the template, as this is relocatable to all sites (unlike the obs_publisher_did, which is meant to be unique for every site), but for DP0.2 we don't need to worry about that if it becomes an obstacle. I think, from what Tim Jenness was saying above, that we are going to be using the UUID as the obs_creator_did?
The access_format is mandated to be "application/x-votable+xml;content=datalink" in the CADC model, and is the same for all dataset types.
In the "direct link" model, you are correct, the access_format would be dataset-type-dependent, but for now always either "image/fits" or "application/fits" (the latter is always acceptable, so we can just default to it for DP0.2) for all the datasets we are likely to put in ObsCore.
Tim Jenness, to find a region for an exposure, I guess I need to associate an exposure to a visit. Should it be done based on timespans or is there a better way?
For em_min and em_max, I think I will need to read filter data from Butler, they are not in the dimension records.
I am absolutely fine with reading the min and max em values for a filter from a static config file. Butler knows the physical_filter and pipelines code does not store filter profiles in the Exposures any more (it uses FilterLabel). The relevant Instrument class could be told the filter widths but they don't at the moment. Use a config keyed by physical_filter now and we'll worry about a cleverer scheme later on.
The sky region for raws was worrying me. For a calexp and later you are fine and can use the visit record directly. It looks like there is a visit_definition record that maps an exposure to the corresponding visit.
I think I'm going to have to spend a little extra time on working out the right way to do the calexp-to-raw, i.e., visit-to-exposure connection in the DP0.2 deployment. I suppose it's definite that the plan is to include the raws in the data exposed? Tim Jenness, let's plan to talk a bit about what obs_id should be for exposures vs. visits.
For em_min and em_max, I 100% agree - just make it a table lookup from a config based on the filter ID. We can worry later on about how to populate that more directly from an authoritative source. It is very clear that no one should be using the ObsCore values for science-critical purposes, though.
It's possible that this will be my only window of time to look at work-related things today. I do expect to be working a full day on Monday, though, and returning to at least 75% time from then on.
Completely ignoring raw for the moment does not seem like a bad plan.
calexp is well-defined and is what the VO services currently need to show they work.
The the regions in an existing sample of Kenny's old converted metadata appear to be all four-point spherical polygons, but I assume this need not always be the case? Is there a limit on number of points (trying to figure out allocation for scisql binary polygon columns in Qserv...)
STC-S has no limit on polygon definitions but practically speaking the visit is region is defined in this code: https://github.com/lsst/obs_base/blob/main/python/lsst/obs/base/defineVisits.py#L806-L823
As a proof of concept for ability to leverage scisql_s2PtInCPoly, I dumped Kenny's previous ObsCore data (~88K rows) from Postgres and imported it into a MariaDB instance. A scisql binary polygon column was then added via:
ALTER TABLE obscore ADD s_region_scisql VARBINARY(128);
|
|
UPDATE obscore SET s_region_scisql = scisql_s2CPolyToBin(
|
cast(substring_index(substring_index(s_region, ' ', -8), ' ', 1) as double),
|
cast(substring_index(substring_index(s_region, ' ', -7), ' ', 1) as double),
|
cast(substring_index(substring_index(s_region, ' ', -6), ' ', 1) as double),
|
cast(substring_index(substring_index(s_region, ' ', -5), ' ', 1) as double),
|
cast(substring_index(substring_index(s_region, ' ', -4), ' ', 1) as double),
|
cast(substring_index(substring_index(s_region, ' ', -3), ' ', 1) as double),
|
cast(substring_index(substring_index(s_region, ' ', -2), ' ', 1) as double),
|
cast(substring_index(substring_index(s_region, ' ', -1), ' ', 1) as double)
|
);
|
After this, we can, e.g.:
MariaDB [imgserv]> select s_region from obscore where scisql_s2PtInCPoly(216, 2, s_region_scisql);
|
+--------------------------------------------------------------------------------------------------+
|
| s_region |
|
+--------------------------------------------------------------------------------------------------+
|
| POLYGON ICRS 216.130297 +2.066402 216.133659 +1.972385 215.945891 +1.977446 215.943826 +2.072778 |
|
| POLYGON ICRS 216.130318 +2.066394 216.133680 +1.972377 215.945912 +1.977438 215.943847 +2.072770 |
|
| POLYGON ICRS 216.096602 +2.018849 216.097539 +1.922714 215.901814 +1.923511 215.901870 +2.019844 |
|
| POLYGON ICRS 216.096614 +2.018841 216.097552 +1.922706 215.901826 +1.923503 215.901883 +2.019836 |
|
| POLYGON ICRS 216.130293 +2.066413 216.133655 +1.972396 215.945887 +1.977457 215.943822 +2.072789 |
|
| POLYGON ICRS 216.134442 +2.067605 216.137666 +1.977445 215.952147 +1.982629 215.950160 +2.074052 |
|
| POLYGON ICRS 216.135765 +2.067434 216.138961 +1.977278 215.953429 +1.982520 215.951479 +2.073942 |
|
| POLYGON ICRS 216.098774 +2.020408 216.099655 +1.928171 215.906041 +1.929070 215.906094 +2.021492 |
|
| POLYGON ICRS 216.135666 +2.067753 216.138867 +1.977591 215.953328 +1.982811 215.951376 +2.074240 |
|
| POLYGON ICRS 216.099985 +2.020794 216.100866 +1.928568 215.907266 +1.929489 215.907336 +2.021898 |
|
+--------------------------------------------------------------------------------------------------+
|
10 rows in set (0.191 sec)
|
That looks workable at a basic level, but as it stands entails a full table scan. Will be investigating indexing strategies next...
Okay, MariaDB has spatial indices on its own GIS geometry types, but they are not spherical. We can maybe get around that by storing a projected set of bounds in each row, then building a spatial index on that column and using it for a cut further refined by the exact scisql clause above. A lat/lon bounding box and a cylindrical projection are probably as simple as it could be, and likely a good match for MariaDB's r-tree spatial index. So:
DELIMITER $$
|
CREATE OR REPLACE FUNCTION bounds_from_region(s_region VARCHAR(512)) RETURNS POLYGON
|
BEGIN
|
DECLARE lon1, lat1, lon2, lat2, lon3, lat3, lon4, lat4 VARCHAR(32);
|
SET lon1 = substring_index(substring_index(s_region, ' ', -8), ' ', 1);
|
SET lat1 = substring_index(substring_index(s_region, ' ', -7), ' ', 1);
|
SET lon2 = substring_index(substring_index(s_region, ' ', -6), ' ', 1);
|
SET lat2 = substring_index(substring_index(s_region, ' ', -5), ' ', 1);
|
SET lon3 = substring_index(substring_index(s_region, ' ', -4), ' ', 1);
|
SET lat3 = substring_index(substring_index(s_region, ' ', -3), ' ', 1);
|
SET lon4 = substring_index(substring_index(s_region, ' ', -2), ' ', 1);
|
SET lat4 = substring_index(substring_index(s_region, ' ', -1), ' ', 1);
|
RETURN ENVELOPE(POLYGON(LINESTRING(
|
POINT(lon1, lat1), POINT(lon2, lat2), POINT(lon3, lat3), POINT(lon4, lat4), POINT(lon1, lat1)
|
)));
|
END $$
|
DELIMITER ;
|
|
ALTER TABLE obscore ADD s_region_bounds POLYGON NOT NULL;
|
UPDATE obscore SET s_region_bounds = bounds_from_region(s_region);
|
ALTER TABLE obscore ADD SPATIAL INDEX(s_region_bounds);
|
As a first cut it does seem to work. Query planner reports:
MariaDB [imgserv]> explain select s_region from obscore where scisql_s2PtInCPoly(216, 2, s_region_scisql);
|
+------+-------------+---------+------+---------------+------+---------+------+-------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|
+------+-------------+---------+------+---------------+------+---------+------+-------+-------------+
|
| 1 | SIMPLE | obscore | ALL | NULL | NULL | NULL | NULL | 88134 | Using where |
|
+------+-------------+---------+------+---------------+------+---------+------+-------+-------------+
|
1 row in set (0.001 sec)
|
|
MariaDB [imgserv]> explain select s_region from obscore where contains(s_region_bounds, point(216, 2)) and scisql_s2PtInCPoly(216, 2, s_region_scisql);
|
+------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|
+------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+
|
| 1 | SIMPLE | obscore | range | s_region_bounds | s_region_bounds | 34 | NULL | 1 | Using where |
|
+------+-------------+---------+-------+-----------------+-----------------+---------+------+------+-------------+
|
1 row in set (0.001 sec)
|
and we now have:
MariaDB [imgserv]> select s_region from obscore where contains(s_region_bounds, point(216, 2)) and scisql_s2PtInCPoly(216, 2, s_region_scisql);
|
+--------------------------------------------------------------------------------------------------+
|
| s_region |
|
+--------------------------------------------------------------------------------------------------+
|
| POLYGON ICRS 216.135666 +2.067753 216.138867 +1.977591 215.953328 +1.982811 215.951376 +2.074240 |
|
| POLYGON ICRS 216.099985 +2.020794 216.100866 +1.928568 215.907266 +1.929489 215.907336 +2.021898 |
|
| POLYGON ICRS 216.130297 +2.066402 216.133659 +1.972385 215.945891 +1.977446 215.943826 +2.072778 |
|
| POLYGON ICRS 216.130318 +2.066394 216.133680 +1.972377 215.945912 +1.977438 215.943847 +2.072770 |
|
| POLYGON ICRS 216.096602 +2.018849 216.097539 +1.922714 215.901814 +1.923511 215.901870 +2.019844 |
|
| POLYGON ICRS 216.096614 +2.018841 216.097552 +1.922706 215.901826 +1.923503 215.901883 +2.019836 |
|
| POLYGON ICRS 216.130293 +2.066413 216.133655 +1.972396 215.945887 +1.977457 215.943822 +2.072789 |
|
| POLYGON ICRS 216.134442 +2.067605 216.137666 +1.977445 215.952147 +1.982629 215.950160 +2.074052 |
|
| POLYGON ICRS 216.135765 +2.067434 216.138961 +1.977278 215.953429 +1.982520 215.951479 +2.073942 |
|
| POLYGON ICRS 216.098774 +2.020408 216.099655 +1.928171 215.906041 +1.929070 215.906094 +2.021492 |
|
+--------------------------------------------------------------------------------------------------+
|
10 rows in set (0.001 sec)
|
so, a speedup of ~200x for that query. This is probably good enough to test next at scale.
I have a preliminary implementation of the butler obscore-export command on tickets/DM-34483 branch. There are obviously few things missing for now:
- access_url is now taken directly from datastore, to make DataLink URL I need to know how that is constructed
- exposure region is determined from visit region assuming that exposure group_id is the same as visit ID, I do not know if this is guaranteed to be always true
- many things come from configuration, some are per-dataset type (check config/example.yaml for an example)
- obs_id format (tempalte) can be specified in configuration, not sure if that is going to be sufficient for all cases
- obs_publisher_did is not filled yet, I need some specific input on how to construct that (I can probably specify a per-dataset type format string for it, e.g. "ivoa://{obs_collection}/{obs_id}" or something like it)
- probably few other things
I used that example config to ran the tool on /repo/main to produce an example parquet and csv files, the output is on lss-devl:
- /project/salnikov/obscore/data.csv.gz
- /project/salnikov/obscore/data.parquet
If you want to play with it, you need dax_obscore and daf_butler from the ticket branch, and the command I used to make parquet file:
butler obscore-export -c $DAX_OBSCORE_DIR/configs/example.yaml /repo/main data.parquet
|
exposure region is determined from visit region assuming that exposure group_id is the same as visit ID, I do not know if this is guaranteed to be always true
This is not going to be reliable (and is imminently changing). Whether group_id or obs_id is used for visits depends on the visit_sytem (0 or 1).
The visit table defines visits but the visit_definition dimension actually stores the exposure to visit mapping. See the discussion on DM-30948. If you have an exposure you can get the visits it is part of or vice versa.
The visit table defines visits but the visit_definition dimension actually stores the exposure to visit mapping. See the discussion on
DM-30948. If you have an exposure you can get the visits it is part of or vice versa.
I have updated the code to use visit_definition.
For current code you should also be using visit_system in the dimension records filtering, but for DP0.2 we only have one type of visit defined so that's not going to be a problem. Once we change the schema things will need to be tweaked because visit_system is moving to a different place.
Not sure how this can work for my case. Exposure record does not have visit_system, so if there are multiple visits for the same exposure (with different visit_system) I cannot actually filter one of them. If there are multiple visits for the same exposure, should all of them have the same region?
Our definition of visit is that the telescope has not moved between exposures so I think we can probably assume that every exposure in a visit has the same sky region as any of the visits defined for that exposure. So I think you can just take the first result for an exposure from the visit_definition query and you will be fine. Just don't assume you will only get one result.
Only 90K lines in that csv – should we give this a go against a DC2 repo?
With the configuration yaml:
dataset_types:
|
-
|
name: raw
|
dataproduct_type: image
|
dataproduct_subtype: lsst.raw
|
calib_level: 1
|
obs_id_fmt: "{exposure}"
|
o_ucd: phot.count
|
access_format: image/fits
|
-
|
name: calexp
|
dataproduct_type: image
|
dataproduct_subtype: lsst.calexp
|
calib_level: 2
|
obs_id_fmt: "{visit}"
|
o_ucd: phot.count
|
access_format: image/fits
|
Can't the dataset type names be keys rather than being a list?
I think it would be good if the raw obs_id was actually the exposure.obs_id name not the exposure ID. (the obs_id string is more descriptive than the integer).
Similarly a calexp should probably use visit.name not visit.id.
Can't the dataset type names be keys rather than being a list?
I was thinking about making it a match-type (regex) to cover more than one dataset type, but if we don't need that then it's easy to convert to dictionary.
I'll try to do something with obs_id in that example config.
Wouldn't a regex string still be usable as a key?
Only 90K lines in that csv – should we give this a go against a DC2 repo?
Fritz Mueller, I made another parquet file /project/salnikov/obscore/dc2.parquet from /repo/dc2 and collection 2.2i/runs/DP0.1. It has a bit more records in it, about 7M, and took ~1 hour to generate. To speed things a bit I replaced access_url with something bogus, should probably be fine for Qserv testing purposes.
Thanks, Andy – I'll give it a go!
Fritz Mueller, I have regenerated parquet file at the same location (/project/salnikov/obscore/dc2.parquet) with the fixed regions.
Thanks, will give that a go!
Looks like the scaling is going to be okay. With the new 7M row DC2 dataset:
MariaDB [imgserv]> select s_region from obscore where contains(s_region_bounds, point(60, -30)) and scisql_s2PtInCPoly(60, -30, s_region_scisql);
|
|
[abbreviated]
|
|
| POLYGON ICRS 60.244426 -30.028724 60.267159 -29.775530 59.980076 -29.755842 59.956634 -30.008983 |
|
| POLYGON ICRS 60.001429 -30.016522 60.281413 -29.941318 60.195939 -29.702694 59.916532 -29.777716 |
|
| POLYGON ICRS 59.905059 -29.781186 59.975650 -30.027569 60.255450 -29.966884 60.184241 -29.720673 |
|
| POLYGON ICRS 60.202006 -29.976891 60.154785 -29.726338 59.870781 -29.766263 59.917312 -30.016948 |
|
| POLYGON ICRS 59.916275 -29.777072 59.987107 -30.023529 60.267065 -29.962715 60.195594 -29.716416 |
|
+--------------------------------------------------------------------------------------------------+
|
1104 rows in set (0.017 sec)
|
|
Tim Jenness, I think this should be OK to merge, I believe most functionality should be there already, including some docs. Other possible improvements could go on separate tickets.
Tim Jenness, sorry, forgot to push one more change and now it needs some rebasing. I'll let you know when it's ready.
Andy Salnikov I have (finally) been looking closely at this, looking at what was released a while back (on 4/24).
Do you have a CSV version of the most recent export format?
Gregory Dubois-Felsmann, I have made a fresh copy of CSV from /repo/main using example.yaml configuration, it is at https://lsst.ncsa.illinois.edu/~salnikov/data.csv.gz, or in /project/salnikov/obscore/data.csv.gz on lsst-devl.
Tim Jenness, sorry, I forgot to make PR for daf_butler earlier, it needed a small fix to avoid crashing: https://github.com/lsst/daf_butler/pull/682.
Looks great. Thanks for all the cleanups.
Thanks everyone for reviewing and all suggestions! Finally merged.
Looks like Gregory Dubois-Felsmann did an analysis of the mapping of registry to ObsCore here: https://confluence.lsstcorp.org/display/~gpdf/Satisfying+ObsCore+from+the+Gen3+Butler+Schema