Agreed on the need for configuration. Presumably we'd use our usual Python configuration tooling for that?
I would like to centralize all the "static" mappings as based on dataset type name, including the one for dataproduct_type; I don't think there's any need to involve the storage type in the reasoning here. It'll be clearer if there's a single lookup table from dataset type to all the attributes derivable from it. (That includes more that are not in my original Confluence page - for instance, s_resolution and o_ucd. I will help you put together the mapping. For now: PVIs are calib_level 2; diffims and coadds are 3.
Agreed that the extractor should take a positive list of dataset types for which extraction is to be performed: e.g., "calexp, deepCoadd_calexp".
I would have to talk to Tim Jenness about how to slice the repository by collection or other attribute. I think having it work on an enumerated list of collections, and if that's not supplied default to the whole repository, is likely to be OK, though.
We very much need em_min and em_max populated from the filter ID.
A critical column to get working for immediate prototyping purposes is s_region, because Fritz Mueller and company are going to have to develop the ability to ingest that into Qserv, turning it both into a string value the user can use in a SELECT and into the encoded form needed by scisql.
We need two options for access_url: either (only for early testing and perhaps some ad-hoc usages I'll tell you about later) a URL that points straight to the image file artifact, or (what's really needed for production) a URL for a "links service" (see the DataLink standard if you are curious, but you don't need to know any real details about that in order to implement this). The latter is what I informally call the "CADC model", which we are adopting.
The URL should be constructed based on a template provided in the configuration. Ideally the obs_creator_did should be the thing plugged in to the template, as this is relocatable to all sites (unlike the obs_publisher_did, which is meant to be unique for every site), but for DP0.2 we don't need to worry about that if it becomes an obstacle. I think, from what Tim Jenness was saying above, that we are going to be using the UUID as the obs_creator_did?
The access_format is mandated to be "application/x-votable+xml;content=datalink" in the CADC model, and is the same for all dataset types.
In the "direct link" model, you are correct, the access_format would be dataset-type-dependent, but for now always either "image/fits" or "application/fits" (the latter is always acceptable, so we can just default to it for DP0.2) for all the datasets we are likely to put in ObsCore.
Looks like Gregory Dubois-Felsmann did an analysis of the mapping of registry to ObsCore here: https://confluence.lsstcorp.org/display/~gpdf/Satisfying+ObsCore+from+the+Gen3+Butler+Schema