Fix Version/s: None
Component/s: daf_butler, dax_obscore
Team:Data Access and Database
When data are taken the files are transferred to the USDF and ingested into a butler registry. A lot of the general tooling developed by SQuaRE rely on ObsTAP queries of an ObsCore database table. Currently we generate the content of the ObsCore table by doing an explicit export of a butler registry using dax_obscore. This works but is an additional step that would have to work as part of the automated data ingest explicitly exporting the records for the new raws and adding them to the ObsCore table. Presumably some data processing will also happen at USDF and we would like, for example, calexp images to be available to other systems as soon as the processing completes. Do we want to have to do an ObsCore export/import every time a batch job completes?
Things would be simpler if we coulde have an ObsCore view of the butler registry that automatically updates when new files are added or deleted. This ticket is to investigate the feasibility of this and to write a tech note with options.
- is triggering
DM-35850 Collect ideas and requirements for daf_butler obscore implementation
- relates to
DM-37275 Establish ObsTAP service for use with a Postgres Butler registry
- To Do
- links to
The formatted version: https://dmtn-236.lsst.io/v/DM-35532/index.html
I've read the tech note and it looks really good to me.
I agree that the only tractable option is to have butler "own" an ObsCore table and have butler insert into ObsCore whenever a butler put/ingest/transfer_from occurs, using the dax_obscore configuration system.
What I'm not sure about is how decoupled we can make this. Should butler have some kind of system that lets you add callbacks that are passed a DatasetRef? This would essentially allow the ObsCore code to be decoupled from Butler – dax_obscore would have a function registered with butler via config but would otherwise be distinct. If we are tightly coupled where does that leave dax_obscore? Do we fully integrate it into daf_butler? How do we set up the ObsCore table such that deleting a dataset from registry will automatically delete it from the ObsCore table? Can we add a dataset_id UUID to the ObsCore table to make it easy to relate a row in ObsCore to a butler dataset?
- The s_region column in the database does not necessarily have to be usable in a WHERE clause as an actual string - e.g., eligible for use with the LIKE operator. One can imagine an implementation (regardless of how the ObsCore table is actually populated) in which the column in the database really is the packed-binary object, and somehow it's the TAP service itself (or some other part of the query layer) that materializes it as an ASCII POLYGON string in a query result. It might simplify things if we stay in the native geometry object representation within the database, I think? I believe this may be what CADC does in its Postgres database + TAP combo, from something I recall hearing.
- I think I understand that the "find_first" problem that you mention in the note is associated with the use of chained collections in defining the membership of the ObsCore table. In the client-side trigger model that appears to be emerging as favored, how would this be handled? If the same (dataID, datasetType) pair is, for ObsCore purposes, effective redefined by a subsequent put() to a collection "ahead of" where the previous instance of that pair was, does the trigger do an UPDATE against the ObsCore table?
Just FYI, please see DM-35740 for a separate issue regarding the usability of the Registry-to-ObsCore conversion code, and a possible refactoring. It might be worth including in whatever architecture is developed to implement the client-side trigger model.
Final comment: right now dax_obscore has very limited dependencies - - which makes it usable in non-Rubin applications (which we will do in SPHEREx). I have no specific reason to worry that this might change, but please do try to keep these dependencies slim if you develop the client-side trigger model in practice. We would certainly use it immediately in SPHEREx.
Gregory Dubois-Felsmann, I agree that some of region formatting job can be shifted to TAP service, still, to be efficient for spatial queries, we need it in a database-native representation suitable for indexing. Trouble is, of course, that database-native is also database-specific, and it would be nice to avoid backend-specific features in TAP. I think this is not very hard problem to solve, just need some attention.
For the find-first issue and client-side "trigger", the insertion test should be easy - one just have to check that a run collection for the new dataset ref is a member of the configured chained collection. There are more interesting cases for removal of datasets, either as an explicit dataset purge, or removal of a run collection from a chained collection. I'll try to expand my technote with ideas for those cases.
I have added one more section to the technote with a bunch of random ideas for client-side implementation.
Tim Jenness, do you want to re-review it (or you can just say it's OK to merge like that).
What I'm not sure about is how decoupled we can make this. Should butler have some kind of system that lets you add callbacks that are passed a DatasetRef?
I think my current idea is to make it a new optional manager that will be called by other managers. We then could re-use all our tooling for optional loading of that manager and schema/data migration.
Thanks for review, merged, the result is here: https://dmtn-236.lsst.io/
Tim Jenness, I have pushed my draft of a note, I'm not sure this is what people expect to see, so this is mostly for getting some feedback and suggestions for what else may be worth to explore there.