Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Labels:
-
Story Points:10
-
Epic Link:
-
Sprint:DB_F22_6
-
Team:Data Access and Database
-
Urgent?:No
Description
DM-35850 defined main ideas for adding obscore table updates to Registry code. This ticket is for implementing actual thing, I'll add more comments as I go.
Attachments
Issue Links
Activity
The first one has the advantage it only says "obscore" once so I would tend to prefer it.
I made some progress with implementation of basic things that insert new records into obscore table (without spatial stuff yet). I'm stuck with implementing record removal and need some advice. In registry the obscore table has a foreign key to dataset table with ON DELETE CASCADE, so when one removes a dataset or a whole RUN collection then corresponding records are automatically dropped from obscore. But I am also adding obscore records when a dataset is associated or certified to a non-RUN collection (in case when RUN collection itself is not monitored by obscore and dataset is not in the obscore when it is associated/certified). Removing those records when they are disassociated/decertified is much more complicated, before I remove them from obscore table, I need to make sure that they do not appear in any other RUN/CHAINED/TAGGED collections that are monitored by obscore. It may be possible to implement this sort of check, but I think it's going to be expensive, and I don't like doing this complicated stuff just because of obscore. I am inclined to say/document that we do not remove things from obscore when we disassociate/decertify datasets. In fact, I'm not even sure that there is an utility in tracking TAGGED and CALIBRATED collections at all. Could we get away with ignoring associate/certify calls and only add datasets in insert/import (for RUN/CHAINED collections)?
I can definitely see we can make a case for not supporting CALIBRATED collections in ObsCore.
TAGGED is a bit trickier in the sense that you could see people starting to use TAGGED more widely in production as a way to overcome slow queries involving chains of hundreds of collections.
Maybe a work around is an obscore admin script that can run to clean up any obscore rows for tagged datasets that have now been deleted (assuming the obscore row can know it came from a tagged collection) and we document that it's not automated.
I would want to avoid having obscore record to know whether it came from RUN or TAGGED collection, this would be another complication to manage on configuration changes. I'm OK with a separate cleanup script, as we can afford it to be less efficient. There is a related issue for when the composition of the CHAINED collection changes (e.g. a non-empty RUN collection added or removed) which is also non-trivial to handle during live updates. We probably can move all those slow operations to a cleanup/batch update script.
One more interesting obstacle for live obscore. Registry does not store regions for exposures (or exposure+detector dimensions). To workaround that in a standalone script I matched exposure to a visit and used visit's region. For live updates ingest of raw data happens before visits are defined, which means that when raw exposure+detector datasets are inserted into registry there is no visit information yet and no way to guess a region for raws. Given that the Registry regions are inflated and are not exactly what Gregory wants in obscore, maybe we also need to populate obscore regions at a later point? Live obscore looks less and less live
Live update for "raw" dataset type was never going to be possible if region is needed. "calexp" is really the first dataset type that has a region attached to it. I assume the region can be left out for raw since otherwise calibration observations would be invisible to obsCore. The question then becomes how to backfill the region information if define-visit runs (and does define-visit actually trigger this?)
I'm having some misgivings about where this is going, in terms of dependencies and complexity:
- I'm still not thrilled about adding a new spatial query system in parallel to the one we already have, based on not just PostgreSQL-specific but pgSphere-specific types, all because we are tied to a particular TAP implementation. If we're talking about this going in a manager implementation in daf_butler, those are pretty unfortunate dependencies to be adding (even implicitly, in the case of the TAP implementation), and while they'd be optional dependencies for users, they'd have to be considered required for developers due to the need to test that we don't break the ObsCore functionality.
- Guaranteeing consistency between Registry and Datastore during dataset writes and especially deletes is already super tricky and something we do not do as well as we should. Having a new hook in play for those operations will make that important work harder to get right, especially if they involve Python logic. I suspect they would at least need to be able to handle rollbacks rigorously, and that could be very hard, especially if we can't rely fully on ON DELETE CASCADE.
- If regions need to come from (e.g.) calexp WCSs instead of the Registry, we're actually talking about a Registry manager that knows not just about Datastore contents but the Exposure storage class as well. That's not something we want to put in daf_butler, especially since any test that exercises that that region creation would need access to formatters defined over in obs_base (based on code in afw).
- Even if the manager implementation lived in obs_base (or some new package at around the same point in the package structure), with only an ABC for it in daf_butler, actually running that region-creation code is going to be pretty I/O heavy, and having it triggered deep inside butler when inserting datasets into particular collections means in practice that it's going to be run by processes like BPS transfer jobs, which we actually want to be quite focused on their main tasks, and not slowed down (or possibly disrupted) by add-on functionality.
All that makes me think that "live OBsCore" is just not going to work, at least not in terms of being able to provide regions (or, ideally, anything that involves inserts into a separate table rather than a true SQL view). I know Gregory Dubois-Felsmann has said in the past that the "live" ObsCore doesn't need to be fully-featured, and I wonder if "whatever we can define via an actual SQL view" includes enough to be viable as what we support live. We could support that via new hooks in daf_butler for creating views or introspecting the actual SQL schema generated by the configured managers, and then let code in obs_base or a new package actually define the view when run (and then, being a view, it'd naturally stay updated). We could similarly support a more fully-featured, non-live ObsCore by beefing up the opaque-table interfaces as necessary for other downstream code to augment the views with actual new tables. And if we need the non-live ObsCore view to behave like it's live in some specific context (e.g. for summit observers), then that sounds like it should really be a new service that maintains that view. As I understand it, we have a whole observatory event-handling system for that sort of thing. And we already have precedent here: inserting rows into an ObsCore table based on content in other butler tables feels a lot like inserting rows into the visit (and related) tables based on content in the exposure table, and we have never expected for butler to be the thing that triggers visit definition.
As a side, note, if we need to get the calexp (etc) WCS/bbox/region information into the database to allow all of this to work without a ton of new I/O, then I think we need to bite the bullet and tackle DM-21773 and add per-StorageClass metadata to Registry that's populated by the formatters.
If regions need to come from (e.g.) calexp WCSs instead of the Registry
I didn't think we were discussing that. Maybe I missed something. This has to work regardless of whether people are writing Exposure or NDData to their repo. We should start by assuming that we are copying the visit region to the raw ObsCore row. If we later come up with a way of storing the accurate calexp region somewhere that should be handled like a metric.
That was in response to this comment from Andy Salnikov :
Given that the Registry regions are inflated and are not exactly what Gregory wants in obscore, maybe we also need to populate obscore regions at a later point?
It is certainly true that s_region is not strictly mandatory in ObsCore, so we can discuss having some heterogeneity in this column.
Calibration exposures can of course be recorded without any of the spatial columns filled. An important way people will get to these is via DataLink and other provenance services, but we do want them browsable by type and time (e.g., "show me all the narrow-band flats taken yesterday afternoon").
We definitely need to discuss the boundaries of where the capability being developed will be used in the system - is this "internal use only" for making it possible to explore any repo in a basic way, with all user-facing query services done in some other way, or is it also applicable to serving AP data to science users?
I made a PR for daf_butler which captures current state of the new obscore manager. Sill missing features:
- spatial column(s) and index,
- s_region for raw dataset type is not filled (as visits are defined only after raw ingest),
- cleanup of the datasets after chained collection changes or disassociation from tagged collections.
The main reason for PR is to check that the approach based on this new manager can work in general. Right now the implementation for the obscore manager interface is defined in daf_butler, and it does not bring any new dependencies. I think Jim suggested that it should be in some other package (obs_base?), I'd like to think about it more, my impression is that it will make some things problematic.
Potentially we can deploy this code and start filling new table (ignoring spatial things for now). Deployment for an existing repo will be done via usual migration script. Obscore manager configuration lives in the butler_attributes table, similarly to dimensions configuration, and similarly to dimensions we'll need to provide a migration script for every non-trivial change in obscore configuration. I have not started working on migration script yet, but it should be relatively simple. Obviously to start doing deployment we need to make sure that pipelines are running the release which supports obscore.
Jim Bosch, could you check daf_butler PR and tell me your impression and any suggestions. There is also a dax_obscore PR, you can ignore that for now.
Regarding manager location, if we decide that daf_butler is not a good idea then I think dax_obscore would be a much better option than obs_base. dax_obscore does already depend on daf_butler and it also doesn't depend on afw (hence this would still be usable by SPHEREx).
Usability by SPHEREx is definitely a goal.
I've done a very fast review, since I'll be off next week, and I'm fine with all of this for now. I think it's worth thinking a bit more about how to map what's in the ObsCore view to our collection system at some point; not having to support chained collections (or even tagged collections) would be a very nice simplification on its own, but even without that I wonder if we could provide a separate table that a TAP service could also see and join to in order to resolve collections at query time, in order to simplify the handling of collections at insert/delete time. In other words, if we simplified the "which collections do we watch" configuration to support only RUN collections, and that leads to putting more datasets in the ObsCore view than we really need, we could still come out ahead.
One issue with the new obscore manager being in a separate package is that anything that uses butler or registry with obscore-configured repo would need to setup that extra package. So running any butler command would need e.g. dax__obscore, which effectively makes dax_obscore a dependency of daf_butler (circular too). I guess this means that manager has to be in daf_butler, unless we want to play some dirty tricks for specific repositories which have obscore data.
I can't get rid of a thought that implementing everything as a separate middleware HTTP service (on top of our middleware) would solve all our numerous problems
Thinking about what that cleanup script would need to do, how to implement it, and how it can affect schema of the obscore table.
The current obscore manager implementation only supports the following registry operations:
- adding datasets to a monitored RUN collection (or if a RUN collection is in monitored CHAINED collections) via Registry.insertDatasets() or Registry._importDatasets()
- associating/certifying datasets with a monitored TAGGED/CALIBRATION collection (or it is in monitored CHAINED collections) via Registry.associate() and Registry.certify()
- a dataset can only appear once in a live obscore table (its UUID is a primary key in that table)
- if dataset's RUN collection is also monitored then the dataset will be added to obscore when it is added to a RUN collection naturally, otherwise it will be added when it is associated/certified
- if dataset is associated/certified many times with different validity timespans or in different collections it will only appear in obscore once (obscore does not have validity ranges or collection names to distinguish them anyways)
- dataset UUID is also a foreign key into dataset table, so when a dataset is removed from registry (individually or as a whole RUN collection) it also disappears from obscore
What is not covered by the above cases and what should be handled by the cleanup script:
- disassociating/decertifying datasets, either individually or due to removal of the whole TAGGED/CALIBRATION collections, which should result in removal from obscore in some cases
- datasets that come from monitored RUN collection or belong to other monitored TAGGED/CALIBRATION collections should be kept in obscore, of course
- changes in a composition of monitored CHAINED collections:
- if a non-empty RUN collections is added to a CHAINED collection, all matching datasets need to be added to obscore; same applies to TAGGED/CALIBRATION collections
- removal of a RUN collection from a monitored CHAINED collection should remove its datasets, except when they are associated with monitored TAGGED/CALIBRATION collections
- removal of a TAGGED/CALIBRATION collections remove its datasets, except when they belong to monitored RUN collection or other monitored TAGGED/CALIBRATION collections
Note that changes in the list of monitored collection that happen due to configuration change are supposed to be handled by the schema migration script, though there is definitely an overlap in that change in a CHAINED collection composition also causes similar "migrations".
Doing those removals/additions after the fact in the cleanup scrip is an interesting problem. Checking those conditions (whether dataset belongs to some other monitored collection) will be non-trivial, and I'm afraid it's not going to scale if we try to do it all on Python side. What could probably help is if we could quickly determine which datasets are NOT in any monitored collections. I think something like it would be possible if we could have a foreign key to a dataset associations/certification (and use CASCADE to remove things that disappear). It's not possible in current schema when we have multiple tags/calibs table, so foreign key cannot be defined (unless I repeat the same table structure used by tags/calibs). I need to think if there is anything at all that can help us here.
Gregory Dubois-Felsmann, I have added a section to DMTN-236 which describes my latest idea on what to do with collections for the new obscore manager: https://dmtn-236.lsst.io/v/DM-35947/index.html#collection-issues-with-client-side-updates
In addition to what we discussed before (multiple run collections) I can implement support for a single tagged collection. That approach has one advantage over run list (details are in technote) but it shifts all hard work to the clients who will need to associate "worthy" datasets to a separate tagged collection. I'd like to get some input from you and anyone else on whether this could be useful before I start implementing all of that.
Jim Bosch, I think this ticket is ready for "final" review. There were some big changes since you last looked at it and I did not want to squash commits, but it may be easier to look at complete diff instead of individual commits. dax_obscore adds new butler command which updates obscore records with missing regions (raws), which I think is also worth reviewing.
I have not implemented migration script yet, which would be necessary to start deploying it. We still need to do something for spatial indexing, and we also need to agree on configuration before we can deploy it (changing configuration needs migration, I prefer to have ~stable configuration as a starting point).
For the new obscore registry manager I will also need a configuration which is only used by that manager. I was thinking about combining two into a single key under managers, e.g.:
registry:
db: ...
managers:
attributes: lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager
obscore:
cls: lsst.daf.butler.registry.obscore.ObsCoreTableManager
config: !include obscore.yaml
This needs an extension to RegistryManagerTypes to understand sub-keys in managers configuration.
Alternatively I can do:
registry:
db: ...
managers:
attributes: lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager
obscore: lsst.daf.butler.registry.obscore.ObsCoreTableManager
obscore: !include obscore.yaml
Even with the latter some changes to RegistryManagerTypes are necessary, as I need to pass obscore configuration to obscore manager instance.
Tim Jenness, what do you think about extending managers configuration to allow cls/config subkeys?