Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-35532

Investigate feasibility of ObsCore as a view on Butler registry

    XMLWordPrintable

    Details

    • Story Points:
      4
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      When data are taken the files are transferred to the USDF and ingested into a butler registry. A lot of the general tooling developed by SQuaRE rely on ObsTAP queries of an ObsCore database table. Currently we generate the content of the ObsCore table by doing an explicit export of a butler registry using dax_obscore. This works but is an additional step that would have to work as part of the automated data ingest explicitly exporting the records for the new raws and adding them to the ObsCore table. Presumably some data processing will also happen at USDF and we would like, for example, calexp images to be available to other systems as soon as the processing completes. Do we want to have to do an ObsCore export/import every time a batch job completes?

      Things would be simpler if we coulde have an ObsCore view of the butler registry that automatically updates when new files are added or deleted. This ticket is to investigate the feasibility of this and to write a tech note with options.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Tim Jenness, I have pushed my draft of a note, I'm not sure this is what people expect to see, so this is mostly for getting some feedback and suggestions for what else may be worth to explore there.

            Show
            salnikov Andy Salnikov added a comment - Tim Jenness , I have pushed my draft of a note, I'm not sure this is what people expect to see, so this is mostly for getting some feedback and suggestions for what else may be worth to explore there.
            Show
            salnikov Andy Salnikov added a comment - The formatted version: https://dmtn-236.lsst.io/v/DM-35532/index.html PR: https://github.com/lsst-dm/dmtn-236/pull/1
            Hide
            tjenness Tim Jenness added a comment -

            I've read the tech note and it looks really good to me.

            I agree that the only tractable option is to have butler "own" an ObsCore table and have butler insert into ObsCore whenever a butler put/ingest/transfer_from occurs, using the dax_obscore configuration system.

            What I'm not sure about is how decoupled we can make this. Should butler have some kind of system that lets you add callbacks that are passed a DatasetRef? This would essentially allow the ObsCore code to be decoupled from Butler – dax_obscore would have a function registered with butler via config but would otherwise be distinct. If we are tightly coupled where does that leave dax_obscore? Do we fully integrate it into daf_butler? How do we set up the ObsCore table such that deleting a dataset from registry will automatically delete it from the ObsCore table? Can we add a dataset_id UUID to the ObsCore table to make it easy to relate a row in ObsCore to a butler dataset?

            Show
            tjenness Tim Jenness added a comment - I've read the tech note and it looks really good to me. I agree that the only tractable option is to have butler "own" an ObsCore table and have butler insert into ObsCore whenever a butler put/ingest/transfer_from occurs, using the dax_obscore configuration system. What I'm not sure about is how decoupled we can make this. Should butler have some kind of system that lets you add callbacks that are passed a DatasetRef? This would essentially allow the ObsCore code to be decoupled from Butler – dax_obscore would have a function registered with butler via config but would otherwise be distinct. If we are tightly coupled where does that leave dax_obscore? Do we fully integrate it into daf_butler? How do we set up the ObsCore table such that deleting a dataset from registry will automatically delete it from the ObsCore table? Can we add a dataset_id UUID to the ObsCore table to make it easy to relate a row in ObsCore to a butler dataset?
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Two comments/questions:

            1. The s_region column in the database does not necessarily have to be usable in a WHERE clause as an actual string - e.g., eligible for use with the LIKE operator. One can imagine an implementation (regardless of how the ObsCore table is actually populated) in which the column in the database really is the packed-binary object, and somehow it's the TAP service itself (or some other part of the query layer) that materializes it as an ASCII POLYGON string in a query result. It might simplify things if we stay in the native geometry object representation within the database, I think? I believe this may be what CADC does in its Postgres database + TAP combo, from something I recall hearing.
            2. I think I understand that the "find_first" problem that you mention in the note is associated with the use of chained collections in defining the membership of the ObsCore table. In the client-side trigger model that appears to be emerging as favored, how would this be handled? If the same (dataID, datasetType) pair is, for ObsCore purposes, effective redefined by a subsequent put() to a collection "ahead of" where the previous instance of that pair was, does the trigger do an UPDATE against the ObsCore table?
            Show
            gpdf Gregory Dubois-Felsmann added a comment - Two comments/questions: The s_region column in the database does not necessarily have to be usable in a WHERE clause as an actual string - e.g., eligible for use with the LIKE operator. One can imagine an implementation (regardless of how the ObsCore table is actually populated) in which the column in the database really is the packed-binary object, and somehow it's the TAP service itself (or some other part of the query layer) that materializes it as an ASCII POLYGON string in a query result. It might simplify things if we stay in the native geometry object representation within the database, I think? I believe this may be what CADC does in its Postgres database + TAP combo, from something I recall hearing. I think I understand that the "find_first" problem that you mention in the note is associated with the use of chained collections in defining the membership of the ObsCore table. In the client-side trigger model that appears to be emerging as favored, how would this be handled? If the same (dataID, datasetType) pair is, for ObsCore purposes, effective redefined by a subsequent put() to a collection "ahead of" where the previous instance of that pair was, does the trigger do an UPDATE against the ObsCore table?
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Just FYI, please see DM-35740 for a separate issue regarding the usability of the Registry-to-ObsCore conversion code, and a possible refactoring. It might be worth including in whatever architecture is developed to implement the client-side trigger model.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Just FYI, please see DM-35740 for a separate issue regarding the usability of the Registry-to-ObsCore conversion code, and a possible refactoring. It might be worth including in whatever architecture is developed to implement the client-side trigger model.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Final comment: right now dax_obscore has very limited dependencies - - which makes it usable in non-Rubin applications (which we will do in SPHEREx). I have no specific reason to worry that this might change, but please do try to keep these dependencies slim if you develop the client-side trigger model in practice. We would certainly use it immediately in SPHEREx.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Final comment: right now dax_obscore has very limited dependencies - - which makes it usable in non-Rubin applications (which we will do in SPHEREx). I have no specific reason to worry that this might change, but please do try to keep these dependencies slim if you develop the client-side trigger model in practice. We would certainly use it immediately in SPHEREx.
            Hide
            salnikov Andy Salnikov added a comment -

            Gregory Dubois-Felsmann, I agree that some of region formatting job can be shifted to TAP service, still, to be efficient for spatial queries, we need it in a database-native representation suitable for indexing. Trouble is, of course, that database-native is also database-specific, and it would be nice to avoid backend-specific features in TAP. I think this is not very hard problem to solve, just need some attention.

            For the find-first issue and client-side "trigger", the insertion test should be easy - one just have to check that a run collection for the new dataset ref is a member of the configured chained collection. There are more interesting cases for removal of datasets, either as an explicit dataset purge, or removal of a run collection from a chained collection. I'll try to expand my technote with ideas for those cases.

            Show
            salnikov Andy Salnikov added a comment - Gregory Dubois-Felsmann , I agree that some of region formatting job can be shifted to TAP service, still, to be efficient for spatial queries, we need it in a database-native representation suitable for indexing. Trouble is, of course, that database-native is also database-specific, and it would be nice to avoid backend-specific features in TAP. I think this is not very hard problem to solve, just need some attention. For the find-first issue and client-side "trigger", the insertion test should be easy - one just have to check that a run collection for the new dataset ref is a member of the configured chained collection. There are more interesting cases for removal of datasets, either as an explicit dataset purge, or removal of a run collection from a chained collection. I'll try to expand my technote with ideas for those cases.
            Hide
            salnikov Andy Salnikov added a comment -

            I have added one more section to the technote with a bunch of random ideas for client-side implementation.

            Tim Jenness, do you want to re-review it (or you can just say it's OK to merge like that).

            What I'm not sure about is how decoupled we can make this. Should butler have some kind of system that lets you add callbacks that are passed a DatasetRef?

            I think my current idea is to make it a new optional manager that will be called by other managers. We then could re-use all our tooling for optional loading of that manager and schema/data migration.

            Show
            salnikov Andy Salnikov added a comment - I have added one more section to the technote with a bunch of random ideas for client-side implementation. Tim Jenness , do you want to re-review it (or you can just say it's OK to merge like that). What I'm not sure about is how decoupled we can make this. Should butler have some kind of system that lets you add callbacks that are passed a DatasetRef? I think my current idea is to make it a new optional manager that will be called by other managers. We then could re-use all our tooling for optional loading of that manager and schema/data migration.
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review, merged, the result is here: https://dmtn-236.lsst.io/

            Show
            salnikov Andy Salnikov added a comment - Thanks for review, merged, the result is here: https://dmtn-236.lsst.io/

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              Tim Jenness
              Watchers:
              Andy Salnikov, Frossie Economou, Gregory Dubois-Felsmann, Jim Bosch, Kian-Tat Lim, Tim Jenness
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.