Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-33493

Add option to save Datastore records to QGs

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Sprint:
      DB_S22_12
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      DM-32072 adds an in-memory attribute that can be used to store opaque table records for datastores in Quantum objects, as well as a QuantumBackedButler that can use these.  But we also need to extract the records for overall inputs from the origin data repository at the end of QG (via the new export_records method on Datastore), and update the QG on-disk data structure to include them.  This should be an option that is disabled by default, until have tested its role in a complete replacement for execution butler.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Jim Bosch, I think this is ready for review. There is one thing that I do not particularly like (for some time already) is the FakeDatasetRef, in this case it creates dependency cycles and I had to add a workaround for it (import inside method). Would it be better to move that class to core, or maybe even make it a base class of DatasetRef? (but I probably miss some perspective on why this class was made separate in the first place). Also pydatic causes some grief, but I think this is common.

            Show
            salnikov Andy Salnikov added a comment - Jim Bosch , I think this is ready for review. There is one thing that I do not particularly like (for some time already) is the FakeDatasetRef , in this case it creates dependency cycles and I had to add a workaround for it (import inside method). Would it be better to move that class to core , or maybe even make it a base class of DatasetRef ? (but I probably miss some perspective on why this class was made separate in the first place). Also pydatic causes some grief, but I think this is common.
            Hide
            jbosch Jim Bosch added a comment -

            I've suggested some changes to the data structures on the PRs that (if you agree) might take another afternoon of work, but otherwise looks great.

            I'd also love to get rid of FakeDatasetRef, but my plan was a pretty intrusive/disruptive one that I hadn't prioritized (and I haven't thought it through in detail, let alone ask Tim Jenness about it): I wanted to just change all of the (mostly Datastore) methods that use it to use dataset IDs directly instead. That will require some code that interacts with them to do things like pass (ref.getCheckedId() for ref in refs), because a container of DatasetRef is what they actually have. I think that generally moves those getCheckedId calls to better places, though I'm actually hoping the the QuantumBackedButler-driven change of generating UUIDs for all dataset in the QG will remove what I think is our only use case for unresolved DatasetRef instances, letting us guarantee DatasetRef.id is never None and dropping getCheckedId. This is also related to long-stalled work on DM-30332, where I was hoping to add a custom dict-backed container for DatasetRef that would let you get the IDs (the keys of the backing dict, at least most of the time) without having to iterate; if we used those in the right places, dropping FakeDatasetRef in favor of bare dataset IDs would be even easier.

            In the meantime, I'm fine with moving FakeDatasetRef to core, but I think making it a base class of DatasetRef will just make removing it more of a pain later (if there's agreement that's a good long-term plan anyway).

            Show
            jbosch Jim Bosch added a comment - I've suggested some changes to the data structures on the PRs that (if you agree) might take another afternoon of work, but otherwise looks great. I'd also love to get rid of FakeDatasetRef , but my plan was a pretty intrusive/disruptive one that I hadn't prioritized (and I haven't thought it through in detail, let alone ask Tim Jenness about it): I wanted to just change all of the (mostly Datastore ) methods that use it to use dataset IDs directly instead. That will require some code that interacts with them to do things like pass (ref.getCheckedId() for ref in refs) , because a container of DatasetRef is what they actually have. I think that generally moves those getCheckedId calls to better places, though I'm actually hoping the the QuantumBackedButler -driven change of generating UUIDs for all dataset in the QG will remove what I think is our only use case for unresolved DatasetRef instances, letting us guarantee DatasetRef.id is never None and dropping getCheckedId . This is also related to long-stalled work on DM-30332 , where I was hoping to add a custom dict -backed container for DatasetRef that would let you get the IDs (the keys of the backing dict , at least most of the time) without having to iterate; if we used those in the right places, dropping FakeDatasetRef in favor of bare dataset IDs would be even easier. In the meantime, I'm fine with moving FakeDatasetRef to core, but I think making it a base class of DatasetRef will just make removing it more of a pain later (if there's agreement that's a good long-term plan anyway).
            Hide
            tjenness Tim Jenness added a comment -

            I think FakeDatasetRef exists solely for emptyTrash. The problem being that the trash table has dataset_id but not any of the other DatasetRef knowledge, but all the Datastore APIs want DatasetRef. Creating a pretend class for that that just had an id method was the quickest way to get something working. mypy makes it annoying of course.

            Show
            tjenness Tim Jenness added a comment - I think FakeDatasetRef exists solely for emptyTrash. The problem being that the trash table has dataset_id but not any of the other DatasetRef knowledge, but all the Datastore APIs want DatasetRef. Creating a pretend class for that that just had an id method was the quickest way to get something working. mypy makes it annoying of course.
            Hide
            tjenness Tim Jenness added a comment -

            Would it help to define a typing Protocol thing that Datastore can use so it can declare it's returning something that has an id method but then we don't care what the class actually is?

            Show
            tjenness Tim Jenness added a comment - Would it help to define a typing Protocol thing that Datastore can use so it can declare it's returning something that has an id method but then we don't care what the class actually is?
            Hide
            jbosch Jim Bosch added a comment -

            My assumption - which may be wrong - was that relatively few Datastore APIs actually needed the full DatasetRef, since they actually only need the run and data ID to construct templates. I suppose it's the dataset type (and particularly the storage class and component-ness) that's needed as well most of the time?

            Show
            jbosch Jim Bosch added a comment - My assumption - which may be wrong - was that relatively few Datastore APIs actually needed the full DatasetRef , since they actually only need the run and data ID to construct templates. I suppose it's the dataset type (and particularly the storage class and component-ness) that's needed as well most of the time?
            Hide
            salnikov Andy Salnikov added a comment -

            Jim Bosch, I am re-opening this for review, there is one more commit on daf_butler PR which changes the structure of DatastoreRecordData, its serialization, and also updates QuantumProvenanceData to use the common class.

            Show
            salnikov Andy Salnikov added a comment - Jim Bosch , I am re-opening this for review, there is one more commit on daf_butler PR which changes the structure of DatastoreRecordData, its serialization, and also updates QuantumProvenanceData to use the common class.
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review and suggestions, merged.

             

            Show
            salnikov Andy Salnikov added a comment - Thanks for review and suggestions, merged.  

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.