Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24614

Move dataset_location tables into manager/storage hierarchy

    XMLWordPrintable

    Details

    • Story Points:
      4
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      This follows the pattern used for other Registry systems. After this change, we could consider giving Datastore instances these classes and a manager for the opaque tables instead of a full Registry, would be better for interface segregation.

      The default implementation of these manager/storage classes should not modify the current schema, but it will give us flexibility to change it via configuration in the future, so it probably still merits being part of the stability milestone.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            Tim Jenness, I'm contemplating making some changes to the Datastore/Registry boundary (code, not conceptual) that I wanted to run by you first:

            • I'd like to remove most of the DatasetLocation methods from Registry itself, because Datastores should be the only ones calling them, and Datastores don't need the vast majority of the Registry methods.  Instead, there'd be one method on Registry to register a new Datastore that would return a bridge object that can do all of the things a Datastore needs to do with the Registry.  Datastores would just call that in their constructors and hold on to the bridge object instead of the Registry instance.
            • I think it'll be pretty easy to implement ephemeral dataset locations for InMemoryDatastore at the same time; that'll just be a dict-based implementation of the bridge object.  Eventually we may want to replace that with something based on temporary database tables so we can include existence-in-datastore in more general dataset queries, but because there will an ABC for the bridge class Datastores shouldn't notice that change.
            • I'm wondering whether it would be better to have per-Datastore trash tables, which would be pretty easy to implement as well (assuming we have a way to get a sanitized datastore name to use when making the table name).  Though I also wonder if we could just rely on the set difference of (<internal-datastore-table> -< dataset_location where datastore=self>) as trash, and save some database writes?  Did we discuss this already?
            Show
            jbosch Jim Bosch added a comment - Tim Jenness , I'm contemplating making some changes to the Datastore/Registry boundary (code, not conceptual) that I wanted to run by you first: I'd like to remove most of the DatasetLocation methods from Registry itself, because Datastores should be the only ones calling them, and Datastores don't need the vast majority of the Registry methods.  Instead, there'd be one method on Registry to register a new Datastore that would return a bridge object that can do all of the things a Datastore needs to do with the Registry.  Datastores would just call that in their constructors and hold on to the bridge object instead of the Registry instance. I think it'll be pretty easy to implement ephemeral dataset locations for InMemoryDatastore at the same time; that'll just be a dict-based implementation of the bridge object.  Eventually we may want to replace that with something based on temporary database tables so we can include existence-in-datastore in more general dataset queries, but because there will an ABC for the bridge class Datastores shouldn't notice that change. I'm wondering whether it would be better to have per-Datastore trash tables, which would be pretty easy to implement as well (assuming we have a way to get a sanitized datastore name to use when making the table name).  Though I also wonder if we could just rely on the set difference of (<internal-datastore-table> -< dataset_location where datastore=self>) as trash, and save some database writes?  Did we discuss this already?
            Hide
            tjenness Tim Jenness added a comment -

            The dataset_location table was meant to be a registry level table that is public and allows registry to answer general questions about whether a dataset is stored anywhere and how many copies of a dataset exist if we use chained datastores. The posix_datastore_records table is designed to be entirely internal and registry is an implementation detail. I'm fine with datastore using a more constrained class for writing to the registry – I assume it will effectively have zero impact on datastore other than being more explicit about the interface.

            A fake connection object for InMemoryDatastore is fine with me – the interface to that table is already factored out in datastore in genericDatastore.py. No-one is really using the dataset_location tables for their intended use other than datastore anyhow. We already use dict in place of registry for internal records so wouldn't need a fancy connection object for dealing with what we currently call "opaqueData".

            The code that moves datasets from dataset_location to dataset_location_trash is shared by all datastores so it's not a big job to switch that to something else. What do we really gain by having a table per datastore rather than using the name in the datastore as a column. On the one hand it becomes an internal implementation detail (like posix_datastore_records), but on the other hand you have to worry a bit more about transactions and you can't query one place for where all the trashed but somehow not trashed yet datasets are listed.

            Show
            tjenness Tim Jenness added a comment - The dataset_location table was meant to be a registry level table that is public and allows registry to answer general questions about whether a dataset is stored anywhere and how many copies of a dataset exist if we use chained datastores. The posix_datastore_records table is designed to be entirely internal and registry is an implementation detail. I'm fine with datastore using a more constrained class for writing to the registry – I assume it will effectively have zero impact on datastore other than being more explicit about the interface. A fake connection object for InMemoryDatastore is fine with me – the interface to that table is already factored out in datastore in genericDatastore.py. No-one is really using the dataset_location tables for their intended use other than datastore anyhow. We already use dict in place of registry for internal records so wouldn't need a fancy connection object for dealing with what we currently call "opaqueData". The code that moves datasets from dataset_location to dataset_location_trash is shared by all datastores so it's not a big job to switch that to something else. What do we really gain by having a table per datastore rather than using the name in the datastore as a column. On the one hand it becomes an internal implementation detail (like posix_datastore_records), but on the other hand you have to worry a bit more about transactions and you can't query one place for where all the trashed but somehow not trashed yet datasets are listed.
            Hide
            jbosch Jim Bosch added a comment -

            I was planning for Registry to keep the DatasetLocation methods that return information to the user, and just remove the ones that modify the information.

            The opaque table interface would also remain on Registry because it might be of more general use, though I would also put those on the bridge object so Datastores don't need to carry around two things.

            For the trash tables, I was thinking that it might be better if they were a Datastore implementation detail (but like the opaque tables, something they might delegate back to Registry, via the bridge object). InMemoryDatastore and ChainedDatastore don't need one, and an in-the-same-database SQL-based Datastore probably wouldn't need one either. And whether to have one or use the set difference with an internal table could also be an implementation (or configuration) choice for Posix and S3. But my reasoning for separate trash tables was just that sort of separation-of-concerns musing, not something I think would be more efficient or safe.

            Show
            jbosch Jim Bosch added a comment - I was planning for Registry to keep the DatasetLocation methods that return information to the user, and just remove the ones that modify the information. The opaque table interface would also remain on Registry because it might be of more general use, though I would also put those on the bridge object so Datastores don't need to carry around two things. For the trash tables, I was thinking that it might be better if they were a Datastore implementation detail (but like the opaque tables, something they might delegate back to Registry, via the bridge object). InMemoryDatastore and ChainedDatastore don't need one, and an in-the-same-database SQL-based Datastore probably wouldn't need one either. And whether to have one or use the set difference with an internal table could also be an implementation (or configuration) choice for Posix and S3. But my reasoning for separate trash tables was just that sort of separation-of-concerns musing, not something I think would be more efficient or safe.
            Hide
            jbosch Jim Bosch added a comment -

            Tim Jenness, I gather you're slammed with reviews (and probably more), but I think you're the only viable reviewer for this one. No rush - I've got orthogonal tickets I can work on - and unless you think it'll be a long wait I'm happy to rebase on whatever typing changes you've made before merge (or even your review).

            PR is https://github.com/lsst/daf_butler/pull/281

            Show
            jbosch Jim Bosch added a comment - Tim Jenness , I gather you're slammed with reviews (and probably more), but I think you're the only viable reviewer for this one. No rush - I've got orthogonal tickets I can work on - and unless you think it'll be a long wait I'm happy to rebase on whatever typing changes you've made before merge (or even your review). PR is https://github.com/lsst/daf_butler/pull/281
            Hide
            tjenness Tim Jenness added a comment -

            This looks really good to me. I like how you use the ephemeral version for testing. I'm fine with you merging it and me trying to rebase.

            Show
            tjenness Tim Jenness added a comment - This looks really good to me. I like how you use the ephemeral version for testing. I'm fine with you merging it and me trying to rebase.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Tim Jenness
              Watchers:
              Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.