Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21231

Refactor Registry handling of dataset and associated tables

    Details

    • Story Points:
      0
    • Team:
      Data Release Production

      Description

      This is the second half of the big registry overhaul (the big part being DM-17023).  Plans have been described in detail at https://docs.google.com/presentation/d/1KxVmRN_8S4GskyGxEkeoX7tn5c8b9xViy73JU4dkXOA/edit?usp=sharing.

       

      Goals include:

      • Normalizing the dataset table into different tables for each dataset type.  This should improve query performance and give us flexibility in how we store metadata associated with datasets (possibly including regions and timestamps that are currently restricted to dimensions).
      • Restructuring the registry codebase towards supporting eventual chained-schema registries.
      • Enabling bulk inserts of datasets during ingest.  This will require changes to Datastore as well (particularly its relationship with Registry).
      • Addressing performance and concurrency problems in our usage of transactions.

      Whenever possible, I'll try to split this up into smaller tickets.  The sheer size of DM-17023 has become a problem of its own, though I'm not sure how much I could have split it up.  Happily this ticket should have a much smaller effect on public interfaces, though it will still involve some broad breaking changes.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment - - edited

            Note to my future self: some datasets, like raw and (post DM-17023) reference catalogs should have their "one DatasetType+DataId" constraint valid across all collections, not just any one collection, and that means we don't actually need to use collections when retrieving them when we are explicitly asked for that DatasetType.  This is a potentially important optimization that we should find a way to take advantage of somehow.

            Show
            jbosch Jim Bosch added a comment - - edited Note to my future self: some datasets, like raw and (post DM-17023 ) reference catalogs should have their "one DatasetType+DataId" constraint valid across all collections, not just any one collection, and that means we don't actually need to use collections when retrieving them when we are explicitly asked for that DatasetType .  This is a potentially important optimization that we should find a way to take advantage of somehow.
            Hide
            jbosch Jim Bosch added a comment -

            I'm zeroing out the story points here to reflect the fact that I expect to do all work on links tickets.

            Show
            jbosch Jim Bosch added a comment - I'm zeroing out the story points here to reflect the fact that I expect to do all work on links tickets.
            Hide
            jbosch Jim Bosch added a comment -

            I've removed the tickets here related to StorageClass metadata (which is not a near-term goal) and added the CALIBRATION-type collections one; that makes this ticket an umbrella for everything described on the prototyping confluence page, and something we could expect to close in the not-too-distant future.

            Show
            jbosch Jim Bosch added a comment - I've removed the tickets here related to StorageClass metadata (which is not a near-term goal) and added the CALIBRATION-type collections one; that makes this ticket an umbrella for everything described on the prototyping confluence page , and something we could expect to close in the not-too-distant future.

              People

              • Assignee:
                jbosch Jim Bosch
                Reporter:
                jbosch Jim Bosch
                Watchers:
                Jim Bosch
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Summary Panel