Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Story Points:0
-
Team:Data Release Production
Description
This is the second half of the big registry overhaul (the big part being DM-17023). Plans have been described in detail at https://docs.google.com/presentation/d/1KxVmRN_8S4GskyGxEkeoX7tn5c8b9xViy73JU4dkXOA/edit?usp=sharing.
Goals include:
- Normalizing the dataset table into different tables for each dataset type. This should improve query performance and give us flexibility in how we store metadata associated with datasets (possibly including regions and timestamps that are currently restricted to dimensions).
- Restructuring the registry codebase towards supporting eventual chained-schema registries.
- Enabling bulk inserts of datasets during ingest. This will require changes to Datastore as well (particularly its relationship with Registry).
- Addressing performance and concurrency problems in our usage of transactions.
Whenever possible, I'll try to split this up into smaller tickets. The sheer size of DM-17023 has become a problem of its own, though I'm not sure how much I could have split it up. Happily this ticket should have a much smaller effect on public interfaces, though it will still involve some broad breaking changes.
Attachments
Issue Links
- blocks
-
DM-21907 Implement multi-user Registries
- To Do
- contains
-
DM-21448 Clean up DatasetRef comparisons and immutability
- Done
-
DM-21451 Remove DatabaseDict and vectorize Datastore/Butler ingest APIs
- Done
-
DM-21764 Better encapsulate dataset storage in Registry
- Done
-
DM-21766 Add per-dataset-type tables to Registry
- Done
-
DM-21768 Vectorize dataset insert API
- Done
-
DM-21795 Rework Registry provenance objects to match prototype
- Done
-
DM-21849 Make runs a type of collection
- Done
-
DM-24432 Add CALIBRATION collections and remove the calibration_label dimension
- Done
-
DM-24612 Add indexes to dataset_collection tables
- Done
-
DM-24614 Move dataset_location tables into manager/storage hierarchy
- Done
-
DM-21770 Vectorize dataset insert implementations
- Invalid
- is blocked by
-
DM-21201 Research cross-database approach to inserts with custom conflict resolution
- Done
-
DM-21203 Research cross-database approach to bulk inserts returning autoincrement values
- Done
-
DM-17023 Refactor the Dimensions and query system
- Done
- relates to
-
DM-22487 Prototype Registry architecture for schema changes and (eventually) multiple layers
- Done
-
DM-17154 Move Registry schema definition from YAML to Python
- Invalid
- mentioned in
-
Page Loading...
Note to my future self: some datasets, like raw and (post
DM-17023) reference catalogs should have their "one DatasetType+DataId" constraint valid across all collections, not just any one collection, and that means we don't actually need to use collections when retrieving them when we are explicitly asked for that DatasetType. This is a potentially important optimization that we should find a way to take advantage of somehow.