Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Story Points:4
-
Epic Link:
-
Team:Data Release Production
Description
(Previous title was "Refactor DatasetRef into multiple better-defined classes", but now we're changing DatasetRef).
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a dataset_id completely changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.
Attachments
Issue Links
- blocks
-
DM-21766 Add per-dataset-type tables to Registry
- Done
-
DM-21768 Vectorize dataset insert API
- Done
- is contained by
-
DM-21231 Refactor Registry handling of dataset and associated tables
- Done
- relates to
-
DM-22178 queryDatasets produces lots of duplicate outputs
- Won't Fix
-
DM-22286 Remove duplicates from the output of queryDatasets
- Won't Fix
- mentioned in
-
Page Loading...
Activity
Description | The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class is what actually behaves most like Gen2's ubiquito{{us }}{{DataRef}} class |
Description | The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class is what actually behaves most like Gen2's ubiquito{{us }}{{DataRef}} class |
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.
I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are: * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}. Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled. (In all bullet points below I'll use these new names.) * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions. ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type. ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below. ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}. While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience. * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same. ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets. ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist. * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names). Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}. ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets. ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}. ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms). * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_. There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this). I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable. Major questions: * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship? If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}} if the data ID dimensions are in fact equal to the dataset type dimensions. I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?"). * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed. It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory. Do we have use cases where we need it, or the full Quantum? * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups? I think this depends on how often those are used. An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?). Populating as needed is even worse, because it would require the object to hold a reference to its own Registry. |
Story Points | 2 | 4 |
Description |
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.
I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are: * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}. Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled. (In all bullet points below I'll use these new names.) * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions. ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type. ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below. ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}. While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience. * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same. ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets. ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist. * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names). Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}. ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets. ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}. ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms). * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_. There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this). I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable. Major questions: * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship? If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}} if the data ID dimensions are in fact equal to the dataset type dimensions. I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?"). * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed. It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory. Do we have use cases where we need it, or the full Quantum? * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups? I think this depends on how often those are used. An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?). Populating as needed is even worse, because it would require the object to hold a reference to its own Registry. |
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.
I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are: * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}. Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled. (In all bullet points below I'll use these new names.) * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions. * ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type. ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below. ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}. While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience. * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same. ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets. ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist. * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names). Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}. ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets. ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}. ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms). * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_. There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this). I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable. Major questions: * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship? If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}} if the data ID dimensions are in fact equal to the dataset type dimensions. I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?"). * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed. It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory. Do we have use cases where we need it, or the full Quantum? * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups? I think this depends on how often those are used. An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?). Populating as needed is even worse, because it would require the object to hold a reference to its own Registry. |
Description |
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.
I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are: * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}. Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled. (In all bullet points below I'll use these new names.) * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions. * ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type. ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below. ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}. While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience. * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same. ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets. ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist. * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names). Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}. ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets. ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}. ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms). * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_. There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this). I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable. Major questions: * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship? If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}} if the data ID dimensions are in fact equal to the dataset type dimensions. I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?"). * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed. It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory. Do we have use cases where we need it, or the full Quantum? * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups? I think this depends on how often those are used. An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?). Populating as needed is even worse, because it would require the object to hold a reference to its own Registry. |
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.
I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are: * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}. Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled. (In all bullet points below I'll use these new names.) * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions. ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type. ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below. ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}. While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience. * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same. ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets. ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist. * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names). Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}. ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets. ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}. ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms). * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_. There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this). I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable. Major questions: * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship? If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}} if the data ID dimensions are in fact equal to the dataset type dimensions. I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?"). * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed. It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory. Do we have use cases where we need it, or the full Quantum? * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups? I think this depends on how often those are used. An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?). Populating as needed is even worse, because it would require the object to hold a reference to its own Registry. |
Status | To Do [ 10001 ] | In Progress [ 3 ] |
Description |
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.
I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are: * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}. Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled. (In all bullet points below I'll use these new names.) * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions. ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type. ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below. ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}. While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience. * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same. ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets. ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist. * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names). Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}. ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets. ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}. ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms). * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_. There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this). I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable. Major questions: * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship? If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}} if the data ID dimensions are in fact equal to the dataset type dimensions. I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?"). * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed. It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory. Do we have use cases where we need it, or the full Quantum? * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups? I think this depends on how often those are used. An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?). Populating as needed is even worse, because it would require the object to hold a reference to its own Registry. |
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.
I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are: * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}. Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled. (In all bullet points below I'll use these new names.) * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions. ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type. ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below. ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}. While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience. * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same. ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets. ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist. * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names). Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}. ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets. ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}. ** This would be the type used in {{Quantum}} to represent datasets that definitely already exist. ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms). * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_. There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this). I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable. Major questions: * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship? If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}} if the data ID dimensions are in fact equal to the dataset type dimensions. I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?"). * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed. It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory. Do we have use cases where we need it, or the full Quantum? * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups? I think this depends on how often those are used. An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?). Populating as needed is even worse, because it would require the object to hold a reference to its own Registry. |
Remote Link | This issue links to "Page (Confluence)" [ 22047 ] |
Labels | gen3-middleware | gen2-deprecation-blocker gen3-middleware |
Summary | Refactor DatasetRef into multiple better-defined classes | Clean up DatasetRef comparisons and immutability |
Description |
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict. And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.
I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are: * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}. Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled. (In all bullet points below I'll use these new names.) * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions. ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type. ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below. ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}. While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience. * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same. ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets. ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist. * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names). Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}. ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets. ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}. ** This would be the type used in {{Quantum}} to represent datasets that definitely already exist. ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms). * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_. There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this). I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable. Major questions: * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship? If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}} if the data ID dimensions are in fact equal to the dataset type dimensions. I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?"). * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed. It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory. Do we have use cases where we need it, or the full Quantum? * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups? I think this depends on how often those are used. An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?). Populating as needed is even worse, because it would require the object to hold a reference to its own Registry. |
(Previous title was "Refactor DatasetRef into multiple better-defined classes", but now we're changing DatasetRef).
The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons. It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict. |
Reviewers | Tim Jenness [ tjenness ] | |
Status | In Progress [ 3 ] | In Review [ 10004 ] |
Status | In Review [ 10004 ] | Reviewed [ 10101 ] |
Resolution | Done [ 10000 ] | |
Status | Reviewed [ 10101 ] | Done [ 10002 ] |
I am returning to work on this now, but plan to lower ambitions significantly: