Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21448

Clean up DatasetRef comparisons and immutability

    XMLWordPrintable

Details

    Description

      (Previous title was "Refactor DatasetRef into multiple better-defined classes", but now we're changing DatasetRef).

      The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a dataset_id completely changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.

       

      Attachments

        Issue Links

          Activity

            No builds found.
            jbosch Jim Bosch created issue -
            jbosch Jim Bosch made changes -
            Field Original Value New Value
            Epic Link DM-21254 [ 414685 ]
            jbosch Jim Bosch made changes -
            Link This issue is child task of DM-21231 [ DM-21231 ]
            jbosch Jim Bosch made changes -
            Description The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class is what actually behaves most like Gen2's ubiquito{{us }}{{DataRef}} class
            jbosch Jim Bosch made changes -
            Description The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class is what actually behaves most like Gen2's ubiquito{{us }}{{DataRef}} class The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.

            I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are:
             * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}.  Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled.  (In all bullet points below I'll use these new names.)
             * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions.

             ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type.
             ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below.
             ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}.  While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience.
             * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same.
             ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets.
             ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist.
             * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names).  Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}.
             ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets.
             ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}.
             ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms).
             * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_.  There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this).  I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable.

            Major questions:
             * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship?  If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}}  if the data ID dimensions are in fact equal to the dataset type dimensions.  I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?").
             * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed.  It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory.  Do we have use cases where we need it, or the full Quantum?
             * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups?  I think this depends on how often those are used.  An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?).  Populating as needed is even worse, because it would require the object to hold a reference to its own Registry.

             
            jbosch Jim Bosch made changes -
            Story Points 2 4
            jbosch Jim Bosch made changes -
            Description The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.

            I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are:
             * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}.  Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled.  (In all bullet points below I'll use these new names.)
             * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions.

             ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type.
             ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below.
             ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}.  While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience.
             * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same.
             ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets.
             ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist.
             * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names).  Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}.
             ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets.
             ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}.
             ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms).
             * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_.  There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this).  I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable.

            Major questions:
             * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship?  If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}}  if the data ID dimensions are in fact equal to the dataset type dimensions.  I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?").
             * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed.  It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory.  Do we have use cases where we need it, or the full Quantum?
             * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups?  I think this depends on how often those are used.  An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?).  Populating as needed is even worse, because it would require the object to hold a reference to its own Registry.

             
            The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.

            I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are:
             * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}.  Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled.  (In all bullet points below I'll use these new names.)
             * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions.

             *
             ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type.
             ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below.
             ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}.  While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience.
             * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same.
             ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets.
             ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist.
             * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names).  Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}.
             ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets.
             ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}.
             ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms).
             * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_.  There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this).  I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable.

            Major questions:
             * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship?  If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}}  if the data ID dimensions are in fact equal to the dataset type dimensions.  I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?").
             * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed.  It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory.  Do we have use cases where we need it, or the full Quantum?
             * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups?  I think this depends on how often those are used.  An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?).  Populating as needed is even worse, because it would require the object to hold a reference to its own Registry.

             
            jbosch Jim Bosch made changes -
            Description The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.

            I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are:
             * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}.  Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled.  (In all bullet points below I'll use these new names.)
             * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions.

             *
             ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type.
             ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below.
             ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}.  While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience.
             * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same.
             ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets.
             ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist.
             * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names).  Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}.
             ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets.
             ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}.
             ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms).
             * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_.  There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this).  I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable.

            Major questions:
             * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship?  If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}}  if the data ID dimensions are in fact equal to the dataset type dimensions.  I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?").
             * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed.  It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory.  Do we have use cases where we need it, or the full Quantum?
             * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups?  I think this depends on how often those are used.  An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?).  Populating as needed is even worse, because it would require the object to hold a reference to its own Registry.

             
            The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.

            I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are:
             * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}.  Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled.  (In all bullet points below I'll use these new names.)
             * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions.
             ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type.
             ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below.
             ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}.  While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience.

             * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same.
             ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets.
             ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist.

             * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names).  Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}.
             ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets.
             ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}.
             ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms).

             * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_.  There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this).  I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable.

            Major questions:
             * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship?  If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}}  if the data ID dimensions are in fact equal to the dataset type dimensions.  I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?").
             * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed.  It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory.  Do we have use cases where we need it, or the full Quantum?
             * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups?  I think this depends on how often those are used.  An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?).  Populating as needed is even worse, because it would require the object to hold a reference to its own Registry.

             
            jbosch Jim Bosch made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            jbosch Jim Bosch made changes -
            Description The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.

            I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are:
             * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}.  Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled.  (In all bullet points below I'll use these new names.)
             * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions.
             ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type.
             ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below.
             ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}.  While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience.

             * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same.
             ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets.
             ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist.

             * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names).  Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}.
             ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets.
             ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}.
             ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms).

             * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_.  There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this).  I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable.

            Major questions:
             * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship?  If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}}  if the data ID dimensions are in fact equal to the dataset type dimensions.  I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?").
             * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed.  It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory.  Do we have use cases where we need it, or the full Quantum?
             * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups?  I think this depends on how often those are used.  An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?).  Populating as needed is even worse, because it would require the object to hold a reference to its own Registry.

             
            The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.

            I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are:
             * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}.  Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled.  (In all bullet points below I'll use these new names.)
             * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions.
             ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type.
             ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below.
             ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}.  While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience.

             * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same.
             ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets.
             ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist.

             * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names).  Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}.
             ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets.
             ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}.
             ** This would be the type used in {{Quantum}} to represent datasets that definitely already exist.
             ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms).

             * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_.  There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this).  I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable.

            Major questions:
             * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship?  If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}}  if the data ID dimensions are in fact equal to the dataset type dimensions.  I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?").
             * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed.  It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory.  Do we have use cases where we need it, or the full Quantum?
             * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups?  I think this depends on how often those are used.  An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?).  Populating as needed is even worse, because it would require the object to hold a reference to its own Registry.

             
            jbosch Jim Bosch made changes -
            Remote Link This issue links to "Page (Confluence)" [ 22047 ]
            jbosch Jim Bosch made changes -
            Link This issue blocks DM-21768 [ DM-21768 ]
            jbosch Jim Bosch made changes -
            Link This issue blocks DM-21766 [ DM-21766 ]
            jbosch Jim Bosch made changes -
            Link This issue is child task of DM-21231 [ DM-21231 ]
            jbosch Jim Bosch made changes -
            Link This issue is contained by DM-21231 [ DM-21231 ]
            jbosch Jim Bosch made changes -
            Labels gen3-middleware gen2-deprecation-blocker gen3-middleware
            jbosch Jim Bosch added a comment - - edited

            I am returning to work on this now, but plan to lower ambitions significantly:

            • DatasetRef will be retained, and there will be no new classes in this area.
            • DatasetRef equality will be defined as a comparison of the dataset type, data ID, and dataset_id; this will make "resolved" DatasetRefs (which know their dataset_id) compare more precisely to each other, and always compare as unequal to unresolved DatasetRefs.
            • Add resolved() and unresolved() methods to add and remove ID, run, and component information (returning new objects).
            • Remove the reference to quantum from DatasetRef (we don't use this at all right now, and I don't plan to use it in the future).
            • Make DatasetRef truly immutable (or at least fixing the attributes used in comparisons; maybe not components), instead of letting some classes (i.e. Registry) modify it in-place.
            • Make DatasetRef hashable.

             

            jbosch Jim Bosch added a comment - - edited I am returning to work on this now, but plan to lower ambitions significantly: DatasetRef will be retained, and there will be no new classes in this area. DatasetRef equality will be defined as a comparison of the dataset type, data ID, and dataset_id; this will make "resolved" DatasetRefs (which know their dataset_id) compare more precisely to each other, and always compare as unequal to unresolved DatasetRefs. Add resolved() and unresolved() methods to add and remove ID, run, and component information (returning new objects). Remove the reference to quantum from DatasetRef (we don't use this at all right now, and I don't plan to use it in the future). Make DatasetRef truly immutable (or at least fixing the attributes used in comparisons; maybe not components), instead of letting some classes (i.e. Registry) modify it in-place. Make DatasetRef hashable.  
            jbosch Jim Bosch made changes -
            Summary Refactor DatasetRef into multiple better-defined classes Clean up DatasetRef comparisons and immutability
            jbosch Jim Bosch made changes -
            Description The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.  And [~yusra] has pointed out that the {{DeferredDatasetHandle}} class (which has a Butler rather than being something used by a Butler) is what actually behaves most like Gen2's ubiquitous {{DataRef}} class, so if we name any Gen3 class with something that sounds like {{DataRef}} it'd be confusing for it to be anything but that.

            I intend to play with a the design a bit in code before settling on anything, but my preliminary ideas are:
             * Use "Handle" instead of "Ref" as the name suffix for the classes being primarily discussed here; rename {{DeferredDatasetHandle}} to {{DeferredDataRef}}.  Let's consider those names preliminary as we work out what classes should exist - we'll go to #dm-naming-things after that's settled.  (In all bullet points below I'll use these new names.)
             * Define {{PartialDatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and data ID, probably with the latter a {{DataCoordinate}} whose dimensions are guaranteed to be a subset of the {{DatasetType}}'s dimensions.
             ** This would be used in the {{Datastore}} {{LookupKey}} machinery, where we need to be able to dispatch on just some of the key-value pairs in the data ID (e.g. "instrument") as well as the dataset type.
             ** It _may_ also be used as a type users can pass to various Butler/Registry APIs that actually require fully-specified data IDs, with those APIs validating this internally (e.g. by upgrading to one of the complete {{DatasetHandle}} types below.
             ** I don't think we have any Butler/Registry APIs that would conceptually take or return a partial data ID for a dataset type at present, but it's conceivable that we could add some in the future, for instance something like Gen2's {{Butler.subset}}.  While everything {{Butler.subset}} could do can already be done with the much more powerful {{Registry.queryDatasets}} method, we might want a simpler form for convenience.

             * Define {{DatasetHandle}} to be a simple immutable struct (probably dataclass) of {{DatasetType}} and {{DataCoordinate}} (which may be an {{ExpandedDataCoordinate}} but is not guaranteed to be) with the dimensions of the dataset type and the data ID guaranteed to be the same.
             ** This would be the type most frequently used as an argument to Butler/Registry APIs, especially those that look for datasets.
             ** This would be the type used in {{Quantum}} to represent datasets that do not or may not already exist.

             * Define {{ResolvedDatasetHandle}} to be a subclass of {{DatasetHandle}} that adds fields for {{dataset_id}} (which we need to rename, but not on this ticket, and split into {{dataset_id}} and {{site_id}}, but not on this ticket), {{run_id}}, {{quantum_id}}, {{components}} (a dict mapping string component name to child {{ResolveDatasetHandle}}), and {{collections}} (a {{set}} of string collection names).  Its data ID would be guaranteed to be an {{ExpandedDataCoordinate}}.
             ** This would be the type most frequently used as a return type in Butler/Registry APIs, especially those that look for datasets.
             ** It may be used as an argument in some APIs as well, especially those that operate on datasets that must already exist (like {{Registry.associate}}.
             ** This would be the type used in {{Quantum}} to represent datasets that definitely already exist.
             ** {{Datastores}} will interact exclusively with {{ResolvedDatasetHandle}} when reading and writing datasets (as noted above, they may use {{PartialDatasetHandle}} in {{LookupKey}} mechanisms).

             * All equality comparisons between any of the above handle classes would be strictly on dataset type and data ID, so two {{ResolvedDatasetHandles}} that have those fields in common but have different {{dataset_ids}} (e.g. because they represent the same data product in different runs/collections) _would compare as equal_.  There would be a separate named method for comparing {{ResolvedDatasetHandles}} for complete equivalence, including {{dataset_id}} (it's a pity we can't overload the {{is}} operator for this).  I believe we have no choice about this if we want to make simple {{DatasetHandles}} and {{ResolvedDatasetHandles}} cross-comparable.

            Major questions:
             * Should {{PartialDatasetHandle}} be a base class of {{DatasetHandle}}, so all of these classes have an inheritance relationship?  If so, we may want to make {{PartialDatasetHandle}} construction always go through a factory that can return a {{DatasetHandle}}  if the data ID dimensions are in fact equal to the dataset type dimensions.  I can't think of a substitutability reason why they shouldn't have such a relationship, but the naming certainly makes it feel weird ("a ResolvedDatasetHandle IS A PartialDatasetHandle?").
             * This proposal for {{ResolvedDatasetHandle}}'s fields is to just attach the IDs of the associated Run and (if present) Quantum, which is primarily about avoiding database lookups of those objects given that they're usually not going to be accessed.  It may make sense to attach the full Run instance, at least, because there will relatively few of those and we should be able to cache those instances in memory.  Do we have use cases where we need it, or the full Quantum?
             * Do we need an intermediate class between {{DatasetHandle}} and {{ResolvedDatasetHandle}} that has a {{dataset_id}} but lacks {{components}} and/or {{collections}}, to avoid those lookups?  I think this depends on how often those are used.  An alternative might be to make those fields optional or populated-as-needed, but both of those options would compound further our existing problem with defining sane equality comparisons (are two instances equal if they don't have the same optional fields populated?).  Populating as needed is even worse, because it would require the object to hold a reference to its own Registry.

             
            (Previous title was "Refactor DatasetRef into multiple better-defined classes", but now we're changing DatasetRef).

            The current DatasetRef has too many fields (provenance stuff we never use), too many states (whether or not it has a {{dataset_id}} _completely_ changes what it represents), and ill-defined comparisons.  It's almost-but-not-quite immutable, which makes it tempting but a bit dangerous to use as the key type in a set or dict.

             
            jbosch Jim Bosch added a comment -

            After much experimentation, I've decided not to change very much here; having different classes to distinguish between different states doesn't help very much when all of the typing is dynamic anyway.  tjenness, mind reviewing?

            PR is https://github.com/lsst/daf_butler/pull/210

             

            jbosch Jim Bosch added a comment - After much experimentation, I've decided not to change very much here; having different classes to distinguish between different states doesn't help very much when all of the typing is dynamic anyway.  tjenness , mind reviewing? PR is https://github.com/lsst/daf_butler/pull/210  
            jbosch Jim Bosch made changes -
            Reviewers Tim Jenness [ tjenness ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            tjenness Tim Jenness made changes -
            Link This issue duplicates DM-22178 [ DM-22178 ]
            tjenness Tim Jenness made changes -
            Link This issue duplicates DM-22178 [ DM-22178 ]
            tjenness Tim Jenness made changes -
            Link This issue is duplicated by DM-22178 [ DM-22178 ]
            tjenness Tim Jenness made changes -
            Link This issue relates to DM-22178 [ DM-22178 ]
            tjenness Tim Jenness made changes -
            Link This issue is duplicated by DM-22178 [ DM-22178 ]
            jbosch Jim Bosch made changes -
            Link This issue relates to DM-22286 [ DM-22286 ]
            tjenness Tim Jenness added a comment -

            Looks good to me. I have one fairly substantive comment on the PR concerning components.

            tjenness Tim Jenness added a comment - Looks good to me. I have one fairly substantive comment on the PR concerning components.
            tjenness Tim Jenness made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            jbosch Jim Bosch added a comment -

            I've added a PR for ctrl_mpexec, where setting components to None for unresolved DatasetRefs broke some code (noticed in a ci_hsc_gen3 failure).

            PR is here: https://github.com/lsst/ctrl_mpexec/pull/37

            The change is trivial (once you understand the context, at least) so I'll merge once Jenkins passes, but I'd love to get a sign-off on that PR, too.

            jbosch Jim Bosch added a comment - I've added a PR for ctrl_mpexec, where setting components to None for unresolved DatasetRefs broke some code (noticed in a ci_hsc_gen3 failure). PR is here: https://github.com/lsst/ctrl_mpexec/pull/37 The change is trivial (once you understand the context, at least) so I'll merge once Jenkins passes, but I'd love to get a sign-off on that PR, too.
            jbosch Jim Bosch made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]

            People

              jbosch Jim Bosch
              jbosch Jim Bosch
              Tim Jenness
              Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.