Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-804

Make dataset type names non-unique in data repositories

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Withdrawn
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      The Gen3 Butler currently uses the string name of a DatasetType as a repository-wide unique identifier for it; a calexp, raw, or deepCoadd means exactly the same thing for all instruments, collections, and users.

      This consistency is convenient most of the time, but deeply problematic in some important cases:

      • In a many-user repository, two users that are otherwise completely unaware of each can experience a conflict over a dataset type name that only they use.
      • We cannot change a DatasetType from one Python type to a newer one without migrating data repositories (and in doing so, breaking the ability for older code to read older datasets), even if we make sure the new and old types are compatible from a code perspective.  This problem is already blocking the implementation of RFC-783.
      • We cannot change the dimensions associated with a DatasetType without similar problems, even DatasetTypes that represent processing intermediates rather than public data products.

      Much of the discussion about resolving this has involved the idea of "per-collection DatasetTypes", but I think that'd be a step much too far: a lot of the power of our current DatasetType concept is that it is cross-collection, and hence can be used to make meaningful comparisons across different processing runs.

      A better solution is to make DatasetType names non-unique in the repository but unique within any collection (with the possible exception of CHAINED collections; more on that later); one DatasetType would typically still be used in many, many collections, and while a different DatasetType with the same name could exist in the same repository, it could only be used in completely different sets of collections.

      This is less disruptive than it might seem, because in almost all contexts where a user wants to resolve a DatasetType name (or wildcard), a list of relevant collections is readily available, either from Butler initialization, the pipetask command line, or arguments passed to Registry methods.  We already have (and aggressively fetch) summary tables that record which DatasetTypes are present in each collection, so we can perform those resolutions efficiently.  I think we can demand that collections are provided whenever dataset type names are solved without breaking much code; this would be an API change, but I think a relatively minor one.

      Problems do arise when this resolution is not unique (i.e. the name resolves to multiple DatasetTypes even in the set of collections known to be in play), and this proposal does rely a bit on these simply being rare in practice.  We can use database constraints to ensure that only one definition appears in any particular RUN, TAGGED, or CALIBRATION collection, to avoid mistakes.  Avoiding multiple definitions in CHAINED collections or multi-collection search paths is trickier, but I think in at least the vast majority of cases we can get away with making it an error to query a sequence of collections via a DatasetType name that is non-unique within that set; we can tell the user to instead pass a fully-resolved DatasetType object instead, resolving the ambiguity and making it quite difficult for the user to silently get something other than what they expect.

      This change will require a butler schema change and migration.  It may be possible to allow older code to read migrated repositories, but they should probably be prohibited from writing to them in certain ways (and it may not be worth the effort to support even that).  It should be possible to ensure that newer code can read and write to unmigrated older repositories (though again I'm not sure of the effort involved).  Because all repositories already use a synthetic integer primary for DatasetType, the migration should be quite straightforward (removal of the unique constraint on the name, addition of a name column and unique constraint to dataset-collection tag tables).

       

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            I'm just withdrawing this for now; I think a more extensive compromise proposal is needed, and the motivation for doing this quickly has gone away.  So some (unspecified) time in the future I'll come back with a new proposal as a new RFC.

            Show
            jbosch Jim Bosch added a comment - I'm just withdrawing this for now; I think a more extensive compromise proposal is needed, and the motivation for doing this quickly has gone away.  So some (unspecified) time in the future I'll come back with a new proposal as a new RFC.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment - - edited

            I am very uncomfortable with the proposal.

            I feel strongly about Butler data access needing to be a stable interface. In addition to Tim Jenness’s concerns, with which I agree:

            We declare the interfaces and capabilities of PipelineTasks, both in the code and in derived documentation, in terms of the string names of DatasetTypes. This would become an unreliable guarantee in the proposal under consideration. The usability of a given PipelineTask would become completely contingent on what collection it happened to be executed on at run time.

            In addition, I fear an unmanageable situation could develop in which different types names are “sliced” by different subgroups of collections, leading to a combinatorial growth in the number of different “flavors” of collections (some with calexp-A and src-A, one with calexp-B and src-A, others with calexp-B and src-B, and so on), a really nightmarish possibility.

            With regard to the three initial concerns expressed in the proposal:

            i) I agree that it’s a concern that the long life of repositories and their multi-user nature will lead to unpredictable collisions, but I am much happier with a namespaced solution to this. We should explore further how to do that in a way that doesn’t make the most common experimental-code-development use cases unnecessarily verbose. The suggestions above seem to be headed in a good direction.

            ii) This is a really serious issue that does require an engineering solution. The long life of repositories and their size mean that we have to have an acceptable solution for migrations of type within a repository. However, I don’t think redefining names from collection to collection is a suitable solution.

            Before we go further down this road, I also want to be sure that we are clearly distinguishing between changes in the Python type for an existing (and unchanged) persistence-type and changes in the persistence-type (which might or might not need to be reflected by changes in the associated Python type).

            On the first one, which I’ll call ii.a), I’d like to note that we have other reasons than migrations to want the capability to have a single external, persisted dataset be instantiatable by Butler as more than one Python type. It would be very convenient for our users, for instance, to be able to retrieve an exposure as either its native afw type or as a common Astropy type; or a source table as afw.table or as a pandas dataframe.

            A digression on that: On BaBar, our C++ equivalent to the Butler had an interface like this (with a little translation / obscuration of unimportant details):

            pointer-to-object = Ifd<ObjectType>::get( butler(dataid), key )

            (Making this rigorously type-safe without having the “butler” class depend on every data type in our system, which would have produced unmanageable dependency cycles, or constant recompilation of the world, was a publishable tour-de-force of mid-1990s C++ - hat tip to Ed Frank, currently at ANL. It would be much easier to implement with current C++.)

            The “key” was most commonly a string that allowed distinguishing between semantically distinct dataset types that were represented by a single C++ type - this is perhaps somewhat similar to the way we have multiple logical image dataset types all of which resolve to the same Python class.

            This system naturally supported retrieving the same dataset as different class types, easily supporting migrations. You could, say, do both Ifd<A1>::get(b, "ChargedTracks") and Ifd<A2>::get(b, "ChargedTracks") on the same “butler” b, with both loading from the same external persisted dataset.

            Inspired by this, could we not support something like a class= parameter to butler.get()? This would throw an exception if no formatter for the specified class and the specified dataset were available. In the absence of{{ class=}}, Butler would return a default class type for a specified dataset type. The formatter to use for a given dataset type would be looked up from both its (string) DatasetType and from the specified class= value, if any.

            The class could also be added to a PipelineTask configuration, declaring - optionally - that a particular implementation of that task can only deal with one specific class type for its inputs.

            In the course of a migration to a new class type for an existing persisted dataset type, new code could be introduced first with the use of the class= parameter, leaving existing code able to still run on the old version of the class, while newly written code explicitly requests the new version. Then, at some later time, the default class type - which would be a property of the software release in addition to the repository could be changed.

            Finally, on

            iii) We cannot change the dimensions associated with a DatasetType without similar problems, even DatasetTypes that represent processing intermediates rather than public data products.

            I’m not as sympathetic. This seems like a pretty fundamental part of the notion of a DatasetType, and I’m willing to ask developers to give the new version a new name. Remember that if we have user-namespaced dataset types, we have a “sandbox” in which developers can give some transitional versions of types under rapid development names that may seem “ugly” - while they are working on branches, etc., switching to a "clean" name in the common namespace once the work has settled down. But once there’s data in the shared repo of a particular type, in the common namespace, it seems immensely confusing to have the “keys” needed for that name be different in different collections. Among other things, this means that understanding the qualitative nature of the slicing/grouping that occurs when a PipelineTask using such a type is executed in a Pipeline is not predictable even roughly by just knowing the types - it’s only resolvable when the specific collections involved are known.

            With these concerns in mind, I am not able to support the RFC without further discussion.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - - edited I am very uncomfortable with the proposal. I feel strongly about Butler data access needing to be a stable interface. In addition to Tim Jenness ’s concerns, with which I agree: We declare the interfaces and capabilities of PipelineTasks, both in the code and in derived documentation, in terms of the string names of DatasetTypes. This would become an unreliable guarantee in the proposal under consideration. The usability of a given PipelineTask would become completely contingent on what collection it happened to be executed on at run time. In addition, I fear an unmanageable situation could develop in which different types names are “sliced” by different subgroups of collections, leading to a combinatorial growth in the number of different “flavors” of collections (some with calexp-A and src-A, one with calexp-B and src-A, others with calexp-B and src-B, and so on), a really nightmarish possibility. With regard to the three initial concerns expressed in the proposal: i) I agree that it’s a concern that the long life of repositories and their multi-user nature will lead to unpredictable collisions, but I am much happier with a namespaced solution to this. We should explore further how to do that in a way that doesn’t make the most common experimental-code-development use cases unnecessarily verbose. The suggestions above seem to be headed in a good direction. ii) This is a really serious issue that does require an engineering solution. The long life of repositories and their size mean that we have to have an acceptable solution for migrations of type within a repository. However, I don’t think redefining names from collection to collection is a suitable solution. Before we go further down this road, I also want to be sure that we are clearly distinguishing between changes in the Python type for an existing (and unchanged) persistence-type and changes in the persistence-type (which might or might not need to be reflected by changes in the associated Python type). On the first one, which I’ll call ii.a), I’d like to note that we have other reasons than migrations to want the capability to have a single external, persisted dataset be instantiatable by Butler as more than one Python type. It would be very convenient for our users, for instance, to be able to retrieve an exposure as either its native afw type or as a common Astropy type; or a source table as afw.table or as a pandas dataframe. A digression on that: On BaBar , our C++ equivalent to the Butler had an interface like this (with a little translation / obscuration of unimportant details): pointer-to-object = Ifd< ObjectType >::get( butler(dataid), key ) (Making this rigorously type-safe without having the “butler” class depend on every data type in our system, which would have produced unmanageable dependency cycles, or constant recompilation of the world, was a publishable tour-de-force of mid-1990s C++ - hat tip to Ed Frank, currently at ANL. It would be much easier to implement with current C++.) The “key” was most commonly a string that allowed distinguishing between semantically distinct dataset types that were represented by a single C++ type - this is perhaps somewhat similar to the way we have multiple logical image dataset types all of which resolve to the same Python class. This system naturally supported retrieving the same dataset as different class types, easily supporting migrations. You could, say, do both Ifd<A1>::get(b, "ChargedTracks") and Ifd<A2>::get(b, "ChargedTracks") on the same “butler” b , with both loading from the same external persisted dataset. Inspired by this, could we not support something like a class= parameter to butler.get()? This would throw an exception if no formatter for the specified class and the specified dataset were available. In the absence of{{ class=}}, Butler would return a default class type for a specified dataset type. The formatter to use for a given dataset type would be looked up from both its (string) DatasetType and from the specified class= value, if any. The class could also be added to a PipelineTask configuration, declaring - optionally - that a particular implementation of that task can only deal with one specific class type for its inputs. In the course of a migration to a new class type for an existing persisted dataset type, new code could be introduced first with the use of the class= parameter, leaving existing code able to still run on the old version of the class, while newly written code explicitly requests the new version. Then, at some later time, the default class type - which would be a property of the software release in addition to the repository could be changed. Finally, on iii) We cannot change the dimensions associated with a DatasetType without similar problems, even DatasetTypes that represent processing intermediates rather than public data products. I’m not as sympathetic. This seems like a pretty fundamental part of the notion of a DatasetType, and I’m willing to ask developers to give the new version a new name. Remember that if we have user-namespaced dataset types, we have a “sandbox” in which developers can give some transitional versions of types under rapid development names that may seem “ugly” - while they are working on branches, etc., switching to a "clean" name in the common namespace once the work has settled down. But once there’s data in the shared repo of a particular type, in the common namespace, it seems immensely confusing to have the “keys” needed for that name be different in different collections. Among other things, this means that understanding the qualitative nature of the slicing/grouping that occurs when a PipelineTask using such a type is executed in a Pipeline is not predictable even roughly by just knowing the types - it’s only resolvable when the specific collections involved are known. With these concerns in mind, I am not able to support the RFC without further discussion.
            Hide
            tjenness Tim Jenness added a comment - - edited

            My worry is the StorageClass part of it, not the dimensions part of it.

            If we allow the storage class for a dataset type to change in every collection then you know longer no what a `butler.get()` is going to return.

            Show
            tjenness Tim Jenness added a comment - - edited My worry is the StorageClass part of it, not the dimensions part of it. If we allow the storage class for a dataset type to change in every collection then you know longer no what a `butler.get()` is going to return.
            Hide
            cmorrison Chris Morrison [X] (Inactive) added a comment -

            I agree with Jim and have been bitten by this before. Would be great not to have to change names and be locked in to dimensions while we are developing the code.

             

            Show
            cmorrison Chris Morrison [X] (Inactive) added a comment - I agree with Jim and have been bitten by this before. Would be great not to have to change names and be locked in to dimensions while we are developing the code.  
            Hide
            jbosch Jim Bosch added a comment -

            I would really like to avoid adding usernames or (especially) version numbers to dataset type names; those names are read, written, and remembered frequently, and having usually-unnecessary terms that are often not relevant to what the user is doing (hence harder to remember) attached to them is something I regard as a big problem.

            A dataset type definition is indeed an interface, but like code interfaces I think the best guard against unexpected changes and results is policy, not rigid code-level prohibitions.  This RFC proposes to make it so that one can redefine a major dataset type in a new collection; so perhaps the biggest thing missing from this RFC is a statement that one may not (at least not in an official, non-u/ collection) without an RFC.

            I'm also on board with ideas that make accidental introduction of new dataset types more difficult, e.g.:

            • boolean flag kwargs/options to enable registering a new dataset type with an already used name
            • a prohibition on building QGs where the input collections and outputs involve multiple definitions for the same name

            Thinking of these more as interfaces, I also wonder if some level of visibility control would help (e.g. "public" dataset types are CCB change-controlled, "private" dataset types for pure intermediates might be RFC-controlled, and "personal" dataset types are a free-for-all).  FWIW, I do think we need to eventually have visibility attributes of some kind on collections (captured a while back as DM-29585).

            I'm not completely set against the idea of having a short, non-unique dataset type name fulfill the role I want, with a longer name with extra disambiguator terms to make it repository-wide unique, but I think our collections do already provide enough of a unique lookup context that right now that would feel like extra complexity for little gain.

            Some direct replies:

            I wonder if we could add a StorageClass configuration that adds a type cast, such that we declare a mapping that says if the formatter has returned a pytype X then it should call this function to convert it to the required pytype Y? Then we could switch isr_metadata to TaskMetadata but if the formatter returns a PropertySet it would then call the conversion function that would convert it to a TaskMetadata.

            If the RFC doesn't converge quickly, or converges to something that would be a lot of work, I'd be on board with this as a way to move forward on RFC-783 (or more generally any configuration-level level of indirection that lets us change the Python type).  I hesitate because I think the functionality might be one-time-use if we ultimately come up with something better, but I like it a lot better than a any schema change that carries the same danger of being single-use.

            The backward compatibility problem could be resolved by maintaining a list of Python types (or Formatters?) in the database, trying them sequentially until one is importable and works.

            This is compatible with my goal of keeping the names short and easy to remember, but I think I like Tim Jenness 's proposal of adding indirection in config/Python better if we can make it work; that should be easier to implement and might give us a bit more flexibility.

            Show
            jbosch Jim Bosch added a comment - I would really like to avoid adding usernames or (especially) version numbers to dataset type names; those names are read, written, and remembered frequently, and having usually-unnecessary terms that are often not relevant to what the user is doing (hence harder to remember) attached to them is something I regard as a big problem. A dataset type definition is indeed an interface, but like code interfaces I think the best guard against unexpected changes and results is policy, not rigid code-level prohibitions.  This RFC proposes to make it so that one can redefine a major dataset type in a new collection; so perhaps the biggest thing missing from this RFC is a statement that one may not (at least not in an official, non-u/ collection) without an RFC. I'm also on board with ideas that make accidental introduction of new dataset types more difficult, e.g.: boolean flag kwargs/options to enable registering a new dataset type with an already used name a prohibition on building QGs where the input collections and outputs involve multiple definitions for the same name Thinking of these more as interfaces, I also wonder if some level of visibility control would help (e.g. "public" dataset types are CCB change-controlled, "private" dataset types for pure intermediates might be RFC-controlled, and "personal" dataset types are a free-for-all).  FWIW, I do think we need to eventually have visibility attributes of some kind on collections (captured a while back as DM-29585 ). I'm not completely set against the idea of having a short, non-unique dataset type name fulfill the role I want, with a longer name with extra disambiguator terms to make it repository-wide unique, but I think our collections do already provide enough of a unique lookup context that right now that would feel like extra complexity for little gain. Some direct replies: I wonder if we could add a StorageClass configuration that adds a type cast, such that we declare a mapping that says if the formatter has returned a pytype X then it should call this function to convert it to the required pytype Y? Then we could switch isr_metadata to TaskMetadata but if the formatter returns a PropertySet it would then call the conversion function that would convert it to a TaskMetadata. If the RFC doesn't converge quickly, or converges to something that would be a lot of work, I'd be on board with this as a way to move forward on RFC-783 (or more generally any configuration-level level of indirection that lets us change the Python type).  I hesitate because I think the functionality might be one-time-use if we ultimately come up with something better, but I like it a lot better than a any schema change that carries the same danger of being single-use. The backward compatibility problem could be resolved by maintaining a list of Python types (or Formatters?) in the database, trying them sequentially until one is importable and works. This is compatible with my goal of keeping the names short and easy to remember, but I think I like Tim Jenness 's proposal of adding indirection in config/Python better if we can make it work; that should be easier to implement and might give us a bit more flexibility.
            Hide
            tjenness Tim Jenness added a comment -

            My concern is mostly that a user doesn't really have any idea what butler.get() is going to return. We might end up with code that thinks that calexp returns an afw Exposure but which then encounters a collection where calexp is an astropy.NDData. Pipelines are not the worry, user notebooks and code sharing are my worry.

            Requiring u-dataset types is interesting and we could presumably make pipetask clever enough to add the prefix with register-dataset-types option and try with and without the prefix when building a graph (because we wouldn't want people to have to edit their pipelines between testing on the branch and release).

            versioned dataset types might solve the storage class problem and dimensions problem. We could adopt RFC-783 by switching everything to isr_metadata_v2 for example and having that have the new storage class. I suppose the difference between the two is that changing dimensions is a graph building issue and so the graph builder could default to using the highest version datasetType whereas for storage class change the actual Task code will need to be changed so won't be able to be picked up automatically (and so the pipeline task will have to declare that it now supports _v2).

            For the RFC-783 problem, I wonder if we could add a StorageClass configuration that adds a type cast, such that we declare a mapping that says if the formatter has returned a pytype X then it should call this function to convert it to the required pytype Y? Then we could switch isr_metadata to TaskMetadata but if the formatter returns a PropertySet it would then call the conversion function that would convert it to a TaskMetadata.

            Show
            tjenness Tim Jenness added a comment - My concern is mostly that a user doesn't really have any idea what butler.get() is going to return. We might end up with code that thinks that calexp returns an afw Exposure but which then encounters a collection where calexp is an astropy.NDData. Pipelines are not the worry, user notebooks and code sharing are my worry. Requiring u-dataset types is interesting and we could presumably make pipetask clever enough to add the prefix with register-dataset-types option and try with and without the prefix when building a graph (because we wouldn't want people to have to edit their pipelines between testing on the branch and release). versioned dataset types might solve the storage class problem and dimensions problem. We could adopt RFC-783 by switching everything to isr_metadata_v2 for example and having that have the new storage class. I suppose the difference between the two is that changing dimensions is a graph building issue and so the graph builder could default to using the highest version datasetType whereas for storage class change the actual Task code will need to be changed so won't be able to be picked up automatically (and so the pipeline task will have to declare that it now supports _v2). For the RFC-783 problem, I wonder if we could add a StorageClass configuration that adds a type cast, such that we declare a mapping that says if the formatter has returned a pytype X then it should call this function to convert it to the required pytype Y? Then we could switch isr_metadata to TaskMetadata but if the formatter returns a PropertySet it would then call the conversion function that would convert it to a TaskMetadata.
            Hide
            hchiang2 Hsin-Fang Chiang added a comment -

            Since this might involve a schema migration or repo/stack compatibility, these two DP0.2 milestones may be relevant, depending on this RFC's timeline:

            • Nov 2021: freeze the input repo, start v23 production branch, and start DP0.2 processing
            • June 2022: DP0.2 repo release to delegates
            Show
            hchiang2 Hsin-Fang Chiang added a comment - Since this might involve a schema migration or repo/stack compatibility, these two DP0.2 milestones may be relevant, depending on this RFC's timeline: Nov 2021: freeze the input repo, start v23 production branch, and start DP0.2 processing June 2022: DP0.2 repo release to delegates
            Hide
            ktl Kian-Tat Lim added a comment -

            I'm coming to agree with others that this could be a dangerous relaxation in the interface to data.

            I would instead suggest (somewhat off the top of my head) that:

            1. The name collision problem could be resolved by pre-registering all dataset types using per-user prefixes unless something like an --official command-line option is used.
            2. The backward compatibility problem could be resolved by maintaining a list of Python types (or Formatters?) in the database, trying them sequentially until one is importable and works.
            3. The dimension problem might require a _v2 dataset type suffix, but since these are internal, that could be OK.
            Show
            ktl Kian-Tat Lim added a comment - I'm coming to agree with others that this could be a dangerous relaxation in the interface to data. I would instead suggest (somewhat off the top of my head) that: The name collision problem could be resolved by pre-registering all dataset types using per-user prefixes unless something like an --official command-line option is used. The backward compatibility problem could be resolved by maintaining a list of Python types (or Formatters?) in the database, trying them sequentially until one is importable and works. The dimension problem might require a _v2 dataset type suffix, but since these are internal, that could be OK.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Chris Morrison [X] (Inactive), Colin Slater, Gregory Dubois-Felsmann, Hsin-Fang Chiang, Ian Sullivan, Jim Bosch, Kian-Tat Lim, Tim Jenness
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.