The Gen3 Butler currently uses the string name of a DatasetType as a repository-wide unique identifier for it; a calexp, raw, or deepCoadd means exactly the same thing for all instruments, collections, and users.
This consistency is convenient most of the time, but deeply problematic in some important cases:
- In a many-user repository, two users that are otherwise completely unaware of each can experience a conflict over a dataset type name that only they use.
- We cannot change a DatasetType from one Python type to a newer one without migrating data repositories (and in doing so, breaking the ability for older code to read older datasets), even if we make sure the new and old types are compatible from a code perspective. This problem is already blocking the implementation of
- We cannot change the dimensions associated with a DatasetType without similar problems, even DatasetTypes that represent processing intermediates rather than public data products.
Much of the discussion about resolving this has involved the idea of "per-collection DatasetTypes", but I think that'd be a step much too far: a lot of the power of our current DatasetType concept is that it is cross-collection, and hence can be used to make meaningful comparisons across different processing runs.
A better solution is to make DatasetType names non-unique in the repository but unique within any collection (with the possible exception of CHAINED collections; more on that later); one DatasetType would typically still be used in many, many collections, and while a different DatasetType with the same name could exist in the same repository, it could only be used in completely different sets of collections.
This is less disruptive than it might seem, because in almost all contexts where a user wants to resolve a DatasetType name (or wildcard), a list of relevant collections is readily available, either from Butler initialization, the pipetask command line, or arguments passed to Registry methods. We already have (and aggressively fetch) summary tables that record which DatasetTypes are present in each collection, so we can perform those resolutions efficiently. I think we can demand that collections are provided whenever dataset type names are solved without breaking much code; this would be an API change, but I think a relatively minor one.
Problems do arise when this resolution is not unique (i.e. the name resolves to multiple DatasetTypes even in the set of collections known to be in play), and this proposal does rely a bit on these simply being rare in practice. We can use database constraints to ensure that only one definition appears in any particular RUN, TAGGED, or CALIBRATION collection, to avoid mistakes. Avoiding multiple definitions in CHAINED collections or multi-collection search paths is trickier, but I think in at least the vast majority of cases we can get away with making it an error to query a sequence of collections via a DatasetType name that is non-unique within that set; we can tell the user to instead pass a fully-resolved DatasetType object instead, resolving the ambiguity and making it quite difficult for the user to silently get something other than what they expect.
This change will require a butler schema change and migration. It may be possible to allow older code to read migrated repositories, but they should probably be prohibited from writing to them in certain ways (and it may not be worth the effort to support even that). It should be possible to ensure that newer code can read and write to unmigrated older repositories (though again I'm not sure of the effort involved). Because all repositories already use a synthetic integer primary for DatasetType, the migration should be quite straightforward (removal of the unique constraint on the name, addition of a name column and unique constraint to dataset-collection tag tables).