I am very uncomfortable with the proposal.
I feel strongly about Butler data access needing to be a stable interface. In addition to Tim Jenness’s concerns, with which I agree:
We declare the interfaces and capabilities of PipelineTasks, both in the code and in derived documentation, in terms of the string names of DatasetTypes. This would become an unreliable guarantee in the proposal under consideration. The usability of a given PipelineTask would become completely contingent on what collection it happened to be executed on at run time.
In addition, I fear an unmanageable situation could develop in which different types names are “sliced” by different subgroups of collections, leading to a combinatorial growth in the number of different “flavors” of collections (some with calexp-A and src-A, one with calexp-B and src-A, others with calexp-B and src-B, and so on), a really nightmarish possibility.
With regard to the three initial concerns expressed in the proposal:
i) I agree that it’s a concern that the long life of repositories and their multi-user nature will lead to unpredictable collisions, but I am much happier with a namespaced solution to this. We should explore further how to do that in a way that doesn’t make the most common experimental-code-development use cases unnecessarily verbose. The suggestions above seem to be headed in a good direction.
ii) This is a really serious issue that does require an engineering solution. The long life of repositories and their size mean that we have to have an acceptable solution for migrations of type within a repository. However, I don’t think redefining names from collection to collection is a suitable solution.
Before we go further down this road, I also want to be sure that we are clearly distinguishing between changes in the Python type for an existing (and unchanged) persistence-type and changes in the persistence-type (which might or might not need to be reflected by changes in the associated Python type).
On the first one, which I’ll call ii.a), I’d like to note that we have other reasons than migrations to want the capability to have a single external, persisted dataset be instantiatable by Butler as more than one Python type. It would be very convenient for our users, for instance, to be able to retrieve an exposure as either its native afw type or as a common Astropy type; or a source table as afw.table or as a pandas dataframe.
A digression on that: On BaBar, our C++ equivalent to the Butler had an interface like this (with a little translation / obscuration of unimportant details):
pointer-to-object = Ifd<ObjectType>::get( butler(dataid), key )
(Making this rigorously type-safe without having the “butler” class depend on every data type in our system, which would have produced unmanageable dependency cycles, or constant recompilation of the world, was a publishable tour-de-force of mid-1990s C++ - hat tip to Ed Frank, currently at ANL. It would be much easier to implement with current C++.)
The “key” was most commonly a string that allowed distinguishing between semantically distinct dataset types that were represented by a single C++ type - this is perhaps somewhat similar to the way we have multiple logical image dataset types all of which resolve to the same Python class.
This system naturally supported retrieving the same dataset as different class types, easily supporting migrations. You could, say, do both Ifd<A1>::get(b, "ChargedTracks") and Ifd<A2>::get(b, "ChargedTracks") on the same “butler” b, with both loading from the same external persisted dataset.
Inspired by this, could we not support something like a class= parameter to butler.get()? This would throw an exception if no formatter for the specified class and the specified dataset were available. In the absence of{{ class=}}, Butler would return a default class type for a specified dataset type. The formatter to use for a given dataset type would be looked up from both its (string) DatasetType and from the specified class= value, if any.
The class could also be added to a PipelineTask configuration, declaring - optionally - that a particular implementation of that task can only deal with one specific class type for its inputs.
In the course of a migration to a new class type for an existing persisted dataset type, new code could be introduced first with the use of the class= parameter, leaving existing code able to still run on the old version of the class, while newly written code explicitly requests the new version. Then, at some later time, the default class type - which would be a property of the software release in addition to the repository could be changed.
Finally, on
iii) We cannot change the dimensions associated with a DatasetType without similar problems, even DatasetTypes that represent processing intermediates rather than public data products.
I’m not as sympathetic. This seems like a pretty fundamental part of the notion of a DatasetType, and I’m willing to ask developers to give the new version a new name. Remember that if we have user-namespaced dataset types, we have a “sandbox” in which developers can give some transitional versions of types under rapid development names that may seem “ugly” - while they are working on branches, etc., switching to a "clean" name in the common namespace once the work has settled down. But once there’s data in the shared repo of a particular type, in the common namespace, it seems immensely confusing to have the “keys” needed for that name be different in different collections. Among other things, this means that understanding the qualitative nature of the slicing/grouping that occurs when a PipelineTask using such a type is executed in a Pipeline is not predictable even roughly by just knowing the types - it’s only resolvable when the specific collections involved are known.
With these concerns in mind, I am not able to support the RFC without further discussion.
I'm coming to agree with others that this could be a dangerous relaxation in the interface to data.
I would instead suggest (somewhat off the top of my head) that: