Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30332

Improve custom containers in daf_butler




      When I added the DataCoordinateIterable hierarchy, I took a few shortcuts (no intermediate ABCs that correspond to AbstractSet and Sequence) and punted on the problem of specialized containers for DimensionRecords and DatasetRefs.

      Those specialized containers would really help with getting new Quantum classes up on DM-29814 (which helps with DM-29761), so I think it's time to tackle this.  I've done some prototyping already, and I'm pretty happy with how things are looking, and I actually think this has the potential to help a lot with RemoteRegistry, too.  Here's what I have in mind:

      • Clean up the DataCoordinateIterable hiearchy, basically by adding intermediate ABCs and more closely following the containers.abc classes.  This will involve API changes (e.g. DataCoordinateSequence will become abstract; the new concrete class that inherits from it will be DataCoordinateTuple), but I think the disruption will be limited to daf_butler (at least to very good approximation).
      • Add abstract Iterable, AbstractSet, and concrete Set containers for both heterogeneous and homogeneous DimensionRecords.  These will be used for the return type of queryDimensionRecords, but the really big deal here is that I want to move the Registry.expandDataIds logic to the heterogeneous container hierarchy and vectorize it, so you can use an in-memory collection of DimensionRecords to expand a bunch of data IDs efficiently.  And then I want to add a caching container implementation that fetches missing records from a Registry as needed.  I think this will be a huge step towards moving/copying some important caches that are currently inside the Manager classes to code the RemoteRegistry client can use.  It will also make it possible to serialize a DataCoordinateIterable with records via a much less duplicative form that factors the unique records into a separate container.
      • Add abstract Iterable, AbstractSet, and concrete Set containers for both heterogeneous and homogeneous (according to dataset type) DatasetRefs.  This is the thing I really want for the new Quantum hierarchy, but by building it on the above I can handle data ID expansion and normalized serialization well for these as well.  The AbstractSet and Set versions will enforce uniqueness on dataset type + data ID, so they won't be appropriate for all cases, but the Iterable base class should provide room for future extension.

      Two downsides that I think we'll have to live with:

      • Custom collections in Python tend to be super boilerplate-y, and these are no exception.  Lots of classes and lots of methods, but almost all of them trivial.
      • Like DataCoordinate[Abstract]Set, I am not planning to make the new set-like classes inherit from collections.abc.Set, because of some edge-case behavior it requries that I don't think we want (mostly, I want more exceptions when you try to mix types).  And I'm actually planning to implement even less of the collections.abc.Set interface, because it's a lot more work and I'm not sure we need it.  So these are definitely conceptually sets - they are unique-element containers with generally-unspecified order - but there is some potential confusion with the true set interfaces given the names.



          There are no comments yet on this issue.


            jbosch Jim Bosch
            jbosch Jim Bosch
            Jim Bosch, Nate Lust, Tim Jenness
            0 Vote for this issue
            3 Start watching this issue




                No builds found.