Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-783

Replace Task metadata class with pure Python Mapping

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Flagged
    • Resolution: Unresolved
    • Component/s: DM
    • Labels:
      None

      Description

      Currently we store Task metadata in a PropertySet that contains PropertyList. This was done mostly because it was convenient at the time and supported some nice features like add() and splitting hierarchy on a period.

      The downside of using PropertySet is that it leads to pipe_base depending on daf_base and pulls in a C++ compiler. This makes it effectively impossible for Task to be used outside the project (see also RFC-775 for related discussion on dependences of pipe_base).

      Given that Task metadata is never passed into C++ code, there is no reason we have to use a specialist C++ class to implement it.

      We propose that Task metadata be changed to use a pure Python Mapping class which reimplements the add() and single-level dot hierarchy but otherwise follows the standard dict-like interface.

      As an interim measure this new TaskMetadata class will support other PropertySet methods but they will issue deprecation warnings and will be cleaned up over time. In particular, code that calls .set() can already be replaced with standard dict [] assignment syntax.

      A simple TaskMetadata prototype has already been written that builds lsst_distrib and ci_hsc_gen3.

      During the transition there will be a need to have new and old forms of metadata coexisting in single repositories. The simplest approach would be to adopt a new dataset type name but if continuity is required a migration script could be written that would change the definition in existing repositories and a formatter could be constructed that recognizes PropertySet and casts to TaskMetadata. I would welcome opinions on this in the RFC.

      In conjunction with RFC-782 this should remove all C++ dependencies from the Task infrastructure.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            At DMCCB today Jim Bosch mentioned that the butler dataset type problem might solve itself because he is considering on changing pipetask to store a richer metadata object that would include provenance. This would be an entirely new dataset type and would (presumably) contain the Task metadata discussed on this ticket as a subset. Switching from PropertySet to TaskMetadata as part of that work would make a lot of sense and minimize issues with backwards compatibility.

            Show
            tjenness Tim Jenness added a comment - At DMCCB today Jim Bosch mentioned that the butler dataset type problem might solve itself because he is considering on changing pipetask to store a richer metadata object that would include provenance. This would be an entirely new dataset type and would (presumably) contain the Task metadata discussed on this ticket as a subset. Switching from PropertySet to TaskMetadata as part of that work would make a lot of sense and minimize issues with backwards compatibility.
            Hide
            tjenness Tim Jenness added a comment -

            I'm moving the date out on this RFC a little to allow more time to ponder the gen3 butler migration plan for existing task metadata.

            Show
            tjenness Tim Jenness added a comment - I'm moving the date out on this RFC a little to allow more time to ponder the gen3 butler migration plan for existing task metadata.
            Hide
            ktl Kian-Tat Lim added a comment -

            The DMCCB approves of the idea of having a distinct, non-generic TaskMetadata class and associated (backward-compatible) formatter; this can be added to the code, but not actually used by pipe_base (no datasets read or written using this) until a further RFC for non-unique dataset types has been written and considered.

            Show
            ktl Kian-Tat Lim added a comment - The DMCCB approves of the idea of having a distinct, non-generic TaskMetadata class and associated (backward-compatible) formatter; this can be added to the code, but not actually used by pipe_base (no datasets read or written using this) until a further RFC for non-unique dataset types has been written and considered.
            Hide
            tjenness Tim Jenness added a comment -

            Discussing the compatibility issues with Gregory Dubois-Felsmann suggested the following migration scheme:

            • Create a TaskMetadata class and release it into a weekly without activating it.
            • Update StorageClass to allow a new "type cast" functionality such that if a formatter returns something of type X (PropertySet) the helper function listed in the storage class definition could be called on it to convert it to type Y (TaskMetadata) – doing this means we do not have to change all the formatter entries in the datastore record.
            • If Type Y or the casting function is not known to the current software version allow it to return type X – this would allow old code that always expected type X to still return type X (it wouldn't help old code read newly-written type Y data and code using a new weekly would need to be modified to support TaskMetadata being returned).
            • Whether it's a good idea to allow people to locally override StorageClass definitions such that they could cast Exposure to Astropy NDData is open to debate since that suffers from the same problem of being surpised described in RFC-804 – except the code working or not depends on configuration (that could be under their control) rather than what someone else has done to registry – for that use case it would be better to have a programmatic solution such as creating a butler with an explicit casts defined and explicit formatter overrides (to remove the overhead of first reading an Exposure and then converting it – you could override the formatter itself to read the FITS file directly in astropy format) which would make the code more robust against configuration modification.
            • After a few releases have been made with the above fixes in place, modify the _metadata dataset type definition in registry to replace the PropertySet storage class with a TaskMetadata storage class.

            I don't think there's any way to switch from a generic container type to a specific container type without a registry migration to the new dataset type definition. The issue is whether we can do it without requiring also updating every datastore record and still allowing people to run a slightly older weekly release for a while.

            Show
            tjenness Tim Jenness added a comment - Discussing the compatibility issues with Gregory Dubois-Felsmann suggested the following migration scheme: Create a TaskMetadata class and release it into a weekly without activating it. Update StorageClass to allow a new "type cast" functionality such that if a formatter returns something of type X (PropertySet) the helper function listed in the storage class definition could be called on it to convert it to type Y (TaskMetadata) – doing this means we do not have to change all the formatter entries in the datastore record. If Type Y or the casting function is not known to the current software version allow it to return type X – this would allow old code that always expected type X to still return type X (it wouldn't help old code read newly-written type Y data and code using a new weekly would need to be modified to support TaskMetadata being returned). Whether it's a good idea to allow people to locally override StorageClass definitions such that they could cast Exposure to Astropy NDData is open to debate since that suffers from the same problem of being surpised described in RFC-804 – except the code working or not depends on configuration (that could be under their control) rather than what someone else has done to registry – for that use case it would be better to have a programmatic solution such as creating a butler with an explicit casts defined and explicit formatter overrides (to remove the overhead of first reading an Exposure and then converting it – you could override the formatter itself to read the FITS file directly in astropy format) which would make the code more robust against configuration modification. After a few releases have been made with the above fixes in place, modify the _metadata dataset type definition in registry to replace the PropertySet storage class with a TaskMetadata storage class. I don't think there's any way to switch from a generic container type to a specific container type without a registry migration to the new dataset type definition. The issue is whether we can do it without requiring also updating every datastore record and still allowing people to run a slightly older weekly release for a while.
            Hide
            jbosch Jim Bosch added a comment - - edited

            I have no objection to this plan as a way to solve the immediate, but a few thoughts on the implications for future functionality.

            I am a bit wary of formatter overrides as a way to solve the "get me a PVI as an astropy thing" problem. Casting is a lot cleaner, and seems unlikely to have any drawbacks in practice other than demanding that the original formatter, target type, and their dependencies be in the user's software environment.

            If we do want to do formatter overrides, I think we really need to consider "on-disk data structure" and "file format for an on-disk data structure" independent of any Python type as first-class concepts, and then rethink how the current Formatter and StorageClass concepts relate to those.

            In fact, at one (very early) point, I think we'd intended for StorageClass to mean "on-disk data structure independent of any Python type", but it was just too convenient to add the Python type to it, and now it's really just an alias for a Python type that's mostly annoying (because it has to be registered) but occasionally really important as a level of indirection (as in the proposed solution to this problem). We might consider the problem raised by this RFC to be our punishment for not following through on that original idea and designing a proper many-to-many mapping between that purely on-disk concept and Python type, and using that more complicated mapping to select a Formatter. A tiny part of me even thinks it'd be a useful thought experiment to try to go back to that in the solution to this problem, but most of me thinks that's a distraction from more important things.

            Show
            jbosch Jim Bosch added a comment - - edited I have no objection to this plan as a way to solve the immediate, but a few thoughts on the implications for future functionality. I am a bit wary of formatter overrides as a way to solve the "get me a PVI as an astropy thing" problem. Casting is a lot cleaner, and seems unlikely to have any drawbacks in practice other than demanding that the original formatter, target type, and their dependencies be in the user's software environment. If we do want to do formatter overrides, I think we really need to consider "on-disk data structure" and "file format for an on-disk data structure" independent of any Python type as first-class concepts, and then rethink how the current Formatter and StorageClass concepts relate to those. In fact, at one (very early) point, I think we'd intended for StorageClass to mean "on-disk data structure independent of any Python type", but it was just too convenient to add the Python type to it, and now it's really just an alias for a Python type that's mostly annoying (because it has to be registered) but occasionally really important as a level of indirection (as in the proposed solution to this problem). We might consider the problem raised by this RFC to be our punishment for not following through on that original idea and designing a proper many-to-many mapping between that purely on-disk concept and Python type, and using that more complicated mapping to select a Formatter. A tiny part of me even thinks it'd be a useful thought experiment to try to go back to that in the solution to this problem, but most of me thinks that's a distraction from more important things.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Watchers:
              Eli Rykoff, Gregory Dubois-Felsmann, Jim Bosch, John Parejko, Kian-Tat Lim, Krzysztof Findeisen, Leanne Guy, Michelle Butler, Tatiana Goldina, Tim Jenness, Wil O'Mullane, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Dates

                Created:
                Updated:
                Planned End:

                  CI Builds

                  No builds found.