Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-783

Replace Task metadata class with pure Python Mapping

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      Currently we store Task metadata in a PropertySet that contains PropertyList. This was done mostly because it was convenient at the time and supported some nice features like add() and splitting hierarchy on a period.

      The downside of using PropertySet is that it leads to pipe_base depending on daf_base and pulls in a C++ compiler. This makes it effectively impossible for Task to be used outside the project (see also RFC-775 for related discussion on dependences of pipe_base).

      Given that Task metadata is never passed into C++ code, there is no reason we have to use a specialist C++ class to implement it.

      We propose that Task metadata be changed to use a pure Python Mapping class which reimplements the add() and single-level dot hierarchy but otherwise follows the standard dict-like interface.

      As an interim measure this new TaskMetadata class will support other PropertySet methods but they will issue deprecation warnings and will be cleaned up over time. In particular, code that calls .set() can already be replaced with standard dict [] assignment syntax.

      A simple TaskMetadata prototype has already been written that builds lsst_distrib and ci_hsc_gen3.

      During the transition there will be a need to have new and old forms of metadata coexisting in single repositories. The simplest approach would be to adopt a new dataset type name but if continuity is required a migration script could be written that would change the definition in existing repositories and a formatter could be constructed that recognizes PropertySet and casts to TaskMetadata. I would welcome opinions on this in the RFC.

      In conjunction with RFC-782 this should remove all C++ dependencies from the Task infrastructure.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            Ah, I wasn't thinking about the in-memory type at all, and yes, that's important to figure out. It'd be ideal if we could switch to using TaskMetadata in code all the time, and have the backwards-compatibility layer kick in on write (as well as in dataset type registration). I'll take a quick look at various code paths tomorrow and see how how hard it looks.

            Show
            jbosch Jim Bosch added a comment - Ah, I wasn't thinking about the in-memory type at all, and yes, that's important to figure out. It'd be ideal if we could switch to using TaskMetadata in code all the time, and have the backwards-compatibility layer kick in on write (as well as in dataset type registration). I'll take a quick look at various code paths tomorrow and see how how hard it looks.
            Hide
            jbosch Jim Bosch added a comment -

            Ok, so the code that defines the dataset type to registered for metadata is here:

            https://github.com/lsst/pipe_base/blob/205747afb71452b7ad7736fd751342810b50880e/python/lsst/pipe/base/pipeline.py#L908-L912

            And that already has access to a Registry, so it can do a getDatasetType and use the existing StorageClass there if one exists (i.e. only let the _TASK_METADATA_TYPE logic there fire when the dataset type does not exist).

            We can let the Task constructor pay attention to _TASK_METADATA_TYPE as it now does, so after we flip that switch, we will always use TaskMetadata there.

            And then, on write, I think we have two options:

            • Add a pyyaml override hook for TaskMetadata, so it can be used with YamlFormatter.  Make sure that writes files in a way that's consistent with writing a PropertySet to YAML.  Make sure Datastore isn't doing any checks on StorageClass.pytype, or provide a way to bypass them.
            • Write a TaskMetadata->PropetySet conversion function, and register a StorageClass conversion in that direction, too.  Make Datastore.put (not just get) attempt coercion when StorageClass.pytype and the object being written are not consistent.

             

            Show
            jbosch Jim Bosch added a comment - Ok, so the code that defines the dataset type to registered for metadata is here: https://github.com/lsst/pipe_base/blob/205747afb71452b7ad7736fd751342810b50880e/python/lsst/pipe/base/pipeline.py#L908-L912 And that already has access to a Registry , so it can do a getDatasetType and use the existing StorageClass there if one exists (i.e. only let the _TASK_METADATA_TYPE logic there fire when the dataset type does not exist). We can let the Task constructor pay attention to _TASK_METADATA_TYPE as it now does, so after we flip that switch, we will always use TaskMetadata there. And then, on write, I think we have two options: Add a pyyaml override hook for TaskMetadata , so it can be used with YamlFormatter .  Make sure that writes files in a way that's consistent with writing a PropertySet to YAML.  Make sure Datastore isn't doing any checks on StorageClass.pytype , or provide a way to bypass them. Write a TaskMetadata -> PropetySet conversion function, and register a StorageClass conversion in that direction, too.  Make Datastore.put (not just get ) attempt coercion when StorageClass.pytype and the object being written are not consistent.  
            Hide
            tjenness Tim Jenness added a comment - - edited

            Pydantic models don't natively support YAML but we could easily make the formatter deal with the pedantic .dict() method to convert to a plain dict and then have the formatter try the pydantic constructor from it like we do in the JSON formatter. PropertySet YAML looks terrible and I wouldn't try to emulate it. That leaves option 2 to end up with a PropertySet.from_TaskMetadata() method. I'll take another look at seeing if we can do this all transparently.

            Show
            tjenness Tim Jenness added a comment - - edited Pydantic models don't natively support YAML but we could easily make the formatter deal with the pedantic .dict() method to convert to a plain dict and then have the formatter try the pydantic constructor from it like we do in the JSON formatter. PropertySet YAML looks terrible and I wouldn't try to emulate it. That leaves option 2 to end up with a PropertySet.from_TaskMetadata() method. I'll take another look at seeing if we can do this all transparently.
            Hide
            tjenness Tim Jenness added a comment -

            DM-33155 was relatively trivial so I propose that we can adopt this RFC. Everything that is needed for migration to TaskMetadata has been demonstrated and we can do it without breaking existing repositories. We can choose to migrate individual repositories over to TaskMetadata storage classes at any time we want but new metadata dataset types will use TaskMetadata immediately.

            Show
            tjenness Tim Jenness added a comment - DM-33155 was relatively trivial so I propose that we can adopt this RFC. Everything that is needed for migration to TaskMetadata has been demonstrated and we can do it without breaking existing repositories. We can choose to migrate individual repositories over to TaskMetadata storage classes at any time we want but new metadata dataset types will use TaskMetadata immediately.
            Hide
            tjenness Tim Jenness added a comment -

            After discussion on DMCCB slack and approvals from Jim Bosch and Kian-Tat Lim, I am marking this RFC as board recommended.

            Show
            tjenness Tim Jenness added a comment - After discussion on DMCCB slack and approvals from Jim Bosch and Kian-Tat Lim , I am marking this RFC as board recommended.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Watchers:
              Eli Rykoff, Gregory Dubois-Felsmann, Jim Bosch, John Parejko, Kian-Tat Lim, Leanne Guy, Michelle Butler [X] (Inactive), Tatiana Goldina, Tim Jenness, Wil O'Mullane, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.