Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-28280

Documenting Butler DatasetTypes

    XMLWordPrintable

    Details

    • Urgent?:
      No

      Description

      Following up on a question Dominique Boutigny posted on Slack about the difference between the deepCoadd_calexp and deepCoadd products, I asked whether we have documentation of the different kinds of datasets in gen3 like we do in the description fields of obs_base's datasets.yaml and exposures.yaml (added in DM-13756). Jim Bosch said that the only place such docs might live in gen3 would be "associated with the PipelineTask connections that produce them.", but most of those Connections do not currently include such descriptive information about what the dataset is.

      If those Connections are to be where we document PipelineTask output, we need a way to aggregate that information (shades of DM-6655) and we need to ensure that all of the descriptive information in the obs_base yaml files is copied over to the relevant Connections docs.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            I'm linking DM-37544 and DM-33034 to this as I think they need to be part of the plan for how we do this:

            • The first is a technote that Nate Lust and I are planning to write, and one of the things we've discussed including is a section in pipeline YAML files where the major output dataset types can be defined, with documentation, with those definitions checked against the tasks in the pipeline. I think that's the source of this information.
            • The second is a long-stalled ticket that involves having scons (in drp_pipe, in this case, as a prototype) that generates content to be included in the Sphinx docs from the pipelines in a package. That is very much alive in my mind as the way to get the doc source into a more public, presentable location. When I'm not putting out fires, I am actively working on blockers for that ticket.
            Show
            jbosch Jim Bosch added a comment - I'm linking DM-37544 and DM-33034 to this as I think they need to be part of the plan for how we do this: The first is a technote that Nate Lust and I are planning to write, and one of the things we've discussed including is a section in pipeline YAML files where the major output dataset types can be defined, with documentation, with those definitions checked against the tasks in the pipeline. I think that's the source of this information. The second is a long-stalled ticket that involves having scons (in drp_pipe, in this case, as a prototype) that generates content to be included in the Sphinx docs from the pipelines in a package. That is very much alive in my mind as the way to get the doc source into a more public, presentable location. When I'm not putting out fires, I am actively working on blockers for that ticket.
            Hide
            Parejkoj John Parejko added a comment -

            Why can't we use the `doc` fields of the Connections definitions for that documentation, instead of (separately?) putting it in the pipeline yaml?

            Show
            Parejkoj John Parejko added a comment - Why can't we use the `doc` fields of the Connections definitions for that documentation, instead of (separately?) putting it in the pipeline yaml?
            Hide
            jbosch Jim Bosch added a comment -

            I'd like to reserve those doc fields for "this is how this dataset is used by this task", though that may be more relevant for inputs than outputs. I wouldn't be opposed to the output connection docs the default documentation for dataset types that aren't documented in the pipeline file.

            Show
            jbosch Jim Bosch added a comment - I'd like to reserve those doc fields for "this is how this dataset is used by this task", though that may be more relevant for inputs than outputs. I wouldn't be opposed to the output connection docs the default documentation for dataset types that aren't documented in the pipeline file.
            Hide
            Parejkoj John Parejko added a comment -

            Partly, my question is because we already have those fields and wouldn't have to implement too much extra (I hope? Jonathan Sick?) to insert them into the already existing Task Topic Type for PipelineTasks.

            Why would you expect that the documentation fields you want to add to the yaml pipeline files would be different from the docs that we can already put on Output Connections?

            Show
            Parejkoj John Parejko added a comment - Partly, my question is because we already have those fields and wouldn't have to implement too much extra (I hope? Jonathan Sick ?) to insert them into the already existing Task Topic Type for PipelineTasks. Why would you expect that the documentation fields you want to add to the yaml pipeline files would be different from the docs that we can already put on Output Connections?
            Hide
            jbosch Jim Bosch added a comment -

            On many tasks, there probably wouldn't be a difference. But in some cases the description of the dataset type depends on configuration and sometimes its context in the pipeline ("this is the final photometric calibration, produced by combining the outputs of FGCM with ..."). I also want to use the pipeline definitions to help distinguish between public, science-user-visible dataset types and those that should be considered pipeline implementation details. Put another way, the documentation on the connection is a generic description of what the dataset type might be given some configuration, and the documentation in the pipeline would be the description of a particular realization of that.

            Show
            jbosch Jim Bosch added a comment - On many tasks, there probably wouldn't be a difference. But in some cases the description of the dataset type depends on configuration and sometimes its context in the pipeline ("this is the final photometric calibration, produced by combining the outputs of FGCM with ..."). I also want to use the pipeline definitions to help distinguish between public, science-user-visible dataset types and those that should be considered pipeline implementation details. Put another way, the documentation on the connection is a generic description of what the dataset type might be given some configuration, and the documentation in the pipeline would be the description of a particular realization of that.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              Parejkoj John Parejko
              Watchers:
              Ian Sullivan, Jim Bosch, John Parejko, Meredith Rawls, Nate Lust, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:

                  Jenkins

                  No builds found.