Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-28280

Documenting Butler DatasetTypes

    XMLWordPrintable

    Details

    • Urgent?:
      No

      Description

      Following up on a question Dominique Boutigny posted on Slack about the difference between the deepCoadd_calexp and deepCoadd products, I asked whether we have documentation of the different kinds of datasets in gen3 like we do in the description fields of obs_base's datasets.yaml and exposures.yaml (added in DM-13756). Jim Bosch said that the only place such docs might live in gen3 would be "associated with the PipelineTask connections that produce them.", but most of those Connections do not currently include such descriptive information about what the dataset is.

      If those Connections are to be where we document PipelineTask output, we need a way to aggregate that information (shades of DM-6655) and we need to ensure that all of the descriptive information in the obs_base yaml files is copied over to the relevant Connections docs.

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            At first glance this is possibly a request for the DatasetType constructor to take an optional description string (as we support for collections) and for that string to be stored in the registry when the DatasetType is registered. This would still require pipeline tasks to include these strings but would allow the `butler query-dataset-types` command to report the description string. This would require a registry schema change and so is not immediately trivial but we could easily add the summary string to DatasetType such that people could at least start adding the documentation strings into the code even if they aren't persisted.

            I don't think obs_base is involved in this.

            Show
            tjenness Tim Jenness added a comment - At first glance this is possibly a request for the DatasetType constructor to take an optional description string (as we support for collections) and for that string to be stored in the registry when the DatasetType is registered. This would still require pipeline tasks to include these strings but would allow the `butler query-dataset-types` command to report the description string. This would require a registry schema change and so is not immediately trivial but we could easily add the summary string to DatasetType such that people could at least start adding the documentation strings into the code even if they aren't persisted. I don't think obs_base is involved in this.
            Hide
            Parejkoj John Parejko added a comment -

            obs_base is involved because that is where the existing descriptions live.

            Show
            Parejkoj John Parejko added a comment - obs_base is involved because that is where the existing descriptions live.
            Hide
            nlust Nate Lust added a comment -

            Connection types (that a type name is just an identifier for) does already support doc fields. We have in the past not talked about doing this at a registry level, but having a command that will report all the dataset types and docs associated with a given pipeline. In principal I can't see any reason to not also store it in the registry outside of a generalized feeling of creating one gigantic complex thing.

            Show
            nlust Nate Lust added a comment - Connection types (that a type name is just an identifier for) does already support doc fields. We have in the past not talked about doing this at a registry level, but having a command that will report all the dataset types and docs associated with a given pipeline. In principal I can't see any reason to not also store it in the registry outside of a generalized feeling of creating one gigantic complex thing.
            Hide
            tjenness Tim Jenness added a comment -

            Okay, but this ticket doesn't involve any work on obs_base.

            Show
            tjenness Tim Jenness added a comment - Okay, but this ticket doesn't involve any work on obs_base.
            Hide
            tjenness Tim Jenness added a comment - - edited

            There are two things then. One is "for pipeline X describe to me the output DatasetTypes that this pipeline creates" – that just needs the pipeline. The other is: "I see there is a dataset type in the butler repository called Y, what does it represent". Short of a clever provenance tracking code that looks up a dataset of that type, then looks at the run, then looks at the provenance to see what pipeline that was made by and then loads that pipeline and asks for the definition, the second option is much easier if registering a dataset type also registers a short summary string for it.

            Show
            tjenness Tim Jenness added a comment - - edited There are two things then. One is "for pipeline X describe to me the output DatasetTypes that this pipeline creates" – that just needs the pipeline. The other is: "I see there is a dataset type in the butler repository called Y, what does it represent". Short of a clever provenance tracking code that looks up a dataset of that type, then looks at the run, then looks at the provenance to see what pipeline that was made by and then loads that pipeline and asks for the definition, the second option is much easier if registering a dataset type also registers a short summary string for it.
            Hide
            jbosch Jim Bosch added a comment -

            The other place (and maybe the primary place) we should target as where this documentation lands is pipelines.lsst.io - ideally, we'd put together some Sphinx (etc) magic such that one could delegate a pipeline (YAML file) in a package as being sufficient important that its output dataset types should be rendered into the static docs, and then the doc build would pull the information from that and put it in a table somewhere.

            Show
            jbosch Jim Bosch added a comment - The other place (and maybe the primary place) we should target as where this documentation lands is pipelines.lsst.io - ideally, we'd put together some Sphinx (etc) magic such that one could delegate a pipeline (YAML file) in a package as being sufficient important that its output dataset types should be rendered into the static docs, and then the doc build would pull the information from that and put it in a table somewhere.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              Parejkoj John Parejko
              Watchers:
              Ian Sullivan, Jim Bosch, John Parejko, Meredith Rawls, Nate Lust, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:

                  CI Builds

                  No builds found.