Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-243

Request for a regularly updated pipeline output dataset

    Details

    • Type: RFC
    • Status: Adopted
    • Resolution: Unresolved
    • Component/s: DM
    • Labels:
      None

      Description

      This is a request to add to our construction-era development environment baseline the regular production of a (small) readily Internet-accessible collection of output data (together with the corresponding input data) from a defined set of science pipelines code, run from master and/or from other defined releases.

      This is starting as an RFD as opposed to an RFC because some aspects of this require some architecture/design/scoping decisions. However, the need is real (and SUIT, in particular, have been asking for this for some time).

      Uses for this data can be added as comments.

      There may be both small datasets generated frequently (nightly would be ideal) and larger datasets generated at longer intervals. The datasets resulting from running of named releases should be kept for long periods; those resulting from nightly or other frequent runs can be kept in small ring buffers. It is a requirement that at least one previous successful run of each "flavor" be kept around so that "what's changed?" tests can be performed.

      Initially this request can be satisfied by providing the data in its current science pipelines code output file format (i.e., FITS for images, and FITS binary table in afw.table format for catalog data), as simple files available in some standard way (https:, git-lfs, etc.). (It is not a requirement that the outputs appear in one of the validation_* repos.) As our work progresses, it may be reasonable to extend this to providing the data ingested into a database and served via DAX, but that is an architecture and scope issue, regarding end-to-end testing, that can be deferred. If the file formats we support evolve, e.g., to include HDF5, that should be reflected in evolution of the output format(s) used.

      Initially providing just basic outputs like calibrated single-epoch images and single-epoch source catalogs would already be very useful. Over time, this should be extended to as much of the full set of Level 1 and Level 2 data products as possible.

      The inputs and outputs should form a coordinated set, so that linkages between inputs and outputs, and between single-epoch and coadd-related data products can be tested on the resulting datasets.

      There are both scientific data content and data format and metadata applications for such a dataset. The former are closely related to the planned QC system's functions, and the QC system may well end up being the supplier of the output datasets contemplated herein. The latter include tests such as "can the SUIT properly display the coadded images produced by the pipelines?". In general, for both types of tests, we have and/or will continue to develop automated test rigs that may be possible to run as part of the CI system itself. But there is also a need to have the datasets available for human / interactive access, so that developers of downstream stages and components need not concern themselves with regenerating the upstream intermediates in order to test their code on recent pipeline outputs (either with learning how to run the upstream steps, or with actually running them!). Note that we have developers, e.g., JavaScript specialists in SUIT, who may never learn how to run the science pipelines code on their own and should not be required to do so.

      The input dataset(s) to this process should ultimately include:

      1. precursor data (e.g., public HSC images);
      2. simulated LSST images with the truth catalogs available;
      3. on-sky images taken with prototype sensors;
      4. commissioning data once it is available with a reasonable level of stability and image quality; and
      5. production LSST data (as the system proposed here transitions to a role in the operations-era software development environment).
        However, providing even one of these (presumably (1) or (2)) in the near term would be most useful.

      The input datasets should be under change control to allow incremental comparisons to be made.

      Initially I am suggesting that the "D" in RFD happen in the comments, but I will be happy to schedule a live discussion based on the initial round of feedback.

      See the discussion in the #dm channel on Slack beginning at 15:52 PDT today for some background; however, the interest in this predates that discussion by some months.

        Attachments

          Issue Links

            Activity

            Hide
            jsick Jonathan Sick added a comment -

            John Swinbank, absolutely agree. I've created DM-18883 to carry on the conversation about documentation datasets, which I agree is a separate issue (but potentially some synergy can be found).

            Show
            jsick Jonathan Sick added a comment - John Swinbank , absolutely agree. I've created DM-18883 to carry on the conversation about documentation datasets, which I agree is a separate issue (but potentially some synergy can be found).
            Hide
            tjenness Tim Jenness added a comment -

            Gregory Dubois-Felsmann this RFC is showing that all triggered work has been completed. Does this mean that the RFC has been implemented or does it mean more triggered tickets are needed?

            Show
            tjenness Tim Jenness added a comment - Gregory Dubois-Felsmann this RFC is showing that all triggered work has been completed. Does this mean that the RFC has been implemented or does it mean more triggered tickets are needed?
            Hide
            tjenness Tim Jenness added a comment -

            Gregory Dubois-Felsmann this RFC is showing that all triggered work has been completed. Does this mean that the RFC has been implemented or does it mean more triggered tickets are needed?

            Show
            tjenness Tim Jenness added a comment - Gregory Dubois-Felsmann this RFC is showing that all triggered work has been completed. Does this mean that the RFC has been implemented or does it mean more triggered tickets are needed?
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I'll have to check whether the vision of getting updated data regularly run through to a point where it's accessible is really completely implemented. I think there are still some open questions at the end of the chain. I'll poke at this this week.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - I'll have to check whether the vision of getting updated data regularly run through to a point where it's accessible is really completely implemented. I think there are still some open questions at the end of the chain. I'll poke at this this week.
            Hide
            lguy Leanne Guy added a comment - - edited

            The process of transforming DRP pipeline outputs to DPDD-specified format is not complete, nor is it fully automated yet. This is something that Hsin-Fang Chiang is working on as part of DM-22806 this cycle. 

            Show
            lguy Leanne Guy added a comment - - edited The process of transforming DRP pipeline outputs to DPDD-specified format is not complete, nor is it fully automated yet. This is something that Hsin-Fang Chiang is working on as part of  DM-22806 this cycle. 

              People

              • Assignee:
                gpdf Gregory Dubois-Felsmann
                Reporter:
                gpdf Gregory Dubois-Felsmann
                Watchers:
                Frossie Economou, Gregory Dubois-Felsmann, Hsin-Fang Chiang, John Parejko, John Swinbank, Jonathan Sick, Kian-Tat Lim, Leanne Guy, Michael Wood-Vasey, Paul Price, Simon Krughoff, Tim Jenness, Trey Roby, Wil O'Mullane, Xiuqin Wu [X] (Inactive)
              • Votes:
                2 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Planned End:

                  Summary Panel