Details
-
Type:
RFC
-
Status: Adopted
-
Resolution: Unresolved
-
Component/s: DM
-
Labels:None
Description
This is a request to add to our construction-era development environment baseline the regular production of a (small) readily Internet-accessible collection of output data (together with the corresponding input data) from a defined set of science pipelines code, run from master and/or from other defined releases.
This is starting as an RFD as opposed to an RFC because some aspects of this require some architecture/design/scoping decisions. However, the need is real (and SUIT, in particular, have been asking for this for some time).
Uses for this data can be added as comments.
There may be both small datasets generated frequently (nightly would be ideal) and larger datasets generated at longer intervals. The datasets resulting from running of named releases should be kept for long periods; those resulting from nightly or other frequent runs can be kept in small ring buffers. It is a requirement that at least one previous successful run of each "flavor" be kept around so that "what's changed?" tests can be performed.
Initially this request can be satisfied by providing the data in its current science pipelines code output file format (i.e., FITS for images, and FITS binary table in afw.table format for catalog data), as simple files available in some standard way (https:, git-lfs, etc.). (It is not a requirement that the outputs appear in one of the validation_* repos.) As our work progresses, it may be reasonable to extend this to providing the data ingested into a database and served via DAX, but that is an architecture and scope issue, regarding end-to-end testing, that can be deferred. If the file formats we support evolve, e.g., to include HDF5, that should be reflected in evolution of the output format(s) used.
Initially providing just basic outputs like calibrated single-epoch images and single-epoch source catalogs would already be very useful. Over time, this should be extended to as much of the full set of Level 1 and Level 2 data products as possible.
The inputs and outputs should form a coordinated set, so that linkages between inputs and outputs, and between single-epoch and coadd-related data products can be tested on the resulting datasets.
There are both scientific data content and data format and metadata applications for such a dataset. The former are closely related to the planned QC system's functions, and the QC system may well end up being the supplier of the output datasets contemplated herein. The latter include tests such as "can the SUIT properly display the coadded images produced by the pipelines?". In general, for both types of tests, we have and/or will continue to develop automated test rigs that may be possible to run as part of the CI system itself. But there is also a need to have the datasets available for human / interactive access, so that developers of downstream stages and components need not concern themselves with regenerating the upstream intermediates in order to test their code on recent pipeline outputs (either with learning how to run the upstream steps, or with actually running them!). Note that we have developers, e.g., JavaScript specialists in SUIT, who may never learn how to run the science pipelines code on their own and should not be required to do so.
The input dataset(s) to this process should ultimately include:
- precursor data (e.g., public HSC images);
- simulated LSST images with the truth catalogs available;
- on-sky images taken with prototype sensors;
- commissioning data once it is available with a reasonable level of stability and image quality; and
- production LSST data (as the system proposed here transitions to a role in the operations-era software development environment).
However, providing even one of these (presumably (1) or (2)) in the near term would be most useful.
The input datasets should be under change control to allow incremental comparisons to be made.
Initially I am suggesting that the "D" in RFD happen in the comments, but I will be happy to schedule a live discussion based on the initial round of feedback.
See the discussion in the #dm channel on Slack beginning at 15:52 PDT today for some background; however, the interest in this predates that discussion by some months.
Attachments
Issue Links
- is triggering
-
DM-22806 Automation of end-to-end system from science pipeline outputs to Qserv ingest
- Done
-
DM-14568 Transform DRP pipeline outputs to DPDD-specified format in F18
- Done
-
DM-14085 Add README.txt to auxTel dataset
- Done
-
DM-14086 Add README.txt to comCam dataset
- Done
-
DM-14087 Add README.txt to ctio0m9 dataset
- Done
-
DM-14089 Add README.txt to gapon dataset
- Done
-
DM-14092 Add README.txt to lsstCam dataset
- Done
-
DM-14093 Add README.txt to sdss dataset
- Done
-
DM-11019 Add README to /datasets/hsc
- Invalid
- relates to
-
DM-13201 calexp have TPV and SIP terms
- Done
-
DM-18883 Design a system for making example datasets accessible to pipelines.lsst.io documentation users and documentation CI
- To Do
-
DM-14683 Update DMTN-067 with data model plans reached during the DMLT meeting
- In Progress
-
RFC-447 Add requirement to include README for repositories in /datasets
- Adopted
-
DM-36240 Setup repository demonstrating we can read historical files
- In Progress
-
DM-10125 Get first version of V & V plan out
- Done
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
Gregory Dubois-Felsmann this RFC is showing that all triggered work has been completed. Does this mean that the RFC has been implemented or does it mean more triggered tickets are needed?