this issue page
This proposal includes three related topics:
• How to organize verification datasets, preprocessing from external pipelines, and outputs from our own pipeline on disk.
• Another try at the --rerun convenience argument for the pipe_base argument parser.
• Functionality in pipe_base/daf_butlerUtils for handling preprocessing from external pipelines.
First off, I think we need to put all of our datasets in a central location (rather than multiple /lsstN volumes), or at least make it appear that way using symlinks. I'm hoping we can eventually just have a large distributed filesystem that can include everything, but I think it's important to make it appear that way even before that's feasible. Thus, for some (short, but otherwise unimportant) value of $root, we'd have a layout like this:
The subdirectories are per-camera; everything below these directories should use the same mapper and obs package. Where possible, the name should reflect the name of the obs package.
For real data, the _mapper and registry.sqlite files would go directly in the per-camera directory; there'd be only one input data repository and one registry for all data from that camera (no splitting it up based on survey, proposal, or patch of sky - managing those differences is what the registry is for). The rest of the directory structure below that level (with some exceptions below) would be managed by the mapper for that camera.
For simulated data, we need another level before we get to the data repository root, because we can have multiple realizations of the sky that should never be processed together. I propose we create a JIRA issue for each distinct simulation run, and form directory names from the issue name and optionally the date (e.g. "DM-NNNN(_YYYYMMDD)?", but I don't care about the details of the formatting). Furthermore, we should distinguish between "public" simulation runs that have a wide audience and "private" runs used only by a small group of people with a particular goal in mind ("private" != "secret"). Public simulation directories should always have a date, go directly within the camera-level directory, should be publicized somehow when they're created, and should require an RFC to remove. Private simulation directories should go under an additional username-based subdirectory, and may be removed by that user whenever desired, and may or may not have a date. For example:
For public simulation runs that are important enough to have a widely-recognized name of their own, the directory can use this name instead of an issue number (though I think the date should still generally be included).
Data products that are produced by an external pipeline that may be used in our own processing (e.g. DECam community pipeline ISR or SDSS Photo PSF models) should go within a "preprocessed" subdirectory in the camera-level directory and additional directories below that to indicate both the source and the version of the processing. These additional directories have camera-dependent names. For instance:
All such directories for a particular camera should have the same layout below this level; they'll be treated the same as the root of the raw data repository by the mapper, with this path set by a new --preprocessed command-line argument in the pipe_base argument parser (similar to the way we use --calib to specify the calibration root directory now). Like --calib, we'll have environment variables and a search path to provide the default when the argument is not provided.
Outputs from our own pipeline should usually go within the camera-level root repository, in a subdirectory of a "rerun" directory there. Like the simulation runs, these should usually be named using JIRA issue numbers and dates, again with a distinction between public and private reruns. For example:
Once we have CI-generated reruns, these should go within e.g. rerun/weekly with a TBD naming convention. Unlike simulation runs, for which issues are generally created just to document the run, most private rerun issues will refer to an issue that is mostly about something else, for which the rerun is just part of testing/resolving the issue. As with any other output repository, reruns can be chained, but the name of the parent rerun is not generally recorded in the name of the child.
To better support this convention, I propose we bring over the --rerun argument from the HSC fork. This serves the same purpose as the --output argument (which will be retained), but the value supplied to --rerun is relative to the "rerun" directory of the root input data repository (i.e. we follow the _parent chain all the way back, then add "rerun" to that). This saves the user from typing in the input repository twice. Unlike --output, --rerun also supports a two-value form separated by a colon, which specifies both a parent rerun and the new output directory, making chaining much easier. For example:
Compare that to:
While we considered adding --rerun a long time ago, and we couldn't reach a decision on whether it was valuable, I think it certainly has been on the HSC side; we essentially always use it, and never use --output. And we've had a much easier time of sharing and making use of each other's outputs as a result. But this is coupled with the directory structure above; without those conventions, --rerun isn't much use, so it's not at all surprising that it wasn't unanimously considered valuable on the LSST side in the past.