Last week, the DRP team started to prototype a refactoring of ci_hsc on
DM-9059 with the goal of adding test coverage for pipe_drivers and ultimately (on different tickets) PipelineTasks. That would involve something like the following packages:
- ci_hsc_data: (or just ci_hsc; name TBD) a git-lfs package (the only one here) containing all raw data, reference catalogs, master calibrations, etc. "Building" this package essentially just runs (Gen2) ingest. Installing the package copies the data repository to the installation directory, so it can be picked up by downstream packages in lsstsw/Jenkins. This package also contains common Python validation code used by downstream code to test the existence and properties of expected outputs.
- ci_hsc_cmdLineTasks: contains the remainder of the current ci_hsc scripts, and hence runs single-frame processing, warping, coadding, and multi-band coadd processing, by calling the CmdLineTasks individually. Outputs go into a new repo with the installed ci_hsc_data repo as its parent. Invokes ci_hsc_data's validation scripts to check that everything we expected to see is present.
- ci_hsc_drivers: contains almost the same algorithmic content as ci_hsc_cmdLineTasks, but invoked via the higher-level scripts in pipe_drivers, which sometimes have slightly different behavior and certainly involve different code paths. Also runs the validation scripts. Depends only on ci_hsc_data.
- ci_hsc_gen3: Depends on ci_hsc_drivers and ci_hsc_cmdLineTasks, and creates a Gen3 repo view into their output repos. Uses Gen3 tools to compare the outputs and ensure they're the same in the ways that matter to us.
In the future I expect to add:
- Gen3 raw ingest directly to ci_hsc_data.
- A new package to run the same code again in PipelineTask form. This will start out depending on ci_hsc_gen3 in order to use Gen2 outputs that we can't yet produce via PipelineTasks, but when complete it should depend only on ci_hsc_data, and eventually replace both ci_hsc_cmdLineTask and ci_hsc_drivers. And, ultimately, I'm hoping we'll have better ways of running Gen3 pipelines in CI than building them as if they were uncompiled code - but I don't think we should treat full retirement of Gen2 and a Pipeline-based test harness as right around the corner, and in the meantime we need this expanded test coverage.
This is, of course, an initial factor of two and an eventual factor of three more processing than ci_hsc does today. But I think that's inevitable; we have 2-3x as more stuff we should be testing than we are testing. It's also an even larger expansion in the amount of disk space, as ci_hsc doesn't install anything before, and with installs we'd essentially have two copies of all data products.
That said, a big goal of this refactoring is to make is possible to run individual pipeline flavors without running all of them. Once ci_hsc_drivers is in good shape, I think it's all most Science Pipelines will need to run on most tickets, because there is a substantial overlap in coverage between _drivers and _cmdLineTasks. Gen3 middleware and PipelineTask developers would probably find it most efficient to install and infrequently update copies of ci_hsc_data, ci_hsc_cmdLineTasks, and ci_hsc_drivers, and rely on local runs of ci_hsc_gen3 under the usually-reasonable assumption that they aren't breaking Gen2 pipelines. And of course we can build all of these packages to ensure complete coverage on a timer or on ticket branches that are particularly far-reaching.
As (I believe) we're about to create one or more ci_lsst packages for various kinds of simulated, test-data, and eventually on-sky LSST data, it might be worth discussing here as well whether it should follow a similar pattern. I'm not convinced the answer is yes, as I think it's much more important right now to organize LSST data in a way that makes sense for more complete testing of earlier stages of the pipeline, rather than end-to-end testing of the DRP pipelines and Gen3 middleware. But it's a discussion we should have.