The warping implementation (currently MakeCoaddTempExpTask) relies on a SelectImagesTask to sort out the inputs. Unfortunately, this selection is performed inside the MakeCoaddTempExp.run method, which means MakeCoaddTempExpTask gets parallelised over patches rather than over individual warps (unless the user has some additional mechanics for setting the command-lines, essentially doing the selection themselves), which is inefficient. The SelectImagesTask hierarchy grew up as a MySQL database query before other things got tacked on to support butler-based selection, and could use simplification.
I propose to move the selection of images into the TaskRunner.getTargetList method, which will allow for parallelisation over individual warps. I will also refactor the SelectImagesTask hierarchy and add a sub-class to perform selection of images using the spatial database introduced in
There is a SelectSdssImagesTask in obs_sdss that I propose to drop because it relies on a particular database that I don't have access to and cannot recreate, and that code probably hasn't been used in a while. If someone objects to dropping it, I will attempt to modify it to conform with the SelectImagesTask changes, but testing and fixing it will have to be the responsibility of whoever wants to keep it.
While I'm modifying the warping front-end, I also propose to move the config parameters that are operational in nature (e.g., doWrite, doOverwrite) out of the Config. Their presence in the Config causes trouble because there are legitimate reasons for changing their values between command-line incantations, but doing so requires adding --clobber-config which should only be required for changes of an algorithmic nature.
DM-5766 Implement spatial exposure selection task
- is triggering
DM-17028 Write PipelineTask to Make Warps (MakeCoaddTempExp conversion)
DM-5766 Implement spatial exposure selection task
DM-15236 Refactor MakeCoaddTempExp s.t. run doesn't touch disk
We can have the TaskRunner call a method in the Task to do the selection; I don't think that's too different from what I was planning.
For moving the overwrite options out of the Config, I was thinking to put them in a separate Config that wouldn't be persisted. Alternatively, they could be separate command-line options, but I like the idea of grouping them together in a Config. I think we should do something similar for the "debug" options (which would allow us to declare and document all the possible options and set them on the command-line), but let's not discuss that here.
I have been aiming towards a solution like the one Jim Bosch suggested, with two tasks (the first of which I've been calling a DataIdGenerator) communicating via a list of dataIds, although not necessarily via a persisted Butler dataset.
Paul Price has an account in the lsst-db server that holds the krughoff_SDSS_quality_db database (which was copied from a public SDSS release, I believe) The credentials need to be saved in either both /.lsst/db-auth.paf and /.lsst/db-auth.py or else ~/.mysql.cnf (I think that works).
Options for configuring I/O, including what datasets are written, should ideally be in a Butler or SuperTask configuration, separate from the algorithmic Task configuration.
I'm not sure if Gregory Dubois-Felsmann and possibly Rob Kooper [X] and Andy Salnikov will be able to weigh in by the planned end of this RFC. It may need to be extended a day or two.
I still use SelectSdssImagesTask and will take responsibility for testing it after your SelectImageTask refactor.
Certainly the long-term solution is to put the selection into the workflow system using SuperTask, but at the moment we don't have SuperTask or a workflow system. It's important to fix the parallelisation of warping in the code we're using now, because it will allow higher efficiency in the coming HSC production run. Also, the code in question has grown in fits and starts and has needed refactoring for a long time.
What I do should not impact development of the workflow system using SuperTask, but will help inform its development.
I'm not sure what exactly to say here. Indeed our expectations currently are that SuperTask will help to solve this sort of issues (likely with some help from task authors). How exactly it will be done I cannot say, it's a complicated issue and we need to think more about all possible options. Cases like this are certainly interesting for developing functional scheduling system and I'll look at it more to understand all details.
Regarding configuration - SuperTask does not do anything special yet, all configuration issues are handled by Task level. K-T is right that we will most likely reuse (task) configuration for defining things that are "non-algorithmic" parameters, this is again not at all settled yet.
|Watchers||Andy Salnikov, Hsin-Fang Chiang, Jim Bosch, John Swinbank, Kian-Tat Lim, Paul Price, Yusra AlSayyad [ Andy Salnikov, Hsin-Fang Chiang, Jim Bosch, John Swinbank, Kian-Tat Lim, Paul Price, Yusra AlSayyad ]||Andy Salnikov, Hsin-Fang Chiang, Jim Bosch, John Swinbank, Kian-Tat Lim, Paul Price, Russell Owen, Yusra AlSayyad [ Andy Salnikov, Hsin-Fang Chiang, Jim Bosch, John Swinbank, Kian-Tat Lim, Paul Price, Russell Owen, Yusra AlSayyad ]|
Are there further arguments against this proposal, or need to hold off longer to give people a chance to respond before I adopt?
Paul Price, are you opposed to having a separate selection CmdLineTask we run first, that writes out a dataset (perhaps an ExposureCatalog) that lists what would be coadded? I think that's sounding like it's the most future-proof, and I imagine it could easily go as part of coaddDrivers.py and hence not make that driver any harder to use, and I even think having that list-of-exposures dataset could be good for provenance (since we wouldn't have to regenerate that list in assembleCoadd's TaskRunner and just hope it's consistent with the list from makeCoaddTempExp's TaskRunner).
That's essentially what the spatial database is. I don't see a difference between your proposal and mine except for the implementation details.
The difference between the spatial database and my list is an extra level of filtering that takes it from the list of possible exposures included to the list of actual exposures included (from either more precise spatial checks or any kind of quality check). I'd like to put that filtering in Task code if at all possible. But I take your point that even if we do that, the coaddition task TaskRunners will need to load the outputs using the butler and iterate over them to generate data IDs.
I think I see the merits of having a "list of actual exposures": it factors out the common operation of selecting exposures from the warping and coadding tasks. OK, so then I see it going like this:
1. singleFrameDriver/processCcd adds entries to the spatial database.
2. selectImagesToCoadd selects entries from the spatial database and writes a selection database.
3. makeMatchedWarp iterates over exposures,patches in the selection database.
4. assembleCoadd iterates over patches,filters in the selection database.
Do you have a particular preference for the format of the selection database? You mentioned an ExposureCatalog, but I'm leaning towards SQLite for consistency with the spatial database and more flexibility. I can see the spatial and selection databases morphing into a proper database that tracks quality and processing metrics for each stage (e.g., section 2.1 of The Pan-STARRS Data Processing System) down the road (we definitely need something like that).
I hadn't really thought about extending it to a real database, but it's an interesting idea. I'd like to get Kian-Tat Lim and DAX opinions on that.
Conceptually, I think this is just a per tract+patch+filter list of visit+ccd data IDs, right? I think the simplest way to represent that is a regular butlerized file (maybe just BaseCatalog rather then ExposureCatalog, if that's all it is). Mapping generic data IDs to a schema is probably a bit thorny right now, but I think we'll have to provide generic capabilities for that someday and hopefully we can do it well now.
So I guess the big question is whether you think it'd be useful to essentially denormalize some of the exposure metadata into those files so it's useful for something other than the coaddition tasks. If that's the case, ExposureCatalog or a real database starts to make more sense. I think the LSST plan is not (thus far) to use a database for a lot of the provenance stuff (or, if we do, to make it write-only by the pipelines), but other people can comment much more usefully on that than I can.
|Status||Proposed [ 10805 ]||Adopted [ 10806 ]|
DM-5766 the triggered work? (It's currently a Blocks relationship not "Is Triggering" – it can be both).
Paul Price what's going on with this RFC? The triggered work closed as invalid. Has the RFC been withdrawn?
Paul Price please comment on the status of this RFC. It has no triggered work associated with it.
Without reading the comment history too carefully, I can probably do this on
DM-15236 which is the ticket to "refactor makeCoaddTempExp in prep for Gen 3 middleware."
I just marked
DM-15236 as invalid because the the work was actually done on DM-17028: https://github.com/lsst/pipe_tasks/commit/62e839c2e8686beb4b9883a710c8e7fb8af9ab5b
Jim Bosch's original comment, "I expect SuperTask will help with this in the future," has been realized. There's still some work to be done this cycle to retire the Gen 2 version. I'll attach the appropriate ticket when we get there.
|Remote Link||This issue links to "Page (Confluence)" [ 19628 ]|
|Remote Link||This issue links to "Page (Confluence)" [ 19761 ]|
Yusra AlSayyad RFCs that have no work to be done keep popping up at the DMCCB for discussion. If you need to do more work on this RFC implementation would it be possible to create a ticket for it (it doesn't have to cover all the work, just some work to allow the RFC to keep on hiding).
|Resolution||Done [ 10000 ]|
|Status||Adopted [ 10806 ]||Implemented [ 11105 ]|
I'm comfortable calling this implemented now. I expect deprecation of the Gen 2 version this Fall.
I agree that doing the selection before we actually start the warping makes sense, but I don't like the idea of putting it in the TaskRunner instead of the Task itself. I expect SuperTask will help with this in the future, but in the meantime, would be it feasible to have the TaskRunner call a method in the Task to do the selection, at least?
A cleaner but much less user-friendly interim solution might be to have a separate command-line task for generating the selection, and writing it out as a list to a new butlerized dataset; this would better handle the case where selection itself is an expensive operation that may want to be parallelized.
I'm all for moving the overwrite options out of the config. Do you have any thoughts on where we could put them instead, or do you think it's just better to not have them for now?