# Standardize primary method names, run/runDataRef, across PipeTasks

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
12
• Sprint:
DRP F18-2
• Team:
Data Release Production

#### Description

DM-1299 changed the API of IsrTask so that run takes exposures and runDataRef takes a DataRef. This was implemented per the interest in RFC-26 to make run the method of reuse across our tasks, which seemed reasonable. The problem here is that CmdLineTask.parseAndRun calls "run," and we cannot trivially run IsrTask as a CmdLineTask.

If we want run to be the method of reuse and runDataRef to be the CmdLineTask:

• parseAndRun should be updated to call runDataRef
• Remaining run methods that take dataRefs should be renamed to runDataRef in all PipeTasks.

#### Activity

Hide
John Swinbank added a comment -

This is likely something that will be tackled (or, perhaps, rendered redundant) by the work described at DM-3756. It's a valid issue for now so I'm not "Won't Fix"ing it, but please check DM-3756 before picking it up to work on.

Show
John Swinbank added a comment - This is likely something that will be tackled (or, perhaps, rendered redundant) by the work described at DM-3756 . It's a valid issue for now so I'm not "Won't Fix"ing it, but please check DM-3756 before picking it up to work on.
Hide

In the supertask migration meeting today with Jim Bosch,Christopher Waters,Nate Lust,Russell Owen we decided that instead of tackling at the same time as the Task refactoring that we plan to do for the migration, we're going to tackle this BEFORE the refactoring. The rationale is that most of the API breaking changes are in this renaming, not the refactoring, and should be done on one swift ticket.

Show
Yusra AlSayyad added a comment - In the supertask migration meeting today with Jim Bosch , Christopher Waters , Nate Lust , Russell Owen we decided that instead of tackling at the same time as the Task refactoring that we plan to do for the migration, we're going to tackle this BEFORE the refactoring. The rationale is that most of the API breaking changes are in this renaming, not the refactoring, and should be done on one swift ticket.
Hide
Christopher Waters added a comment -

Most repositories have minor changes, but all need to conform to the RFC-352 standard.

Show
Christopher Waters added a comment - Most repositories have minor changes, but all need to conform to the RFC-352 standard.
Hide
Paul Price added a comment -

I've just started the review and I already have cause to question the wisdom of the RFC decision. Take the changes to coaddDriver.py, for example. The new CoaddDriverTask would have the following functional methods:

 def runDataRef(self, tractPatchRefList, butler, selectIdList=[]):

and

 def run(self, patchRefList, butler, selectDataList=[]):

The runDataRef method is a bit of a misnomer, as it doesn't take a data reference but a list of data references. It then delegates to the run method, but that is not independent of data references either, as it also takes a list of data references. That seems quite contrary to the intent of the RFC (separating I/O from function), but I don't think that's a fault in the implementation, but naïveté in the RFC: there are times when it's not possible to separate I/O out from the functionality until one gets quite deep into the functionality. Coaddition is a prime example: we coadd images a few rows at a time to keep the memory footprint low, which requires delaying the I/O.

I suggest the RFC needs to be rethought, rephrased or expanded to cover these cases.

Show
Paul Price added a comment - I've just started the review and I already have cause to question the wisdom of the RFC decision. Take the changes to coaddDriver.py , for example. The new CoaddDriverTask would have the following functional methods: def runDataRef(self, tractPatchRefList, butler, selectIdList=[]): and def run(self, patchRefList, butler, selectDataList=[]): The runDataRef method is a bit of a misnomer, as it doesn't take a data reference but a list of data references. It then delegates to the run method, but that is not independent of data references either, as it also takes a list of data references. That seems quite contrary to the intent of the RFC (separating I/O from function), but I don't think that's a fault in the implementation, but naïveté in the RFC: there are times when it's not possible to separate I/O out from the functionality until one gets quite deep into the functionality. Coaddition is a prime example: we coadd images a few rows at a time to keep the memory footprint low, which requires delaying the I/O. I suggest the RFC needs to be rethought, rephrased or expanded to cover these cases.
Hide
Russell Owen added a comment - - edited

Paul Price most command-line tasks can have a runDataRef method that is called by the task runner (and takes one or more data references) and a run method that takes no data references. In those cases I personally would much rather proceed with this ticket instead of re-opening the discussion. The intent of the refactoring in this ticket is to prepare for SuperTask by having a runDataRef method that will be similar to runQuantum in the SuperTask version.

This is interesting timing as I am just now reviewing a new task that computes brighter-fatter kernels and I think it is like coaddition in that it processes more data than can fit into memory in one run.

Show
Hide
Robert Lupton added a comment -

Isn't this covered (at least in theory) by the discussion that started with reference catalogues?  You are allowed to pass a proxy that defers the I/O until it is needed, but you are not allowed to pass a blob such as a DataRef that is idempotent.  This is pretty clear for tasks that might reasonably be reused from a script (or notebook), although I agree that it's less clear for larger-scale processing and will require detailed thought.

Show
Robert Lupton added a comment - Isn't this covered (at least in theory) by the discussion that started with reference catalogues?  You are allowed to pass a proxy that defers the I/O until it is needed, but you are not allowed to pass a blob such as a DataRef  that is idempotent.  This is pretty clear for tasks that might reasonably be reused from a script (or notebook), although I agree that it's less clear for larger-scale processing and will require detailed thought.
Hide
Paul Price added a comment -

The reference catalogs carry a butler, which makes them very much like a data reference. If that's the solution you want, I'm fine with it, but someone needs to decide whether that work will be done here, or those special cases will be deferred to another ticket.

Show
Paul Price added a comment - The reference catalogs carry a butler, which makes them very much like a data reference. If that's the solution you want, I'm fine with it, but someone needs to decide whether that work will be done here, or those special cases will be deferred to another ticket.
Hide

Paul Price Some backstory on why we’re doing the renaming now during the SuperTask refactoring. After the SuperTask refactoring, Tasks will have 3 primary methods:

• runQuantum (Takes a “quantum” and a “Gen3 butler”, called by the Gen 3 middleware)
• runDataRef (Takes Gen2 butler dataRef, list of Gen2 butler DataRefs)
Both of these will get the appropriate data in their respective ways and call:
• run (the heart of the Task that does the processing)

Right now these methods are named:

Gen3: runQuantum
Gen2: run
which calls the "heart" of the Task, which has some nice greppable action name (e.g. assemble, calibrate, characterize, runDetection.

In RFC-26 and RFC-352 it was clear Robert much prefers that the “heart” of a Task be called run.

Before DM-2639, pipe.base.CmdlineTask.parseAndRun() called run, which was blocking our refactoring efforts (along with our ability to run CmdLineTasks with the new naming convention like IsrTask.

My first implementation of DM-2639 introduced a second TaskRunner that called a Task’s runDataRef() instead of run so that we could rename the methods one by one as we refactored them for SuperTask. But in the SuperTask meeting we decided that since the renaming was the major API-breaking change and had cross-package implications, it was best to rip off the band-aid and do it all at once by one person.

So here we are.

Show
Yusra AlSayyad added a comment - Paul Price Some backstory on why we’re doing the renaming now during the SuperTask refactoring. After the SuperTask refactoring, Tasks will have 3 primary methods: runQuantum (Takes a “quantum” and a “Gen3 butler”, called by the Gen 3 middleware) runDataRef (Takes Gen2 butler dataRef, list of Gen2 butler DataRefs) Both of these will get the appropriate data in their respective ways and call: run (the heart of the Task that does the processing) Right now these methods are named: Gen3: runQuantum Gen2: run which calls the "heart" of the Task, which has some nice greppable action name (e.g. assemble , calibrate , characterize , runDetection . In RFC-26 and RFC-352 it was clear Robert much prefers that the “heart” of a Task be called run . Before DM-2639 , pipe.base.CmdlineTask.parseAndRun() called run, which was blocking our refactoring efforts (along with our ability to run CmdLineTasks with the new naming convention like IsrTask. My first implementation of DM-2639 introduced a second TaskRunner that called a Task’s runDataRef() instead of run so that we could rename the methods one by one as we refactored them for SuperTask. But in the SuperTask meeting we decided that since the renaming was the major API-breaking change and had cross-package implications, it was best to rip off the band-aid and do it all at once by one person. So here we are. https://community.lsst.org/t/api-change-for-tasks-rename-run-primarymethod-to-rundataref-run/3054/8
Hide

Paul I will also clarify that we are still refactoring all the tasks we can per RFC-352, just not on this ticket.  It was too much work for one person on one ticket.  See bucket tickets DM-14816 and DM-11098. We have a spreadsheet of tickets that need refactoring and are starting with the ones that get run by a HSC DRP. Nate Lust is working on ReferenceLoaders and I'm going to start on makeCoaddTempExp and the zeropoint scalers.

If you have time, pick your favorite Task to separate out disk access from processing.

Show
Yusra AlSayyad added a comment - Paul  I will also clarify that we are still refactoring all the tasks we can per RFC-352 , just not on this ticket.  It was too much work for one person on one ticket.  See bucket tickets DM-14816 and DM-11098 . We have a spreadsheet of tickets that need refactoring and are starting with the ones that get run by a HSC DRP. Nate Lust is working on ReferenceLoaders and I'm going to start on makeCoaddTempExp and the zeropoint scalers. If you have time, pick your favorite Task to separate out disk access from processing.
Hide
Jim Bosch added a comment - - edited

The singular/plural mismatch of runDataRef for Tasks whose unit of work is comprised of multiple DataRefs is unfortunate, but can also easily be dealt with on a case-by-case basis on this ticket, because those Tasks must already be using a custom TaskRunner, and that TaskRunner can of course choose to call whatever method it wants. Note that in the SuperTask era, the main driver that uses the butler will be called runQuantum, which does not have this singular/plural problem.

Show
Hide
Jim Bosch added a comment - - edited

In a separate conversation with Yusra AlSayyad, she pointed out that because custom TaskRunner doesn't always imply custom TaskRunner.__call__, my assertion that it'd be easy to pick an alternate name for runDataRef in the plural case is probably false. When that's the case, I suggest we just hold our noses about the singular/plural mismatch and call it runDataRef for now. I think the important thing is that we get this change merged before it develops too many conflicts with other ongoing work, so we can proceed with the follow-up tickets to clean individual Tasks up further and start making them usable as SuperTasks.

On the reference catalog question, this is definitely not the ticket to try to address this. There's already another ticket (DM-14819) to make the reference loaders work with SuperTask, so the current approach will be going away when that conversion is complete, if not earlier. It's also worth pointing out that the reason those are considered conceptually different from DataRefs in this context is because their API was designed to support non-butler-backed implementations, so if a user had an instance of such a thing in a butler-free context, they could still use the Task that wants the references. We don't actually have any such implementations, because we've never actually needed one (and maybe that says something about the utility of this approach), but I do still maintain that it'd be easy to add one.

Show
Hide
Paul Price added a comment -

That's great, Jim, thanks!

Show
Paul Price added a comment - That's great, Jim, thanks!
Hide
Paul Price added a comment -

With the understanding that the goal here is not full RFC-352 compliance (which I think we've agreed needs some more thought and/or guidelines and/or work) but easing the SuperTask transition, this looks fine.

The commit messages are very short. In general, I shouldn't need to know what RFC-352 is (or even have access to Jira to look it up) in order to understand the motivation behind the changes. Please have a look at the commit message best practices in the Dev Guide. And I think you shouldn't pass this off as RFC-352 compliance in the commit summary line (which confused me) but easing the SuperTask transition (as I understand it, the fact that this helps RFC-352 is nice, but not the primary motivation).

I trust this has all been run through Jenkins and ci_hsc, and that there will be an announcement on CLO when this merges.

Show
Paul Price added a comment - With the understanding that the goal here is not full RFC-352 compliance (which I think we've agreed needs some more thought and/or guidelines and/or work) but easing the SuperTask transition, this looks fine. The commit messages are very short. In general, I shouldn't need to know what RFC-352 is (or even have access to Jira to look it up) in order to understand the motivation behind the changes. Please have a look at the commit message best practices in the Dev Guide . And I think you shouldn't pass this off as RFC-352 compliance in the commit summary line (which confused me) but easing the SuperTask transition (as I understand it, the fact that this helps RFC-352 is nice, but not the primary motivation). I trust this has all been run through Jenkins and ci_hsc, and that there will be an announcement on CLO when this merges.
Hide

Made another review pass for completeness. Nice work. Looks like all that's left is documentation.

There are currently two copies of "How to write a commandline Task" and "How to write a Task" the dox version is in pipe_tasks and the rst version is in pipe_base. We're in transition. I would say you must update the rst and should update the dox.

I reviewed the dox first so I'll put those down first only because the comments are more complete.

• run--> runDataRef lines 51, 53, 68, 86, 151, 176, 190, 202, 280

It’d be good to add the sentence to the intro of custom TaskRunners: “If your Task uses the pre-2018 naming convention and has a run method that operates on a DataRef instead of a runDataRef, you can still run this as a CmdLineTask by using the \ref lsst.pipe.base.cmdLineTask.LegacyTaskRunner "lsst.pipe.base.LegacyTaskRunner”, which will call your Task’s run method.”

• line 150 should now be be: “In addition, Command-line tasks require a runDataRef method that fetches the appropriate data products specified by Gen-2 Butler DataRefs”
• 164: The run and runDataRef methods should always…
• 174: runDataRef does everything
• 215: Including run and runDataRef (if present)

pipe_base:
lines: 32, 39, 55, 75, 131, 146, 174, 193

• lines: 32, 39, 55, 75, 131, 146, 174, 193
• add the info from the old docs about the LegacyTaskRunner here in the Custom Task Runner section that begins with 208. 244, 247

Nice doc string in TaskRunner.runTask, but check the veracity of the copied and pasted text from _call_.

• docstring for Task. run —> runDataRef

pipe_drivers:
Note to Self: Reminding myself that CalibCombineTask is OK as is since its not a cmdlineTask and will be refactored later. Tasks in obs_subaru liek HscFlatCombineTask inherit from this.

pipe_drivers/python/lsst/pipe/drivers/utils.py

• line 13 should prob say runDataRef

And finally squash the commits in all packages.

Show
Yusra AlSayyad added a comment - Made another review pass for completeness. Nice work. Looks like all that's left is documentation. There are currently two copies of "How to write a commandline Task" and "How to write a Task" the dox version is in pipe_tasks and the rst version is in pipe_base. We're in transition. I would say you must update the rst and should update the dox. I reviewed the dox first so I'll put those down first only because the comments are more complete. pipe_tasks : pipe_tasks/doc/old/writeCmdLineTask.dox: run--> runDataRef lines 51, 53, 68, 86, 151, 176, 190, 202, 280 in: 198 \section pipeTasks_writeCmdLineTask_customTaskRunner Custom Task Runner It’d be good to add the sentence to the intro of custom TaskRunners: “If your Task uses the pre-2018 naming convention and has a run method that operates on a DataRef instead of a runDataRef, you can still run this as a CmdLineTask by using the \ref lsst.pipe.base.cmdLineTask.LegacyTaskRunner "lsst.pipe.base.LegacyTaskRunner”, which will call your Task’s run method.” pipe_tasks/doc/old/writeTask.dox: line 150 should now be be: “In addition, Command-line tasks require a runDataRef method that fetches the appropriate data products specified by Gen-2 Butler DataRefs” 164: The run and runDataRef methods should always… 174: runDataRef does everything 215: Including run and runDataRef (if present) pipe_base: pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst lines: 32, 39, 55, 75, 131, 146, 174, 193 pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst lines: 32, 39, 55, 75, 131, 146, 174, 193 add the info from the old docs about the LegacyTaskRunner here in the Custom Task Runner section that begins with 208. 244, 247 Nice doc string in TaskRunner.runTask , but check the veracity of the copied and pasted text from _ call _. pipe_base/python/lsst/pipe/base/task.py: docstring for Task. run —> runDataRef pipe_drivers: Note to Self: Reminding myself that CalibCombineTask is OK as is since its not a cmdlineTask and will be refactored later. Tasks in obs_subaru liek HscFlatCombineTask inherit from this. pipe_drivers/python/lsst/pipe/drivers/utils.py line 13 should prob say runDataRef And finally squash the commits in all packages.
Hide
Christopher Waters added a comment -

I've taken the duplicate reference to pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst as a typo, and checked all the files in that directory, modifying the two creating-*-task.rst files which had run references that needed updating to runDataRef.

Show
Christopher Waters added a comment - I've taken the duplicate reference to pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst as a typo, and checked all the files in that directory, modifying the two creating-*-task.rst files which had run references that needed updating to runDataRef.
Hide

Closing. SPs include time becoming familiar with our workflow.

Show
Yusra AlSayyad added a comment - Closing. SPs include time becoming familiar with our workflow.

#### People

Assignee:
Christopher Waters
Reporter:
Reviewers:
Paul Price
Watchers:
Christopher Waters, Hsin-Fang Chiang, Jim Bosch, John Swinbank, Paul Price, Robert Lupton, Russell Owen, Simon Krughoff, Yusra AlSayyad