Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-2639

Standardize primary method names, run/runDataRef, across PipeTasks

    XMLWordPrintable

    Details

    • Story Points:
      12
    • Sprint:
      DRP F18-2
    • Team:
      Data Release Production

      Description

      DM-1299 changed the API of IsrTask so that run takes exposures and runDataRef takes a DataRef. This was implemented per the interest in RFC-26 to make run the method of reuse across our tasks, which seemed reasonable. The problem here is that CmdLineTask.parseAndRun calls "run," and we cannot trivially run IsrTask as a CmdLineTask.

      If we want run to be the method of reuse and runDataRef to be the CmdLineTask:

      • parseAndRun should be updated to call runDataRef
      • Remaining run methods that take dataRefs should be renamed to runDataRef in all PipeTasks.

        Attachments

          Issue Links

            Activity

            Hide
            swinbank John Swinbank added a comment -

            This is likely something that will be tackled (or, perhaps, rendered redundant) by the work described at DM-3756. It's a valid issue for now so I'm not "Won't Fix"ing it, but please check DM-3756 before picking it up to work on.

            Show
            swinbank John Swinbank added a comment - This is likely something that will be tackled (or, perhaps, rendered redundant) by the work described at DM-3756 . It's a valid issue for now so I'm not "Won't Fix"ing it, but please check DM-3756 before picking it up to work on.
            Hide
            yusra Yusra AlSayyad added a comment -

            In the supertask migration meeting today with Jim Bosch,Christopher Waters,Nate Lust,Russell Owen we decided that instead of tackling at the same time as the Task refactoring that we plan to do for the migration, we're going to tackle this BEFORE the refactoring. The rationale is that most of the API breaking changes are in this renaming, not the refactoring, and should be done on one swift ticket.

            Show
            yusra Yusra AlSayyad added a comment - In the supertask migration meeting today with Jim Bosch , Christopher Waters , Nate Lust , Russell Owen we decided that instead of tackling at the same time as the Task refactoring that we plan to do for the migration, we're going to tackle this BEFORE the refactoring. The rationale is that most of the API breaking changes are in this renaming, not the refactoring, and should be done on one swift ticket.
            Hide
            czw Christopher Waters added a comment -

            Most repositories have minor changes, but all need to conform to the RFC-352 standard.

            Show
            czw Christopher Waters added a comment - Most repositories have minor changes, but all need to conform to the RFC-352 standard.
            Hide
            price Paul Price added a comment -

            I've just started the review and I already have cause to question the wisdom of the RFC decision. Take the changes to coaddDriver.py, for example. The new CoaddDriverTask would have the following functional methods:

                def runDataRef(self, tractPatchRefList, butler, selectIdList=[]):
            

            and

                def run(self, patchRefList, butler, selectDataList=[]):
            

            The runDataRef method is a bit of a misnomer, as it doesn't take a data reference but a list of data references. It then delegates to the run method, but that is not independent of data references either, as it also takes a list of data references. That seems quite contrary to the intent of the RFC (separating I/O from function), but I don't think that's a fault in the implementation, but naïveté in the RFC: there are times when it's not possible to separate I/O out from the functionality until one gets quite deep into the functionality. Coaddition is a prime example: we coadd images a few rows at a time to keep the memory footprint low, which requires delaying the I/O.

            I suggest the RFC needs to be rethought, rephrased or expanded to cover these cases.

            Show
            price Paul Price added a comment - I've just started the review and I already have cause to question the wisdom of the RFC decision. Take the changes to coaddDriver.py , for example. The new CoaddDriverTask would have the following functional methods: def runDataRef(self, tractPatchRefList, butler, selectIdList=[]): and def run(self, patchRefList, butler, selectDataList=[]): The runDataRef method is a bit of a misnomer, as it doesn't take a data reference but a list of data references. It then delegates to the run method, but that is not independent of data references either, as it also takes a list of data references. That seems quite contrary to the intent of the RFC (separating I/O from function), but I don't think that's a fault in the implementation, but naïveté in the RFC: there are times when it's not possible to separate I/O out from the functionality until one gets quite deep into the functionality. Coaddition is a prime example: we coadd images a few rows at a time to keep the memory footprint low, which requires delaying the I/O. I suggest the RFC needs to be rethought, rephrased or expanded to cover these cases.
            Hide
            rowen Russell Owen added a comment - - edited

            Paul Price most command-line tasks can have a runDataRef method that is called by the task runner (and takes one or more data references) and a run method that takes no data references. In those cases I personally would much rather proceed with this ticket instead of re-opening the discussion. The intent of the refactoring in this ticket is to prepare for SuperTask by having a runDataRef method that will be similar to runQuantum in the SuperTask version.

            Re-reading RFC-352 I don't see a clear statement of how to handle tasks such as coaddition where one "run" processes more data than can fit into memory. At the super task meeting where we decided implement the RFC before refactoring for SuperTask, we agreed to convert those tasks that we could on this ticket, but I don't recall what we said about coaddition. It seems to me the most natural thing to do is rename run to runDataRef (as for all other tasks being converted) using the rule that the task runner always calls runDataRef. I have no idea if a run method also makes sense for that task.

            This is interesting timing as I am just now reviewing a new task that computes brighter-fatter kernels and I think it is like coaddition in that it processes more data than can fit into memory in one run.

            Show
            rowen Russell Owen added a comment - - edited Paul Price most command-line tasks can have a runDataRef method that is called by the task runner (and takes one or more data references) and a run method that takes no data references. In those cases I personally would much rather proceed with this ticket instead of re-opening the discussion. The intent of the refactoring in this ticket is to prepare for SuperTask by having a runDataRef method that will be similar to runQuantum in the SuperTask version. Re-reading RFC-352 I don't see a clear statement of how to handle tasks such as coaddition where one "run" processes more data than can fit into memory. At the super task meeting where we decided implement the RFC before refactoring for SuperTask, we agreed to convert those tasks that we could on this ticket, but I don't recall what we said about coaddition. It seems to me the most natural thing to do is rename run to runDataRef (as for all other tasks being converted) using the rule that the task runner always calls runDataRef . I have no idea if a run method also makes sense for that task. This is interesting timing as I am just now reviewing a new task that computes brighter-fatter kernels and I think it is like coaddition in that it processes more data than can fit into memory in one run.
            Hide
            rhl Robert Lupton added a comment -

            Isn't this covered (at least in theory) by the discussion that started with reference catalogues?  You are allowed to pass a proxy that defers the I/O until it is needed, but you are not allowed to pass a blob such as a DataRef that is idempotent.  This is pretty clear for tasks that might reasonably be reused from a script (or notebook), although I agree that it's less clear for larger-scale processing and will require detailed thought.

             

            Show
            rhl Robert Lupton added a comment - Isn't this covered (at least in theory) by the discussion that started with reference catalogues?  You are allowed to pass a proxy that defers the I/O until it is needed, but you are not allowed to pass a blob such as a DataRef  that is idempotent.  This is pretty clear for tasks that might reasonably be reused from a script (or notebook), although I agree that it's less clear for larger-scale processing and will require detailed thought.  
            Hide
            price Paul Price added a comment -

            The reference catalogs carry a butler, which makes them very much like a data reference. If that's the solution you want, I'm fine with it, but someone needs to decide whether that work will be done here, or those special cases will be deferred to another ticket.

            Show
            price Paul Price added a comment - The reference catalogs carry a butler, which makes them very much like a data reference. If that's the solution you want, I'm fine with it, but someone needs to decide whether that work will be done here, or those special cases will be deferred to another ticket.
            Hide
            yusra Yusra AlSayyad added a comment -

            Paul Price Some backstory on why we’re doing the renaming now during the SuperTask refactoring. After the SuperTask refactoring, Tasks will have 3 primary methods:

            • runQuantum (Takes a “quantum” and a “Gen3 butler”, called by the Gen 3 middleware)
            • runDataRef (Takes Gen2 butler dataRef, list of Gen2 butler DataRefs)
              Both of these will get the appropriate data in their respective ways and call:
            • run (the heart of the Task that does the processing)

            Right now these methods are named:

            Gen3: runQuantum
            Gen2: run
            which calls the "heart" of the Task, which has some nice greppable action name (e.g. assemble, calibrate, characterize, runDetection.

            In RFC-26 and RFC-352 it was clear Robert much prefers that the “heart” of a Task be called run.

            Before DM-2639, pipe.base.CmdlineTask.parseAndRun() called `run`, which was blocking our refactoring efforts (along with our ability to run CmdLineTasks with the new naming convention like IsrTask.

            My first implementation of DM-2639 introduced a second TaskRunner that called a Task’s runDataRef() instead of run so that we could rename the methods one by one as we refactored them for SuperTask. But in the SuperTask meeting we decided that since the renaming was the major API-breaking change and had cross-package implications, it was best to rip off the band-aid and do it all at once by one person.

            So here we are.

            https://community.lsst.org/t/api-change-for-tasks-rename-run-primarymethod-to-rundataref-run/3054/8

            Show
            yusra Yusra AlSayyad added a comment - Paul Price Some backstory on why we’re doing the renaming now during the SuperTask refactoring. After the SuperTask refactoring, Tasks will have 3 primary methods: runQuantum (Takes a “quantum” and a “Gen3 butler”, called by the Gen 3 middleware) runDataRef (Takes Gen2 butler dataRef, list of Gen2 butler DataRefs) Both of these will get the appropriate data in their respective ways and call: run (the heart of the Task that does the processing) Right now these methods are named: Gen3: runQuantum Gen2: run which calls the "heart" of the Task, which has some nice greppable action name (e.g. assemble , calibrate , characterize , runDetection . In RFC-26 and RFC-352 it was clear Robert much prefers that the “heart” of a Task be called run . Before DM-2639 , pipe.base.CmdlineTask.parseAndRun() called `run`, which was blocking our refactoring efforts (along with our ability to run CmdLineTasks with the new naming convention like IsrTask. My first implementation of DM-2639 introduced a second TaskRunner that called a Task’s runDataRef() instead of run so that we could rename the methods one by one as we refactored them for SuperTask. But in the SuperTask meeting we decided that since the renaming was the major API-breaking change and had cross-package implications, it was best to rip off the band-aid and do it all at once by one person. So here we are. https://community.lsst.org/t/api-change-for-tasks-rename-run-primarymethod-to-rundataref-run/3054/8
            Hide
            yusra Yusra AlSayyad added a comment -

            Paul I will also clarify that we are still refactoring all the tasks we can per RFC-352, just not on this ticket.  It was too much work for one person on one ticket.  See bucket tickets DM-14816 and DM-11098. We have a spreadsheet of tickets that need refactoring and are starting with the ones that get run by a HSC DRP. Nate Lust is working on ReferenceLoaders and I'm going to start on makeCoaddTempExp and the zeropoint scalers.

            If you have time, pick your favorite Task to separate out disk access from processing.

            Show
            yusra Yusra AlSayyad added a comment - Paul  I will also clarify that we are still refactoring all the tasks we can per RFC-352 , just not on this ticket.  It was too much work for one person on one ticket.  See bucket tickets DM-14816 and DM-11098 . We have a spreadsheet of tickets that need refactoring and are starting with the ones that get run by a HSC DRP. Nate Lust is working on ReferenceLoaders and I'm going to start on makeCoaddTempExp and the zeropoint scalers. If you have time, pick your favorite Task to separate out disk access from processing.
            Hide
            jbosch Jim Bosch added a comment - - edited

            I do not believe there is any requirement that all Tasks do all I/O in runDataRef and all non-I/O work in run; that is clearly impractical for a few Tasks. I would state it as "any Task that can do all I/O in runDataRef and all non-I/O work in run should" (and that we expect this to be true for the vast majority of Tasks, but not all). The goal here is to better support ad-hoc and/or non-butler-based usage of Tasks and (in the future) to avoid boilerplate in SuperTask. We should not tie ourselves up in knots trying to support ad-hoc and/or non-butler-based usage of Tasks like AssembleCoadd and Jointcal for which much of the work is itself I/O bookkeeping and the case where an interactive user has all inputs already in memory is at least exceedingly rare, and the extra boilerplate in the SuperTask era is really not that bad (and can be mitigated other ways). I suggest we handle the naming of the methods of those rare CmdLineTasks on a case-by-case basis in this ticket; I trust Paul Price and Christopher Waters (and any interested bystanders) to come up with something decent, and I don't think we need another RFC for that.

            The singular/plural mismatch of runDataRef for Tasks whose unit of work is comprised of multiple DataRefs is unfortunate, but can also easily be dealt with on a case-by-case basis on this ticket, because those Tasks must already be using a custom TaskRunner, and that TaskRunner can of course choose to call whatever method it wants. Note that in the SuperTask era, the main driver that uses the butler will be called runQuantum, which does not have this singular/plural problem.

            Show
            jbosch Jim Bosch added a comment - - edited I do not believe there is any requirement that all Tasks do all I/O in runDataRef and all non-I/O work in run; that is clearly impractical for a few Tasks. I would state it as "any Task that can do all I/O in runDataRef and all non-I/O work in run should " (and that we expect this to be true for the vast majority of Tasks, but not all). The goal here is to better support ad-hoc and/or non-butler-based usage of Tasks and (in the future) to avoid boilerplate in SuperTask. We should not tie ourselves up in knots trying to support ad-hoc and/or non-butler-based usage of Tasks like AssembleCoadd and Jointcal for which much of the work is itself I/O bookkeeping and the case where an interactive user has all inputs already in memory is at least exceedingly rare, and the extra boilerplate in the SuperTask era is really not that bad (and can be mitigated other ways). I suggest we handle the naming of the methods of those rare CmdLineTasks on a case-by-case basis in this ticket; I trust Paul Price and Christopher Waters (and any interested bystanders) to come up with something decent, and I don't think we need another RFC for that. The singular/plural mismatch of runDataRef for Tasks whose unit of work is comprised of multiple DataRefs is unfortunate, but can also easily be dealt with on a case-by-case basis on this ticket, because those Tasks must already be using a custom TaskRunner, and that TaskRunner can of course choose to call whatever method it wants. Note that in the SuperTask era, the main driver that uses the butler will be called runQuantum , which does not have this singular/plural problem.
            Hide
            jbosch Jim Bosch added a comment - - edited

            In a separate conversation with Yusra AlSayyad, she pointed out that because custom TaskRunner doesn't always imply custom TaskRunner.__call__, my assertion that it'd be easy to pick an alternate name for runDataRef in the plural case is probably false. When that's the case, I suggest we just hold our noses about the singular/plural mismatch and call it runDataRef for now. I think the important thing is that we get this change merged before it develops too many conflicts with other ongoing work, so we can proceed with the follow-up tickets to clean individual Tasks up further and start making them usable as SuperTasks.

            On the reference catalog question, this is definitely not the ticket to try to address this. There's already another ticket (DM-14819) to make the reference loaders work with SuperTask, so the current approach will be going away when that conversion is complete, if not earlier. It's also worth pointing out that the reason those are considered conceptually different from DataRefs in this context is because their API was designed to support non-butler-backed implementations, so if a user had an instance of such a thing in a butler-free context, they could still use the Task that wants the references. We don't actually have any such implementations, because we've never actually needed one (and maybe that says something about the utility of this approach), but I do still maintain that it'd be easy to add one.

            Show
            jbosch Jim Bosch added a comment - - edited In a separate conversation with Yusra AlSayyad , she pointed out that because custom TaskRunner doesn't always imply custom TaskRunner.__call__ , my assertion that it'd be easy to pick an alternate name for runDataRef in the plural case is probably false. When that's the case, I suggest we just hold our noses about the singular/plural mismatch and call it runDataRef for now. I think the important thing is that we get this change merged before it develops too many conflicts with other ongoing work, so we can proceed with the follow-up tickets to clean individual Tasks up further and start making them usable as SuperTasks. On the reference catalog question, this is definitely not the ticket to try to address this. There's already another ticket ( DM-14819 ) to make the reference loaders work with SuperTask, so the current approach will be going away when that conversion is complete, if not earlier. It's also worth pointing out that the reason those are considered conceptually different from DataRefs in this context is because their API was designed to support non-butler-backed implementations, so if a user had an instance of such a thing in a butler-free context, they could still use the Task that wants the references. We don't actually have any such implementations, because we've never actually needed one (and maybe that says something about the utility of this approach), but I do still maintain that it'd be easy to add one.
            Hide
            price Paul Price added a comment -

            That's great, Jim, thanks!

            Show
            price Paul Price added a comment - That's great, Jim, thanks!
            Hide
            price Paul Price added a comment -

            With the understanding that the goal here is not full RFC-352 compliance (which I think we've agreed needs some more thought and/or guidelines and/or work) but easing the SuperTask transition, this looks fine.

            The commit messages are very short. In general, I shouldn't need to know what RFC-352 is (or even have access to Jira to look it up) in order to understand the motivation behind the changes. Please have a look at the commit message best practices in the Dev Guide. And I think you shouldn't pass this off as RFC-352 compliance in the commit summary line (which confused me) but easing the SuperTask transition (as I understand it, the fact that this helps RFC-352 is nice, but not the primary motivation).

            I trust this has all been run through Jenkins and ci_hsc, and that there will be an announcement on CLO when this merges.

            Show
            price Paul Price added a comment - With the understanding that the goal here is not full RFC-352 compliance (which I think we've agreed needs some more thought and/or guidelines and/or work) but easing the SuperTask transition, this looks fine. The commit messages are very short. In general, I shouldn't need to know what RFC-352 is (or even have access to Jira to look it up) in order to understand the motivation behind the changes. Please have a look at the commit message best practices in the Dev Guide . And I think you shouldn't pass this off as RFC-352 compliance in the commit summary line (which confused me) but easing the SuperTask transition (as I understand it, the fact that this helps RFC-352 is nice, but not the primary motivation). I trust this has all been run through Jenkins and ci_hsc, and that there will be an announcement on CLO when this merges.
            Hide
            yusra Yusra AlSayyad added a comment -

            Made another review pass for completeness. Nice work. Looks like all that's left is documentation.

            There are currently two copies of "How to write a commandline Task" and "How to write a Task" the dox version is in pipe_tasks and the rst version is in pipe_base. We're in transition. I would say you must update the rst and should update the dox.

            I reviewed the dox first so I'll put those down first only because the comments are more complete.
            pipe_tasks:
            pipe_tasks/doc/old/writeCmdLineTask.dox:

            • run--> runDataRef lines 51, 53, 68, 86, 151, 176, 190, 202, 280
            • in: 198 \section pipeTasks_writeCmdLineTask_customTaskRunner Custom Task Runner

            It’d be good to add the sentence to the intro of custom TaskRunners: “If your Task uses the pre-2018 naming convention and has a `run` method that operates on a `DataRef` instead of a `runDataRef`, you can still run this as a CmdLineTask by using the \ref lsst.pipe.base.cmdLineTask.LegacyTaskRunner "lsst.pipe.base.LegacyTaskRunner”, which will call your Task’s `run` method.”

            pipe_tasks/doc/old/writeTask.dox:

            • line 150 should now be be: “In addition, Command-line tasks require a `runDataRef` method that fetches the appropriate data products specified by Gen-2 Butler DataRefs”
            • 164: The `run` and `runDataRef` methods should always…
            • 174: `runDataRef` does everything
            • 215: Including `run` and `runDataRef` (if present)

            pipe_base:
            pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst
            lines: 32, 39, 55, 75, 131, 146, 174, 193

            pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst

            • lines: 32, 39, 55, 75, 131, 146, 174, 193
            • add the info from the old docs about the LegacyTaskRunner here in the Custom Task Runner section that begins with 208. 244, 247

            Nice doc string in TaskRunner.runTask, but check the veracity of the copied and pasted text from _call_.

            pipe_base/python/lsst/pipe/base/task.py:

            • docstring for Task. ``run`` —> ``runDataRef``

            pipe_drivers:
            Note to Self: Reminding myself that `CalibCombineTask` is OK as is since its not a cmdlineTask and will be refactored later. Tasks in obs_subaru liek HscFlatCombineTask inherit from this.

            pipe_drivers/python/lsst/pipe/drivers/utils.py

            • line 13 should prob say `runDataRef`

            And finally squash the commits in all packages.

            Show
            yusra Yusra AlSayyad added a comment - Made another review pass for completeness. Nice work. Looks like all that's left is documentation. There are currently two copies of "How to write a commandline Task" and "How to write a Task" the dox version is in pipe_tasks and the rst version is in pipe_base. We're in transition. I would say you must update the rst and should update the dox. I reviewed the dox first so I'll put those down first only because the comments are more complete. pipe_tasks : pipe_tasks/doc/old/writeCmdLineTask.dox: run--> runDataRef lines 51, 53, 68, 86, 151, 176, 190, 202, 280 in: 198 \section pipeTasks_writeCmdLineTask_customTaskRunner Custom Task Runner It’d be good to add the sentence to the intro of custom TaskRunners: “If your Task uses the pre-2018 naming convention and has a `run` method that operates on a `DataRef` instead of a `runDataRef`, you can still run this as a CmdLineTask by using the \ref lsst.pipe.base.cmdLineTask.LegacyTaskRunner "lsst.pipe.base.LegacyTaskRunner”, which will call your Task’s `run` method.” pipe_tasks/doc/old/writeTask.dox: line 150 should now be be: “In addition, Command-line tasks require a `runDataRef` method that fetches the appropriate data products specified by Gen-2 Butler DataRefs” 164: The `run` and `runDataRef` methods should always… 174: `runDataRef` does everything 215: Including `run` and `runDataRef` (if present) pipe_base: pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst lines: 32, 39, 55, 75, 131, 146, 174, 193 pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst lines: 32, 39, 55, 75, 131, 146, 174, 193 add the info from the old docs about the LegacyTaskRunner here in the Custom Task Runner section that begins with 208. 244, 247 Nice doc string in TaskRunner.runTask , but check the veracity of the copied and pasted text from _ call _. pipe_base/python/lsst/pipe/base/task.py: docstring for Task. ``run`` —> ``runDataRef`` pipe_drivers: Note to Self: Reminding myself that `CalibCombineTask` is OK as is since its not a cmdlineTask and will be refactored later. Tasks in obs_subaru liek HscFlatCombineTask inherit from this. pipe_drivers/python/lsst/pipe/drivers/utils.py line 13 should prob say `runDataRef` And finally squash the commits in all packages.
            Hide
            czw Christopher Waters added a comment -

            I've taken the duplicate reference to pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst as a typo, and checked all the files in that directory, modifying the two creating-*-task.rst files which had run references that needed updating to runDataRef.

            Show
            czw Christopher Waters added a comment - I've taken the duplicate reference to pipe_base/doc/lsst.pipe.base/command-line-task-retargeting-howto.rst as a typo, and checked all the files in that directory, modifying the two creating-*-task.rst files which had run references that needed updating to runDataRef.
            Hide
            yusra Yusra AlSayyad added a comment -

            Closing. SPs include time becoming familiar with our workflow.

            Show
            yusra Yusra AlSayyad added a comment - Closing. SPs include time becoming familiar with our workflow.

              People

              Assignee:
              czw Christopher Waters
              Reporter:
              yusra Yusra AlSayyad
              Reviewers:
              Paul Price
              Watchers:
              Christopher Waters, Hsin-Fang Chiang, Jim Bosch, John Swinbank, Paul Price, Robert Lupton, Russell Owen, Simon Krughoff, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.