Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-14823

Clean up handling of extra data ID information in SuperTask.run

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      The default implementation of SuperTask.runQuantum currently passes additional output data ID information to run, as this is necessary in at least some contexts in which run needs to be able to group input datasets.  However, the way this is passed is confusing (the kwargs generated do not have names that suggest that they are IDs), and the need for these IDs may be sufficiently rare that most SuperTasks should not be required to accept them.

      One possibility for how to address this would be:

      • runQuantum always passes just a single data ID (the quantum data ID, not the data of either inputs or outputs) to run, as an always-optional dataId keyword argument (i.e. SuperTasks must permit this argument to be None).  That will at least meet the needs of SuperTasks that want to use the data ID for diagnostic or custom-provenance purposes (see also DM-14821).
      • SuperTasks that need to do data ID grouping in run should override runQuantum themselves.
      • To make the above easier / less verbose, we should look for ways to make some of the logic in the default implementation of runQuantum available to subclasses that override that method (e.g. via utility methods that do some of the work).

      I'm open to other ideas as well, and I should note that I have not thought much about how this proposal would change which of our concrete SuperTasks would need to override runQuantum.

        Attachments

          Issue Links

            Activity

            No builds found.
            jbosch Jim Bosch created issue -
            jbosch Jim Bosch made changes -
            Field Original Value New Value
            Link This issue blocks DM-14816 [ DM-14816 ]
            jbosch Jim Bosch made changes -
            Risk Score 0
            salnikov Andy Salnikov made changes -
            Epic Link DM-14661 [ 106365 ]
            salnikov Andy Salnikov made changes -
            Sprint BG3_F18_07 [ 759 ]
            salnikov Andy Salnikov made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I think that if we want output units to be optional (not used by most concrete tasks) then run() method cannot even have them in the argument list (but we still want all input data there). I can imagine few ways to design/implement that:

            1. have run() method to only take inputs, most normal tasks will override that method. In addition to run() method have some other method, e.g. runWithOutputs(inputData: dict, outputUnits: dict) which has a default implementation calling run():

                  def runWithOutputs(self, inputData, outputUnits):
                      """default implementation calls run() with only input data as kw arguments"""
                      return self.run(**inputData)
               
                  def run(self, **kw):
                      # subclass will override this with actual input args, e.g. "def run(self, input1, input2):"
              

              The runQuantum() will call this new method instead of run(). Tasks that need access to output units would have to override runWithOutputs() method.

            1. variation of this is to skip runWithOutputs() entirely and let people override runQuantum() instead. But most or all of the logic in runQuantum() has to be reimplemented anyways, so I think that intermediate method is a little bit cleaner approach.
            2. one more option is to provide separate method or property which returns output units for a currently processed Quanta, the units will be remembered by runQuantum() before calling run() method. I'm not particularly happy with this idea, I would prefer passing all context via method arguments, but if this simplifies life for many users then it's probably OK (and I don't think we'll ever use same Task instance to process more than one Quanta in parallel).
            Show
            salnikov Andy Salnikov added a comment - - edited I think that if we want output units to be optional (not used by most concrete tasks) then run() method cannot even have them in the argument list (but we still want all input data there). I can imagine few ways to design/implement that: have run() method to only take inputs, most normal tasks will override that method. In addition to run() method have some other method, e.g. runWithOutputs(inputData: dict, outputUnits: dict) which has a default implementation calling run() : def runWithOutputs( self , inputData, outputUnits): "" "default implementation calls run() with only input data as kw arguments" "" return self .run( * * inputData)   def run( self , * * kw): # subclass will override this with actual input args, e.g. "def run(self, input1, input2):" The runQuantum() will call this new method instead of run() . Tasks that need access to output units would have to override runWithOutputs() method. variation of this is to skip runWithOutputs() entirely and let people override runQuantum() instead. But most or all of the logic in runQuantum()  has to be reimplemented anyways, so I think that intermediate method is a little bit cleaner approach. one more option is to provide separate method or property which returns output units for a currently processed Quanta, the units will be remembered by runQuantum() before calling run()  method. I'm not particularly happy with this idea, I would prefer passing all context via method arguments, but if this simplifies life for many users then it's probably OK (and I don't think we'll ever use same Task instance to process more than one Quanta in parallel).
            Hide
            salnikov Andy Salnikov added a comment -

            Jim Bosch, if you have no other preference I'll go ahead with my option 1 from the list above.

            Show
            salnikov Andy Salnikov added a comment - Jim Bosch , if you have no other preference I'll go ahead with my option 1 from the list above.
            Hide
            jbosch Jim Bosch added a comment -

            Sorry I didn't get to this earlier.  (1) would at least be okay; it certainly solves this problem.  The option that looks like it was supposed to be (2) (but is also labeled (1)) might be better if we can find some way to refactor some of the logic in runQuantum into utility methods in order to make it easier to override runQuantum directly.  Nate Lust and I have already identified (on DM-14819) a piece of shared code (for reading reference catalogs) that does some specialized I/O; that would naturally go best in an override of runQuantum, so we have at least one other use case for wanting to override runQuantum.  But we haven't yet looked to see if that kind of refactoring into utility methods is feasible.

            Show
            jbosch Jim Bosch added a comment - Sorry I didn't get to this earlier.  (1) would at least be okay; it certainly solves this problem.  The option that looks like it was supposed to be (2) (but is also labeled (1)) might be better if we can find some way to refactor some of the logic in runQuantum into utility methods in order to make it easier to override runQuantum directly.  Nate Lust and I have already identified (on DM-14819 ) a piece of shared code (for reading reference catalogs) that does some specialized I/O; that would naturally go best in an override of runQuantum , so we have at least one other use case for wanting to override runQuantum .  But we haven't yet looked to see if that kind of refactoring into utility methods is feasible.
            Hide
            salnikov Andy Salnikov added a comment -

            Here is my attempt to improve things. I'm not particularly happy with it but I think we can at least freeze API for run() method at this point as we are preparing to merge it on Friday. Any suggestions for better name of runWithOutputs() method are welcome. 

            I also think we may want to do something with Struct in case when task returns output data for multiple DataIds. Right now code expects it to be a list in the same order as DataRefs in Quantum, but that feels somewhat fragile. It may be better to return a dict with DataIds as keys, but that would complicate things for single-DataId case.

            Show
            salnikov Andy Salnikov added a comment - Here is my attempt to improve things. I'm not particularly happy with it but I think we can at least freeze API for run() method at this point as we are preparing to merge it on Friday. Any suggestions for better name of runWithOutputs()  method are welcome.  I also think we may want to do something with Struct  in case when task returns output data for multiple DataIds. Right now code expects it to be a list in the same order as DataRefs in Quantum, but that feels somewhat fragile. It may be better to return a dict with DataIds as keys, but that would complicate things for single-DataId case.
            salnikov Andy Salnikov made changes -
            Reviewers Jim Bosch [ jbosch ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            Hide
            jbosch Jim Bosch added a comment -

            Comments on the PR.  I'd also like Nate Lust to take a quick look, as he may have some ideas about how to make runQuantum easier to override.

            Show
            jbosch Jim Bosch added a comment - Comments on the PR.  I'd also like Nate Lust to take a quick look, as he may have some ideas about how to make runQuantum easier to override.
            jbosch Jim Bosch made changes -
            Status In Review [ 10004 ] In Review [ 10004 ]
            Hide
            salnikov Andy Salnikov added a comment -

            Ready to merge (after Jenkins finishes for this branch), Jim Bosch can you mark JIRA as reviewed if there are no more questions/suggestions?

            Show
            salnikov Andy Salnikov added a comment - Ready to merge (after Jenkins finishes for this branch), Jim Bosch can you mark JIRA as reviewed if there are no more questions/suggestions?
            Hide
            jbosch Jim Bosch added a comment -

            Review complete!  Sorry I forgot to do that here before.

            Show
            jbosch Jim Bosch added a comment - Review complete!  Sorry I forgot to do that here before.
            jbosch Jim Bosch made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review! Jenkins passed for Centos jobs, still in progress for osx. Merged both packages, done.

            Show
            salnikov Andy Salnikov added a comment - Thanks for review! Jenkins passed for Centos jobs, still in progress for osx. Merged both packages, done.
            salnikov Andy Salnikov made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.