Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-16275

PipelineTask should always use overridable methods to get DatasetTypes

    XMLWordPrintable

    Details

      Description

      It looks like PipelineTask.runQuantum (at least) uses iteration through the Config instance to find all of the input and output DatasetTypes.  This at least partially ignores the get*DatasetType methods, which should be permitted to add and modify the set of DatasetTypes processed.

      Fixing this may involving changing the signatures of the get*DatasetType methods a bit - there is at least some information (e.g. "scalar") that is currently only available from the configs, and that probably needs to be forwarded through the get*DatasetType methods so they can be considered the final source of truth by PipelineTask and all activators.

       

        Attachments

          Activity

          Hide
          salnikov Andy Salnikov added a comment -

          I'm having hard time imagining how can I change get*DatasetType to also return extra info from configuration object without making it overly complicated and keeping it future-proof.

          How about adding extra method to PipelineTask which returns the value of "scalar" for given field with default implementation that sub-classes can override as well, e.g.:

              @classmethod
              def getDatasetIsScalar(cls, config, key):
                  field = config.get(key)
                  if isinstance(field, (InputDatasetConfig, OutputDatasetConfig)):
                      return field.scalar
                  return False
          

          I do not particularly like this, if we need to add something else to configuration it would mean new method like this will be needed. Should we just rename get*DatasetType() to get*DatasetConfig() and return configuration objects from those (and let sub-classes override those)?

          Show
          salnikov Andy Salnikov added a comment - I'm having hard time imagining how can I change get*DatasetType to also return extra info from configuration object without making it overly complicated and keeping it future-proof. How about adding extra method to PipelineTask which returns the value of "scalar" for given field with default implementation that sub-classes can override as well, e.g.: @classmethod def getDatasetIsScalar( cls , config, key): field = config.get(key) if isinstance (field, (InputDatasetConfig, OutputDatasetConfig)): return field.scalar return False I do not particularly like this, if we need to add something else to configuration it would mean new method like this will be needed. Should we just rename get*DatasetType() to get*DatasetConfig() and return configuration objects from those (and let sub-classes override those)?
          Hide
          jbosch Jim Bosch added a comment -

          Should we just rename get*DatasetType() to get*DatasetConfig() and return configuration objects from those (and let sub-classes override those)?

          I think this is a bit closer to what we want.  I'm a bit worried it may be confusing to use a Config class as the return element type of these methods, despite the fact that they're likely to have all of the attributes we need*.*

          Should we invent a new class for "DatasetType as used by a PipelineTask" that could be returned instead?  Those could have a DatasetType instance, and a way to construct from *DatasetConfig (taking over the logic in PipelineTask.makeDatasetType), and a scalar attribute, and whatever else we need them to have in the future.  We could start with a simple namedtuple for now.

          Show
          jbosch Jim Bosch added a comment - Should we just rename get*DatasetType() to get*DatasetConfig() and return configuration objects from those (and let sub-classes override those)? I think this is a bit closer to what we want.  I'm a bit worried it may be confusing to use a Config class as the return element type of these methods, despite the fact that they're likely to have all of the attributes we need*.* Should we invent a new class for "DatasetType as used by a PipelineTask" that could be returned instead?  Those could have a DatasetType instance, and a way to construct from *DatasetConfig (taking over the logic in PipelineTask.makeDatasetType), and a scalar attribute, and whatever else we need them to have in the future.  We could start with a simple namedtuple for now.
          Hide
          salnikov Andy Salnikov added a comment -

          I was trying to avoid introducing new class (semantically it should be ~equivalent to *DatasetConfig classes) but it may be the safest option for the future extensions. I'll try to implement something not too complicated.

          Show
          salnikov Andy Salnikov added a comment - I was trying to avoid introducing new class (semantically it should be ~equivalent to *DatasetConfig classes) but it may be the safest option for the future extensions. I'll try to implement something not too complicated.
          Hide
          salnikov Andy Salnikov added a comment -

          I think this is ready for review. I ran Jenkins yesterday with success, and re-running it now.

          I have decided to keep method names (they are already quite long) but they now return instances of new class DatasetTypeDescriptor instead of DatasetType. All code that uses those methods is updated and I switched runQuantum() to use those methods instead of doing config introspection.

          Show
          salnikov Andy Salnikov added a comment - I think this is ready for review. I ran Jenkins yesterday with success, and re-running it now. I have decided to keep method names (they are already quite long) but they now return instances of new class DatasetTypeDescriptor instead of DatasetType . All code that uses those methods is updated and I switched runQuantum() to use those methods instead of doing config introspection.
          Hide
          salnikov Andy Salnikov added a comment -

          Thanks for review! I think I have taken care of all comments. Re-ran Jenkins, centos builds completed OK, osx is still running but I presume it's going to be success Merged and closing.

          Show
          salnikov Andy Salnikov added a comment - Thanks for review! I think I have taken care of all comments. Re-ran Jenkins, centos builds completed OK, osx is still running but I presume it's going to be success Merged and closing.

            People

            Assignee:
            salnikov Andy Salnikov
            Reporter:
            jbosch Jim Bosch
            Reviewers:
            Jim Bosch
            Watchers:
            Andy Salnikov, Jim Bosch, Nate Lust, Vaikunth Thukral
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                CI Builds

                No builds found.