Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-16503

Investigate requirements for Butler 3 interoperability

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      4
    • Epic Link:
    • Sprint:
      AP F18-6
    • Team:
      Alert Production

      Description

      DMTN-098 speculates that it may be possible to write an adapter that allows the same MetricTask to be used with both Butler 2 and Butler 3 workflows. However, the details depend substantially on how the new Butler works and how much custom information the adapter needs (in particular, it's not clear how to create a DatasetTypeDescriptor, one of the classes the PipelineTask API must return). While we can't implement such an adapter until Butler output of measurements is supported, knowing whether or not an adapter is possible affects the best way to implement MetricTask and its subclasses.

      This issue is to spend some time investigating how Butler 3 and PipelineTask interact, and what is "missing" from the Butler 2 equivalents.

      Note that the middleware development team is also working on something that will allow CmdLineTasks to be used as PipelineTasks. Their solution may or may not satisfy our requirements, for the following reasons:

      • it may only work on CmdLineTask, which the metrics support framework does not use
      • even in the Butler 2 context, MetricTask reports the dataset(s) it needs as input. An adapter that can translate this information to a Butler 3 representation may be simpler than one that needs to get the same information from an external source
      • it is not actually clear whether the middleware team's solution is an adapter in the OOD sense, or whether it only maps PipelineTask.run to Task.runDataRef while leaving other PipelineTask capabilities (i.e., dataset management) unsupported

      This potential conflict/redundancy can be addressed once we have a clearer picture of our own requirements.

        Attachments

          Issue Links

            Activity

            Hide
            krzys Krzysztof Findeisen added a comment -

            The middleware team's approach is to write a separate class adapter for each high-level CmdLineTask (including CoaddTask, ProcessCcdTask, and IsrTask). I gather that subtasks will interact with their parent tasks like before, i.e. they will not be aware of the PipelineTask interface at all. The adapter will request inputs through, e.g., PipelineTask.getInputDatasetTypes, though it is not clear for what purpose; the adapter must be able to function without a QuantumGraphGenerator to match up the inputs and outputs.

            The proposed design is fairly labor-intensive and requires a custom class for each adaptee, making it scale poorly to an indefinite (and hopefully growing) number of MetricTasks. Therefore, it is still worthwhile to research our own solution, particularly since we can use known properties of MetricTask to simplify the design (e.g., no MetricTask will create catalogs, so a specialized adapter need not support catalog schemas).

            Show
            krzys Krzysztof Findeisen added a comment - The middleware team's approach is to write a separate class adapter for each high-level CmdLineTask (including CoaddTask , ProcessCcdTask , and IsrTask ). I gather that subtasks will interact with their parent tasks like before, i.e. they will not be aware of the PipelineTask interface at all. The adapter will request inputs through, e.g., PipelineTask.getInputDatasetTypes , though it is not clear for what purpose; the adapter must be able to function without a QuantumGraphGenerator to match up the inputs and outputs. The proposed design is fairly labor-intensive and requires a custom class for each adaptee, making it scale poorly to an indefinite (and hopefully growing) number of MetricTasks . Therefore, it is still worthwhile to research our own solution, particularly since we can use known properties of MetricTask to simplify the design (e.g., no MetricTask will create catalogs, so a specialized adapter need not support catalog schemas).
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            TL;DR: it is almost certainly possible to create an adapter as proposed in DMTN-098, but it will require a lot of some boilerplate.

            My intent was that the adapter could be implemented as a Python class decorator, something like:

            Gen3Task = PipelineTaskAdapter([extra info])(OriginalTaskClass)
            pipeline.task[1] = Gen3Task
            pipeline.task[1].config.[field] = [configuration info]
            

            I assumed that a Pipeline will be configured using the existing Config framework, much like subtasks are. This is in principle possible: a Pipeline is a list of TaskDef, each of which includes a task type object and a config. However, it's not clear how a Pipeline will be configured in the final system; I suspect the middleware group has not yet thought about a user interface.

            The current placeholder (CmdLineFwk) is configured by passing task names through command line arguments, and requires that all tasks be importable classes. The decorator approach can be made to work with such a system, but it requires promoting the Gen3Task = PipelineTaskAdapter(...) declaration from a local (possibly anonymous) object in a config file to a publicly visible object in some Python module, adding a lot of API clutter.

            We can avoid supporting some optional PipelineTask features on the grounds that MetricTask will never need them. However, even a minimal PipelineTask still needs a substantial amount of information that cannot be derived from the original Gen 2 task:

            • input and output dataset types: there must be one for each task input as well as the (single) output, and each must specify three fields: the type name, the data units (e.g., "ccd", "visit", "tract"), and the Butler 3 storage class. Of these three, only the type names can be inferred from the wrapped MetricTask.
              • It may be possible to preemptively include this information in the Gen 2 configs, as appears to have been done in DM-15663. However, I'm not completely sure this won't cause problems with the config is loaded without access to a Butler 3 repository. Done in DM-16822, no ill effects so far.
            • PipelineTaskConfig.quantum: this appears to be a replacement for user-provided data IDs. This field must specify a list of data units (presumably not redundant with those in the input spec), and may provide an SQL query that selects the data to be processed. It may be possible to have the latter default to empty, then overwrite it in the user config file. Given that CmdLineFwk bypasses this field, and it only appears in one example (where it is set to None), I don't have a clear picture of how PipelineTaskConfig.quantum.sql is meant to be used. Nate Lust says this field should not be exposed to, let alone set by, the task user.

            In short, we may need to treat adapted MetricTasks as if they were classes in their own right, and we need to provide a substantial amount of information either as arguments to the adapter or as mandatory config options. Both add complexity to a feature that was intended to keep the code simple.

            Show
            krzys Krzysztof Findeisen added a comment - - edited TL;DR: it is almost certainly possible to create an adapter as proposed in DMTN-098, but it will require  a lot of some boilerplate. My intent was that the adapter could be implemented as a Python class decorator, something like: Gen3Task = PipelineTaskAdapter([extra info])(OriginalTaskClass) pipeline.task[1] = Gen3Task pipeline.task[1].config.[field] = [configuration info] I assumed that a Pipeline will be configured using the existing Config framework, much like subtasks are. This is in principle possible: a Pipeline is a list of TaskDef , each of which includes a task type object and a config. However, it's not clear how a Pipeline will be configured in the final system; I suspect the middleware group has not yet thought about a user interface. The current placeholder ( CmdLineFwk ) is configured by passing task names through command line arguments, and requires that all tasks be importable classes. The decorator approach can be made to work with such a system, but it requires promoting the Gen3Task = PipelineTaskAdapter(...) declaration from a local (possibly anonymous) object in a config file to a publicly visible object in some Python module, adding a lot of API clutter. We can avoid supporting some optional PipelineTask features on the grounds that MetricTask will never need them. However, even a minimal PipelineTask still needs a substantial amount of information that cannot be derived from the original Gen 2 task: input and output dataset types: there must be one for each task input as well as the (single) output, and each must specify three fields: the type name, the data units (e.g., "ccd", "visit", "tract"), and the Butler 3 storage class. Of these three, only the type names can be inferred from the wrapped MetricTask . It may be possible to preemptively include this information in the Gen 2 configs, as appears to have been done in DM-15663 . However, I'm not completely sure this won't cause problems with the config is loaded without access to a Butler 3 repository. Done in DM-16822 , no ill effects so far. PipelineTaskConfig.quantum : this appears to be a replacement for user-provided data IDs. This field must specify a list of data units (presumably not redundant with those in the input spec), and may provide an SQL query that selects the data to be processed. It may be possible to have the latter default to empty, then overwrite it in the user config file. Given that CmdLineFwk bypasses this field, and it only appears in one example (where it is set to None ), I don't have a clear picture of how PipelineTaskConfig.quantum.sql is meant to be used. Nate Lust says this field should not be exposed to, let alone set by, the task user. In short, we may need to treat adapted MetricTasks as if they were classes in their own right, and we need to provide a substantial amount of information either as arguments to the adapter or as mandatory config options. Both add complexity to a feature that was intended to keep the code simple.

              People

              Assignee:
              krzys Krzysztof Findeisen
              Reporter:
              krzys Krzysztof Findeisen
              Watchers:
              John Swinbank, Krzysztof Findeisen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.