Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-7490

Understand Production requirements to Supertask

    Details

    • Type: Story
    • Status: Done
    • Priority: Major
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: supertask
    • Labels:
      None
    • Templates:
    • Story Points:
      4
    • Sprint:
      DB_F16_9, DB_F16_10
    • Team:
      Data Access and Database

      Description

      A bunch of new requirements to Supertask comes from Production system designed currently by NCSA (DM-7678).

      Understand these requirements and try to think how they map into new interface between Supertask and Production system.

        Issue Links

          Activity

          Hide
          salnikov Andy Salnikov added a comment - - edited

          Last week we had a meeting with NCSA folks to discuss their proposed workflow and how it interacts with the pipeline, the diagram of the interaction is attached to DM-7678. I also had a short session with K-T to improve my understanding of vocabulary and ideas. Here is a brief summary of what I remember from those two discussions.

          • one of the main goals for all of that is to automate workflow configuration and reduce the need for humans entering a bunch of parameters used by the workflow system
          • obvious approach for thart is to ask pipeline itself for things that pipeline now "knows" implicitly, for example what are all the inputs that are needed by this execution, or how much resources is needed to run the job, etc.
          • according to this the Production workflow is going to interact with the pipeline in following ways:
            1. it will ask pipeline for all necessary inputs so it could stage input data and bring it to execution location
            2. it will ask pipeline for (estimate of) resources needed during execution such as memory, CPU, etc.
            3. it will ask pipeline about its tasks and corresponding inputs/outputs so that it can build execution graph
            4. it will schedule and run individual tasks in correct order

          Existing Supertask interface (execute() method or something which replaces it) probably covers the last point to some extent depending on other ideas. Three other points will need to be defined and implemented by pipeline. In workflow context first three step are performed by "non-executable" pipeline and last step is done by "executable" pipeline, non-executable pipeline is a sort of preparation (or part of the job configuration) for actual pipeline execution. This two are thought to be the instances of the same pipeline so they should be (almost) identical and consistent even though they may run on different host/environment. (Potentially one can see first three steps as a part of configuration and pass all configuration information to the last step, this would avoid repetition and possible inconsistency.)

          One thing that was not entirely clear to me is whether the diagram applies to a single execution of a pipeline or does it handle repeated executions on different datasets. If latter then how is that specified and what is responsible for splitting input data into smaller subsets. This may be related to pre-configuration part as "other" pipeline inputs may depend on data IDs (not sure if it makes sense to me though?)

          Show
          salnikov Andy Salnikov added a comment - - edited Last week we had a meeting with NCSA folks to discuss their proposed workflow and how it interacts with the pipeline, the diagram of the interaction is attached to DM-7678 . I also had a short session with K-T to improve my understanding of vocabulary and ideas. Here is a brief summary of what I remember from those two discussions. one of the main goals for all of that is to automate workflow configuration and reduce the need for humans entering a bunch of parameters used by the workflow system obvious approach for thart is to ask pipeline itself for things that pipeline now "knows" implicitly, for example what are all the inputs that are needed by this execution, or how much resources is needed to run the job, etc. according to this the Production workflow is going to interact with the pipeline in following ways: it will ask pipeline for all necessary inputs so it could stage input data and bring it to execution location it will ask pipeline for (estimate of) resources needed during execution such as memory, CPU, etc. it will ask pipeline about its tasks and corresponding inputs/outputs so that it can build execution graph it will schedule and run individual tasks in correct order Existing Supertask interface ( execute() method or something which replaces it) probably covers the last point to some extent depending on other ideas. Three other points will need to be defined and implemented by pipeline. In workflow context first three step are performed by "non-executable" pipeline and last step is done by "executable" pipeline, non-executable pipeline is a sort of preparation (or part of the job configuration) for actual pipeline execution. This two are thought to be the instances of the same pipeline so they should be (almost) identical and consistent even though they may run on different host/environment. (Potentially one can see first three steps as a part of configuration and pass all configuration information to the last step, this would avoid repetition and possible inconsistency.) One thing that was not entirely clear to me is whether the diagram applies to a single execution of a pipeline or does it handle repeated executions on different datasets. If latter then how is that specified and what is responsible for splitting input data into smaller subsets. This may be related to pre-configuration part as "other" pipeline inputs may depend on data IDs (not sure if it makes sense to me though?)
          Hide
          salnikov Andy Salnikov added a comment -

          Few more things that I was thinking about:

          The idea of "non-executable" pipeline implies that pipeline can be instantiated and configured in a "restricted" environment which lacks most of the regular inputs (e.g. no databases, no butler, etc.) There are couple of interesting issues related to that:

          • how do we ensure that non-executable pipeline is configured in exactly the same way as executable one
          • is there really a need to instantiate whole pipeline if we only need a small subset of its functionality

          Having all those configuration-related methods (covering points 1 and 2 above) in supertask API makes it implicitly dependent on workflow used by production, I'm not sure that other activators will use that part of supertask interface. If answers to questions 1 and 2 could be answered by something else than supertask then we probably won't need even instantiate non-executable pipeline. I think the info like that could be made a part of the supertask configuration so that workflow could analyze just configuration (and overrides) and decide for itself what inputs are needed for example. Also configuration API could probably be extended if some dynamic behavior is needed for this sort of introspection.

          Resource estimation looks to me as a most complicated part of the interaction. From what I understand resource estimation should be more or less precise (order-of-magnitude guess is probably not going to help anybody). It's clear that resource use for any task is related to the size of the data that has to be processed (size of images or number of rows selected from database) but non-executable pipeline does not know anything about actual size data being processed, and I'm not sure that metadata service can have even approximate answer about data size. One consideration is that workflow has to able to override any resource requirement even when pipeline can provide a guess in case this guess is wrong. This implies that workflow has to have sort of per-task configuration override for resources. Given all that it may be easier and more reliable (at least initially) to provide resource requirements also at the level of task configuration and not as task API.

          Point 3 above is an interesting case, it may look similar to point 1 but it's more involved. The DAG that workflow need to build should take advantage of possible data parallelism so the whole dataset (input/output data IDs) have to be split into smallest units of work ("quanta") and this splitting can only be reasonably done by pipeline code, there is probably no way to infer that splitting from configuration (at least it seems hard for me with my near-zero understanding of all related concepts). So to me this would imply that pipeline has to be instantiated and correctly configured to do point 3 which means that we cannot totally avoid instantiating non-executable pipeline (unless point 3 can be done by executable pipeline). One more thing about this - to me it looks like the result of point 3 could be used (after proper reduction) to answer question from point 1, so if we cannot eliminate non-executable pipeline we could probably limit exposed set of methods in its API to a smaller number.

          Show
          salnikov Andy Salnikov added a comment - Few more things that I was thinking about: The idea of "non-executable" pipeline implies that pipeline can be instantiated and configured in a "restricted" environment which lacks most of the regular inputs (e.g. no databases, no butler, etc.) There are couple of interesting issues related to that: how do we ensure that non-executable pipeline is configured in exactly the same way as executable one is there really a need to instantiate whole pipeline if we only need a small subset of its functionality Having all those configuration-related methods (covering points 1 and 2 above) in supertask API makes it implicitly dependent on workflow used by production, I'm not sure that other activators will use that part of supertask interface. If answers to questions 1 and 2 could be answered by something else than supertask then we probably won't need even instantiate non-executable pipeline. I think the info like that could be made a part of the supertask configuration so that workflow could analyze just configuration (and overrides) and decide for itself what inputs are needed for example. Also configuration API could probably be extended if some dynamic behavior is needed for this sort of introspection. Resource estimation looks to me as a most complicated part of the interaction. From what I understand resource estimation should be more or less precise (order-of-magnitude guess is probably not going to help anybody). It's clear that resource use for any task is related to the size of the data that has to be processed (size of images or number of rows selected from database) but non-executable pipeline does not know anything about actual size data being processed, and I'm not sure that metadata service can have even approximate answer about data size. One consideration is that workflow has to able to override any resource requirement even when pipeline can provide a guess in case this guess is wrong. This implies that workflow has to have sort of per-task configuration override for resources. Given all that it may be easier and more reliable (at least initially) to provide resource requirements also at the level of task configuration and not as task API. Point 3 above is an interesting case, it may look similar to point 1 but it's more involved. The DAG that workflow need to build should take advantage of possible data parallelism so the whole dataset (input/output data IDs) have to be split into smallest units of work ("quanta") and this splitting can only be reasonably done by pipeline code, there is probably no way to infer that splitting from configuration (at least it seems hard for me with my near-zero understanding of all related concepts). So to me this would imply that pipeline has to be instantiated and correctly configured to do point 3 which means that we cannot totally avoid instantiating non-executable pipeline (unless point 3 can be done by executable pipeline). One more thing about this - to me it looks like the result of point 3 could be used (after proper reduction) to answer question from point 1, so if we cannot eliminate non-executable pipeline we could probably limit exposed set of methods in its API to a smaller number.
          Hide
          salnikov Andy Salnikov added a comment -

          I think I managed to identify main issues, to resolve them will need additional work of course. I'm going to close this ticket (no review needed) and switch to hacking of the prototype SuperTask interface that we can discuss with workflow folks.

          Show
          salnikov Andy Salnikov added a comment - I think I managed to identify main issues, to resolve them will need additional work of course. I'm going to close this ticket (no review needed) and switch to hacking of the prototype SuperTask interface that we can discuss with workflow folks.

            People

            • Assignee:
              salnikov Andy Salnikov
              Reporter:
              fritzm Fritz Mueller
              Watchers:
              Andy Salnikov, Fritz Mueller, Gregory Dubois-Felsmann, Hsin-Fang Chiang, Rob Kooper
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development

                  Agile