Few more things that I was thinking about:
The idea of "non-executable" pipeline implies that pipeline can be instantiated and configured in a "restricted" environment which lacks most of the regular inputs (e.g. no databases, no butler, etc.) There are couple of interesting issues related to that:
- how do we ensure that non-executable pipeline is configured in exactly the same way as executable one
- is there really a need to instantiate whole pipeline if we only need a small subset of its functionality
Having all those configuration-related methods (covering points 1 and 2 above) in supertask API makes it implicitly dependent on workflow used by production, I'm not sure that other activators will use that part of supertask interface. If answers to questions 1 and 2 could be answered by something else than supertask then we probably won't need even instantiate non-executable pipeline. I think the info like that could be made a part of the supertask configuration so that workflow could analyze just configuration (and overrides) and decide for itself what inputs are needed for example. Also configuration API could probably be extended if some dynamic behavior is needed for this sort of introspection.
Resource estimation looks to me as a most complicated part of the interaction. From what I understand resource estimation should be more or less precise (order-of-magnitude guess is probably not going to help anybody). It's clear that resource use for any task is related to the size of the data that has to be processed (size of images or number of rows selected from database) but non-executable pipeline does not know anything about actual size data being processed, and I'm not sure that metadata service can have even approximate answer about data size. One consideration is that workflow has to able to override any resource requirement even when pipeline can provide a guess in case this guess is wrong. This implies that workflow has to have sort of per-task configuration override for resources. Given all that it may be easier and more reliable (at least initially) to provide resource requirements also at the level of task configuration and not as task API.
Point 3 above is an interesting case, it may look similar to point 1 but it's more involved. The DAG that workflow need to build should take advantage of possible data parallelism so the whole dataset (input/output data IDs) have to be split into smallest units of work ("quanta") and this splitting can only be reasonably done by pipeline code, there is probably no way to infer that splitting from configuration (at least it seems hard for me with my near-zero understanding of all related concepts). So to me this would imply that pipeline has to be instantiated and correctly configured to do point 3 which means that we cannot totally avoid instantiating non-executable pipeline (unless point 3 can be done by executable pipeline). One more thing about this - to me it looks like the result of point 3 could be used (after proper reduction) to answer question from point 1, so if we cannot eliminate non-executable pipeline we could probably limit exposed set of methods in its API to a smaller number.