Nate Lust and I had a very productive discussion this morning about future directions for PipelineTask and Pipeline. I'm jotting down the key points of that discussion here to help us remember it, and at some point we should flesh them out in technote and get broader feedback.
- Inverting the relationship between tasks and connections by introducting a QuantumHarness class that replaces the connections class, has runQuantum, and calls the run method of a regular Task.
- PipelineTask and PipelineTaskConnections could probably remain as a legacy interface, and possibly remain the only public interface (with QuantumHarness at least starting as a behind-the-scenes class, and maybe only ever being used in advanced cases).
- Connections would be defined in the harness' own config class, which would hold the task's config class (maybe via a RegistryField?) and have special fields for defining connections in the common ways.
- The actual information needed to build pipeline and task graphs would go through QuantumHarness APIs, instead of referencing its config directly at all.
- Some task config fields would need to be annotated as affecting the harness, while the rest could be assumed to not affect the graphs.
- Adding boolean feature flags to the combination of a dataset type and a RUN.
- For example, a postISRCCD could have a "flat-fielded" feature, directly corresponding (in this and many cases, but not always) to IsrTask's doFlat config field.
- This will help us use dataset types as interfaces between tasks, rather than forcing one task to depend too much on the configuration and implementation of another.
- Configured tasks could specify features that must or must not be present on their input connections.
- Configured tasks might be able to elide the dataset type name of a connection entirely if the features, storage class, and dimensions it provides are sufficient for a unique match to an output connection or a registry dataset type.
- Features that must or must not be present could also be provided to butler searches for datasets, causing datasets that do not satisfy that feature specification to be ignored.
- Features must be intrinsic to datasets; they should not be used to attempt to capture the relationship of this dataset to others (e.g. whether it is the final version of something). That sounds like it'd be useful, but all versions of it we could come up with proved to be unsound, so we're better off continuing to use dataset type names for this.
- Major public dataset types produced by a pipeline should be documented directly in the pipeline, with those definitions checked against its connections.
- Pipeline should define named groups of dataset type similar to our subsets of task labels, and we should encourage users to use these instead of task labels.
- To that end, we should consider mechanisms for generating numbered-step subsets on the fly to satisfy certain criteria. New QG generation algorithms may make some number subsets entirely unnecessary in the future, but we will still need them for (at least) large scale productions in which multiple submissions of the same task subsets are necessary.
Nate Lust, did I forget anything?