We when last discussed the problem of PipelineTask quanta that fail because they have no work to do (the most common kind of "expected failure"), I assumed the problem was much harder than it actually is, because I was worried that workflow systems needed to track our files, and need some kind of dummy output dataset to be present (or a file with a particular name to appear via some other means) in order to prevent them holding or trimming downstream DAG vertices. That pushed the discussion towards solutions that would write some dataset in the case of failures, and perhaps having dataset types or storage classes whose instances could actually be exactly one of a few sub-types (a concept often referred to as "sum types" or "stateful enums" in the programming languages that support them).
But that limitation does not exist; the HTCondor BPS backend doesn't care about what files we write, and I'm hoping no workflow system will ever need to, given that we plan to read and write directly from object stores.
That opens up solutions in which we don't write files or create datasets in these cases at all. That avoids cluttering the filesystem and database when there's no information for the sentinal dataset to convey, and it's very naturally expressed in Python (a language that does not have sum types, and won't have the tooling to emulate them well until 3.10) by having the PipelineTask.run return a pipe.base.Struct with None values (or that just lacks attributes for some output connections). The base PipelineTask.runQuantum would simply not put things missing from the returned Struct. If desired, we could add some flag options to the connections classes to control when this behavior is desirable vs. considering it a logic error if some output connection attribute is missing.
Of course, when one quantum fails to produce a dataset that was expected, this creates problems downstream. I think there are three possibilities (I use "harness" below to mean "process that runs a quantum", so that's pipetask, today):
- The downstream task thinks missing inputs ought to be impossible, and genuinely fails; this really means that it was user error to define the Pipeline this way, and a true failure (task or harness raises an exception, process exits with nonzero exit, workflow system sees it as a failure) informs the user of this.
- The downstream task also has nothing to do, and should also return a Struct with None or missing attributes for its output connections. From the perspective of the workflow system, this is just an unusually fast "success"; we just need to be careful that higher-level tooling that analyzes resource consumption can identify these cases. It probably makes sense for this to be something a PipelineTask can declare to the harness in its connection class ("if this input doesn't exist at runtime, none of my outputs will"), allowing the harness to handle this case without even running the task. It might even be a reasonable default for input connections with multiple=False.
- The downstream task can do what it needs to do even if some of its inputs are not present (though it may still fail if too many of them are missing). At present, the only thing stopping concrete PipelineTasks from supporting missing inputs is the fact that the harness currently checks for the expected inputs and fails early when any are missing. But this check is going to be a waste of time with limited registry batch anyway (because all datasets will always appear to be present), and I think it's best to remove it and let the task handle missing inputs itself, to keep behavior similar in different execution contexts. We may want to make the ButlerQuantumContext object actually used by PipelineTasks return None rather than raise in this case to make this easier, though.
The last problem (that I see) created by this scenario is that it's a bit tricky to distinguish from just the data repository state between the case where a dataset doesn't exist because there was nothing for the quantum to do and the case where the quantum actually failed (or was pre-empted). We could partially resolve this by having the harness record in the <label>_metadata datasets whether the task raise an exception or finished normally; that could naturally extend to saving much richer provenance information in those files in the future (information we could ETL into the Registry database later for easier access; this sidesteps the problem of trying to write provenance directly to the database in a way that scales). In particular, when "retrying" either in QG generation or execution,
- We can skip any quantum for which all outputs exist;
- We can usually assume any quantum for which there is no <label>_metadata dataset needs to be rerun, possibly after clobbering any outputs that are present, because it was brought down hard and fast, but repeatable algorithmic and/or data problems will usually manifest as a Python exception the harness can catch;
- When the <label>_metadata file does exist but some outputs are missing, we can consult the metadata file itself to see directly whether this is a nothing-to-do case that we can skip, a failure that needs human inspection, or a predictable external failure that merits running again (like the harness catching a preemption signal).
The big change here, then, is transforming those metadata files into more than just a grab-bag of task-specific key-value pairs, or perhaps subsuming them into or (adding alongside them) a more structured provenance dataset.