Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29761

Improving error handling (and nothing-to-do quanta) in PipelineTask execution

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      We when last discussed the problem of PipelineTask quanta that fail because they have no work to do (the most common kind of "expected failure"), I assumed the problem was much harder than it actually is, because I was worried that workflow systems needed to track our files, and need some kind of dummy output dataset to be present (or a file with a particular name to appear via some other means) in order to prevent them holding or trimming downstream DAG vertices. That pushed the discussion towards solutions that would write some dataset in the case of failures, and perhaps having dataset types or storage classes whose instances could actually be exactly one of a few sub-types (a concept often referred to as "sum types" or "stateful enums" in the programming languages that support them).

      But that limitation does not exist; the HTCondor BPS backend doesn't care about what files we write, and I'm hoping no workflow system will ever need to, given that we plan to read and write directly from object stores.

      That opens up solutions in which we don't write files or create datasets in these cases at all. That avoids cluttering the filesystem and database when there's no information for the sentinal dataset to convey, and it's very naturally expressed in Python (a language that does not have sum types, and won't have the tooling to emulate them well until 3.10) by having the PipelineTask.run return a pipe.base.Struct with None values (or that just lacks attributes for some output connections). The base PipelineTask.runQuantum would simply not put things missing from the returned Struct. If desired, we could add some flag options to the connections classes to control when this behavior is desirable vs. considering it a logic error if some output connection attribute is missing.

      Of course, when one quantum fails to produce a dataset that was expected, this creates problems downstream. I think there are three possibilities (I use "harness" below to mean "process that runs a quantum", so that's pipetask, today):

      • The downstream task thinks missing inputs ought to be impossible, and genuinely fails; this really means that it was user error to define the Pipeline this way, and a true failure (task or harness raises an exception, process exits with nonzero exit, workflow system sees it as a failure) informs the user of this.
      • The downstream task also has nothing to do, and should also return a Struct with None or missing attributes for its output connections. From the perspective of the workflow system, this is just an unusually fast "success"; we just need to be careful that higher-level tooling that analyzes resource consumption can identify these cases. It probably makes sense for this to be something a PipelineTask can declare to the harness in its connection class ("if this input doesn't exist at runtime, none of my outputs will"), allowing the harness to handle this case without even running the task. It might even be a reasonable default for input connections with multiple=False.
      • The downstream task can do what it needs to do even if some of its inputs are not present (though it may still fail if too many of them are missing). At present, the only thing stopping concrete PipelineTasks from supporting missing inputs is the fact that the harness currently checks for the expected inputs and fails early when any are missing. But this check is going to be a waste of time with limited registry batch anyway (because all datasets will always appear to be present), and I think it's best to remove it and let the task handle missing inputs itself, to keep behavior similar in different execution contexts. We may want to make the ButlerQuantumContext object actually used by PipelineTasks return None rather than raise in this case to make this easier, though.

      The last problem (that I see) created by this scenario is that it's a bit tricky to distinguish from just the data repository state between the case where a dataset doesn't exist because there was nothing for the quantum to do and the case where the quantum actually failed (or was pre-empted). We could partially resolve this by having the harness record in the <label>_metadata datasets whether the task raise an exception or finished normally; that could naturally extend to saving much richer provenance information in those files in the future (information we could ETL into the Registry database later for easier access; this sidesteps the problem of trying to write provenance directly to the database in a way that scales). In particular, when "retrying" either in QG generation or execution,

      • We can skip any quantum for which all outputs exist;
      • We can usually assume any quantum for which there is no <label>_metadata dataset needs to be rerun, possibly after clobbering any outputs that are present, because it was brought down hard and fast, but repeatable algorithmic and/or data problems will usually manifest as a Python exception the harness can catch;
      • When the <label>_metadata file does exist but some outputs are missing, we can consult the metadata file itself to see directly whether this is a nothing-to-do case that we can skip, a failure that needs human inspection, or a predictable external failure that merits running again (like the harness catching a preemption signal).

      The big change here, then, is transforming those metadata files into more than just a grab-bag of task-specific key-value pairs, or perhaps subsuming them into or (adding alongside them) a more structured provenance dataset.

        Attachments

          Issue Links

            Activity

            Hide
            ktl Kian-Tat Lim added a comment -

            Storing this kind of execution result information within the *_metadata files seems very compatible with their intended usage.  Some key/value pairs should already be harness-generic rather than task-specific.  Specifying the usage of additional keys (or adding additional structure) would be fine.

            Show
            ktl Kian-Tat Lim added a comment - Storing this kind of execution result information within the *_metadata files seems very compatible with their intended usage.  Some key/value pairs should already be harness-generic rather than task-specific.  Specifying the usage of additional keys (or adding additional structure) would be fine.
            Hide
            tjenness Tim Jenness added a comment -

            If you ask whether butler has a dataset it will do an existence check on the object store even if registry says the file should be there.

            If a node is pre-empted the workflow system should not start downstream jobs so I'm not sure how pre-emption is relevant here. The job never completes and has to be restarted. This is not the same as the job completing with bad exit status from the harness.

            Show
            tjenness Tim Jenness added a comment - If you ask whether butler has a dataset it will do an existence check on the object store even if registry says the file should be there. If a node is pre-empted the workflow system should not start downstream jobs so I'm not sure how pre-emption is relevant here. The job never completes and has to be restarted. This is not the same as the job completing with bad exit status from the harness.
            Hide
            jbosch Jim Bosch added a comment -

            Nate Lust has convinced me that using one dataset (the _metadata or _quantum dataset) to provide essential status information about another (the possibly-optional dataset) is not a good idea, and I've created DM-29818 to capture our alternative (which I think nicely solves some other problems, too).

            Saving as-run quanta to files is still a good idea for other reasons, and I've created DM-29814 to start that process. The question of whether to save those to the existing _metadata datasets or create a new _quantum one is still a relevant one, but that ticket will focus on the Python representation first.

            With those alternatives in place, I'm closing this ticket as Won't Fix, as I think we have other ones that better address the problems it attempted to solve.

            Show
            jbosch Jim Bosch added a comment - Nate Lust has convinced me that using one dataset (the _metadata or _quantum dataset) to provide essential status information about another (the possibly-optional dataset) is not a good idea, and I've created DM-29818 to capture our alternative (which I think nicely solves some other problems, too). Saving as-run quanta to files is still a good idea for other reasons, and I've created DM-29814 to start that process. The question of whether to save those to the existing _metadata datasets or create a new _quantum one is still a relevant one, but that ticket will focus on the Python representation first. With those alternatives in place, I'm closing this ticket as Won't Fix, as I think we have other ones that better address the problems it attempted to solve.
            Hide
            jbosch Jim Bosch added a comment -

            I'm back to thinking that this is a good idea, and in fact the best approach to our most immediate problem; DM-29818 solves one that is really quite different (and lower priority).  See the last two sections of https://confluence.lsstcorp.org/display/DM/Saving+per-Quantum+provenance+and+propagating+nothing-to-do+cases%2C+and+The+Future for details.

            Nate Lust, I'd like your feedback on that in particular, as it was the earlier discussion with you that made me change my mind about this ticket before.  I'm happy to talk live as well (even today, though we've got a toddler at home and scheduling may be a bit difficult).

            Show
            jbosch Jim Bosch added a comment - I'm back to thinking that this is a good idea, and in fact the best approach to our most immediate problem; DM-29818 solves one that is really quite different (and lower priority).  See the last two sections of https://confluence.lsstcorp.org/display/DM/Saving+per-Quantum+provenance+and+propagating+nothing-to-do+cases%2C+and+The+Future for details. Nate Lust , I'd like your feedback on that in particular, as it was the earlier discussion with you that made me change my mind about this ticket before.  I'm happy to talk live as well (even today, though we've got a toddler at home and scheduling may be a bit difficult).
            Hide
            jbosch Jim Bosch added a comment - - edited

            I'm starting in on this now.  I think I can do it without any QuantumDirectory stuff if I make the assumption throughout that a PipelineTask's metadata dataset will be always be written (even if other datasets are not) if it had nothing to do, while failures or preemptions will cause the metadata not to be written.  I think that is the case already.  But I'll hide this implementation behind a "did this quantum run successfully" interface that we could reimplement via QuantumDirectory (making _metadata less special) in the future.

            I've also realized that this possibility from the original ticket description:

            • The downstream task thinks missing inputs ought to be impossible, and genuinely fails; this really means that it was user error to define the Pipeline this way, and a true failure (task or harness raises an exception, process exits with nonzero exit, workflow system sees it as a failure) informs the user of this.

            is actually invalid: the downstream task cannot be allowed to distinguish between "the upstream task that produced my input did not produce that input" and "my input did not exist at QG generation time and hence this quantum shouldn't even exist".  In other words, we must prune the downstream quantum when the input does not exist unless the task says it can proceed without it.  Unfortunately, we can't really make QGs that involve inputs that don't exist, unless they're prerequisite inputs.  But that's a known problem with a different (difficult, long-term) solution, and I don't think we need it fixed for this ticket - I'll just be sure on this ticket to make the interfaces reflect consistent behavior for QG-gen missing inputs vs. execution-time missing inputs, and raise NotImplementedError for the case we can't actually implement consistently (but don't need) yet: "run this quantum even if no inputs for this connection exist".

            Show
            jbosch Jim Bosch added a comment - - edited I'm starting in on this now.  I think I can do it without any QuantumDirectory stuff if I make the assumption throughout that a PipelineTask's metadata dataset will be always be written (even if other datasets are not) if it had nothing to do, while failures or preemptions will cause the metadata not to be written.  I think that is the case already.  But I'll hide this implementation behind a "did this quantum run successfully" interface that we could reimplement via QuantumDirectory (making _metadata less special) in the future. I've also realized that this possibility from the original ticket description: The downstream task thinks missing inputs ought to be impossible, and genuinely fails; this really means that it was user error to define the Pipeline this way, and a true failure (task or harness raises an exception, process exits with nonzero exit, workflow system sees it as a failure) informs the user of this. is actually invalid: the downstream task cannot be allowed to distinguish between "the upstream task that produced my input did not produce that input" and "my input did not exist at QG generation time and hence this quantum shouldn't even exist".  In other words, we must prune the downstream quantum when the input does not exist unless the task says it can proceed without it.  Unfortunately, we can't really make QGs that involve inputs that don't exist, unless they're prerequisite inputs.  But that's a known problem with a different (difficult, long-term) solution, and I don't think we need it fixed for this ticket - I'll just be sure on this ticket to make the interfaces reflect consistent behavior for QG-gen missing inputs vs. execution-time missing inputs, and raise NotImplementedError for the case we can't actually implement consistently (but don't need) yet: "run this quantum even if no inputs for this connection exist".

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, John Parejko, Kian-Tat Lim, Michelle Gower, Nate Lust, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:

                  Jenkins

                  No builds found.