Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: ctrl_mpexec, pipe_base
-
Labels:
-
Story Points:4
-
Epic Link:
-
Sprint:DB_F20_06
-
Team:Data Access and Database
-
Urgent?:No
Description
One of the problems we are having with running pipelines on Google is that they support pre-emption of compute nodes. At the moment what happens is that the node is shut down and depending on where the pipeline was we can end up in a weird state. When we run tens of thousands of jobs some of those are going to be pre-empted during the writes to the datastore with no chance of rolling back any transactions. Sometimes only some of the writes have been completed, sometimes all of them have been.
Google do support pre-emption scripts which they can launch 30 seconds before the node is killed. In theory this script could send a SIGINT to any pipetask commands that are running and give them 30 seconds to clean up before being yanked.
What I'm wondering is if we could wrap the code that does all the butler puts in a try block catching KeyboardInterrupt. Then it could know if it has done partial puts and roll those back. We probably shouldn't be in the game of trying to start a timer and seeing if we can squeeze in all the puts before the 30 seconds triggers.
I think we need to support CTRL-C anyhow and it looks like CmdLineTask tries to do so but not pipetask (I might be wrong).
Note that we absolutely do not want CTRL-C to be ignored – the point is that this is the only way we are going to tell that a shut down is about to happen.
In a related note, pipetask really does need some way of overwriting a dataset if it has not child datasets that depend on it. We are running into many issues during development where rerunning jobs in a big workflow is something we really want to do automatically. I imagine that's a different ticket to this one though.
Attachments
Issue Links
- relates to
-
DM-25818 S3Datastore tests existence before writing
- Done
You meant --clobber-partial-outputs, not --skip-partial-outputs, right?
Also, to be clear,
referred more to the result of my changes to deal with RUN vs. CHAINED collections, not Andy's original interface.