Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26131

Pipetask should handle SIGINT and back out of database transactions

    XMLWordPrintable

    Details

    • Story Points:
      4
    • Sprint:
      DB_F20_06
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      One of the problems we are having with running pipelines on Google is that they support pre-emption of compute nodes. At the moment what happens is that the node is shut down and depending on where the pipeline was we can end up in a weird state. When we run tens of thousands of jobs some of those are going to be pre-empted during the writes to the datastore with no chance of rolling back any transactions. Sometimes only some of the writes have been completed, sometimes all of them have been.

      Google do support pre-emption scripts which they can launch 30 seconds before the node is killed. In theory this script could send a SIGINT to any pipetask commands that are running and give them 30 seconds to clean up before being yanked.

      What I'm wondering is if we could wrap the code that does all the butler puts in a try block catching KeyboardInterrupt. Then it could know if it has done partial puts and roll those back. We probably shouldn't be in the game of trying to start a timer and seeing if we can squeeze in all the puts before the 30 seconds triggers.

      I think we need to support CTRL-C anyhow and it looks like CmdLineTask tries to do so but not pipetask (I might be wrong).

      Note that we absolutely do not want CTRL-C to be ignored – the point is that this is the only way we are going to tell that a shut down is about to happen.

      In a related note, pipetask really does need some way of overwriting a dataset if it has not child datasets that depend on it. We are running into many issues during development where rerunning jobs in a big workflow is something we really want to do automatically. I imagine that's a different ticket to this one though.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            You meant --clobber-partial-outputs, not --skip-partial-outputs, right?

            Also, to be clear,

            I think we've overall got a confusing interface here

            referred more to the result of my changes to deal with RUN vs. CHAINED collections, not Andy's original interface.

            Show
            jbosch Jim Bosch added a comment - You meant --clobber-partial-outputs , not --skip-partial-outputs , right? Also, to be clear, I think we've overall got a confusing interface here referred more to the result of my changes to deal with RUN vs. CHAINED collections, not Andy's original interface.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Right, sorry, I'm typing without reading, must be --clobber-partial-outputs of course.

            And speaking about multiple confising options - should --clobber-partial-outputs imply --skip-existing? It does not matter for single-quantum jobs but for multi-quanta I think this is logical. Alternatively we can tell users to always add --skip-existing if --clobber-partial-outputs is specified.

            Show
            salnikov Andy Salnikov added a comment - - edited Right, sorry, I'm typing without reading, must be --clobber-partial-outputs of course. And speaking about multiple confising options - should --clobber-partial-outputs imply --skip-existing ? It does not matter for single-quantum jobs but for multi-quanta I think this is logical. Alternatively we can tell users to always add  --skip-existing if --clobber-partial-outputs is specified.
            Hide
            salnikov Andy Salnikov added a comment -

            Should be ready for review. I could not decide whether we want to turn on --skip-existing when --clobber-partial-outputs is given, if you have an opinion on this please tell. The change is really small, logic is limited to singleQuantumExecutor.py but I also refactored bunch of unit test code to implement test case for this new feature, bulk of changes is in the test code.

            Show
            salnikov Andy Salnikov added a comment - Should be ready for review. I could not decide whether we want to turn on --skip-existing when --clobber-partial-outputs is given, if you have an opinion on this please tell. The change is really small, logic is limited to singleQuantumExecutor.py but I also refactored bunch of unit test code to implement test case for this new feature, bulk of changes is in the test code.
            Hide
            tjenness Tim Jenness added a comment -

            Looks okay. I don't have a strong feeling either way on skip-existing being the default when clobber is enable. Someone could make the case that they want a clobber-existing as well just to force everything to be rebuilt. We are going to have to ask for feedback once people start using these tools more.

            Show
            tjenness Tim Jenness added a comment - Looks okay. I don't have a strong feeling either way on skip-existing being the default when clobber is enable. Someone could make the case that they want a clobber-existing as well just to force everything to be rebuilt. We are going to have to ask for feedback once people start using these tools more.
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review! Merged with the fixes to the review comments and, as there were no other comments, I leave skip-existing unchanged.

            Show
            salnikov Andy Salnikov added a comment - Thanks for review! Merged with the fixes to the review comments and, as there were no other comments, I leave skip-existing unchanged.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              Tim Jenness
              Watchers:
              Andy Salnikov, Hsin-Fang Chiang, Jim Bosch, Kian-Tat Lim, Nate Lust, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.