Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-10992

Dealing with failures

    Details

      Description

      During production job failures are to be expected either due to infrastructure failures such as network or node failures but also due to faulty applications. The goal is to understand:

      • what are conditions making Pegasus to consider a job as failed one,
      • what happens after a job enters a failed state, i.e.:
        • how a failure affect transferring job's intermediate/output files (if any),
        • what happens if failure occurs in a job cluster,
      • how retries work in practice, i.e.:
        • can an operating organization control number of retries,
        • can an operating organization make Pegasus retry on certain failures but not other,

        Attachments

          Activity

          Hide
          mkowalik Mikolaj Kowalik added a comment -

          After reading Pegasus documentation I wrote a stub addressing the concerns listed above. In the future it may be included into a larger document describing how Pegasus deals with situation commonly encountered during batch productions operations.

          Show
          mkowalik Mikolaj Kowalik added a comment - After reading Pegasus documentation I wrote a stub addressing the concerns listed above. In the future it may be included into a larger document describing how Pegasus deals with situation commonly encountered during batch productions operations.

            People

            • Assignee:
              mkowalik Mikolaj Kowalik
              Reporter:
              mkowalik Mikolaj Kowalik
              Watchers:
              Mikolaj Kowalik
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel