Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:
-
Story Points:10
-
Epic Link:
-
Team:Data Facility
Description
During production job failures are to be expected either due to infrastructure failures such as network or node failures but also due to faulty applications. The goal is to understand:
- what are conditions making Pegasus to consider a job as failed one,
- what happens after a job enters a failed state, i.e.:
- how a failure affect transferring job's intermediate/output files (if any),
- what happens if failure occurs in a job cluster,
- how retries work in practice, i.e.:
- can an operating organization control number of retries,
- can an operating organization make Pegasus retry on certain failures but not other,
After reading Pegasus documentation I wrote a stub addressing the concerns listed above. In the future it may be included into a larger document describing how Pegasus deals with situation commonly encountered during batch productions operations.