Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30977

Write pipeline task log output to butler

    XMLWordPrintable

    Details

    • Story Points:
      8
    • Team:
      Architecture
    • Urgent?:
      No

      Description

      Modify pipetask such that it captures all log output into a buffer and at quantum completion it puts that log output into the butler repository just as is done for other metadata.

        Attachments

          Issue Links

            Activity

            Hide
            mgower Michelle Gower added a comment -

            What's the plan for if an execution fails (nicely) and then is rerun?  Will the failure log just be overwritten or do we need a way to keep multiple logs?  A restart with a new run collection would handle keeping both around.  Is that good enough?  An automatic WMS retry because a file couldn't be opened for writing would have the same run collection and presumably clobber the log that had the error message explaining why the failure and thus retry.

            Show
            mgower Michelle Gower added a comment - What's the plan for if an execution fails (nicely) and then is rerun?  Will the failure log just be overwritten or do we need a way to keep multiple logs?  A restart with a new run collection would handle keeping both around.  Is that good enough?  An automatic WMS retry because a file couldn't be opened for writing would have the same run collection and presumably clobber the log that had the error message explaining why the failure and thus retry.
            Hide
            tjenness Tim Jenness added a comment -

            You mean if it reruns into an existing run and is overwriting all the contents? This is written at quantum completion so presumably it means that everything worked. If the quantum is skipped next time then no log would be written. But if this is an execution butler and the job is rerun with the same run then just like everything else the file will be overwritten. The only way we would stop that is for us to be using a temporary datastore location in the execution butler – that is entirely doable if we need to do it that way but it would slow down the "bring home" part of things since the data would have to be copied.

            Show
            tjenness Tim Jenness added a comment - You mean if it reruns into an existing run and is overwriting all the contents? This is written at quantum completion so presumably it means that everything worked. If the quantum is skipped next time then no log would be written. But if this is an execution butler and the job is rerun with the same run then just like everything else the file will be overwritten. The only way we would stop that is for us to be using a temporary datastore location in the execution butler – that is entirely doable if we need to do it that way but it would slow down the "bring home" part of things since the data would have to be copied.
            Hide
            tjenness Tim Jenness added a comment -

            I think it would make sense to have a special log object that accumulates log message components rather than the fully-specified text. This would make it easier later on to analyze the log content and would seem to be reasonable to serialized in JSON format. For python logging this would seemingly be trivial – register a log handler and capture the LogRecord directly. Unfortunately if lsst.log is installed pipetask/butler forwards all python logging messages to lsst.log and so lsst.log messages would have to be captured.

            This is seemingly much harder since it lsst.log messages don't seemingly have a way to call python callbacks for each message. Log4CXX does support JSON strings but I do not know if lsst.log can enable that functionality.

            Show
            tjenness Tim Jenness added a comment - I think it would make sense to have a special log object that accumulates log message components rather than the fully-specified text. This would make it easier later on to analyze the log content and would seem to be reasonable to serialized in JSON format. For python logging this would seemingly be trivial – register a log handler and capture the LogRecord directly. Unfortunately if lsst.log is installed pipetask/butler forwards all python logging messages to lsst.log and so lsst.log messages would have to be captured. This is seemingly much harder since it lsst.log messages don't seemingly have a way to call python callbacks for each message. Log4CXX does support JSON strings but I do not know if lsst.log can enable that functionality.
            Hide
            mgower Michelle Gower added a comment -

            Presumably we would want log files for Quanta failures in the same place as log files for successes?

            There was a recent problem at NCSA where a file couldn't be opened for writing.  The Quantum failed.  We were asking to turn on automatic retries in the WMS (not done yet).  The job would then have failed and automatically rerun.  Let's assume 2nd time successful (some intermittent filesystem issue).  So the automatic rerun would have the exact same run collection, but the Quantum would be executed again producing different log contents.  Ideally the 2nd log is really the only one we would need, but do we want the first log to understand why it failed the first time and was retried?  I was assuming that this was the only copy of the Quanta output, but if also goes to job output then we could find the first attempt output in the job output.

            Show
            mgower Michelle Gower added a comment - Presumably we would want log files for Quanta failures in the same place as log files for successes? There was a recent problem at NCSA where a file couldn't be opened for writing.  The Quantum failed.  We were asking to turn on automatic retries in the WMS (not done yet).  The job would then have failed and automatically rerun.  Let's assume 2nd time successful (some intermittent filesystem issue).  So the automatic rerun would have the exact same run collection, but the Quantum would be executed again producing different log contents.  Ideally the 2nd log is really the only one we would need, but do we want the first log to understand why it failed the first time and was retried?  I was assuming that this was the only copy of the Quanta output, but if also goes to job output then we could find the first attempt output in the job output.
            Hide
            tjenness Tim Jenness added a comment -

            This is a general issue with butler and reruns. All the other datasets would also overwrite (with clobber-outputs). So in this example the log file would at least be correct for the files that are in the butler for that run, even if you no longer have the log file for the failed run. You have the same problem with the run provenance metadata. I suppose pipetask could put a log dataset type on success (and failure) and a log_failure dataset type only on failure and then on retry you would only overwrite the log dataset type. That would work unless you retried 3 or more times.

            Show
            tjenness Tim Jenness added a comment - This is a general issue with butler and reruns. All the other datasets would also overwrite (with clobber-outputs). So in this example the log file would at least be correct for the files that are in the butler for that run, even if you no longer have the log file for the failed run. You have the same problem with the run provenance metadata. I suppose pipetask could put a log dataset type on success (and failure) and a log_failure dataset type only on failure and then on retry you would only overwrite the log dataset type. That would work unless you retried 3 or more times.
            Hide
            tjenness Tim Jenness added a comment -

            Andy Salnikov sorry to keep on asking you for reviews but K-T is already doing the other big logging ticket for me.

            On this ticket there have been a number of changes:

            • A new ButlerLogRecord class as a pydantic model representing a LogRecord+MDC
            • ButlerLogRecords as a collection of ButlerLogRecord that is understood by butler for put/get.
            • A new log logging.Formatter that writes JSON records.
            • A new datastore formatter that can read and write those JSON records.
            • A new --log-file command line option that will allow all log records to be written to a file (can be JSON if the file ends in json else formatted text).
            • Change to pipe_base to understand a new _log dataset type.
            • Change to ctrl_mpexec to open a log file when the quantum execution starts and ingest that log file when quantum execution finishes. K-T convinced me that accumulating log messages into memory was a bad idea (especially when I saw characterizeImage generate half a million debug messages).

            Jenkins seems to still work. I've run pipelines_check and see the log files in there and I can butler.get() them. Since I use a stream format of one record-per-line all the log records can be trivially combined if needed (although bps should probably use the --file-log option to generate JSON logs for itself).

            Show
            tjenness Tim Jenness added a comment - Andy Salnikov sorry to keep on asking you for reviews but K-T is already doing the other big logging ticket for me. On this ticket there have been a number of changes: A new ButlerLogRecord class as a pydantic model representing a LogRecord+MDC ButlerLogRecords as a collection of ButlerLogRecord that is understood by butler for put/get. A new log logging.Formatter that writes JSON records. A new datastore formatter that can read and write those JSON records. A new --log-file command line option that will allow all log records to be written to a file (can be JSON if the file ends in json else formatted text). Change to pipe_base to understand a new _log dataset type. Change to ctrl_mpexec to open a log file when the quantum execution starts and ingest that log file when quantum execution finishes. K-T convinced me that accumulating log messages into memory was a bad idea (especially when I saw characterizeImage generate half a million debug messages). Jenkins seems to still work. I've run pipelines_check and see the log files in there and I can butler.get() them. Since I use a stream format of one record-per-line all the log records can be trivially combined if needed (although bps should probably use the --file-log option to generate JSON logs for itself).
            Hide
            salnikov Andy Salnikov added a comment -

            Looks good, few comments mostly for daf_butler.

            Show
            salnikov Andy Salnikov added a comment - Looks good, few comments mostly for daf_butler.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Reviewers:
              Andy Salnikov
              Watchers:
              Andy Salnikov, Kian-Tat Lim, Michelle Gower, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.