Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-33820

Emit periodic log messages during Faro data loading

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: faro
    • Labels:
      None
    • Story Points:
      2
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Some of the batch system monitoring tools expect tasks to periodically emit log messages, to ensure that the task is still running and not in some stuck state. Panda e.g. expects these messages at at least once every two hours, but it's advisable for it to be more frequent. Faro's match preparation tasks can potentially spend a long time in data loading, and should have some log messages added during this process.

        Attachments

          Issue Links

            Activity

            No builds found.
            ctslater Colin Slater created issue -
            Hide
            tjenness Tim Jenness added a comment -

            Using self.log.verbose is okay.

            Show
            tjenness Tim Jenness added a comment - Using self.log.verbose is okay.
            tjenness Tim Jenness made changes -
            Field Original Value New Value
            Link This issue relates to DM-31528 [ DM-31528 ]
            Show
            tjenness Tim Jenness added a comment - See DM-31528 : https://github.com/lsst/meas_extensions_scarlet/blob/main/python/lsst/meas/extensions/scarlet/scarletDeblendTask.py#L1057-L1060
            yusra Yusra AlSayyad made changes -
            Assignee Arun Kannawadi [ kannawad ]
            Hide
            yusra Yusra AlSayyad added a comment -

            Arun Kannawadi it is unclear whether the part that takes >2hrs without logging is during processing or during dataset loading. We're going to assume the latter and this might be fixed by turning on logging whenever butler reads a file from s3.

            If that doesn't fix this, we might pursue this.

            Show
            yusra Yusra AlSayyad added a comment - Arun Kannawadi it is unclear whether the part that takes >2hrs without logging is during processing or during dataset loading. We're going to assume the latter and this might be fixed by turning on logging whenever butler reads a file from s3. If that doesn't fix this, we might pursue this.
            yesw Shuwei Ye made changes -
            Watchers Colin Slater, Tim Jenness, Yusra AlSayyad [ Colin Slater, Tim Jenness, Yusra AlSayyad ] Colin Slater, Shuwei Ye, Tim Jenness, Yusra AlSayyad [ Colin Slater, Shuwei Ye, Tim Jenness, Yusra AlSayyad ]
            tjenness Tim Jenness made changes -
            Link This issue relates to DM-33853 [ DM-33853 ]
            tjenness Tim Jenness made changes -
            Link This issue relates to DM-33854 [ DM-33854 ]
            Hide
            kannawad Arun Kannawadi added a comment -

            Is this run with logging level set to VERBOSE or DEBUG (or greater)? I'm adding a bunch of log messages at both levels, but wondering if the DEBUG level messages will prevent premature termination.

            Show
            kannawad Arun Kannawadi added a comment - Is this run with logging level set to VERBOSE or DEBUG (or greater)? I'm adding a bunch of log messages at both levels, but wondering if the DEBUG level messages will prevent premature termination.
            Hide
            hlin Huan Lin added a comment -

            We are running with logging level VERBOSE for DP0.2 production, so that is sufficient.

            Show
            hlin Huan Lin added a comment - We are running with logging level VERBOSE for DP0.2 production, so that is sufficient.
            Hide
            ctslater Colin Slater added a comment -

            Relevant portion of the the faro log for a job that ran successfully:

             

            INFO 2022-03-02T01:00:41.068+00:00 botocore.credentials ()(credentials.py:1111) - Found credentials in environment variables.
            INFO 2022-03-02T01:00:42.431+00:00 matplotlib.font_manager ()(font_manager.py:1443) - generated new fontManager
            INFO 2022-03-02T01:00:43.900+00:00 ctrl.mpexec.cmdLineFwk ()(cmdLineFwk.py:580) - QuantumGraph contains 1 quanta for 1 tasks, graph ID: '1646177452.0195708-9048
            INFO 2022-03-02T08:00:33.611+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:289) - Running tract: 3083
            INFO 2022-03-02T08:00:33.617+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:181) - Running catalog matching
            WARNING 2022-03-02T08:00:33.658+00:00 py.warnings (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(warnings.py:109) - /opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.8.0/Linux64/faro/ge8d9ffb934+d34cda5e8b/python/lsst/faro/utils/matcher.py:211: RuntimeWarning: invalid value encountered in true_divide
            e = (i_xx - i_yy + 2j * i_xy) / (i_xx + i_yy)

            INFO 2022-03-02T08:27:27.516+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:188) - 6160117 sources in matched catalog.
            INFO 2022-03-02T08:29:02.199+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:192) - 4251647 sources when trimmed to tract boundaries.
            INFO 2022-03-02T08:31:53.406+00:00 ctrl.mpexec.singleQuantumExecutor (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(singleQuantumExecutor.py:189) - Execution of task 'matchCatalogsTract' on quantum {band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083} took 27069.216 seconds
            INFO 2022-03-02T08:31:53.731+00:00 ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:386) - Executed 1 quanta successfully, 0 failed and 0 remain out of total 1 quanta.

            Show
            ctslater Colin Slater added a comment - Relevant portion of the the faro log for a job that ran successfully:   INFO 2022-03-02T01:00:41.068+00:00 botocore.credentials ()(credentials.py:1111) - Found credentials in environment variables. INFO 2022-03-02T01:00:42.431+00:00 matplotlib.font_manager ()(font_manager.py:1443) - generated new fontManager INFO 2022-03-02T01:00:43.900+00:00 ctrl.mpexec.cmdLineFwk ()(cmdLineFwk.py:580) - QuantumGraph contains 1 quanta for 1 tasks, graph ID: '1646177452.0195708-9048 INFO 2022-03-02T08:00:33.611+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:289) - Running tract: 3083 INFO 2022-03-02T08:00:33.617+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:181) - Running catalog matching WARNING 2022-03-02T08:00:33.658+00:00 py.warnings (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(warnings.py:109) - /opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.8.0/Linux64/faro/ge8d9ffb934+d34cda5e8b/python/lsst/faro/utils/matcher.py:211: RuntimeWarning: invalid value encountered in true_divide e = (i_xx - i_yy + 2j * i_xy) / (i_xx + i_yy) INFO 2022-03-02T08:27:27.516+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:188) - 6160117 sources in matched catalog. INFO 2022-03-02T08:29:02.199+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:192) - 4251647 sources when trimmed to tract boundaries. INFO 2022-03-02T08:31:53.406+00:00 ctrl.mpexec.singleQuantumExecutor (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(singleQuantumExecutor.py:189) - Execution of task 'matchCatalogsTract' on quantum {band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083} took 27069.216 seconds INFO 2022-03-02T08:31:53.731+00:00 ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:386) - Executed 1 quanta successfully, 0 failed and 0 remain out of total 1 quanta.
            Hide
            tjenness Tim Jenness added a comment -

            So over 7 hours between log messages?

            Show
            tjenness Tim Jenness added a comment - So over 7 hours between log messages?
            Hide
            kannawad Arun Kannawadi added a comment - - edited

            The log message that appears after 7 hours seems to occur fairly early on from a call in the `runQuantum` method. So it looks like it is indeed the loading that is taking so much time.

            Show
            kannawad Arun Kannawadi added a comment - - edited The log message that appears after 7 hours seems to occur fairly early on from a call in the `runQuantum` method. So it looks like it is indeed the loading that is taking so much time.
            Hide
            tjenness Tim Jenness added a comment -

            I will see if I can take a quick look at DM-33853 but 7 hours to load the data is a bit of a red flag and really implies that deferred dataset loading should be used if at all possible (like we do when coadding images).

            Show
            tjenness Tim Jenness added a comment - I will see if I can take a quick look at DM-33853 but 7 hours to load the data is a bit of a red flag and really implies that deferred dataset loading should be used if at all possible (like we do when coadding images).
            Hide
            kannawad Arun Kannawadi added a comment -

            Agreed. FWIW, I raised a PR with some log messages sprinkled throughout. Not all of them are timed, heartbeat log messages. If this becomes excessive, we can demote more of them to DEBUG or TRACE level.

            Show
            kannawad Arun Kannawadi added a comment - Agreed. FWIW, I raised a PR with some log messages sprinkled throughout. Not all of them are timed, heartbeat log messages. If this becomes excessive, we can demote more of them to DEBUG or TRACE level.
            kannawad Arun Kannawadi made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            kannawad Arun Kannawadi made changes -
            Team DM Science [ 12218 ] Data Release Production [ 10301 ]
            kannawad Arun Kannawadi made changes -
            Status In Progress [ 3 ] In Review [ 10004 ]
            kannawad Arun Kannawadi made changes -
            Reviewers Tim Jenness [ tjenness ]
            Show
            kannawad Arun Kannawadi added a comment - Ongoing Jenkins run:  https://ci.lsst.codes/job/stack-os-matrix/36163/display/redirect
            Hide
            tjenness Tim Jenness added a comment -

            Thanks for doing this. I have a philosophical question for Jeffrey Carlin and Keith Bechtol about logger names in the faro.utils package but that's not something that needs to be solved on this ticket.

            Note that back porting this is going to be tricky because PeriodicLogger does not exist in 23.x.

            Show
            tjenness Tim Jenness added a comment - Thanks for doing this. I have a philosophical question for Jeffrey Carlin and Keith Bechtol about logger names in the faro.utils package but that's not something that needs to be solved on this ticket. Note that back porting this is going to be tricky because PeriodicLogger does not exist in 23.x.
            tjenness Tim Jenness made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            Hide
            kannawad Arun Kannawadi added a comment -

            Now that you mention, the possibility of backporting is why I had not immediately switched to using PeriodicLogger. However, since the time consuming part really is the input fetching from the butler, it wouldn't make sense anyway to backport just this ticket without also backporting DM-33853, which requires ticket that implemented PeriodicLogger to also be backported.

            Show
            kannawad Arun Kannawadi added a comment - Now that you mention, the possibility of backporting is why I had not immediately switched to using PeriodicLogger. However, since the time consuming part really is the input fetching from the butler, it wouldn't make sense anyway to backport just this ticket without also backporting DM-33853 , which requires ticket that implemented PeriodicLogger to also be backported.
            Hide
            tjenness Tim Jenness added a comment -

            Yes, your previous patch (if you still have it) could seemingly be backported easily even if this better version is merged to main. I have not backported my logging to butler get because they changed the node configuration.

            Show
            tjenness Tim Jenness added a comment - Yes, your previous patch (if you still have it) could seemingly be backported easily even if this better version is merged to main. I have not backported my logging to butler get because they changed the node configuration.
            Hide
            kannawad Arun Kannawadi added a comment -

            I can pull up my earlier patch from my reflogs before too late. As long it is acceptable to have the backport branch have the same behavior with different implementation, we should do that.

            Show
            kannawad Arun Kannawadi added a comment - I can pull up my earlier patch from my reflogs before too late. As long it is acceptable to have the backport branch have the same behavior with different implementation, we should do that.
            kannawad Arun Kannawadi made changes -
            Labels backport-v23
            Hide
            tjenness Tim Jenness added a comment -

            I don't see any problem with having the branch use a slightly different implementation.

            Show
            tjenness Tim Jenness added a comment - I don't see any problem with having the branch use a slightly different implementation.
            kannawad Arun Kannawadi made changes -
            Epic Link PREOPS-995 [ 1078856 ]
            tjenness Tim Jenness made changes -
            Labels backport-v23
            Hide
            tjenness Tim Jenness added a comment - - edited

            Arun Kannawadi we've discussed the backporting of this and decided that we do not need to do the backport because step 3 (the step where this would matter) is now almost complete and there is not going to be another 23.0.1 release candidate.

            Show
            tjenness Tim Jenness added a comment - - edited Arun Kannawadi we've discussed the backporting of this and decided that we do not need to do the backport because step 3 (the step where this would matter) is now almost complete and there is not going to be another 23.0.1 release candidate.
            Hide
            kannawad Arun Kannawadi added a comment -

            Alright, I had raised a PR for backporting to not lose the discarded commits altogether. I closed the DM-33820-v23 PR and merged the main ticket branch to the main branch.

            Show
            kannawad Arun Kannawadi added a comment - Alright, I had raised a PR for backporting to not lose the discarded commits altogether. I closed the DM-33820 -v23 PR and merged the main ticket branch to the main branch.
            kannawad Arun Kannawadi made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            kannawad Arun Kannawadi made changes -
            Story Points 2

              People

              Assignee:
              kannawad Arun Kannawadi
              Reporter:
              ctslater Colin Slater
              Reviewers:
              Tim Jenness
              Watchers:
              Arun Kannawadi, Colin Slater, Huan Lin, Shuwei Ye, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.