Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-33820

Emit periodic log messages during Faro data loading

    XMLWordPrintable

Details

    • Story
    • Status: Done
    • Resolution: Done
    • None
    • faro
    • None
    • 2
    • Data Release Production
    • No

    Description

      Some of the batch system monitoring tools expect tasks to periodically emit log messages, to ensure that the task is still running and not in some stuck state. Panda e.g. expects these messages at at least once every two hours, but it's advisable for it to be more frequent. Faro's match preparation tasks can potentially spend a long time in data loading, and should have some log messages added during this process.

      Attachments

        Issue Links

          Activity

            No builds found.
            ctslater Colin Slater created issue -
            tjenness Tim Jenness added a comment -

            Using self.log.verbose is okay.

            tjenness Tim Jenness added a comment - Using self.log.verbose is okay.
            tjenness Tim Jenness made changes -
            Field Original Value New Value
            Link This issue relates to DM-31528 [ DM-31528 ]
            tjenness Tim Jenness added a comment - See DM-31528 : https://github.com/lsst/meas_extensions_scarlet/blob/main/python/lsst/meas/extensions/scarlet/scarletDeblendTask.py#L1057-L1060
            yusra Yusra AlSayyad made changes -
            Assignee Arun Kannawadi [ kannawad ]

            kannawad it is unclear whether the part that takes >2hrs without logging is during processing or during dataset loading. We're going to assume the latter and this might be fixed by turning on logging whenever butler reads a file from s3.

            If that doesn't fix this, we might pursue this.

            yusra Yusra AlSayyad added a comment - kannawad it is unclear whether the part that takes >2hrs without logging is during processing or during dataset loading. We're going to assume the latter and this might be fixed by turning on logging whenever butler reads a file from s3. If that doesn't fix this, we might pursue this.
            yesw Shuwei Ye made changes -
            Watchers Colin Slater, Tim Jenness, Yusra AlSayyad [ Colin Slater, Tim Jenness, Yusra AlSayyad ] Colin Slater, Shuwei Ye, Tim Jenness, Yusra AlSayyad [ Colin Slater, Shuwei Ye, Tim Jenness, Yusra AlSayyad ]
            tjenness Tim Jenness made changes -
            Link This issue relates to DM-33853 [ DM-33853 ]
            tjenness Tim Jenness made changes -
            Link This issue relates to DM-33854 [ DM-33854 ]

            Is this run with logging level set to VERBOSE or DEBUG (or greater)? I'm adding a bunch of log messages at both levels, but wondering if the DEBUG level messages will prevent premature termination.

            kannawad Arun Kannawadi added a comment - Is this run with logging level set to VERBOSE or DEBUG (or greater)? I'm adding a bunch of log messages at both levels, but wondering if the DEBUG level messages will prevent premature termination.
            hlin Huan Lin added a comment -

            We are running with logging level VERBOSE for DP0.2 production, so that is sufficient.

            hlin Huan Lin added a comment - We are running with logging level VERBOSE for DP0.2 production, so that is sufficient.
            ctslater Colin Slater added a comment -

            Relevant portion of the the faro log for a job that ran successfully:

             

            INFO 2022-03-02T01:00:41.068+00:00 botocore.credentials ()(credentials.py:1111) - Found credentials in environment variables.
            INFO 2022-03-02T01:00:42.431+00:00 matplotlib.font_manager ()(font_manager.py:1443) - generated new fontManager
            INFO 2022-03-02T01:00:43.900+00:00 ctrl.mpexec.cmdLineFwk ()(cmdLineFwk.py:580) - QuantumGraph contains 1 quanta for 1 tasks, graph ID: '1646177452.0195708-9048
            INFO 2022-03-02T08:00:33.611+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:289) - Running tract: 3083
            INFO 2022-03-02T08:00:33.617+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:181) - Running catalog matching
            WARNING 2022-03-02T08:00:33.658+00:00 py.warnings (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(warnings.py:109) - /opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.8.0/Linux64/faro/ge8d9ffb934+d34cda5e8b/python/lsst/faro/utils/matcher.py:211: RuntimeWarning: invalid value encountered in true_divide
            e = (i_xx - i_yy + 2j * i_xy) / (i_xx + i_yy)

            INFO 2022-03-02T08:27:27.516+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:188) - 6160117 sources in matched catalog.
            INFO 2022-03-02T08:29:02.199+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:192) - 4251647 sources when trimmed to tract boundaries.
            INFO 2022-03-02T08:31:53.406+00:00 ctrl.mpexec.singleQuantumExecutor (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(singleQuantumExecutor.py:189) - Execution of task 'matchCatalogsTract' on quantum {band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083} took 27069.216 seconds
            INFO 2022-03-02T08:31:53.731+00:00 ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:386) - Executed 1 quanta successfully, 0 failed and 0 remain out of total 1 quanta.

            ctslater Colin Slater added a comment - Relevant portion of the the faro log for a job that ran successfully:   INFO 2022-03-02T01:00:41.068+00:00 botocore.credentials ()(credentials.py:1111) - Found credentials in environment variables. INFO 2022-03-02T01:00:42.431+00:00 matplotlib.font_manager ()(font_manager.py:1443) - generated new fontManager INFO 2022-03-02T01:00:43.900+00:00 ctrl.mpexec.cmdLineFwk ()(cmdLineFwk.py:580) - QuantumGraph contains 1 quanta for 1 tasks, graph ID: '1646177452.0195708-9048 INFO 2022-03-02T08:00:33.611+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:289) - Running tract: 3083 INFO 2022-03-02T08:00:33.617+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:181) - Running catalog matching WARNING 2022-03-02T08:00:33.658+00:00 py.warnings (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(warnings.py:109) - /opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-0.8.0/Linux64/faro/ge8d9ffb934+d34cda5e8b/python/lsst/faro/utils/matcher.py:211: RuntimeWarning: invalid value encountered in true_divide e = (i_xx - i_yy + 2j * i_xy) / (i_xx + i_yy) INFO 2022-03-02T08:27:27.516+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:188) - 6160117 sources in matched catalog. INFO 2022-03-02T08:29:02.199+00:00 matchCatalogsTract (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(MatchedCatalogBase.py:192) - 4251647 sources when trimmed to tract boundaries. INFO 2022-03-02T08:31:53.406+00:00 ctrl.mpexec.singleQuantumExecutor (matchCatalogsTract:{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083})(singleQuantumExecutor.py:189) - Execution of task 'matchCatalogsTract' on quantum {band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', tract: 3083} took 27069.216 seconds INFO 2022-03-02T08:31:53.731+00:00 ctrl.mpexec.mpGraphExecutor ()(mpGraphExecutor.py:386) - Executed 1 quanta successfully, 0 failed and 0 remain out of total 1 quanta.
            tjenness Tim Jenness added a comment -

            So over 7 hours between log messages?

            tjenness Tim Jenness added a comment - So over 7 hours between log messages?
            kannawad Arun Kannawadi added a comment - - edited

            The log message that appears after 7 hours seems to occur fairly early on from a call in the `runQuantum` method. So it looks like it is indeed the loading that is taking so much time.

            kannawad Arun Kannawadi added a comment - - edited The log message that appears after 7 hours seems to occur fairly early on from a call in the `runQuantum` method. So it looks like it is indeed the loading that is taking so much time.
            tjenness Tim Jenness added a comment -

            I will see if I can take a quick look at DM-33853 but 7 hours to load the data is a bit of a red flag and really implies that deferred dataset loading should be used if at all possible (like we do when coadding images).

            tjenness Tim Jenness added a comment - I will see if I can take a quick look at DM-33853 but 7 hours to load the data is a bit of a red flag and really implies that deferred dataset loading should be used if at all possible (like we do when coadding images).

            Agreed. FWIW, I raised a PR with some log messages sprinkled throughout. Not all of them are timed, heartbeat log messages. If this becomes excessive, we can demote more of them to DEBUG or TRACE level.

            kannawad Arun Kannawadi added a comment - Agreed. FWIW, I raised a PR with some log messages sprinkled throughout. Not all of them are timed, heartbeat log messages. If this becomes excessive, we can demote more of them to DEBUG or TRACE level.
            kannawad Arun Kannawadi made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            kannawad Arun Kannawadi made changes -
            Team DM Science [ 12218 ] Data Release Production [ 10301 ]
            kannawad Arun Kannawadi made changes -
            Status In Progress [ 3 ] In Review [ 10004 ]
            kannawad Arun Kannawadi made changes -
            Reviewers Tim Jenness [ tjenness ]
            kannawad Arun Kannawadi added a comment - Ongoing Jenkins run:  https://ci.lsst.codes/job/stack-os-matrix/36163/display/redirect
            tjenness Tim Jenness added a comment -

            Thanks for doing this. I have a philosophical question for jcarlin and kbechtol about logger names in the faro.utils package but that's not something that needs to be solved on this ticket.

            Note that back porting this is going to be tricky because PeriodicLogger does not exist in 23.x.

            tjenness Tim Jenness added a comment - Thanks for doing this. I have a philosophical question for jcarlin and kbechtol about logger names in the faro.utils package but that's not something that needs to be solved on this ticket. Note that back porting this is going to be tricky because PeriodicLogger does not exist in 23.x.
            tjenness Tim Jenness made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]

            Now that you mention, the possibility of backporting is why I had not immediately switched to using PeriodicLogger. However, since the time consuming part really is the input fetching from the butler, it wouldn't make sense anyway to backport just this ticket without also backporting DM-33853, which requires ticket that implemented PeriodicLogger to also be backported.

            kannawad Arun Kannawadi added a comment - Now that you mention, the possibility of backporting is why I had not immediately switched to using PeriodicLogger. However, since the time consuming part really is the input fetching from the butler, it wouldn't make sense anyway to backport just this ticket without also backporting DM-33853 , which requires ticket that implemented PeriodicLogger to also be backported.
            tjenness Tim Jenness added a comment -

            Yes, your previous patch (if you still have it) could seemingly be backported easily even if this better version is merged to main. I have not backported my logging to butler get because they changed the node configuration.

            tjenness Tim Jenness added a comment - Yes, your previous patch (if you still have it) could seemingly be backported easily even if this better version is merged to main. I have not backported my logging to butler get because they changed the node configuration.

            I can pull up my earlier patch from my reflogs before too late. As long it is acceptable to have the backport branch have the same behavior with different implementation, we should do that.

            kannawad Arun Kannawadi added a comment - I can pull up my earlier patch from my reflogs before too late. As long it is acceptable to have the backport branch have the same behavior with different implementation, we should do that.
            kannawad Arun Kannawadi made changes -
            Labels backport-v23
            tjenness Tim Jenness added a comment -

            I don't see any problem with having the branch use a slightly different implementation.

            tjenness Tim Jenness added a comment - I don't see any problem with having the branch use a slightly different implementation.
            kannawad Arun Kannawadi made changes -
            Epic Link PREOPS-995 [ 1078856 ]
            tjenness Tim Jenness made changes -
            Labels backport-v23
            tjenness Tim Jenness added a comment - - edited

            kannawad we've discussed the backporting of this and decided that we do not need to do the backport because step 3 (the step where this would matter) is now almost complete and there is not going to be another 23.0.1 release candidate.

            tjenness Tim Jenness added a comment - - edited kannawad we've discussed the backporting of this and decided that we do not need to do the backport because step 3 (the step where this would matter) is now almost complete and there is not going to be another 23.0.1 release candidate.

            Alright, I had raised a PR for backporting to not lose the discarded commits altogether. I closed the DM-33820-v23 PR and merged the main ticket branch to the main branch.

            kannawad Arun Kannawadi added a comment - Alright, I had raised a PR for backporting to not lose the discarded commits altogether. I closed the DM-33820 -v23 PR and merged the main ticket branch to the main branch.
            kannawad Arun Kannawadi made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            kannawad Arun Kannawadi made changes -
            Story Points 2

            People

              kannawad Arun Kannawadi
              ctslater Colin Slater
              Tim Jenness
              Arun Kannawadi, Colin Slater, Huan Lin, Shuwei Ye, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.