Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-34865

Script for looking at ap_verify metrics

    XMLWordPrintable

Details

    • Story
    • Status: Done
    • Resolution: Done
    • None
    • ap_verify
    • None
    • 6
    • AP F22-1 (June)
    • Alert Production
    • No

    Description

      To help us determine whether our performance changes to the AP pipeline cause unwanted output changes, we need to be able to compare the metric values output from ap_verify runs on ci_hits2015 and ci_cosmos_pdr2 to those in Chronograph. This ticket is to make some simple scripts to read in the metrics and dump them to the console or make them easier to read in a notebook.

      Attachments

        Issue Links

          Activity

            No builds found.
            Parejkoj John Parejko created issue -
            Parejkoj John Parejko made changes -
            Field Original Value New Value
            Link This issue relates to DM-34826 [ DM-34826 ]
            Parejkoj John Parejko made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            sullivan Ian Sullivan made changes -
            Story Points 2 6

            Copying some feedback on the preliminary output shown on DM-34826:

            first number on the left is the detector id

            I suggest at least including the exposure/visit as well. Note also that ap.association.totalUnassociatedDiaObjects is not associated with either an exposure or detector, so something's going very wrong if you're able to measure it that way.

            If you just mean "out of what number" then you should be able to just compare it with what's in Chronograph.

            Doesn't that defeat the purpose of having a local utility that inspects metrics?

            krzys Krzysztof Findeisen added a comment - Copying some feedback on the preliminary output shown on DM-34826 : first number on the left is the detector id I suggest at least including the exposure/visit as well. Note also that ap.association.totalUnassociatedDiaObjects is not associated with either an exposure or detector, so something's going very wrong if you're able to measure it that way. If you just mean "out of what number" then you should be able to just compare it with what's in Chronograph. Doesn't that defeat the purpose of having a local utility that inspects metrics?
            sullivan Ian Sullivan made changes -
            Parent DM-34623 [ 1504405 ]
            Supports Epic Completion No [ 16401 ]
            Supports Milestone Completion No [ 16403 ]
            Urgent? off
            Issue Type Technical task [ 10002 ] Story [ 10001 ]
            sullivan Ian Sullivan made changes -
            Epic Link PREOPS-1161 [ 1475339 ]
            sullivan Ian Sullivan made changes -
            Link This issue relates to DM-34623 [ DM-34623 ]
            sullivan Ian Sullivan made changes -
            Sprint AP F22-1 (June) [ 1166 ]
            sullivan Ian Sullivan made changes -
            Rank Ranked higher
            Parejkoj John Parejko made changes -
            Remote Link This issue links to "Page (Confluence)" [ 33446 ]
            Parejkoj John Parejko added a comment -

            re: documentation, the inspect_job.py script is an older thing for doing this sort of analysis (using the old job file format), and I could write a similar document for this script. At minimum, I should have a cmdline task topic type.

            https://pipelines.lsst.io/modules/lsst.verify/reviewing-verification-json-on-the-command-line.html

            Parejkoj John Parejko added a comment - re: documentation, the inspect_job.py script is an older thing for doing this sort of analysis (using the old job file format), and I could write a similar document for this script. At minimum, I should have a cmdline task topic type. https://pipelines.lsst.io/modules/lsst.verify/reviewing-verification-json-on-the-command-line.html
            Parejkoj John Parejko made changes -
            Summary Script(s) for looking at ap_verify metrics Script for looking at ap_verify metrics
            Parejkoj John Parejko added a comment -

            krzys: would you mind taking a look at this? Before you look at the code, please try it out on a repo of your choice: I'd like to know what you think of the commandline documentation (there's also a task topic for it, if you build the package docs).

            For the code review, I'm getting two flake8 errors that I'm not sure what to do about. I believe they're valid python (and I'm using the first `match` in the stack, I think!), but I get F811 redefinition of unused 'value_formatter' from line 64 for those two mini functions.

            The PR, since there are two closed PRs that someone made a typo on: https://github.com/lsst/verify/pull/99

            Parejkoj John Parejko added a comment - krzys : would you mind taking a look at this? Before you look at the code, please try it out on a repo of your choice: I'd like to know what you think of the commandline documentation (there's also a task topic for it, if you build the package docs). For the code review, I'm getting two flake8 errors that I'm not sure what to do about. I believe they're valid python (and I'm using the first `match` in the stack, I think!), but I get F811 redefinition of unused 'value_formatter' from line 64 for those two mini functions. The PR, since there are two closed PRs that someone made a typo on: https://github.com/lsst/verify/pull/99
            Parejkoj John Parejko made changes -
            Reviewers Krzysztof Findeisen [ krzys ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            krzys Krzysztof Findeisen added a comment - - edited

            It sounds like you're using a version of flake8 that doesn't support match; you might need to avoid it until we update our tooling. I assume it also fails in Jenkins?

            krzys Krzysztof Findeisen added a comment - - edited It sounds like you're using a version of flake8 that doesn't support match ; you might need to avoid it until we update our tooling. I assume it also fails in Jenkins?
            krzys Krzysztof Findeisen added a comment - - edited

            Feedback on the CLI:

            • Why is a collection mandatory? This is different from the behavior of the butler tool, where queries that don't specify a collection are understood to search everything.
            • --data_id_restriction sounds like it would be a useful option, though given the warning I haven't tried testing it.
            • I don't understand the --kind argument at all. Metrics are metrics; how do you decide whether something is a "value" or a "timing"? The majority of metrics are likely to be neither timing nor memory (and this is already true for faro); do you plan to split up the "value" category in the future?
              • It's not clear that omitting --kind is equivalent to --kind value instead of providing an unfiltered view. This could potentially lead the user into thinking that timing metrics aren't being measured at all, and cause them to waste time trying to track down a bug that doesn't exist.
            • It would be helpful if there were a space between the blocks for each data ID; as it is, it's hard to tell at a glance how many values exist for how many IDs.
            • The help text for --verbose should probably read "Print extra information when loading metric values or handling errors."
            • The log given with the --verbose option is incorrect; it reads Loaded 18 metrics for 3 dataIds. when run on 6 data IDs (also, strictly speaking it's 18 values of 3 different metrics).
            • Why is the output format different with and without --data_id_keys?

              {instrument: 'DECam', detector: 5, visit: 411420, ...}; ap_association.fracUpdatedDiaObjects: 0.17993079584775087
              visit: 411420, detector: 5; ap_association.fracUpdatedDiaObjects: 0.17993079584775087
              

            • It's not clear which way the difference is measured, and in fact the order is the opposite of what I'd expect from programs like diff. Maybe change the header to read something like "Change in <repo1> relative to <repo2>"?
            krzys Krzysztof Findeisen added a comment - - edited Feedback on the CLI: Why is a collection mandatory? This is different from the behavior of the butler tool, where queries that don't specify a collection are understood to search everything. --data_id_restriction sounds like it would be a useful option, though given the warning I haven't tried testing it. I don't understand the --kind argument at all. Metrics are metrics; how do you decide whether something is a "value" or a "timing"? The majority of metrics are likely to be neither timing nor memory (and this is already true for faro ); do you plan to split up the "value" category in the future? It's not clear that omitting --kind is equivalent to --kind value instead of providing an unfiltered view. This could potentially lead the user into thinking that timing metrics aren't being measured at all, and cause them to waste time trying to track down a bug that doesn't exist. It would be helpful if there were a space between the blocks for each data ID; as it is, it's hard to tell at a glance how many values exist for how many IDs. The help text for --verbose should probably read "Print extra information when loading metric values or handling errors." The log given with the --verbose option is incorrect; it reads Loaded 18 metrics for 3 dataIds. when run on 6 data IDs (also, strictly speaking it's 18 values of 3 different metrics). Why is the output format different with and without --data_id_keys ? {instrument: 'DECam', detector: 5, visit: 411420, ...}; ap_association.fracUpdatedDiaObjects: 0.17993079584775087 visit: 411420, detector: 5; ap_association.fracUpdatedDiaObjects: 0.17993079584775087 It's not clear which way the difference is measured, and in fact the order is the opposite of what I'd expect from programs like diff . Maybe change the header to read something like "Change in <repo1> relative to <repo2>"?
            krzys Krzysztof Findeisen added a comment - - edited

            An alternative way of presenting the data ID information might be to provide the data ID in a header; this would split up the output visually, and avoid wrapping problems with long.package.veryLongAndPreciselyVerboseDescriptiveMetrics. Or metrics with both tract and detector info, as might happen when characterizing fakes.

            krzys Krzysztof Findeisen added a comment - - edited An alternative way of presenting the data ID information might be to provide the data ID in a header; this would split up the output visually, and avoid wrapping problems with long.package.veryLongAndPreciselyVerboseDescriptiveMetrics . Or metrics with both tract and detector info, as might happen when characterizing fakes.
            Parejkoj John Parejko added a comment -

            Collection is mandatory because of the way I look up the data to search. I use `registry.queryDataIds(("detector", "visit"), datasets="calexp")` to get the dataIds of interest, which requires a collection. Suggestions for alternative approaches are welcome (maybe when you review the code itself?).

            data_id_restriction is available as a kwarg in the print function, I just don't know how to implement it on the cmdline.

            I wrote this for ap_verify, as that's what I needed it for. That has `metricvalue_blah`, `metricvalue_blahTime`, and `metricvalue_blahMemory`. The "values" option is for the former (i.e., everything that isn't Time or Memory); I don't know what else to call it. I don't know what future expansions of this tool might be wanted in the future, I just wanted to make the thing I wrote for ap_verify useable by others.

            I've added `default='value'` to the `--kind` option; I had thought it included the defaults in the cmdline help, but it doesn't.

            I refactored the output to be in per-dataId blocks, with the dataID as a "header". I think you're right that it is more interpretable that way.

            Good catch on the verbose message re: number of dataIds: that was a bug in the code. I've added the number of metrics to that, and tweaked the message.

            Without --data_id_keys, I just print the DataCoordinates directly, since I don't know what they might contain. That results in the dict-like output. I don't know if it needs to be kept consistent with a user-specified list, and doing so seems like it might be a pain.

            I took your wording re: the difference, but I'm not sure whether I should be printing `repo2-repo1` or `repo1-repo2` (the latter is what's implemented).

            Parejkoj John Parejko added a comment - Collection is mandatory because of the way I look up the data to search. I use `registry.queryDataIds(("detector", "visit"), datasets="calexp")` to get the dataIds of interest, which requires a collection. Suggestions for alternative approaches are welcome (maybe when you review the code itself?). data_id_restriction is available as a kwarg in the print function, I just don't know how to implement it on the cmdline. I wrote this for ap_verify, as that's what I needed it for. That has `metricvalue_blah`, `metricvalue_blahTime`, and `metricvalue_blahMemory`. The "values" option is for the former (i.e., everything that isn't Time or Memory); I don't know what else to call it. I don't know what future expansions of this tool might be wanted in the future, I just wanted to make the thing I wrote for ap_verify useable by others. I've added `default='value'` to the `--kind` option; I had thought it included the defaults in the cmdline help, but it doesn't. I refactored the output to be in per-dataId blocks, with the dataID as a "header". I think you're right that it is more interpretable that way. Good catch on the verbose message re: number of dataIds: that was a bug in the code. I've added the number of metrics to that, and tweaked the message. Without --data_id_keys , I just print the DataCoordinates directly, since I don't know what they might contain. That results in the dict-like output. I don't know if it needs to be kept consistent with a user-specified list, and doing so seems like it might be a pain. I took your wording re: the difference, but I'm not sure whether I should be printing `repo2-repo1` or `repo1-repo2` (the latter is what's implemented).

            Collection is mandatory because of the way I look up the data to search. I use `registry.queryDataIds(("detector", "visit"), datasets="calexp")` to get the dataIds of interest, which requires a collection. Suggestions for alternative approaches are welcome (maybe when you review the code itself?).

            Looks like the way butler does it is pretty ugly: https://github.com/lsst/daf_butler/blob/main/python/lsst/daf/butler/script/queryDatasets.py#L160-L168. You can always pass a list of all (run?) collections to the Butler constructor.

            I just wanted to make the thing I wrote for ap_verify useable by others.

            I still don't understand why it's useful to need three calls to see all the metrics.

            krzys Krzysztof Findeisen added a comment - Collection is mandatory because of the way I look up the data to search. I use `registry.queryDataIds(("detector", "visit"), datasets="calexp")` to get the dataIds of interest, which requires a collection. Suggestions for alternative approaches are welcome (maybe when you review the code itself?). Looks like the way butler does it is pretty ugly: https://github.com/lsst/daf_butler/blob/main/python/lsst/daf/butler/script/queryDatasets.py#L160-L168 . You can always pass a list of all (run?) collections to the Butler constructor. I just wanted to make the thing I wrote for ap_verify useable by others. I still don't understand why it's useful to need three calls to see all the metrics.
            krzys Krzysztof Findeisen made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            Parejkoj John Parejko added a comment - Jenkins: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36824/pipeline
            Parejkoj John Parejko made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            sullivan Ian Sullivan made changes -
            Link This issue relates to DM-35326 [ DM-35326 ]

            People

              Parejkoj John Parejko
              Parejkoj John Parejko
              Krzysztof Findeisen
              Eric Bellm, John Parejko, Krzysztof Findeisen, Meredith Rawls
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.