Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-34865

Script for looking at ap_verify metrics

    XMLWordPrintable

Details

    • Story
    • Status: Done
    • Resolution: Done
    • None
    • ap_verify
    • None
    • 6
    • AP F22-1 (June)
    • Alert Production
    • No

    Description

      To help us determine whether our performance changes to the AP pipeline cause unwanted output changes, we need to be able to compare the metric values output from ap_verify runs on ci_hits2015 and ci_cosmos_pdr2 to those in Chronograph. This ticket is to make some simple scripts to read in the metrics and dump them to the console or make them easier to read in a notebook.

      Attachments

        Issue Links

          Activity

            krzys Krzysztof Findeisen added a comment - - edited

            Feedback on the CLI:

            • Why is a collection mandatory? This is different from the behavior of the butler tool, where queries that don't specify a collection are understood to search everything.
            • --data_id_restriction sounds like it would be a useful option, though given the warning I haven't tried testing it.
            • I don't understand the --kind argument at all. Metrics are metrics; how do you decide whether something is a "value" or a "timing"? The majority of metrics are likely to be neither timing nor memory (and this is already true for faro); do you plan to split up the "value" category in the future?
              • It's not clear that omitting --kind is equivalent to --kind value instead of providing an unfiltered view. This could potentially lead the user into thinking that timing metrics aren't being measured at all, and cause them to waste time trying to track down a bug that doesn't exist.
            • It would be helpful if there were a space between the blocks for each data ID; as it is, it's hard to tell at a glance how many values exist for how many IDs.
            • The help text for --verbose should probably read "Print extra information when loading metric values or handling errors."
            • The log given with the --verbose option is incorrect; it reads Loaded 18 metrics for 3 dataIds. when run on 6 data IDs (also, strictly speaking it's 18 values of 3 different metrics).
            • Why is the output format different with and without --data_id_keys?

              {instrument: 'DECam', detector: 5, visit: 411420, ...}; ap_association.fracUpdatedDiaObjects: 0.17993079584775087
              visit: 411420, detector: 5; ap_association.fracUpdatedDiaObjects: 0.17993079584775087
              

            • It's not clear which way the difference is measured, and in fact the order is the opposite of what I'd expect from programs like diff. Maybe change the header to read something like "Change in <repo1> relative to <repo2>"?
            krzys Krzysztof Findeisen added a comment - - edited Feedback on the CLI: Why is a collection mandatory? This is different from the behavior of the butler tool, where queries that don't specify a collection are understood to search everything. --data_id_restriction sounds like it would be a useful option, though given the warning I haven't tried testing it. I don't understand the --kind argument at all. Metrics are metrics; how do you decide whether something is a "value" or a "timing"? The majority of metrics are likely to be neither timing nor memory (and this is already true for faro ); do you plan to split up the "value" category in the future? It's not clear that omitting --kind is equivalent to --kind value instead of providing an unfiltered view. This could potentially lead the user into thinking that timing metrics aren't being measured at all, and cause them to waste time trying to track down a bug that doesn't exist. It would be helpful if there were a space between the blocks for each data ID; as it is, it's hard to tell at a glance how many values exist for how many IDs. The help text for --verbose should probably read "Print extra information when loading metric values or handling errors." The log given with the --verbose option is incorrect; it reads Loaded 18 metrics for 3 dataIds. when run on 6 data IDs (also, strictly speaking it's 18 values of 3 different metrics). Why is the output format different with and without --data_id_keys ? {instrument: 'DECam', detector: 5, visit: 411420, ...}; ap_association.fracUpdatedDiaObjects: 0.17993079584775087 visit: 411420, detector: 5; ap_association.fracUpdatedDiaObjects: 0.17993079584775087 It's not clear which way the difference is measured, and in fact the order is the opposite of what I'd expect from programs like diff . Maybe change the header to read something like "Change in <repo1> relative to <repo2>"?
            krzys Krzysztof Findeisen added a comment - - edited

            An alternative way of presenting the data ID information might be to provide the data ID in a header; this would split up the output visually, and avoid wrapping problems with long.package.veryLongAndPreciselyVerboseDescriptiveMetrics. Or metrics with both tract and detector info, as might happen when characterizing fakes.

            krzys Krzysztof Findeisen added a comment - - edited An alternative way of presenting the data ID information might be to provide the data ID in a header; this would split up the output visually, and avoid wrapping problems with long.package.veryLongAndPreciselyVerboseDescriptiveMetrics . Or metrics with both tract and detector info, as might happen when characterizing fakes.
            Parejkoj John Parejko added a comment -

            Collection is mandatory because of the way I look up the data to search. I use `registry.queryDataIds(("detector", "visit"), datasets="calexp")` to get the dataIds of interest, which requires a collection. Suggestions for alternative approaches are welcome (maybe when you review the code itself?).

            data_id_restriction is available as a kwarg in the print function, I just don't know how to implement it on the cmdline.

            I wrote this for ap_verify, as that's what I needed it for. That has `metricvalue_blah`, `metricvalue_blahTime`, and `metricvalue_blahMemory`. The "values" option is for the former (i.e., everything that isn't Time or Memory); I don't know what else to call it. I don't know what future expansions of this tool might be wanted in the future, I just wanted to make the thing I wrote for ap_verify useable by others.

            I've added `default='value'` to the `--kind` option; I had thought it included the defaults in the cmdline help, but it doesn't.

            I refactored the output to be in per-dataId blocks, with the dataID as a "header". I think you're right that it is more interpretable that way.

            Good catch on the verbose message re: number of dataIds: that was a bug in the code. I've added the number of metrics to that, and tweaked the message.

            Without --data_id_keys, I just print the DataCoordinates directly, since I don't know what they might contain. That results in the dict-like output. I don't know if it needs to be kept consistent with a user-specified list, and doing so seems like it might be a pain.

            I took your wording re: the difference, but I'm not sure whether I should be printing `repo2-repo1` or `repo1-repo2` (the latter is what's implemented).

            Parejkoj John Parejko added a comment - Collection is mandatory because of the way I look up the data to search. I use `registry.queryDataIds(("detector", "visit"), datasets="calexp")` to get the dataIds of interest, which requires a collection. Suggestions for alternative approaches are welcome (maybe when you review the code itself?). data_id_restriction is available as a kwarg in the print function, I just don't know how to implement it on the cmdline. I wrote this for ap_verify, as that's what I needed it for. That has `metricvalue_blah`, `metricvalue_blahTime`, and `metricvalue_blahMemory`. The "values" option is for the former (i.e., everything that isn't Time or Memory); I don't know what else to call it. I don't know what future expansions of this tool might be wanted in the future, I just wanted to make the thing I wrote for ap_verify useable by others. I've added `default='value'` to the `--kind` option; I had thought it included the defaults in the cmdline help, but it doesn't. I refactored the output to be in per-dataId blocks, with the dataID as a "header". I think you're right that it is more interpretable that way. Good catch on the verbose message re: number of dataIds: that was a bug in the code. I've added the number of metrics to that, and tweaked the message. Without --data_id_keys , I just print the DataCoordinates directly, since I don't know what they might contain. That results in the dict-like output. I don't know if it needs to be kept consistent with a user-specified list, and doing so seems like it might be a pain. I took your wording re: the difference, but I'm not sure whether I should be printing `repo2-repo1` or `repo1-repo2` (the latter is what's implemented).

            Collection is mandatory because of the way I look up the data to search. I use `registry.queryDataIds(("detector", "visit"), datasets="calexp")` to get the dataIds of interest, which requires a collection. Suggestions for alternative approaches are welcome (maybe when you review the code itself?).

            Looks like the way butler does it is pretty ugly: https://github.com/lsst/daf_butler/blob/main/python/lsst/daf/butler/script/queryDatasets.py#L160-L168. You can always pass a list of all (run?) collections to the Butler constructor.

            I just wanted to make the thing I wrote for ap_verify useable by others.

            I still don't understand why it's useful to need three calls to see all the metrics.

            krzys Krzysztof Findeisen added a comment - Collection is mandatory because of the way I look up the data to search. I use `registry.queryDataIds(("detector", "visit"), datasets="calexp")` to get the dataIds of interest, which requires a collection. Suggestions for alternative approaches are welcome (maybe when you review the code itself?). Looks like the way butler does it is pretty ugly: https://github.com/lsst/daf_butler/blob/main/python/lsst/daf/butler/script/queryDatasets.py#L160-L168 . You can always pass a list of all (run?) collections to the Butler constructor. I just wanted to make the thing I wrote for ap_verify useable by others. I still don't understand why it's useful to need three calls to see all the metrics.
            Parejkoj John Parejko added a comment - Jenkins: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36824/pipeline

            People

              Parejkoj John Parejko
              Parejkoj John Parejko
              Krzysztof Findeisen
              Eric Bellm, John Parejko, Krzysztof Findeisen, Meredith Rawls
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.