Fix Version/s: None
use DP0.2 as a testbed to study the performance of querying over many thousands of metric values persisted as lsst.verify.Measurement objects and start to think about the workflows to compile summary statistics and correlate with metadata
I added this as a suggested topic to the DMLT VF2F next week to get a status update on it. I was not necessecarily thinking it was you doing it.
I saw this referenced from the DMLT f2f and I was curious about how lsst.verify.Measurement is related to YAML storage. Traditionally metric and specification definitions were stored in human-editable YAML (https://github.com/lsst/verify_metrics) and measurements were serialized out to JSON. Are measurements being written out to YAML now?
Another thing, if the slow-down is happening while creating measurements, I just want to highlight that lsst.verify.Measurement doesn’t need a loaded Metric instance in order to create a measurement. Instead, we envisioned that measurement code would pass the metric’s name as a string and only later analysis code would load a Metric instance from the YAML repos. See https://pipelines.lsst.io/py-api/lsst.verify.Measurement.html#lsst.verify.Measurement
In other words, a serialized Measurement probably shouldn’t (or at least, shouldn’t have to) have the full Metric in its data.
Of course, I’m not up to date on how lsst.verify is being used now, but I just wanted to highlight this in case were missing a basic optimization.
I have some notes written up on DMTN-203 – DM-31599
Writing metrics to YAML was a mistake caused by it being the easiest possible approach to get a quick butler test going. It wasn't meant to be the end game or even end up in production that way. The problem is that Measurement does not follow any of the conventions for serialization and reconstruction supported by the JSON formatter (which supports some different approaches, including pydantic) and so could not be used directly. This is discussed in DM-31617. We either need a specialist MeasurementFormatter or need to tweak its API to match something like pydantic.
I've also just realized that the timing test at the start of this ticket is not optimal. If you have queryDatasets results you should not then call butler.get because butler will take the ref and expand it and do another query to check that it's consistent. You need to use butler.getDirect to bypass the registry.
I've just been told that people are waiting on me to change butler to use JSON rather than YAML. Is that correct? I did not think I was working on DM-31617. The questions I had on that ticket relate to the verify package which is not middleware.