You could probably avoid some cognitive dissonance for some readers by replacing "MeasurementTask" with something else ("MetricReportingTask"?) here. (And this is out of scope, but I think we'll want to rename lsst.verify.Measurement itself before using it more broadly).
Yes, I've found the term "Measurement" to be confusing to other members of the AP group. MetricReportingTask sounds a bit odd (to me, "reporting" suggests SQuaSH upload); perhaps ComputeMetricsTask?
My preferred way to deal with this would just be to add any information necessary for metrics to the task metadata.
I can add mention of metadata as a possible solution, but I'm not convinced that it's the best one. In particular, metadata doesn't handle auxiliary information (notes and especially extras) very well.
I do think this means we'd need to start thinking of the metadata keys provided by Tasks (or subtask slots) as part of their documented public interface, but I think that's broadly true of any the metadata-based architectures you've proposed.
I'd go farther and say that metadata must be considered part of a task's API, whether or not we need it for metrics. It is impossible to do anything useful with metadata if you don't know what keys to expect and what each means.
I think the fact that value is useful for a metric is an indication that it's the sort of thing we ought to include in the metadata anyway.
I think part of the mitigation is that we should prefer algorithmic-agnostic metrics whenever possible (after all, our requirements are algorithm-agnostic)... It might be worth thinking about whether algorithm-specific quantities are consumed in a manner similar enough to standard metrics to be worth using the verify classes at all.
I suspect this may be a fundamental mismatch in how the AP and DRP teams are thinking about metrics. While we will, of course, have specifications for metrics that correspond to official requirements, most of the metrics we're considering are motivated by their diagnostic value rather than a high-level requirement. From our point of view the main value of verify and SQuaSH is that it lets us monitor how the quality/performance of the code evolves as we work on it. For example, I could see the developer of a particular task adding new metrics as they identify new performance issues (e.g., a bottleneck in a specific implementation), essentially generalizing regression testing to quantities that don't have meaningful pass/fail thresholds.
Do you think this is something that I should address more explicitly in the Introduction or Design Goals? I'm a bit worried that a misunderstanding about this could derail discussion between the AP/DRP groups.
Hi Jim Bosch, please review the corrections I've made in response to everyone's feedback. In particular, please let me know if there's anything in the new dataset-driven architecture that won't work with the current Tasks framework or doesn't fit your intent. Thanks!