Fix Version/s: None
Component/s: ap_association, ap_pipe, ap_verify, ip_diffim
Sprint:AP F21-1 (June), AP F21-2 (July)
Per DM with Eric Bellm, our source count metrics should be agnostic to whether or not we are running fakes processing. However, the pipeline does not distinguish between fake and natural sources; even ProcessCcdWithFakesTask forgets this information as soon as it modifies the image. There is a FAKE mask plane and corresponding catalog flags, but these flags are not suitable for source filtering.
The current best way to identify fake sources is to cross-match them to the original fakes catalog, as is done for the existing fakes metrics. This is adds a dependency on a dataset that does not exist in non-fakes pipelines, though this can be turned on and off in pipeline configurations at the ap_verify level. Add support for such cross-matching to the existing metrics, preferably in a way that leaves the Diffim and SFP metrics portable across pipelines.
For Eric Bellm, another alternative would be to give up on making the metrics fakes-invariant, and handle this at the SQuaSH level, by running each dataset with and without fakes and comparing the numbers. But that would eat more into our (still vague?) space and runtime budget.
Krzysztof Findeisen right now we don't write the per-visit/ccd DiaObject table to disk. If we did, you could do a bit of post processing to combine the DiaSource fakes matched from that visit to remove the amount of number counts that come from fakes.
I was going to echo Chris Morrison [X]'s comment about reconstructing the association order in post-processing (and then JIRA ate my comment).
However, I'm not sure it's worth the effort, at least right now. Those specific metrics are largely meant to catch gross failures in the pipelines--the baseline itself is not particularly informative. While changing the number of inserted fakes will cause a step in our metrics, we can annotate those rare changes with our usual tools.
As we continue to build up metrics that use fakes, sky objects, and known SSObjects I expect that we'll rely less and less on these relatively simple metrics anyway.
(Also, you mentioned problems with four metrics but only listed three--which was the fourth?)
The fourth is totalUnassociatedDiaObjects, which is unusual because it needs to be calculated from the APDB. However, I think I can use the standard fakes match catalog for it, so I'm not sure why I thought it would be a problem.
So what's the verdict? Should I try to use the per-visit tables (though I don't see a config flag for that), should I stop with what I have (ip_diffim metrics converted, but ap_association ones untouched), should I abandon this ticket entirely...?
Discussed at sprint planning today; decision was to drop ap_association metrics from the scope of the ticket, and to finish integrating the modified ip_diffim metrics into the ApVerifyWithFakes pipeline.
Eventual solution ended up being completely different, because it turns out that the pipeline does distinguish fake and real sources through SFP, putting them into distinct datasets. Most metrics were using non-fakes datasets to begin with, so the only ip_diffim or pipe_tasks metric that needed fixing was fracDiaSourcesToSciSources. I've added a new difference imaging task that creates a clean diaSource catalog. I still don't think this trick can be extended up to the diaPipe metrics, because we only have one APDB.
Since even this, much simpler, solution involved some hacking of ApVerifyWithFakes, I suggest either Chris Morrison [X] or Eric Bellm as the reviewer, to check that I haven't inadvertently broken the intent of that pipeline.
Looks good, tried to have some, semi coherent thought about how we could possibly setup two separate Apdbs and use them in different DiaPipe tasks. Not sure if it was anything useful.
I've encountered a problem with the four ap_association metrics affected by this issue. Three of these, numNewDiaObjects, numUnassociatedDiaObjects, and fracUpdatedDiaObjects, are actually computed by AssociationTask during the matching process. To support these with my current approach, I'd have to cross-match the fakes to both the DIASources and the DIAObjects (separately from the "official" match to associated DIASources, which must be run after AssociationTask), then pass both sets of matches into DiaPipelineTask and thence to AssociationTask.
While I can do this, it would be a very intrusive change to ap_association. On the other hand, I don't see a way to compute these three metrics from only data products; you can tell that a DIAObject was associated with a DIASource, but the new/updated distinction requires knowledge of the association order.
Chris Morrison [X], do you know of some other way to figure out which DIASources/DIAObjects are fakes within AssociationTask?