Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Labels:
-
Story Points:4
-
Epic Link:
-
Team:Data Release Production
-
Urgent?:No
Description
Butler has been storing summaries of which dataset types and governor dimension values are present in each collection since DM-24939, which was merged almost a year ago in order to make it into the first stable butler schema. But we still haven't actually used those summaries as intended. On this ticket, I'll:
- Provide some public interfaces for querying these summaries. These will have both a fast path that will use the summary tables and a slow path that will run a new SELECT DISTINCT or SELECT with GROUP BY; we cannot guarantee that summary entries will be deleted when datasets are deleted, so while inaccuracies are rare, we need to be clear about what they represent (i.e. "this dataset type may be present in this collection").
- Update the logic that generates subqueries for datasets to skip combinations that are known from the summary tables to have no matches. This should dramatically reduce the number of UNION clauses we have in subqueries, in most cases.
Open questions include
- Where to put the new interfaces. I'm leaning towards new methods on DatasetQueryResults, so the usage would be something like
>>> registry.queryDatasets("calexp", collections="HSC/raw/all").any() |
False
|
>>> registry.queryDatasets(..., collections="HSC/raw/all", fast=True).dataset_types().names |
{"raw"} |
- When to fetch the summary data (at Butler startup vs. as needed) and whether to cache it.
I took a look at the query generated by this ticket for my validation_data_hsc sqlite query that started the ball on
DM-31548. And I can confirm that (a) the generated queries look a lot simpler with no UNION s (so good!) but (b) without the index that test query is still slow. With the index the query is fast.