I thought a little bit about it before and I think possibilities to mess things up at generating that one big query are almost infinite. The query that we generate can be thought of as:
- dimensions query to generate complete set of dimensions (universe), not constrained by anything.
- we constrain this universe with pre-existing datasets from input dataset types
- we further constrain it with the user-provided expression (and in the future probably with implicit SkyMap and Instrument)
- optionally remove quanta with all existing output datasets (this is not a part of the query but is done by graphBuilder)
#1 should not normally generate empty result set unless database itself is not consistent, it may still be prudent for debugging to check that this step is not empty. Also if we do constrain SkyMap/Instrument at this step then incorrectly specified SkyMap or Instrument name can produce an empty result set. And if we do not constrain things at this point it may also be reasonable to check that things that are supposed to be unique (SkyMap) actually appear once only.
#2 I think is covered by your comment - if for some input dataset type there are no datasets then this will also result in empty graph. Preflight does not know task names so it cannot generate message including that but it knows dataset type name and GraphBuilder can use that to find tasks which have that dataset type as input. Finding which dataset type has no dataset probably means that we need to take dimensions query (possibly with user-defined expression) and repeatedly run it with individual dataset types. There are of course more complicated cases when two input dataset types have non-empty datasets but their combination is empty (e.g. non-overlapping visit ranges).
#3 user expression can also be too restrictive or incorrect. In principle it can be tested with dimensions query separately from datasets joins. Still like with datasets there is a possibility that separately they make non-empty selection but combining all things together becomes empty. (We also need better diagnostics for incorrect user expressions, that would mean replacing current text-based expressions with better representation from which we can generate sqlalchemy code).
#4 is probably easiest to detect because it should start with non-empty selection and result in a graph which has some or all tasks excluded.