Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15214

QuantumGraph option to filter existing output Datasets

    XMLWordPrintable

    Details

      Description

      Current version of QG builder does not check whether any output Datasets that it generates are already there in the Registry. We need an option to either filter those existing Datasets from output or generate an exception.

        Attachments

          Activity

          Hide
          salnikov Andy Salnikov added a comment -

          Some trivial example of how it could work. Suppose we have a pipeline made of tasks T1 and T2:

          • T1 has input dataset types A and B and output dataset types C and D
          • T2 has input dataset types C and D and output dataset type E
          • there are just single datasets in each A and B, a and b

          If there are no existing datasets in C, D, and E then graph would look like 

          a        c
            \    /   \
              T1       T2 -> e
            /    \   /
          b        d

          If both c and d Datasets exist already (but not e) then there are two options:

          • Raise an exception
          • Skip T1 step

          In latter case graph will look like:

           c
             \
                T2 -> e
             /
           d

           If only one of  c or d Datasets exists then I think the only reasonable thing to do is to throw an exception, more generally if a task that creates non-existing dataset also creates another dataset that exists this should be considered an exceptional situation.

          Another interesting possibility is that e exists but c and d are both missing. I think we can also have the same choice here:

          • Raise an exception
          • Only run T1 step.

           

           

          Show
          salnikov Andy Salnikov added a comment - Some trivial example of how it could work. Suppose we have a pipeline made of tasks T1 and T2: T1 has input dataset types A and B and output dataset types C and D T2 has input dataset types C and D and output dataset type E there are just single datasets in each A and B, a and b If there are no existing datasets in C, D, and E then graph would look like  a c \ / \ T1 T2 -> e / \ / b d If both c and d Datasets exist already (but not e ) then there are two options: Raise an exception Skip T1 step In latter case graph will look like: c \ T2 -> e / d  If only one of   c or  d Datasets exists then I think the only reasonable thing to do is to throw an exception, more generally if a task that creates non-existing dataset also creates another dataset that exists this should be considered an exceptional situation. Another interesting possibility is that e exists but c and d are both missing. I think we can also have the same choice here: Raise an exception Only run T1 step.    
          Hide
          salnikov Andy Salnikov added a comment -

          Thinking how this could be implemented. Probably easiest option is to return matching Dataset ID for each output DatasetType in the same super-query (NULL should be returned for Datasets that don't exist). Preflight can look at the IDs which are not NULL and decide what to do with them, either filter them out or raise exception. 

          Non-trivial thing here is that sort of query is an OUTER JOIN and I do not know yet how to build that query with SQLAlchemy. I have actually started with building query as JOINS but that did not work out very well so I switched to simpler structure. Need to look how to revert to JOINS again.

          Show
          salnikov Andy Salnikov added a comment - Thinking how this could be implemented. Probably easiest option is to return matching Dataset ID for each output DatasetType in the same super-query (NULL should be returned for Datasets that don't exist). Preflight can look at the IDs which are not NULL and decide what to do with them, either filter them out or raise exception.  Non-trivial thing here is that sort of query is an OUTER JOIN and I do not know yet how to build that query with SQLAlchemy. I have actually started with building query as JOINS but that did not work out very well so I switched to simpler structure. Need to look how to revert to JOINS again.
          Hide
          salnikov Andy Salnikov added a comment - - edited

          Looking at the examples of the queries that pre-flight generates today.

          Here is a simple task which has "raw" DatasetType (units Camera+Exposure+Sensor) as input and "calexp" DatasetType (units Camera+Visit+Sensor) as output and its generated query is:

          SELECT
              "Visit".visit, "Exposure".exposure, "Sensor".sensor, "Camera".camera 
          FROM
              "Visit", "Exposure", "Sensor", "Camera", "PhysicalFilter", "AbstractFilter", "Dataset" AS dsraw, "DatasetCollection" AS "dsCollraw" 
          WHERE
              "Exposure".camera = "PhysicalFilter".camera AND "Exposure".physical_filter = "PhysicalFilter".physical_filter AND
              "Exposure".visit = "Visit".visit AND "Exposure".camera = "Visit".camera AND "Exposure".camera = "Camera".camera AND
              "Visit".camera = "PhysicalFilter".camera AND "Visit".physical_filter = "PhysicalFilter".physical_filter AND
              "Visit".camera = "Camera".camera AND "Sensor".camera = "Camera".camera AND
              "PhysicalFilter".abstract_filter = "AbstractFilter".abstract_filter AND "PhysicalFilter".camera = "Camera".camera AND
              dsraw.exposure = "Exposure".exposure AND dsraw.sensor = "Sensor".sensor AND
              dsraw.camera = "Camera".camera AND dsraw.dataset_id = "dsCollraw".dataset_id AND
              dsraw.dataset_type_name = 'raw' AND "dsCollraw".collection = 'raw/HSC'
          

          Using INNER JOINS this can be rewritten as:

          SELECT
              "Visit".visit, "Exposure".exposure, "Sensor".sensor, "Camera".camera 
          FROM
              "Camera"
              INNER JOIN "PhysicalFilter" ON ("PhysicalFilter".camera = "Camera".camera)
              INNER JOIN "AbstractFilter" ON ("PhysicalFilter".abstract_filter = "AbstractFilter".abstract_filter)
              INNER JOIN "Exposure" ON ("Exposure".camera = "Camera".camera AND
                      "Exposure".camera = "PhysicalFilter".camera AND "Exposure".physical_filter = "PhysicalFilter".physical_filter)
              INNER JOIN "Visit" ON ("Exposure".visit = "Visit".visit AND "Exposure".camera = "Visit".camera AND 
                       "Visit".camera = "PhysicalFilter".camera AND "Visit".physical_filter = "PhysicalFilter".physical_filter AND
                       "Visit".camera = "Camera".camera)
              INNER JOIN "Sensor" ON ("Sensor".camera = "Camera".camera)
              INNER JOIN "Dataset" AS dsraw ON (dsraw.exposure = "Exposure".exposure AND
                        dsraw.sensor = "Sensor".sensor AND dsraw.camera = "Camera".camera)
              INNER JOIN "DatasetCollection" AS "dsCollraw"  ON (dsraw.dataset_id = "dsCollraw".dataset_id)
          WHERE
              dsraw.dataset_type_name = 'raw' AND "dsCollraw".collection = 'raw/HSC'
          

          (The reason why I did not want to use JOIN in pre-flight was that contents of ON depends on ordering of tables in JOIN and is non-trivial to implement, but may be it's not too hard).

          Adding OUTER JOIN to this may look like

          SELECT
              "Visit".visit, "Exposure".exposure, "Sensor".sensor, "Camera".camera,
              dscalexp.dataset_id AS calexp_id
          FROM
              "Camera"
              INNER JOIN "PhysicalFilter" ON ("PhysicalFilter".camera = "Camera".camera)
              INNER JOIN "AbstractFilter" ON ("PhysicalFilter".abstract_filter = "AbstractFilter".abstract_filter)
              INNER JOIN "Exposure" ON ("Exposure".camera = "Camera".camera AND
                      "Exposure".camera = "PhysicalFilter".camera AND "Exposure".physical_filter = "PhysicalFilter".physical_filter)
              INNER JOIN "Visit" ON ("Exposure".visit = "Visit".visit AND "Exposure".camera = "Visit".camera AND 
                       "Visit".camera = "PhysicalFilter".camera AND "Visit".physical_filter = "PhysicalFilter".physical_filter AND
                       "Visit".camera = "Camera".camera)
              INNER JOIN "Sensor" ON ("Sensor".camera = "Camera".camera)
              INNER JOIN "Dataset" AS dsraw ON (dsraw.exposure = "Exposure".exposure AND
                        dsraw.sensor = "Sensor".sensor AND dsraw.camera = "Camera".camera)
              INNER JOIN "DatasetCollection" AS "dsCollraw"  ON (dsraw.dataset_id = "dsCollraw".dataset_id)
              LEFT OUTER JOIN "Dataset" AS dscalexp ON (dscalexp.visit = "Visit".visit AND
                        dscalexp.sensor = "Sensor".sensor AND dscalexp.camera = "Camera".camera)
              INNER JOIN "DatasetCollection" AS "dsCollcalexp"  ON (dscalexp.dataset_id = "dsCollcalexp".dataset_id)
          WHERE
              dsraw.dataset_type_name = 'raw' AND "dsCollraw".collection = 'raw/HSC' AND
              dscalexp.dataset_type_name = 'calexp' AND "dsCollcalexp".collection = 'shared/ci_hsc'
          

          Show
          salnikov Andy Salnikov added a comment - - edited Looking at the examples of the queries that pre-flight generates today. Here is a simple task which has "raw" DatasetType (units Camera+Exposure+Sensor) as input and "calexp" DatasetType (units Camera+Visit+Sensor) as output and its generated query is: SELECT "Visit" .visit, "Exposure" .exposure, "Sensor" .sensor, "Camera" .camera FROM "Visit" , "Exposure" , "Sensor" , "Camera" , "PhysicalFilter" , "AbstractFilter" , "Dataset" AS dsraw, "DatasetCollection" AS "dsCollraw" WHERE "Exposure" .camera = "PhysicalFilter" .camera AND "Exposure" .physical_filter = "PhysicalFilter" .physical_filter AND "Exposure" .visit = "Visit" .visit AND "Exposure" .camera = "Visit" .camera AND "Exposure" .camera = "Camera" .camera AND "Visit" .camera = "PhysicalFilter" .camera AND "Visit" .physical_filter = "PhysicalFilter" .physical_filter AND "Visit" .camera = "Camera" .camera AND "Sensor" .camera = "Camera" .camera AND "PhysicalFilter" .abstract_filter = "AbstractFilter" .abstract_filter AND "PhysicalFilter" .camera = "Camera" .camera AND dsraw.exposure = "Exposure" .exposure AND dsraw.sensor = "Sensor" .sensor AND dsraw.camera = "Camera" .camera AND dsraw.dataset_id = "dsCollraw" .dataset_id AND dsraw.dataset_type_name = 'raw' AND "dsCollraw" .collection = 'raw/HSC' Using INNER JOINS this can be rewritten as: SELECT "Visit" .visit, "Exposure" .exposure, "Sensor" .sensor, "Camera" .camera FROM "Camera" INNER JOIN "PhysicalFilter" ON ( "PhysicalFilter" .camera = "Camera" .camera) INNER JOIN "AbstractFilter" ON ( "PhysicalFilter" .abstract_filter = "AbstractFilter" .abstract_filter) INNER JOIN "Exposure" ON ( "Exposure" .camera = "Camera" .camera AND "Exposure" .camera = "PhysicalFilter" .camera AND "Exposure" .physical_filter = "PhysicalFilter" .physical_filter) INNER JOIN "Visit" ON ( "Exposure" .visit = "Visit" .visit AND "Exposure" .camera = "Visit" .camera AND "Visit" .camera = "PhysicalFilter" .camera AND "Visit" .physical_filter = "PhysicalFilter" .physical_filter AND "Visit" .camera = "Camera" .camera) INNER JOIN "Sensor" ON ( "Sensor" .camera = "Camera" .camera) INNER JOIN "Dataset" AS dsraw ON (dsraw.exposure = "Exposure" .exposure AND dsraw.sensor = "Sensor" .sensor AND dsraw.camera = "Camera" .camera) INNER JOIN "DatasetCollection" AS "dsCollraw" ON (dsraw.dataset_id = "dsCollraw" .dataset_id) WHERE dsraw.dataset_type_name = 'raw' AND "dsCollraw" .collection = 'raw/HSC' (The reason why I did not want to use JOIN in pre-flight was that contents of ON depends on ordering of tables in JOIN and is non-trivial to implement, but may be it's not too hard). Adding OUTER JOIN to this may look like SELECT "Visit" .visit, "Exposure" .exposure, "Sensor" .sensor, "Camera" .camera, dscalexp.dataset_id AS calexp_id FROM "Camera" INNER JOIN "PhysicalFilter" ON ( "PhysicalFilter" .camera = "Camera" .camera) INNER JOIN "AbstractFilter" ON ( "PhysicalFilter" .abstract_filter = "AbstractFilter" .abstract_filter) INNER JOIN "Exposure" ON ( "Exposure" .camera = "Camera" .camera AND "Exposure" .camera = "PhysicalFilter" .camera AND "Exposure" .physical_filter = "PhysicalFilter" .physical_filter) INNER JOIN "Visit" ON ( "Exposure" .visit = "Visit" .visit AND "Exposure" .camera = "Visit" .camera AND "Visit" .camera = "PhysicalFilter" .camera AND "Visit" .physical_filter = "PhysicalFilter" .physical_filter AND "Visit" .camera = "Camera" .camera) INNER JOIN "Sensor" ON ( "Sensor" .camera = "Camera" .camera) INNER JOIN "Dataset" AS dsraw ON (dsraw.exposure = "Exposure" .exposure AND dsraw.sensor = "Sensor" .sensor AND dsraw.camera = "Camera" .camera) INNER JOIN "DatasetCollection" AS "dsCollraw" ON (dsraw.dataset_id = "dsCollraw" .dataset_id) LEFT OUTER JOIN "Dataset" AS dscalexp ON (dscalexp.visit = "Visit" .visit AND dscalexp.sensor = "Sensor" .sensor AND dscalexp.camera = "Camera" .camera) INNER JOIN "DatasetCollection" AS "dsCollcalexp" ON (dscalexp.dataset_id = "dsCollcalexp" .dataset_id) WHERE dsraw.dataset_type_name = 'raw' AND "dsCollraw" .collection = 'raw/HSC' AND dscalexp.dataset_type_name = 'calexp' AND "dsCollcalexp" .collection = 'shared/ci_hsc'
          Hide
          salnikov Andy Salnikov added a comment - - edited

          Here is an example of the SQL that pre-flight now generates for one simple calexp-to-coadd task :

          SELECT
              "AbstractFilter".abstract_filter,
              "Sensor".sensor,
              "Patch".patch,
              "Visit".visit,
              "Camera".camera,
              "SkyMap".skymap,
              "Tract".tract,
              "VisitSensorRegion".region,
              "Patch".region,
              dscalexp.dataset_id,
              "dsdeepCoadd_calexp".dataset_id 
          FROM
              "SkyMap" 
              JOIN
                  "AbstractFilter" 
                  ON :param_1          -- param_1 = TRUE, this appears here only because of sqlalchemy quirks
              JOIN
                  "Tract" 
                  ON "Tract".skymap = "SkyMap".skymap 
              JOIN
                  "Patch" 
                  ON "Patch".skymap = "SkyMap".skymap 
                  AND "Patch".tract = "Tract".tract 
                  AND "Patch".skymap = "Tract".skymap 
              JOIN
                  "Camera" 
                  ON :param_2          -- param_2 = TRUE
              JOIN
                  "PhysicalFilter" 
                  ON "PhysicalFilter".abstract_filter = "AbstractFilter".abstract_filter 
                  AND "PhysicalFilter".camera = "Camera".camera 
              JOIN
                  "Visit" 
                  ON "Visit".physical_filter = "PhysicalFilter".physical_filter 
                  AND "Visit".camera = "PhysicalFilter".camera 
                  AND "Visit".camera = "Camera".camera 
              JOIN
                  "Sensor" 
                  ON "Sensor".camera = "Camera".camera 
              JOIN
                  "VisitSensorRegion" 
                  ON "VisitSensorRegion".visit = "Visit".visit 
                  AND "VisitSensorRegion".camera = "Visit".camera 
                  AND "VisitSensorRegion".sensor = "Sensor".sensor 
                  AND "VisitSensorRegion".camera = "Sensor".camera 
              JOIN
                  "VisitSensorPatchJoin" 
                  ON "VisitSensorPatchJoin".visit = "VisitSensorRegion".visit 
                  AND "VisitSensorPatchJoin".sensor = "VisitSensorRegion".sensor 
                  AND "VisitSensorPatchJoin".camera = "VisitSensorRegion".camera 
                  AND "VisitSensorPatchJoin".tract = "Patch".tract 
                  AND "VisitSensorPatchJoin".patch = "Patch".patch 
                  AND "VisitSensorPatchJoin".skymap = "Patch".skymap 
              JOIN
                  (
                      SELECT
                          sub2calexp.dataset_id AS dataset_id,
                          sub2calexp.visit AS visit,
                          sub2calexp.sensor AS sensor,
                          sub2calexp.camera AS camera 
                      FROM
                          (
                              SELECT
                                  CASE
                                      WHEN ("DatasetCollection".collection = :collection_1) THEN :param_3  -- param_3 = 0
                                      WHEN ("DatasetCollection".collection = :collection_2) THEN :param_4  -- param_4 = 1
                                  END AS collorder,
                                  "Dataset".dataset_id AS dataset_id,
                                  "Dataset".visit AS visit,
                                  "Dataset".sensor AS sensor,
                                  "Dataset".camera AS camera 
                              FROM
                                  "Dataset" 
                                  JOIN
                                      "DatasetCollection" 
                                      ON "Dataset".dataset_id = "DatasetCollection".dataset_id 
                              WHERE
                                  "Dataset".dataset_type_name = :dataset_type_name_1 
                                  AND "DatasetCollection".collection IN (:collection_3, :collection_4)
                          ) AS sub2calexp 
                          JOIN
                              (
                                  SELECT
                                      min(CASE
                                          WHEN ("DatasetCollection".collection = :collection_1) THEN :param_3  -- param_3 = 0
                                          WHEN ("DatasetCollection".collection = :collection_2) THEN :param_4  -- param_4 = 1
                                      END) AS collorder,
                                      "Dataset".visit AS visit,
                                      "Dataset".sensor AS sensor,
                                      "Dataset".camera AS camera 
                                  FROM
                                      "Dataset" 
                                      JOIN
                                          "DatasetCollection" 
                                          ON "Dataset".dataset_id = "DatasetCollection".dataset_id 
                                  WHERE
                                      "Dataset".dataset_type_name = :dataset_type_name_1 
                                      AND "DatasetCollection".collection IN (:collection_3, :collection_4)
                                  GROUP BY
                                      visit, sensor, camera
                              ) AS sub1calexp 
                              ON sub1calexp.collorder = sub2calexp.collorder 
                              AND sub1calexp.visit = sub2calexp.visit 
                              AND sub1calexp.sensor = sub2calexp.sensor 
                              AND sub1calexp.camera = sub2calexp.camera
                  ) AS dscalexp 
                  ON dscalexp.camera = "Camera".camera 
                  AND dscalexp.visit = "Visit".visit 
                  AND dscalexp.sensor = "Sensor".sensor 
              LEFT OUTER JOIN
                  (
                      SELECT
                          "Dataset".dataset_id AS dataset_id,
                          "Dataset".tract AS tract,
                          "Dataset".abstract_filter AS abstract_filter,
                          "Dataset".patch AS patch,
                          "Dataset".skymap AS skymap 
                      FROM
                          "Dataset" 
                          JOIN
                              "DatasetCollection" 
                              ON "Dataset".dataset_id = "DatasetCollection".dataset_id 
                      WHERE
                          "Dataset".dataset_type_name = :dataset_type_name_2 
                          AND "DatasetCollection".collection = :collection_5
                  ) AS "dsdeepCoadd_calexp" 
                  ON "dsdeepCoadd_calexp".abstract_filter = "AbstractFilter".abstract_filter 
                  AND "dsdeepCoadd_calexp".patch = "Patch".patch 
                  AND "dsdeepCoadd_calexp".skymap = "SkyMap".skymap 
                  AND "dsdeepCoadd_calexp".tract = "Tract".tract
          

          Looks scary but it works and it also searches datasets in multiple collections and returns correct dataset_id based on collection order.

          Show
          salnikov Andy Salnikov added a comment - - edited Here is an example of the SQL that pre-flight now generates for one simple calexp-to-coadd task : SELECT "AbstractFilter" .abstract_filter, "Sensor" .sensor, "Patch" .patch, "Visit" .visit, "Camera" .camera, "SkyMap" .skymap, "Tract" .tract, "VisitSensorRegion" .region, "Patch" .region, dscalexp.dataset_id, "dsdeepCoadd_calexp" .dataset_id FROM "SkyMap" JOIN "AbstractFilter" ON :param_1 -- param_1 = TRUE, this appears here only because of sqlalchemy quirks JOIN "Tract" ON "Tract" .skymap = "SkyMap" .skymap JOIN "Patch" ON "Patch" .skymap = "SkyMap" .skymap AND "Patch" .tract = "Tract" .tract AND "Patch" .skymap = "Tract" .skymap JOIN "Camera" ON :param_2 -- param_2 = TRUE JOIN "PhysicalFilter" ON "PhysicalFilter" .abstract_filter = "AbstractFilter" .abstract_filter AND "PhysicalFilter" .camera = "Camera" .camera JOIN "Visit" ON "Visit" .physical_filter = "PhysicalFilter" .physical_filter AND "Visit" .camera = "PhysicalFilter" .camera AND "Visit" .camera = "Camera" .camera JOIN "Sensor" ON "Sensor" .camera = "Camera" .camera JOIN "VisitSensorRegion" ON "VisitSensorRegion" .visit = "Visit" .visit AND "VisitSensorRegion" .camera = "Visit" .camera AND "VisitSensorRegion" .sensor = "Sensor" .sensor AND "VisitSensorRegion" .camera = "Sensor" .camera JOIN "VisitSensorPatchJoin" ON "VisitSensorPatchJoin" .visit = "VisitSensorRegion" .visit AND "VisitSensorPatchJoin" .sensor = "VisitSensorRegion" .sensor AND "VisitSensorPatchJoin" .camera = "VisitSensorRegion" .camera AND "VisitSensorPatchJoin" .tract = "Patch" .tract AND "VisitSensorPatchJoin" .patch = "Patch" .patch AND "VisitSensorPatchJoin" .skymap = "Patch" .skymap JOIN ( SELECT sub2calexp.dataset_id AS dataset_id, sub2calexp.visit AS visit, sub2calexp.sensor AS sensor, sub2calexp.camera AS camera FROM ( SELECT CASE WHEN ( "DatasetCollection" .collection = :collection_1) THEN :param_3 -- param_3 = 0 WHEN ( "DatasetCollection" .collection = :collection_2) THEN :param_4 -- param_4 = 1 END AS collorder, "Dataset" .dataset_id AS dataset_id, "Dataset" .visit AS visit, "Dataset" .sensor AS sensor, "Dataset" .camera AS camera FROM "Dataset" JOIN "DatasetCollection" ON "Dataset" .dataset_id = "DatasetCollection" .dataset_id WHERE "Dataset" .dataset_type_name = :dataset_type_name_1 AND "DatasetCollection" .collection IN (:collection_3, :collection_4) ) AS sub2calexp JOIN ( SELECT min ( CASE WHEN ( "DatasetCollection" .collection = :collection_1) THEN :param_3 -- param_3 = 0 WHEN ( "DatasetCollection" .collection = :collection_2) THEN :param_4 -- param_4 = 1 END ) AS collorder, "Dataset" .visit AS visit, "Dataset" .sensor AS sensor, "Dataset" .camera AS camera FROM "Dataset" JOIN "DatasetCollection" ON "Dataset" .dataset_id = "DatasetCollection" .dataset_id WHERE "Dataset" .dataset_type_name = :dataset_type_name_1 AND "DatasetCollection" .collection IN (:collection_3, :collection_4) GROUP BY visit, sensor, camera ) AS sub1calexp ON sub1calexp.collorder = sub2calexp.collorder AND sub1calexp.visit = sub2calexp.visit AND sub1calexp.sensor = sub2calexp.sensor AND sub1calexp.camera = sub2calexp.camera ) AS dscalexp ON dscalexp.camera = "Camera" .camera AND dscalexp.visit = "Visit" .visit AND dscalexp.sensor = "Sensor" .sensor LEFT OUTER JOIN ( SELECT "Dataset" .dataset_id AS dataset_id, "Dataset" .tract AS tract, "Dataset" .abstract_filter AS abstract_filter, "Dataset" .patch AS patch, "Dataset" .skymap AS skymap FROM "Dataset" JOIN "DatasetCollection" ON "Dataset" .dataset_id = "DatasetCollection" .dataset_id WHERE "Dataset" .dataset_type_name = :dataset_type_name_2 AND "DatasetCollection" .collection = :collection_5 ) AS "dsdeepCoadd_calexp" ON "dsdeepCoadd_calexp" .abstract_filter = "AbstractFilter" .abstract_filter AND "dsdeepCoadd_calexp" .patch = "Patch" .patch AND "dsdeepCoadd_calexp" .skymap = "SkyMap" .skymap AND "dsdeepCoadd_calexp" .tract = "Tract" .tract Looks scary but it works and it also searches datasets in multiple collections and returns correct dataset_id based on collection order.
          Hide
          salnikov Andy Salnikov added a comment -

          Jim Bosch, I think this is ready for review now. Pre-flight now returns a bit more information about input/output datasets which allows GraphBuilder to either filter Quanta for which all outputs exist or raise an exception.

          Pre-flight now also implements dataset search in multiple input collections taking order of collections into account, returned DatasetRef now has id which can be used to resolve collection ambiguity in the post-preflight stage.

          Tim Jenness, you are welcome to continue reviewing my changes too

          Show
          salnikov Andy Salnikov added a comment - Jim Bosch , I think this is ready for review now. Pre-flight now returns a bit more information about input/output datasets which allows GraphBuilder to either filter Quanta for which all outputs exist or raise an exception. Pre-flight now also implements dataset search in multiple input collections taking order of collections into account, returned DatasetRef now has id which can be used to resolve collection ambiguity in the post-preflight stage. Tim Jenness , you are welcome to continue reviewing my changes too
          Hide
          jbosch Jim Bosch added a comment - - edited

          This is really impressive, Andy Salnikov. I've left a number of minor and for-the-future comments on the PRs. I especially appreciate all the effort you've gone to to document and comment the algorithms - it's still sufficiently complex that I worry about anyone being able to maintain it other than you, but all of the examples you've given really help with that.

          I'd like to expand on those "comments for the future" here:

          I think the object currently called PreFlightUnitsRow has the potential to become a really fundamental piece of the Registry API, well beyond just preflight. Iterators that yield PreflightUnitsRow could be how we represent a repository Subset, which is something we've talked about conceptually a long time without ever defining it at the code level. Some possibilities:

          • You could produce a Subset given a WHERE expression, some Collections, and some DatasetTypes (via selectDataUnits).
          • You could produce a Subset via custom, hand-written SQL.
          • You could consume a Subset to produce a QuantumGraph for new processing to be done.
          • You could consume a Subset to produce a QuantumGraph representing the provenance of processing already done.
          • You could consume a Subset to produce a set of DataIds (i.e. removing duplicates).
          • You could consume a Subset to produce a set of DatasetRefs.
          • You could consume a Subset to dump Registry content to (e.g.) CSV for export/transfer.
          • You could filter (consume and produce) a Subset in many different ways: only certain DatasetTypes, require certain DataUnit links to be non-null, a many different kinds of spatial filters, etc.

          I think DM-15034 will need to be done before we start generalizing the concept to all of those use cases - that ticket should move some of the smarts in selectDataUnits into other classes, and it might change how we decide to map what PreFlightUnitRow conceptually represents (DataId/DatasetRef/Region) into its actual Python attributes. I really need to find time to work on that ticket. Obviously, when we do generalize the code we have now into that Subset concept, we'd want to rename PreFlightUnitsRow to something more general-sounding at the same time.

          In the meantime, I think it'd be useful for you to just keep this idea in mind (and by all means chime in with your thoughts). In particular, I'm curious about:

          • Can we refactor any more of the preflight logic into subset producers/consumers/filters? The post-query spatial filtering to remove disjoint spatial DataUnits seems like at least one candidate for a filter.
          • Is a Subset just a regular iterator over PreFlightUnitsRow, or does it need to be a special iterator with some kind of header that describes what's common to all rows? Regular iterators would be really nice, because then we could use all kinds of itertools operations an generator-expression syntax on them.
          • Related: would it ever be useful/possible to support heterogeneous Subsets, in which the set of links in the DataId change from row to row? Could we implement that by just treating missing links as null/None?
          Show
          jbosch Jim Bosch added a comment - - edited This is really impressive, Andy Salnikov . I've left a number of minor and for-the-future comments on the PRs. I especially appreciate all the effort you've gone to to document and comment the algorithms - it's still sufficiently complex that I worry about anyone being able to maintain it other than you, but all of the examples you've given really help with that. I'd like to expand on those "comments for the future" here: I think the object currently called PreFlightUnitsRow has the potential to become a really fundamental piece of the Registry API, well beyond just preflight. Iterators that yield PreflightUnitsRow could be how we represent a repository Subset , which is something we've talked about conceptually a long time without ever defining it at the code level. Some possibilities: You could produce a Subset given a WHERE expression, some Collections, and some DatasetTypes (via selectDataUnits). You could produce a Subset via custom, hand-written SQL. You could consume a Subset to produce a QuantumGraph for new processing to be done. You could consume a Subset to produce a QuantumGraph representing the provenance of processing already done. You could consume a Subset to produce a set of DataIds (i.e. removing duplicates). You could consume a Subset to produce a set of DatasetRefs. You could consume a Subset to dump Registry content to (e.g.) CSV for export/transfer. You could filter (consume and produce) a Subset in many different ways: only certain DatasetTypes, require certain DataUnit links to be non-null, a many different kinds of spatial filters, etc. I think DM-15034 will need to be done before we start generalizing the concept to all of those use cases - that ticket should move some of the smarts in selectDataUnits into other classes, and it might change how we decide to map what PreFlightUnitRow conceptually represents (DataId/DatasetRef/Region) into its actual Python attributes. I really need to find time to work on that ticket. Obviously, when we do generalize the code we have now into that Subset concept, we'd want to rename PreFlightUnitsRow to something more general-sounding at the same time. In the meantime, I think it'd be useful for you to just keep this idea in mind (and by all means chime in with your thoughts). In particular, I'm curious about: Can we refactor any more of the preflight logic into subset producers/consumers/filters? The post-query spatial filtering to remove disjoint spatial DataUnits seems like at least one candidate for a filter. Is a Subset just a regular iterator over PreFlightUnitsRow , or does it need to be a special iterator with some kind of header that describes what's common to all rows? Regular iterators would be really nice, because then we could use all kinds of itertools operations an generator-expression syntax on them. Related: would it ever be useful/possible to support heterogeneous Subsets, in which the set of links in the DataId change from row to row? Could we implement that by just treating missing links as null/None?
          Hide
          salnikov Andy Salnikov added a comment -

          Jim, no objections to trying to generalize PreFlightUnitsRow if it can be made useful elsewhere. I'm not sure I have that deep knowledge of all concepts, I'm mostly discovering things by looking at them from pre-flight side.

          Can we refactor any more of the preflight logic into subset producers/consumers/filters? The post-query spatial filtering to remove disjoint spatial DataUnits seems like at least one candidate for a filter.

          I'm sure things can be made more modular and reusable. Spatial filtering can probably be moved to a separate filter but I suspect that this would be a 100% obligatory filter, I can't imagine anyone would want to deal with regions that don't overlap (ideally all selection should be based on regions only, I consider SkyMap as just an optimization that most people don't need to worry about).

          Is a Subset just a regular iterator over PreFlightUnitsRow, or does it need to be a special iterator with some kind of header that describes what's common to all rows? Regular iterators would be really nice, because then we could use all kinds of itertools operations an generator-expression syntax on them.

          It is regular iterator, all structure is contained in the record itself (which may not be super-efficient if we have large volumes of those records).

          would it ever be useful/possible to support heterogeneous Subsets, in which the set of links in the DataId change from row to row? Could we implement that by just treating missing links as null/None?

          This could be useful for some cases, I guess. For pre-flight I don't think it matters as we still need to know which missing things we need to make.

          BTW, I don't see your comments on PR, did you click on Submit Review?

          Show
          salnikov Andy Salnikov added a comment - Jim, no objections to trying to generalize PreFlightUnitsRow  if it can be made useful elsewhere. I'm not sure I have that deep knowledge of all concepts, I'm mostly discovering things by looking at them from pre-flight side. Can we refactor any more of the preflight logic into subset producers/consumers/filters? The post-query spatial filtering to remove disjoint spatial DataUnits seems like at least one candidate for a filter. I'm sure things can be made more modular and reusable. Spatial filtering can probably be moved to a separate filter but I suspect that this would be a 100% obligatory filter, I can't imagine anyone would want to deal with regions that don't overlap (ideally all selection should be based on regions only, I consider SkyMap as just an optimization that most people don't need to worry about). Is a Subset just a regular iterator over PreFlightUnitsRow, or does it need to be a special iterator with some kind of header that describes what's common to all rows? Regular iterators would be really nice, because then we could use all kinds of itertools operations an generator-expression syntax on them. It is regular iterator, all structure is contained in the record itself (which may not be super-efficient if we have large volumes of those records). would it ever be useful/possible to support heterogeneous Subsets, in which the set of links in the DataId change from row to row? Could we implement that by just treating missing links as null/None? This could be useful for some cases, I guess. For pre-flight I don't think it matters as we still need to know which missing things we need to make. BTW, I don't see your comments on PR, did you click on Submit Review?
          Hide
          jbosch Jim Bosch added a comment -

          I had indeed forgotten to hit Submit in GitHub; now done.

          Show
          jbosch Jim Bosch added a comment - I had indeed forgotten to hit Submit in GitHub; now done.
          Hide
          salnikov Andy Salnikov added a comment -

          I think I addressed all concerns, Jenkins has passed (and I had to re-base twice) so  both packages merged now.

          Show
          salnikov Andy Salnikov added a comment - I think I addressed all concerns, Jenkins has passed (and I had to re-base twice) so  both packages merged now.

            People

            Assignee:
            salnikov Andy Salnikov
            Reporter:
            salnikov Andy Salnikov
            Reviewers:
            Jim Bosch
            Watchers:
            Andy Salnikov, Jim Bosch
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                CI Builds

                No builds found.