Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-16482

Add limited support for ExposureRange to Pre-flight

    XMLWordPrintable

    Details

      Description

      DM-16467 cannot be completed until we have ExposureRange some support in pre-flight. Minimum support that we need for that is to be able to specify ExposureRange in DatasetTypes, support in user filter query can be implemented later.

        Attachments

          Issue Links

            Activity

            No builds found.
            salnikov Andy Salnikov created issue -
            salnikov Andy Salnikov made changes -
            Field Original Value New Value
            Epic Link DM-14661 [ 106365 ]
            salnikov Andy Salnikov made changes -
            Link This issue blocks DM-16467 [ DM-16467 ]
            salnikov Andy Salnikov made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            Hide
            salnikov Andy Salnikov added a comment -

            Although ExposureRange is a DataUnit it is a very special unit and need special handling in pre-flight. ExposureRange ius basically a time interval:

                ExposureRange:
                  dependencies:
                    required:
                    - Instrument
                  link:
                  - valid_first
                  - valid_last
                  doc: >
                    An inclusive range of Exposure dates that may be open in either
                    direction, typically used to identify master calibration products.
                    There is no SQL table associated with ExposureRanges; there is no
                    additional information associated with an ExposureRange besides the
                    instrument, valid_first, and valid_last fields already present in
                    Dataset.
            

            As comment says there is no regular DataUnit table like for other units, the only place that knows about range is a Dataset table.

            Here is how pre-flight works today at the very-high level with regular DataUnits which all have corresponding tables:

            • One super-query is built which selects all combinations of data units which participate in particular pipeline (limited by existing inputs and user filter)
            • one can think of the result of that query as a full set of "paths" connecting inputs and outputs of the pipeline (including all intermediates), one row in the result is one such path
            • the query is built as a series of joins:
              • first we extract all units in input/output datasets and join them on their dependencies
              • then we use DataUnitJoins to join DataUnits spatially
              • these two make a "unit universe" of all units that can potentially exist
              • next for each input dataset we join above universe with Dataset table using links from dataset type units, this uses inner equi-join (Dataset link value has to be equal DataUnit value from unit universe), this limits query to inputs that already exist (actual implementation is more complicated by Dataset-collection relationship but this is not important here)
              • next we do outer join of that "input universe" with output dataset types, this is only needed to check that outputs don't already exists and it does not limit anything
              • last step is to add user filter query to WHERE clause of that super-query, it needs to be written in terms of DataUnit tables and their columns (Dataset tables is wrapped into multiple sub-queries and thier high-level names are not known to users so they can't be used, this is not an issue because all unit values enter query via equi-join, so it does not matter which one to use).
            • SELECT list of this super-query is made of the link column names of all DataUnits used in query. It also includes region columns the DataUnitJoins so that we can do region-based filtering when we return data.
            Show
            salnikov Andy Salnikov added a comment - Although ExposureRange is a DataUnit it is a very special unit and need special handling in pre-flight. ExposureRange ius basically a time interval: ExposureRange: dependencies: required: - Instrument link: - valid_first - valid_last doc: > An inclusive range of Exposure dates that may be open in either direction, typically used to identify master calibration products. There is no SQL table associated with ExposureRanges; there is no additional information associated with an ExposureRange besides the instrument, valid_first, and valid_last fields already present in Dataset. As comment says there is no regular DataUnit table like for other units, the only place that knows about range is a Dataset table. Here is how pre-flight works today at the very-high level with regular DataUnits which all have corresponding tables: One super-query is built which selects all combinations of data units which participate in particular pipeline (limited by existing inputs and user filter) one can think of the result of that query as a full set of "paths" connecting inputs and outputs of the pipeline (including all intermediates), one row in the result is one such path the query is built as a series of joins: first we extract all units in input/output datasets and join them on their dependencies then we use DataUnitJoins to join DataUnits spatially these two make a "unit universe" of all units that can potentially exist next for each input dataset we join above universe with Dataset table using links from dataset type units, this uses inner equi-join (Dataset link value has to be equal DataUnit value from unit universe), this limits query to inputs that already exist (actual implementation is more complicated by Dataset-collection relationship but this is not important here) next we do outer join of that "input universe" with output dataset types, this is only needed to check that outputs don't already exists and it does not limit anything last step is to add user filter query to WHERE clause of that super-query, it needs to be written in terms of DataUnit tables and their columns (Dataset tables is wrapped into multiple sub-queries and thier high-level names are not known to users so they can't be used, this is not an issue because all unit values enter query via equi-join, so it does not matter which one to use). SELECT list of this super-query is made of the link column names of all DataUnits used in query. It also includes region columns the DataUnitJoins so that we can do region-based filtering when we return data.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            ExposureRange is a time interval (or a set of Exposures) and we cannot use equi-join for corresponding dataset, instead for input datasets we should do something like:

            -- "UnitUniverse" must include Exposure table.
            UnitUniverse JOIN 
                (SELECT Dataset.valid_first, Dataset.valid_last, ...
                 FROM Dataset JOIN DatasetCollection .......) DS_42
            ON (DS_42.valid_first <= Exposure.datetime_begin AND DS_42.valid_last >= Exposure.datetime_end)
            

            The idea is simple - find a ExposureRange which fully covers an Exposure. I think that it is expected that there is always one such dataset, it's not clear what to do if there is none or if there is more that one (if there is none then we skip that part of the unit universe entirely, but if there is more than one there will be many ExposureRanges returned). And I think that this join will be super-inefficient compared to simple equi-join.

            It is not at all clear what to do for the output dataset types with ExposureRange units. The whole idea of the pre-flight is to generate all possible outputs but in case of ExposureRange possible outputs set is probably a combination of Exposure begin/end times (or even beyond that) so it's not useful. We'll need some other idea how to handle outputs later.

            Because ExposureRange does not exists there are no corresponding columns added to SELECT list, we need to add that too. Unlike all other DataUnits which can only enter resulting row once (e.g. only one Exposure can appear on one path even if there are many tasks working on it) there may be several ExposureRanges which are different, e.g. One calibration has exposure range covering time from minus to plus infinity, but other calibration can be done on per-night basis. So all different DatasetTypes with ExposureRange need their own set of valid_first/valid_last columns in the result.

            Show
            salnikov Andy Salnikov added a comment - - edited ExposureRange is a time interval (or a set of Exposures) and we cannot use equi-join for corresponding dataset, instead for input datasets we should do something like: -- "UnitUniverse" must include Exposure table. UnitUniverse JOIN (SELECT Dataset.valid_first, Dataset.valid_last, ... FROM Dataset JOIN DatasetCollection .......) DS_42 ON (DS_42.valid_first <= Exposure.datetime_begin AND DS_42.valid_last >= Exposure.datetime_end) The idea is simple - find a ExposureRange which fully covers an Exposure. I think that it is expected that there is always one such dataset, it's not clear what to do if there is none or if there is more that one (if there is none then we skip that part of the unit universe entirely, but if there is more than one there will be many ExposureRanges returned). And I think that this join will be super-inefficient compared to simple equi-join. It is not at all clear what to do for the output dataset types with ExposureRange units. The whole idea of the pre-flight is to generate all possible outputs but in case of ExposureRange possible outputs set is probably a combination of Exposure begin/end times (or even beyond that) so it's not useful. We'll need some other idea how to handle outputs later. Because ExposureRange does not exists there are no corresponding columns added to SELECT list, we need to add that too. Unlike all other DataUnits which can only enter resulting row once (e.g. only one Exposure can appear on one path even if there are many tasks working on it) there may be several ExposureRanges which are different, e.g. One calibration has exposure range covering time from minus to plus infinity, but other calibration can be done on per-night basis. So all different DatasetTypes with ExposureRange need their own set of valid_first/valid_last columns in the result.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            And we also have ExposureRangeJoin in in dataUnits.yaml with this definition:

                ExposureRangeJoin:
                  doc: >
                    A calculated join between Datasets identified with an Exposure
                    (typically raw science frames) and Datasets identified with
                    ExposureRange (typically master calibrations).
                  lhs: [Exposure]
                  rhs: [ExposureRange]
                  sql:
                    (lhs.instrument = rhs.instrument) AND
                    (lhs.datetime_begin BETWEEN rhs.valid_first AND rhs.valid_last)
            

            except pre-flight currently does not know how to use it correctly (and in reality it is join between Dataset and Exposure)

            Show
            salnikov Andy Salnikov added a comment - - edited And we also have ExposureRangeJoin in in dataUnits.yaml with this definition: ExposureRangeJoin: doc: > A calculated join between Datasets identified with an Exposure (typically raw science frames) and Datasets identified with ExposureRange (typically master calibrations). lhs: [Exposure] rhs: [ExposureRange] sql: (lhs.instrument = rhs.instrument) AND (lhs.datetime_begin BETWEEN rhs.valid_first AND rhs.valid_last) except pre-flight currently does not know how to use it correctly (and in reality it is join between Dataset and Exposure)
            Hide
            jbosch Jim Bosch added a comment -

            More more wrinkle: I've recently come to the conclusion that we need another level of indirection between ExposureRange and Dataset for calibration Datasets themselves.  In other words, I think we need to eventually change the DataUnit schema to:

            • Add a new CalibIdentifier DataUnit.  This would be a string, and it would be part of the Dataset table instead of the current valid_begin and valid_end fields.
            • We would have a separate CalibIdentifier DataUnit table that has valid_begin and valid_end.
            • We could create a CalibIdentifierExposureJoin view calculates a many-to-many join between CalibIdentifier and Exposure.

            I don't have a ticket for that yet, and I don't think we need it to be done before this ticket.  But I think that change actually makes this relationship behave much more like existing DataUnit relationships, and hence it might make sense to do it first (possibly on this ticket, if you're ambitious).

            Show
            jbosch Jim Bosch added a comment - More more wrinkle: I've recently come to the conclusion that we need another level of indirection between ExposureRange and Dataset for calibration Datasets themselves.  In other words, I think we need to eventually change the DataUnit schema to: Add a new CalibIdentifier DataUnit.  This would be a string, and it would be part of the  Dataset table instead of the current valid_begin and valid_end fields. We would have a separate CalibIdentifier DataUnit table that has valid_begin and valid_end . We could create a CalibIdentifierExposureJoin view calculates a many-to-many join between CalibIdentifier and Exposure . I don't have a ticket for that yet, and I don't think we need it to be done before this ticket.  But I think that change actually makes this relationship behave much more like existing DataUnit relationships, and hence it might make sense to do it first (possibly on this ticket, if you're ambitious).
            Hide
            salnikov Andy Salnikov added a comment -

            Jim Bosch, I have implemented minimal changes to support ExposureRange, though this is just a temporary hack to unblock isrTask migration (but unit test should be solid). I do agree that we need better schema to reflect the nature of calibrations, but I think it has to be something different from regular DataUnits. In preflight I have to treat ExposureRange differently from other DataUnits, and reason for it is that there can only be a single regular DataUnit per Quantum but there maybe multiple different ExposureRanges in the same Quantum. That distinction will remain even if you rename it to CalibIdentifier, so something in the schema should reflect this difference.

            Show
            salnikov Andy Salnikov added a comment - Jim Bosch , I have implemented minimal changes to support ExposureRange, though this is just a temporary hack to unblock isrTask migration (but unit test should be solid). I do agree that we need better schema to reflect the nature of calibrations, but I think it has to be something different from regular DataUnits. In preflight I have to treat ExposureRange differently from other DataUnits, and reason for it is that there can only be a single regular DataUnit per Quantum but there maybe multiple different ExposureRanges in the same Quantum. That distinction will remain even if you rename it to CalibIdentifier, so something in the schema should reflect this difference.
            salnikov Andy Salnikov made changes -
            Reviewers Jim Bosch [ jbosch ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            Hide
            jbosch Jim Bosch added a comment -

            there can only be a single regular DataUnit per Quantum but there maybe multiple different ExposureRanges in the same Quantum

            How is this different from any other gather-style step?  In the classic coaddition example, there are multiple Visits that contribute (via Visit+Patch inputs) to a Quantum (Patch+Filter).

            (That question aside, I'll start looking at the code changes now.)

            Show
            jbosch Jim Bosch added a comment - there can only be a single regular DataUnit per Quantum but there maybe multiple different ExposureRanges in the same Quantum How is this different from any other gather-style step?  In the classic coaddition example, there are multiple Visits that contribute (via Visit+Patch inputs) to a Quantum (Patch+Filter). (That question aside, I'll start looking at the code changes now.)
            Hide
            salnikov Andy Salnikov added a comment -

            The difference between ExposureRanges and other DataUnits is that for normal units all relationships are defined at the DataUnit level and do not depend on Datasets. If there is many-to-one or many-to-many relationship between units it is determined by the structure/topology of the "unit universe" (e.g. relation between Visits and Patches is determined by corresponding DataUnits/Joins). That topology may also be trimmed by the pre-existing input Datasets but in general that unit topology is the same for any Dataset. If for example we build a graph to produce two DatasetType with Patch unit then input to those DatasetTypes will be identical set of Visits (latter can also come from different DatasetTypes, and may be trimmed by existing inputs).

            For ExposureRanges the whole thing is different, we do not have DataUnit-level relationship between them and other units, this is by its nature (or by its size) can only be defined at Dataset level and it is Dataset-specific. So in the same Quantum we end up with one set of ExposureRange units for one calibration and completely unrelated set of ExposureRange units for other calibration. It is of course possible to say that whole universe of ExposureRange units exists with all possible combinations of begin/end validity times and we include all of them (with some time constraints) in quanta but that is not really helpful (even at second-level precision this is a lot), especially for defining outputs with those units.

            Show
            salnikov Andy Salnikov added a comment - The difference between ExposureRanges and other DataUnits is that for normal units all relationships are defined at the DataUnit level and do not depend on Datasets. If there is many-to-one or many-to-many relationship between units it is determined by the structure/topology of the "unit universe" (e.g. relation between Visits and Patches is determined by corresponding DataUnits/Joins). That topology may also be trimmed by the pre-existing input Datasets but in general that unit topology is the same for any Dataset. If for example we build a graph to produce two DatasetType with Patch unit then input to those DatasetTypes will be identical set of Visits (latter can also come from different DatasetTypes, and may be trimmed by existing inputs). For ExposureRanges the whole thing is different, we do not have DataUnit-level relationship between them and other units, this is by its nature (or by its size) can only be defined at Dataset level and it is Dataset-specific. So in the same Quantum we end up with one set of ExposureRange units for one calibration and completely unrelated set of ExposureRange units for other calibration. It is of course possible to say that whole universe of ExposureRange units exists with all possible combinations of begin/end validity times and we include all of them (with some time constraints) in quanta but that is not really helpful (even at second-level precision this is a lot), especially for defining outputs with those units.
            Hide
            jbosch Jim Bosch added a comment -

            Ah, ok, I understand now.  I am thinking that when we add the extra level of indirection with CalibIdentifier, then we will need to have a separate CalibIdentifier DataUnit table that contains all of the ExposureRange information.  That would make it like a regular (pre-defined) DataUnit rather than something strictly pulled out of the Dataset table.

            Anyhow, code all looks good for now, and while I hope to make the schema changes that would allow us to clean it up soon, I'm not yet sure when I'll get to them.

            Show
            jbosch Jim Bosch added a comment - Ah, ok, I understand now.  I am thinking that when we add the extra level of indirection with CalibIdentifier , then we will need to have a separate CalibIdentifier DataUnit table that contains all of the ExposureRange information.  That would make it like a regular (pre-defined) DataUnit rather than something strictly pulled out of the Dataset table. Anyhow, code all looks good for now, and while I hope to make the schema changes that would allow us to clean it up soon, I'm not yet sure when I'll get to them.
            jbosch Jim Bosch made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            Hide
            salnikov Andy Salnikov added a comment -

            Thanks for review! Merged to master.

            New CalibIdentifier would help to make it uniform, though maintaining CalibIdentifier <-> Exposure join table may become an issue.

            Show
            salnikov Andy Salnikov added a comment - Thanks for review! Merged to master. New CalibIdentifier would help to make it uniform, though maintaining CalibIdentifier <-> Exposure join table may become an issue.
            salnikov Andy Salnikov made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.