Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15049

Continue QuantumGraph implementation

    XMLWordPrintable

    Details

      Description

      This is a continuation of work started in DM-14334. Main goal of this ticket is to extend preflight solver with joins between skymap units and camera units so that we have more or less complete solution for building quantum graphs. 

        Attachments

          Activity

          No builds found.
          vaikunth Vaikunth Thukral created issue -
          vaikunth Vaikunth Thukral made changes -
          Field Original Value New Value
          Epic Link DM-14661 [ 106365 ]
          vaikunth Vaikunth Thukral made changes -
          Risk Score 0
          salnikov Andy Salnikov made changes -
          Labels gen3-middleware
          salnikov Andy Salnikov made changes -
          Description This is a continuation of work started in DM-14334. Main goal of this ticket is to extend preflight solver with joins between skymap units and camera units so that we have more or less complete solution for building quantum graphs. 
          salnikov Andy Salnikov made changes -
          Status To Do [ 10001 ] In Progress [ 3 ]
          Hide
          salnikov Andy Salnikov added a comment -

          I just had a quick look at the options for implementing joins with "join tables" and it turns out there is no trivial way to do that. For non-join tables I join those using foreign key definition in SQLAlchemy (SA) schema. It is even possible in trivial cases to use SA join() method which does everything automagically, but our case is not that trivial so I have to generate all those joins myself just looking at Table foreign key constraints.

          With Join tables situation is problematic because those are not actually tables but views and views do not have foreign key constraints. What is also problematic is that DataUnitRegistry does not store foreign key info in DataUnitJoin instances. I'll probably have to extend that class to include that otherwise there is no place where to get this info from.

          Show
          salnikov Andy Salnikov added a comment - I just had a quick look at the options for implementing joins with "join tables" and it turns out there is no trivial way to do that. For non-join tables I join those using foreign key definition in SQLAlchemy (SA) schema. It is even possible in trivial cases to use SA join() method which does everything automagically, but our case is not that trivial so I have to generate all those joins myself just looking at Table foreign key constraints. With Join tables situation is problematic because those are not actually tables but views and views do not have foreign key constraints. What is also problematic is that DataUnitRegistry does not store foreign key info in DataUnitJoin instances. I'll probably have to extend that class to include that otherwise there is no place where to get this info from.
          Hide
          salnikov Andy Salnikov added a comment -

          I think it is possible to workaround FK issue by making SA.Table instances for those view instead of TableClause as we do now. Table can have FK added to it even if it represents database view (example is in SA doc: http://docs.sqlalchemy.org/en/latest/core/reflection.html#reflecting-views).

          More interesting issue that bothers me now is the relation between PhysicalFilter and AbstractFilter units. I'm trying to generate graph for a simple task which involves SkyMap/Camera joins, the units are:

          • input: Camera, Visit, Sensor
          • output: SkyMap, Tract, Patch, AbstractFilter
          • input is also constrained by existing Datasets.

          PhysicalFilter is not in the unit list, so the first problem to understand is how PhysicalFilter table has to be brought into the picture. The reason why we need PhysicalFilter table is the presence of AbstractFilter in output units, without that we would not even need to look at what PhysicalFilter contains (user expression may depend on physical_filter but that does not matter yet). OTOH if there is an AbstractFilter in the units but no Camera/Visit/Senson units then PhysicalFilter is not needed too. Not sure how to express that logic in schema definition.

          Another issue is that there is an implicit link between PhysicalFilter and AbstractFilter but we do not have any FK relationship defined for them in the schema. Logically I think PhysicalFilter.abstract_filter column should be a FK for AbstractFilter.abstract_filter, but I'm not sure that we can do it in current schema where AbstractFilter is a view of PhysicalFilter (Maybe we should make AbstractFilter an actual table?).

          Jim Bosch, if you have though about this could you suggest what needs to be done here?

          Show
          salnikov Andy Salnikov added a comment - I think it is possible to workaround FK issue by making SA.Table instances for those view instead of TableClause as we do now. Table can have FK added to it even if it represents database view (example is in SA doc: http://docs.sqlalchemy.org/en/latest/core/reflection.html#reflecting-views ). More interesting issue that bothers me now is the relation between PhysicalFilter and AbstractFilter units. I'm trying to generate graph for a simple task which involves SkyMap/Camera joins, the units are: input: Camera, Visit, Sensor output: SkyMap, Tract, Patch, AbstractFilter input is also constrained by existing Datasets. PhysicalFilter is not in the unit list, so the first problem to understand is how PhysicalFilter table has to be brought into the picture. The reason why we need PhysicalFilter table is the presence of AbstractFilter in output units, without that we would not even need to look at what PhysicalFilter contains (user expression may depend on physical_filter but that does not matter yet). OTOH if there is an AbstractFilter in the units but no Camera/Visit/Senson units then PhysicalFilter is not needed too. Not sure how to express that logic in schema definition. Another issue is that there is an implicit link between PhysicalFilter and AbstractFilter but we do not have any FK relationship defined for them in the schema. Logically I think PhysicalFilter.abstract_filter column should be a FK for AbstractFilter.abstract_filter, but I'm not sure that we can do it in current schema where AbstractFilter is a view of PhysicalFilter (Maybe we should make AbstractFilter an actual table?). Jim Bosch , if you have though about this could you suggest what needs to be done here?
          Hide
          salnikov Andy Salnikov added a comment - - edited

          To clarify things a bit for Jim Bosch, this is how I currently try to approach building the query which extracts all unit combinations:

          • Extract all DataUnits from all DatasetTypes (both inputs and outputs)
          • SELECT column list:
            • for each DataUnit add its link column(s) to the SELECT column list
          • WHERE clause
            • for each DataUnit table join it with its "required" DataUnits (using FK definition from sqlalchemy table schema)
            • for each DataUnit table also join it with its "optional" DataUnits if optional DataUnit in in the DataUnit list
            • select DataUnitJoins in the units registry for which all "lhs" and "rhs" are in the DataUnits list
            • for each of the selected DataUnitJoin:
              • if a "region table" exists for "lhs" or "rhs" DataUnits join that table with its corresponding DataUnits
              • join DataUnitJoin table using all of its FK definition from SA schema (there is a slight problem with views which I think can be solved)
            • filter based on existing input data - for each of the input DatasetType:
              • define per-datasettype aliases for Dataset and DatasetCollection tables
              • join Dataset alias with the DataUnits from the input DatasetType (this is currently using liks definition but could switch to SA FK definition instead)
              • add a filter using dataset_type_name and collection name

          This is not terribly complicated (implementation is a bit messy today but will be improved).

          I execute this on a simple task which has input DatasetType "calexp" (from ci_hsc registry) and output DatasetType "deepCoadd_calexp", with DataUnits (Camera, Visit, Sensor) and (SkyMap, Tract, Patch, AbstractFilter) respectively. Trouble here is that PhysicalFilter is not in the input/output DataUnits so it does not appear in the query and is not joined with AbstractFilter. As a result quanta contain a cartesian product with every possible AbstractFilter in the output.

           

          Show
          salnikov Andy Salnikov added a comment - - edited To clarify things a bit for Jim Bosch , this is how I currently try to approach building the query which extracts all unit combinations: Extract all DataUnits from all DatasetTypes (both inputs and outputs) SELECT column list: for each DataUnit add its link column(s) to the SELECT column list WHERE clause for each DataUnit table join it with its "required" DataUnits (using FK definition from sqlalchemy table schema) for each DataUnit table also join it with its "optional" DataUnits if optional DataUnit in in the DataUnit list select DataUnitJoins in the units registry for which all "lhs" and "rhs" are in the DataUnits list for each of the selected DataUnitJoin: if a "region table" exists for "lhs" or "rhs" DataUnits join that table with its corresponding DataUnits join DataUnitJoin table using all of its FK definition from SA schema (there is a slight problem with views which I think can be solved) filter based on existing input data - for each of the input DatasetType: define per-datasettype aliases for Dataset and DatasetCollection tables join Dataset alias with the DataUnits from the input DatasetType (this is currently using liks definition but could switch to SA FK definition instead) add a filter using dataset_type_name and collection name This is not terribly complicated (implementation is a bit messy today but will be improved). I execute this on a simple task which has input DatasetType "calexp" (from ci_hsc registry) and output DatasetType "deepCoadd_calexp", with DataUnits (Camera, Visit, Sensor) and (SkyMap, Tract, Patch, AbstractFilter) respectively. Trouble here is that PhysicalFilter is not in the input/output DataUnits so it does not appear in the query and is not joined with AbstractFilter. As a result quanta contain a cartesian product with every possible AbstractFilter in the output.  
          Hide
          jbosch Jim Bosch added a comment -

          I think the first thing to do is to try to start using our own definitions of the unit relationships (i.e. the objects in daf/butler/core/dataUnits.py) in the preflight code, so you're not relying on the SQLAlchemy foreign key relationships, as we may just not be able to make those as expressive as what we define ourselves.

          I'm a bit ambivalent about whether it's a good idea to then use SQLAlchemy to write the query itself - that might make things more portable as long as the Registry is implemented on top of SQLAlchemy, but means we can't use this (shared) component at all if the Registry isn't implemented on top of SQLAlchemy, and while we want to use SQLAlchemy to make it easier to implement Registries, we don't want to make it part of any interface.  Of course, using it in the implementation of a shared component isn't quite the same as making it part of an interface - we could always reimplement the shared component when if/when we have a non-SQLAlchemy Registry.  In any case, I'm happy to follow your judgement on that question for now, but I wanted to make sure you understood the situation.

          Anyhow, I'm hoping we may already have information in our own DataUnit classes that could be used t say that whenever Visit appears, PhysicalFilter should as well.  That might make PhysicalFilter appear when it isn't needed, but that's "only" a performance problem, and I hope it may not be a meaningful one.  I'm pretty sure our DataUnit classes should report that there is a foreign-key-like relationship between AbstractFilter and PhysicalFilter, even if there isn't one in the actual SQL.

           

           

          Show
          jbosch Jim Bosch added a comment - I think the first thing to do is to try to start using our own definitions of the unit relationships (i.e. the objects in daf/butler/core/dataUnits.py) in the preflight code, so you're not relying on the SQLAlchemy foreign key relationships, as we may just not be able to make those as expressive as what we define ourselves. I'm a bit ambivalent about whether it's a good idea to then use SQLAlchemy to write the query itself - that might make things more portable as long as the Registry is implemented on top of SQLAlchemy, but means we can't use this (shared) component at all if the Registry isn't implemented on top of SQLAlchemy, and while we want to use SQLAlchemy to make it easier to implement Registries, we don't want to make it part of any interface.  Of course, using it in the implementation of a shared component isn't quite the same as making it part of an interface - we could always reimplement the shared component when if/when we have a non-SQLAlchemy Registry.  In any case, I'm happy to follow your judgement on that question for now, but I wanted to make sure you understood the situation. Anyhow, I'm hoping we may already have information in our own DataUnit classes that could be used t say that whenever Visit appears, PhysicalFilter should as well.  That might make PhysicalFilter appear when it isn't needed, but that's "only" a performance problem, and I hope it may not be a meaningful one.  I'm pretty sure our DataUnit classes should report that there is a foreign-key-like relationship between AbstractFilter and PhysicalFilter, even if there isn't one in the actual SQL.    
          Hide
          jbosch Jim Bosch added a comment -

          I think our messages posted while we were both writing, and it sounds like you're already using our DataUnit relationships at least some of the time.  So two follow-up questions:

          • Does the PhysicalFilter problem go away if you expand the list of DataUnits to include all optional dependencies ** after populating the SELECT columns but before starting to generate the WHERE?
          • Is there enough information in the combination of the requiredDependencies, optionalDependencies, and link attributes of the DataUnit classes to generate the join statements without looking at foreign key definitions (at least for relationships that don't involve a many-to-many table)?  If not, we can make sure that information is added.
          Show
          jbosch Jim Bosch added a comment - I think our messages posted while we were both writing, and it sounds like you're already using our DataUnit relationships at least some of the time.  So two follow-up questions: Does the PhysicalFilter problem go away if you expand the list of DataUnits to include all optional dependencies ** after populating the SELECT columns but before starting to generate the WHERE? Is there enough information in the combination of the requiredDependencies, optionalDependencies, and link attributes of the DataUnit classes to generate the join statements without looking at foreign key definitions (at least for relationships that don't involve a many-to-many table)?  If not, we can make sure that information is added.
          Hide
          salnikov Andy Salnikov added a comment -

          I do agree that we should try to make things as generic as possible but preflight is closely coupled with registry implementation and I'm not sure we can define a practically useful API that is implementation-independent. We can try to generalize things if/when we work on another implementation of registry/preflight. Right now even if I only use schema definitions I'll still has to write SQL which matches SqlRegistry's transformation of that schema into database level, which is more or less equivalent to using SQLAlchemy schema (but potentially with more code).

          Regarding Filter units - I can try to extend DataUnits set to include optional dependencies though this does not make them optional any more (in preflight sense). This can probably solve Filters issue, I think their dependencies already setup correctly.

          One more thing that I was thinking about - in preflight I'll need to further filter query results based on the region overlaps so I'll need to return not only DataUnit value but also regions for some data units. I think right now there is no info in the schema which would tell me that particular DataUnit has associated region (or the column name used for region). I can probably assume that all region columns are named "region" and I can probably guess which DataUnits have them but it may also be worth adding that into to schema as well.

           

           

          Show
          salnikov Andy Salnikov added a comment - I do agree that we should try to make things as generic as possible but preflight is closely coupled with registry implementation and I'm not sure we can define a practically useful API that is implementation-independent. We can try to generalize things if/when we work on another implementation of registry/preflight. Right now even if I only use schema definitions I'll still has to write SQL which matches SqlRegistry's transformation of that schema into database level, which is more or less equivalent to using SQLAlchemy schema (but potentially with more code). Regarding Filter units - I can try to extend DataUnits set to include optional dependencies though this does not make them optional any more (in preflight sense). This can probably solve Filters issue, I think their dependencies already setup correctly. One more thing that I was thinking about - in preflight I'll need to further filter query results based on the region overlaps so I'll need to return not only DataUnit value but also regions for some data units. I think right now there is no info in the schema which would tell me that particular DataUnit has associated region (or the column name used for region). I can probably assume that all region columns are named "region" and I can probably guess which DataUnits have them but it may also be worth adding that into to schema as well.    
          Hide
          jbosch Jim Bosch added a comment -

          I do agree that we should try to make things as generic as possible but preflight is closely coupled with registry implementation and I'm not sure we can define a practically useful API that is implementation-independent. We can try to generalize things if/when we work on another implementation of registry/preflight.

          Sounds reasonable to me.

          Regarding Filter units - I can try to extend DataUnits set to include optional dependencies though this does not make them optional any more (in preflight sense). This can probably solve Filters issue, I think their dependencies already setup correctly.

          I think of "optional" as referring to the "do I need this DataUnit to make a unique Data ID" question, so I'm not bothered by this change.

          One more thing that I was thinking about - in preflight I'll need to further filter query results based on the region overlaps so I'll need to return not only DataUnit value but also regions for some data units. I think right now there is no info in the schema which would tell me that particular DataUnit has associated region (or the column name used for region). I can probably assume that all region columns are named "region" and I can probably guess which DataUnits have them but it may also be worth adding that into to schema as well.

          Yes, I think it is safe to assume the region is called "region", but we should make this explicit.  I think I took some steps towards this in the Python API recently in DataUnitRegistry.getRegionTable (note that this can return the DataUnit's own table), but we should make the logic for it is more explicit in the YAML.  Please do feel free to make changes to that yourself in daf_butler.

          Show
          jbosch Jim Bosch added a comment - I do agree that we should try to make things as generic as possible but preflight is closely coupled with registry implementation and I'm not sure we can define a practically useful API that is implementation-independent. We can try to generalize things if/when we work on another implementation of registry/preflight. Sounds reasonable to me. Regarding Filter units - I can try to extend DataUnits set to include optional dependencies though this does not make them optional any more (in preflight sense). This can probably solve Filters issue, I think their dependencies already setup correctly. I think of "optional" as referring to the "do I need this DataUnit to make a unique Data ID" question, so I'm not bothered by this change. One more thing that I was thinking about - in preflight I'll need to further filter query results based on the region overlaps so I'll need to return not only DataUnit value but also regions for some data units. I think right now there is no info in the schema which would tell me that particular DataUnit has associated region (or the column name used for region). I can probably assume that all region columns are named "region" and I can probably guess which DataUnits have them but it may also be worth adding that into to schema as well. Yes, I think it is safe to assume the region is called "region", but we should make this explicit.  I think I took some steps towards this in the Python API recently in DataUnitRegistry.getRegionTable (note that this can return the DataUnit's own table), but we should make the logic for it is more explicit in the YAML.  Please do feel free to make changes to that yourself in daf_butler.
          Hide
          salnikov Andy Salnikov added a comment - - edited

          Jim Bosch, I am thinking of adding "regionColumn" field to dataUnits and dataUnitRegions descriptions in YAML, but VisitSensorRegion table gives me trouble. Thing is that we do not have Python-side object for this table to which I can add an attribute, we only access it using its SQLAlchemy.Table instance. I can get around it by adding new class (e.g. DataUnitRegion) and store instances of those in DataUnitRegistry._dataUnitRegions. Then I can add region property to that new class and DataUnit which will be None by default and sqlalchemy.Column for DataUnit with region. This adds more dependencies on sqlalchemy to units API, if that is not what we want then it may be better to have region property be a name of a column. What do you think?

          Show
          salnikov Andy Salnikov added a comment - - edited Jim Bosch , I am thinking of adding "regionColumn" field to dataUnits and dataUnitRegions descriptions in YAML, but VisitSensorRegion table gives me trouble. Thing is that we do not have Python-side object for this table to which I can add an attribute, we only access it using its SQLAlchemy.Table instance. I can get around it by adding new class (e.g. DataUnitRegion) and store instances of those in DataUnitRegistry._dataUnitRegions . Then I can add region property to that new class and DataUnit which will be None by default and sqlalchemy.Column for DataUnit with region. This adds more dependencies on sqlalchemy to units API, if that is not what we want then it may be better to have region property be a name of a column. What do you think?
          Hide
          jbosch Jim Bosch added a comment -

          Adding a DataUnitRegion class sounds reasonable to me, though I don't have a very strong opinion.  Maybe Pim Schellart [X] does?

           

          Show
          jbosch Jim Bosch added a comment - Adding a DataUnitRegion class sounds reasonable to me, though I don't have a very strong opinion.  Maybe Pim Schellart [X] does?  
          Hide
          pschella Pim Schellart [X] (Inactive) added a comment -

          I think this is fine.
          I'm just not convinced that adding regionColumn = "region" to the YAML is all that useful, since we could also just say that the column "region" is present if and only if there is a region associated with the DataUnit (which is what I assumed in DM-15098), but if there is a compelling reason to do it this way that is also fine.

          Show
          pschella Pim Schellart [X] (Inactive) added a comment - I think this is fine. I'm just not convinced that adding regionColumn = "region" to the YAML is all that useful, since we could also just say that the column "region" is present if and only if there is a region associated with the DataUnit (which is what I assumed in DM-15098 ), but if there is a compelling reason to do it this way that is also fine.
          Hide
          salnikov Andy Salnikov added a comment -

          Pim, thanks for looking at this. I just try to follow "explicit is better than implicit" mantra And Jim said above that we want it in YAML explicitly.

          Show
          salnikov Andy Salnikov added a comment - Pim, thanks for looking at this. I just try to follow "explicit is better than implicit" mantra And Jim said above that we want it in YAML explicitly.
          Hide
          salnikov Andy Salnikov added a comment -

          I think this is ready for merge (bit I'll probably wait until Pim Schellart [X] merges DM-15098 and then rebase my stuff). I was thinking to add unit test for this but I do not know how to generate all those skypixes in a unit test so I decided to skip it (for now at least).

          Show
          salnikov Andy Salnikov added a comment - I think this is ready for merge (bit I'll probably wait until Pim Schellart [X] merges DM-15098 and then rebase my stuff). I was thinking to add unit test for this but I do not know how to generate all those skypixes in a unit test so I decided to skip it (for now at least).
          salnikov Andy Salnikov made changes -
          Reviewers Jim Bosch [ jbosch ]
          Status In Progress [ 3 ] In Review [ 10004 ]
          salnikov Andy Salnikov made changes -
          Risk Score 0 1
          Hide
          pschella Pim Schellart [X] (Inactive) added a comment -

          Feel free to merge now, I'll rebase DM-15096 instead.

          Show
          pschella Pim Schellart [X] (Inactive) added a comment - Feel free to merge now, I'll rebase DM-15096 instead.
          Hide
          jbosch Jim Bosch added a comment -

          Comments on the PR.

          Show
          jbosch Jim Bosch added a comment - Comments on the PR.
          jbosch Jim Bosch made changes -
          Status In Review [ 10004 ] Reviewed [ 10101 ]
          Hide
          salnikov Andy Salnikov added a comment -

          Jim Bosch, thanks for review! Sorry, I forgot to push pipe_supertask commit, just made a PR for that too. It only adds an example SuperTask that I used for testing my pre-flight changes. You can check it too if you prefer.

          Show
          salnikov Andy Salnikov added a comment - Jim Bosch , thanks for review! Sorry, I forgot to push pipe_supertask commit, just made a PR for that too. It only adds an example SuperTask that I used for testing my pre-flight changes. You can check it too if you prefer.
          Hide
          jbosch Jim Bosch added a comment -

          Changes in pipe_supertask look fine; I left one comment on the PR, but it's really an unrelated question about preflight in general prompted by a bit of code there.

          Show
          jbosch Jim Bosch added a comment - Changes in pipe_supertask look fine; I left one comment on the PR, but it's really an unrelated question about preflight in general prompted by a bit of code there.
          Hide
          salnikov Andy Salnikov added a comment -

          Thanks for review! Merged both packages to master.

          Show
          salnikov Andy Salnikov added a comment - Thanks for review! Merged both packages to master.
          salnikov Andy Salnikov made changes -
          Resolution Done [ 10000 ]
          Status Reviewed [ 10101 ] Done [ 10002 ]

            People

            Assignee:
            salnikov Andy Salnikov
            Reporter:
            vaikunth Vaikunth Thukral
            Reviewers:
            Jim Bosch
            Watchers:
            Andy Salnikov, Fritz Mueller, Jim Bosch, Pim Schellart [X] (Inactive), Vaikunth Thukral
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.