Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24407

Add abstractions for spatial indexing to daf.butler.registry.Database

    XMLWordPrintable

    Details

    • Story Points:
      4
    • Sprint:
      DB_S20_02
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      Our current approach to spatial indexing on the sphere is simply to store the HTM IDs each region overlaps at a particular level, in an extra "common skypix overlap" table for every table that has a region column. This might be an okay approach for regions of a certain size, but it's definitely not a good approach for regions of arbitrary size, and probably not close to the best approach even for its ideal region size.

      Doing better would probably require using special index types built for spatial overlap queries (R trees and variants thereof). While the databases of interest to us (SQLite, Oracle, and PostgreSQL) do not support indexing regions on the sphere natively, they all do provide ways to index 3-d bounding boxes, and the lsst.sphgeom objects we work with provide ways to get 3-d bounding boxes (for a unit sphere).

      The interfaces for the three databases are quite different, however, so the big challenge is probably encapsulating that functionality behind a consistent interface for the rest of our Registry code to use. We might be able to leverage SQLAlchemy hooks for some of this customization, but I doubt they will cover everything. While Oracle and PostgreSQL both treat geometries as column types and spatial indexes much like normal indexes, they have different syntax for inserting, querying, and relating them. SQLite is even more different: its R tree implementation involves creating a special table for every spatial column, in which lower and upper bounds behave like regular (but separate) columms. Adding at least some of the abstraction to our Database class might be a lot easier, and no less usable, though at present it doesn't do much with SELECT queries (they're just passed through to SQLAlchemy directly). And I don't think this functionality needs to be totally transparent - just database-generic. It would be totally reasonable for Registry code that's using our Database interface to work with a spatial column to have to do things it wouldn't have to do with a normal column, like manually extracting bounding boxes from regions for inserts, calling Database methods to get SQLAlchemy expressions that represent overlaps in query join/where clauses, or doing post-query filtering in Python (as we do now).

      This ticket is for adding that abstraction layer to some combination of SQLAlchemy and the Database class. I'll create another one for actually using it in Registry (maybe after the design for this one is further along).

        Attachments

          Activity

          Hide
          jbosch Jim Bosch added a comment - - edited

          Conversations on slack revealed that my initial investigation missed some built-in and extension database functionality we should at least investigate first before diving into this scheme.

          The most obvious one is that Oracle Spatial does support expressing geometries on the sphere natively and Christopher Stephens [X] has a contact at NCSA who has used that successfully in astronomy. We should check that this includes indexing spatial regions for overlap queries and joins (not just point-in-region queries) before diving in, but if it does, it's probably the way to go there. There's a slight chance that the Oracle spherical stuff is actually slower than just using the Oracle 3-d stuff directly, because it can also do ellipsoids and heights (a lot of sophistication we don't need), but hopefully the code path that needs spherical but doesn't need those things is exercised enough by non-astronomers to be reasonably optimized.

          I had already known of pgSphere, but hadn't considering it worth exploring until/unless profiling revealed we needed it, as using the scheme described earlier for all databases seemed easier. If we have a more native approach for Oracle, it may be worth using pgSphere for PostgreSQL, too, though its maintenance situation does not seem ideal. PostGIS also now seems to have support for spherical geometries, but this "GEOGRAPHY" type seems largely distinct from its much more broadly-used and complete 2-d map projection functionality.

          SpatiaLite aims to bring PostGIS-like functionality to SQLite, and it's unclear whether it has anything useful for us. It certainly has a lot of 2-d map projection geometry functionality, but I have not been able to wade through all of the GIS jargon and acronyms to determine whether it can actually do indexing over a significant fraction of the sphere.

          On that "2-d map projection functionality" point: this seems to be the approach that is big in the open-source GIS world, and I've always assumed it just wouldn't work for us, because we just can't put all of the regions we might want to relate onto the same 2-d map projection (and I've never seen it used in astronomy). But there's so much stuff to wade through that's not of interest to us that it's very hard to tell whether there is any stuff that is of use to us. Actually asking on some email lists may teach us a lot more than a similar amount of time spent poring over docs; I already found one such thread involving an astronomer that helped a lot, though it was about distance calculations rather than region overlaps and doesn't directly answer the most critical questions for us.

          Show
          jbosch Jim Bosch added a comment - - edited Conversations on slack revealed that my initial investigation missed some built-in and extension database functionality we should at least investigate first before diving into this scheme. The most obvious one is that Oracle Spatial does support expressing geometries on the sphere natively and Christopher Stephens [X] has a contact at NCSA who has used that successfully in astronomy. We should check that this includes indexing spatial regions for overlap queries and joins (not just point-in-region queries) before diving in, but if it does, it's probably the way to go there. There's a slight chance that the Oracle spherical stuff is actually slower than just using the Oracle 3-d stuff directly, because it can also do ellipsoids and heights (a lot of sophistication we don't need), but hopefully the code path that needs spherical but doesn't need those things is exercised enough by non-astronomers to be reasonably optimized. I had already known of pgSphere , but hadn't considering it worth exploring until/unless profiling revealed we needed it, as using the scheme described earlier for all databases seemed easier. If we have a more native approach for Oracle, it may be worth using pgSphere for PostgreSQL, too, though its maintenance situation does not seem ideal. PostGIS also now seems to have support for spherical geometries , but this "GEOGRAPHY" type seems largely distinct from its much more broadly-used and complete 2-d map projection functionality. SpatiaLite aims to bring PostGIS-like functionality to SQLite, and it's unclear whether it has anything useful for us. It certainly has a lot of 2-d map projection geometry functionality, but I have not been able to wade through all of the GIS jargon and acronyms to determine whether it can actually do indexing over a significant fraction of the sphere. On that "2-d map projection functionality" point: this seems to be the approach that is big in the open-source GIS world, and I've always assumed it just wouldn't work for us, because we just can't put all of the regions we might want to relate onto the same 2-d map projection (and I've never seen it used in astronomy). But there's so much stuff to wade through that's not of interest to us that it's very hard to tell whether there is any stuff that is of use to us. Actually asking on some email lists may teach us a lot more than a similar amount of time spent poring over docs; I already found one such thread involving an astronomer that helped a lot, though it was about distance calculations rather than region overlaps and doesn't directly answer the most critical questions for us.
          Hide
          jbosch Jim Bosch added a comment -

          Andy Salnikov, that is the best-arranged and most readable diagram of the butler schema I have ever seen. Someday I'm going to have to ask you to teach me how you made it.

          Show
          jbosch Jim Bosch added a comment - Andy Salnikov , that is the best-arranged and most readable diagram of the butler schema I have ever seen. Someday I'm going to have to ask you to teach me how you made it.
          Hide
          salnikov Andy Salnikov added a comment - - edited

          Jim Bosch, that diagram was quickly done with DBeaver CE (https://dbeaver.com/edition/community/). It can read SQLite files (and lots of other databases) and quickly draw ER diagram from that. Downside is that of course you need to spend some time on arranging the diagram to get nice view. Also representation convention is not my favorite (they use IDEF1X) and it cannot be adjusted. But it still looks usable.

          Show
          salnikov Andy Salnikov added a comment - - edited Jim Bosch , that diagram was quickly done with DBeaver CE ( https://dbeaver.com/edition/community/ ). It can read SQLite files (and lots of other databases) and quickly draw ER diagram from that. Downside is that of course you need to spend some time on arranging the diagram to get nice view. Also representation convention is not my favorite (they use IDEF1X) and it cannot be adjusted. But it still looks usable.
          Hide
          salnikov Andy Salnikov added a comment -

          Just a quick followup to our discussion at MW meeting today. One possible option to define relation between spacial tables is to define explicit many-to-many relationship between two tables in the schema and fill it when adding new records to any of the two tables. E.g. for visit <-> tract we can define a table visit_tract_spatial_match(skymap, track, instrument, visit). Joining on that relation should be more efficient than joining on overlapping skypix.

          For filling that relation I could imagine two options:

          • Just as now use skypix indices from each table overlap, this will produce some fraction of false matches that we'll need to filter out on client side based on regions.
          • We could calculate exact overlap for skymap tracts/patches and any other region on client side while storing visits/detectors and use that info, should eliminate false matches. This should also be more efficient, I presume that skymap regions can be calculated on client side efficiently without the need to query database (though some care will be needed to make sure that database definition is consistent with client code) and that client has access to skymap code for every skymap in database (I do not particularly like that but it may be a reasonable trade-off).

          Management of the spatial relation table will be more complicated, e.g. adding new skymap would require scanning full set of other spacial tables and adding records for all of them (with potential transactional issues) though this kind of operation does not happen frequently.

          Show
          salnikov Andy Salnikov added a comment - Just a quick followup to our discussion at MW meeting today. One possible option to define relation between spacial tables is to define explicit many-to-many relationship between two tables in the schema and fill it when adding new records to any of the two tables. E.g. for visit <-> tract we can define a table visit_tract_spatial_match(skymap, track, instrument, visit) . Joining on that relation should be more efficient than joining on overlapping skypix. For filling that relation I could imagine two options: Just as now use skypix indices from each table overlap, this will produce some fraction of false matches that we'll need to filter out on client side based on regions. We could calculate exact overlap for skymap tracts/patches and any other region on client side while storing visits/detectors and use that info, should eliminate false matches. This should also be more efficient, I presume that skymap regions can be calculated on client side efficiently without the need to query database (though some care will be needed to make sure that database definition is consistent with client code) and that client has access to skymap code for every skymap in database (I do not particularly like that but it may be a reasonable trade-off). Management of the spatial relation table will be more complicated, e.g. adding new skymap would require scanning full set of other spacial tables and adding records for all of them (with potential transactional issues) though this kind of operation does not happen frequently.
          Hide
          jbosch Jim Bosch added a comment -

          I'd definitely vote for calculating exact overlaps if we're materializing the relationship tables. I was also thinking we might condition whether a relationship is populated on some of the parent dimensions, e.g. particular combinations of instrument+skymap (+visit_system after DM-15536, probably) would have their relationships explicitly materialized, and we could track which relationships exist in much smaller summary table. I think we could rely on all tracts and patches for a skymap being registered up front when the skymap itself is added (and then never changed), so it'd only be visit definition that would need to be able to update the relationship tables. The more I write down about that, the more reasonable it sounds.

          The skypix dimensions (HTM, someday HEALPix) may be a little trickier, but we may not have to do anything with them - at present, we never join on those (except as a way to join other dimensions) - we only need to look up datasets identified by an HTM ID given an visit+detector data ID, and we can use the region on the latter to compute the right HTM IDs in Python easily.

          Show
          jbosch Jim Bosch added a comment - I'd definitely vote for calculating exact overlaps if we're materializing the relationship tables. I was also thinking we might condition whether a relationship is populated on some of the parent dimensions, e.g. particular combinations of instrument+skymap (+visit_system after DM-15536 , probably) would have their relationships explicitly materialized, and we could track which relationships exist in much smaller summary table. I think we could rely on all tracts and patches for a skymap being registered up front when the skymap itself is added (and then never changed), so it'd only be visit definition that would need to be able to update the relationship tables. The more I write down about that, the more reasonable it sounds. The skypix dimensions (HTM, someday HEALPix) may be a little trickier, but we may not have to do anything with them - at present, we never join on those (except as a way to join other dimensions) - we only need to look up datasets identified by an HTM ID given an visit+detector data ID, and we can use the region on the latter to compute the right HTM IDs in Python easily.
          Hide
          jbosch Jim Bosch added a comment -

          Andy Salnikov, if you are pretty confident that we should go with explicitly materialized relationship tables rather than try in-database spatial support, I'm thinking we might want to just close this ticket as Won't Fix and defer work on those relationship tables. DM-15536 has highlighted some other deficiencies in the dimensions system, and if the work on spatial stuff is going to be more tightly coupled with the dimensions system than the database class we might want to design all of that out in advance more rather than trying to do this independently.

          Show
          jbosch Jim Bosch added a comment - Andy Salnikov , if you are pretty confident that we should go with explicitly materialized relationship tables rather than try in-database spatial support, I'm thinking we might want to just close this ticket as Won't Fix and defer work on those relationship tables. DM-15536 has highlighted some other deficiencies in the dimensions system, and if the work on spatial stuff is going to be more tightly coupled with the dimensions system than the database class we might want to design all of that out in advance more rather than trying to do this independently.
          Hide
          salnikov Andy Salnikov added a comment -

          I copied few class diagrams that I made while studying the code, not sure if anyone else would find them useful, I made these for myself to try to understand the big picture.

          Jim Bosch, I think I strongly prefer to have spacial relations as materialized tables, with performance being our major concern it should be most efficient representation. Still it may be beneficial to have a sort of abstraction if we want to try some different implementation for that relation in the future. I still don't have a clear view of the big picture in my head (I probably need a bigger head) so I'm not sure at which level this abstraction can appear. I'm inclined to close this ticket as you suggest and return to the issue when we are going to implement materialized tables.

          Show
          salnikov Andy Salnikov added a comment - I copied few class diagrams that I made while studying the code, not sure if anyone else would find them useful, I made these for myself to try to understand the big picture. Jim Bosch , I think I strongly prefer to have spacial relations as materialized tables, with performance being our major concern it should be most efficient representation. Still it may be beneficial to have a sort of abstraction if we want to try some different implementation for that relation in the future. I still don't have a clear view of the big picture in my head (I probably need a bigger head) so I'm not sure at which level this abstraction can appear. I'm inclined to close this ticket as you suggest and return to the issue when we are going to implement materialized tables.
          Hide
          salnikov Andy Salnikov added a comment -

          I am closing this as WontFix, work for materializing spatial relations will be done on separate tickets.

          Show
          salnikov Andy Salnikov added a comment - I am closing this as WontFix, work for materializing spatial relations will be done on separate tickets.

            People

            Assignee:
            salnikov Andy Salnikov
            Reporter:
            jbosch Jim Bosch
            Watchers:
            Andy Salnikov, Jim Bosch, Tim Jenness
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.