Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-35850

Collect ideas and requirements for daf_butler obscore implementation

    XMLWordPrintable

    Details

    • Story Points:
      10
    • Sprint:
      DB_F22_6, DB_S23_6
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      Time to start implementation of ObsCore table as part daf_butler Registry. Before doing any actual work I want to spend some time on trying to understand all requirements and options that we need for that. TechNote already lists few ideas for implementation, I'm going to expand that in more details and also try to figure out the issues related to schema migration here. Also, dax_obscore already has some parts that could be reused.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Some random general architectural considerations, repeating some of what was said in a technote:

            • Only very few butler repositories need to have obscore table.
            • It makes sense to implement obscore concern as a separate and optional registry manager, which will only be instantiated (based on registry configuration) for those repos that need it.
            • The Registry will, like for other managers, define an abstract interface for obscore manager, other managers or registry methods will only ever use this interface.
            • Manager implementation may be defined in some other package (e.g. dax_obscore) but I think it makes sense to keep the implementation in daf_butler, just to avoid requiring setup dax_obscore for few special repos.
            • Dependency-wise it should not be an issue, the code in dax_obscore only depends on sphgeom and daf_butler (plus pyarrow, but we don't need that piece in daf_butler).
            • Spatial indexing will only ever work with Postgres, and depending on how we implement it, it will need a new extension loaded into Postgres.
            • For testing purposes, it may make sense to support SQLite but without spatial indexing.
            • Access to ObsCore data from TAP services will happen via direct queries on ObsCore table (or maybe a view on top of that table), we do not expect any need for Python API for queries at registry level.
            • Changes in ObsCore configuration and corresponding schema/data changes in the database can be managed by the same daf_butler_migrate, that should cover also initial addition of the obscore table to existing repo and its population from pre-existing data.
            • At this point it's not clear if we may need more than one obscore table per registry, I'll try to expand on this in further comments.
            Show
            salnikov Andy Salnikov added a comment - Some random general architectural considerations, repeating some of what was said in a technote: Only very few butler repositories need to have obscore table. It makes sense to implement obscore concern as a separate and optional registry manager, which will only be instantiated (based on registry configuration) for those repos that need it. The Registry will, like for other managers, define an abstract interface for obscore manager, other managers or registry methods will only ever use this interface. Manager implementation may be defined in some other package (e.g. dax_obscore) but I think it makes sense to keep the implementation in daf_butler, just to avoid requiring setup dax_obscore for few special repos. Dependency-wise it should not be an issue, the code in dax_obscore only depends on sphgeom and daf_butler (plus pyarrow, but we don't need that piece in daf_butler). Spatial indexing will only ever work with Postgres, and depending on how we implement it, it will need a new extension loaded into Postgres. For testing purposes, it may make sense to support SQLite but without spatial indexing. Access to ObsCore data from TAP services will happen via direct queries on ObsCore table (or maybe a view on top of that table), we do not expect any need for Python API for queries at registry level. Changes in ObsCore configuration and corresponding schema/data changes in the database can be managed by the same daf_butler_migrate , that should cover also initial addition of the obscore table to existing repo and its population from pre-existing data. At this point it's not clear if we may need more than one obscore table per registry, I'll try to expand on this in further comments.
            Hide
            salnikov Andy Salnikov added a comment -

            More details on which Registry methods need to interact with obscore manager. I think obscore operations only need to support insertion and deletion of obscore records, we probably do not need a dedicated update methods in obscore manager (some updates will be done when schema gets updated by migration scripts).

            • Storing new datasets in an existing RUN collection that belongs to one of the collections monitored by obscore:
              • Registry.insertDatasets() method
              • Registry._importDatasets() method
              • obscore manager needs expanded dataset refs, both above methods have an expand parameter, which may be set to False, but I think that caller is supposed to guarantee that refs are already expanded in that case.
              • Registry.associate() - in case a tagged collection is being monitored by obscore manager, and a run collection for a dataset is not monitored. Not sure if this is a useful case, but I think we should still support it.
              • Registry.certify() - similar to associate()?
            • Removing individual datasets:
              • Registry.removeDatasets() method, it's likely that we can implement removal by just enabling cascade delete of the foreign key in obscore table, that needs dataset_id column in obscore table.
              • Registry.disassociate() - in case dataset ref is removed from a monitored tagged collection and its run collection is not monitored, and Registry.decertify().
            • Collection-level additions:
              • Registry.setCollectionChain() - if this adds a new run-run type collection to a chained collection, all datasets with matching dataset types need to be added to an obscore table.
            • Collection-level removal:
              • Registry.setCollectionChain() - if one or more run collections dropped from a monitored chained collection, all refs from those run collections need to be removed from obscore.
              • Registry.removeCollection() - if removed collection is a run collection, then dataset need to be removed from obscore, this again can be done via cascade delete.
            Show
            salnikov Andy Salnikov added a comment - More details on which Registry methods need to interact with obscore manager. I think obscore operations only need to support insertion and deletion of obscore records, we probably do not need a dedicated update methods in obscore manager (some updates will be done when schema gets updated by migration scripts). Storing new datasets in an existing RUN collection that belongs to one of the collections monitored by obscore: Registry.insertDatasets() method Registry._importDatasets() method obscore manager needs expanded dataset refs, both above methods have an expand parameter, which may be set to False, but I think that caller is supposed to guarantee that refs are already expanded in that case. Registry.associate() - in case a tagged collection is being monitored by obscore manager, and a run collection for a dataset is not monitored. Not sure if this is a useful case, but I think we should still support it. Registry.certify() - similar to associate() ? Removing individual datasets: Registry.removeDatasets() method, it's likely that we can implement removal by just enabling cascade delete of the foreign key in obscore table, that needs dataset_id column in obscore table. Registry.disassociate() - in case dataset ref is removed from a monitored tagged collection and its run collection is not monitored, and Registry.decertify() . Collection-level additions: Registry.setCollectionChain() - if this adds a new run-run type collection to a chained collection, all datasets with matching dataset types need to be added to an obscore table. Collection-level removal: Registry.setCollectionChain() - if one or more run collections dropped from a monitored chained collection, all refs from those run collections need to be removed from obscore. Registry.removeCollection() - if removed collection is a run collection, then dataset need to be removed from obscore, this again can be done via cascade delete.
            Hide
            salnikov Andy Salnikov added a comment -

            Some thoughts on configuration and schema management:

            • In dax_obscore configuration part is based on pydantic models, for daf_butler it would be more natural to use butler Config sub-classes.
            • ObsCore table schema will depend on both manager implementation (its version) and configuration. This is quite similar to how dimensions schema is currently managed, so it will make sense to use a similar approach for schema management.
            • Like with dimensions, we want to keep configuration in the database in butler_attributes table in JSON format (optionally, of course, only when repository has obscore support configured).
            • If we want to support multiple ObsCore tables, it probably makes sense to make their configurations completely independent to simplify their management.

            With all that, I think it should look like:

            • We have one new optional manager obscore, there will be a record for it in butler_attributes with a key config:registry.managers.obscore only when registry has ObsCore table.
            • There will be a corresponding version number stored for it with a key like version:lsst.daf.butler.registry.obscore._namager.ObsCoreManagerClass
            • The configuration will be stored in a key config:obscore.json (or multiple keys config:obscore-something.yaml), configuration should have a unique ID in it (similar to a namespace key in dimensions config) and a version number.
            • daf_butler_migrate will be extended to recognize config:obscore.json and treat it as a special obscore-config manager, for schema migration purposes.

            There is one issue with the bootstrapping of this, which we did not have with dimensions manager. The dimensions manager is initially created with an empty state, without any records, when Registry is created, with a current version of manager class and current version of configuration that exists at a time. ObsCore table is going to be created when we already have a bunch of data in a Registry, and we'll need to populate it with the data from all other tables. I think it makes sense doing that via a migration script, the issue is that that migration script will depend on versions of both obscore and obscore-config managers. But maybe I'm overthinking this now, it is likely that I'll have to call obscore manager code to do that initial population anyways, and it will handle config version correctly. I never call other managers from migration scripts, and I try to limit the use of daf_butler classes in general to keep that stuff stable, but with obscore I probably cannot avoid it, as reproducing that logic in a migration script would be too much. That also means that the manager should have a method for populating obscore table from data already in the tables (not just from expanded datasets).

            Related to this is the issue of compatibility, what happens when old code that does not know about obscore manager tries to run against a repository with ObsCore table. I think that old code will crash because there will be a mismatch between managers in registry and managers in the code, but I need to verify that.

            Show
            salnikov Andy Salnikov added a comment - Some thoughts on configuration and schema management: In dax_obscore configuration part is based on pydantic models, for daf_butler it would be more natural to use butler Config sub-classes. ObsCore table schema will depend on both manager implementation (its version) and configuration. This is quite similar to how dimensions schema is currently managed, so it will make sense to use a similar approach for schema management. Like with dimensions, we want to keep configuration in the database in butler_attributes table in JSON format (optionally, of course, only when repository has obscore support configured). If we want to support multiple ObsCore tables, it probably makes sense to make their configurations completely independent to simplify their management. With all that, I think it should look like: We have one new optional manager obscore , there will be a record for it in butler_attributes with a key config:registry.managers.obscore only when registry has ObsCore table. There will be a corresponding version number stored for it with a key like version:lsst.daf.butler.registry.obscore._namager.ObsCoreManagerClass The configuration will be stored in a key config:obscore.json (or multiple keys config:obscore-something.yaml ), configuration should have a unique ID in it (similar to a namespace key in dimensions config) and a version number. daf_butler_migrate will be extended to recognize config:obscore.json and treat it as a special obscore-config manager, for schema migration purposes. There is one issue with the bootstrapping of this, which we did not have with dimensions manager. The dimensions manager is initially created with an empty state, without any records, when Registry is created, with a current version of manager class and current version of configuration that exists at a time. ObsCore table is going to be created when we already have a bunch of data in a Registry, and we'll need to populate it with the data from all other tables. I think it makes sense doing that via a migration script, the issue is that that migration script will depend on versions of both obscore and obscore-config managers. But maybe I'm overthinking this now, it is likely that I'll have to call obscore manager code to do that initial population anyways, and it will handle config version correctly. I never call other managers from migration scripts, and I try to limit the use of daf_butler classes in general to keep that stuff stable, but with obscore I probably cannot avoid it, as reproducing that logic in a migration script would be too much. That also means that the manager should have a method for populating obscore table from data already in the tables (not just from expanded datasets). Related to this is the issue of compatibility, what happens when old code that does not know about obscore manager tries to run against a repository with ObsCore table. I think that old code will crash because there will be a mismatch between managers in registry and managers in the code, but I need to verify that.
            Hide
            tjenness Tim Jenness added a comment -

            Regarding configs. Using pydantic models is great and I wish I had known about them when we first started using butler configs.

            A Config class can read JSON or YAML. I don't think there is anything stopping you reading the files with Config and then constructing your pydantic model from that config dict – using Config to do the reading gives you the ability to pull content from other locations and merge it together into a single configuration.

            Show
            tjenness Tim Jenness added a comment - Regarding configs. Using pydantic models is great and I wish I had known about them when we first started using butler configs. A Config class can read JSON or YAML. I don't think there is anything stopping you reading the files with Config and then constructing your pydantic model from that config dict – using Config to do the reading gives you the ability to pull content from other locations and merge it together into a single configuration.
            Hide
            salnikov Andy Salnikov added a comment -

            reading the files with Config and then constructing your pydantic model from that config dict

            This is exactly what I'm doing in dax_obscore. Pydantic does not provide methods for parsing YAML, so I have to read YAML into a Config/dict and pass it to a Pydantic model. I do like Pydantic for configuration classes, it works really well for that purpose, probably I should keep it.

            Show
            salnikov Andy Salnikov added a comment - reading the files with Config and then constructing your pydantic model from that config dict This is exactly what I'm doing in dax_obscore. Pydantic does not provide methods for parsing YAML, so I have to read YAML into a Config/dict and pass it to a Pydantic model. I do like Pydantic for configuration classes, it works really well for that purpose, probably I should keep it.
            Hide
            tjenness Tim Jenness added a comment -

            Yes, please keep the pydantic model.

            Show
            tjenness Tim Jenness added a comment - Yes, please keep the pydantic model.
            Hide
            salnikov Andy Salnikov added a comment -

            The most interesting piece of the ObsCore will be a spatial search and indexing, which is definitely needed to avoid full table scans when searching for things by their position. Presently when we import obscore CSV file into QServ/MySQL we define two additional spatial columns for obscore table (DM-34483):

            • binary representation of a scisql region corresponding to a record region, this of course cannot be indexed, but queries can do point-in-region test using scisql functions
            • MySQL GIS representation of a bounding box/envelope for a record region, this column is indexed and can provide a quick test for point-in-region using a standard GIS function. (I'm not sure we do it 100% right, need to talk to Fritz)
            • obscore has a column with ADQL-formatted representation of a region, above two columns are filled from the ADQL expressions (and I'm not sure that it is 100% correct too w.r.t. reference frame, but we ignore that for now)

            For Postgres we need to replace those two columns with something different, and we cannot use scisql with Postgres. Here I think we have a very broad set of options depending on how query processing can be split between TAP and database. I can imagine this possibilities:

            • If we want to support arbitrary queries then it probably requires native Postgres GIS support, a column containing the region in native format and using functions that operate on this column. The column should be indexed which will help with the speed of spatial queries.
            • There may be issues with spatial indexing and other types of indices. As an alternative to native spatial indexing, we could consider adding indexing based on skypix overlaps which are already in registry. Using that feature would require support on TAP side, so that TAP could restrict queries to a set of skypix indices. Queries could still use native GIS column for more specific spatial filtering.
            • As an extension of previous case, we could abandon native GIS column and make TAP do region-based filtering on TAP side using sphgeom region methods (if TAP can call Python methods). That means more complications on TAP side and probably more restrictions on what kind of queries we can support in obscore TAP service.

            For native Postgres geometry support I think there are two possible options:

            • PostGIS which looks quite advanced and mature, but I'm not sure how well does it work with celestial reference systems
            • pgSphere which looks almost abandoned at this point

            Both are implemented as Postgres extensions and are not included in Postgres by default, they need to be installed on server host separately.

            I should note that I'm not an expert in GIS, ADQL, and anything related. I tried to learn as much as I could in the past few days, but there is much gobbledygook which I still do not quite understand. I should probably spend more time on learning PostGIS, as I think this is where we should start with our implementation. Any pointers for how other astronomy projects do this are welcome of course.

            Show
            salnikov Andy Salnikov added a comment - The most interesting piece of the ObsCore will be a spatial search and indexing, which is definitely needed to avoid full table scans when searching for things by their position. Presently when we import obscore CSV file into QServ/MySQL we define two additional spatial columns for obscore table ( DM-34483 ): binary representation of a scisql region corresponding to a record region, this of course cannot be indexed, but queries can do point-in-region test using scisql functions MySQL GIS representation of a bounding box/envelope for a record region, this column is indexed and can provide a quick test for point-in-region using a standard GIS function. (I'm not sure we do it 100% right, need to talk to Fritz) obscore has a column with ADQL-formatted representation of a region, above two columns are filled from the ADQL expressions (and I'm not sure that it is 100% correct too w.r.t. reference frame, but we ignore that for now) For Postgres we need to replace those two columns with something different, and we cannot use scisql with Postgres. Here I think we have a very broad set of options depending on how query processing can be split between TAP and database. I can imagine this possibilities: If we want to support arbitrary queries then it probably requires native Postgres GIS support, a column containing the region in native format and using functions that operate on this column. The column should be indexed which will help with the speed of spatial queries. There may be issues with spatial indexing and other types of indices. As an alternative to native spatial indexing, we could consider adding indexing based on skypix overlaps which are already in registry. Using that feature would require support on TAP side, so that TAP could restrict queries to a set of skypix indices. Queries could still use native GIS column for more specific spatial filtering. As an extension of previous case, we could abandon native GIS column and make TAP do region-based filtering on TAP side using sphgeom region methods (if TAP can call Python methods). That means more complications on TAP side and probably more restrictions on what kind of queries we can support in obscore TAP service. For native Postgres geometry support I think there are two possible options: PostGIS which looks quite advanced and mature, but I'm not sure how well does it work with celestial reference systems pgSphere which looks almost abandoned at this point Both are implemented as Postgres extensions and are not included in Postgres by default, they need to be installed on server host separately. I should note that I'm not an expert in GIS, ADQL, and anything related. I tried to learn as much as I could in the past few days, but there is much gobbledygook which I still do not quite understand. I should probably spend more time on learning PostGIS, as I think this is where we should start with our implementation. Any pointers for how other astronomy projects do this are welcome of course.
            Hide
            jbosch Jim Bosch added a comment - - edited

            I should note that I'm not an expert in GIS, ADQL, and anything related. I tried to learn as much as I could in the past few days, but there is much gobbledygook which I still do not quite understand. I should probably spend more time on learning PostGIS, as I think this is where we should start with our implementation. Any pointers for how other astronomy projects do this are welcome of course.

            I doubt my knowledge is any more advanced than yours on any of these topics, but maybe these tidbits will prove useful:

            • Somebody (Tim Jenness? Gregory Dubois-Felsmann?) mentioned that another major data center (if I had to guess I'd say it was CADC, but I don't remember) does use PostGIS for VO queries of some kind.
            • Eli Rykoff has some experience making the GIS world's main projection library (PROJ) make sense for astronomical applications, via his SkyProj library.
            Show
            jbosch Jim Bosch added a comment - - edited I should note that I'm not an expert in GIS, ADQL, and anything related. I tried to learn as much as I could in the past few days, but there is much gobbledygook which I still do not quite understand. I should probably spend more time on learning PostGIS, as I think this is where we should start with our implementation. Any pointers for how other astronomy projects do this are welcome of course. I doubt my knowledge is any more advanced than yours on any of these topics, but maybe these tidbits will prove useful: Somebody ( Tim Jenness ? Gregory Dubois-Felsmann ?) mentioned that another major data center (if I had to guess I'd say it was CADC, but I don't remember) does use PostGIS for VO queries of some kind. Eli Rykoff has some experience making the GIS world's main projection library (PROJ) make sense for astronomical applications, via his SkyProj library.
            Hide
            tjenness Tim Jenness added a comment -

            CADC use pgsphere but the development situation there is very odd. I was told at an IVOA meeting that we are all meant to be using https://github.com/postgrespro/pgsphere but I'm not sure which one we use in our CADC TAP service. Christine Banek do you know?

            Show
            tjenness Tim Jenness added a comment - CADC use pgsphere but the development situation there is very odd. I was told at an IVOA meeting that we are all meant to be using https://github.com/postgrespro/pgsphere but I'm not sure which one we use in our CADC TAP service. Christine Banek do you know?
            Hide
            jbosch Jim Bosch added a comment -

            Hmm, if it wasn't CADC, that memory of someone telling me about using PostGIS in astro might have been from a conversation with Serge (which would then presumably mean something at IPAC).

            In any case, I do like the idea of going with PostGIS instead.  It's a real project - and a major, well-supported one - that can certainly do what we need, and if figuring how to get it to do just what we need (instead of the myriad GIS things that are irrelevant for astronomy) is the hard part, I think the astro community would be really well served by us writing some tooling to do that or even just documenting what it took.

            Show
            jbosch Jim Bosch added a comment - Hmm, if it wasn't CADC, that memory of someone telling me about using PostGIS in astro might have been from a conversation with Serge (which would then presumably mean something at IPAC). In any case, I do like the idea of going with PostGIS instead.  It's a real project - and a major, well-supported one - that can certainly do what we need, and if figuring how to get it to do just what we need (instead of the myriad GIS things that are irrelevant for astronomy) is the hard part, I think the astro community would be really well served by us writing some tooling to do that or even just documenting what it took.
            Hide
            salnikov Andy Salnikov added a comment -

            Found this interesting thread from ~16 years go: https://lists.osgeo.org/pipermail/postgis-devel/2006-October/002348.html

            Looks like circle (ellipse) is practically complete

            Show
            salnikov Andy Salnikov added a comment - Found this interesting thread from ~16 years go: https://lists.osgeo.org/pipermail/postgis-devel/2006-October/002348.html Looks like circle (ellipse) is practically complete
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Regarding pgsphere, I think the best thing would be to sync up with Pat Dowler about this. I note that both he and Markus Demleitner have recently-refreshed forks of the repo Tim Jenness pointed out, so something is still happening.

            I found this: https://www.asterics2020.eu/dokuwiki/lib/exe/fetch.php?media=open:wp4:wp4techforum4:pgsphere_techforum4_nullmeier.pdf

            RegTAP 1.2 adds significant MOC-based spatial query capabilities to what's required of TAP, and as far as I now only pgsphere-based services have implemented that yet.

            Having said that, the general argument that PostGIS is much more actively supported is still pretty compelling, but we might be blazing a new trail there.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Regarding pgsphere, I think the best thing would be to sync up with Pat Dowler about this. I note that both he and Markus Demleitner have recently-refreshed forks of the repo Tim Jenness pointed out, so something is still happening. I found this: https://www.asterics2020.eu/dokuwiki/lib/exe/fetch.php?media=open:wp4:wp4techforum4:pgsphere_techforum4_nullmeier.pdf RegTAP 1.2 adds significant MOC-based spatial query capabilities to what's required of TAP, and as far as I now only pgsphere-based services have implemented that yet. Having said that, the general argument that PostGIS is much more actively supported is still pretty compelling, but we might be blazing a new trail there.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I'd like to re-flag the request in DM-35740 that there be a callable interface to retrieve, in-memory, an ObsCore table for a single dataset, a collection, and possibly a dataset wildcard (e.g., all DatasetTypes for a given DataID) or a list of datasets.

            For Registries for which ObsCore data is maintained "live" by the DM-35850 work, this callable interface would just return that data. For others (i.e., for sqlite Registries), the data would be generated on-demand.

            The in-memory table should not be afw.table, to avoid dependency creep.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - I'd like to re-flag the request in DM-35740 that there be a callable interface to retrieve, in-memory, an ObsCore table for a single dataset, a collection, and possibly a dataset wildcard (e.g., all DatasetTypes for a given DataID) or a list of datasets. For Registries for which ObsCore data is maintained "live" by the DM-35850 work, this callable interface would just return that data. For others (i.e., for sqlite Registries), the data would be generated on-demand. The in-memory table should not be afw.table, to avoid dependency creep.
            Hide
            tjenness Tim Jenness added a comment -

            There is no concern about butler ever depending on afw. Never going to happen.

            Show
            tjenness Tim Jenness added a comment - There is no concern about butler ever depending on afw. Never going to happen.
            Hide
            jbosch Jim Bosch added a comment -

            Found this interesting thread from ~16 years go: https://lists.osgeo.org/pipermail/postgis-devel/2006-October/002348.html

            Wow, that's a blast from the past. That's second-year grad student me, in the middle of a CS grad course on database spatial indexing that I somehow convinced my (physics) department I should take, looking for a way to justify that decision via my class project. Sadly not actually relevant for the problem at hand.

            Show
            jbosch Jim Bosch added a comment - Found this interesting thread from ~16 years go: https://lists.osgeo.org/pipermail/postgis-devel/2006-October/002348.html Wow, that's a blast from the past. That's second-year grad student me, in the middle of a CS grad course on database spatial indexing that I somehow convinced my (physics) department I should take, looking for a way to justify that decision via my class project. Sadly not actually relevant for the problem at hand.
            Hide
            salnikov Andy Salnikov added a comment -

            For Registries for which ObsCore data is maintained "live" by the DM-35850 work, this callable interface would just return that data. For others (i.e., for sqlite Registries), the data would be generated on-demand.

            Gregory Dubois-Felsmann, it may be problematic to try to support both use cases in one interface. Transformation to obscore needs a non-trivial configuration, for an obscore-enabled Registry that configuration will be stored in the Registry itself. For non-enabled Registry, configuration has to be provided by a client. If client always provides configuration then we need to check that client configuration is compatible with Registry configuration which may be complicated. Another dimension in this problem is that some obscore functionality can be offloaded to a TAP service or SQL view, so on Registry side you may not even get a complete obscore view of the data. I think it would be more natural to use TAP for access to pre-existing obscore table, and use new API to generate a possibly incomplete in-memory table.

            Show
            salnikov Andy Salnikov added a comment - For Registries for which ObsCore data is maintained "live" by the DM-35850 work, this callable interface would just return that data. For others (i.e., for sqlite Registries), the data would be generated on-demand. Gregory Dubois-Felsmann , it may be problematic to try to support both use cases in one interface. Transformation to obscore needs a non-trivial configuration, for an obscore-enabled Registry that configuration will be stored in the Registry itself. For non-enabled Registry, configuration has to be provided by a client. If client always provides configuration then we need to check that client configuration is compatible with Registry configuration which may be complicated. Another dimension in this problem is that some obscore functionality can be offloaded to a TAP service or SQL view, so on Registry side you may not even get a complete obscore view of the data. I think it would be more natural to use TAP for access to pre-existing obscore table, and use new API to generate a possibly incomplete in-memory table.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I was learning and playing with both PostGIS and pgSphere to see what option can work better for storing geometry data in ObsCore table. Here is a brief summary what I have learned so far.

            PostGIS

            This is an advanced and well-supported project for PotgreSQL-based GIS. It is a separate PostgreSQL extension that needs to be installed as a system package and added to a database or schema. GIS mostly cares about more or less precise description of features located on the surface of Earth (or maybe other planets), and not what happens in the sky. Earth of course is not a spherical object, and a lot of effort in PostGIS is devoted to supporting the non-ideal Earth shape. PostGIS implements SQL/MM standard for spatial data management, with some extensions.

            PostGIS has two distinct types for spatial data - geometry and geography. Geometry type is used for Cartesian coordinates, and is typically used for small enough patches of the surface where curvature can be ignored. Geography type is used for long distances where it is necessary to take curvature into effect for correct calculations. The shape of the Earth (or a small patch of it) is determined by a Spatial Reference System (SRS), all spatial objects are associated with some SRS (with some exceptions). PostGIS supports user-defined SRS, and potentially one can define completely spherical planet shape, project sky map on its surface, and use that projection for all calculations using geography type.

            PostGIS geography uses lon/lat ranges [-180, 180] and [-90, 90], while ra range is [0, 360]. When a geography type receives data outside its allowed range it coerces the data into that range. If coercing ra/dec to lon/lat is done consistently, then PostGIS should still be able to produce correct answer for ra/dec. Even for geography type the calculations that produce distances return their result in meters, to obtain angular separation it needs to be converted to degrees. One possible workaround is to project on a surface where 1 degree makes one meter, a small "planet" with a radius of 57.29577951308232 meters.

            PostGIS' set of shapes include POINT, LINE, POLYGON, and collections of those (e.g. MULTIPOLYGON), but no circle or ellipses. There is a large set of functions for all kinds of calculations and transformations, geography types only support a subset of the functions. Indexing of the spatial columns is supported with GiST, BRIN, or SP-GiSt type of index.

            pgSphere

            The support and development of pgSphere looks patchy, it is certainly a smaller project, used and supported by VO folks. The "official" repository at https://github.com/akorotkov/pgsphere had the latest commit from almost three years ago. Ubuntu 22.04 has a package for pgsphere version 1.1.5 which seem to be the latest pgSphere version number. Creating pgSphere extension for a database requires superuser privileges (true for PostGIS as well).

            pgSphere is built specifically for use in astronomy, and uses lon/lat coordinate system with radians units in ranges [0, 2π] and [-π/2, π/2]. On input the coordinates can be specified as degrees (e.g. spoint '(90d, 45d)'), though functions that return distances always return radians.

            Supported geometry types include POINT, LINE, PATH, CIRCLE, ELLIPSE, POLYGON, BOX, but there is no support for composite geometries (except for PATH). pgSphere supports a number of basic functions and operators, but not all of them work on every type of geometry, e.g. distance calculation only works for points and circles. Indexing of the spatial columns is supported using GiST index.

            The choice between the two options may depend on what kind of spatial queries we need to support, and what needs to be done on TAP side for transforming ADQL into either PostGIS or pgSphere. If we need to be able to calculate distances between polygons and points (e.g. for cone search) then PostGIS seems to be the only answer (extending pgSphere is another option of course). (And I think I also heard that TAP already has something for pgSphere).

            Show
            salnikov Andy Salnikov added a comment - - edited I was learning and playing with both PostGIS and pgSphere to see what option can work better for storing geometry data in ObsCore table. Here is a brief summary what I have learned so far. PostGIS This is an advanced and well-supported project for PotgreSQL-based GIS. It is a separate PostgreSQL extension that needs to be installed as a system package and added to a database or schema. GIS mostly cares about more or less precise description of features located on the surface of Earth (or maybe other planets), and not what happens in the sky. Earth of course is not a spherical object, and a lot of effort in PostGIS is devoted to supporting the non-ideal Earth shape. PostGIS implements SQL/MM standard for spatial data management, with some extensions. PostGIS has two distinct types for spatial data - geometry and geography . Geometry type is used for Cartesian coordinates, and is typically used for small enough patches of the surface where curvature can be ignored. Geography type is used for long distances where it is necessary to take curvature into effect for correct calculations. The shape of the Earth (or a small patch of it) is determined by a Spatial Reference System (SRS), all spatial objects are associated with some SRS (with some exceptions). PostGIS supports user-defined SRS, and potentially one can define completely spherical planet shape, project sky map on its surface, and use that projection for all calculations using geography type. PostGIS geography uses lon/lat ranges [-180, 180] and [-90, 90] , while ra range is [0, 360] . When a geography type receives data outside its allowed range it coerces the data into that range. If coercing ra/dec to lon/lat is done consistently, then PostGIS should still be able to produce correct answer for ra/dec. Even for geography type the calculations that produce distances return their result in meters, to obtain angular separation it needs to be converted to degrees. One possible workaround is to project on a surface where 1 degree makes one meter, a small "planet" with a radius of 57.29577951308232 meters. PostGIS' set of shapes include POINT, LINE, POLYGON, and collections of those (e.g. MULTIPOLYGON), but no circle or ellipses. There is a large set of functions for all kinds of calculations and transformations, geography types only support a subset of the functions. Indexing of the spatial columns is supported with GiST, BRIN, or SP-GiSt type of index. pgSphere The support and development of pgSphere looks patchy, it is certainly a smaller project, used and supported by VO folks. The "official" repository at https://github.com/akorotkov/pgsphere had the latest commit from almost three years ago. Ubuntu 22.04 has a package for pgsphere version 1.1.5 which seem to be the latest pgSphere version number. Creating pgSphere extension for a database requires superuser privileges (true for PostGIS as well). pgSphere is built specifically for use in astronomy, and uses lon/lat coordinate system with radians units in ranges [0, 2π] and [-π/2, π/2] . On input the coordinates can be specified as degrees (e.g. spoint '(90d, 45d)' ), though functions that return distances always return radians. Supported geometry types include POINT, LINE, PATH, CIRCLE, ELLIPSE, POLYGON, BOX, but there is no support for composite geometries (except for PATH). pgSphere supports a number of basic functions and operators, but not all of them work on every type of geometry, e.g. distance calculation only works for points and circles. Indexing of the spatial columns is supported using GiST index. The choice between the two options may depend on what kind of spatial queries we need to support, and what needs to be done on TAP side for transforming ADQL into either PostGIS or pgSphere. If we need to be able to calculate distances between polygons and points (e.g. for cone search) then PostGIS seems to be the only answer (extending pgSphere is another option of course). (And I think I also heard that TAP already has something for pgSphere).
            Hide
            tjenness Tim Jenness added a comment -

            https://github.com/postgrespro/pgsphere had a commit last November. CADC TAP server (which we use in RSP) assumes pgsphere.

            Would we be able patch PostGIS to support circle/ellipse?

            Show
            tjenness Tim Jenness added a comment - https://github.com/postgrespro/pgsphere had a commit last November. CADC TAP server (which we use in RSP) assumes pgsphere. Would we be able patch PostGIS to support circle/ellipse?
            Hide
            salnikov Andy Salnikov added a comment -

            Would we be able patch PostGIS to support circle/ellipse?

            I doubt that. It may be possible for perfect sphere or flat surface, but I can't even imagine how it can be done for spheroids or ellipsoids. Maybe if you can find a person who knows his way around PostGIS code, OTOH may be there is a reason why it does not exist in the first place.

            Show
            salnikov Andy Salnikov added a comment - Would we be able patch PostGIS to support circle/ellipse? I doubt that. It may be possible for perfect sphere or flat surface, but I can't even imagine how it can be done for spheroids or ellipsoids. Maybe if you can find a person who knows his way around PostGIS code, OTOH may be there is a reason why it does not exist in the first place.
            Hide
            tjenness Tim Jenness added a comment -

            Gregory Dubois-Felsmann given the close relationship between this implementation and any TAP server using it, we don't feel we can decide between PostGIS and pgsphere in a vacuum. Can you either tell us which option you prefer or shall we get together with you and the TAP implementation person and make a decision?

            Show
            tjenness Tim Jenness added a comment - Gregory Dubois-Felsmann given the close relationship between this implementation and any TAP server using it, we don't feel we can decide between PostGIS and pgsphere in a vacuum. Can you either tell us which option you prefer or shall we get together with you and the TAP implementation person and make a decision?
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I'm pretty sure (but only from memory) that CADC TAP, which is our only realistic near-term option, supports only pgsphere for Postgres, but this question should really go to Christine Banek for an expert answer.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - I'm pretty sure (but only from memory) that CADC TAP, which is our only realistic near-term option, supports only pgsphere for Postgres, but this question should really go to Christine Banek for an expert answer.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            For some additional background, which might also be related to Jim Bosch's 8/11 comment above: the PostGIS reference very likely was from Serge Monkewitz, because IRSA does use it on the Postgres back ends to its only-partly-open-source C++-based TAP server. (IRSA uses both Oracle and Postgres, via a common TAP server.) Realistically I don't think it is likely to be a useful path to try to adopt that code base, though, on any time scale that helps us with commissioning.

            So I think unless Christine tells us that I'm wrong about pgsphere and CADC TAP, we're going to be writing to a pgsphere interface.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - For some additional background, which might also be related to Jim Bosch 's 8/11  comment above: the PostGIS reference very likely was from Serge Monkewitz , because IRSA does use it on the Postgres back ends to its only-partly-open-source C++-based TAP server. (IRSA uses both Oracle and Postgres, via a common TAP server.) Realistically I don't think it is likely to be a useful path to try to adopt that code base, though, on any time scale that helps us with commissioning. So I think unless Christine tells us that I'm wrong about pgsphere and CADC TAP, we're going to be writing to a pgsphere interface.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment - - edited

            Since Andy Salnikov mentioned "collecting requirements", I think there's one thing that's become clear from the DP0.2 experience. We decided to add six Rubin-specific nullable columns to the ObsCore table - which is perfectly legitimate as long as we support the mandatory ones (that's why TAP_SCHEMA has an std attribute for columns) - lsst_visit, lsst_detector, lsst_tract, lsst_patch, lsst_band, and lsst_filter. So far, that's been a big win from a usability standpoint. We set the attributes that make sense for a particular row, and null the others (as Andy knows - he wrote the code).

            So for this autogenerated-ObsCore project I think the same will be true: we'll want to be able to include in the configuration for this aspect of a Butler registry something that lets this be done in a configurable way. Andy was probably already assuming this, but I want to confirm that, let's say, the experiment of adding these columns has been a success.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - - edited Since Andy Salnikov mentioned "collecting requirements", I think there's one thing that's become clear from the DP0.2 experience. We decided to add six Rubin-specific nullable columns to the ObsCore table - which is perfectly legitimate as long as we support the mandatory ones (that's why TAP_SCHEMA has an std attribute for columns) - lsst_visit , lsst_detector , lsst_tract , lsst_patch , lsst_band , and lsst_filter . So far, that's been a big win from a usability standpoint. We set the attributes that make sense for a particular row, and null the others (as Andy knows - he wrote the code). So for this autogenerated-ObsCore project I think the same will be true: we'll want to be able to include in the configuration for this aspect of a Butler registry something that lets this be done in a configurable way. Andy was probably already assuming this, but I want to confirm that, let's say, the experiment of adding these columns has been a success.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Another near-requirement, I think, and probably again one that's not going to be difficult to meet:

            From the perspective of just the content of the ObsCore table, we want the configurability to be able to make it possible for the access_url to be generated in a way compatible with at least the following situations, covering both the "direct" and "CADC" (indirect access via a DataLink "links service") models:

            1. file:// URL to the physical files when on a Posix filesystem (direct)
            2. templated https:// URL to the physical files when such a server is available in a particular installation (direct)
            3. templated https:// URL to a "links service" (CADC)

            In the "direct" model, the access_format would be determined based on the DatasetType of a dataset. In the CADC model (which is what we're using for the statically extracted service for DP0.2) it's a single standard-prescribed value for a DataLink "links response" table.

            I want to preserve the ability to use the "direct" model in smaller-scale situations.

            Again, from what I understand about what Andy did originally, and is likely to continue to do, I see no reason for this to be any sort of problem to meet.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Another near-requirement, I think, and probably again one that's not going to be difficult to meet: From the perspective of just the content of the ObsCore table, we want the configurability to be able to make it possible for the access_url to be generated in a way compatible with at least the following situations, covering both the "direct" and "CADC" (indirect access via a DataLink "links service") models: file:// URL to the physical files when on a Posix filesystem (direct) templated https:// URL to the physical files when such a server is available in a particular installation (direct) templated https:// URL to a "links service" (CADC) In the "direct" model, the access_format would be determined based on the DatasetType of a dataset. In the CADC model (which is what we're using for the statically extracted service for DP0.2) it's a single standard-prescribed value for a DataLink "links response" table. I want to preserve the ability to use the "direct" model in smaller-scale situations. Again, from what I understand about what Andy did originally, and is likely to continue to do, I see no reason for this to be any sort of problem to meet.
            Hide
            tjenness Tim Jenness added a comment -

            Couple of quick comments:

            • Using pgsphere means we have to support our own postgres deployment on Google since CloudSQL does not support it (we wouldn't even be able to request it given no-one can agree on which pgsphere we are meant to use). Maybe we don't need any of this to run on Google because a data release will be using Qserv and we only want this to run at USDF and the summit.
            • Storing the direct URIs means that we are never going to do composite disassembly. It now seems pretty clear that no-one wants to do composite disassembly so I should give up the idea. The links service at least gave the illusion that on the fly assembly was a thing that could happen (along with on the fly format conversion).
            • Direct URIs also means that the ObsCore manager has to be able to talk to datastore to get that URI (previously ObsCore was entirely registry).
            Show
            tjenness Tim Jenness added a comment - Couple of quick comments: Using pgsphere means we have to support our own postgres deployment on Google since CloudSQL does not support it (we wouldn't even be able to request it given no-one can agree on which pgsphere we are meant to use). Maybe we don't need any of this to run on Google because a data release will be using Qserv and we only want this to run at USDF and the summit. Storing the direct URIs means that we are never going to do composite disassembly. It now seems pretty clear that no-one wants to do composite disassembly so I should give up the idea. The links service at least gave the illusion that on the fly assembly was a thing that could happen (along with on the fly format conversion). Direct URIs also means that the ObsCore manager has to be able to talk to datastore to get that URI (previously ObsCore was entirely registry).
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Quick reactions:

            • Yes, the pgsphere issue does complicate cloud deployment. We've looked at that at IPAC as well in some contexts, w.r.t. the Amazon cloud (because of NASA's deal with Amazon). Amazon has pre-packaged configurations of Postgres including PostGIS, but not pgsphere. At some point the astronomy community might be well advised to find the resources, in some project, to build a full-featured ADQL solution against PostGIS, perhaps including adding additional capabilities to PostGIS itself. pgsphere, even if it is currently still being maintained, is looking like a fragile reed.
            • I have been assuming that in DRP releases we would need ObsCore tables both in Qserv (for use/joining with the catalogs, as we're doing on DP0.2) and in Postgres, as part of the CDB, for joining with observatory metadata. So I think we'll need TAP-with-geometry on Postgres even in DRP.
            • The PPDB is by default Postgres, I think, and we'll certainly need an ObsCore table for the AP image data products, ideally joinable with the PPDB catalog contents.
            • I am not suggesting that we would ever use the "direct" model on DRP outputs. I just don't want to foreclose it in the code itself. If composite disassembly becomes something we really need to do, I wouldn't object to it on the grounds that it prevents using the "direct" model.
            Show
            gpdf Gregory Dubois-Felsmann added a comment - Quick reactions: Yes, the pgsphere issue does complicate cloud deployment. We've looked at that at IPAC as well in some contexts, w.r.t. the Amazon cloud (because of NASA's deal with Amazon). Amazon has pre-packaged configurations of Postgres including PostGIS, but not pgsphere. At some point the astronomy community might be well advised to find the resources, in some project, to build a full-featured ADQL solution against PostGIS, perhaps including adding additional capabilities to PostGIS itself. pgsphere, even if it is currently still being maintained, is looking like a fragile reed. I have been assuming that in DRP releases we would need ObsCore tables both in Qserv (for use/joining with the catalogs, as we're doing on DP0.2) and in Postgres, as part of the CDB, for joining with observatory metadata. So I think we'll need TAP-with-geometry on Postgres even in DRP. The PPDB is by default Postgres, I think, and we'll certainly need an ObsCore table for the AP image data products, ideally joinable with the PPDB catalog contents. I am not suggesting that we would ever use the "direct" model on DRP outputs. I just don't want to foreclose it in the code itself. If composite disassembly becomes something we really need to do, I wouldn't object to it on the grounds that it prevents using the "direct" model.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            ... lsst_visit, lsst_detector, lsst_tract, lsst_patch, lsst_band, and lsst_filter columns

            Gregory Dubois-Felsmann, indeed, I assume we want to keep these, and they (and probably something else) can be added via simple configuration. One issue here is that extending the configuration which causes obscore table schema change will need some coordination, so we should try to guess everything that we need in advance.

            re access_url variations

            This is again a matter of configuration, but I also think that some aspects may be easier to handle on a TAP server side, or maybe in a special database view, which should be easier to modify compared to migrating the data in the table. I'll keep that in mind working on implementation.

            re PPDB

            Joining PPDB and obscore is, I think, a whole new requirement. I believe it implies that we need to either have the Registry on the same server as PPDB, or we need to replicate obscore table to PPDB server. We probably need to know more specifics about this to decide what we can do.

            Show
            salnikov Andy Salnikov added a comment - - edited ... lsst_visit, lsst_detector, lsst_tract, lsst_patch, lsst_band, and lsst_filter columns Gregory Dubois-Felsmann , indeed, I assume we want to keep these, and they (and probably something else) can be added via simple configuration. One issue here is that extending the configuration which causes obscore table schema change will need some coordination, so we should try to guess everything that we need in advance. re access_url variations This is again a matter of configuration, but I also think that some aspects may be easier to handle on a TAP server side, or maybe in a special database view, which should be easier to modify compared to migrating the data in the table. I'll keep that in mind working on implementation. re PPDB Joining PPDB and obscore is, I think, a whole new requirement. I believe it implies that we need to either have the Registry on the same server as PPDB, or we need to replicate obscore table to PPDB server. We probably need to know more specifics about this to decide what we can do.
            Hide
            salnikov Andy Salnikov added a comment -

            I think there is nothing more to add on this ticket, and there is already implementation work done on other tickets. One possible issue with PPDB integration will need to be addressed on a separate ticket. I'm going to close this one.

            Show
            salnikov Andy Salnikov added a comment - I think there is nothing more to add on this ticket, and there is already implementation work done on other tickets. One possible issue with PPDB integration will need to be addressed on a separate ticket. I'm going to close this one.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Salnikov, Gregory Dubois-Felsmann, Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.