Details
-
Type:
Improvement
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:
-
Story Points:3
-
Epic Link:
-
Team:Ops Middleware
Description
For spatial indexing of obscore table we need pg_sphere extension enabled. Standard postgres deployment usually comes without that extension, so it has to be installed separately. Details for that process depend on how postgres is deployed. Some Linux distributions may have that extension in their repos, and also Postgres has it in its repos as well.
I think we need that for USDF and also for cloud instance, likely very different approaches needed for those two cases.
Attachments
Activity
SQuaRE no longer deploys pg_sphere with PostgreSQL anywhere. That was part of the proof-of-concept obstap service, which we retired once we decided to use qserv for ObsCore. All the spherical geometry is now done by qserv.
The old repository we used for obstap is at https://github.com/lsst-sqre/tap-postgres if that's helpful, and it did work at one point, although we didn't put a lot of weight on it.
Presumably this ticket is to ask the USDF database admins to add the pgsphere plugin to the OGA repo? Gregory Dubois-Felsmann which repos do we need live ObsCore working on?
Markus Demleitner recommends this version of pgsphere: https://github.com/postgrespro/pgsphere
Russ Allbery, thanks, that is useful. The comments in repo say that pgsphere RPM came from CADC build on DM-23081 branch. Christine Banek, do you remember where did you get that RPM? USDF runs postgres14, so we will need a build of pgsphere for that version.
I asked Yee about details on USDF install, will post more info here.
We have pg_sphere installed at usdf-butelr-dev instance (version 1.2, I understand this is latest one from github repo). I ran some tests with obscore table and pg_sphere using the migration scripts from DM-36766 branch, here is the summary of the tests.
DM-36766 code was developed and tested on the copy of OGA repository loaded into my personal server. That copy has very little data, approx 20k records for raw images. These tests we performed on a copy of main repo (main_20210215 schema) which has a lot more data, so one of the goals was to check performance of the whole process.
The process of adding obscore table to the registry includes two steps:
- Adding obscore and obscore-config managers to the registry configuration (butler_attributes table), this obviously needs obscore configuration, I used some something based on configuration used for generating obscore table for QServ. Migration script only creates an empty obscore table but does not populate it with any data.
- Adding records to obscore table from existing registry data, which is done by a new command in dax_butler package: butler obscore-update-table. It scans all dataset refs based on obscore configuration, converts them to obscore records and adds to the table. This is a slow process, especially for raw dataset type because we also need to find a corresponding region from matching visit records.
First worked perfectly, one minor issue is that DM-36766 defines migration script only for OGA namespace (I plan to use that for demonstration and add migration script for main namespace later) so I had to cheat in my config, but this does not really matter for this test.
Second step needed one minor patch to DM-36766 to handle empty timespans correctly (already added to DM-36766 branch). With that fix I managed to run the whole thing, and it took about 3 hours 40 minutes to complete. That feels like a long time, but OTOH we do not need downtime for this and it can run concurrently with regular activity so this should not be an issue.
Two longest contributions to that come from:
- raw dataset type from LSSTCam/raw/all collection - approx. 6100 seconds and 6,188,213 records
- raw dataset type from HSC/raw/all collection - approx. 5300 seconds and 2,935,078 records
All other collections and dataset types take much less time compared to these two. I believe that the major contribution for raw dataset comes from finding matching regions from visit records. There may be ways to improve that, but I do not want to spend too much time right non this (this step should only be done once, so performance is not critical). What is not quite clear to me is why time does not scale linearly with the number of records for these two collections., I'll try to investigate.
I tried to run a bunch of spacial queries on obscore table to verify that pg_sphere works ok, it does not crash the server, and that spatial indexing works as expected. The results were reasonable, no crashes observed, queries (those that can be indexed) were running fast and query analysis shows that indexing is indeed used for those queries.
I think that I can declare success with this test, and we can ask Yee Li to deploy pg_sphere for production instance when there is an opportunity (I guess this will need a downtime).
Yee Li, one thing that I noticed, which is probably dev-specific isuue - I get occasional errors from the server which looks like this:
ERROR sqlalchemy.pool.impl.QueuePool ()(base.py:781) - Exception during reset or similar
|
Traceback (most recent call last):
|
File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_45/conda/envs/lsst-scipipe-4.1.0-ext/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 753, in _finalize_fairy
|
fairy._reset(pool)
|
File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_45/conda/envs/lsst-scipipe-4.1.0-ext/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1004, in _reset
|
pool._dialect.do_rollback(self)
|
File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_45/conda/envs/lsst-scipipe-4.1.0-ext/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
|
dbapi_connection.rollback()
|
psycopg2.OperationalError: server closed the connection unexpectedly
|
This probably means the server terminated abnormally
|
before or while processing the request.
|
server closed the connection unexpectedly
|
This probably means the server terminated abnormally
|
before or while processing the request.
|
server closed the connection unexpectedly
|
This probably means the server terminated abnormally
|
before or while processing the request.
|
Could it be something related to proxy (I think CNPG uses pgpool?) timeouts?
great! thanks for the updates Andy Salnikov. we'll make plans to roll it into the production butler asap. do you have a timestamp for that error?
Yee Li, I saw few instances of this exception:
- 2022-11-04T00:57:16.401-07:00
- 2022-11-03T23:10:09.020-07:00
- 2022-11-03T19:21:27.423-07:00
- 2022-11-03T17:30:57.513-07:00
All seem to be related to a long-running transaction.
pg_sphere extension was installed yesterday by DBAs on usdf-butler instance and added to lsstdb1 database. Today I loaded ci_hsc data into a separate schema in that database and ran obscore migration scripts on to test that pg_sphere is in usable shape. Everything works perfectly, queries run successfully and spatial indexing works as well.
I'll wait for feedback on DM-36701 to finalize configuration for OGA repository and will add obscore to that repo. This ticket is finished, no code to review.
Yee Li, thanks a lot for coordinating this installation!
Andy Salnikov, what is the origin of the pgsphere extension that you've actually been installing? Are we getting an RPM from outside, or building it ourselves from source, or...?
Gregory Dubois-Felsmann, I believe it was built from source, our DBAs made a special image that includes pg_sphere. Yee Li can probably point you to specifics.
We need to involve the SQuaRE team in this since they deploy exactly this form of Postgres for the TAP service and we should agree on which variant of pgsphere we are using.
IDF is problematic given that CloudSQL does not support pgsphere. I didn't think we had a requirement to use ObsCore on the IDF.