Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26051

Try ap_pipe HiTS2015 rerun with PostgreSQL

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      8
    • Sprint:
      AP F20-2 (July), AP F20-3 (August)
    • Team:
      Alert Production
    • Urgent?:
      No

      Description

      After discussing with Eric Morganson [X] and Chris Stephens at NCSA, I have an empty APDB-like PostgreSQL I can connect to hosted at lsst-pg-devel1. Eventually we will probably want to use one of these instead of sqlite for all/most AP reprocessing, especially the larger multi-filter HSC "medium" dataset I used for DM-24252. For now, this ticket is just to rerun ap_pipe with this new flavor of database to see if I can get it to work. The results should be nearly identical to the official July 2020 reprocessing (DM-25910).

      Fun facts

      • Credentials are an issue, as every time I connect to this DB I presently have to type my NCSA password
      • I don't currently have the ability to run e.g. make_apdb.py and get a fresh empty PostgreSQL proto-APDB on this host, instead I have to ask Chris S to make me one, and there's just one for me to play with for starters
      • I need to read up on psycopg2 (python package for interfacing with PostgreSQL databases so I can load APDB data into python/pandas)
      • The host for this test database is, as Chris S described, the wild west, and we will need to do something to make sure this and future APDBs are accessible in the future

        Attachments

          Issue Links

            Activity

            Hide
            mrawls Meredith Rawls added a comment -

            I successfully completed an ap_pipe run with the APDB being written to the new lsstdevapdb1 PostgreSQL database! I need to verify I can access the data to make my usual plots and analysis. I also need to learn a workflow for copying a past rerun's APDB into a new schema, make sure I know how to access that schema in the future, and then wipe the old default schema (conveniently called "mrawls") so it is ready for the next rerun to connect to. There is presently a permission issue and I still need to find a better way to store my credentials.

            Show
            mrawls Meredith Rawls added a comment - I successfully completed an ap_pipe run with the APDB being written to the new lsstdevapdb1 PostgreSQL database! I need to verify I can access the data to make my usual plots and analysis. I also need to learn a workflow for copying a past rerun's APDB into a new schema, make sure I know how to access that schema in the future, and then wipe the old default schema (conveniently called "mrawls") so it is ready for the next rerun to connect to. There is presently a permission issue and I still need to find a better way to store my credentials.
            Hide
            mrawls Meredith Rawls added a comment -

            It works! The workflow for using PostgreSQL instead of sqlite with ap_pipe is as follows:

            • Ask someone at NCSA to make you an empty sandbox PostgreSQL database. Christopher Stephens [X] helped me with this. Mine is called lsstdevapdb1, and it's accessed at host lsst-pg-devel1 using a certain port and my NCSA username. The default schema (which is the actual thing containing DB tables and whatnot) is also my NCSA username.
            • Run make_apdb in the usual way, e.g., make_apdb.py -c diaPipe.apdb.isolation_level=READ_UNCOMMITTED -c diaPipe.apdb.db_url="postgresql://USER@HOST.ncsa.illinois.edu/lsstdevapdb1" (with USER and HOST filled in appropriately)
            • Run ap_pipe in the usual way, e.g., with the same db_url config as well as the usual isolation_level="READ_UNCOMMITTED" config (I omitted the connection_timeout config that I'd added for sqlite)
            • Use `psql` to quickly introspect the APDB (it's actually rather well documented online?!), e.g. psql "host=lsst-pg-devel1.ncsa.illinois.edu user=USER dbname=lsstdevapdb1"
            • Use the python package psycopg2 to connect to the APDB and run queries etc. Note that queries are very particular about quotation marks in a way that they weren't for me with sqlite, e.g., select "ra", "decl" from DiaObject where "validityEnd" is NULL limit 5; will only work with the double quotes around each field name.

            Still yet to be solved: a better way to save my NCSA credentials for connecting to the DB host than a plain text `~/.pgpass` file. Although, you need to be logged in as me to access said file, and the only way you could do that is if you already had my password, so... this isn't my top concern.

             

            Show
            mrawls Meredith Rawls added a comment - It works! The workflow for using PostgreSQL instead of sqlite with ap_pipe is as follows: Ask someone at NCSA to make you an empty sandbox PostgreSQL database. Christopher Stephens [X] helped me with this. Mine is called lsstdevapdb1 , and it's accessed at host lsst-pg-devel1  using a certain port and my NCSA username. The default schema (which is the actual thing containing DB tables and whatnot) is also my NCSA username. Run make_apdb in the usual way, e.g., make_apdb.py -c diaPipe.apdb.isolation_level=READ_UNCOMMITTED -c diaPipe.apdb.db_url="postgresql://USER@HOST.ncsa.illinois.edu/lsstdevapdb1"  (with USER and HOST filled in appropriately) Run ap_pipe in the usual way, e.g., with the same db_url config as well as the usual isolation_level="READ_UNCOMMITTED" config (I omitted the connection_timeout config that I'd added for sqlite) Use `psql` to quickly introspect the APDB (it's actually rather well documented online?!), e.g. psql "host=lsst-pg-devel1.ncsa.illinois.edu user=USER dbname=lsstdevapdb1" Use the python package psycopg2 to connect to the APDB and run queries etc. Note that queries are very particular about quotation marks in a way that they weren't for me with sqlite, e.g., select "ra", "decl" from DiaObject where "validityEnd" is NULL limit 5;  will only work with the double quotes around each field name. Still yet to be solved: a better way to save my NCSA credentials for connecting to the DB host than a plain text `~/.pgpass` file. Although, you need to be logged in as me to access said file, and the only way you could do that is if you already had my password, so... this isn't my top concern.  
            Hide
            tjenness Tim Jenness added a comment -

            Regarding the quoting, Andy Salnikov might correct me but I think that's because of the mixed case in the column names. At least in Butler we had to switch to all lower case to simplify quoting requirements.

            Show
            tjenness Tim Jenness added a comment - Regarding the quoting, Andy Salnikov might correct me but I think that's because of the mixed case in the column names. At least in Butler we had to switch to all lower case to simplify quoting requirements.
            Hide
            salnikov Andy Salnikov added a comment -

            Indeed. Even though every backend has different rules for case sensitivity, the surest way to trigger problem is to use mixed case in table/column names. Double quote is a standard SQL way to pass an identifier so that backend does not transform it, so this indeed is a solution if schema defines those fields as case-sensitive (which dax_apdb probably does). Maybe better solution is to use sqlalchemy to build/execute queries instead of native psycopg2, it's easier to be consistent with dax_apdb that way. And, as Tim suggested, we could switch to lower case names if it simplifies matters, but this is DPDD-defined schema, so we may need to start with DPDD definitions first.

            Show
            salnikov Andy Salnikov added a comment - Indeed. Even though every backend has different rules for case sensitivity, the surest way to trigger problem is to use mixed case in table/column names. Double quote is a standard SQL way to pass an identifier so that backend does not transform it, so this indeed is a solution if schema defines those fields as case-sensitive (which dax_apdb probably does). Maybe better solution is to use sqlalchemy to build/execute queries instead of native psycopg2, it's easier to be consistent with dax_apdb that way. And, as Tim suggested, we could switch to lower case names if it simplifies matters, but this is DPDD-defined schema, so we may need to start with DPDD definitions first.
            Hide
            tjenness Tim Jenness added a comment - - edited

            Whenever I've discussed DPDD "schemas" in the past I've been told quite clearly that DPDD itself is telling you the content that is promised and not the specifics of exact names. My understanding is that we are at liberty to tweak things if it is not changing the scientific content but is driven by implementation needs.

            cc/ Colin Slater

            Show
            tjenness Tim Jenness added a comment - - edited Whenever I've discussed DPDD "schemas" in the past I've been told quite clearly that DPDD itself is telling you the content that is promised and not the specifics of exact names. My understanding is that we are at liberty to tweak things if it is not changing the scientific content but is driven by implementation needs. cc/ Colin Slater
            Hide
            salnikov Andy Salnikov added a comment -

            I do agree that it would be nice if database (or any other persistency mechanism) could use different schema from what is declared in DPDD. Today we do not have any intermediate layer that can separate DPDD definitions from implementation but we do expose database to external world via TAP. In APDB case this is probably less of an issue because we are going to have PPDB as a separate layer for end users but we will still need to think how PPDB names relate to DPDD schema.

            I think we need to take this out of this particular ticket comments but let DAX people (Fritz Mueller, Colin Slater) discuss this separately.

            Show
            salnikov Andy Salnikov added a comment - I do agree that it would be nice if database (or any other persistency mechanism) could use different schema from what is declared in DPDD. Today we do not have any intermediate layer that can separate DPDD definitions from implementation but we do expose database to external world via TAP. In APDB case this is probably less of an issue because we are going to have PPDB as a separate layer for end users but we will still need to think how PPDB names relate to DPDD schema. I think we need to take this out of this particular ticket comments but let DAX people ( Fritz Mueller , Colin Slater ) discuss this separately.
            Hide
            gkovacs Gabor Kovacs [X] (Inactive) added a comment -

            Looks good. Some remarks based on my former postgres experience:

             

            • Though I don't think it is directly relevant now, if you ever need to format external values into sql query strings, you should use psycopg2's special character escaping functions for this purpose. Otherwise, it becomes a security hole. (So far, I saw "str.format()" with internal constants only in the scripts.)
            • If you want to browse a pg database directly, I'd recommend pgAdmin. It can also show the exact sql syntax for certain operations.
            • I don't know how the ncsa server is set up, but if you can have different, less valuable credentials for the db itself, you can try ssh port forwarding from your client computer and use USER@localhost:localport for the db access. Then you can avoid storing your login info in cleartext files.
            • I think you can backup/clean/re-create an empty database for yourself from a pg_dump backup and psql restore. There are options to play with whether you need table drop/creation privileges for this.
            Show
            gkovacs Gabor Kovacs [X] (Inactive) added a comment - Looks good. Some remarks based on my former postgres experience:   Though I don't think it is directly relevant now, if you ever need to format external values into sql query strings, you should use psycopg2's special character escaping functions for this purpose. Otherwise, it becomes a security hole. (So far, I saw "str.format()" with internal constants only in the scripts.) If you want to browse a pg database directly, I'd recommend pgAdmin . It can also show the exact sql syntax for certain operations. I don't know how the ncsa server is set up, but if you can have different, less valuable credentials for the db itself, you can try ssh port forwarding from your client computer and use USER@localhost:localport for the db access. Then you can avoid storing your login info in cleartext files. I think you can backup/clean/re-create an empty database for yourself from a pg_dump backup and psql restore. There are options to play with whether you need table drop/creation privileges for this.
            Hide
            mrawls Meredith Rawls added a comment -

            Thanks all for the insights on SQL query formatting, and thanks Gabor for the review and postgres tips.

            As an aside, I don't strongly care if the APDB schema matches the DPDD exactly, and I don't think we should get too hung up on SQL details at this stage. It's my understanding we'll be using a totally different "NoSQL" database when it's time for the real LSST APDB to happen anyway.

            Show
            mrawls Meredith Rawls added a comment - Thanks all for the insights on SQL query formatting, and thanks Gabor for the review and postgres tips. As an aside, I don't strongly care if the APDB schema matches the DPDD exactly, and I don't think we should get too hung up on SQL details at this stage. It's my understanding we'll be using a totally different "NoSQL" database when it's time for the real LSST APDB to happen anyway.

              People

              Assignee:
              mrawls Meredith Rawls
              Reporter:
              mrawls Meredith Rawls
              Reviewers:
              Gabor Kovacs [X] (Inactive)
              Watchers:
              Andy Salnikov, Gabor Kovacs [X] (Inactive), Meredith Rawls, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.