Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25379

psycopg2.OperationalError: SSL when running ci_hsc_gen3

    XMLWordPrintable

Details

    • 1
    • Data Release Production
    • No

    Description

      While running ci_hsc_gen3 against the PostgreSQL server, I've been seeing the following error message (Usually > 100 times per ci_hsc_gen3 run when it happens):

      sqlalchemy.pool.impl.QueuePool ERROR: Exception during reset or similar
      Traceback (most recent call last):
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 693, in _finalize_fairy
          fairy._reset(pool)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 880, in _reset
          pool._dialect.do_rollback(self)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 540, in do_rollback
          dbapi_connection.rollback()
      psycopg2.OperationalError: SSL SYSCALL error: EOF detected
      

      I've only seen the following once:

      sqlalchemy.pool.impl.QueuePool ERROR: Exception during reset or similar
      Traceback (most recent call last):
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 693, in _finalize_fairy
          fairy._reset(pool)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 880, in _reset
          pool._dialect.do_rollback(self)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 540, in do_rollback
          dbapi_connection.rollback()
      psycopg2.OperationalError: SSL error: decryption failed or bad record mac
      

      This doesn't seem to actually affect the run. This doesn't happen 100% of the runs. It only seems to happen when ci_hsc_gen3 runs pipetask (i.e., never butler commands).
      Here are counts from a few runs (from grep of stdout/stderr):

      1591627439/cihsc_w_2020_23_1591627439_postgres.out:149
      1591633412/cihsc_w_2020_23_1591633412_postgres.out:136
      1591819226/cihsc_w_2020_23_1591819226_postgres.out:0
      1591886082/cihsc_w_2020_23_1591886082_postgres.out:140
      1591888106/cihsc_w_2020_23_1591888106_postgres.out:0
      1591889656/cihsc_w_2020_23_1591889656_postgres.out:136
      

      Google searches of the error messages makes it sound like a multiprocessing code issue (e.g., https://virtualandy.wordpress.com/2019/09/04/a-fix-for-operationalerror-psycopg2-operationalerror-ssl-error-decryption-failed-or-bad-record-mac/). I had found result that talked about TCP keepalive so I tried pool_pre_ping=True in the create_engine call, but that didn't seem to make any difference.

      Attachments

        Issue Links

          Activity

            20 ci_hsc_gen3 runs last night with the NullPool change ran without any SSL errors.  Not a guarantee, but in previous 20 runs I was seeing errors in at least some of the runs.  I pushed the poolclass capitalization fix,  You can test this before and after the capitalization change by using a dummy postgresql connection string with butler create.  With the wrong capitalization it fails before even trying to contact the non-existent machine.  I think this is ready to merge.  If you'd rather I run more ci_hsc_gen3 tests over a few days, I can do that in the background with little effort.  I volunteer to review this whenever you are ready (i.e., hit github and Jira buttons).

            mgower Michelle Gower added a comment - 20 ci_hsc_gen3 runs last night with the NullPool change ran without any SSL errors.  Not a guarantee, but in previous 20 runs I was seeing errors in at least some of the runs.  I pushed the poolclass capitalization fix,  You can test this before and after the capitalization change by using a dummy postgresql connection string with butler create.  With the wrong capitalization it fails before even trying to contact the non-existent machine.  I think this is ready to merge.  If you'd rather I run more ci_hsc_gen3 tests over a few days, I can do that in the background with little effort.  I volunteer to review this whenever you are ready (i.e., hit github and Jira buttons).
            tjenness Tim Jenness added a comment -

            Should we make the same change in oracle.py for completeness? It would be nice if these generic sqlalchemy connection parameters that we always want to use came from a single location.

            tjenness Tim Jenness added a comment - Should we make the same change in oracle.py for completeness? It would be nice if these generic sqlalchemy connection parameters that we always want to use came from a single location.
            nlust Nate Lust added a comment -

            I went ahead and squashed your commit into mine. As this fix worked for sqlite, and it seemed to work for the run you did, I feel pretty confident that we can merge it in if you are happy with the changes.

            nlust Nate Lust added a comment - I went ahead and squashed your commit into mine. As this fix worked for sqlite, and it seemed to work for the run you did, I feel pretty confident that we can merge it in if you are happy with the changes.

            Oracle has pool_size=1.  Which may explain why we weren't seeing problems there. So I would prefer leaving it alone for now.  For making generic sqlalchemy connection parameters coming from a single location, that sounds like a new ticket.  Starting review.

             

            mgower Michelle Gower added a comment - Oracle has pool_size=1.  Which may explain why we weren't seeing problems there. So I would prefer leaving it alone for now.  For making generic sqlalchemy connection parameters coming from a single location, that sounds like a new ticket.  Starting review.  

            Changes look fine.  Testing didn't reproduce error message.  Please merge.

            mgower Michelle Gower added a comment - Changes look fine.  Testing didn't reproduce error message.  Please merge.

            People

              nlust Nate Lust
              mgower Michelle Gower
              Michelle Gower
              Andy Salnikov, Michelle Gower, Nate Lust, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.