Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26539

w_35 ci_hsc_gen pipeline.sh dying due to SSL Error

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Sprint:
      DB_F20_06
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      w_35 of the software stack + ci_hsc_gen3 dies in pipeline.sh with

      psycopg2.OperationalError: SSL error: decryption failed or bad record mac
      

      Monika Adamow first generated the error and I was able to duplicate it. While a full run has not yet completed using -j 1, it was run far enough to get through the first few PipelineTasks without having this error.
      No error messages from previous steps in scons. Full output of pipeline.sh:

       
      bin/pipeline.sh 8 /scratch/mgower/weekly_ci_hsc_gen3/postgres/git/ci_hsc_gen3/DATA
      ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 155 quanta for 12 tasks
      ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 155 quanta for 12 tasks
      Process task-0:
      Traceback (most recent call last):
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
          cursor, statement, parameters, context
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
          cursor.execute(statement, parameters)
      psycopg2.OperationalError: SSL error: decryption failed or bad record mac
       
       
      The above exception was the direct cause of the following exception:
      Traceback (most recent call last):
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
          self.run()
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/multiprocessing/process.py", line 99, in run
          self._target(*self._args, **self._kwargs)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/ctrl_mpexec/20.0.0-13-g37abb6e+24c9b31241/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 83, in execute
          self.initGlobals(quantum, butler)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/ctrl_mpexec/20.0.0-13-g37abb6e+24c9b31241/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 273, in initGlobals
          Instrument.fromName(instrument, butler.registry)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/obs_base/20.0.0-37-g38a3e24+ce95e46117/python/lsst/obs/base/_instrument.py", line 219, in fromName
          records = list(registry.queryDimensionRecords("instrument", instrument=name))
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/_registry.py", line 1402, in queryDimensionRecords
          where=where, components=components, **kwargs)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/_registry.py", line 1324, in queryDataIds
          standardizedDataId = self.expandDataId(dataId, **kwargs)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/_registry.py", line 851, in expandDataId
          fetched = tuple(storage.fetch(dataIdSet))
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/dimensions/caching.py", line 111, in fetch
          for record in self._nested.fetch(toFetch):
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/dimensions/table.py", line 127, in fetch
          for row in self._db.query(query.combine()):
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/interfaces/_database.py", line 1373, in query
          return self._connection.execute(sql, *args, **kwds)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 984, in execute
          return meth(self, multiparams, params)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
          return connection._execute_clauseelement(self, multiparams, params)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1103, in _execute_clauseelement
          distilled_params,
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1288, in _execute_context
          e, statement, parameters, cursor, context
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1482, in _handle_dbapi_exception
          sqlalchemy_exception, with_traceback=exc_info[2], from_=e
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
          raise exception
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
          cursor, statement, parameters, context
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
          cursor.execute(statement, parameters)
      sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac
       
      [SQL: SELECT svclsstdbmgower1.instrument.name, svclsstdbmgower1.instrument.visit_max, svclsstdbmgower1.instrument.exposure_max, svclsstdbmgower1.instrument.detector_max, svclsstdbmgower1.instrument.class_name
      FROM svclsstdbmgower1.instrument
      WHERE svclsstdbmgower1.instrument.name = %(name_1)s]
      [parameters: {'name_1': 'HSC'}]
      (Background on this error at: http://sqlalche.me/e/e3q8)
      Process task-1:
      Traceback (most recent call last):
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
          cursor, statement, parameters, context
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
          cursor.execute(statement, parameters)
      psycopg2.OperationalError: SSL error: decryption failed or bad record mac
       
       
      The above exception was the direct cause of the following exception:
       
      Traceback (most recent call last):
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
          self.run()
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/multiprocessing/process.py", line 99, in run
          self._target(*self._args, **self._kwargs)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/ctrl_mpexec/20.0.0-13-g37abb6e+24c9b31241/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 83, in execute
          self.initGlobals(quantum, butler)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/ctrl_mpexec/20.0.0-13-g37abb6e+24c9b31241/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 273, in initGlobals
          Instrument.fromName(instrument, butler.registry)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/obs_base/20.0.0-37-g38a3e24+ce95e46117/python/lsst/obs/base/_instrument.py", line 219, in fromName
          records = list(registry.queryDimensionRecords("instrument", instrument=name))
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/_registry.py", line 1402, in queryDimensionRecords
          where=where, components=components, **kwargs)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/_registry.py", line 1324, in queryDataIds
          standardizedDataId = self.expandDataId(dataId, **kwargs)
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/_registry.py", line 851, in expandDataId
          fetched = tuple(storage.fetch(dataIdSet))
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/dimensions/caching.py", line 111, in fetch
          for record in self._nested.fetch(toFetch):
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/dimensions/table.py", line 127, in fetch
          for row in self._db.query(query.combine()):
        File "/software/lsstsw/stack_20200515/stack/miniconda3-4.7.12-46b24e8/Linux64/daf_butler/19.0.0-148-g38150800+de92faa26f/python/lsst/daf/butler/registry/interfaces/_database.py", line 1373, in query
          return self._connection.execute(sql, *args, **kwds)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 984, in execute
          return meth(self, multiparams, params)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
          return connection._execute_clauseelement(self, multiparams, params)
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1103, in _execute_clauseelement
          distilled_params,
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1288, in _execute_context
          e, statement, parameters, cursor, context
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1482, in _handle_dbapi_exception
          sqlalchemy_exception, with_traceback=exc_info[2], from_=e
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
          raise exception
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
          cursor, statement, parameters, context
        File "/software/lsstsw/stack_20200515/python/miniconda3-4.7.12/envs/lsst-scipipe/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
          cursor.execute(statement, parameters)
      sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac
      

      The error seems to repeat multiple times and then we get lines like:

      trl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(MakeWarpTask, label=makeWarpTask) dataId={abstract_filter: r, instrument: HSC, skymap: discrete/ci_hsc, physical_filter: HSC-R, tract: 0, visit_system: 0, patch: 69, visit: 903342}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(MakeWarpTask, label=makeWarpTask) dataId={abstract_filter: i, instrument: HSC, skymap: discrete/ci_hsc, physical_filter: HSC-I, tract: 0, visit_system: 0, patch: 69, visit: 903986}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(MakeWarpTask, label=makeWarpTask) dataId={abstract_filter: r, instrument: HSC, skymap: discrete/ci_hsc, physical_filter: HSC-R, tract: 0, visit_system: 0, patch: 69, visit: 903344}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(CompareWarpAssembleCoaddTask, label=assembleCoadd) dataId={abstract_filter: r, skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(CompareWarpAssembleCoaddTask, label=assembleCoadd) dataId={abstract_filter: i, skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(DetectCoaddSourcesTask, label=detection) dataId={abstract_filter: r, skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(DetectCoaddSourcesTask, label=detection) dataId={abstract_filter: i, skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(MergeDetectionsTask, label=mergeDetections) dataId={skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(DeblendCoaddSourcesSingleTask, label=deblend) dataId={abstract_filter: r, skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(DeblendCoaddSourcesSingleTask, label=deblend) dataId={abstract_filter: i, skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(MeasureMergedCoaddSourcesTask, label=measure) dataId={abstract_filter: r, skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(MeasureMergedCoaddSourcesTask, label=measure) dataId={abstract_filter: i, skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      ctrl.mpexec.mpGraphExecutor ERROR: Upstream job failed for task <TaskDef(MergeMeasurementsTask, label=mergeMeasurements) dataId={skymap: discrete/ci_hsc, tract: 0, patch: 69}>, skipping this task.
      

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment -

            It looks like it was DM-26136 then. In that case I've assigned this ticket to Andy Salnikov.

            Show
            tjenness Tim Jenness added a comment - It looks like it was DM-26136 then. In that case I've assigned this ticket to Andy Salnikov .
            Hide
            salnikov Andy Salnikov added a comment -

            Looks like my assumption about when connection to database is open was wrong, I though that conne3ction is established on the first use but in reality just creating a butler instance also makes new connection. My mistake was that made `copy()` which does pickle/unpickle completely in parent process (idea was that copy does not have a connection open). Instead of copy I should pass pickled butler to the subprocess.

            Show
            salnikov Andy Salnikov added a comment - Looks like my assumption about when connection to database is open was wrong, I though that conne3ction is established on the first use but in reality just creating a butler instance also makes new connection. My mistake was that made `copy()` which does pickle/unpickle completely in parent process (idea was that copy does not have a connection open). Instead of copy I should pass pickled butler to the subprocess.
            Hide
            salnikov Andy Salnikov added a comment -

            Tim Jenness, you have approved PR already, you can probably just click "Reviewed".

            Jenkins passed for centos7 and macos OK, and I tested it with Postgres so I'm sure it fixes this problem.

            Show
            salnikov Andy Salnikov added a comment - Tim Jenness , you have approved PR already, you can probably just click "Reviewed". Jenkins passed for centos7 and macos OK, and I tested it with Postgres so I'm sure it fixes this problem.
            Hide
            tjenness Tim Jenness added a comment -

            Looks good to me.

            Show
            tjenness Tim Jenness added a comment - Looks good to me.
            Hide
            salnikov Andy Salnikov added a comment -

            Done. Michelle Gower, master should be now OK to run with many processes.

            Show
            salnikov Andy Salnikov added a comment - Done. Michelle Gower , master should be now OK to run with many processes.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              mgower Michelle Gower
              Reviewers:
              Tim Jenness
              Watchers:
              Andy Salnikov, Michelle Gower, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.