Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-16831

Crash when running multi-task pipeline

    Details

    • Story Points:
      2
    • Sprint:
      BG3_F18_11
    • Team:
      Data Access and Database

      Description

      We tried to run stac with two tasks in a pipeline and it crashed:

      $ stac -b ../ci_hsc/DATA/butler.yaml -i raw/HSC -o coll12 -d "" run -t RawToCalexpTask -t CalexpToCoaddTask
      Traceback (most recent call last):
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1170, in _execute_context
          context)
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/default.py", line 504, in do_executemany
          cursor.executemany(statement, parameters)
      sqlite3.IntegrityError: NOT NULL constraint failed: DatasetCollection.dataset_id
       
      The above exception was the direct cause of the following exception:
       
      Traceback (most recent call last):
        File "/project/salnikov/gen3-middleware/pipe_supertask/bin/stac", line 25, in <module>
          sys.exit(CmdLineFwk().parseAndRun())
        File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 228, in parseAndRun
          return self.runPipeline(qgraph, butler, args)
        File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 334, in runPipeline
          self._updateOutputCollection(graph, butler)
        File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 412, in _updateOutputCollection
          registry.associate(collection, list(id2ref.values()))
        File "/project/salnikov/gen3-middleware/daf_butler/python/lsst/daf/butler/core/utils.py", line 290, in inner
          return func(self, *args, **kwargs)
        File "/project/salnikov/gen3-middleware/daf_butler/python/lsst/daf/butler/registries/sqlRegistry.py", line 613, in associate
          [{"dataset_id": ref.id, "collection": collection} for ref in refs])
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 948, in execute
          return meth(self, multiparams, params)
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
          return connection._execute_clauseelement(self, multiparams, params)
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
          compiled_sql, distilled_params
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1200, in _execute_context
          context)
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
          exc_info
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/util/compat.py", line 203, in raise_from_cause
          reraise(type(exception), exception, tb=exc_tb, cause=cause)
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/util/compat.py", line 186, in reraise
          raise value.with_traceback(tb)
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1170, in _execute_context
          context)
        File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/default.py", line 504, in do_executemany
          cursor.executemany(statement, parameters)
      sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: DatasetCollection.dataset_id [SQL: 'INSERT INTO "DatasetCollection" (dataset_id, collection) VALUES (?, ?)'] [parameters: ((6, 'coll12'), (12, 'coll12'), (2, 'coll12'), (1, 'coll12'), (16, 'coll12'), (10, 'coll12'), (11, 'coll12'), (15, 'coll12')  ... displaying 10 of 34 total bound parameter sets ...  (22, 'coll12'), (None, 'coll12'))] (Background on this error at: http://sqlalche.me/e/gkpj)
      

       
      This looks like it's trying to insert NULL as dataset_id into collection table for the new collection, and it happens when adding inputs to a new output collection. Maybe it also tries to add intermediate non-existing dataset?

        Attachments

          Activity

          Hide
          salnikov Andy Salnikov added a comment -

          Apparently it happens because because cmdLineFwk attempts to copy an intermediate dataset ID (which does not exist yet and is set to None) to output collection. The fix is simple - just filter those inputs that have none as dataset ID.

          Fixing this reveals another issue that I have not seen before - there is a similar crash when adding a Quantum to a registry:

          Traceback (most recent call last):
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1193, in _execute_context
              context)
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/default.py", line 507, in do_execute
              cursor.execute(statement, parameters)
          sqlite3.IntegrityError: NOT NULL constraint failed: DatasetConsumers.dataset_id
           
          The above exception was the direct cause of the following exception:
           
          Traceback (most recent call last):
            File "/project/salnikov/gen3-middleware/pipe_supertask/bin/stac", line 25, in <module>
              sys.exit(CmdLineFwk().parseAndRun())
            File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 228, in parseAndRun
              return self.runPipeline(qgraph, butler, args)
            File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 374, in runPipeline
              mapFunc(self._executePipelineTask, target_list)
            File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 361, in _mapFunc
              return [func(item) for item in iterable]
            File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 361, in <listcomp>
              return [func(item) for item in iterable]
            File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 451, in _executePipelineTask
              butler.registry.addQuantum(quantum)
            File "/project/salnikov/gen3-middleware/daf_butler/python/lsst/daf/butler/core/utils.py", line 185, in inner
              return func(self, *args, **kwargs)
            File "/project/salnikov/gen3-middleware/daf_butler/python/lsst/daf/butler/registries/sqlRegistry.py", line 858, in addQuantum
              for ref in flatInputs])
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 948, in execute
              return meth(self, multiparams, params)
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
              return connection._execute_clauseelement(self, multiparams, params)
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
              compiled_sql, distilled_params
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1200, in _execute_context
              context)
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
              exc_info
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/util/compat.py", line 203, in raise_from_cause
              reraise(type(exception), exception, tb=exc_tb, cause=cause)
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/util/compat.py", line 186, in reraise
              raise value.with_traceback(tb)
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1193, in _execute_context
              context)
            File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/default.py", line 507, in do_execute
              cursor.execute(statement, parameters)
          sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: DatasetConsumers.dataset_id [SQL: 'INSERT INTO "DatasetConsumers" (quantum_id, dataset_id, actual) VALUES (?, ?, ?)'] [parameters: (48, None, 0)] (Background on this error at: http://sqlalche.me/e/gkpj)
          

          It looks like dataset ID needs to be updated for those intermediate inputs before sending Quantum to database.

          Show
          salnikov Andy Salnikov added a comment - Apparently it happens because because cmdLineFwk attempts to copy an intermediate dataset ID (which does not exist yet and is set to None) to output collection. The fix is simple - just filter those inputs that have none as dataset ID. Fixing this reveals another issue that I have not seen before - there is a similar crash when adding a Quantum to a registry: Traceback (most recent call last): File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1193, in _execute_context context) File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/default.py", line 507, in do_execute cursor.execute(statement, parameters) sqlite3.IntegrityError: NOT NULL constraint failed: DatasetConsumers.dataset_id   The above exception was the direct cause of the following exception:   Traceback (most recent call last): File "/project/salnikov/gen3-middleware/pipe_supertask/bin/stac", line 25, in <module> sys.exit(CmdLineFwk().parseAndRun()) File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 228, in parseAndRun return self.runPipeline(qgraph, butler, args) File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 374, in runPipeline mapFunc(self._executePipelineTask, target_list) File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 361, in _mapFunc return [func(item) for item in iterable] File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 361, in <listcomp> return [func(item) for item in iterable] File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 451, in _executePipelineTask butler.registry.addQuantum(quantum) File "/project/salnikov/gen3-middleware/daf_butler/python/lsst/daf/butler/core/utils.py", line 185, in inner return func(self, *args, **kwargs) File "/project/salnikov/gen3-middleware/daf_butler/python/lsst/daf/butler/registries/sqlRegistry.py", line 858, in addQuantum for ref in flatInputs]) File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 948, in execute return meth(self, multiparams, params) File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement compiled_sql, distilled_params File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1200, in _execute_context context) File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception exc_info File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/util/compat.py", line 203, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/util/compat.py", line 186, in reraise raise value.with_traceback(tb) File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1193, in _execute_context context) File "/software/lsstsw/stack_20181012/stack/miniconda3-4.5.4-fcd27eb/Linux64/sqlalchemy/1.2.2+2/lib/python/SQLAlchemy-1.2.2-py3.6-linux-x86_64.egg/sqlalchemy/engine/default.py", line 507, in do_execute cursor.execute(statement, parameters) sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: DatasetConsumers.dataset_id [SQL: 'INSERT INTO "DatasetConsumers" (quantum_id, dataset_id, actual) VALUES (?, ?, ?)'] [parameters: (48, None, 0)] (Background on this error at: http://sqlalche.me/e/gkpj) It looks like dataset ID needs to be updated for those intermediate inputs before sending Quantum to database.
          Hide
          salnikov Andy Salnikov added a comment -

          For intermediate inputs we obviously do not have dataset_id set before we start executing tasks. To be able to store the quantum in registry we need to update its input ids with the correct values, those become known when preceding task writes its outputs to the registry. Somehow we need to return those saved DataRefs from a task and use that info to update downstream input refs.

          Show
          salnikov Andy Salnikov added a comment - For intermediate inputs we obviously do not have dataset_id set before we start executing tasks. To be able to store the quantum in registry we need to update its input ids with the correct values, those become known when preceding task writes its outputs to the registry. Somehow we need to return those saved DataRefs from a task and use that info to update downstream input refs.
          Hide
          salnikov Andy Salnikov added a comment -

          One major complication for using previous task outputs - in case we do more complicated workflows when tasks are executed by different controllers there may be no ways to easily communicate information produce by one task to the controller of other task. This probably leaves us with the need to find dataset_id when we read inputs from butler. Simple butler.get() does not return dataset_id so we need to replace that with slightly more complicated approach - first find dataset_id by looking at the registry and then ask butler to find that specific dataset_it (not sure if that is possible).

          Show
          salnikov Andy Salnikov added a comment - One major complication for using previous task outputs - in case we do more complicated workflows when tasks are executed by different controllers there may be no ways to easily communicate information produce by one task to the controller of other task. This probably leaves us with the need to find dataset_id when we read inputs from butler. Simple butler.get() does not return dataset_id so we need to replace that with slightly more complicated approach - first find dataset_id by looking at the registry and then ask butler to find that specific dataset_it (not sure if that is possible).
          Hide
          salnikov Andy Salnikov added a comment -

          Hsin-Fang, this is rather trivial change and it should fix the problem that we saw when we tried to run two-task pipeline.
          Note also that if you want to try to run stac against current weekly build you also need daf_butler master (with DM-16869 fix) in addition to pipe_supertask and ci_hsc.

          Show
          salnikov Andy Salnikov added a comment - Hsin-Fang, this is rather trivial change and it should fix the problem that we saw when we tried to run two-task pipeline. Note also that if you want to try to run stac against current weekly build you also need daf_butler master (with DM-16869 fix) in addition to pipe_supertask and ci_hsc .
          Hide
          hchiang2 Hsin-Fang Chiang added a comment -

          I haven't been able to run this yet because my ci_hsc repo from last week is too old now, but the patch makes sense and looks good to me. I'm generating a new ci_hsc repo to try it out. No need to wait for me for merging.

          Show
          hchiang2 Hsin-Fang Chiang added a comment - I haven't been able to run this yet because my ci_hsc repo from last week is too old now, but the patch makes sense and looks good to me. I'm generating a new ci_hsc repo to try it out. No need to wait for me for merging.
          Hide
          salnikov Andy Salnikov added a comment -

          Thanks for review! I have rebased to latest master of daf_butler and pipe_supertask and had to add one small patch for an example task to run it (things move very fast). Merged and done.

          Show
          salnikov Andy Salnikov added a comment - Thanks for review! I have rebased to latest master of daf_butler and pipe_supertask and had to add one small patch for an example task to run it (things move very fast). Merged and done.

            People

            • Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Hsin-Fang Chiang
              Watchers:
              Andy Salnikov, Hsin-Fang Chiang
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel