Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-38957

New resolved dataref handling led to a KeyError and database lockup

    XMLWordPrintable

    Details

      Description

      It looks like DM-38780 is the prime suspect for the error that occurred in this verify_drp_metrics run.

      The error seems to have been a KeyError:

      Traceback (most recent call last):
        File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/daf_butler/g87c2bcf78e+1bb61f183b/python/lsst/daf/butler/registry/interfaces/_database.py", line 465, in temporary_table
          yield table
        File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/daf_butler/g87c2bcf78e+1bb61f183b/python/lsst/daf/butler/registry/queries/_query.py", line 168, in open_context
          yield
        File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/daf_butler/g87c2bcf78e+1bb61f183b/python/lsst/daf/butler/registry/queries/_results.py", line 119, in materialize
          yield DataCoordinateQueryResults(self._query.materialized())
        File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/pipe_base/gdf2caccafc+f05c588fde/python/lsst/pipe/base/graphBuilder.py", line 908, in connectDataIds
          yield commonDataIds
        File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/pipe_base/gdf2caccafc+f05c588fde/python/lsst/pipe/base/graphBuilder.py", line 1540, in makeGraph
          scaffolding.resolveDatasetRefs(
        File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/pipe_base/gdf2caccafc+f05c588fde/python/lsst/pipe/base/graphBuilder.py", line 1099, in resolveDatasetRefs
          refs[resolvedRef.dataId].ref = (
      KeyError: {instrument: 'HSC', detector: 47, visit: 322, ...}
      

      that triggered a database locking error while trying to clean up a temporary table in the context manager responding to the KeyError exception.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Jim's ticket DM-38948 is likely going to fix it.

            Show
            salnikov Andy Salnikov added a comment - Jim's ticket DM-38948 is likely going to fix it.
            Hide
            jbosch Jim Bosch added a comment -

            I was hoping so, too, but the exception isn't happening where I would have expected for that to be the case, either in terms of the traceback or the log messages leading up to it.

            I'm pretty baffled by this, and my best guess is that prior to DM-38780 the usual "add a key if it isn't there" dict set behavior masked the findDatasets call returning some data IDs the calling code didn't expect and didn't make use of.  If that's the case a quick fix that gets us back to the previous state would be easy, but it'd be better to figure out what's going on.

            I'll see if I can reproduce locally.

            Show
            jbosch Jim Bosch added a comment - I was hoping so, too, but the exception isn't happening where I would have expected for that to be the case, either in terms of the traceback or the log messages leading up to it. I'm pretty baffled by this, and my best guess is that prior to DM-38780 the usual "add a key if it isn't there" dict set behavior masked the findDatasets call returning some data IDs the calling code didn't expect and didn't make use of.  If that's the case a quick fix that gets us back to the previous state would be easy, but it'd be better to figure out what's going on. I'll see if I can reproduce locally.
            Hide
            salnikov Andy Salnikov added a comment -

            Kian-Tat Lim, how can I run the same thing locally?

            Show
            salnikov Andy Salnikov added a comment - Kian-Tat Lim , how can I run the same thing locally?
            Hide
            jbosch Jim Bosch added a comment -

            Ok, I've figured out why the findDatasets call was returning additional data IDs, and it's totally benign.  The data IDs that cause this failure are the ones that involve an apparent overlap that goes away when we apply our post-query filtering, and we can't do that filtering in the particular form of chained query in play here.

            So the right fix is just to guard against this case explicitly (and probably in a few other similar loops) and drop the ones that don't already have an entry in the dict.  I'll probably be able to push out a branch with that tonight.

            Show
            jbosch Jim Bosch added a comment - Ok, I've figured out why the  findDatasets call was returning additional data IDs, and it's totally benign.  The data IDs that cause this failure are the ones that involve an apparent overlap that goes away when we apply our post-query filtering, and we can't do that filtering in the particular form of chained query in play here. So the right fix is just to guard against this case explicitly (and probably in a few other similar loops) and drop the ones that don't already have an entry in the dict.  I'll probably be able to push out a branch with that tonight.
            Hide
            jbosch Jim Bosch added a comment -

            how can I run the same thing locally?

            Don't worry about it for now (I'm stealing the ticket), but for the future, clone https://github.com/lsst/rc2_subset.git, set the NUMPROCS envvar, and run the only bin script in that package. But it'll take a few hours unless you have lots of cores to throw at it (it'll also take a long while to download).

            Show
            jbosch Jim Bosch added a comment - how can I run the same thing locally? Don't worry about it for now (I'm stealing the ticket), but for the future, clone https://github.com/lsst/rc2_subset.git , set the NUMPROCS envvar, and run the only bin script in that package. But it'll take a few hours unless you have lots of cores to throw at it (it'll also take a long while to download).
            Hide
            jbosch Jim Bosch added a comment -

            Ready for review: https://github.com/lsst/pipe_base/pull/327

            I've confirmed that it works with a one-off test of the original problem in rc2_subset. I'm not attempting a unit test but I'm really looking forward to improving QG gen and execution butler test coverage on DM-38952 ASAP.

            Jenkins is underway at https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/38604/pipeline/

            Show
            jbosch Jim Bosch added a comment - Ready for review: https://github.com/lsst/pipe_base/pull/327 I've confirmed that it works with a one-off test of the original problem in rc2_subset. I'm not attempting a unit test but I'm really looking forward to improving QG gen and execution butler test coverage on DM-38952 ASAP. Jenkins is underway at https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/38604/pipeline/
            Hide
            salnikov Andy Salnikov added a comment -

            Looks good, thanks again for fixing more of my messes!

            Show
            salnikov Andy Salnikov added a comment - Looks good, thanks again for fixing more of my messes!

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              ktl Kian-Tat Lim
              Reviewers:
              Andy Salnikov
              Watchers:
              Andy Salnikov, Jim Bosch, Kian-Tat Lim
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.