Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29908

Registry collection loading can fail due to concurrent deletes

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      I just had BPS job die with the following traceback:

        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/managers.py", line 309, in refresh
          self.collections.refresh()
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/collections/_base.py", line 361, in refresh
          chain.refresh(self)
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/interfaces/_collections.py", line 204, in refresh
          self._children = self._load(manager)
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/collections/_base.py", line 284, in _load
          [manager[row[self._table.columns.child]].name for row in self._db.query(sql)]
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/collections/_base.py", line 284, in <listcomp>
          [manager[row[self._table.columns.child]].name for row in self._db.query(sql)]
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/collections/_base.py", line 424, in __getitem__
          raise MissingCollectionError(f"Collection with key '{key}' not found.") from err
      lsst.daf.butler.registry._exceptions.MissingCollectionError: Collection with key 'u/kbechtol/calib_test/20210427T062718Z' not found.
      

      I'm 90% sure that Keith Bechtol happened to delete this completely-unrelated-to-me collection last night (please confirm if you can, Keith) while my Butler was linking up all of its parent and child collection information at startup.

      I've always known this super-aggressive up-front fetching wouldn't scale to many, many users, but I hadn't anticipated having to replace it so soon (it'd be nice to get e.g. DM-29585 first). We might be able work around this in the short term by either wrapping this in a transaction or just catching the exception and retrying.

      Andy Salnikov, I've added you as a watcher because I figure it's possible you may be able to get to this before I do (I know we're both super busy, so neither of us is likely to), or you may have some thoughts on the best approach to take on this. And Tim Jenness, this is the kind of thing that might get exacerbated in the client/server butler, depending on where we do the caching.

        Attachments

          Issue Links

            Activity

            Hide
            kbechtol Keith Bechtol added a comment -

            I am sorry if I inadvertently disrupted a job. Here are the commands I was running. I believe this was at roughly 12:30pm Project time (PDT) 26 April 2021.

            import lsst.daf.butler as dafButler
            config = '/repo/main/butler.yaml'
            butler = dafButler.Butler(config=config, collections='HSC/runs/RC2/w_2021_14/DM-29528', writeable=True)
            registry = butler.registry
            registry.removeDatasetType('metricvalue_info_nsrcMeasDetector')
            registry.removeDatasetType('nsrcMeasDetector_metadata')

            I think I also attempted to run these commands at ~10 pm Project time (PDT) 26 April 2021.

            Show
            kbechtol Keith Bechtol added a comment - I am sorry if I inadvertently disrupted a job. Here are the commands I was running. I believe this was at roughly 12:30pm Project time (PDT) 26 April 2021. import lsst.daf.butler as dafButler config = '/repo/main/butler.yaml' butler = dafButler.Butler(config=config, collections='HSC/runs/RC2/w_2021_14/ DM-29528 ', writeable=True) registry = butler.registry registry.removeDatasetType('metricvalue_info_nsrcMeasDetector') registry.removeDatasetType('nsrcMeasDetector_metadata') I think I also attempted to run these commands at ~10 pm Project time (PDT) 26 April 2021.
            Hide
            jbosch Jim Bosch added a comment -

            Looking at the code and some more recent occurrences, I think the actual race condition is:

            1. Butler A fetches everything from the collection table
            2. Butler B creates a new collection
            3. Butler B adds the new collection as a child of an existing CHAINED collection
            4. Butler A fetches associations between parents and children, assuming it already has fetched the basic information about all of these.

            A fix for that case is on a branch.

             

            Show
            jbosch Jim Bosch added a comment - Looking at the code and some more recent occurrences, I think the actual race condition is: Butler A fetches everything from the collection table Butler B creates a new collection Butler B adds the new collection as a child of an existing CHAINED collection Butler A fetches associations between parents and children, assuming it already has fetched the basic information about all of these. A fix for that case is on a branch.  
            Hide
            jbosch Jim Bosch added a comment -

            Andy Salnikov, could you add this small try-except addition?

            Only PR is daf_butler: https://github.com/lsst/daf_butler/pull/516

            Show
            jbosch Jim Bosch added a comment - Andy Salnikov , could you add this small try-except addition? Only PR is daf_butler: https://github.com/lsst/daf_butler/pull/516
            Hide
            salnikov Andy Salnikov added a comment -

            Looks good, though I can't say I managed to understand exactly the distributed logic, but I'm going to trust you on this

            Show
            salnikov Andy Salnikov added a comment - Looks good, though I can't say I managed to understand exactly the distributed logic, but I'm going to trust you on this

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Andy Salnikov
              Watchers:
              Andy Salnikov, Jim Bosch, Keith Bechtol, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.