Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29908

Registry collection loading can fail due to concurrent deletes

    XMLWordPrintable

Details

    • 1
    • Data Release Production
    • No

    Description

      I just had BPS job die with the following traceback:

        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/managers.py", line 309, in refresh
          self.collections.refresh()
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/collections/_base.py", line 361, in refresh
          chain.refresh(self)
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/interfaces/_collections.py", line 204, in refresh
          self._children = self._load(manager)
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/collections/_base.py", line 284, in _load
          [manager[row[self._table.columns.child]].name for row in self._db.query(sql)]
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/collections/_base.py", line 284, in <listcomp>
          [manager[row[self._table.columns.child]].name for row in self._db.query(sql)]
        File "/software/lsstsw/stack_20210415/stack/miniconda3-py38_4.9.2-0.5.0/Linux64/daf_butler/21.0.0-74-g62d1151e+6c308c38c1/python/lsst/daf/butler/registry/collections/_base.py", line 424, in __getitem__
          raise MissingCollectionError(f"Collection with key '{key}' not found.") from err
      lsst.daf.butler.registry._exceptions.MissingCollectionError: Collection with key 'u/kbechtol/calib_test/20210427T062718Z' not found.
      

      I'm 90% sure that kbechtol happened to delete this completely-unrelated-to-me collection last night (please confirm if you can, Keith) while my Butler was linking up all of its parent and child collection information at startup.

      I've always known this super-aggressive up-front fetching wouldn't scale to many, many users, but I hadn't anticipated having to replace it so soon (it'd be nice to get e.g. DM-29585 first). We might be able work around this in the short term by either wrapping this in a transaction or just catching the exception and retrying.

      salnikov, I've added you as a watcher because I figure it's possible you may be able to get to this before I do (I know we're both super busy, so neither of us is likely to), or you may have some thoughts on the best approach to take on this. And tjenness, this is the kind of thing that might get exacerbated in the client/server butler, depending on where we do the caching.

      Attachments

        Issue Links

          Activity

            I am sorry if I inadvertently disrupted a job. Here are the commands I was running. I believe this was at roughly 12:30pm Project time (PDT) 26 April 2021.

            import lsst.daf.butler as dafButler
            config = '/repo/main/butler.yaml'
            butler = dafButler.Butler(config=config, collections='HSC/runs/RC2/w_2021_14/DM-29528', writeable=True)
            registry = butler.registry
            registry.removeDatasetType('metricvalue_info_nsrcMeasDetector')
            registry.removeDatasetType('nsrcMeasDetector_metadata')

            I think I also attempted to run these commands at ~10 pm Project time (PDT) 26 April 2021.

            kbechtol Keith Bechtol added a comment - I am sorry if I inadvertently disrupted a job. Here are the commands I was running. I believe this was at roughly 12:30pm Project time (PDT) 26 April 2021. import lsst.daf.butler as dafButler config = '/repo/main/butler.yaml' butler = dafButler.Butler(config=config, collections='HSC/runs/RC2/w_2021_14/ DM-29528 ', writeable=True) registry = butler.registry registry.removeDatasetType('metricvalue_info_nsrcMeasDetector') registry.removeDatasetType('nsrcMeasDetector_metadata') I think I also attempted to run these commands at ~10 pm Project time (PDT) 26 April 2021.
            jbosch Jim Bosch added a comment -

            Looking at the code and some more recent occurrences, I think the actual race condition is:

            1. Butler A fetches everything from the collection table
            2. Butler B creates a new collection
            3. Butler B adds the new collection as a child of an existing CHAINED collection
            4. Butler A fetches associations between parents and children, assuming it already has fetched the basic information about all of these.

            A fix for that case is on a branch.

             

            jbosch Jim Bosch added a comment - Looking at the code and some more recent occurrences, I think the actual race condition is: Butler A fetches everything from the collection table Butler B creates a new collection Butler B adds the new collection as a child of an existing CHAINED collection Butler A fetches associations between parents and children, assuming it already has fetched the basic information about all of these. A fix for that case is on a branch.  
            jbosch Jim Bosch added a comment -

            salnikov, could you add this small try-except addition?

            Only PR is daf_butler: https://github.com/lsst/daf_butler/pull/516

            jbosch Jim Bosch added a comment - salnikov , could you add this small try-except addition? Only PR is daf_butler: https://github.com/lsst/daf_butler/pull/516

            Looks good, though I can't say I managed to understand exactly the distributed logic, but I'm going to trust you on this

            salnikov Andy Salnikov added a comment - Looks good, though I can't say I managed to understand exactly the distributed logic, but I'm going to trust you on this

            People

              jbosch Jim Bosch
              jbosch Jim Bosch
              Andy Salnikov
              Andy Salnikov, Jim Bosch, Keith Bechtol, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.