Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-28095

A missing storage class causes all dataset queries to fail

    XMLWordPrintable

    Details

    • Story Points:
      1
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Attempting to run butler query-datasets against the repo at /project/hsc/gen3repo/rc2w42_ssw46 with essentially any arguments results in an exception about the 'BrightStarStamps' storage class not being found. I'm pretty sure that's because there's a dataset type with that storage class name registered in the repo, but the storage class is not defined in the butler configs on master; I'm pretty sure it's only defined on a ticket branch that only Morgan Schmitz uses.

      We've discussed in the past how to add new storage classes to an existing registry, and I don't want to try to solve that problem on this ticket - though it's clear we'll have to do something in order for Morgan Schmitz to get his functionality on master in a way that's compatible with existing repos.

      What I do want to do here is to make it possible for people who aren't Morgan to query datasets from the command-line in that repo again, and I suspect that'll be a piece of any StorageClass extension system in the future.

      Here's what's going on, as far as I can tell:

      • In the command-line script, all of the dataset types are assumed to be glob patterns, and are transformed into regular expressions, even if they're fully-qualified types.
      • Seeing regular expressions for dataset types and components=None (which means "look for component dataset types only when the parent dataset type doesn't match), Registry.queryDatasets calls Registry.queryDatasetTypes to query for all dataset types, including all components, in order to get a list to match those regular expressions against.
      • Expanding the components for a dataset type requires getting its actual StorageClass, and boom.

      This also affects querying for datasets in Python, but only when the dataset type argument is passed explicitly as a regex - when it's a fully-qualified dataset type name, DatasetType instance, or list thereof (as it is in QG gen), we don't have to expand unrelated dataest type components to match against.

      So, how do we fix that?

      • We can catch the exception raised when the storage class isn't found (probably warn), and proceed with expanding all other dataset types. I'm not sure this scales to a world where everyone has their own storage class extensions that cause warnings for everyone else, but maybe that's something we should take into account when we actually tackle storage class extensions (e.g. by putting storage class definitions into the database when they're used).
      • We can disable or restrict searching for components via regex. I don't think we want to disable it entirely, or even restrict it to one parent dataset at a time (it's not unreasonable for someone to ask, say, "what are the dataset types with a .wcs component?"). It's frustrating that we can't inspect the pattern to see whether it could match a component of some parent dataset, because the vast majority of patterns wouldn't, but I don't see a way to do that. But given that searching for components is really just searching for dataset types with a particular storage class, maybe we should retire the component-regex matching in favor of a way to filter the dataset types by their storage class.

      Right now, I'm leaning towards the former, because it'll be a lot easier (and not an API change), but that may be punting some problems to the future.

        Attachments

          Activity

          Hide
          tjenness Tim Jenness added a comment -

          We should probably also lazy load StorageClass inside a DatasetType – if you pass in a string for a storage class instead of a StorageClass to the datasetType constructor it should create the StorageClass when you need it. That would solve this problem of having to create all the DatasetType and associated StorageClass up front wouldn't it? (unless I misread this ticket).

          Show
          tjenness Tim Jenness added a comment - We should probably also lazy load StorageClass inside a DatasetType – if you pass in a string for a storage class instead of a StorageClass to the datasetType constructor it should create the StorageClass when you need it. That would solve this problem of having to create all the DatasetType and associated StorageClass up front wouldn't it? (unless I misread this ticket).
          Hide
          jbosch Jim Bosch added a comment - - edited

          DatasetType construction already doesn't trigger a StorageClass load - if you give it a StorageClass instance, it remembers it, but if you just give it a name, it sets self._storageClassName and sets self._storageClass = None, deferring loading until first use. The problem here is that we end up calling DatasetType.makeAllComponentDatasetTypes on all dataset types any time someone queries for datasets or dataset types with a regex, and that method (reasonably) does need to load the real StorageClass.

          Show
          jbosch Jim Bosch added a comment - - edited DatasetType construction already doesn't trigger a StorageClass load - if you give it a StorageClass instance, it remembers it, but if you just give it a name, it sets self._storageClassName and sets self._storageClass = None , deferring loading until first use. The problem here is that we end up calling DatasetType.makeAllComponentDatasetTypes on all dataset types any time someone queries for datasets or dataset types with a regex, and that method (reasonably) does need to load the real StorageClass.
          Hide
          tjenness Tim Jenness added a comment -

          Okay. Yes. That makes sense.

          Show
          tjenness Tim Jenness added a comment - Okay. Yes. That makes sense.
          Hide
          jbosch Jim Bosch added a comment -

          The changes here are small and not really butler-specific. Eli Rykoff, are you up for reviewing, since you saw the context already on Slack, and I'm hitting all non-vacationing middleware team members up for other reviews?

          PR is https://github.com/lsst/daf_butler/pull/453

          Show
          jbosch Jim Bosch added a comment - The changes here are small and not really butler-specific. Eli Rykoff , are you up for reviewing, since you saw the context already on Slack, and I'm hitting all non-vacationing middleware team members up for other reviews? PR is https://github.com/lsst/daf_butler/pull/453

            People

            Assignee:
            jbosch Jim Bosch
            Reporter:
            jbosch Jim Bosch
            Reviewers:
            Eli Rykoff
            Watchers:
            Eli Rykoff, Jim Bosch, Tim Jenness
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.