Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-16869

Inconsistent hash for DimensionSet/DimensionNameSet

    Details

    • Story Points:
      1
    • Sprint:
      BG3_F18_11
    • Team:
      Data Access and Database

      Description

      It looks like recent switch from units to dimensions created issuers in GraphBuilder, I'm getting error when trying to run simple pipeline of two tasks :

      Traceback (most recent call last):
        File "/project/salnikov/gen3-middleware/pipe_supertask/bin/stac", line 25, in <module>
          sys.exit(CmdLineFwk().parseAndRun())
        File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/cmdLineFwk.py", line 198, in parseAndRun
          qgraph = graphBuilder.makeGraph(pipeline, coll, args.data_query)
        File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/graphBuilder.py", line 198, in makeGraph
          originInfo, userQuery)
        File "/project/salnikov/gen3-middleware/pipe_supertask/python/lsst/pipe/supertask/graphBuilder.py", line 324, in _makeGraph
          dataRef = row.datasetRefs[dsType]
      KeyError: DatasetType(calexp, {Visit, Instrument, Detector}, ExposureF)
      

      Looks like the dict indexing on DatasetType is broken, and this seems to be related to hash returning different values when dimension is either DimensionGraph or DimensionNameSet.

        Attachments

          Activity

          Hide
          salnikov Andy Salnikov added a comment -

          I think that DimensionSetBase hash method is not quite correct:

              def __hash__(self):
                  # Two possibilities in practice:
                  try:
                      # self.names is a frozenset, which is hashable but not sorted.
                      return hash(self.names)
                  except TypeError:
                      # self.names is an OrderedDict keys view, which is not hashable
                      # but is sorted.
                      return hash(tuple(self.names))
          

          for the same set of names it can return different hash.

          For proper hashing we need to use the same container type , it's probably easiest to just make frozenset out of names.

          Show
          salnikov Andy Salnikov added a comment - I think that DimensionSetBase hash method is not quite correct: def __hash__( self ): # Two possibilities in practice: try : # self.names is a frozenset, which is hashable but not sorted. return hash ( self .names) except TypeError: # self.names is an OrderedDict keys view, which is not hashable # but is sorted. return hash ( tuple ( self .names)) for the same set of names it can return different hash. For proper hashing we need to use the same container type , it's probably easiest to just make frozenset out of names.
          Hide
          salnikov Andy Salnikov added a comment - - edited

          JIm, could you check that my fix makes sense? PR: https://github.com/lsst/daf_butler/pull/111

          Show
          salnikov Andy Salnikov added a comment - - edited JIm, could you check that my fix makes sense? PR: https://github.com/lsst/daf_butler/pull/111
          Hide
          jbosch Jim Bosch added a comment - - edited

          Fix looks good, thanks!

          We should probably also try to make sure the activator logic "upgrades" its DatasetTypes to have full DimensionGraphs as soon as it can (this is done automatically by Registry.getDatasetType and Registry.registerDatasetType), and maybe consider making DimensionGraph __hash__ on frozenset(self.names), too.  I am a bit worried that while the ability to compare DimensionNameSet, DimensionSet, and DimensionGraph for equality is very nice, it is very hard to those binary operations consistent with the unary hash operation, and even this ticket doesn't fully achieve that.

          Show
          jbosch Jim Bosch added a comment - - edited Fix looks good, thanks! We should probably also try to make sure the activator logic "upgrades" its DatasetTypes to have full DimensionGraphs as soon as it can (this is done automatically by Registry.getDatasetType and Registry.registerDatasetType), and maybe consider making DimensionGraph __hash__ on frozenset(self.names) , too.  I am a bit worried that while the ability to compare DimensionNameSet, DimensionSet, and DimensionGraph for equality is very nice, it is very hard to those binary operations consistent with the unary hash operation, and even this ticket doesn't fully achieve that.
          Hide
          salnikov Andy Salnikov added a comment -

          Thanks for review, indeed we probably need another review it once again, this is just enough to unblock my other tickets.

          Show
          salnikov Andy Salnikov added a comment - Thanks for review, indeed we probably need another review it once again, this is just enough to unblock my other tickets.

            People

            • Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel