Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27419

Fix bugs when reading yaml file after running PTC extract task (DM-23159)

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Invalid
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      When trying to run the photon transfer curve measurement task (version DM-23159) on LATISS data (w_2020_44):

      pipetask run -d "detector=0 and exposure IN (2020012800028, 2020012800029, 2020012800041, 2020012800042)" -b ./DATA/butler.yaml -i ptcTest.19 -o ptcTest.40 -p /home/plazas/lsst_devel/LSST/cp_pipe/pipelines/measurePhotonTransferCurve.yaml  --register-dataset-types
      

      The first part of the task creates a yaml file

       /home/plazas/lsst_devel/LSST/ci_cpp_gen3/DATA/ptcTest.40/20201103T15h29m09s/cpCovariances/KPNO_406_828nm~EMPTY/KPNO_406_828nm~EMPTY/cpCovariances_KPNO_406_828nm~EMPTY_KPNO_406_828nm~EMPTY_2020012800028_0_LATISS_ptcTest_40_20201103T15h29m09s.yaml
      

      that then can’t be read by the second part of the task (via inputs = butlerQC.get(inputRefs)):

      inputRefs InputQuantizedConnection(_attributes={'inputCovariances', 'camera'}, camera=DatasetRef(DatasetType('camera', {instrument}, Camera, isCalibration=True), {instrument: LATISS}, id=1, run='LATISS/calib/unbounded'), inputCovariances=[DatasetRef(DatasetType('cpCovariances', {band, instrument, detector, physical_filter, exposure}, StructuredDataDict), {band: KPNO_406_828nm~EMPTY, instrument: LATISS, detector: 0, physical_filter: KPNO_406_828nm~EMPTY, exposure: 2020012800028}, id=793, run='ptcTest.40/20201103T15h29m09s'), DatasetRef(DatasetType('cpCovariances', {band, instrument, detector, physical_filter, exposure}, StructuredDataDict), {band: KPNO_406_828nm~EMPTY, instrument: LATISS, detector: 0, physical_filter: KPNO_406_828nm~EMPTY, exposure: 2020012800029}, id=794, run='ptcTest.40/20201103T15h29m09s'), DatasetRef(DatasetType('cpCovariances', {band, instrument, detector, physical_filter, exposure}, StructuredDataDict), {band: KPNO_406_828nm~EMPTY, instrument: LATISS, detector: 0, physical_filter: KPNO_406_828nm~EMPTY, exposure: 2020012800041}, id=795, run='ptcTest.40/20201103T15h29m09s'), DatasetRef(DatasetType('cpCovariances', {band, instrument, detector, physical_filter, exposure}, StructuredDataDict), {band: KPNO_406_828nm~EMPTY, instrument: LATISS, detector: 0, physical_filter: KPNO_406_828nm~EMPTY, exposure: 2020012800042}, id=796, run='ptcTest.40/20201103T15h29m09s')])
      Error: An error occurred during command execution:
      Traceback (most recent call last):
        File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/datastores/posixDatastore.py", line 120, in _read_artifact_into_memory
          result = formatter.read(component=getInfo.component if isComponent else None)
        File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/formatters/file.py", line 196, in read
          raise ValueError(f"Unable to read data with URI {self.fileDescriptor.location.uri}")
      ValueError: Unable to read data with URI file:///home/plazas/lsst_devel/LSST/ci_cpp_gen3/DATA/ptcTest.40/20201103T15h29m09s/cpCovariances/KPNO_406_828nm~EMPTY/KPNO_406_828nm~EMPTY/cpCovariances_KPNO_406_828nm~EMPTY_KPNO_406_828nm~EMPTY_2020012800028_0_LATISS_ptcTest_40_20201103T15h29m09s.yaml
      The above exception was the direct cause of the following exception:
      Traceback (most recent call last):
        File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/cli/utils.py", line 446, in cli_handle_exception
          return func(*args, **kwargs)
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/cli/script/run.py", line 173, in run
          f.runPipeline(qgraphObj, taskFactory, args)
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 646, in runPipeline
          executor.execute(graph, butler)
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 255, in execute
          self._executeQuantaInProcess(graph, butler)
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 301, in _executeQuantaInProcess
          self.quantumExecutor.execute(qnode.taskDef, qnode.quantum, butler)
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 99, in execute
          self.runQuantum(task, quantum, taskDef, butler)
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 266, in runQuantum
          task.runQuantum(butlerQC, inputRefs, outputRefs)
        File "/home/plazas/lsst_devel/LSST/cp_pipe/python/lsst/cp/pipe/ptc.py", line 565, in runQuantum
          inputs = butlerQC.get(inputRefs)
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/pipe_base/20.0.0-23-g8900aa8+487b895792/python/lsst/pipe/base/butlerQuantumContext.py", line 142, in get
          val = [self._get(r) for r in ref]
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/pipe_base/20.0.0-23-g8900aa8+487b895792/python/lsst/pipe/base/butlerQuantumContext.py", line 142, in <listcomp>
          val = [self._get(r) for r in ref]
        File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/pipe_base/20.0.0-23-g8900aa8+487b895792/python/lsst/pipe/base/butlerQuantumContext.py", line 94, in _get
          return butler.getDirect(ref)
        File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/_butler.py", line 681, in getDirect
          return self.datastore.get(ref, parameters=parameters)
        File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/datastores/fileLikeDatastore.py", line 1176, in get
          return self._read_artifact_into_memory(getInfo, ref, isComponent=isComponent)
        File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/datastores/posixDatastore.py", line 123, in _read_artifact_into_memory
          f" ({ref.datasetType.name} from {location.path}): {e}") from e
      ValueError: Failure from formatter 'lsst.daf.butler.formatters.yaml.YamlFormatter' for dataset 793 (cpCovariances from /home/plazas/lsst_devel/LSST/ci_cpp_gen3/DATA/ptcTest.40/20201103T15h29m09s/cpCovariances/KPNO_406_828nm~EMPTY/KPNO_406_828nm~EMPTY/cpCovariances_KPNO_406_828nm~EMPTY_KPNO_406_828nm~EMPTY_2020012800028_0_LATISS_ptcTest_40_20201103T15h29m09s.yaml): Unable to read data with URI
      file:///home/plazas/lsst_devel/LSST/ci_cpp_gen3/DATA/ptcTest.40/20201103T15h29m09s/cpCovariances/KPNO_406_828nm~EMPTY/KPNO_406_828nm~EMPTY/cpCovariances_KPNO_406_828nm~EMPTY_KPNO_406_828nm~EMPTY_2020012800028_0_LATISS_ptcTest_40_20201103T15h29m09s.yaml
      

      The yaml file is a list of dictionaries with tuples (and alternating empty dictionaries, to match dimensions between inputs and outputs). The first few lines of the yaml file are:

      ? !!python/tuple
      - C00
      - 2020012800028
      - 2020012800029
      : - !!python/tuple
          - 8727.117627992311
          - 8139.554851354043
          - 0
          - 0
          - !!python/object/apply:numpy.core.multiarray.scalar
            - &id001 !!python/object/apply:numpy.dtype
              args:
              - f8
              - false
              - true
              state: !!python/tuple
              - 3
              - <
              - null
              - null
              - null
              - -1
              - -1
              - 0
            - !!binary |
              8Yq8/ovLz0A=
          - !!python/object/apply:numpy.core.multiarray.scalar
            - *id001
            - !!binary |
              8Yq8/ovLz0A=
          - 1018000
          - 15
          - 0.4
          - C00
        - !!python/tuple
          - 8727.117627992311
          - 8139.554851354043
          - 1
          - 0
          - !!python/object/apply:numpy.core.multiarray.scalar
            - *id001
            - !!binary |
              8Yq8/ovLz0A=
          - !!python/object/apply:numpy.core.multiarray.scalar
            - *id001
            - !!binary |
              kXAsqRnIE8A=
          - 1016000
          - 15
          - 0.4
          - C00
        - !!python/tuple
          - 8727.117627992311
          - 8139.554851354043
          - 2
          - 0
          - !!python/object/apply:numpy.core.multiarray.scalar
            - *id001
            - !!binary |
              8Yq8/ovLz0A=
          - !!python/object/apply:numpy.core.multiarray.scalar
            - *id001
            - !!binary |
      

      Tim Jenness says that the problem is that the code is using a generic YAML formatter and it's expecting a simple dict, but in fact the task is storing the date as explained above (list of dict of tuples). The yaml.loader doesn’t know how to load them.

      He reports that there are two bugs, that this ticket should address:
      1) The YamlFormatter should complain about this — currently it traps the error and lets a bit of code higher up report something generic.
      2) We probably should be using the super UnsafeLoader here instead (since we wrote the files we should be able to read them and trust it will be okay)

        Attachments

          Issue Links

            Activity

            Hide
            tjenness Tim Jenness added a comment - - edited

            Given the request by Kian-Tat Lim to use safe_dump for dumping to YAML in the standard formatter for dict (DM-27418) I think that is going to break any attempt to get this working for you because numpy values will not round-trip.

            You have two choices:

            1. In the short term cleanup your dict to replace numpy with floats (or lists of float). Note also that tuple is a non-standard data type for YAML and if you want the YAML to be more portable you should use python list.
            2. In the longer term switch from dict to a proper python class so you can declare your own storage class and specialist formatter that will give you more control of the serialization format. (that is DM-27420 by the looks of it)
            Show
            tjenness Tim Jenness added a comment - - edited Given the request by Kian-Tat Lim to use safe_dump for dumping to YAML in the standard formatter for dict ( DM-27418 ) I think that is going to break any attempt to get this working for you because numpy values will not round-trip. You have two choices: In the short term cleanup your dict to replace numpy with floats (or lists of float). Note also that tuple is a non-standard data type for YAML and if you want the YAML to be more portable you should use python list . In the longer term switch from dict to a proper python class so you can declare your own storage class and specialist formatter that will give you more control of the serialization format. (that is DM-27420 by the looks of it)
            Hide
            ktl Kian-Tat Lim added a comment -

            I agree with Tim Jenness.

            Dictionaries are very general data structures, but they are often a poor choice for an interface, as there is little opportunity to document their contents or to validate them.  A class, even implemented on top of a dictionary, is better.

            Generally data should be YAML when it is intended for humans to read or write.  This seems to be neither, given that the numpy data seems to be written in unreadable binary anyway.  I'm not sure how flexible the dictionary values need to be, but some kind of table format (FITS, Parquet, etc.) might be appropriate.

            Show
            ktl Kian-Tat Lim added a comment - I agree with Tim Jenness . Dictionaries are very general data structures, but they are often a poor choice for an interface, as there is little opportunity to document their contents or to validate them.  A class, even implemented on top of a dictionary, is better. Generally data should be YAML when it is intended for humans to read or write.  This seems to be neither, given that the numpy data seems to be written in unreadable binary anyway.  I'm not sure how flexible the dictionary values need to be, but some kind of table format (FITS, Parquet, etc.) might be appropriate.
            Hide
            plazas Andrés Alejandro Plazas Malagón added a comment - - edited

            Thank you both. To get DM-23159 (and it's umbrella ticket DM-21786) done soon, I'll see if I can just follow 1) above from Tim for now (or use another table format as K-T suggests). These files are just intermediate to the PTC task, and started being saved after the task was split into a extract and a solve task. They get saved after the former is run, and, in fact the first thing that solve does is to read them again and assemble them into covariance matrices ( extract measures covariances from pairs of flat fields, and solve fits models to those covariances).

            Show
            plazas Andrés Alejandro Plazas Malagón added a comment - - edited Thank you both. To get DM-23159 (and it's umbrella ticket DM-21786 ) done soon, I'll see if I can just follow 1) above from Tim for now (or use another table format as K-T suggests). These files are just intermediate to the PTC task, and started being saved after the task was split into a extract and a solve task. They get saved after the former is run, and, in fact the first thing that solve does is to read them again and assemble them into covariance matrices ( extract measures covariances from pairs of flat fields, and solve fits models to those covariances).
            Hide
            tjenness Tim Jenness added a comment - - edited

            I think that replacing each numpy number with a float would work for you even after DM-27418.

            Show
            tjenness Tim Jenness added a comment - - edited I think that replacing each numpy number with a float would work for you even after DM-27418 .
            Hide
            plazas Andrés Alejandro Plazas Malagón added a comment -

            OK, thanks. I'll try that too.

            Show
            plazas Andrés Alejandro Plazas Malagón added a comment - OK, thanks. I'll try that too.
            Hide
            tjenness Tim Jenness added a comment -

            I fixed the problem with not forwarding the error message on read in DM-27418 – That also changed to safe_dump so things will now break for you on put rather than get until you change the numpy to float. I will close this ticket as invalid.

            Show
            tjenness Tim Jenness added a comment - I fixed the problem with not forwarding the error message on read in DM-27418 – That also changed to safe_dump so things will now break for you on put rather than get until you change the numpy to float. I will close this ticket as invalid.
            Hide
            plazas Andrés Alejandro Plazas Malagón added a comment -

            Thanks.

            Show
            plazas Andrés Alejandro Plazas Malagón added a comment - Thanks.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              plazas Andrés Alejandro Plazas Malagón
              Watchers:
              Andrés Alejandro Plazas Malagón, Christopher Waters, Kian-Tat Lim, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.