# Fix bugs when reading yaml file after running PTC extract task (DM-23159)

XMLWordPrintable

#### Details

• Type: Story
• Status: Invalid
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
• Team:
Data Release Production
• Urgent?:
No

#### Description

When trying to run the photon transfer curve measurement task (version DM-23159) on LATISS data (w_2020_44):

 pipetask run -d "detector=0 and exposure IN (2020012800028, 2020012800029, 2020012800041, 2020012800042)" -b ./DATA/butler.yaml -i ptcTest.19 -o ptcTest.40 -p /home/plazas/lsst_devel/LSST/cp_pipe/pipelines/measurePhotonTransferCurve.yaml --register-dataset-types 

The first part of the task creates a yaml file

  /home/plazas/lsst_devel/LSST/ci_cpp_gen3/DATA/ptcTest.40/20201103T15h29m09s/cpCovariances/KPNO_406_828nm~EMPTY/KPNO_406_828nm~EMPTY/cpCovariances_KPNO_406_828nm~EMPTY_KPNO_406_828nm~EMPTY_2020012800028_0_LATISS_ptcTest_40_20201103T15h29m09s.yaml 

that then can’t be read by the second part of the task (via inputs = butlerQC.get(inputRefs)):

 inputRefs InputQuantizedConnection(_attributes={'inputCovariances', 'camera'}, camera=DatasetRef(DatasetType('camera', {instrument}, Camera, isCalibration=True), {instrument: LATISS}, id=1, run='LATISS/calib/unbounded'), inputCovariances=[DatasetRef(DatasetType('cpCovariances', {band, instrument, detector, physical_filter, exposure}, StructuredDataDict), {band: KPNO_406_828nm~EMPTY, instrument: LATISS, detector: 0, physical_filter: KPNO_406_828nm~EMPTY, exposure: 2020012800028}, id=793, run='ptcTest.40/20201103T15h29m09s'), DatasetRef(DatasetType('cpCovariances', {band, instrument, detector, physical_filter, exposure}, StructuredDataDict), {band: KPNO_406_828nm~EMPTY, instrument: LATISS, detector: 0, physical_filter: KPNO_406_828nm~EMPTY, exposure: 2020012800029}, id=794, run='ptcTest.40/20201103T15h29m09s'), DatasetRef(DatasetType('cpCovariances', {band, instrument, detector, physical_filter, exposure}, StructuredDataDict), {band: KPNO_406_828nm~EMPTY, instrument: LATISS, detector: 0, physical_filter: KPNO_406_828nm~EMPTY, exposure: 2020012800041}, id=795, run='ptcTest.40/20201103T15h29m09s'), DatasetRef(DatasetType('cpCovariances', {band, instrument, detector, physical_filter, exposure}, StructuredDataDict), {band: KPNO_406_828nm~EMPTY, instrument: LATISS, detector: 0, physical_filter: KPNO_406_828nm~EMPTY, exposure: 2020012800042}, id=796, run='ptcTest.40/20201103T15h29m09s')]) Error: An error occurred during command execution: Traceback (most recent call last):  File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/datastores/posixDatastore.py", line 120, in _read_artifact_into_memory  result = formatter.read(component=getInfo.component if isComponent else None)  File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/formatters/file.py", line 196, in read  raise ValueError(f"Unable to read data with URI {self.fileDescriptor.location.uri}") ValueError: Unable to read data with URI file:///home/plazas/lsst_devel/LSST/ci_cpp_gen3/DATA/ptcTest.40/20201103T15h29m09s/cpCovariances/KPNO_406_828nm~EMPTY/KPNO_406_828nm~EMPTY/cpCovariances_KPNO_406_828nm~EMPTY_KPNO_406_828nm~EMPTY_2020012800028_0_LATISS_ptcTest_40_20201103T15h29m09s.yaml The above exception was the direct cause of the following exception: Traceback (most recent call last):  File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/cli/utils.py", line 446, in cli_handle_exception  return func(*args, **kwargs)  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/cli/script/run.py", line 173, in run  f.runPipeline(qgraphObj, taskFactory, args)  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 646, in runPipeline  executor.execute(graph, butler)  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 255, in execute  self._executeQuantaInProcess(graph, butler)  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/mpGraphExecutor.py", line 301, in _executeQuantaInProcess  self.quantumExecutor.execute(qnode.taskDef, qnode.quantum, butler)  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 99, in execute  self.runQuantum(task, quantum, taskDef, butler)  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/ctrl_mpexec/20.0.0-33-g43e4e0e+110e833cc8/python/lsst/ctrl/mpexec/singleQuantumExecutor.py", line 266, in runQuantum  task.runQuantum(butlerQC, inputRefs, outputRefs)  File "/home/plazas/lsst_devel/LSST/cp_pipe/python/lsst/cp/pipe/ptc.py", line 565, in runQuantum  inputs = butlerQC.get(inputRefs)  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/pipe_base/20.0.0-23-g8900aa8+487b895792/python/lsst/pipe/base/butlerQuantumContext.py", line 142, in get  val = [self._get(r) for r in ref]  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/pipe_base/20.0.0-23-g8900aa8+487b895792/python/lsst/pipe/base/butlerQuantumContext.py", line 142, in   val = [self._get(r) for r in ref]  File "/software/lsstsw/stack_20200922/stack/miniconda3-py37_4.8.2-cb4e2dc/Linux64/pipe_base/20.0.0-23-g8900aa8+487b895792/python/lsst/pipe/base/butlerQuantumContext.py", line 94, in _get  return butler.getDirect(ref)  File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/_butler.py", line 681, in getDirect  return self.datastore.get(ref, parameters=parameters)  File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/datastores/fileLikeDatastore.py", line 1176, in get  return self._read_artifact_into_memory(getInfo, ref, isComponent=isComponent)  File "/home/plazas/lsst_devel/LSST/daf_butler/python/lsst/daf/butler/datastores/posixDatastore.py", line 123, in _read_artifact_into_memory  f" ({ref.datasetType.name} from {location.path}): {e}") from e ValueError: Failure from formatter 'lsst.daf.butler.formatters.yaml.YamlFormatter' for dataset 793 (cpCovariances from /home/plazas/lsst_devel/LSST/ci_cpp_gen3/DATA/ptcTest.40/20201103T15h29m09s/cpCovariances/KPNO_406_828nm~EMPTY/KPNO_406_828nm~EMPTY/cpCovariances_KPNO_406_828nm~EMPTY_KPNO_406_828nm~EMPTY_2020012800028_0_LATISS_ptcTest_40_20201103T15h29m09s.yaml): Unable to read data with URI file:///home/plazas/lsst_devel/LSST/ci_cpp_gen3/DATA/ptcTest.40/20201103T15h29m09s/cpCovariances/KPNO_406_828nm~EMPTY/KPNO_406_828nm~EMPTY/cpCovariances_KPNO_406_828nm~EMPTY_KPNO_406_828nm~EMPTY_2020012800028_0_LATISS_ptcTest_40_20201103T15h29m09s.yaml 

The yaml file is a list of dictionaries with tuples (and alternating empty dictionaries, to match dimensions between inputs and outputs). The first few lines of the yaml file are:

 ? !!python/tuple - C00 - 2020012800028 - 2020012800029 : - !!python/tuple  - 8727.117627992311  - 8139.554851354043  - 0  - 0  - !!python/object/apply:numpy.core.multiarray.scalar  - &id001 !!python/object/apply:numpy.dtype  args:  - f8  - false  - true  state: !!python/tuple  - 3  - <  - null  - null  - null  - -1  - -1  - 0  - !!binary |  8Yq8/ovLz0A=  - !!python/object/apply:numpy.core.multiarray.scalar  - *id001  - !!binary |  8Yq8/ovLz0A=  - 1018000  - 15  - 0.4  - C00  - !!python/tuple  - 8727.117627992311  - 8139.554851354043  - 1  - 0  - !!python/object/apply:numpy.core.multiarray.scalar  - *id001  - !!binary |  8Yq8/ovLz0A=  - !!python/object/apply:numpy.core.multiarray.scalar  - *id001  - !!binary |  kXAsqRnIE8A=  - 1016000  - 15  - 0.4  - C00  - !!python/tuple  - 8727.117627992311  - 8139.554851354043  - 2  - 0  - !!python/object/apply:numpy.core.multiarray.scalar  - *id001  - !!binary |  8Yq8/ovLz0A=  - !!python/object/apply:numpy.core.multiarray.scalar  - *id001  - !!binary | 

Tim Jenness says that the problem is that the code is using a generic YAML formatter and it's expecting a simple dict, but in fact the task is storing the date as explained above (list of dict of tuples). The yaml.loader doesn’t know how to load them.

He reports that there are two bugs, that this ticket should address:
1) The YamlFormatter should complain about this — currently it traps the error and lets a bit of code higher up report something generic.
2) We probably should be using the super UnsafeLoader here instead (since we wrote the files we should be able to read them and trust it will be okay)

#### Activity

Hide
Tim Jenness added a comment - - edited

Given the request by Kian-Tat Lim to use safe_dump for dumping to YAML in the standard formatter for dict (DM-27418) I think that is going to break any attempt to get this working for you because numpy values will not round-trip.

You have two choices:

1. In the short term cleanup your dict to replace numpy with floats (or lists of float). Note also that tuple is a non-standard data type for YAML and if you want the YAML to be more portable you should use python list.
2. In the longer term switch from dict to a proper python class so you can declare your own storage class and specialist formatter that will give you more control of the serialization format. (that is DM-27420 by the looks of it)
Show
Tim Jenness added a comment - - edited Given the request by Kian-Tat Lim to use safe_dump for dumping to YAML in the standard formatter for dict ( DM-27418 ) I think that is going to break any attempt to get this working for you because numpy values will not round-trip. You have two choices: In the short term cleanup your dict to replace numpy with floats (or lists of float). Note also that tuple is a non-standard data type for YAML and if you want the YAML to be more portable you should use python list . In the longer term switch from dict to a proper python class so you can declare your own storage class and specialist formatter that will give you more control of the serialization format. (that is DM-27420 by the looks of it)
Hide
Kian-Tat Lim added a comment -

I agree with Tim Jenness.

Dictionaries are very general data structures, but they are often a poor choice for an interface, as there is little opportunity to document their contents or to validate them.  A class, even implemented on top of a dictionary, is better.

Generally data should be YAML when it is intended for humans to read or write.  This seems to be neither, given that the numpy data seems to be written in unreadable binary anyway.  I'm not sure how flexible the dictionary values need to be, but some kind of table format (FITS, Parquet, etc.) might be appropriate.

Show
Kian-Tat Lim added a comment - I agree with Tim Jenness . Dictionaries are very general data structures, but they are often a poor choice for an interface, as there is little opportunity to document their contents or to validate them.  A class, even implemented on top of a dictionary, is better. Generally data should be YAML when it is intended for humans to read or write.  This seems to be neither, given that the numpy data seems to be written in unreadable binary anyway.  I'm not sure how flexible the dictionary values need to be, but some kind of table format (FITS, Parquet, etc.) might be appropriate.
Hide
Andrés Alejandro Plazas Malagón added a comment - - edited

Thank you both. To get DM-23159 (and it's umbrella ticket DM-21786) done soon, I'll see if I can just follow 1) above from Tim for now (or use another table format as K-T suggests). These files are just intermediate to the PTC task, and started being saved after the task was split into a extract and a solve task. They get saved after the former is run, and, in fact the first thing that solve does is to read them again and assemble them into covariance matrices ( extract measures covariances from pairs of flat fields, and solve fits models to those covariances).

Show
Andrés Alejandro Plazas Malagón added a comment - - edited Thank you both. To get DM-23159 (and it's umbrella ticket DM-21786 ) done soon, I'll see if I can just follow 1) above from Tim for now (or use another table format as K-T suggests). These files are just intermediate to the PTC task, and started being saved after the task was split into a extract and a solve task. They get saved after the former is run, and, in fact the first thing that solve does is to read them again and assemble them into covariance matrices ( extract measures covariances from pairs of flat fields, and solve fits models to those covariances).
Hide
Tim Jenness added a comment - - edited

I think that replacing each numpy number with a float would work for you even after DM-27418.

Show
Tim Jenness added a comment - - edited I think that replacing each numpy number with a float would work for you even after DM-27418 .
Hide
Andrés Alejandro Plazas Malagón added a comment -

OK, thanks. I'll try that too.

Show
Andrés Alejandro Plazas Malagón added a comment - OK, thanks. I'll try that too.
Hide
Tim Jenness added a comment -

I fixed the problem with not forwarding the error message on read in DM-27418 – That also changed to safe_dump so things will now break for you on put rather than get until you change the numpy to float. I will close this ticket as invalid.

Show
Tim Jenness added a comment - I fixed the problem with not forwarding the error message on read in DM-27418 – That also changed to safe_dump so things will now break for you on put rather than get until you change the numpy to float. I will close this ticket as invalid.
Hide
Andrés Alejandro Plazas Malagón added a comment -

Thanks.

Show
Andrés Alejandro Plazas Malagón added a comment - Thanks.

#### People

Assignee:
Tim Jenness
Reporter:
Andrés Alejandro Plazas Malagón
Watchers:
Andrés Alejandro Plazas Malagón, Christopher Waters, Kian-Tat Lim, Tim Jenness