# Use specialist formatter for serializing lsst.verify.Measurement in butler

XMLWordPrintable

#### Details

• Type: Story
• Status: To Do
• Resolution: Unresolved
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
2
• Urgent?:
No

#### Description

Today it was noted that it takes a long time to read many serialized lsst.verify.Measurement files using the butler. Part of this slow down is caused by them being serialized as YAML format (and using complex YAML object specifications). This was done on DM-21875 as the simplest and quickest solution to the problem, given that we had no standardized API to look for in the JSON formatter (following the Pydantic model would possibly have been the right answer if we had known about it) but in hindsight I gave bad advice.

We need a new specialist Formatter that will use the Measurement write_json() API to write a JSON file. The read method should look at the file extension and read the existing YAML or the new JSON.

The question that vexed us in DM-21875 was where such a formatter should go. Current policy (and in the absence of any progress on DM-26190) is to only write non-lsst formatters in daf_butler and to write lsst-specific formatters in obs_base.

#### Activity

Hide
Krzysztof Findeisen added a comment -

Can you clarify why we need a specialized class instead of lsst.daf.butler.formatters.JsonFormatter?

Show
Krzysztof Findeisen added a comment - Can you clarify why we need a specialized class instead of lsst.daf.butler.formatters.JsonFormatter ?
Hide
Tim Jenness added a comment - - edited

It might work if we were careful but might require some enhancements to the standard json formatter. And you are right that we don't have to support .yaml files at all since the old formatter will still be used for existing datasets – we'd only need to support yaml if we had a write configuration option to write them out in either format.

In particular:

• For write the in memory dataset must either have an _asdict() method or be dumpable directly as JSON. Measurement does not support that. We could add support to the json formatter to see if .json() works and use that (that is supported already by Measurement and pydantic).
• For reading, we use json.loads, which is fine, but currently to force that to the right python type in the coercion method we assume that pytype(object) will work. Does it? We could follow the pydantic API here and add support in the formatter for pytype.parse_obj(object) and then add such a method to Measurement.

So yes, with some minor tweaks to both places we could use the JsonFormatter.

Show
Tim Jenness added a comment - - edited It might work if we were careful but might require some enhancements to the standard json formatter. And you are right that we don't have to support .yaml files at all since the old formatter will still be used for existing datasets – we'd only need to support yaml if we had a write configuration option to write them out in either format. In particular: For write the in memory dataset must either have an _asdict() method or be dumpable directly as JSON. Measurement does not support that. We could add support to the json formatter to see if .json() works and use that (that is supported already by Measurement and pydantic). For reading, we use json.loads, which is fine, but currently to force that to the right python type in the coercion method we assume that pytype(object) will work. Does it? We could follow the pydantic API here and add support in the formatter for pytype.parse_obj(object) and then add such a method to Measurement. So yes, with some minor tweaks to both places we could use the JsonFormatter.
Hide
Tim Jenness added a comment -

Pydantic models are supported by JsonFormatter now, so parse_obj is supported and a json() method that returns a string.

Measurement is not compatible with that because it seems that it has a json property (not a method).

Writing out a JSON measurement seems to be lossy compared to the YAML serialization. In particular blobs are stored only by ref ID. I assume that's not really an issue (and storing the blobs will make things even slower on read).

The content looks something like:

 {  "blob_refs": [  "e1e9081c12764c04a83dbe72e93ee82f"  ],  "identifier": "43068e671da94ce7b419d969f01de0c9",  "metric": "TE3",  "unit": "",  "value": 0.000168362403348868 } 

It looks like you need to use Measurement.deserialize to convert it back to a measurement. This all suggests a specialist formatter is the only option, assuming that the lack of blobs is not a problem.

Show
Tim Jenness added a comment - Pydantic models are supported by JsonFormatter now, so parse_obj is supported and a json() method that returns a string. Measurement is not compatible with that because it seems that it has a json property (not a method). Writing out a JSON measurement seems to be lossy compared to the YAML serialization. In particular blobs are stored only by ref ID. I assume that's not really an issue (and storing the blobs will make things even slower on read). The content looks something like: { "blob_refs": [ "e1e9081c12764c04a83dbe72e93ee82f" ], "identifier": "43068e671da94ce7b419d969f01de0c9", "metric": "TE3", "unit": "", "value": 0.000168362403348868 } It looks like you need to use Measurement.deserialize to convert it back to a measurement. This all suggests a specialist formatter is the only option, assuming that the lack of blobs is not a problem.

#### People

Assignee:
Unassigned
Reporter:
Tim Jenness
Watchers:
Keith Bechtol, Krzysztof Findeisen, Tim Jenness