# Use pyarrow.Table for butler parquet datasets instead of pandas.DataFrame

XMLWordPrintable

#### Details

• Type: RFC
• Resolution: Unresolved
• Component/s:
• Labels:
None

#### Description

Parquet can do a lot of potentially useful things in terms of data partitioning, streaming, metadata, and optimized I/O that we don't currently take advantage of, but would like to in the future.

The fact that we use pandas.DataFrame as the in-memory type that corresponds to a Parquet file on disk is a blocker for some of these, and a hindrance for others, because pandas can't do some of these things - for example, one can't attach the same kind of metadata to pandas.DataFrame that one could attach to a Parquet file on disk.

There simply are no good in-memory table libraries that can do everything we'd like as well as we'd like, but pyarrow.Table does a very good job of fully supporting Parquet, and of the major contenders it's the one whose internal data structure maps most naturally to Parquet. The Arrow docs themselves say:

Parquet is not a “runtime in-memory format”; in general, file formats almost always have to be deserialized into some in-memory data structure for processing. We intend for Arrow to be that in-memory data structure.

In fact, our implementation for Parquet persistence of DataFrames already works by converting the DataFrame to/from a pyarrow.Table.

This RFC proposes that we move that conversion out of the Butler formatter, and instead make it the responsibility of application code to either use the pyarrow.Table object directly or convert it to/from something else (often a DataFrame).

Specifically, this would involve introducing a new butler "Table" storage class, and just as with the PropertySet -> TaskMetadata conversion we'd define converters between the "Table" storage class and the existing "DataFrame" storage class. When particular PipelineTasks that use DataFrame datasets are migrated as proposed by this RFC, they would update both their code and their connection definition at the same time, ensuring that they get the type the expect regardless of how the dataset type was originally registered.

That isn't quite enough, because for dataset types already registered with "DataFrame", we'll end up round-tripping through pandas even when a pyarrow table is expected, defeating much of the purpose of the RFC. So we will eventually also need to either:

• migrate data repositories as well, by replacing all DataFrame datasets with Table datasets;
• define new Table dataset types to replace all existing DataFrame dataset types.

I think the latter is probably a better approach - it greatly reduces the chances of different data repositories having different dataset type definitions. But I think it's an advantage of this proposal that we don't need to do either immediately; we can migrate tasks and pipelines incrementally, and hopefully finish the project before taking full advantage of Arrow->Parquet functionality becomes critical.

#### Activity

Hide
Eli Rykoff added a comment -

I support this! And it is also not hard to write a pyarrow.Table to/from numpy.recarray converter that bypasses pandas.DataFrame entirely. See, e.g. https://github.com/astropy/astropy/blob/main/astropy/io/misc/parquet.py

One could even create a column of binary blobs if you need to store (e.g.) footprints that way. But I have no idea what the performance of this would be.

A follow-up question is whether we are envisioning also expanding support from single-files to multi-file parquet datasets, and if that has any bearing on this RFC.

Show
Eli Rykoff added a comment - I support this! And it is also not hard to write a pyarrow.Table to/from numpy.recarray converter that bypasses pandas.DataFrame entirely. See, e.g. https://github.com/astropy/astropy/blob/main/astropy/io/misc/parquet.py One could even create a column of binary blobs if you need to store (e.g.) footprints that way. But I have no idea what the performance of this would be. A follow-up question is whether we are envisioning also expanding support from single-files to multi-file parquet datasets, and if that has any bearing on this RFC.
Hide
John Parejko added a comment -

I would advocate not calling the storage class Table, but rather PyArrowTable or something similar, since Table by itself is a bit too generic.

Show
John Parejko added a comment - I would advocate not calling the storage class Table, but rather PyArrowTable or something similar, since Table by itself is a bit too generic.
Hide
Jim Bosch added a comment - - edited

A follow-up question is whether we are envisioning also expanding support from single-files to multi-file parquet datasets, and if that has any bearing on this RFC.

I regard that as one of the things "hindered" by using DataFrame as the in-Python type, but not actually blocked. I'm hoping to implement it at some point in the future without the need for an RFC, because it should appear as new functionality that doesn't change or break anything that exists.

I would advocate not calling the storage class Table, but rather PyArrowTable or something similar, since Table by itself is a bit too generic.

I reluctantly agree, though I might go with ArrowTable instead of PyArrowTable to be a bit less tied to the Python version of the type while still being sufficiently specific. I'm reluctant because I had originally conceived of these storage classes as being less coupled to a particular Python class than they are in practice, but that's a problem I don't have a solution for at hand, and given how things work at present using a more specific type does make sense.

Show
Jim Bosch added a comment - - edited A follow-up question is whether we are envisioning also expanding support from single-files to multi-file parquet datasets, and if that has any bearing on this RFC. I regard that as one of the things "hindered" by using DataFrame as the in-Python type, but not actually blocked. I'm hoping to implement it at some point in the future without the need for an RFC, because it should appear as new functionality that doesn't change or break anything that exists. I would advocate not calling the storage class Table, but rather PyArrowTable or something similar, since Table by itself is a bit too generic. I reluctantly agree, though I might go with ArrowTable instead of PyArrowTable to be a bit less tied to the Python version of the type while still being sufficiently specific. I'm reluctant because I had originally conceived of these storage classes as being less coupled to a particular Python class than they are in practice, but that's a problem I don't have a solution for at hand, and given how things work at present using a more specific type does make sense.
Hide
Eli Rykoff added a comment -

I agree that the name should be ArrowTable rather than Table. I think that Table is too generic/overloaded, and PyArrowTable is too specific. These are ArrowTable objects that will be returned, and I think that is relevant. (And the underlying storage, whether parquet or feather or something else supported by arrow is still abstracted away).

Show
Eli Rykoff added a comment - I agree that the name should be ArrowTable rather than Table . I think that Table is too generic/overloaded, and PyArrowTable is too specific. These are ArrowTable objects that will be returned, and I think that is relevant. (And the underlying storage, whether parquet or feather or something else supported by arrow is still abstracted away).
Hide
Jim Bosch added a comment -

This has converged, and I've added a pair of implementation tickets. Adopting.

Show
Jim Bosch added a comment - This has converged, and I've added a pair of implementation tickets. Adopting.

#### People

Assignee:
Jim Bosch
Reporter:
Jim Bosch
Watchers:
Clare Saunders, Colin Slater, Eli Rykoff, Jim Bosch, John Parejko, Lee Kelvin, Tim Jenness