Parquet can do a lot of potentially useful things in terms of data partitioning, streaming, metadata, and optimized I/O that we don't currently take advantage of, but would like to in the future.
The fact that we use pandas.DataFrame as the in-memory type that corresponds to a Parquet file on disk is a blocker for some of these, and a hindrance for others, because pandas can't do some of these things - for example, one can't attach the same kind of metadata to pandas.DataFrame that one could attach to a Parquet file on disk.
There simply are no good in-memory table libraries that can do everything we'd like as well as we'd like, but pyarrow.Table does a very good job of fully supporting Parquet, and of the major contenders it's the one whose internal data structure maps most naturally to Parquet. The Arrow docs themselves say:
Parquet is not a “runtime in-memory format”; in general, file formats almost always have to be deserialized into some in-memory data structure for processing. We intend for Arrow to be that in-memory data structure.
In fact, our implementation for Parquet persistence of DataFrames already works by converting the DataFrame to/from a pyarrow.Table.
This RFC proposes that we move that conversion out of the Butler formatter, and instead make it the responsibility of application code to either use the pyarrow.Table object directly or convert it to/from something else (often a DataFrame).
Specifically, this would involve introducing a new butler "Table" storage class, and just as with the PropertySet -> TaskMetadata conversion we'd define converters between the "Table" storage class and the existing "DataFrame" storage class. When particular PipelineTasks that use DataFrame datasets are migrated as proposed by this RFC, they would update both their code and their connection definition at the same time, ensuring that they get the type the expect regardless of how the dataset type was originally registered.
That isn't quite enough, because for dataset types already registered with "DataFrame", we'll end up round-tripping through pandas even when a pyarrow table is expected, defeating much of the purpose of the RFC. So we will eventually also need to either:
- migrate data repositories as well, by replacing all DataFrame datasets with Table datasets;
- define new Table dataset types to replace all existing DataFrame dataset types.
I think the latter is probably a better approach - it greatly reduces the chances of different data repositories having different dataset type definitions. But I think it's an advantage of this proposal that we don't need to do either immediately; we can migrate tasks and pipelines incrementally, and hopefully finish the project before taking full advantage of Arrow->Parquet functionality becomes critical.