Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Story Points:10
-
Epic Link:
-
Team:Data Release Production
Description
DM-13770 implemented butler-enabled reading/writing of Parquet tables via a FitsCatalogStorage shim. However, in order to take advantage of the column-store nature, it is really necessary to define a new storage type that will enable extraction of specified columns without loading in the whole table. Jim Bosch suggests that this "basically boils down to grepping for all the places e.g. FitsStorage appears in obs_base and daf_persistence, and adding new clauses to various if blocks for the new storage type." Additionally, a model for adding extra keywords to be passed to the loading function can be found with FitsStorage, which allows, e.g. butler.get("calexp_sub", dataId=..., bbox=...).
This is now implemented. I have defined the deepCoadd_obj object table (from
DM-13770) to be stored with ParquetStorage storage. This storage type expects put to take a pandas dataframe, but get returns a ParquetTable object (defined for now within qa_explorer). ParquetTable itself has tests within qa_explorer; the put/get test I run is as follows (open to suggestions for a better way to unit-test this).First, set up qa_explorer. Then
writeObjectTable.py /datasets/hsc/repo/rerun/RC/w_2018_10/DM-13647/ --output /your/test/repo/path --id tract=9615 patch=4,4 filter=HSC-G^HSC-R --no-versions
to write the table. Then, test the reading as follows: