I have now written a writeObjectTableTask in the qa_explorer package, which is now eups-installable (DM-13793). This includes defining a new deepCoadd_obj dataset, which reads from and writes to parquet via the lsst.qa.explorer.table.ParquetTable object, which define writeFits and readFits methods, which enables it to masquerade as a FitsCatalogStorage storage type.
Running the task looks like this:
writeObjectTable.py /datasets/hsc/repo/rerun/RC/w_2018_10/DM-13647/ --output /project/tmorton/DM-13770 --id tract=9615 filter=HSC-G^HSC-R^HSC-I^HSC-Z^HSC-Y --no-versions -j 24
|
This writes a single parquet file per patch that is a merge of the deepCoadd_meas, deepCoadd_forced_src and deepCoadd_ref tables for all requested bands (containing ~10k columns).
The tables can currently be loaded as follows:
from lsst.daf.persistence import Butler
|
butler = Butler('/project/tmorton/DM-13770')
|
dataId = {'tract':9615, 'patch':'4,4'}
|
df = butler.get('deepCoadd_obj', dataId=dataId)
|
Here, df is now a pandas dataframe with a multi-level index (check out what it looks like with df.head()). If you want a particular subset of this table; e.g., the meas catalog in HSC-R, you can access that with df['meas']['HSC-R'], which returns the desired portion of the table. I note that you cannot currently load selected columns (the main motivation of using the parquet format), but this will be addressed in future work.
I note that while in principle, one should be able to load all patches into a single dask dataframe using dask.dataframe.read_parquet, this doesn't currently work, presumably due to the multi-index (question asked at https://github.com/dask/dask/issues/1493).
Also, I will also note that before I followed the suggestions on RFC-465 to exclusively use the pyarrow engine to write/read the parquet files, I discovered that fastparquet was being very slow in initializing the ParquetFile objects on these large tables, and I submitted a fix to the fastparquet project (https://github.com/dask/fastparquet/pull/318), which decreased the metadata-read time of the ~10,000 column file from about 12s to 0.3s.
I have now written a writeObjectTableTask in the qa_explorer package, which is now eups-installable (
DM-13793). This includes defining a new deepCoadd_obj dataset, which reads from and writes to parquet via the lsst.qa.explorer.table.ParquetTable object, which define writeFits and readFits methods, which enables it to masquerade as a FitsCatalogStorage storage type.Running the task looks like this:
writeObjectTable.py /datasets/hsc/repo/rerun/RC/w_2018_10/DM-13647/ --output /project/tmorton/DM-13770 --id tract=9615 filter=HSC-G^HSC-R^HSC-I^HSC-Z^HSC-Y --no-versions -j 24
This writes a single parquet file per patch that is a merge of the deepCoadd_meas, deepCoadd_forced_src and deepCoadd_ref tables for all requested bands (containing ~10k columns).
The tables can currently be loaded as follows:
from lsst.daf.persistence import Butler
butler = Butler('/project/tmorton/DM-13770')
dataId = {'tract':9615, 'patch':'4,4'}
df = butler.get('deepCoadd_obj', dataId=dataId)
Here, df is now a pandas dataframe with a multi-level index (check out what it looks like with df.head()). If you want a particular subset of this table; e.g., the meas catalog in HSC-R, you can access that with df['meas']['HSC-R'], which returns the desired portion of the table. I note that you cannot currently load selected columns (the main motivation of using the parquet format), but this will be addressed in future work.
I note that while in principle, one should be able to load all patches into a single dask dataframe using dask.dataframe.read_parquet, this doesn't currently work, presumably due to the multi-index (question asked at https://github.com/dask/dask/issues/1493).
Also, I will also note that before I followed the suggestions on
RFC-465to exclusively use the pyarrow engine to write/read the parquet files, I discovered that fastparquet was being very slow in initializing the ParquetFile objects on these large tables, and I submitted a fix to the fastparquet project (https://github.com/dask/fastparquet/pull/318), which decreased the metadata-read time of the ~10,000 column file from about 12s to 0.3s.