Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13876

Write a ParquetStorage Butler storage type

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      DM-13770 implemented butler-enabled reading/writing of Parquet tables via a FitsCatalogStorage shim.  However, in order to take advantage of the column-store nature, it is really necessary to define a new storage type that will enable extraction of specified columns without loading in the whole table.  Jim Bosch suggests that this "basically boils down to grepping for all the places e.g. FitsStorage appears in obs_base and daf_persistence, and adding new clauses to various if blocks for the new storage type."  Additionally, a model for adding extra keywords to be passed to the loading function can be found with FitsStorage, which allows, e.g. butler.get("calexp_sub", dataId=..., bbox=...).

        Attachments

          Activity

          Hide
          tmorton Tim Morton [X] (Inactive) added a comment - - edited

          This is now implemented. I have defined the deepCoadd_obj object table (from DM-13770) to be stored with ParquetStorage storage. This storage type expects put to take a pandas dataframe, but get returns a ParquetTable object (defined for now within qa_explorer). ParquetTable itself has tests within qa_explorer; the put/get test I run is as follows (open to suggestions for a better way to unit-test this).

          First, set up qa_explorer. Then

          writeObjectTable.py /datasets/hsc/repo/rerun/RC/w_2018_10/DM-13647/ --output /your/test/repo/path --id tract=9615 patch=4,4 filter=HSC-G^HSC-R --no-versions
          

          to write the table. Then, test the reading as follows:

          from lsst.daf.persistence import Butler
           
          butler = Butler('/your/test/repo/path')
          dataId = {'tract':9615, 'patch':'4,4'}
          catalog = butler.get('deepCoadd_obj', dataId=dataId) # This is a ParquetFile wrapper
           
          columnDict = {'dataset':'meas',
                        'filter':['HSC-G', 'HSC-R'],
                        'column':['coord_ra', 'coord_dec']}
          df = catalog.to_df(columns=columnDict) # This loads data for the first time
           
          print(len(df))
          print(df.head())
          

          Show
          tmorton Tim Morton [X] (Inactive) added a comment - - edited This is now implemented. I have defined the deepCoadd_obj object table (from DM-13770 ) to be stored with ParquetStorage storage. This storage type expects put to take a pandas dataframe, but get returns a ParquetTable object (defined for now within qa_explorer ). ParquetTable itself has tests within qa_explorer ; the put/get test I run is as follows (open to suggestions for a better way to unit-test this). First, set up qa_explorer . Then writeObjectTable.py /datasets/hsc/repo/rerun/RC/w_2018_10/DM-13647/ --output /your/test/repo/path --id tract=9615 patch=4,4 filter=HSC-G^HSC-R --no-versions to write the table. Then, test the reading as follows: from lsst.daf.persistence import Butler   butler = Butler( '/your/test/repo/path' ) dataId = { 'tract' : 9615 , 'patch' : '4,4' } catalog = butler.get( 'deepCoadd_obj' , dataId = dataId) # This is a ParquetFile wrapper   columnDict = { 'dataset' : 'meas' , 'filter' :[ 'HSC-G' , 'HSC-R' ], 'column' :[ 'coord_ra' , 'coord_dec' ]} df = catalog.to_df(columns = columnDict) # This loads data for the first time   print ( len (df)) print (df.head())
          Hide
          tmorton Tim Morton [X] (Inactive) added a comment -

          I ended up sticking with the dual-purpose ParquetTable object for now.  I had thought about whether the butler should instead read/write raw dataframes (and take a columns kwarg), and then to have a different dataset for the lazy loader, but decided against it for now, just for simplicity.  Kian-Tat Lim I have implemented ParquetTable within qa_explorer for now, but I expect it should eventually move to afw.table?  I figure that I should at least go through the exercise of implementing DM-13877 (which will use this machinery) before we do that, to work out some kinks.  

          Show
          tmorton Tim Morton [X] (Inactive) added a comment - I ended up sticking with the dual-purpose ParquetTable object for now.  I had thought about whether the butler should instead read/write raw dataframes (and take a columns kwarg), and then to have a different dataset for the lazy loader, but decided against it for now, just for simplicity.  Kian-Tat Lim I have implemented ParquetTable within qa_explorer for now, but I expect it should eventually move to afw.table?  I figure that I should at least go through the exercise of implementing DM-13877 (which will use this machinery) before we do that, to work out some kinks.  
          Hide
          ktl Kian-Tat Lim added a comment -

          I would not want ParquetTable to move to afw if it can be avoided. Things that are for serialization only should be outside the core package to avoid dependency bloat.

          Show
          ktl Kian-Tat Lim added a comment - I would not want ParquetTable to move to afw if it can be avoided. Things that are for serialization only should be outside the core package to avoid dependency bloat.
          Hide
          ktl Kian-Tat Lim added a comment -

          While the overall strategy and organization seems OK, there are a fair number of cleanups suggested in the PR.  Please address before merging.

          Show
          ktl Kian-Tat Lim added a comment - While the overall strategy and organization seems OK, there are a fair number of cleanups suggested in the PR.  Please address before merging.

            People

            Assignee:
            tmorton Tim Morton [X] (Inactive)
            Reporter:
            tmorton Tim Morton [X] (Inactive)
            Reviewers:
            Kian-Tat Lim
            Watchers:
            Jim Bosch, John Swinbank, Kian-Tat Lim, Tim Morton [X] (Inactive), Yusra AlSayyad
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.