Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13770

Write a stand-alone task to convert/consolidate source tables to parquet files for QA

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      6
    • Sprint:
      DRP S18-4, DRP S18-5
    • Team:
      Data Release Production

      Description

      Currently, pipe_analysis scripts have an optional writeParquetOnly flag to skip making plots and only write the parquet tables. It might be desirable to factor that functionality out into a separate task. One thing that might be interesting to test as part of this work would be what the performance is like if catalogs are just written at the patch level instead of consolidating all requested patches to a single full-tract catalog.

        Attachments

          Issue Links

            Activity

            Hide
            tmorton Tim Morton [X] (Inactive) added a comment - - edited

            I have now written a writeObjectTableTask in the qa_explorer package, which is now eups-installable (DM-13793). This includes defining a new deepCoadd_obj dataset, which reads from and writes to parquet via the lsst.qa.explorer.table.ParquetTable object, which define writeFits and readFits methods, which enables it to masquerade as a FitsCatalogStorage storage type.

            Running the task looks like this:

            writeObjectTable.py /datasets/hsc/repo/rerun/RC/w_2018_10/DM-13647/ --output /project/tmorton/DM-13770 --id tract=9615 filter=HSC-G^HSC-R^HSC-I^HSC-Z^HSC-Y --no-versions -j 24
            

            This writes a single parquet file per patch that is a merge of the deepCoadd_meas, deepCoadd_forced_src and deepCoadd_ref tables for all requested bands (containing ~10k columns).

            The tables can currently be loaded as follows:

            from lsst.daf.persistence import Butler
            butler = Butler('/project/tmorton/DM-13770')
            dataId = {'tract':9615, 'patch':'4,4'}
            df = butler.get('deepCoadd_obj', dataId=dataId)
            

            Here, df is now a pandas dataframe with a multi-level index (check out what it looks like with df.head()). If you want a particular subset of this table; e.g., the meas catalog in HSC-R, you can access that with df['meas']['HSC-R'], which returns the desired portion of the table. I note that you cannot currently load selected columns (the main motivation of using the parquet format), but this will be addressed in future work.

            I note that while in principle, one should be able to load all patches into a single dask dataframe using dask.dataframe.read_parquet, this doesn't currently work, presumably due to the multi-index (question asked at https://github.com/dask/dask/issues/1493).

            Also, I will also note that before I followed the suggestions on RFC-465 to exclusively use the pyarrow engine to write/read the parquet files, I discovered that fastparquet was being very slow in initializing the ParquetFile objects on these large tables, and I submitted a fix to the fastparquet project (https://github.com/dask/fastparquet/pull/318), which decreased the metadata-read time of the ~10,000 column file from about 12s to 0.3s.

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - - edited I have now written a writeObjectTableTask in the qa_explorer package, which is now eups-installable ( DM-13793 ). This includes defining a new deepCoadd_obj dataset, which reads from and writes to parquet via the lsst.qa.explorer.table.ParquetTable object, which define writeFits and readFits methods, which enables it to masquerade as a FitsCatalogStorage storage type. Running the task looks like this: writeObjectTable.py /datasets/hsc/repo/rerun/RC/w_2018_10/DM-13647/ --output /project/tmorton/DM-13770 --id tract=9615 filter=HSC-G^HSC-R^HSC-I^HSC-Z^HSC-Y --no-versions -j 24 This writes a single parquet file per patch that is a merge of the deepCoadd_meas , deepCoadd_forced_src and deepCoadd_ref tables for all requested bands (containing ~10k columns). The tables can currently be loaded as follows: from lsst.daf.persistence import Butler butler = Butler('/project/tmorton/DM-13770') dataId = {'tract':9615, 'patch':'4,4'} df = butler.get('deepCoadd_obj', dataId=dataId) Here, df is now a pandas dataframe with a multi-level index (check out what it looks like with df.head() ). If you want a particular subset of this table; e.g., the meas catalog in HSC-R , you can access that with df ['meas'] ['HSC-R'] , which returns the desired portion of the table. I note that you cannot currently load selected columns (the main motivation of using the parquet format), but this will be addressed in future work. I note that while in principle, one should be able to load all patches into a single dask dataframe using dask.dataframe.read_parquet , this doesn't currently work, presumably due to the multi-index (question asked at https://github.com/dask/dask/issues/1493 ). Also, I will also note that before I followed the suggestions on RFC-465 to exclusively use the pyarrow engine to write/read the parquet files, I discovered that fastparquet was being very slow in initializing the ParquetFile objects on these large tables, and I submitted a fix to the fastparquet project ( https://github.com/dask/fastparquet/pull/318 ), which decreased the metadata-read time of the ~10,000 column file from about 12s to 0.3s.
            Hide
            Parejkoj John Parejko added a comment - - edited

            See comments on the qa_explorer PR.

            I missed the changes to obs_base because there wasn't a PR. I'm not sure I'm comfortable reviewing changes to datasets.yaml, but deepCoadd_obj definitely needs a thorough description: field (see Yusra AlSayyad's work on DM-13765) to clarify exactly what you're doing and why the template doesn't look like a FITS file. I'm not sure who the most appropriate reviewer for the obs_base change would be though.

            It does suggest that it we wanted to play with Parquet for other objects (say, full-visit catalogs), we'd have to make similar "fake FITS" datasets for them too.

            Your qa_explorer PR refers to deepCoadd_object, but I assume you meant deepCoadd_obj? The former name seems better to me; why abbreviate it like that?

            Show
            Parejkoj John Parejko added a comment - - edited See comments on the qa_explorer PR. I missed the changes to obs_base because there wasn't a PR. I'm not sure I'm comfortable reviewing changes to datasets.yaml, but deepCoadd_obj definitely needs a thorough description: field (see Yusra AlSayyad 's work on DM-13765 ) to clarify exactly what you're doing and why the template doesn't look like a FITS file. I'm not sure who the most appropriate reviewer for the obs_base change would be though. It does suggest that it we wanted to play with Parquet for other objects (say, full-visit catalogs), we'd have to make similar "fake FITS" datasets for them too. Your qa_explorer PR refers to deepCoadd_object , but I assume you meant deepCoadd_obj ? The former name seems better to me; why abbreviate it like that?
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            Thanks for all the comments and suggestions. I'll work through them today.

            I figured the "obj" abbreviation was ok, since I don't think any other english word begins with "obj" and does not proceed through "ect". I could be pursuaded, though.

            I'll make the PR to obs_base soon. Jim Bosch suggested the read/writeFits hack as a temporary workaround. I emphasize temporary, since a better solution is an actual ParquetStorage object, which I'm starting to work on in DM-13876.

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - Thanks for all the comments and suggestions. I'll work through them today. I figured the "obj" abbreviation was ok, since I don't think any other english word begins with "obj" and does not proceed through "ect". I could be pursuaded, though. I'll make the PR to obs_base soon. Jim Bosch suggested the read/writeFits hack as a temporary workaround. I emphasize temporary, since a better solution is an actual ParquetStorage object, which I'm starting to work on in DM-13876 .
            Hide
            tmorton Tim Morton [X] (Inactive) added a comment -

            John Parejko, I finally got around to addressing your comments; I've pushed to the PRs. Let me know if this looks OK to you. I didn't write the roundtrip test for ParquetTable you suggested because I'm moving on to DM-13876, which will be junking that object anyway.

            Show
            tmorton Tim Morton [X] (Inactive) added a comment - John Parejko , I finally got around to addressing your comments; I've pushed to the PRs. Let me know if this looks OK to you. I didn't write the roundtrip test for ParquetTable you suggested because I'm moving on to DM-13876 , which will be junking that object anyway.
            Hide
            Parejkoj John Parejko added a comment -

            It looks like there are a number of comments I made that you didn't incorporate or respond to. At least, I didn't see them in the 4 new commits. It can help if you emoji-tag the comments when you've dealt with them.

            I fine with not writing the roundtrip test since you're replacing object that immediately.

            Please rebase squash/flatten your work so far: there's a lot of commits there.

            Show
            Parejkoj John Parejko added a comment - It looks like there are a number of comments I made that you didn't incorporate or respond to. At least, I didn't see them in the 4 new commits. It can help if you emoji-tag the comments when you've dealt with them. I fine with not writing the roundtrip test since you're replacing object that immediately. Please rebase squash/flatten your work so far: there's a lot of commits there.
            Hide
            Parejkoj John Parejko added a comment -

            It looks like squashing it all together wasn't unreasonable.

            I have one remaining dependency question on DM-13793, but otherwise I think this ticket is ok to merge now.

            Show
            Parejkoj John Parejko added a comment - It looks like squashing it all together wasn't unreasonable. I have one remaining dependency question on DM-13793 , but otherwise I think this ticket is ok to merge now.

              People

              Assignee:
              tmorton Tim Morton [X] (Inactive)
              Reporter:
              tmorton Tim Morton [X] (Inactive)
              Reviewers:
              John Parejko
              Watchers:
              Hsin-Fang Chiang, Jim Bosch, John Parejko, John Swinbank, Tim Morton [X] (Inactive), Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.