Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-662

Adopt Parquet as user facing columnar data format for LSST catalog data

    Details

    • Type: RFC
    • Status: Adopted
    • Resolution: Unresolved
    • Component/s: DM
    • Labels:

      Description

      Parquet files are in use internally within DM for a variety of purposes,  and have already been included in the latest sizing model (dmtn-135)

      The adoption of a columnar data format, such as Parquet, as a user facing data format, particularly in support of next-to-data analysis and to provide a MapReduce-style access to catalog data has been discussed at several DMLT and SST meetings, and also at  several user-facing meetings. e.g LSP FDR.  The idea has received broad acceptance within DM,  however is not stated in any DM policy/design/architecture document. The community are now asking us what our plans are. 

      This RFC is to accept to support a columnar format for catalog data (Parquet at this time), in addition to ADQL (Qserv),  and to update DM documents to reflect this.  Documents that would need update include: 

      • DPDD - should contain a  short section explaining to the  user community that catalog data will be available in both ADQL(Qserv) and columnar format (Parquet)
      • LDM-148 - include in the DM architecture 
      • ... others?

        Attachments

          Issue Links

            Activity

            Hide
            gpdf Gregory Dubois-Felsmann added a comment - - edited

            Preserving some discussion about "next actions" for future reference: https://lsstc.slack.com/archives/C2K6YMTK2/p1583539429112300

            Show
            gpdf Gregory Dubois-Felsmann added a comment - - edited Preserving some discussion about "next actions" for future reference: https://lsstc.slack.com/archives/C2K6YMTK2/p1583539429112300
            Hide
            ctslater Colin Slater added a comment -

            After some discussions with Fritz Mueller about how to structure the implementation of this RFC, I propose two actions:

            • One implementation ticket for updating the DPDD, stating that columnar files will be made available.
            • Another implementation ticket to create a new LDM document for user-facing parquet data products. This would give a bit more room to cover topics such as which specific tables are made available, format and partitioning details, data type conversion conventions between parquet and RDBMS tables, validation, and metadata. A first version of this would be relatively short, and could later expand as necessary. Making this a separate document would be preferable to overloading LDM-135, which is focused on qserv design.

            One potential gap is that this these documents would not specify a service to provide the files to the users in the LSP. I am less certain of both the right technical implementation and the appropriate policy mechanism to address that. The use case is broadly similar to other file-based access in the LSP though, so addressing the two tickets listed above seems like it would handle the most pressing topics.

             

            Show
            ctslater Colin Slater added a comment - After some discussions with Fritz Mueller about how to structure the implementation of this RFC, I propose two actions: One implementation ticket for updating the DPDD, stating that columnar files will be made available. Another implementation ticket to create a new LDM document for user-facing parquet data products. This would give a bit more room to cover topics such as which specific tables are made available, format and partitioning details, data type conversion conventions between parquet and RDBMS tables, validation, and metadata. A first version of this would be relatively short, and could later expand as necessary. Making this a separate document would be preferable to overloading LDM-135, which is focused on qserv design. One potential gap is that this these documents would not specify a service to provide the files to the users in the LSP. I am less certain of both the right technical implementation and the appropriate policy mechanism to address that. The use case is broadly similar to other file-based access in the LSP though, so addressing the two tickets listed above seems like it would handle the most pressing topics.  
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I think we should cover the file-based data access aspect pretty soon.  In fact it's really not all that well defined; we are still climbing a hill of trying to get a coherent IVOA-ish file-based data (i.e., image) service architecture defined, and it makes sense to see if we can ensure that other file-based data is also readily handled.  (I don't think it needs to be all that difficult.)

            We should cover:

            • Access (if any) to the Parquet "tiles" via BG3 in the Notebook Aspect
            • Access to the Parquet tiles via IVOA or other network services in the API Aspect
            • Desirements for how to handle these data in a restarted Portal Aspect effort
            Show
            gpdf Gregory Dubois-Felsmann added a comment - I think we should cover the file-based data access aspect pretty soon.  In fact it's really not all that well defined; we are still climbing a hill of trying to get a coherent IVOA-ish file-based data (i.e., image) service architecture defined, and it makes sense to see if we can ensure that other file-based data is also readily handled.  (I don't think it needs to be all that difficult.) We should cover: Access (if any) to the Parquet "tiles" via BG3 in the Notebook Aspect Access to the Parquet tiles via IVOA or other network services in the API Aspect Desirements for how to handle these data in a restarted Portal Aspect effort
            Hide
            lguy Leanne Guy added a comment -

            The policy outline in this RFC is accepted, the two implementation tickets suggested will be created.

            Show
            lguy Leanne Guy added a comment - The policy outline in this RFC is accepted, the two implementation tickets suggested will be created.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            We also need to think about what LSE-61 (DMSR) and LDM-554 (LSP) requirements might be affected by this change. See RFC-704 for a relevant discussion.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - We also need to think about what LSE-61 (DMSR) and LDM-554 (LSP) requirements might be affected by this change. See RFC-704 for a relevant discussion.

              People

              • Assignee:
                lguy Leanne Guy
                Reporter:
                lguy Leanne Guy
                Watchers:
                Colin Slater, Fritz Mueller, Frossie Economou, Gregory Dubois-Felsmann, Hsin-Fang Chiang, John Parejko, John Swinbank, Kian-Tat Lim, Leanne Guy, Tim Jenness
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Planned End: