Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24779

Investigate alternatives for the Parquet Sink connector

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      One of the output formats required for the interoperability of the EFD and the LSP is Parquet. Parquet files have proven to be an efficient columnar data source for large datasets and also work with Dask distributed processing.

      There are some other requirements for the Parquet Sink connector.

      • We want to use Kafka Connect as for the other EFD connectors
      • It must be open-source, if it is provided by Confluent it must be under the Confluent Community Licence since we are not paying for the Confluent Enterprise License.
      • It must support time-based data partitioning, we want to split the EFD data by observing night.
      • Must be able to read from Avro
      • Should write to POSIX file system or to an object-store.

       

        Attachments

          Issue Links

            Activity

            Hide
            afausti Angelo Fausti added a comment - - edited

            There are three Confluent connectors that write to Parquet format under the Confluent Community License. The HDFS 2.0 Sink connector, the SFTP Sink Connector and the Amazon S3 Sink Connector.

            Looking at the docs and discussing this internally with the team we plan on testing the Amazon S3 Sink connector. It can write to Parquet format and has options for time base partitioning.

            Show
            afausti Angelo Fausti added a comment - - edited There are three Confluent connectors that write to Parquet format under the Confluent Community License. The HDFS 2.0 Sink connector, the SFTP Sink Connector and the Amazon S3 Sink Connector. Looking at the docs and discussing this internally with the team we plan on testing the Amazon S3 Sink connector. It can write to Parquet format and has options for time base partitioning .

              People

              • Assignee:
                afausti Angelo Fausti
                Reporter:
                afausti Angelo Fausti
                Watchers:
                Angelo Fausti
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel