Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30103

SQL types containing binary data are not correctly processed by the Qserv Ingest system

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: In Progress
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      The bug

      The ticket was noticed while working on DM-29994.

      The current implementation of the Qserv ingests system doesn't have support for ingesting data corresponding to the binary SQL types. This includes the MySQL types:

      BINARY
      VARBINARY
      TINYBLOB
      BLOB
      MEDIUMBLOB
      LONGBLOB
      

      The problem occurs in two classes responsible for reading data files:

      class method
      lsst::qserv::replica::IngestClient send
      lsst::qserv::replica::IngestHttpSvcMod _readLocal
      lsst::qserv::replica::HttpFileReader _store

      In all three cases, the code won't recognize an escape character before the EOL (end of line character) '\n' if the latter is found within the binary data. This results in interpreting the first met EOL character as the line terminator.

      For example, consider the following schema:

      INT NOT NULL,
      INT NOT NULL,
      INT NOT NULL,
      BINARY(10)
      

      In this case, the input row presented below will get interpreted by the current code as two separate rows (provided the column terminator character is a comma and the escape character is the back-slash):

      1,2,3,12%^&**\
      &^TH^
      

      The proposed solution

      • Extend the table loading interfaces (the uploader application qserv-replica-file and the worker ingest REST service /ingest/file) with an option allowing to specify the desired escape character.
      • Use the escape character when reading and preprocessing input files in the above-mentioned classes.
      • Pass the escape character to MySQL when ingesting the preprocessed files using LOAD DATA INFILE.

      Other notes

      • To avoid code duplication introduce a utility class shared by implementations of the method IngestClient::send and IngestHttpSvcMod::_readLocal
      • Have a lock at how this problem is addressed (if it's addressed) in the Git package https://github.com/lsst/partition

        Attachments

          Issue Links

            Activity

            There are no comments yet on this issue.

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Watchers:
              Fabrice Jammes, Fritz Mueller, Hsin-Fang Chiang, Igor Gaponenko, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated: