Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-2385

Implement data loading in worker manager service

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:

      Description

      This is a separate ticket for implementation of the data loading part of the worker management service (started in DM-2176). Some ideas and thoughts are outlined in that ticket.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Copy of the issues mentioned in DM-2176:

            There are few issues related to data loading via HTTP:

            • data volume loaded in each request is potentially large
            • data is formatted currently (and likely in the future) according to representation accepted by LOAD DATA INFILE ..., together with the data we need to pass some extra info (line terminators, column separators, etc.)
            • with large data volumes there is a chance that connection could be broken while data is transferred from client to worker, this should result in clean rollback as if no data were loaded at all so that clients can retry and succeed
              • one simple way to achieve that is to store loaded data into a temporary file and then load it using LOAD DATA INFILE ....
            • we probably want to avoid unnecessary encoding/decoding of the data if possible at all (except for encryption which will be required)
              • if we only need to send file data then we could use application/octet-stream content type to transfer data in raw format
              • but since we also need to pass extra info we'll likely need to use multipart/form-data for request (combined with application/octet-stream for file data and some other content type for other parameters}}
            Show
            salnikov Andy Salnikov added a comment - Copy of the issues mentioned in DM-2176 : There are few issues related to data loading via HTTP: data volume loaded in each request is potentially large data is formatted currently (and likely in the future) according to representation accepted by LOAD DATA INFILE ... , together with the data we need to pass some extra info (line terminators, column separators, etc.) with large data volumes there is a chance that connection could be broken while data is transferred from client to worker, this should result in clean rollback as if no data were loaded at all so that clients can retry and succeed one simple way to achieve that is to store loaded data into a temporary file and then load it using LOAD DATA INFILE ... . we probably want to avoid unnecessary encoding/decoding of the data if possible at all (except for encryption which will be required) if we only need to send file data then we could use application/octet-stream content type to transfer data in raw format but since we also need to pass extra info we'll likely need to use multipart/form-data for request (combined with application/octet-stream for file data and some other content type for other parameters}}
            Hide
            salnikov Andy Salnikov added a comment -

            Initial implementation of the data loading is based on temporary files:

            • file data is sent to wmgr over network via http without encoding (things will get worse with SSL)
            • wmgr writes the data to a temporary file
            • wmgr calls mysql with command LOAD DATA INFILE or LOAD DATA LOCAL INFILE
              • LOCAL is probably slightly less efficient but it does not need global FILE grant and temporary file location can be anywhere
              • mysql docs seem to imply that with LOCAL mysql server creates a temporary file on server side but I don't think this is rally what happens

            Creating temporary file is not the most efficient approach, second version of the process used named pipe instead which avoids temporary file altogether.

            Here are few numbers obtained when loading large file using different methods. File has 123753 records, size is 62MB uncompressed or 22MB compressed (wmg supports loading of compressed files):

            # size in kBytes
            21616 datasets/case04/data/DeepForcedSource.csv.gz
            63288 /tmp/DeepForcedSource.csv

            All measurements are done on the same host (both mysql server, wmgr server, wmgr client, and mysql client):

              Time, sec
            mysql LOAD INFILE LOCAL 2.3
            wmgr uncompressed data via temporary file 3.9
            wmgr compressed data via temporary file 4.0
            wmgr uncompressed data via named pipe 3.2
            wmgr compressed data via named pipe 3.2

            There is no significant difference between loading compressed or uncompressed data. Piping data is certainly more efficient than making temporary file for it.

            Werkzeug (which is responsible for request parsing in flask) creates temporary files (using tempfile.TemporaryFile) for uploaded data itself in case file data size exceeds 500 kBytes (see werkzeug.wrappers.BaseRequest._get_file_stream()). One could monkey-patch BaseRequest._get_file_stream() method or change BaseRequest.form_data_parser_class member to use replacement form parse class which instantiates different stream factory.

            Show
            salnikov Andy Salnikov added a comment - Initial implementation of the data loading is based on temporary files: file data is sent to wmgr over network via http without encoding (things will get worse with SSL) wmgr writes the data to a temporary file wmgr calls mysql with command LOAD DATA INFILE or LOAD DATA LOCAL INFILE LOCAL is probably slightly less efficient but it does not need global FILE grant and temporary file location can be anywhere mysql docs seem to imply that with LOCAL mysql server creates a temporary file on server side but I don't think this is rally what happens Creating temporary file is not the most efficient approach, second version of the process used named pipe instead which avoids temporary file altogether. Here are few numbers obtained when loading large file using different methods. File has 123753 records, size is 62MB uncompressed or 22MB compressed (wmg supports loading of compressed files): # size in kBytes 21616 datasets/case04/data/DeepForcedSource.csv.gz 63288 /tmp/DeepForcedSource.csv All measurements are done on the same host (both mysql server, wmgr server, wmgr client, and mysql client):   Time, sec mysql LOAD INFILE LOCAL 2.3 wmgr uncompressed data via temporary file 3.9 wmgr compressed data via temporary file 4.0 wmgr uncompressed data via named pipe 3.2 wmgr compressed data via named pipe 3.2 There is no significant difference between loading compressed or uncompressed data. Piping data is certainly more efficient than making temporary file for it. Werkzeug (which is responsible for request parsing in flask) creates temporary files (using tempfile.TemporaryFile ) for uploaded data itself in case file data size exceeds 500 kBytes (see werkzeug.wrappers.BaseRequest._get_file_stream() ). One could monkey-patch BaseRequest._get_file_stream() method or change BaseRequest.form_data_parser_class member to use replacement form parse class which instantiates different stream factory.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Quick implementation for the in-memory limit up to 500 MBytes, time to load dropped to 2.9 sec. Still slower that direct mysql, but it is reasonable to expect this kind of overhead. Extended table above with the new number:

              Time, sec
            mysql LOAD INFILE LOCAL 2.3
            wmgr uncompressed data via temporary file 3.9
            wmgr compressed data via temporary file 4.0
            wmgr uncompressed data via named pipe 3.2
            wmgr compressed data via named pipe 3.2
            wmgr compressed data, in-memory streaming, named pipe 2.9

            At this point I'm not sure that it's worth using large memory buffers instead of temporary files, I won't commit that monkey patch.

            Show
            salnikov Andy Salnikov added a comment - - edited Quick implementation for the in-memory limit up to 500 MBytes, time to load dropped to 2.9 sec. Still slower that direct mysql, but it is reasonable to expect this kind of overhead. Extended table above with the new number:   Time, sec mysql LOAD INFILE LOCAL 2.3 wmgr uncompressed data via temporary file 3.9 wmgr compressed data via temporary file 4.0 wmgr uncompressed data via named pipe 3.2 wmgr compressed data via named pipe 3.2 wmgr compressed data, in-memory streaming, named pipe 2.9 At this point I'm not sure that it's worth using large memory buffers instead of temporary files, I won't commit that monkey patch.
            Hide
            salnikov Andy Salnikov added a comment -

            Serge, can you look at PR. It's not too much code and maybe you can suggest ways to improve it too.

            Thanks,
            Andy

            Show
            salnikov Andy Salnikov added a comment - Serge, can you look at PR. It's not too much code and maybe you can suggest ways to improve it too. Thanks, Andy
            Hide
            smonkewitz Serge Monkewitz added a comment -

            Overall, looks very nice! I posted a few trivial comments in the pull request.

            For the record, Andy and I discussed a variation on his named pipe idea on hipchat:

            So here's my initial reaction. Not sure if it makes sense or if it's just crazy talk...
            Can we create the named_pipe, launch a mysql client process that loads from it (or a chain of named pipes for decompression purposes), and make werkzeug write to the named pipe instead?

            Andy:

            It may be possible, but will likely need some monkey patching for werkzeug.

            I'd like to get some numbers from actual data loading before trying to optimize it.

            Even if it's possible, I'm not sure if its a win (pipe buffers are pretty small, 4k or 32k IIRC, so there may be overhead in lots of context switches from the pipe(s) constantly being emptied/filled)

            Show
            smonkewitz Serge Monkewitz added a comment - Overall, looks very nice! I posted a few trivial comments in the pull request. For the record, Andy and I discussed a variation on his named pipe idea on hipchat: So here's my initial reaction. Not sure if it makes sense or if it's just crazy talk... Can we create the named_pipe, launch a mysql client process that loads from it (or a chain of named pipes for decompression purposes), and make werkzeug write to the named pipe instead? Andy: It may be possible, but will likely need some monkey patching for werkzeug. I'd like to get some numbers from actual data loading before trying to optimize it. Even if it's possible, I'm not sure if its a win (pipe buffers are pretty small, 4k or 32k IIRC, so there may be overhead in lots of context switches from the pipe(s) constantly being emptied/filled)
            Hide
            salnikov Andy Salnikov added a comment -

            I added few context managers to the mix, it indeed makes it cleaner. Merged and pushed.

            Show
            salnikov Andy Salnikov added a comment - I added few context managers to the mix, it indeed makes it cleaner. Merged and pushed.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Serge Monkewitz
              Watchers:
              Andy Salnikov, Serge Monkewitz
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.