XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
6
• Sprint:
DB_S15_04
• Team:
Data Access and Database

#### Description

This is a separate ticket for implementation of the data loading part of the worker management service (started in DM-2176). Some ideas and thoughts are outlined in that ticket.

#### Activity

Hide
Andy Salnikov added a comment -

Copy of the issues mentioned in DM-2176:

• data volume loaded in each request is potentially large
• data is formatted currently (and likely in the future) according to representation accepted by LOAD DATA INFILE ..., together with the data we need to pass some extra info (line terminators, column separators, etc.)
• with large data volumes there is a chance that connection could be broken while data is transferred from client to worker, this should result in clean rollback as if no data were loaded at all so that clients can retry and succeed
• one simple way to achieve that is to store loaded data into a temporary file and then load it using LOAD DATA INFILE ....
• we probably want to avoid unnecessary encoding/decoding of the data if possible at all (except for encryption which will be required)
• if we only need to send file data then we could use application/octet-stream content type to transfer data in raw format
• but since we also need to pass extra info we'll likely need to use multipart/form-data for request (combined with application/octet-stream for file data and some other content type for other parameters}}
Show
Andy Salnikov added a comment - Copy of the issues mentioned in DM-2176 : There are few issues related to data loading via HTTP: data volume loaded in each request is potentially large data is formatted currently (and likely in the future) according to representation accepted by LOAD DATA INFILE ... , together with the data we need to pass some extra info (line terminators, column separators, etc.) with large data volumes there is a chance that connection could be broken while data is transferred from client to worker, this should result in clean rollback as if no data were loaded at all so that clients can retry and succeed one simple way to achieve that is to store loaded data into a temporary file and then load it using LOAD DATA INFILE ... . we probably want to avoid unnecessary encoding/decoding of the data if possible at all (except for encryption which will be required) if we only need to send file data then we could use application/octet-stream content type to transfer data in raw format but since we also need to pass extra info we'll likely need to use multipart/form-data for request (combined with application/octet-stream for file data and some other content type for other parameters}}
Hide
Andy Salnikov added a comment -

• file data is sent to wmgr over network via http without encoding (things will get worse with SSL)
• wmgr writes the data to a temporary file
• wmgr calls mysql with command LOAD DATA INFILE or LOAD DATA LOCAL INFILE
• LOCAL is probably slightly less efficient but it does not need global FILE grant and temporary file location can be anywhere
• mysql docs seem to imply that with LOCAL mysql server creates a temporary file on server side but I don't think this is rally what happens

Creating temporary file is not the most efficient approach, second version of the process used named pipe instead which avoids temporary file altogether.

Here are few numbers obtained when loading large file using different methods. File has 123753 records, size is 62MB uncompressed or 22MB compressed (wmg supports loading of compressed files):

 # size in kBytes 21616 datasets/case04/data/DeepForcedSource.csv.gz 63288 /tmp/DeepForcedSource.csv

All measurements are done on the same host (both mysql server, wmgr server, wmgr client, and mysql client):

Time, sec
wmgr uncompressed data via temporary file 3.9
wmgr compressed data via temporary file 4.0
wmgr uncompressed data via named pipe 3.2
wmgr compressed data via named pipe 3.2

There is no significant difference between loading compressed or uncompressed data. Piping data is certainly more efficient than making temporary file for it.

Werkzeug (which is responsible for request parsing in flask) creates temporary files (using tempfile.TemporaryFile) for uploaded data itself in case file data size exceeds 500 kBytes (see werkzeug.wrappers.BaseRequest._get_file_stream()). One could monkey-patch BaseRequest._get_file_stream() method or change BaseRequest.form_data_parser_class member to use replacement form parse class which instantiates different stream factory.

Show
Hide
Andy Salnikov added a comment - - edited

Quick implementation for the in-memory limit up to 500 MBytes, time to load dropped to 2.9 sec. Still slower that direct mysql, but it is reasonable to expect this kind of overhead. Extended table above with the new number:

Time, sec
wmgr uncompressed data via temporary file 3.9
wmgr compressed data via temporary file 4.0
wmgr uncompressed data via named pipe 3.2
wmgr compressed data via named pipe 3.2
wmgr compressed data, in-memory streaming, named pipe 2.9

At this point I'm not sure that it's worth using large memory buffers instead of temporary files, I won't commit that monkey patch.

Show
Andy Salnikov added a comment - - edited Quick implementation for the in-memory limit up to 500 MBytes, time to load dropped to 2.9 sec. Still slower that direct mysql, but it is reasonable to expect this kind of overhead. Extended table above with the new number:   Time, sec mysql LOAD INFILE LOCAL 2.3 wmgr uncompressed data via temporary file 3.9 wmgr compressed data via temporary file 4.0 wmgr uncompressed data via named pipe 3.2 wmgr compressed data via named pipe 3.2 wmgr compressed data, in-memory streaming, named pipe 2.9 At this point I'm not sure that it's worth using large memory buffers instead of temporary files, I won't commit that monkey patch.
Hide
Andy Salnikov added a comment -

Serge, can you look at PR. It's not too much code and maybe you can suggest ways to improve it too.

Thanks,
Andy

Show
Andy Salnikov added a comment - Serge, can you look at PR. It's not too much code and maybe you can suggest ways to improve it too. Thanks, Andy
Hide
Serge Monkewitz added a comment -

Overall, looks very nice! I posted a few trivial comments in the pull request.

For the record, Andy and I discussed a variation on his named pipe idea on hipchat:

So here's my initial reaction. Not sure if it makes sense or if it's just crazy talk...
Can we create the named_pipe, launch a mysql client process that loads from it (or a chain of named pipes for decompression purposes), and make werkzeug write to the named pipe instead?

Andy:

It may be possible, but will likely need some monkey patching for werkzeug.

I'd like to get some numbers from actual data loading before trying to optimize it.

Even if it's possible, I'm not sure if its a win (pipe buffers are pretty small, 4k or 32k IIRC, so there may be overhead in lots of context switches from the pipe(s) constantly being emptied/filled)

Show
Serge Monkewitz added a comment - Overall, looks very nice! I posted a few trivial comments in the pull request. For the record, Andy and I discussed a variation on his named pipe idea on hipchat: So here's my initial reaction. Not sure if it makes sense or if it's just crazy talk... Can we create the named_pipe, launch a mysql client process that loads from it (or a chain of named pipes for decompression purposes), and make werkzeug write to the named pipe instead? Andy: It may be possible, but will likely need some monkey patching for werkzeug. I'd like to get some numbers from actual data loading before trying to optimize it. Even if it's possible, I'm not sure if its a win (pipe buffers are pretty small, 4k or 32k IIRC, so there may be overhead in lots of context switches from the pipe(s) constantly being emptied/filled)
Hide
Andy Salnikov added a comment -

I added few context managers to the mix, it indeed makes it cleaner. Merged and pushed.

Show
Andy Salnikov added a comment - I added few context managers to the mix, it indeed makes it cleaner. Merged and pushed.

#### People

Assignee:
Andy Salnikov
Reporter:
Andy Salnikov
Reviewers:
Serge Monkewitz
Watchers:
Andy Salnikov, Serge Monkewitz