Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: Qserv
-
Labels:
-
Story Points:4
-
Epic Link:
-
Sprint:DB_S17_6
-
Team:Data Access and Database
Description
Objectives
ATTENTION: This investigation relies on the infrastructure, tools and datasets developed in a context of DM-10988.
Study parallel options for loading Qserv secondary index (assuming the present InnoDB-based implemented). The main objective of the study to see if there is a possibility to improve the performance of the (loading) process for large (multi-billion rows) catalogs.Test if input data pre-ordering will make any difference on the aggregate performance.
Also, in a context of this effort investigate an effect of various storage technologies (including hardware RAID of spinning disks, SSD and NVMe) on the performance of the loader.
Post a summary of observations and conclusions into the same Confluence document which was set up for DM-10988:
Resources
Run the tests on ccqserv150.in2p3.fr (a member of the Qserv cluster in IN2P3).
References
These are interesting ideas on speeding up the loading process from the key MySQL developer:
Attachments
Issue Links
Activity
Testing /qserv_nvm
Loading in 4 streams
Setup
Table schema:
CREATE TABLE secondary_index.parallel4 LIKE secondary_index.sequential; |
Streams are defined at:
% cd /qserv_nvm/gapon |
% wc -l tsv4/* |
26723 tsv4/00 |
26723 tsv4/01 |
26723 tsv4/02 |
26723 tsv4/03 |
106892 total
|
Log files of this tests:
% mkdir log4 |
Using the modified version of the loading script:
% cat scripts/load_parallel.bash |
#!/bin/bash
|
|
set -e |
|
basedir=$1
|
num_streams=$2
|
stream_file=$3
|
db=$4
|
table=$5
|
user=$6
|
password=$7
|
|
logdir=${basedir}/log${num_streams} |
|
mysql_cmd="mysql -u${user} -p${password}" |
|
echo "["`date`"] ** loading begin" |
|
for f in `cat ${stream_file}`; do |
f_path=${basedir}/tsv/${f} |
echo "["`date`"] ** loading triplet from file: ${f}" |
$mysql_cmd -e "LOAD DATA INFILE '${f_path}' INTO TABLE ${db}.${table}" >& ${logdir}/load_${f}.log |
done
|
|
echo "["`date`"] ** loading end" |
Loading
Launched 4 streams in parallel:
% for f in `ls -1 tsv4/`; do |
nohup ./scripts/load_parallel.bash $PWD 4 tsv4/$f secondary_index parallel4 <user> <password> \ |
>& log4/load_parallel_${f}.log& done |
Monitoring
I/O rates:
nvme0n1 1196.00 0.00 69.33 0 69
|
nvme0n1 2031.00 0.00 137.01 0 137
|
nvme0n1 1701.00 0.00 89.69 0 89
|
nvme0n1 1034.00 0.00 69.20 0 69
|
nvme0n1 481.00 0.00 38.56 0 38
|
nvme0n1 1798.00 0.00 108.99 0 108
|
nvme0n1 1963.00 0.00 122.00 0 121
|
nvme0n1 1193.00 0.00 75.49 0 75
|
nvme0n1 1370.00 0.00 80.07 0 80
|
nvme0n1 1043.00 0.00 77.48 0 77
|
CPU utilization:
%Cpu(s): 2.2 us, 0.2 sy, 0.0 ni, 97.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st |
138480 mysql 20 0 0.137t 0.132t 10920 S 145.9 26.9 907:45.16 mysqld
|
Performance, conclusions
Total loading time: 20:23.08
Table file size: 675 GB
Also, the overall CPU utilization was noticeably (by a factor of up to 1.5) higher compared with the sequential loading method. There was a similar observation regarding the write I/O rate (as sample-monitored with iotops). The later may be attributed to temporary files created by 4 loading streams.
The main conclusion: parallel loading into InnodB tables shows no benefits compared with the earlier tested sequential one, neither from a perspective of the overall run time nor from the file size reduction. It may also no sense to proceed with testing other storage technologies. The performance limitations of the loading procedure are due to an implementation of the storage engine (provided the underlying file system provides enough bandwidth and IOPS to keep up with the data ingestion process).
The next step is to test potential benefits of loading presorted data.
Loading presorted data
The main objective of this test is to see if the overall loading procedure could be sped up if it was broken into 4 phases
- sorting in TSV files by the values of their primary (object identifier) keys
- using the merge-sort algorithm to produce a single (globally) sorted file
- splitting this file into smaller files before loading into the database (otherwise MySQL will create huge temporary/log files during the loading)
- loading these (smaller) files into the database in the right order (to keep up with the sort order of the index)
In this test we're going to see if the total amount of time spent on all 4 stages will be less compared with the earlier reported time for loading unsorted data. Another interesting question to look at would be the final size of the table file. There are reasons to believe that feeding unsorted data into the loader may result in a significant internal fragmentation of the file.
Preparing a dataset
Sorting TSV files
NOTE: this is done in 64 parallel streams because sort is CPU-bound application. Hence the idea is to create 64 lists of file names to be compressed sequentially by each (of 64) stream. Details of how these file groups were created are omitted below. Only a summary is shown.
These are 64 sets of files:
% cd /qserv_ssd/gapon |
% wc -l tsv64/* |
1671 tsv64/00 |
1671 tsv64/01 |
1671 tsv64/02 |
1671 tsv64/03 |
1671 tsv64/04 |
...
|
1671 tsv64/61 |
1671 tsv64/62 |
1618 tsv64/63 |
106891 total
|
%
|
Sort in 64 streams and store the sorted files in a separate folder:
% mkdir local_tsv_sorted |
% for c in `ls tsv64`; do |
for f in `cat tsv64/${c}`; do |
sort local_tsv/${f} > local_tsv_sorted/${f} |
done& |
done
|
% ls -1 local_tsv_sorted | wc -l |
106891
|
% du -hs local_tsv_sorted |
Total run time of the stage: 30 min
Merge-sort
Due to a large number of input files this operation is split into 2 steps:
- sort-merging files within each stream (as defined on the previous step). All 64 streams are processed simulnateously.
- final merge of the 64 files (produced by the previous stage) into a single file
Parallel merge of 64 streams:
% mkdir sorted_merged_64 |
% for f in `ls ../tsv64`; do |
mkdir -p ../tmp_${f} |
cat ../tsv64/${f} | tr '\n' '\0' | sort -m -s -k1 -T ../tmp_${f} -o ../sorted_merged_64/${f}.tsv --files0-from=- & |
done |
NOTE: the translation of newlines to NULL is required to feed null-terminated file (input) names into the sort tool. Also note that each stream uses its own temporary directory on the current disk instead of relying on a small /tmp file system.
Total run time of the stage: 60 minutes
Final merge of 64 steam files into the large one:
% sort -m -s -k1 -T tmp/ -o sorted_merged.tsv sorted_merged_64/* |
% ls -alh sorted_merged.tsv |
-rw-r--r-- 1 gapon babar 187G Jun 28 04:25 sorted_merged.tsv
|
Total run time of the stage: 2 hours
Splitting the large file into smaller chunks
% ls -alh sorted_merged.tsv |
-rw-r--r-- 1 gapon babar 187G Jun 28 04:25 sorted_merged.tsv
|
|
% mkdir sorted_merged_split |
% cd sorted_merged_split |
% date |
Wed Jun 28 05:45:28 CEST 2017
|
|
% split -d -l 80663508 ../sorted_merged.tsv '' |
|
% date |
Wed Jun 28 05:53:58 CEST 2017
|
Total number of files: 90
Average file size: 2.1 GB
Lines per file: *80663508 *
Total run time of the stage: 8 min 30 seconds
Loading file into the database
Loading is done into the NVM-based storage while reading previously sorted file from the SSD-based storage:
% cd /qserv_nvm/gapon |
% ln -s /qserv_ssd/gapon/sorted_merged_split tsv_sorted |
Index table:
% CREATE TABLE secondary_index.sequential_sorted LIKE secondary_index.parallel4;
|
% DROP TABLE secondary_index.parallel4;
|
Added the following option to the loading SQL command:
SET UNIQUE_CHECKS=0; |
Run the customized version of the sequential loader:
% nohup scripts/load_sorted_sequential.bash \ |
$PWD secondary_index sequential_sorted <user> <password> \
|
&> log/load_sorted_sequential.log & |
The operation IN PROGRESS
Total loading time: 9 hours 45 minutes
Table file size: 416 GB
Conclusions
Two candidate optimization techniques for index ingestion have been tried. This is the summary:
- The parallel (in 4 streams) loading option shows no benefits compared with the original (loading unsorted data) sequential approach, neither in its performance nor in the resulting size of the table file on disk. Even worse, this method extends the total loading time by 2 hours (assuming the same underlying storage) and it results in using more resources (CPU time and disk I/O) compared with the original method.
- The other technique which required to presort input data demonstrates a noticeable (though not dramatic) reduction in the overall loading time (from 18:00 down to 13:30 hours) and (what is very important) in the significant reduction in the table file size (from 674 GB to 416 GB). The later maybe an important factor for resource constraint setups.
TODO: put this summary along with relevant conditions (dataset, configuration, setup, test method) into the Confluence page mentioned in the Description section of the document.
Are there any noticeable differences in the index lookup time, after loading, given the differences in index size?
Disabling index generation (not just the unique key check) on load and then enabling it afterwards may also give a speedup.
Kian-Tat Lim I didn't have a chance to test the read performance of the index (built in two different ways). My "educated guess" is that it should not matter as InnoDB stores tuple data in the leaf nodes which require a disk I/O to fetch the data (provided very high width of the index B-Tree and a high probability of having the upper levels of the tree cached in memory. In case of fetching random ObjectId's one would always hit a random disk page from the leaf level.
Though, I'm interested in testing this.
Kian-Tat Lim regarding disabling the PRIMARY key idea, I don't think this is possible in the InnoDB engine because it implements its tables as B-Trees based on the PRIMARY key values. So, there must be some index. Other indexes can be disabled. However, we don't have any but the RPIMARY index in this particular table.
Planning the study
Prepare for the parallel loading:
Test the loading using the same approach as for the simple sequential setup. Try to repeat the tests for different numbers of the parallel loading streams: 2, 4, 8, 16, 32, 64.
Monitor the progress using the same techniques reported in
DM-10988.