Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-11027

Investigate options for speeding up data ingestion into Qserv "secondary index"

    XMLWordPrintable

    Details

      Description

      Objectives

      ATTENTION: This investigation relies on the infrastructure, tools and datasets developed in a context of DM-10988.

      Study parallel options for loading Qserv secondary index (assuming the present InnoDB-based implemented). The main objective of the study to see if there is a possibility to improve the performance of the (loading) process for large (multi-billion rows) catalogs.Test if input data pre-ordering will make any difference on the aggregate performance.

      Also, in a context of this effort investigate an effect of various storage technologies (including hardware RAID of spinning disks, SSD and NVMe) on the performance of the loader.

      Post a summary of observations and conclusions into the same Confluence document which was set up for DM-10988:

      Resources

      Run the tests on ccqserv150.in2p3.fr (a member of the Qserv cluster in IN2P3).

      References

      These are interesting ideas on speeding up the loading process from the key MySQL developer:

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment - - edited

            Planning the study

            Prepare for the parallel loading:

            • group TSV files into N file lists
            • prepare a custom version of the secondary index loading script which will be loading N groups in parallel (all files within each group will be loaded sequentially)

            Test the loading using the same approach as for the simple sequential setup. Try to repeat the tests for different numbers of the parallel loading streams: 2, 4, 8, 16, 32, 64.

            Monitor the progress using the same techniques reported in DM-10988.

            Show
            gapon Igor Gaponenko added a comment - - edited Planning the study Prepare for the parallel loading: group TSV files into N file lists prepare a custom version of the secondary index loading script which will be loading N groups in parallel (all files within each group will be loaded sequentially) Test the loading using the same approach as for the simple sequential setup. Try to repeat the tests for different numbers of the parallel loading streams: 2, 4, 8, 16, 32, 64. Monitor the progress using the same techniques reported in DM-10988 .
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Testing /qserv_nvm

            Loading in 4 streams

            Setup

            Table schema:

            CREATE TABLE secondary_index.parallel4 LIKE secondary_index.sequential;
            

            Streams are defined at:

            % cd /qserv_nvm/gapon
            % wc -l tsv4/*
              26723 tsv4/00
              26723 tsv4/01
              26723 tsv4/02
              26723 tsv4/03
             106892 total
            

            Log files of this tests:

            % mkdir log4
            

            Using the modified version of the loading script:

            % cat scripts/load_parallel.bash
            #!/bin/bash
             
            set -e
             
            basedir=$1
            num_streams=$2
            stream_file=$3
            db=$4
            table=$5
            user=$6
            password=$7
             
            logdir=${basedir}/log${num_streams}
             
            mysql_cmd="mysql -u${user} -p${password}"
             
            echo "["`date`"] ** loading begin"
             
            for f in `cat ${stream_file}`; do
              f_path=${basedir}/tsv/${f}
              echo "["`date`"] ** loading triplet from file: ${f}"
              $mysql_cmd -e "LOAD DATA INFILE '${f_path}' INTO TABLE ${db}.${table}" >& ${logdir}/load_${f}.log
            done
             
            echo "["`date`"] ** loading end"
            

            Loading

            Launched 4 streams in parallel:

            % for f in `ls -1 tsv4/`; do
                nohup ./scripts/load_parallel.bash $PWD 4 tsv4/$f secondary_index parallel4 <user> <password> \
                >& log4/load_parallel_${f}.log& done
            

            Monitoring

            I/O rates:

            nvme0n1        1196.00         0.00        69.33          0         69
            nvme0n1        2031.00         0.00       137.01          0        137
            nvme0n1        1701.00         0.00        89.69          0         89
            nvme0n1        1034.00         0.00        69.20          0         69
            nvme0n1         481.00         0.00        38.56          0         38
            nvme0n1        1798.00         0.00       108.99          0        108
            nvme0n1        1963.00         0.00       122.00          0        121
            nvme0n1        1193.00         0.00        75.49          0         75
            nvme0n1        1370.00         0.00        80.07          0         80
            nvme0n1        1043.00         0.00        77.48          0         77
            

            CPU utilization:

            %Cpu(s):  2.2 us,  0.2 sy,  0.0 ni, 97.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
            138480 mysql     20   0  0.137t 0.132t  10920 S 145.9 26.9 907:45.16 mysqld
            

            Performance, conclusions

            Total loading time: 20:23.08
            Table file size: 675 GB

            Also, the overall CPU utilization was noticeably (by a factor of up to 1.5) higher compared with the sequential loading method. There was a similar observation regarding the write I/O rate (as sample-monitored with iotops). The later may be attributed to temporary files created by 4 loading streams.

            The main conclusion: parallel loading into InnodB tables shows no benefits compared with the earlier tested sequential one, neither from a perspective of the overall run time nor from the file size reduction. It may also no sense to proceed with testing other storage technologies. The performance limitations of the loading procedure are due to an implementation of the storage engine (provided the underlying file system provides enough bandwidth and IOPS to keep up with the data ingestion process).

            The next step is to test potential benefits of loading presorted data.

            Show
            gapon Igor Gaponenko added a comment - - edited Testing /qserv_nvm Loading in 4 streams Setup Table schema: CREATE TABLE secondary_index.parallel4 LIKE secondary_index.sequential; Streams are defined at: % cd /qserv_nvm/gapon % wc -l tsv4/* 26723 tsv4 /00 26723 tsv4 /01 26723 tsv4 /02 26723 tsv4 /03 106892 total Log files of this tests: % mkdir log4 Using the modified version of the loading script: % cat scripts /load_parallel . bash #!/bin/bash   set -e   basedir=$1 num_streams=$2 stream_file=$3 db=$4 table=$5 user=$6 password=$7   logdir=${basedir} /log ${num_streams}   mysql_cmd= "mysql -u${user} -p${password}"   echo "[" ` date ` "] ** loading begin"   for f in ` cat ${stream_file}`; do f_path=${basedir} /tsv/ ${f} echo "[" ` date ` "] ** loading triplet from file: ${f}" $mysql_cmd -e "LOAD DATA INFILE '${f_path}' INTO TABLE ${db}.${table}" >& ${logdir} /load_ ${f}.log done   echo "[" ` date ` "] ** loading end" Loading Launched 4 streams in parallel: % for f in ` ls -1 tsv4/`; do nohup . /scripts/load_parallel . bash $PWD 4 tsv4/$f secondary_index parallel4 <user> <password> \ >& log4 /load_parallel_ ${f}.log& done Monitoring I/O rates: nvme0n1 1196.00 0.00 69.33 0 69 nvme0n1 2031.00 0.00 137.01 0 137 nvme0n1 1701.00 0.00 89.69 0 89 nvme0n1 1034.00 0.00 69.20 0 69 nvme0n1 481.00 0.00 38.56 0 38 nvme0n1 1798.00 0.00 108.99 0 108 nvme0n1 1963.00 0.00 122.00 0 121 nvme0n1 1193.00 0.00 75.49 0 75 nvme0n1 1370.00 0.00 80.07 0 80 nvme0n1 1043.00 0.00 77.48 0 77 CPU utilization: %Cpu(s): 2.2 us, 0.2 sy, 0.0 ni, 97.5 id , 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 138480 mysql 20 0 0.137t 0.132t 10920 S 145.9 26.9 907:45.16 mysqld Performance, conclusions Total loading time: 20:23.08 Table file size: 675 GB Also, the overall CPU utilization was noticeably (by a factor of up to 1.5) higher compared with the sequential loading method. There was a similar observation regarding the write I/O rate (as sample-monitored with iotops ). The later may be attributed to temporary files created by 4 loading streams. The main conclusion : parallel loading into InnodB tables shows no benefits compared with the earlier tested sequential one, neither from a perspective of the overall run time nor from the file size reduction. It may also no sense to proceed with testing other storage technologies. The performance limitations of the loading procedure are due to an implementation of the storage engine (provided the underlying file system provides enough bandwidth and IOPS to keep up with the data ingestion process). The next step is to test potential benefits of loading presorted data.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Loading presorted data

            The main objective of this test is to see if the overall loading procedure could be sped up if it was broken into 4 phases

            1. sorting in TSV files by the values of their primary (object identifier) keys
            2. using the merge-sort algorithm to produce a single (globally) sorted file
            3. splitting this file into smaller files before loading into the database (otherwise MySQL will create huge temporary/log files during the loading)
            4. loading these (smaller) files into the database in the right order (to keep up with the sort order of the index)

            In this test we're going to see if the total amount of time spent on all 4 stages will be less compared with the earlier reported time for loading unsorted data. Another interesting question to look at would be the final size of the table file. There are reasons to believe that feeding unsorted data into the loader may result in a significant internal fragmentation of the file.

            Preparing a dataset

            Sorting TSV files

            NOTE: this is done in 64 parallel streams because sort is CPU-bound application. Hence the idea is to create 64 lists of file names to be compressed sequentially by each (of 64) stream. Details of how these file groups were created are omitted below. Only a summary is shown.

            These are 64 sets of files:

            % cd /qserv_ssd/gapon
            % wc -l tsv64/*
               1671 tsv64/00
               1671 tsv64/01
               1671 tsv64/02
               1671 tsv64/03
               1671 tsv64/04
               ...
               1671 tsv64/61
               1671 tsv64/62
               1618 tsv64/63
             106891 total
            % 
            

            Sort in 64 streams and store the sorted files in a separate folder:

            %  mkdir local_tsv_sorted
            % for c in `ls tsv64`; do
                for f in `cat tsv64/${c}`; do
                    sort local_tsv/${f} > local_tsv_sorted/${f}
                done&
            done
            % ls -1 local_tsv_sorted | wc -l
            106891
            % du -hs local_tsv_sorted
            

            Total run time of the stage: 30 min

            Merge-sort

            Due to a large number of input files this operation is split into 2 steps:

            1. sort-merging files within each stream (as defined on the previous step). All 64 streams are processed simulnateously.
            2. final merge of the 64 files (produced by the previous stage) into a single file

            Parallel merge of 64 streams:

            % mkdir sorted_merged_64
            % for f in `ls ../tsv64`; do
                mkdir -p ../tmp_${f}
                cat ../tsv64/${f} | tr '\n' '\0' | sort -m -s -k1 -T ../tmp_${f} -o ../sorted_merged_64/${f}.tsv --files0-from=- &
            done
            

            NOTE: the translation of newlines to NULL is required to feed null-terminated file (input) names into the sort tool. Also note that each stream uses its own temporary directory on the current disk instead of relying on a small /tmp file system.

            Total run time of the stage: 60 minutes

            Final merge of 64 steam files into the large one:

            % sort -m -s -k1 -T tmp/  -o sorted_merged.tsv sorted_merged_64/*
            % ls -alh sorted_merged.tsv
            -rw-r--r-- 1 gapon babar 187G Jun 28 04:25 sorted_merged.tsv
            

            Total run time of the stage: 2 hours

            Splitting the large file into smaller chunks

            % ls -alh sorted_merged.tsv
            -rw-r--r-- 1 gapon babar 187G Jun 28 04:25 sorted_merged.tsv
             
            % mkdir sorted_merged_split
            % cd sorted_merged_split
            % date
            Wed Jun 28 05:45:28 CEST 2017
             
            % split -d -l 80663508 ../sorted_merged.tsv ''
             
            % date
            Wed Jun 28 05:53:58 CEST 2017
            

            Total number of files: 90
            Average file size: 2.1 GB
            Lines per file: *80663508 *
            Total run time of the stage: 8 min 30 seconds

            Loading file into the database

            Loading is done into the NVM-based storage while reading previously sorted file from the SSD-based storage:

            % cd /qserv_nvm/gapon
            % ln -s /qserv_ssd/gapon/sorted_merged_split tsv_sorted
            

            Index table:

            % CREATE TABLE secondary_index.sequential_sorted LIKE secondary_index.parallel4;
            % DROP TABLE secondary_index.parallel4;
            

            Added the following option to the loading SQL command:

            SET UNIQUE_CHECKS=0;
            

            Run the customized version of the sequential loader:

            % nohup scripts/load_sorted_sequential.bash \
                $PWD secondary_index sequential_sorted <user> <password> \
                &> log/load_sorted_sequential.log &
            

            The operation IN PROGRESS

            Total loading time: 9 hours 45 minutes
            Table file size: 416 GB

            Show
            gapon Igor Gaponenko added a comment - - edited Loading presorted data The main objective of this test is to see if the overall loading procedure could be sped up if it was broken into 4 phases sorting in TSV files by the values of their primary (object identifier) keys using the merge-sort algorithm to produce a single (globally) sorted file splitting this file into smaller files before loading into the database (otherwise MySQL will create huge temporary/log files during the loading) loading these (smaller) files into the database in the right order (to keep up with the sort order of the index) In this test we're going to see if the total amount of time spent on all 4 stages will be less compared with the earlier reported time for loading unsorted data. Another interesting question to look at would be the final size of the table file. There are reasons to believe that feeding unsorted data into the loader may result in a significant internal fragmentation of the file. Preparing a dataset Sorting TSV files NOTE : this is done in 64 parallel streams because sort is CPU-bound application. Hence the idea is to create 64 lists of file names to be compressed sequentially by each (of 64) stream. Details of how these file groups were created are omitted below. Only a summary is shown. These are 64 sets of files: % cd /qserv_ssd/gapon % wc -l tsv64/* 1671 tsv64 /00 1671 tsv64 /01 1671 tsv64 /02 1671 tsv64 /03 1671 tsv64 /04 ... 1671 tsv64 /61 1671 tsv64 /62 1618 tsv64 /63 106891 total % Sort in 64 streams and store the sorted files in a separate folder: % mkdir local_tsv_sorted % for c in ` ls tsv64`; do for f in ` cat tsv64/${c}`; do sort local_tsv/${f} > local_tsv_sorted/${f} done & done % ls -1 local_tsv_sorted | wc -l 106891 % du -hs local_tsv_sorted Total run time of the stage: 30 min Merge-sort Due to a large number of input files this operation is split into 2 steps: sort-merging files within each stream (as defined on the previous step). All 64 streams are processed simulnateously. final merge of the 64 files (produced by the previous stage) into a single file Parallel merge of 64 streams: % mkdir sorted_merged_64 % for f in ` ls .. /tsv64 `; do mkdir -p .. /tmp_ ${f} cat .. /tsv64/ ${f} | tr '\n' '\0' | sort -m -s -k1 -T .. /tmp_ ${f} -o .. /sorted_merged_64/ ${f}.tsv --files0-from=- & done NOTE : the translation of newlines to NULL is required to feed null-terminated file (input) names into the sort tool. Also note that each stream uses its own temporary directory on the current disk instead of relying on a small /tmp file system. Total run time of the stage: 60 minutes Final merge of 64 steam files into the large one: % sort -m -s -k1 -T tmp/ -o sorted_merged.tsv sorted_merged_64/* % ls -alh sorted_merged.tsv -rw-r--r-- 1 gapon babar 187G Jun 28 04:25 sorted_merged.tsv Total run time of the stage: 2 hours Splitting the large file into smaller chunks % ls -alh sorted_merged.tsv -rw-r--r-- 1 gapon babar 187G Jun 28 04:25 sorted_merged.tsv   % mkdir sorted_merged_split % cd sorted_merged_split % date Wed Jun 28 05:45:28 CEST 2017   % split -d -l 80663508 .. /sorted_merged .tsv ''   % date Wed Jun 28 05:53:58 CEST 2017 Total number of files: 90 Average file size: 2.1 GB Lines per file: *80663508 * Total run time of the stage: 8 min 30 seconds Loading file into the database Loading is done into the NVM -based storage while reading previously sorted file from the SSD -based storage: % cd /qserv_nvm/gapon % ln -s /qserv_ssd/gapon/sorted_merged_split tsv_sorted Index table: % CREATE TABLE secondary_index.sequential_sorted LIKE secondary_index.parallel4; % DROP TABLE secondary_index.parallel4; Added the following option to the loading SQL command: SET UNIQUE_CHECKS=0; Run the customized version of the sequential loader: % nohup scripts /load_sorted_sequential . bash \ $PWD secondary_index sequential_sorted <user> <password> \ &> log /load_sorted_sequential .log & The operation IN PROGRESS Total loading time: 9 hours 45 minutes Table file size: 416 GB
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Conclusions

            Two candidate optimization techniques for index ingestion have been tried. This is the summary:

            • The parallel (in 4 streams) loading option shows no benefits compared with the original (loading unsorted data) sequential approach, neither in its performance nor in the resulting size of the table file on disk. Even worse, this method extends the total loading time by 2 hours (assuming the same underlying storage) and it results in using more resources (CPU time and disk I/O) compared with the original method.
            • The other technique which required to presort input data demonstrates a noticeable (though not dramatic) reduction in the overall loading time (from 18:00 down to 13:30 hours) and (what is very important) in the significant reduction in the table file size (from 674 GB to 416 GB). The later maybe an important factor for resource constraint setups.

            TODO: put this summary along with relevant conditions (dataset, configuration, setup, test method) into the Confluence page mentioned in the Description section of the document.

            Show
            gapon Igor Gaponenko added a comment - - edited Conclusions Two candidate optimization techniques for index ingestion have been tried. This is the summary: The parallel (in 4 streams) loading option shows no benefits compared with the original (loading unsorted data) sequential approach, neither in its performance nor in the resulting size of the table file on disk. Even worse, this method extends the total loading time by 2 hours (assuming the same underlying storage) and it results in using more resources (CPU time and disk I/O) compared with the original method. The other technique which required to presort input data demonstrates a noticeable (though not dramatic) reduction in the overall loading time (from 18:00 down to 13:30 hours) and (what is very important) in the significant reduction in the table file size (from 674 GB to 416 GB ). The later maybe an important factor for resource constraint setups. TODO : put this summary along with relevant conditions (dataset, configuration, setup, test method) into the Confluence page mentioned in the Description section of the document.
            Hide
            ktl Kian-Tat Lim added a comment -

            Are there any noticeable differences in the index lookup time, after loading, given the differences in index size?

            Show
            ktl Kian-Tat Lim added a comment - Are there any noticeable differences in the index lookup time, after loading, given the differences in index size?
            Hide
            ktl Kian-Tat Lim added a comment -

            Disabling index generation (not just the unique key check) on load and then enabling it afterwards may also give a speedup.

            Show
            ktl Kian-Tat Lim added a comment - Disabling index generation (not just the unique key check) on load and then enabling it afterwards may also give a speedup.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Kian-Tat Lim I didn't have a chance to test the read performance of the index (built in two different ways). My "educated guess" is that it should not matter as InnoDB stores tuple data in the leaf nodes which require a disk I/O to fetch the data (provided very high width of the index B-Tree and a high probability of having the upper levels of the tree cached in memory. In case of fetching random ObjectId's one would always hit a random disk page from the leaf level.

            Though, I'm interested in testing this.

            Show
            gapon Igor Gaponenko added a comment - - edited Kian-Tat Lim I didn't have a chance to test the read performance of the index (built in two different ways). My "educated guess" is that it should not matter as InnoDB stores tuple data in the leaf nodes which require a disk I/O to fetch the data (provided very high width of the index B-Tree and a high probability of having the upper levels of the tree cached in memory. In case of fetching random ObjectId's one would always hit a random disk page from the leaf level. Though, I'm interested in testing this.
            Hide
            gapon Igor Gaponenko added a comment -

            Kian-Tat Lim regarding disabling the PRIMARY key idea, I don't think this is possible in the InnoDB engine because it implements its tables as B-Trees based on the PRIMARY key values. So, there must be some index. Other indexes can be disabled. However, we don't have any but the RPIMARY index in this particular table.

            Show
            gapon Igor Gaponenko added a comment - Kian-Tat Lim regarding disabling the PRIMARY key idea, I don't think this is possible in the InnoDB engine because it implements its tables as B-Trees based on the PRIMARY key values. So, there must be some index. Other indexes can be disabled. However, we don't have any but the RPIMARY index in this particular table.

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Watchers:
              Fritz Mueller, Igor Gaponenko, Kian-Tat Lim
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.