ID | IEP-22 |
Author | Vladimir Ozerov Ozerov |
Sponsor | Vladimir Ozerov Ozerov |
Created | 15 Jun 2018 |
Status | ACTIVE |
Initial data load is one of the most frequent and important use cases for Apache Ignite. At the moment the fastest way to load data is data streamer, which relies on asynchronous messaging and fast-path entry update method [1]. But all other internal mechanics of cache and page memory works as usual during data load.
General approach employed by major database vendors is to disable and/or skip as much internals as possible during data load, and to employ alternative sorting methods for indexes. With this idea in mind, we have potential to improve data loading speed in several times as follows:
Speedup is expected from less number of IO operations and and less locking page/entry overhead.
Most industrial vendors rely on the following assumptions:
Oracle offers "direct path load" [1] as the fastest method to populate data after manual file copying. Direct path load can be from both command line utility SQL*Loader to load existing files (resembling Ignite's COPY
command), and with conventional SQL, such as INSERT
and CREATE TABLE AS SELECT
(CTAS). Major highlights:
[1] https://docs.oracle.com/cd/E11882_01/server.112/e22490/ldr_modes.htm#SUTIL009
[2] https://docs.oracle.com/en/database/oracle/oracle-database/18/admin/managing-tables.html#GUID-134D6EB6-219E-4820-AB54-8C60067A8F0F
PostgreSQL offers a number of best practices for efficient data load. Dedicated COPY
command exists to bypass SQL parsing and minimize network overhead. Apart from that there are not dedicated features for data load. General recommendations are:
maintenance_work_mem
parameter (buffer for index rebuild routines)max_wal_size
parameterMySQL and it's forks are pretty similar to PostgreSQL in available techniques and best practices [1]. MySQL also has COPY
command [2], which is the fastest data load method, reducing a lot parsing and network overhead. MySQL also advises to disable indexes, unique and FK constraints. But as MySQL is index-organized database and has key-value MyISAM engine, there are two unique points:
[1] https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html
[2] https://dev.mysql.com/doc/refman/8.0/en/load-data.html
[3] https://dev.mysql.com/doc/internals/en/bulk-insert.html
Implement distributed table (cache) lock - when it is held no other operation on cache is possible