IDIEP-22
AuthorVladimir Ozerov Ozerov
SponsorVladimir Ozerov Ozerov
Created15 Jun 2018
Status

DRAFT


Motivation

Initial data load is one of the most frequent and important use cases for Apache Ignite. At the moment the fastest way to load data is data streamer, which relies on asynchronous messaging and fast-path entry update method [1]. But all other internal mechanics of cache and page memory works as usual during data load.

General approach employed by major database vendors is to disable and/or skip as much internals as possible during data load, and to employ alternative sorting methods for indexes. With this idea in mind, we have potential to improve data loading speed in several times as follows:

  • Add exclusive cache access mode, when only data loading process is able to interact with cache
  • Load data directly to data pages, skipping page buffer and free lists
  • Then rebuild indexes bottom-up using external sort algorithm

Speedup is expected from less number of IO operations and and less locking page/entry overhead.

[1] https://github.com/apache/ignite/blob/ignite-2.5/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCacheMapEntry.java#L2691

Competitive Analysis

Most industrial vendors rely on the following assumptions:

  • Sequential IO is faster than random
  • Maintenance of internal database data structures is expensive, it is better to rebuild them from scratch in bulks when data load is finished
  • Acquiring of per-entry locks is expensive, so it is better to escalate lock to upper level (typically - table)
  • Crash recovery of data being loaded is typically not needed

Oracle

Oracle offers "direct path load" [1] as the fastest method to populate data after manual file copying. Direct path load can be from both command line utility SQL*Loader to load existing files (resembling Ignite's COPY command), and with conventional SQL, such as INSERT and CREATE TABLE AS SELECT (CTAS). Major highlights:

  • SQL parsing phase is skipped in case of SQL*Loader usage
  • Page cache is skipped
  • Old pages are not changed, data is written to new data blocks
  • Multi-block writes and async IO is used to improve throughput
  • Table lock is held during data load - writes are not allowed, reads see data before data load start
  • Partial indexes are built using separate memory buffers and then merged to existing indexes

[1] https://docs.oracle.com/cd/E11882_01/server.112/e22490/ldr_modes.htm#SUTIL009
[2] https://docs.oracle.com/en/database/oracle/oracle-database/18/admin/managing-tables.html#GUID-134D6EB6-219E-4820-AB54-8C60067A8F0F

PostgreSQL

PostgreSQL offers a number of best practices for efficient data load. Dedicated COPY command exists to bypass SQL parsing and minimize network overhead. Apart from that there are not dedicated features for data load. General recommendations are:

  • Disable indexes and FK constraints
  • Turn off WAL
  • Increase size of maintenance_work_mem parameter (buffer for index rebuild routines)
  • Indirectly increase time between checkpoints through max_wal_size parameter

MySQL and MariaDB

MySQL and it's forks are pretty similar to PostgreSQL in available techniques and best practices [1]. MySQL also has COPY command [2], which is the fastest data load method, reducing a lot parsing and network overhead. MySQL also advises to disable indexes, unique and FK constraints. But as MySQL is index-organized database and has key-value MyISAM engine, there are two unique points:

  • MyISAM first saves KV pairs in separate memory-optimized buffer base on red-black tree; once full, it is flushed to disk in a series of sequential writes [3]
  • Data load in PK order is much faster than in arbitrary order (Oracle has the same recommendation for secondary indexes)

[1] https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html
[2] https://dev.mysql.com/doc/refman/8.0/en/load-data.html
[3] https://dev.mysql.com/doc/internals/en/bulk-insert.html

Proposed Changes

  1. Implement distributed cache (table) lock - when it is held no other operation on cache is possible, PME infrastructure will be used for that

  2. Implement external sort - get some memory buffer, collect there values in some order, flush to disk if needed, merge in the end
  3. Implement direct data load - write data to data blocks, re-created indexes with external sort and other necessary data structures in the end
  4. Add necessary API to control direct data load mode 

Distributed Cache Lock

Index update and integrity checks will be delayed during direct data load. In order to maintain data consistency during this time it is essential to prevent any concurrent updates. This could be done with new cache lock concept. We will initiate exchange which will wait for all cache operations to finish, and then put special flag on that cache. If flag is set, exception will be thrown in an attempt to access cache. 

External Sort

We will not update PK and secondary indexes during the data load, so it is necessary to rebuild them in the end. The most efficient way to build indexes is bottom-up approach, when the lowest level of BTree is built first, and the root is build last. We will need a buffer where indexed values and respective links will be sorted in index order. If the buffer is big enough and all the data fits into it, index will be created in one hop. Otherwise it is necessary to sort indexed values in several runs using an external sort. It is necessary to let users configure sort parameters - buffer size (ideally - in bytes), and the file system path where temp files will be stored. The latter is critical - typically users would like to keep temp files on a separate disk, so that WAL and checkpoint operations are not affected.

Direct Data Load

We will have a small in-memory buffer for several consecutive data blocks. Data being injected is put into these blocks, bypassing the page memory. When the buffer is full, we could issue a multi-buffer async disk write and continue filling the buffer with new data. As data loading typically affects several partitions, multiple buffers and/or some additional synchronization may be required. 

Data will be inserted into the new blocks only. We will have to track the start and end positions of the inserted data blocks. These positions will be used to scan new data during index rebuilding.

TBD

  • Interaction with WAL
  • Rebalance
  • Crash recovery

Tickets

key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels