ID

IEP-1

Author

Sponsor

Created

16 Sep 2017

Status


colour	Green
title	COMPLETED

Table of Contents

Motivation

One frequent Frequent usage pattern for Ignite is bulk data loading. Users need to be able to load data to Ignite from external sources as fast as possible. Ignite is not optimized for this use case at the moment, as bulk data loading process goes through the same code paths as normal cache updates. We need This IEP aims to improve bulk data loading performance.

Proposed changes

All proposed changes can be split in two groups - infrastructure improvements and index improvements. Note that some proposals are in conflict with each other so careful evaluation is a must.

Infrastructure improvements:

Description

WAL optimization

When doing initial data load sometimes it is OK to relax crash-recovery guarantees. We can disable WAL for particular cache, cache group or data region, then load data, then enable it again. This mode could increase data loading time by a factor of 2x-4x.

Duplicate PK indexes

Currently we have single PK index per physical cache plus 1 additional PK index per table. It means that in typical case when cache doesn't belong to any group, we will have two PK indexes instead of one. This slows down updates. We should try removing H2 PK index altogether. This should be done carefully, so that inline optimization feature is not lost.

Optimize CREATE INDEX

Secondary indexes negatively affects write performance. Common pattern is to drop indexes, load data and then create indexes again. This doesn't work for Ignite at the moment because index creation is slow. First, we create index adding entries one-by-one. Every addition require walking through B+Tree from the top. Instead, we can create sorted batches of entries and add multiple entries to index in one hop. Second, index is created through iteration over primary index. This is less then efficient, especially for persistent caches, due to additional jumps from primary index to data page. Instead, we can try iterating through data pages, rather than through primary index. Last, we can try creating index from multiple threads, when every thread will process predefined set of partitions.

Optimize IgniteDataStreamer performance

Data streamer is the main tool for fast data load to Ignite. Currently it is not very efficient because every call to {{igniteDataStreamer.addData(K, V)}} method require a lot of actions. As a result we cannot load data to data streamer from one thread fast enough, user should create many threads to mitigate this. We should optimize this and make data streamer fast out of the box.

Experimental

Experimental improvement are either hard or nearly impossible to implement without serious changes in architecture. However, it is worth to at least estimate positive impact of these changes.

IGNITE-6412 Bypass GridCacheMapEntry altogether when doing data load
IGNITE-6410 Add
Disable WAL for some caches when doing bulk data load
Bypass GridCacheMapEntry
Add data to new pages rather to existing pages to minimize free-list overhead
Perform cache scan through data pages rather than primary index pages

Index improvements:

...

Risks and Assumptions

Binary compatibility should be preserved to allow startup with persistent data created on previous versions. Page format should either be left unchanged, or changed with ability to disable new optimizations and rollback to previous format.

Discussion Links

N/A

Reference Links

N/A

...

Tickets

Jira

server	ASF JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	project = Ignite AND labels IN (iep-1) ORDER BY status
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b

Page tree

Versions Compared

Old Version 4

New Version Current

Key

Motivation

Proposed changes

Description

WAL optimization

Duplicate PK indexes

Optimize CREATE INDEX

Optimize IgniteDataStreamer performance

Experimental

Risks and Assumptions

Discussion Links

Reference Links

Tickets

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

Motivation

Proposed changes

Description

WAL optimization

Duplicate PK indexes

Optimize CREATE INDEX

Optimize IgniteDataStreamer performance

Experimental

Risks and Assumptions

Discussion Links

Reference Links

Tickets