ID	IEP-62
Author	Alex Plehanov
Sponsor	Alex Plehanov
Created	25 Nov 2020
Status	DRAFT

Motivation

Currently, for page replacement (page rotation between page-memory and disk) we use Random-LRU algorithm. It has a low maintenance cost and relatively simple implementation, but it has many disadvantages and affects performance very much when replacement is started. We even have warnings in the log when page replacement started and a special event for this. On some deployments, administrators even force to restart cluster nodes periodically to avoid page replacement.

Description

Proposals to improve page replacement in Ignite:

Batch page replacement

Main idea: in some cases start background task to evict cold pages from page-memory (for example, pages, last touched more than 12 hours ago).

The task can be started:

Automatically, triggered by some events, for example, when we expect a start of Random-LRU page replacing soon (allocated more than 90% of page-memory) + we have enough amount of cold pages (we need some metric to calculate the number of cold pages) + some time passed since last batch page replacement (to avoid too much resource consumption by background batch replacement).
Manually (JMX or control.sh), if an administrator wants to control the time of batch replacement more precisely (for example, to avoid the start of this task during peak time).

Batch page replacement will be helpful in some workloads (when some data much colder than another). For example, when OLAP request over historical data raised pages to page-memory, and after such request, this data is not needed for a long time. Or when OLTP transactions mostly add new data and process recent data but rarely touch historical data. In these cases with the current approach, we will enter "page replacement mode" after some period of time and never leave it. Batch page replacement here can prevent the starting of Random-LRU page replacement, or if Random-LRU already started it can provide conditions to stop it.

Change the page replacement algorithm

Good page replacement algorithm should satisfy the requirements:

low page-fault rates for typical workload
low maintenance cost (low resource consumption to maintain additional structures required for page replacement)
fast searching of next page for replacement
sequential scans resistance (one sequential scan should not evict all relatively hot pages from page-memory)

Our Random-LRU has low maintenance cost and sequential scan resistant, but to find the next page for replacement in the best case we scan 5 pages, in the worst case we can scan all data region segment. Also, due to random nature, it's not very effective in predicting the right page for replacement to minimize the page-fault rate. And it's much time required to totally evict old cold data.

Usually, database management systems and operating systems use modifications of LRU algorithms. These algorithms have higher maintenance costs (pages list should be modified on each page access), but often they are effective from a "page-fault rate" point of view and have O(1) complexity for a searching page to replace. Simple LRU is not sequential scan resistant, but modifications that utilize page access frequency are resistant to sequential scan. Some databases uses CLOCK as page replacement algorithm.

We can try one of the modifications of LRU as well (for example, "segmented LRU" seems suitable for Ignite) or CLOCK algorithm (or one of its modifications).

Random-LRU algorithm

Random-LRU algorithm. Every time page is accessed, its timestamp gets updated. When a page fault occurs and it's required to replace some pages, the algorithm randomly chooses 5 pages from the page memory and evicts a page with the latest timestamp.

Segmented-LRU algorithm

Segmented-LRU algorithm is a scan-resistant variation of the Least Recently Used (LRU) algorithm. Segmented-LRU pages list is divided into two segments, a probationary segment, and a protected segment. Pages in each segment are ordered from the least to the most recently accessed. New pages are added to the most recently accessed end (tail) of the probationary segment. Existing pages are removed from wherever they currently reside and added to the most recently accessed end of the protected segment. Pages in the protected segment have thus been accessed at least twice. The protected segment is finite, so migration of a page from the probationary segment to the protected segment may force the migration of the LRU page in the protected segment to the most recently used end of the probationary segment, giving this page another chance to be accessed before being replaced. Page to replace is polled from the least recently accessed end (head) of the probationary segment.

CLOCK algorithm

The clock algorithm keeps a circular list of pages in memory, with the "hand" pointing to the last examined page frame in the list. When a page fault occurs and no empty frames exist, then the hit flag of the page is inspected at the hand's location. If the hit flag is 0, the new page is put in place of the page the "hand" points to, and the hand is advanced one position. Otherwise, the hit flag is cleared, then the clock hand is incremented and the process is repeated until a page is replaced.

Risks and Assumptions

Ignite is a memory-centric product, so "low maintenance cost" is very critical. And there is a risk that page replacement algorithm can affect workloads, where page replacement is not used (enough RAM to store all data). Any page replacement solution should be carefully benchmarked with different workloads.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSS-Page-replacement-improvement-tp50204.html

Reference Links

https://en.wikipedia.org/wiki/Page_replacement_algorithm

https://en.wikipedia.org/wiki/Cache_replacement_policies

Tickets

key	summary	type	created	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-62 Page replacement improvements