When dump created each partition's data is stored in a separate file. This is performed in several dump executor threads concurrently. At the same time user updates might also trigger storing dump entries in transaction threads to any partition, not necessarily to partitions currently processed by dump executor. From disk writing point of view often random switches to several files is not good for overall throughput. And throughput is one of the important characteristics of creating dump process.
The main idea is to write large number of bytes per each disk write operation, minimizing switching between files and minimizing other overhead work between writes.
The file write operation should send large number of bytes. There are 3 options for this
Need to investigate: Assuming that we want to write the same total number of bytes at what range of byte buffer array size invocation of write(ByteBuffer) and write(ByteBuffer[]) is the same?
ByteBuffer trip cycle will look like this
ByteBufferPool → partition/transaction thread → Partition Queue → Partition Queue Dispatcher Thread → Disk Writer Queue → Disk Writer Thread → Free Queue → ByteBuffer Releaser Thread → ByteBufferPool
| → | partition thread | → |
|
transaction thread | ||||
↑ | ↓ | |||
ByteBuffer Releaser Thread | Partition Queue Dispatcher Thread | |||
↑ | ↓ | |||
Free Queue | ← | Disk Writer Thread | ← | Disk Writer Queue |
Current solution uses thread local ByteBuffers with expanding size. This is ok for single threaded usage, but not suitable for passing to other threads. And this is also not good for transaction threads.
We can use pool of ByteBuffers which provides newly allocated or reused ByteBuffers and doesn't go beyond its predefined limit. For example,
class ByteBufferPool
ByteBufferPool(size) - constructor with maximum number of bytes to allocate by all buffers together.
Using buffers only of size of power of 2 simplifies maintenance of the buffers and search. Internal structure to use:
List<ByteBuffer>[]
At position i there will be a list of available buffers of size 2^i.
In average the buffers will be filled to 75%.
Let's assume there is a request for 200k buffer and lots of 128k buffers allocated, but no buffer larger than 128k allocated in the pool and there is no capacity remaining in the pool. In this case we will take 2 ByteBuffers with 128k and use in ByteBuffersWrapper.
Let's assume there is a request for 11Mb buffer and pool limit is 10Mb. In this case we will wait until all buffers return to the pool, take them all, allocate a new HeapByteBuffer for 1Mb and wrap them into ByteBuffersWrapper. When buffers released, we will return only 10Mb buffers to the pool. The new 1Mb buffer will be given to GC.
Wraps several ByteBuffers, extends ByteBuffer. It is created in ByteBufferPool#acquire and destroyed in ByteBufferPool#release when all internal buffers returned to the pool.
We can often see that disk IO operation is the bottleneck in Ignite. So we should try to make writing to disk efficient.
There should be a separate thread that saves data to disk and it should do minimum work besides writing to disk. For example, it could take buffers from queue and write to file. The buffers should be made ready by another thread and possibly returning buffers to the pool should be also delegated to another thread.
For desktop PCs it doesn't make sense to use more than one writer thread. But for servers using RAID storages writing to several files in parallel could be faster overall. The solution should be build in assumption that there could be several writers. Need to make sure that writers don't pick up data for the same file.
There will be a separate blocking queue for each partition providing fast info about its fullness
The thread will periodically check disk writer queue size making sure that disk writer always has enough data to process. So, minimum number of elements will be 2 for single disk writer. Once the size goes down, the thread will scan sizes of partition queues and choose the winner with the maximum number of bytes. After that it will take all buffers from the queue and will put them together with file info (channel) to the disk writer queue.
In case of multiple disk writers additionally a flag should be passed, which will be considered when choosing largest partition queue and will be reset on buffer release.
They could be simple ArrayBlockingQueues.
Open question: Does it make sense to use non-blocking queue here for faster put/take queue operations?
Does 3 operations
Takes buffers from queue and returns to the pool. This operation will probably be fast enough so that extra thread could be redundant.
The solution assumes that all partition dump files are open simultaneously. This could break when number of partitions (default 1024) is bigger than OS limit for open files (ulimit -n).
We can address this issue like this
In case of encryption 2 buffers will be required. First one as usual will be used for keeping serialized entry, and the second one will be for storing encrypted data from the first buffer. After encryption the first buffer will be returned back to the pool. The second buffer will go to the partition queue.
Note: Avoid deadlock when processing large objects not fitting into pool size.
Compression is done after encryption (if enabled).
This can remain in the same thread (part/trans thread). But because of high CPU usage and decreased output size the bottleneck might move from disk writing to compression. If this is observed, we should extract compression to another thread.
Compression output is a sequence of buffers which can't be reordered. Compression per partition can't be done in 2 threads simultaneously.
The size of required buffer isn't known while requesting buffer from pool. It is preferable to use medium size buffers. And it is ok to get a bit smaller buffer for this, fill it, send it to queue and request another medium size buffer from pool.
Once dump creating is over, all resources will be cleaned.
In case of error the whole execution will stop. Resuming creating dump isn't considered, the only option is to rerun creating dump.
What happens if a dump entry doesn't fit into pool size?
Is the order of entries in dump important? Is it acceptable to write ver 3 (from trans thread) before ver 2 (by par thread) ?
HeapByteBuffer or DirectByteBuffer ?
Is there a protection against multiple concurrent dump creation? Do we need one?
There are several important points to consider when writing data to SSD storage
Typical throughput versus write block size chart will look like this.
One of the configuration aims is to find minimum block size when random and sequential writes show the same throughput.