...
Need to investigate: Assuming that we want to write the same total number of bytes at what range of byte buffer array size invocation of write(ByteBuffer) and write(ByteBuffer[]) is the same?
ByteBuffer trip cycle will look like this
ByteBufferPool => → partition/transaction thread => partition queue => queue dispatcher thread => disk writer queue => disk writer thread => free ByteBuffer queue => ByteBuffer releaser thread => ByteBufferPool→ Partition Queue → Partition Queue Dispatcher Thread → Disk Writer Queue → Disk Writer Thread → Free Queue → ByteBuffer Releaser Thread → ByteBufferPool
| → | partition thread | → |
|
transaction thread | ||||
↑ | ↓ | |||
ByteBuffer Releaser Thread | Partition Queue Dispatcher Thread | |||
↑ | ↓ | |||
Free Queue | ← | Disk Writer Thread | ← | Disk Writer Queue |
Current solution uses thread local ByteBuffers with expanding size. This is ok for single threaded usage, but not suitable for passing to other threads. And this is also not good for transaction threads.
...
ByteBufferPool(size) - constructor with maximum number of bytes to allocate by all buffers together.
Using buffers only of size of power of 2 simplifies maintenance of the buffers and search. Internal structure to use:
List<ByteBuffer>[]
At position i there will be a list of available buffers of size 2^i.
In average the buffers will be filled to 75%.
We can often see that disk IO operation is the bottleneck in Ignite. So we should try to make writing to disk efficient.
...
For desktop PCs it doesn't make sense to use more than one writer thread. But for servers using RAID storages writing to several files in parallel could be faster overall. The solution should be build in assumption that there could be several writers. Need to make sure that writers don't pick up data for the same file.
There will be a separate blocking queue for each partition providing fast info about its fullness
The thread will periodically check disk writer queue size making sure that disk writer always has enough data to process. So, minimum number of elements will be 2 for single disk writer. Once the size goes down, the thread will scan sizes of partition queues and choose the winner with the maximum number of bytes. After that it will take all buffers from the queue and will put them together with file info (channel) to the disk writer queue.
In case of multiple disk writers additionally a flag should be passed, which will be considered when choosing largest partition queue and will be reset on buffer release.
They could be simple ArrayBlockingQueues.
Open question: Does it make sense to use non-blocking queue here for faster read/write operations on the queue?
Does 3 operations
Takes buffers from queue and returns to the pool. This operation will probably be fast enough so that extra thread will be redundant.
What happens if a dump entry doesn't fit into pool size?
One of the proposed ideas was to switch from writing to several partition-specific files to single dump file. This idea wasn't considered much because of change complexity and limitation for multi-threaded I/O which could be beneficial on some server storages. And it is still possible to achieve sequential writes with multiple partition files.