Status

...

Page properties

Discussion thread

...

	https://lists.apache.org/thread

...

JIRA
/rm69c7wdfmqgz6k851cq59txy15c3f5z
Vote thread

...

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	FLINK-19582

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	FLINK-19614

...

Release

1.13

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Table of Contents

Motivation

Hash-based blocking shuffle and sort-merge based blocking shuffle are two main blocking shuffle implementations wildly adopted by existing distributed data processing frameworks. Hash-based implementation writes data sent to different reducer tasks into separate files concurrently while sort-merge based approach writes those data together into a single file and merges those small files into bigger ones. Compared to sort-merge based approach, hash-based approach has several weak points when it comes to running large scale batch jobs:

Stability: For high parallelism (tens of thousands) batch job, current hash-based blocking shuffle implementation writes too many files concurrently which gives high pressure to the file system, for example, maintenance of too many file metas, exhaustion of inodes or file descriptors. All of these can be potential stability issues. Sort-Merge based blocking shuffle don’t have the problem because for one result partition, only one file is written at the same time.
Performance: Large amounts of small shuffle files and random IO can influence shuffle performance a lot especially for hdd (for ssd, sequential read is also important because of read ahead and cache)HDD. For batch jobs processing massive data, small amount of data per subpartition is common because of high parallelism. Besides, data skew is another cause of small subpartition files. By merging writing data of all subpartitions together in one file and leveraging IO scheduling, more sequential read can be achieved.
Resource: For current hash-based implementation, each subpartition needs at least one buffer. For large scale batch shuffles, the memory consumption can be huge. For example, we need at least 320M network memory per result partition if parallelism is set to 10000 and because of the huge network consumption, it is hard to config the network memory for large scale batch job and sometimes parallelism can not be increased just because of insufficient network memory which leads to bad user experience.

By introducing the sort-merge based approach blocking shuffle implementation to Flink, we can improve Flink’s capability of running large scale batch jobs.

Public Interfaces

Several new config options will be added to control the behavior of the sort-merge based blocking shuffle and by disable sort-merge based blocking shuffle by default, the default behavior of blocking shuffle stays unchanged.

For small parallelism, hash-based blocking shuffle will be used and for large parallelism, sort-merge based blocking shuffle will be used

Config Option

Description

taskmanager.network.sort

-merge-blocking

-shuffle.

max-files-per-partition

min-buffers

Minimum number of network buffers required per

The maximum number of files can be produced by each

sort-merge blocking

partition, files over this threshold will be merged

 result partition.

taskmanager.network.sort

-merge-blocking

-shuffle.

buffers

min-

per-partitionNumber of network buffers required for each

parallelism

Parallelism threshold to switch between sort-merge

blocking result partition. Larger value can reduce the number of shuffle files and bring better performance.

taskmanager.network.sort-merge-blocking-shuffle.min-parallelism

 based blocking shuffle and the default hash-based blocking shuffle.

taskmanager.memory.framework.off-heap.batch-shuffle.size

Size of direct memory used by blocking shuffle for shuffle data read.

A fixed number of network buffers per result partition makes the memory consumption decoupled with parallelism which is more friendly for large scale batch jobs.

Proposed Changes

Data Shuffle Process

Image Removed

Image Added

Each result partition holds a We have SortBuffer, serialized records and events will be appended to the SortBuffer until the it is full or EOF reached.
Then the PartitionedFileWriter will spill all data in the SortBuffer as one PartitionedFile in subpartition index order and at the same time partition offset information will be also saved.
MergePolicy will collect information of all spilled PartitionedFiles and select a subset or all files to be merged according to the number of files and the file size.
PartitionedFileMerger then merges all the selected PartitionedFiles into one PartitionedFile.
After the SortMergeResultPartition is finished, the consumer task can request the partition data, a SortMergePartitionReader will be created to read the corresponding data.
The IO scheduler will schedule all the shuffle data reads in IO friendly order, i.e. reading shuffle data file sequentiallyAfter the SortMergeResultPartition is finished, the consumer task can request the partition data, a SortMergePartitionReader will be created to read the corresponding data.

Main Components

SortBuffer: Data of different channels can be appended to a SortBuffer and after the SortBuffer is finished, the appended data can be copied from it in channel index order.

public interface SortBuffer {

    /**
     * Appends data of the specified channel to this {@link SortBuffer} and returns true if all
  bytes of
  * bytes *of the source buffer is copied to this {@link SortBuffer} successfully, otherwise if
  returns false,
  * returns *false, nothing will be copied.
     */
    boolean append(ByteBuffer source, int targetChannel, Buffer.DataType dataType)
            throws IOException;

    /**
     * Copies data in this {@link SortBuffer} to the target {@link MemorySegment} in channel index
 order
    * order and returns {@link BufferWithChannel} which contains the copied data and the corresponding
 channel
    * corresponding channel index.
     */
    BufferWithChannel copyDatacopyIntoSegment(MemorySegment target);

    /**
    * Returns the number of records written to this {@link SortBuffer}.
    */
    long numRecords();

    /**
    * Returns the number of bytes written to this {@link SortBuffer}.
    */
    long numBytes();

    /**
    * Returns true if there is still data can be consumed in this {@link SortBuffer}.
    */
    boolean hasRemaining();

    /**
    * Finishes this {@link SortBuffer} which means no record can be appended any more.
    */
    void finish();

    /**
    * Releases Whether this {@link SortBuffer} whichis releasesfinished allor resourcesnot.
    */
    voidboolean releaseisFinished();
}

PartitionedFile: Persistent file type of SortMergeResultPartition and it stores data of all subpartitions in subpartition index order.

...


   public Path
 getDataFile();
   /**
 Releases this {@link *SortBuffer} Returnswhich thereleases starting offset of the given subpartition in this {@link PartitionedFile}.
   all resources. */
   public longvoid getStartingOffsetrelease(int subpartitionIndex);

    /**
 Whether this {@link *SortBuffer} Returnsis thereleased numberor of buffers of the given subpartition in this {@link PartitionedFile}.
    */
not. */
    boolean isReleased();
}

PartitionedFile: Persistent file type of SortMergeResultPartition and it stores data of all subpartitions in subpartition index order.

public class PartitionedFile {

    public Path getDataFilePath();

    public intPath getNumBuffersgetIndexFilePath(int subpartitionIndex);

    public voidint deleteQuietlygetNumRegions();
}

PartitionedFileWriter: File writer to write buffers to PartitionedFile in subpartition order.

public class PartitionedFileWriter {


     /**
     * Gets Opensthe aindex {@linkentry PartitionedFile}of forthe writing.
target region and subpartition */
either from the publicindex void open() throws IOException;
data cache
     /**
    * Writes a {@link Buffer} of the given subpartition to the opened {@link PartitionedFile}.
* or the index data file.
     */
    public void getIndexEntry(FileChannel indexFile, ByteBuffer target, int region, int subpartition)    *
    * <p>Note: The caller is responsible for recycling throws IOException;

    public void deleteQuietly();
}

PartitionedFileWriter: File writer to write buffers to PartitionedFile in subpartition order.

public class PartitionedFileWriter implements AutoCloseable {

    /**
 the target buffer and releasing the failed
    * {@link PartitionedFile} if any exception occurs.
    */
 Persists the publicregion voidindex writeBuffer(Buffer target, int subpartitionIndex) throws IOException;
   /**
    * Finishes the current {@link PartitionedFile} which closes the file channel and constructs
    * the correspondingof the current data region and starts a new region to write.
     *
     * <p>Note: The caller is responsible for releasing the failed {@link PartitionedFile.PartitionedFileIndex}.} if any
     *
 exception occurs.
  * <p>Note: The caller*
 is responsible for releasing the* failed@param {@linkisBroadcastRegion PartitionedFile} if any
    * exception occurs.
Whether it's a broadcast region. See {@link #isBroadcastRegion}.
     */
    public PartitionedFilevoid finishstartNewRegion(boolean isBroadcastRegion) throws IOException;

    /**
     * UsedWrites toa close and delete the failedlist of {@link Buffer}s to this {@link PartitionedFile}. whenIt anyguarantees exceptionthat occurs.after
     */
 the return of publicthis void releaseQuietly();
}

PartitionedFileReader: Reader which can read all data of the target subpartition from a PartitionedFile.

public class PartitionedFileReader implements AutoCloseable {

   /**
    * Opens the given {@link PartitionedFile} and moves read position to the starting offset of the
    * target subpartition.
method, the target buffers can be released. In a data region, all data of
     * the same subpartition must be written together.
     */
   public void open() throws IOException;
   /**
    * Reads a buffer from the * <p>Note: The caller is responsible for recycling the target buffers and releasing the failed
     * {@link PartitionedFile} andif movesany the read position forwardexception occurs.
     */
    *public <p>Note: The caller is responsible for recycling the target buffer if any exception occurs.
    */
   @Nullable
   public Buffer readBuffer(MemorySegment target, BufferRecycler recycler) throws IOException;
   public boolean hasRemaining();
   @Override
   public void close() throws IOException;
}

SortMergeResultPartition: Entry point of sort-merge based blocking shuffle. (Override methods are inherited from ResultPartition)

public class SortMergeResultPartition extends ResultPartition {
   @Override
   protected void releaseInternal();
   @Override
   public void emitRecord(ByteBuffer record, int targetSubpartition) throws IOException;
   @Override
   public void broadcastRecord(ByteBuffer recordvoid writeBuffers(List<BufferWithChannel> bufferWithChannels) throws IOException;

    /**
     * Finishes writing the {@link PartitionedFile} which closes the file channel and returns the
     * corresponding {@link PartitionedFile}.
     *
     * <p>Note: The caller is responsible for releasing the failed {@link PartitionedFile} if any
     * exception occurs.
     */
    public PartitionedFile finish() throws IOException;

   @Override
 /** Used publicto voidclose broadcastEvent(AbstractEvent event, boolean isPriorityEvent) throws IOException;
   /**
    * Spills the large record into a separateand delete the failed {@link PartitionedFile}.
 when any exception occurs. */
    privatepublic void writeLargeRecordreleaseQuietly();

    @Override
    public void close() ByteBuffer record, int targetSubpartition, DataType dataType) throws IOException;throws IOException;
}

PartitionedFileReader: Reader which can read all data of the target subpartition from a PartitionedFile.

class PartitionedFileReader {

    /**
   void releaseReader(SortMergePartitionReader reader);
   @Override
   public void finish() throws IOException;
   @Override * Reads a buffer from the current region of the target {@link PartitionedFile} and moves the
   public void close();
* read position @Overrideforward.
   public ResultSubpartitionView createSubpartitionView(*
     * <p>Note: The caller is responsible intfor subpartitionIndex,recycling BufferAvailabilityListenerthe listener)target throwsbuffer IOException;
if any exception @Overrideoccurs.
   public void flushAll();
   @Override
   public void flush(int subpartitionIndex);
   @Override
   public CompletableFuture<?> getAvailableFuture();
}

SortMergePartitionReader: Subpartition data reader for link SortMergeResultPartition. (Override methods are inherited from ResultSubpartitionView and BufferRecycler)

public class SortMergePartitionReader implements ResultSubpartitionView, BufferRecycler {

...

MergePolicy: It is responsible for selecting the PartitionedFiles to be merged to one file.

public interface MergePolicy {

...

*
     * @param target The target {@link MemorySegment} to read data to.
     * @param recycler The {@link BufferRecycler} which is responsible to recycle the target buffer.
     * @return A {@link Buffer} containing the data read.
     */
    @Nullable
    public Buffer readCurrentRegion(MemorySegment target, BufferRecycler recycler) throws IOException;

    public boolean hasRemaining() throws IOException;

    /** Gets read priority of this file reader. Smaller value indicates higher priority. */
    public long getPriority();
}

SortMergeResultPartition: Entry point of sort-merge based blocking shuffle. (Override methods are inherited from ResultPartition)

public class SortMergeResultPartition extends ResultPartition {

    @Override
    public void setup() throws IOException;

    @Override
    protected void releaseInternal();

    @Override
    public void emitRecord(ByteBuffer record, int targetSubpartition) throws IOException;

    @Override
    public void broadcastRecord(ByteBuffer record) throws IOException;

    @Override
    public void broadcastEvent(AbstractEvent event, boolean isPriorityEvent) throws IOException;

    /**
     * Spills the large record into the target {@link PartitionedFile} as a separate data region.
     */
    private void writeLargeRecord(
            ByteBuffer record, int targetSubpartition, DataType dataType, boolean isBroadcast)
            throws IOException;

    @Override
    public void finish() throws IOException;

    @Override
    public void close();

    @Override
    public ResultSubpartitionView createSubpartitionView(
            int subpartitionIndex, BufferAvailabilityListener availabilityListener)
            throws IOException;

    @Override
    public void flushAll();

    @Override
    public void flush(int subpartitionIndex);

    @Override
    public CompletableFuture<?> getAvailableFuture();

    @Override
    public int getNumberOfQueuedBuffers();

    @Override
    public int getNumberOfQueuedBuffers(int targetSubpartition);
}

SortMergePartitionReader: Subpartition data reader for SortMergeResultPartition. (Override methods are mainly inherited from ResultSubpartitionView)

public class SortMergeSubpartitionReader
        implements ResultSubpartitionView, Comparable<SortMergeSubpartitionReader> {

    @Nullable
    @Override
    public BufferAndBacklog getNextBuffer();

    /** This method is called by the IO thread of {@link SortMergeResultPartitionReadScheduler}. */
    public boolean readBuffers(Queue<MemorySegment> buffers, BufferRecycler recycler) throws IOException;

    public CompletableFuture<?> getReleaseFuture();

    public void fail(Throwable throwable);

    @Override
    public void notifyDataAvailable();

    @Override
    public int compareTo(SortMergeSubpartitionReader that);

    @Override
    public void releaseAllResources();

    @Override
    public boolean isReleased();

    @Override
    public void resumeConsumption();

    @Override
    public Throwable getFailureCause();

    @Override
    public boolean isAvailable(int numCreditsAvailable);

    @Override
    public int unsynchronizedGetNumberOfQueuedBuffers();
}

The interface of SortBuffer is flexible enough and new requirements like sorting by record can be also implemented easily if needed.

Further Optimization

IO Scheduling

PartitionedFileMerger: It is responsible for merging the selected list of PartitionedFiles to be one file.

public interface PartitionedFileMerger {

...

Further Optimization

...

As we discussed above, writing data of all subpartitions together in one file makes it more friendly for sequential read and write which can already improve the IO performance a lot. Besides, we can even further improve the IO performance by scheduling the reading and writing IO requests (especially helpful for reading). When shuffling data, the sequential read is restricted by the amount of data of each subpartition, the size of the read buffer and the available credits of the consumer task. The data read pattern can be summarized as reading a chunk of data from different subpartitions in parallel. After data of all subpartitions is spilled to one file in subpartition index order, we can rearrange the data read requests and always serve the data in subpartition order and read as much data in one request. By scheduling the read requests, more sequential reads can be achieved and in the best cases, a data file can be read totally in a sequential way.

Data Compression

Data compression has been implemented for the default hash-based blocking shuffle, which improves the tpcTPC-ds DS benchmark performance by about 30%. We can also implement data compression for sort-merge based blocking shuffle.

Broadcast Optimization

For the result partition using the broadcast partitioner, we can copy the serialized record only once to the SortBuffer and write only one copy of the data to disk which can reduce CPU usage and file IO a lot.

Multiple Disks Load Balance

If there are multiple disks, load balance is important for good performance. The simplest way to achieve load balance is rebalance disk selection.

Restrict Concurrent Partition Requests (Not implemented in FLIP)

For large scale batch jobs, a large number of network connections will be established, which may incur stability issues. We can restrict the number of concurrent partition requests to relieve the issue. Besides, restricting concurrent partition requests can increase the number of network buffers can be used per remote channel, that is, more credits per channel which is helpful for the shuffle reader to read sequentially. (As we mentioned above, the number of available credits can influence sequential read because we can not read more buffers than the consumer can process)can restrict the number of concurrent partition requests to relieve the issue. Besides, restricting concurrent partition requests can increase the number of network buffers can be used per remote channel, that is, more credits per channel which is helpful for the shuffle reader to read sequentially. (As we mentioned above, the number of available credits can influence sequential read because we can not read more buffers than the consumer can process)

Implement External/Remote Shuffle Service (Not implemented in FLIP)

Implementing a stand-alone shuffle service can further improve the shuffle IO performance because it is a centralized service and can collect more information which can lead to more optimized actions. For example, better node-level load balance, better disk-level load balance, further file merging, node-level IO scheduling and shared read/write buffer and thread pool. It can be introduced in a separated FLIP.

Implementation and Test Plan

Step #1: Implement Basic Shuffle Logic and Data Compression

Basic shuffle logic and data compression will be implemented first, which can make the sort-merge based blocking shuffle available for usage. Main components include 1) SortBuffer and a hash-based data clustering implementation; 2) PartitionedFile together with the corresponding writer (PartitionedFileWriter) and reader (PartitionedFileReader); 3) SortMergeResultPartition and the subpartition data reader SortMergePartitionReader. We will introduce this components separately. For data compression, by reusing the facilities implemented for the existing BoundedBlockingResultPartition, only very small change is needed. Tests will include both unit tests, IT cases and real job test on a cluster.

Step #2:

...

Implement IO Scheduling Other Optimizations

File merge IO scheduling and other optimizations can be implemented as the second step. Main components include MergePolicy, PartitionedFileMerge, IOScheduler and PartitionRequestManagerIO scheduler and buffer pool for batch shuffle. Tests will include both unit tests, IT cases and real job test on a cluster.

Compatibility, Deprecation, and Migration Plan

The default behavior of Flink stays unchanged. Nothing need to do when migrating to new Flink version.

Appendix

Sort by Subpartition Index

Our goal is to cluster data belonging to the same subpartition together and sort is a nature approach. However, we do not need a generic sort implementation. Given that the subpartition index is a sequence of continuous integers from 0, bucket sort combining linked list is a simpler and more efficient way. Each subpartition takes a bucket and each bucket points to the first record in the binary SortBuffer. Each record also has a pointer pointing to the next record belonging to the same subpartition. The following picture shows how it works:

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Status

Motivation

Public Interfaces

Proposed Changes

Data Shuffle Process

Main Components

Further Optimization

IO Scheduling

Further Optimization

Data Compression

Broadcast Optimization

Multiple Disks Load Balance

Restrict Concurrent Partition Requests (Not implemented in FLIP)

Implement External/Remote Shuffle Service (Not implemented in FLIP)

Implementation and Test Plan

Step #1: Implement Basic Shuffle Logic and Data Compression

Step #2:

Implement IO Scheduling Other Optimizations

Compatibility, Deprecation, and Migration Plan

Appendix

Sort by Subpartition Index

Page tree

Page History

Versions Compared

Old Version 2

New Version Current

Key

Status

Motivation

Public Interfaces

Proposed Changes

Data Shuffle Process

Main Components

Further Optimization

IO Scheduling

Further Optimization

Data Compression

Broadcast Optimization

Multiple Disks Load Balance

Restrict Concurrent Partition Requests (Not implemented in FLIP)

Implement External/Remote Shuffle Service (Not implemented in FLIP)

Implementation and Test Plan

Step #1: Implement Basic Shuffle Logic and Data Compression

Step #2:

Implement IO Scheduling Other Optimizations

Compatibility, Deprecation, and Migration Plan

Appendix

Sort by Subpartition Index