Status
...
...
...
...
...
...
| Jira |
---|
server | ASF JIRA |
---|
serverId | 5aa69414-a9e9-3523-82ec-879b028fb15b |
---|
key | FLINK-27919 |
---|
|
|
---|
|
...
Motivation
FLIP-27 sources are non-trivial to implement. At the same time, it is frequently required to generate arbitrary events with a "mock" source. Such requirement arises both for Flink users, in the scope of demo/PoC projects, and for Flink developers when writing tests. The go-to solution for these purposes so far was using pre-FLIP-27 APIs and implementing data generators as SourceFunctions
.
While the new FLIP-27 Source
interface introduces important additional functionality, it comes with significant complexity that presents a hurdle for Flink users for implementing drop-in replacements of the SourceFunction
-based data generators. Meanwhile, SourceFunction
is effectively superseded by the Source
interface and needs to be eventually deprecated. To fill this gap, this FLIP proposes the introduction of a generic data generator source based on the FLIP-27 API.
Since it is frequently required to control the rate at which generated events are produced, this FLIP also expands the basic events generation functionality with native support for rate limiting.
Public Interfaces
A new class with the following API will be introduced. Under the hood it, wraps, it can wrap and delegate delegates to the NumberSequenceSource
utilities.
Code Block |
---|
language | java |
---|
title | DataGeneratorSource |
---|
|
package org.apache.flink.api.connector.source.lib;
/**
* A data source that produces generators N events of an arbitrary type in parallel.
* This source is useful for
* testing and for cases that just need a stream of N events of any kind.
*
* <p>The source splits the sequence into as many parallel sub-sequences as there are parallel
* source readers. Each sub-sequence will be produced in order. Consequently, if the parallelism is
* limited to one, this will produce one sequence in order.
*
* <p>This source is always bounded. For very long sequences user may want to consider executing
* the application in a streaming manner, because, despite the fact that the produced stream is bounded,
* the end bound is pretty far away.
*/
@Public
public class DataGeneratorSource<OUT>
implements Source<
implements Source<OUT, GeneratorSequenceSplit<OUT>, Collection<GeneratorSequenceSplit<OUT>>> OUT,
NumberSequenceSource.NumberSequenceSplit,
Collection<NumberSequenceSource.NumberSequenceSplit>>,
ResultTypeQueryable<OUT> {
/**
* Creates a new {@code DataGeneratorSource} that produces @{code count} records in
* parallel.
*
* @param generatorFunction The factory for instantiating the readers of
* type SourceReader<OUT, NumberSequenceSplit>.
* @param count The number of events to be produced.
* @param typeInfo The type information of the returned events.
*/
public DataGeneratorSource(
SourceReaderFactory<OUT, NumberSequenceSplit> sourceReaderFactory,
long count,
TypeInformation<OUT> typeInfo) {
this.sourceReaderFactory = checkNotNull(sourceReaderFactory);
this.typeInfo = checkNotNull(typeInfo);
this.numberSource = new NumberSequenceSource(0, count);
}
/**
* Creates a new {@code DataGeneratorSource} that produces @{code count} records in
* parallel.
*
* @param generatorFunction The generator function that receives index numbers and translates
* them into events of the output type.
* @param count The number of events to be produced.
* @param typeInfo The type information of the returned events.
*/
public DataGeneratorSource(
GeneratorFunction<Long, OUT> generatorFunction, long count, TypeInformation<OUT> typeInfo) {...}
}
*
/**
* Creates a new {@code DataGeneratorSource} that produces @{code count} records in
* parallel.
*
* @param generatorFunction The generator function that receives index numbers and translates
* them into events of the output type.
* @param count The number of events to be produced.
* @param sourceRatePerSecond The maximum number of events per seconds that this generator aims
* to produce. This is a target number for the whole source and the individual parallel
* source instances automatically adjust their rate taking based on the {@code
* sourceRatePerSecond} and the source parallelism.
* @param typeInfo The type information of the returned events.
*/
public DataGeneratorSource(
MapFunction<LongGeneratorFunction<Long, OUT> generatorFunction,
long count,
double sourceRatePerSecond,
TypeInformation<OUT> typeInfo) {...}
|
Where GeneratorFunction supports initialization of class fields via the open() method with access to the local SourceReaderContext.
Code Block |
---|
language | java |
---|
title | GeneratorFunction |
---|
|
@Public
public interface GeneratorFunction<T, O> extends Function {
/**
* Initialization method for the function. It is called once before the actual working process
* methods.
*/
default void open(SourceReaderContext readerContext) throws Exception {}
/** Tear-down method for the function. */
default void close() throws Exception {}
O map(T value) throws Exception;
} |
A new SourceReaderFactory interface is introduced.
Code Block |
---|
language | java |
---|
title | SourceReaderFactory |
---|
|
public interface SourceReaderFactory<OUT, SplitT extends SourceSplit> extends Serializable {
SourceReader<OUT, SplitT> newSourceReader(SourceReaderContext readerContext);
} |
The generator source delegates the SourceReaders' creation to the factory.
Code Block |
---|
language | java |
---|
title | DataGeneratorSource |
---|
|
@Public
public class DataGeneratorSource<OUT>
implements Source<
OUT,
NumberSequenceSource.NumberSequenceSplit,
Collection<NumberSequenceSource.NumberSequenceSplit>>,
ResultTypeQueryable<OUT> {
private final SourceReaderFactory<OUT, NumberSequenceSplit> sourceReaderFactory;
@Override
public SourceReader<OUT, NumberSequenceSplit> createReader(SourceReaderContext readerContext)
throws Exception {
return sourceReaderFactory.newSourceReader(readerContext);
}
} |
Proposed Changes
In order to deliver convenient rate-limiting functionality to the users of the new API, a small addition to the SourceReaderContext is required.
The sum of rates of all parallel readers has to approximate the optional user-defined sourceRatePerSecond parameter. Currently, there is no way for the SourceReaders to acquire the current parallelism of the job they are part of. To overcome this limitation, this FLIP proposes an extension of the SourceReaderContext interface with the currentParallelism() method:
Code Block |
---|
language | java |
---|
title | SourceReaderContext |
---|
|
package org.apache.flink.api.connector.source;
/** The class that expose some context from runtime to the {@link SourceReader}. */
@Public
public interface SourceReaderContext {
...
/**
* Get the current parallelism of this Source.
*
* @return the parallelism of the Source.
*/
int currentParallelism();
} |
With the parallelism accessible via SourceReaderContext, initialization of the rate-limiting data generating readers can be taken care of by the SourceReaderFactories. For example:
Code Block |
---|
language | java |
---|
title | GeneratorSourceReaderFactory |
---|
|
public class GeneratorSourceReaderFactory<OUT>
implements SourceReaderFactory<OUT, NumberSequenceSource.NumberSequenceSplit> {
public GeneratorSourceReaderFactory(
GeneratorFunction<Long, OUT> generatorFunction, long sourceRatePerSecond){...}
@Override
public SourceReader<OUT, NumberSequenceSource.NumberSequenceSplit> newSourceReader(
SourceReaderContext readerContext) {
if (sourceRatePerSecond > 0) {
int parallelism = readerContext.currentParallelism();
RateLimiter rateLimiter = new GuavaRateLimiter(sourceRatePerSecond, parallelism);
return new RateLimitedSourceReader<>(
new GeneratingIteratorSourceReader<>(readerContext, generatorFunction),
rateLimiter);
} else {
return new GeneratingIteratorSourceReader<>(readerContext, generatorFunction);
}
}
} |
Where RateLimiter
Code Block |
---|
language | java |
---|
title | RateLimiter |
---|
|
/** The interface that can be used to throttle execution of methods. */
interface RateLimiter extends Serializable {
/**
* Acquire method is a blocking call that is intended to be used in places where it is required
* to limit the rate at which results are produced or other functions are called.
*
* @return The number of milliseconds this call blocked its caller.
* @throws InterruptedException The interrupted exception.
*/
int acquire() throws InterruptedException;
} |
---
It is desirable to reuse the functionality of IteratorSourceReader for cases where the input data type is different from the output (IN: Long from the wrapped NumberSequenceSplit, OUT: the result of applying GeneratorFunction<Long, OUT> provided by the user). For that purpose, the following changes are proposed:
- New IteratorSourceReaderBase is introduced parameterized with both in and out data types generics.
- All methods apart from pollNext() from the IteratorSourceReader are "pulled-up" to the *Base class
- IteratorSourceReader API remains the same while implementing IteratorSourceReaderBase where input and output types are the same
- New GeneratingIteratorSourceReader is introduced where input and output types are different (the result of applying GeneratorFunction)
- GeneratingIteratorSourceReader initializes the GeneratorFunction (if needed), by calling open() method within its start() method.
Code Block |
---|
language | java |
---|
title | IteratorSourceReaderBase |
---|
|
package org.apache.flink.api.connector.source.lib.util;
@Experimental
abstract class IteratorSourceReaderBase<
E, O, IterT extends Iterator<E>, SplitT extends IteratorSourceSplit<E, IterT>>
/** A split of the source, representing a number sub-sequence. */
public static class GeneratorSequenceSplit<T>
implements IteratorSourceSplit<T, GeneratorSequenceIterator<T>> SourceReader<O, SplitT> {...} |
Reader:
Code Block |
---|
language | java |
---|
title | IteratorSourceReader |
---|
|
package org.apache.flink.api.connector.source.lib.util;
@Public
public class IteratorSourceReader<
E, IterT extends Iterator<E>, SplitT extends IteratorSourceSplit<E, IterT>>
extends IteratorSourceReaderBase<E, E, IterT, SplitT> {
public IteratorSourceReader(SourceReaderContext context) {
super(context);
}
{
@Override
public InputStatus GeneratorSequenceSplit(pollNext(ReaderOutput<E> output) {...}
} |
Code Block |
---|
language | java |
---|
title | GeneratingIteratorSourceReader |
---|
|
package org.apache.flink.api.connector.source.lib.util;
@Experimental
public class GeneratingIteratorSourceReader<
E, O, IterT extends Iterator<E>, SplitT extends IteratorSourceSplit<E, IterT>>
extends IteratorSourceReaderBase<E, O, IterT, SplitT> {
NumberSequenceSplit numberSequenceSplit, MapFunction<Long, T>public GeneratingIteratorSourceReader(
SourceReaderContext context, GeneratorFunction<E, O> generatorFunction) {...}
@Override
public InputStatus pollNext(ReaderOutput<O> output) {...}
} |
RateLimitedSourceReader wraps another SourceReader (delegates to its methods) while rate-limiting the pollNext() calls.
Code Block |
---|
language | java |
---|
title | RateLimitedSourceReader |
---|
|
package org.apache.flink.api.connector.source.lib.util;
@Experimental
class RateLimitedSourceReader<E, SplitT extends SourceSplit>
implements SourceReader<E, SplitT> {
private final SourceReader<E, SplitT> sourceReader;
private final RateLimiter rateLimiter;
public RateLimitedSourceReader(SourceReader<E, SplitT> sourceReader, RateLimiter rateLimiter) {
this.numberSequenceSplit = numberSequenceSplit;
this.generatorFunctionsourceReader = generatorFunctionsourceReader;
this.rateLimiter = rateLimiter;
}
...
}
}
|
Proposed Changes
This FLIP introduces a new DataGeneratorSource
class.
@Override
public void start() {
sourceReader.start();
}
@Override
public InputStatus pollNext(ReaderOutput<E> output) throws Exception {
rateLimiter.acquire();
return sourceReader.pollNext(output);
}
...
} |
Usage:
The envisioned usage for functions that do not contain any class fields that need initialization The envisioned usage looks like this:
Code Block |
---|
|
MapFunction<Longint count = 1000;
int sourceRatePerSecond = 2;
GeneratorFunction<Long, String> generator = index -> "Event from index: " + index;
DataGeneratorSource<String> source = new DataGeneratorSource<>(generator, count, 10sourceRatePerSecond, Types.STRING);
DataStreamSource<String> watermarked =
env.fromSource(
source,
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(1)),
"watermarked"); |
Scenarios, where GeneratorFunction requires initialization of non-serializable fields, is supported as follows:
Code Block |
---|
|
GeneratorFunction<Long, String> generator =
new GeneratorFunction<Long, String>() {
transient SourceReaderMetricGroup sourceReaderMetricGroup;
@Override
public void open(SourceReaderContext readerContext) {
sourceReaderMetricGroup = readerContext.metricGroup();
}
@Override
public String map(Long value) {
return "Generated: >> "
+ value.toString()
+ "; local metric group: "
+ sourceReaderMetricGroup.hashCode();
}
};
DataGeneratorSource<String> source = new DataGeneratorSource<>(generator, count, sourceRatePerSecond, Types.STRING); |
...
...
- addition of a utility method
...
- to
StreamExecutionEnvironment
with default watermarking might also be desirable (similar to env.fromSequence(long from, long to)
). - To be able to reuse the existing functionality of
NumberSequenceSource
it is required to change the visibility of NumberSequenceSource.CheckpointSerializer
from private to package-private.
Compatibility, Deprecation, and Migration Plan
...
- Unit tests will be added to verify the behavior of
Source's Splits
in relation to the SourceReader
- Integration tests will be added to verify correct functioning with different levels of parallelism
Rejected Alternatives
It is possible to use a NumberSequenceSource
followed by a map
function to achieve similar results, however, this has two disadvantages:
- It introduces another level of indirection and is less intuitive to use
- It does not promote best practices of assigning watermarks (see this discussion)