You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

Status

Current state: "Under Discussion"

Discussion thread: https://lists.apache.org/thread/7gjxto1rmkpff4kl54j8nlg5db2rqhkt

JIRA: Unable to render Jira issues macro, execution error.

Released: TBD

Motivation

FLIP-27 sources are non-trivial to implement. At the same time, it is frequently required to generate arbitrary events with a "mock" source. Such requirement arises both for Flink users, in the scope of demo/PoC projects, and for Flink developers when writing tests. The go-to solution for these purposes so far was using pre-FLIP-27 APIs and implementing data generators as SourceFunctions
While the new FLIP-27 Source interface introduces important additional functionality, it comes with significant complexity that presents a hurdle for Flink users for implementing drop-in replacements of the SourceFunction-based data generators.  Meanwhile, SourceFunction is effectively superseded by the Source interface and needs to be eventually deprecated. To fill this gap, this FLIP proposes the introduction of a generic data generator source based on the FLIP-27 API. 

Since it is frequently required to control the rate at which generated events are produced, this FLIP also expands the basic events generation functionality with native support for rate limiting. 


Public Interfaces

A new class with the following API will be introduced. Under the hood, wraps, and delegates to the NumberSequenceSource utilities.

DataGeneratorSource
package org.apache.flink.api.connector.source.lib;

/**
 * A data source that produces generators N events of an arbitrary type in parallel. 
 * This source is useful for testing and for cases that just need a stream of N events of any kind.
 *
 * <p>The source splits the sequence into as many parallel sub-sequences as there are parallel
 * source readers. Each sub-sequence will be produced in order. Consequently, if the parallelism is
 * limited to one, this will produce one sequence in order.
 *
 * <p>This source is always bounded. For very long sequences user may want to consider executing 
 * the application in a streaming manner, because, despite the fact that the produced stream is bounded, 
 * the end bound is pretty far away.
 */

@Public
public class DataGeneratorSource<OUT>                 
		implements Source<
                        OUT,
                        NumberSequenceSource.NumberSequenceSplit,
                        Collection<NumberSequenceSource.NumberSequenceSplit>>,
                ResultTypeQueryable<OUT> {


    /**
     * Creates a new {@code DataGeneratorSource} that produces {@code count} records in
     * parallel.
     *
     * @param generatorFunction The generator function that receives index numbers and 
	 *                          translates them into events of the output type.
     * @param count The number of events to be produced.
     * @param typeInfo The type information of the returned events.
     */
    public DataGeneratorSource(
            MapFunction<Long, OUT> generatorFunction, long count, TypeInformation<OUT> typeInfo) {
     ...
	}
	
    /** A split of the source, representing a number sub-sequence. */
    public static class GeneratorSequenceSplit<T>
            implements IteratorSourceSplit<T, GeneratorSequenceIterator<T>> {
	 
        public GeneratorSequenceSplit(
                NumberSequenceSplit numberSequenceSplit, MapFunction<Long, T> generatorFunction) {
            this.numberSequenceSplit = numberSequenceSplit;
            this.generatorFunction = generatorFunction;
        }
		...
	}
}

Proposed Changes

This FLIP introduces a new DataGeneratorSource class.

The envisioned usage looks like this:

usage
MapFunction<Long, String> generator = index -> "Event from index: " + index;
DataGeneratorSource<String> source = new DataGeneratorSource<>(generator, 10, Types.STRING);
DataStreamSource<String> watermarked =
                  env.fromSource(
                        source,
                        WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(1)),
                        "watermarked");

It is up for discussion if an additional utility method of StreamExecutionEnvironment with default watermarking might also be desirable (similar to env.fromSequence(long from, long to) ).

Compatibility, Deprecation, and Migration Plan

This feature is a stepping stone toward deprecating the SourceFunction API (see this discussion). 

  1. After this feature is introduced, it will be documented and promoted as the recommended way to write data generators.
  2. A list of Flink tests that currently use the SourceFunction API will be compiled and follow-up tickets for migration will be created.

Test Plan

  • Unit tests will be added to verify the behavior of Source's Splits in relation to the SourceReader
  • Integration tests will be added to verify correct functioning with different levels of parallelism

Rejected Alternatives

It is possible to use a NumberSequenceSource followed by a map function to achieve similar results, however, this has two disadvantages:

  • It introduces another level of indirection and is less intuitive to use
  • It does not promote best practices of assigning watermarks (see this discussion)


  • No labels