Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: ref to richmapfun generalized to richfun

...

Broadcast Variables

This variant takes a Flink DataSet and makes it available to all function instances that implement RichFunction (JavaDoc).

The following example illustrates that:

Code Block
languagejava
 
DataSet<Point> points = env.readCsv(...);
 
DataSet<Centroid> centroids = ... ; // some computation
 
points.map(new RichMapFunction<Point, Integer>() {
 
    private List<Centroid> centroids;

    @Override
    public void open(Configuration parameters) {
        this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
    }
		
    @Override
    public Integer map(Point p) {
        return selectCentroid(centroids, p);
    }
 
}).withBroadcastSet("centroids", centroids);
 


The data flow attaches the centroids data set explicitly to the RichMapFunction. The data will be sent to all RichFunctions in parallel via the Flink network stack that is also used for shuffling the data.


Note that the collection obtained via "getRuntimeContext().getBroadcastVariable("centroids");" is actually shared between all RichMapFunctions RichFunctions on the same TaskManager.
That way, the system can handle larger broadcast variables efficiently.

When to use?

Distribute data with a broadcast variable when

  • The data is large
  • The data has been produced by some form of computation and is already a DataSet (distributed result)
  • Typical use case: Redistribute intermediate results, such as trained models

...