Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Broadcast Variables

This variant takes a Flink DataSet and makes it available to all function instances.

The following example illustrates that:

Code Block
languagejava
 
DataSet<Point> points = env.readCsv(...);
 
DataSet<Centroid> centroids = ... ; // some computation
 
points.map(new MapFunction<PointRichMapFunction<Point, Integer>() {
 
    private Collection<Centroid> centroids;

    @Override
    public void open(Configuration parameters) {
        this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
    }
		
    @Override
    public Integer map(Point p) {
        return selectCentroid(centroids, p);
    }
 
}).withBroadcastSet("centroids", centroids);
 


The data flow attaches the centroid data set explicitly to the MapFunction. The data will be sent to all MapFunctions in parallel via the Flink network stack that is also used for shuffling the data.


Note that the collection obtained via "getRuntimeContext().getBroadcastVariable("centroids");" is actually shared between all MapFunction on the same TaskManager.
That way, the system can handle larger broadcast variables efficiently.

When to use?

Distribute data with a broadcast variable when

  • The data is large
  • The data has been produced by some form of computation and is already a DataSet (distributed result)
  • Typical use case: Redistribute intermediate results, such as trained models

...