THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
Broadcast Variables
This variant takes a Flink DataSet and makes it available to all function instances that implement RichFunction (JavaDoc).
The following example illustrates that:
Code Block | ||
---|---|---|
| ||
DataSet<Point> points = env.readCsv(...); DataSet<Centroid> centroids = ... ; // some computation points.map(new MapFunction<PointRichMapFunction<Point, Integer>() { private Collection<Centroid>List<Centroid> centroids; @Override public void open(Configuration parameters) { this.centroids = getRuntimeContext().getBroadcastVariable("centroids"); } @Override public Integer map(Point p) { return selectCentroid(centroids, p); } }).withBroadcastSet("centroids", centroids); |
The data flow attaches the centroid
centroids
data set explicitly to the MapFunctionRichMapFunction. The data will be sent to all MapFunctions RichFunctions in parallel via the Flink network stack that is also used for shuffling the data.
"getRuntimeContext().getBroadcastVariable("centroids");"
is actually shared between all MapFunction RichFunctions on the same TaskManager.That way, the system can handle larger broadcast variables efficiently.
When to use?
Distribute data with a broadcast variable when
- The data is large
- The data has been produced by some form of computation and is already a DataSet (distributed result)
- Typical use case: Redistribute intermediate results, such as trained models
...