Page History

...

Page properties

Discussion thread

https://lists.apache.org/thread/mm0o8fv7x7k13z11htt88zhy7lo8npmg

Vote thread

https://lists.apache.org/thread/60c0obrgxrcxb7qv9pqywzxvtt5phnhy

JIRA

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	FLINK-32486

Release

<Flink Version>1.18.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

Key	Type	Default Value	Description
table.optimizer.runtime-filter.enabled	Boolean	false	A flag to enable or disable the runtime filter.
table.optimizer.runtime-filter.max-build-data-size	MemorySize	10 150 MB	Data volume threshold of the runtime filter build side. Estimated data volume needs to be under this value to try to inject runtime filter.
table.optimizer.runtime-filter.min-probe-data-size	MemorySize	10 GB	Data volume threshold of the runtime filter probe side. Estimated data volume needs to be over this value to try to inject runtime filter.
table.optimizer.runtime-filter.min-filter-ratio	Double	0.5	Filter ratio threshold of the runtime filter. Estimated filter ratio needs to be over this value to try to inject runtime filter.

...

The runtime filter can work well with all shuffle modes: pipeline shuffle, blocking shuffle, and hybrid shuffle.

LocalRuntimeFilterBuilderOperator

...

In this version, the underlying implementation of the runtime filter is a bloom-filter. In the future, we can introduce more underlying implementations for further optimization. For example, when the input data volume on the build side is small enough, we can use an in-filter to reduce building overhead and avoid the false positive problem. We can also introduce a min-max filter so that the filter can be easily pushed down to the source to reduce the scan IO.

...

We need to give the number of expected records when creating a bloom filter. Currently, the number is is estimated at in the planning phase. However, a better solution would be to let the RuntimeFilterBuilder know the real number of records on the build side at execution phase, we may do it in the future.

...

When the join type is hash join, we can reuse the hash table built in the join operator to build the bloom filter. The keyset of the hash table gives us exact NDV counts and deduplicated keys, which helps avoid inserting records twice into the bloom filter. This idea comes comes from the discussion on the mailing list, you can check the mailing list for more details.

Use blocked bloom filters to improve

...

cache efficiency

If we want to improve cache efficiency for the build of larger filters, we could structure them as blocked bloom filters, where the filter is separated into blocks and all bits of one key go only into one block. This idea comes from the discussion on the mailing list, you can check the mailing list for more details.

...

Page tree

Versions Compared

Old Version 8

New Version Current

Key

LocalRuntimeFilterBuilderOperator

Use blocked bloom filters to improve

cache efficiency