Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page properties


Discussion threadhttps://lists.apache.org/thread/mm0o8fv7x7k13z11htt88zhy7lo8npmg
Vote threadhttps://lists.apache.org/thread/60c0obrgxrcxb7qv9pqywzxvtt5phnhy
JIRA

Jira
serverASF JIRA
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-32486

Release<Flink Version>1.18.0


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

Key

Type

Default Value

Description

table.optimizer.runtime-filter.enabled

Boolean

false

A flag to enable or disable the runtime filter.

table.optimizer.runtime-filter.max-build-data-size

MemorySize

10 150 MB

Data volume threshold of the runtime filter build side. Estimated data volume needs to be under this value to try to inject runtime filter.

table.optimizer.runtime-filter.min-probe-data-size

MemorySize

10 GB

Data volume threshold of the runtime filter probe side. Estimated data volume needs to be over this value to try to inject runtime filter.

table.optimizer.runtime-filter.min-filter-ratio

Double

0.5

Filter ratio threshold of the runtime filter. Estimated filter ratio needs to be over this value to try to inject runtime filter.

...

The runtime filter can work well with all shuffle modes: pipeline shuffle, blocking shuffle, and hybrid shuffle.

LocalRuntimeFilterBuilderOperator

...

In this version, the underlying implementation of the runtime filter is a bloom-filter. In the future, we can introduce more underlying implementations for further optimization. For example, when the input data volume on the build side is small enough, we can use an in-filter to reduce building overhead and avoid the false positive problem. We can also introduce a min-max filter so that the filter can be easily pushed down to the source to reduce the scan IO.

...

We need to give the number of expected records when creating a bloom filter. Currently, the number is  is estimated at in the planning phase. However, a better solution would be to let the RuntimeFilterBuilder know the real number of records on the build side at execution phase, we may do it in the future.

...

When the join type is hash join, we can reuse the hash table built in the join operator to build the bloom filter. The keyset of the hash table gives us exact NDV counts and deduplicated keys, which helps avoid inserting records twice into the bloom filter. This idea comes  comes from the discussion on the mailing list, you can check the mailing list for more details.

Use blocked bloom filters to improve

...

cache efficiency

If we want to improve cache efficiency for the build of larger filters, we could structure them as blocked bloom filters, where the filter is separated into blocks and all bits of one key go only into one block. This idea comes  from the discussion on the mailing list, you can check the mailing list for more details.

...