...
Page properties | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
...
Key | Type | Default Value | Description |
table.optimizer.runtime-filter.enabled | Boolean | false | A flag to enable or disable the runtime filter. |
table.optimizer.runtime-filter.max-build-data-size | MemorySize | 10 150 MB | Data volume threshold of the runtime filter build side. Estimated data volume needs to be under this value to try to inject runtime filter. |
table.optimizer.runtime-filter.min-probe-data-size | MemorySize | 10 GB | Data volume threshold of the runtime filter probe side. Estimated data volume needs to be over this value to try to inject runtime filter. |
table.optimizer.runtime-filter.min-filter-ratio | Double | 0.5 | Filter ratio threshold of the runtime filter. Estimated filter ratio needs to be over this value to try to inject runtime filter. |
...
The runtime filter can work well with all shuffle modes: pipeline shuffle, blocking shuffle, and hybrid shuffle.
LocalRuntimeFilterBuilderOperator
...
In this version, the underlying implementation of the runtime filter is a bloom-filter. In the future, we can introduce more underlying implementations for further optimization. For example, when the input data volume on the build side is small enough, we can use an in-filter to reduce building overhead and avoid the false positive problem. We can also introduce a min-max filter so that the filter can be easily pushed down to the source to reduce the scan IO.
...
We need to give the number of expected records when creating a bloom filter. Currently, the number is is estimated at in the planning phase. However, a better solution would be to let the RuntimeFilterBuilder know the real number of records on the build side at execution phase, we may do it in the future.
...
When the join type is hash join, we can reuse the hash table built in the join operator to build the bloom filter. The keyset of the hash table gives us exact NDV counts and deduplicated keys, which helps avoid inserting records twice into the bloom filter. This idea comes comes from the discussion on the mailing list, you can check the mailing list for more details.
Use blocked bloom filters to improve
...
cache efficiency
If we want to improve cache efficiency for the build of larger filters, we could structure them as blocked bloom filters, where the filter is separated into blocks and all bits of one key go only into one block. This idea comes from the discussion on the mailing list, you can check the mailing list for more details.
...