Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The runtime filter can work well with all shuffle modes: pipeline shuffle, blocking shuffle, and hybrid shuffle.

LocalRuntimeFilterBuilderOperator

...

In this version, the underlying implementation of the runtime filter is a bloom-filter. In the future, we can introduce more underlying implementations for further optimization. For example, when the input data volume on the build side is small enough, we can use an in-filter to reduce building overhead and avoid the false positive problem. We can also introduce a min-max filter so that the filter can be easily pushed down to the source to reduce the scan IO.

...

We need to give the number of expected records when creating a bloom filter. Currently, the number is  is estimated at in the planning phase. However, a better solution would be to let the RuntimeFilterBuilder know the real number of records on the build side at execution phase, we may do it in the future.

...

When the join type is hash join, we can reuse the hash table built in the join operator to build the bloom filter. The keyset of the hash table gives us exact NDV counts and deduplicated keys, which helps avoid inserting records twice into the bloom filter. This idea comes  comes from the discussion on the mailing list, you can check the mailing list for more details.

Use blocked bloom filters to improve

...

cache efficiency

If we want to improve cache efficiency for the build of larger filters, we could structure them as blocked bloom filters, where the filter is separated into blocks and all bits of one key go only into one block. This idea comes  from the discussion on the mailing list, you can check the mailing list for more details.

...