Status

Discussion threadhttps://lists.apache.org/thread/9mt07mnbwf1rwftzsbxz3jkcrp8dvkl5
Vote thread
JIRA

Unable to render Jira issues macro, execution error.

Release

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

HyperLogLog is a probabilistic data structure used for estimating the cardinality of a dataset, which is the number of unique elements in a set. In production, hyperloglog is used in a wide range of scenarios, such as calculating unique visitors. For this data structure/algorithm, You can find more information at: https://redis.io/docs/data-types/probabilistic/hyperloglogs/

The Redis stream (another Redis data structure) connector is in progress, see FLIP-254. I think it is possible to create a sink connector for HyperLogLog.

For different data structures of Redis, there are different ways to use them. Rather than maintaining a multi-functional connector, it is more appropriate to develop a corresponding connector for each feasible data structure.

Public Interfaces

For irreversible data structures, it is not possible to read the original data from them. Therefore, the Redis HyperLogLog connector will only consist of Sink. The following interface will be used: 

Proposed Changes

The Redis HyperLogLog connector will be based on the Async Sink (FLIP-171), and support both Bounded(Batch) and Unbounded(Streaming) and both DataStream and Table API/SQL. 

Redis officially recommended Redis clients Jedis, Lettuce, etc., of which Jedis is the elderly Redis Java implementation of the client, providing more comprehensive support for Redis commands. However, Jedis uses blocking IO, and its method calls are synchronous, the program flow needs to wait until the sockets finish processing the IO before execution, and does not support asynchronous, in the concurrency scenario, the use of Jedis client will consume more resources. In addition, the Jedis client instance is not thread-safe, to ensure thread-safe, we must use the connection pool, each thread needs to take out the connection instance from the connection pool, after the completion of the operation or encounter an exception to return the instance. When the number of connections with the business continues to rise, the consumption of physical connections will also become a potential risk point for performance and stability.

The Lettuce client is based on Netty's NIO framework, and for most Redis operations, it only needs to maintain a single connection to efficiently support concurrent requests from the business side -- this is very different from Jedis' connection pooling model. At the same time, Lettuce supports more comprehensive features, and its performance is not inferior to, or even better than Jedis.

For end-users, it is a common practice to write different records to different Redis keys depending on the source data. For example, the visitor information in the log is written to the hyperloglog corresponding to the month to calculate the number of unique visitors for each month.

Therefore, we need to add 'redis-key-field' to the config, so that users can write to a custom key based on the contents of their table fields. It's like the following:

CREATE TABLE MyTable (
	dataField STRING
) with (
	'connector' = 'redis-hyperloglog',
	'host' = 'xxx',
	'port' = 'xxx',
	'redis-key-field' = 'redisKey'
)

INSERT INTO MyTable
SELECT
	dataField
	CASE
		WHEN dataField < '100' THEN 'redis-key1'
		WHEN dataFiled >= '100' THEN 'redis-key2'
		ELSE 'default-key'
	END AS redisKey
FROM xxx;


Compatibility, Deprecation, and Migration Plan

This is a new feature, no compatibility, deprecation, or migration plan is expected.

Test Plan

We will add the following tests:

  • Unit tests
  • Integration tests that perform end-to-end tests against a Redis HyperLogLog test container

Rejected Alternatives

N/A