Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When the configuration columnar.encoding is set to a value other than 'none,' the producer needs to change by calling the new API setSchema() to add a schema. The consumer, without the proposed change, should still be able to poll the records as before. Optionally, the consumer can call the new method pollBuffer() to get the segment in the columnar encoding format.

Test Plan

...

Functional tests 

  1. Regression test - The configuration ‘columnar.encoding’ is set to 'none’, run all the tests in Uber staging env which includes but not limited to read/write with different scale. 
  2. Added feature - The configuration ‘columnar.encoding’ is set to 'parquet’.
    1. Verify the data is encoded as Parquet format 
    2. The producer, broker and consumer all work as before functionality-wide. No exceptions are expected. 
    3. The newly added consumer API should be able to return the whole segment as Parquet format directly. 

Performance tests 

Run tests for different topics that should have different data types and scale

  1. Benchmarking the data size when the number of rows is changed in the batch. 
  2. Benchmarking CPU utilization on producer and consumer 

Compatibility tests

Test the compatibility among producer, consumer, and replicator with/without the proposed changes. 

  1. Both producer and consumer have the proposed changes 
    1. The feature is turned off in configuration, all all regression tests should work as before.
    2. When the producer turns on this feature, the consumer and replicator can consume as before. 
  2. Producer has the proposed changes, but the consumer doesn’t 
    1. The feature is turned off in configuration, all all regression tests should work as before.
    2. When the producer turn on this feature, the consumer and replicator throw an exception 
  3. Producer doesn’t have the proposed changes, but the consumer does 
    1. All the regression tests should pass 

Rejected Alternatives

The alternative is to apply columnar encoding and compression outside Kafka clients. The application can add a buffer to create a batch for records, apply columnar encoding and compression, and then put them into the (K, V) in the ProducerRecord. The benefit of doing this is to avoid changes in the Kafka client, but there are problems with this approach, as outlined below:

...