Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We divide consumers into two categories. Minimum change is needed for messaging consumers for reading the Parquet segment, and a new consuming method is proposed to add for ingestion consumers. The way to divide into the two categories and these terms are used solely for the purpose of this proposal.

...

For applications expecting a batch of records or even the entire segment to write to the sink (data lake), the Kafka consumer client can simply return byte buffer of the entire segment to the application, allowing it to directly dump it into the sink. The necessary change involves adding a new consumer API to return the segment byte buffer and bypass decompression. 


We refer to the first type of use case as a messaging consumer and the second one as an ingestion consumer. 

...

  1. Regression test - The configuration ‘columnar.encoding’ is set to 'none’, run all the tests in Uber staging env which includes but is not limited to read/write with different scales. 
  2. Added feature - The configuration ‘columnar.encoding’ is set to 'parquet’.
    1. Verify the data is encoded as Parquet format 
    2. The producer, broker, and consumer all work as before functionality-wide. No exceptions are expected. 
    3. The newly added consumer API should be able to return the whole segment as Parquet format directlyconsumer fetches the data as byte buffer with 'fetch.raw.bytes' set and the data is with Parquet format

Performance tests 

Run tests for different topics that should have different data types and scale

...