Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added rejected alternative for standalone tool

...

Current state: Under Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-10281

...

ParameterRequiredDescription
--logsYesThe comma-separated list of log files to be analyzed.
--verboseNoIf set, display verbose analysis batch information about batches.

Output

The tool will print results to standard out. The tool reports information about the batches in the log segment (as more batching often helps improve the effectiveness of compression), the breakdown of compression types found in the log segment, and the results of applying each compression type. A sample output:

Code Block
titleSample Output
collapsetrue
Analyzing /kafka/test-topic-0/00000000000525233956.log
Original log size: 536793767 bytes
Uncompressed log size: 536793767 bytes
Original compression ratio: 1.00
Original space savings: 0.00%


Batch stats:
  16593/20220 batches contain >1 message
  Avg number of messages per batch: 3.68
  Avg batch size (original): 5180 bytes
  Avg batch size (uncompressed): 5180 bytes

Number of input batches by compression type:
  none: 20220

COMPRESSION-TYPE  COMPRESSED-SIZE  SPACE-SAVINGS  COMPRESSION-RATIO  AVG-RATIO/BATCH  TOTAL-TIME        SPEED
gzip                    118159324         22.01%              4.543            1.795     13875ms   36.90 MB/s
snappy                  160597012         29.92%              3.342            1.549      2678ms  191.16 MB/s 
lz4                     161711232         30.13%              3.319            1.576      2616ms  195.69 MB/s
zstd                    112737048         21.00%              4.761            1.775      5103ms  100.32 MB/s


Code Block
titleSample Output 2
collapsetrue
Analyzing /kafka/test-topic-1/00000000000000000000.log
Original log size: 14510269 bytes
Uncompressed log size: 16080153 bytes
Original compression ratio: 1.11
Original space savings: 9.76%

Batch stats:
  6/2875 batches contain >1 message
  Avg messages/batch: 1.01
  Avg batch size (original): 1255 bytes
  Avg batch size (uncompressed): 3125 bytes

Number of input batches by compression type:
  none: 1784
  gzip: 525
  snappy: 275
  lz4: 291

COMPRESSION-TYPE  COMPRESSED-SIZE  SPACE-SAVINGS  TOTAL-RATIO  AVG-RATIO/BATCH  TOTAL-TIME        SPEED
gzip                       422829         97.37%        38.03            21.43       168ms   91.28 MB/s
snappy                    1103867         93.14%        14.57            10.30        45ms  340.78 MB/s
lz4                        423965         97.36%        37.93            21.46       195ms   78.64 MB/s
zstd                       352861         97.81%        45.57            25.46       251ms   61.10 MB/s

Breakdown of outputs:

Compression Type - the configured compression type
Compressed Size - size in bytes of the log segment after compression
Space Savings - the reduction in size relative to the uncompressed size
Compression Ratio - the ratio of the uncompressed size to the compressed size
Avg Ratio/Batch - the mean compression ratio on a per-batch basis
Time - how long it took to compress all batches for the given compression type
Speed - the average rate at which the compression type is able to compress the log segment

...

There may be situations where it is not desirable for all batches to be compressed with a single compression type. For this reason, it may eventually be useful to provide a way to restrict the batches being compressed for the analysis. For example, it might be possible to exclude batches compressed with a certain compression type from being recompressed, only analyzing the remaining subset of the log. However, this can be implemented as a follow-up addition once better motivation for what mechanisms are needed and how they might work is available.

Rejected Alternatives

TODOAnother approach could be to run the tool as a consumer-like process that would fetch batches from the Kafka cluster and perform the compression measurements directly on those batches. This would require the tool to be provided the appropriate authentication information for the topic/partition being analyzed. This would also require batches of records to be exposed to the tool, which the consumer's interface and internals (specifically the fetcher) do not currently expose.