Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. 本次实验使用的人工数据集不具有代表性:其时间戳从1开始以步长1递增,值在[0,100)随机取整数,导致时间戳的编码压缩效率过高、值列的编码压缩效率过低。甚至对于值列来说,和其它编码方式相比,采用PLAIN编码的空间大小是最小的。
  2. 在中车ZT11529数据集上的实验结果来看:
    • 真实数据集的压缩率高、磁盘数据量相对小,此时【从磁盘加载Chunk数据的耗时】小于【解压缩和解码Page数据的耗时】,即整体耗时瓶颈不是磁盘IO。
    • D-1步骤内部的耗时瓶颈就是子步骤7_2_data_decompress_PageDataByteArray。注意:人工数据实验里发现另一个子步骤7_1_data_ByteBuffer_to_ByteArray(us)的占比也高,分析是因为人工数据压缩率很低,子步骤7_2_data_decompress_PageDataByteArray(us)耗时相对少,从而7_1_data_ByteBuffer_to_ByteArray(us)耗时占比相对偏高。
    • D-2类操作内部没有某一个子步骤是突出的耗时瓶颈。
    • 相对其它压缩方法,GZIP的压缩率最高,但磁盘加载IO代价和解压缩代价之间有tradeoff,GZIP压缩下的整体读耗时并不是最小的。
  3. 后续
    1. 可以增大真实数据集的数据量之后再实验看看,目前使用的中车数据量级是一千万点。
    2. D-1解压缩和D-2解码的空间压缩关系和耗时关系还有待探索
    3. 写数据的耗时可以也测量一下
    4. 注意RLE编码对于浮点数是有损的

实验设置

IoTDB版本

  • v0.13.1

实验环境

  • FIT楼166.111.130.101 / 192.168.130.31
  • CPU:Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz,(6核12线程)
  • L1 cache 284KB, L2 cache 1536KB, L3 cache 12MB
  • 内存:16G
  • 硬盘:1.8T HDD /dev/sdb1 mounted on /disk
  • 操作系统:Ubuntu 16.04.7 LTS
  • 工作文件夹:/disk/rl/tsfileReadExp/

...

压缩方式GZIPLZ4SNAPPYUNCOMPRESSED
dataset/disk/rl/zc_data/ZT11529.csv/disk/rl/zc_data/ZT11529.csv/disk/rl/zc_data/ZT11529.csv/disk/rl/zc_data/ZT11529.csv
pagePointNum(ppn)10000100001000010000
numOfPagesInChunk(pic)100100100100
chunksWritten(cw)13131313
timeEncoding(te)PLAINPLAINPLAINPLAIN
valueDataType(vt)DOUBLEDOUBLEDOUBLEDOUBLE
valueEncoding(ve)PLAINPLAINPLAINPLAIN
compression(co)GZIPLZ4SNAPPYUNCOMPRESSED
totalPointNum12780287127802871278028712780287
tsfileSize(MB)50.757878374.7114610775.81546879195.0946026
chunkDataSize_stats_mean(MB)3.9771355795.8460962775.93273003915.26517868
compressedPageSize_stats_mean(B)41639.2891761236.762562145.18333160003
uncompressedPageSize_stats_mean(B)160003160003160003160003
timeBufferSize_stats_mean(B)80000800008000080000
valueBufferSize_stats_mean(B)80000800008000080000
total_time(us)2096685.59811244193.04041195582.11731895095.384
[2] category: (A)get ChunkStatistic->(B)load on-disk Chunk->(C)get PageStatistics->(D)load in-memory PageData
[Avg&Per] (A)get_chunkMetadatas86791.3859 us - 4.1844839300349514%100869.9527 us - 8.731470875680406%99547.05960000001 us - 7.911992556094877%88166.6213 us - 4.659966496653105%
[Avg&Per] (B)load_on_disk_chunk349328.55549999996 us - 16.84222128306944%452828.32 us - 39.197572537012675%450015.7795 us - 35.76721916082827%1155773.8993 us - 61.08737716190697%
[Avg&Per] (C)get_pageHeader2818.293900000001 us - 0.13587875585087916%3913.898699999998 us - 0.3387935811871694%3502.114700000002 us - 0.27834780402685505%4450.306 us - 0.23521687180559223%
[Avg&Per] (D_1)decompress_pageData1350175.8349000001 us - 65.09619618668371%261417.90119999985 us - 22.628768326063632%395144.6617 us - 31.40606698493612%173785.97769999996 us - 9.185299626198828%
[Avg&Per] (D_2)decode_pageData285009.94010000007 us - 13.74121984436101%336215.7518000001 us - 29.103394680056127%309969.7735 us - 24.636373494113883%469824.3798999997 us - 24.832139843435506%
SUM2074124.01031155245.82439999981258179.3891892001.1841999996
[3] D_1 compare each step inside
[Avg&Per] (D-1)7_1_data_ByteBuffer_to_ByteArray(us)4312.080999999998 us - 0.3365746911741179%10247.3701 us - 5.205625954425964%9355.254500000003 us - 2.648386471251286%33088.566800000015 us - 56.71920599957147%
[Avg&Per] (D-1)7_2_data_decompress_PageDataByteArray(us)1274619.9271000002 us - 99.48904214184739%183796.93149999995 us - 93.36815862249874%341381.81159999996 us - 96.64204980983628%21318.842899999996 us - 36.54397754446109%
[Avg&Per] (D-1)7_3_data_ByteArray_to_ByteBuffer(us)454.68920000000026 us - 0.03549026028736633%659.1047999999998 us - 0.334822790636471%604.4583000000001 us - 0.17111658310904862%973.5250000000001 us - 1.6687808013713306%
[Avg&Per] (D-1)7_4_data_split_time_value_Buffer(us)1779.4489000000008 us - 0.1388929066911369%2148.4263999999994 us - 1.091392632438828%1902.0298000000007 us - 0.5384471358033915%2956.5653000000016 us - 5.068035654596102%
[3] D_2 compare each step inside
[Avg&Per] (D-2)8_1_createBatchData(us)3343.3923 us - 0.259976288059772%3522.578 us - 0.27212097540020236%3458.9225 us - 0.2672966393395521%3730.5984 us - 0.2816327187407449%
[Avg&Per] (D-2)8_2_timeDecoder_hasNext(us)232202.511 us - 18.05565768873081%231677.2947 us - 17.897191037883086%231911.8133 us - 17.921548782382853%237390.8697 us - 17.921263258420133%
[Avg&Per] (D-2)8_3_timeDecoder_readLong(us)254086.67 us - 19.75733129255225%255389.3804 us - 19.72896194244707%255976.5501 us - 19.78120978179259%261059.1494 us - 19.70804415658043%
[Avg&Per] (D-2)8_4_valueDecoder_read(us)241634.0221 us - 18.78903535624908%242640.7293 us - 18.744122040429612%242328.9765 us - 18.726560376256852%246610.5964 us - 18.6172847590372%
[Avg&Per] (D-2)8_5_checkValueSatisfyOrNot(us)230100.908 us - 17.892240746329144%231053.5736 us - 17.849008259784295%231043.8535 us - 17.85447508020484%235200.0462 us - 17.755872210542634%
[Avg&Per] (D-2)8_6_putIntoBatchData(us)324669.8983 us - 25.24575862807894%330206.1447 us - 25.508595744055732%329318.78729999997 us - 25.448909340023306%340641.1962 us - 25.715902896678855%


  • 可以看到,当时间戳列和值列都使用PLAIN编码之后,压缩负责了全部的压缩率,此时D-1操作和耗时占比有了明显提高;但是也可以看到,即便如此,除了GZIP之外的压缩方式的D-1耗时占比也没有增大到60%以上,D-2解码操作仍然有不小的基础耗时

改变值列编码方式

RLValueEncodingRealExpScripts.sh

...