Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

(I) Experiment of the necessity of TimeseriesMetadata

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIOTDB-1831

...

  1. Although the index area structure with no TimeseriesMetadata speeds up a little in raw data query,
    it reduces the speed a lot in aggregation query. => We should reserve TimeseriesMetadata.
  2. The time cost does not change in the data area of TsFile.


(II) Experiment about combine Chunk and Page

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIOTDB-1832

...

1. For a scenario that generates 5 data points per second. (high one chunk one day) (5Hz frequency)

One day will generate 432,000 points (about 54 pages). Therefore, 1 chunk has 54 pages (about 3.4M). In scenarios like this, chunk and page is both necessary.

2. For a scenario that generates one data point per second. (second (one chunk one day) (1Hz frequency)

One day will generate 86,400 points (about 11 pages). Therefore, 1 chunk has 11 pages (about 693K). In this scenario, chunk and page is both necessary. 

3. For a scenario that generates 5 data points per minute. (one chunk one day) (low 1/12Hz frequency)

One day will generate 7200 points (about 1 pages). Therefore, 1 chunk has 1 page (about 56.6K). In this scenario, chunk and page should only reserve one.

4. For a scenario that generates one data points point per minute. (one chunk one week) (minute 1/60Hz frequency)

One week will generate 10080 points (about 1.3 pages). Therefore, 1 chunk has 1~2 pages (about 79.3K). In this scenario, chunk and page should only reserve one.


Reserve both chunk and page:

  • Chunk and Page are 2 levels of indexes in one TsFile, Suitable for aggregation and time filter with different granularity.
  • Chunk is the unit for I/O and page is the unit for query, which could supply multiple levels of I/O
  • Suitable for all kinds of query scenarios, whether aggregation query or raw data query
  • When one Chunk has multiple pages, this structure is better.

Reserve only page:

  • one level index in one TsFile.
  • Suitable for APM scenario
  • Simple structure, which could reduce one level of I/O

...

  • small Chunk (Mass Timeseries) scenario, in which 1 chunk has only 1~2 pages
    (Note: Since 0.12, If one Chunk has only one Page, then PageStatistics will be removed, we only store statistics in ChunkMetadata)


(III) Experiment about how to store PageHeader

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIOTDB-1833

(a) store PageHeader with PageData (current design)

(b) combine PageHeader with ChunkHeader

Image Added

当前的读取方式为:将 Chunk 全都读取到内存后读取。

如果按照这样的方式,则 (a) (b) 两者所用的时间相同。


如果按照精细方式进行逐块读取,分析如下:


For raw data query in a Chunk:

(1) time > t:

分析:假设 Chunk 中共有 n 个 Page,满足时间过滤要求的 Page 有 m 个,读 PageHeader 耗时为 th,读 PageData 耗时为 td, seek 耗时为 ts

(a) 顺序读 前几个 Page,然后开始顺序读后面的 PageData

需要读 n 个 PageHeader,m 个 PageData,seek (n m) 次。耗时为 * (th + ts) + m * (td - ts)

(b) 顺序读 前几个 PageHeader,然后开始顺序读后面的 PageData

需要读 (n - m) 个 PageHeader,m 个 PageData,seek 1 次。耗时为 * th + m * (td - th) + ts


前者比后者耗时多Δt = (n - 1) * ts + m * (th - ts),由于 n >= 1,  th > ts(读 PageHeader 也需要 seek, 因此 th > ts),

因此 Δt >0,后者耗时一定比前者少。


举例:

假设 Chunk 中有6个 Page,其中前两个 Page 是不符合时间过滤要求的

对于 (a) 而言,需要读6个 PageHeader,以及4个 PageData,seek 2次

对于 (b) 而言,需要读2个 PageHeader,以及4个 PageData,seek 1次


Image Added


(2) time < t:

分析:假设 Chunk 中共有 n 个 Page,满足时间过滤要求的 Page 有 m 个,读 PageHeader 耗时为 th,读 PageData 耗时为 td, seek 耗时为 ts

(a) 顺序读前几个符合时间过滤条件的 Page

需要读 m 个 PageHeader,m 个 PageData,seek 0次。耗时为 m * (th + td)

(b) 顺序读前几个 PageHeader,然后开始顺序读一部分的 PageData

需要读 m 个 PageHeader,m 个 PageData,seek 1次。耗时为 m * (th + td) + ts


举例:

假设 Chunk 中有6个 Page,其中前两个 Page 是符合时间过滤要求的

对于 (a) 而言,需要读2个 PageHeader,以及2个 PageData,seek 0次

对于 (b) 而言,需要读2个 PageHeader,以及2个 PageData,seek 1次

Image Added

For aggregation query in a Chunk:

(1) time > t:

  • (a) 跳读 所有 PageHeader,获得聚合结果
  • (b) 顺序读 所有 PageHeader,获得聚合结果

(2) time < t:

  • (a) 跳读 前几个符合时间过滤条件的 PageHeader,获得聚合结果
  • (b) 顺序读 前几个符合时间过滤条件的 PageHeader,获得聚合结果


Conclusion:

从理论分析,(b) 方案无论在原始数据查询还是在聚合查询中均会有较好的表现。

使用 (a) 方案仅仅是因为将对应 PageHeader 和 PageData 放在一起存储,易于理解。