Table of Contents |
---|
(I) Experiment of the necessity of TimeseriesMetadata
Jira | ||||||
---|---|---|---|---|---|---|
|
...
- Although the index area structure with no TimeseriesMetadata speeds up a little in raw data query,
it reduces the speed a lot in aggregation query. => We should reserve TimeseriesMetadata. - The time cost does not change in the data area of TsFile.
(II) Experiment about combine Chunk and Page
Jira | ||||||
---|---|---|---|---|---|---|
|
...
1. For a scenario that generates 5 data points per second. (high one chunk one day) (5Hz frequency)
One day will generate 432,000 points (about 54 pages). Therefore, 1 chunk has 54 pages (about 3.4M). In scenarios like this, chunk and page is both necessary.
2. For a scenario that generates one data point per second. (second (one chunk one day) (1Hz frequency)
One day will generate 86,400 points (about 11 pages). Therefore, 1 chunk has 11 pages (about 693K). In this scenario, chunk and page is both necessary.
3. For a scenario that generates 5 data points per minute. (one chunk one day) (low 1/12Hz frequency)
One day will generate 7200 points (about 1 pages). Therefore, 1 chunk has 1 page (about 56.6K). In this scenario, chunk and page should only reserve one.
4. For a scenario that generates one data points point per minute. (one chunk one week) (minute 1/60Hz frequency)
One week will generate 10080 points (about 1.3 pages). Therefore, 1 chunk has 1~2 pages (about 79.3K). In this scenario, chunk and page should only reserve one.
Reserve both chunk and page:
- Chunk and Page are 2 levels of indexes in one TsFile, Suitable for aggregation and time filter with different granularity.
- Chunk is the unit for I/O and page is the unit for query, which could supply multiple levels of I/O
- Suitable for all kinds of query scenarios, whether aggregation query or raw data query
- When one Chunk has multiple pages, this structure is better.
Reserve only page:
- one level index in one TsFile.
- Suitable for APM scenario
- Simple structure, which could reduce one level of I/O
...
- small Chunk (Mass Timeseries) scenario, in which 1 chunk has only 1~2 pages
(Note: Since 0.12, If one Chunk has only one Page, then PageStatistics will be removed, we only store statistics in ChunkMetadata)
(III) Experiment about how to store PageHeader
Jira | ||||||
---|---|---|---|---|---|---|
|
(a) store PageHeader with PageData (current design)
(b) combine PageHeader with ChunkHeader
当前的读取方式为:将 Chunk 全都读取到内存后读取。
如果按照这样的方式,则 (a) (b) 两者所用的时间相同。
如果按照精细方式进行逐块读取,分析如下:
For raw data query in a Chunk:
(1) time > t:
分析:假设 Chunk 中共有 n 个 Page,满足时间过滤要求的 Page 有 m 个,读 PageHeader 耗时为 th,读 PageData 耗时为 td, seek 耗时为 ts
(a) 顺序读 前几个 Page,然后开始顺序读后面的 PageData
需要读 n 个 PageHeader,m 个 PageData,seek (n - m) 次。耗时为 n * (th + ts) + m * (td - ts)
(b) 顺序读 前几个 PageHeader,然后开始顺序读后面的 PageData
需要读 (n - m) 个 PageHeader,m 个 PageData,seek 1 次。耗时为 n * th + m * (td - th) + ts
前者比后者耗时多Δt = (n - 1) * ts + m * (th - ts),由于 n >= 1, th > ts(读 PageHeader 也需要 seek, 因此 th > ts),
因此 Δt >0,后者耗时一定比前者少。
举例:
假设 Chunk 中有6个 Page,其中前两个 Page 是不符合时间过滤要求的
对于 (a) 而言,需要读6个 PageHeader,以及4个 PageData,seek 2次
对于 (b) 而言,需要读2个 PageHeader,以及4个 PageData,seek 1次
(2) time < t:
分析:假设 Chunk 中共有 n 个 Page,满足时间过滤要求的 Page 有 m 个,读 PageHeader 耗时为 th,读 PageData 耗时为 td, seek 耗时为 ts
(a) 顺序读前几个符合时间过滤条件的 Page
需要读 m 个 PageHeader,m 个 PageData,seek 0次。耗时为 m * (th + td)
(b) 顺序读前几个 PageHeader,然后开始顺序读一部分的 PageData
需要读 m 个 PageHeader,m 个 PageData,seek 1次。耗时为 m * (th + td) + ts
举例:
假设 Chunk 中有6个 Page,其中前两个 Page 是符合时间过滤要求的
对于 (a) 而言,需要读2个 PageHeader,以及2个 PageData,seek 0次
对于 (b) 而言,需要读2个 PageHeader,以及2个 PageData,seek 1次
For aggregation query in a Chunk:
(1) time > t:
- (a) 跳读 所有 PageHeader,获得聚合结果
- (b) 顺序读 所有 PageHeader,获得聚合结果
(2) time < t:
- (a) 跳读 前几个符合时间过滤条件的 PageHeader,获得聚合结果
- (b) 顺序读 前几个符合时间过滤条件的 PageHeader,获得聚合结果
Conclusion:
从理论分析,(b) 方案无论在原始数据查询还是在聚合查询中均会有较好的表现。
使用 (a) 方案仅仅是因为将对应 PageHeader 和 PageData 放在一起存储,易于理解。