Table of Contents |
---|
(I) Experiment of the necessity of TimeseriesMetadata
Jira | ||||||
---|---|---|---|---|---|---|
|
...
- Although the index area structure with no TimeseriesMetadata speeds up a little in raw data query,
it reduces the speed a lot in aggregation query. => We should reserve TimeseriesMetadata. - The time cost does not change in the data area of TsFile.
(II) Experiment about combine Chunk and Page
Jira | ||||||
---|---|---|---|---|---|---|
|
Do we need Chunk and Page, or reserve one is ok?
How many points can a chunk have when chunk size = 64K, 1M, 2M, 3M, and 4M?
(1) Write one timeseries in one TsFile, with long data type , random data.
(2) And adjust the number of points by the size of chunk.
Code Block |
---|
try (TsFileWriter tsFileWriter = new TsFileWriter(f)) {
// only one timeseries
tsFileWriter.registerTimeseries(
new Path(Constant.DEVICE_PREFIX, Constant.SENSOR_1),
new UnaryMeasurementSchema(Constant.SENSOR_1, TSDataType.INT64, TSEncoding.RLE));
// construct TSRecord
for (int i = 1; i <= 7977; i++) { // change here
TSRecord tsRecord = new TSRecord(i, Constant.DEVICE_PREFIX);
DataPoint dPoint1 = new LongDataPoint(Constant.SENSOR_1, random.nextLong());
tsRecord.addTuple(dPoint1);
// write TSRecord
tsFileWriter.write(tsRecord);
}
} |
Here are the results:
chunk size | ~64K | ~1M | ~2M | ~3M | ~4M |
points number | 7,977 | 125,000 | 260,000 | 390,000 | 520,000 |
page number | 1 | 16 | 32 | 49 | 66 |
page size (uncompressed) | 65398 | 65398 | 65398 | 65398 | 65398 |
page size (compressed) | 64275 | 64275 | 64275 | 64275 | 64275 |
Discuss the scenarios below: (only one timeseries)
1. For a scenario that generates 5 data points per second. (one chunk one day) (5Hz frequency)
One day will generate 432,000 points (about 54 pages). Therefore, 1 chunk has 54 pages (about 3.4M).
2. For a scenario that generates one data point per second. (one chunk one day) (1Hz frequency)
One day will generate 86,400 points (about 11 pages). Therefore, 1 chunk has 11 pages (about 693K).
3. For a scenario that generates 5 data points per minute. (one chunk one day) (1/12Hz frequency)
One day will generate 7200 points (about 1 pages). Therefore, 1 chunk has 1 page (about 56.6K).
4. For a scenario that generates one data point per minute. (one chunk one week) (1/60Hz frequency)
One week will generate 10080 points (about 1.3 pages). Therefore, 1 chunk has 1~2 pages (about 79.3K).
Reserve both chunk and page:
- Chunk and Page are 2 levels of indexes in one TsFile, Suitable for aggregation and time filter with different granularity.
- Chunk is the unit for I/O and page is the unit for query
- When one Chunk has multiple pages, this structure is better.
Reserve only page:
- one level index in one TsFile.
- Suitable for small Chunk (Mass Timeseries) scenario, in which 1 chunk has only 1~2 pages
(Note: Since 0.12, If one Chunk has only one Page, then PageStatistics will be removed, we only store statistics in ChunkMetadata)
(III) Experiment about how to store PageHeader
Jira | ||||||
---|---|---|---|---|---|---|
|
(a) store PageHeader with PageData (current design)
(b) combine PageHeader with ChunkHeader
当前的读取方式为:将 Chunk 全都读取到内存后读取。
如果按照这样的方式,则 (a) (b) 两者所用的时间相同。
如果按照精细方式进行逐块读取,分析如下:
For raw data query in a Chunk:
(1) time > t:
分析:假设 Chunk 中共有 n 个 Page,满足时间过滤要求的 Page 有 m 个,读 PageHeader 耗时为 th,读 PageData 耗时为 td, seek 耗时为 ts
(a) 顺序读 前几个 Page,然后开始顺序读后面的 PageData
需要读 n 个 PageHeader,m 个 PageData,seek (n - m) 次。耗时为 n * (th + ts) + m * (td - ts)
(b) 顺序读 前几个 PageHeader,然后开始顺序读后面的 PageData
需要读 (n - m) 个 PageHeader,m 个 PageData,seek 1 次。耗时为 n * th + m * (td - th) + ts
前者比后者耗时多Δt = (n - 1) * ts + m * (th - ts),由于 n >= 1, th > ts(读 PageHeader 也需要 seek, 因此 th > ts),
因此 Δt >0,后者耗时一定比前者少。
举例:
假设 Chunk 中有6个 Page,其中前两个 Page 是不符合时间过滤要求的
对于 (a) 而言,需要读6个 PageHeader,以及4个 PageData,seek 2次
对于 (b) 而言,需要读2个 PageHeader,以及4个 PageData,seek 1次
(2) time < t:
分析:假设 Chunk 中共有 n 个 Page,满足时间过滤要求的 Page 有 m 个,读 PageHeader 耗时为 th,读 PageData 耗时为 td, seek 耗时为 ts
(a) 顺序读前几个符合时间过滤条件的 Page
需要读 m 个 PageHeader,m 个 PageData,seek 0次。耗时为 m * (th + td)
(b) 顺序读前几个 PageHeader,然后开始顺序读一部分的 PageData
需要读 m 个 PageHeader,m 个 PageData,seek 1次。耗时为 m * (th + td) + ts
举例:
假设 Chunk 中有6个 Page,其中前两个 Page 是符合时间过滤要求的
对于 (a) 而言,需要读2个 PageHeader,以及2个 PageData,seek 0次
对于 (b) 而言,需要读2个 PageHeader,以及2个 PageData,seek 1次
For aggregation query in a Chunk:
(1) time > t:
- (a) 跳读 所有 PageHeader,获得聚合结果
- (b) 顺序读 所有 PageHeader,获得聚合结果
(2) time < t:
- (a) 跳读 前几个符合时间过滤条件的 PageHeader,获得聚合结果
- (b) 顺序读 前几个符合时间过滤条件的 PageHeader,获得聚合结果
Conclusion:
从理论分析,(b) 方案无论在原始数据查询还是在聚合查询中均会有较好的表现。
使用 (a) 方案仅仅是因为将对应 PageHeader 和 PageData 放在一起存储,易于理解。