(I) Experiment of the necessity of TimeseriesMetadata
After we store TimeseriesMetadata together with ChunkMetadata, the necessity of TimeseriesMetadata needs to be reconsidered. We need some experiments for decision.
TimeseriesMetadata for Aggregation query and raw data query under different circumstances for one timeseries in one tsfile.
Each chunk has 100 points. Each query contains 500 TsFiles.
(1) with TimeseriesMetadata: origin TimeseriesMetadata
(2) without TimeseriesMetadata: TimeseriesMetadata has no statistics
And test query for 1 timeseries in TsFile which have 1 timeseries and 1000 timeseries seperately.
Writing:
String path = "/home/fit/szs/data/data/sequence/root.sg/0/" + chunkNum + "/test" + fileIndex + ".tsfile"; File f = FSFactoryProducer.getFSFactory().getFile(path); if (f.exists()) { f.delete(); } try (TsFileWriter tsFileWriter = new TsFileWriter(f)) { // only one timeseries tsFileWriter.registerTimeseries( new Path(Constant.DEVICE_PREFIX, Constant.SENSOR_1), new UnaryMeasurementSchema(Constant.SENSOR_1, TSDataType.INT64, TSEncoding.RLE)); // construct TSRecord for (int i = 1; i <= chunkNum * 100; i++) { TSRecord tsRecord = new TSRecord(i, Constant.DEVICE_PREFIX); DataPoint dPoint1 = new LongDataPoint(Constant.SENSOR_1, i); tsRecord.addTuple(dPoint1); // write TSRecord tsFileWriter.write(tsRecord); if (i % 100 == 0) { tsFileWriter.flushAllChunkGroups(); } } }
Raw data query:
for (int fileIndex = 0; fileIndex < fileNum; fileIndex++) { // file path String path = "/home/fit/szs/data/data/sequence/root.sg/0/" + chunkNum + "/test" + fileIndex + ".tsfile"; // raw data query try (TsFileSequenceReader reader = new TsFileSequenceReader(path); ReadOnlyTsFile readTsFile = new ReadOnlyTsFile(reader)) { ArrayList<Path> paths = new ArrayList<>(); paths.add(new Path(DEVICE1, "sensor_1")); QueryExpression queryExpression = QueryExpression.create(paths, null); long startTime = System.nanoTime(); QueryDataSet queryDataSet = readTsFile.query(queryExpression); while (queryDataSet.hasNext()) { queryDataSet.next(); } costTime += (System.nanoTime() - startTime); } }
Aggregation query:
long totalStartTime = System.nanoTime(); for (int fileIndex = 0; fileIndex < fileNum; fileIndex++) { // file path String path = "/home/fit/szs/data/data/sequence/root.sg/0/" + chunkNum + "/test" + fileIndex + ".tsfile"; // aggregation query try (TsFileSequenceReader reader = new TsFileSequenceReader(path)) { Path seriesPath = new Path(DEVICE1, "sensor_1"); long startTime = System.nanoTime(); TimeseriesMetadata timeseriesMetadata = reader.readTimeseriesMetadata(seriesPath, false); long count = timeseriesMetadata.getStatistics().getCount(); costTime += (System.nanoTime() - startTime); } } System.out.println( "Total raw read cost time: " + (System.nanoTime() - totalStartTime) / 1000_000 + "ms"); System.out.println("Index area cost time: " + costTime / 1000_000 + "ms");
1 timeseries in one tsfile:
chunk number | 1 | 2 | 3 | 5 | 8 | 10 | 15 | 20 | 25 | ||
raw | with timeseriesMetadata | overall cost time (ms) | 210 | 230 | 237 | 250 | 276 | 297 | 309 | 344 | 374 |
index area time (ms) | 116 | 131 | 142 | 156 | 185 | 197 | 220 | 255 | 282 | ||
without timeseriesMetadata | overall cost time (ms) | 219 | 223 | 242 | 267 | 287 | 302 | 334 | 357 | ||
index area time (ms) | 131 | 136 | 155 | 182 | 200 | 219 | 251 | 274 | |||
count(*) | with timeseriesMetadata | overall cost time (ms) | 89 | 90 | 91 | 93 | 93 | 93 | 94 | 97 | 97 |
index area time (ms) | 15 | 16 | 16 | 16 | 16 | 16 | 16 | 17 | 17 | ||
without timeseriesMetadata | overall cost time (ms) | 122 | 123 | 127 | 127 | 127 | 127 | 128 | 130 | ||
index area time (ms) | 50 | 50 | 50 | 50 | 51 | 52 | 52 | 53 |
1000 timeseries in one tsfile: (query for 1 timeseries as well)
chunk number | 1 | 2 | 3 | 5 | 8 | 10 | 15 | 20 | 25 | ||
raw | with timeseriesMetadata | overall cost time (ms) | 421 | 478 | 550 | 673 | 910 | 998 | 1394 | 1637 | 1966 |
index area time (ms) | 274 | 332 | 403 | 528 | 763 | 853 | 1249 | 1496 | 1795 | ||
without timeseriesMetadata | overall cost time (ms) | 489 | 537 | 672 | 903 | 1010 | 1371 | 1650 | 1938 | ||
index area time (ms) | 340 | 393 | 528 | 758 | 864 | 1232 | 1511 | 1789 | |||
count(*) | with timeseriesMetadata | overall cost time (ms) | 260 | 271 | 290 | 331 | 399 | 397 | 562 | 609 | 647 |
index area time (ms) | 133 | 142 | 158 | 197 | 265 | 267 | 427 | 472 | 513 | ||
without timeseriesMetadata | overall cost time (ms) | 307 | 326 | 359 | 428 | 447 | 583 | 620 | 713 | ||
index area time (ms) | 177 | 195 | 227 | 296 | 315 | 447 | 486 | 553 |
Conclusion:
- Although the index area structure with no TimeseriesMetadata speeds up a little in raw data query,
it reduces the speed a lot in aggregation query. => We should reserve TimeseriesMetadata. - The time cost does not change in the data area of TsFile.
(II) Experiment about combine Chunk and Page
Do we need Chunk and Page, or reserve one is ok?
How many points can a chunk have when chunk size = 64K, 1M, 2M, 3M, and 4M?
(1) Write one timeseries in one TsFile, with long data type , random data.
(2) And adjust the number of points by the size of chunk.
try (TsFileWriter tsFileWriter = new TsFileWriter(f)) { // only one timeseries tsFileWriter.registerTimeseries( new Path(Constant.DEVICE_PREFIX, Constant.SENSOR_1), new UnaryMeasurementSchema(Constant.SENSOR_1, TSDataType.INT64, TSEncoding.RLE)); // construct TSRecord for (int i = 1; i <= 7977; i++) { // change here TSRecord tsRecord = new TSRecord(i, Constant.DEVICE_PREFIX); DataPoint dPoint1 = new LongDataPoint(Constant.SENSOR_1, random.nextLong()); tsRecord.addTuple(dPoint1); // write TSRecord tsFileWriter.write(tsRecord); } }
Here are the results:
chunk size | ~64K | ~1M | ~2M | ~3M | ~4M |
points number | 7,977 | 125,000 | 260,000 | 390,000 | 520,000 |
page number | 1 | 16 | 32 | 49 | 66 |
page size (uncompressed) | 65398 | 65398 | 65398 | 65398 | 65398 |
page size (compressed) | 64275 | 64275 | 64275 | 64275 | 64275 |
Discuss the scenarios below: (only one timeseries)
1. For a scenario that generates 5 data points per second. (high frequency)
One day will generate 432,000 points (about 54 pages). Therefore, 1 chunk has 54 pages (about 3.4M). In scenarios like this, chunk and page is both necessary.
2. For a scenario that generates one data point per second. (second frequency)
One day will generate 86,400 points (about 11 pages). Therefore, 1 chunk has 11 pages (about 693K). In this scenario, chunk and page is both necessary.
3. For a scenario that generates 5 data points per minute. (one chunk one day) (low frequency)
One day will generate 7200 points (about 1 pages). Therefore, 1 chunk has 1 page (about 56.6K). In this scenario, chunk and page should only reserve one.
4. For a scenario that generates one data points per minute. (one chunk one week) (minute frequency)
One week will generate 10080 points (about 1.3 pages). Therefore, 1 chunk has 1~2 pages (about 79.3K). In this scenario, chunk and page should only reserve one.
Reserve both chunk and page:
- Chunk is the unit for I/O and page is the unit for query, which could supply multiple levels of I/O
- Suitable for all kinds of query scenarios, whether aggregation query or raw data query
Reserve only page:
- Suitable for APM scenario
- Simple structure, which could reduce one level of I/O
(III) Experiment about how to store PageHeader