测试基本参数
对应JIRA:https://issues.apache.org/jira/browse/IOTDB-1628
OPERATION_PROPORTION=991:1:1:1:1:0:1:1:1:1:1
- client number = 20
- group number = 20
- device number = 100
- sensor number = 8000
- batch size per write = 10
- loop = 90000000
- point step = 200
- operation interval = 0
第一次测试
- 被测试分支:0.12.3RC1
- 测试结果:2021-10-23 03:06:21,518第一次出现了mid consistency报错,之后开始出现OOM的报错,有关日志如下。
2021-10-23 18:43:57,171 ERROR cn.edu.tsinghua.iotdb.benchmark.iotdb012.IoTDB:610 - exception occurred when execute query=SELECT count(s_3717) FROM root.test.g_5.d_5 WHERE time >= 1633017614200 AND time <= 1633017864200
org.apache.iotdb.jdbc.IoTDBSQLException: 313: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.jdbc.IoTDBStatement.executeQuerySQL(IoTDBStatement.java:396)
at org.apache.iotdb.jdbc.IoTDBStatement.executeQuery(IoTDBStatement.java:367)
at org.apache.iotdb.jdbc.IoTDBStatement.executeQuery(IoTDBStatement.java:357)
at cn.edu.tsinghua.iotdb.benchmark.iotdb012.IoTDB.lambda$executeQueryAndGetStatus$0(IoTDB.java:588)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
第二次测试(添加日志)
- 被测试分支:cluster_sync_leader_bug,commitID:06693755099251aef01d2940d548144b53b129e2 https://github.com/apache/iotdb/commit/06693755099251aef01d2940d548144b53b129e2
- 测试结果:出现了mid consistency的报错(约4天时间),可以看到存在部分单机版写入较慢的情况,部分如下:
2021-11-01 09:12:29,118 [ClusterClient-8] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.9:9007): Failed to synchronize with the leader after 20043ms
2021-11-01 09:12:30,278 [Data(172.20.70.9:9007)-root.test.g_5] WARN o.a.i.c.l.a.DataLogApplier:149 - [ApplyInsert] applyPhysicalPlan slowly for 22413ms
2021-11-01 09:12:30,278 [Data(172.20.70.9:9007)-root.test.g_5] WARN o.a.i.c.l.a.AsyncDataLogApplier:208 - [ApplyInternal] Data(172.20.70.9:9007):apply log InsertTabletPlan {deviceId:root.test.g_5.d_45, timesRange[1633184004200,1633184006000]},term:1,index:10389821 slowly for 22413ms
2021-11-01 09:12:30,295 [check-log-applier-Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:1697295815, dataPort:40007, clientPort:6667, clientIp:172.20.70.9)] WARN o.a.i.c.l.m.RaftLogManager:960 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:1697295815, dataPort:40007, clientPort:6667, clientIp:172.20.70.9):wait log InsertTabletPlan {deviceId:root.test.g_5.d_45, timesRange[1633184004200,1633184006000]},term:1,index:10389821 to apply slowly for 22427ms
2021-11-01 09:13:10,160 [ClusterClient-14] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.8:9005): Failed to synchronize with the leader after 20319ms
2021-11-01 09:13:10,343 [Data(172.20.70.8:9005)-root.test.g_17] WARN o.a.i.c.l.a.DataLogApplier:149 - [ApplyInsert] applyPhysicalPlan slowly for 21623ms
2021-11-01 09:13:10,344 [Data(172.20.70.8:9005)-root.test.g_17] WARN o.a.i.c.l.a.AsyncDataLogApplier:208 - [ApplyInternal] Data(172.20.70.8:9005):apply log InsertTabletPlan {deviceId:root.test.g_17.d_77, timesRange[1633168692200,1633168694000]},term:1,index:8093294 slowly for 21623ms
2021-11-01 09:13:10,352 [check-log-applier-Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:1697292808, dataPort:40005, clientPort:6667, clientIp:172.20.70.8)] WARN o.a.i.c.l.m.RaftLogManager:960 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:1697292808, dataPort:40005, clientPort:6667, clientIp:172.20.70.8):wait log InsertTabletPlan {deviceId:root.test.g_17.d_77, timesRange[1633168692200,1633168694000]},term:1,index:8093294 to apply slowly for 21627ms
2021-11-01 09:13:12,240 [ClusterClient-35] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.8:9005): Failed to synchronize with the leader after 20093ms
2021-11-01 09:13:13,240 [ClusterClient-12] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.8:9005): Failed to synchronize with the leader after 20186ms
第三次测试(添加更细致的日志)
- 被测试分支:cluster_sync_leader_bug(stormbroken仓库中),commitID:b1f7ca95
- 测试结果:出现了mid consistency的报错(约6个小时),新增的更细致的日志没有出现,存在部分单机版写入较慢的情况,但是具体耗时也不太像是在GC上消耗。
第三次测试:mid consistency 出现前后的warn日志
2021-11-03 00:14:51,682 [ClusterClient-25] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.9:9007): Failed to synchronize with the leader after 20223ms
2021-11-03 00:14:51,700 [ClusterClient-25] WARN o.a.i.d.s.TSServiceImpl:1926 - Error occurred in query process:
org.apache.iotdb.db.exception.StorageEngineException: org.apache.iotdb.cluster.exception.CheckConsistencyException: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:1026)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getSeriesReaderByTime(ClusterReaderFactory.java:161)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:130)
at org.apache.iotdb.cluster.query.aggregate.ClusterAggregateExecutor.getReaderByTime(ClusterAggregateExecutor.java:95)
at org.apache.iotdb.db.query.executor.AggregationExecutor.executeWithValueFilter(AggregationExecutor.java:357)
at org.apache.iotdb.db.query.executor.QueryRouter.aggregate(QueryRouter.java:137)
at org.apache.iotdb.db.qp.executor.PlanExecutor.processDataQuery(PlanExecutor.java:538)
at org.apache.iotdb.cluster.query.ClusterPlanExecutor.processQuery(ClusterPlanExecutor.java:101)
at org.apache.iotdb.db.service.TSServiceImpl.createQueryDataSet(TSServiceImpl.java:1098)
at org.apache.iotdb.db.service.TSServiceImpl.internalExecuteQueryStatement(TSServiceImpl.java:729)
at org.apache.iotdb.db.service.TSServiceImpl.executeQueryStatement(TSServiceImpl.java:627)
at org.apache.iotdb.service.rpc.thrift.TSIService$Processor$executeQueryStatement.getResult(TSIService.java:2423)
at org.apache.iotdb.service.rpc.thrift.TSIService$Processor$executeQueryStatement.getResult(TSIService.java:2403)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.iotdb.cluster.exception.CheckConsistencyException: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.cluster.exception.CheckConsistencyException.<clinit>(CheckConsistencyException.java:34)
at org.apache.iotdb.cluster.server.member.RaftMember$MidCheckConsistency.postCheckConsistency(RaftMember.java:810)
at org.apache.iotdb.cluster.server.member.RaftMember.waitUntilCatchUp(RaftMember.java:909)
at org.apache.iotdb.cluster.server.member.RaftMember.syncLeader(RaftMember.java:858)
at org.apache.iotdb.cluster.server.member.RaftMember.syncLeaderWithConsistencyCheck(RaftMember.java:765)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:1024)
... 18 common frames omitted
2021-11-03 00:14:51,713 [Data(172.20.70.9:9007)-root.test.g_8] WARN o.a.i.c.l.a.DataLogApplier:149 - [ApplyInsert] applyPhysicalPlan slowly for 21319ms
2021-11-03 00:14:51,713 [Data(172.20.70.9:9007)-root.test.g_8] WARN o.a.i.c.l.a.AsyncDataLogApplier:208 - [ApplyInternal] Data(172.20.70.9:9007):apply log InsertTabletPlan {deviceId:root.test.g_8.d_88, timesRange[1633037436200,1633037438000]},term:1,index:1239624 slowly for 21320ms
2021-11-03 00:14:51,713 [check-log-applier-Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9)] WARN o.a.i.c.l.m.RaftLogManager:960 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9):wait log InsertTabletPlan {deviceId:root.test.g_8.d_88, timesRange[1633037436200,1633037438000]},term:1,index:1239624 to apply slowly for 21319ms
第三次测试:mid consistency 出现前后的GC日志
2021-11-03T00:14:48.074+0800: 23319.008: [GC (Allocation Failure) [PSYoungGen: 4631301K->114123K(5746688K)] 17062396K->12808796K(19727872K), 0.3338259 secs] [Times: user=1.19 sys=0.01, real=0.54 secs]
2021-11-03T00:14:48.617+0800: 23319.552: Total time for which application threads were stopped: 0.5504849 seconds, Stopping threads took: 0.0020039 seconds
2021-11-03T00:14:49.149+0800: 23320.084: [GC (Allocation Failure) [PSYoungGen: 4647371K->103713K(5758464K)] 17342044K->12891978K(19739648K), 0.0640482 secs] [Times: user=0.81 sys=0.00, real=0.07 secs]
2021-11-03T00:14:49.214+0800: 23320.148: Total time for which application threads were stopped: 0.0707036 seconds, Stopping threads took: 0.0013424 seconds
2021-11-03T00:14:49.236+0800: 23320.170: Total time for which application threads were stopped: 0.0054960 seconds, Stopping threads took: 0.0004315 seconds
2021-11-03T00:14:49.464+0800: 23320.398: Total time for which application threads were stopped: 0.0058205 seconds, Stopping threads took: 0.0009434 seconds
2021-11-03T00:14:49.749+0800: 23320.684: [GC (Allocation Failure) [PSYoungGen: 4636961K->91474K(5731840K)] 17425226K->12973431K(19713024K), 0.0486890 secs] [Times: user=0.61 sys=0.01, real=0.05 secs]
2021-11-03T00:14:49.798+0800: 23320.733: Total time for which application threads were stopped: 0.0542169 seconds, Stopping threads took: 0.0003450 seconds
2021-11-03T00:14:50.100+0800: 23321.035: Total time for which application threads were stopped: 0.0128498 seconds, Stopping threads took: 0.0033714 seconds
2021-11-03T00:14:50.451+0800: 23321.385: Total time for which application threads were stopped: 0.0081008 seconds, Stopping threads took: 0.0033660 seconds
2021-11-03T00:14:50.460+0800: 23321.394: [GC (GCLocker Initiated GC) [PSYoungGen: 4637010K->198209K(5751296K)] 17525934K->13087132K(19732480K), 0.0455184 secs] [Times: user=0.48 sys=0.00, real=0.05 secs]
2021-11-03T00:14:50.506+0800: 23321.440: Total time for which application threads were stopped: 0.0508556 seconds, Stopping threads took: 0.0003779 seconds
2021-11-03T00:14:50.970+0800: 23321.905: Total time for which application threads were stopped: 0.0063016 seconds, Stopping threads took: 0.0012185 seconds
2021-11-03T00:14:50.997+0800: 23321.932: [GC (Allocation Failure) [PSYoungGen: 4743745K->331393K(5675008K)] 17632669K->13220317K(19656192K), 0.0647358 secs] [Times: user=0.59 sys=0.01, real=0.07 secs]
2021-11-03T00:14:51.062+0800: 23321.997: Total time for which application threads were stopped: 0.0704747 seconds, Stopping threads took: 0.0006351 seconds
2021-11-03T00:14:51.502+0800: 23322.437: Total time for which application threads were stopped: 0.0067273 seconds, Stopping threads took: 0.0018121 seconds
2021-11-03T00:14:51.518+0800: 23322.453: Total time for which application threads were stopped: 0.0057247 seconds, Stopping threads took: 0.0009592 seconds
2021-11-03T00:14:51.607+0800: 23322.542: Total time for which application threads were stopped: 0.0060769 seconds, Stopping threads took: 0.0011555 seconds
2021-11-03T00:14:51.618+0800: 23322.552: [GC (GCLocker Initiated GC) [PSYoungGen: 4950145K->493210K(5719552K)] 17841579K->13384644K(19700736K), 0.0637473 secs] [Times: user=0.78 sys=0.02, real=0.07 secs]
2021-11-03T00:14:51.682+0800: 23322.617: Total time for which application threads were stopped: 0.0695298 seconds, Stopping threads took: 0.0002446 seconds
2021-11-03T00:14:51.700+0800: 23322.635: Total time for which application threads were stopped: 0.0057423 seconds, Stopping threads took: 0.0005822 seconds
2021-11-03T00:14:52.079+0800: 23323.013: Total time for which application threads were stopped: 0.0060501 seconds, Stopping threads took: 0.0009866 seconds
2021-11-03T00:14:52.284+0800: 23323.219: [GC (Allocation Failure) [PSYoungGen: 5111962K->551491K(5856256K)] 18003429K->13442958K(19837440K), 0.1040987 secs] [Times: user=1.19 sys=0.01, real=0.10 secs]
2021-11-03T00:14:52.389+0800: 23323.323: Total time for which application threads were stopped: 0.1123434 seconds, Stopping threads took: 0.0029986 seconds
2021-11-03T00:14:52.923+0800: 23323.858: [GC (Allocation Failure) [PSYoungGen: 5374019K->671541K(5878272K)] 18265486K->13563008K(19859456K), 0.1139545 secs] [Times: user=1.42 sys=0.04, real=0.11 secs]
第三次测试:mid consistency 出现前后的所有日志
2021-11-03 00:14:51,510 [pool-15-IoTDB-FlushTask-Submit-Pool-1] INFO o.a.i.d.e.s.TsFileProcessor:615 - Memtable PrimitiveMemTable{planIndex=[949947,950649]} has been added to flushing list
2021-11-03 00:14:51,506 [DataClientThread-131] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:736,sizeToReserveForConfig:1000
2021-11-03 00:14:51,512 [pool-15-IoTDB-FlushTask-Submit-Pool-1] INFO o.a.i.d.e.s.StorageGroupProcessor:1257 - close a sequence tsfile processor root.test.g_7-0
2021-11-03 00:14:51,512 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): Before compaction index 949958-950698, compactIndex 949959, removeSize 2, committedLogSize 738, maxAppliedLog 950693
2021-11-03 00:14:51,512 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): After compaction index 949960-950698, committedLogSize 736
2021-11-03 00:14:51,512 [pool-3-IoTDB-Flush-3] INFO o.a.i.d.e.s.TsFileProcessor:995 - Ended file /data/cluster/first-rest-test/apache-iotdb/sbin/../data/data/sequence/root.test.g_15/0/0/1635869637646-459-0-0.tsfile
2021-11-03 00:14:51,519 [pool-3-IoTDB-Flush-3] INFO o.a.i.d.e.s.StorageGroupProcessor:1997 - signal closing storage group condition in root.test.g_15-0
2021-11-03 00:14:51,519 [pool-3-IoTDB-Flush-3] INFO o.a.i.d.e.s.TsFileProcessor:1009 - Storage group root.test.g_15/0 close the file /data/cluster/first-rest-test/apache-iotdb/sbin/../data/data/sequence/root.test.g_15/0/0/1635869637646-459-0-0.tsfile, TsFile size is 52079411, time consumption of flushing metadata is 708ms
2021-11-03 00:14:51,519 [pool-3-IoTDB-Flush-3] INFO o.a.i.d.e.f.MemTableFlushTask:88 - The memTable size of SG root.test.g_7/0 is 170240000, the avg series points num in chunk is 88, total timeseries number is 120000
2021-11-03 00:14:51,510 [ClusterClient-56] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:735,sizeToReserveForConfig:1000
2021-11-03 00:14:51,530 [ClusterClient-56] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505128-505869, compactIndex 505130, removeSize 3, committedLogSize 738, maxAppliedLog 505854
2021-11-03 00:14:51,530 [ClusterClient-56] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505131-505869, committedLogSize 735
2021-11-03 00:14:51,531 [DataClientThread-243] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1998,sizeToReserveForNew:736,sizeToReserveForConfig:1000
2021-11-03 00:14:51,531 [DataClientThread-243] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9): Before compaction index 1239623-1241878, compactIndex 1239622, removeSize 1476, committedLogSize 2212, maxAppliedLog 1239623
2021-11-03 00:14:51,532 [DataClientThread-243] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9): After compaction index 1239623-1241878, committedLogSize 2212
2021-11-03 00:14:51,538 [ClusterClient-58] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,538 [ClusterClient-58] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505131-505871, compactIndex 505131, removeSize 1, committedLogSize 738, maxAppliedLog 505854
2021-11-03 00:14:51,538 [ClusterClient-58] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505132-505871, committedLogSize 737
2021-11-03 00:14:51,541 [Data(172.20.70.7:9003)-root.test.g_16] INFO o.a.i.d.w.n.ExclusiveWriteLogNode:127 - WAL BufferOverflow !
2021-11-03 00:14:51,547 [Data(172.20.70.8:9005)-root.test.g_17] INFO o.a.i.d.w.n.ExclusiveWriteLogNode:127 - WAL BufferOverflow !
2021-11-03 00:14:51,559 [ClusterClient-59] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,559 [ClusterClient-59] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505132-505871, compactIndex 505132, removeSize 1, committedLogSize 738, maxAppliedLog 505855
2021-11-03 00:14:51,559 [ClusterClient-59] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505133-505871, committedLogSize 737
2021-11-03 00:14:51,562 [ClusterClient-38] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,562 [ClusterClient-38] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505133-505871, compactIndex 505133, removeSize 1, committedLogSize 738, maxAppliedLog 505855
2021-11-03 00:14:51,562 [ClusterClient-38] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505134-505871, committedLogSize 737
2021-11-03 00:14:51,588 [ClusterClient-47] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,588 [ClusterClient-47] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505134-505872, compactIndex 505134, removeSize 1, committedLogSize 738, maxAppliedLog 505856
2021-11-03 00:14:51,588 [ClusterClient-47] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505135-505872, committedLogSize 737
2021-11-03 00:14:51,596 [DataClientThread-131] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,596 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): Before compaction index 949960-950699, compactIndex 949960, removeSize 1, committedLogSize 738, maxAppliedLog 950697
2021-11-03 00:14:51,596 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): After compaction index 949961-950699, committedLogSize 737
2021-11-03 00:14:51,682 [ClusterClient-25] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.9:9007): Failed to synchronize with the leader after 20223ms
2021-11-03 00:14:51,690 [ClusterClient-25] INFO o.a.i.d.s.TSServiceImpl:812 - Cost: 43063 ms, sql is SELECT count(s_2292) FROM root.test.g_8.d_28 WHERE time >= 1633017602400 AND time <= 1633017852400 AND root.test.g_8.d_28.s_2292 > -5
2021-11-03 00:14:51,700 [ClusterClient-25] WARN o.a.i.d.s.TSServiceImpl:1926 - Error occurred in query process:
org.apache.iotdb.db.exception.StorageEngineException: org.apache.iotdb.cluster.exception.CheckConsistencyException: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:1026)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getSeriesReaderByTime(ClusterReaderFactory.java:161)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:130)
at org.apache.iotdb.cluster.query.aggregate.ClusterAggregateExecutor.getReaderByTime(ClusterAggregateExecutor.java:95)
at org.apache.iotdb.db.query.executor.AggregationExecutor.executeWithValueFilter(AggregationExecutor.java:357)
at org.apache.iotdb.db.query.executor.QueryRouter.aggregate(QueryRouter.java:137)
at org.apache.iotdb.db.qp.executor.PlanExecutor.processDataQuery(PlanExecutor.java:538)
at org.apache.iotdb.cluster.query.ClusterPlanExecutor.processQuery(ClusterPlanExecutor.java:101)
at org.apache.iotdb.db.service.TSServiceImpl.createQueryDataSet(TSServiceImpl.java:1098)
at org.apache.iotdb.db.service.TSServiceImpl.internalExecuteQueryStatement(TSServiceImpl.java:729)
at org.apache.iotdb.db.service.TSServiceImpl.executeQueryStatement(TSServiceImpl.java:627)
at org.apache.iotdb.service.rpc.thrift.TSIService$Processor$executeQueryStatement.getResult(TSIService.java:2423)
at org.apache.iotdb.service.rpc.thrift.TSIService$Processor$executeQueryStatement.getResult(TSIService.java:2403)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.iotdb.cluster.exception.CheckConsistencyException: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.cluster.exception.CheckConsistencyException.<clinit>(CheckConsistencyException.java:34)
at org.apache.iotdb.cluster.server.member.RaftMember$MidCheckConsistency.postCheckConsistency(RaftMember.java:810)
at org.apache.iotdb.cluster.server.member.RaftMember.waitUntilCatchUp(RaftMember.java:909)
at org.apache.iotdb.cluster.server.member.RaftMember.syncLeader(RaftMember.java:858)
at org.apache.iotdb.cluster.server.member.RaftMember.syncLeaderWithConsistencyCheck(RaftMember.java:765)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:1024)
... 18 common frames omitted
2021-11-03 00:14:51,703 [ClusterClient-10] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,703 [ClusterClient-10] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505135-505873, compactIndex 505135, removeSize 1, committedLogSize 738, maxAppliedLog 505867
2021-11-03 00:14:51,703 [ClusterClient-10] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505136-505873, committedLogSize 737
2021-11-03 00:14:51,704 [DataClientThread-131] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:736,sizeToReserveForConfig:1000
2021-11-03 00:14:51,704 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): Before compaction index 949961-950701, compactIndex 949962, removeSize 2, committedLogSize 738, maxAppliedLog 950697
2021-11-03 00:14:51,704 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): After compaction index 949963-950701, committedLogSize 736
2021-11-03 00:14:51,710 [Data(172.20.70.8:9005)-root.test.g_7] INFO o.a.i.d.e.s.TsFileProcessor:149 - create a new tsfile processor /data/cluster/first-rest-test/apache-iotdb/sbin/../data/data/sequence/root.test.g_7/0/0/1635869691710-578-0-0.tsfile
2021-11-03 00:14:51,713 [Data(172.20.70.9:9007)-root.test.g_8] WARN o.a.i.c.l.a.DataLogApplier:149 - [ApplyInsert] applyPhysicalPlan slowly for 21319ms
2021-11-03 00:14:51,713 [Data(172.20.70.9:9007)-root.test.g_8] WARN o.a.i.c.l.a.AsyncDataLogApplier:208 - [ApplyInternal] Data(172.20.70.9:9007):apply log InsertTabletPlan {deviceId:root.test.g_8.d_88, timesRange[1633037436200,1633037438000]},term:1,index:1239624 slowly for 21320ms
2021-11-03 00:14:51,713 [check-log-applier-Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9)] WARN o.a.i.c.l.m.RaftLogManager:960 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9):wait log InsertTabletPlan {deviceId:root.test.g_8.d_88, timesRange[1633037436200,1633037438000]},term:1,index:1239624 to apply slowly for 21319ms
2021-11-03 00:14:51,717 [Data(172.20.70.8:9005)-root.test.g_7] INFO o.a.i.d.w.n.ExclusiveWriteLogNode:90 - create the WAL folder /data/cluster/first-rest-test/apache-iotdb/sbin/../data/wal/root.test.g_7/0-1635869691710-578-0-0.tsfile.
2021-11-03 00:14:51,717 [Data(172.20.70.8:9005)-root.test.g_7] INFO o.a.i.d.c.IoTDBThreadPoolFactory:87 - new single thread pool: Flush-WAL-Thread-root.test.g_7/0-1635869691710-578-0-0.tsfile
2021-11-03 00:14:51,750 [DataClientThread-30] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1999,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,750 [DataClientThread-30] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9): Before compaction index 1239623-1241879, compactIndex 1239625, removeSize 1477, committedLogSize 2214, maxAppliedLog 1239626
2021-11-03 00:14:51,750 [DataClientThread-30] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9): After compaction index 1239626-1241879, committedLogSize 2211
2021-11-03 00:14:51,840 [ClusterClient-67] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,857 [ClusterClient-67] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505136-505874, compactIndex 505136, removeSize 1, committedLogSize 738, maxAppliedLog 505873
2021-11-03 00:14:51,857 [ClusterClient-67] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505137-505874, committedLogSize 737
2021-11-03 00:14:51,860 [Data(172.20.70.9:9007)-root.test.g_8] INFO o.a.i.d.w.n.ExclusiveWriteLogNode:127 - WAL BufferOverflow !
2021-11-03 00:14:51,922 [DataClientThread-314] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
第三次测试添加的日志代码
第四次测试(排除仅剩的没有日志的可能)
- 被测试分支:cluster_sync_leader_bug(stormbroken仓库中),commitID6c0425e
- 启动时间2021年11月04日10点49分42秒
- 7号节点
- 首次出现mid consistency报错:2021年11月04日11点22分16秒,持续到2021年11月04日11点52分17秒。
- 出现OOM的报错:11月04日11点06分24秒,之后大量出现OOM
- 在启动后2分钟开始间歇性出现Failed to process no query InsertTabletPlan报错,之后11点22分16秒开始出现第一个mid consistency(往往会在之前伴随一个Failed to synchronize with leader after 20000+ms),在11点29分02秒出现第一个Current query is time out xxx,之后间歇性出现mid consistency,下午16点37分26点开始出现集群节点失联。
- 查看全部日志发现Failed to process no query plan附近出现Raft log buffer overflow和WAL BufferOverflow。
- 8号节点
- 没有出现mid consistency报错
- 没有出现OOM报错
- 9号节点
- 没有出现mid consistency报错
- 出现OOM的报错:11月04日16点45分51秒,之后大量出现OOM
- 7号节点
- 测试结果:新增的更细致的日志没有出现,存在部分单机版写入较慢的情况,但是具体耗时也不太像是在GC上消耗。
- 所有新添加的日志没有出现.....g
部分日志截图
同时对应时间段服务器的相关监控信息
使用Zabbix查看具体的情况,网址http://172.20.70.11/zabbix.php?name=&ip=&dns=&port=&status=-1&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&maintenance_status=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=name&sortorder=ASC&show_suppressed=0&action=host.view,用户Admin,密码iotdb2019
基本上可以看到07节点出现OOM后,出现了mid consistency,然后接下来09节点出现了mid consistency
第五次测试(调整三节点三副本为三节点单副本)
- 被测试分支:cluster_sync_leader_bug(stormbroken仓库中),commitID6c0425e
- 没有出现mid consistency、OOM等问题,也没有出现部分操作变慢到超过10s的情况。
经过大量尝试和最终确认,初步断定是由于过长时间的DataSnapshot导致的分布式crash。
第六次测试(定位测试,针对snapshot添加了大量的日志)
- 被测试分支:cluster_sync_leader_bug(stormbroken仓库中),commitID ce9d0992
- 经过一段时间后,复现了mid consistency,基本确定是snapshot出现的问题。
- 本次测试基本配置为3节点3副本
- 7号节点
- 同一个线程:2022.1.4 22:55:47 发送Snapshot 耗时19.0s,执行SnapShotCatchUp 耗时19.8s
- 8号节点
- 同一个线程:2022.1.4 23:14:16-30 出现3次 PartitionedSnapshot 安装到单个slot慢,分别耗时4.0, 4.1、7.4s,上面3次之中出现了FileSnapshot 安装tsfile较慢(4.9s),最终导致2022.1.4 23:14:34 出现PartitionSnapshot整体安装慢,耗时26.3s
- 9号节点
- 同一个线程:2022.1.4 22:54:48 - 2022.1.4 22:55:20 出现6次PartitionSnapshot安装到单个slot慢,分别耗时4.3, 3.4, 3.7, 9.4, 10.0, 3.0s,最终导致2022.1.4 22:55:20 出现PartitionSnapshot整体安装慢,耗时39.4s
- 随后导致2022.1.4 22:55:30 出现PartitionSnapshot整体安装慢,耗时24.7s, 2022.1.4 22:55:47 出现PartitionSnapshot整体安装慢,耗时18.5s
- 共性特点:
- load New TsFile 持锁时间不长
- MetaSimpleSnapshot 没有出现超时
- 7号节点