logs.zip

测试基本参数

对应JIRA:https://issues.apache.org/jira/browse/IOTDB-1628

  1. OPERATION_PROPORTION=991:1:1:1:1:0:1:1:1:1:1

  2. client number = 20
  3. group number = 20
  4. device number = 100
  5. sensor number = 8000
  6. batch size per write = 10
  7. loop = 90000000
  8. point step = 200
  9. operation interval = 0

第一次测试

  1. 被测试分支:0.12.3RC1
  2. 测试结果:2021-10-23 03:06:21,518第一次出现了mid consistency报错,之后开始出现OOM的报错,有关日志如下。


2021-10-23 18:43:57,171 ERROR cn.edu.tsinghua.iotdb.benchmark.iotdb012.IoTDB:610 - exception occurred when execute query=SELECT count(s_3717) FROM root.test.g_5.d_5 WHERE time >= 1633017614200 AND time <= 1633017864200
org.apache.iotdb.jdbc.IoTDBSQLException: 313: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.jdbc.IoTDBStatement.executeQuerySQL(IoTDBStatement.java:396)
at org.apache.iotdb.jdbc.IoTDBStatement.executeQuery(IoTDBStatement.java:367)
at org.apache.iotdb.jdbc.IoTDBStatement.executeQuery(IoTDBStatement.java:357)
at cn.edu.tsinghua.iotdb.benchmark.iotdb012.IoTDB.lambda$executeQueryAndGetStatus$0(IoTDB.java:588)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


第二次测试(添加日志)

  1. 被测试分支:cluster_sync_leader_bug,commitID:06693755099251aef01d2940d548144b53b129e2 https://github.com/apache/iotdb/commit/06693755099251aef01d2940d548144b53b129e2
  2. 测试结果:出现了mid consistency的报错(约4天时间),可以看到存在部分单机版写入较慢的情况,部分如下:


2021-11-01 09:12:29,118 [ClusterClient-8] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.9:9007): Failed to synchronize with the leader after 20043ms
2021-11-01 09:12:30,278 [Data(172.20.70.9:9007)-root.test.g_5] WARN o.a.i.c.l.a.DataLogApplier:149 - [ApplyInsert] applyPhysicalPlan slowly for 22413ms
2021-11-01 09:12:30,278 [Data(172.20.70.9:9007)-root.test.g_5] WARN o.a.i.c.l.a.AsyncDataLogApplier:208 - [ApplyInternal] Data(172.20.70.9:9007):apply log InsertTabletPlan {deviceId:root.test.g_5.d_45, timesRange[1633184004200,1633184006000]},term:1,index:10389821 slowly for 22413ms
2021-11-01 09:12:30,295 [check-log-applier-Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:1697295815, dataPort:40007, clientPort:6667, clientIp:172.20.70.9)] WARN o.a.i.c.l.m.RaftLogManager:960 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:1697295815, dataPort:40007, clientPort:6667, clientIp:172.20.70.9):wait log InsertTabletPlan {deviceId:root.test.g_5.d_45, timesRange[1633184004200,1633184006000]},term:1,index:10389821 to apply slowly for 22427ms
2021-11-01 09:13:10,160 [ClusterClient-14] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.8:9005): Failed to synchronize with the leader after 20319ms
2021-11-01 09:13:10,343 [Data(172.20.70.8:9005)-root.test.g_17] WARN o.a.i.c.l.a.DataLogApplier:149 - [ApplyInsert] applyPhysicalPlan slowly for 21623ms
2021-11-01 09:13:10,344 [Data(172.20.70.8:9005)-root.test.g_17] WARN o.a.i.c.l.a.AsyncDataLogApplier:208 - [ApplyInternal] Data(172.20.70.8:9005):apply log InsertTabletPlan {deviceId:root.test.g_17.d_77, timesRange[1633168692200,1633168694000]},term:1,index:8093294 slowly for 21623ms
2021-11-01 09:13:10,352 [check-log-applier-Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:1697292808, dataPort:40005, clientPort:6667, clientIp:172.20.70.8)] WARN o.a.i.c.l.m.RaftLogManager:960 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:1697292808, dataPort:40005, clientPort:6667, clientIp:172.20.70.8):wait log InsertTabletPlan {deviceId:root.test.g_17.d_77, timesRange[1633168692200,1633168694000]},term:1,index:8093294 to apply slowly for 21627ms
2021-11-01 09:13:12,240 [ClusterClient-35] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.8:9005): Failed to synchronize with the leader after 20093ms
2021-11-01 09:13:13,240 [ClusterClient-12] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.8:9005): Failed to synchronize with the leader after 20186ms


第三次测试(添加更细致的日志)

  1. 被测试分支:cluster_sync_leader_bug(stormbroken仓库中),commitID:b1f7ca95
  2. 测试结果:出现了mid consistency的报错(约6个小时),新增的更细致的日志没有出现,存在部分单机版写入较慢的情况,但是具体耗时也不太像是在GC上消耗。


第三次测试:mid consistency 出现前后的warn日志

2021-11-03 00:14:51,682 [ClusterClient-25] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.9:9007): Failed to synchronize with the leader after 20223ms
2021-11-03 00:14:51,700 [ClusterClient-25] WARN o.a.i.d.s.TSServiceImpl:1926 - Error occurred in query process:
org.apache.iotdb.db.exception.StorageEngineException: org.apache.iotdb.cluster.exception.CheckConsistencyException: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:1026)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getSeriesReaderByTime(ClusterReaderFactory.java:161)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:130)
at org.apache.iotdb.cluster.query.aggregate.ClusterAggregateExecutor.getReaderByTime(ClusterAggregateExecutor.java:95)
at org.apache.iotdb.db.query.executor.AggregationExecutor.executeWithValueFilter(AggregationExecutor.java:357)
at org.apache.iotdb.db.query.executor.QueryRouter.aggregate(QueryRouter.java:137)
at org.apache.iotdb.db.qp.executor.PlanExecutor.processDataQuery(PlanExecutor.java:538)
at org.apache.iotdb.cluster.query.ClusterPlanExecutor.processQuery(ClusterPlanExecutor.java:101)
at org.apache.iotdb.db.service.TSServiceImpl.createQueryDataSet(TSServiceImpl.java:1098)
at org.apache.iotdb.db.service.TSServiceImpl.internalExecuteQueryStatement(TSServiceImpl.java:729)
at org.apache.iotdb.db.service.TSServiceImpl.executeQueryStatement(TSServiceImpl.java:627)
at org.apache.iotdb.service.rpc.thrift.TSIService$Processor$executeQueryStatement.getResult(TSIService.java:2423)
at org.apache.iotdb.service.rpc.thrift.TSIService$Processor$executeQueryStatement.getResult(TSIService.java:2403)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.iotdb.cluster.exception.CheckConsistencyException: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.cluster.exception.CheckConsistencyException.<clinit>(CheckConsistencyException.java:34)
at org.apache.iotdb.cluster.server.member.RaftMember$MidCheckConsistency.postCheckConsistency(RaftMember.java:810)
at org.apache.iotdb.cluster.server.member.RaftMember.waitUntilCatchUp(RaftMember.java:909)
at org.apache.iotdb.cluster.server.member.RaftMember.syncLeader(RaftMember.java:858)
at org.apache.iotdb.cluster.server.member.RaftMember.syncLeaderWithConsistencyCheck(RaftMember.java:765)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:1024)
... 18 common frames omitted
2021-11-03 00:14:51,713 [Data(172.20.70.9:9007)-root.test.g_8] WARN o.a.i.c.l.a.DataLogApplier:149 - [ApplyInsert] applyPhysicalPlan slowly for 21319ms
2021-11-03 00:14:51,713 [Data(172.20.70.9:9007)-root.test.g_8] WARN o.a.i.c.l.a.AsyncDataLogApplier:208 - [ApplyInternal] Data(172.20.70.9:9007):apply log InsertTabletPlan {deviceId:root.test.g_8.d_88, timesRange[1633037436200,1633037438000]},term:1,index:1239624 slowly for 21320ms
2021-11-03 00:14:51,713 [check-log-applier-Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9)] WARN o.a.i.c.l.m.RaftLogManager:960 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9):wait log InsertTabletPlan {deviceId:root.test.g_8.d_88, timesRange[1633037436200,1633037438000]},term:1,index:1239624 to apply slowly for 21319ms

第三次测试:mid consistency 出现前后的GC日志

2021-11-03T00:14:48.074+0800: 23319.008: [GC (Allocation Failure) [PSYoungGen: 4631301K->114123K(5746688K)] 17062396K->12808796K(19727872K), 0.3338259 secs] [Times: user=1.19 sys=0.01, real=0.54 secs]
2021-11-03T00:14:48.617+0800: 23319.552: Total time for which application threads were stopped: 0.5504849 seconds, Stopping threads took: 0.0020039 seconds
2021-11-03T00:14:49.149+0800: 23320.084: [GC (Allocation Failure) [PSYoungGen: 4647371K->103713K(5758464K)] 17342044K->12891978K(19739648K), 0.0640482 secs] [Times: user=0.81 sys=0.00, real=0.07 secs]
2021-11-03T00:14:49.214+0800: 23320.148: Total time for which application threads were stopped: 0.0707036 seconds, Stopping threads took: 0.0013424 seconds
2021-11-03T00:14:49.236+0800: 23320.170: Total time for which application threads were stopped: 0.0054960 seconds, Stopping threads took: 0.0004315 seconds
2021-11-03T00:14:49.464+0800: 23320.398: Total time for which application threads were stopped: 0.0058205 seconds, Stopping threads took: 0.0009434 seconds
2021-11-03T00:14:49.749+0800: 23320.684: [GC (Allocation Failure) [PSYoungGen: 4636961K->91474K(5731840K)] 17425226K->12973431K(19713024K), 0.0486890 secs] [Times: user=0.61 sys=0.01, real=0.05 secs]
2021-11-03T00:14:49.798+0800: 23320.733: Total time for which application threads were stopped: 0.0542169 seconds, Stopping threads took: 0.0003450 seconds
2021-11-03T00:14:50.100+0800: 23321.035: Total time for which application threads were stopped: 0.0128498 seconds, Stopping threads took: 0.0033714 seconds
2021-11-03T00:14:50.451+0800: 23321.385: Total time for which application threads were stopped: 0.0081008 seconds, Stopping threads took: 0.0033660 seconds
2021-11-03T00:14:50.460+0800: 23321.394: [GC (GCLocker Initiated GC) [PSYoungGen: 4637010K->198209K(5751296K)] 17525934K->13087132K(19732480K), 0.0455184 secs] [Times: user=0.48 sys=0.00, real=0.05 secs]
2021-11-03T00:14:50.506+0800: 23321.440: Total time for which application threads were stopped: 0.0508556 seconds, Stopping threads took: 0.0003779 seconds
2021-11-03T00:14:50.970+0800: 23321.905: Total time for which application threads were stopped: 0.0063016 seconds, Stopping threads took: 0.0012185 seconds
2021-11-03T00:14:50.997+0800: 23321.932: [GC (Allocation Failure) [PSYoungGen: 4743745K->331393K(5675008K)] 17632669K->13220317K(19656192K), 0.0647358 secs] [Times: user=0.59 sys=0.01, real=0.07 secs]
2021-11-03T00:14:51.062+0800: 23321.997: Total time for which application threads were stopped: 0.0704747 seconds, Stopping threads took: 0.0006351 seconds
2021-11-03T00:14:51.502+0800: 23322.437: Total time for which application threads were stopped: 0.0067273 seconds, Stopping threads took: 0.0018121 seconds
2021-11-03T00:14:51.518+0800: 23322.453: Total time for which application threads were stopped: 0.0057247 seconds, Stopping threads took: 0.0009592 seconds
2021-11-03T00:14:51.607+0800: 23322.542: Total time for which application threads were stopped: 0.0060769 seconds, Stopping threads took: 0.0011555 seconds
2021-11-03T00:14:51.618+0800: 23322.552: [GC (GCLocker Initiated GC) [PSYoungGen: 4950145K->493210K(5719552K)] 17841579K->13384644K(19700736K), 0.0637473 secs] [Times: user=0.78 sys=0.02, real=0.07 secs]
2021-11-03T00:14:51.682+0800: 23322.617: Total time for which application threads were stopped: 0.0695298 seconds, Stopping threads took: 0.0002446 seconds
2021-11-03T00:14:51.700+0800: 23322.635: Total time for which application threads were stopped: 0.0057423 seconds, Stopping threads took: 0.0005822 seconds
2021-11-03T00:14:52.079+0800: 23323.013: Total time for which application threads were stopped: 0.0060501 seconds, Stopping threads took: 0.0009866 seconds
2021-11-03T00:14:52.284+0800: 23323.219: [GC (Allocation Failure) [PSYoungGen: 5111962K->551491K(5856256K)] 18003429K->13442958K(19837440K), 0.1040987 secs] [Times: user=1.19 sys=0.01, real=0.10 secs]
2021-11-03T00:14:52.389+0800: 23323.323: Total time for which application threads were stopped: 0.1123434 seconds, Stopping threads took: 0.0029986 seconds
2021-11-03T00:14:52.923+0800: 23323.858: [GC (Allocation Failure) [PSYoungGen: 5374019K->671541K(5878272K)] 18265486K->13563008K(19859456K), 0.1139545 secs] [Times: user=1.42 sys=0.04, real=0.11 secs]


第三次测试:mid consistency 出现前后的所有日志

2021-11-03 00:14:51,510 [pool-15-IoTDB-FlushTask-Submit-Pool-1] INFO o.a.i.d.e.s.TsFileProcessor:615 - Memtable PrimitiveMemTable{planIndex=[949947,950649]} has been added to flushing list
2021-11-03 00:14:51,506 [DataClientThread-131] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:736,sizeToReserveForConfig:1000
2021-11-03 00:14:51,512 [pool-15-IoTDB-FlushTask-Submit-Pool-1] INFO o.a.i.d.e.s.StorageGroupProcessor:1257 - close a sequence tsfile processor root.test.g_7-0
2021-11-03 00:14:51,512 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): Before compaction index 949958-950698, compactIndex 949959, removeSize 2, committedLogSize 738, maxAppliedLog 950693
2021-11-03 00:14:51,512 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): After compaction index 949960-950698, committedLogSize 736
2021-11-03 00:14:51,512 [pool-3-IoTDB-Flush-3] INFO o.a.i.d.e.s.TsFileProcessor:995 - Ended file /data/cluster/first-rest-test/apache-iotdb/sbin/../data/data/sequence/root.test.g_15/0/0/1635869637646-459-0-0.tsfile
2021-11-03 00:14:51,519 [pool-3-IoTDB-Flush-3] INFO o.a.i.d.e.s.StorageGroupProcessor:1997 - signal closing storage group condition in root.test.g_15-0
2021-11-03 00:14:51,519 [pool-3-IoTDB-Flush-3] INFO o.a.i.d.e.s.TsFileProcessor:1009 - Storage group root.test.g_15/0 close the file /data/cluster/first-rest-test/apache-iotdb/sbin/../data/data/sequence/root.test.g_15/0/0/1635869637646-459-0-0.tsfile, TsFile size is 52079411, time consumption of flushing metadata is 708ms
2021-11-03 00:14:51,519 [pool-3-IoTDB-Flush-3] INFO o.a.i.d.e.f.MemTableFlushTask:88 - The memTable size of SG root.test.g_7/0 is 170240000, the avg series points num in chunk is 88, total timeseries number is 120000
2021-11-03 00:14:51,510 [ClusterClient-56] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:735,sizeToReserveForConfig:1000
2021-11-03 00:14:51,530 [ClusterClient-56] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505128-505869, compactIndex 505130, removeSize 3, committedLogSize 738, maxAppliedLog 505854
2021-11-03 00:14:51,530 [ClusterClient-56] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505131-505869, committedLogSize 735
2021-11-03 00:14:51,531 [DataClientThread-243] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1998,sizeToReserveForNew:736,sizeToReserveForConfig:1000
2021-11-03 00:14:51,531 [DataClientThread-243] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9): Before compaction index 1239623-1241878, compactIndex 1239622, removeSize 1476, committedLogSize 2212, maxAppliedLog 1239623
2021-11-03 00:14:51,532 [DataClientThread-243] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9): After compaction index 1239623-1241878, committedLogSize 2212
2021-11-03 00:14:51,538 [ClusterClient-58] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,538 [ClusterClient-58] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505131-505871, compactIndex 505131, removeSize 1, committedLogSize 738, maxAppliedLog 505854
2021-11-03 00:14:51,538 [ClusterClient-58] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505132-505871, committedLogSize 737
2021-11-03 00:14:51,541 [Data(172.20.70.7:9003)-root.test.g_16] INFO o.a.i.d.w.n.ExclusiveWriteLogNode:127 - WAL BufferOverflow !
2021-11-03 00:14:51,547 [Data(172.20.70.8:9005)-root.test.g_17] INFO o.a.i.d.w.n.ExclusiveWriteLogNode:127 - WAL BufferOverflow !
2021-11-03 00:14:51,559 [ClusterClient-59] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,559 [ClusterClient-59] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505132-505871, compactIndex 505132, removeSize 1, committedLogSize 738, maxAppliedLog 505855
2021-11-03 00:14:51,559 [ClusterClient-59] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505133-505871, committedLogSize 737
2021-11-03 00:14:51,562 [ClusterClient-38] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,562 [ClusterClient-38] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505133-505871, compactIndex 505133, removeSize 1, committedLogSize 738, maxAppliedLog 505855
2021-11-03 00:14:51,562 [ClusterClient-38] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505134-505871, committedLogSize 737
2021-11-03 00:14:51,588 [ClusterClient-47] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,588 [ClusterClient-47] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505134-505872, compactIndex 505134, removeSize 1, committedLogSize 738, maxAppliedLog 505856
2021-11-03 00:14:51,588 [ClusterClient-47] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505135-505872, committedLogSize 737
2021-11-03 00:14:51,596 [DataClientThread-131] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,596 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): Before compaction index 949960-950699, compactIndex 949960, removeSize 1, committedLogSize 738, maxAppliedLog 950697
2021-11-03 00:14:51,596 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): After compaction index 949961-950699, committedLogSize 737
2021-11-03 00:14:51,682 [ClusterClient-25] WARN o.a.i.c.s.m.RaftMember:950 - Data(172.20.70.9:9007): Failed to synchronize with the leader after 20223ms
2021-11-03 00:14:51,690 [ClusterClient-25] INFO o.a.i.d.s.TSServiceImpl:812 - Cost: 43063 ms, sql is SELECT count(s_2292) FROM root.test.g_8.d_28 WHERE time >= 1633017602400 AND time <= 1633017852400 AND root.test.g_8.d_28.s_2292 > -5
2021-11-03 00:14:51,700 [ClusterClient-25] WARN o.a.i.d.s.TSServiceImpl:1926 - Error occurred in query process:
org.apache.iotdb.db.exception.StorageEngineException: org.apache.iotdb.cluster.exception.CheckConsistencyException: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:1026)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getSeriesReaderByTime(ClusterReaderFactory.java:161)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:130)
at org.apache.iotdb.cluster.query.aggregate.ClusterAggregateExecutor.getReaderByTime(ClusterAggregateExecutor.java:95)
at org.apache.iotdb.db.query.executor.AggregationExecutor.executeWithValueFilter(AggregationExecutor.java:357)
at org.apache.iotdb.db.query.executor.QueryRouter.aggregate(QueryRouter.java:137)
at org.apache.iotdb.db.qp.executor.PlanExecutor.processDataQuery(PlanExecutor.java:538)
at org.apache.iotdb.cluster.query.ClusterPlanExecutor.processQuery(ClusterPlanExecutor.java:101)
at org.apache.iotdb.db.service.TSServiceImpl.createQueryDataSet(TSServiceImpl.java:1098)
at org.apache.iotdb.db.service.TSServiceImpl.internalExecuteQueryStatement(TSServiceImpl.java:729)
at org.apache.iotdb.db.service.TSServiceImpl.executeQueryStatement(TSServiceImpl.java:627)
at org.apache.iotdb.service.rpc.thrift.TSIService$Processor$executeQueryStatement.getResult(TSIService.java:2423)
at org.apache.iotdb.service.rpc.thrift.TSIService$Processor$executeQueryStatement.getResult(TSIService.java:2403)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.iotdb.cluster.exception.CheckConsistencyException: check consistency failed, error message=mid consistency, localAppliedId is smaller than the leaderCommitId
at org.apache.iotdb.cluster.exception.CheckConsistencyException.<clinit>(CheckConsistencyException.java:34)
at org.apache.iotdb.cluster.server.member.RaftMember$MidCheckConsistency.postCheckConsistency(RaftMember.java:810)
at org.apache.iotdb.cluster.server.member.RaftMember.waitUntilCatchUp(RaftMember.java:909)
at org.apache.iotdb.cluster.server.member.RaftMember.syncLeader(RaftMember.java:858)
at org.apache.iotdb.cluster.server.member.RaftMember.syncLeaderWithConsistencyCheck(RaftMember.java:765)
at org.apache.iotdb.cluster.query.reader.ClusterReaderFactory.getReaderByTimestamp(ClusterReaderFactory.java:1024)
... 18 common frames omitted
2021-11-03 00:14:51,703 [ClusterClient-10] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,703 [ClusterClient-10] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505135-505873, compactIndex 505135, removeSize 1, committedLogSize 738, maxAppliedLog 505867
2021-11-03 00:14:51,703 [ClusterClient-10] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505136-505873, committedLogSize 737
2021-11-03 00:14:51,704 [DataClientThread-131] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:736,sizeToReserveForConfig:1000
2021-11-03 00:14:51,704 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): Before compaction index 949961-950701, compactIndex 949962, removeSize 2, committedLogSize 738, maxAppliedLog 950697
2021-11-03 00:14:51,704 [DataClientThread-131] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.8, metaPort:9005, nodeIdentifier:2041880328, dataPort:40005, clientPort:6667, clientIp:172.20.70.8): After compaction index 949963-950701, committedLogSize 736
2021-11-03 00:14:51,710 [Data(172.20.70.8:9005)-root.test.g_7] INFO o.a.i.d.e.s.TsFileProcessor:149 - create a new tsfile processor /data/cluster/first-rest-test/apache-iotdb/sbin/../data/data/sequence/root.test.g_7/0/0/1635869691710-578-0-0.tsfile
2021-11-03 00:14:51,713 [Data(172.20.70.9:9007)-root.test.g_8] WARN o.a.i.c.l.a.DataLogApplier:149 - [ApplyInsert] applyPhysicalPlan slowly for 21319ms
2021-11-03 00:14:51,713 [Data(172.20.70.9:9007)-root.test.g_8] WARN o.a.i.c.l.a.AsyncDataLogApplier:208 - [ApplyInternal] Data(172.20.70.9:9007):apply log InsertTabletPlan {deviceId:root.test.g_8.d_88, timesRange[1633037436200,1633037438000]},term:1,index:1239624 slowly for 21320ms
2021-11-03 00:14:51,713 [check-log-applier-Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9)] WARN o.a.i.c.l.m.RaftLogManager:960 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9):wait log InsertTabletPlan {deviceId:root.test.g_8.d_88, timesRange[1633037436200,1633037438000]},term:1,index:1239624 to apply slowly for 21319ms
2021-11-03 00:14:51,717 [Data(172.20.70.8:9005)-root.test.g_7] INFO o.a.i.d.w.n.ExclusiveWriteLogNode:90 - create the WAL folder /data/cluster/first-rest-test/apache-iotdb/sbin/../data/wal/root.test.g_7/0-1635869691710-578-0-0.tsfile.
2021-11-03 00:14:51,717 [Data(172.20.70.8:9005)-root.test.g_7] INFO o.a.i.d.c.IoTDBThreadPoolFactory:87 - new single thread pool: Flush-WAL-Thread-root.test.g_7/0-1635869691710-578-0-0.tsfile
2021-11-03 00:14:51,750 [DataClientThread-30] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1999,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,750 [DataClientThread-30] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9): Before compaction index 1239623-1241879, compactIndex 1239625, removeSize 1477, committedLogSize 2214, maxAppliedLog 1239626
2021-11-03 00:14:51,750 [DataClientThread-30] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.9, metaPort:9007, nodeIdentifier:2041883899, dataPort:40007, clientPort:6667, clientIp:172.20.70.9): After compaction index 1239626-1241879, committedLogSize 2211
2021-11-03 00:14:51,840 [ClusterClient-67] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000
2021-11-03 00:14:51,857 [ClusterClient-67] INFO o.a.i.c.l.m.RaftLogManager:873 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): Before compaction index 505136-505874, compactIndex 505136, removeSize 1, committedLogSize 738, maxAppliedLog 505873
2021-11-03 00:14:51,857 [ClusterClient-67] INFO o.a.i.c.l.m.RaftLogManager:887 - Node(internalIp:172.20.70.7, metaPort:9003, nodeIdentifier:2041893354, dataPort:40003, clientPort:6667, clientIp:172.20.70.7): After compaction index 505137-505874, committedLogSize 737
2021-11-03 00:14:51,860 [Data(172.20.70.9:9007)-root.test.g_8] INFO o.a.i.d.w.n.ExclusiveWriteLogNode:127 - WAL BufferOverflow !
2021-11-03 00:14:51,922 [DataClientThread-314] ERROR o.a.i.c.l.m.RaftLogManager:615 - Trigger compact:numToReserveForNew:1000,sizeToReserveForNew:737,sizeToReserveForConfig:1000

第三次测试添加的日志代码

第四次测试(排除仅剩的没有日志的可能)

  1. 被测试分支:cluster_sync_leader_bug(stormbroken仓库中),commitID6c0425e
  2. 启动时间2021年11月04日10点49分42秒
    1. 7号节点
      1. 首次出现mid consistency报错:2021年11月04日11点22分16秒,持续到2021年11月04日11点52分17秒。
      2. 出现OOM的报错:11月04日11点06分24秒,之后大量出现OOM
      3. 在启动后2分钟开始间歇性出现Failed to process no query InsertTabletPlan报错,之后11点22分16秒开始出现第一个mid consistency(往往会在之前伴随一个Failed to  synchronize with leader after 20000+ms),在11点29分02秒出现第一个Current query is time out xxx,之后间歇性出现mid consistency,下午16点37分26点开始出现集群节点失联。
      4. 查看全部日志发现Failed to process no query plan附近出现Raft log buffer overflow和WAL BufferOverflow。
    2. 8号节点
      1. 没有出现mid consistency报错
      2. 没有出现OOM报错
    3. 9号节点
      1. 没有出现mid consistency报错
      2. 出现OOM的报错:11月04日16点45分51秒,之后大量出现OOM
  3. 测试结果:新增的更细致的日志没有出现,存在部分单机版写入较慢的情况,但是具体耗时也不太像是在GC上消耗。
  4. 所有新添加的日志没有出现.....g

部分日志截图

同时对应时间段服务器的相关监控信息

使用Zabbix查看具体的情况,网址http://172.20.70.11/zabbix.php?name=&ip=&dns=&port=&status=-1&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&maintenance_status=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=name&sortorder=ASC&show_suppressed=0&action=host.view,用户Admin,密码iotdb2019

基本上可以看到07节点出现OOM后,出现了mid consistency,然后接下来09节点出现了mid consistency


第五次测试(调整三节点三副本为三节点单副本)

  1. 被测试分支:cluster_sync_leader_bug(stormbroken仓库中),commitID6c0425e
  2. 没有出现mid consistency、OOM等问题,也没有出现部分操作变慢到超过10s的情况。


经过大量尝试和最终确认,初步断定是由于过长时间的DataSnapshot导致的分布式crash。


第六次测试(定位测试,针对snapshot添加了大量的日志)

  1. 被测试分支:cluster_sync_leader_bug(stormbroken仓库中),commitID ce9d0992
  2. 经过一段时间后,复现了mid consistency,基本确定是snapshot出现的问题。
  3. 本次测试基本配置为3节点3副本
    1. 7号节点
      1. 2022.1.4 22:37:04 开始出现LogCatchUp超时,共计31次,最高452.2s(2022.1.4 23:17:38),多数集中在10s以上
      2. 2022.1.4 22:54:39 doSnapshot 耗时 15.6s
      3. 2022.1.4 22:55:47 发送Snapshot 耗时19.0s,执行SnapShotCatchUp 耗时19.8s,Call Snapshot CatchUp 耗时 19.8s
      4. 2022 1.4 23:16:53 出现OOM:Java heap space
      5. 2022 1.4 23:50:34 出现OOM: GC overhead limit exceeded
    2. 8号节点
      1. 2022.1.4 22:42:34 出现LogCatchUp超时,耗时350.3s
      2. 2022 1.4 22:54:41 - 2022 1.4 23:02:16 出现OOM,反复出现Java heap space或GC overhead limit exceeded
      3. 2022.1.4 23:02:29 出现LogSnapshotCatchUp超时,耗时1068.4s
      4. 2022.1.4 23:14:16-30 出现3次 PartitionedSnapshot 安装到单个slot慢,分别耗时4.0, 4.1、7.4s,上面3次之中出现了FileSnapshot 安装tsfile较慢(4.9s),最终导致2022.1.4 23:14:34 出现PartitionSnapshot整体安装慢,耗时26.3s
    3. 9号节点
      1. 2022.1.4 22:38:28 - 2022.1.4 23:51:44 零散出现LogCatchUp超时,大多数集中在3-10s
      2. 2022.1.4 22:54:48 - 2022.1.4 22:55:20 出现6次PartitionSnapshot安装到单个slot慢,分别耗时4.3, 3.4, 3.7, 9.4, 10.0, 3.0s,最终导致2022.1.4 22:55:20 出现PartitionSnapshot整体安装慢,耗时39.4s
      3. 随后导致2022.1.4 22:55:30 出现PartitionSnapshot整体安装慢,耗时24.7s, 2022.1.4 22:55:47 出现PartitionSnapshot整体安装慢,耗时18.5s
      4. 2022.1.4 23:14:06 doSnapshot 耗时26.4s
    4. 共性特点:
      1. load New TsFile 持锁时间不长
      2. MetaSimpleSnapshot 没有出现超时

经过测试确认,leader节点出现大量的LogSnapshotCatchUp超时,而follower节点主要出现PartitionSnapshot安装超时,导致大量请求堆积,最后OOM。


添加的日志


部分日志结果

时长概述

日志细节


  • No labels