View file

name	logs.zip
height	250

测试基本参数

对应JIRA：https://issues.apache.org/jira/browse/IOTDB-1628

...

经过大量尝试和最终确认，初步断定是由于过长时间的DataSnapshot导致的分布式crash。

第六次测试（定位测试，针对snapshot添加了大量的日志）

被测试分支：cluster_sync_leader_bug（stormbroken仓库中），commitID ce9d0992
经过一段时间后，复现了mid consistency，基本确定是snapshot出现的问题。
本次测试基本配置为3节点3副本
1. 7号节点
  1. 2022.1.4 22:37:04 开始出现LogCatchUp超时，共计31次，最高452.2s(2022.1.4 23:17:38)，多数集中在10s以上
  2. 2022.1.4 22:54:39 doSnapshot 耗时 15.6s
  3. 2022.1.4 22:55:47 发送Snapshot 耗时19.0s，执行SnapShotCatchUp 耗时19.8s，Call Snapshot CatchUp 耗时 19.8s
  4. 2022 1.4 23:16:53 出现OOM：Java heap space
  5. 2022 1.4 23:50:34 出现OOM: GC overhead limit exceeded
2. 8号节点
  1. 2022.1.4 22:42:34 出现LogCatchUp超时，耗时350.3s
  2. 2022 1.4 22:54:41 - 2022 1.4 23:02:16 出现OOM，反复出现Java heap space或GC overhead limit exceeded
  3. 2022.1.4 23:02:29 出现LogSnapshotCatchUp超时，耗时1068.4s
  4. 2022.1.4 23:14:16-30 出现3次 PartitionedSnapshot 安装到单个slot慢，分别耗时4.0, 4.1、7.4s，上面3次之中出现了FileSnapshot 安装tsfile较慢(4.9s)，最终导致2022.1.4 23:14:34 出现PartitionSnapshot整体安装慢，耗时26.3s
3. 9号节点
  1. 2022.1.4 22:38:28 - 2022.1.4 23:51:44 零散出现LogCatchUp超时，大多数集中在3-10s
  2. 2022.1.4 22:54:48 - 2022.1.4 22:55:20 出现6次PartitionSnapshot安装到单个slot慢，分别耗时4.3, 3.4, 3.7, 9.4, 10.0, 3.0s，最终导致2022.1.4 22:55:20 出现PartitionSnapshot整体安装慢，耗时39.4s
  3. 随后导致2022.1.4 22:55:30 出现PartitionSnapshot整体安装慢，耗时24.7s, 2022.1.4 22:55:47 出现PartitionSnapshot整体安装慢，耗时18.5s
  4. 2022.1.4 23:14:06 doSnapshot 耗时26.4s
4. 共性特点：
  1. load New TsFile 持锁时间不长
  2. MetaSimpleSnapshot 没有出现超时