ID | IEP-31 |
Author | |
Sponsor | |
Created | 4 March 2019 |
Status | ACTIVE |
Apache Ignite provides a consistency guarantee, each backup should contain the same value for the same key, at least eventually.
But this guarantee can be violated because of ...
but ... most likely because of bugs.
And yes, we had some (eg. IGNITE-10078) and fixed them.
But, is there any chances we fixed all and will not write the new? While Ignite is distributed ... such bugs are possible.
The main idea is to provide special read mode which will read a value from primary and all backups and will check that values are the same.
In case values are differ they should be repaired according to the appropriate strategy.
So, the final goal is to have ability to detect and fix consistency issues.
Currently, we able to use "idle_verify" feature to detect broken partitions.
Once we detected them, we should be able to fix them.
Another way is to use this feature on each get request.
One more way is to check all entries in loop way pemanently.
But what if we have 3+ different values for the same key at topology?
It's not true.
Bugs able to make every node outdated.
Not a 100% guarantee, but Simple!
Seems to be suitable because of each value related to the GridCacheVersion which is comparable.
Best case.
Can be implemented as an add-on.
1) LWW and any other strategy do not guarantee that the correct value will be chosen.
We have to record the event contains all values and the chosen one.
The event will allow to
- got we have an inconsistent state situation
- investigate which value is correct manually and re-repair if necessary
1.1) Seems, it's not possible to fix any cases.
For any transactional caches we able to perform pessimistic serializable transaction per key.
But, atomic caches cannot be fixed this way.
2) Consistency guard able to produce the following problems
- replace hot data with cold data
We have to have special configuration property allows or restricts to check (read) data available only at the persistence layer
- decrease throughput/latency metrics
Some throttle feature should be implemented
Initial review request - http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
"Idle verify" to "Online verify" discussion - http://apache-ignite-developers.2346864.n4.nabble.com/quot-Idle-verify-quot-to-quot-Online-verify-quot-td41928.html
Second review request - http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
Cassandra's Read Repair feature - https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
Idle_verify - https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums