Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

How ReplicationWorker handled this data loss scenario?
Scenario: The last fragment of the ledger is in under replicated state; replication worker replicates it and updates the ledger metadata with local Bookies address. Immediately, the failed Bookie started and running. Now the client resumed for adding some more entries, and it can continue with adding entries with the old Bookie. But ReplicationWorker already change the metadata for that fragment with local Bookie. That means, that client unnecessarily adding the entries to the old bookie whose address is already removed from fragment ensemble. So, this can create data loss if other bookie goes down and even though old Bookie is running fine.
To prevent this situation, ReplicationWorker will postpone the replications by adding that ledger to the pending replications. Pending ReplicationWorker will check the timeout of the ledgers which are there in pending replications. This timeout is configurable. Once the timeout happens, pending ReplicationWorker if the last fragment of the ledger is in open state. In such case it will just schedule a timer task for that ledger for delaying replication for such ledgers. That timer task scheduling period is configurable and default value is 30000ms. Once the timer fired, it will force fence the ledger if it is still in open state and will inform the replication worker for the
replicationrelease the ledger lock.So, that will trigger rereplication automatically as RW will loop to get the under replicated ledgers. So, any under-replicated last fragment ledger will not be kept open for long time if the client is idle and not reforming ensemble for long (more than pending replication timeout.)

...