Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This document discusses new API, configurations and aspects of HDFS persistence layer in Geode. We assume the reader is familiar with basic Geode constructs like Regions, Members, gfsh etc.

Operational Data Tier

 Geode provides ability to cache key-value (KV) sets in memory. For big-datasets use cases, it is assumed that the entire data set cannot be managed in memory. So Geode will provide a configurable KV retention/eviction policy. The data in memory is available for fast querying and is referred to as Operational Data. Operational data set typically consists of recently accessed records. Whenever a key lookup fails in operational data, Geode will execute lookup on HDFS and add it to the operational dataset if it meets retention criteria. When used like this, Geode will provide a reliable, fast and easy access to HDFS data.

...

Geode collects all write operation and persists them on HDFS. These records are never evicted from HDFS unless deleted by user. Hence full record of all data records collected by Geode are present on HDFS, referred to as HDFS TierThe update log managed on HDFS is similar to the oplog (operational log) maintained on local disk. The data on HDFS will be visible “externally”, for e.g. readable from a MR job or Hive query. This way data managed by Geode can be used for analytics. At any instance, Operational data is a subset of HDFS data.

...

  1. Each write operation will be streamed to Operational data store (in-memory region) and HDFS buffers simultaneously. In general data flow to HDFS and Geode Regions will be independent of each other.

  2. Each new/updated record will go through eviction logic test. Existing data will be checked again on need basis (heap limit trigger) or as configured by user.


PlantUML
(*) --> [PUT] Handler
--> ===B1=== 
 
--> Buffer
--> HDFS
 
===B1=== --> "EvictionPolicy" 
--> HdfsRegion
--> Scheduler
--> "EvictionPolicy"
 
HdfsRegion ..> [cache miss] HDFS
 

Data Flow

Put KV

PlantUML
participant User
participant Handler
participant HdfsRegion
participant OperationalData
participant Filter
 
User->Handler: Put KV
activate Handler
Handler->HdfsRegion: Add to buffer
activate HdfsRegion
HdfsRegion->Handler:
HdfsRegion-->HDFS: Asynchronous
deactivate HdfsRegion
Handler->Filter: Test eviction logic
activate Filter
Filter->Handler: True/False
deactivate Filter
Handler->OperationalData: Put in cache
activate OperationalData
OperationalData->Handler: Return V*
deactivate OperationalData
Handler->User: Old V*
deactivate Handler

...