Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3
outlinetrue
stylenone

Overview

Geode is main memory based data management solution. Given its shared-nothing architecture, each member node is self-sufficient and independent. Each node manages its data set in memory and on local disk. While share-nothing reduces contention across the system, it has some limitations too. In many in-memory use cases, a small fraction of data is accessed frequently. Managing this subset in memory improves application's performance. However there are some infrequent analytical jobs accessing full dataset. Full data scans can cause eviction of critical data from memory. It can also increase response latency by contending for network resources. It is desired that analytical jobs do not impact online application. So an alternative access method for analytical workloads can be put forward.

A persistence store based on HDFS could be used to address some of the issues discussed above. HDFS provides economical, reliable, scalable and high performance storage layer. HDFS is becoming synonymous to scalable-storage system. Together with the Hadoop ecosystem, it is transforming has transformed the way data is shared and analyzed by various data management systems. Geode is an in-memory data management solution. Each member node manages its data set on local disk. Currently, this data is not shared with any other node or system. We plan to add HDFS as a configurable persistence layer for Geode Regions. Geode can depend on HDFS for scalable persistence and offer new capabilities by sharing data with HDFS enabled systems. In return HDFS can benefit from Geode’s record level caching (think kilobytes records) and ability to micro-ingest data in an efficient and reliable way.

...

Region data persisted on HDFS could be accessed directly from HDFS without impacting cluster performance.

Like Geode DiskStore we introduce HdfsStore as a means to persist region data on HDFS. This document discusses HdfsStore management API and behavior. We assume the reader is familiar with

...

Regions, Members, gfsh etc.

...

Goals

Geode provides ability to cache key-value (KV) sets in memory. For big-datasets use cases, it is assumed that the entire data set cannot be managed in memory. So Geode will provide a configurable KV retention/eviction policy. The data in memory is available for fast querying and is referred to as Operational Data. Operational data set typically consists of recently accessed records. Whenever a key lookup fails in operational data, Geode will execute lookup on HDFS and add it to the operational dataset if it meets retention criteria. When used like this, Geode will provide a reliable, fast and easy access to HDFS data.

HDFS Data Tier

Geode collects all write operation and persists them on HDFS. These records are never evicted from HDFS unless deleted by user. Hence full record of all data records collected by Geode are present on HDFS, referred to as HDFS TierThe update log managed on HDFS is similar to the oplog (operational log) maintained on local disk. The data on HDFS will be visible “externally”, for e.g. readable from a MR job or Hive query. This way data managed by Geode can be used for analytics. At any instance, Operational data is a subset of HDFS data.

  1. Offline Geode data access, using custom InputFormat designed for Geode.
  2. Support all features using region iterators on Hdfs regions.
  3. Use Geode for data ingestion while caching recent data in memory.
  4. Hdfs data loader

Anti-Goals

  1. Secondary indexes on data archived on HDFS.
  2. Eviction logic will be based on LRU.
  3. Eviction may cause keys to be evicted from memory. Supporting features depending on in-memory keys may not work

Design

  1. Each write operation will be streamed to Operational data store (in-memory region) and HDFS buffers simultaneously. In general data flow to HDFS and Geode Regions will be independent of each other.

  2. Each new/updated record will go through eviction logic test. Existing data will be checked again on need basis (heap limit trigger) or as configured by user.
PlantUML
(*) --> [PUT] Handler
--> ===B1=== 
 
--> Buffer
--> HDFS
 
===B1=== --> "Filter" 
--> OperationalData
--> Scheduler
--> "Filter"
 
OperationalData ..> [cache miss] HDFS
 

...

Put KV

PlantUML
participant User
participant HandlerHdfsRegion
participant HdfsRegionHdfsBuffer
participant OperationalDataHDFS
participant Filter
activate HdfsRegion
User->Handler>HdfsRegion: Put KV
activate HandlerHdfsBuffer
HandlerHdfsRegion->HdfsRegion>HdfsBuffer: AddGet toold buffervalue
activate HdfsRegion
HdfsRegionHdfsBuffer-->Handler>HDFS: Read
HdfsRegionHdfsBuffer-->HDFS>HdfsRegion: Asynchronous
deactivate HdfsRegion
Handler->Filter: Test eviction logic
activate Filter
Filter->Handler: True/False
deactivate Filter
Handler->OperationalData: Put in cache
activate OperationalData
OperationalData->Handler: Return V*
deactivate OperationalData
Handler->User: Old V*
deactivate HandlerReturn Old V*
HdfsRegion->HdfsBuffer: Write New V
HdfsBuffer->HdfsRegion:
HdfsBuffer-->HDFS: Asynchronous write
deactivate HdfsBuffer
HdfsRegion->User: Return v*
HdfsRegion-->HdfsRegion: Asynchronous LRU Eviction
 

Get K

PlantUML
 participant User 
 participant HandlerHdfsRegion
 participant OperationalDataHdfsBuffer
 participant HdfsRegion
 participant Filter

 User->Handler: Get KHDFS

 activate HandlerHdfsRegion
 HandlerUser->OperationalData>HdfsRegion: Get K
 activate OperationalData
 OperationalData->HandlerHdfsRegion->User:Return V
 deactivate OperationalData
 alt if V=nullnot & Get from HDFS is enabledin memory
 HandlerHdfsRegion->HdfsRegion>HdfsBuffer: Get K
 activate HdfsRegionHdfsBuffer
 HdfsRegionHdfsBuffer->HDFS:Read
 activate HDFS
 HDFSHdfsBuffer->HdfsRegion:
 deactivate HDFS
 HdfsRegion->Handler:Update cache
 deactivate HdfsRegionHdfsBuffer
 Handler->Filter: Test eviction logic
 activate Filter
 Filter->Handler: True/False
 deactivate Filter
 Handler->OperationalData: Put in cache
 activate OperationalData
 OperationalData->Handler:
 deactivate OperationalData
 end
 Handler->User: Vend
 HdfsRegion->User: Return V
 HdfsRegion-->HdfsRegion: Asynchronous LRU Eviction

HDFS Store

HDFS stores provide a means of persisting data on HDFS. There can be multiple instance of HDFS stores in a cluster. A user will normally perform the following steps to enable HDFS persistence for a region:

  1. [Optional] Creates a Disk store for reliability. HDFS buffers will use local persistence till it is persisted on HDFS.
  2. Creates a HDFS Store
  3. Creates a Region connected to HDFS Store
  4. Uses region API to create and query data

HDFS Region Types

All the data coming into a HDFS region is written to HDFS in batches asynchronously. There are two ways in which data is managed in HDFS depending on whether it will be queried by Geode or not.

Read-Write (RW) HDFS region

Geode will try to fetch KV from HDFS files for a RW region if the requested KV is not found in memory. To make reads efficient, KVs in a batch are sorted and augmented with indexes and bloom filters.

Write-Only (WO) HDFS region

HDFS files for a WO region will never be read by Geode. The data events are “archived” on HDFS for offline analysis. A user can create a WO region to ingest small batches of user data in HDFS friendly way.

Key attributes of a HDFS Store

 

Attribute

Alterable?

Purpose

Name

no

A unique identifier for the HDFSStore

NameNodeURL

no

HDFSStore persists data on a HDFS cluster identified by cluster's NameNode URL or NameNode Service URL. NameNode URL can also be provided via hdfs-site.xml (see HDFSClientConfigFile). If the NameNode url is missing HDFSStore creation will fail. HDFS client can also load hdfs configuration files in the classpath. NameNode URL provided in this way is also fine.

HomeDir

no

The HDFS directory path in which HDFSStore stores files. The value must not contain the NameNode URL. The owner of this node's JVM process must have read and write access to this directory. The path could be absolute or relative. If a relative path for HomeDir is provided, then the HomeDir is created relative to /user/JVM_owner_name or, if specified, relative to directory specified by the hdfs-root-dir property.

As a best practice, HDFS store directories should be created relative to a single HDFS root directory. As an alternative, an absolute path beginning with the "/" character to override the default root location can be provided.

HDFSClientConfigFile

no

The full path to the HDFS client configuration file, for e.g. hdfs-site.xml or core-site.xml. This file must be accessible to any node where an instance of this HDFSStore will be created. If each node has a local copy of this configuration file, it is important for all the copies to be "identical". Alternatively, by default HDFS client can also load some HDFS configuration files if added in the classpath.

MaxMemory

yes

The maximum amount of memory in megabytes used by HDFSStore. HDFSStore buffers data in memory to optimize HDFS IO operations. Once the configured memory is utilized, data may overflow to disk.

ReadCacheSize

no

The maximum amount of memory in megabytes used by HDFSStore read cache. HDFSStore can cache data in memory to optimize HDFS IO operations. Read cache shares memory allocated to HDFSStore. Increasing read cache memory can improve the read performance.

BatchSize

yes

HDFSStore buffer data is persisted on HDFS in batches, and the BatchSize defines the maximum size (in megabytes) of each batch that is written to HDFS. This parameter, along with BatchInterval determines the rate at which data is persisted on HDFS.

BatchInterval

yes

HDFSStore buffer data is persisted on HDFS in batches, and the BatchInterval defines the maximum time that can elapse between writing batches to HDFS. This parameter, along with BatchSize determines the rate at which data is persisted on HDFS.

DispatcherThreads

no

The maximum number of threads (per region) used to write batches of HDFS. If you have a large number of clients that add or update data in a region, then you may need to increase the number of dispatcher threads to avoid bottlenecks when writing data to HDFS.

BufferPersistent

no

Configure if HDFSStore in-memory buffer data, that has not been persisted on HDFS yet, should be persisted to a local disk to buffer prevent data loss. Persisting data may impact write performance. If performance is critical and buffer data loss is acceptable, disable persistence.

DiskStore

no

The named DiskStore to use for any local disk persistence needs of HDFSStore, for e.g. store's buffer persistence and buffer overflow. If you specify a value, the named DiskStore must exist. If you specify a null value or you omit this option, default DiskStore is used.

SynchronousDiskWrite

no

Include the isSynchronous option to enable or disable synchronous writes to the local DiskStore.

PurgeInterval

yes

HDFSStore creates new files as part of periodic maintenance activity. Existing files are deleted asynchronously. PurgeInterval defines the amount of time old files remain available for MapReduce jobs. After this interval has passed, old files are deleted.

MajorCompaction

yes

Major compaction removes old values of a key and deleted records from the HDFS files, which can save space in HDFS and improve performance when reading from HDFS. As major compaction process can be long-running and I/O-intensive, tune the performance of major compaction using MajorCompactionInterval and MajorCompactionThreads.

MajorCompactionInterval

yes

The amount of time after which HDFSStore performs the next major compaction cycle.

MajorCompactionThreads

yes

The maximum number of threads that HDFSStore uses to perform major compaction. You can increase the number of threads used for compactions on different buckets as necessary in order to fully utilize the performance of your HDFS cluster and its disks.

MinorCompaction

yes

Minor compaction reorganizes data in files to optimize read performance and reduce number of files created on HDFS. Minor compaction process can be I/O-intensive, tune the performance of minor compaction using MinorCompactionThreads.

MinorCompactionThreads

yes

The maximum number of threads that HDFSStore uses to perform minor compaction. You can increase the number of threads used for compactions on different buckets as necessary in order to fully utilize the performance of your HDFS cluster and its disks.

MaxWriteOnlyFileSize

yes

For HDFS write-only regions, this defines the maximum size (in megabytes) that an HDFS log file can reach before HDFSStore closes the file and begins writing to a new file. This clause is ignored for HDFS read/write regions. Keep in mind that the files are not available for MapReduce processing until the file is closed; you can also set WriteOnlyFileRolloverInterval to specify the maximum amount of time an HDFS log file remains open.

WriteOnlyFileRolloverInterval

yes

For HDFS write-only regions, this defines the maximum time that can elapse before HDFSStore closes an HDFS file and begins writing to a new file. This clause is ignored for HDFS read/write regions.

 

 

HDFS Data Tier

 

Geode collects all write operation and persists them on HDFS asynchronously. These records are never evicted from HDFS unless deleted by user. Hence full record of all events is present on HDFS. The update log managed on HDFS is similar to the oplog (operational log) maintained on local disk. The data on HDFS will be visible “externally”, for e.g. readable from a MR job or Hive query. This way data managed by Geode can be used for analytics. 

 

  1. Each write operation will be cached in-memory and HDFS buffers simultaneously.

  2. A new/updated record may result in LRU eviction of existing data

Whenever a key lookup fails in operational data, Geode will execute lookup on HDFS and add it to the operational dataset if it meets retention criteria. When used like this, Geode will provide a reliable, fast and easy access to HDFS data.

PlantUML
(*) --> [PUT] Handler
--> ===B1=== 
 
--> Buffer
--> HDFS
 
===B1=== --> "Filter" 
--> OperationalData
--> Scheduler
--> "Filter"
 
OperationalData ..> [cache miss] HDFS