Page History

...

Distribute/Partition/Replicate the NN functionality across multiple computers
- Read-only replicas of the name node
  - What is the ratio of Rs to Ws - get data from Simon
  - Note: RO replicas can be useful for the HA solution and for checkpoint rolling
- Partition by function (also scales namespace and addressible storage space)
  - E.g. move block management and processing to slave NN.
  - E.g. move Replica management to slave NN
- Partition by name space - ie different parts of the name space are handled by different NN (see below)
  - this helps in scaling the performance of NN and also the Name space scaling
RPC and Timeout issues
- When load spikes occur, the clients timeout and the spiral of death occurs
- See Hadoop Protocol RPC
Higher concurrency in Namespace access (more sophisticated Namespace locking)
- This is probably an issue only on NN restart, not during normal operation
- Improving concurrency is hard since it will require redesign and testing
  - Better to do this when NN is being redesigned for other reasons.
Journaling and Sync
- *Benefits*: improves latency, client utilization, less timeouts, greater throughput
- Improve Remote syncs
  - Approach 1 - NVRM NFS file system - investigate this
  - Approach 2 - If flush on NFS pushes the data to the NFS server, this may be good eough if there is a local sync - investigate
- Lazy syncs - need to investigate the benefit and cost (latency)
  - Delay the reply by a few milliseconds to take allow for more bunching of syncs
  - This increases the latency
- NVRAM for journal
- Wiki Markup
  Async sysncs \[No!!!\]
  - reply as soon as memory is updated
  - This changes semantics
    - If it is good enough for Unix then isn't it good enough for HDFS?
      - For a single machine, its failure implies failure of client and fs *together*
      - In a distributed file system, there is partial failure; further more one expects HA'ed NN to not loose data
Move more functionality to data node
- Distributed replica creation - not simple
Improve Block report processing HADOOP-2448
2K nodes mean a block report every 3 sec.
- Currently: Each DN sends Full BR are sent as array of longs every hour. Initial BR has random backoff (configurable)
- Incremental and Event based B-reports - HADOOP-1079
  - E.g when disk is lost. or blocks are deleted, etc
  - DN can determine what if anything has changed and send only of there are changes
- Send only checksums
  - NN recalculates the checksum, OR has rolling checksum
- Make intial block report's random backoff to be dynamicaly set via NN when DNs register. - HADOOP-2444

...

Statically Partition the namespace hierarchically and mount the volumes
- In this scheme, there are multiple namespace volumes in a cluster.
- All the name space volumes share the physical block storage (i.e. One storage pool)
- Optionally All namespaces (ie volumes) are mounted at top level using an automounter like approach
- A namepace can be explicitly mounted on to a node in another namename (a la mount in Posix)
  - Wiki Markup
    Note the Cepf file system \[ref\] partitions automatically and mounts the partition \\
A truly distributed name service that partitions the namespace dynamically.
Only keep part of the namespace in memory.
- This like a tradional file system where the entire namepsace is stored in secondary and page-in as needed.
Reduce accidental space growth - name space quotas

...

Page tree

Versions Compared

Old Version 1

New Version 2

Key