Geode 网络配置最佳实践

介绍

Geode is a data management platform that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures. Geode pools memory是一个数据管理平台，提供实时的、一致的、贯穿整个云架构地访问数据关键型应用.

Geode 池化了服务器上的内存, CPU, network resources, and optionally disk storage across multiple processes to manage application objects and behavior. It uses dynamic replication and data partitioning techniques to implement high availability, improved performance, scalability, and fault tolerance. In addition to being a distributed data container, Geode is an in-memory data management system that provides reliable asynchronous event notifications and guaranteed message delivery.

Due to Geode’s distributed nature, network resources can have a significant impact on system performance and availability. Geode is designed to be fault tolerant and to handle network disruptions gracefully. However, proper network design and tuning are essential to achieving optimum performance and High Availability with Geode.

目的

The purpose of this paper is to provide best practice recommendations for configuring the network resources in a Geode solution. The recommendations in this paper are not intended to provide a comprehensive, one-size-fits-all guide to network design and implementation. However, they should serve to provide a working foundation to help guide Geode implementations.

范围

网络资源, 和本地磁盘，跨多个进程来管理应用对象和应用行为. 它使用了动态数据复制和分区技术来实现高可用, 高性能, 高可扩展性, 和容错. 另外, 对于一个分布式数据容器, Apache Geode 是一个基于内存的数据管理系统, 提供了可靠的异步事件通知和可靠的消息投递.

由于 Geode天生的分布式特性, 网络资源对于系统的性能和可用性有比较大的影响. Geode 被设计具有容错能力和自动处理网络中断故障. 然而, 合适的网络设计和调优对于达到最佳性能是必要的.

目的

本篇文章主要提供配置网络资源的最佳实践. 本推荐并不提供很详细的网络设计和实现指南. 然而, 是提供一个工作基础, 指导大家进行网络配置实现.

范围

本章节所讲的话题与网络组件的设计和配置有关. 将介绍如下的话题:本章节所讲的话题与网络组件的设计和配置有关. 将介绍如下的话题:

网络架构的目标
NIC网卡选择和配置
交换机配置参考
通用的网络基础设施参考
TCP vs. UDP 协议参考
Socket 通信和 Socket Buffer 设置
TCP 设置, 拥塞控制, 窗口扩展, 等.

对象

This paper assumes a basic knowledge and understanding of Geode, virtualization concepts and networking. Its primary audience consists of:

本文假设你对 Geode 有一些基础的理解和知识, 虚拟化概念和网络知识. 主要的读者如下:

架构师: 能够使用此篇文章形成围绕 Geode 的关键决策和设计选择
系统工程师和管理者: 它能够使用此篇文章作为系统配置的指南
Architects: who can use this paper to inform key decisions and design choices surrounding a Geode solution
System Engineers and Administrators: who can use this paper as a guide for system configuration

Geode: 快速回顾

概要介绍

A Geode distributed system is comprised of members distributed over a network to provide in-memory speed along with high availability, scalability, and fault tolerance. Each member consists of a Java virtual machine (JVM) that hosts data and/or compute logic and is connected to other Geode members over a network. Members hosting data maintain a cache consisting of one or more Regions that can be replicated or partitioned across the distributed system. Compute logic is deployed to members as needed by adding the appropriate Java JAR files to the member’s class path.

使用 Geode 的公司:

Reduced risk analysis time from 6 hours to 20 minutes, allowing for record profits in the flash crash of 2008 that other firms were not able to monetize.
Improved end-user response time from 3 seconds to 50 ms, worth 8 figures a year in new revenue from a project delivered in fewer than 6 months.
Tracked assets in real time to coordinate all the right persons and machinery into the right place at the right time to take advantage of immediate high-value opportunities.
Created end-user reservation systems that handle over a billion requests daily with no downtime.

一个 Geode 分布式系统由跨网络的成员节点组成, 提供内存速度, 带有高可靠, 高扩展性, 和容错. 每个成员基于JVM, 管理数据和计算逻辑, 连接到其他 Geode 成员. 管理数据的成员维护一个缓存, 缓存中的Region 可以跨分布式系统进行复制或分区. 计算逻辑被部署到添加合适 Java JAR 文件的成员上.

使用 Geode 的公司:

减少风险分析从6 小时到 20 分钟, 在闪电崩盘的过程中能够记录下交易资料, 而其他机构都不能够做到这一点.
增强的最终用户响应时间, 从3秒到 50 毫秒, 一年能够增加 8 位数的收益, 而项目交付小于6个月.
实时追踪资产, 协调对的人和机器在对的时间到达对的地点, 发挥高价值的机会.
创建用户预留系统, 每天可以处理10亿的请求.

Geode Geode 通信

Geode 成员使用 TCP, UDP 单播和 UDP 多播相结合的方式进行成员之间的通信. Geode 成员与其他成员之间维护着经常性通信 , 为了分发数据和管理整个分布式系统.

...

注意: 本文章主要关注于拓扑结构的网络配置.

Client server topology

Geode 网络特性

Geode is a distributed, in-memory data platform designed to provide extreme performance and high levels of availability. In its most common deployment configurations, Geode makes extensive use of network resources for data distribution, system management and client request processing. As a result, network performance and reliability can have a significant impact on Geode.

To obtain optimal Geode performance, the network needs to exhibit the following characteristics.

是一个分布式内存数据平台, 用于提供高性能地, 高标准的可靠性. 在大多数部署配置中, Geode 最网络资源做了扩展性地利用, 如数据分布, 系统管理和客户端请求处理. 结果, 网络性能和可靠性都有比较大的影响.

有了获得最优化的性能, 网络需要展示以下方面的特点.

低延时

延时问题涉及到在跨网络处理数据过程中的各种类型的延时. 这些延时包括:

广播延时 – 这与网络传播距离有关, 数据跨网络到达目的地, 和信号通过的中介地. 延时范围从本地网络(LAN)的纳秒到微秒延时, 到卫星通信系统的0.25秒延时.
传输延时 – 这些延时是发送所有数据包的比特流到链接网络层所需要的时间, 这是一个包长度和链接层速率的问题. 例如, 为了传输一个10 Mb 文件跨 1 Mbps 链接层将需要10秒中, 而跨 100 Mbps 链接层只需要 0.1 秒.
处理延时 – 这个延时是处理包头, 检查比特级错误, 确定包的发送目的地的所花时间. 在高速路由环境处理延时基本是最小的. 然而, 对于网络处理复杂加密或深度包检测, 处理延时还是比较大的. 另外, 处理 NAT 的路由器也有高于正常处理的延时, 因为这些路由器都需要检查, 和修改输入和输出包.
队列延时 – 这些延时都是路由队列所消耗的时间. 网络设计的实际情况是一些队列延时将出现. 有效的队列管理技术是关键的, 可以保障高优先级的流量体验.

最佳实践

It should be noted that latency, not bandwidth, is the most common performance bottleneck for network dependent systems like websites. Therefore, one of the key design goals in architecting a Geode solution is to minimize network latency. Best practices for achieving this goal include:

Keep Geode members and clients on the same LAN Keep all members of a Geode distributed system and their clients on the same LAN and preferably on the same LAN segment. The goal is to place all Geode cluster members and clients in close proximity to each other on the network. This not only minimizes propagation delays, it also serves to minimize other delays resulting from routing and traffic management. Geode members are in constant communication and so even relatively small changes in network delays can multiply, impacting overall performance.
Use network traffic encryption prudently Distributed systems like Geode generate high volumes of network traffic, including a fair amount of system management traffic. Encrypting network traffic between the members of a Geode cluster will add processing delays even when the traffic contains no sensitive data. As an alternative, consider encrypting only the sensitive data itself. Or, if it is necessary to restrict access to data on the wire between Geode members, consider placing the Geode members in a separate network security zone that cordons off the Geode cluster from other systems.
Use the fastest link possible Although bandwidth alone does not determine throughput - all things being equal, a higher speed link will transmit more data in the same amount of time than a slower one. Distributed systems like Geode move high volumes of traffic through the network and can benefit from having the highest speed link available. While some Geode customers with exacting performance requirements make use of InfiniBand network technology that is capable of link speeds up to 40Gbps, 10GbE is sufficient for most applications and is generally recommended for production and performance/system testing environments. For development environments and less critical applications, 1GbE is often sufficient.

高吞吐量

另外对于低延时来讲, Geode 系统的网络需要有高吞吐量. ISPs 和 FCC 经常使用术语'带宽'和'速度', 虽然它们并不是一回事. 事实上, 带宽只是众多影响因素之一. 因此, 更准确地说

本文将主要关注延时, 而不是带宽, 对于依赖网络的系统来说, 这是最普遍的性能瓶颈. 因此架构Geode解决方案的关键设计之一就是最小化网络延时. 达到这一目标的最佳实践如下:

保持 Geode 服务器和客户端在同一个局域网内, 最好是在同一个网段内. 目标是放所有的Geode 集群成员和客户端尽可能临近, 以减少互相通信的网络延时. 这不仅减少了广播延时, 也减少了路由和流量管理的延时. Geode 成员在一致的通信环境中, 甚至网络延时相对变化较少,能够大幅放大整体的性能.
慎重使用网络流量加密, Geode 将产生大量网络流量, 包括一定量的系统管理流量. 在 Geode 集群成员之间加密网络流量将增加处理延时, 甚至流量并不包含敏感数据. 可以考虑仅仅在敏感数据上进行加密. 或者, 如果是必要的限制 Geode 成员之间的数据访问, 考虑将 Geode 集群放在一个隔离的网络安全域内, 与其他的系统进行隔离.
使用最快的网络链路, 虽然带宽并不能单独决定吞吐量 - 所有事情都是平等的, 一个高速链路将比低速链路传输更多的数据, 在相同的时间下. 分布式系统如 Geode 透过网络传输高通量数据得益于高速的网络链路. 而一些对网络性能要求比较高通常使用 InfiniBand 网络技术, 达到 40Gbps, 10GbE 对于大多数应用是足够的, 在生产/测试环境下基本满足需求.

高吞吐量

另外对于低延时来讲, Geode 系统的网络需要有高吞吐量. ISPs 和 FCC 经常使用术语'带宽'和'速度', 虽然它们并不是一回事. 事实上, 带宽只是众多影响因素之一. 因此, 更准确地说

带宽描述了一个网络能力, 通常情况下以 bit/s 来表示. 更进一步地, 带宽涉及到数据传输速率 (bits/s) , 通过一个网络连接或接口来支持. 吞吐量,经常是小于网络的全部传输能力. 吞吐量, 带宽描述了一个网络能力, 通常情况下以 bit/s 来表示. 更进一步地, 带宽涉及到数据传输速率 (bits/s) , 通过一个网络连接或接口来支持. 吞吐量,经常是小于网络的全部传输能力. 吞吐量, 可用的链接带宽, 可能受以下的因素影响:

协议低效 – TCP 是一个适配协议, 搜寻在网络资源上的负载均衡需求, 高效利用如下的网络基础设施. TCP detects and responds to current network conditions using a variety of feedback mechanisms and algorithms. The mechanisms and algorithms have evolved over the years but the core principles remain the same: ++ All TCP connections begin with a three-way handshake that introduces latency and makes TCP connection creation expensive ++ TCP slow-start is applied to every new connection by default. This means that connections can’t immediately use the full capacity of the link. The time required to reach a specific throughput target is a function of both the round trip time between the client and server and the initial congestion window size. ++ TCP flow control and congestion control regulate the throughput of all TCP connections. ++ TCP throughput is regulated by the current congestion window size.
Congestion – this occurs when a link or node is loaded to the point that its quality of service degrades. Typical effects include queuing delay, packet loss or blocking of new connections. As a result, an incremental increase in offered load on a congested network may result in an actual reduction in network throughput. In extreme cases, networks may experience a congestion collapse where reduced throughput continues well after the congestion-inducing load has been eliminated and renders the network unusable. This condition was first documented by John Nagle in 1984 and by 1986 had become a reality for the Department of Defense’s ARPANET – the precursor to the modern Internet and the world’s first operational packet-switched network. These incidents saw sustained reductions in capacity, in some cases capacity dropped by a factor of 1,000! Modern networks use flow control, congestion control and congestion avoidance techniques to avoid congestion collapse. These techniques include: exponential backoff, TCP Window reduction and fair queuing in devices like routers. Packet prioritization is another method used to minimize the effects of congestion.

最佳实践

Geode systems are often called upon to handle extremely high transaction volumes and as a consequence move large amounts of traffic through the network. As a result, one of the primary design goals in architecting a Geode solution is to maximize network throughput.

Best practices for achieving this goal include:

Increasing TCP’s Initial Congestion Window A larger starting congestion window allows TCP transfers more data in the first round trip and significantly accelerates the window growth – an especially critical optimization for bursty and short-lived connections.
Disabling TCP Slow-Start After Idle Disabling slow-start after idle will improve performance of long-lived TCP connections, which transfer data in bursts.
Enabling Window Scaling (RFC 1323) Enabling window scaling increases the maximum receive window size and allows high-latency connections to achieve better throughput.
Enabling TCP Low Latency Enabling TCP Low Latency effectively tells the operating system to sacrifice throughput for lower latency. For latency sensitive workloads like Geode, this is an acceptable tradeoff than can improve performance.
Enabling TCP Fast Open Enabling TCP Fast Open (TFO), allows application data to be sent in the initial SYN packet in certain situations. TFO is a new optimization, which requires support on both clients and servers and may not be available on all operating systems.

容错设计

Another network characteristic that is key to optimal Geode performance is fault tolerance. Geode operations are dependent on network services and network failures can have a significant impact on Geode system operations and performance. While fault tolerant network design is beyond the scope of this paper, there are some important considerations to bear in mind when designing Geode Solutions. For the purposes of this paper, these considerations are organized along the lines of the Cisco Hierarchical Network Design Model as illustrated below.

Fault tolerance diagram Image Removed

This model uses a layered approach to network design, representing the network as a set of scalable building blocks, or layers. In designing Geode systems, network fault tolerance considerations include:

Access layer redundancy – The access layer is the first point of entry into the network for edge devices and end stations such as Geode servers. For Geode systems, this network layer should have attributes that support high availability including: ++ Operating system high-availability features, such as Link Aggregation (EtherChannel or 802.3ad), which provide higher effective bandwidth and resilience while reducing complexity. ++ Default gateway redundancy using dual connections to redundant systems (distribution layer switches) that use Gateway Load Balancing Protocol (GLBP), Hot Standby Router Protocol (HSRP), or Virtual Router Redundancy Protocol (VRRP). This provides fast failover from one switch to the backup switch at the distribution layer. ++ Switch redundancy using some form of Split Multi-Link Trunking (SMLT). The use of SMLT not only allows traffic to be load-balanced across all the links in an aggregation group but also allows traffic to be redistributed very quickly in the event of link or switch failure. In general the failure of any one component results in a traffic disruption lasting less than half a second (normal less than 100 milliseconds).
Distribution layer redundancy – The distribution layer aggregates access layer nodes and creates a fault boundary providing a logical isolation point in the event of a failure in the access layer. High availability for this layer comes from dual equal-cost paths from the distribution layer to the core and from the access layer to the distribution layer. This network layer is usually designed for high availability and doesn’t typically require changes for Geode systems.
Core layer redundancy – The core layer serves as the backbone for the network. The core needs to be fast and extremely resilient because everything depends on it for connectivity. This network layer is typically built as a high-speed, Layer 3 switching environment using only hardware-accelerated services and redundant point-to-point Layer 3 interconnections in the core. This layer is designed for high availability and doesn’t typically require changes for Geode systems.

最佳实践

Geode systems depend on network services and network failures can have a significant impact on Geode operations and performance. As a result, network fault tolerance is an important design goal for Geode solutions. Best practices for achieving this goal include:

Use Mode 6 Network Interface Card (NIC) Bonding – NIC bonding involves combining multiple network connections in parallel in order to increase throughput and provide redundancy should one of the links fail. Linux supports six modes of link aggregation: ++ Mode 1 (active-backup) in this mode only one slave in the bond is active. A different slave becomes active if and only if the active slave fails. ++ Mode 2 (balance-xor) in this mode a slave is selected to transmit based on a simple XOR calculation that determines which slave to use. This mode provides both load balancing and fault tolerance. ++ Mode 3 (broadcast) this mode transmits everything on all slave interfaces. This mode provides fault tolerance. ++ Mode 4 (IEEE 802.3ad) this mode creates aggregation groups that share the same speed and duplex settings and utilizes all slaves in the active aggregator according to the 802.3ad specification. ++ Mode 5 (balance-tlb) this mode distributes outgoing traffic according to the load on each slave. One slave receives incoming traffic. If that slave fails, another slave takes over the MAC address of the failed receiving slave. ++ Mode 6 (balance-alb) this mode includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond such that different peers use different hardware addresses for the server.

For Geode systems, Mode 6 is recommended. Mode 6 NIC Bonding (Adaptive Load Balancing) provides both link aggregation and fault tolerance. Mode 1 only provides fault tolerance while modes 2, 3 and 4 require that the link aggregation group reside on the same logical switch and this could introduce a single point of failure when the physical switch to which the links are connected goes offline.

Use SMLT for switch redundancy – the Split Multi-link Trunking (SMLT) protocol allows multiple Ethernet links to be split across multiple switches in a stack, preventing any single point of failure, and allowing switches to be load balanced across multiple aggregation switches from the single access stack. SMLT provides enhanced resiliency with sub-second failover and sub-second recovery for all speed trunks while operating transparently to end-devices. This allows for the creation of Active load sharing high availability network designs that meet five nines availability requirements.

Geode 网络设置

To achieve the goals of low latency, high throughput and fault tolerance, network settings in the operating system and Geode will need to be configured appropriately. The following sections outline recommended settings.

IPv4 vs. IPv6

By default, Geode uses Internet Protocol version 4 (IPv4). Testing with Geode has shown that IPv4 provides better performance than IPv6. Therefore, the general recommendation is to use IPv4 with Geode. However, Geode can be configured to use IPv6 if desired. If IPv6 is used, make sure that all Geode processes use IPv6. Do not mix IPv4 and IPv6 addresses.

检测和响应了当前的网络条件, 使用几种反馈机制和算法. 机制和算法是经年累月形成的, 但是核心的原理是不变的:
++ 所有的 TCP 连接使用三次握手开始, 这就引入了延时和 TCP 连接创建开销
++ 默认情况下, TCP 慢启动被应用到每一个新建连接. 这就意味着连接并不会立即使用链路层的全部容量. 需要达到一个特定的吞吐量状态所需的时间=客户端服务器之间环回时间+初始化拥塞窗口大小.
++ TCP 流控和拥塞控制调节所有 TCP 连接的吞吐量.
++ TCP 吞吐量通过当前的拥塞窗口大小来管理.
拥塞 – 当一个链路或节点被加载到这个点时, QoS 性能降级, 此时拥塞就会发生. 典型的影响包括队列延迟, 包丢失或新的连接阻塞. 结果, 在一个拥塞网络上累计增加的负载可能导致网络吞吐量的丧失. 在极端情况下, 网络可能遭遇一个拥塞崩溃, 在拥塞负载消除之后, 吞吐量又恢复了正常, 使得网络不可用. 这种情况初次被John Nagle 在 1984 年提出, 到了 1986 年, DoD 的 ARPANET 已经遇到了这种情况– 互联网的鼻祖, 世界第一个包交换网络. 这些问题使得网络能力持续下降, 在一些场景下, 网络吞吐量下降了1000倍! 现代网络使用了流控, 拥塞控制, 和拥塞避免技术来避免拥塞崩溃. 这些技术包括: 潜在补偿, TCP Window 减少 , 公平队列设备如路由器. 包优先级是另外一个方法用于最小化拥塞的影响.

最佳实践

Geode 系统经常调用来处理高通量事务流量, 通过网络流入大量的流量. 结果, 那么 Geode 架构主要的设计目标就是最大化网络吞吐量.

达到这一目标的最佳实践如下:

增加 TCP’s 初始拥塞窗口 – 一个大的开始拥塞窗口允许 TCP 传输更多的数据, 显著增强了窗口的增长 – 对于突发的短时连接, 这是一个特别关键的优化.
禁用 TCP Slow-Start After Idle Disabling 在 Idle 状态之后的慢启动将影响 TCP 长连接的性能, 在突发性传输数据的情况下.
启用 Window Scaling (RFC 1323) 启用窗口扩展将增加最大接收窗口大小, 允许高延迟连接来达到更好的吞吐量.
启用 TCP 低延时启用 TCP 低延时将有效地告知 OS为了低延时牺牲吞吐量. 对于延时敏感的工作负载, 如 Geode, 这是一个性能与速度之间的平衡点.
启用 TCP Fast Open 启用TCP Fast Open (TFO), 允许应用数据在初始的SYN包中发送出去, 在特定的情况之下. TFO 是一个新的优化, 需要客户端和服务器都支持此功能, 可能并不是所有 OS 都支持这一特性.

容错设计

另一个网络特点在于最优的 Geode 性能是容错. Geode 操作依赖于网络服务, 网络故障对于Geode系统操作和性能有比较重要的影响. 而容错网络设计超出了本章所介绍的范围, 当在设计 Geode 解决方案时, 有一些比较重要的考虑. 对于本章的目的, 这些考虑通过网络三层架构设计来体现, 如下图所示:

Fault tolerance diagram Image Added

此模型使用典型的三层架构来设计网络, 以可扩展的模块化构建块或层级来表示网络. 在设计 Geode 系统时, 网络容错考虑如下设置:

接入层冗余设计 – 接入层是第一个入口点进入到网络边界设备和主机设备, 如 Geode 服务器. 对于Geode 系统, 在接入网络层中应该有属性来支持高可用, 包括:
++ Operating system 高可用特性, 例如 Link Aggregation (EtherChannel or 802.3ad), 提供了更高效的带宽和弹性, 减少了复杂度.
++ 默认的网关冗余, 使用了双连接到冗余系统 (汇聚层交换机) , 此交换机使用 Gateway Load Balancing Protocol (GLBP), Hot Standby Router Protocol (HSRP), 或者 Virtual Router Redundancy Protocol (VRRP). 这提供了故障快速切换的高可用性.
++ Switch 容错, 使用一些Split Multi-Link Trunking (SMLT). 使用 SMLT 不仅允许在汇聚层流量跨所有链路是负载均衡的, 而且还允许流量当链路或交换机发生故障时进行快速重分布. 通常情况, 任意一个模块故障将导致无流量, 0.5秒之内 (正常情况下小于 100 毫秒).
汇聚层冗余设计 – 汇聚层网络聚合接入层的节点, 在接入层发生故障之后, 创建容错边界来提供逻辑隔离点. 高可用源自汇聚层到核心层、汇聚层到接入层的双冗余链路设计. 此网络层通常被设计用于高可用, 不需要为 Geode系统做更改.
核心层冗余设计 – 核心层作为骨干网提供网络服务. 核心层需要极快地弹性, 因为所有的网络访问都依赖于它连接. 在核心层使用 L3交换机环境物理硬件增强和 P2P 的互联. 这个核心层被设计成高可用层, 此网络层通常被设计用于高可用, 不需要为 Geode系统做更改.

最佳实践

Geode 系统依赖于网络服务, 网络故障对于Geode系统操作和性能有比较重要的影响. 结果, 网络容错对于 Geode 解决方案来说是一个重要的设计目标. 能够达到此目标的最佳实践包括如下:

使用 Mode 6 的NIC网卡绑定 – NIC 绑定涉及到多个网络连接并行化, 为了增加吞吐量和提供冗余, 防止链路单点故障. Linux 支持6种链路聚合的模式:
++ Mode 1 (主备模式) 在此模式中只有一个备用网卡是运行的. 如果只有一个备用网卡失效了, 那么另外一个备用网卡将成为活跃网卡.
++ Mode 2 (balance-xor) 在此模式中一个备用网卡被选择用于传输, 基于一个简单的XOR计算, 这个计算决定哪个备用网卡可以使用. 此模式提供了负载均衡和容错功能.
++ Mode 3 (broadcast) 此模式在所有的备用网卡接口传输数据. 此模式提供了容错.
++ Mode 4 (IEEE 802.3ad) 此模式创建了聚合组,共享了相同的速度和双工设置, 根据 802.3ad 规范在聚合器上利用所有的备用网卡.
++ Mode 5 (balance-tlb) 此模式根据每个备用网卡的负载情况分发输出流量. 一个备用网卡接收输入流量. 如果备用网卡故障, 另外一个备用网卡接管此故障网卡的 MAC 地址.
++ Mode 6 (balance-alb) 此模式包括 balance-tlb + receive load balancing (rlb), 对于 IPV4 流量来说, 同时不需要任意特定的交换机支持. 接收负载均衡通过 ARP 协商来达到. 绑定驱动器中断了通过本地系统发出的 ARP 回应, 覆盖源硬件地址, 带有备用网卡的唯一设备地址, 例如不同的节点使用不同的硬件地址.

对于 Geode 系统, 推荐使用 Mode 6 方式. Mode 6 NIC 绑定 (Adaptive Load Balancing) 提供了链路聚合和容错. Mode 1 只提供容错, 而Mode 2, 3, 4 需要链接聚合组位于同一个逻辑交换机, 它也可能引入单点故障, 当物理交换机的链路掉线时.

使用 SMLT 作为交换机冗余 – the Split Multi-link Trunking (SMLT) 协议允许多个以太网链路跨多个交换机进行拆分, 防止任何的单点故障, 同时允许交换机跨多个汇聚交换机是负载均衡的. SMLT 提供了增强级弹性能力, 亚秒级故障切换和亚秒级恢复, 对于全线速的Trunks链路, 操作对于终端设备是透明的. 这允许负载共享的高可用网络, 满足5个9的高可靠需求.

Geode 网络设置

为了达到低延时、高吞吐量、容错的目标, OS 的网络设置和 Geode 将需要合理地配置. 如下章节是官方推荐的配置.

IPv4 vs. IPv6

默认情况下, Geode 使用 IPv4. 通过测试 Geode 表明 IPv4 的性能要优于 IPv6. 因此, 通用的推荐是使用带有 Geode 的 IPv4. 然而, 如果需要的话, Geode 也能配置使用 IPv6. 一旦使用 IPv6 , 请确保所有的 Geode 进程都使用 IPv6. 不要使用IPv4 和 IPv6 混合地址.

注意: 对于 Geode 地址使用 IPv6, 设置如下的 Java 属性Note: to use IPv6 for Geode addresses, set the following Java property: java.net.preferIPv6Addresses=true

TCP vs. UDP

Geode supports the use of both TCP and UDP for communications. Depending on the size and nature of the Geode system as well as the types of regions employed, either TCP or UDP may be more appropriate支持TCP和 UDP 两种通信方式. 依赖于 Geode 系统的大小和特性, 以及 Region 的类型来选择 TCP 还是 UDP 通信.

TCP 通信

TCP (Transmission Control Protocol) provides reliable in-order delivery of system messages. Geode uses TCP by default for inter-cache point-to-point messaging. TCP is generally more appropriate than UDP in the following situations:

Partitioned Data For distributed systems that make extensive use of partitioned regions, TCP is generally a better choice as TCP provides more reliable communications and better performance that UDP.
Smaller Distributed Systems TCP is preferable to UDP unicast in smaller distributed systems because it implements more reliable communications at the operating system level than UDP and its performance can be substantially faster than UDP.
Unpredictable Network Loads TCP provides higher levels of fault tolerance and reliability than UDP. While Geode implements retransmission protocols to ensure proper delivery of messages over UDP, it cannot fully compensate for heavy congestion and unpredictable spikes in network loading.

提供了可靠的顺序投递系统消息. 默认情况下, Geode 使用TCP来进行内存交换的 P2P 通信. 在如下的场景, TCP 比 UDP 更合适:

对于分布式系统来讲, 分区数据利用分区 Regions, TCP 通常是一个更好地选择, 当 TCP 提供比 UDP更可靠的通信和更好的性能.
相对于 UDP Unicast 而言, TCP 更适合中小型分布式系统, 因为它在 OS 层面比 UDP实现了更可靠的通信, 性能比 UDP 更快.
未预测的网络负载 TCP 提供比 UDP更高水平的容错和可靠性. 而 Geode 实现了重传协议用于保障 UDP 的可靠传输, 对于重度拥塞和不可预测的峰值, UDP 还是不能完全补偿.

注意: Geode 通常使用 TCP 通信在成员故障检测方面. 在这种情况下, Geode 将尝试建立一个与可疑成员的 TCP/IP 连接, 为了确定是否有成员已经失败Note: Geode always uses TCP communications in member failure detection. In this situation, Geode will attempt to establish a TCP/IP connection with the suspect member in order to determine if the member has failed.

UDP 通信

UDP (User Datagram Protocol) is a connectionless protocol, which uses far fewer resources than TCP. However, UDP has some important limitations that should be factored into a design, namely:

64K byte message size limit (including overhead for message headers)
Markedly slower performance on congested networks
Limited reliability (Geode compensates through retransmission protocols)

If a Geode system can operate within the limitations of UDP, then it may be a more appropriate choice than TCP in the following situations:

Replicated Data In systems where most or all of the members use the same replicated regions, UDP multicast may be the most appropriate choice. UDP multicast provides an efficient means of distributing all events for a region. However, when multicast is enabled for a region, all processes in the distributed system receive all events for the region. Therefore, multicast is only suitable when most or all members have the region defined and the members are interested in most or all of the events for the region.

Note: Even when UDP multicast is used for a region, Geode will send unicast messages in some situations. Also, partitioned regions will use UDP unicast for almost all purposes.

Larger Distributed Systems

As the size of a distributed system increases, the relatively small overhead of UDP makes it the better choice. TCP adds new threads and sockets to every member, causing more overhead as the system grows.

是一个无连接协议, 比 TCP 使用更少的资源. 然而, UDP 有一些重要的限制将会在设计的时候充分考虑:

64K 字节消息大小限制 (包括消息头开销)
在拥塞网络上有更慢的性能
可靠性有限 (Geode 通过'重传协议'来进行补偿)

如果一个Geode 系统能够在 UDP 的限制下运行, 那么在如下的场景下它比 TCP 更合适:

同步模式所有成员或大多数成员都使用相同的复制 Regions, UDP 多播可能是最合适的选择. UDP 多播提供了一个高效分发所有事件的机制. 然而, 当开启多播模式时, 分布式系统中的所有进程都接受 Region 的所有事件. 因此, 多播仅适用于所有的成员都有 Region 定义, 成员对于所有的Region事件都是感兴趣的.

注意: 甚至当 UDP 多播被用于一个 Region 时, Geode 将发送单播消息. 同时, 对于大多数操作, Partitioned Regions 将使用 UDP 单播协议.

更大的分布式系统规模

随着分布式系统规模的增长, UDP 的开销相对较小是更好的选择. TCP 添加新的线程和套接字到每个成员, 当系统增长时, 导致更多的开销.

注意: 为了配置 Geode 来使用 UDP 对于内存交换的 P2P 消息, 设置如下的 Geode 属性Note: to configure Geode to use UDP for inter-cache point-to-point messaging set the following Geode property: disable-tcp=true

TCP 设置

The following sections provide guidance on TCP settings recommended for Geode下面的章节提供了推荐的 TCP 设置.

Geode 的TCP/IP 通信设置

Socket Buffer Size In determining buffer size settings, the goal is to strike a balance between communication needs and other processing. Larger socket buffers allow Geode members to distribute data and events more quickly, but also reduce the memory available for other tasks. In some cases, particularly when storing very large data objects, finding the right socket buffer size can become critical to system performance.

Ideally, socket buffers should be large enough for the distribution of any single data object. This will avoid message fragmentation, which lowers performance. The socket buffers should be at least as large as the largest stored objects with their keys plus some overhead for message headers - 100 bytes should be sufficient.

If possible, the TCP/IP socket buffer settings should match across the Geode installation. At a minimum, follow the guidelines listed below.

++ Peer-to-peer. The socket-buffer-size setting in gemfire.properties should be the same throughout the distributed system. ++ Client/server. The client’s pool socket-buffer size-should match the setting for the servers that the pool uses. ++ Server. The server socket-buffer size in the server’s cache configuration (e.g. cache.xml file) should match the values defined for the server’s clients. ++ Multisite (WAN). If the link between sites isn’t optimized for throughput, it can cause messages to back up in the queues. If a receiving queue buffer overflows, it will get out of sync with the sender and the receiver won’t know it. A gateway sender's socket-buffer-size should match the gateway receiver’s socket-buffer-size for all receivers that the sender connects to.

Buffer 大小确定 Buffer 大小设置, 目标是在通信和其他处理之间做出平衡. 更大的 Socket Buffers 允许 Geode 成员来快速分发数据和事件, 但是也减少了内存可用率. 在一些情况下, 特别是当保存非常大的数据对象时, 找到正确的Socket Buffer 大小对于系统性能是很关键的.

理想情况下, 对于任何的数据对象分发来说, Socket Buffers 应该足够大. 这将要避免消息分片, 性能会更低. Socket Buffers 至少应该和最大的数据对象一样大, 带着 keys值和一些消息头开销 - 100 bytes 应该足够大了.

如果可能的话, TCP/IP Socket Buffer 设置应该匹配 Geode 的设置. 参见如下指导.

++ P2P. 在gemfire.properties属性中, Socket-Buffer-size 设置应该在分布式系统中是相同的.

++ Client/server. 客户端的池 Socket-Buffer size 应该匹配服务器的池所使用的设置.

++ Server. 在服务器的cache.xml 文件中, 服务器的 Socket-Buffer size 应该匹配客户端所定义的值.

++ Multisite (WAN). 如果数据中心站点之间链路没有被优化, 它可能导致消息在队列中备份. 如果一个接收队列 Buffer 溢出, 它将发送者不同步, 而接受者不知道这个情况. 一个网关发送者的 socket-buffer-size 应该匹配网关接受者的 socket-buffer-size.

注意: OS TCP buffer size 限制必须足够大来调节 Geode Socket Buffer 的设置. 如果不是, Geode 的值将设置到 OS 限制 – 而不是请求的值Note: OS TCP buffer size limits must be large enough to accommodate Geode socket buffer settings. If not, the Geode value will be set to the OS limit – not the requested value.

TCP/IP Keep Alive

Geode supports 支持 TCP KeepAlive to prevent socket connections from being timed out来防止 Socket 连接超时.

The gemfire.enableTcpKeepAlive system property prevents connections that appear idle from being timed out (for example, by a firewall.) When configured to true, Geode enables the SO_KEEPALIVE option for individual sockets. This operating system-level setting allows the socket to send verification checks (ACK requests) to remote systems in order to determine whether or not to keep the socket connection alive.

Note: The time intervals for sending the first ACK KeepAlive request, the subsequent ACK requests and the number of requests to send before closing the socket is configured on the operating system level. See

By default, this system property is set to true.

TCP/IP Peer-to-Peer Handshake Timeouts

This property governs the amount of time a peer will wait to complete the TCP/IP handshake process. You can change the connection handshake timeouts for TCP/IP connections with the system property p2p.handshakeTimeoutMs.

系统属性放置空闲的连接发生超时 (例如, 通过一个防火墙.) 当配置设置为 true 时, 对于每个 Sockets 来说, Geode 启用了 SO_KEEPALIVE 选项. OS级别的设置允许 Socket 发送验证检查 (ACK requests) 到远程系统为了确定是否保持socket 连接是活跃的.

注意: 对于发送第一个 ACK KeepAlive 请求的时间间隔, 后续的 ACK 请求和请求数量在关闭 Socket 之前进行发送.

默认情况下, 此系统属性被设置为 true.

TCP/IP Peer-to-Peer 握手超时

此属性管理时间量, 一个 Peer 将等待完成 TCP/IP 握手处理. 你能够更改连接握手超时时间, 带有系统属性p2p.handshakeTimeoutMs 的 TCP/IP 连接.

默认设置是 The default setting is 59,000 milliseconds (59 seconds).

This sets the handshake timeout to 75,000 milliseconds for a Java application对于一个 Java 应用来说, 此设置握手超时到 75,000 毫秒:

-Dp2p.handshakeTimeoutMs=75000

The properties are passed to the cache server on the gfsh command line此属性可以通过命令行来设置:

 gfsh>start server --name=server1 --J=-Dp2p.handshakeTimeoutMs=75000

Linux TCP/IP通信设置

The following table summarizes the recommended 如下的表总结了推荐的 TCP/IP settings for Linux. These settings are in the 设置. 这些设置位于 /etc/sysctl.conf file文件

Setting设置	Recommended Value推荐值	Rationale基本原理
net.core.netdev_max_backlog	30000	Set maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. Recommended setting is for 10GbE links. For 1GbE links use 设置包的最大数, 在输入端进行排队, 当接口接收包比内核处理更快时. 推荐设置为10GbE 链路. 对于1GbE 链路使用 8000.
net.core.wmem_max	67108864Set	max to 对于 1GbE 链路, 设置最大数为 16MB (16777216) for 1GbE links and , 而对于10GbE链路为 64MB (67108864) for 10GbE links.
net.core.rmem_max	67108864Set	max to 对于 1GbE 链路, 设置最大数为 16MB (16777216) for 1GbE links and , 而对于10GbE链路为 64MB (67108864) for 10GbE links.
net.ipv4.tcp_congestion_control	htcp	There seem to be bugs in both bic and cubic (the default) for a number of versions of the Linux kernel up to version 这看起来是 bugs 在 bic 和 cubic 上(默认) , 对于 Linux 内核上到版本 2.6.33. The kernel version for Redhat 5.x is 内核版本是 2.6.18-x and , Redhat 6.x内核版本是 2.6.32-x for Redhat 6.x
net.ipv4.tcp_congestion_window	10	This is the default for Linux operating systems based on 默认情况下, Linux OS 是基于 Linux kernel 2.6.39 or later或以上版本.
net.ipv4.tcp_fin_timeout	10This	setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this 此设置确定了在 TCP/IP 释放一个关闭连接和重用资源之前时间必须超时.在这个 TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry状态中, 重新打开到客户端的连接的成本低于建立新连接的成本. 通过减少条目值, TCP/IP can release closed connections faster, making more resources available for new connections. The default value is 60. The recommened setting lowers its to 10. You can lower this even further, but too low, and you can run into socket close errors in networks with lots of jitter能够更快地释放关闭的连接, 对于新的连接让更多的资源可用. 默认值是 60. 推荐设置较低, 为10. 你能够进一步拉低这个值, 如果这个值太低, 将会在网络中得到 socket close errors , 并带有大量的抖动.
net.ipv4.tcp_keepalive_interval	30This	determines the wait time between isAlive interval probes. Default value is 75. Recommended value reduces this in keeping with the reduction of the overall keepalive time此设置确定了在isAlive间隔的等待时间. 默认值是75. 推荐值拉低了这个值, keepalive时间为30.
net.ipv4.tcp_keepalive_probes	5	在 socket 超时之前, 有多少 keepalive probes 发出. 默认值为 9. 推荐值拉低了这个值为 5 , 因此重试操作将花费 2.5 分钟How many keepalive probes to send out before the socket is timed out. Default value is 9. Recommended value reduces this to 5 so that retry attempts will take 2.5 minutes.
net.ipv4.tcp_keepalive_time	600	Set the 设置 TCP Socket timeout value to 10 minutes instead of 2 hour default. With an idle socket, the system will wait 超时值为 10 分钟, 默认是 2 小时. 在一个空闲 socket, 系统将要等待 tcp_keepalive_time seconds, and after that try 秒, 在尝试 tcp_keepalive_probes times to send a 次数后发送一个 TCP KEEPALIVE in intervals of tcp, 时间间隔为tcp_keepalive_intvl seconds. If the retry attempts fail, the socket times out秒. 如果重试尝试失败, socket 将超时.
net.ipv4.tcp_low_latency	1	Configure 配置 TCP for low latency, favoring low latency over throughput为低延时, 在吞吐量上达到低延时
net.ipv4.tcp_max_orphans	16384	Limit number of orphans, each orphan can eat up to 限制孤儿套接字的数量, 每个孤儿套接字将吃掉最大 16M (max wmem) of unswappable memory非交换内存
net.ipv4.tcp_max_tw_buckets	1440000	Maximal number of timewait sockets held by system simultaneously. If this number is exceeded 通过系统持有的timewait sockets最大数量. 如果此数量超过了, time-wait socket is immediately destroyed and warning is printed. This limit exists to help prevent simple DoS attacks将立即销毁, 并打印出警告信息. 此限制帮助对一些简单的 DDoS 攻击进行防护.
net.ipv4.tcp_no_metrics_save	1	Disable caching TCP metrics on connection close禁用连接关闭的缓存TCP metrics
net.ipv4.tcp_orphan_retries	0	Limit number of orphans, each orphan can eat up to 限制孤儿套接字的数量, 每个孤儿套接字将吃掉最大 16M (max wmem) of unswappable memory非交换内存
net.ipv4.tcp_rfc1337	1	Enable a fix for RFC1337 - 开启对 RFC1337 的修复 - TCP 中的 time-wait assassination hazards in TCP破坏风险
net.ipv4.tcp_rmem	10240 131072 33554432	Setting is 设置是 min/default/max. Recommed increasing the 推荐增加 Linux autotuning TCP buffer limit to 自动调优 TCP Buffer 限制到 32MB
net.ipv4.tcp_wmem	10240 131072 33554432	Setting is 设置是 min/default/max. Recommed increasing the 推荐增加 Linux autotuning TCP buffer limit to 自动调优 TCP Buffer 限制到 32MB
net.ipv4.tcp_sack	1Enable	select acknowledgments启用选择确认
net.ipv4.tcp_slow_start_after_idle	0	By default, TCP starts with a single small segment, gradually increasing it by one each time. This results in unnecessary slowness that impacts the start of every request默认情况下, TCP 以单个小段开始, 通过每次一个逐渐增加它.这导致了不必要的拖慢, 影响了每个请求的开始.
net.ipv4.tcp_syncookies	0	Many default 很多默认的 Linux installations use SYN cookies to protect the system against malicious attacks that flood TCP SYN packets. The use of SYN cookies dramatically reduces network bandwidth, and can be triggered by a running Geode cluster. If your Geode cluster is otherwise protected against such attacks, disable SYN cookies to ensure that Geode network throughput is not affected. NOTE: if SYN floods are an issue and SYN cookies can’t be disabled, try the following安装使用 SYN 来保护系统免于 TCP SYN包洪泛攻击. 使用 SYN cookies 显著减小了网络带宽, 通过运行 Geode 集群来触发. 如果你的 Geode 集群防护受攻击, 则禁用 SYN cookies 来保障 Geode 网络吞吐量不受影响. 注意: 如果 SYN 洪泛是一个问题, 那么 SYN cookies 则不能禁用, 尝试配置以下参数: net.ipv4.tcp_max_syn_backlog="16384" net.ipv4.tcp_synack_retries="1" net.ipv4.tcp_max_orphans="400000"
net.ipv4.tcp_timestamps	1Enable	timestamps as defined in RFC1323启用时间戳(在 RFC1323中定义):
net.ipv4.tcp_tw_recycle	1	This enables fast recycling of 启用TIME_WAIT sockets. The default value is Socket快速回收. 默认值为 0 (disabled禁用). Should be used with caution with load balancers应用的时候带有负载均衡的警告 .
net.ipv4.tcp_tw_reuse	1	This allows reusing sockets in 对于新连接, 这允许以 TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative to 状态重用 Sockets, 从协议视角来说当它是安全时. 默认值为 0 (禁用). 它通常是tcp_tw_recyclerecycle的一个安全替换. The tcp_tw_reuse setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers and 设置通常是非常有用的, 在如下的环境有大量的短连接打开, 留在了TIME_WAIT 状态, 例如 web servers 和 loadbalancers.
net.ipv4.tcp_window_scaling	1	Turn on 开启 window scaling which can be an option to enlarge the transfer window, 它是一个选项扩大传输窗口:

In addition, increasing the size of transmit queue can also help TCP throughput. Add the following command to 另外, 增加传输队列的大小也能够帮助提升 TCP 吞吐量. 添加如下的命令到 /etc/rc.local to accomplish this来完成.

/sbin/ifconfig eth0 txqueuelen 10000

NOTE: substitute the appropriate adapter name for eth0 in the above example注意: 替换合适的网卡适配器名称 eth0 在上面例子中.

Space shortcuts

Page tree

Versions Compared

Old Version 5

New Version Current

Key

Geode 网络配置最佳实践

介绍

目的

范围

目的

范围

对象

Geode: 快速回顾

概要介绍

使用 Geode 的公司:

使用 Geode 的公司:

Geode Geode 通信

Geode 网络特性

低延时

最佳实践

高吞吐量

高吞吐量

最佳实践

容错设计

最佳实践

Geode 网络设置

IPv4 vs. IPv6

最佳实践

容错设计

最佳实践

Geode 网络设置

IPv4 vs. IPv6

TCP vs. UDP

TCP 通信

UDP 通信

TCP 设置

Geode 的TCP/IP 通信设置

Linux TCP/IP通信设置

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 5

New Version Current

Key

Geode 网络配置最佳实践

介绍

目的

范围

目的

范围

对象

Geode: 快速回顾

概要介绍

使用 Geode 的公司:

使用 Geode 的公司:

Geode Geode 通信

Geode 网络特性

低延时

最佳实践

高吞吐量

高吞吐量

最佳实践

容错设计

最佳实践

Geode 网络设置

IPv4 vs. IPv6

最佳实践

容错设计

最佳实践

Geode 网络设置

IPv4 vs. IPv6

TCP vs. UDP

TCP 通信

UDP 通信

TCP 设置

Geode 的TCP/IP 通信设置

Linux TCP/IP通信设置