You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Geode 网络配置最佳实践

介绍

Geode is a data management platform that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures. Geode pools memory, CPU, network resources, and optionally disk storage across multiple processes to manage application objects and behavior. It uses dynamic replication and data partitioning techniques to implement high availability, improved performance, scalability, and fault tolerance. In addition to being a distributed data container, Geode is an in-memory data management system that provides reliable asynchronous event notifications and guaranteed message delivery.

Due to Geode’s distributed nature, network resources can have a significant impact on system performance and availability. Geode is designed to be fault tolerant and to handle network disruptions gracefully. However, proper network design and tuning are essential to achieving optimum performance and High Availability with Geode.

目的

The purpose of this paper is to provide best practice recommendations for configuring the network resources in a Geode solution. The recommendations in this paper are not intended to provide a comprehensive, one-size-fits-all guide to network design and implementation. However, they should serve to provide a working foundation to help guide Geode implementations.

范围

本章节所讲的话题与网络组件的设计和配置有关. 将介绍如下的话题:

  • 网络架构的目标
  • NIC网卡选择和配置
  • 交换机配置参考
  • 通用的网络基础设施参考
  • TCP vs. UDP 协议参考
  • Socket 通信和 Socket Buffer 设置
  • TCP 设置, 拥塞控制, 窗口扩展, 等.

对象

This paper assumes a basic knowledge and understanding of Geode, virtualization concepts and networking. Its primary audience consists of:

  • Architects: who can use this paper to inform key decisions and design choices surrounding a Geode solution
  • System Engineers and Administrators: who can use this paper as a guide for system configuration

Geode: 快速回顾

概要介绍

A Geode distributed system is comprised of members distributed over a network to provide in-memory speed along with high availability, scalability, and fault tolerance. Each member consists of a Java virtual machine (JVM) that hosts data and/or compute logic and is connected to other Geode members over a network. Members hosting data maintain a cache consisting of one or more Regions that can be replicated or partitioned across the distributed system. Compute logic is deployed to members as needed by adding the appropriate Java JAR files to the member’s class path.

使用 Geode 的公司:

  • Reduced risk analysis time from 6 hours to 20 minutes, allowing for record profits in the flash crash of 2008 that other firms were not able to monetize.
  • Improved end-user response time from 3 seconds to 50 ms, worth 8 figures a year in new revenue from a project delivered in fewer than 6 months.
  • Tracked assets in real time to coordinate all the right persons and machinery into the right place at the right time to take advantage of immediate high-value opportunities.
  • Created end-user reservation systems that handle over a billion requests daily with no downtime.

Geode 通信

Geode 成员使用 TCP, UDP 单播和 UDP 多播相结合的方式进行成员之间的通信. Geode 成员与其他成员之间维护着经常性通信 , 为了分发数据和管理整个分布式系统.

成员发现通信

Peer 成员发现是什么定义了一个 分布式系统. 所有的应用和使用了相同配置的缓存服务器, 在同一个分布式系统当中互相发现. 每一个系统成员都有一个唯一的ID, 知道其他成员的 ID 号. 一个成员只能属于一个分布式系统. 一旦它们之间互相发现, 成员即可直接互相通信, 从而独立于发现机制. 在Peer 发现中, Geode 使用了一个成员关系协调器来管理成员加入和离开分布式系统. 主要有两个发现选项: 使用多播和使用 Locators.

  • UDP/IP Multicast  新的成员使用多播地址和端口广播它们的连接信息到所有运行的成员. 现有的成员响应并建立与新成员的通信.
  • Geode Locators Using TCP/IP Peer locators 管理一个分布式成员的动态列表. 新的成员连接到一个Locator 来查询成员列表, 然后加入到分布式系统中.

通用的消息和 Region 操作通信

Geode 支持 TCP, UDP 单播 或 UDP 多播的通信方式来分发消息和执行 Region 操作. 默认情况是TCP. 然而, Geode 也能够配置使用 UDP.

Geode 拓扑

Geode 成员可以配置成多种拓扑结构来提供一个非常灵活的解决方案. 如下的章节总结了以下几种拓扑结构.

P2P 拓扑结构

P2P 拓扑结构是最基本的 Geode 部署网络结构. 在这个配置中, 每一个成员直接与分布式系统中其他的成员进行通信. 新的成员广播它们的连接信息到所有运行的成员. 现有的成员响应新成员的建立连接的请求. 此配置一个典型的例子是在一个应用服务器集群中, 一个应用实例与一个Geode 服务器位于同一个 JVM 上. 此配置如下图所示.

Peer to peer image

客户端-服务器拓扑

客户端-服务器拓扑是 Geode 安装最通用的一种拓扑结构. 在这个配置中, 应用与Geode服务器进行通信使用Geode 客户端. Geode 客户端由一些代码组成, 与应用在相同的进程中. 客户端定义了一个连接池来管理 Geode 服务器的连接, 也能提供一个本地缓存来为应用管理数据. 新的Geode 服务器启动将联系一个 locator 用于加入分布式系统, 同时添加到成员关系视图中. 在一个 Geode 系统服务器中 Locator 用于协调整个系统的成员关系, 同时提供Geode客户端请求的负载均衡. 此配置如下图所示.

注意: 本文章主要关注于拓扑结构的网络配置.

Client server topology

Geode 网络特性

Geode is a distributed, in-memory data platform designed to provide extreme performance and high levels of availability. In its most common deployment configurations, Geode makes extensive use of network resources for data distribution, system management and client request processing. As a result, network performance and reliability can have a significant impact on Geode.

To obtain optimal Geode performance, the network needs to exhibit the following characteristics.

低延时

延时问题涉及到在跨网络处理数据过程中的各种类型的延时. 这些延时包括:

  • 广播延时 – 这与网络传播距离有关, 数据跨网络到达目的地,  和信号通过的中介地. 延时范围从本地网络(LAN)的纳秒到微秒延时, 到卫星通信系统的0.25秒延时.
  • 传输延时 – 这些延时是发送所有数据包的比特流到链接网络层所需要的时间, 这是一个包长度和链接层速率的问题. 例如, 为了传输一个10 Mb 文件跨 1 Mbps 链接层将需要10秒中, 而跨 100 Mbps 链接层只需要 0.1 秒.
  • 处理延时 – 这个延时是处理包头, 检查比特级错误, 确定包的发送目的地的所花时间. 在高速路由环境处理延时基本是最小的. 然而, 对于网络处理复杂加密或深度包检测, 处理延时还是比较大的. 另外, 处理 NAT 的路由器也有高于正常处理的延时, 因为这些路由器都需要检查, 和修改输入和输出包.
  • 队列延时 – 这些延时都是路由队列所消耗的时间. 网络设计的实际情况是一些队列延时将出现. 有效的队列管理技术是关键的, 可以保障高优先级的流量体验.

最佳实践

It should be noted that latency, not bandwidth, is the most common performance bottleneck for network dependent systems like websites. Therefore, one of the key design goals in architecting a Geode solution is to minimize network latency. Best practices for achieving this goal include:

  • Keep Geode members and clients on the same LAN Keep all members of a Geode distributed system and their clients on the same LAN and preferably on the same LAN segment. The goal is to place all Geode cluster members and clients in close proximity to each other on the network. This not only minimizes propagation delays, it also serves to minimize other delays resulting from routing and traffic management. Geode members are in constant communication and so even relatively small changes in network delays can multiply, impacting overall performance.
  • Use network traffic encryption prudently Distributed systems like Geode generate high volumes of network traffic, including a fair amount of system management traffic. Encrypting network traffic between the members of a Geode cluster will add processing delays even when the traffic contains no sensitive data. As an alternative, consider encrypting only the sensitive data itself. Or, if it is necessary to restrict access to data on the wire between Geode members, consider placing the Geode members in a separate network security zone that cordons off the Geode cluster from other systems.
  • Use the fastest link possible Although bandwidth alone does not determine throughput - all things being equal, a higher speed link will transmit more data in the same amount of time than a slower one. Distributed systems like Geode move high volumes of traffic through the network and can benefit from having the highest speed link available. While some Geode customers with exacting performance requirements make use of InfiniBand network technology that is capable of link speeds up to 40Gbps, 10GbE is sufficient for most applications and is generally recommended for production and performance/system testing environments. For development environments and less critical applications, 1GbE is often sufficient.

高吞吐量

In addition to low latency, the network underlying a Geode system needs to have high throughput. ISPs and the FCC often use the terms 'bandwidth' and 'speed' interchangeably although they are not the same thing. In fact, bandwidth is only one of several factors that affect the perceived speed of a network. Therefore, it is more accurate to say that bandwidth describes a network’s capacity, most often expressed in bits per second. Specifically, bandwidth refers to the data transfer rate (in bits per second) supported by a network connection or interface. Throughput, on the other hand, can often be significantly less than the network’s full capacity. Throughput, the useable link bandwidth, may be impacted by a number of factors including:

  • Protocol inefficiency – TCP is an adaptive protocol that seeks to balance the demands placed on network resources from all network peers while making efficient use of the underlying network infrastructure. TCP detects and responds to current network conditions using a variety of feedback mechanisms and algorithms. The mechanisms and algorithms have evolved over the years but the core principles remain the same: ++ All TCP connections begin with a three-way handshake that introduces latency and makes TCP connection creation expensive ++ TCP slow-start is applied to every new connection by default. This means that connections can’t immediately use the full capacity of the link. The time required to reach a specific throughput target is a function of both the round trip time between the client and server and the initial congestion window size. ++ TCP flow control and congestion control regulate the throughput of all TCP connections. ++ TCP throughput is regulated by the current congestion window size.
  • Congestion – this occurs when a link or node is loaded to the point that its quality of service degrades. Typical effects include queuing delay, packet loss or blocking of new connections. As a result, an incremental increase in offered load on a congested network may result in an actual reduction in network throughput. In extreme cases, networks may experience a congestion collapse where reduced throughput continues well after the congestion-inducing load has been eliminated and renders the network unusable. This condition was first documented by John Nagle in 1984 and by 1986 had become a reality for the Department of Defense’s ARPANET – the precursor to the modern Internet and the world’s first operational packet-switched network. These incidents saw sustained reductions in capacity, in some cases capacity dropped by a factor of 1,000! Modern networks use flow control, congestion control and congestion avoidance techniques to avoid congestion collapse. These techniques include: exponential backoff, TCP Window reduction and fair queuing in devices like routers. Packet prioritization is another method used to minimize the effects of congestion.

最佳实践

Geode systems are often called upon to handle extremely high transaction volumes and as a consequence move large amounts of traffic through the network. As a result, one of the primary design goals in architecting a Geode solution is to maximize network throughput.

Best practices for achieving this goal include:

  • Increasing TCP’s Initial Congestion Window A larger starting congestion window allows TCP transfers more data in the first round trip and significantly accelerates the window growth – an especially critical optimization for bursty and short-lived connections.
  • Disabling TCP Slow-Start After Idle Disabling slow-start after idle will improve performance of long-lived TCP connections, which transfer data in bursts.
  • Enabling Window Scaling (RFC 1323) Enabling window scaling increases the maximum receive window size and allows high-latency connections to achieve better throughput.
  • Enabling TCP Low Latency Enabling TCP Low Latency effectively tells the operating system to sacrifice throughput for lower latency. For latency sensitive workloads like Geode, this is an acceptable tradeoff than can improve performance.
  • Enabling TCP Fast Open Enabling TCP Fast Open (TFO), allows application data to be sent in the initial SYN packet in certain situations. TFO is a new optimization, which requires support on both clients and servers and may not be available on all operating systems.

容错设计

Another network characteristic that is key to optimal Geode performance is fault tolerance. Geode operations are dependent on network services and network failures can have a significant impact on Geode system operations and performance. While fault tolerant network design is beyond the scope of this paper, there are some important considerations to bear in mind when designing Geode Solutions. For the purposes of this paper, these considerations are organized along the lines of the Cisco Hierarchical Network Design Model as illustrated below.

Fault tolerance diagram

This model uses a layered approach to network design, representing the network as a set of scalable building blocks, or layers. In designing Geode systems, network fault tolerance considerations include:

  • Access layer redundancy – The access layer is the first point of entry into the network for edge devices and end stations such as Geode servers. For Geode systems, this network layer should have attributes that support high availability including: ++ Operating system high-availability features, such as Link Aggregation (EtherChannel or 802.3ad), which provide higher effective bandwidth and resilience while reducing complexity. ++ Default gateway redundancy using dual connections to redundant systems (distribution layer switches) that use Gateway Load Balancing Protocol (GLBP), Hot Standby Router Protocol (HSRP), or Virtual Router Redundancy Protocol (VRRP). This provides fast failover from one switch to the backup switch at the distribution layer. ++ Switch redundancy using some form of Split Multi-Link Trunking (SMLT). The use of SMLT not only allows traffic to be load-balanced across all the links in an aggregation group but also allows traffic to be redistributed very quickly in the event of link or switch failure. In general the failure of any one component results in a traffic disruption lasting less than half a second (normal less than 100 milliseconds).
  • Distribution layer redundancy – The distribution layer aggregates access layer nodes and creates a fault boundary providing a logical isolation point in the event of a failure in the access layer. High availability for this layer comes from dual equal-cost paths from the distribution layer to the core and from the access layer to the distribution layer. This network layer is usually designed for high availability and doesn’t typically require changes for Geode systems.
  • Core layer redundancy – The core layer serves as the backbone for the network. The core needs to be fast and extremely resilient because everything depends on it for connectivity. This network layer is typically built as a high-speed, Layer 3 switching environment using only hardware-accelerated services and redundant point-to-point Layer 3 interconnections in the core. This layer is designed for high availability and doesn’t typically require changes for Geode systems.

最佳实践

Geode systems depend on network services and network failures can have a significant impact on Geode operations and performance. As a result, network fault tolerance is an important design goal for Geode solutions. Best practices for achieving this goal include:

  • Use Mode 6 Network Interface Card (NIC) Bonding – NIC bonding involves combining multiple network connections in parallel in order to increase throughput and provide redundancy should one of the links fail. Linux supports six modes of link aggregation: ++ Mode 1 (active-backup) in this mode only one slave in the bond is active. A different slave becomes active if and only if the active slave fails. ++ Mode 2 (balance-xor) in this mode a slave is selected to transmit based on a simple XOR calculation that determines which slave to use. This mode provides both load balancing and fault tolerance. ++ Mode 3 (broadcast) this mode transmits everything on all slave interfaces. This mode provides fault tolerance. ++ Mode 4 (IEEE 802.3ad) this mode creates aggregation groups that share the same speed and duplex settings and utilizes all slaves in the active aggregator according to the 802.3ad specification. ++ Mode 5 (balance-tlb) this mode distributes outgoing traffic according to the load on each slave. One slave receives incoming traffic. If that slave fails, another slave takes over the MAC address of the failed receiving slave. ++ Mode 6 (balance-alb) this mode includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond such that different peers use different hardware addresses for the server.

For Geode systems, Mode 6 is recommended. Mode 6 NIC Bonding (Adaptive Load Balancing) provides both link aggregation and fault tolerance. Mode 1 only provides fault tolerance while modes 2, 3 and 4 require that the link aggregation group reside on the same logical switch and this could introduce a single point of failure when the physical switch to which the links are connected goes offline.

  • Use SMLT for switch redundancy – the Split Multi-link Trunking (SMLT) protocol allows multiple Ethernet links to be split across multiple switches in a stack, preventing any single point of failure, and allowing switches to be load balanced across multiple aggregation switches from the single access stack. SMLT provides enhanced resiliency with sub-second failover and sub-second recovery for all speed trunks while operating transparently to end-devices. This allows for the creation of Active load sharing high availability network designs that meet five nines availability requirements.

Geode 网络设置

To achieve the goals of low latency, high throughput and fault tolerance, network settings in the operating system and Geode will need to be configured appropriately. The following sections outline recommended settings.

IPv4 vs. IPv6

By default, Geode uses Internet Protocol version 4 (IPv4). Testing with Geode has shown that IPv4 provides better performance than IPv6. Therefore, the general recommendation is to use IPv4 with Geode. However, Geode can be configured to use IPv6 if desired. If IPv6 is used, make sure that all Geode processes use IPv6. Do not mix IPv4 and IPv6 addresses.

Note: to use IPv6 for Geode addresses, set the following Java property: java.net.preferIPv6Addresses=true

TCP vs. UDP

Geode supports the use of both TCP and UDP for communications. Depending on the size and nature of the Geode system as well as the types of regions employed, either TCP or UDP may be more appropriate.

TCP 通信

TCP (Transmission Control Protocol) provides reliable in-order delivery of system messages. Geode uses TCP by default for inter-cache point-to-point messaging. TCP is generally more appropriate than UDP in the following situations:

  • Partitioned Data For distributed systems that make extensive use of partitioned regions, TCP is generally a better choice as TCP provides more reliable communications and better performance that UDP.
  • Smaller Distributed Systems TCP is preferable to UDP unicast in smaller distributed systems because it implements more reliable communications at the operating system level than UDP and its performance can be substantially faster than UDP.
  • Unpredictable Network Loads TCP provides higher levels of fault tolerance and reliability than UDP. While Geode implements retransmission protocols to ensure proper delivery of messages over UDP, it cannot fully compensate for heavy congestion and unpredictable spikes in network loading.

Note: Geode always uses TCP communications in member failure detection. In this situation, Geode will attempt to establish a TCP/IP connection with the suspect member in order to determine if the member has failed.

UDP 通信

UDP (User Datagram Protocol) is a connectionless protocol, which uses far fewer resources than TCP. However, UDP has some important limitations that should be factored into a design, namely:

  • 64K byte message size limit (including overhead for message headers)
  • Markedly slower performance on congested networks
  • Limited reliability (Geode compensates through retransmission protocols)

If a Geode system can operate within the limitations of UDP, then it may be a more appropriate choice than TCP in the following situations:

  • Replicated Data In systems where most or all of the members use the same replicated regions, UDP multicast may be the most appropriate choice. UDP multicast provides an efficient means of distributing all events for a region. However, when multicast is enabled for a region, all processes in the distributed system receive all events for the region. Therefore, multicast is only suitable when most or all members have the region defined and the members are interested in most or all of the events for the region.

Note: Even when UDP multicast is used for a region, Geode will send unicast messages in some situations. Also, partitioned regions will use UDP unicast for almost all purposes.

  • Larger Distributed Systems

As the size of a distributed system increases, the relatively small overhead of UDP makes it the better choice. TCP adds new threads and sockets to every member, causing more overhead as the system grows.

Note: to configure Geode to use UDP for inter-cache point-to-point messaging set the following Geode property: disable-tcp=true

TCP 设置

The following sections provide guidance on TCP settings recommended for Geode.

Geode 的TCP/IP 通信设置
  • Socket Buffer Size In determining buffer size settings, the goal is to strike a balance between communication needs and other processing. Larger socket buffers allow Geode members to distribute data and events more quickly, but also reduce the memory available for other tasks. In some cases, particularly when storing very large data objects, finding the right socket buffer size can become critical to system performance.

Ideally, socket buffers should be large enough for the distribution of any single data object. This will avoid message fragmentation, which lowers performance. The socket buffers should be at least as large as the largest stored objects with their keys plus some overhead for message headers - 100 bytes should be sufficient.

If possible, the TCP/IP socket buffer settings should match across the Geode installation. At a minimum, follow the guidelines listed below.

++ Peer-to-peer. The socket-buffer-size setting in gemfire.properties should be the same throughout the distributed system. ++ Client/server. The client’s pool socket-buffer size-should match the setting for the servers that the pool uses. ++ Server. The server socket-buffer size in the server’s cache configuration (e.g. cache.xml file) should match the values defined for the server’s clients. ++ Multisite (WAN). If the link between sites isn’t optimized for throughput, it can cause messages to back up in the queues. If a receiving queue buffer overflows, it will get out of sync with the sender and the receiver won’t know it. A gateway sender's socket-buffer-size should match the gateway receiver’s socket-buffer-size for all receivers that the sender connects to.

Note: OS TCP buffer size limits must be large enough to accommodate Geode socket buffer settings. If not, the Geode value will be set to the OS limit – not the requested value.

  • TCP/IP Keep Alive

Geode supports TCP KeepAlive to prevent socket connections from being timed out.

The gemfire.enableTcpKeepAlive system property prevents connections that appear idle from being timed out (for example, by a firewall.) When configured to true, Geode enables the SO_KEEPALIVE option for individual sockets. This operating system-level setting allows the socket to send verification checks (ACK requests) to remote systems in order to determine whether or not to keep the socket connection alive.

Note: The time intervals for sending the first ACK KeepAlive request, the subsequent ACK requests and the number of requests to send before closing the socket is configured on the operating system level. See

By default, this system property is set to true.

  • TCP/IP Peer-to-Peer Handshake Timeouts

This property governs the amount of time a peer will wait to complete the TCP/IP handshake process. You can change the connection handshake timeouts for TCP/IP connections with the system property p2p.handshakeTimeoutMs.

The default setting is 59,000 milliseconds (59 seconds).

This sets the handshake timeout to 75,000 milliseconds for a Java application:

-Dp2p.handshakeTimeoutMs=75000

The properties are passed to the cache server on the gfsh command line:

 gfsh>start server --name=server1 --J=-Dp2p.handshakeTimeoutMs=75000 
Linux TCP/IP通信设置

The following table summarizes the recommended TCP/IP settings for Linux. These settings are in the /etc/sysctl.conf file

SettingRecommended ValueRationale
net.core.netdev_max_backlog30000Set maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. Recommended setting is for 10GbE links. For 1GbE links use 8000.
net.core.wmem_max67108864Set max to 16MB (16777216) for 1GbE links and 64MB (67108864) for 10GbE links.
net.core.rmem_max67108864Set max to 16MB (16777216) for 1GbE links and 64MB (67108864) for 10GbE links.
net.ipv4.tcp_congestion_controlhtcpThere seem to be bugs in both bic and cubic (the default) for a number of versions of the Linux kernel up to version 2.6.33. The kernel version for Redhat 5.x is 2.6.18-x and 2.6.32-x for Redhat 6.x
net.ipv4.tcp_congestion_window10This is the default for Linux operating systems based on Linux kernel 2.6.39 or later.
net.ipv4.tcp_fin_timeout10This setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, making more resources available for new connections. The default value is 60. The recommened setting lowers its to 10. You can lower this even further, but too low, and you can run into socket close errors in networks with lots of jitter.
net.ipv4.tcp_keepalive_interval30This determines the wait time between isAlive interval probes. Default value is 75. Recommended value reduces this in keeping with the reduction of the overall keepalive time.
net.ipv4.tcp_keepalive_probes5How many keepalive probes to send out before the socket is timed out. Default value is 9. Recommended value reduces this to 5 so that retry attempts will take 2.5 minutes.
net.ipv4.tcp_keepalive_time600Set the TCP Socket timeout value to 10 minutes instead of 2 hour default. With an idle socket, the system will wait tcp_keepalive_time seconds, and after that try tcp_keepalive_probes times to send a TCP KEEPALIVE in intervals of tcp_keepalive_intvl seconds. If the retry attempts fail, the socket times out.
net.ipv4.tcp_low_latency1Configure TCP for low latency, favoring low latency over throughput
net.ipv4.tcp_max_orphans16384Limit number of orphans, each orphan can eat up to 16M (max wmem) of unswappable memory
net.ipv4.tcp_max_tw_buckets1440000Maximal number of timewait sockets held by system simultaneously. If this number is exceeded time-wait socket is immediately destroyed and warning is printed. This limit exists to help prevent simple DoS attacks.
net.ipv4.tcp_no_metrics_save1Disable caching TCP metrics on connection close
net.ipv4.tcp_orphan_retries0Limit number of orphans, each orphan can eat up to 16M (max wmem) of unswappable memory
net.ipv4.tcp_rfc13371Enable a fix for RFC1337 - time-wait assassination hazards in TCP
net.ipv4.tcp_rmem10240 131072 33554432Setting is min/default/max. Recommed increasing the Linux autotuning TCP buffer limit to 32MB
net.ipv4.tcp_wmem10240 131072 33554432Setting is min/default/max. Recommed increasing the Linux autotuning TCP buffer limit to 32MB
net.ipv4.tcp_sack1Enable select acknowledgments
net.ipv4.tcp_slow_start_after_idle0By default, TCP starts with a single small segment, gradually increasing it by one each time. This results in unnecessary slowness that impacts the start of every request.
net.ipv4.tcp_syncookies0Many default Linux installations use SYN cookies to protect the system against malicious attacks that flood TCP SYN packets. The use of SYN cookies dramatically reduces network bandwidth, and can be triggered by a running Geode cluster. If your Geode cluster is otherwise protected against such attacks, disable SYN cookies to ensure that Geode network throughput is not affected. 
NOTE: if SYN floods are an issue and SYN cookies can’t be disabled, try the following: 
net.ipv4.tcp_max_syn_backlog="16384"
net.ipv4.tcp_synack_retries="1" 
net.ipv4.tcp_max_orphans="400000"
net.ipv4.tcp_timestamps1Enable timestamps as defined in RFC1323:
net.ipv4.tcp_tw_recycle1This enables fast recycling of TIME_WAIT sockets. The default value is 0 (disabled). Should be used with caution with load balancers.
net.ipv4.tcp_tw_reuse1This allows reusing sockets in TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative to tcp_tw_recycle. The tcp_tw_reuse setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers and loadbalancers.
net.ipv4.tcp_window_scaling1Turn on window scaling which can be an option to enlarge the transfer window:

In addition, increasing the size of transmit queue can also help TCP throughput. Add the following command to /etc/rc.local to accomplish this.

/sbin/ifconfig eth0 txqueuelen 10000

NOTE: substitute the appropriate adapter name for eth0 in the above example.

 

  • No labels