RPS/RFS/ GRO

网友投稿 876 2022-11-23

RPS/RFS/ GRO

RPS/RFS/ GRO

​​receive offload): 在napi poll里把小包封装成大包再递交给协议栈LRO: GRO的硬件实现(通过网卡的RSC功能)

​​not irqbalance?

我们知道通过设置/proc/irq//smp_affinity, 可以设置中断的亲和性。 在用户态也有irqbalance来根据系统各cpu的实时情况来设置亲和性,从而达到中断的负载均衡。但是irqbalance虽然能够利用多核计算特性, 但是显而易见的cache利用率非常低效。低端网卡硬件不识别网络流,即它只能识别到这是一个数据包,而不能识别到数据包的元组信息。如果一个数据流的第一个数据包被分发到了CPU1,而第二个数据包分发到了CPU2,那么对于流的公共数据,比如nf_conntrack中记录的东西,CPU cache的利用率就会比较低,cache抖动会比较厉害。对于TCP流而言,可能还会因为TCP串行包并行处理的延迟不确定性导致数据包乱序。因此最直接的想法就是将属于一个流的所有数据包分发了一个CPU上。

Why RFS(Receive Flow Steering)

在使用RPS接收数据包之后,会在指定的CPU进行软中断处理,之后就会在用户态进行处理;如果用户态处理的CPU不在软中断处理的CPU,则会造成CPU cache miss,造成很大的性能影响。RFS能够保证处理软中断和处理应用程序是同一个CPU,这样会保证local cache hit,提升处理效率。RFS需要和RPS一起配合使用。也就是RPS虽然能够利用多核特性,但是如果如果应用程序所在的CPU和RPS选择的CPU不是同一个的话,也会降低cache的利用。因此,RFS是RPS的一个扩展补丁包,在RPS的基础上,解决了以上问题。主要是在应用程序调用系统调用的时候,在一个全局的hash表上,用流的hash值,映射到当前的cpu上。在流的下一个数据包到来的时候,可以查这个全局的hash表。

out of order

当调度器调度应用程序到另一个cpu上的时候,根据RFS算法,数据包也要发送到这个新的cpu上,这时候过去的包在另一个cpu的软中断中处理。    同一个流的数据包同时发送到两个不同cpu的队列中,就会导致ooo乱序包。    因此RFS引入了另一个per rx队列的rps_flow_table, 具体实现在下文描述,总之如果cpu变更,且原来cpu中还有该flow的数据,就不会把数据包发送到新的cpu队列上。

Why RPS(Receive Packet Steering)?

RPS distributes the load of received packet processing across multiple CPUs.

考虑以下场景:​​通过hash值分发到设置的cpu集合中。 hash(skb->hash)值一般由网卡根据数据包头部的四元组直接计算, 因此能够让同一个流在同一个cpu上处理

RSS

启用RSS这种功能之后,网卡会有多个接收和发送队列,这些队列对被分配不同的CPU进行处理。RSS为网卡数据传输使用多核提供了支持,RSS在硬件/驱动级别实现多队列并且使用一个hash函数对数据包进行多队列分离处理,这个hash根据源IP、目的IP、源端口和目的端口进行数据包分布选择,这样同一数据流的数据包会被放置到同一个队列进行处理并且能一定程度上保证数据处理的均衡性。

RPS是和RSS类似的一个技术,区别在于RSS是网的硬件实现而RPS是内核软件实现。RPS帮助单队列网卡将其产生的SoftIRQ分派到多个CPU内核进行处理。在这个方案中,为网卡单队列分配的CPU只处理所有硬件中断,由于硬件中断的快速高效,即使在同一个CPU进行处理,影响也是有限的,而耗时的软中断处理会被分派到不同CPU进行处理,可以有效的避免处理瓶颈。

​​   关于 rfs rps xps rss 的介绍以及使用建议。目前觉得最重要的一条就是下面这一条

2、 For a single queue device, a typical RPS configuration would be to set the rps_cpus to the CPUs in the same memory domain of the interrupting CPU. If NUMA locality is not an issue, this could also be all CPUs in the system. At high interrupt rate, it might be wise to exclude theinterrupting CPU from the map since that already performs much work.

For a multi-queue system, if RSS is configured so that a hardware receive queue is mapped to each CPU, then RPS is probably redundant and unnecessary. If there are fewer hardware queues than CPUs, then RPS might be beneficial if the rps_cpus for each queue are the ones that share the same memory domain as the interrupting CPU for that queue.

RPS

​​RECEIVE PACKET STEERING (RPS)​​

Receive Packet Steering (RPS) is similar to RSS in that it is used to direct packets to specific CPUs for processing. However, RPS is implemented at the software level, and helps to prevent the hardware queue of a single network interface card from becoming a bottleneck in network traffic.

RPS has several advantages over hardware-based RSS:

RPS can be used with any network interface card.It is easy to add software filters to RPS to deal with new protocols.RPS does not increase the hardware interrupt rate of the network device. However, it does introduce inter-processor interrupts.

RPS is configured per network device and receive queue, in the ​​/sys/class/net/*device*/queues/*rx-queue*/rps_cpus​​ file, where device is the name of the network device (such as ​​eth0​​) and rx-queue is the name of the appropriate receive queue (such as ​​rx-0​​).

The default value of the ​​rps_cpus​​ file is zero. This disables RPS, so the CPU that handles the network interrupt also processes the packet.

To enable RPS, configure the appropriate ​​rps_cpus​​ file with the CPUs that should process packets from the specified network device and receive queue.

The ​​rps_cpus​​​ files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of ​​rps_cpus​​​ to ​​00001111​​​ (1+2+4+8), or ​​f​​ (the hexadecimal value for 15).

For network devices with single transmit queues, best performance can be achieved by configuring RPS to use CPUs in the same memory domain. On non-NUMA systems, this means that all available CPUs can be used. If the network interrupt rate is extremely high, excluding the CPU that handles network interrupts may also improve performance.

For network devices with multiple queues, there is typically no benefit to configuring both RPS and RSS, as RSS is configured to map a CPU to each receive queue by default. However, RPS may still be beneficial if there are fewer hardware queues than CPUs, and RPS is configured to use CPUs in the same memory domain.

RFS

​​RECEIVE FLOW STEERING (RFS)​​

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS backend to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.

RFS is disabled by default. To enable RFS, you must edit two files:

​​/proc/sys/net/core/rps_sock_flow_entries​​

Set the value of this file to the maximum expected number of concurrently active connections. We recommend a value of ​​32768​​ for moderate server loads. All values entered are rounded up to the nearest power of 2 in practice.

​​/sys/class/net/*device*/queues/*rx-queue*/rps_flow_cnt​​

Replace device with the name of the network device you wish to configure (for example, ​​eth0​​), and rx-queue with the receive queue you wish to configure (for example, ​​rx-0​​).

Set the value of this file to the value of ​​rps_sock_flow_entries​​​ divided by ​​N​​​, where ​​N​​​ is the number of receive queues on a device. For example, if ​​rps_flow_entries​​​ is set to ​​32768​​​ and there are 16 configured receive queues, ​​rps_flow_cnt​​​ should be set to ​​2048​​​. For single-queue devices, the value of ​​rps_flow_cnt​​​ is the same as the value of ​​rps_sock_flow_entries​​.

Data received from a single sender is not sent to more than one CPU. If the amount of data received from a single sender is greater than a single CPU can handle, configure a larger frame size to reduce the number of interrupts and therefore the amount of processing work for the CPU. Alternatively, consider NIC offload options or faster CPUs.

Consider using ​​numactl​​​ or ​​taskset​​ in conjunction with RFS to pin applications to specific cores, sockets, or NUMA nodes. This can help prevent packets from being processed out of order.

摄像头驱动 tcpip网络协议栈、netfilter、bridge 好像看过!!!! 但行好事 莫问前程 --身高体重180的胖子

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:用YOLOv5模型识别出表情!
下一篇:linux 异步I/O 信号
相关文章

 发表评论

暂时没有评论,来抢沙发吧~