| #
f2006e27 |
| 12-Jul-2013 |
Thomas Gleixner <tglx@linutronix.de> |
Merge branch 'linus' into timers/urgent
Get upstream changes so we can apply fixes against them
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
| #
496322bc |
| 10-Jul-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller: "This is a re-do of the net-next pull request for the current merge window. The only
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller: "This is a re-do of the net-next pull request for the current merge window. The only difference from the one I made the other day is that this has Eliezer's interface renames and the timeout handling changes made based upon your feedback, as well as a few bug fixes that have trickeled in.
Highlights:
1) Low latency device polling, eliminating the cost of interrupt handling and context switches. Allows direct polling of a network device from socket operations, such as recvmsg() and poll().
Currently ixgbe, mlx4, and bnx2x support this feature.
Full high level description, performance numbers, and design in commit 0a4db187a999 ("Merge branch 'll_poll'")
From Eliezer Tamir.
2) With the routing cache removed, ip_check_mc_rcu() gets exercised more than ever before in the case where we have lots of multicast addresses. Use a hash table instead of a simple linked list, from Eric Dumazet.
3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski, Marek Puzyniak, Michal Kazior, and Sujith Manoharan.
4) Support reporting the TUN device persist flag to userspace, from Pavel Emelyanov.
5) Allow controlling network device VF link state using netlink, from Rony Efraim.
6) Support GRE tunneling in openvswitch, from Pravin B Shelar.
7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from Daniel Borkmann and Eric Dumazet.
8) Allow controlling of TCP quickack behavior on a per-route basis, from Cong Wang.
9) Several bug fixes and improvements to vxlan from Stephen Hemminger, Pravin B Shelar, and Mike Rapoport. In particular, support receiving on multiple UDP ports.
10) Major cleanups, particular in the area of debugging and cookie lifetime handline, to the SCTP protocol code. From Daniel Borkmann.
11) Allow packets to cross network namespaces when traversing tunnel devices. From Nicolas Dichtel.
12) Allow monitoring netlink traffic via AF_PACKET sockets, in a manner akin to how we monitor real network traffic via ptype_all. From Daniel Borkmann.
13) Several bug fixes and improvements for the new alx device driver, from Johannes Berg.
14) Fix scalability issues in the netem packet scheduler's time queue, by using an rbtree. From Eric Dumazet.
15) Several bug fixes in TCP loss recovery handling, from Yuchung Cheng.
16) Add support for GSO segmentation of MPLS packets, from Simon Horman.
17) Make network notifiers have a real data type for the opaque pointer that's passed into them. Use this to properly handle network device flag changes in arp_netdev_event(). From Jiri Pirko and Timo Teräs.
18) Convert several drivers over to module_pci_driver(), from Peter Huewe.
19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a O(1) calculation instead. From Eric Dumazet.
20) Support setting of explicit tunnel peer addresses in ipv6, just like ipv4. From Nicolas Dichtel.
21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.
22) Prevent a single high rate flow from overruning an individual cpu during RX packet processing via selective flow shedding. From Willem de Bruijn.
23) Don't use spinlocks in TCP md5 signing fast paths, from Eric Dumazet.
24) Don't just drop GSO packets which are above the TBF scheduler's burst limit, chop them up so they are in-bounds instead. Also from Eric Dumazet.
25) VLAN offloads are missed when configured on top of a bridge, fix from Vlad Yasevich.
26) Support IPV6 in ping sockets. From Lorenzo Colitti.
27) Receive flow steering targets should be updated at poll() time too, from David Majnemer.
28) Fix several corner case regressions in PMTU/redirect handling due to the routing cache removal, from Timo Teräs.
29) We have to be mindful of ipv4 mapped ipv6 sockets in upd_v6_push_pending_frames(). From Hannes Frederic Sowa.
30) Fix L2TP sequence number handling bugs, from James Chapman."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits) drivers/net: caif: fix wrong rtnl_is_locked() usage drivers/net: enic: release rtnl_lock on error-path vhost-net: fix use-after-free in vhost_net_flush net: mv643xx_eth: do not use port number as platform device id net: sctp: confirm route during forward progress virtio_net: fix race in RX VQ processing virtio: support unlocked queue poll net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit Documentation: Fix references to defunct linux-net@vger.kernel.org net/fs: change busy poll time accounting net: rename low latency sockets functions to busy poll bridge: fix some kernel warning in multicast timer sfc: Fix memory leak when discarding scattered packets sit: fix tunnel update via netlink dt:net:stmmac: Add dt specific phy reset callback support. dt:net:stmmac: Add support to dwmac version 3.610 and 3.710 dt:net:stmmac: Allocate platform data only if its NULL. net:stmmac: fix memleak in the open method ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available net: ipv6: fix wrong ping_v6_sendmsg return value ...
show more ...
|
| #
fe3c22bd |
| 02-Jul-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'char-misc-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc updates from Greg KH: "Here's the big char/misc driver tree merge for 3.11-rc1
A v
Merge tag 'char-misc-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc updates from Greg KH: "Here's the big char/misc driver tree merge for 3.11-rc1
A variety of different driver patches here. All of these have been in linux-next for a while, and the networking patches were acked-by David Miller, as it made sense for those patches to come through this tree"
* tag 'char-misc-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (102 commits) Revert "char: misc: assign file->private_data in all cases" drivers: uio_pdrv_genirq: Use of_match_ptr() macro mei: check whether hw start has succeeded mei: check if the hardware reset succeeded mei: mei_cl_connect: don't multiply the timeout twice mei: do not override a client writing state when buffering mei: move mei_cl_irq_write_complete to client.c UIO: Fix concurrency issue drivers: uio_dmem_genirq: Use of_match_ptr() macro char: misc: assign file->private_data in all cases drivers: hv: allocate synic structures before hv_synic_init() drivers: hv: check interrupt mask before read_index vme: vme_tsi148.c: fix error return code in tsi148_probe() FMC: fix error handling in probe() function fmc: avoid readl/writel namespace conflict FMC: NULL dereference on allocation failure UIO: fix uio_pdrv_genirq with device tree but no interrupt UIO: allow binding uio_pdrv_genirq.c to devices using command line option FMC: add a char-device mezzanine driver FMC: add a driver to write mezzanine EEPROM ...
show more ...
|
| #
27eb2c4b |
| 02-Jul-2013 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare first set of updates for 3.11 merge window.
|
| #
31881d74 |
| 28-Jun-2013 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'for-next' of git://github.com/rydberg/linux into next
Pull in changes from Henrik: "a trivial MT documentation fix".
|
| #
89bf1b5a |
| 14-Jun-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
net: remove NET_LL_RX_POLL config menue
Remove NET_LL_RX_POLL from the config menu. Change default to y. Busy polling still needs to be enabled at run time.
Signed-off-by: Eliezer Tamir <eliezer.ta
net: remove NET_LL_RX_POLL config menue
Remove NET_LL_RX_POLL from the config menu. Change default to y. Busy polling still needs to be enabled at run time.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
9a3c71aa |
| 14-Jun-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
net: convert low latency sockets to sched_clock()
Use sched_clock() instead of get_cycles(). We can use sched_clock() because we don't care much about accuracy. Remove the dependency on X86_TSC
Sig
net: convert low latency sockets to sched_clock()
Use sched_clock() instead of get_cycles(). We can use sched_clock() because we don't care much about accuracy. Remove the dependency on X86_TSC
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
0a4db187 |
| 11-Jun-2013 |
David S. Miller <davem@davemloft.net> |
Merge branch 'll_poll'
Eliezer Tamir says:
==================== This patch set adds the ability for the socket layer code to poll directly on an Ethernet device's RX queue. This eliminates the cost
Merge branch 'll_poll'
Eliezer Tamir says:
==================== This patch set adds the ability for the socket layer code to poll directly on an Ethernet device's RX queue. This eliminates the cost of the interrupt and context switch and with proper tuning allows us to get very close to the HW latency.
This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf
Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id. Patch 2 adds an ndo_ll_poll method and the code that supports it. Patch 3 adds support for busy-polling on UDP sockets. Patch 4 adds support for TCP. Patch 5 adds the ixgbe driver code implementing ndo_ll_poll. Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.
Performance numbers: setup TCP_RR UDP_RR kernel Config C3/6 rx-usecs tps cpu% S.dem tps cpu% S.dem patched optimized on 100 87k 3.13 11.4 94K 3.17 10.7 patched optimized on 0 71k 3.12 14.0 84k 3.19 12.0 patched optimized on adaptive 80k 3.13 12.5 90k 3.46 12.2 patched typical on 100 72 3.13 14.0 79k 3.17 12.8 patched typical on 0 60k 2.13 16.5 71k 3.18 14.0 patched typical on adaptive 67k 3.51 16.7 75k 3.36 14.5 3.9 optimized on adaptive 25k 1.0 12.7 28k 0.98 11.2 3.9 typical off 0 48k 1.09 7.3 52k 1.11 4.18 3.9 typical 0ff adaptive 35k 1.12 4.08 38k 0.65 5.49 3.9 optimized off adaptive 40k 0.82 4.83 43k 0.70 5.23 3.9 optimized off 0 57k 1.17 4.08 62k 1.04 3.95
Test setup details: Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second) Kernel: unmodified 3.9 and patched 3.9 Config: typical is derived from RH6.2, optimized is a stripped down config. Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us When C3/6 states were turned on (via BIOS) the performance governor was used.
These performance numbers were measured with v2 of the patch set. Performance of the optimized config with an rx-usecs setting of 100 (the first line in the table above) was tracked during the evolution of the patches and has never varied by more than 1%.
Design: A global hash table that allows us to look up a struct napi by a unique id was added.
A napi_id field was added both to struct sk_buff and struct sk. This is used to track which NAPI we need to poll for a specific socket.
The device driver marks every incoming skb with this id. This is propagated to the sk when the socket is looked up in the protocol handler.
When the socket code does not find any more data on the socket queue, it now may call ndo_ll_poll which will crank the device's rx queue and feed incoming packets to the stack directly from the context of the socket.
A sysctl value (net.core4.low_latency_poll) controls how many microseconds we busy-wait before giving up. (setting to 0 globally disables busy-polling)
Locking:
1. Locking between napi poll and ndo_ll_poll: Since what needs to be locked between a device's NAPI poll and ndo_ll_poll, is highly device / configuration dependent, we do this inside the Ethernet driver. For example, when packets for high priority connections are sent to separate rx queues, you might not need locking between napi poll and ndo_ll_poll at all.
For ixgbe we only lock the RX queue. ndo_ll_poll does not touch the interrupt state or the TX queues. (earlier versions of this patchset did touch them, but this design is simpler and works better.)
If a queue is actively polled by a socket (on another CPU) napi poll will not service it, but will wait until the queue can be locked and cleaned before doing a napi_complete(). If a socket can't lock the queue because another CPU has it, either from napi or from another socket polling on the queue, the socket code can busy wait on the socket's skb queue.
Ndo_ll_poll does not have preferential treatment for the data from the calling socket vs. data from others, so if another CPU is polling, you will see your data on this socket's queue when it arrives.
Ndo_ll_poll is called with local BHs disabled, so it won't race on the same CPU with net_rx_action, which calls the napi poll method.
2. Napi_hash The napi hash mechanism uses RCU. napi_by_id() must be called under rcu_read_lock(). After a call to napi_hash_del(), caller must take care to wait an rcu grace period before freeing the memory containing the napi struct. (Ixgbe already had this because the queue vector structure uses rcu to protect the statistics counters in it.)
how to test:
1. The patchset should apply cleanly to net-next. (don't forget to configure INET_LL_RX_POLL).
2. The ethtool -c setting for rx-usecs should be on the order of 100.
3. Use ethtool -K to disable GRO and LRO (You are encouraged to try it both ways. If you find that your workload does better with GRO on do tell us.)
4. Sysctl value net.core.low_latency_poll controls how long (in us) to busy-wait for more data, You are encouraged to play with this and see what works for you. The default is now 0 so you need to set it to turn the feature on. I recommend a value around 50.
4. benchmark thread and IRQ should be bound to separate cores. Both cores should be on the same CPU NUMA node as the NIC. When the app and the IRQ run on the same CPU you get a small penalty. If interrupt coalescing is set to a low value this penalty can be very large.
5. If you suspect that your machine is not configured properly, use numademo to make sure that the CPU to memory BW is OK. numademo 128m memcpy local copy numbers should be more than 8GB/s on a properly configured machine.
Change log: v10 - removed select/poll support. (we will work on this some more and try again) v9 - correct sysctl proc_handler, reported by Eric Dumazet and Amir Vadai. - more int -> bool changes, reported by Eric Dumazet. - better mask testing in sock_poll(), reported by Eric Dumazet.
v8 - split out udp and select/poll into separate patches. what used to be patch 2/5 is now three patches. - type corrections from Amir Vadai and Cong Wang: one unsigned long that was left when changing to cycles_t int -> bool - more detailed patch descriptions.
v7 - suggested by Ben Hutchings and Eric Dumazet: type fixes, static for globals in net/core.c, avoid napi_id collisions in napi_hash_add()
v6 - many small fixes suggested by Eric Dumazet: data locality, typos, documentation protect napi_hash insert/delete with a spinlock (napi_gen_id is no longer atomic_t since it's only accessed with the spinlock held.) - added IPv6 TCP and UDP support (only minimally tested)
v5 - corrections suggested by Ben Hutchings: fixed typos, moved the config option and sysctl value from IPv4 to net - moved sk_mark_ll() to the protocol handlers - removed global id mechanism, replaced with a hashed napi_id. based on code sample from Eric Dumazet Note that ixgbe_free_q_vector() already waits an rcu grace period before freeing the q_vector, so nothing additional needs to be done when adding a call to napi_hash_del(). - simple poll/select support
v4 - removed separate config option for TCP as suggested Eric Dumazet. - added linux mib counter for packets received through the low latency path, as suggested by Andi Kleen. - re-allow module unloading, remove module param, use a global generation id instead to prevent the use of a stale napi pointer, as suggested by Eric Dumazet - updated Documentation/networking/ip-sysctl.txt text
v3 - coding style changes suggested by Dave Miller
v2 - the sysctl knob is now in microseconds. The default value is now 0 (off). - for now the code depends at configure time on CONFIG_I86_TSC - the napi reference in struct skb is now a union with the dma cookie since the former is only used on RX and the latter on TX, as suggested by Eric Dumazet. - we do a better job at honoring non-blocking operations. - removed busy-polling support for tcp_read_sock() - remove dynamic disabling of GRO - coding style fixes - disallow unloading the device module after the feature has been used
Credit: Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings, Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li, Mike Polehn, Anil Vasudevan, Don Wood Special thanks for finding bugs in earlier versions: Willem de Bruijn and Andi Kleen ====================
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
06021292 |
| 10-Jun-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
net: add low latency socket poll
Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the s
net: add low latency socket poll
Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
4cd5773a |
| 04-Jun-2013 |
Andy Shevchenko <andy.shevchenko@gmail.com> |
net: core: move mac_pton() to lib/net_utils.c
Since we have at least one user of this function outside of CONFIG_NET scope, we have to provide this function independently. The proposed solution is t
net: core: move mac_pton() to lib/net_utils.c
Since we have at least one user of this function outside of CONFIG_NET scope, we have to provide this function independently. The proposed solution is to move it under lib/net_utils.c with corresponding configuration variable and select wherever it is needed.
Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
| #
6e9041c6 |
| 28-May-2013 |
Jiri Kosina <jkosina@suse.cz> |
Merge branch 'master' into for-next
|
| #
51047840 |
| 28-May-2013 |
David S. Miller <davem@davemloft.net> |
Merge branch 'mpls_gso'
Simon Horman says:
==================== In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but t
Merge branch 'mpls_gso'
Simon Horman says:
==================== In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not support GSO of MPLS packets.
The aim of this short series is to provide GSO in software for MPLS packets whose skbs are GSO.
Change since v4:
Update first patch of the series to use 16 bits for all *_headers rather than just inner_*_headers
Simon Horman (2): net: Use 16bits for *_headers fields of struct skbuff MPLS: Add limited GSO support ====================
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
0d89d203 |
| 23-May-2013 |
Simon Horman <horms@verge.net.au> |
MPLS: Add limited GSO support
In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not s
MPLS: Add limited GSO support
In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not support GSO of MPLS packets.
The aim of this code is to provide GSO in software for MPLS packets whose skbs are GSO.
SKB Usage:
When an implementation adds an MPLS stack to a non-MPLS packet it should do the following to skb metadata:
* Set skb->inner_protocol to the old non-MPLS ethertype of the packet. skb->inner_protocol is added by this patch.
* Set skb->protocol to the new MPLS ethertype of the packet.
* Set skb->network_header to correspond to the end of the L3 header, including the MPLS label stack.
I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to kernel" which adds MPLS support to the kernel datapath of Open vSwtich. That patch sets the above requirements in datapath/actions.c:push_mpls() and was used to exercise this code. The datapath patch is against the Open vSwtich tree but it is intended that it be added to the Open vSwtich code present in the mainline Linux kernel at some point.
Features:
I believe that the approach that I have taken is at least partially consistent with the handling of other protocols. Jesse, I understand that you have some ideas here. I am more than happy to change my implementation.
This patch adds dev->mpls_features which may be used by devices to advertise features supported for MPLS packets.
A new NETIF_F_MPLS_GSO feature is added for devices which support hardware MPLS GSO offload. Currently no devices support this and MPLS GSO always falls back to software.
Alternate Implementation:
One possible alternate implementation is to teach netif_skb_features() and skb_network_protocol() about MPLS, in a similar way to their understanding of VLANs. I believe this would avoid the need for net/mpls/mpls_gso.c and in particular the calls to __skb_push() and __skb_push() in mpls_gso_segment().
I have decided on the implementation in this patch as it should not introduce any overhead in the case where mpls_gso is not compiled into the kernel or inserted as a module.
MPLS GSO suggested by Jesse Gross. Based in part on "v4 GRE: Add TCP segmentation offload for GRE" by Pravin B Shelar.
Cc: Jesse Gross <jesse@nicira.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
e1b73cba |
| 21-May-2013 |
Daniel Vetter <daniel.vetter@ffwll.ch> |
Merge tag 'v3.10-rc2' into drm-intel-next-queued
Backmerge Linux 3.10-rc2 since the various (rather trivial) conflicts grew a bit out of hand. intel_dp.c has the only real functional conflict since
Merge tag 'v3.10-rc2' into drm-intel-next-queued
Backmerge Linux 3.10-rc2 since the various (rather trivial) conflicts grew a bit out of hand. intel_dp.c has the only real functional conflict since the logic changed while dev_priv->edp.bpp was moved around.
Also squash in a whitespace fixup from Ben Widawsky for i915_gem_gtt.c, git seems to do something pretty strange in there (which I don't fully understand tbh).
Conflicts: drivers/gpu/drm/i915/i915_reg.h drivers/gpu/drm/i915/intel_dp.c
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
show more ...
|
| #
99bbc707 |
| 20-May-2013 |
Willem de Bruijn <willemb@google.com> |
rps: selective flow shedding during softnet overflow
A cpu executing the network receive path sheds packets when its input queue grows to netdev_max_backlog. A single high rate flow (such as a spoof
rps: selective flow shedding during softnet overflow
A cpu executing the network receive path sheds packets when its input queue grows to netdev_max_backlog. A single high rate flow (such as a spoofed source DoS) can exceed a single cpu processing rate and will degrade throughput of other flows hashed onto the same cpu.
This patch adds a more fine grained hashtable. If the netdev backlog is above a threshold, IRQ cpus track the ratio of total traffic of each flow (using 4096 buckets, configurable). The ratio is measured by counting the number of packets per flow over the last 256 packets from the source cpu. Any flow that occupies a large fraction of this (set at 50%) will see packet drop while above the threshold.
Tested: Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0, kernel receive (RPS) on cpu0 and application threads on cpus 2--7 each handling 20k req/s. Throughput halves when hit with a 400 kpps antagonist storm. With this patch applied, antagonist overload is dropped and the server processes its complete load.
The patch is effective when kernel receive processing is the bottleneck. The above RPS scenario is a extreme, but the same is reached with RFS and sufficient kernel processing (iptables, packet socket tap, ..).
Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
12e04ffc |
| 15-May-2013 |
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> |
Merge tag 'v3.10-rc1' into stable/for-linus-3.10
Linux 3.10-rc1
* tag 'v3.10-rc1': (12273 commits) Linux 3.10-rc1 [SCSI] qla2xxx: Update firmware link in Kconfig file. [SCSI] iscsi class, qla
Merge tag 'v3.10-rc1' into stable/for-linus-3.10
Linux 3.10-rc1
* tag 'v3.10-rc1': (12273 commits) Linux 3.10-rc1 [SCSI] qla2xxx: Update firmware link in Kconfig file. [SCSI] iscsi class, qla4xxx: fix sess/conn refcounting when find fns are used [SCSI] sas: unify the pointlessly separated enums sas_dev_type and sas_device_type [SCSI] pm80xx: thermal, sas controller config and error handling update [SCSI] pm80xx: NCQ error handling changes [SCSI] pm80xx: WWN Modification for PM8081/88/89 controllers [SCSI] pm80xx: Changed module name and debug messages update [SCSI] pm80xx: Firmware flash memory free fix, with addition of new memory region for it [SCSI] pm80xx: SPC new firmware changes for device id 0x8081 alone [SCSI] pm80xx: Added SPCv/ve specific hardware functionalities and relevant changes in common files [SCSI] pm80xx: MSI-X implementation for using 64 interrupts [SCSI] pm80xx: Updated common functions common for SPC and SPCv/ve [SCSI] pm80xx: Multiple inbound/outbound queue configuration [SCSI] pm80xx: Added SPCv/ve specific ids, variables and modify for SPC [SCSI] lpfc: fix up Kconfig dependencies [SCSI] Handle MLQUEUE busy response in scsi_send_eh_cmnd dm cache: set config value dm cache: move config fns dm thin: generate event when metadata threshold passed ...
show more ...
|
| #
4237c09a |
| 13-May-2013 |
Mauro Carvalho Chehab <mchehab@redhat.com> |
Merge tag 'v3.10-rc1' into patchwork
Linux 3.10-rc1
* tag 'v3.10-rc1': (11697 commits) Linux 3.10-rc1 [SCSI] qla2xxx: Update firmware link in Kconfig file. [SCSI] iscsi class, qla4xxx: fix se
Merge tag 'v3.10-rc1' into patchwork
Linux 3.10-rc1
* tag 'v3.10-rc1': (11697 commits) Linux 3.10-rc1 [SCSI] qla2xxx: Update firmware link in Kconfig file. [SCSI] iscsi class, qla4xxx: fix sess/conn refcounting when find fns are used [SCSI] sas: unify the pointlessly separated enums sas_dev_type and sas_device_type [SCSI] pm80xx: thermal, sas controller config and error handling update [SCSI] pm80xx: NCQ error handling changes [SCSI] pm80xx: WWN Modification for PM8081/88/89 controllers [SCSI] pm80xx: Changed module name and debug messages update [SCSI] pm80xx: Firmware flash memory free fix, with addition of new memory region for it [SCSI] pm80xx: SPC new firmware changes for device id 0x8081 alone [SCSI] pm80xx: Added SPCv/ve specific hardware functionalities and relevant changes in common files [SCSI] pm80xx: MSI-X implementation for using 64 interrupts [SCSI] pm80xx: Updated common functions common for SPC and SPCv/ve [SCSI] pm80xx: Multiple inbound/outbound queue configuration [SCSI] pm80xx: Added SPCv/ve specific ids, variables and modify for SPC [SCSI] lpfc: fix up Kconfig dependencies [SCSI] Handle MLQUEUE busy response in scsi_send_eh_cmnd dm cache: set config value dm cache: move config fns dm thin: generate event when metadata threshold passed ...
show more ...
|
| #
f99e44a7 |
| 05-May-2013 |
Thomas Gleixner <tglx@linutronix.de> |
Merge branch 'linus' into core/urgent
Update with Linus tree so fixes for the same can be applied.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
| #
048c9acc |
| 05-May-2013 |
David S. Miller <davem@davemloft.net> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Merge sparc bug fixes that didn't make it into v3.9 into sparc-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| #
73287a43 |
| 01-May-2013 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller: "Highlights (1721 non-merge commits, this has to be a record of some sort):
1) Ad
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller: "Highlights (1721 non-merge commits, this has to be a record of some sort):
1) Add 'random' mode to team driver, from Jiri Pirko and Eric Dumazet.
2) Make it so that any driver that supports configuration of multiple MAC addresses can provide the forwarding database add and del calls by providing a default implementation and hooking that up if the driver doesn't have an explicit set of handlers. From Vlad Yasevich.
3) Support GSO segmentation over tunnels and other encapsulating devices such as VXLAN, from Pravin B Shelar.
4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.
5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita Dukkipati.
6) In the PHY layer, allow supporting wake-on-lan in situations where the PHY registers have to be written for it to be configured.
Use it to support wake-on-lan in mv643xx_eth.
From Michael Stapelberg.
7) Significantly improve firewire IPV6 support, from YOSHIFUJI Hideaki.
8) Allow multiple packets to be sent in a single transmission using network coding in batman-adv, from Martin Hundebøll.
9) Add support for T5 cxgb4 chips, from Santosh Rastapur.
10) Generalize the VXLAN forwarding tables so that there is more flexibility in configurating various aspects of the endpoints. From David Stevens.
11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver, from Dmitry Kravkov.
12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo Neira Ayuso.
13) Start adding networking selftests.
14) In situations of overload on the same AF_PACKET fanout socket, or per-cpu packet receive queue, minimize drop by distributing the load to other cpus/fanouts. From Willem de Bruijn and Eric Dumazet.
15) Add support for new payload offset BPF instruction, from Daniel Borkmann.
16) Convert several drivers over to mdoule_platform_driver(), from Sachin Kamat.
17) Provide a minimal BPF JIT image disassembler userspace tool, from Daniel Borkmann.
18) Rewrite F-RTO implementation in TCP to match the final specification of it in RFC4138 and RFC5682. From Yuchung Cheng.
19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear you like netlink, so I implemented netlink dumping of netlink sockets.") From Andrey Vagin.
20) Remove ugly passing of rtnetlink attributes into rtnl_doit functions, from Thomas Graf.
21) Allow userspace to be able to see if a configuration change occurs in the middle of an address or device list dump, from Nicolas Dichtel.
22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes Frederic Sowa.
23) Increase accuracy of packet length used by packet scheduler, from Jason Wang.
24) Beginning set of changes to make ipv4/ipv6 fragment handling more scalable and less susceptible to overload and locking contention, from Jesper Dangaard Brouer.
25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*() instead. From Hong Zhiguo.
26) Optimize route usage in IPVS by avoiding reference counting where possible, from Julian Anastasov.
27) Convert IPVS schedulers to RCU, also from Julian Anastasov.
28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger Eitzenberger.
29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG, nfnetlink_log, and nfnetlink_queue. From Gao feng.
30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.
31) Support several new r8169 chips, from Hayes Wang.
32) Support tokenized interface identifiers in ipv6, from Daniel Borkmann.
33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.
34) Add 802.1ad vlan offload support, from Patrick McHardy.
35) Support mmap() based netlink communication, also from Patrick McHardy.
36) Support HW timestamping in mlx4 driver, from Amir Vadai.
37) Rationalize AF_PACKET packet timestamping when transmitting, from Willem de Bruijn and Daniel Borkmann.
38) Bring parity to what's provided by /proc/net/packet socket dumping and the info provided by netlink socket dumping of AF_PACKET sockets. From Nicolas Dichtel.
39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin Poirier"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) filter: fix va_list build error af_unix: fix a fatal race with bit fields bnx2x: Prevent memory leak when cnic is absent bnx2x: correct reading of speed capabilities net: sctp: attribute printl with __printf for gcc fmt checks netlink: kconfig: move mmap i/o into netlink kconfig netpoll: convert mutex into a semaphore netlink: Fix skb ref counting. net_sched: act_ipt forward compat with xtables mlx4_en: fix a build error on 32bit arches Revert "bnx2x: allow nvram test to run when device is down" bridge: avoid OOPS if root port not found drivers: net: cpsw: fix kernel warn on cpsw irq enable sh_eth: use random MAC address if no valid one supplied 3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA) tg3: fix to append hardware time stamping flags unix/stream: fix peeking with an offset larger than data in queue unix/dgram: fix peeking with an offset larger than data in queue unix/dgram: peek beyond 0-sized skbs openvswitch: Remove unneeded ovs_netdev_get_ifindex() ...
show more ...
|
| #
ee1bec9b |
| 01-May-2013 |
Daniel Borkmann <dborkman@redhat.com> |
netlink: kconfig: move mmap i/o into netlink kconfig
Currently, in menuconfig, Netlink's new mmaped IO is the very first entry under the ``Networking support'' item and comes even before ``Networkin
netlink: kconfig: move mmap i/o into netlink kconfig
Currently, in menuconfig, Netlink's new mmaped IO is the very first entry under the ``Networking support'' item and comes even before ``Networking options'':
[ ] Netlink: mmaped IO Networking options ---> ...
Lets move this into ``Networking options'' under netlink's Kconfig, since this might be more appropriate. Introduced by commit ccdfcc398 (``netlink: mmaped netlink: ring setup'').
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
bf61c884 |
| 01-May-2013 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare first set of updates for 3.10 merge window.
|
| #
f53f292e |
| 20-Apr-2013 |
H. Peter Anvin <hpa@linux.intel.com> |
Merge remote-tracking branch 'efi/chainsaw' into x86/efi
Resolved Conflicts: drivers/firmware/efivars.c fs/efivarsfs/file.c
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
| #
42bbcb78 |
| 19-Apr-2013 |
David S. Miller <davem@davemloft.net> |
Merge branch 'netlink-mmap'
Patrick McHardy says:
==================== The following patches contain an implementation of memory mapped I/O for netlink. The implementation is modelled after AF_PACK
Merge branch 'netlink-mmap'
Patrick McHardy says:
==================== The following patches contain an implementation of memory mapped I/O for netlink. The implementation is modelled after AF_PACKET memory mapped I/O with a few differences:
- In order to perform memory mapped I/O to userspace, the kernel allocates skbs with the data area pointing to the data area of the mapped frames. All netlink subsystems assume a linear data area, so for the sake of simplicity, the mapped data area is not attached to the paged area but to skb->data. This requires introduction of a special skb alloction function that just allocates an skb head without the data area. Since this is a quite rare use case, I introduced a new function based on __alloc_skb instead of splitting it up into head and data alloction. The alternative would be to introduce an __alloc_skb_head and __alloc_skb_data function, which would actually be useful for a specific error case in memory mapped netlink, but would require a couple of extra instructions for the common skb allocation case, so it doesn't really seem worth it.
In order to get the destination memory area for skb->data before message construction, memory mapped netlink I/O needs to look up the destination socket during allocation instead of during transmission because the ring is owned by the receiveing socket/process. A special skb allocation function (netlink_alloc_skb) taking the destination pid as an argument is used for this, all subsystems that want to support memory mapped I/O need to use this function, automatic fallback to the receive queue happens for unconverted subsystems. Dumps automatically use memory mapped I/O if the receiving socket has enabled it.
The visible effect of looking up the destination socket during allocation instead of transmission is that message ordering in userspace might change in case allocation and transmission aren't performed atomically. This usually doesn't matter since most subsystems have a BKL-like lock like the rtnl mutex, to my knowledge the currently only existing case where it might matter is nfnetlink_queue combined with the recently introduced batched verdicts, but a) that subsystem already includes sequence numbers which allow userspace to reorder messages in case it cares to, also the reodering window is quite small and b) with memory mapped transmission batching can be performed in a subsystem indepandant manner.
- AF_NETLINK contains flow control for database dumps, with regular I/O dump continuation are triggered based on the sockets receive queue space and by recvmsg() calls. Since with memory mapped I/O there are no recvmsg() calls under normal operation, this is done in netlink_poll(), under the assumption that userspace has processed all pending frames before invoking poll(), thus the ring is expected to have room for new messages. Dumps currently don't benefit as much as they could from memory mapped I/O because each single continuation requires a poll() call. A more agressive approach seems like a good idea to me, especially in case the socket is not subscribed to any multicast groups (IOW only receiving explicitly requested data).
Besides that, the memory mapped netlink implementation extends the states defined by AF_PACKET between userspace and the kernel by a SKIP status, this is intended for the case that userspace wants to queue frames (specifically when using nfnetlink_queue, an IDS and stream reassembly, requested by Eric Leblond) for a longer period of time. The kernel skips over all frames marked with SKIP when looking or unused frames and only fails when not finding a free frame or when having skipped the entire ring.
Also noteworthy is memory mapped sendmsg: the kernel performs validation of messages before accepting and processing them, in order to prevent userspace from changing the messages contents after validation, the kernel checks that the ring is only mapped once and the file descriptor is not shared (in order to avoid having userspace set up another mapping after the first mentioned check). If either of both is not true, the message copied to an allocated skb and processed as with regular I/O. I'd especially appreciate review of this part since I'm not really versed in memory, file and process management,
The remaining interesting details are included in the changelogs of the individual patches and the documentation, so I won't repeat them here.
As an example, nfnetlink_queue is convererted to support memory mapped I/O. Other subsystems that would probably benefit are nfnetlink_log, audit and maybe ISCSI, not sure.
Following are some numbers collected by Florian Westphal based on a slightly older version, which included an experimental patch for the nfnetlink_queue ordering issue.
===
Test hardware is a 12-core machine Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz ixgbe interfaces are used (i.e., multiqueue nics). irqs are distributed across the cpus.
I've made several tests.
The simple one consists of 3GBit UDP traffic, packets are 1500 bytes in size (i.e., no fragmentation), with a single nfqueue and the test client programs in libmnl examples directory. Packets are sent from one /24 net to another /24 net, i.e. there are a few hundred flows active at any given time.
I've also tested with snort, but I disabled all rules. 6Gbit UDP traffic is generated in the snort case, and 6 nfqueues are used (i.e., 6 snorts run in parallel).
I've tested with 3 different kernels, all based on 3.7.1. - 3.7.1, without the mmap patches - 3.7.1, with Patricks mmap patches - 3.7.1, with mmap patches and extended spinlock to ensure packet ids are monotonically increasing and cannot be re-ordered. This is what we currently ship in our product.
[ the spinlock that is extended is the per nfqueue spinlock, it will be held from the time the netlink skb is allocated until the netlink skb is sent to userspace:
http://1984.lsi.us.es/git/nf-next/commit/?h=mmap-netlink3&id=b8eb19c46650fef4e9e4fe53f367f99bbf72afc9 ]
snort is normally used in "batch mode", i.e., after processing 25 packets a single "batch verdict" is sent to accept the packets seen so far. "mmap snort" means RX_RING + sendmsg(), i.e. TX_RING is not used at this time (except where noted below).
One reason is that snort has a reload thread, so kernel needs to copy; also in the snort case no payload rewrite takes place, so compared to the rx path the tx path is cheap.
Results:
3.7.1, without mmap patches, i.e. recv()+sendmsg() for everyone nfq-queue: 1.7 gbit out snort-recv-batch-25 5.1 gbit out snort-recv-no-batch 3.1 gbit out
3.7.1 + mmap + without extended spinlocked section nfq-queue: 1.7 gbit out (recv/sendmsg) nfq-queue-mmap: 2.4 gbit out snort-mmap-batch-25 5.6 gbit out (warning: since ids can be re-ordered, this version is "broken"). snort-recv-batch-25 5.1 gbit out snort-mmap-no-batch 4.6 gbit out (i.e., one verdict per packet)
Kernel 3.7.1 + mmap + extended spinlock section: nfq-queue: 1.4 gbit out nfq-queue-mmap: 2.3 gbit out snort: 5.6 gbit out
Conclusions: - The "extended spinlocked section" hurts performance in the single queue case; with 6 snorts there is no measureable slowdown. - I tried to re-write the mmap-snort to work without batch verdicts, but results were not very encouraging:
kernel 3.7.1 + mmap (without extended spinlocked section):
snort-mmap-batch-25 5.6 gbit out (what we currenlty ship) snort-recv-batch-25 5.1 gbit out (without using mmap) snort-mmap-batch-1 4.6 gbit out (with mmap but without batch verdicts) snort-mmap-txring-25 5.2 gbit out (with mmap but without batch verdicts) snort-mmap-txring-1 4.6 gbit out (with mmap but without batch verdicts)
The difference between the last two is that in the txring-25 case, we put a verdict into the tx ring after every packet, but will only invoke sendmsg(, NULL, 0) after processing 25 packets. So the only difference is the number of sendmsg calls/context switches.
So, i.o.w, kernel 3.7.1 + mmap + the extra locking crap is faster than 3.7.1 + mmap-without-extra-locking and single-verdict-per packet. ====================
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
| #
ccdfcc39 |
| 17-Apr-2013 |
Patrick McHardy <kaber@trash.net> |
netlink: mmaped netlink: ring setup
Add support for mmap'ed RX and TX ring setup and teardown based on the af_packet.c code. The following patches will use this to add the real mmap'ed receive and t
netlink: mmaped netlink: ring setup
Add support for mmap'ed RX and TX ring setup and teardown based on the af_packet.c code. The following patches will use this to add the real mmap'ed receive and transmit functionality.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|