History log of /linux-5.15/net/Kconfig (Results 1 – 25 of 711)
Revision Date Author Comments
# f2006e27 12-Jul-2013 Thomas Gleixner <tglx@linutronix.de>

Merge branch 'linus' into timers/urgent

Get upstream changes so we can apply fixes against them

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 496322bc 10-Jul-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"This is a re-do of the net-next pull request for the current merge
window. The only

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"This is a re-do of the net-next pull request for the current merge
window. The only difference from the one I made the other day is that
this has Eliezer's interface renames and the timeout handling changes
made based upon your feedback, as well as a few bug fixes that have
trickeled in.

Highlights:

1) Low latency device polling, eliminating the cost of interrupt
handling and context switches. Allows direct polling of a network
device from socket operations, such as recvmsg() and poll().

Currently ixgbe, mlx4, and bnx2x support this feature.

Full high level description, performance numbers, and design in
commit 0a4db187a999 ("Merge branch 'll_poll'")

From Eliezer Tamir.

2) With the routing cache removed, ip_check_mc_rcu() gets exercised
more than ever before in the case where we have lots of multicast
addresses. Use a hash table instead of a simple linked list, from
Eric Dumazet.

3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

4) Support reporting the TUN device persist flag to userspace, from
Pavel Emelyanov.

5) Allow controlling network device VF link state using netlink, from
Rony Efraim.

6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
Daniel Borkmann and Eric Dumazet.

8) Allow controlling of TCP quickack behavior on a per-route basis,
from Cong Wang.

9) Several bug fixes and improvements to vxlan from Stephen
Hemminger, Pravin B Shelar, and Mike Rapoport. In particular,
support receiving on multiple UDP ports.

10) Major cleanups, particular in the area of debugging and cookie
lifetime handline, to the SCTP protocol code. From Daniel
Borkmann.

11) Allow packets to cross network namespaces when traversing tunnel
devices. From Nicolas Dichtel.

12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
manner akin to how we monitor real network traffic via ptype_all.
From Daniel Borkmann.

13) Several bug fixes and improvements for the new alx device driver,
from Johannes Berg.

14) Fix scalability issues in the netem packet scheduler's time queue,
by using an rbtree. From Eric Dumazet.

15) Several bug fixes in TCP loss recovery handling, from Yuchung
Cheng.

16) Add support for GSO segmentation of MPLS packets, from Simon
Horman.

17) Make network notifiers have a real data type for the opaque
pointer that's passed into them. Use this to properly handle
network device flag changes in arp_netdev_event(). From Jiri
Pirko and Timo Teräs.

18) Convert several drivers over to module_pci_driver(), from Peter
Huewe.

19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
O(1) calculation instead. From Eric Dumazet.

20) Support setting of explicit tunnel peer addresses in ipv6, just
like ipv4. From Nicolas Dichtel.

21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

22) Prevent a single high rate flow from overruning an individual cpu
during RX packet processing via selective flow shedding. From
Willem de Bruijn.

23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
Dumazet.

24) Don't just drop GSO packets which are above the TBF scheduler's
burst limit, chop them up so they are in-bounds instead. Also
from Eric Dumazet.

25) VLAN offloads are missed when configured on top of a bridge, fix
from Vlad Yasevich.

26) Support IPV6 in ping sockets. From Lorenzo Colitti.

27) Receive flow steering targets should be updated at poll() time
too, from David Majnemer.

28) Fix several corner case regressions in PMTU/redirect handling due
to the routing cache removal, from Timo Teräs.

29) We have to be mindful of ipv4 mapped ipv6 sockets in
upd_v6_push_pending_frames(). From Hannes Frederic Sowa.

30) Fix L2TP sequence number handling bugs, from James Chapman."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
drivers/net: caif: fix wrong rtnl_is_locked() usage
drivers/net: enic: release rtnl_lock on error-path
vhost-net: fix use-after-free in vhost_net_flush
net: mv643xx_eth: do not use port number as platform device id
net: sctp: confirm route during forward progress
virtio_net: fix race in RX VQ processing
virtio: support unlocked queue poll
net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
Documentation: Fix references to defunct linux-net@vger.kernel.org
net/fs: change busy poll time accounting
net: rename low latency sockets functions to busy poll
bridge: fix some kernel warning in multicast timer
sfc: Fix memory leak when discarding scattered packets
sit: fix tunnel update via netlink
dt:net:stmmac: Add dt specific phy reset callback support.
dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
dt:net:stmmac: Allocate platform data only if its NULL.
net:stmmac: fix memleak in the open method
ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
net: ipv6: fix wrong ping_v6_sendmsg return value
...

show more ...


# fe3c22bd 02-Jul-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'char-misc-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc updates from Greg KH:
"Here's the big char/misc driver tree merge for 3.11-rc1

A v

Merge tag 'char-misc-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc updates from Greg KH:
"Here's the big char/misc driver tree merge for 3.11-rc1

A variety of different driver patches here. All of these have been in
linux-next for a while, and the networking patches were acked-by David
Miller, as it made sense for those patches to come through this tree"

* tag 'char-misc-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (102 commits)
Revert "char: misc: assign file->private_data in all cases"
drivers: uio_pdrv_genirq: Use of_match_ptr() macro
mei: check whether hw start has succeeded
mei: check if the hardware reset succeeded
mei: mei_cl_connect: don't multiply the timeout twice
mei: do not override a client writing state when buffering
mei: move mei_cl_irq_write_complete to client.c
UIO: Fix concurrency issue
drivers: uio_dmem_genirq: Use of_match_ptr() macro
char: misc: assign file->private_data in all cases
drivers: hv: allocate synic structures before hv_synic_init()
drivers: hv: check interrupt mask before read_index
vme: vme_tsi148.c: fix error return code in tsi148_probe()
FMC: fix error handling in probe() function
fmc: avoid readl/writel namespace conflict
FMC: NULL dereference on allocation failure
UIO: fix uio_pdrv_genirq with device tree but no interrupt
UIO: allow binding uio_pdrv_genirq.c to devices using command line option
FMC: add a char-device mezzanine driver
FMC: add a driver to write mezzanine EEPROM
...

show more ...


# 27eb2c4b 02-Jul-2013 Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge branch 'next' into for-linus

Prepare first set of updates for 3.11 merge window.


# 31881d74 28-Jun-2013 Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge branch 'for-next' of git://github.com/rydberg/linux into next

Pull in changes from Henrik: "a trivial MT documentation fix".


# 89bf1b5a 14-Jun-2013 Eliezer Tamir <eliezer.tamir@linux.intel.com>

net: remove NET_LL_RX_POLL config menue

Remove NET_LL_RX_POLL from the config menu.
Change default to y.
Busy polling still needs to be enabled at run time.

Signed-off-by: Eliezer Tamir <eliezer.ta

net: remove NET_LL_RX_POLL config menue

Remove NET_LL_RX_POLL from the config menu.
Change default to y.
Busy polling still needs to be enabled at run time.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 9a3c71aa 14-Jun-2013 Eliezer Tamir <eliezer.tamir@linux.intel.com>

net: convert low latency sockets to sched_clock()

Use sched_clock() instead of get_cycles().
We can use sched_clock() because we don't care much about accuracy.
Remove the dependency on X86_TSC

Sig

net: convert low latency sockets to sched_clock()

Use sched_clock() instead of get_cycles().
We can use sched_clock() because we don't care much about accuracy.
Remove the dependency on X86_TSC

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 0a4db187 11-Jun-2013 David S. Miller <davem@davemloft.net>

Merge branch 'll_poll'

Eliezer Tamir says:

====================
This patch set adds the ability for the socket layer code to
poll directly on an Ethernet device's RX queue.
This eliminates the cost

Merge branch 'll_poll'

Eliezer Tamir says:

====================
This patch set adds the ability for the socket layer code to
poll directly on an Ethernet device's RX queue.
This eliminates the cost of the interrupt and context switch
and with proper tuning allows us to get very close to the HW latency.

This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from
last year
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf

Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id.
Patch 2 adds an ndo_ll_poll method and the code that supports it.
Patch 3 adds support for busy-polling on UDP sockets.
Patch 4 adds support for TCP.
Patch 5 adds the ixgbe driver code implementing ndo_ll_poll.
Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.

Performance numbers:
setup TCP_RR UDP_RR
kernel Config C3/6 rx-usecs tps cpu% S.dem tps cpu% S.dem
patched optimized on 100 87k 3.13 11.4 94K 3.17 10.7
patched optimized on 0 71k 3.12 14.0 84k 3.19 12.0
patched optimized on adaptive 80k 3.13 12.5 90k 3.46 12.2
patched typical on 100 72 3.13 14.0 79k 3.17 12.8
patched typical on 0 60k 2.13 16.5 71k 3.18 14.0
patched typical on adaptive 67k 3.51 16.7 75k 3.36 14.5
3.9 optimized on adaptive 25k 1.0 12.7 28k 0.98 11.2
3.9 typical off 0 48k 1.09 7.3 52k 1.11 4.18
3.9 typical 0ff adaptive 35k 1.12 4.08 38k 0.65 5.49
3.9 optimized off adaptive 40k 0.82 4.83 43k 0.70 5.23
3.9 optimized off 0 57k 1.17 4.08 62k 1.04 3.95

Test setup details:
Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical
NICs
Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
Kernel: unmodified 3.9 and patched 3.9
Config: typical is derived from RH6.2, optimized is a stripped down
config.
Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive,
100 us
When C3/6 states were turned on (via BIOS) the performance governor
was used.

These performance numbers were measured with v2 of the patch set.
Performance of the optimized config with an rx-usecs setting of 100
(the first line in the table above) was tracked during the evolution
of the patches and has never varied by more than 1%.

Design:
A global hash table that allows us to look up a struct napi by a
unique id was added.

A napi_id field was added both to struct sk_buff and struct sk.
This is used to track which NAPI we need to poll for a specific
socket.

The device driver marks every incoming skb with this id.
This is propagated to the sk when the socket is looked up in the
protocol handler.

When the socket code does not find any more data on the socket queue,
it now may call ndo_ll_poll which will crank the device's rx queue and
feed incoming packets to the stack directly from the context of the
socket.

A sysctl value (net.core4.low_latency_poll) controls how many
microseconds we busy-wait before giving up. (setting to 0 globally
disables busy-polling)

Locking:

1. Locking between napi poll and ndo_ll_poll:
Since what needs to be locked between a device's NAPI poll and
ndo_ll_poll, is highly device / configuration dependent, we do this
inside the Ethernet driver.
For example, when packets for high priority connections are sent to
separate rx queues, you might not need locking between napi poll and
ndo_ll_poll at all.

For ixgbe we only lock the RX queue.
ndo_ll_poll does not touch the interrupt state or the TX queues.
(earlier versions of this patchset did touch them,
but this design is simpler and works better.)

If a queue is actively polled by a socket (on another CPU) napi poll
will not service it, but will wait until the queue can be locked
and cleaned before doing a napi_complete().
If a socket can't lock the queue because another CPU has it,
either from napi or from another socket polling on the queue,
the socket code can busy wait on the socket's skb queue.

Ndo_ll_poll does not have preferential treatment for the data from the
calling socket vs. data from others, so if another CPU is polling,
you will see your data on this socket's queue when it arrives.

Ndo_ll_poll is called with local BHs disabled, so it won't race on
the same CPU with net_rx_action, which calls the napi poll method.

2. Napi_hash
The napi hash mechanism uses RCU.
napi_by_id() must be called under rcu_read_lock().
After a call to napi_hash_del(), caller must take care to wait an rcu
grace period before freeing the memory containing the napi struct.
(Ixgbe already had this because the queue vector structure uses rcu to
protect the statistics counters in it.)

how to test:

1. The patchset should apply cleanly to net-next.
(don't forget to configure INET_LL_RX_POLL).

2. The ethtool -c setting for rx-usecs should be on the order of 100.

3. Use ethtool -K to disable GRO and LRO
(You are encouraged to try it both ways. If you find that your
workload
does better with GRO on do tell us.)

4. Sysctl value net.core.low_latency_poll controls how long
(in us) to busy-wait for more data, You are encouraged to play
with this and see what works for you. The default is now 0 so you need
to
set it to turn the feature on. I recommend a value around 50.

4. benchmark thread and IRQ should be bound to separate cores.
Both cores should be on the same CPU NUMA node as the NIC.
When the app and the IRQ run on the same CPU you get a small penalty.
If interrupt coalescing is set to a low value this penalty can be very
large.

5. If you suspect that your machine is not configured properly,
use numademo to make sure that the CPU to memory BW is OK.
numademo 128m memcpy local copy numbers should be more than
8GB/s on a properly configured machine.

Change log:
v10
- removed select/poll support. (we will work on this some more and try again)
v9
- correct sysctl proc_handler, reported by Eric Dumazet and Amir Vadai.
- more int -> bool changes, reported by Eric Dumazet.
- better mask testing in sock_poll(), reported by Eric Dumazet.

v8
- split out udp and select/poll into separate patches.
what used to be patch 2/5 is now three patches.
- type corrections from Amir Vadai and Cong Wang:
one unsigned long that was left when changing to cycles_t
int -> bool
- more detailed patch descriptions.

v7
- suggested by Ben Hutchings and Eric Dumazet:
type fixes, static for globals in net/core.c,
avoid napi_id collisions in napi_hash_add()

v6
- many small fixes suggested by Eric Dumazet:
data locality, typos, documentation
protect napi_hash insert/delete with a spinlock (napi_gen_id is no
longer atomic_t since it's only accessed with the spinlock held.)
- added IPv6 TCP and UDP support (only minimally tested)

v5
- corrections suggested by Ben Hutchings:
fixed typos, moved the config option and sysctl value from IPv4 to net
- moved sk_mark_ll() to the protocol handlers
- removed global id mechanism, replaced with a hashed napi_id.
based on code sample from Eric Dumazet
Note that ixgbe_free_q_vector() already waits an rcu grace period
before freeing the q_vector, so nothing additional needs to be done
when adding a call to napi_hash_del().
- simple poll/select support

v4
- removed separate config option for TCP as suggested Eric Dumazet.
- added linux mib counter for packets received through the low latency path,
as suggested by Andi Kleen.
- re-allow module unloading, remove module param, use a global generation id
instead to prevent the use of a stale napi pointer, as suggested
by Eric Dumazet
- updated Documentation/networking/ip-sysctl.txt text

v3
- coding style changes suggested by Dave Miller

v2
- the sysctl knob is now in microseconds. The default value is now 0 (off).
- for now the code depends at configure time on CONFIG_I86_TSC
- the napi reference in struct skb is now a union with the dma cookie
since the former is only used on RX and the latter on TX,
as suggested by Eric Dumazet.
- we do a better job at honoring non-blocking operations.
- removed busy-polling support for tcp_read_sock()
- remove dynamic disabling of GRO
- coding style fixes
- disallow unloading the device module after the feature has been used

Credit:
Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings,
Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li,
Mike Polehn, Anil Vasudevan, Don Wood
Special thanks for finding bugs in earlier versions:
Willem de Bruijn and Andi Kleen
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 06021292 10-Jun-2013 Eliezer Tamir <eliezer.tamir@linux.intel.com>

net: add low latency socket poll

Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the s

net: add low latency socket poll

Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 4cd5773a 04-Jun-2013 Andy Shevchenko <andy.shevchenko@gmail.com>

net: core: move mac_pton() to lib/net_utils.c

Since we have at least one user of this function outside of CONFIG_NET
scope, we have to provide this function independently. The proposed
solution is t

net: core: move mac_pton() to lib/net_utils.c

Since we have at least one user of this function outside of CONFIG_NET
scope, we have to provide this function independently. The proposed
solution is to move it under lib/net_utils.c with corresponding
configuration variable and select wherever it is needed.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


# 6e9041c6 28-May-2013 Jiri Kosina <jkosina@suse.cz>

Merge branch 'master' into for-next


# 51047840 28-May-2013 David S. Miller <davem@davemloft.net>

Merge branch 'mpls_gso'

Simon Horman says:

====================
In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but t

Merge branch 'mpls_gso'

Simon Horman says:

====================
In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.

The aim of this short series is to provide GSO in software for MPLS packets
whose skbs are GSO.

Change since v4:

Update first patch of the series to use 16 bits for all *_headers
rather than just inner_*_headers

Simon Horman (2):
net: Use 16bits for *_headers fields of struct skbuff
MPLS: Add limited GSO support
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 0d89d203 23-May-2013 Simon Horman <horms@verge.net.au>

MPLS: Add limited GSO support

In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not s

MPLS: Add limited GSO support

In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.

The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.

SKB Usage:

When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:

* Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
skb->inner_protocol is added by this patch.

* Set skb->protocol to the new MPLS ethertype of the packet.

* Set skb->network_header to correspond to the
end of the L3 header, including the MPLS label stack.

I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in datapath/actions.c:push_mpls()
and was used to exercise this code. The datapath patch is against the Open
vSwtich tree but it is intended that it be added to the Open vSwtich code
present in the mainline Linux kernel at some point.

Features:

I believe that the approach that I have taken is at least partially
consistent with the handling of other protocols. Jesse, I understand that
you have some ideas here. I am more than happy to change my implementation.

This patch adds dev->mpls_features which may be used by devices
to advertise features supported for MPLS packets.

A new NETIF_F_MPLS_GSO feature is added for devices which support
hardware MPLS GSO offload. Currently no devices support this
and MPLS GSO always falls back to software.

Alternate Implementation:

One possible alternate implementation is to teach netif_skb_features()
and skb_network_protocol() about MPLS, in a similar way to their
understanding of VLANs. I believe this would avoid the need
for net/mpls/mpls_gso.c and in particular the calls to
__skb_push() and __skb_push() in mpls_gso_segment().

I have decided on the implementation in this patch as it should
not introduce any overhead in the case where mpls_gso is not compiled
into the kernel or inserted as a module.

MPLS GSO suggested by Jesse Gross.
Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
by Pravin B Shelar.

Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# e1b73cba 21-May-2013 Daniel Vetter <daniel.vetter@ffwll.ch>

Merge tag 'v3.10-rc2' into drm-intel-next-queued

Backmerge Linux 3.10-rc2 since the various (rather trivial) conflicts
grew a bit out of hand. intel_dp.c has the only real functional
conflict since

Merge tag 'v3.10-rc2' into drm-intel-next-queued

Backmerge Linux 3.10-rc2 since the various (rather trivial) conflicts
grew a bit out of hand. intel_dp.c has the only real functional
conflict since the logic changed while dev_priv->edp.bpp was moved
around.

Also squash in a whitespace fixup from Ben Widawsky for
i915_gem_gtt.c, git seems to do something pretty strange in there
(which I don't fully understand tbh).

Conflicts:
drivers/gpu/drm/i915/i915_reg.h
drivers/gpu/drm/i915/intel_dp.c

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

show more ...


# 99bbc707 20-May-2013 Willem de Bruijn <willemb@google.com>

rps: selective flow shedding during softnet overflow

A cpu executing the network receive path sheds packets when its input
queue grows to netdev_max_backlog. A single high rate flow (such as a
spoof

rps: selective flow shedding during softnet overflow

A cpu executing the network receive path sheds packets when its input
queue grows to netdev_max_backlog. A single high rate flow (such as a
spoofed source DoS) can exceed a single cpu processing rate and will
degrade throughput of other flows hashed onto the same cpu.

This patch adds a more fine grained hashtable. If the netdev backlog
is above a threshold, IRQ cpus track the ratio of total traffic of
each flow (using 4096 buckets, configurable). The ratio is measured
by counting the number of packets per flow over the last 256 packets
from the source cpu. Any flow that occupies a large fraction of this
(set at 50%) will see packet drop while above the threshold.

Tested:
Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
kernel receive (RPS) on cpu0 and application threads on cpus 2--7
each handling 20k req/s. Throughput halves when hit with a 400 kpps
antagonist storm. With this patch applied, antagonist overload is
dropped and the server processes its complete load.

The patch is effective when kernel receive processing is the
bottleneck. The above RPS scenario is a extreme, but the same is
reached with RFS and sufficient kernel processing (iptables, packet
socket tap, ..).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 12e04ffc 15-May-2013 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge tag 'v3.10-rc1' into stable/for-linus-3.10

Linux 3.10-rc1

* tag 'v3.10-rc1': (12273 commits)
Linux 3.10-rc1
[SCSI] qla2xxx: Update firmware link in Kconfig file.
[SCSI] iscsi class, qla

Merge tag 'v3.10-rc1' into stable/for-linus-3.10

Linux 3.10-rc1

* tag 'v3.10-rc1': (12273 commits)
Linux 3.10-rc1
[SCSI] qla2xxx: Update firmware link in Kconfig file.
[SCSI] iscsi class, qla4xxx: fix sess/conn refcounting when find fns are used
[SCSI] sas: unify the pointlessly separated enums sas_dev_type and sas_device_type
[SCSI] pm80xx: thermal, sas controller config and error handling update
[SCSI] pm80xx: NCQ error handling changes
[SCSI] pm80xx: WWN Modification for PM8081/88/89 controllers
[SCSI] pm80xx: Changed module name and debug messages update
[SCSI] pm80xx: Firmware flash memory free fix, with addition of new memory region for it
[SCSI] pm80xx: SPC new firmware changes for device id 0x8081 alone
[SCSI] pm80xx: Added SPCv/ve specific hardware functionalities and relevant changes in common files
[SCSI] pm80xx: MSI-X implementation for using 64 interrupts
[SCSI] pm80xx: Updated common functions common for SPC and SPCv/ve
[SCSI] pm80xx: Multiple inbound/outbound queue configuration
[SCSI] pm80xx: Added SPCv/ve specific ids, variables and modify for SPC
[SCSI] lpfc: fix up Kconfig dependencies
[SCSI] Handle MLQUEUE busy response in scsi_send_eh_cmnd
dm cache: set config value
dm cache: move config fns
dm thin: generate event when metadata threshold passed
...

show more ...


# 4237c09a 13-May-2013 Mauro Carvalho Chehab <mchehab@redhat.com>

Merge tag 'v3.10-rc1' into patchwork

Linux 3.10-rc1

* tag 'v3.10-rc1': (11697 commits)
Linux 3.10-rc1
[SCSI] qla2xxx: Update firmware link in Kconfig file.
[SCSI] iscsi class, qla4xxx: fix se

Merge tag 'v3.10-rc1' into patchwork

Linux 3.10-rc1

* tag 'v3.10-rc1': (11697 commits)
Linux 3.10-rc1
[SCSI] qla2xxx: Update firmware link in Kconfig file.
[SCSI] iscsi class, qla4xxx: fix sess/conn refcounting when find fns are used
[SCSI] sas: unify the pointlessly separated enums sas_dev_type and sas_device_type
[SCSI] pm80xx: thermal, sas controller config and error handling update
[SCSI] pm80xx: NCQ error handling changes
[SCSI] pm80xx: WWN Modification for PM8081/88/89 controllers
[SCSI] pm80xx: Changed module name and debug messages update
[SCSI] pm80xx: Firmware flash memory free fix, with addition of new memory region for it
[SCSI] pm80xx: SPC new firmware changes for device id 0x8081 alone
[SCSI] pm80xx: Added SPCv/ve specific hardware functionalities and relevant changes in common files
[SCSI] pm80xx: MSI-X implementation for using 64 interrupts
[SCSI] pm80xx: Updated common functions common for SPC and SPCv/ve
[SCSI] pm80xx: Multiple inbound/outbound queue configuration
[SCSI] pm80xx: Added SPCv/ve specific ids, variables and modify for SPC
[SCSI] lpfc: fix up Kconfig dependencies
[SCSI] Handle MLQUEUE busy response in scsi_send_eh_cmnd
dm cache: set config value
dm cache: move config fns
dm thin: generate event when metadata threshold passed
...

show more ...


# f99e44a7 05-May-2013 Thomas Gleixner <tglx@linutronix.de>

Merge branch 'linus' into core/urgent

Update with Linus tree so fixes for the same can be applied.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 048c9acc 05-May-2013 David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

Merge sparc bug fixes that didn't make it into v3.9 into
sparc-next.

Signed-off-by: David S. Miller <davem@davemloft.net>


# 73287a43 01-May-2013 Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"Highlights (1721 non-merge commits, this has to be a record of some
sort):

1) Ad

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"Highlights (1721 non-merge commits, this has to be a record of some
sort):

1) Add 'random' mode to team driver, from Jiri Pirko and Eric
Dumazet.

2) Make it so that any driver that supports configuration of multiple
MAC addresses can provide the forwarding database add and del
calls by providing a default implementation and hooking that up if
the driver doesn't have an explicit set of handlers. From Vlad
Yasevich.

3) Support GSO segmentation over tunnels and other encapsulating
devices such as VXLAN, from Pravin B Shelar.

4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
Dukkipati.

6) In the PHY layer, allow supporting wake-on-lan in situations where
the PHY registers have to be written for it to be configured.

Use it to support wake-on-lan in mv643xx_eth.

From Michael Stapelberg.

7) Significantly improve firewire IPV6 support, from YOSHIFUJI
Hideaki.

8) Allow multiple packets to be sent in a single transmission using
network coding in batman-adv, from Martin Hundebøll.

9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

10) Generalize the VXLAN forwarding tables so that there is more
flexibility in configurating various aspects of the endpoints.
From David Stevens.

11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
from Dmitry Kravkov.

12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
Neira Ayuso.

13) Start adding networking selftests.

14) In situations of overload on the same AF_PACKET fanout socket, or
per-cpu packet receive queue, minimize drop by distributing the
load to other cpus/fanouts. From Willem de Bruijn and Eric
Dumazet.

15) Add support for new payload offset BPF instruction, from Daniel
Borkmann.

16) Convert several drivers over to mdoule_platform_driver(), from
Sachin Kamat.

17) Provide a minimal BPF JIT image disassembler userspace tool, from
Daniel Borkmann.

18) Rewrite F-RTO implementation in TCP to match the final
specification of it in RFC4138 and RFC5682. From Yuchung Cheng.

19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
you like netlink, so I implemented netlink dumping of netlink
sockets.") From Andrey Vagin.

20) Remove ugly passing of rtnetlink attributes into rtnl_doit
functions, from Thomas Graf.

21) Allow userspace to be able to see if a configuration change occurs
in the middle of an address or device list dump, from Nicolas
Dichtel.

22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
Frederic Sowa.

23) Increase accuracy of packet length used by packet scheduler, from
Jason Wang.

24) Beginning set of changes to make ipv4/ipv6 fragment handling more
scalable and less susceptible to overload and locking contention,
from Jesper Dangaard Brouer.

25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
instead. From Hong Zhiguo.

26) Optimize route usage in IPVS by avoiding reference counting where
possible, from Julian Anastasov.

27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
Eitzenberger.

29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
nfnetlink_log, and nfnetlink_queue. From Gao feng.

30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

31) Support several new r8169 chips, from Hayes Wang.

32) Support tokenized interface identifiers in ipv6, from Daniel
Borkmann.

33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

34) Add 802.1ad vlan offload support, from Patrick McHardy.

35) Support mmap() based netlink communication, also from Patrick
McHardy.

36) Support HW timestamping in mlx4 driver, from Amir Vadai.

37) Rationalize AF_PACKET packet timestamping when transmitting, from
Willem de Bruijn and Daniel Borkmann.

38) Bring parity to what's provided by /proc/net/packet socket dumping
and the info provided by netlink socket dumping of AF_PACKET
sockets. From Nicolas Dichtel.

39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
Poirier"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
filter: fix va_list build error
af_unix: fix a fatal race with bit fields
bnx2x: Prevent memory leak when cnic is absent
bnx2x: correct reading of speed capabilities
net: sctp: attribute printl with __printf for gcc fmt checks
netlink: kconfig: move mmap i/o into netlink kconfig
netpoll: convert mutex into a semaphore
netlink: Fix skb ref counting.
net_sched: act_ipt forward compat with xtables
mlx4_en: fix a build error on 32bit arches
Revert "bnx2x: allow nvram test to run when device is down"
bridge: avoid OOPS if root port not found
drivers: net: cpsw: fix kernel warn on cpsw irq enable
sh_eth: use random MAC address if no valid one supplied
3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
tg3: fix to append hardware time stamping flags
unix/stream: fix peeking with an offset larger than data in queue
unix/dgram: fix peeking with an offset larger than data in queue
unix/dgram: peek beyond 0-sized skbs
openvswitch: Remove unneeded ovs_netdev_get_ifindex()
...

show more ...


# ee1bec9b 01-May-2013 Daniel Borkmann <dborkman@redhat.com>

netlink: kconfig: move mmap i/o into netlink kconfig

Currently, in menuconfig, Netlink's new mmaped IO is the very first
entry under the ``Networking support'' item and comes even before
``Networkin

netlink: kconfig: move mmap i/o into netlink kconfig

Currently, in menuconfig, Netlink's new mmaped IO is the very first
entry under the ``Networking support'' item and comes even before
``Networking options'':

[ ] Netlink: mmaped IO
Networking options --->
...

Lets move this into ``Networking options'' under netlink's Kconfig,
since this might be more appropriate. Introduced by commit ccdfcc398
(``netlink: mmaped netlink: ring setup'').

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# bf61c884 01-May-2013 Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge branch 'next' into for-linus

Prepare first set of updates for 3.10 merge window.


# f53f292e 20-Apr-2013 H. Peter Anvin <hpa@linux.intel.com>

Merge remote-tracking branch 'efi/chainsaw' into x86/efi

Resolved Conflicts:
drivers/firmware/efivars.c
fs/efivarsfs/file.c

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 42bbcb78 19-Apr-2013 David S. Miller <davem@davemloft.net>

Merge branch 'netlink-mmap'

Patrick McHardy says:

====================
The following patches contain an implementation of memory mapped I/O for
netlink. The implementation is modelled after AF_PACK

Merge branch 'netlink-mmap'

Patrick McHardy says:

====================
The following patches contain an implementation of memory mapped I/O for
netlink. The implementation is modelled after AF_PACKET memory mapped I/O
with a few differences:

- In order to perform memory mapped I/O to userspace, the kernel allocates
skbs with the data area pointing to the data area of the mapped frames.
All netlink subsystems assume a linear data area, so for the sake of
simplicity, the mapped data area is not attached to the paged area but
to skb->data. This requires introduction of a special skb alloction
function that just allocates an skb head without the data area. Since this
is a quite rare use case, I introduced a new function based on __alloc_skb
instead of splitting it up into head and data alloction. The alternative
would be to introduce an __alloc_skb_head and __alloc_skb_data function,
which would actually be useful for a specific error case in memory mapped
netlink, but would require a couple of extra instructions for the common
skb allocation case, so it doesn't really seem worth it.

In order to get the destination memory area for skb->data before message
construction, memory mapped netlink I/O needs to look up the destination
socket during allocation instead of during transmission because the
ring is owned by the receiveing socket/process. A special skb allocation
function (netlink_alloc_skb) taking the destination pid as an argument is
used for this, all subsystems that want to support memory mapped I/O need
to use this function, automatic fallback to the receive queue happens
for unconverted subsystems. Dumps automatically use memory mapped I/O if
the receiving socket has enabled it.

The visible effect of looking up the destination socket during allocation
instead of transmission is that message ordering in userspace might
change in case allocation and transmission aren't performed atomically.
This usually doesn't matter since most subsystems have a BKL-like lock
like the rtnl mutex, to my knowledge the currently only existing case
where it might matter is nfnetlink_queue combined with the recently
introduced batched verdicts, but a) that subsystem already includes
sequence numbers which allow userspace to reorder messages in case it
cares to, also the reodering window is quite small and b) with memory
mapped transmission batching can be performed in a subsystem indepandant
manner.

- AF_NETLINK contains flow control for database dumps, with regular I/O
dump continuation are triggered based on the sockets receive queue space
and by recvmsg() calls. Since with memory mapped I/O there are no
recvmsg() calls under normal operation, this is done in netlink_poll(),
under the assumption that userspace has processed all pending frames
before invoking poll(), thus the ring is expected to have room for new
messages. Dumps currently don't benefit as much as they could from
memory mapped I/O because each single continuation requires a poll()
call. A more agressive approach seems like a good idea to me, especially
in case the socket is not subscribed to any multicast groups (IOW only
receiving explicitly requested data).

Besides that, the memory mapped netlink implementation extends the states
defined by AF_PACKET between userspace and the kernel by a SKIP status, this
is intended for the case that userspace wants to queue frames (specifically
when using nfnetlink_queue, an IDS and stream reassembly, requested by
Eric Leblond) for a longer period of time. The kernel skips over all frames
marked with SKIP when looking or unused frames and only fails when not finding
a free frame or when having skipped the entire ring.

Also noteworthy is memory mapped sendmsg: the kernel performs validation
of messages before accepting and processing them, in order to prevent
userspace from changing the messages contents after validation, the
kernel checks that the ring is only mapped once and the file descriptor
is not shared (in order to avoid having userspace set up another mapping
after the first mentioned check). If either of both is not true, the
message copied to an allocated skb and processed as with regular I/O.
I'd especially appreciate review of this part since I'm not really versed
in memory, file and process management,

The remaining interesting details are included in the changelogs of the
individual patches and the documentation, so I won't repeat them here.

As an example, nfnetlink_queue is convererted to support memory mapped
I/O. Other subsystems that would probably benefit are nfnetlink_log,
audit and maybe ISCSI, not sure.

Following are some numbers collected by Florian Westphal based on a
slightly older version, which included an experimental patch for the
nfnetlink_queue ordering issue.

===

Test hardware is a 12-core machine
Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
ixgbe interfaces are used (i.e., multiqueue nics).
irqs are distributed across the cpus.

I've made several tests.

The simple one consists of 3GBit UDP traffic, packets are 1500 bytes
in size (i.e., no fragmentation), with a single nfqueue
and the test client programs in libmnl examples directory.
Packets are sent from one /24 net to another /24 net, i.e.
there are a few hundred flows active at any given time.

I've also tested with snort, but I disabled all rules.
6Gbit UDP traffic is generated in the snort case, and
6 nfqueues are used (i.e., 6 snorts run in parallel).

I've tested with 3 different kernels, all based on 3.7.1.
- 3.7.1, without the mmap patches
- 3.7.1, with Patricks mmap patches
- 3.7.1, with mmap patches and extended spinlock to ensure packet ids are
monotonically increasing and cannot be re-ordered. This is what we
currently ship in our product.

[ the spinlock that is extended is the per nfqueue spinlock, it will
be held from the time the netlink skb is allocated until the netlink
skb is sent to userspace:

http://1984.lsi.us.es/git/nf-next/commit/?h=mmap-netlink3&id=b8eb19c46650fef4e9e4fe53f367f99bbf72afc9
]

snort is normally used in "batch mode", i.e., after processing 25 packets
a single "batch verdict" is sent to accept the packets seen so far.
"mmap snort" means RX_RING + sendmsg(), i.e. TX_RING is not used at this
time (except where noted below).

One reason is that snort has a reload thread, so kernel needs to copy;
also in the snort case no payload rewrite takes place, so compared
to the rx path the tx path is cheap.

Results:

3.7.1, without mmap patches, i.e. recv()+sendmsg() for everyone
nfq-queue: 1.7 gbit out
snort-recv-batch-25 5.1 gbit out
snort-recv-no-batch 3.1 gbit out

3.7.1 + mmap + without extended spinlocked section
nfq-queue: 1.7 gbit out (recv/sendmsg)
nfq-queue-mmap: 2.4 gbit out
snort-mmap-batch-25 5.6 gbit out (warning: since ids can be
re-ordered, this version is "broken").
snort-recv-batch-25 5.1 gbit out
snort-mmap-no-batch 4.6 gbit out (i.e., one verdict per packet)

Kernel 3.7.1 + mmap + extended spinlock section:
nfq-queue: 1.4 gbit out
nfq-queue-mmap: 2.3 gbit out
snort: 5.6 gbit out

Conclusions:
- The "extended spinlocked section" hurts performance in the
single queue case; with 6 snorts there is no measureable slowdown.
- I tried to re-write the mmap-snort to work without batch verdicts, but
results were not very encouraging:

kernel 3.7.1 + mmap (without extended spinlocked section):

snort-mmap-batch-25 5.6 gbit out (what we currenlty ship)
snort-recv-batch-25 5.1 gbit out (without using mmap)
snort-mmap-batch-1 4.6 gbit out (with mmap but without batch verdicts)
snort-mmap-txring-25 5.2 gbit out (with mmap but without batch verdicts)
snort-mmap-txring-1 4.6 gbit out (with mmap but without batch verdicts)

The difference between the last two is that in the txring-25 case, we
put a verdict into the tx ring after every packet, but will only
invoke sendmsg(, NULL, 0) after processing 25 packets. So the only
difference is the number of sendmsg calls/context switches.

So, i.o.w, kernel 3.7.1 + mmap + the extra locking crap is faster
than 3.7.1 + mmap-without-extra-locking and single-verdict-per packet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# ccdfcc39 17-Apr-2013 Patrick McHardy <kaber@trash.net>

netlink: mmaped netlink: ring setup

Add support for mmap'ed RX and TX ring setup and teardown based on the
af_packet.c code. The following patches will use this to add the real
mmap'ed receive and t

netlink: mmaped netlink: ring setup

Add support for mmap'ed RX and TX ring setup and teardown based on the
af_packet.c code. The following patches will use this to add the real
mmap'ed receive and transmit functionality.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


12345678910>>...29