tcpdump mailing list archives

[PATCH] enable memory mapped access to ethernet device for linux


From: Alexander Dupuy <alex.dupuy () counterstorm com>
Date: Thu, 6 Dec 2007 12:54:48 -0500

Paolo Abeni writes:
It does not use environment variables to control the memory mapped ring
parameters; instead the requested snap len is used: the low order bytes
are used to select the ring frame size and the high order bytes are used
to select the ring frame number. If the high order bytes is 0, like in
every current libpcap usage, a reasonable default is used.

Using the snaplen for the ring frame size certainly makes sense, but I'm uncomfortable with overloading the high order bytes to specify the number of ring slots. The right way to do this is to provide a general interface to set the buffering size (WinPcap has a system-specific extension) but the problem in the past has been that some systems, like *BSD, require the buffer size to be set when opening the underlying (in this case, BPF) device. Guy Harris has in the past talked about an extensible key/value list parameter to pcap_open, but as far as I know, nothing has ever come of it. In our version, we use a pcap_setbufsize function, like WinPcap's pcap_setbuff, but this only works on Linux (packet socket read or mmap), Windows (WinPcap), and SunOS 3.x or SGI Irix (the only other systems to use sockets for packet capture, and which support SO_RCVBUF).

Rounding the ring size to nearest power of two wastes quite a bit of memory for full capture on standard Ethernet (2048/1514 = 26% wasted) and even more for typical jumbo frames (16384/9000 = 45% wasted). How exactly does this simplify ring navigation? I don't recall seeing this in any other pcap-mmap implementation (admittedly, I never looked too closely at Phil Woods' code). Also, what do you do when the snaplen is zero - implying max packet size? I have code that gets the interface MTU and uses that (for the "any" interface, it defaults to 65535, which is safe, but wasteful) - but this is not dealt with in your code.

Is the ring navigation you're referring to the computation of the exact ring/frame structure size? I have the following comment in the version I use:

/*
* Compute framesz, frames_per_pg, pgs, ct from snaplen, bufsize, cooked mode
*
* The extra +16/+2 factor for cooked/non-cooked mode is needed for additional
* "gap" so that captured data after MAC header (14 bytes for ethernet,
* nominally 0 for cooked mode, but *some* gap is required) will be aligned to
* TPACKET_ALIGNMENT.  For interfaces where MAC header != 14 bytes, the
* computation gets trickier, and you need to get the arp hardware type of the * interface to decide what the offset should be. The best solution for those * is to incorporate C. Philip Woods' libpcap with "mmap" support, which has a * map_arphrd_to_dlt() function that handles the offset computation. For now,
* we only support ethernet and cooked interfaces.
*/

I wonder if your "power-of-two" approach is just covering up some memory overflow problems. I also notice that you are limiting the number of ring slots to 128K (MAX_BLOCK_NR). While this is correct for 32-bit i386 Linux 2.4 (and earlier) kernels, the values are different on other architectures, and the kmalloc limit no longer applies for 2.6 kernels (there are other limits, though). My version uses some binary search approach to find a working buffer size if the requested one fails when allocating the ring buffer - this isn't ideal, but is more practical across different kernels, and simplifies application programming considerably.

Using a zero timeout to indicate "wait forever" introduces some compatibility and consistency problems; the original (and best, probably) use of the timeout is for in-kernel delays - the application-level read timeout (or not) is better taken from pcap_setnonblock() call (i.e. wait forever is default, unless nonblock is set). If I'm not mistaken, this is the current behavior with socket read() implementation on Linux. You also have to be much more careful about multiple calls to poll() within the loop, due to interrupts, interface down, and handle pcap_breakloop() correctly.

There's also an issue that with the ringbuffer, the initial contents can be quite substantial in the fraction of a second between the pcap_open and application call to pcap_setfilter; for some reason this is not so much an issue for the socket read() interface, although buffering takes place there as well, perhaps the kernel (re-)filters the socket buffer when the filter is changed? Anyhow, I've found it necessary to apply user-level filtering to the contents of the ring buffer from startup until the ring is empty the first time. There's also a (smaller) window between the packet socket() and bind() calls where packets from *any* interface may be queued in the ringbuffer; I also filter these out if the pcap_open was not for the "any" interface. (This one seems to apply in the socket read case as well, and I think I stole that code from there.)

Something else the version I use does is to update the packet stats whenever the ring is empty and TP_STATUS_LOSING (indicating dropped packets)was seen in the ringbuffer. I'm not quite sure why it does this (I don't have access to my CVS repository right now, so I can't even say where that bit of code came from).

I'd also like to see some way to enable/disable use of the ringbuffer at runtime (again, Guy's key/value config suggestion would be ideal for this) although that's not a show-stopper for me, but it may be necessary for compatibility purposes on some systems.

In summary, getting the ring buffer to work correctly (and compatibly/consistently) is not quite so straightforward as it may seem. I don't want to discourage your attempt to get this included, though _ I would very much like to see this in the mainline libpcap as I noted earlier.

I also wonder whether it might make sense to look at a libpcap-ng development effort; this would be an upwards-compatible replacement for libpcap that would also offer new APIs with the key/value extensions, and support reading/dumping both classic pcap savefile and ntar formats (possibly using the ntar library, or new code). Clearly, this would not be for the libpcap 1.0 branch, but rather a new libpcap 2.0. Support for Linux mmap and PF_RING could be folded into this as well.

@alex

--
mailto:alex.dupuy () counterstorm com

-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Current thread: