tcpdump mailing list archives

Re: [PATCH] enable memory mapped access to ethernet device for linux


From: Guy Harris <guy () alum mit edu>
Date: Thu, 6 Dec 2007 16:09:01 -0800


On Dec 6, 2007, at 9:54 AM, Alexander Dupuy wrote:

Paolo Abeni writes:
It does not use environment variables to control the memory mapped ring parameters; instead the requested snap len is used: the low order bytes are used to select the ring frame size and the high order bytes are used to select the ring frame number. If the high order bytes is 0, like in
every current libpcap usage, a reasonable default is used.

Using the snaplen for the ring frame size certainly makes sense, but I'm uncomfortable with overloading the high order bytes to specify the number of ring slots. The right way to do this is to provide a general interface to set the buffering size (WinPcap has a system- specific extension) but the problem in the past has been that some systems, like *BSD, require the buffer size to be set when opening the underlying (in this case, BPF) device. Guy Harris has in the past talked about an extensible key/value list parameter to pcap_open, but as far as I know, nothing has ever come of it. In our version, we use a pcap_setbufsize function, like WinPcap's pcap_setbuff, but this only works on Linux (packet socket read or mmap), Windows (WinPcap), and SunOS 3.x or SGI Irix (the only other systems to use sockets for packet capture, and which support SO_RCVBUF).

How does pcap_setbufsize() differ from pcap_setbuff()?

Rounding the ring size to nearest power of two wastes quite a bit of memory for full capture on standard Ethernet (2048/1514 = 26% wasted) and even more for typical jumbo frames (16384/9000 = 45% wasted). How exactly does this simplify ring navigation? I don't recall seeing this in any other pcap-mmap implementation (admittedly, I never looked too closely at Phil Woods' code). Also, what do you do when the snaplen is zero - implying max packet size?

0 is not a valid snapshot length value in pcap_open_live(); it's valid in current versions of tcpdump, but that's because it maps it to 65535. 65535 is also the default in Wireshark.

If that wastes wired-down ring-buffer memory, the right thing to do is probably, as you note, to use the interface MTU (although you have to add on not only the maximum link-layer header size, but also sizes for things such as the radio header for 802.11 adapters).

I wonder if your "power-of-two" approach is just covering up some memory overflow problems. I also notice that you are limiting the number of ring slots to 128K (MAX_BLOCK_NR). While this is correct for 32-bit i386 Linux 2.4 (and earlier) kernels, the values are different on other architectures, and the kmalloc limit no longer applies for 2.6 kernels (there are other limits, though). My version uses some binary search approach to find a working buffer size if the requested one fails when allocating the ring buffer - this isn't ideal, but is more practical across different kernels, and simplifies application programming considerably.

...and is similar to what's done for the BPF buffer size.

Using a zero timeout to indicate "wait forever" introduces some compatibility and consistency problems; the original (and best, probably) use of the timeout is for in-kernel delays - the application-level read timeout (or not) is better taken from pcap_setnonblock() call (i.e. wait forever is default, unless nonblock is set). If I'm not mistaken, this is the current behavior with socket read() implementation on Linux. You also have to be much more careful about multiple calls to poll() within the loop, due to interrupts, interface down, and handle pcap_breakloop() correctly.

On platforms where the timeout is supported, 0 means "wait forever"; to quote the pcap man page's description of pcap_open_live():

to_ms specifies the read timeout in milliseconds. The read timeout is used to arrange that the read not necessarily return immediately when a packet is seen, but that it wait for some amount of time to allow more packets to arrive and to read multiple packets from the OS kernel in one operation. Not all platforms support a read timeout; on platforms that don't, the read timeout is ignored. A zero value for to_ms, on platforms that support a read timeout, will cause a read to wait forever to allow enough packets to arrive, with no timeout.

Linux is one of the platforms that doesn't support a read timeout. Note that it is *NOT* guaranteed that a read will complete within "to_ms" milliseconds; on Solaris, for example, the timer doesn't start until at least one packet is seen, so the read could block forever if no packets arrive. (Applications should *NOT* be using the timeout to, for example, allow them to do other things if no packets arrive.)

There's also an issue that with the ringbuffer, the initial contents can be quite substantial in the fraction of a second between the pcap_open and application call to pcap_setfilter; for some reason this is not so much an issue for the socket read() interface, although buffering takes place there as well, perhaps the kernel (re-)filters the socket buffer when the filter is changed?

With BPF and Digital UNIX's packetfilter, changing the filter flushes the buffer. With Linux, changing the filter doesn't flush the buffer - so current versions of libpcap purge the buffer themselves, so that, after you change a filter, you don't get any packets that wouldn't have passed the filter. (On platforms where filtering is done in userland, that's not an issue.)

I also wonder whether it might make sense to look at a libpcap-ng development effort; this would be an upwards-compatible replacement for libpcap that would also offer new APIs with the key/value extensions, and support reading/dumping both classic pcap savefile and ntar formats (possibly using the ntar library, or new code). Clearly, this would not be for the libpcap 1.0 branch, but rather a new libpcap 2.0.

...or libpcap 1.1; calling it libpcap 2.0 might imply binary incompatibility, and if the library is renamed libpcap.2.so on ELF- based systems, would strongly imply binary incompatibility (i.e., programs linked with libpcap 1.x wouldn't work with 2.x even if 2.x *is* binary-compatible).

(That's also an issue for going to libpcap 1.0 from 0.x.)
-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Current thread: