tcpdump mailing list archives

[PATCH] enable memory mapped access to ethernet device for linux


From: Alexander Dupuy <alex.dupuy () mac com>
Date: Fri, 07 Dec 2007 11:22:42 -0500

[I sent this from my other account but it seems not to have gone through;
resending and apologies if you receive it twice.]

Paolo Abeni writes:
It does not use environment variables to control the memory mapped ring
parameters; instead the requested snap len is used: the low order bytes
are used to select the ring frame size and the high order bytes are used
to select the ring frame number. If the high order bytes is 0, like in
every current libpcap usage, a reasonable default is used.

Using the snaplen for the ring frame size certainly makes sense, but I'm
uncomfortable with overloading the high order bytes to specify the
number of ring slots.  The right way to do this is to provide a general
interface to set the buffering size (WinPcap has a system-specific
extension) but the problem in the past has been that some systems, like
*BSD, require the buffer size to be set when opening the underlying (in
this case, BPF) device.  Guy Harris has in the past talked about an
extensible key/value list parameter to pcap_open, but as far as I know,
nothing has ever come of it.  In our version, we use a pcap_setbufsize
function, like WinPcap's pcap_setbuff, but this only works on Linux
(packet socket read or mmap), Windows (WinPcap), and SunOS 3.x or SGI
Irix (the only other systems to use sockets for packet capture, and
which support SO_RCVBUF).

Rounding the ring size to nearest power of two wastes quite a bit of
memory for full capture on standard Ethernet (2048/1514 = 26% wasted)
and even more for typical jumbo frames (16384/9000 = 45% wasted).  How
exactly does this simplify ring navigation?  I don't recall seeing this
in any other pcap-mmap implementation (admittedly, I never looked too
closely at Phil Woods' code).  Also, what do you do when the snaplen is
zero - implying max packet size?  I have code that gets the interface
MTU and uses that (for the "any" interface, it defaults to 65535, which
is safe, but wasteful) - but this is not dealt with in your code.

Is the ring navigation you're referring to the computation of the exact
ring/frame structure size?  I have the following comment in the version
I use:

/*
 * Compute framesz, frames_per_pg, pgs, ct from snaplen, bufsize, cooked
mode
 *
 * The extra +16/+2 factor for cooked/non-cooked mode is needed for
additional
 * "gap" so that captured data after MAC header (14 bytes for ethernet,
 * nominally 0 for cooked mode, but *some* gap is required) will be
aligned to
 * TPACKET_ALIGNMENT.  For interfaces where MAC header != 14 bytes, the
 * computation gets trickier, and you need to get the arp hardware type
of the
 * interface to decide what the offset should be.  The best solution for
those
 * is to incorporate C. Philip Woods' libpcap with "mmap" support, which
has a
 * map_arphrd_to_dlt() function that handles the offset computation.
For now,
 * we only support ethernet and cooked interfaces.
 */

I wonder if your "power-of-two" approach is just covering up some memory
overflow problems.  I also notice that you are limiting the number of
ring slots to 128K (MAX_BLOCK_NR).  While this is correct for 32-bit
i386 Linux 2.4 (and earlier) kernels, the values are different on other
architectures, and the kmalloc limit no longer applies for 2.6 kernels
(there are other limits, though).  My version uses some binary search
approach to find a working buffer size if the requested one fails when
allocating the ring buffer - this isn't ideal, but is more practical
across different kernels, and simplifies application programming
considerably.

Using a zero timeout to indicate "wait forever" introduces some
compatibility and consistency problems; the original (and best,
probably) use of the timeout is for in-kernel delays - the
application-level read timeout (or not) is better taken from
pcap_setnonblock() call (i.e. wait forever is default, unless nonblock
is set).  If I'm not mistaken, this is the current behavior with socket
read() implementation on Linux.  You also have to be much more careful
about multiple calls to poll() within the loop, due to interrupts,
interface down, and handle pcap_breakloop() correctly.

There's also an issue that with the ringbuffer, the initial contents can
be quite substantial in the fraction of a second between the pcap_open
and application call to pcap_setfilter; for some reason this is not so
much an issue for the socket read() interface, although buffering takes
place there as well, perhaps the kernel (re-)filters the socket buffer
when the filter is changed?  Anyhow, I've found it necessary to apply
user-level filtering to the contents of the ring buffer from startup
until the ring is empty the first time.  There's also a (smaller) window
between the packet socket() and bind() calls where packets from *any*
interface may be queued in the ringbuffer; I also filter these out if
the pcap_open was not for the "any" interface.  (This one seems to apply
in the socket read case as well, and I think I stole that code from there.)

Something else the version I use does is to update the packet stats
whenever the ring is empty and TP_STATUS_LOSING (indicating dropped
packets)was seen in the ringbuffer.  I'm not quite sure why it does this
(I don't have access to my CVS repository right now, so I can't even say
where that bit of code came from).

I'd also like to see some way to enable/disable use of the ringbuffer at
runtime (again, Guy's key/value config suggestion would be ideal for
this) although that's not a show-stopper for me, but it may be necessary
for compatibility purposes on some systems.

In summary, getting the ring buffer to work correctly (and
compatibly/consistently) is not quite so straightforward as it may
seem.  I don't want to discourage your attempt to get this included,
though _ I would very much like to see this in the mainline libpcap as I
noted earlier.

I also wonder whether it might make sense to look at a libpcap-ng
development effort; this would be an upwards-compatible replacement for
libpcap that would also offer new APIs with the key/value extensions,
and support reading/dumping both classic pcap savefile and ntar formats
(possibly using the ntar library, or new code). Clearly, this would not
be for the libpcap 1.0 branch, but rather a new libpcap 2.0.  Support
for Linux mmap and PF_RING could be folded into this as well.

@alex

--
mailto:alex.dupuy () counterstorm com



--
mailto:alex.dupuy () mac com

-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Current thread: