tcpdump mailing list archives

Re: [PATCH] enable memory mapped access toethernet device for linux


From: "Paolo Abeni" <paolo.abeni () telecomitalia it>
Date: Fri, 07 Dec 2007 09:51:02 +0100

hello,

First, thanks for the detailed review. It seems that some of the
relevant point has been addressed by Guy Harris, so I'll try to catch
the others...

On Thu, 2007-12-06 at 12:54 -0500, Alexander Dupuy wrote:

Rounding the ring size to nearest power of two wastes quite a bit of 
memory for full capture on standard Ethernet (2048/1514 = 26% wasted) 
and even more for typical jumbo frames (16384/9000 = 45% wasted).  How 
exactly does this simplify ring navigation?  

There are a couple of constraints to take care of. The ring blocks must
be page aligned, but are allocated internally to a size that is a power
of 2, so choosing a non power of 2 size of for the ring's block will
waste memory. The ring frames can't run across the block boundary, so if
the block size is not a multiple of the frame size there is some waste
of memory at the end of each block. This gap can be reduced choosing a
(possibly big) block size that will match the frame size, but this will
lead to allocation of possibly big chunks of kernel memory, that is
strongly discouraged by the Linux kernel developers. 

Having the frame size a power of two solve the above issue and simplify
walking the ring, because there is no need to handle in special way the
end of each ring block; differently we need to keep in the pcap handle
the block size (and that should require adding some field to the handle
structure) and check for end of block after each frame processing (to
skip the gap at the block's end)

You also have to be much more careful 
about multiple calls to poll() within the loop, due to interrupts, 
interface down, and handle pcap_breakloop() correctly.

Sincerely here I miss the point. Currently If the poll() call is
interrupted by a signal, the call is invoked again, as performed on
other platforms. The interface down will cause the read call to return
with error, and I suppose this is the standard behavior. Finally I
expressly check for break_loop after each non fatal termination of
poll(). 

Obviously there could be many bugs in may code, so if you pin-point
something in it, that will help a lot!

I wonder if your "power-of-two" approach is just covering up some memory 
overflow problems.  I also notice that you are limiting the number of 
ring slots to 128K (MAX_BLOCK_NR).  While this is correct for 32-bit 
i386 Linux 2.4 (and earlier) kernels, the values are different on other 
architectures, and the kmalloc limit no longer applies for 2.6 kernels 
(there are other limits, though).  

Please note that in both version of the patch I submitted, the effective
ring size is selected using a binary search approach, starting from said
limit.

With the MAX_BLOCK_NR block limit on a 64bits platform the ring will
hold by default 16K jumbo frames (or 32K standard ethernet frames) that
in may experience is more than enough to handle at least a Gb ethernet
link, even if it's filled with very small packets (but most NICs are
simple unable to deliver such load the the host). On 32 bits platform
the maximum ring frame number is doubled.

Anyway the starting point for the binary search can be changed (i.e.
increased) if necessary.

There's also an issue that with the ringbuffer, the initial contents can 
be quite substantial in the fraction of a second between the pcap_open 
and application call to pcap_setfilter; for some reason this is not so 
much an issue for the socket read() interface, although buffering takes 
place there as well, perhaps the kernel (re-)filters the socket buffer 
when the filter is changed?  Anyhow, I've found it necessary to apply 
user-level filtering to the contents of the ring buffer from startup 
until the ring is empty the first time.  There's also a (smaller) window 
between the packet socket() and bind() calls where packets from *any* 
interface may be queued in the ringbuffer; I also filter these out if 
the pcap_open was not for the "any" interface.  (This one seems to apply 
in the socket read case as well, and I think I stole that code from there.)

The ring is created only after the 'bind' of the socket to the requested
interface, and packet are delivered to the ring only after it's
creation, so the second issue should not arise. 

On the other hand, it seams that the ring buffer isn't flushed by the
kernel when a pcap filter is attached, so the first issue must be
handled. An alternative, very simple solution would be to manually flush
the ring buffer after setting the filter. It will cause the loss of same
frames, but the same happen right now with standard, not memory mapped
access.

In conclusion I'll try to repost asap a modified version of the patch
that will handle the filter issue. If some other change is required
(i.e. a bigger default ring size) please let me know.

ciao,

Paolo
--------------------------------------------------------------------

CONFIDENTIALITY NOTICE

This message and its attachments are addressed solely to the persons above and may contain confidential information. If 
you have received the message in error, be informed that any use of the content hereof is prohibited. Please return it 
immediately to the sender and delete the message. Should you have any questions, please contact us by replying to 
webmaster () telecomitalia it.

        Thank you

                                        www.telecomitalia.it

--------------------------------------------------------------------
                        
-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Current thread: