tcpdump mailing list archives

Re: [PATCH] enable memory mapped access to ethernet device for linux


From: Alexander Dupuy <alex.dupuy () mac com>
Date: Mon, 10 Dec 2007 13:42:18 -0500

Guy Harris asked:

How does pcap_setbufsize() differ from pcap_setbuff()?

The WinPcap pcap_setbuff function is defined as:
int pcap_setbuff(pcap_t *p, int dim);

I declared it as follows:

extern int pcap_setbufsize(pcap_t *p, int bufsize, char *errbuf);
extern int pcap_getbufsize(pcap_t *p, char *errbuf);

where the errbuf can return a warning message even if the call doesn't fail (e.g. on partial allocation of ring it indicates the maximum size to the user); the getbufsize function allows programmatic determination of actual size allocated (if partial allocation was possible).

Paolo Abeni writes:

Having the frame size a power of two solve the above issue and simplify
walking the ring, because there is no need to handle in special way the
end of each ring block; differently we need to keep in the pcap handle
the block size (and that should require adding some field to the handle
structure) and check for end of block after each frame processing (to
skip the gap at the block's end)


Okay, I guess I see why you're doing this - you are trying not to add any additional fields to the struct pcap (this is why you're re-using the otherwise unused bp/cc members to track the ring). However, the struct pcap is an internal structure and is allocated and managed within libpcap, so there really isn't any reason not to add the additional tracking structures. The version I have uses a separately allocated struct iovec array to track the ring slots, so that once this array is set up, the read code just iterates through the array of pointers; this is partly an artifact of support for the PACKET_TRECV ringbuffer support in patched versions of the 2.2 kernel that was replaced by the simpler and cleaner PACKET_RX_RING ioctl. Whether you use struct iovec or something better, though, changing the size of the internal struct pcap will not affect binary compatibility, and is probably a better approach than forcing the ring frame size to a power of two; remember that setting the ring buffer frame size not only allocates memory, but will cause the kernel to actually copy the data (for packets that are larger than the snapshot) so you're not only wasting memory but also CPU cycles.

Currently If the poll() call is
interrupted by a signal, the call is invoked again, as performed on
other platforms. The interface down will cause the read call to return
with error, and I suppose this is the standard behavior.

I must have missed the breakloop check in your first version; your current version does have it, but there is a minor problem - the breakloop field is not reset in the after-poll check (and I wonder if it might not be better to check before calling poll rather than after?) - the ring-read check does reset the field.

With the MAX_BLOCK_NR block limit on a 64bits platform the ring will
hold by default 16K jumbo frames (or 32K standard ethernet frames) that
in my experience is more than enough to handle at least a Gb ethernet


I don't really see the need for a hard-coded upper limit, though. With kernels before 2.6.4, there was a kmalloc restriction that limited the number of ring slots to 128K/sizeof(pointer) (the total ring size was not limited in the same way), but this is not the case in recent kernels, so there's no reason to impose an arbitrary upper limit. As for it being "enough to handle at least a Gb ethernet" it really depends on the application's processing speed and the maximum burst rate; I don't think you can generalize from any particular application.

Also, your "binary search" just reduces the size by halves; this will not discover the actual limit - the implementation I have also tries increasing the size after a successful allocation (if the successful one was less than the original request), with a loop like this (where maxct is initialized to the original request size, and minct is initialized to 1 - this loop is only used if the original allocation fails, and there is a slightly different and more complex one for the PACKET_RX_RING case implementation, I'm providing this just to illustrate the basic idea):

                 do
                 {
                   ct = (maxct + minct) / 2;
                   if (setsockopt(p->fd, SOL_PACKET, PACKET_TRECV,
                                  (void*)ring, ct*sizeof(struct iovec)))
                     maxct = ct;
                   else
                     minct = ct;
                 } while (maxct - minct > 1);

On the other hand, it seams that the ring buffer isn't flushed by the
kernel when a pcap filter is attached, so the first issue must be
handled. An alternative, very simple solution would be to manually flush
the ring buffer after setting the filter. It will cause the loss of same
frames, but the same happen right now with standard, not memory mapped
access.


Guy's answer indicates that this is the current implementation for the socket read() version (and other kernel versions), but just as Andy Howell noted, I also have applications that adjust the filter dynamically, so I would prefer not to flush the ring after setting the filter (although I guess this would be acceptable the very first time the filter is set for a given pcap handle/fd).

@alex

--
mailto:alex.dupuy () mac com

-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Current thread: