tcpdump mailing list archives

Re: pcap_dispatch on linux 2.6 with libpcap 1.1.1


From: Jim Lloyd <jlloyd () silvertailsystems com>
Date: Mon, 23 Aug 2010 15:54:36 -0700

On Sun, Aug 22, 2010 at 11:44 PM, Guy Harris <guy () alum mit edu> wrote:


On Aug 21, 2010, at 3:30 PM, Jim Lloyd wrote:

I have tested with the above logic while sniffing traffic on a GigE
ethernet
NIC (eth0) and on the loopback device (lo). The test machine is an 8-core
Opteron with 32Gb of RAM running CentOS 5.5 with kernel 2.6.18. The
traffic
generator program is a small program using libcurl to repeatedly download
a
mix of static content from apache 2.2, with 4 concurrent connections. The
test results are:

         pps     Mbps     avg packets/dispatch
eth0      30K     850      3.009
lo        23K    1700      3.5

The total throughput here is excellent, so I'm not complaining. But why
is
the packets per dispatch so small? I was under the impression that at
these
data rates pcap_dispatch should process the requested 1000 packets per
call
instead of just ~3.

I know of no mechanism with PF_PACKET sockets - either when reading from
them or when looking at a memory-mapped buffer - to delay wakeups until a
certain number of packets, or a certain amount of packet data, is queued up
on the socket buffer or in the memory-mapped ring buffer, so, if the buffer
in question is currently empty, and a packet is added to it, any process
reading from the buffer, or blocked in a select() on the socket, will be
woken up.  If, for example, two additional packets are delivered into the
buffer while the process is being woken up and processing the first packet,
and then the process handles the remaining two packets before any additional
packets are added to the buffer, and thus goes to sleep again, you'll get 3
packets in that dispatch.


Ahh, thank you. That makes perfect sense.


How much work does Thunk do?  If you're getting 30K packets per second,
that's 1 packet every 1/30 of a millisecond, or 1 packet every .0333...ms,
or 1 packet every 33 or so microseconds.  A batch of 3 packets, at 1/30 of a
millisecond per packet, takes 1/10 of a millisecond, so if the sum of "time
to wake up the process" and "3*time to process a packet" is under 100
microseconds, a batch of 3 packets can be processed in less time than it
takes for those 3 packets to arrive.


Yes, now the 3 or 3.5 packets per call makes perfect sense. Thunk basically
just makes a copy of the packet and then puts the copy on a queue for a
worker thread to process asynchronously. There are multiple worker threads
and each thread must see a shard of connections, so Thunk also hashes the
connection 4-tuple and computes the shard from the hash.


Does this mean the 512Mb memory buffer is huge overkill?

For this application, it might be.

Ok, my understanding is improving rapidly. One gap in that understanding
remains. What is the relationship between the socket receive buffer and the
mmap buffer? Does the mmap buffer replace the socket receive buffer, or are
both buffers used? If so, when are packets transferred from the socket
receive buffer to the mmap buffer? I currently have my primary testing
machine configured with

net.core.rmem_default = 4194304
net.core.rmem_max = 16777216

Do we expect the rmem_default setting to be significant or not?

Aso, note that pcap_stats is not reporting any dropped packets, but I have
a
little bit of evidence that some packet loss may be occurring when
sniffing
ethernet.

You have a recent version of libpcap, and a recent kernel, so pcap_stats()
should be getting the dropped-packet statistics by calling
getsockopt(PF_PACKET socket, SOL_PACKET, PACKET_STATISTICS,  &statistics
buffer, ...).  The PF_PACKET socket code should increment the count of
dropped packets any time it fails to put a packet into the buffer because
the buffer is full.

So the only time the drop count will increment is when the buffer is full?
I rarely see drops, but I've seen enough that this doesn't feel right. I did
recently increase the size of the mmap buffer from 16Mb to 512Mb, so perhaps
this means that 16Mb was definitely too small.


The evidence is that my application occasionally fails to
reconstruct a TCP stream when sniffing ethernet, but never fails to
reconstruct any TCP streams when sniffing loopback. However, I wouldn't
be
surprised  if this is due to my TCP reconstruction code failing to handle
some rare corner case that handles with real TCP packets but does not
happen
with loopback.

You might want to see why the failure is happening - presumably your
reconstruction code can handle out-of-order delivery of TCP segments, as
well as overlapping segments; if a packet that was sent by the remote host
gets dropped on the network before it reaches the machine doing the
capturing, then there will eventually be a retransmission of some or all
data, but, at least with SACK, the remote host might be told "I'm missing
data in this range" when it's gotten data on both sides of that range, so
you might get a retransmission that includes data that comes before data
you've already seen.


I have logic to handle out of order segments, but no, I don't yet handle
SACK. So far this application generally runs on traffic entirely within one
datacenter, sniffing http traffic between load balancers and web servers,
where I expect that selective ACKs are rare. But perhaps I am being naive. I
added instrumentation today to find out if this is a bad assumption, and in
my test lab I haven't been able to make it happen. Can you recommend any
traffic generators that can provide test traffic with SACKs?


In addition, even if the packet *does* arrive at the Ethernet adapter, it
might get dropped if, for example, the adapter driver's ring buffer is full;
from the point of view of the networking stack, including both TCP and
PF_PACKET sockets, that's indistinguishable from a packet lost on the
network, except that the driver can increment a "packets dropped" count -
but that count isn't available through a PACKET_STATISTICS call, so it won't
show up in the results of pcap_stats().  That ring buffer is separate from
the ring buffer used on PF_PACKET sockets in memory-mapped mode, so you can
make the latter ring buffer as big as you want, and it won't prevent packet
drops at the driver level.


Is there any way to detect these kinds of drops with Linux? It looks like
the receive drops column in /proc/net/dev might be this count. Can you
confirm?
-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Current thread: