Nmap Development mailing list archives

Re: Live Capture Performance to Rival Wireshark


From: Daniel Miller <bonsaiviking () gmail com>
Date: Sun, 18 Dec 2022 10:06:24 -0600

Matt,

Thanks for your interest in Npcap! These are very good questions, and we
hope to be able to improve Npcap's documentation to answer them soon. In
the meantime, here are some answers that may help you:

A recent survey of our log files from the field indicates that we are
missing packets.  Specifically,  converting our log files to .pcapng and
opening them in Wireshark, we see about 1% of the packets showing the [TCP
Previous segment not captured] message.  Due to the nature of this data,
this 1% loss is unacceptable to our users.  As expected, this loss gets
more dramatic with additional network traffic.  Testing in the lab shows
that Wireshark v3.6.7 captures the packets from a stress test with no
apparent packet loss, so I know the problem is on our end.


Wireshark's "TCP Previous segment not captured" message does not
necessarily mean that Npcap or your application was unable to capture a
packet that otherwise made it to the system, though the direct stress test
you mention does make it more likely that is the case. It is also possible
that the packets were dropped by some other participant in the data path,
such as an upstream router, switch, or another component of the NDIS stack
like a firewall. A better measurement is Npcap's own internal stats, which
can be obtained with the pcap_stats() function. This will return a struct
pcap_stat with the ps_recv member showing the number of packets which have
been delivered on the adapter (regardless of whether they are captured by
your application, due to BPF filtering or buffer size limitations, etc.),
and the ps_drop member showing the number of packets which have been
dropped by this capture handle, due usually to buffer size limits but also
potentially due to memory allocation failures.


- We call pcap_open() with a snap length of 65536, promiscuous mode
enabled, and a read timeout of 500ms.

The recommended functions to open a capture handle are pcap_create() and
pcap_activate(), which allow better fine-grained control over capture
parameters via a number of pcap_set_*() functions. Modern systems with
Receive Side Coalescing can indicate packets larger than the MTU/MSS of the
adapter, so if your intent is to capture the entire packet, do not set a
snaplen at all, which will set the maximum value. Promiscuous mode may not
be supported on all adapters and, for most switched networks, will not
necessarily result in more data captured. Review your application's needs
to be sure this is appropriate. The read timeout can be tuned based on your
application's needs, which may change depending on other changes you make
based on this guidance.


- We call pcap_setbuff() to increase the size of the kernel buffer to 16MB.

pcap_setbuff() is a WinPcap extension that should not be used in new
programs. Use pcap_set_buffer_size() instead.


- We then call pcap_next_ex() inside a for loop to get the next capture.
- Upon successful return, we allocate a byte array using
pkt_header.caplen, copy the pkt_data into the byte array, and add the byte
array to a pre-allocated list.
- We execute this for loop until the pre-allocated list is filled (to
avoid reallocation) or a predetermined timeout is exceeded on the
application side.
- When either of these conditions is satisfied, we hand the pre-allocated
list off to another thread, allocate a new list, and do the loop again.

Here are my questions.

1.  Is pcap_next_ex() the most efficient way of transferring captures to
the application?  It looks like pcap_loop() or pcap_dispatch() might allow
multiple captures to be returned via a single callback.  Is that correct?
And if so, would that be the recommended way to get the captures in a high
data rate environment?


The advantage of pcap_loop or pcap_dispatch() is that they handle the
looping and offer better control over when to stop processing packets.
pcap_dispatch(), in particular, will process packets until it is time to
issue another request for packet data to the kernel (the Npcap driver in
this case). This can be combined with the Windows Event returned by the
pcap_getevent() function, which is signaled when a batch of packets is
"ready" for the application to process, as defined by parameters set via
pcap_setmintocopy(), pcap_set_timeout(), pcap_set_immediate(), etc. So an
application will typically WaitForSingleObject (or other API function for
synchronizing on an Event) until the event is signaled, then call
pcap_dispatch() to run the callback on all received packets.


2.  Our understanding is that the kernel buffer *IS* the ring buffer that
must be read from at least as fast as the data is received in order to
minimize/eliminate the occurrence of dropped packets.  We understand the
size of the buffer won't prevent dropped packets if the application can't
keep up (it merely delays the moment when that occurs).  But a bigger ring
buffer can accommodate data spikes, allowing the application to catch up
during data lulls.

This is correct.

To this end, how big can we make the kernel buffer via pcap_setbuff()?  Is
there a practical or rule-of-thumb limit?

The kernel buffer space is allocated from the NonPagedPool, which is a very
precious resource. On my laptop currently running Windows 11 with 4GB of
RAM, the NonPagedPool is 768MB. Fortunately, since Npcap 1.00 the "kernel
buffer size" is interpreted as a limit, not allocated all at once as it was
in WinPcap. This means that setting a ridiculously large buffer size will
not immediately crash the system, and as long as you continue to read from
it, it will likely never attain the full size. However, it does open up the
possibility of running out of resources later, especially if you stop
processing packets without closing the handle.

Is the ring buffer associated with each handle???

The size limit is tracked per-handle, referring to the amount of packet
data that particular handle is waiting on. If multiple handles are waiting
on the same data, each one will account the storage towards its own limit,
but the data will not be actually duplicated, and it will not be freed
until the last handle retrieves the associated packet.


  If we collect simultaneously with Npcap on multiple NICs, does the size
of each ring buffer need to be limited in any way?

Each NIC (technically: each Npcap filter module, which is an instance of
Npcap in a particular NDIS stack) stores packet data that a handle is
waiting to retrieve. Multiple handles on the same NIC can share that data
as described above, meaning that the actual amount of NonPagedPool used is
most likely less than the sum of their buffer sizes. Handles on multiple
NICs will not share data in this way, so it is more likely that they can
consume the amount of NonPagedPool equal to the sum of their buffer sizes.
Measurement is the best way to determine how much network data your
application can process. Setting a small snaplen and putting as much
filtering logic into the kernel BPF filter as possible are good ways to
reduce the amount of kernel buffer that is needed.


Should we use pcap_set_buffer_size() instead of pcap_setbuff()?  Can
pcap_set_buffer_size() be called after pcap_open() like pcap_setbuff()
can?  Or do we need to use the create-set-activate pattern?

Both of these functions achieve the same result, but pcap_set_buffer_size()
is preferred because it is standard libpcap API and will work on non-Npcap
platforms.


3.  We are confused about the difference between the kernel buffer and the
user buffer. How does the user buffer work with pcap_next_ex()?  Since
pcap_next_ex() only returns a single packet at a time, does the user buffer
even matter?  Perhaps it comes into play with pcap_loop() or
pcap_dispatch() being able to return more data in the callback?

The "user buffer" is used to transfer packet data between the kernel
(Npcap's NDIS filter driver) and userspace (wpcap.dll). Tuning this
parameter may help reduce overhead, but other parameters should probably be
adjusted first. pcap_next_ex() reads from this buffer until it is empty,
then it issues a Read call to retrieve more packets.


4.  How does pcap_mintocopy() work with pcap_next_ex()?  Again, since
pcap_next_ex() only returns a single packet at a time, does this even
apply?  Perhaps with pcap_loop() or pcap_dispatch()?

The same mechanism retrieves packets for all of these functions. The
MinToCopy parameter serves to reduce the number of Read calls, which each
requires the user buffer be mapped and locked. It works best using the
Windows Event synchronization mechanism along with pcap_dispatch().


5.  Finally, can stats be enabled for the same handle that is capturing
the data?  If we want to monitor, for example, the number of dropped
packets seen during a capture, how do we do that with the pcap_t returned
by pcap_open()?  If we need two pcap_t handles, one for capture and one for
stats, does that imply a single ring buffer under the hood for a given NIC?

"Statistics mode" using pcap_setmode() with MODE_STAT is a WinPcap
extension, and we have not done much to alter or fix it. It is mutually
exclusive with capture mode, so two handles are required, but it does not
buffer packets, only counts them. The preferred way to check statistics is
the pcap_stats() function mentioned earlier. This can be used on any open
capture handle regardless of mode.

Hopefully this response, though a bit late, will be helpful.
_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at https://seclists.org/nmap-dev/

Current thread: