tcpdump mailing list archives

Re: some questions about TPACKET3


From: Mario Rugiero via tcpdump-workers <tcpdump-workers () lists tcpdump org>
Date: Sun, 28 Jun 2020 16:23:23 -0300

--- Begin Message --- From: Mario Rugiero <mrugiero () gmail com>
Date: Sun, 28 Jun 2020 16:23:23 -0300
El sáb., 27 jun. 2020 a las 23:56, Michael Richardson
(<mcr () sandelman ca>) escribió:


Mario, can you confirm my understanding here.

Hi Michael.

In TPACKET3 mode, there are (tp_block_nr) pools of memory.
The beginning of each block is tp_block_size in size, which can
be large numbers like 4M in size. (2^22 in the kernel documentation example).
(We, however seem to pick a blocksize which is only just big enough to hold
the maximum snaplen.)

Each one has a linked-list of tp3_hdr, which are interleaved with the packet
data itself.  The "next" pointer is the tp_next_offset.
It seems from my reading of code that the kernel returns an entire chain
of tp3_hdr to us, controlled by a *single* block_status bit.
That is, we get entire chains of tp3_hdr from the kernel, and we return
them to the kernel in single blocks.

I think that this was not the case with tp2: in that packets were passed
to/from the kernel one at a time, each one with their own TP_STATUS_KERNEL
bit.

AFAIK all of this is correct.

For a contract, I am trying to improve the write performance by using
async I/O.  {I also need to associate requests and responses, which makes the
ordering of operations non-sequential}
I therefore do not want to give the blocks back to the kernel until the
write has concluded, and for this I'm working on a variation of
linux_mmap_v3(), which will callback with groups of packets, through
a pipeline of "processors", each of which may steal the packet, and
then return it later.

I am realizing that I have to keep track of the blocks, not just the
packets.  I guess my original conceptual thinking was too heavily
influenced by V2, and I was thinking that V3 had changed things by
splitting the hdr from the packet, putting the constant-sized hdrs
into a fixed sized ring, while the packet content was allocated
as needed.
I see that I am mistaken, but I'd sure love confirmation.

I believe you may be thinking of AF_XDP. As you probably know, libpcap doesn't
have support for it (yet), but I don't think you'll have trouble using
it directly.
I worked briefly with the RX side of it, so I may be able to help you with that.
As you said, it splits headers from packets, sort of.
The packet contents are stored in blocks of a buffer called UMEM. Contrary to
PACKET_MMAP, you work with two queues per path, both containing descriptors
to find the data in UMEM. These descriptors fit the role of the headers.
For the RX side you have the FILL queue, where you store descriptors to indicate
the kernel a given block is free to use, and the RX queue, where the
kernel gives
these blocks back when a packet passes the filter[0].
The TX side has a TX queue, where you store descriptors pointing to the data you
want to send in the UMEM buffer, and a COMPLETION queue, where the kernel
gives you the blocks back for reuse after the data was sent.
IIRC, AF_XDP allows queuing packets to later send in a burst on
request, but since
I didn't work with that path, so I'm not 100% certain.

Since the UMEM blocks are fixed size and one block is used for each packet, they
consume more memory, but are much simpler to use for this and allows
out-of-order
release of resources.

[0]: AF_XDP requires eBPF filters to be installed in the kernel.

I am also considering rewriting packet_mmap.txt :-)

--
]               Never tell me the odds!                 | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works        |    IoT architect   [
]     mcr () sandelman ca  http://www.sandelman.ca/        |   ruby on rails    [





--- End Message ---
_______________________________________________
tcpdump-workers mailing list
tcpdump-workers () lists tcpdump org
https://lists.sandelman.ca/mailman/listinfo/tcpdump-workers

Current thread: