Wireshark mailing list archives

Proposal to improve filtration speed by caching fields that are queried recently


From: Sidhant Bansal <sidhbansal () gmail com>
Date: Mon, 15 Jun 2020 11:38:49 +0800

Hi all,

I want to propose an improvement to speed up the display filters by
avoiding to re-dissect all the packets again and again when not required
and instead maintaining a cache of the fields that have been queried
recently.

Motivation: Benchmarking filtering on capture files > 100 MB shows that the
re-dissection step, i.e the amount of time spent inside the dissector tends
to be a lot, i.e > ~40-50% of the total time spent is consumed to
re-dissect. I believe we can make huge savings here.

Example:
1st Filter applied: tcp.srcport >= 1200 && tcp.dstport <= 1500
This filter runs normally as it does right now AND stores the tcp.srcport
and tcp.dstport for all the packets on-memory in wireshark
2nd Filter applied: tcp.srcport == 80
We don't need to re-dissect all the packets again and can simply refer to
the information stored to apply the filter.
3rd Filter applied: tcp.srcport == 120 || udp.srcport == 80
Since we haven't stored "udp.srcport" in our cache, therefore we need to
re-dissect again AND we will store udp.srcport for all the packets also (to
speed-up future filter queries)
4th Filter applied: tcp.srcport == 40 || udp.srcport >= 1000 || tcp.dstport
<= 500
Since all of these fields are in cache, so we can refer to them directly
from the on-memory information stored and don't need to re-dissect any of
the packets.

We can limit the number of fields we store on-memory at any given moment of
time depending on how many packets we have and how much memory we can
afford to allocate. And deleting the fields from the cache can be done
according to a specific cache replacement policy (I haven't thought about
which one will the most apt, input is welcome)

Most of the fields tend to be fixed-length in terms of bytes and are small,
i.e <= 8bytes. For fields such as strings that are variable-length and can
be arbitrarily large we can avoid doing this caching procedure and instead
re-dissect all the packets if the filter expression consists of such a
field.

From an implementation point of view: The cached fields information can be
stored inside the frame_data since that remains persistent throughout
wireshark's execution for a single capture file opened. Now whenever we
encounter a new filter query we can check if all the fields are in the
cache or not? If yes, then once we convert our abstract syntax tree of the
filter query to DFVM and then query, we should lookup the cache instead of
re-dissecting. If no, then we do what we do currently, i.e re-dissect but
we also store this new field into our cache (according to the specific
replacement policy)

Want to know about any feedback or objections to this optimization.
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    https://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

Current thread: