Wireshark mailing list archives

Re: Proposal to improve filtration speed by caching fields that are queried recently


From: Jaap Keuter <jaap.keuter () xs4all nl>
Date: Mon, 15 Jun 2020 06:24:18 +0200

HI,

Not sure since when the filtering system has been worked on in this depth, but I suspect it has been a while. Finding 
someone completely up to speed about this may be a challenge.

Thanks,
Jaap


On 15 Jun 2020, at 05:38, Sidhant Bansal <sidhbansal () gmail com> wrote:

Hi all,

I want to propose an improvement to speed up the display filters by avoiding to re-dissect all the packets again and 
again when not required and instead maintaining a cache of the fields that have been queried recently.

Motivation: Benchmarking filtering on capture files > 100 MB shows that the re-dissection step, i.e the amount of 
time spent inside the dissector tends to be a lot, i.e > ~40-50% of the total time spent is consumed to re-dissect. I 
believe we can make huge savings here.

Example:
1st Filter applied: tcp.srcport >= 1200 && tcp.dstport <= 1500
This filter runs normally as it does right now AND stores the tcp.srcport and tcp.dstport for all the packets 
on-memory in wireshark
2nd Filter applied: tcp.srcport == 80
We don't need to re-dissect all the packets again and can simply refer to the information stored to apply the filter.
3rd Filter applied: tcp.srcport == 120 || udp.srcport == 80
Since we haven't stored "udp.srcport" in our cache, therefore we need to re-dissect again AND we will store 
udp.srcport for all the packets also (to speed-up future filter queries)
4th Filter applied: tcp.srcport == 40 || udp.srcport >= 1000 || tcp.dstport <= 500
Since all of these fields are in cache, so we can refer to them directly from the on-memory information stored and 
don't need to re-dissect any of the packets.

We can limit the number of fields we store on-memory at any given moment of time depending on how many packets we 
have and how much memory we can afford to allocate. And deleting the fields from the cache can be done according to a 
specific cache replacement policy (I haven't thought about which one will the most apt, input is welcome)

Most of the fields tend to be fixed-length in terms of bytes and are small, i.e <= 8bytes. For fields such as strings 
that are variable-length and can be arbitrarily large we can avoid doing this caching procedure and instead 
re-dissect all the packets if the filter expression consists of such a field.

From an implementation point of view: The cached fields information can be stored inside the frame_data since that 
remains persistent throughout wireshark's execution for a single capture file opened. Now whenever we encounter a new 
filter query we can check if all the fields are in the cache or not? If yes, then once we convert our abstract syntax 
tree of the filter query to DFVM and then query, we should lookup the cache instead of re-dissecting. If no, then we 
do what we do currently, i.e re-dissect but we also store this new field into our cache (according to the specific 
replacement policy)

Want to know about any feedback or objections to this optimization.

___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    https://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev
            mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    https://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

Current thread: