Wireshark mailing list archives

Re: filter for ONLY initial get request


From: Jeffs <jeffs () speakeasy net>
Date: Wed, 11 Aug 2010 09:06:53 -0400

On 8/11/2010 6:12 AM, Sake Blok wrote:
On 10 aug 2010, at 16:48, Jeffs wrote:
   
I have come up with the following tshark formula which seems to address my needs.  Since I am not interested in the 
URLs from advertising agencies, videos and other embedded links in web pages, but only the top level domain I use 
this.  Please let me know if anyone sees any gotchas or potential problems with this formula I'm very new to regex 
expressions and could use advice.  This formula will return only the top level domains and strips out links such as 
admin.brightcove.com, advertisingserver.amazon.com, tubemogel.videos.com:

tshark -r test.cap -R http.request -T fields -e http.host | sed -e 's/?.*$//' | sed -e 
's#^\(.*\)\t\(.*\)$#http://\1\2#&apos; | sort | uniq -c | sort -rn | head -n 300 | sed -n -e '/www/p'
     
If you're only interested in an overview of visited top-level domains, without caring what the specific hosts and/or 
URI's were that were visited. You could use something like

tshark -r test.cap -R http.request -T fields -e http.host | sed -e 's/^.*\.\([^.]*\.[^.]*\)$/\1/' | sort | uniq -c | 
sort -rn | head -n 100

for the top-100 top-level domains (based on individual hits, not user sessions).

Cheers,


Sake


___________________________________________________________________________
Sent via:    Wireshark-users mailing list<wireshark-users () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-users
Unsubscribe: https://wireshark.org/mailman/options/wireshark-users
              mailto:wireshark-users-request () wireshark org?subject=unsubscribe


   
Thank you for your reply.  The issue I am having, and which also happens 
with the formula you provided, above, is that domains are being reported 
that are links (mostly advertising and graphic-image links) embedded in 
the web page  which I do not want for they will pollute my results.  I 
only want either the domain for the link clicked, or the domain for the 
link typed in the browser box.  For example, the formula you provided 
above returns:

71 nytimes.com
      15 propertyshark.com
      13 fbcdn.net
       5 voicefive.com
       5 2mdn.net
       4 brightcove.com
       2 google-analytics.com
       2 doubleclick.net
       1 yahoo.com
       1 imrworldwide.com
       1 facebook.com

The above doubleclick.net, brightcove.com, 2mdn.net, and fbcdn.net 
reported domains are for things like advertising links and embedded 
links in the web page of the landing page for the domain typed or 
clicked.  This is polluting my results.


This formula, however, only returns results minus the links and images 
embedded in the web page:

tshark -r test.cap -T fields -e http.host  | sed 's/?.*$//' | sed -n 
'/www./p'  | sort | uniq -c | sort -rn | head -n 100

15 www.propertyshark.com
       8 www.nytimes.com
       2 www.google-analytics.com
       1 www.facebook.com


However, I am new to regex so I'm sure I may be missing  something or 
losing some links.

Thank you.

___________________________________________________________________________
Sent via:    Wireshark-users mailing list <wireshark-users () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-users
Unsubscribe: https://wireshark.org/mailman/options/wireshark-users
             mailto:wireshark-users-request () wireshark org?subject=unsubscribe


Current thread: