Wireshark mailing list archives

Strings containing characters that don't map to printable ASCII

From: Guy Harris <guy () alum mit edu>
Date: Sun, 20 May 2012 14:03:55 -0700


On May 20, 2012, at 11:32 AM, darkjames () wireshark org wrote:

http://anonsvn.wireshark.org/viewvc/viewvc.cgi?view=rev&revision=42727

User: darkjames
Date: 2012/05/20 11:32 AM

Log:
Revert r35131 fix bug #5738

g_unichar_isprint() is for *wide characters*.
For UTF-8 multibyte characters we could 
use g_utf8_validate() and g_utf8_next_char(),
but IMHO format_text_* should be ASCII-only.


I'm not sure it should always be ASCII-only.  Somebody might want, for example, to see file names as they would appear 
in the UI.

However, in other circumstances, somebody might want to see the raw octets of non-ASCII characters if, for example, 
they're dealing with encoding issues (e.g.., SMB servers sending Normalization Form D Unicode strings over the wire to 
Windows clients that expect Normalization Form C strings - this is not, BTW, a hypothetical case... - or strings sent 
over the wire that aren't valid {UTF-8,UTF-16,UCS-2,...}).

So perhaps at least two ways of displaying strings are needed, perhaps settable via a preference.  The first might, for 
example, display invalid sequences and characters that don't exist in Unicode as the Unicode REPLACEMENT CHARACTER:

        http://unicode.org/charts/nameslist/n_FFF0.html

and display non-printable characters as either REPLACEMENT CHARACTER or, for C0 control characters, the corresponding 
Unicode SYMBOL FOR XXX character:

        http://unicode.org/charts/nameslist/n_2400.html

The second might, for example, display octets that don't correspond to printable ASCII characters as C-style backslash 
escapes, e.g. CR as \r, LF as \n, etc., and octets that don't have specific C-style backslash escapes as either octal 
or hex escapes (we're currently using octal, but I suspect most of us don't deal with PDP-11's on a daily basic, so 
perhaps hex would be better).

All of the GUI toolkits we're likely to care about use Unicode, in some encoding, for strings, so we don't need to 
worry about translating from Unicode to ISO 8859/x or some flavor of EUC or... in the GUI - we can just hand the GUI 
Unicode strings.

For writing to files and to the "terminal", we might have to determine the user's code page/character encoding and map 
to that.  I think most UN*Xes support UTF-8 as the character encoding in the LANG environment variable these days, and 
sufficiently recent versions of Windows have code page 65001, a/k/a UTF-8:

        http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

(I don't know whether that dates back to W2K, which I think is the oldest version of Windows supported by current 
versions of Wireshark), so it is, at least in theory, possible for a user to configure their system so that 
non-UCS-2/non-UTF-16 text files and their terminal emulator/console program can handle Unicode.  In practice, some 
users might have reasons why they can't or wouldn't want to do that, however.

We rather need to store encoding of FT_STRING[Z]
and in proto_item_fill_label() call appropiate
function.
For ENC_ASCII use format_text(),
for unicode (ENC_UTF*, ENC_UCS*) use format_text_utf(),
etc..


This also raises some other questions.

For example, presumably if the user enters a Unicode string in a display filter comparison expression, they'd want it 
to match the field in question if it has that value, regardless of whether it's encoded as UTF-8 or UTF-16 or ISO 
8859/1 or {fill in your flavor of EBCDIC} or....  (They might even want it to match regardless of whether characters 
are composed or not:

        http://unicode.org/reports/tr15/

I would argue that it should, given that the OS with the largest market share on the desktop prefers composed 
characters, the UN*X with the largest market share on the desktop prefers decomposed characters, and all the other 
UN*Xes prefer composed characters, but that's another matter.)  Thus, the comparison that should be done should be a 
comparison between the string the user specified and the value of the field as converted to Unicode (UTF-8, as that's 
the way we're internally encoding Unicode).  If the field's raw value *can't* be converted to Unicode, the comparison 
would fail.

However, if the user constructs a filter from a field and its value with Apply As Filter -> Selected, and the field is 
a string field and has a value that *can't* be represented in Unicode, the filter should probably do a match on the raw 
value of the field, not on the value of the field as converted to Unicode.

The latter could perhaps be represented as

        example.name == 48:65:6c:6c:6f:20:ff:ff:ff:ff:ff:ff:ff:ff

or something such as that.

This might mean we'd store, for a string field, the raw value and specified encoding.  When doing a comparison against 
a value specified as an octet string, we'd compare the raw values; when doing a comparison against a value specified as 
a Unicode string, we'd attempt to convert the raw value to UTF-8 and:

        if that fails, have the comparison fail;

        if that succeeds, compare against the converted value.

In addition, when getting the value of a field for some other code to process, what should be done if the field can't 
be mapped to Unicode?

And what about non-printable characters?  We could use %-encoding for the XML formats (PDML, PSML), but for TShark's 
"-e" option, or for "export as CSV", or other non-XML formats, what should be done?
___________________________________________________________________________
Sent via:    Wireshark-users mailing list <wireshark-users () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-users
Unsubscribe: https://wireshark.org/mailman/options/wireshark-users
             mailto:wireshark-users-request () wireshark org?subject=unsubscribe

Current thread:

Strings containing characters that don't map to printable ASCII Guy Harris (May 20)