Wireshark mailing list archives

Re: UTF8 vs. locale in error messages (bug 5715)

From: Graham Bloice <graham.bloice () trihedral com>
Date: Wed, 29 Jun 2011 10:37:33 +0100

On 28/06/2011 18:27, Guy Harris wrote:

On Jun 28, 2011, at 6:10 AM, Stig Bjørlykke wrote:

On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris <guy () alum mit edu> wrote:

       1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the encoding (are you seeing the issue with 
Norwegian characters on your system?  If so, what's the setting of LANG?);

I only had issues with Norwegian characters in file names reported via
simple_dialog(), and my LANG is empty.

OK, what OS are you using?  If it's a UN*X, try compiling and running the attached C program; does it print your name 
correctly on your terminal/terminal emulator (it writes it out in UTF-8), and does the file it creates (your name is 
its name - yeah, complete with a space between "Stig" and "Bjørlykke", and with no ".txt" at the end) have a name 
that shows up correctly if you do "ls"?  If it's Windows, then you're probably just seeing bug 5715.

Another problem is that we still have issues regarding UTF-8 strings
in packets.  We should really fix that...

We have an issue regarding strings in packets in general.  Strings might be in a number of encodings, including ASCII 
(meaning that any byte with the 8th bit set is something that shouldn't be there), other national variants of ISO 
646, UTF-8, UTF-16, UCS-2 (meaning "only the Basic Multilingual plane, with no surrogate pairs"), ISO 8859/x for 
various values of x, various ISO 2022-based encodings (e.g., the EUC encodings), various national standards, various 
DOS and Windows code pages, various Mac OS encodings, EBCDIC, whatever encodings are used for SMS, etc., etc., etc, 
etc.:

      http://en.wikipedia.org/wiki/Template:Character_encoding

I don't know whether all of the encodings in question can be mapped to Unicode without information loss.  An 
arbitrary string of octets definitely can't be mapped to UTF-8 without information loss; consider a putatively 
UTF-8-encoded string that contains an octet sequence that's not valid in UTF-8.

Perhaps, in the Wireshark dissection engine, we should initially store string values as a pair {encoding, counted 
octet string} (counted so that octets with the value 0 don't cause problems), and:

      when putting them into a textual representation of the protocol tree or into columns or something else to be 
shown to humans, map them to UTF-8, with anything that can't be mapped to UTF-8 - including, if the encoding is 
putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown as the Unicode replacement character 
U+FFFD;

      when comparing them in a display filter, attempt to map them to UTF-8 (and save the result), and:

              if the mapping fails, treat *all* comparisons except for inequality as failing, and treat comparisons 
for inequality as succeeding;

              if the mapping succeeds, compare the two strings;

      when making them available to software inside *Shark (C/C++ code, Lua code, Python code, etc.), attempt to 
convert them to whatever the appropriate representation is (presumably UTF-8), and have the routines to fetch those 
values support returning a "conversion failed" indication (or perhaps offer both a "convert for display to humans" 
version that uses U+FFFD for failure and a "convert for processing" version that returns "can't do it" for failure).

Here's the program I mentioned above:

For reference, here's the test executable output on Win7, using the SDK 7.0
build environment (a cmd.prompt):

c:\temp>test
Stig Bj├©rlykke
Now creating a file with Stig's name as its name

c:\temp>dir
 Volume in drive C has no label.
 Volume Serial Number is D845-44D4

 Directory of c:\temp

29/06/2011  10:30    <DIR>          .
29/06/2011  10:30    <DIR>          ..
29/06/2011  10:30                17 Stig BjÃ¸rlykke
29/06/2011  10:28            77,312 test.exe
               2 File(s)         77,329 bytes
               2 Dir(s)  65,078,947,840 bytes free

The output of the executable was the same using Powershell.

-- 
Regards,

Graham Bloice

___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

Current thread:

Re: UTF8 vs. locale in error messages (bug 5715), (continued)
- - Re: UTF8 vs. locale in error messages (bug 5715) Graham Bloice (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Guy Harris (Jun 28)
  - Re: UTF8 vs. locale in error messages (bug 5715) Stig Bjørlykke (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Guy Harris (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Guy Harris (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Guy Harris (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Guy Harris (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Stig Bjørlykke (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Guy Harris (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Stig Bjørlykke (Jun 28)
    - Re: UTF8 vs. locale in error messages (bug 5715) Graham Bloice (Jun 29)
    - Re: UTF8 vs. locale in error messages (bug 5715) Guy Harris (Jun 29)