Wireshark mailing list archives

Re: UTF8 vs. locale in error messages (bug 5715)


From: Guy Harris <guy () alum mit edu>
Date: Wed, 29 Jun 2011 09:44:14 -0700


On Jun 29, 2011, at 2:37 AM, Graham Bloice wrote:

For reference, here's the test executable output on Win7, using the SDK 7.0 build environment (a cmd.prompt):

Not surprisingly, it doesn't work.

Microsoft introduced Unicode support when they introduced Win32; as they were introducing a new API, they could make 
the versions of the API that support Unicode take UCS-2 (later UTF-16) strings as arguments.  They also offered "ASCII" 
versions, which took strings in the local code page as arguments.  This also applies to the C library's routines, such 
as open()/_open().

UN*X systems already had a well-established API when they introduced Unicode support, and they had what amounted to 
code pages (the various ISO 8859/x encodings, the EUC encodings, assorted other encodings); instead, they added a new 
"code page", with UTF-8 encoding.

The program was written for UN*X, to test whether, in the user's locale, UTF-8 strings work.  In Windows, the ASCII API 
it was using to create a file would take your local code page, not UTF-8, as the string encoding, and I suspect cmd.exe 
also expects "ASCII" output from programs - such as when the test program was printing Stig's name - to be in the local 
code page, not UTF-8.

This is why GLib has file functions that do mapping on file names; the page at

        http://developer.gnome.org/glib/stable/glib-File-Utilities.html

says

        There is a group of functions which wrap the common POSIX functions dealing with filenames (g_open(), 
g_rename(), g_mkdir(), g_stat(),g_unlink(), g_remove(), g_fopen(), g_freopen()). The point of these wrappers is to make 
it possible to handle file names with any Unicode characters in them on Windows without having to use ifdefs and the 
wide character API in the application code.

        The pathname argument should be in the GLib file name encoding. On POSIX this is the actual on-disk encoding 
which might correspond to the locale settings of the process (or the G_FILENAME_ENCODING environment variable), or not.

        On Windows the GLib file name encoding is UTF-8. Note that the Microsoft C library does not use UTF-8, but has 
separate APIs for current system code page and wide characters (UTF-16). The GLib wrappers call the wide character API 
if present (on modern Windows systems), otherwise convert to/from the system code page.

        Another group of functions allows to open and read directories in the GLib file name encoding. These are 
g_dir_open(), g_dir_read_name(),g_dir_rewind(), g_dir_close().

This is also why we have our own copies of some of those functions on Windows, and wrap them ourselves (so that we 
don't require GLib 2.6, which introduced them, for all platforms).
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe


Current thread: