Nmap Development mailing list archives

Interesting Zenmap encoding bug


From: David Fifield <david () bamsoftware com>
Date: Sun, 11 Oct 2009 21:46:03 -0600

Hi,

I had gotten some Zenmap crash reports that were variations on this
theme:

File "zenmapGUI\ScanNotebook.pyo", line 184, in _target_entry_changed
File "zenmapCore\NmapOptions.pyo", line 719, in render_string
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 1: unexpected end of data

It looks like a UTF-8 string got truncated, because bytes starting
around 0xC2 are the start of UTF-8 sequences. It was happening when
something was entered in the target box, after splitting the target
string on whitespace. PyGTK returns the text content of its widgets in
UTF-8, so that part wasn't surprising. I tried entering all kinds of
characters that have multibyte UTF-8 representations, but I couldn't
reproduce the crash. Then I got one report saying that the character à
(which is shift-0 on a French keyboard).

I could reproduce the crash with à (U+00E0), but what's interesting is
that it wouldn't happen with á (U+00E1). The key is in their UTF-8
representations. à is C3 A0 while á is C3 A1. U+00A0 happens to be
NO-BREAK SPACE while U+00A1 is INVERTED EXCLAMATION MARK. That was the
key. The NO-BREAK SPACE was being treated as whitespace. à in the target
box was becoming the UTF-8 encoded byte string "\xc3\xa0", which the
split function was turning into ["\xc3"], which when decoded led to an
error because of the truncated sequence.

I was surprised that the split function would split on a non-ASCII
character. In fact it doesn't by default, but apparently it does by
default when the locale is loaded on Windows. In other words,

u'\u00e0'.encode('UTF-8')
'\xc3\xa0'
'\xc3\xa0'.split()
['\xc3\xa0']
import locale
locale.setlocale(locale.LC_ALL, '')
'English_United States.1252'
'\xc3\xa0'.split()
['\xc3']

The problem is fixed by decoding the byte string returned by PyGTK
before processing it. (However Nmap will likely choke once you try to
run the scan because it's going to see the UTF-8 bytes in the host
specification once it is serialized for the command line.)

David Fifield

_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://SecLists.Org

Current thread: