Wireshark mailing list archives

Re: tvb_get_string_enc() doesn't always return valid UTF-8

From: Evan Huus <eapache () gmail com>
Date: Mon, 20 Jan 2014 13:23:20 -0500

There was a bug where Guy and I discussed strings in depth (though I
can't find it at the moment).

I think we'd agreed that the right thing to do is to convert most of
our string functions to handle and return counted strings
(wmem_strbuf_t or something) and then do the replacement as you
suggest. There are several other outstanding encoding issues
(especially around embedded NULLs) where string length cannot be
reliably managed without explicitly counting it.

Unfortunately it's a relatively large API change, but I think it's the
right thing going forward, especially since we already use a
wmem_strbuf_t in most of the _get_string functions already (we just
don't return it).

Evan

On Mon, Jan 20, 2014 at 12:22 PM, Martin Kaiser <lists () kaiser cx> wrote:

Hi,

if I have a tvbuff that starts with 0x86 and I call

a = tvb_get_string_enc(tvb, 0, ENC_ASCII)
proto_tree_add_string(..., a);

I can trigger the DISSECTOR_ASSERT since a is not a valid unicode string.

Comments in the code suggest that tvb_get_string() should replace
chars>=0x80 with the unicode replacement char, which is two bytes long.
This would look like

guint8 *
tvb_get_string(wmem_allocator_t *scope, tvbuff_t *tvb, gint offset, gint length)
{
        wmem_strbuf_t *str;

        tvb_ensure_bytes_exist(tvb, offset, length);
        str = wmem_strbuf_new(scope, "");

        while (length > 0) {
                guint8 ch = tvb_get_guint8(tvb, offset);

                if (ch < 0x80)
                        wmem_strbuf_append_c(str, ch);
                else {
                        wmem_strbuf_append_unichar(str, UNREPL);
                }
                offset++;
                length--;
        }
        wmem_strbuf_append_c(str, '\0');

        return (guint8 *) wmem_strbuf_get_str(str);
}


The resulting string would still contain len+1 chars but not necessarily
len+1 bytes. Would that be a problem, i.e. is it ok to do sth like

b = tvb_get_string(NULL, tvb, offset, len_b);
copy_of_b = g_malloc(len_b+1);
memcpy(copy_of_b, b, len_b+1);

?

If that should work, we'd need a separate function for get string &
replace 8bit chars.

Thoughts?

   Martin
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

Current thread:

tvb_get_string_enc() doesn't always return valid UTF-8 Martin Kaiser (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 20)
  - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Martin Kaiser (Jan 20)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 20)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 20)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 20)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 21)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Jakub Zawadzki (Jan 26)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 26)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 27)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 29)
    - Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 26)

(Thread continues...)