Wireshark mailing list archives
Re: tvb_get_string_enc() doesn't always return valid UTF-8
From: Evan Huus <eapache () gmail com>
Date: Mon, 20 Jan 2014 13:23:20 -0500
There was a bug where Guy and I discussed strings in depth (though I can't find it at the moment). I think we'd agreed that the right thing to do is to convert most of our string functions to handle and return counted strings (wmem_strbuf_t or something) and then do the replacement as you suggest. There are several other outstanding encoding issues (especially around embedded NULLs) where string length cannot be reliably managed without explicitly counting it. Unfortunately it's a relatively large API change, but I think it's the right thing going forward, especially since we already use a wmem_strbuf_t in most of the _get_string functions already (we just don't return it). Evan On Mon, Jan 20, 2014 at 12:22 PM, Martin Kaiser <lists () kaiser cx> wrote:
Hi, if I have a tvbuff that starts with 0x86 and I call a = tvb_get_string_enc(tvb, 0, ENC_ASCII) proto_tree_add_string(..., a); I can trigger the DISSECTOR_ASSERT since a is not a valid unicode string. Comments in the code suggest that tvb_get_string() should replace chars>=0x80 with the unicode replacement char, which is two bytes long. This would look like guint8 * tvb_get_string(wmem_allocator_t *scope, tvbuff_t *tvb, gint offset, gint length) { wmem_strbuf_t *str; tvb_ensure_bytes_exist(tvb, offset, length); str = wmem_strbuf_new(scope, ""); while (length > 0) { guint8 ch = tvb_get_guint8(tvb, offset); if (ch < 0x80) wmem_strbuf_append_c(str, ch); else { wmem_strbuf_append_unichar(str, UNREPL); } offset++; length--; } wmem_strbuf_append_c(str, '\0'); return (guint8 *) wmem_strbuf_get_str(str); } The resulting string would still contain len+1 chars but not necessarily len+1 bytes. Would that be a problem, i.e. is it ok to do sth like b = tvb_get_string(NULL, tvb, offset, len_b); copy_of_b = g_malloc(len_b+1); memcpy(copy_of_b, b, len_b+1); ? If that should work, we'd need a separate function for get string & replace 8bit chars. Thoughts? Martin ___________________________________________________________________________ Sent via: Wireshark-dev mailing list <wireshark-dev () wireshark org> Archives: http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-request () wireshark org?subject=unsubscribe
___________________________________________________________________________ Sent via: Wireshark-dev mailing list <wireshark-dev () wireshark org> Archives: http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-request () wireshark org?subject=unsubscribe
Current thread:
- tvb_get_string_enc() doesn't always return valid UTF-8 Martin Kaiser (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Martin Kaiser (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 21)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Jakub Zawadzki (Jan 26)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 26)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 27)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 29)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Martin Kaiser (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Evan Huus (Jan 20)
- Re: tvb_get_string_enc() doesn't always return valid UTF-8 Guy Harris (Jan 26)