Wireshark mailing list archives

Re: FT_STRING, FT_STRINGZPAD, and null padding


From: John Thacker <johnthacker () gmail com>
Date: Sat, 5 Sep 2020 14:09:54 -0400

So for the problem with calling format_text() on a string that has had
replacement characters applied, I see that packet-ieee80211.c calls
format_text() on a string that has already been obtained with
tvb_get_string_enc() in order to get something printable.
This is pretty common, a lot of dissectors do this, because format_text()
expects UTF-8, tvb_format_text() doesn't take an encoding at all and just
assumes UTF-8 (and works on ASCII), and it makes sense to try to call
something that does charset conversion to UTF-8 first for any field that
might have an encoding other than ASCII or UTF-8.

However, tvb_get_string_enc() for ASCII strings replaces characters with
the high bit set with the Unicode Replacement character, which can change
the length of the string. It's not clear what length to pass into
format_text() - the original length can be wrong, but strlen() doesn't work
for the cases where there's a NUL in the middle of the string, as it won't
get the trailing characters. A similar problem would occur if
tvb_get_string_enc() actually did replacement characters for UTF-8 invalid
characters (it doesn't do that now, see the comments of
https://gitlab.com/wireshark/wireshark/-/issues/14948 ), and does occur
with all the other encodings, where the replacement character substitution
already does occur.

So I don't particularly see any easy way to handle the counted strings that
might have replacement characters substituted in but also might have a NUL
in the middle and we still want to display any trailing characters in the
field after the NUL and warn about them.

For the dissectors that only have either ASCII or UTF-8 as the encoding, a
workaround is to call tvb_format_text() instead of calling format_text() on
the output of tvb_get_string_enc(). For dissectors with any other encoding,
that's not an option and as a result it doesn't feel great as a solution.

To handle other encodings, I can only see:
1) Changing tvb_get_string_enc - and the encoding specific helper functions
it calls - to pass back the true length as a output parameter
2) Changing tvb_get_string_enc to pass back a wmem_strbuf_t type that
handles embedded nulls and has length, use that type more widely, pass it
into the format function
3) Having some kind of tvb_format_text_enc() function (and variations for
whitespace) that takes different encodings, and differs from
tvb_get_string_enc in that it produces a printable output.

3) alone seems like a bad solution, because it would probably be called in
addition to tvb_get_string_enc for the same string in the same dissector,
and that's hitting some very similar tvbuff access and conversion code
twice. It also would mean some more tricky and error prone conversions. It
could be added as syntactic sugar for the function composition of
tvb_get_string_enc() and format_text() if 1) or 2) was implemented.

John Thacker
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    https://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

Current thread: