[Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points
@ 2023-02-28 23:20 dmjpp at hotmail dot com
  2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: dmjpp at hotmail dot com @ 2023-02-28 23:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

            Bug ID: 108976
           Summary: codecvt for Unicode allows surrogate code points
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dmjpp at hotmail dot com
  Target Milestone: ---

Text in valid Unicode should never contain surrogate code POINTS. Those are
only allowed in UTF-16, but only as code UNITS and must be properly paired.

UTF-8 text in its strictest form must not contain surrogates but in a slightly
relaxed form surrogates can be easily encoded as 3-byte sequences. Same can be
said for UTF-32 and UCS-2. Only UTF-16 is immune to the error of surrogate code
POINT (they are treated as UNITS).

Codecvts in libstdc++ currently allow surrogate code points in some cases. Here
is a minimal reproduction (asserts are the correct behavior):

#include <locale>
#include <cassert>

void u32()
{
    using namespace std;
    auto& f = use_facet<codecvt<char32_t, char, mbstate_t>>(locale::classic());

    char u8str[] = "\uC800\uCBFF\uCC00\uCFFF";
    u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D.
    // now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8
    // that allows surrogate code points.
    char32_t u32str[] = {0xD800, 0xDBFF, 0xDC00, 0xDFFF, 0};

    char32_t u32out[1];
    const char* from_next;
    char32_t* to_next;
    mbstate_t st = {};
    auto res = f.in(st, u8str, u8str+3, from_next, u32out, u32out+1, to_next);
    assert(res == f.error);
    assert(from_next == u8str);
    assert(to_next == u32out);

    st = {};
    auto l = f.length(st, u8str, u8str+3, 1);
    assert(l == 0);

    char u8out[3];
    const char32_t* from_next2;
    char* to_next2;
    st = {};
    res = f.out(st, u32str, u32str+1, from_next2, u8out, u8out+3, to_next2);
    assert(res == f.error);
    assert(from_next2 == u32str);
    assert(to_next2 == u8out);
}
void u16()
{
    using namespace std;
    auto& f = use_facet<codecvt<char16_t, char, mbstate_t>>(locale::classic());

    char u8str[] = "\uC800\uCBFF\uCC00\uCFFF";
    u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D.
    // now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8
    // that allows surrogates.

    char16_t u16out[1];
    const char* from_next;
    char16_t* to_next;
    mbstate_t st = {};
    auto res = f.in(st, u8str, u8str+3, from_next, u16out, u16out+1, to_next);
    assert(res == f.error);
    assert(from_next == u8str);
    assert(to_next == u16out);

    st = {};
    auto l = f.length(st, u8str, u8str+3, 1);
    assert(l == 0);
}
int main()
{
    u32();
    u16();
}

From reading the file codecvt.cc the following conversions have the bug:

- From UTF-8 to any other encoding.
- From UTF-32/UCS-4 to any other encoding.

Those that read from UCS-2 seem to me like they properly report the error.
Reading from UTF-16 can not have this bug by definition. From what I checked,
the functions for reading UTF-16 properly treat unpaired surrogate code units
as error.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
@ 2023-03-02 11:03 ` redi at gcc dot gnu.org
  2023-03-02 11:08 ` dmjpp at hotmail dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: redi at gcc dot gnu.org @ 2023-03-02 11:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2023-03-02
     Ever confirmed|0                           |1

--- Comment #1 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Thanks for the testcase, I'd been meaning to look into exactly this issue.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
  2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
@ 2023-03-02 11:08 ` dmjpp at hotmail dot com
  2023-03-02 11:17 ` redi at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: dmjpp at hotmail dot com @ 2023-03-02 11:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #2 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
(In reply to Dimitrij Mijoski from comment #0)
> Those that read from UCS-2 seem to me like they properly report the error.
> Reading from UTF-16 can not have this bug by definition. From what I
> checked, the functions for reading UTF-16 properly treat unpaired surrogate
> code units as error.

Seems like the conversion from UCS-2 to UTF-16BE/LE is also affected. This
conversions is called via codecvt_utf16<char16_t>::out(). See line
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=02f05752de84139a7eb7c3d40946b61f4c0334cf;hb=HEAD#l656
it only checks for high surrogate but should also check for low.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
  2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
  2023-03-02 11:08 ` dmjpp at hotmail dot com
@ 2023-03-02 11:17 ` redi at gcc dot gnu.org
  2023-03-07 20:17 ` dmjpp at hotmail dot com
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: redi at gcc dot gnu.org @ 2023-03-02 11:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I have some new code for handling UTF-8 for std::print, and using that code
your relaxed u8str gets converted to 12 U+FFFD code points when printed to a
terminal, which I think is correct.

#include <print>

int main()
{
  char u8str[] = "\uC800\uCBFF\uCC00\uCFFF";
  std::println("valid UTF-8: {}", u8str);

  u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D.
  // now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8
  // that allows surrogate code points.
  std::vprint_nonunicode("invalid UTF-8 printed raw: {}\n",
std::make_format_args(u8str));
  std::println("invalid UTF-8 printed safely: {}", u8str);
}
$ g++ -std=c++23 surr.cc && ./a.out && ./a.out | xxd
valid UTF-8: 저쯿찀쿿
invalid UTF-8 printed raw: ������������
invalid UTF-8 printed safely: ������������
00000000: 7661 6c69 6420 5554 462d 383a 20ec a080  valid UTF-8: ...
00000010: ecaf bfec b080 ecbf bf0a 696e 7661 6c69  ..........invali
00000020: 6420 5554 462d 3820 7072 696e 7465 6420  d UTF-8 printed 
00000030: 7261 773a 20ed a080 edaf bfed b080 edbf  raw: ...........
00000040: bf0a 696e 7661 6c69 6420 5554 462d 3820  ..invalid UTF-8 
00000050: 7072 696e 7465 6420 7361 6665 6c79 3a20  printed safely: 
00000060: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
00000070: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
00000080: bdef bfbd 0a                             .....


The new code is also much faster, so I'm thinking of rewriting some of the
src/c++11/codecvt.cc facets to use it. But that's a longer term project, we
should fix this bug first.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (2 preceding siblings ...)
  2023-03-02 11:17 ` redi at gcc dot gnu.org
@ 2023-03-07 20:17 ` dmjpp at hotmail dot com
  2023-03-07 21:43 ` redi at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: dmjpp at hotmail dot com @ 2023-03-07 20:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #4 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I have fixed this, added large testsute and discovered another bug in
codecvt_utf16 when the input [from, from_end) contains odd number of bytes.
Error was returned instead of partial.

Here are the changes in 8 commits
https://github.com/gcc-mirror/gcc/compare/master...dimztimz:gcc:codecvt (read
from top to bottom). Do you want everything squashed in one commit or maybe
some other combination?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (3 preceding siblings ...)
  2023-03-07 20:17 ` dmjpp at hotmail dot com
@ 2023-03-07 21:43 ` redi at gcc dot gnu.org
  2023-03-08 14:11 ` dmjpp at hotmail dot com
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: redi at gcc dot gnu.org @ 2023-03-07 21:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #5 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Excellent, thanks! I'll look at the commits tomorrow and get back to you.

A patch will need to be submitted to the mailing lists, but it might be OK to
send a single diff for the whole series, but then I could merge the 8 commits
(without squashing them). I'll see if that makes sense (and would be OK by our
policies) and let you know.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (4 preceding siblings ...)
  2023-03-07 21:43 ` redi at gcc dot gnu.org
@ 2023-03-08 14:11 ` dmjpp at hotmail dot com
  2023-04-18 13:45 ` dmjpp at hotmail dot com
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: dmjpp at hotmail dot com @ 2023-03-08 14:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #6 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I sent a single patch to the mailing list with a good detailed commit message.
I think that is better than multiple patches.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (5 preceding siblings ...)
  2023-03-08 14:11 ` dmjpp at hotmail dot com
@ 2023-04-18 13:45 ` dmjpp at hotmail dot com
  2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: dmjpp at hotmail dot com @ 2023-04-18 13:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #7 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I put a second version of the patch
https://gcc.gnu.org/pipermail/libstdc++/2023-March/055667.html about a month
ago.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (6 preceding siblings ...)
  2023-04-18 13:45 ` dmjpp at hotmail dot com
@ 2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
  2024-01-05 16:34 ` dmjpp at hotmail dot com
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-09-29 15:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:a8b9c32da787ea0bfbfc9118ac816fa7be4b1bc8

commit r14-4335-ga8b9c32da787ea0bfbfc9118ac816fa7be4b1bc8
Author: Dimitrij Mijoski <dmjpp@hotmail.com>
Date:   Thu Sep 28 21:38:11 2023 +0200

    libstdc++: Fix handling of surrogate CP in codecvt [PR108976]

    This patch fixes the handling of surrogate code points in all standard
    facets for transcoding Unicode that are based on std::codecvt. Surrogate
    code points should always be treated as error. On the other hand
    surrogate code units can only appear in UTF-16 and only when they come
    in a proper pair.

    Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
    of bytes were given in the range [from, from_end), error was returned
    always. The last byte in such range does not form a full UTF-16 code
    unit and we can not make any decisions for error, instead partial should
    be returned.

    The testsuite for testing these facets was updated in the following
    order:

    1. All functions that test codecvts that work with UTF-8 were refactored
       and made more generic so they accept codecvt that works with the char
       type char8_t.
    2. The same functions were updated with new test cases for transcoding
       errors and now additionally test for surrogates, overlong UTF-8
       sequences, code points out of the Unicode range, and more tests for
       missing leading and trailing code units.
    3. New tests were added to test codecvt_utf16 in both of its variants,
       UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.

    libstdc++-v3/ChangeLog:

            PR libstdc++/108976
            * src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of
            surrogates in UTF-8.
            (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8.
            (ucs4_in): Fix handling of range with odd number of bytes.
            (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16.
            (ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16.
            (ucs2_in): Fix handling of range with odd number of bytes.
            (__codecvt_utf16_base<char16_t>::do_in): Likewise.
            (__codecvt_utf16_base<char32_t>::do_in): Likewise.
            (__codecvt_utf16_base<wchar_t>::do_in): Likewise.
            * testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add
            tests for codecvt_utf16<char16_t> and codecvt_utf16<char32_t>.
            * testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8
            testing functions for char8_t, add more test cases for errors,
            add testing functions for codecvt_utf16.
            * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc:
            Renames, add tests for codecvt_utf16<whchar_t>.
            * testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06):
            Fix test.
            * testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New
            test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (7 preceding siblings ...)
  2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
@ 2024-01-05 16:34 ` dmjpp at hotmail dot com
  2024-01-05 18:57 ` redi at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: dmjpp at hotmail dot com @ 2024-01-05 16:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #9 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I believe this bug report should closed as resolved. Are there maybe plans for
back-porting?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (8 preceding siblings ...)
  2024-01-05 16:34 ` dmjpp at hotmail dot com
@ 2024-01-05 18:57 ` redi at gcc dot gnu.org
  2024-01-13 11:56 ` dmjpp at hotmail dot com
  2024-01-13 12:25 ` redi at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: redi at gcc dot gnu.org @ 2024-01-05 18:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #10 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I think it would be good to backport it, what do you think?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (9 preceding siblings ...)
  2024-01-05 18:57 ` redi at gcc dot gnu.org
@ 2024-01-13 11:56 ` dmjpp at hotmail dot com
  2024-01-13 12:25 ` redi at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: dmjpp at hotmail dot com @ 2024-01-13 11:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #11 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
(In reply to Jonathan Wakely from comment #10)
> I think it would be good to backport it, what do you think?

I don't really have strong need. Maybe porting only to v13 as that is pretty
straightforward (simple cherry-picking). Porting to v12 and v11 will require
porting other patches first.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
  2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
                   ` (10 preceding siblings ...)
  2024-01-13 11:56 ` dmjpp at hotmail dot com
@ 2024-01-13 12:25 ` redi at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: redi at gcc dot gnu.org @ 2024-01-13 12:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

Jonathan Wakely <redi at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |13.3

--- Comment #12 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Thanks, I think that makes sense.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-01-13 12:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
2023-03-02 11:08 ` dmjpp at hotmail dot com
2023-03-02 11:17 ` redi at gcc dot gnu.org
2023-03-07 20:17 ` dmjpp at hotmail dot com
2023-03-07 21:43 ` redi at gcc dot gnu.org
2023-03-08 14:11 ` dmjpp at hotmail dot com
2023-04-18 13:45 ` dmjpp at hotmail dot com
2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:34 ` dmjpp at hotmail dot com
2024-01-05 18:57 ` redi at gcc dot gnu.org
2024-01-13 11:56 ` dmjpp at hotmail dot com
2024-01-13 12:25 ` redi at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).