public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points
@ 2023-02-28 23:20 dmjpp at hotmail dot com
2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
` (14 more replies)
0 siblings, 15 replies; 16+ messages in thread
From: dmjpp at hotmail dot com @ 2023-02-28 23:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
Bug ID: 108976
Summary: codecvt for Unicode allows surrogate code points
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: libstdc++
Assignee: unassigned at gcc dot gnu.org
Reporter: dmjpp at hotmail dot com
Target Milestone: ---
Text in valid Unicode should never contain surrogate code POINTS. Those are
only allowed in UTF-16, but only as code UNITS and must be properly paired.
UTF-8 text in its strictest form must not contain surrogates but in a slightly
relaxed form surrogates can be easily encoded as 3-byte sequences. Same can be
said for UTF-32 and UCS-2. Only UTF-16 is immune to the error of surrogate code
POINT (they are treated as UNITS).
Codecvts in libstdc++ currently allow surrogate code points in some cases. Here
is a minimal reproduction (asserts are the correct behavior):
#include <locale>
#include <cassert>
void u32()
{
using namespace std;
auto& f = use_facet<codecvt<char32_t, char, mbstate_t>>(locale::classic());
char u8str[] = "\uC800\uCBFF\uCC00\uCFFF";
u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D.
// now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8
// that allows surrogate code points.
char32_t u32str[] = {0xD800, 0xDBFF, 0xDC00, 0xDFFF, 0};
char32_t u32out[1];
const char* from_next;
char32_t* to_next;
mbstate_t st = {};
auto res = f.in(st, u8str, u8str+3, from_next, u32out, u32out+1, to_next);
assert(res == f.error);
assert(from_next == u8str);
assert(to_next == u32out);
st = {};
auto l = f.length(st, u8str, u8str+3, 1);
assert(l == 0);
char u8out[3];
const char32_t* from_next2;
char* to_next2;
st = {};
res = f.out(st, u32str, u32str+1, from_next2, u8out, u8out+3, to_next2);
assert(res == f.error);
assert(from_next2 == u32str);
assert(to_next2 == u8out);
}
void u16()
{
using namespace std;
auto& f = use_facet<codecvt<char16_t, char, mbstate_t>>(locale::classic());
char u8str[] = "\uC800\uCBFF\uCC00\uCFFF";
u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D.
// now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8
// that allows surrogates.
char16_t u16out[1];
const char* from_next;
char16_t* to_next;
mbstate_t st = {};
auto res = f.in(st, u8str, u8str+3, from_next, u16out, u16out+1, to_next);
assert(res == f.error);
assert(from_next == u8str);
assert(to_next == u16out);
st = {};
auto l = f.length(st, u8str, u8str+3, 1);
assert(l == 0);
}
int main()
{
u32();
u16();
}
From reading the file codecvt.cc the following conversions have the bug:
- From UTF-8 to any other encoding.
- From UTF-32/UCS-4 to any other encoding.
Those that read from UCS-2 seem to me like they properly report the error.
Reading from UTF-16 can not have this bug by definition. From what I checked,
the functions for reading UTF-16 properly treat unpaired surrogate code units
as error.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
@ 2023-03-02 11:03 ` redi at gcc dot gnu.org
2023-03-02 11:08 ` dmjpp at hotmail dot com
` (13 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: redi at gcc dot gnu.org @ 2023-03-02 11:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
Jonathan Wakely <redi at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2023-03-02
Ever confirmed|0 |1
--- Comment #1 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Thanks for the testcase, I'd been meaning to look into exactly this issue.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
@ 2023-03-02 11:08 ` dmjpp at hotmail dot com
2023-03-02 11:17 ` redi at gcc dot gnu.org
` (12 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: dmjpp at hotmail dot com @ 2023-03-02 11:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #2 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
(In reply to Dimitrij Mijoski from comment #0)
> Those that read from UCS-2 seem to me like they properly report the error.
> Reading from UTF-16 can not have this bug by definition. From what I
> checked, the functions for reading UTF-16 properly treat unpaired surrogate
> code units as error.
Seems like the conversion from UCS-2 to UTF-16BE/LE is also affected. This
conversions is called via codecvt_utf16<char16_t>::out(). See line
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/codecvt.cc;h=02f05752de84139a7eb7c3d40946b61f4c0334cf;hb=HEAD#l656
it only checks for high surrogate but should also check for low.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
2023-03-02 11:08 ` dmjpp at hotmail dot com
@ 2023-03-02 11:17 ` redi at gcc dot gnu.org
2023-03-07 20:17 ` dmjpp at hotmail dot com
` (11 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: redi at gcc dot gnu.org @ 2023-03-02 11:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I have some new code for handling UTF-8 for std::print, and using that code
your relaxed u8str gets converted to 12 U+FFFD code points when printed to a
terminal, which I think is correct.
#include <print>
int main()
{
char u8str[] = "\uC800\uCBFF\uCC00\uCFFF";
std::println("valid UTF-8: {}", u8str);
u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D.
// now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8
// that allows surrogate code points.
std::vprint_nonunicode("invalid UTF-8 printed raw: {}\n",
std::make_format_args(u8str));
std::println("invalid UTF-8 printed safely: {}", u8str);
}
$ g++ -std=c++23 surr.cc && ./a.out && ./a.out | xxd
valid UTF-8: 저쯿찀쿿
invalid UTF-8 printed raw: ������������
invalid UTF-8 printed safely: ������������
00000000: 7661 6c69 6420 5554 462d 383a 20ec a080 valid UTF-8: ...
00000010: ecaf bfec b080 ecbf bf0a 696e 7661 6c69 ..........invali
00000020: 6420 5554 462d 3820 7072 696e 7465 6420 d UTF-8 printed
00000030: 7261 773a 20ed a080 edaf bfed b080 edbf raw: ...........
00000040: bf0a 696e 7661 6c69 6420 5554 462d 3820 ..invalid UTF-8
00000050: 7072 696e 7465 6420 7361 6665 6c79 3a20 printed safely:
00000060: efbf bdef bfbd efbf bdef bfbd efbf bdef ................
00000070: bfbd efbf bdef bfbd efbf bdef bfbd efbf ................
00000080: bdef bfbd 0a .....
The new code is also much faster, so I'm thinking of rewriting some of the
src/c++11/codecvt.cc facets to use it. But that's a longer term project, we
should fix this bug first.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (2 preceding siblings ...)
2023-03-02 11:17 ` redi at gcc dot gnu.org
@ 2023-03-07 20:17 ` dmjpp at hotmail dot com
2023-03-07 21:43 ` redi at gcc dot gnu.org
` (10 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: dmjpp at hotmail dot com @ 2023-03-07 20:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #4 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I have fixed this, added large testsute and discovered another bug in
codecvt_utf16 when the input [from, from_end) contains odd number of bytes.
Error was returned instead of partial.
Here are the changes in 8 commits
https://github.com/gcc-mirror/gcc/compare/master...dimztimz:gcc:codecvt (read
from top to bottom). Do you want everything squashed in one commit or maybe
some other combination?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (3 preceding siblings ...)
2023-03-07 20:17 ` dmjpp at hotmail dot com
@ 2023-03-07 21:43 ` redi at gcc dot gnu.org
2023-03-08 14:11 ` dmjpp at hotmail dot com
` (9 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: redi at gcc dot gnu.org @ 2023-03-07 21:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #5 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Excellent, thanks! I'll look at the commits tomorrow and get back to you.
A patch will need to be submitted to the mailing lists, but it might be OK to
send a single diff for the whole series, but then I could merge the 8 commits
(without squashing them). I'll see if that makes sense (and would be OK by our
policies) and let you know.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (4 preceding siblings ...)
2023-03-07 21:43 ` redi at gcc dot gnu.org
@ 2023-03-08 14:11 ` dmjpp at hotmail dot com
2023-04-18 13:45 ` dmjpp at hotmail dot com
` (8 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: dmjpp at hotmail dot com @ 2023-03-08 14:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #6 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I sent a single patch to the mailing list with a good detailed commit message.
I think that is better than multiple patches.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (5 preceding siblings ...)
2023-03-08 14:11 ` dmjpp at hotmail dot com
@ 2023-04-18 13:45 ` dmjpp at hotmail dot com
2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
` (7 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: dmjpp at hotmail dot com @ 2023-04-18 13:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #7 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I put a second version of the patch
https://gcc.gnu.org/pipermail/libstdc++/2023-March/055667.html about a month
ago.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (6 preceding siblings ...)
2023-04-18 13:45 ` dmjpp at hotmail dot com
@ 2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:34 ` dmjpp at hotmail dot com
` (6 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-09-29 15:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:
https://gcc.gnu.org/g:a8b9c32da787ea0bfbfc9118ac816fa7be4b1bc8
commit r14-4335-ga8b9c32da787ea0bfbfc9118ac816fa7be4b1bc8
Author: Dimitrij Mijoski <dmjpp@hotmail.com>
Date: Thu Sep 28 21:38:11 2023 +0200
libstdc++: Fix handling of surrogate CP in codecvt [PR108976]
This patch fixes the handling of surrogate code points in all standard
facets for transcoding Unicode that are based on std::codecvt. Surrogate
code points should always be treated as error. On the other hand
surrogate code units can only appear in UTF-16 and only when they come
in a proper pair.
Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
of bytes were given in the range [from, from_end), error was returned
always. The last byte in such range does not form a full UTF-16 code
unit and we can not make any decisions for error, instead partial should
be returned.
The testsuite for testing these facets was updated in the following
order:
1. All functions that test codecvts that work with UTF-8 were refactored
and made more generic so they accept codecvt that works with the char
type char8_t.
2. The same functions were updated with new test cases for transcoding
errors and now additionally test for surrogates, overlong UTF-8
sequences, code points out of the Unicode range, and more tests for
missing leading and trailing code units.
3. New tests were added to test codecvt_utf16 in both of its variants,
UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.
libstdc++-v3/ChangeLog:
PR libstdc++/108976
* src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of
surrogates in UTF-8.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8.
(ucs4_in): Fix handling of range with odd number of bytes.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16.
(ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16.
(ucs2_in): Fix handling of range with odd number of bytes.
(__codecvt_utf16_base<char16_t>::do_in): Likewise.
(__codecvt_utf16_base<char32_t>::do_in): Likewise.
(__codecvt_utf16_base<wchar_t>::do_in): Likewise.
* testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add
tests for codecvt_utf16<char16_t> and codecvt_utf16<char32_t>.
* testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8
testing functions for char8_t, add more test cases for errors,
add testing functions for codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc:
Renames, add tests for codecvt_utf16<whchar_t>.
* testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06):
Fix test.
* testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New
test.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (7 preceding siblings ...)
2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
@ 2024-01-05 16:34 ` dmjpp at hotmail dot com
2024-01-05 18:57 ` redi at gcc dot gnu.org
` (5 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: dmjpp at hotmail dot com @ 2024-01-05 16:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #9 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
I believe this bug report should closed as resolved. Are there maybe plans for
back-porting?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (8 preceding siblings ...)
2024-01-05 16:34 ` dmjpp at hotmail dot com
@ 2024-01-05 18:57 ` redi at gcc dot gnu.org
2024-01-13 11:56 ` dmjpp at hotmail dot com
` (4 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: redi at gcc dot gnu.org @ 2024-01-05 18:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #10 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I think it would be good to backport it, what do you think?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (9 preceding siblings ...)
2024-01-05 18:57 ` redi at gcc dot gnu.org
@ 2024-01-13 11:56 ` dmjpp at hotmail dot com
2024-01-13 12:25 ` redi at gcc dot gnu.org
` (3 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: dmjpp at hotmail dot com @ 2024-01-13 11:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #11 from Dimitrij Mijoski <dmjpp at hotmail dot com> ---
(In reply to Jonathan Wakely from comment #10)
> I think it would be good to backport it, what do you think?
I don't really have strong need. Maybe porting only to v13 as that is pretty
straightforward (simple cherry-picking). Porting to v12 and v11 will require
porting other patches first.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (10 preceding siblings ...)
2024-01-13 11:56 ` dmjpp at hotmail dot com
@ 2024-01-13 12:25 ` redi at gcc dot gnu.org
2024-05-21 9:14 ` jakub at gcc dot gnu.org
` (2 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: redi at gcc dot gnu.org @ 2024-01-13 12:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
Jonathan Wakely <redi at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |13.3
--- Comment #12 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Thanks, I think that makes sense.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (11 preceding siblings ...)
2024-01-13 12:25 ` redi at gcc dot gnu.org
@ 2024-05-21 9:14 ` jakub at gcc dot gnu.org
2024-05-21 22:03 ` cvs-commit at gcc dot gnu.org
2024-05-21 22:03 ` redi at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-05-21 9:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|13.3 |13.4
--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 13.3 is being released, retargeting bugs to GCC 13.4.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (12 preceding siblings ...)
2024-05-21 9:14 ` jakub at gcc dot gnu.org
@ 2024-05-21 22:03 ` cvs-commit at gcc dot gnu.org
2024-05-21 22:03 ` redi at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-05-21 22:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #14 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-13 branch has been updated by Jonathan Wakely
<redi@gcc.gnu.org>:
https://gcc.gnu.org/g:bd5e672303f5f777e8927a746d3ee42db21d871b
commit r13-8788-gbd5e672303f5f777e8927a746d3ee42db21d871b
Author: Dimitrij Mijoski <dmjpp@hotmail.com>
Date: Thu Sep 28 21:38:11 2023 +0200
libstdc++: Fix handling of surrogate CP in codecvt [PR108976]
This patch fixes the handling of surrogate code points in all standard
facets for transcoding Unicode that are based on std::codecvt. Surrogate
code points should always be treated as error. On the other hand
surrogate code units can only appear in UTF-16 and only when they come
in a proper pair.
Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
of bytes were given in the range [from, from_end), error was returned
always. The last byte in such range does not form a full UTF-16 code
unit and we can not make any decisions for error, instead partial should
be returned.
The testsuite for testing these facets was updated in the following
order:
1. All functions that test codecvts that work with UTF-8 were refactored
and made more generic so they accept codecvt that works with the char
type char8_t.
2. The same functions were updated with new test cases for transcoding
errors and now additionally test for surrogates, overlong UTF-8
sequences, code points out of the Unicode range, and more tests for
missing leading and trailing code units.
3. New tests were added to test codecvt_utf16 in both of its variants,
UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.
libstdc++-v3/ChangeLog:
PR libstdc++/108976
* src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of
surrogates in UTF-8.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8.
(ucs4_in): Fix handling of range with odd number of bytes.
(ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16.
(ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16.
(ucs2_in): Fix handling of range with odd number of bytes.
(__codecvt_utf16_base<char16_t>::do_in): Likewise.
(__codecvt_utf16_base<char32_t>::do_in): Likewise.
(__codecvt_utf16_base<wchar_t>::do_in): Likewise.
* testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add
tests for codecvt_utf16<char16_t> and codecvt_utf16<char32_t>.
* testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8
testing functions for char8_t, add more test cases for errors,
add testing functions for codecvt_utf16.
* testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc:
Renames, add tests for codecvt_utf16<whchar_t>.
* testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06):
Fix test.
* testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New
test.
(cherry picked from commit a8b9c32da787ea0bfbfc9118ac816fa7be4b1bc8)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
` (13 preceding siblings ...)
2024-05-21 22:03 ` cvs-commit at gcc dot gnu.org
@ 2024-05-21 22:03 ` redi at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: redi at gcc dot gnu.org @ 2024-05-21 22:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
Jonathan Wakely <redi at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|NEW |RESOLVED
--- Comment #15 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Fixed for 13.4 and 14.1
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-05-21 22:03 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-28 23:20 [Bug libstdc++/108976] New: codecvt for Unicode allows surrogate code points dmjpp at hotmail dot com
2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
2023-03-02 11:08 ` dmjpp at hotmail dot com
2023-03-02 11:17 ` redi at gcc dot gnu.org
2023-03-07 20:17 ` dmjpp at hotmail dot com
2023-03-07 21:43 ` redi at gcc dot gnu.org
2023-03-08 14:11 ` dmjpp at hotmail dot com
2023-04-18 13:45 ` dmjpp at hotmail dot com
2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:34 ` dmjpp at hotmail dot com
2024-01-05 18:57 ` redi at gcc dot gnu.org
2024-01-13 11:56 ` dmjpp at hotmail dot com
2024-01-13 12:25 ` redi at gcc dot gnu.org
2024-05-21 9:14 ` jakub at gcc dot gnu.org
2024-05-21 22:03 ` cvs-commit at gcc dot gnu.org
2024-05-21 22:03 ` redi at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).