public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "redi at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points Date: Thu, 02 Mar 2023 11:17:29 +0000 [thread overview] Message-ID: <bug-108976-4-RUv1hDmxC2@http.gcc.gnu.org/bugzilla/> (raw) In-Reply-To: <bug-108976-4@http.gcc.gnu.org/bugzilla/> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976 --- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> --- I have some new code for handling UTF-8 for std::print, and using that code your relaxed u8str gets converted to 12 U+FFFD code points when printed to a terminal, which I think is correct. #include <print> int main() { char u8str[] = "\uC800\uCBFF\uCC00\uCFFF"; std::println("valid UTF-8: {}", u8str); u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D. // now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8 // that allows surrogate code points. std::vprint_nonunicode("invalid UTF-8 printed raw: {}\n", std::make_format_args(u8str)); std::println("invalid UTF-8 printed safely: {}", u8str); } $ g++ -std=c++23 surr.cc && ./a.out && ./a.out | xxd valid UTF-8: 저쯿찀쿿 invalid UTF-8 printed raw: ������������ invalid UTF-8 printed safely: ������������ 00000000: 7661 6c69 6420 5554 462d 383a 20ec a080 valid UTF-8: ... 00000010: ecaf bfec b080 ecbf bf0a 696e 7661 6c69 ..........invali 00000020: 6420 5554 462d 3820 7072 696e 7465 6420 d UTF-8 printed 00000030: 7261 773a 20ed a080 edaf bfed b080 edbf raw: ........... 00000040: bf0a 696e 7661 6c69 6420 5554 462d 3820 ..invalid UTF-8 00000050: 7072 696e 7465 6420 7361 6665 6c79 3a20 printed safely: 00000060: efbf bdef bfbd efbf bdef bfbd efbf bdef ................ 00000070: bfbd efbf bdef bfbd efbf bdef bfbd efbf ................ 00000080: bdef bfbd 0a ..... The new code is also much faster, so I'm thinking of rewriting some of the src/c++11/codecvt.cc facets to use it. But that's a longer term project, we should fix this bug first.
next prev parent reply other threads:[~2023-03-02 11:17 UTC|newest] Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-02-28 23:20 [Bug libstdc++/108976] New: " dmjpp at hotmail dot com 2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org 2023-03-02 11:08 ` dmjpp at hotmail dot com 2023-03-02 11:17 ` redi at gcc dot gnu.org [this message] 2023-03-07 20:17 ` dmjpp at hotmail dot com 2023-03-07 21:43 ` redi at gcc dot gnu.org 2023-03-08 14:11 ` dmjpp at hotmail dot com 2023-04-18 13:45 ` dmjpp at hotmail dot com 2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org 2024-01-05 16:34 ` dmjpp at hotmail dot com 2024-01-05 18:57 ` redi at gcc dot gnu.org 2024-01-13 11:56 ` dmjpp at hotmail dot com 2024-01-13 12:25 ` redi at gcc dot gnu.org 2024-05-21 9:14 ` jakub at gcc dot gnu.org 2024-05-21 22:03 ` cvs-commit at gcc dot gnu.org 2024-05-21 22:03 ` redi at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-108976-4-RUv1hDmxC2@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).