public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "redi at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libstdc++/108976] codecvt for Unicode allows surrogate code points
Date: Thu, 02 Mar 2023 11:17:29 +0000	[thread overview]
Message-ID: <bug-108976-4-RUv1hDmxC2@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-108976-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976

--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I have some new code for handling UTF-8 for std::print, and using that code
your relaxed u8str gets converted to 12 U+FFFD code points when printed to a
terminal, which I think is correct.

#include <print>

int main()
{
  char u8str[] = "\uC800\uCBFF\uCC00\uCFFF";
  std::println("valid UTF-8: {}", u8str);

  u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D.
  // now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8
  // that allows surrogate code points.
  std::vprint_nonunicode("invalid UTF-8 printed raw: {}\n",
std::make_format_args(u8str));
  std::println("invalid UTF-8 printed safely: {}", u8str);
}
$ g++ -std=c++23 surr.cc && ./a.out && ./a.out | xxd
valid UTF-8: 저쯿찀쿿
invalid UTF-8 printed raw: ������������
invalid UTF-8 printed safely: ������������
00000000: 7661 6c69 6420 5554 462d 383a 20ec a080  valid UTF-8: ...
00000010: ecaf bfec b080 ecbf bf0a 696e 7661 6c69  ..........invali
00000020: 6420 5554 462d 3820 7072 696e 7465 6420  d UTF-8 printed 
00000030: 7261 773a 20ed a080 edaf bfed b080 edbf  raw: ...........
00000040: bf0a 696e 7661 6c69 6420 5554 462d 3820  ..invalid UTF-8 
00000050: 7072 696e 7465 6420 7361 6665 6c79 3a20  printed safely: 
00000060: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
00000070: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
00000080: bdef bfbd 0a                             .....


The new code is also much faster, so I'm thinking of rewriting some of the
src/c++11/codecvt.cc facets to use it. But that's a longer term project, we
should fix this bug first.

  parent reply	other threads:[~2023-03-02 11:17 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-28 23:20 [Bug libstdc++/108976] New: " dmjpp at hotmail dot com
2023-03-02 11:03 ` [Bug libstdc++/108976] " redi at gcc dot gnu.org
2023-03-02 11:08 ` dmjpp at hotmail dot com
2023-03-02 11:17 ` redi at gcc dot gnu.org [this message]
2023-03-07 20:17 ` dmjpp at hotmail dot com
2023-03-07 21:43 ` redi at gcc dot gnu.org
2023-03-08 14:11 ` dmjpp at hotmail dot com
2023-04-18 13:45 ` dmjpp at hotmail dot com
2023-09-29 15:01 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:34 ` dmjpp at hotmail dot com
2024-01-05 18:57 ` redi at gcc dot gnu.org
2024-01-13 11:56 ` dmjpp at hotmail dot com
2024-01-13 12:25 ` redi at gcc dot gnu.org
2024-05-21  9:14 ` jakub at gcc dot gnu.org
2024-05-21 22:03 ` cvs-commit at gcc dot gnu.org
2024-05-21 22:03 ` redi at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-108976-4-RUv1hDmxC2@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).