From: Jason Merrill <jason@redhat.com>
To: Ben Boeckel <ben.boeckel@kitware.com>, gcc-patches@gcc.gnu.org
Cc: nathan@acm.org, fortran@gcc.gnu.org, gcc@gcc.gnu.org,
brad.king@kitware.com
Subject: Re: [PATCH v5 1/5] libcpp: reject codepoints above 0x10FFFF
Date: Mon, 13 Feb 2023 10:53:17 -0500 [thread overview]
Message-ID: <6427dfd9-9ccd-c313-9251-75b9de8bc0af@redhat.com> (raw)
In-Reply-To: <20230125210636.2960049-2-ben.boeckel@kitware.com>
[-- Attachment #1: Type: text/plain, Size: 546 bytes --]
On 1/25/23 13:06, Ben Boeckel wrote:
> Unicode does not support such values because they are unrepresentable in
> UTF-16.
>
> libcpp/
>
> * charset.cc: Reject encodings of codepoints above 0x10FFFF.
> UTF-16 does not support such codepoints and therefore all
> Unicode rejects such values.
It seems that this causes a bunch of testsuite failures from tests that
expect this limit to be checked elsewhere with a different diagnostic,
so I think the easiest thing is to fold this into _cpp_valid_utf8_str
instead, i.e.:
Make sense?
Jason
[-- Attachment #2: 0001-libcpp-add-a-function-to-determine-UTF-8-validity-of.patch --]
[-- Type: text/x-patch, Size: 2250 bytes --]
From 296e9d1e16533979d12bd98db2937e396a0796f3 Mon Sep 17 00:00:00 2001
From: Ben Boeckel <ben.boeckel@kitware.com>
Date: Sat, 10 Dec 2022 17:20:49 -0500
Subject: [PATCH] libcpp: add a function to determine UTF-8 validity of a C
string
To: gcc-patches@gcc.gnu.org
This simplifies the interface for other UTF-8 validity detections when a
simple "yes" or "no" answer is sufficient.
libcpp/
* charset.cc: Add `_cpp_valid_utf8_str` which determines whether
a C string is valid UTF-8 or not.
* internal.h: Add prototype for `_cpp_valid_utf8_str`.
Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>
---
libcpp/internal.h | 2 ++
libcpp/charset.cc | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/libcpp/internal.h b/libcpp/internal.h
index 9724676a8cd..48520901b2d 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -834,6 +834,8 @@ extern bool _cpp_valid_utf8 (cpp_reader *pfile,
struct normalize_state *nst,
cppchar_t *cp);
+extern bool _cpp_valid_utf8_str (const char *str);
+
extern void _cpp_destroy_iconv (cpp_reader *);
extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
unsigned char *, size_t, size_t,
diff --git a/libcpp/charset.cc b/libcpp/charset.cc
index 3c47d4f868b..42a1b596c06 100644
--- a/libcpp/charset.cc
+++ b/libcpp/charset.cc
@@ -1864,6 +1864,30 @@ _cpp_valid_utf8 (cpp_reader *pfile,
return true;
}
+/* Detect whether a C-string is a valid UTF-8-encoded set of bytes. Returns
+ `false` if any contained byte sequence encodes an invalid Unicode codepoint
+ or is not a valid UTF-8 sequence. Returns `true` otherwise. */
+
+extern bool
+_cpp_valid_utf8_str (const char *name)
+{
+ const uchar* in = (const uchar*)name;
+ size_t len = strlen (name);
+ cppchar_t cp;
+
+ while (*in)
+ {
+ if (one_utf8_to_cppchar (&in, &len, &cp))
+ return false;
+
+ /* one_utf8_to_cppchar doesn't check this limit. */
+ if (cp > UCS_LIMIT)
+ return false;
+ }
+
+ return true;
+}
+
/* Subroutine of convert_hex and convert_oct. N is the representation
in the execution character set of a numeric escape; write it into the
string buffer TBUF and update the end-of-string pointer therein. WIDE
--
2.31.1
next prev parent reply other threads:[~2023-02-13 15:53 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-25 21:06 [PATCH v5 0/5] P1689R5 support Ben Boeckel
2023-01-25 21:06 ` [PATCH v5 1/5] libcpp: reject codepoints above 0x10FFFF Ben Boeckel
2023-02-13 15:53 ` Jason Merrill [this message]
2023-05-12 14:26 ` Ben Boeckel
2023-01-25 21:06 ` [PATCH v5 2/5] libcpp: add a function to determine UTF-8 validity of a C string Ben Boeckel
2023-10-23 15:16 ` David Malcolm
2023-10-23 15:24 ` Jason Merrill
2023-10-23 15:28 ` David Malcolm
2023-01-25 21:06 ` [PATCH v5 3/5] p1689r5: initial support Ben Boeckel
2023-02-14 21:50 ` Jason Merrill
2023-05-12 14:24 ` Ben Boeckel
2023-06-19 21:33 ` Jason Merrill
2023-06-20 16:51 ` Ben Boeckel
2023-06-20 19:46 ` Ben Boeckel
2023-06-23 18:31 ` Jason Merrill
2023-06-25 17:08 ` Ben Boeckel
2023-01-25 21:06 ` [PATCH v5 4/5] c++modules: report imported CMI files as dependencies Ben Boeckel
2023-02-13 18:33 ` Jason Merrill
2023-05-12 14:26 ` Ben Boeckel
2023-06-22 21:21 ` Jason Merrill
2023-06-23 2:45 ` Ben Boeckel
2023-06-23 12:12 ` Nathan Sidwell
2023-06-25 16:36 ` Ben Boeckel
2023-07-18 20:52 ` Jason Merrill
2023-07-18 21:12 ` Nathan Sidwell
2023-07-19 0:01 ` Ben Boeckel
2023-07-19 21:11 ` Nathan Sidwell
2023-07-20 0:47 ` Ben Boeckel
2023-07-20 21:00 ` Nathan Sidwell
2023-07-21 14:57 ` Ben Boeckel
2023-07-21 20:23 ` Nathan Sidwell
2023-07-24 0:26 ` Ben Boeckel
2023-07-28 1:13 ` Jason Merrill
2023-07-29 14:25 ` Ben Boeckel
2023-01-25 21:06 ` [PATCH v5 5/5] c++modules: report module mapper files as a dependency Ben Boeckel
2023-06-23 14:44 ` Jason Merrill
2023-06-25 16:42 ` Ben Boeckel
2023-02-02 14:04 ` [PATCH v5 0/5] P1689R5 support Ben Boeckel
2023-02-02 20:24 ` Harald Anlauf
2023-02-03 4:00 ` Ben Boeckel
2023-02-03 4:07 ` Andrew Pinski
2023-02-03 8:58 ` Jonathan Wakely
2023-02-03 9:10 ` Jonathan Wakely
2023-02-03 14:52 ` Ben Boeckel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6427dfd9-9ccd-c313-9251-75b9de8bc0af@redhat.com \
--to=jason@redhat.com \
--cc=ben.boeckel@kitware.com \
--cc=brad.king@kitware.com \
--cc=fortran@gcc.gnu.org \
--cc=gcc-patches@gcc.gnu.org \
--cc=gcc@gcc.gnu.org \
--cc=nathan@acm.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).