public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed
From: Jason Merrill <jason@redhat.com>
To: Ben Boeckel <ben.boeckel@kitware.com>, gcc-patches@gcc.gnu.org
Cc: nathan@acm.org, fortran@gcc.gnu.org, gcc@gcc.gnu.org,
	brad.king@kitware.com
Subject: Re: [PATCH v5 1/5] libcpp: reject codepoints above 0x10FFFF
Date: Mon, 13 Feb 2023 10:53:17 -0500	[thread overview]
Message-ID: <6427dfd9-9ccd-c313-9251-75b9de8bc0af@redhat.com> (raw)
In-Reply-To: <20230125210636.2960049-2-ben.boeckel@kitware.com>

[-- Attachment #1: Type: text/plain, Size: 546 bytes --]

On 1/25/23 13:06, Ben Boeckel wrote:
> Unicode does not support such values because they are unrepresentable in
> UTF-16.
> 
> libcpp/
> 
> 	* charset.cc: Reject encodings of codepoints above 0x10FFFF.
> 	UTF-16 does not support such codepoints and therefore all
> 	Unicode rejects such values.

It seems that this causes a bunch of testsuite failures from tests that 
expect this limit to be checked elsewhere with a different diagnostic, 
so I think the easiest thing is to fold this into _cpp_valid_utf8_str 
instead, i.e.:

Make sense?

Jason

[-- Attachment #2: 0001-libcpp-add-a-function-to-determine-UTF-8-validity-of.patch --]
[-- Type: text/x-patch, Size: 2250 bytes --]

From 296e9d1e16533979d12bd98db2937e396a0796f3 Mon Sep 17 00:00:00 2001
From: Ben Boeckel <ben.boeckel@kitware.com>
Date: Sat, 10 Dec 2022 17:20:49 -0500
Subject: [PATCH] libcpp: add a function to determine UTF-8 validity of a C
 string
To: gcc-patches@gcc.gnu.org

This simplifies the interface for other UTF-8 validity detections when a
simple "yes" or "no" answer is sufficient.

libcpp/

	* charset.cc: Add `_cpp_valid_utf8_str` which determines whether
	a C string is valid UTF-8 or not.
	* internal.h: Add prototype for `_cpp_valid_utf8_str`.

Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>
---
 libcpp/internal.h |  2 ++
 libcpp/charset.cc | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/libcpp/internal.h b/libcpp/internal.h
index 9724676a8cd..48520901b2d 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -834,6 +834,8 @@ extern bool _cpp_valid_utf8 (cpp_reader *pfile,
 			     struct normalize_state *nst,
 			     cppchar_t *cp);
 
+extern bool _cpp_valid_utf8_str (const char *str);
+
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
diff --git a/libcpp/charset.cc b/libcpp/charset.cc
index 3c47d4f868b..42a1b596c06 100644
--- a/libcpp/charset.cc
+++ b/libcpp/charset.cc
@@ -1864,6 +1864,30 @@ _cpp_valid_utf8 (cpp_reader *pfile,
   return true;
 }
 
+/*  Detect whether a C-string is a valid UTF-8-encoded set of bytes. Returns
+    `false` if any contained byte sequence encodes an invalid Unicode codepoint
+    or is not a valid UTF-8 sequence. Returns `true` otherwise. */
+
+extern bool
+_cpp_valid_utf8_str (const char *name)
+{
+  const uchar* in = (const uchar*)name;
+  size_t len = strlen (name);
+  cppchar_t cp;
+
+  while (*in)
+    {
+      if (one_utf8_to_cppchar (&in, &len, &cp))
+	return false;
+
+      /* one_utf8_to_cppchar doesn't check this limit.  */
+      if (cp > UCS_LIMIT)
+	return false;
+    }
+
+  return true;
+}
+
 /* Subroutine of convert_hex and convert_oct.  N is the representation
    in the execution character set of a numeric escape; write it into the
    string buffer TBUF and update the end-of-string pointer therein.  WIDE
-- 
2.31.1


  reply	other threads:[~2023-02-13 15:53 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-25 21:06 [PATCH v5 0/5] P1689R5 support Ben Boeckel
2023-01-25 21:06 ` [PATCH v5 1/5] libcpp: reject codepoints above 0x10FFFF Ben Boeckel
2023-02-13 15:53   ` Jason Merrill [this message]
2023-05-12 14:26     ` Ben Boeckel
2023-01-25 21:06 ` [PATCH v5 2/5] libcpp: add a function to determine UTF-8 validity of a C string Ben Boeckel
2023-10-23 15:16   ` David Malcolm
2023-10-23 15:24     ` Jason Merrill
2023-10-23 15:28       ` David Malcolm
2023-01-25 21:06 ` [PATCH v5 3/5] p1689r5: initial support Ben Boeckel
2023-02-14 21:50   ` Jason Merrill
2023-05-12 14:24     ` Ben Boeckel
2023-06-19 21:33       ` Jason Merrill
2023-06-20 16:51         ` Ben Boeckel
2023-06-20 19:46     ` Ben Boeckel
2023-06-23 18:31       ` Jason Merrill
2023-06-25 17:08         ` Ben Boeckel
2023-01-25 21:06 ` [PATCH v5 4/5] c++modules: report imported CMI files as dependencies Ben Boeckel
2023-02-13 18:33   ` Jason Merrill
2023-05-12 14:26     ` Ben Boeckel
2023-06-22 21:21   ` Jason Merrill
2023-06-23  2:45     ` Ben Boeckel
2023-06-23 12:12       ` Nathan Sidwell
2023-06-25 16:36         ` Ben Boeckel
2023-07-18 20:52           ` Jason Merrill
2023-07-18 21:12             ` Nathan Sidwell
2023-07-19  0:01             ` Ben Boeckel
2023-07-19 21:11               ` Nathan Sidwell
2023-07-20  0:47                 ` Ben Boeckel
2023-07-20 21:00                   ` Nathan Sidwell
2023-07-21 14:57                     ` Ben Boeckel
2023-07-21 20:23                       ` Nathan Sidwell
2023-07-24  0:26                         ` Ben Boeckel
2023-07-28  1:13                           ` Jason Merrill
2023-07-29 14:25                             ` Ben Boeckel
2023-01-25 21:06 ` [PATCH v5 5/5] c++modules: report module mapper files as a dependency Ben Boeckel
2023-06-23 14:44   ` Jason Merrill
2023-06-25 16:42     ` Ben Boeckel
2023-02-02 14:04 ` [PATCH v5 0/5] P1689R5 support Ben Boeckel
2023-02-02 20:24 ` Harald Anlauf
2023-02-03  4:00   ` Ben Boeckel
2023-02-03  4:07 ` Andrew Pinski
2023-02-03  8:58   ` Jonathan Wakely
2023-02-03  9:10     ` Jonathan Wakely
2023-02-03 14:52       ` Ben Boeckel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6427dfd9-9ccd-c313-9251-75b9de8bc0af@redhat.com \
    --to=jason@redhat.com \
    --cc=ben.boeckel@kitware.com \
    --cc=brad.king@kitware.com \
    --cc=fortran@gcc.gnu.org \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=gcc@gcc.gnu.org \
    --cc=nathan@acm.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).