From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 577433948A73 for ; Wed, 16 Nov 2022 00:00:59 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 577433948A73 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668556859; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1WdAn//ukdwQ2T4oxO/ssYkEISpYNfNYpL6XAkP0mnc=; b=e691P0Xfb639oP/+oujbnmzsuMCuWHSH4PVfwqmJHA7jvoOQmaBlg2agG4jL1Sq2/QTAyF Q/LRWXcNyT2onbuGyAuSVihxdHpV+QZhTOGUbAeN+WFOCWE8J+96cb3TneUkMU2khHf58S yqhsE359Yx3y778sAD7M9PtZbXqMoO8= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-106-0V8et4qqMmWIV7TjKItfPg-1; Tue, 15 Nov 2022 19:00:57 -0500 X-MC-Unique: 0V8et4qqMmWIV7TjKItfPg-1 Received: by mail-qt1-f198.google.com with SMTP id ay12-20020a05622a228c00b003a52bd33749so11466619qtb.8 for ; Tue, 15 Nov 2022 16:00:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1WdAn//ukdwQ2T4oxO/ssYkEISpYNfNYpL6XAkP0mnc=; b=xERWPwOauzJV6Z0rGmyejcImRvrc15rSVOzTdYhcf8jI1WzaD1i3d3sD50j/zm96s+ fSYWHqEavBcaOBVUhMd8b4LcNPZcgGpFbbW6xwgWvxNBoLFCXwqhRSms4OkGe1l9GFAa n2V1nJOrqUGsM8ATcc+I2pHNae74crG+e6np0ygR08vnNZcVqMPs6t+wj4lic+rHECFk FOWH8Si63LWTiFqiJdq/E3zWA3T4bUE4fXC/K8O+GAFh1YA75L5nR7jN72iq3p+ampUw wU7d8/fLPuQqy5phBl46mYWL6Mbe8BZGWjI6lfpi8fK9Tm1vzPo6ZUSJMAqxbSXCT+DK WFYg== X-Gm-Message-State: ANoB5pmNc2XwW+lFlh1sjD9MgNNhEV8dC4fFL0vfZi/SIoQ6CR1tfASs FzJO/BedpsIZmlZzsZ+oCCoW7OvY84lGydMhAnypHEpJHisPsTJZAfEIO+aWmyBE3z9LOzOhxds VVy90Bd0= X-Received: by 2002:a05:6214:1630:b0:4b1:d52d:3c29 with SMTP id e16-20020a056214163000b004b1d52d3c29mr18898698qvw.68.1668556857279; Tue, 15 Nov 2022 16:00:57 -0800 (PST) X-Google-Smtp-Source: AA0mqf5DY+I8UuXhN9fyvysrmHvhR7EIkfQ9IRlN+SvgIIt6oJozUvgMw5vlaQlF25mDLL6Eu/YULw== X-Received: by 2002:a05:6214:1630:b0:4b1:d52d:3c29 with SMTP id e16-20020a056214163000b004b1d52d3c29mr18898674qvw.68.1668556856988; Tue, 15 Nov 2022 16:00:56 -0800 (PST) Received: from [192.168.1.101] (130-44-159-43.s15913.c3-0.arl-cbr1.sbo-arl.ma.cable.rcncustomer.com. [130.44.159.43]) by smtp.gmail.com with ESMTPSA id c8-20020a05620a268800b006eecc4a0de9sm9066631qkp.62.2022.11.15.16.00.55 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 15 Nov 2022 16:00:56 -0800 (PST) Message-ID: Date: Tue, 15 Nov 2022 19:00:54 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.2 Subject: Re: [PATCH v3 1/3] libcpp: reject codepoints above 0x10FFFF To: Ben Boeckel , gcc-patches@gcc.gnu.org Cc: nathan@acm.org, fortran@gcc.gnu.org, gcc@gcc.gnu.org, brad.king@kitware.com References: <20221109021048.2123704-1-ben.boeckel@kitware.com> <20221109021048.2123704-2-ben.boeckel@kitware.com> From: Jason Merrill In-Reply-To: <20221109021048.2123704-2-ben.boeckel@kitware.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 11/8/22 16:10, Ben Boeckel wrote: > Unicode does not support such values because they are unrepresentable in > UTF-16. > > libcpp/ > > * charset.cc: Reject encodings of codepoints above 0x10FFFF. > UTF-16 does not support such codepoints and therefore all > Unicode rejects such values. OK. > Signed-off-by: Ben Boeckel > --- > libcpp/charset.cc | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/libcpp/charset.cc b/libcpp/charset.cc > index 12a398e7527..324b5b19136 100644 > --- a/libcpp/charset.cc > +++ b/libcpp/charset.cc > @@ -158,6 +158,10 @@ struct _cpp_strbuf > encoded as any of DF 80, E0 9F 80, F0 80 9F 80, F8 80 80 9F 80, or > FC 80 80 80 9F 80. Only the first is valid. > > + Additionally, Unicode declares that all codepoints above 0010FFFF are > + invalid because they cannot be represented in UTF-16. As such, all 5- and > + 6-byte encodings are invalid. > + > An implementation note: the transformation from UTF-16 to UTF-8, or > vice versa, is easiest done by using UTF-32 as an intermediary. */ > > @@ -216,7 +220,7 @@ one_utf8_to_cppchar (const uchar **inbufp, size_t *inbytesleftp, > if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ; > > /* Make sure the character is valid. */ > - if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; > + if (c > 0x10FFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; > > *cp = c; > *inbufp = inbuf; > @@ -320,7 +324,7 @@ one_utf32_to_utf8 (iconv_t bigend, const uchar **inbufp, size_t *inbytesleftp, > s += inbuf[bigend ? 2 : 1] << 8; > s += inbuf[bigend ? 3 : 0]; > > - if (s >= 0x7FFFFFFF || (s >= 0xD800 && s <= 0xDFFF)) > + if (s > 0x10FFFF || (s >= 0xD800 && s <= 0xDFFF)) > return EILSEQ; > > rval = one_cppchar_to_utf8 (s, outbufp, outbytesleftp);