From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 5E00B385E017 for ; Mon, 4 Jul 2022 19:54:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5E00B385E017 Received: from mail-io1-f70.google.com (mail-io1-f70.google.com [209.85.166.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-79-slREkFgpMeiWiy_1-EvXfQ-1; Mon, 04 Jul 2022 15:54:20 -0400 X-MC-Unique: slREkFgpMeiWiy_1-EvXfQ-1 Received: by mail-io1-f70.google.com with SMTP id o11-20020a6bcf0b000000b0067328c4275bso5865442ioa.8 for ; Mon, 04 Jul 2022 12:54:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:references:from:organization:in-reply-to :content-transfer-encoding; bh=3S9GzcMYs9JN/qRrJjtqljsBwRZ4zYLPg+FTHvPWlbE=; b=75vr7bnrRGLSQzvJ/XRQDFw2hrwZEi2OUbOjg4jwfe2sOLkSEa/eCdhg+De1lSxphY OrczwpG8rcIm/fbAtC+DSocMvEdPiZ+hQs7wUbPum5CkkYX1xeWGcWoQ/UEWwQlKGY+G XAIcu4auKQxTVONlEM+lYJNYhf5MePI5Ez4Gfla/1aBM45ULqWsZkTvs/dZYgafGVcAt +B7a++grd+bf4c05fN6mf9q0RTCyCRCVkDcnqgSI3avJQGEEpnF5GMZFCfmsIdYNDYXa LcWu72OLmTJZQWzubOb92rgoNopgjVnrf6tqHx8InIb7zzk/C3PvpyP8IGW7pG9KcDjh RELA== X-Gm-Message-State: AJIora9uyabEveuGU8LiLbK4pCOducHf9E07HUZSLj5x2Ku2Fw5ffKzh LQt25+0A3KCPZ1Cwu1eVdNfhrhoWeVxPoRKDAZC+O8nhLDQ3MeQL5A4Y26owOy1zaGpl+DtDpZ3 snUWTfiZwCnymuQLszm8Q X-Received: by 2002:a6b:7845:0:b0:64c:9acc:9f1a with SMTP id h5-20020a6b7845000000b0064c9acc9f1amr16817530iop.103.1656964459580; Mon, 04 Jul 2022 12:54:19 -0700 (PDT) X-Google-Smtp-Source: AGRyM1s8xqYlRGF2Fz4oX/KX+zGoA0q/SzB19ntUltTFfLfJcErandJpL3vG4mBMWGJ5okhZjcMyCw== X-Received: by 2002:a6b:7845:0:b0:64c:9acc:9f1a with SMTP id h5-20020a6b7845000000b0064c9acc9f1amr16817524iop.103.1656964459344; Mon, 04 Jul 2022 12:54:19 -0700 (PDT) Received: from [192.168.0.241] (135-23-175-80.cpe.pppoe.ca. [135.23.175.80]) by smtp.gmail.com with ESMTPSA id d68-20020a026247000000b0033ec0c74fc7sm2814159jac.134.2022.07.04.12.54.18 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Jul 2022 12:54:19 -0700 (PDT) Message-ID: <83a19826-57db-cdb6-b64d-510544d96e9d@redhat.com> Date: Mon, 4 Jul 2022 15:54:18 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 Subject: Re: [PATCH 4/5] locale: localdef input files are now encoded in UTF-8 To: Florian Weimer , libc-alpha@sourceware.org References: From: Carlos O'Donell Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-16.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, NICE_REPLY_A, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jul 2022 19:54:29 -0000 On 5/19/22 17:06, Florian Weimer via Libc-alpha wrote: > Previously, they were assumed to be in ISO-8859-1, and that the output > charset overlapped with ISO-8859-1 for the characters actually used. > However, this did not work as intended on many architectures even for > an ISO-8859-1 output encoding because of the char signedness bug in > lr_getc. Therefore, this commit switches to UTF-8 without making > provisions for backwards compatibility. That sounds acceptable to me. LGTM. Reviewed-by: Carlos O'Donell Tested-by: Carlos O'Donell > > The following Elisp code can be used to convert locale definition files > to UTF-8: > > (defun glibc/convert-localedef (from to) > (interactive "r") > (save-excursion > (save-restriction > (narrow-to-region from to) > (goto-char (point-min)) > (save-match-data > (while (re-search-forward "" nil t) > (let* ((codepoint (string-to-number (match-string 1) 16)) > (converted > (cond > ((memq codepoint '(?/ ?\ ?< ?>)) > (string ?/ codepoint)) > ((= codepoint ?\") "") > (t (string codepoint))))) > (replace-match converted t))))))) OK. > --- > NEWS | 4 + > locale/programs/linereader.c | 144 ++++++++++++++++++++++++++++++++--- > 2 files changed, 137 insertions(+), 11 deletions(-) > > diff --git a/NEWS b/NEWS > index ad0c08d8ca..7ce0d8a135 100644 > --- a/NEWS > +++ b/NEWS > @@ -20,6 +20,10 @@ Major new features: > have been added. The pidfd functionality provides access to a process > while avoiding the issue of PID reuse on tranditional Unix systems. > > +* localedef now accepts locale definition files encoded in UTF-8. > + Previously, input bytes not within the ASCII range resulted in > + unpredictable output. OK. > + > Deprecated and removed features, and other changes affecting compatibility: > > * Support for prelink will be removed in the next release; this includes > diff --git a/locale/programs/linereader.c b/locale/programs/linereader.c > index f7292f0102..b484327969 100644 > --- a/locale/programs/linereader.c > +++ b/locale/programs/linereader.c > @@ -42,6 +42,7 @@ static struct token *get_string (struct linereader *lr, > struct localedef_t *locale, > const struct repertoire_t *repertoire, > int verbose); > +static bool utf8_decode (struct linereader *lr, uint8_t ch1, uint32_t *wch); OK. > > > struct linereader * > @@ -327,6 +328,17 @@ lr_token (struct linereader *lr, const struct charmap_t *charmap, > } > lr_ungetn (lr, 2); > break; > + > + case 0x80 ... 0xff: /* UTF-8 sequence. */ > + uint32_t wch; > + if (!utf8_decode (lr, ch, &wch)) > + { > + lr->token.tok = tok_error; > + return &lr->token; > + } > + lr->token.tok = tok_ucs4; > + lr->token.val.ucs4 = wch; > + return &lr->token; OK. > } > > return get_ident (lr); > @@ -673,6 +685,87 @@ translate_unicode_codepoint (struct localedef_t *locale, > return false; > } > > +/* Returns true if ch is not EOF (that is, non-negative) and a valid > + UTF-8 trailing byte. */ > +static bool > +utf8_valid_trailing (int ch) > +{ > + return ch >= 0 && (ch & 0xc0) == 0x80; > +} OK. > + > +/* Reports an error for a broken UTF-8 sequence. CH2 to CH4 may be > + EOF. Always returns false. */ > +static bool > +utf8_sequence_error (struct linereader *lr, uint8_t ch1, int ch2, int ch3, > + int ch4) > +{ > + char buf[30]; > + > + if (ch2 < 0) > + snprintf (buf, sizeof (buf), "0x%02x", ch1); > + else if (ch3 < 0) > + snprintf (buf, sizeof (buf), "0x%02x 0x%02x", ch1, ch2); > + else if (ch4 < 0) > + snprintf (buf, sizeof (buf), "0x%02x 0x%02x 0x%02x", ch1, ch2, ch3); > + else > + snprintf (buf, sizeof (buf), "0x%02x 0x%02x 0x%02x 0x%02x", > + ch1, ch2, ch3, ch4); > + > + lr_error (lr, _("invalid UTF-8 sequence %s"), buf); > + return false; OK. > +} > + > +/* Reads a UTF-8 sequence from LR, with the leading byte CH1, and > + stores the decoded codepoint in *WCH. Returns false on failure and > + reports an error. */ > +static bool > +utf8_decode (struct linereader *lr, uint8_t ch1, uint32_t *wch) > +{ > + /* See RFC 3629 section 4 and __gconv_transform_utf8_internal. */ > + if (ch1 < 0xc2) > + return utf8_sequence_error (lr, ch1, -1, -1, -1); OK. While ASCII goes to 7F the standard says ch2 would be > c2, so cover the whole range here. > + > + int ch2 = lr_getc (lr); > + if (!utf8_valid_trailing (ch2)) > + return utf8_sequence_error (lr, ch1, ch2, -1, -1); > + > + if (ch1 <= 0xdf) > + { > + uint32_t result = ((ch1 & 0x1f) << 6) | (ch2 & 0x3f); > + if (result < 0x80) > + return utf8_sequence_error (lr, ch1, ch2, -1, -1); > + *wch = result; > + return true; > + } OK. > + > + int ch3 = lr_getc (lr); > + if (!utf8_valid_trailing (ch3) || ch1 < 0xe0) > + return utf8_sequence_error (lr, ch1, ch2, ch3, -1); > + > + if (ch1 <= 0xef) > + { > + uint32_t result = (((ch1 & 0x0f) << 12) > + | ((ch2 & 0x3f) << 6) > + | (ch3 & 0x3f)); > + if (result < 0x800) > + return utf8_sequence_error (lr, ch1, ch2, ch3, -1); > + *wch = result; > + return true; > + } > + > + int ch4 = lr_getc (lr); > + if (!utf8_valid_trailing (ch4) || ch1 < 0xf0 || ch1 > 0xf4) > + return utf8_sequence_error (lr, ch1, ch2, ch3, ch4); > + > + uint32_t result = (((ch1 & 0x07) << 18) > + | ((ch2 & 0x3f) << 12) > + | ((ch3 & 0x3f) << 6) > + | (ch4 & 0x3f)); > + if (result < 0x10000) > + return utf8_sequence_error (lr, ch1, ch2, ch3, ch4); > + *wch = result; > + return true; > +} OK. > > static struct token * > get_string (struct linereader *lr, const struct charmap_t *charmap, > @@ -696,7 +789,11 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > > buf2 = NULL; > while ((ch = lr_getc (lr)) != '"' && ch != '\n' && ch != EOF) > - addc (&lrb, ch); > + { > + if (ch >= 0x80) > + lr_error (lr, _("illegal 8-bit character in untranslated string")); > + addc (&lrb, ch); > + } OK. > > /* Catch errors with trailing escape character. */ > if (lrb.act > 0 && lrb.buf[lrb.act - 1] == lr->escape_char > @@ -730,24 +827,49 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > > if (ch != '<') > { > - /* The standards leave it up to the implementation to decide > - what to do with character which stand for themself. We > - could jump through hoops to find out the value relative to > - the charmap and the repertoire map, but instead we leave > - it up to the locale definition author to write a better > - definition. We assume here that every character which > - stands for itself is encoded using ISO 8859-1. Using the > - escape character is allowed. */ > + /* The standards leave it up to the implementation to > + decide what to do with characters which stand for > + themselves. This implementation treats the input > + file as encoded in UTF-8. */ OK. This is the right direction. > if (ch == lr->escape_char) > { > ch = lr_getc (lr); > + if (ch >= 0x80) > + { > + lr_error (lr, _("illegal 8-bit escape sequence")); > + illegal_string = true; > + break; > + } OK. > if (ch == '\n' || ch == EOF) > break; > + addc (&lrb, ch); > + wch = ch; > + } > + else if (ch < 0x80) > + { > + wch = ch; > + addc (&lrb, ch); > + } > + else /* UTF-8 sequence. */ > + { > + if (!utf8_decode (lr, ch, &wch)) > + { > + illegal_string = true; > + break; > + } > + if (!translate_unicode_codepoint (locale, charmap, > + repertoire, wch, &lrb)) > + { > + /* Ignore the rest of the string. Callers may > + skip this string because it cannot be encoded > + in the output character set. */ > + illegal_string = true; > + continue; > + } > } > > - addc (&lrb, ch); > if (return_widestr) > - ADDWC ((uint32_t) ch); > + ADDWC (wch); > > continue; > } -- Cheers, Carlos.