From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 343D03853556 for ; Mon, 4 Jul 2022 19:54:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 343D03853556 Received: from mail-io1-f71.google.com (mail-io1-f71.google.com [209.85.166.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-196-e-w46o2zN9OZVDELbFZhIA-1; Mon, 04 Jul 2022 15:54:17 -0400 X-MC-Unique: e-w46o2zN9OZVDELbFZhIA-1 Received: by mail-io1-f71.google.com with SMTP id h73-20020a6bb74c000000b0067275998ba8so5844633iof.2 for ; Mon, 04 Jul 2022 12:54:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:references:from:organization:in-reply-to :content-transfer-encoding; bh=te2w5a29NYt4hed2BG+Z7GIz23K4wZKjqThDsZf4tr0=; b=oOx+5TLSXkhfDiKpUIT8/YPxAdRUoaBAaV77jwI0agqme4SZPpye4LI6Vj89eb5MKI 7C2ObhpCa/EGcrfksLyBy6EpjAMKldckefZPglwc2MupLAkQfFGg99HFhePlkfRAG4Ki F4VFocQxQcb/e4sAQLQFf5eW1Q29U9/K0HNCAt8BwyCgCGp7mkj6v6W/SZVT2D7BZkJT hJHSEl1/ph5y7TvP/3k6U0TPPsLcdYwxV7tqk0/jssOkWthXVbNSTBj99B5Jdn+TC3F9 EnLmQ2cIFUCkOX8Gq6NIb0iKkRBY0jUr4GTxyKk7ZZeid5zzEGHucqJXD7th2H4nADd0 84uQ== X-Gm-Message-State: AJIora+vx1Lsyb0bQ+Jf0qJbwlQFgfH7htHdAg/ISAg/3+rtmRiylPL1 7IUTchdaJ4MHwu1U/rpXKTJmss/06GFyxZW4Avxpr8i+/x1H0EkzLXsShuWl6r3ZVucZUemkVfq ueRiMCw9Tbv+BBhUvnshZ X-Received: by 2002:a05:6638:1493:b0:33e:c04e:56e4 with SMTP id j19-20020a056638149300b0033ec04e56e4mr7344744jak.282.1656964456908; Mon, 04 Jul 2022 12:54:16 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uLTTMSKrqngx+19FDELQa0yHkshujteGGF2QPeBUWqPxqb2dfSbr7ar+N6kIwy/p5fw11gWQ== X-Received: by 2002:a05:6638:1493:b0:33e:c04e:56e4 with SMTP id j19-20020a056638149300b0033ec04e56e4mr7344737jak.282.1656964456678; Mon, 04 Jul 2022 12:54:16 -0700 (PDT) Received: from [192.168.0.241] (135-23-175-80.cpe.pppoe.ca. [135.23.175.80]) by smtp.gmail.com with ESMTPSA id k30-20020a02335e000000b00339e6f88235sm13877873jak.61.2022.07.04.12.54.16 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Jul 2022 12:54:16 -0700 (PDT) Message-ID: Date: Mon, 4 Jul 2022 15:54:15 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 Subject: Re: [PATCH 3/5] locale: Introduce translate_unicode_codepoint into linereader.c To: Florian Weimer , libc-alpha@sourceware.org References: From: Carlos O'Donell Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-16.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jul 2022 19:54:22 -0000 On 5/19/22 17:06, Florian Weimer via Libc-alpha wrote: > This will permit reusing the Unicode character processing for > different character encodings, not just the current encoding. LGTM. Straight forward refactor. Reviewed-by: Carlos O'Donell Tested-by: Carlos O'Donell > --- > locale/programs/linereader.c | 167 ++++++++++++++++++----------------- > 1 file changed, 85 insertions(+), 82 deletions(-) > > diff --git a/locale/programs/linereader.c b/locale/programs/linereader.c > index d5367e0a1e..f7292f0102 100644 > --- a/locale/programs/linereader.c > +++ b/locale/programs/linereader.c > @@ -596,6 +596,83 @@ get_ident (struct linereader *lr) > return &lr->token; > } > > +/* Process a decoded Unicode codepoint WCH in a string, placing the > + multibyte sequence into LRB. Return false if the character is not > + found in CHARMAP/REPERTOIRE. */ > +static bool > +translate_unicode_codepoint (struct localedef_t *locale, > + const struct charmap_t *charmap, > + const struct repertoire_t *repertoire, > + uint32_t wch, struct lr_buffer *lrb) > +{ > + /* See whether the charmap contains the Uxxxxxxxx names. */ > + char utmp[10]; > + snprintf (utmp, sizeof (utmp), "U%08X", wch); > + struct charseq *seq = charmap_find_value (charmap, utmp, 9); > + > + if (seq == NULL) > + { > + /* No, this isn't the case. Now determine from > + the repertoire the name of the character and > + find it in the charmap. */ > + if (repertoire != NULL) > + { > + const char *symbol = repertoire_find_symbol (repertoire, wch); > + if (symbol != NULL) > + seq = charmap_find_value (charmap, symbol, strlen (symbol)); > + } > + > + if (seq == NULL) > + { > +#ifndef NO_TRANSLITERATION > + /* Transliterate if possible. */ > + if (locale != NULL) > + { > + if ((locale->avail & CTYPE_LOCALE) == 0) > + { > + /* Load the CTYPE data now. */ > + int old_needed = locale->needed; > + > + locale->needed = 0; > + locale = load_locale (LC_CTYPE, locale->name, > + locale->repertoire_name, > + charmap, locale); > + locale->needed = old_needed; > + } > + > + uint32_t *translit; > + if ((locale->avail & CTYPE_LOCALE) != 0 > + && ((translit = find_translit (locale, charmap, wch)) > + != NULL)) > + /* The CTYPE data contains a matching > + transliteration. */ > + { > + for (int i = 0; translit[i] != 0; ++i) > + { > + snprintf (utmp, sizeof (utmp), "U%08X", translit[i]); > + seq = charmap_find_value (charmap, utmp, 9); > + assert (seq != NULL); > + adds (lrb, seq->bytes, seq->nbytes); > + } > + return true; > + } > + } > +#endif /* NO_TRANSLITERATION */ > + > + /* Not a known name. */ > + return false; > + } > + } > + > + if (seq != NULL) > + { > + adds (lrb, seq->bytes, seq->nbytes); > + return true; > + } > + else > + return false; > +} > + > > static struct token * > get_string (struct linereader *lr, const struct charmap_t *charmap, > @@ -635,7 +712,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > } > else > { > - int illegal_string = 0; > + bool illegal_string = false; > size_t buf2act = 0; > size_t buf2max = 56 * sizeof (uint32_t); > int ch; > @@ -695,7 +772,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > { > /* <> is no correct name. Ignore it and also signal an > error. */ > - illegal_string = 1; > + illegal_string = true; > continue; > } > > @@ -709,8 +786,6 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > > if (cp == &lrb.buf[lrb.act]) > { > - char utmp[10]; > - > /* Yes, it is. */ > addc (&lrb, '\0'); > wch = strtoul (lrb.buf + startidx + 1, NULL, 16); > @@ -721,81 +796,9 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > if (return_widestr) > ADDWC (wch); > > - /* See whether the charmap contains the Uxxxxxxxx names. */ > - snprintf (utmp, sizeof (utmp), "U%08X", wch); > - seq = charmap_find_value (charmap, utmp, 9); > - > - if (seq == NULL) > - { > - /* No, this isn't the case. Now determine from > - the repertoire the name of the character and > - find it in the charmap. */ > - if (repertoire != NULL) > - { > - const char *symbol; > - > - symbol = repertoire_find_symbol (repertoire, wch); > - > - if (symbol != NULL) > - seq = charmap_find_value (charmap, symbol, > - strlen (symbol)); > - } > - > - if (seq == NULL) > - { > -#ifndef NO_TRANSLITERATION > - /* Transliterate if possible. */ > - if (locale != NULL) > - { > - uint32_t *translit; > - > - if ((locale->avail & CTYPE_LOCALE) == 0) > - { > - /* Load the CTYPE data now. */ > - int old_needed = locale->needed; > - > - locale->needed = 0; > - locale = load_locale (LC_CTYPE, > - locale->name, > - locale->repertoire_name, > - charmap, locale); > - locale->needed = old_needed; > - } > - > - if ((locale->avail & CTYPE_LOCALE) != 0 > - && ((translit = find_translit (locale, > - charmap, wch)) > - != NULL)) > - /* The CTYPE data contains a matching > - transliteration. */ > - { > - int i; > - > - for (i = 0; translit[i] != 0; ++i) > - { > - char utmp[10]; > - > - snprintf (utmp, sizeof (utmp), "U%08X", > - translit[i]); > - seq = charmap_find_value (charmap, utmp, > - 9); > - assert (seq != NULL); > - adds (&lrb, seq->bytes, seq->nbytes); > - } > - > - continue; > - } > - } > -#endif /* NO_TRANSLITERATION */ > - > - /* Not a known name. */ > - illegal_string = 1; > - } > - } > - > - if (seq != NULL) > - adds (&lrb, seq->bytes, seq->nbytes); > - > + if (!translate_unicode_codepoint (locale, charmap, > + repertoire, wch, &lrb)) > + illegal_string = true; OK. Refactor. > continue; > } > } > @@ -812,7 +815,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > /* This name is not in the charmap. */ > lr_error (lr, _("symbol `%.*s' not in charmap"), > (int) (lrb.act - startidx), &lrb.buf[startidx]); > - illegal_string = 1; > + illegal_string = true; > } > > if (return_widestr) > @@ -833,7 +836,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > /* This name is not in the repertoire map. */ > lr_error (lr, _("symbol `%.*s' not in repertoire map"), > (int) (lrb.act - startidx), &lrb.buf[startidx]); > - illegal_string = 1; > + illegal_string = true; > } > else > ADDWC (wch); > @@ -850,7 +853,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, > if (ch == '\n' || ch == EOF) > { > lr_error (lr, _("unterminated string")); > - illegal_string = 1; > + illegal_string = true; > } > > if (illegal_string) OK. -- Cheers, Carlos.