From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <carlos@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id 343D03853556
 for <libc-alpha@sourceware.org>; Mon,  4 Jul 2022 19:54:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 343D03853556
Received: from mail-io1-f71.google.com (mail-io1-f71.google.com
 [209.85.166.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-196-e-w46o2zN9OZVDELbFZhIA-1; Mon, 04 Jul 2022 15:54:17 -0400
X-MC-Unique: e-w46o2zN9OZVDELbFZhIA-1
Received: by mail-io1-f71.google.com with SMTP id
 h73-20020a6bb74c000000b0067275998ba8so5844633iof.2
 for <libc-alpha@sourceware.org>; Mon, 04 Jul 2022 12:54:17 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:message-id:date:mime-version:user-agent:subject
 :content-language:to:references:from:organization:in-reply-to
 :content-transfer-encoding;
 bh=te2w5a29NYt4hed2BG+Z7GIz23K4wZKjqThDsZf4tr0=;
 b=oOx+5TLSXkhfDiKpUIT8/YPxAdRUoaBAaV77jwI0agqme4SZPpye4LI6Vj89eb5MKI
 7C2ObhpCa/EGcrfksLyBy6EpjAMKldckefZPglwc2MupLAkQfFGg99HFhePlkfRAG4Ki
 F4VFocQxQcb/e4sAQLQFf5eW1Q29U9/K0HNCAt8BwyCgCGp7mkj6v6W/SZVT2D7BZkJT
 hJHSEl1/ph5y7TvP/3k6U0TPPsLcdYwxV7tqk0/jssOkWthXVbNSTBj99B5Jdn+TC3F9
 EnLmQ2cIFUCkOX8Gq6NIb0iKkRBY0jUr4GTxyKk7ZZeid5zzEGHucqJXD7th2H4nADd0
 84uQ==
X-Gm-Message-State: AJIora+vx1Lsyb0bQ+Jf0qJbwlQFgfH7htHdAg/ISAg/3+rtmRiylPL1
 7IUTchdaJ4MHwu1U/rpXKTJmss/06GFyxZW4Avxpr8i+/x1H0EkzLXsShuWl6r3ZVucZUemkVfq
 ueRiMCw9Tbv+BBhUvnshZ
X-Received: by 2002:a05:6638:1493:b0:33e:c04e:56e4 with SMTP id
 j19-20020a056638149300b0033ec04e56e4mr7344744jak.282.1656964456908; 
 Mon, 04 Jul 2022 12:54:16 -0700 (PDT)
X-Google-Smtp-Source: AGRyM1uLTTMSKrqngx+19FDELQa0yHkshujteGGF2QPeBUWqPxqb2dfSbr7ar+N6kIwy/p5fw11gWQ==
X-Received: by 2002:a05:6638:1493:b0:33e:c04e:56e4 with SMTP id
 j19-20020a056638149300b0033ec04e56e4mr7344737jak.282.1656964456678; 
 Mon, 04 Jul 2022 12:54:16 -0700 (PDT)
Received: from [192.168.0.241] (135-23-175-80.cpe.pppoe.ca. [135.23.175.80])
 by smtp.gmail.com with ESMTPSA id
 k30-20020a02335e000000b00339e6f88235sm13877873jak.61.2022.07.04.12.54.16
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Mon, 04 Jul 2022 12:54:16 -0700 (PDT)
Message-ID: <b7235300-0704-065f-689b-8ac074230522@redhat.com>
Date: Mon, 4 Jul 2022 15:54:15 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.9.0
Subject: Re: [PATCH 3/5] locale: Introduce translate_unicode_codepoint into
 linereader.c
To: Florian Weimer <fweimer@redhat.com>, libc-alpha@sourceware.org
References: <cover.1652994079.git.fweimer@redhat.com>
 <a89cee054d28d43cf8f7e5f171e876326e4af96e.1652994079.git.fweimer@redhat.com>
From: Carlos O'Donell <carlos@redhat.com>
Organization: Red Hat
In-Reply-To: <a89cee054d28d43cf8f7e5f171e876326e4af96e.1652994079.git.fweimer@redhat.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-16.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 04 Jul 2022 19:54:22 -0000

On 5/19/22 17:06, Florian Weimer via Libc-alpha wrote:
> This will permit reusing the Unicode character processing for
> different character encodings, not just the current <U...> encoding.

LGTM. Straight forward refactor.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>

> ---
>  locale/programs/linereader.c | 167 ++++++++++++++++++-----------------
>  1 file changed, 85 insertions(+), 82 deletions(-)
> 
> diff --git a/locale/programs/linereader.c b/locale/programs/linereader.c
> index d5367e0a1e..f7292f0102 100644
> --- a/locale/programs/linereader.c
> +++ b/locale/programs/linereader.c
> @@ -596,6 +596,83 @@ get_ident (struct linereader *lr)
>    return &lr->token;
>  }
>  
> +/* Process a decoded Unicode codepoint WCH in a string, placing the
> +   multibyte sequence into LRB.  Return false if the character is not
> +   found in CHARMAP/REPERTOIRE.  */
> +static bool
> +translate_unicode_codepoint (struct localedef_t *locale,
> +			     const struct charmap_t *charmap,
> +			     const struct repertoire_t *repertoire,
> +			     uint32_t wch, struct lr_buffer *lrb)
> +{
> +  /* See whether the charmap contains the Uxxxxxxxx names.  */
> +  char utmp[10];
> +  snprintf (utmp, sizeof (utmp), "U%08X", wch);
> +  struct charseq *seq = charmap_find_value (charmap, utmp, 9);
> +
> +  if (seq == NULL)
> +    {
> +      /* No, this isn't the case.  Now determine from
> +	 the repertoire the name of the character and
> +	 find it in the charmap.  */
> +      if (repertoire != NULL)
> +	{
> +	  const char *symbol = repertoire_find_symbol (repertoire, wch);
> +	  if (symbol != NULL)
> +	    seq = charmap_find_value (charmap, symbol, strlen (symbol));
> +	}
> +
> +      if (seq == NULL)
> +	{
> +#ifndef NO_TRANSLITERATION
> +	  /* Transliterate if possible.  */
> +	  if (locale != NULL)
> +	    {
> +	      if ((locale->avail & CTYPE_LOCALE) == 0)
> +		{
> +		  /* Load the CTYPE data now.  */
> +		  int old_needed = locale->needed;
> +
> +		  locale->needed = 0;
> +		  locale = load_locale (LC_CTYPE, locale->name,
> +					locale->repertoire_name,
> +					charmap, locale);
> +		  locale->needed = old_needed;
> +		}
> +
> +	      uint32_t *translit;
> +	      if ((locale->avail & CTYPE_LOCALE) != 0
> +		  && ((translit = find_translit (locale, charmap, wch))
> +		      != NULL))
> +		/* The CTYPE data contains a matching
> +		   transliteration.  */
> +		{
> +		  for (int i = 0; translit[i] != 0; ++i)
> +		    {
> +		      snprintf (utmp, sizeof (utmp), "U%08X", translit[i]);
> +		      seq = charmap_find_value (charmap, utmp, 9);
> +		      assert (seq != NULL);
> +		      adds (lrb, seq->bytes, seq->nbytes);
> +		    }
> +		  return true;
> +		}
> +	    }
> +#endif	/* NO_TRANSLITERATION */
> +
> +	  /* Not a known name.  */
> +	  return false;
> +	}
> +    }
> +
> +  if (seq != NULL)
> +    {
> +      adds (lrb, seq->bytes, seq->nbytes);
> +      return true;
> +    }
> +  else
> +    return false;
> +}
> +
>  
>  static struct token *
>  get_string (struct linereader *lr, const struct charmap_t *charmap,
> @@ -635,7 +712,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap,
>      }
>    else
>      {
> -      int illegal_string = 0;
> +      bool illegal_string = false;
>        size_t buf2act = 0;
>        size_t buf2max = 56 * sizeof (uint32_t);
>        int ch;
> @@ -695,7 +772,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap,
>  	    {
>  	      /* <> is no correct name.  Ignore it and also signal an
>  		 error.  */
> -	      illegal_string = 1;
> +	      illegal_string = true;
>  	      continue;
>  	    }
>  
> @@ -709,8 +786,6 @@ get_string (struct linereader *lr, const struct charmap_t *charmap,
>  
>  	      if (cp == &lrb.buf[lrb.act])
>  		{
> -		  char utmp[10];
> -
>  		  /* Yes, it is.  */
>  		  addc (&lrb, '\0');
>  		  wch = strtoul (lrb.buf + startidx + 1, NULL, 16);
> @@ -721,81 +796,9 @@ get_string (struct linereader *lr, const struct charmap_t *charmap,
>  		  if (return_widestr)
>  		    ADDWC (wch);
>  
> -		  /* See whether the charmap contains the Uxxxxxxxx names.  */
> -		  snprintf (utmp, sizeof (utmp), "U%08X", wch);
> -		  seq = charmap_find_value (charmap, utmp, 9);
> -
> -		  if (seq == NULL)
> -		    {
> -		     /* No, this isn't the case.  Now determine from
> -			the repertoire the name of the character and
> -			find it in the charmap.  */
> -		      if (repertoire != NULL)
> -			{
> -			  const char *symbol;
> -
> -			  symbol = repertoire_find_symbol (repertoire, wch);
> -
> -			  if (symbol != NULL)
> -			    seq = charmap_find_value (charmap, symbol,
> -						      strlen (symbol));
> -			}
> -
> -		      if (seq == NULL)
> -			{
> -#ifndef NO_TRANSLITERATION
> -			  /* Transliterate if possible.  */
> -			  if (locale != NULL)
> -			    {
> -			      uint32_t *translit;
> -
> -			      if ((locale->avail & CTYPE_LOCALE) == 0)
> -				{
> -				  /* Load the CTYPE data now.  */
> -				  int old_needed = locale->needed;
> -
> -				  locale->needed = 0;
> -				  locale = load_locale (LC_CTYPE,
> -							locale->name,
> -							locale->repertoire_name,
> -							charmap, locale);
> -				  locale->needed = old_needed;
> -				}
> -
> -			      if ((locale->avail & CTYPE_LOCALE) != 0
> -				  && ((translit = find_translit (locale,
> -								 charmap, wch))
> -				      != NULL))
> -				/* The CTYPE data contains a matching
> -				   transliteration.  */
> -				{
> -				  int i;
> -
> -				  for (i = 0; translit[i] != 0; ++i)
> -				    {
> -				      char utmp[10];
> -
> -				      snprintf (utmp, sizeof (utmp), "U%08X",
> -						translit[i]);
> -				      seq = charmap_find_value (charmap, utmp,
> -								9);
> -				      assert (seq != NULL);
> -				      adds (&lrb, seq->bytes, seq->nbytes);
> -				    }
> -
> -				  continue;
> -				}
> -			    }
> -#endif	/* NO_TRANSLITERATION */
> -
> -			  /* Not a known name.  */
> -			  illegal_string = 1;
> -			}
> -		    }
> -
> -		  if (seq != NULL)
> -		    adds (&lrb, seq->bytes, seq->nbytes);
> -
> +		  if (!translate_unicode_codepoint (locale, charmap,
> +						    repertoire, wch, &lrb))
> +		    illegal_string = true;

OK. Refactor.

>  		  continue;
>  		}
>  	    }
> @@ -812,7 +815,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap,
>  	      /* This name is not in the charmap.  */
>  	      lr_error (lr, _("symbol `%.*s' not in charmap"),
>  			(int) (lrb.act - startidx), &lrb.buf[startidx]);
> -	      illegal_string = 1;
> +	      illegal_string = true;
>  	    }
>  
>  	  if (return_widestr)
> @@ -833,7 +836,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap,
>  		  /* This name is not in the repertoire map.  */
>  		  lr_error (lr, _("symbol `%.*s' not in repertoire map"),
>  			    (int) (lrb.act - startidx), &lrb.buf[startidx]);
> -		  illegal_string = 1;
> +		  illegal_string = true;
>  		}
>  	      else
>  		ADDWC (wch);
> @@ -850,7 +853,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap,
>        if (ch == '\n' || ch == EOF)
>  	{
>  	  lr_error (lr, _("unterminated string"));
> -	  illegal_string = 1;
> +	  illegal_string = true;
>  	}
>  
>        if (illegal_string)

OK.

-- 
Cheers,
Carlos.