From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <adhemerval.zanella@linaro.org>
Received: from mail-oo1-xc34.google.com (mail-oo1-xc34.google.com
 [IPv6:2607:f8b0:4864:20::c34])
 by sourceware.org (Postfix) with ESMTPS id CE20B385E454
 for <libc-alpha@sourceware.org>; Mon, 21 Mar 2022 12:24:38 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CE20B385E454
Received: by mail-oo1-xc34.google.com with SMTP id
 j7-20020a4ad6c7000000b0031c690e4123so18919089oot.11
 for <libc-alpha@sourceware.org>; Mon, 21 Mar 2022 05:24:38 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:message-id:date:mime-version:user-agent:subject
 :content-language:to:references:from:in-reply-to
 :content-transfer-encoding;
 bh=DNdwg85u6x9ZISO3U1UL6SOThK4K1D7mR9MCMubKypM=;
 b=N5CEgY2SxkX/b9UcpiJpa6ahV6OLHeA+fql7xKvIGXPXvlpe848IVvRmMADC+AhgWC
 OmqX85Z1Yez21L2QVYRE4pC4pteC4v1EasMixipICVmGfjMN8T1lX+1977vDQZO5+HUa
 w0xXL40WrQCtj1YDVW9Aug/3aXoykTLe0puFQTLJuuwjQ6icpR3mNC5GCQMDsqxo/UeF
 PIsNY6X1aocugaA5XW6vQPUO64uqS5Cj+dolRAsE8SUBMDxO+75H2Q4RpdWQSuOLZUzu
 imh3f+gOReIzNjivZUPWivNwKJpX6/d2z9c9h0JpyYtcwmmy+Uc9sZZYr/ey+jzKgU06
 zJ2Q==
X-Gm-Message-State: AOAM533aqMG4rp8AUE7NagNQskivXPx9KWDxjzskhm/oos6g2nAObyiv
 +I6CfWb6HPlbMPEtkgXJJR4Xgw2XQ9Rqwg==
X-Google-Smtp-Source: ABdhPJzP7nzcploYNbN2SRzqD5zR/hbHDfuiDfvuA8eYNTWmfWWtM4VI4yR2QdD1xM7QK7eEgJ/cQg==
X-Received: by 2002:a05:6870:d24a:b0:dd:ec8d:d641 with SMTP id
 h10-20020a056870d24a00b000ddec8dd641mr4242473oac.71.1647865478026; 
 Mon, 21 Mar 2022 05:24:38 -0700 (PDT)
Received: from ?IPV6:2804:431:c7ca:2d55:f04a:67c7:cbf3:571d?
 ([2804:431:c7ca:2d55:f04a:67c7:cbf3:571d])
 by smtp.gmail.com with ESMTPSA id
 v18-20020a056830091200b005c928debf84sm7351962ott.75.2022.03.21.05.24.36
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Mon, 21 Mar 2022 05:24:37 -0700 (PDT)
Message-ID: <93566715-f32e-713b-197e-c33332beb37e@linaro.org>
Date: Mon, 21 Mar 2022 09:24:35 -0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.7.0
Subject: Re: [PATCH v5 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c
Content-Language: en-US
To: Max Gautier <mg@max.gautier.name>, libc-alpha@sourceware.org
References: <87blcw9ptq.fsf@oldenburg.str.redhat.com>
 <20211209093152.313872-1-mg@max.gautier.name>
 <20211209093152.313872-5-mg@max.gautier.name> <YjdZmCDxCmsDwy5S@ol-mgautier>
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
In-Reply-To: <YjdZmCDxCmsDwy5S@ol-mgautier>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 21 Mar 2022 12:24:40 -0000


On 20/03/2022 13:43, Max Gautier via Libc-alpha wrote:
> UTF-7-IMAP differs from UTF-7 in the followings ways (see RFC 3501[1]
> for reference) :
> 
> - The shift character is '&' instead of '+'
> - There is no "optional direct characters" and the "direct characters"
>   set is different
> - There is no implicit shift back to US-ASCII from BASE64, all BASE64
>   sequences MUST be terminated with '-'
> 
> [1]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3
> 
> Signed-off-by: Max Gautier <mg@max.gautier.name>

LGTM, thanks.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  iconvdata/TESTS                     |  1 +
>  iconvdata/gconv-modules             |  4 ++++
>  iconvdata/testdata/UTF-7-IMAP       |  1 +
>  iconvdata/testdata/UTF-7-IMAP..UTF8 | 32 +++++++++++++++++++++++++++++
>  iconvdata/utf-7.c                   | 30 +++++++++++++++++++++------
>  5 files changed, 62 insertions(+), 6 deletions(-)
>  create mode 100644 iconvdata/testdata/UTF-7-IMAP
>  create mode 100644 iconvdata/testdata/UTF-7-IMAP..UTF8
> 
> diff --git a/iconvdata/TESTS b/iconvdata/TESTS
> index a0157c3350..3cc043c21b 100644
> --- a/iconvdata/TESTS
> +++ b/iconvdata/TESTS
> @@ -94,6 +94,7 @@ EUC-TW			EUC-TW			Y	UTF8
>  GBK			GBK			Y	UTF8
>  BIG5HKSCS		BIG5HKSCS		Y	UTF8
>  UTF-7			UTF-7			N	UTF8
> +UTF-7-IMAP		UTF-7-IMAP		N	UTF8
>  IBM856			IBM856			N	UTF8
>  IBM922			IBM922			Y	UTF8
>  IBM930			IBM930			N	UTF8
> diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules
> index 4acbba062f..d120699394 100644
> --- a/iconvdata/gconv-modules
> +++ b/iconvdata/gconv-modules
> @@ -113,3 +113,7 @@ module	INTERNAL		UTF-32BE//		UTF-32		1
>  alias	UTF7//			UTF-7//
>  module	UTF-7//			INTERNAL		UTF-7		1
>  module	INTERNAL		UTF-7//			UTF-7		1
> +
> +#	from			to			module		cost
> +module	UTF-7-IMAP//		INTERNAL		UTF-7		1
> +module	INTERNAL		UTF-7-IMAP//		UTF-7		1
> diff --git a/iconvdata/testdata/UTF-7-IMAP b/iconvdata/testdata/UTF-7-IMAP
> new file mode 100644
> index 0000000000..6b5dada63c
> --- /dev/null
> +++ b/iconvdata/testdata/UTF-7-IMAP
> @@ -0,0 +1 @@
> +&EqASGxItEps-       Amharic&AAoBDQ-esky      Czech&AAo-Dansk      Danish&AAo-English    English&AAo-Suomi      Finnish&AAo-Fran&AOc-ais   French&AAo-Deutsch    German&AAoDlQO7A7sDtwO9A7kDugOs-   Greek&AAoF4gXRBegF2QXq-      Hebrew&AAo-Italiano   Italian&AAo-Norsk      Norwegian&AAoEIARDBEEEQQQ6BDgEOQ-    Russian&AAo-Espa&APE-ol    Spanish&AAo-Svenska    Swedish&AAoOIA4yDikOMg5EDhcOIg-    Thai&AAo-T&APw-rk&AOc-e     Turkish&AAo-Ti&Hr8-ng Vi&Hsc-t Vietnamese&AApl5Wcsip4-     Japanese&AApOLWWH-       Chinese&AArVXK4A-       Korean&AAoACg-// Checking for correct handling of shift characters ('&-', '-') after base64 sequences&AArVXK4A-&-&AArVXK4A--&AAoACg-// Checking for correct handling of litteral '&-' and '-'&AAo----&-&--&AAoACg-// The last line of this file is missing the end-of-line terminator&AAo-// on purpose, in order to test that the conversion empties the bit buffer&AAo-// and shifts back to the initial state at the end of the conversion.&AAo-A&ImIDkQ-
> \ No newline at end of file
> diff --git a/iconvdata/testdata/UTF-7-IMAP..UTF8 b/iconvdata/testdata/UTF-7-IMAP..UTF8
> new file mode 100644
> index 0000000000..8b9add3670
> --- /dev/null
> +++ b/iconvdata/testdata/UTF-7-IMAP..UTF8
> @@ -0,0 +1,32 @@
> +አማርኛ       Amharic
> +česky      Czech
> +Dansk      Danish
> +English    English
> +Suomi      Finnish
> +Français   French
> +Deutsch    German
> +Ελληνικά   Greek
> +עברית      Hebrew
> +Italiano   Italian
> +Norsk      Norwegian
> +Русский    Russian
> +Español    Spanish
> +Svenska    Swedish
> +ภาษาไทย    Thai
> +Türkçe     Turkish
> +Tiếng Việt Vietnamese
> +日本語     Japanese
> +中文       Chinese
> +한글       Korean
> +
> +// Checking for correct handling of shift characters ('&', '-') after base64 sequences
> +한글&
> +한글-
> +
> +// Checking for correct handling of litteral '&' and '-'
> +---&&-
> +
> +// The last line of this file is missing the end-of-line terminator
> +// on purpose, in order to test that the conversion empties the bit buffer
> +// and shifts back to the initial state at the end of the conversion.
> +A≢Α
> \ No newline at end of file
> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> index b639d8ff3e..5c2e17e50c 100644
> --- a/iconvdata/utf-7.c
> +++ b/iconvdata/utf-7.c
> @@ -32,11 +32,13 @@
>  enum variant
>  {
>    UTF7,
> +  UTF_7_IMAP
>  };
>  
>  /* Must be in the same order as enum variant above.  */
>  static const char names[] =
>    "UTF-7//\0"
> +  "UTF-7-IMAP//\0"
>    "\0";
>  
>  static uint32_t
> @@ -44,6 +46,8 @@ shift_character (enum variant const var)
>  {
>    if (var == UTF7)
>      return '+';
> +  else if (var == UTF_7_IMAP)
> +    return '&';
>    else
>      abort ();
>  }
> @@ -58,6 +62,9 @@ between (uint32_t const ch,
>  /* The set of "direct characters":
>     FOR UTF-7
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
> +   FOR UTF-7-IMAP
> +   A-Z a-z 0-9 ' ( ) , - . / : ? space
> +   ! " # $ % + * ; < = > @ [ \ ] ^ _ ` { | } ~
>  */
>  
>  static bool
> @@ -71,6 +78,8 @@ isdirect (uint32_t ch, enum variant var)
>  	    || between (ch, ',', '/')
>  	    || ch == ':' || ch == '?'
>  	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
> +  else if (var == UTF_7_IMAP)
> +    return (ch != '&' && between (ch, ' ', '~'));
>    abort ();
>  }
>  
> @@ -124,6 +133,8 @@ base64 (unsigned int i, enum variant var)
>      return '+';
>    else if (i == 63 && var == UTF7)
>      return '/';
> +  else if (i == 63 && var == UTF_7_IMAP)
> +    return ',';
>    else
>      abort ();
>  }
> @@ -308,7 +319,8 @@ gconv_end (struct __gconv_step *data)
>  	  i = ch - '0' + 52;						      \
>  	else if (ch == '+')						      \
>  	  i = 62;							      \
> -	else if (ch == '/')						      \
> +	else if ((var == UTF7 && ch == '/')                                   \
> +		  || (var == UTF_7_IMAP && ch == ','))			      \
>  	  i = 63;							      \
>  	else								      \
>  	  {								      \
> @@ -316,8 +328,10 @@ gconv_end (struct __gconv_step *data)
>  									      \
>  	    /* If accumulated data is nonzero, the input is invalid.  */      \
>  	    /* Also, partial UTF-16 characters are invalid.  */		      \
> -	    if (__builtin_expect (statep->__value.__wch != 0, 0)	      \
> -		|| __builtin_expect ((statep->__count >> 3) <= 26, 0))	      \
> +	    /* In IMAP variant, must be terminated by '-'.  */		      \
> +	    if (__glibc_unlikely (statep->__value.__wch != 0)		      \
> +		|| __glibc_unlikely ((statep->__count >> 3) <= 26)	      \
> +		|| __glibc_unlikely (var == UTF_7_IMAP && ch != '-'))	      \
>  	      {								      \
>  		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
>  	      }								      \
> @@ -474,13 +488,15 @@ gconv_end (struct __gconv_step *data)
>      else								      \
>        {									      \
>  	/* base64 encoding active */					      \
> -	if (isdirect (ch, var))						      \
> +	if ((var == UTF_7_IMAP && ch == '&') || isdirect (ch, var))	      \
>  	  {								      \
>  	    /* deactivate base64 encoding */				      \
>  	    size_t count;						      \
>  									      \
>  	    count = ((statep->__count & 0x18) >= 0x10)			      \
> -	      + needs_explicit_shift (ch) + 1;				      \
> +	      + (var == UTF_7_IMAP || needs_explicit_shift (ch))	      \
> +	      + (var == UTF_7_IMAP && ch == '&')			      \
> +	      + 1;							      \
>  	    if (__glibc_unlikely (outptr + count > outend))		      \
>  	      {								      \
>  		result = __GCONV_FULL_OUTPUT;				      \
> @@ -489,9 +505,11 @@ gconv_end (struct __gconv_step *data)
>  									      \
>  	    if ((statep->__count & 0x18) >= 0x10)			      \
>  	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
> -	    if (needs_explicit_shift (ch))				      \
> +	    if (var == UTF_7_IMAP || needs_explicit_shift (ch))		      \
>  	      *outptr++ = '-';						      \
>  	    *outptr++ = (unsigned char) ch;				      \
> +	    if (var == UTF_7_IMAP && ch == '&')				      \
> +	      *outptr++ = '-';						      \
>  	    statep->__count = 0;					      \
>  	  }								      \
>  	else								      \