From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mg@max.gautier.name>
Received: from mout-p-101.mailbox.org (mout-p-101.mailbox.org
 [IPv6:2001:67c:2050::465:101])
 by sourceware.org (Postfix) with ESMTPS id BAC893854829
 for <libc-alpha@sourceware.org>; Sun,  7 Feb 2021 12:29:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org BAC893854829
Received: from smtp1.mailbox.org (smtp1.mailbox.org
 [IPv6:2001:67c:2050:105:465:1:1:0])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest
 SHA256) (No client certificate requested)
 by mout-p-101.mailbox.org (Postfix) with ESMTPS id 4DYT3z5qfbzQjgj;
 Sun,  7 Feb 2021 13:29:19 +0100 (CET)
X-Virus-Scanned: amavisd-new at heinlein-support.de
Received: from smtp1.mailbox.org ([80.241.60.240])
 by spamfilter05.heinlein-hosting.de (spamfilter05.heinlein-hosting.de
 [80.241.56.123]) (amavisd-new, port 10030)
 with ESMTP id gODXitKgdg4E; Sun,  7 Feb 2021 13:29:16 +0100 (CET)
Date: Sun, 7 Feb 2021 13:29:15 +0100
From: Max Gautier <mg@max.gautier.name>
To: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
Cc: Max Gautier <mg@max.gautier.name>
Subject: Re: [PATCH v3 1/5] Copy utf-7 module to modified-utf-7
Message-ID: <YB/dG6Fxxr/tve51@ol-mgautier.localdomain>
Mail-Followup-To: Max Gautier <mg@max.gautier.name>,
 Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
References: <87y2m9agmm.fsf@mid.deneb.enyo.de>
 <20210125090226.39967-1-mg@max.gautier.name>
 <20210125090226.39967-2-mg@max.gautier.name>
 <878s8h2whg.fsf@igel.home>
 <YA7M9M3KvGiH/eUP@ol-mgautier.localdomain>
 <87blcw9ptq.fsf@oldenburg.str.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87blcw9ptq.fsf@oldenburg.str.redhat.com>
X-MBO-SPAM-Probability: 
X-Rspamd-Score: -9.50 / 15.00 / 15.00
X-Rspamd-Queue-Id: D61181850
X-Rspamd-UID: b97fb1
X-Spam-Status: No, score=-7.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, KAM_NUMSUBJECT, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Feb 2021 12:29:24 -0000

* Florian Weimer via Libc-alpha:
> Given that UTF-7 conversion (either variant) is not
> performance-critical, I suggest to have just one implementation file.
> 
> You can use step->__data to keep track of which variant is active.  See
> iconvdata/iso646.c for an example.  There is no need to allocate a
> separate object; you can store the flag directly in the __data member.

I'll work on that.
I might use some advice on specific parts. About the classification of
characters (direct or not etc) :
- utf-7.c
+ utf-7-imap.c
> -static const unsigned char direct_tab[128 / 8] =
> -  {
> -    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> -  };
> -
>  static int
>  isdirect (uint32_t ch)
>  {
> -  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
> -}
> -
> -
> -/* The set of "direct and optional direct characters":
> -   A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
> -   ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
> -*/
> -
> -static const unsigned char xdirect_tab[128 / 8] =
> -  {
> -    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
> -    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
> -  };
> -
> -static int
> -isxdirect (uint32_t ch)
> -{
> -  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
> +  return ((ch == '\n' || ch == '\t' || ch == '\r')
> +		  || (ch >= 0x20 && ch <= 0x7e && ch != '&'));
>  }
>  
> -
> -/* The set of "extended base64 characters":
> -   A-Z a-z 0-9 + / -
> +/* The set of "modified base64 characters":
> +   A-Z a-z 0-9 + , -
>  */
>  
> -static const unsigned char xbase64_tab[128 / 8] =
> -  {
> -    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> -  };
> -
>  static int
> -isxbase64 (uint32_t ch)
> +ismbase64 (uint32_t ch)
>  {
> -  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
> +  return ((ch >= 'a' && ch <= 'z')
> +			  || (ch >= 'A' && ch <= 'Z')
> +			  || (ch >= '0' && ch <= '9')
> +			  || (ch == '+' || ch == ','));
>  }

When I initially looked at utf-7.c, the use of the _tab arrays with
magic values and the subsequent shifting didn't make a lot of sense to
me, which is why I modified them like this for utf-7-imap. If they are
in the same file, it's probably better to use the same method.
So do you see any benefits to keeping the old method ?
Testing directly for the actual characters seems a lot more readable to
me, is shorter, and can be mapped to the RFC definition.

I didn't find the reason for using the current method by looking at the
git history. The only one I could think of is performance, but I don't
see how or what it would improve. If someone has some hints, let me
know.

 
> You could remove the UTF7_ENCODE_OPTIONAL_CHARS from the existing UTF-7
> codec in a first, separate patch.

Do you mean I should modify the utf-7 conversion to not encode the
optional chars ?  That would change the result of utf-7 conversions,
wouldn't it ? I'm not opposed to it, but isn't that going to break
things ?

Thanks

-- 
Max Gautier
mg@max.gautier.name