From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout-p-101.mailbox.org (mout-p-101.mailbox.org [IPv6:2001:67c:2050::465:101]) by sourceware.org (Postfix) with ESMTPS id BAC893854829 for ; Sun, 7 Feb 2021 12:29:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org BAC893854829 Received: from smtp1.mailbox.org (smtp1.mailbox.org [IPv6:2001:67c:2050:105:465:1:1:0]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-101.mailbox.org (Postfix) with ESMTPS id 4DYT3z5qfbzQjgj; Sun, 7 Feb 2021 13:29:19 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de Received: from smtp1.mailbox.org ([80.241.60.240]) by spamfilter05.heinlein-hosting.de (spamfilter05.heinlein-hosting.de [80.241.56.123]) (amavisd-new, port 10030) with ESMTP id gODXitKgdg4E; Sun, 7 Feb 2021 13:29:16 +0100 (CET) Date: Sun, 7 Feb 2021 13:29:15 +0100 From: Max Gautier To: Florian Weimer via Libc-alpha Cc: Max Gautier Subject: Re: [PATCH v3 1/5] Copy utf-7 module to modified-utf-7 Message-ID: Mail-Followup-To: Max Gautier , Florian Weimer via Libc-alpha References: <87y2m9agmm.fsf@mid.deneb.enyo.de> <20210125090226.39967-1-mg@max.gautier.name> <20210125090226.39967-2-mg@max.gautier.name> <878s8h2whg.fsf@igel.home> <87blcw9ptq.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87blcw9ptq.fsf@oldenburg.str.redhat.com> X-MBO-SPAM-Probability: X-Rspamd-Score: -9.50 / 15.00 / 15.00 X-Rspamd-Queue-Id: D61181850 X-Rspamd-UID: b97fb1 X-Spam-Status: No, score=-7.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, KAM_NUMSUBJECT, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Feb 2021 12:29:24 -0000 * Florian Weimer via Libc-alpha: > Given that UTF-7 conversion (either variant) is not > performance-critical, I suggest to have just one implementation file. > > You can use step->__data to keep track of which variant is active. See > iconvdata/iso646.c for an example. There is no need to allocate a > separate object; you can store the flag directly in the __data member. I'll work on that. I might use some advice on specific parts. About the classification of characters (direct or not etc) : - utf-7.c + utf-7-imap.c > -static const unsigned char direct_tab[128 / 8] = > - { > - 0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87, > - 0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07 > - }; > - > static int > isdirect (uint32_t ch) > { > - return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1)); > -} > - > - > -/* The set of "direct and optional direct characters": > - A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr > - ! " # $ % & * ; < = > @ [ ] ^ _ ` { | } > -*/ > - > -static const unsigned char xdirect_tab[128 / 8] = > - { > - 0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff, > - 0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f > - }; > - > -static int > -isxdirect (uint32_t ch) > -{ > - return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1)); > + return ((ch == '\n' || ch == '\t' || ch == '\r') > + || (ch >= 0x20 && ch <= 0x7e && ch != '&')); > } > > - > -/* The set of "extended base64 characters": > - A-Z a-z 0-9 + / - > +/* The set of "modified base64 characters": > + A-Z a-z 0-9 + , - > */ > > -static const unsigned char xbase64_tab[128 / 8] = > - { > - 0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03, > - 0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07 > - }; > - > static int > -isxbase64 (uint32_t ch) > +ismbase64 (uint32_t ch) > { > - return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1)); > + return ((ch >= 'a' && ch <= 'z') > + || (ch >= 'A' && ch <= 'Z') > + || (ch >= '0' && ch <= '9') > + || (ch == '+' || ch == ',')); > } When I initially looked at utf-7.c, the use of the _tab arrays with magic values and the subsequent shifting didn't make a lot of sense to me, which is why I modified them like this for utf-7-imap. If they are in the same file, it's probably better to use the same method. So do you see any benefits to keeping the old method ? Testing directly for the actual characters seems a lot more readable to me, is shorter, and can be mapped to the RFC definition. I didn't find the reason for using the current method by looking at the git history. The only one I could think of is performance, but I don't see how or what it would improve. If someone has some hints, let me know. > You could remove the UTF7_ENCODE_OPTIONAL_CHARS from the existing UTF-7 > codec in a first, separate patch. Do you mean I should modify the utf-7 conversion to not encode the optional chars ? That would change the result of utf-7 conversions, wouldn't it ? I'm not opposed to it, but isn't that going to break things ? Thanks -- Max Gautier mg@max.gautier.name