From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout-p-102.mailbox.org (mout-p-102.mailbox.org [IPv6:2001:67c:2050::465:102]) by sourceware.org (Postfix) with ESMTPS id 969213858431 for ; Thu, 9 Dec 2021 09:32:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 969213858431 Received: from smtp202.mailbox.org (smtp202.mailbox.org [80.241.60.245]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-102.mailbox.org (Postfix) with ESMTPS id 4J8pj41lMlzQkF0 for ; Thu, 9 Dec 2021 10:32:24 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de From: Max Gautier To: libc-alpha@sourceware.org Subject: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Date: Thu, 9 Dec 2021 10:31:48 +0100 Message-Id: <20211209093152.313872-1-mg@max.gautier.name> In-Reply-To: <87blcw9ptq.fsf@oldenburg.str.redhat.com> References: <87blcw9ptq.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Dec 2021 09:32:28 -0000 I finally took the time to work on this again. This new series implements UTF-7-IMAP in the UTF-7 module, using, as advised, the same approach than in iso646.c. Unresolved issues (would appreciate advice on those): - There is a slight incoherence (to me) in the UTF-7 RFC[1], and the current implementation do not follow it exactly : In the "UTF-7 Definition/Rule 2": "The '+' signals that subsequent octets are to be interpreted as elements of the Modified Base64 alphabet until a character not in that alphabet is encountered. Such characters include control characters such as carriage returns and line feeds" The UTF-7 module implements this by making characters '\n', '\r', '\t' part of the "direct characters" set, even though they are not according to the definition given by the RFC. So these characters should be encoded, but should also be interpreted literally and implicitly terminates base64 sequences. On this, I'm inclined to leave the current behavior as is. Changing it might mean breaking things; and I don't see many benefits. - For UTF-7-IMAP: The IMAPv4 RFC (UTF-7-IMAP definition)[2] specifies that : - The character "&" (0x26) is represented by the two-octet sequence "&-" - null shifts ("-&" while in BASE64; note that "&-" while in US-ASCII means "&") are not permitted - The purpose of these modifications is to correct the following problems with UTF-7: ... 5) UTF-7 permits multiple alternate forms to represent the same string; in particular, printable US-ASCII characters can be represented in encoded form. Consider the following cases: A- When encoding to UTF-7-IMAP, if we encounter '&' while in base64 mode, should we: 1) encode it in base64 2) terminate the encoding with '-' and use "&-" B- When encoding to UTF-7-IMAP, if we encounter "&&" while in us-ascii mode, should we: 1) start base64 mode and encode the two '&' 2) encode them as "&-&-" It seems to me than for A and B, the solution 2 allows null shifts, and solution 1 allows multiples representation. However, A-2 and B-2 still feels cleaner to me, since they avoid alternate forms for '&'. The arguments can be made that the resulting sequences are not null shifts, merely a special case in US-ASCII. I've use that approach in PATCH 4/4, but that should be quite easy to change if necessary. - Also, I'm not sure how to add negative test cases, aka, invalid sequences which needs to trigger an iconv errors. Thanks for your time. [1]: https://datatracker.ietf.org/doc/html/rfc2152 [2]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3