From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oo1-xc34.google.com (mail-oo1-xc34.google.com [IPv6:2607:f8b0:4864:20::c34]) by sourceware.org (Postfix) with ESMTPS id CE20B385E454 for ; Mon, 21 Mar 2022 12:24:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CE20B385E454 Received: by mail-oo1-xc34.google.com with SMTP id j7-20020a4ad6c7000000b0031c690e4123so18919089oot.11 for ; Mon, 21 Mar 2022 05:24:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:references:from:in-reply-to :content-transfer-encoding; bh=DNdwg85u6x9ZISO3U1UL6SOThK4K1D7mR9MCMubKypM=; b=N5CEgY2SxkX/b9UcpiJpa6ahV6OLHeA+fql7xKvIGXPXvlpe848IVvRmMADC+AhgWC OmqX85Z1Yez21L2QVYRE4pC4pteC4v1EasMixipICVmGfjMN8T1lX+1977vDQZO5+HUa w0xXL40WrQCtj1YDVW9Aug/3aXoykTLe0puFQTLJuuwjQ6icpR3mNC5GCQMDsqxo/UeF PIsNY6X1aocugaA5XW6vQPUO64uqS5Cj+dolRAsE8SUBMDxO+75H2Q4RpdWQSuOLZUzu imh3f+gOReIzNjivZUPWivNwKJpX6/d2z9c9h0JpyYtcwmmy+Uc9sZZYr/ey+jzKgU06 zJ2Q== X-Gm-Message-State: AOAM533aqMG4rp8AUE7NagNQskivXPx9KWDxjzskhm/oos6g2nAObyiv +I6CfWb6HPlbMPEtkgXJJR4Xgw2XQ9Rqwg== X-Google-Smtp-Source: ABdhPJzP7nzcploYNbN2SRzqD5zR/hbHDfuiDfvuA8eYNTWmfWWtM4VI4yR2QdD1xM7QK7eEgJ/cQg== X-Received: by 2002:a05:6870:d24a:b0:dd:ec8d:d641 with SMTP id h10-20020a056870d24a00b000ddec8dd641mr4242473oac.71.1647865478026; Mon, 21 Mar 2022 05:24:38 -0700 (PDT) Received: from ?IPV6:2804:431:c7ca:2d55:f04a:67c7:cbf3:571d? ([2804:431:c7ca:2d55:f04a:67c7:cbf3:571d]) by smtp.gmail.com with ESMTPSA id v18-20020a056830091200b005c928debf84sm7351962ott.75.2022.03.21.05.24.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 21 Mar 2022 05:24:37 -0700 (PDT) Message-ID: <93566715-f32e-713b-197e-c33332beb37e@linaro.org> Date: Mon, 21 Mar 2022 09:24:35 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: [PATCH v5 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c Content-Language: en-US To: Max Gautier , libc-alpha@sourceware.org References: <87blcw9ptq.fsf@oldenburg.str.redhat.com> <20211209093152.313872-1-mg@max.gautier.name> <20211209093152.313872-5-mg@max.gautier.name> From: Adhemerval Zanella In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Mar 2022 12:24:40 -0000 On 20/03/2022 13:43, Max Gautier via Libc-alpha wrote: > UTF-7-IMAP differs from UTF-7 in the followings ways (see RFC 3501[1] > for reference) : > > - The shift character is '&' instead of '+' > - There is no "optional direct characters" and the "direct characters" > set is different > - There is no implicit shift back to US-ASCII from BASE64, all BASE64 > sequences MUST be terminated with '-' > > [1]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3 > > Signed-off-by: Max Gautier LGTM, thanks. Reviewed-by: Adhemerval Zanella > --- > iconvdata/TESTS | 1 + > iconvdata/gconv-modules | 4 ++++ > iconvdata/testdata/UTF-7-IMAP | 1 + > iconvdata/testdata/UTF-7-IMAP..UTF8 | 32 +++++++++++++++++++++++++++++ > iconvdata/utf-7.c | 30 +++++++++++++++++++++------ > 5 files changed, 62 insertions(+), 6 deletions(-) > create mode 100644 iconvdata/testdata/UTF-7-IMAP > create mode 100644 iconvdata/testdata/UTF-7-IMAP..UTF8 > > diff --git a/iconvdata/TESTS b/iconvdata/TESTS > index a0157c3350..3cc043c21b 100644 > --- a/iconvdata/TESTS > +++ b/iconvdata/TESTS > @@ -94,6 +94,7 @@ EUC-TW EUC-TW Y UTF8 > GBK GBK Y UTF8 > BIG5HKSCS BIG5HKSCS Y UTF8 > UTF-7 UTF-7 N UTF8 > +UTF-7-IMAP UTF-7-IMAP N UTF8 > IBM856 IBM856 N UTF8 > IBM922 IBM922 Y UTF8 > IBM930 IBM930 N UTF8 > diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules > index 4acbba062f..d120699394 100644 > --- a/iconvdata/gconv-modules > +++ b/iconvdata/gconv-modules > @@ -113,3 +113,7 @@ module INTERNAL UTF-32BE// UTF-32 1 > alias UTF7// UTF-7// > module UTF-7// INTERNAL UTF-7 1 > module INTERNAL UTF-7// UTF-7 1 > + > +# from to module cost > +module UTF-7-IMAP// INTERNAL UTF-7 1 > +module INTERNAL UTF-7-IMAP// UTF-7 1 > diff --git a/iconvdata/testdata/UTF-7-IMAP b/iconvdata/testdata/UTF-7-IMAP > new file mode 100644 > index 0000000000..6b5dada63c > --- /dev/null > +++ b/iconvdata/testdata/UTF-7-IMAP > @@ -0,0 +1 @@ > +&EqASGxItEps- Amharic&AAoBDQ-esky Czech&AAo-Dansk Danish&AAo-English English&AAo-Suomi Finnish&AAo-Fran&AOc-ais French&AAo-Deutsch German&AAoDlQO7A7sDtwO9A7kDugOs- Greek&AAoF4gXRBegF2QXq- Hebrew&AAo-Italiano Italian&AAo-Norsk Norwegian&AAoEIARDBEEEQQQ6BDgEOQ- Russian&AAo-Espa&APE-ol Spanish&AAo-Svenska Swedish&AAoOIA4yDikOMg5EDhcOIg- Thai&AAo-T&APw-rk&AOc-e Turkish&AAo-Ti&Hr8-ng Vi&Hsc-t Vietnamese&AApl5Wcsip4- Japanese&AApOLWWH- Chinese&AArVXK4A- Korean&AAoACg-// Checking for correct handling of shift characters ('&-', '-') after base64 sequences&AArVXK4A-&-&AArVXK4A--&AAoACg-// Checking for correct handling of litteral '&-' and '-'&AAo----&-&--&AAoACg-// The last line of this file is missing the end-of-line terminator&AAo-// on purpose, in order to test that the conversion empties the bit buffer&AAo-// and shifts back to the initial state at the end of the conversion.&AAo-A&ImIDkQ- > \ No newline at end of file > diff --git a/iconvdata/testdata/UTF-7-IMAP..UTF8 b/iconvdata/testdata/UTF-7-IMAP..UTF8 > new file mode 100644 > index 0000000000..8b9add3670 > --- /dev/null > +++ b/iconvdata/testdata/UTF-7-IMAP..UTF8 > @@ -0,0 +1,32 @@ > +አማርኛ Amharic > +česky Czech > +Dansk Danish > +English English > +Suomi Finnish > +Français French > +Deutsch German > +Ελληνικά Greek > +עברית Hebrew > +Italiano Italian > +Norsk Norwegian > +Русский Russian > +Español Spanish > +Svenska Swedish > +ภาษาไทย Thai > +Türkçe Turkish > +Tiếng Việt Vietnamese > +日本語 Japanese > +中文 Chinese > +한글 Korean > + > +// Checking for correct handling of shift characters ('&', '-') after base64 sequences > +한글& > +한글- > + > +// Checking for correct handling of litteral '&' and '-' > +---&&- > + > +// The last line of this file is missing the end-of-line terminator > +// on purpose, in order to test that the conversion empties the bit buffer > +// and shifts back to the initial state at the end of the conversion. > +A≢Α > \ No newline at end of file > diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c > index b639d8ff3e..5c2e17e50c 100644 > --- a/iconvdata/utf-7.c > +++ b/iconvdata/utf-7.c > @@ -32,11 +32,13 @@ > enum variant > { > UTF7, > + UTF_7_IMAP > }; > > /* Must be in the same order as enum variant above. */ > static const char names[] = > "UTF-7//\0" > + "UTF-7-IMAP//\0" > "\0"; > > static uint32_t > @@ -44,6 +46,8 @@ shift_character (enum variant const var) > { > if (var == UTF7) > return '+'; > + else if (var == UTF_7_IMAP) > + return '&'; > else > abort (); > } > @@ -58,6 +62,9 @@ between (uint32_t const ch, > /* The set of "direct characters": > FOR UTF-7 > A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr > + FOR UTF-7-IMAP > + A-Z a-z 0-9 ' ( ) , - . / : ? space > + ! " # $ % + * ; < = > @ [ \ ] ^ _ ` { | } ~ > */ > > static bool > @@ -71,6 +78,8 @@ isdirect (uint32_t ch, enum variant var) > || between (ch, ',', '/') > || ch == ':' || ch == '?' > || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r'); > + else if (var == UTF_7_IMAP) > + return (ch != '&' && between (ch, ' ', '~')); > abort (); > } > > @@ -124,6 +133,8 @@ base64 (unsigned int i, enum variant var) > return '+'; > else if (i == 63 && var == UTF7) > return '/'; > + else if (i == 63 && var == UTF_7_IMAP) > + return ','; > else > abort (); > } > @@ -308,7 +319,8 @@ gconv_end (struct __gconv_step *data) > i = ch - '0' + 52; \ > else if (ch == '+') \ > i = 62; \ > - else if (ch == '/') \ > + else if ((var == UTF7 && ch == '/') \ > + || (var == UTF_7_IMAP && ch == ',')) \ > i = 63; \ > else \ > { \ > @@ -316,8 +328,10 @@ gconv_end (struct __gconv_step *data) > \ > /* If accumulated data is nonzero, the input is invalid. */ \ > /* Also, partial UTF-16 characters are invalid. */ \ > - if (__builtin_expect (statep->__value.__wch != 0, 0) \ > - || __builtin_expect ((statep->__count >> 3) <= 26, 0)) \ > + /* In IMAP variant, must be terminated by '-'. */ \ > + if (__glibc_unlikely (statep->__value.__wch != 0) \ > + || __glibc_unlikely ((statep->__count >> 3) <= 26) \ > + || __glibc_unlikely (var == UTF_7_IMAP && ch != '-')) \ > { \ > STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1)); \ > } \ > @@ -474,13 +488,15 @@ gconv_end (struct __gconv_step *data) > else \ > { \ > /* base64 encoding active */ \ > - if (isdirect (ch, var)) \ > + if ((var == UTF_7_IMAP && ch == '&') || isdirect (ch, var)) \ > { \ > /* deactivate base64 encoding */ \ > size_t count; \ > \ > count = ((statep->__count & 0x18) >= 0x10) \ > - + needs_explicit_shift (ch) + 1; \ > + + (var == UTF_7_IMAP || needs_explicit_shift (ch)) \ > + + (var == UTF_7_IMAP && ch == '&') \ > + + 1; \ > if (__glibc_unlikely (outptr + count > outend)) \ > { \ > result = __GCONV_FULL_OUTPUT; \ > @@ -489,9 +505,11 @@ gconv_end (struct __gconv_step *data) > \ > if ((statep->__count & 0x18) >= 0x10) \ > *outptr++ = base64 ((statep->__count >> 3) & ~3, var); \ > - if (needs_explicit_shift (ch)) \ > + if (var == UTF_7_IMAP || needs_explicit_shift (ch)) \ > *outptr++ = '-'; \ > *outptr++ = (unsigned char) ch; \ > + if (var == UTF_7_IMAP && ch == '&') \ > + *outptr++ = '-'; \ > statep->__count = 0; \ > } \ > else \