[PATCH 0/5] iconv: module for MODIFIED-UTF-7

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/5] iconv: module for MODIFIED-UTF-7
@ 2020-08-19 23:06 Max Gautier
  2020-08-19 23:06 ` [PATCH 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
                   ` (6 more replies)
  0 siblings, 7 replies; 60+ messages in thread
From: Max Gautier @ 2020-08-19 23:06 UTC (permalink / raw)
  To: libc-alpha

These patches implement a conversion module for "modified UTF-7"
described by RFC 3501 as part of the IMAP4rev1 specification (in section
5.1.3[1]).
This is the encoding used by convention by IMAP server to describe
internationalized mailbox names.

I'm trying to make isync[2] (an IMAP synchronizer) support that
encoding ; implementing it in glibc (vs making a custom gconv-module)
seems (to me) a sensible move, since this will allow other IMAP clients
to reuse that work. Also, it's easier to reuse the boilerplate than to
remake my own.

The conversion is based on the existing UTF-7 module ; I have merely copied
it, then changed the necessary parts to make UTF-7 into MODIFIED-UTF-7.

I am unaware of an official name for the encoding, so I used
"MODIFIED-UTF-7". There might be better choices, if someone has insights
on that.

I added test files (last patch) but I'm not sure `make check` actually
tests the stateful character sets (I'm not very familiar with iconv or
the glibc build system).

I would appreciate feedback, even if it is only to say you think that
module does not belongs in glibc.

Thank you,

Max Gautier

[1]: https://tools.ietf.org/html/rfc3501#section-5.1.3
[2]: https://isync.sourceforge.io/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/5] Copy utf-7 module to modified-utf-7
  2020-08-19 23:06 [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Max Gautier
@ 2020-08-19 23:06 ` Max Gautier
  2020-08-19 23:06 ` [PATCH 2/5] Update gconv-modules file Max Gautier
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2020-08-19 23:06 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

---
 iconvdata/Makefile         |   2 +-
 iconvdata/modified-utf-7.c | 531 +++++++++++++++++++++++++++++++++++++
 2 files changed, 532 insertions(+), 1 deletion(-)
 create mode 100644 iconvdata/modified-utf-7.c

diff --git a/iconvdata/Makefile b/iconvdata/Makefile
index 4ec2741cdc..02fd805234 100644
--- a/iconvdata/Makefile
+++ b/iconvdata/Makefile
@@ -61,7 +61,7 @@ modules	:= ISO8859-1 ISO8859-2 ISO8859-3 ISO8859-4 ISO8859-5		 \
 	   IBM5347 IBM9030 IBM9066 IBM9448 IBM12712 IBM16804             \
 	   IBM1364 IBM1371 IBM1388 IBM1390 IBM1399 ISO_11548-1 MIK BRF	 \
 	   MAC-CENTRALEUROPE KOI8-RU ISO8859-9E				 \
-	   CP770 CP771 CP772 CP773 CP774
+	   CP770 CP771 CP772 CP773 CP774 MODIFIED-UTF-7
 
 # If lazy binding is disabled, use BIND_NOW for the gconv modules.
 ifeq ($(bind-now),yes)
diff --git a/iconvdata/modified-utf-7.c b/iconvdata/modified-utf-7.c
new file mode 100644
index 0000000000..fc6a8dfcfd
--- /dev/null
+++ b/iconvdata/modified-utf-7.c
@@ -0,0 +1,531 @@
+/* Conversion module for UTF-7.
+   Copyright (C) 2000-2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* UTF-7 is a legacy encoding used for transmitting Unicode within the
+   ASCII character set, used primarily by mail agents.  New programs
+   are encouraged to use UTF-8 instead.
+
+   UTF-7 is specified in RFC 2152 (and old RFC 1641, RFC 1642).  The
+   original Base64 encoding is defined in RFC 2045.  */
+
+#include <dlfcn.h>
+#include <gconv.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+
+/* Define this to 1 if you want the so-called "optional direct" characters
+      ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
+   to be encoded. Define to 0 if you want them to be passed straight
+   through, like the so-called "direct" characters.
+   We set this to 1 because it's safer.
+ */
+#define UTF7_ENCODE_OPTIONAL_CHARS 1
+
+
+/* The set of "direct characters":
+   A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
+*/
+
+static const unsigned char direct_tab[128 / 8] =
+  {
+    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
+    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
+  };
+
+static int
+isdirect (uint32_t ch)
+{
+  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
+}
+
+
+/* The set of "direct and optional direct characters":
+   A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
+   ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
+*/
+
+static const unsigned char xdirect_tab[128 / 8] =
+  {
+    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
+    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
+  };
+
+static int
+isxdirect (uint32_t ch)
+{
+  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
+}
+
+
+/* The set of "extended base64 characters":
+   A-Z a-z 0-9 + / -
+*/
+
+static const unsigned char xbase64_tab[128 / 8] =
+  {
+    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
+    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
+  };
+
+static int
+isxbase64 (uint32_t ch)
+{
+  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
+}
+
+
+/* Converts a value in the range 0..63 to a base64 encoded char.  */
+static unsigned char
+base64 (unsigned int i)
+{
+  if (i < 26)
+    return i + 'A';
+  else if (i < 52)
+    return i - 26 + 'a';
+  else if (i < 62)
+    return i - 52 + '0';
+  else if (i == 62)
+    return '+';
+  else if (i == 63)
+    return '/';
+  else
+    abort ();
+}
+
+
+/* Definitions used in the body of the `gconv' function.  */
+#define CHARSET_NAME		"UTF-7//"
+#define DEFINE_INIT		1
+#define DEFINE_FINI		1
+#define FROM_LOOP		from_utf7_loop
+#define TO_LOOP			to_utf7_loop
+#define MIN_NEEDED_FROM		1
+#define MAX_NEEDED_FROM		6
+#define MIN_NEEDED_TO		4
+#define MAX_NEEDED_TO		4
+#define ONE_DIRECTION		0
+#define PREPARE_LOOP \
+  mbstate_t saved_state;						      \
+  mbstate_t *statep = data->__statep;
+#define EXTRA_LOOP_ARGS		, statep
+
+
+/* Since we might have to reset input pointer we must be able to save
+   and restore the state.  */
+#define SAVE_RESET_STATE(Save) \
+  if (Save)								      \
+    saved_state = *statep;						      \
+  else									      \
+    *statep = saved_state
+
+
+/* First define the conversion function from UTF-7 to UCS4.
+   The state is structured as follows:
+     __count bit 2..0: zero
+     __count bit 8..3: shift
+     __wch: data
+   Precise meaning:
+     shift      data
+       0         --          not inside base64 encoding
+     1..32  XX..XX00..00     inside base64, (32 - shift) bits pending
+   This state layout is simpler than relying on STORE_REST/UNPACK_BYTES.
+
+   When shift = 0, __wch needs to store at most one lookahead byte (see
+   __GCONV_INCOMPLETE_INPUT below).
+*/
+#define MIN_NEEDED_INPUT	MIN_NEEDED_FROM
+#define MAX_NEEDED_INPUT	MAX_NEEDED_FROM
+#define MIN_NEEDED_OUTPUT	MIN_NEEDED_TO
+#define MAX_NEEDED_OUTPUT	MAX_NEEDED_TO
+#define LOOPFCT			FROM_LOOP
+#define BODY \
+  {									      \
+    uint_fast8_t ch = *inptr;						      \
+									      \
+    if ((statep->__count >> 3) == 0)					      \
+      {									      \
+	/* base64 encoding inactive.  */				      \
+	if (isxdirect (ch))						      \
+	  {								      \
+	    inptr++;							      \
+	    put32 (outptr, ch);						      \
+	    outptr += 4;						      \
+	  }								      \
+	else if (__glibc_likely (ch == '+'))				      \
+	  {								      \
+	    if (__glibc_unlikely (inptr + 2 > inend))			      \
+	      {								      \
+		/* Not enough input available.  */			      \
+		result = __GCONV_INCOMPLETE_INPUT;			      \
+		break;							      \
+	      }								      \
+	    if (inptr[1] == '-')					      \
+	      {								      \
+		inptr += 2;						      \
+		put32 (outptr, ch);					      \
+		outptr += 4;						      \
+	      }								      \
+	    else							      \
+	      {								      \
+		/* Switch into base64 mode.  */				      \
+		inptr++;						      \
+		statep->__count = (32 << 3);				      \
+		statep->__value.__wch = 0;				      \
+	      }								      \
+	  }								      \
+	else								      \
+	  {								      \
+	    /* The input is invalid.  */				      \
+	    STANDARD_FROM_LOOP_ERR_HANDLER (1);				      \
+	  }								      \
+      }									      \
+    else								      \
+      {									      \
+	/* base64 encoding active.  */					      \
+	uint32_t i;							      \
+	int shift;							      \
+									      \
+	if (ch >= 'A' && ch <= 'Z')					      \
+	  i = ch - 'A';							      \
+	else if (ch >= 'a' && ch <= 'z')				      \
+	  i = ch - 'a' + 26;						      \
+	else if (ch >= '0' && ch <= '9')				      \
+	  i = ch - '0' + 52;						      \
+	else if (ch == '+')						      \
+	  i = 62;							      \
+	else if (ch == '/')						      \
+	  i = 63;							      \
+	else								      \
+	  {								      \
+	    /* Terminate base64 encoding.  */				      \
+									      \
+	    /* If accumulated data is nonzero, the input is invalid.  */      \
+	    /* Also, partial UTF-16 characters are invalid.  */		      \
+	    if (__builtin_expect (statep->__value.__wch != 0, 0)	      \
+		|| __builtin_expect ((statep->__count >> 3) <= 26, 0))	      \
+	      {								      \
+		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
+	      }								      \
+									      \
+	    if (ch == '-')						      \
+	      inptr++;							      \
+									      \
+	    statep->__count = 0;					      \
+	    continue;							      \
+	  }								      \
+									      \
+	/* Concatenate the base64 integer i to the accumulator.  */	      \
+	shift = (statep->__count >> 3);					      \
+	if (shift > 6)							      \
+	  {								      \
+	    uint32_t wch;						      \
+									      \
+	    shift -= 6;							      \
+	    wch = statep->__value.__wch | (i << shift);			      \
+									      \
+	    if (shift <= 16 && shift > 10)				      \
+	      {								      \
+		/* An UTF-16 character has just been completed.  */	      \
+		uint32_t wc1 = wch >> 16;				      \
+									      \
+		/* UTF-16: When we see a High Surrogate, we must also decode  \
+		   the following Low Surrogate. */			      \
+		if (!(wc1 >= 0xd800 && wc1 < 0xdc00))			      \
+		  {							      \
+		    wch = wch << 16;					      \
+		    shift += 16;					      \
+		    put32 (outptr, wc1);				      \
+		    outptr += 4;					      \
+		  }							      \
+	      }								      \
+	    else if (shift <= 10 && shift > 4)				      \
+	      {								      \
+		/* After a High Surrogate, verify that the next 16 bit	      \
+		   indeed form a Low Surrogate.  */			      \
+		uint32_t wc2 = wch & 0xffff;				      \
+									      \
+		if (! __builtin_expect (wc2 >= 0xdc00 && wc2 < 0xe000, 1))    \
+		  {							      \
+		    STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));\
+		  }							      \
+	      }								      \
+									      \
+	    statep->__value.__wch = wch;				      \
+	  }								      \
+	else								      \
+	  {								      \
+	    /* An UTF-16 surrogate pair has just been completed.  */	      \
+	    uint32_t wc1 = (uint32_t) statep->__value.__wch >> 16;	      \
+	    uint32_t wc2 = ((uint32_t) statep->__value.__wch & 0xffff)	      \
+			   | (i >> (6 - shift));			      \
+									      \
+	    statep->__value.__wch = (i << shift) << 26;			      \
+	    shift += 26;						      \
+									      \
+	    assert (wc1 >= 0xd800 && wc1 < 0xdc00);			      \
+	    assert (wc2 >= 0xdc00 && wc2 < 0xe000);			      \
+	    put32 (outptr,						      \
+		   0x10000 + ((wc1 - 0xd800) << 10) + (wc2 - 0xdc00));	      \
+	    outptr += 4;						      \
+	  }								      \
+									      \
+	statep->__count = shift << 3;					      \
+									      \
+	/* Now that we digested the input increment the input pointer.  */    \
+	inptr++;							      \
+      }									      \
+  }
+#define LOOP_NEED_FLAGS
+#define EXTRA_LOOP_DECLS	, mbstate_t *statep
+#include <iconv/loop.c>
+
+
+/* Next, define the conversion from UCS4 to UTF-7.
+   The state is structured as follows:
+     __count bit 2..0: zero
+     __count bit 4..3: shift
+     __count bit 8..5: data
+   Precise meaning:
+     shift      data
+       0         0           not inside base64 encoding
+       1         0           inside base64, no pending bits
+       2       XX00          inside base64, 2 bits known for next byte
+       3       XXXX          inside base64, 4 bits known for next byte
+
+   __count bit 2..0 and __wch are always zero, because this direction
+   never returns __GCONV_INCOMPLETE_INPUT.
+*/
+#define MIN_NEEDED_INPUT	MIN_NEEDED_TO
+#define MAX_NEEDED_INPUT	MAX_NEEDED_TO
+#define MIN_NEEDED_OUTPUT	MIN_NEEDED_FROM
+#define MAX_NEEDED_OUTPUT	MAX_NEEDED_FROM
+#define LOOPFCT			TO_LOOP
+#define BODY \
+  {									      \
+    uint32_t ch = get32 (inptr);					      \
+									      \
+    if ((statep->__count & 0x18) == 0)					      \
+      {									      \
+	/* base64 encoding inactive */					      \
+	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	  {								      \
+	    *outptr++ = (unsigned char) ch;				      \
+	  }								      \
+	else								      \
+	  {								      \
+	    size_t count;						      \
+									      \
+	    if (ch == '+')						      \
+	      count = 2;						      \
+	    else if (ch < 0x10000)					      \
+	      count = 3;						      \
+	    else if (ch < 0x110000)					      \
+	      count = 6;						      \
+	    else							      \
+	      STANDARD_TO_LOOP_ERR_HANDLER (4);				      \
+									      \
+	    if (__glibc_unlikely (outptr + count > outend))		      \
+	      {								      \
+		result = __GCONV_FULL_OUTPUT;				      \
+		break;							      \
+	      }								      \
+									      \
+	    *outptr++ = '+';						      \
+	    if (ch == '+')						      \
+	      *outptr++ = '-';						      \
+	    else if (ch < 0x10000)					      \
+	      {								      \
+		*outptr++ = base64 (ch >> 10);				      \
+		*outptr++ = base64 ((ch >> 4) & 0x3f);			      \
+		statep->__count = ((ch & 15) << 5) | (3 << 3);		      \
+	      }								      \
+	    else if (ch < 0x110000)					      \
+	      {								      \
+		uint32_t ch1 = 0xd800 + ((ch - 0x10000) >> 10);		      \
+		uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff);	      \
+									      \
+		ch = (ch1 << 16) | ch2;					      \
+		*outptr++ = base64 (ch >> 26);				      \
+		*outptr++ = base64 ((ch >> 20) & 0x3f);			      \
+		*outptr++ = base64 ((ch >> 14) & 0x3f);			      \
+		*outptr++ = base64 ((ch >> 8) & 0x3f);			      \
+		*outptr++ = base64 ((ch >> 2) & 0x3f);			      \
+		statep->__count = ((ch & 3) << 7) | (2 << 3);		      \
+	      }								      \
+	    else							      \
+	      abort ();							      \
+	  }								      \
+      }									      \
+    else								      \
+      {									      \
+	/* base64 encoding active */					      \
+	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	  {								      \
+	    /* deactivate base64 encoding */				      \
+	    size_t count;						      \
+									      \
+	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
+	    if (__glibc_unlikely (outptr + count > outend))		      \
+	      {								      \
+		result = __GCONV_FULL_OUTPUT;				      \
+		break;							      \
+	      }								      \
+									      \
+	    if ((statep->__count & 0x18) >= 0x10)			      \
+	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
+	    if (isxbase64 (ch))						      \
+	      *outptr++ = '-';						      \
+	    *outptr++ = (unsigned char) ch;				      \
+	    statep->__count = 0;					      \
+	  }								      \
+	else								      \
+	  {								      \
+	    size_t count;						      \
+									      \
+	    if (ch < 0x10000)						      \
+	      count = ((statep->__count & 0x18) >= 0x10 ? 3 : 2);	      \
+	    else if (ch < 0x110000)					      \
+	      count = ((statep->__count & 0x18) >= 0x18 ? 6 : 5);	      \
+	    else							      \
+	      STANDARD_TO_LOOP_ERR_HANDLER (4);				      \
+									      \
+	    if (__glibc_unlikely (outptr + count > outend))		      \
+	      {								      \
+		result = __GCONV_FULL_OUTPUT;				      \
+		break;							      \
+	      }								      \
+									      \
+	    if (ch < 0x10000)						      \
+	      {								      \
+		switch ((statep->__count >> 3) & 3)			      \
+		  {							      \
+		  case 1:						      \
+		    *outptr++ = base64 (ch >> 10);			      \
+		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
+		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
+		    break;						      \
+		  case 2:						      \
+		    *outptr++ =						      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12));    \
+		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
+		    *outptr++ = base64 (ch & 0x3f);			      \
+		    statep->__count = (1 << 3);				      \
+		    break;						      \
+		  case 3:						      \
+		    *outptr++ =						      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14));    \
+		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
+		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
+		    break;						      \
+		  default:						      \
+		    abort ();						      \
+		  }							      \
+	      }								      \
+	    else if (ch < 0x110000)					      \
+	      {								      \
+		uint32_t ch1 = 0xd800 + ((ch - 0x10000) >> 10);		      \
+		uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff);	      \
+									      \
+		ch = (ch1 << 16) | ch2;					      \
+		switch ((statep->__count >> 3) & 3)			      \
+		  {							      \
+		  case 1:						      \
+		    *outptr++ = base64 (ch >> 26);			      \
+		    *outptr++ = base64 ((ch >> 20) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 14) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
+		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
+		    break;						      \
+		  case 2:						      \
+		    *outptr++ =						      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28));    \
+		    *outptr++ = base64 ((ch >> 22) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 16) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 10) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
+		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
+		    break;						      \
+		  case 3:						      \
+		    *outptr++ =						      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30));    \
+		    *outptr++ = base64 ((ch >> 24) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 18) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 12) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
+		    *outptr++ = base64 (ch & 0x3f);			      \
+		    statep->__count = (1 << 3);				      \
+		    break;						      \
+		  default:						      \
+		    abort ();						      \
+		  }							      \
+	      }								      \
+	    else							      \
+	      abort ();							      \
+	  }								      \
+      }									      \
+									      \
+    /* Now that we wrote the output increment the input pointer.  */	      \
+    inptr += 4;								      \
+  }
+#define LOOP_NEED_FLAGS
+#define EXTRA_LOOP_DECLS	, mbstate_t *statep
+#include <iconv/loop.c>
+
+
+/* Since this is a stateful encoding we have to provide code which resets
+   the output state to the initial state.  This has to be done during the
+   flushing.  */
+#define EMIT_SHIFT_TO_INIT \
+  if (FROM_DIRECTION)							      \
+    /* Nothing to emit.  */						      \
+    memset (data->__statep, '\0', sizeof (mbstate_t));			      \
+  else									      \
+    {									      \
+      /* The "to UTF-7" direction.  Flush the remaining bits and terminate    \
+	 with a '-' byte.  This will guarantee correct decoding if more	      \
+	 UTF-7 encoded text is added afterwards.  */			      \
+      int state = data->__statep->__count;				      \
+									      \
+      if (state & 0x18)							      \
+	{								      \
+	  /* Deactivate base64 encoding.  */				      \
+	  size_t count = ((state & 0x18) >= 0x10) + 1;			      \
+									      \
+	  if (__glibc_unlikely (outbuf + count > outend))		      \
+	    /* We don't have enough room in the output buffer.  */	      \
+	    status = __GCONV_FULL_OUTPUT;				      \
+	  else								      \
+	    {								      \
+	      /* Write out the shift sequence.  */			      \
+	      if ((state & 0x18) >= 0x10)				      \
+		*outbuf++ = base64 ((state >> 3) & ~3);			      \
+	      *outbuf++ = '-';						      \
+									      \
+	      data->__statep->__count = 0;				      \
+	    }								      \
+	}								      \
+      else								      \
+	data->__statep->__count = 0;					      \
+    }
+
+
+/* Now define the toplevel functions.  */
+#include <iconv/skeleton.c>
-- 
2.28.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 2/5] Update gconv-modules file
  2020-08-19 23:06 [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Max Gautier
  2020-08-19 23:06 ` [PATCH 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
@ 2020-08-19 23:06 ` Max Gautier
  2020-08-19 23:07 ` [PATCH 3/5] Transform UTF-7 to MODIFIED-UTF-7 Max Gautier
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2020-08-19 23:06 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

---
 iconvdata/gconv-modules | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules
index 16d09eb98d..c8b7f747f0 100644
--- a/iconvdata/gconv-modules
+++ b/iconvdata/gconv-modules
@@ -1534,6 +1534,14 @@ alias	UTF7//			UTF-7//
 module	UTF-7//			INTERNAL		UTF-7		1
 module	INTERNAL		UTF-7//			UTF-7		1
 
+#	from			to			module		cost
+alias	M-UTF7//		MODIFIED-UTF-7//
+alias	M-UTF-7//		MODIFIED-UTF-7//
+alias	IMAP-UTF-7//		MODIFIED-UTF-7//
+alias	IMAP-UTF7//		MODIFIED-UTF-7//
+module	MODIFIED-UTF-7//	INTERNAL		MODIFIED-UTF-7	1
+module	INTERNAL		MODIFIED-UTF-7//	MODIFIED-UTF-7	1
+
 #	from			to			module		cost
 module	GB18030//		INTERNAL		GB18030		1
 module	INTERNAL		GB18030//		GB18030		1
-- 
2.28.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 3/5] Transform UTF-7 to MODIFIED-UTF-7
  2020-08-19 23:06 [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Max Gautier
  2020-08-19 23:06 ` [PATCH 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
  2020-08-19 23:06 ` [PATCH 2/5] Update gconv-modules file Max Gautier
@ 2020-08-19 23:07 ` Max Gautier
  2020-08-19 23:07 ` [PATCH 4/5] Make terminating base64 sequences mandatory Max Gautier
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2020-08-19 23:07 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

* shift character is '&' instead of '+'
* No "optionnal direct characters" set
* modified base64 character set
* use direct comparison instead of arrays and bitwise op 
---
Regarding the fourth item, if there is reasons to use the bitwise way,
please let me know.
 iconvdata/modified-utf-7.c | 97 ++++++++++++--------------------------
 1 file changed, 31 insertions(+), 66 deletions(-)

diff --git a/iconvdata/modified-utf-7.c b/iconvdata/modified-utf-7.c
index fc6a8dfcfd..e6eb784891 100644
--- a/iconvdata/modified-utf-7.c
+++ b/iconvdata/modified-utf-7.c
@@ -1,4 +1,4 @@
-/* Conversion module for UTF-7.
+/* Conversion module for Modified UTF-7.
    Copyright (C) 2000-2020 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
@@ -16,12 +16,12 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-/* UTF-7 is a legacy encoding used for transmitting Unicode within the
-   ASCII character set, used primarily by mail agents.  New programs
-   are encouraged to use UTF-8 instead.
+/* Modified UTF-7 is a legacy encoding used for transmitting Unicode within the
+   ASCII character set, used primarily by IMAP server and clients agents.
+   New programs are encouraged to use UTF-8 instead.
 
-   UTF-7 is specified in RFC 2152 (and old RFC 1641, RFC 1642).  The
-   original Base64 encoding is defined in RFC 2045.  */
+   Modified UTF-7 is specified in RFC 3501 as part of the IMAPv4 specification.
+   The original Base64 encoding is defined in RFC 2045.  */
 
 #include <dlfcn.h>
 #include <gconv.h>
@@ -29,64 +29,29 @@
 #include <stdlib.h>
 
 
-/* Define this to 1 if you want the so-called "optional direct" characters
-      ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
-   to be encoded. Define to 0 if you want them to be passed straight
-   through, like the so-called "direct" characters.
-   We set this to 1 because it's safer.
- */
-#define UTF7_ENCODE_OPTIONAL_CHARS 1
-
-
 /* The set of "direct characters":
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
+   ! " # $ % + * ; < = > @ [ ] ^ _ ` { | }
 */
 
-static const unsigned char direct_tab[128 / 8] =
-  {
-    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
-    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
-  };
-
 static int
 isdirect (uint32_t ch)
 {
-  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
-}
-
-
-/* The set of "direct and optional direct characters":
-   A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
-   ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
-*/
-
-static const unsigned char xdirect_tab[128 / 8] =
-  {
-    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
-    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
-  };
-
-static int
-isxdirect (uint32_t ch)
-{
-  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
+  return ((ch == '\n' || ch == '\t' || ch == '\r')
+		  || (ch >= 0x20 && ch <= 0x7e && ch != '&'));
 }
 
-
-/* The set of "extended base64 characters":
-   A-Z a-z 0-9 + / -
+/* The set of "modified base64 characters":
+   A-Z a-z 0-9 + , -
 */
 
-static const unsigned char xbase64_tab[128 / 8] =
-  {
-    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
-    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
-  };
-
 static int
-isxbase64 (uint32_t ch)
+ismbase64 (uint32_t ch)
 {
-  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
+  return ((ch >= 'a' && ch <= 'z')
+			  || (ch >= 'A' && ch <= 'Z')
+			  || (ch >= '0' && ch <= '9')
+			  || (ch == '+' || ch == ','));
 }
 
 
@@ -103,18 +68,18 @@ base64 (unsigned int i)
   else if (i == 62)
     return '+';
   else if (i == 63)
-    return '/';
+    return ',';
   else
     abort ();
 }
 
 
 /* Definitions used in the body of the `gconv' function.  */
-#define CHARSET_NAME		"UTF-7//"
+#define CHARSET_NAME		"MODIFIED-UTF-7//"
 #define DEFINE_INIT		1
 #define DEFINE_FINI		1
-#define FROM_LOOP		from_utf7_loop
-#define TO_LOOP			to_utf7_loop
+#define FROM_LOOP		from_m_utf7_loop
+#define TO_LOOP			to_m_utf7_loop
 #define MIN_NEEDED_FROM		1
 #define MAX_NEEDED_FROM		6
 #define MIN_NEEDED_TO		4
@@ -161,13 +126,13 @@ base64 (unsigned int i)
     if ((statep->__count >> 3) == 0)					      \
       {									      \
 	/* base64 encoding inactive.  */				      \
-	if (isxdirect (ch))						      \
+	if (isdirect (ch))						      \
 	  {								      \
 	    inptr++;							      \
 	    put32 (outptr, ch);						      \
 	    outptr += 4;						      \
 	  }								      \
-	else if (__glibc_likely (ch == '+'))				      \
+	else if (__glibc_likely (ch == '&'))				      \
 	  {								      \
 	    if (__glibc_unlikely (inptr + 2 > inend))			      \
 	      {								      \
@@ -209,7 +174,7 @@ base64 (unsigned int i)
 	  i = ch - '0' + 52;						      \
 	else if (ch == '+')						      \
 	  i = 62;							      \
-	else if (ch == '/')						      \
+	else if (ch == ',')						      \
 	  i = 63;							      \
 	else								      \
 	  {								      \
@@ -323,7 +288,7 @@ base64 (unsigned int i)
     if ((statep->__count & 0x18) == 0)					      \
       {									      \
 	/* base64 encoding inactive */					      \
-	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	if (isdirect (ch))      \
 	  {								      \
 	    *outptr++ = (unsigned char) ch;				      \
 	  }								      \
@@ -331,7 +296,7 @@ base64 (unsigned int i)
 	  {								      \
 	    size_t count;						      \
 									      \
-	    if (ch == '+')						      \
+	    if (ch == '&')						      \
 	      count = 2;						      \
 	    else if (ch < 0x10000)					      \
 	      count = 3;						      \
@@ -346,8 +311,8 @@ base64 (unsigned int i)
 		break;							      \
 	      }								      \
 									      \
-	    *outptr++ = '+';						      \
-	    if (ch == '+')						      \
+	    *outptr++ = '&';						      \
+	    if (ch == '&')						      \
 	      *outptr++ = '-';						      \
 	    else if (ch < 0x10000)					      \
 	      {								      \
@@ -375,12 +340,12 @@ base64 (unsigned int i)
     else								      \
       {									      \
 	/* base64 encoding active */					      \
-	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	if (isdirect (ch))      \
 	  {								      \
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
 									      \
-	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
+	    count = ((statep->__count & 0x18) >= 0x10) + ismbase64 (ch) + 1;  \
 	    if (__glibc_unlikely (outptr + count > outend))		      \
 	      {								      \
 		result = __GCONV_FULL_OUTPUT;				      \
@@ -389,7 +354,7 @@ base64 (unsigned int i)
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
 	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
-	    if (isxbase64 (ch))						      \
+	    if (ismbase64 (ch))						      \
 	      *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
 	    statep->__count = 0;					      \
@@ -499,7 +464,7 @@ base64 (unsigned int i)
     memset (data->__statep, '\0', sizeof (mbstate_t));			      \
   else									      \
     {									      \
-      /* The "to UTF-7" direction.  Flush the remaining bits and terminate    \
+      /* The "to M-UTF-7" direction.  Flush the remaining bits and terminate    \
 	 with a '-' byte.  This will guarantee correct decoding if more	      \
 	 UTF-7 encoded text is added afterwards.  */			      \
       int state = data->__statep->__count;				      \
-- 
2.28.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 4/5] Make terminating base64 sequences mandatory
  2020-08-19 23:06 [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Max Gautier
                   ` (2 preceding siblings ...)
  2020-08-19 23:07 ` [PATCH 3/5] Transform UTF-7 to MODIFIED-UTF-7 Max Gautier
@ 2020-08-19 23:07 ` Max Gautier
  2020-08-19 23:07 ` [PATCH 5/5] Add test case for MODIFIED-UTF-7 Max Gautier
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2020-08-19 23:07 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

In the modified UTF-7 encoding, unlike in UTF-7, one MUST terminate all
base64 sequence with the '-' character.
MODIFIED-UTF-7 -> INTERNAL : make unterminated sequences illegal
INTERNAL -> MODIFIED-UTF-7 : always terminate the sequences
---
 iconvdata/modified-utf-7.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/iconvdata/modified-utf-7.c b/iconvdata/modified-utf-7.c
index e6eb784891..27cc4a88c3 100644
--- a/iconvdata/modified-utf-7.c
+++ b/iconvdata/modified-utf-7.c
@@ -176,7 +176,7 @@ base64 (unsigned int i)
 	  i = 62;							      \
 	else if (ch == ',')						      \
 	  i = 63;							      \
-	else								      \
+	else if (ch == '-')								      \
 	  {								      \
 	    /* Terminate base64 encoding.  */				      \
 									      \
@@ -188,12 +188,14 @@ base64 (unsigned int i)
 		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
 	      }								      \
 									      \
-	    if (ch == '-')						      \
-	      inptr++;							      \
+		inptr++;							      \
 									      \
 	    statep->__count = 0;					      \
 	    continue;							      \
 	  }								      \
+	else  \
+		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
+		/* Terminating '-' is required */  \
 									      \
 	/* Concatenate the base64 integer i to the accumulator.  */	      \
 	shift = (statep->__count >> 3);					      \
@@ -354,8 +356,7 @@ base64 (unsigned int i)
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
 	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
-	    if (ismbase64 (ch))						      \
-	      *outptr++ = '-';						      \
+	    *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
 	    statep->__count = 0;					      \
 	  }								      \
-- 
2.28.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 5/5] Add test case for MODIFIED-UTF-7
  2020-08-19 23:06 [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Max Gautier
                   ` (3 preceding siblings ...)
  2020-08-19 23:07 ` [PATCH 4/5] Make terminating base64 sequences mandatory Max Gautier
@ 2020-08-19 23:07 ` Max Gautier
  2020-08-20  7:18   ` Andreas Schwab
  2020-08-20  8:03 ` [PATCH 0/5] iconv: module " Florian Weimer
  2021-01-12  9:12 ` [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Florian Weimer
  6 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2020-08-19 23:07 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

---
Not sure if make check test stateful characters sets.
I'm not very familiar with iconv and glibc build system.
 iconvdata/TESTS                         |  1 +
 iconvdata/testdata/MODIFIED-UTF-7       | 25 +++++++++++++++++++++++++
 iconvdata/testdata/MODIFIED-UTF-7..UTF8 | 25 +++++++++++++++++++++++++
 3 files changed, 51 insertions(+)
 create mode 100644 iconvdata/testdata/MODIFIED-UTF-7
 create mode 100644 iconvdata/testdata/MODIFIED-UTF-7..UTF8

diff --git a/iconvdata/TESTS b/iconvdata/TESTS
index ef3bd43454..096598f20e 100644
--- a/iconvdata/TESTS
+++ b/iconvdata/TESTS
@@ -95,6 +95,7 @@ EUC-TW			EUC-TW			Y	UTF8
 GBK			GBK			Y	UTF8
 BIG5HKSCS		BIG5HKSCS		Y	UTF8
 UTF-7			UTF-7			N	UTF8
+MODIFIED-UTF-7		MODIFIED-UTF-7		N	UTF8
 IBM856			IBM856			N	UTF8
 IBM922			IBM922			Y	UTF8
 IBM930			IBM930			N	UTF8
diff --git a/iconvdata/testdata/MODIFIED-UTF-7 b/iconvdata/testdata/MODIFIED-UTF-7
new file mode 100644
index 0000000000..02d5681dac
--- /dev/null
+++ b/iconvdata/testdata/MODIFIED-UTF-7
@@ -0,0 +1,25 @@
+&EqASGxItEps-       Amharic
+&AQ0-esky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Fran&AOc-ais   French
+Deutsch    German
+&A5UDuwO7A7cDvQO5A7oDrA-   Greek
+&BeIF0QXoBdkF6g-      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+&BCAEQwRBBEEEOgQ4BDk-    Russian
+Espa&APE-ol    Spanish
+Svenska    Swedish
+&DiAOMg4pDjIORA4XDiI-    Thai
+T&APw-rk&AOc-e     Turkish
+Ti&Hr8-ng Vi&Hsc-t Vietnamese
+&ZeVnLIqe-     Japanese
+&Ti1lhw-       Chinese
+&1VyuAA-       Korean
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A&ImIDkQ-
diff --git a/iconvdata/testdata/MODIFIED-UTF-7..UTF8 b/iconvdata/testdata/MODIFIED-UTF-7..UTF8
new file mode 100644
index 0000000000..3d6ba8b535
--- /dev/null
+++ b/iconvdata/testdata/MODIFIED-UTF-7..UTF8
@@ -0,0 +1,25 @@
+አማርኛ       Amharic
+česky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Français   French
+Deutsch    German
+Ελληνικά   Greek
+עברית      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+Русский    Russian
+Español    Spanish
+Svenska    Swedish
+ภาษาไทย    Thai
+Türkçe     Turkish
+Tiếng Việt Vietnamese
+日本語     Japanese
+中文       Chinese
+한글       Korean
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A≢Α
-- 
2.28.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] Add test case for MODIFIED-UTF-7
  2020-08-19 23:07 ` [PATCH 5/5] Add test case for MODIFIED-UTF-7 Max Gautier
@ 2020-08-20  7:18   ` Andreas Schwab
  2020-08-20 15:40     ` [PATCH v2 " Max Gautier
  0 siblings, 1 reply; 60+ messages in thread
From: Andreas Schwab @ 2020-08-20  7:18 UTC (permalink / raw)
  To: Max Gautier via Libc-alpha; +Cc: Max Gautier

On Aug 20 2020, Max Gautier via Libc-alpha wrote:

> +// The last line of this file is missing the end-of-line terminator
> +// on purpose, in order to test that the conversion empties the bit buffer
> +// and shifts back to the initial state at the end of the conversion.
> +A&ImIDkQ-

That didn't work out, the newline is present nevertheless.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 5/5] Add test case for MODIFIED-UTF-7
  2020-08-20  7:18   ` Andreas Schwab
@ 2020-08-20 15:40     ` Max Gautier
  0 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2020-08-20 15:40 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

---
Indeed, I inadvertently added the final newline. Here is the fixed
version.

 iconvdata/TESTS                         |  1 +
 iconvdata/testdata/MODIFIED-UTF-7       | 25 +++++++++++++++++++++++++
 iconvdata/testdata/MODIFIED-UTF-7..UTF8 | 25 +++++++++++++++++++++++++
 3 files changed, 51 insertions(+)
 create mode 100644 iconvdata/testdata/MODIFIED-UTF-7
 create mode 100644 iconvdata/testdata/MODIFIED-UTF-7..UTF8

diff --git a/iconvdata/TESTS b/iconvdata/TESTS
index ef3bd43454..096598f20e 100644
--- a/iconvdata/TESTS
+++ b/iconvdata/TESTS
@@ -95,6 +95,7 @@ EUC-TW			EUC-TW			Y	UTF8
 GBK			GBK			Y	UTF8
 BIG5HKSCS		BIG5HKSCS		Y	UTF8
 UTF-7			UTF-7			N	UTF8
+MODIFIED-UTF-7		MODIFIED-UTF-7		N	UTF8
 IBM856			IBM856			N	UTF8
 IBM922			IBM922			Y	UTF8
 IBM930			IBM930			N	UTF8
diff --git a/iconvdata/testdata/MODIFIED-UTF-7 b/iconvdata/testdata/MODIFIED-UTF-7
new file mode 100644
index 0000000000..4b03e4ae57
--- /dev/null
+++ b/iconvdata/testdata/MODIFIED-UTF-7
@@ -0,0 +1,25 @@
+&EqASGxItEps-       Amharic
+&AQ0-esky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Fran&AOc-ais   French
+Deutsch    German
+&A5UDuwO7A7cDvQO5A7oDrA-   Greek
+&BeIF0QXoBdkF6g-      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+&BCAEQwRBBEEEOgQ4BDk-    Russian
+Espa&APE-ol    Spanish
+Svenska    Swedish
+&DiAOMg4pDjIORA4XDiI-    Thai
+T&APw-rk&AOc-e     Turkish
+Ti&Hr8-ng Vi&Hsc-t Vietnamese
+&ZeVnLIqe-     Japanese
+&Ti1lhw-       Chinese
+&1VyuAA-       Korean
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A&ImIDkQ-
\ No newline at end of file
diff --git a/iconvdata/testdata/MODIFIED-UTF-7..UTF8 b/iconvdata/testdata/MODIFIED-UTF-7..UTF8
new file mode 100644
index 0000000000..3b362e578c
--- /dev/null
+++ b/iconvdata/testdata/MODIFIED-UTF-7..UTF8
@@ -0,0 +1,25 @@
+አማርኛ       Amharic
+česky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Français   French
+Deutsch    German
+Ελληνικά   Greek
+עברית      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+Русский    Russian
+Español    Spanish
+Svenska    Swedish
+ภาษาไทย    Thai
+Türkçe     Turkish
+Tiếng Việt Vietnamese
+日本語     Japanese
+中文       Chinese
+한글       Korean
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A≢Α
\ No newline at end of file
-- 
2.28.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] iconv: module for MODIFIED-UTF-7
  2020-08-19 23:06 [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Max Gautier
                   ` (4 preceding siblings ...)
  2020-08-19 23:07 ` [PATCH 5/5] Add test case for MODIFIED-UTF-7 Max Gautier
@ 2020-08-20  8:03 ` Florian Weimer
  2020-08-20 15:19   ` Max Gautier
                     ` (2 more replies)
  2021-01-12  9:12 ` [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Florian Weimer
  6 siblings, 3 replies; 60+ messages in thread
From: Florian Weimer @ 2020-08-20  8:03 UTC (permalink / raw)
  To: Max Gautier via Libc-alpha; +Cc: Max Gautier

* Max Gautier via Libc-alpha:

> I am unaware of an official name for the encoding, so I used
> "MODIFIED-UTF-7". There might be better choices, if someone has insights
> on that.

Let's try to get it added to the IANA registry?  It's odd that a
charset defined in an RFC is not already contained in it.

The contact information for the registry seems to have atrophied a
bit.  I will try to figure out the current process.  Historically,
it's been Expert Review, so we shouldn't have to write an RFC for
this.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] iconv: module for MODIFIED-UTF-7
  2020-08-20  8:03 ` [PATCH 0/5] iconv: module " Florian Weimer
@ 2020-08-20 15:19   ` Max Gautier
  2020-08-20 15:58     ` Florian Weimer
  2020-09-02 15:24   ` Max Gautier
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
  2 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2020-08-20 15:19 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Max Gautier via Libc-alpha

Florian Weimer:
> * Max Gautier via Libc-alpha:
> 
> > I am unaware of an official name for the encoding, so I used
> > "MODIFIED-UTF-7". There might be better choices, if someone has insights
> > on that.
> 
> Let's try to get it added to the IANA registry?  It's odd that a
> charset defined in an RFC is not already contained in it.
> 
> The contact information for the registry seems to have atrophied a
> bit.  I will try to figure out the current process.  Historically,
> it's been Expert Review, so we shouldn't have to write an RFC for
> this.

Is the list here (and the linked RFC) not accurate ? 

[1]: https://www.iana.org/assignments/charset-info/charset-info.xhtml

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] iconv: module for MODIFIED-UTF-7
  2020-08-20 15:19   ` Max Gautier
@ 2020-08-20 15:58     ` Florian Weimer
  0 siblings, 0 replies; 60+ messages in thread
From: Florian Weimer @ 2020-08-20 15:58 UTC (permalink / raw)
  To: Max Gautier; +Cc: Max Gautier via Libc-alpha

* Max Gautier:

> Florian Weimer:
>> * Max Gautier via Libc-alpha:
>> 
>> > I am unaware of an official name for the encoding, so I used
>> > "MODIFIED-UTF-7". There might be better choices, if someone has insights
>> > on that.
>> 
>> Let's try to get it added to the IANA registry?  It's odd that a
>> charset defined in an RFC is not already contained in it.
>> 
>> The contact information for the registry seems to have atrophied a
>> bit.  I will try to figure out the current process.  Historically,
>> it's been Expert Review, so we shouldn't have to write an RFC for
>> this.
>
> Is the list here (and the linked RFC) not accurate ? 
>
> [1]: https://www.iana.org/assignments/charset-info/charset-info.xhtml

I tried to subscribe earlier today, but have yet to receive a response
from the mailing list software.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] iconv: module for MODIFIED-UTF-7
  2020-08-20  8:03 ` [PATCH 0/5] iconv: module " Florian Weimer
  2020-08-20 15:19   ` Max Gautier
@ 2020-09-02 15:24   ` Max Gautier
  2020-09-02 20:01     ` Adhemerval Zanella
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
  2 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2020-09-02 15:24 UTC (permalink / raw)
  To: Libc-alpha; +Cc: Max Gautier

* Florian Weimer:
> Let's try to get it added to the IANA registry?  It's odd that a
> charset defined in an RFC is not already contained in it.

While we do that, is there someone who would have time to review the
code itself, so we'll be able to proceed once the charset name get
registered ?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] iconv: module for MODIFIED-UTF-7
  2020-09-02 15:24   ` Max Gautier
@ 2020-09-02 20:01     ` Adhemerval Zanella
  2020-09-03  9:47       ` Max Gautier
  0 siblings, 1 reply; 60+ messages in thread
From: Adhemerval Zanella @ 2020-09-02 20:01 UTC (permalink / raw)
  To: libc-alpha, Max Gautier

On 02/09/2020 12:24, Max Gautier via Libc-alpha wrote:
> * Florian Weimer:
>> Let's try to get it added to the IANA registry?  It's odd that a
>> charset defined in an RFC is not already contained in it.
> 
> While we do that, is there someone who would have time to review the
> code itself, so we'll be able to proceed once the charset name get
> registered ?
> 

I haven't read the RFC to comment whether the resulting patch matches
the standard, but the patchset structure looks good.  Although I am
not very found of the copy duplication of the utf7 module, the resulting
patch does make clear what is the difference compared to default utf-7 
modified one (and I am not sure if trying to parametrize iconvdata/utf-7.c
really pays off here). I noticed only some minor style issues.

The only worry I have if this encoding is really used in the wild that
justify its inclusion on glibc.  The fact that it is defined for about 
17 years without anyone having the trouble to register it on IANA makes
me doubtful it is really that useful.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] iconv: module for MODIFIED-UTF-7
  2020-09-02 20:01     ` Adhemerval Zanella
@ 2020-09-03  9:47       ` Max Gautier
  2020-09-03 10:56         ` Andreas Schwab
  0 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2020-09-03  9:47 UTC (permalink / raw)
  To: libc-alpha

On Wed, Sep 02, 2020 at 05:01:14PM -0300, Adhemerval Zanella wrote:
> The only worry I have if this encoding is really used in the wild that
> justify its inclusion on glibc.  The fact that it is defined for about 
> 17 years without anyone having the trouble to register it on IANA makes
> me doubtful it is really that useful.

AFAIK, it's only used for IMAP client and servers. Searching Google for
"IMAP4 utf-7 usage" shows there exists some implementations already
(PHP, Python, Perl (I think)) and several questions on Stack Overflow
(over the years and until recently) on how to deal with that UTF-7.
So it seems it's a narrow use case, but used by many. Since I suppose
that many languages can interface with glibc, including that modified
UTF-7 would avoid workarounds like converting to original UTF-7 then
just replacing the shift character and '/' by ',', and that kind of
things.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] iconv: module for MODIFIED-UTF-7
  2020-09-03  9:47       ` Max Gautier
@ 2020-09-03 10:56         ` Andreas Schwab
  0 siblings, 0 replies; 60+ messages in thread
From: Andreas Schwab @ 2020-09-03 10:56 UTC (permalink / raw)
  To: Max Gautier; +Cc: libc-alpha

On Sep 03 2020, Max Gautier via Libc-alpha wrote:

> AFAIK, it's only used for IMAP client and servers. Searching Google for
> "IMAP4 utf-7 usage" shows there exists some implementations already
> (PHP, Python, Perl (I think))

Emacs also implements it (as utf-7-imap) and is used by Gnus.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 0/5] iconv: module for IMAP-UTF-7
  2020-08-20  8:03 ` [PATCH 0/5] iconv: module " Florian Weimer
  2020-08-20 15:19   ` Max Gautier
  2020-09-02 15:24   ` Max Gautier
@ 2021-01-25  9:02   ` Max Gautier
  2021-01-25  9:02     ` [PATCH v3 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
                       ` (6 more replies)
  2 siblings, 7 replies; 60+ messages in thread
From: Max Gautier @ 2021-01-25  9:02 UTC (permalink / raw)
  To: libc-alpha

Here are the updated patchs , using the name IMAP-UTF-7 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 1/5] Copy utf-7 module to modified-utf-7
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
@ 2021-01-25  9:02     ` Max Gautier
  2021-01-25  9:31       ` Andreas Schwab
  2021-01-25  9:02     ` [PATCH v3 2/5] Update gconv-modules file Max Gautier
                       ` (5 subsequent siblings)
  6 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2021-01-25  9:02 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

---
 iconvdata/Makefile         |   2 +-
 iconvdata/modified-utf-7.c | 531 +++++++++++++++++++++++++++++++++++++
 2 files changed, 532 insertions(+), 1 deletion(-)
 create mode 100644 iconvdata/modified-utf-7.c

diff --git a/iconvdata/Makefile b/iconvdata/Makefile
index c8c532a3e4..7f932e10ed 100644
--- a/iconvdata/Makefile
+++ b/iconvdata/Makefile
@@ -61,7 +61,7 @@ modules	:= ISO8859-1 ISO8859-2 ISO8859-3 ISO8859-4 ISO8859-5		 \
 	   IBM5347 IBM9030 IBM9066 IBM9448 IBM12712 IBM16804             \
 	   IBM1364 IBM1371 IBM1388 IBM1390 IBM1399 ISO_11548-1 MIK BRF	 \
 	   MAC-CENTRALEUROPE KOI8-RU ISO8859-9E				 \
-	   CP770 CP771 CP772 CP773 CP774
+	   CP770 CP771 CP772 CP773 CP774 MODIFIED-UTF-7
 
 # If lazy binding is disabled, use BIND_NOW for the gconv modules.
 ifeq ($(bind-now),yes)
diff --git a/iconvdata/modified-utf-7.c b/iconvdata/modified-utf-7.c
new file mode 100644
index 0000000000..fc6a8dfcfd
--- /dev/null
+++ b/iconvdata/modified-utf-7.c
@@ -0,0 +1,531 @@
+/* Conversion module for UTF-7.
+   Copyright (C) 2000-2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* UTF-7 is a legacy encoding used for transmitting Unicode within the
+   ASCII character set, used primarily by mail agents.  New programs
+   are encouraged to use UTF-8 instead.
+
+   UTF-7 is specified in RFC 2152 (and old RFC 1641, RFC 1642).  The
+   original Base64 encoding is defined in RFC 2045.  */
+
+#include <dlfcn.h>
+#include <gconv.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+
+/* Define this to 1 if you want the so-called "optional direct" characters
+      ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
+   to be encoded. Define to 0 if you want them to be passed straight
+   through, like the so-called "direct" characters.
+   We set this to 1 because it's safer.
+ */
+#define UTF7_ENCODE_OPTIONAL_CHARS 1
+
+
+/* The set of "direct characters":
+   A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
+*/
+
+static const unsigned char direct_tab[128 / 8] =
+  {
+    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
+    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
+  };
+
+static int
+isdirect (uint32_t ch)
+{
+  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
+}
+
+
+/* The set of "direct and optional direct characters":
+   A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
+   ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
+*/
+
+static const unsigned char xdirect_tab[128 / 8] =
+  {
+    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
+    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
+  };
+
+static int
+isxdirect (uint32_t ch)
+{
+  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
+}
+
+
+/* The set of "extended base64 characters":
+   A-Z a-z 0-9 + / -
+*/
+
+static const unsigned char xbase64_tab[128 / 8] =
+  {
+    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
+    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
+  };
+
+static int
+isxbase64 (uint32_t ch)
+{
+  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
+}
+
+
+/* Converts a value in the range 0..63 to a base64 encoded char.  */
+static unsigned char
+base64 (unsigned int i)
+{
+  if (i < 26)
+    return i + 'A';
+  else if (i < 52)
+    return i - 26 + 'a';
+  else if (i < 62)
+    return i - 52 + '0';
+  else if (i == 62)
+    return '+';
+  else if (i == 63)
+    return '/';
+  else
+    abort ();
+}
+
+
+/* Definitions used in the body of the `gconv' function.  */
+#define CHARSET_NAME		"UTF-7//"
+#define DEFINE_INIT		1
+#define DEFINE_FINI		1
+#define FROM_LOOP		from_utf7_loop
+#define TO_LOOP			to_utf7_loop
+#define MIN_NEEDED_FROM		1
+#define MAX_NEEDED_FROM		6
+#define MIN_NEEDED_TO		4
+#define MAX_NEEDED_TO		4
+#define ONE_DIRECTION		0
+#define PREPARE_LOOP \
+  mbstate_t saved_state;						      \
+  mbstate_t *statep = data->__statep;
+#define EXTRA_LOOP_ARGS		, statep
+
+
+/* Since we might have to reset input pointer we must be able to save
+   and restore the state.  */
+#define SAVE_RESET_STATE(Save) \
+  if (Save)								      \
+    saved_state = *statep;						      \
+  else									      \
+    *statep = saved_state
+
+
+/* First define the conversion function from UTF-7 to UCS4.
+   The state is structured as follows:
+     __count bit 2..0: zero
+     __count bit 8..3: shift
+     __wch: data
+   Precise meaning:
+     shift      data
+       0         --          not inside base64 encoding
+     1..32  XX..XX00..00     inside base64, (32 - shift) bits pending
+   This state layout is simpler than relying on STORE_REST/UNPACK_BYTES.
+
+   When shift = 0, __wch needs to store at most one lookahead byte (see
+   __GCONV_INCOMPLETE_INPUT below).
+*/
+#define MIN_NEEDED_INPUT	MIN_NEEDED_FROM
+#define MAX_NEEDED_INPUT	MAX_NEEDED_FROM
+#define MIN_NEEDED_OUTPUT	MIN_NEEDED_TO
+#define MAX_NEEDED_OUTPUT	MAX_NEEDED_TO
+#define LOOPFCT			FROM_LOOP
+#define BODY \
+  {									      \
+    uint_fast8_t ch = *inptr;						      \
+									      \
+    if ((statep->__count >> 3) == 0)					      \
+      {									      \
+	/* base64 encoding inactive.  */				      \
+	if (isxdirect (ch))						      \
+	  {								      \
+	    inptr++;							      \
+	    put32 (outptr, ch);						      \
+	    outptr += 4;						      \
+	  }								      \
+	else if (__glibc_likely (ch == '+'))				      \
+	  {								      \
+	    if (__glibc_unlikely (inptr + 2 > inend))			      \
+	      {								      \
+		/* Not enough input available.  */			      \
+		result = __GCONV_INCOMPLETE_INPUT;			      \
+		break;							      \
+	      }								      \
+	    if (inptr[1] == '-')					      \
+	      {								      \
+		inptr += 2;						      \
+		put32 (outptr, ch);					      \
+		outptr += 4;						      \
+	      }								      \
+	    else							      \
+	      {								      \
+		/* Switch into base64 mode.  */				      \
+		inptr++;						      \
+		statep->__count = (32 << 3);				      \
+		statep->__value.__wch = 0;				      \
+	      }								      \
+	  }								      \
+	else								      \
+	  {								      \
+	    /* The input is invalid.  */				      \
+	    STANDARD_FROM_LOOP_ERR_HANDLER (1);				      \
+	  }								      \
+      }									      \
+    else								      \
+      {									      \
+	/* base64 encoding active.  */					      \
+	uint32_t i;							      \
+	int shift;							      \
+									      \
+	if (ch >= 'A' && ch <= 'Z')					      \
+	  i = ch - 'A';							      \
+	else if (ch >= 'a' && ch <= 'z')				      \
+	  i = ch - 'a' + 26;						      \
+	else if (ch >= '0' && ch <= '9')				      \
+	  i = ch - '0' + 52;						      \
+	else if (ch == '+')						      \
+	  i = 62;							      \
+	else if (ch == '/')						      \
+	  i = 63;							      \
+	else								      \
+	  {								      \
+	    /* Terminate base64 encoding.  */				      \
+									      \
+	    /* If accumulated data is nonzero, the input is invalid.  */      \
+	    /* Also, partial UTF-16 characters are invalid.  */		      \
+	    if (__builtin_expect (statep->__value.__wch != 0, 0)	      \
+		|| __builtin_expect ((statep->__count >> 3) <= 26, 0))	      \
+	      {								      \
+		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
+	      }								      \
+									      \
+	    if (ch == '-')						      \
+	      inptr++;							      \
+									      \
+	    statep->__count = 0;					      \
+	    continue;							      \
+	  }								      \
+									      \
+	/* Concatenate the base64 integer i to the accumulator.  */	      \
+	shift = (statep->__count >> 3);					      \
+	if (shift > 6)							      \
+	  {								      \
+	    uint32_t wch;						      \
+									      \
+	    shift -= 6;							      \
+	    wch = statep->__value.__wch | (i << shift);			      \
+									      \
+	    if (shift <= 16 && shift > 10)				      \
+	      {								      \
+		/* An UTF-16 character has just been completed.  */	      \
+		uint32_t wc1 = wch >> 16;				      \
+									      \
+		/* UTF-16: When we see a High Surrogate, we must also decode  \
+		   the following Low Surrogate. */			      \
+		if (!(wc1 >= 0xd800 && wc1 < 0xdc00))			      \
+		  {							      \
+		    wch = wch << 16;					      \
+		    shift += 16;					      \
+		    put32 (outptr, wc1);				      \
+		    outptr += 4;					      \
+		  }							      \
+	      }								      \
+	    else if (shift <= 10 && shift > 4)				      \
+	      {								      \
+		/* After a High Surrogate, verify that the next 16 bit	      \
+		   indeed form a Low Surrogate.  */			      \
+		uint32_t wc2 = wch & 0xffff;				      \
+									      \
+		if (! __builtin_expect (wc2 >= 0xdc00 && wc2 < 0xe000, 1))    \
+		  {							      \
+		    STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));\
+		  }							      \
+	      }								      \
+									      \
+	    statep->__value.__wch = wch;				      \
+	  }								      \
+	else								      \
+	  {								      \
+	    /* An UTF-16 surrogate pair has just been completed.  */	      \
+	    uint32_t wc1 = (uint32_t) statep->__value.__wch >> 16;	      \
+	    uint32_t wc2 = ((uint32_t) statep->__value.__wch & 0xffff)	      \
+			   | (i >> (6 - shift));			      \
+									      \
+	    statep->__value.__wch = (i << shift) << 26;			      \
+	    shift += 26;						      \
+									      \
+	    assert (wc1 >= 0xd800 && wc1 < 0xdc00);			      \
+	    assert (wc2 >= 0xdc00 && wc2 < 0xe000);			      \
+	    put32 (outptr,						      \
+		   0x10000 + ((wc1 - 0xd800) << 10) + (wc2 - 0xdc00));	      \
+	    outptr += 4;						      \
+	  }								      \
+									      \
+	statep->__count = shift << 3;					      \
+									      \
+	/* Now that we digested the input increment the input pointer.  */    \
+	inptr++;							      \
+      }									      \
+  }
+#define LOOP_NEED_FLAGS
+#define EXTRA_LOOP_DECLS	, mbstate_t *statep
+#include <iconv/loop.c>
+
+
+/* Next, define the conversion from UCS4 to UTF-7.
+   The state is structured as follows:
+     __count bit 2..0: zero
+     __count bit 4..3: shift
+     __count bit 8..5: data
+   Precise meaning:
+     shift      data
+       0         0           not inside base64 encoding
+       1         0           inside base64, no pending bits
+       2       XX00          inside base64, 2 bits known for next byte
+       3       XXXX          inside base64, 4 bits known for next byte
+
+   __count bit 2..0 and __wch are always zero, because this direction
+   never returns __GCONV_INCOMPLETE_INPUT.
+*/
+#define MIN_NEEDED_INPUT	MIN_NEEDED_TO
+#define MAX_NEEDED_INPUT	MAX_NEEDED_TO
+#define MIN_NEEDED_OUTPUT	MIN_NEEDED_FROM
+#define MAX_NEEDED_OUTPUT	MAX_NEEDED_FROM
+#define LOOPFCT			TO_LOOP
+#define BODY \
+  {									      \
+    uint32_t ch = get32 (inptr);					      \
+									      \
+    if ((statep->__count & 0x18) == 0)					      \
+      {									      \
+	/* base64 encoding inactive */					      \
+	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	  {								      \
+	    *outptr++ = (unsigned char) ch;				      \
+	  }								      \
+	else								      \
+	  {								      \
+	    size_t count;						      \
+									      \
+	    if (ch == '+')						      \
+	      count = 2;						      \
+	    else if (ch < 0x10000)					      \
+	      count = 3;						      \
+	    else if (ch < 0x110000)					      \
+	      count = 6;						      \
+	    else							      \
+	      STANDARD_TO_LOOP_ERR_HANDLER (4);				      \
+									      \
+	    if (__glibc_unlikely (outptr + count > outend))		      \
+	      {								      \
+		result = __GCONV_FULL_OUTPUT;				      \
+		break;							      \
+	      }								      \
+									      \
+	    *outptr++ = '+';						      \
+	    if (ch == '+')						      \
+	      *outptr++ = '-';						      \
+	    else if (ch < 0x10000)					      \
+	      {								      \
+		*outptr++ = base64 (ch >> 10);				      \
+		*outptr++ = base64 ((ch >> 4) & 0x3f);			      \
+		statep->__count = ((ch & 15) << 5) | (3 << 3);		      \
+	      }								      \
+	    else if (ch < 0x110000)					      \
+	      {								      \
+		uint32_t ch1 = 0xd800 + ((ch - 0x10000) >> 10);		      \
+		uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff);	      \
+									      \
+		ch = (ch1 << 16) | ch2;					      \
+		*outptr++ = base64 (ch >> 26);				      \
+		*outptr++ = base64 ((ch >> 20) & 0x3f);			      \
+		*outptr++ = base64 ((ch >> 14) & 0x3f);			      \
+		*outptr++ = base64 ((ch >> 8) & 0x3f);			      \
+		*outptr++ = base64 ((ch >> 2) & 0x3f);			      \
+		statep->__count = ((ch & 3) << 7) | (2 << 3);		      \
+	      }								      \
+	    else							      \
+	      abort ();							      \
+	  }								      \
+      }									      \
+    else								      \
+      {									      \
+	/* base64 encoding active */					      \
+	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	  {								      \
+	    /* deactivate base64 encoding */				      \
+	    size_t count;						      \
+									      \
+	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
+	    if (__glibc_unlikely (outptr + count > outend))		      \
+	      {								      \
+		result = __GCONV_FULL_OUTPUT;				      \
+		break;							      \
+	      }								      \
+									      \
+	    if ((statep->__count & 0x18) >= 0x10)			      \
+	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
+	    if (isxbase64 (ch))						      \
+	      *outptr++ = '-';						      \
+	    *outptr++ = (unsigned char) ch;				      \
+	    statep->__count = 0;					      \
+	  }								      \
+	else								      \
+	  {								      \
+	    size_t count;						      \
+									      \
+	    if (ch < 0x10000)						      \
+	      count = ((statep->__count & 0x18) >= 0x10 ? 3 : 2);	      \
+	    else if (ch < 0x110000)					      \
+	      count = ((statep->__count & 0x18) >= 0x18 ? 6 : 5);	      \
+	    else							      \
+	      STANDARD_TO_LOOP_ERR_HANDLER (4);				      \
+									      \
+	    if (__glibc_unlikely (outptr + count > outend))		      \
+	      {								      \
+		result = __GCONV_FULL_OUTPUT;				      \
+		break;							      \
+	      }								      \
+									      \
+	    if (ch < 0x10000)						      \
+	      {								      \
+		switch ((statep->__count >> 3) & 3)			      \
+		  {							      \
+		  case 1:						      \
+		    *outptr++ = base64 (ch >> 10);			      \
+		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
+		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
+		    break;						      \
+		  case 2:						      \
+		    *outptr++ =						      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12));    \
+		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
+		    *outptr++ = base64 (ch & 0x3f);			      \
+		    statep->__count = (1 << 3);				      \
+		    break;						      \
+		  case 3:						      \
+		    *outptr++ =						      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14));    \
+		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
+		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
+		    break;						      \
+		  default:						      \
+		    abort ();						      \
+		  }							      \
+	      }								      \
+	    else if (ch < 0x110000)					      \
+	      {								      \
+		uint32_t ch1 = 0xd800 + ((ch - 0x10000) >> 10);		      \
+		uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff);	      \
+									      \
+		ch = (ch1 << 16) | ch2;					      \
+		switch ((statep->__count >> 3) & 3)			      \
+		  {							      \
+		  case 1:						      \
+		    *outptr++ = base64 (ch >> 26);			      \
+		    *outptr++ = base64 ((ch >> 20) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 14) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
+		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
+		    break;						      \
+		  case 2:						      \
+		    *outptr++ =						      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28));    \
+		    *outptr++ = base64 ((ch >> 22) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 16) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 10) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
+		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
+		    break;						      \
+		  case 3:						      \
+		    *outptr++ =						      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30));    \
+		    *outptr++ = base64 ((ch >> 24) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 18) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 12) & 0x3f);		      \
+		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
+		    *outptr++ = base64 (ch & 0x3f);			      \
+		    statep->__count = (1 << 3);				      \
+		    break;						      \
+		  default:						      \
+		    abort ();						      \
+		  }							      \
+	      }								      \
+	    else							      \
+	      abort ();							      \
+	  }								      \
+      }									      \
+									      \
+    /* Now that we wrote the output increment the input pointer.  */	      \
+    inptr += 4;								      \
+  }
+#define LOOP_NEED_FLAGS
+#define EXTRA_LOOP_DECLS	, mbstate_t *statep
+#include <iconv/loop.c>
+
+
+/* Since this is a stateful encoding we have to provide code which resets
+   the output state to the initial state.  This has to be done during the
+   flushing.  */
+#define EMIT_SHIFT_TO_INIT \
+  if (FROM_DIRECTION)							      \
+    /* Nothing to emit.  */						      \
+    memset (data->__statep, '\0', sizeof (mbstate_t));			      \
+  else									      \
+    {									      \
+      /* The "to UTF-7" direction.  Flush the remaining bits and terminate    \
+	 with a '-' byte.  This will guarantee correct decoding if more	      \
+	 UTF-7 encoded text is added afterwards.  */			      \
+      int state = data->__statep->__count;				      \
+									      \
+      if (state & 0x18)							      \
+	{								      \
+	  /* Deactivate base64 encoding.  */				      \
+	  size_t count = ((state & 0x18) >= 0x10) + 1;			      \
+									      \
+	  if (__glibc_unlikely (outbuf + count > outend))		      \
+	    /* We don't have enough room in the output buffer.  */	      \
+	    status = __GCONV_FULL_OUTPUT;				      \
+	  else								      \
+	    {								      \
+	      /* Write out the shift sequence.  */			      \
+	      if ((state & 0x18) >= 0x10)				      \
+		*outbuf++ = base64 ((state >> 3) & ~3);			      \
+	      *outbuf++ = '-';						      \
+									      \
+	      data->__statep->__count = 0;				      \
+	    }								      \
+	}								      \
+      else								      \
+	data->__statep->__count = 0;					      \
+    }
+
+
+/* Now define the toplevel functions.  */
+#include <iconv/skeleton.c>
-- 
2.30.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 1/5] Copy utf-7 module to modified-utf-7
  2021-01-25  9:02     ` [PATCH v3 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
@ 2021-01-25  9:31       ` Andreas Schwab
  2021-01-25 13:51         ` Max Gautier
  0 siblings, 1 reply; 60+ messages in thread
From: Andreas Schwab @ 2021-01-25  9:31 UTC (permalink / raw)
  To: Max Gautier via Libc-alpha; +Cc: Max Gautier

Why are you still using MODIFIED-UTF-7?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 1/5] Copy utf-7 module to modified-utf-7
  2021-01-25  9:31       ` Andreas Schwab
@ 2021-01-25 13:51         ` Max Gautier
  2021-02-07  9:42           ` Florian Weimer
  0 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2021-01-25 13:51 UTC (permalink / raw)
  To: Andreas Schwab via Libc-alpha

I merged the name change on the other patches. I can extract them back
and merged the relevant bits on this one if needed. 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 1/5] Copy utf-7 module to modified-utf-7
  2021-01-25 13:51         ` Max Gautier
@ 2021-02-07  9:42           ` Florian Weimer
  2021-02-07 12:29             ` Max Gautier
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
  0 siblings, 2 replies; 60+ messages in thread
From: Florian Weimer @ 2021-02-07  9:42 UTC (permalink / raw)
  To: Max Gautier; +Cc: Andreas Schwab via Libc-alpha

* Max Gautier via Libc-alpha:

> I merged the name change on the other patches. I can extract them back
> and merged the relevant bits on this one if needed. 

Given that UTF-7 conversion (either variant) is not
performance-critical, I suggest to have just one implementation file.

You can use step->__data to keep track of which variant is active.  See
iconvdata/iso646.c for an example.  There is no need to allocate a
separate object; you can store the flag directly in the __data member.

You could remove the UTF7_ENCODE_OPTIONAL_CHARS from the existing UTF-7
codec in a first, separate patch.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 1/5] Copy utf-7 module to modified-utf-7
  2021-02-07  9:42           ` Florian Weimer
@ 2021-02-07 12:29             ` Max Gautier
  2021-02-07 12:34               ` Florian Weimer
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
  1 sibling, 1 reply; 60+ messages in thread
From: Max Gautier @ 2021-02-07 12:29 UTC (permalink / raw)
  To: Florian Weimer via Libc-alpha; +Cc: Max Gautier

* Florian Weimer via Libc-alpha:
> Given that UTF-7 conversion (either variant) is not
> performance-critical, I suggest to have just one implementation file.
> 
> You can use step->__data to keep track of which variant is active.  See
> iconvdata/iso646.c for an example.  There is no need to allocate a
> separate object; you can store the flag directly in the __data member.

I'll work on that.
I might use some advice on specific parts. About the classification of
characters (direct or not etc) :
- utf-7.c
+ utf-7-imap.c
> -static const unsigned char direct_tab[128 / 8] =
> -  {
> -    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> -  };
> -
>  static int
>  isdirect (uint32_t ch)
>  {
> -  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
> -}
> -
> -
> -/* The set of "direct and optional direct characters":
> -   A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
> -   ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
> -*/
> -
> -static const unsigned char xdirect_tab[128 / 8] =
> -  {
> -    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
> -    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
> -  };
> -
> -static int
> -isxdirect (uint32_t ch)
> -{
> -  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
> +  return ((ch == '\n' || ch == '\t' || ch == '\r')
> +		  || (ch >= 0x20 && ch <= 0x7e && ch != '&'));
>  }
>  
> -
> -/* The set of "extended base64 characters":
> -   A-Z a-z 0-9 + / -
> +/* The set of "modified base64 characters":
> +   A-Z a-z 0-9 + , -
>  */
>  
> -static const unsigned char xbase64_tab[128 / 8] =
> -  {
> -    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> -  };
> -
>  static int
> -isxbase64 (uint32_t ch)
> +ismbase64 (uint32_t ch)
>  {
> -  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
> +  return ((ch >= 'a' && ch <= 'z')
> +			  || (ch >= 'A' && ch <= 'Z')
> +			  || (ch >= '0' && ch <= '9')
> +			  || (ch == '+' || ch == ','));
>  }

When I initially looked at utf-7.c, the use of the _tab arrays with
magic values and the subsequent shifting didn't make a lot of sense to
me, which is why I modified them like this for utf-7-imap. If they are
in the same file, it's probably better to use the same method.
So do you see any benefits to keeping the old method ?
Testing directly for the actual characters seems a lot more readable to
me, is shorter, and can be mapped to the RFC definition.

I didn't find the reason for using the current method by looking at the
git history. The only one I could think of is performance, but I don't
see how or what it would improve. If someone has some hints, let me
know.

 
> You could remove the UTF7_ENCODE_OPTIONAL_CHARS from the existing UTF-7
> codec in a first, separate patch.

Do you mean I should modify the utf-7 conversion to not encode the
optional chars ?  That would change the result of utf-7 conversions,
wouldn't it ? I'm not opposed to it, but isn't that going to break
things ?

Thanks

-- 
Max Gautier
mg@max.gautier.name

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 1/5] Copy utf-7 module to modified-utf-7
  2021-02-07 12:29             ` Max Gautier
@ 2021-02-07 12:34               ` Florian Weimer
  0 siblings, 0 replies; 60+ messages in thread
From: Florian Weimer @ 2021-02-07 12:34 UTC (permalink / raw)
  To: Max Gautier; +Cc: Florian Weimer via Libc-alpha

* Max Gautier via Libc-alpha:

> When I initially looked at utf-7.c, the use of the _tab arrays with
> magic values and the subsequent shifting didn't make a lot of sense to
> me, which is why I modified them like this for utf-7-imap. If they are
> in the same file, it's probably better to use the same method.
> So do you see any benefits to keeping the old method ?

I think the old method was an optimization in the long-gone times when
memory access was relatively fast compared to computation.

GCC has also improved since: it can use shifts and bit lookups to
combine comparisons against multiple constants into one check.

> Testing directly for the actual characters seems a lot more readable to
> me, is shorter, and can be mapped to the RFC definition.

Agreed.

>> You could remove the UTF7_ENCODE_OPTIONAL_CHARS from the existing UTF-7
>> codec in a first, separate patch.
>
> Do you mean I should modify the utf-7 conversion to not encode the
> optional chars ?  That would change the result of utf-7 conversions,
> wouldn't it ? I'm not opposed to it, but isn't that going to break
> things ?

Sorry, I meant to remove the constructs that enable this type of
compile-time configurability.  That is, remove the macro, but do not
change the generated code.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP
  2021-02-07  9:42           ` Florian Weimer
  2021-02-07 12:29             ` Max Gautier
@ 2021-12-09  9:31             ` Max Gautier
  2021-12-09  9:31               ` [PATCH v4 1/4] iconv: Always encode "optional direct" UTF-7 characters Max Gautier
                                 ` (6 more replies)
  1 sibling, 7 replies; 60+ messages in thread
From: Max Gautier @ 2021-12-09  9:31 UTC (permalink / raw)
  To: libc-alpha

I finally took the time to work on this again.

This new series implements UTF-7-IMAP in the UTF-7 module, using, as
advised, the same approach than in iso646.c.

Unresolved issues (would appreciate advice on those):
- There is a slight incoherence (to me) in the UTF-7 RFC[1], and the
  current implementation do not follow it exactly :
  In the "UTF-7 Definition/Rule 2":

  "The '+' signals that subsequent octets are to be interpreted as
  elements of the Modified Base64 alphabet until a character not in that
  alphabet is encountered. Such characters include control characters
  such as carriage returns and line feeds"

  The UTF-7 module implements this by making characters '\n', '\r', '\t'
  part of the "direct characters" set, even though they are not
  according to the definition given by the RFC.

  So these characters should be encoded, but should also be interpreted
  literally and implicitly terminates base64 sequences.

  On this, I'm inclined to leave the current behavior as is. Changing it
  might mean breaking things; and I don't see many benefits.

- For UTF-7-IMAP:
  The IMAPv4 RFC (UTF-7-IMAP definition)[2] specifies that :

  - The character "&" (0x26) is represented by the two-octet sequence "&-"
  - null shifts ("-&" while in BASE64; note that "&-" while in US-ASCII
    means "&") are not permitted
  - The purpose of these modifications is to correct the following
    problems with UTF-7:
      ...

      5) UTF-7 permits multiple alternate forms to represent the same
         string; in particular, printable US-ASCII characters can be
         represented in encoded form.

   Consider the following cases:

   A- When encoding to UTF-7-IMAP, if we encounter '&' while in base64
   mode, should we:
       1) encode it in base64
       2) terminate the encoding with '-' and use "&-"
   B- When encoding to UTF-7-IMAP, if we encounter "&&" while in
   us-ascii mode, should we:
       1) start base64 mode and encode the two '&' 
       2) encode them as "&-&-"
   It seems to me than for A and B, the solution 2 allows null shifts,
   and solution 1 allows multiples representation.

   However, A-2 and B-2 still feels cleaner to me, since they avoid
   alternate forms for '&'. The arguments can be made that the resulting
   sequences are not null shifts, merely a special case in US-ASCII.
   I've use that approach in PATCH 4/4, but that should be quite easy to
   change if necessary.

- Also, I'm not sure how to add negative test cases, aka, invalid
  sequences which needs to trigger an iconv errors.

Thanks for your time.

[1]: https://datatracker.ietf.org/doc/html/rfc2152
[2]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 1/4] iconv: Always encode "optional direct" UTF-7 characters
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
@ 2021-12-09  9:31               ` Max Gautier
  2022-03-07 12:10                 ` Adhemerval Zanella
  2021-12-09  9:31               ` [PATCH v4 2/4] iconv: Better mapping to RFC for UTF-7 Max Gautier
                                 ` (5 subsequent siblings)
  6 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2021-12-09  9:31 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

Signed-off-by: Max Gautier <mg@max.gautier.name>
---
 iconvdata/utf-7.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
index 0ed46c948d..9ba0974959 100644
--- a/iconvdata/utf-7.c
+++ b/iconvdata/utf-7.c
@@ -29,14 +29,6 @@
 #include <stdlib.h>
 
 
-/* Define this to 1 if you want the so-called "optional direct" characters
-      ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
-   to be encoded. Define to 0 if you want them to be passed straight
-   through, like the so-called "direct" characters.
-   We set this to 1 because it's safer.
- */
-#define UTF7_ENCODE_OPTIONAL_CHARS 1
-
 
 /* The set of "direct characters":
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
@@ -323,7 +315,7 @@ base64 (unsigned int i)
     if ((statep->__count & 0x18) == 0)					      \
       {									      \
 	/* base64 encoding inactive */					      \
-	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	if (isdirect (ch))   						      \
 	  {								      \
 	    *outptr++ = (unsigned char) ch;				      \
 	  }								      \
@@ -375,7 +367,7 @@ base64 (unsigned int i)
     else								      \
       {									      \
 	/* base64 encoding active */					      \
-	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	if (isdirect (ch))						      \
 	  {								      \
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
-- 
2.34.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 1/4] iconv: Always encode "optional direct" UTF-7 characters
  2021-12-09  9:31               ` [PATCH v4 1/4] iconv: Always encode "optional direct" UTF-7 characters Max Gautier
@ 2022-03-07 12:10                 ` Adhemerval Zanella
  0 siblings, 0 replies; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-07 12:10 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 09/12/2021 06:31, Max Gautier via Libc-alpha wrote:
> Signed-off-by: Max Gautier <mg@max.gautier.name>

LGTM, thanks.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  iconvdata/utf-7.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> index 0ed46c948d..9ba0974959 100644
> --- a/iconvdata/utf-7.c
> +++ b/iconvdata/utf-7.c
> @@ -29,14 +29,6 @@
>  #include <stdlib.h>
>  
>  
> -/* Define this to 1 if you want the so-called "optional direct" characters
> -      ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
> -   to be encoded. Define to 0 if you want them to be passed straight
> -   through, like the so-called "direct" characters.
> -   We set this to 1 because it's safer.
> - */
> -#define UTF7_ENCODE_OPTIONAL_CHARS 1
> -
>  
>  /* The set of "direct characters":
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
> @@ -323,7 +315,7 @@ base64 (unsigned int i)
>      if ((statep->__count & 0x18) == 0)					      \
>        {									      \
>  	/* base64 encoding inactive */					      \
> -	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
> +	if (isdirect (ch))   						      \
>  	  {								      \
>  	    *outptr++ = (unsigned char) ch;				      \
>  	  }								      \
> @@ -375,7 +367,7 @@ base64 (unsigned int i)
>      else								      \
>        {									      \
>  	/* base64 encoding active */					      \
> -	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
> +	if (isdirect (ch))						      \
>  	  {								      \
>  	    /* deactivate base64 encoding */				      \
>  	    size_t count;						      \

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 2/4] iconv: Better mapping to RFC for UTF-7
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
  2021-12-09  9:31               ` [PATCH v4 1/4] iconv: Always encode "optional direct" UTF-7 characters Max Gautier
@ 2021-12-09  9:31               ` Max Gautier
  2022-03-07 12:14                 ` Adhemerval Zanella
  2022-03-20 16:41                 ` [PATCH v5 " Max Gautier
  2021-12-09  9:31               ` [PATCH v4 3/4] iconv: make utf-7.c able to use variants Max Gautier
                                 ` (4 subsequent siblings)
  6 siblings, 2 replies; 60+ messages in thread
From: Max Gautier @ 2021-12-09  9:31 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

- Direct use of characters instead of arcane arrays
- isxbase64 is not the Modified BASE64 alphabet, but the characters who
  needs to trigger an explicit shift back to US-ASCII. Make that clearer

Signed-off-by: Max Gautier <mg@max.gautier.name>
---
 iconvdata/utf-7.c | 56 +++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 24 deletions(-)

diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
index 9ba0974959..ac7d78141a 100644
--- a/iconvdata/utf-7.c
+++ b/iconvdata/utf-7.c
@@ -30,20 +30,27 @@
 
 
 
+static int
+between(uint32_t const ch,
+        uint32_t const lower_bound, uint32_t const upper_bound)
+{
+    return (ch >= lower_bound && ch <= upper_bound);
+}
+
 /* The set of "direct characters":
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
 */
 
-static const unsigned char direct_tab[128 / 8] =
-  {
-    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
-    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
-  };
-
 static int
 isdirect (uint32_t ch)
 {
-  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
+    return (between(ch, 'A', 'Z')
+	    || between(ch, 'a', 'z')
+	    || between(ch, '0', '9')
+	    || ch == '\'' || ch == '(' || ch == ')'
+	    || between(ch, ',', '/')
+	    || ch == ':' || ch == '?'
+	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
 }
 
 
@@ -52,33 +59,33 @@ isdirect (uint32_t ch)
    ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
 */
 
-static const unsigned char xdirect_tab[128 / 8] =
-  {
-    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
-    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
-  };
 
 static int
 isxdirect (uint32_t ch)
 {
-  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
+    return (ch == '\t'
+            || ch == '\n'
+            || ch == '\r'
+            || (between(ch, ' ','}')
+                && ch != '+' && ch != '\\')
+           );
 }
 
 
-/* The set of "extended base64 characters":
+/* Characters which needs to trigger an explicit shift back to US-ASCII (UTF-7
+   only): Modified base64 + '-' (shift back character)
    A-Z a-z 0-9 + / -
 */
 
-static const unsigned char xbase64_tab[128 / 8] =
-  {
-    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
-    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
-  };
-
 static int
-isxbase64 (uint32_t ch)
+needs_explicit_shift (uint32_t ch)
 {
-  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
+  return (between(ch, 'A', 'Z')
+          || between(ch, 'a', 'z')
+          || between(ch, '/', '9')
+          || ch == '+'
+          || ch == '-'
+          );
 }
 
 
@@ -372,7 +379,8 @@ base64 (unsigned int i)
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
 									      \
-	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
+	    count = ((statep->__count & 0x18) >= 0x10)			      \
+	      + needs_explicit_shift (ch) + 1;				      \
 	    if (__glibc_unlikely (outptr + count > outend))		      \
 	      {								      \
 		result = __GCONV_FULL_OUTPUT;				      \
@@ -381,7 +389,7 @@ base64 (unsigned int i)
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
 	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
-	    if (isxbase64 (ch))						      \
+	    if (needs_explicit_shift (ch))				      \
 	      *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
 	    statep->__count = 0;					      \
-- 
2.34.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/4] iconv: Better mapping to RFC for UTF-7
  2021-12-09  9:31               ` [PATCH v4 2/4] iconv: Better mapping to RFC for UTF-7 Max Gautier
@ 2022-03-07 12:14                 ` Adhemerval Zanella
  2022-03-20 16:41                 ` [PATCH v5 " Max Gautier
  1 sibling, 0 replies; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-07 12:14 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 09/12/2021 06:31, Max Gautier via Libc-alpha wrote:
> - Direct use of characters instead of arcane arrays
> - isxbase64 is not the Modified BASE64 alphabet, but the characters who
>   needs to trigger an explicit shift back to US-ASCII. Make that clearer
> 
> Signed-off-by: Max Gautier <mg@max.gautier.name>

LGTM with style fixes below.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  iconvdata/utf-7.c | 56 +++++++++++++++++++++++++++--------------------
>  1 file changed, 32 insertions(+), 24 deletions(-)
> 
> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> index 9ba0974959..ac7d78141a 100644
> --- a/iconvdata/utf-7.c
> +++ b/iconvdata/utf-7.c
> @@ -30,20 +30,27 @@
>  
>  
>  
> +static int
> +between(uint32_t const ch,

Space before '(') and for other usages below..  Also 'const' does not change much
here.

> +        uint32_t const lower_bound, uint32_t const upper_bound)
> +{
> +    return (ch >= lower_bound && ch <= upper_bound);
> +}
> +
>  /* The set of "direct characters":
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
>  */
>  
> -static const unsigned char direct_tab[128 / 8] =
> -  {
> -    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> -  };
> -
>  static int
>  isdirect (uint32_t ch)
>  {
> -  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
> +    return (between(ch, 'A', 'Z')

Ok, it is indeed clear.

> +	    || between(ch, 'a', 'z')
> +	    || between(ch, '0', '9')
> +	    || ch == '\'' || ch == '(' || ch == ')'
> +	    || between(ch, ',', '/')
> +	    || ch == ':' || ch == '?'
> +	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
>  }
>  
>  
> @@ -52,33 +59,33 @@ isdirect (uint32_t ch)
>     ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
>  */
>  
> -static const unsigned char xdirect_tab[128 / 8] =
> -  {
> -    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
> -    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
> -  };
>  
>  static int
>  isxdirect (uint32_t ch)
>  {
> -  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
> +    return (ch == '\t'
> +            || ch == '\n'
> +            || ch == '\r'
> +            || (between(ch, ' ','}')
> +                && ch != '+' && ch != '\\')
> +           );
>  }
>  
>  

Ok.

> -/* The set of "extended base64 characters":
> +/* Characters which needs to trigger an explicit shift back to US-ASCII (UTF-7
> +   only): Modified base64 + '-' (shift back character)
>     A-Z a-z 0-9 + / -
>  */
>  
> -static const unsigned char xbase64_tab[128 / 8] =
> -  {
> -    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> -  };
> -
>  static int
> -isxbase64 (uint32_t ch)
> +needs_explicit_shift (uint32_t ch)
>  {
> -  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
> +  return (between(ch, 'A', 'Z')
> +          || between(ch, 'a', 'z')
> +          || between(ch, '/', '9')
> +          || ch == '+'
> +          || ch == '-'
> +          );
>  }
>  
>  

Ok.

> @@ -372,7 +379,8 @@ base64 (unsigned int i)
>  	    /* deactivate base64 encoding */				      \
>  	    size_t count;						      \
>  									      \
> -	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
> +	    count = ((statep->__count & 0x18) >= 0x10)			      \
> +	      + needs_explicit_shift (ch) + 1;				      \
>  	    if (__glibc_unlikely (outptr + count > outend))		      \
>  	      {								      \
>  		result = __GCONV_FULL_OUTPUT;				      \
> @@ -381,7 +389,7 @@ base64 (unsigned int i)
>  									      \
>  	    if ((statep->__count & 0x18) >= 0x10)			      \
>  	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
> -	    if (isxbase64 (ch))						      \
> +	    if (needs_explicit_shift (ch))				      \
>  	      *outptr++ = '-';						      \
>  	    *outptr++ = (unsigned char) ch;				      \
>  	    statep->__count = 0;					      \

Ok, it just change the function name.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v5 2/4] iconv: Better mapping to RFC for UTF-7
  2021-12-09  9:31               ` [PATCH v4 2/4] iconv: Better mapping to RFC for UTF-7 Max Gautier
  2022-03-07 12:14                 ` Adhemerval Zanella
@ 2022-03-20 16:41                 ` Max Gautier
  2022-03-21 11:53                   ` Adhemerval Zanella
  1 sibling, 1 reply; 60+ messages in thread
From: Max Gautier @ 2022-03-20 16:41 UTC (permalink / raw)
  To: libc-alpha; +Cc: mg

- Direct use of characters instead of arcane arrays
- isxbase64 is not the Modified BASE64 alphabet, but the characters who
  needs to trigger an explicit shift back to US-ASCII. Make that clearer

Signed-off-by: Max Gautier <mg@max.gautier.name>
---
 iconvdata/utf-7.c | 64 ++++++++++++++++++++++++-----------------------
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
index 9ba0974959..15f3669ac8 100644
--- a/iconvdata/utf-7.c
+++ b/iconvdata/utf-7.c
@@ -30,20 +30,27 @@
 
 
 
+static bool
+between (uint32_t const ch,
+	 uint32_t const lower_bound, uint32_t const upper_bound)
+{
+  return (ch >= lower_bound && ch <= upper_bound);
+}
+
 /* The set of "direct characters":
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
 */
 
-static const unsigned char direct_tab[128 / 8] =
-  {
-    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
-    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
-  };
-
-static int
-isdirect (uint32_t ch)
+static bool
+isdirect (uint32_t ch, enum variant var)
 {
-  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
+  return (between (ch, 'A', 'Z')
+	  || between (ch, 'a', 'z')
+	  || between (ch, '0', '9')
+	  || ch == '\'' || ch == '(' || ch == ')'
+	  || between (ch, ',', '/')
+	  || ch == ':' || ch == '?'
+	  || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
 }
 
 
@@ -52,33 +59,27 @@ isdirect (uint32_t ch)
    ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
 */
 
-static const unsigned char xdirect_tab[128 / 8] =
-  {
-    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
-    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
-  };
-
-static int
-isxdirect (uint32_t ch)
+static bool
+isxdirect (uint32_t ch, enum variant var)
 {
-  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
+  return (ch == '\t'
+	  || ch == '\n'
+	  || ch == '\r'
+	  || (between (ch, ' ', '}') && ch != '+' && ch != '\\'));
 }
 
 
-/* The set of "extended base64 characters":
+/* Characters which needs to trigger an explicit shift back to US-ASCII (UTF-7
+   only): Modified base64 + '-' (shift back character)
    A-Z a-z 0-9 + / -
 */
 
-static const unsigned char xbase64_tab[128 / 8] =
-  {
-    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
-    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
-  };
-
-static int
-isxbase64 (uint32_t ch)
+static bool
+needs_explicit_shift (uint32_t ch)
 {
-  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
+  return (between (ch, 'A', 'Z')
+	  || between (ch, 'a', 'z')
+	  || between (ch, '/', '9') || ch == '+' || ch == '-');
 }
 
 
@@ -252,7 +253,7 @@ base64 (unsigned int i)
 		   indeed form a Low Surrogate.  */			      \
 		uint32_t wc2 = wch & 0xffff;				      \
 									      \
-		if (! __builtin_expect (wc2 >= 0xdc00 && wc2 < 0xe000, 1))    \
+		if (! __glibc_likely (wc2 >= 0xdc00 && wc2 < 0xe000))	      \
 		  {							      \
 		    STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));\
 		  }							      \
@@ -372,7 +373,8 @@ base64 (unsigned int i)
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
 									      \
-	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
+	    count = ((statep->__count & 0x18) >= 0x10)			      \
+	      + needs_explicit_shift (ch) + 1;				      \
 	    if (__glibc_unlikely (outptr + count > outend))		      \
 	      {								      \
 		result = __GCONV_FULL_OUTPUT;				      \
@@ -381,7 +383,7 @@ base64 (unsigned int i)
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
 	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
-	    if (isxbase64 (ch))						      \
+	    if (needs_explicit_shift (ch))				      \
 	      *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
 	    statep->__count = 0;					      \
-- 
2.35.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 2/4] iconv: Better mapping to RFC for UTF-7
  2022-03-20 16:41                 ` [PATCH v5 " Max Gautier
@ 2022-03-21 11:53                   ` Adhemerval Zanella
  2022-03-21 11:59                     ` Adhemerval Zanella
  0 siblings, 1 reply; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-21 11:53 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 20/03/2022 13:41, Max Gautier via Libc-alpha wrote:
> - Direct use of characters instead of arcane arrays
> - isxbase64 is not the Modified BASE64 alphabet, but the characters who
>   needs to trigger an explicit shift back to US-ASCII. Make that clearer
> 
> Signed-off-by: Max Gautier <mg@max.gautier.name>


LGTM, thanks.

Reviewed-by: Adhemerval Zanellla  <adhemerval.zanella@linaro.org>

> ---
>  iconvdata/utf-7.c | 64 ++++++++++++++++++++++++-----------------------
>  1 file changed, 33 insertions(+), 31 deletions(-)
> 
> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> index 9ba0974959..15f3669ac8 100644
> --- a/iconvdata/utf-7.c
> +++ b/iconvdata/utf-7.c
> @@ -30,20 +30,27 @@
>  
>  
>  
> +static bool
> +between (uint32_t const ch,
> +	 uint32_t const lower_bound, uint32_t const upper_bound)
> +{
> +  return (ch >= lower_bound && ch <= upper_bound);
> +}
> +
>  /* The set of "direct characters":
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
>  */
>  
> -static const unsigned char direct_tab[128 / 8] =
> -  {
> -    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> -  };
> -
> -static int
> -isdirect (uint32_t ch)
> +static bool
> +isdirect (uint32_t ch, enum variant var)
>  {
> -  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
> +  return (between (ch, 'A', 'Z')
> +	  || between (ch, 'a', 'z')
> +	  || between (ch, '0', '9')
> +	  || ch == '\'' || ch == '(' || ch == ')'
> +	  || between (ch, ',', '/')
> +	  || ch == ':' || ch == '?'
> +	  || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
>  }
>  
>  
> @@ -52,33 +59,27 @@ isdirect (uint32_t ch)
>     ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
>  */
>  
> -static const unsigned char xdirect_tab[128 / 8] =
> -  {
> -    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
> -    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
> -  };
> -
> -static int
> -isxdirect (uint32_t ch)
> +static bool
> +isxdirect (uint32_t ch, enum variant var)
>  {
> -  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
> +  return (ch == '\t'
> +	  || ch == '\n'
> +	  || ch == '\r'
> +	  || (between (ch, ' ', '}') && ch != '+' && ch != '\\'));
>  }
>  
>  
> -/* The set of "extended base64 characters":
> +/* Characters which needs to trigger an explicit shift back to US-ASCII (UTF-7
> +   only): Modified base64 + '-' (shift back character)
>     A-Z a-z 0-9 + / -
>  */
>  
> -static const unsigned char xbase64_tab[128 / 8] =
> -  {
> -    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> -  };
> -
> -static int
> -isxbase64 (uint32_t ch)
> +static bool
> +needs_explicit_shift (uint32_t ch)
>  {
> -  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
> +  return (between (ch, 'A', 'Z')
> +	  || between (ch, 'a', 'z')
> +	  || between (ch, '/', '9') || ch == '+' || ch == '-');
>  }
>  
>  
> @@ -252,7 +253,7 @@ base64 (unsigned int i)
>  		   indeed form a Low Surrogate.  */			      \
>  		uint32_t wc2 = wch & 0xffff;				      \
>  									      \
> -		if (! __builtin_expect (wc2 >= 0xdc00 && wc2 < 0xe000, 1))    \
> +		if (! __glibc_likely (wc2 >= 0xdc00 && wc2 < 0xe000))	      \
>  		  {							      \
>  		    STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));\
>  		  }							      \
> @@ -372,7 +373,8 @@ base64 (unsigned int i)
>  	    /* deactivate base64 encoding */				      \
>  	    size_t count;						      \
>  									      \
> -	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
> +	    count = ((statep->__count & 0x18) >= 0x10)			      \
> +	      + needs_explicit_shift (ch) + 1;				      \
>  	    if (__glibc_unlikely (outptr + count > outend))		      \
>  	      {								      \
>  		result = __GCONV_FULL_OUTPUT;				      \
> @@ -381,7 +383,7 @@ base64 (unsigned int i)
>  									      \
>  	    if ((statep->__count & 0x18) >= 0x10)			      \
>  	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
> -	    if (isxbase64 (ch))						      \
> +	    if (needs_explicit_shift (ch))				      \
>  	      *outptr++ = '-';						      \
>  	    *outptr++ = (unsigned char) ch;				      \
>  	    statep->__count = 0;					      \

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 2/4] iconv: Better mapping to RFC for UTF-7
  2022-03-21 11:53                   ` Adhemerval Zanella
@ 2022-03-21 11:59                     ` Adhemerval Zanella
  2022-03-21 12:06                       ` Adhemerval Zanella
  2022-03-21 14:07                       ` Max Gautier
  0 siblings, 2 replies; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-21 11:59 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 21/03/2022 08:53, Adhemerval Zanella wrote:
> 
> 
> On 20/03/2022 13:41, Max Gautier via Libc-alpha wrote:
>> - Direct use of characters instead of arcane arrays
>> - isxbase64 is not the Modified BASE64 alphabet, but the characters who
>>   needs to trigger an explicit shift back to US-ASCII. Make that clearer
>>
>> Signed-off-by: Max Gautier <mg@max.gautier.name>
> 
> 
> LGTM, thanks.
> 
> Reviewed-by: Adhemerval Zanellla  <adhemerval.zanella@linaro.org>
> 
>> ---
>>  iconvdata/utf-7.c | 64 ++++++++++++++++++++++++-----------------------
>>  1 file changed, 33 insertions(+), 31 deletions(-)
>>
>> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
>> index 9ba0974959..15f3669ac8 100644
>> --- a/iconvdata/utf-7.c
>> +++ b/iconvdata/utf-7.c
>> @@ -30,20 +30,27 @@
>>  
>>  
>>  
>> +static bool
>> +between (uint32_t const ch,
>> +	 uint32_t const lower_bound, uint32_t const upper_bound)
>> +{
>> +  return (ch >= lower_bound && ch <= upper_bound);
>> +}
>> +
>>  /* The set of "direct characters":
>>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
>>  */
>>  
>> -static const unsigned char direct_tab[128 / 8] =
>> -  {
>> -    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
>> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
>> -  };
>> -
>> -static int
>> -isdirect (uint32_t ch)
>> +static bool
>> +isdirect (uint32_t ch, enum variant var)
>>  {

In fact I am seeing this failure:

utf-7.c:45:29: error: ‘enum variant’ declared inside parameter list will not be visible outside of this definition o
r declaration [-Werror]
   45 | isdirect (uint32_t ch, enum variant var)
      |                             ^~~~~~~

Since 'enum variant' in only defined on next patch.  Usually the best
practice is keep each patch consistent, so could you move the definition
on this patch?

Or I can fix it for you before installing, it is up to you.

>> -  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
>> +  return (between (ch, 'A', 'Z')
>> +	  || between (ch, 'a', 'z')
>> +	  || between (ch, '0', '9')
>> +	  || ch == '\'' || ch == '(' || ch == ')'
>> +	  || between (ch, ',', '/')
>> +	  || ch == ':' || ch == '?'
>> +	  || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
>>  }
>>  
>>  
>> @@ -52,33 +59,27 @@ isdirect (uint32_t ch)
>>     ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
>>  */
>>  
>> -static const unsigned char xdirect_tab[128 / 8] =
>> -  {
>> -    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
>> -    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
>> -  };
>> -
>> -static int
>> -isxdirect (uint32_t ch)
>> +static bool
>> +isxdirect (uint32_t ch, enum variant var)
>>  {
>> -  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
>> +  return (ch == '\t'
>> +	  || ch == '\n'
>> +	  || ch == '\r'
>> +	  || (between (ch, ' ', '}') && ch != '+' && ch != '\\'));
>>  }
>>  
>>  
>> -/* The set of "extended base64 characters":
>> +/* Characters which needs to trigger an explicit shift back to US-ASCII (UTF-7
>> +   only): Modified base64 + '-' (shift back character)
>>     A-Z a-z 0-9 + / -
>>  */
>>  
>> -static const unsigned char xbase64_tab[128 / 8] =
>> -  {
>> -    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
>> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
>> -  };
>> -
>> -static int
>> -isxbase64 (uint32_t ch)
>> +static bool
>> +needs_explicit_shift (uint32_t ch)
>>  {
>> -  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
>> +  return (between (ch, 'A', 'Z')
>> +	  || between (ch, 'a', 'z')
>> +	  || between (ch, '/', '9') || ch == '+' || ch == '-');
>>  }
>>  
>>  
>> @@ -252,7 +253,7 @@ base64 (unsigned int i)
>>  		   indeed form a Low Surrogate.  */			      \
>>  		uint32_t wc2 = wch & 0xffff;				      \
>>  									      \
>> -		if (! __builtin_expect (wc2 >= 0xdc00 && wc2 < 0xe000, 1))    \
>> +		if (! __glibc_likely (wc2 >= 0xdc00 && wc2 < 0xe000))	      \
>>  		  {							      \
>>  		    STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));\
>>  		  }							      \
>> @@ -372,7 +373,8 @@ base64 (unsigned int i)
>>  	    /* deactivate base64 encoding */				      \
>>  	    size_t count;						      \
>>  									      \
>> -	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
>> +	    count = ((statep->__count & 0x18) >= 0x10)			      \
>> +	      + needs_explicit_shift (ch) + 1;				      \
>>  	    if (__glibc_unlikely (outptr + count > outend))		      \
>>  	      {								      \
>>  		result = __GCONV_FULL_OUTPUT;				      \
>> @@ -381,7 +383,7 @@ base64 (unsigned int i)
>>  									      \
>>  	    if ((statep->__count & 0x18) >= 0x10)			      \
>>  	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
>> -	    if (isxbase64 (ch))						      \
>> +	    if (needs_explicit_shift (ch))				      \
>>  	      *outptr++ = '-';						      \
>>  	    *outptr++ = (unsigned char) ch;				      \
>>  	    statep->__count = 0;					      \

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 2/4] iconv: Better mapping to RFC for UTF-7
  2022-03-21 11:59                     ` Adhemerval Zanella
@ 2022-03-21 12:06                       ` Adhemerval Zanella
  2022-03-21 14:07                       ` Max Gautier
  1 sibling, 0 replies; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-21 12:06 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 21/03/2022 08:59, Adhemerval Zanella wrote:
> 
> 
> On 21/03/2022 08:53, Adhemerval Zanella wrote:
>>
>>
>> On 20/03/2022 13:41, Max Gautier via Libc-alpha wrote:
>>> - Direct use of characters instead of arcane arrays
>>> - isxbase64 is not the Modified BASE64 alphabet, but the characters who
>>>   needs to trigger an explicit shift back to US-ASCII. Make that clearer
>>>
>>> Signed-off-by: Max Gautier <mg@max.gautier.name>
>>
>>
>> LGTM, thanks.
>>
>> Reviewed-by: Adhemerval Zanellla  <adhemerval.zanella@linaro.org>
>>
>>> ---
>>>  iconvdata/utf-7.c | 64 ++++++++++++++++++++++++-----------------------
>>>  1 file changed, 33 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
>>> index 9ba0974959..15f3669ac8 100644
>>> --- a/iconvdata/utf-7.c
>>> +++ b/iconvdata/utf-7.c
>>> @@ -30,20 +30,27 @@
>>>  
>>>  
>>>  
>>> +static bool
>>> +between (uint32_t const ch,
>>> +	 uint32_t const lower_bound, uint32_t const upper_bound)
>>> +{
>>> +  return (ch >= lower_bound && ch <= upper_bound);
>>> +}
>>> +
>>>  /* The set of "direct characters":
>>>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
>>>  */
>>>  
>>> -static const unsigned char direct_tab[128 / 8] =
>>> -  {
>>> -    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
>>> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
>>> -  };
>>> -
>>> -static int
>>> -isdirect (uint32_t ch)
>>> +static bool
>>> +isdirect (uint32_t ch, enum variant var)
>>>  {
> 
> In fact I am seeing this failure:
> 
> utf-7.c:45:29: error: ‘enum variant’ declared inside parameter list will not be visible outside of this definition o
> r declaration [-Werror]
>    45 | isdirect (uint32_t ch, enum variant var)
>       |                             ^~~~~~~
> 
> Since 'enum variant' in only defined on next patch.  Usually the best
> practice is keep each patch consistent, so could you move the definition
> on this patch?
> 
> Or I can fix it for you before installing, it is up to you.

And it does not actually required the variant argument on neither, isdirect or
isxdirect.  The obvious fix for this patch is:

diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
index 4a89de235a..815b1891c7 100644
--- a/iconvdata/utf-7.c
+++ b/iconvdata/utf-7.c
@@ -42,7 +42,7 @@ between (uint32_t const ch,
 */
 
 static bool
-isdirect (uint32_t ch, enum variant var)
+isdirect (uint32_t ch)
 {
   return (between (ch, 'A', 'Z')
          || between (ch, 'a', 'z')
@@ -60,7 +60,7 @@ isdirect (uint32_t ch, enum variant var)
 */
 
 static bool
-isxdirect (uint32_t ch, enum variant var)
+isxdirect (uint32_t ch)
 {
   return (ch == '\t'
          || ch == '\n'


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 2/4] iconv: Better mapping to RFC for UTF-7
  2022-03-21 11:59                     ` Adhemerval Zanella
  2022-03-21 12:06                       ` Adhemerval Zanella
@ 2022-03-21 14:07                       ` Max Gautier
  1 sibling, 0 replies; 60+ messages in thread
From: Max Gautier @ 2022-03-21 14:07 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, Mar 21, 2022 at 08:59:27AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 21/03/2022 08:53, Adhemerval Zanella wrote:
> > 
> > 
> > On 20/03/2022 13:41, Max Gautier via Libc-alpha wrote:
> >> - Direct use of characters instead of arcane arrays
> >> - isxbase64 is not the Modified BASE64 alphabet, but the characters who
> >>   needs to trigger an explicit shift back to US-ASCII. Make that clearer
> >>
> >> Signed-off-by: Max Gautier <mg@max.gautier.name>
> > 
> > 
> > LGTM, thanks.
> > 
> > Reviewed-by: Adhemerval Zanellla  <adhemerval.zanella@linaro.org>
> > 
> >> ---
> >>  iconvdata/utf-7.c | 64 ++++++++++++++++++++++++-----------------------
> >>  1 file changed, 33 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> >> index 9ba0974959..15f3669ac8 100644
> >> --- a/iconvdata/utf-7.c
> >> +++ b/iconvdata/utf-7.c
> >> @@ -30,20 +30,27 @@
> >>  
> >>  
> >>  
> >> +static bool
> >> +between (uint32_t const ch,
> >> +	 uint32_t const lower_bound, uint32_t const upper_bound)
> >> +{
> >> +  return (ch >= lower_bound && ch <= upper_bound);
> >> +}
> >> +
> >>  /* The set of "direct characters":
> >>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
> >>  */
> >>  
> >> -static const unsigned char direct_tab[128 / 8] =
> >> -  {
> >> -    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
> >> -    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
> >> -  };
> >> -
> >> -static int
> >> -isdirect (uint32_t ch)
> >> +static bool
> >> +isdirect (uint32_t ch, enum variant var)
> >>  {
> 
> In fact I am seeing this failure:
> 
> utf-7.c:45:29: error: ‘enum variant’ declared inside parameter list will not be visible outside of this definition o
> r declaration [-Werror]
>    45 | isdirect (uint32_t ch, enum variant var)
>       |                             ^~~~~~~
> 
> Since 'enum variant' in only defined on next patch.  Usually the best
> practice is keep each patch consistent, so could you move the definition
> on this patch?
> 
> Or I can fix it for you before installing, it is up to you.
> 

I think I mixed up my patches while integrating the corrections and
style fixes you mentionned, sorry.
No problem for me I you fix it before applying.

Thanks !

-- 
Max Gautier

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 3/4] iconv: make utf-7.c able to use variants
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
  2021-12-09  9:31               ` [PATCH v4 1/4] iconv: Always encode "optional direct" UTF-7 characters Max Gautier
  2021-12-09  9:31               ` [PATCH v4 2/4] iconv: Better mapping to RFC for UTF-7 Max Gautier
@ 2021-12-09  9:31               ` Max Gautier
  2022-03-07 12:34                 ` Adhemerval Zanella
  2022-03-20 16:42                 ` [PATCH v5 " Max Gautier
  2021-12-09  9:31               ` [PATCH v4 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c Max Gautier
                                 ` (3 subsequent siblings)
  6 siblings, 2 replies; 60+ messages in thread
From: Max Gautier @ 2021-12-09  9:31 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

Add infrastructure in utf-7.c to handle variants. The approach comes from
iso646.c
The variant is defined at gconv_init time and is passed as a
supplementary variable.

Signed-off-by: Max Gautier <mg@max.gautier.name>
---
 iconvdata/utf-7.c | 239 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 174 insertions(+), 65 deletions(-)

diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
index ac7d78141a..965d4220f1 100644
--- a/iconvdata/utf-7.c
+++ b/iconvdata/utf-7.c
@@ -29,6 +29,24 @@
 #include <stdlib.h>
 
 
+enum variant
+{
+    UTF7,
+};
+
+/* Must be in the same order as enum variant above.  */
+static const char names[] =
+  "UTF-7//\0"
+  "\0";
+
+static uint32_t
+shift_character(enum variant const var)
+{
+    if (var == UTF7)
+        return '+';
+    else
+        abort();
+}
 
 static int
 between(uint32_t const ch,
@@ -38,37 +56,43 @@ between(uint32_t const ch,
 }
 
 /* The set of "direct characters":
+   FOR UTF-7
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
 */
 
 static int
-isdirect (uint32_t ch)
+isdirect (uint32_t ch, enum variant var)
 {
-    return (between(ch, 'A', 'Z')
-	    || between(ch, 'a', 'z')
-	    || between(ch, '0', '9')
-	    || ch == '\'' || ch == '(' || ch == ')'
-	    || between(ch, ',', '/')
-	    || ch == ':' || ch == '?'
-	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
+    if (var == UTF7)
+        return (between(ch, 'A', 'Z')
+                || between(ch, 'a', 'z')
+                || between(ch, '0', '9')
+                || ch == '\'' || ch == '(' || ch == ')'
+                || between(ch, ',', '/')
+                || ch == ':' || ch == '?'
+                || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
+    abort();
 }
 
 
 /* The set of "direct and optional direct characters":
+   (UTF-7 only)
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
    ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
 */
 
-
 static int
-isxdirect (uint32_t ch)
+isxdirect (uint32_t ch, enum variant var)
 {
-    return (ch == '\t'
-            || ch == '\n'
-            || ch == '\r'
-            || (between(ch, ' ','}')
-                && ch != '+' && ch != '\\')
-           );
+    return(isdirect(ch, var)
+            || (var == UTF7 &&
+                (between(ch, '!', '&')
+                 || ch == '*'
+                 || between(ch, ';', '@')
+                 || (between(ch, '[', '`') && ch != '\\')
+                 || between(ch, '{', '}'))
+                )
+          );
 }
 
 
@@ -91,7 +115,7 @@ needs_explicit_shift (uint32_t ch)
 
 /* Converts a value in the range 0..63 to a base64 encoded char.  */
 static unsigned char
-base64 (unsigned int i)
+base64 (unsigned int i, enum variant var)
 {
   if (i < 26)
     return i + 'A';
@@ -101,7 +125,7 @@ base64 (unsigned int i)
     return i - 52 + '0';
   else if (i == 62)
     return '+';
-  else if (i == 63)
+  else if (i == 63 && var == UTF7)
     return '/';
   else
     abort ();
@@ -109,9 +133,8 @@ base64 (unsigned int i)
 
 
 /* Definitions used in the body of the `gconv' function.  */
-#define CHARSET_NAME		"UTF-7//"
-#define DEFINE_INIT		1
-#define DEFINE_FINI		1
+#define DEFINE_INIT		0
+#define DEFINE_FINI		0
 #define FROM_LOOP		from_utf7_loop
 #define TO_LOOP			to_utf7_loop
 #define MIN_NEEDED_FROM		1
@@ -119,11 +142,27 @@ base64 (unsigned int i)
 #define MIN_NEEDED_TO		4
 #define MAX_NEEDED_TO		4
 #define ONE_DIRECTION		0
+#define FROM_DIRECTION      (dir == from_utf7)
 #define PREPARE_LOOP \
   mbstate_t saved_state;						      \
-  mbstate_t *statep = data->__statep;
-#define EXTRA_LOOP_ARGS		, statep
+  mbstate_t *statep = data->__statep;					      \
+  enum direction dir = ((struct utf7_data *) step->__data)->dir;	      \
+  enum direction var = ((struct utf7_data *) step->__data)->var;
+#define EXTRA_LOOP_ARGS		, statep, var
+
+
+enum direction
+{
+   illegal_dir,
+   from_utf7,
+   to_utf7
+};
 
+struct utf7_data
+{
+  enum direction dir;
+  enum variant var;
+};
 
 /* Since we might have to reset input pointer we must be able to save
    and restore the state.  */
@@ -133,6 +172,72 @@ base64 (unsigned int i)
   else									      \
     *statep = saved_state
 
+extern int gconv_init (struct __gconv_step *step);
+int
+gconv_init (struct __gconv_step *step)
+{
+  /* Determine which direction.  */
+  struct utf7_data *new_data;
+  enum direction dir = illegal_dir;
+
+  enum variant var = 0;
+  for (const char *name = names; *name != '\0';
+       name = __rawmemchr (name, '\0') + 1)
+    {
+      if (__strcasecmp (step->__from_name, name) == 0)
+	{
+	  dir = from_utf7;
+	  break;
+	}
+      else if (__strcasecmp (step->__to_name, name) == 0)
+	{
+	  dir = to_utf7;
+	  break;
+	}
+      ++var;
+    }
+
+  if (__builtin_expect (dir, from_utf7) != illegal_dir)
+  {
+      new_data = malloc (sizeof (*new_data));
+      if (new_data == NULL)
+          return __GCONV_NOMEM;
+
+      new_data->dir = dir;
+      new_data->var = var;
+      step->__data = new_data;
+
+      if (dir == from_utf7)
+      {
+          step->__min_needed_from = MIN_NEEDED_FROM;
+          step->__max_needed_from = MAX_NEEDED_FROM;
+          step->__min_needed_to = MIN_NEEDED_TO;
+          step->__max_needed_to = MAX_NEEDED_TO;
+      }
+      else
+      {
+          step->__min_needed_from = MIN_NEEDED_TO;
+          step->__max_needed_from = MAX_NEEDED_TO;
+          step->__min_needed_to = MIN_NEEDED_FROM;
+          step->__max_needed_to = MAX_NEEDED_FROM;
+      }
+  }
+  else
+    return __GCONV_NOCONV;
+
+  step->__stateful = 1;
+
+  return __GCONV_OK;
+}
+
+extern void gconv_end (struct __gconv_step *data);
+void
+gconv_end (struct __gconv_step *data)
+{
+  free (data->__data);
+}
+
+
 
 /* First define the conversion function from UTF-7 to UCS4.
    The state is structured as follows:
@@ -160,13 +265,13 @@ base64 (unsigned int i)
     if ((statep->__count >> 3) == 0)					      \
       {									      \
 	/* base64 encoding inactive.  */				      \
-	if (isxdirect (ch))						      \
+	if (isxdirect (ch, var))					      \
 	  {								      \
 	    inptr++;							      \
 	    put32 (outptr, ch);						      \
 	    outptr += 4;						      \
 	  }								      \
-	else if (__glibc_likely (ch == '+'))				      \
+	else if (__glibc_likely (ch == shift_character(var)))		      \
 	  {								      \
 	    if (__glibc_unlikely (inptr + 2 > inend))			      \
 	      {								      \
@@ -291,7 +396,7 @@ base64 (unsigned int i)
       }									      \
   }
 #define LOOP_NEED_FLAGS
-#define EXTRA_LOOP_DECLS	, mbstate_t *statep
+#define EXTRA_LOOP_DECLS	, mbstate_t *statep, enum variant var
 #include <iconv/loop.c>
 
 
@@ -322,7 +427,7 @@ base64 (unsigned int i)
     if ((statep->__count & 0x18) == 0)					      \
       {									      \
 	/* base64 encoding inactive */					      \
-	if (isdirect (ch))   						      \
+	if (isdirect (ch, var))						      \
 	  {								      \
 	    *outptr++ = (unsigned char) ch;				      \
 	  }								      \
@@ -330,7 +435,7 @@ base64 (unsigned int i)
 	  {								      \
 	    size_t count;						      \
 									      \
-	    if (ch == '+')						      \
+	    if (ch == shift_character(var))				      \
 	      count = 2;						      \
 	    else if (ch < 0x10000)					      \
 	      count = 3;						      \
@@ -345,13 +450,13 @@ base64 (unsigned int i)
 		break;							      \
 	      }								      \
 									      \
-	    *outptr++ = '+';						      \
-	    if (ch == '+')						      \
+	    *outptr++ = shift_character(var);				      \
+	    if (ch == shift_character(var))				      \
 	      *outptr++ = '-';						      \
 	    else if (ch < 0x10000)					      \
 	      {								      \
-		*outptr++ = base64 (ch >> 10);				      \
-		*outptr++ = base64 ((ch >> 4) & 0x3f);			      \
+		*outptr++ = base64 (ch >> 10, var);			      \
+		*outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
 		statep->__count = ((ch & 15) << 5) | (3 << 3);		      \
 	      }								      \
 	    else if (ch < 0x110000)					      \
@@ -360,11 +465,11 @@ base64 (unsigned int i)
 		uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff);	      \
 									      \
 		ch = (ch1 << 16) | ch2;					      \
-		*outptr++ = base64 (ch >> 26);				      \
-		*outptr++ = base64 ((ch >> 20) & 0x3f);			      \
-		*outptr++ = base64 ((ch >> 14) & 0x3f);			      \
-		*outptr++ = base64 ((ch >> 8) & 0x3f);			      \
-		*outptr++ = base64 ((ch >> 2) & 0x3f);			      \
+		*outptr++ = base64 (ch >> 26, var);			      \
+		*outptr++ = base64 ((ch >> 20) & 0x3f, var);		      \
+		*outptr++ = base64 ((ch >> 14) & 0x3f, var);		      \
+		*outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
+		*outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
 		statep->__count = ((ch & 3) << 7) | (2 << 3);		      \
 	      }								      \
 	    else							      \
@@ -374,7 +479,7 @@ base64 (unsigned int i)
     else								      \
       {									      \
 	/* base64 encoding active */					      \
-	if (isdirect (ch))						      \
+	if (isdirect (ch, var))						      \
 	  {								      \
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
@@ -388,7 +493,7 @@ base64 (unsigned int i)
 	      }								      \
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
-	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
+	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
 	    if (needs_explicit_shift (ch))				      \
 	      *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
@@ -416,22 +521,24 @@ base64 (unsigned int i)
 		switch ((statep->__count >> 3) & 3)			      \
 		  {							      \
 		  case 1:						      \
-		    *outptr++ = base64 (ch >> 10);			      \
-		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
+		    *outptr++ = base64 (ch >> 10, var);			      \
+		    *outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
 		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
 		    break;						      \
 		  case 2:						      \
 		    *outptr++ =						      \
-		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12));    \
-		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
-		    *outptr++ = base64 (ch & 0x3f);			      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12),     \
+			      var);					      \
+		    *outptr++ = base64 ((ch >> 6) & 0x3f, var);		      \
+		    *outptr++ = base64 (ch & 0x3f, var);		      \
 		    statep->__count = (1 << 3);				      \
 		    break;						      \
 		  case 3:						      \
 		    *outptr++ =						      \
-		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14));    \
-		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14),     \
+			      var);					      \
+		    *outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
+		    *outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
 		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
 		    break;						      \
 		  default:						      \
@@ -447,30 +554,32 @@ base64 (unsigned int i)
 		switch ((statep->__count >> 3) & 3)			      \
 		  {							      \
 		  case 1:						      \
-		    *outptr++ = base64 (ch >> 26);			      \
-		    *outptr++ = base64 ((ch >> 20) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 14) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
+		    *outptr++ = base64 (ch >> 26, var);			      \
+		    *outptr++ = base64 ((ch >> 20) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 14) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
+		    *outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
 		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
 		    break;						      \
 		  case 2:						      \
 		    *outptr++ =						      \
-		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28));    \
-		    *outptr++ = base64 ((ch >> 22) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 16) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 10) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28),     \
+			      var);					      \
+		    *outptr++ = base64 ((ch >> 22) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 16) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 10) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
 		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
 		    break;						      \
 		  case 3:						      \
 		    *outptr++ =						      \
-		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30));    \
-		    *outptr++ = base64 ((ch >> 24) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 18) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 12) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
-		    *outptr++ = base64 (ch & 0x3f);			      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30),     \
+			      var);					      \
+		    *outptr++ = base64 ((ch >> 24) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 18) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 12) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 6) & 0x3f, var);		      \
+		    *outptr++ = base64 (ch & 0x3f, var);		      \
 		    statep->__count = (1 << 3);				      \
 		    break;						      \
 		  default:						      \
@@ -486,7 +595,7 @@ base64 (unsigned int i)
     inptr += 4;								      \
   }
 #define LOOP_NEED_FLAGS
-#define EXTRA_LOOP_DECLS	, mbstate_t *statep
+#define EXTRA_LOOP_DECLS	, mbstate_t *statep, enum variant var
 #include <iconv/loop.c>
 
 
@@ -516,7 +625,7 @@ base64 (unsigned int i)
 	    {								      \
 	      /* Write out the shift sequence.  */			      \
 	      if ((state & 0x18) >= 0x10)				      \
-		*outbuf++ = base64 ((state >> 3) & ~3);			      \
+		*outbuf++ = base64 ((state >> 3) & ~3, var);		      \
 	      *outbuf++ = '-';						      \
 									      \
 	      data->__statep->__count = 0;				      \
-- 
2.34.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/4] iconv: make utf-7.c able to use variants
  2021-12-09  9:31               ` [PATCH v4 3/4] iconv: make utf-7.c able to use variants Max Gautier
@ 2022-03-07 12:34                 ` Adhemerval Zanella
  2022-03-12 11:07                   ` Max Gautier
  2022-03-20 16:42                 ` [PATCH v5 " Max Gautier
  1 sibling, 1 reply; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-07 12:34 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 09/12/2021 06:31, Max Gautier via Libc-alpha wrote:
> Add infrastructure in utf-7.c to handle variants. The approach comes from
> iso646.c
> The variant is defined at gconv_init time and is passed as a
> supplementary variable.
> 
> Signed-off-by: Max Gautier <mg@max.gautier.name>

Patch looks ok in general, there are style issues that needed to be fixed and some
minor suggestions below.

> ---
>  iconvdata/utf-7.c | 239 +++++++++++++++++++++++++++++++++-------------
>  1 file changed, 174 insertions(+), 65 deletions(-)
> 
> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> index ac7d78141a..965d4220f1 100644
> --- a/iconvdata/utf-7.c
> +++ b/iconvdata/utf-7.c
> @@ -29,6 +29,24 @@
>  #include <stdlib.h>
>  
>  
> +enum variant
> +{
> +    UTF7,
> +};
> +
> +/* Must be in the same order as enum variant above.  */
> +static const char names[] =
> +  "UTF-7//\0"
> +  "\0";
> +
> +static uint32_t
> +shift_character(enum variant const var)
> +{
> +    if (var == UTF7)
> +        return '+';
> +    else
> +        abort();
> +}

Please use the expected indentation on glibc [1] and other places as well.

[1] https://sourceware.org/glibc/wiki/Style_and_Conventions

>  
>  static int
>  between(uint32_t const ch,
> @@ -38,37 +56,43 @@ between(uint32_t const ch,
>  }
>  
>  /* The set of "direct characters":
> +   FOR UTF-7
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
>  */
>  
>  static int
> -isdirect (uint32_t ch)
> +isdirect (uint32_t ch, enum variant var)
>  {
> -    return (between(ch, 'A', 'Z')
> -	    || between(ch, 'a', 'z')
> -	    || between(ch, '0', '9')
> -	    || ch == '\'' || ch == '(' || ch == ')'
> -	    || between(ch, ',', '/')
> -	    || ch == ':' || ch == '?'
> -	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
> +    if (var == UTF7)
> +        return (between(ch, 'A', 'Z')
> +                || between(ch, 'a', 'z')
> +                || between(ch, '0', '9')
> +                || ch == '\'' || ch == '(' || ch == ')'
> +                || between(ch, ',', '/')
> +                || ch == ':' || ch == '?'
> +                || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
> +    abort();
>  }
>  
>  
>  /* The set of "direct and optional direct characters":
> +   (UTF-7 only)
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
>     ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
>  */
>  
> -
>  static int
> -isxdirect (uint32_t ch)
> +isxdirect (uint32_t ch, enum variant var)
>  {
> -    return (ch == '\t'
> -            || ch == '\n'
> -            || ch == '\r'
> -            || (between(ch, ' ','}')
> -                && ch != '+' && ch != '\\')
> -           );
> +    return(isdirect(ch, var)
> +            || (var == UTF7 &&
> +                (between(ch, '!', '&')
> +                 || ch == '*'
> +                 || between(ch, ';', '@')
> +                 || (between(ch, '[', '`') && ch != '\\')
> +                 || between(ch, '{', '}'))
> +                )
> +          );
>  }
>  

The change is ok, but maybe adding the variant out makes it more clear:

   if (var != UTF7)
     return 0;
   [...]

Also I think since you change it, it would be better to use 'bool' as
return type.

>  
> @@ -91,7 +115,7 @@ needs_explicit_shift (uint32_t ch)
>  
>  /* Converts a value in the range 0..63 to a base64 encoded char.  */
>  static unsigned char
> -base64 (unsigned int i)
> +base64 (unsigned int i, enum variant var)
>  {
>    if (i < 26)
>      return i + 'A';
> @@ -101,7 +125,7 @@ base64 (unsigned int i)
>      return i - 52 + '0';
>    else if (i == 62)
>      return '+';
> -  else if (i == 63)
> +  else if (i == 63 && var == UTF7)
>      return '/';
>    else
>      abort ();
> @@ -109,9 +133,8 @@ base64 (unsigned int i)
>  
>  

Ok.

>  /* Definitions used in the body of the `gconv' function.  */
> -#define CHARSET_NAME		"UTF-7//"
> -#define DEFINE_INIT		1
> -#define DEFINE_FINI		1
> +#define DEFINE_INIT		0
> +#define DEFINE_FINI		0
>  #define FROM_LOOP		from_utf7_loop
>  #define TO_LOOP			to_utf7_loop
>  #define MIN_NEEDED_FROM		1
> @@ -119,11 +142,27 @@ base64 (unsigned int i)
>  #define MIN_NEEDED_TO		4
>  #define MAX_NEEDED_TO		4
>  #define ONE_DIRECTION		0
> +#define FROM_DIRECTION      (dir == from_utf7)
>  #define PREPARE_LOOP \
>    mbstate_t saved_state;						      \
> -  mbstate_t *statep = data->__statep;
> -#define EXTRA_LOOP_ARGS		, statep
> +  mbstate_t *statep = data->__statep;					      \
> +  enum direction dir = ((struct utf7_data *) step->__data)->dir;	      \
> +  enum direction var = ((struct utf7_data *) step->__data)->var;
> +#define EXTRA_LOOP_ARGS		, statep, var
> +
> +
> +enum direction
> +{
> +   illegal_dir,
> +   from_utf7,
> +   to_utf7
> +};

Style, use two spaces.

>  
> +struct utf7_data
> +{
> +  enum direction dir;
> +  enum variant var;
> +};
>  
>  /* Since we might have to reset input pointer we must be able to save
>     and restore the state.  */
> @@ -133,6 +172,72 @@ base64 (unsigned int i)
>    else									      \
>      *statep = saved_state
>  
> +extern int gconv_init (struct __gconv_step *step);

I think there is no need to add the prototype here.

> +int
> +gconv_init (struct __gconv_step *step)
> +{
> +  /* Determine which direction.  */
> +  struct utf7_data *new_data;
> +  enum direction dir = illegal_dir;
> +
> +  enum variant var = 0;
> +  for (const char *name = names; *name != '\0';
> +       name = __rawmemchr (name, '\0') + 1)
> +    {
> +      if (__strcasecmp (step->__from_name, name) == 0)
> +	{
> +	  dir = from_utf7;
> +	  break;
> +	}
> +      else if (__strcasecmp (step->__to_name, name) == 0)
> +	{
> +	  dir = to_utf7;
> +	  break;
> +	}
> +      ++var;
> +    }
> +
> +  if (__builtin_expect (dir, from_utf7) != illegal_dir)

Use __glibc_likely.

> +  {
> +      new_data = malloc (sizeof (*new_data));
> +      if (new_data == NULL)
> +          return __GCONV_NOMEM;
> +
> +      new_data->dir = dir;
> +      new_data->var = var;
> +      step->__data = new_data;
> +
> +      if (dir == from_utf7)
> +      {
> +          step->__min_needed_from = MIN_NEEDED_FROM;
> +          step->__max_needed_from = MAX_NEEDED_FROM;
> +          step->__min_needed_to = MIN_NEEDED_TO;
> +          step->__max_needed_to = MAX_NEEDED_TO;
> +      }
> +      else
> +      {
> +          step->__min_needed_from = MIN_NEEDED_TO;
> +          step->__max_needed_from = MAX_NEEDED_TO;
> +          step->__min_needed_to = MIN_NEEDED_FROM;
> +          step->__max_needed_to = MAX_NEEDED_FROM;
> +      }
> +  }
> +  else
> +    return __GCONV_NOCONV;
> +
> +  step->__stateful = 1;
> +
> +  return __GCONV_OK;
> +}
> +
> +extern void gconv_end (struct __gconv_step *data);
> +void
> +gconv_end (struct __gconv_step *data)
> +{
> +  free (data->__data);
> +}
> +
> +
>  
>  /* First define the conversion function from UTF-7 to UCS4.
>     The state is structured as follows:
> @@ -160,13 +265,13 @@ base64 (unsigned int i)
>      if ((statep->__count >> 3) == 0)					      \
>        {									      \
>  	/* base64 encoding inactive.  */				      \
> -	if (isxdirect (ch))						      \
> +	if (isxdirect (ch, var))					      \
>  	  {								      \
>  	    inptr++;							      \
>  	    put32 (outptr, ch);						      \
>  	    outptr += 4;						      \
>  	  }								      \
> -	else if (__glibc_likely (ch == '+'))				      \
> +	else if (__glibc_likely (ch == shift_character(var)))		      \
>  	  {								      \
>  	    if (__glibc_unlikely (inptr + 2 > inend))			      \
>  	      {								      \
> @@ -291,7 +396,7 @@ base64 (unsigned int i)
>        }									      \
>    }
>  #define LOOP_NEED_FLAGS
> -#define EXTRA_LOOP_DECLS	, mbstate_t *statep
> +#define EXTRA_LOOP_DECLS	, mbstate_t *statep, enum variant var
>  #include <iconv/loop.c>
>  
>  
> @@ -322,7 +427,7 @@ base64 (unsigned int i)
>      if ((statep->__count & 0x18) == 0)					      \
>        {									      \
>  	/* base64 encoding inactive */					      \
> -	if (isdirect (ch))   						      \
> +	if (isdirect (ch, var))						      \
>  	  {								      \
>  	    *outptr++ = (unsigned char) ch;				      \
>  	  }								      \
> @@ -330,7 +435,7 @@ base64 (unsigned int i)
>  	  {								      \
>  	    size_t count;						      \
>  									      \
> -	    if (ch == '+')						      \
> +	    if (ch == shift_character(var))				      \
>  	      count = 2;						      \
>  	    else if (ch < 0x10000)					      \
>  	      count = 3;						      \
> @@ -345,13 +450,13 @@ base64 (unsigned int i)
>  		break;							      \
>  	      }								      \
>  									      \
> -	    *outptr++ = '+';						      \
> -	    if (ch == '+')						      \
> +	    *outptr++ = shift_character(var);				      \
> +	    if (ch == shift_character(var))				      \
>  	      *outptr++ = '-';						      \
>  	    else if (ch < 0x10000)					      \
>  	      {								      \
> -		*outptr++ = base64 (ch >> 10);				      \
> -		*outptr++ = base64 ((ch >> 4) & 0x3f);			      \
> +		*outptr++ = base64 (ch >> 10, var);			      \
> +		*outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
>  		statep->__count = ((ch & 15) << 5) | (3 << 3);		      \
>  	      }								      \
>  	    else if (ch < 0x110000)					      \
> @@ -360,11 +465,11 @@ base64 (unsigned int i)
>  		uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff);	      \
>  									      \
>  		ch = (ch1 << 16) | ch2;					      \
> -		*outptr++ = base64 (ch >> 26);				      \
> -		*outptr++ = base64 ((ch >> 20) & 0x3f);			      \
> -		*outptr++ = base64 ((ch >> 14) & 0x3f);			      \
> -		*outptr++ = base64 ((ch >> 8) & 0x3f);			      \
> -		*outptr++ = base64 ((ch >> 2) & 0x3f);			      \
> +		*outptr++ = base64 (ch >> 26, var);			      \
> +		*outptr++ = base64 ((ch >> 20) & 0x3f, var);		      \
> +		*outptr++ = base64 ((ch >> 14) & 0x3f, var);		      \
> +		*outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
> +		*outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
>  		statep->__count = ((ch & 3) << 7) | (2 << 3);		      \
>  	      }								      \
>  	    else							      \
> @@ -374,7 +479,7 @@ base64 (unsigned int i)
>      else								      \
>        {									      \
>  	/* base64 encoding active */					      \
> -	if (isdirect (ch))						      \
> +	if (isdirect (ch, var))						      \
>  	  {								      \
>  	    /* deactivate base64 encoding */				      \
>  	    size_t count;						      \
> @@ -388,7 +493,7 @@ base64 (unsigned int i)
>  	      }								      \
>  									      \
>  	    if ((statep->__count & 0x18) >= 0x10)			      \
> -	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
> +	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
>  	    if (needs_explicit_shift (ch))				      \
>  	      *outptr++ = '-';						      \
>  	    *outptr++ = (unsigned char) ch;				      \
> @@ -416,22 +521,24 @@ base64 (unsigned int i)
>  		switch ((statep->__count >> 3) & 3)			      \
>  		  {							      \
>  		  case 1:						      \
> -		    *outptr++ = base64 (ch >> 10);			      \
> -		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
> +		    *outptr++ = base64 (ch >> 10, var);			      \
> +		    *outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
>  		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
>  		    break;						      \
>  		  case 2:						      \
>  		    *outptr++ =						      \
> -		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12));    \
> -		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
> -		    *outptr++ = base64 (ch & 0x3f);			      \
> +		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12),     \
> +			      var);					      \
> +		    *outptr++ = base64 ((ch >> 6) & 0x3f, var);		      \
> +		    *outptr++ = base64 (ch & 0x3f, var);		      \
>  		    statep->__count = (1 << 3);				      \
>  		    break;						      \
>  		  case 3:						      \
>  		    *outptr++ =						      \
> -		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14));    \
> -		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
> +		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14),     \
> +			      var);					      \
> +		    *outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
> +		    *outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
>  		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
>  		    break;						      \
>  		  default:						      \
> @@ -447,30 +554,32 @@ base64 (unsigned int i)
>  		switch ((statep->__count >> 3) & 3)			      \
>  		  {							      \
>  		  case 1:						      \
> -		    *outptr++ = base64 (ch >> 26);			      \
> -		    *outptr++ = base64 ((ch >> 20) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 14) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
> +		    *outptr++ = base64 (ch >> 26, var);			      \
> +		    *outptr++ = base64 ((ch >> 20) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 14) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
> +		    *outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
>  		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
>  		    break;						      \
>  		  case 2:						      \
>  		    *outptr++ =						      \
> -		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28));    \
> -		    *outptr++ = base64 ((ch >> 22) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 16) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 10) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
> +		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28),     \
> +			      var);					      \
> +		    *outptr++ = base64 ((ch >> 22) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 16) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 10) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
>  		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
>  		    break;						      \
>  		  case 3:						      \
>  		    *outptr++ =						      \
> -		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30));    \
> -		    *outptr++ = base64 ((ch >> 24) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 18) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 12) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
> -		    *outptr++ = base64 (ch & 0x3f);			      \
> +		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30),     \
> +			      var);					      \
> +		    *outptr++ = base64 ((ch >> 24) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 18) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 12) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 6) & 0x3f, var);		      \
> +		    *outptr++ = base64 (ch & 0x3f, var);		      \
>  		    statep->__count = (1 << 3);				      \
>  		    break;						      \
>  		  default:						      \
> @@ -486,7 +595,7 @@ base64 (unsigned int i)
>      inptr += 4;								      \
>    }


Ok

>  #define LOOP_NEED_FLAGS
> -#define EXTRA_LOOP_DECLS	, mbstate_t *statep
> +#define EXTRA_LOOP_DECLS	, mbstate_t *statep, enum variant var
>  #include <iconv/loop.c>
>  
>  
> @@ -516,7 +625,7 @@ base64 (unsigned int i)
>  	    {								      \
>  	      /* Write out the shift sequence.  */			      \
>  	      if ((state & 0x18) >= 0x10)				      \
> -		*outbuf++ = base64 ((state >> 3) & ~3);			      \
> +		*outbuf++ = base64 ((state >> 3) & ~3, var);		      \
>  	      *outbuf++ = '-';						      \
>  									      \
>  	      data->__statep->__count = 0;				      \

Ok.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/4] iconv: make utf-7.c able to use variants
  2022-03-07 12:34                 ` Adhemerval Zanella
@ 2022-03-12 11:07                   ` Max Gautier
  2022-03-14 12:17                     ` Adhemerval Zanella
  0 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2022-03-12 11:07 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, Mar 07, 2022 at 09:34:47AM -0300, Adhemerval Zanella wrote:
> >  static int
> > -isxdirect (uint32_t ch)
> > +isxdirect (uint32_t ch, enum variant var)
> >  {
> > -    return (ch == '\t'
> > -            || ch == '\n'
> > -            || ch == '\r'
> > -            || (between(ch, ' ','}')
> > -                && ch != '+' && ch != '\\')
> > -           );
> > +    return(isdirect(ch, var)
> > +            || (var == UTF7 &&
> > +                (between(ch, '!', '&')
> > +                 || ch == '*'
> > +                 || between(ch, ';', '@')
> > +                 || (between(ch, '[', '`') && ch != '\\')
> > +                 || between(ch, '{', '}'))
> > +                )
> > +          );
> >  }
> >  
> 
> The change is ok, but maybe adding the variant out makes it more clear:
> 
>    if (var != UTF7)
>      return 0;
>    [...]
> 
something like this ? 

if (isdirect(ch, var))
   return true;
if (var != UTF7)
   return false;
return (between(ch, '!', '&')
  || ch == '*'
  || between(ch, ';', '@')
  || (between(ch, '[', '`') && ch != '\\')
  || between(ch, '{', '}'))
 );

I'd prefer the single expression form, but that works too.

> Also I think since you change it, it would be better to use 'bool' as
> return type.

Ok. I was not sure whether it was ok to use bool type or not.

Thanks for the review.

-- 
Max Gautier

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/4] iconv: make utf-7.c able to use variants
  2022-03-12 11:07                   ` Max Gautier
@ 2022-03-14 12:17                     ` Adhemerval Zanella
  0 siblings, 0 replies; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-14 12:17 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 12/03/2022 08:07, Max Gautier wrote:
> On Mon, Mar 07, 2022 at 09:34:47AM -0300, Adhemerval Zanella wrote:
>>>  static int
>>> -isxdirect (uint32_t ch)
>>> +isxdirect (uint32_t ch, enum variant var)
>>>  {
>>> -    return (ch == '\t'
>>> -            || ch == '\n'
>>> -            || ch == '\r'
>>> -            || (between(ch, ' ','}')
>>> -                && ch != '+' && ch != '\\')
>>> -           );
>>> +    return(isdirect(ch, var)
>>> +            || (var == UTF7 &&
>>> +                (between(ch, '!', '&')
>>> +                 || ch == '*'
>>> +                 || between(ch, ';', '@')
>>> +                 || (between(ch, '[', '`') && ch != '\\')
>>> +                 || between(ch, '{', '}'))
>>> +                )
>>> +          );
>>>  }
>>>  
>>
>> The change is ok, but maybe adding the variant out makes it more clear:
>>
>>    if (var != UTF7)
>>      return 0;
>>    [...]
>>
> something like this ? 
> 
> if (isdirect(ch, var))
>    return true;
> if (var != UTF7)
>    return false;
> return (between(ch, '!', '&')
>   || ch == '*'
>   || between(ch, ';', '@')
>   || (between(ch, '[', '`') && ch != '\\')
>   || between(ch, '{', '}'))
>  );
> 

Yes, it is slight better for readability (don't forget the space before '('
and the extra parenthesis in return is not required). 

> I'd prefer the single expression form, but that works too.
> 
>> Also I think since you change it, it would be better to use 'bool' as
>> return type.
> 
> Ok. I was not sure whether it was ok to use bool type or not.

We build internally with gnu11, so we can use most of c11 facilities (there
are some spots like atomics that we are still migrating and require extra
care to not call libatomics).

> 
> Thanks for the review.
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v5 3/4] iconv: make utf-7.c able to use variants
  2021-12-09  9:31               ` [PATCH v4 3/4] iconv: make utf-7.c able to use variants Max Gautier
  2022-03-07 12:34                 ` Adhemerval Zanella
@ 2022-03-20 16:42                 ` Max Gautier
  2022-03-21 12:24                   ` Adhemerval Zanella
  1 sibling, 1 reply; 60+ messages in thread
From: Max Gautier @ 2022-03-20 16:42 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

Add infrastructure in utf-7.c to handle variants. The approach comes from
iso646.c
The variant is defined at gconv_init time and is passed as a
supplementary variable.

Signed-off-by: Max Gautier <mg@max.gautier.name>
---
 iconvdata/utf-7.c | 230 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 170 insertions(+), 60 deletions(-)

diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
index 15f3669ac8..b639d8ff3e 100644
--- a/iconvdata/utf-7.c
+++ b/iconvdata/utf-7.c
@@ -29,6 +29,24 @@
 #include <stdlib.h>
 
 
+enum variant
+{
+  UTF7,
+};
+
+/* Must be in the same order as enum variant above.  */
+static const char names[] =
+  "UTF-7//\0"
+  "\0";
+
+static uint32_t
+shift_character (enum variant const var)
+{
+  if (var == UTF7)
+    return '+';
+  else
+    abort ();
+}
 
 static bool
 between (uint32_t const ch,
@@ -38,23 +56,27 @@ between (uint32_t const ch,
 }
 
 /* The set of "direct characters":
+   FOR UTF-7
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
 */
 
 static bool
 isdirect (uint32_t ch, enum variant var)
 {
-  return (between (ch, 'A', 'Z')
-	  || between (ch, 'a', 'z')
-	  || between (ch, '0', '9')
-	  || ch == '\'' || ch == '(' || ch == ')'
-	  || between (ch, ',', '/')
-	  || ch == ':' || ch == '?'
-	  || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
+  if (var == UTF7)
+    return (between (ch, 'A', 'Z')
+	    || between (ch, 'a', 'z')
+	    || between (ch, '0', '9')
+	    || ch == '\'' || ch == '(' || ch == ')'
+	    || between (ch, ',', '/')
+	    || ch == ':' || ch == '?'
+	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
+  abort ();
 }
 
 
 /* The set of "direct and optional direct characters":
+   (UTF-7 only)
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
    ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
 */
@@ -62,10 +84,15 @@ isdirect (uint32_t ch, enum variant var)
 static bool
 isxdirect (uint32_t ch, enum variant var)
 {
-  return (ch == '\t'
-	  || ch == '\n'
-	  || ch == '\r'
-	  || (between (ch, ' ', '}') && ch != '+' && ch != '\\'));
+  if (isdirect (ch, var))
+    return true;
+  if (var != UTF7)
+    return false;
+  return between (ch, '!', '&')
+    || ch == '*'
+    || between (ch, ';', '@')
+    || (between (ch, '[', '`') && ch != '\\')
+    || between (ch, '{', '}');
 }
 
 
@@ -85,7 +112,7 @@ needs_explicit_shift (uint32_t ch)
 
 /* Converts a value in the range 0..63 to a base64 encoded char.  */
 static unsigned char
-base64 (unsigned int i)
+base64 (unsigned int i, enum variant var)
 {
   if (i < 26)
     return i + 'A';
@@ -95,7 +122,7 @@ base64 (unsigned int i)
     return i - 52 + '0';
   else if (i == 62)
     return '+';
-  else if (i == 63)
+  else if (i == 63 && var == UTF7)
     return '/';
   else
     abort ();
@@ -103,9 +130,8 @@ base64 (unsigned int i)
 
 
 /* Definitions used in the body of the `gconv' function.  */
-#define CHARSET_NAME		"UTF-7//"
-#define DEFINE_INIT		1
-#define DEFINE_FINI		1
+#define DEFINE_INIT		0
+#define DEFINE_FINI		0
 #define FROM_LOOP		from_utf7_loop
 #define TO_LOOP			to_utf7_loop
 #define MIN_NEEDED_FROM		1
@@ -113,11 +139,27 @@ base64 (unsigned int i)
 #define MIN_NEEDED_TO		4
 #define MAX_NEEDED_TO		4
 #define ONE_DIRECTION		0
+#define FROM_DIRECTION      (dir == from_utf7)
 #define PREPARE_LOOP \
   mbstate_t saved_state;						      \
-  mbstate_t *statep = data->__statep;
-#define EXTRA_LOOP_ARGS		, statep
+  mbstate_t *statep = data->__statep;					      \
+  enum direction dir = ((struct utf7_data *) step->__data)->dir;	      \
+  enum direction var = ((struct utf7_data *) step->__data)->var;
+#define EXTRA_LOOP_ARGS		, statep, var
+
 
+enum direction
+{
+  illegal_dir,
+  from_utf7,
+  to_utf7
+};
+
+struct utf7_data
+{
+  enum direction dir;
+  enum variant var;
+};
 
 /* Since we might have to reset input pointer we must be able to save
    and restore the state.  */
@@ -127,6 +169,70 @@ base64 (unsigned int i)
   else									      \
     *statep = saved_state
 
+int
+gconv_init (struct __gconv_step *step)
+{
+  /* Determine which direction.  */
+  struct utf7_data *new_data;
+  enum direction dir = illegal_dir;
+
+  enum variant var = 0;
+  for (const char *name = names; *name != '\0';
+       name = __rawmemchr (name, '\0') + 1)
+    {
+      if (__strcasecmp (step->__from_name, name) == 0)
+	{
+	  dir = from_utf7;
+	  break;
+	}
+      else if (__strcasecmp (step->__to_name, name) == 0)
+	{
+	  dir = to_utf7;
+	  break;
+	}
+      ++var;
+    }
+
+  if (__glibc_likely (dir != illegal_dir))
+    {
+      new_data = malloc (sizeof (*new_data));
+      if (new_data == NULL)
+	return __GCONV_NOMEM;
+
+      new_data->dir = dir;
+      new_data->var = var;
+      step->__data = new_data;
+
+      if (dir == from_utf7)
+	{
+	  step->__min_needed_from = MIN_NEEDED_FROM;
+	  step->__max_needed_from = MAX_NEEDED_FROM;
+	  step->__min_needed_to = MIN_NEEDED_TO;
+	  step->__max_needed_to = MAX_NEEDED_TO;
+	}
+      else
+	{
+	  step->__min_needed_from = MIN_NEEDED_TO;
+	  step->__max_needed_from = MAX_NEEDED_TO;
+	  step->__min_needed_to = MIN_NEEDED_FROM;
+	  step->__max_needed_to = MAX_NEEDED_FROM;
+	}
+    }
+  else
+    return __GCONV_NOCONV;
+
+  step->__stateful = 1;
+
+  return __GCONV_OK;
+}
+
+void
+gconv_end (struct __gconv_step *data)
+{
+  free (data->__data);
+}
+
+
 
 /* First define the conversion function from UTF-7 to UCS4.
    The state is structured as follows:
@@ -154,13 +260,13 @@ base64 (unsigned int i)
     if ((statep->__count >> 3) == 0)					      \
       {									      \
 	/* base64 encoding inactive.  */				      \
-	if (isxdirect (ch))						      \
+	if (isxdirect (ch, var))					      \
 	  {								      \
 	    inptr++;							      \
 	    put32 (outptr, ch);						      \
 	    outptr += 4;						      \
 	  }								      \
-	else if (__glibc_likely (ch == '+'))				      \
+	else if (__glibc_likely (ch == shift_character (var)))		      \
 	  {								      \
 	    if (__glibc_unlikely (inptr + 2 > inend))			      \
 	      {								      \
@@ -285,7 +391,7 @@ base64 (unsigned int i)
       }									      \
   }
 #define LOOP_NEED_FLAGS
-#define EXTRA_LOOP_DECLS	, mbstate_t *statep
+#define EXTRA_LOOP_DECLS	, mbstate_t *statep, enum variant var
 #include <iconv/loop.c>
 
 
@@ -316,7 +422,7 @@ base64 (unsigned int i)
     if ((statep->__count & 0x18) == 0)					      \
       {									      \
 	/* base64 encoding inactive */					      \
-	if (isdirect (ch))   						      \
+	if (isdirect (ch, var))						      \
 	  {								      \
 	    *outptr++ = (unsigned char) ch;				      \
 	  }								      \
@@ -324,7 +430,7 @@ base64 (unsigned int i)
 	  {								      \
 	    size_t count;						      \
 									      \
-	    if (ch == '+')						      \
+	    if (ch == shift_character (var))				      \
 	      count = 2;						      \
 	    else if (ch < 0x10000)					      \
 	      count = 3;						      \
@@ -339,13 +445,13 @@ base64 (unsigned int i)
 		break;							      \
 	      }								      \
 									      \
-	    *outptr++ = '+';						      \
-	    if (ch == '+')						      \
+	    *outptr++ = shift_character (var);				      \
+	    if (ch == shift_character (var))				      \
 	      *outptr++ = '-';						      \
 	    else if (ch < 0x10000)					      \
 	      {								      \
-		*outptr++ = base64 (ch >> 10);				      \
-		*outptr++ = base64 ((ch >> 4) & 0x3f);			      \
+		*outptr++ = base64 (ch >> 10, var);			      \
+		*outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
 		statep->__count = ((ch & 15) << 5) | (3 << 3);		      \
 	      }								      \
 	    else if (ch < 0x110000)					      \
@@ -354,11 +460,11 @@ base64 (unsigned int i)
 		uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff);	      \
 									      \
 		ch = (ch1 << 16) | ch2;					      \
-		*outptr++ = base64 (ch >> 26);				      \
-		*outptr++ = base64 ((ch >> 20) & 0x3f);			      \
-		*outptr++ = base64 ((ch >> 14) & 0x3f);			      \
-		*outptr++ = base64 ((ch >> 8) & 0x3f);			      \
-		*outptr++ = base64 ((ch >> 2) & 0x3f);			      \
+		*outptr++ = base64 (ch >> 26, var);			      \
+		*outptr++ = base64 ((ch >> 20) & 0x3f, var);		      \
+		*outptr++ = base64 ((ch >> 14) & 0x3f, var);		      \
+		*outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
+		*outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
 		statep->__count = ((ch & 3) << 7) | (2 << 3);		      \
 	      }								      \
 	    else							      \
@@ -368,7 +474,7 @@ base64 (unsigned int i)
     else								      \
       {									      \
 	/* base64 encoding active */					      \
-	if (isdirect (ch))						      \
+	if (isdirect (ch, var))						      \
 	  {								      \
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
@@ -382,7 +488,7 @@ base64 (unsigned int i)
 	      }								      \
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
-	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
+	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
 	    if (needs_explicit_shift (ch))				      \
 	      *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
@@ -410,22 +516,24 @@ base64 (unsigned int i)
 		switch ((statep->__count >> 3) & 3)			      \
 		  {							      \
 		  case 1:						      \
-		    *outptr++ = base64 (ch >> 10);			      \
-		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
+		    *outptr++ = base64 (ch >> 10, var);			      \
+		    *outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
 		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
 		    break;						      \
 		  case 2:						      \
 		    *outptr++ =						      \
-		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12));    \
-		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
-		    *outptr++ = base64 (ch & 0x3f);			      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12),     \
+			      var);					      \
+		    *outptr++ = base64 ((ch >> 6) & 0x3f, var);		      \
+		    *outptr++ = base64 (ch & 0x3f, var);		      \
 		    statep->__count = (1 << 3);				      \
 		    break;						      \
 		  case 3:						      \
 		    *outptr++ =						      \
-		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14));    \
-		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14),     \
+			      var);					      \
+		    *outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
+		    *outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
 		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
 		    break;						      \
 		  default:						      \
@@ -441,30 +549,32 @@ base64 (unsigned int i)
 		switch ((statep->__count >> 3) & 3)			      \
 		  {							      \
 		  case 1:						      \
-		    *outptr++ = base64 (ch >> 26);			      \
-		    *outptr++ = base64 ((ch >> 20) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 14) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
+		    *outptr++ = base64 (ch >> 26, var);			      \
+		    *outptr++ = base64 ((ch >> 20) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 14) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
+		    *outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
 		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
 		    break;						      \
 		  case 2:						      \
 		    *outptr++ =						      \
-		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28));    \
-		    *outptr++ = base64 ((ch >> 22) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 16) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 10) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28),     \
+			      var);					      \
+		    *outptr++ = base64 ((ch >> 22) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 16) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 10) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
 		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
 		    break;						      \
 		  case 3:						      \
 		    *outptr++ =						      \
-		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30));    \
-		    *outptr++ = base64 ((ch >> 24) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 18) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 12) & 0x3f);		      \
-		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
-		    *outptr++ = base64 (ch & 0x3f);			      \
+		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30),     \
+			      var);					      \
+		    *outptr++ = base64 ((ch >> 24) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 18) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 12) & 0x3f, var);	      \
+		    *outptr++ = base64 ((ch >> 6) & 0x3f, var);		      \
+		    *outptr++ = base64 (ch & 0x3f, var);		      \
 		    statep->__count = (1 << 3);				      \
 		    break;						      \
 		  default:						      \
@@ -480,7 +590,7 @@ base64 (unsigned int i)
     inptr += 4;								      \
   }
 #define LOOP_NEED_FLAGS
-#define EXTRA_LOOP_DECLS	, mbstate_t *statep
+#define EXTRA_LOOP_DECLS	, mbstate_t *statep, enum variant var
 #include <iconv/loop.c>
 
 
@@ -510,7 +620,7 @@ base64 (unsigned int i)
 	    {								      \
 	      /* Write out the shift sequence.  */			      \
 	      if ((state & 0x18) >= 0x10)				      \
-		*outbuf++ = base64 ((state >> 3) & ~3);			      \
+		*outbuf++ = base64 ((state >> 3) & ~3, var);		      \
 	      *outbuf++ = '-';						      \
 									      \
 	      data->__statep->__count = 0;				      \
-- 
2.35.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 3/4] iconv: make utf-7.c able to use variants
  2022-03-20 16:42                 ` [PATCH v5 " Max Gautier
@ 2022-03-21 12:24                   ` Adhemerval Zanella
  0 siblings, 0 replies; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-21 12:24 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 20/03/2022 13:42, Max Gautier via Libc-alpha wrote:
> Add infrastructure in utf-7.c to handle variants. The approach comes from
> iso646.c
> The variant is defined at gconv_init time and is passed as a
> supplementary variable.
> 
> Signed-off-by: Max Gautier <mg@max.gautier.name>

Patch looks ok, although it should be refactor to add 'enum variant' argument
on isdirect and isxdirect instead of relying on previous patch (to keep the
patch consistent).

> ---
>  iconvdata/utf-7.c | 230 ++++++++++++++++++++++++++++++++++------------
>  1 file changed, 170 insertions(+), 60 deletions(-)
> 
> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> index 15f3669ac8..b639d8ff3e 100644
> --- a/iconvdata/utf-7.c
> +++ b/iconvdata/utf-7.c
> @@ -29,6 +29,24 @@
>  #include <stdlib.h>
>  
>  
> +enum variant
> +{
> +  UTF7,
> +};
> +
> +/* Must be in the same order as enum variant above.  */
> +static const char names[] =
> +  "UTF-7//\0"
> +  "\0";
> +
> +static uint32_t
> +shift_character (enum variant const var)
> +{
> +  if (var == UTF7)
> +    return '+';
> +  else
> +    abort ();
> +}
>  
>  static bool
>  between (uint32_t const ch,
> @@ -38,23 +56,27 @@ between (uint32_t const ch,
>  }
>  
>  /* The set of "direct characters":
> +   FOR UTF-7
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
>  */
>  
>  static bool
>  isdirect (uint32_t ch, enum variant var)
>  {
> -  return (between (ch, 'A', 'Z')
> -	  || between (ch, 'a', 'z')
> -	  || between (ch, '0', '9')
> -	  || ch == '\'' || ch == '(' || ch == ')'
> -	  || between (ch, ',', '/')
> -	  || ch == ':' || ch == '?'
> -	  || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
> +  if (var == UTF7)
> +    return (between (ch, 'A', 'Z')
> +	    || between (ch, 'a', 'z')
> +	    || between (ch, '0', '9')
> +	    || ch == '\'' || ch == '(' || ch == ')'
> +	    || between (ch, ',', '/')
> +	    || ch == ':' || ch == '?'
> +	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
> +  abort ();
>  }
>  
>  
>  /* The set of "direct and optional direct characters":
> +   (UTF-7 only)
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
>     ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
>  */
> @@ -62,10 +84,15 @@ isdirect (uint32_t ch, enum variant var)
>  static bool
>  isxdirect (uint32_t ch, enum variant var)
>  {
> -  return (ch == '\t'
> -	  || ch == '\n'
> -	  || ch == '\r'
> -	  || (between (ch, ' ', '}') && ch != '+' && ch != '\\'));
> +  if (isdirect (ch, var))
> +    return true;
> +  if (var != UTF7)
> +    return false;
> +  return between (ch, '!', '&')
> +    || ch == '*'
> +    || between (ch, ';', '@')
> +    || (between (ch, '[', '`') && ch != '\\')
> +    || between (ch, '{', '}');
>  }
>  
>  
> @@ -85,7 +112,7 @@ needs_explicit_shift (uint32_t ch)
>  
>  /* Converts a value in the range 0..63 to a base64 encoded char.  */
>  static unsigned char
> -base64 (unsigned int i)
> +base64 (unsigned int i, enum variant var)
>  {
>    if (i < 26)
>      return i + 'A';
> @@ -95,7 +122,7 @@ base64 (unsigned int i)
>      return i - 52 + '0';
>    else if (i == 62)
>      return '+';
> -  else if (i == 63)
> +  else if (i == 63 && var == UTF7)
>      return '/';
>    else
>      abort ();
> @@ -103,9 +130,8 @@ base64 (unsigned int i)
>  
>  
>  /* Definitions used in the body of the `gconv' function.  */
> -#define CHARSET_NAME		"UTF-7//"
> -#define DEFINE_INIT		1
> -#define DEFINE_FINI		1
> +#define DEFINE_INIT		0
> +#define DEFINE_FINI		0
>  #define FROM_LOOP		from_utf7_loop
>  #define TO_LOOP			to_utf7_loop
>  #define MIN_NEEDED_FROM		1
> @@ -113,11 +139,27 @@ base64 (unsigned int i)
>  #define MIN_NEEDED_TO		4
>  #define MAX_NEEDED_TO		4
>  #define ONE_DIRECTION		0
> +#define FROM_DIRECTION      (dir == from_utf7)
>  #define PREPARE_LOOP \
>    mbstate_t saved_state;						      \
> -  mbstate_t *statep = data->__statep;
> -#define EXTRA_LOOP_ARGS		, statep
> +  mbstate_t *statep = data->__statep;					      \
> +  enum direction dir = ((struct utf7_data *) step->__data)->dir;	      \
> +  enum direction var = ((struct utf7_data *) step->__data)->var;
> +#define EXTRA_LOOP_ARGS		, statep, var
> +
>  
> +enum direction
> +{
> +  illegal_dir,
> +  from_utf7,
> +  to_utf7
> +};
> +
> +struct utf7_data
> +{
> +  enum direction dir;
> +  enum variant var;
> +};
>  
>  /* Since we might have to reset input pointer we must be able to save
>     and restore the state.  */
> @@ -127,6 +169,70 @@ base64 (unsigned int i)
>    else									      \
>      *statep = saved_state
>  
> +int
> +gconv_init (struct __gconv_step *step)
> +{
> +  /* Determine which direction.  */
> +  struct utf7_data *new_data;
> +  enum direction dir = illegal_dir;
> +
> +  enum variant var = 0;
> +  for (const char *name = names; *name != '\0';
> +       name = __rawmemchr (name, '\0') + 1)
> +    {
> +      if (__strcasecmp (step->__from_name, name) == 0)
> +	{
> +	  dir = from_utf7;
> +	  break;
> +	}
> +      else if (__strcasecmp (step->__to_name, name) == 0)
> +	{
> +	  dir = to_utf7;
> +	  break;
> +	}
> +      ++var;
> +    }
> +
> +  if (__glibc_likely (dir != illegal_dir))
> +    {
> +      new_data = malloc (sizeof (*new_data));
> +      if (new_data == NULL)
> +	return __GCONV_NOMEM;
> +
> +      new_data->dir = dir;
> +      new_data->var = var;
> +      step->__data = new_data;
> +
> +      if (dir == from_utf7)
> +	{
> +	  step->__min_needed_from = MIN_NEEDED_FROM;
> +	  step->__max_needed_from = MAX_NEEDED_FROM;
> +	  step->__min_needed_to = MIN_NEEDED_TO;
> +	  step->__max_needed_to = MAX_NEEDED_TO;
> +	}
> +      else
> +	{
> +	  step->__min_needed_from = MIN_NEEDED_TO;
> +	  step->__max_needed_from = MAX_NEEDED_TO;
> +	  step->__min_needed_to = MIN_NEEDED_FROM;
> +	  step->__max_needed_to = MAX_NEEDED_FROM;
> +	}
> +    }
> +  else
> +    return __GCONV_NOCONV;
> +
> +  step->__stateful = 1;
> +
> +  return __GCONV_OK;
> +}
> +
> +void
> +gconv_end (struct __gconv_step *data)
> +{
> +  free (data->__data);
> +}
> +
> +
>  
>  /* First define the conversion function from UTF-7 to UCS4.
>     The state is structured as follows:
> @@ -154,13 +260,13 @@ base64 (unsigned int i)
>      if ((statep->__count >> 3) == 0)					      \
>        {									      \
>  	/* base64 encoding inactive.  */				      \
> -	if (isxdirect (ch))						      \
> +	if (isxdirect (ch, var))					      \
>  	  {								      \
>  	    inptr++;							      \
>  	    put32 (outptr, ch);						      \
>  	    outptr += 4;						      \
>  	  }								      \
> -	else if (__glibc_likely (ch == '+'))				      \
> +	else if (__glibc_likely (ch == shift_character (var)))		      \
>  	  {								      \
>  	    if (__glibc_unlikely (inptr + 2 > inend))			      \
>  	      {								      \
> @@ -285,7 +391,7 @@ base64 (unsigned int i)
>        }									      \
>    }
>  #define LOOP_NEED_FLAGS
> -#define EXTRA_LOOP_DECLS	, mbstate_t *statep
> +#define EXTRA_LOOP_DECLS	, mbstate_t *statep, enum variant var
>  #include <iconv/loop.c>
>  
>  
> @@ -316,7 +422,7 @@ base64 (unsigned int i)
>      if ((statep->__count & 0x18) == 0)					      \
>        {									      \
>  	/* base64 encoding inactive */					      \
> -	if (isdirect (ch))   						      \
> +	if (isdirect (ch, var))						      \
>  	  {								      \
>  	    *outptr++ = (unsigned char) ch;				      \
>  	  }								      \
> @@ -324,7 +430,7 @@ base64 (unsigned int i)
>  	  {								      \
>  	    size_t count;						      \
>  									      \
> -	    if (ch == '+')						      \
> +	    if (ch == shift_character (var))				      \
>  	      count = 2;						      \
>  	    else if (ch < 0x10000)					      \
>  	      count = 3;						      \
> @@ -339,13 +445,13 @@ base64 (unsigned int i)
>  		break;							      \
>  	      }								      \
>  									      \
> -	    *outptr++ = '+';						      \
> -	    if (ch == '+')						      \
> +	    *outptr++ = shift_character (var);				      \
> +	    if (ch == shift_character (var))				      \
>  	      *outptr++ = '-';						      \
>  	    else if (ch < 0x10000)					      \
>  	      {								      \
> -		*outptr++ = base64 (ch >> 10);				      \
> -		*outptr++ = base64 ((ch >> 4) & 0x3f);			      \
> +		*outptr++ = base64 (ch >> 10, var);			      \
> +		*outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
>  		statep->__count = ((ch & 15) << 5) | (3 << 3);		      \
>  	      }								      \
>  	    else if (ch < 0x110000)					      \
> @@ -354,11 +460,11 @@ base64 (unsigned int i)
>  		uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff);	      \
>  									      \
>  		ch = (ch1 << 16) | ch2;					      \
> -		*outptr++ = base64 (ch >> 26);				      \
> -		*outptr++ = base64 ((ch >> 20) & 0x3f);			      \
> -		*outptr++ = base64 ((ch >> 14) & 0x3f);			      \
> -		*outptr++ = base64 ((ch >> 8) & 0x3f);			      \
> -		*outptr++ = base64 ((ch >> 2) & 0x3f);			      \
> +		*outptr++ = base64 (ch >> 26, var);			      \
> +		*outptr++ = base64 ((ch >> 20) & 0x3f, var);		      \
> +		*outptr++ = base64 ((ch >> 14) & 0x3f, var);		      \
> +		*outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
> +		*outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
>  		statep->__count = ((ch & 3) << 7) | (2 << 3);		      \
>  	      }								      \
>  	    else							      \
> @@ -368,7 +474,7 @@ base64 (unsigned int i)
>      else								      \
>        {									      \
>  	/* base64 encoding active */					      \
> -	if (isdirect (ch))						      \
> +	if (isdirect (ch, var))						      \
>  	  {								      \
>  	    /* deactivate base64 encoding */				      \
>  	    size_t count;						      \
> @@ -382,7 +488,7 @@ base64 (unsigned int i)
>  	      }								      \
>  									      \
>  	    if ((statep->__count & 0x18) >= 0x10)			      \
> -	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
> +	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
>  	    if (needs_explicit_shift (ch))				      \
>  	      *outptr++ = '-';						      \
>  	    *outptr++ = (unsigned char) ch;				      \
> @@ -410,22 +516,24 @@ base64 (unsigned int i)
>  		switch ((statep->__count >> 3) & 3)			      \
>  		  {							      \
>  		  case 1:						      \
> -		    *outptr++ = base64 (ch >> 10);			      \
> -		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
> +		    *outptr++ = base64 (ch >> 10, var);			      \
> +		    *outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
>  		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
>  		    break;						      \
>  		  case 2:						      \
>  		    *outptr++ =						      \
> -		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12));    \
> -		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
> -		    *outptr++ = base64 (ch & 0x3f);			      \
> +		      base64 (((statep->__count >> 3) & ~3) | (ch >> 12),     \
> +			      var);					      \
> +		    *outptr++ = base64 ((ch >> 6) & 0x3f, var);		      \
> +		    *outptr++ = base64 (ch & 0x3f, var);		      \
>  		    statep->__count = (1 << 3);				      \
>  		    break;						      \
>  		  case 3:						      \
>  		    *outptr++ =						      \
> -		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14));    \
> -		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
> +		      base64 (((statep->__count >> 3) & ~3) | (ch >> 14),     \
> +			      var);					      \
> +		    *outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
> +		    *outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
>  		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
>  		    break;						      \
>  		  default:						      \
> @@ -441,30 +549,32 @@ base64 (unsigned int i)
>  		switch ((statep->__count >> 3) & 3)			      \
>  		  {							      \
>  		  case 1:						      \
> -		    *outptr++ = base64 (ch >> 26);			      \
> -		    *outptr++ = base64 ((ch >> 20) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 14) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 8) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 2) & 0x3f);		      \
> +		    *outptr++ = base64 (ch >> 26, var);			      \
> +		    *outptr++ = base64 ((ch >> 20) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 14) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 8) & 0x3f, var);		      \
> +		    *outptr++ = base64 ((ch >> 2) & 0x3f, var);		      \
>  		    statep->__count = ((ch & 3) << 7) | (2 << 3);	      \
>  		    break;						      \
>  		  case 2:						      \
>  		    *outptr++ =						      \
> -		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28));    \
> -		    *outptr++ = base64 ((ch >> 22) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 16) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 10) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 4) & 0x3f);		      \
> +		      base64 (((statep->__count >> 3) & ~3) | (ch >> 28),     \
> +			      var);					      \
> +		    *outptr++ = base64 ((ch >> 22) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 16) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 10) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 4) & 0x3f, var);		      \
>  		    statep->__count = ((ch & 15) << 5) | (3 << 3);	      \
>  		    break;						      \
>  		  case 3:						      \
>  		    *outptr++ =						      \
> -		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30));    \
> -		    *outptr++ = base64 ((ch >> 24) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 18) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 12) & 0x3f);		      \
> -		    *outptr++ = base64 ((ch >> 6) & 0x3f);		      \
> -		    *outptr++ = base64 (ch & 0x3f);			      \
> +		      base64 (((statep->__count >> 3) & ~3) | (ch >> 30),     \
> +			      var);					      \
> +		    *outptr++ = base64 ((ch >> 24) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 18) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 12) & 0x3f, var);	      \
> +		    *outptr++ = base64 ((ch >> 6) & 0x3f, var);		      \
> +		    *outptr++ = base64 (ch & 0x3f, var);		      \
>  		    statep->__count = (1 << 3);				      \
>  		    break;						      \
>  		  default:						      \
> @@ -480,7 +590,7 @@ base64 (unsigned int i)
>      inptr += 4;								      \
>    }
>  #define LOOP_NEED_FLAGS
> -#define EXTRA_LOOP_DECLS	, mbstate_t *statep
> +#define EXTRA_LOOP_DECLS	, mbstate_t *statep, enum variant var
>  #include <iconv/loop.c>
>  
>  
> @@ -510,7 +620,7 @@ base64 (unsigned int i)
>  	    {								      \
>  	      /* Write out the shift sequence.  */			      \
>  	      if ((state & 0x18) >= 0x10)				      \
> -		*outbuf++ = base64 ((state >> 3) & ~3);			      \
> +		*outbuf++ = base64 ((state >> 3) & ~3, var);		      \
>  	      *outbuf++ = '-';						      \
>  									      \
>  	      data->__statep->__count = 0;				      \

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
                                 ` (2 preceding siblings ...)
  2021-12-09  9:31               ` [PATCH v4 3/4] iconv: make utf-7.c able to use variants Max Gautier
@ 2021-12-09  9:31               ` Max Gautier
  2022-03-07 12:46                 ` Adhemerval Zanella
  2022-03-20 16:43                 ` [PATCH v5 " Max Gautier
  2021-12-17 13:15               ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
                                 ` (2 subsequent siblings)
  6 siblings, 2 replies; 60+ messages in thread
From: Max Gautier @ 2021-12-09  9:31 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

UTF-7-IMAP differs from UTF-7 in the followings ways (see RFC 3501[1]
for reference) :

- The shift character is '&' instead of '+'
- There is no "optional direct characters" and the "direct characters"
  set is different
- ',' replaces '/' in the Modified Base64 alphabet
- There is no implicit shift back to US-ASCII from BASE64, all BASE64
  sequences MUST be terminated with '-'

[1]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3

Signed-off-by: Max Gautier <mg@max.gautier.name>
---
 iconvdata/TESTS                     |  1 +
 iconvdata/gconv-modules             |  4 ++++
 iconvdata/testdata/UTF-7-IMAP       |  1 +
 iconvdata/testdata/UTF-7-IMAP..UTF8 | 32 +++++++++++++++++++++++++++++
 iconvdata/utf-7.c                   | 28 ++++++++++++++++++++-----
 5 files changed, 61 insertions(+), 5 deletions(-)
 create mode 100644 iconvdata/testdata/UTF-7-IMAP
 create mode 100644 iconvdata/testdata/UTF-7-IMAP..UTF8

diff --git a/iconvdata/TESTS b/iconvdata/TESTS
index a0157c3350..3cc043c21b 100644
--- a/iconvdata/TESTS
+++ b/iconvdata/TESTS
@@ -94,6 +94,7 @@ EUC-TW			EUC-TW			Y	UTF8
 GBK			GBK			Y	UTF8
 BIG5HKSCS		BIG5HKSCS		Y	UTF8
 UTF-7			UTF-7			N	UTF8
+UTF-7-IMAP		UTF-7-IMAP		N	UTF8
 IBM856			IBM856			N	UTF8
 IBM922			IBM922			Y	UTF8
 IBM930			IBM930			N	UTF8
diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules
index 4acbba062f..d120699394 100644
--- a/iconvdata/gconv-modules
+++ b/iconvdata/gconv-modules
@@ -113,3 +113,7 @@ module	INTERNAL		UTF-32BE//		UTF-32		1
 alias	UTF7//			UTF-7//
 module	UTF-7//			INTERNAL		UTF-7		1
 module	INTERNAL		UTF-7//			UTF-7		1
+
+#	from			to			module		cost
+module	UTF-7-IMAP//		INTERNAL		UTF-7		1
+module	INTERNAL		UTF-7-IMAP//		UTF-7		1
diff --git a/iconvdata/testdata/UTF-7-IMAP b/iconvdata/testdata/UTF-7-IMAP
new file mode 100644
index 0000000000..6b5dada63c
--- /dev/null
+++ b/iconvdata/testdata/UTF-7-IMAP
@@ -0,0 +1 @@
+&EqASGxItEps-       Amharic&AAoBDQ-esky      Czech&AAo-Dansk      Danish&AAo-English    English&AAo-Suomi      Finnish&AAo-Fran&AOc-ais   French&AAo-Deutsch    German&AAoDlQO7A7sDtwO9A7kDugOs-   Greek&AAoF4gXRBegF2QXq-      Hebrew&AAo-Italiano   Italian&AAo-Norsk      Norwegian&AAoEIARDBEEEQQQ6BDgEOQ-    Russian&AAo-Espa&APE-ol    Spanish&AAo-Svenska    Swedish&AAoOIA4yDikOMg5EDhcOIg-    Thai&AAo-T&APw-rk&AOc-e     Turkish&AAo-Ti&Hr8-ng Vi&Hsc-t Vietnamese&AApl5Wcsip4-     Japanese&AApOLWWH-       Chinese&AArVXK4A-       Korean&AAoACg-// Checking for correct handling of shift characters ('&-', '-') after base64 sequences&AArVXK4A-&-&AArVXK4A--&AAoACg-// Checking for correct handling of litteral '&-' and '-'&AAo----&-&--&AAoACg-// The last line of this file is missing the end-of-line terminator&AAo-// on purpose, in order to test that the conversion empties the bit buffer&AAo-// and shifts back to the initial state at the end of the conversion.&AAo-A&ImIDkQ-
\ No newline at end of file
diff --git a/iconvdata/testdata/UTF-7-IMAP..UTF8 b/iconvdata/testdata/UTF-7-IMAP..UTF8
new file mode 100644
index 0000000000..8b9add3670
--- /dev/null
+++ b/iconvdata/testdata/UTF-7-IMAP..UTF8
@@ -0,0 +1,32 @@
+አማርኛ       Amharic
+česky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Français   French
+Deutsch    German
+Ελληνικά   Greek
+עברית      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+Русский    Russian
+Español    Spanish
+Svenska    Swedish
+ภาษาไทย    Thai
+Türkçe     Turkish
+Tiếng Việt Vietnamese
+日本語     Japanese
+中文       Chinese
+한글       Korean
+
+// Checking for correct handling of shift characters ('&', '-') after base64 sequences
+한글&
+한글-
+
+// Checking for correct handling of litteral '&' and '-'
+---&&-
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A≢Α
\ No newline at end of file
diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
index 965d4220f1..553636e324 100644
--- a/iconvdata/utf-7.c
+++ b/iconvdata/utf-7.c
@@ -32,11 +32,13 @@
 enum variant
 {
     UTF7,
+    UTF_7_IMAP
 };
 
 /* Must be in the same order as enum variant above.  */
 static const char names[] =
   "UTF-7//\0"
+  "UTF-7-IMAP//\0"
   "\0";
 
 static uint32_t
@@ -44,6 +46,8 @@ shift_character(enum variant const var)
 {
     if (var == UTF7)
         return '+';
+    else if (var == UTF_7_IMAP)
+        return '&';
     else
         abort();
 }
@@ -58,6 +62,9 @@ between(uint32_t const ch,
 /* The set of "direct characters":
    FOR UTF-7
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
+   FOR UTF-7-IMAP
+   A-Z a-z 0-9 ' ( ) , - . / : ? space
+   ! " # $ % + * ; < = > @ [ \ ] ^ _ ` { | } ~
 */
 
 static int
@@ -71,6 +78,8 @@ isdirect (uint32_t ch, enum variant var)
                 || between(ch, ',', '/')
                 || ch == ':' || ch == '?'
                 || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
+    else if (var == UTF_7_IMAP)
+        return (ch != '&' && between(ch, ' ', '~'));
     abort();
 }
 
@@ -127,6 +136,8 @@ base64 (unsigned int i, enum variant var)
     return '+';
   else if (i == 63 && var == UTF7)
     return '/';
+  else if (i == 63 && var == UTF_7_IMAP)
+    return ',';
   else
     abort ();
 }
@@ -313,7 +324,8 @@ gconv_end (struct __gconv_step *data)
 	  i = ch - '0' + 52;						      \
 	else if (ch == '+')						      \
 	  i = 62;							      \
-	else if (ch == '/')						      \
+	else if ((var == UTF7 && ch == '/')                                   \
+		  || (var == UTF_7_IMAP && ch == ','))			      \
 	  i = 63;							      \
 	else								      \
 	  {								      \
@@ -321,8 +333,10 @@ gconv_end (struct __gconv_step *data)
 									      \
 	    /* If accumulated data is nonzero, the input is invalid.  */      \
 	    /* Also, partial UTF-16 characters are invalid.  */		      \
+	    /* In IMAP variant, must be terminated by '-'.  */		      \
 	    if (__builtin_expect (statep->__value.__wch != 0, 0)	      \
-		|| __builtin_expect ((statep->__count >> 3) <= 26, 0))	      \
+		|| __builtin_expect ((statep->__count >> 3) <= 26, 0)	      \
+		|| __builtin_expect (var == UTF_7_IMAP && ch != '-', 0))      \
 	      {								      \
 		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
 	      }								      \
@@ -479,13 +493,15 @@ gconv_end (struct __gconv_step *data)
     else								      \
       {									      \
 	/* base64 encoding active */					      \
-	if (isdirect (ch, var))						      \
+	if ((var == UTF_7_IMAP && ch == '&') || isdirect (ch, var))	      \
 	  {								      \
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
 									      \
 	    count = ((statep->__count & 0x18) >= 0x10)			      \
-	      + needs_explicit_shift (ch) + 1;				      \
+	      + (var == UTF_7_IMAP || needs_explicit_shift (ch))	      \
+	      + (var == UTF_7_IMAP && ch == '&')			      \
+	      + 1;							      \
 	    if (__glibc_unlikely (outptr + count > outend))		      \
 	      {								      \
 		result = __GCONV_FULL_OUTPUT;				      \
@@ -494,9 +510,11 @@ gconv_end (struct __gconv_step *data)
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
 	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
-	    if (needs_explicit_shift (ch))				      \
+	    if (var == UTF_7_IMAP || needs_explicit_shift (ch))		      \
 	      *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
+	    if (var == UTF_7_IMAP && ch == '&')				      \
+	      *outptr++ = '-';						      \
 	    statep->__count = 0;					      \
 	  }								      \
 	else								      \
-- 
2.34.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c
  2021-12-09  9:31               ` [PATCH v4 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c Max Gautier
@ 2022-03-07 12:46                 ` Adhemerval Zanella
  2022-03-20 16:43                 ` [PATCH v5 " Max Gautier
  1 sibling, 0 replies; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-07 12:46 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 09/12/2021 06:31, Max Gautier via Libc-alpha wrote:
> UTF-7-IMAP differs from UTF-7 in the followings ways (see RFC 3501[1]
> for reference) :
> 
> - The shift character is '&' instead of '+'
> - There is no "optional direct characters" and the "direct characters"
>   set is different
> - ',' replaces '/' in the Modified Base64 alphabet
> - There is no implicit shift back to US-ASCII from BASE64, all BASE64
>   sequences MUST be terminated with '-'
> 
> [1]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3
> 
> Signed-off-by: Max Gautier <mg@max.gautier.name>

Patch looks ok, some minor style issues (as for other parts as well).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  iconvdata/TESTS                     |  1 +
>  iconvdata/gconv-modules             |  4 ++++
>  iconvdata/testdata/UTF-7-IMAP       |  1 +
>  iconvdata/testdata/UTF-7-IMAP..UTF8 | 32 +++++++++++++++++++++++++++++
>  iconvdata/utf-7.c                   | 28 ++++++++++++++++++++-----
>  5 files changed, 61 insertions(+), 5 deletions(-)
>  create mode 100644 iconvdata/testdata/UTF-7-IMAP
>  create mode 100644 iconvdata/testdata/UTF-7-IMAP..UTF8
> 
> diff --git a/iconvdata/TESTS b/iconvdata/TESTS
> index a0157c3350..3cc043c21b 100644
> --- a/iconvdata/TESTS
> +++ b/iconvdata/TESTS
> @@ -94,6 +94,7 @@ EUC-TW			EUC-TW			Y	UTF8
>  GBK			GBK			Y	UTF8
>  BIG5HKSCS		BIG5HKSCS		Y	UTF8
>  UTF-7			UTF-7			N	UTF8
> +UTF-7-IMAP		UTF-7-IMAP		N	UTF8
>  IBM856			IBM856			N	UTF8
>  IBM922			IBM922			Y	UTF8
>  IBM930			IBM930			N	UTF8

Ok.

> diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules
> index 4acbba062f..d120699394 100644
> --- a/iconvdata/gconv-modules
> +++ b/iconvdata/gconv-modules
> @@ -113,3 +113,7 @@ module	INTERNAL		UTF-32BE//		UTF-32		1
>  alias	UTF7//			UTF-7//
>  module	UTF-7//			INTERNAL		UTF-7		1
>  module	INTERNAL		UTF-7//			UTF-7		1
> +
> +#	from			to			module		cost
> +module	UTF-7-IMAP//		INTERNAL		UTF-7		1
> +module	INTERNAL		UTF-7-IMAP//		UTF-7		1

Ok.

> diff --git a/iconvdata/testdata/UTF-7-IMAP b/iconvdata/testdata/UTF-7-IMAP
> new file mode 100644
> index 0000000000..6b5dada63c
> --- /dev/null
> +++ b/iconvdata/testdata/UTF-7-IMAP
> @@ -0,0 +1 @@
> +&EqASGxItEps-       Amharic&AAoBDQ-esky      Czech&AAo-Dansk      Danish&AAo-English    English&AAo-Suomi      Finnish&AAo-Fran&AOc-ais   French&AAo-Deutsch    German&AAoDlQO7A7sDtwO9A7kDugOs-   Greek&AAoF4gXRBegF2QXq-      Hebrew&AAo-Italiano   Italian&AAo-Norsk      Norwegian&AAoEIARDBEEEQQQ6BDgEOQ-    Russian&AAo-Espa&APE-ol    Spanish&AAo-Svenska    Swedish&AAoOIA4yDikOMg5EDhcOIg-    Thai&AAo-T&APw-rk&AOc-e     Turkish&AAo-Ti&Hr8-ng Vi&Hsc-t Vietnamese&AApl5Wcsip4-     Japanese&AApOLWWH-       Chinese&AArVXK4A-       Korean&AAoACg-// Checking for correct handling of shift characters ('&-', '-') after base64 sequences&AArVXK4A-&-&AArVXK4A--&AAoACg-// Checking for correct handling of litteral '&-' and '-'&AAo----&-&--&AAoACg-// The last line of this file is missing the end-of-line terminator&AAo-// on purpose, in order to test that the conversion empties the bit buffer&AAo-// and shifts back to the initial state at the end of the conversion.&AAo-A&ImIDkQ-
> \ No newline at end of file

Ok.

> diff --git a/iconvdata/testdata/UTF-7-IMAP..UTF8 b/iconvdata/testdata/UTF-7-IMAP..UTF8
> new file mode 100644
> index 0000000000..8b9add3670
> --- /dev/null
> +++ b/iconvdata/testdata/UTF-7-IMAP..UTF8
> @@ -0,0 +1,32 @@
> +አማርኛ       Amharic
> +česky      Czech
> +Dansk      Danish
> +English    English
> +Suomi      Finnish
> +Français   French
> +Deutsch    German
> +Ελληνικά   Greek
> +עברית      Hebrew
> +Italiano   Italian
> +Norsk      Norwegian
> +Русский    Russian
> +Español    Spanish
> +Svenska    Swedish
> +ภาษาไทย    Thai
> +Türkçe     Turkish
> +Tiếng Việt Vietnamese
> +日本語     Japanese
> +中文       Chinese
> +한글       Korean
> +
> +// Checking for correct handling of shift characters ('&', '-') after base64 sequences
> +한글&
> +한글-
> +
> +// Checking for correct handling of litteral '&' and '-'
> +---&&-
> +
> +// The last line of this file is missing the end-of-line terminator
> +// on purpose, in order to test that the conversion empties the bit buffer
> +// and shifts back to the initial state at the end of the conversion.
> +A≢Α
> \ No newline at end of file

Ok.

> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> index 965d4220f1..553636e324 100644
> --- a/iconvdata/utf-7.c
> +++ b/iconvdata/utf-7.c
> @@ -32,11 +32,13 @@
>  enum variant
>  {
>      UTF7,
> +    UTF_7_IMAP
>  };
>  
>  /* Must be in the same order as enum variant above.  */
>  static const char names[] =
>    "UTF-7//\0"
> +  "UTF-7-IMAP//\0"
>    "\0";
>  
>  static uint32_t
> @@ -44,6 +46,8 @@ shift_character(enum variant const var)
>  {
>      if (var == UTF7)
>          return '+';
> +    else if (var == UTF_7_IMAP)
> +        return '&';
>      else
>          abort();
>  }
> @@ -58,6 +62,9 @@ between(uint32_t const ch,
>  /* The set of "direct characters":
>     FOR UTF-7
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
> +   FOR UTF-7-IMAP
> +   A-Z a-z 0-9 ' ( ) , - . / : ? space
> +   ! " # $ % + * ; < = > @ [ \ ] ^ _ ` { | } ~
>  */
>  
>  static int
> @@ -71,6 +78,8 @@ isdirect (uint32_t ch, enum variant var)
>                  || between(ch, ',', '/')
>                  || ch == ':' || ch == '?'
>                  || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
> +    else if (var == UTF_7_IMAP)
> +        return (ch != '&' && between(ch, ' ', '~'));
>      abort();
>  }
>  

Some style issues as before.

> @@ -127,6 +136,8 @@ base64 (unsigned int i, enum variant var)
>      return '+';
>    else if (i == 63 && var == UTF7)
>      return '/';
> +  else if (i == 63 && var == UTF_7_IMAP)
> +    return ',';
>    else
>      abort ();
>  }
> @@ -313,7 +324,8 @@ gconv_end (struct __gconv_step *data)
>  	  i = ch - '0' + 52;						      \
>  	else if (ch == '+')						      \
>  	  i = 62;							      \
> -	else if (ch == '/')						      \
> +	else if ((var == UTF7 && ch == '/')                                   \
> +		  || (var == UTF_7_IMAP && ch == ','))			      \
>  	  i = 63;							      \
>  	else								      \
>  	  {								      \
> @@ -321,8 +333,10 @@ gconv_end (struct __gconv_step *data)
>  									      \
>  	    /* If accumulated data is nonzero, the input is invalid.  */      \
>  	    /* Also, partial UTF-16 characters are invalid.  */		      \
> +	    /* In IMAP variant, must be terminated by '-'.  */		      \
>  	    if (__builtin_expect (statep->__value.__wch != 0, 0)	      \
> -		|| __builtin_expect ((statep->__count >> 3) <= 26, 0))	      \
> +		|| __builtin_expect ((statep->__count >> 3) <= 26, 0)	      \
> +		|| __builtin_expect (var == UTF_7_IMAP && ch != '-', 0))      \

Use __glibc_likely.

>  	      {								      \
>  		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
>  	      }								      \
> @@ -479,13 +493,15 @@ gconv_end (struct __gconv_step *data)
>      else								      \
>        {									      \
>  	/* base64 encoding active */					      \
> -	if (isdirect (ch, var))						      \
> +	if ((var == UTF_7_IMAP && ch == '&') || isdirect (ch, var))	      \
>  	  {								      \
>  	    /* deactivate base64 encoding */				      \
>  	    size_t count;						      \
>  									      \
>  	    count = ((statep->__count & 0x18) >= 0x10)			      \
> -	      + needs_explicit_shift (ch) + 1;				      \
> +	      + (var == UTF_7_IMAP || needs_explicit_shift (ch))	      \
> +	      + (var == UTF_7_IMAP && ch == '&')			      \
> +	      + 1;							      \
>  	    if (__glibc_unlikely (outptr + count > outend))		      \
>  	      {								      \
>  		result = __GCONV_FULL_OUTPUT;				      \
> @@ -494,9 +510,11 @@ gconv_end (struct __gconv_step *data)
>  									      \
>  	    if ((statep->__count & 0x18) >= 0x10)			      \
>  	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
> -	    if (needs_explicit_shift (ch))				      \
> +	    if (var == UTF_7_IMAP || needs_explicit_shift (ch))		      \
>  	      *outptr++ = '-';						      \
>  	    *outptr++ = (unsigned char) ch;				      \
> +	    if (var == UTF_7_IMAP && ch == '&')				      \
> +	      *outptr++ = '-';						      \
>  	    statep->__count = 0;					      \
>  	  }								      \
>  	else								      \

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v5 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c
  2021-12-09  9:31               ` [PATCH v4 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c Max Gautier
  2022-03-07 12:46                 ` Adhemerval Zanella
@ 2022-03-20 16:43                 ` Max Gautier
  2022-03-21 12:24                   ` Adhemerval Zanella
  1 sibling, 1 reply; 60+ messages in thread
From: Max Gautier @ 2022-03-20 16:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

UTF-7-IMAP differs from UTF-7 in the followings ways (see RFC 3501[1]
for reference) :

- The shift character is '&' instead of '+'
- There is no "optional direct characters" and the "direct characters"
  set is different
- There is no implicit shift back to US-ASCII from BASE64, all BASE64
  sequences MUST be terminated with '-'

[1]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3

Signed-off-by: Max Gautier <mg@max.gautier.name>
---
 iconvdata/TESTS                     |  1 +
 iconvdata/gconv-modules             |  4 ++++
 iconvdata/testdata/UTF-7-IMAP       |  1 +
 iconvdata/testdata/UTF-7-IMAP..UTF8 | 32 +++++++++++++++++++++++++++++
 iconvdata/utf-7.c                   | 30 +++++++++++++++++++++------
 5 files changed, 62 insertions(+), 6 deletions(-)
 create mode 100644 iconvdata/testdata/UTF-7-IMAP
 create mode 100644 iconvdata/testdata/UTF-7-IMAP..UTF8

diff --git a/iconvdata/TESTS b/iconvdata/TESTS
index a0157c3350..3cc043c21b 100644
--- a/iconvdata/TESTS
+++ b/iconvdata/TESTS
@@ -94,6 +94,7 @@ EUC-TW			EUC-TW			Y	UTF8
 GBK			GBK			Y	UTF8
 BIG5HKSCS		BIG5HKSCS		Y	UTF8
 UTF-7			UTF-7			N	UTF8
+UTF-7-IMAP		UTF-7-IMAP		N	UTF8
 IBM856			IBM856			N	UTF8
 IBM922			IBM922			Y	UTF8
 IBM930			IBM930			N	UTF8
diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules
index 4acbba062f..d120699394 100644
--- a/iconvdata/gconv-modules
+++ b/iconvdata/gconv-modules
@@ -113,3 +113,7 @@ module	INTERNAL		UTF-32BE//		UTF-32		1
 alias	UTF7//			UTF-7//
 module	UTF-7//			INTERNAL		UTF-7		1
 module	INTERNAL		UTF-7//			UTF-7		1
+
+#	from			to			module		cost
+module	UTF-7-IMAP//		INTERNAL		UTF-7		1
+module	INTERNAL		UTF-7-IMAP//		UTF-7		1
diff --git a/iconvdata/testdata/UTF-7-IMAP b/iconvdata/testdata/UTF-7-IMAP
new file mode 100644
index 0000000000..6b5dada63c
--- /dev/null
+++ b/iconvdata/testdata/UTF-7-IMAP
@@ -0,0 +1 @@
+&EqASGxItEps-       Amharic&AAoBDQ-esky      Czech&AAo-Dansk      Danish&AAo-English    English&AAo-Suomi      Finnish&AAo-Fran&AOc-ais   French&AAo-Deutsch    German&AAoDlQO7A7sDtwO9A7kDugOs-   Greek&AAoF4gXRBegF2QXq-      Hebrew&AAo-Italiano   Italian&AAo-Norsk      Norwegian&AAoEIARDBEEEQQQ6BDgEOQ-    Russian&AAo-Espa&APE-ol    Spanish&AAo-Svenska    Swedish&AAoOIA4yDikOMg5EDhcOIg-    Thai&AAo-T&APw-rk&AOc-e     Turkish&AAo-Ti&Hr8-ng Vi&Hsc-t Vietnamese&AApl5Wcsip4-     Japanese&AApOLWWH-       Chinese&AArVXK4A-       Korean&AAoACg-// Checking for correct handling of shift characters ('&-', '-') after base64 sequences&AArVXK4A-&-&AArVXK4A--&AAoACg-// Checking for correct handling of litteral '&-' and '-'&AAo----&-&--&AAoACg-// The last line of this file is missing the end-of-line terminator&AAo-// on purpose, in order to test that the conversion empties the bit buffer&AAo-// and shifts back to the initial state at the end of the conversion.&AAo-A&ImIDkQ-
\ No newline at end of file
diff --git a/iconvdata/testdata/UTF-7-IMAP..UTF8 b/iconvdata/testdata/UTF-7-IMAP..UTF8
new file mode 100644
index 0000000000..8b9add3670
--- /dev/null
+++ b/iconvdata/testdata/UTF-7-IMAP..UTF8
@@ -0,0 +1,32 @@
+አማርኛ       Amharic
+česky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Français   French
+Deutsch    German
+Ελληνικά   Greek
+עברית      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+Русский    Russian
+Español    Spanish
+Svenska    Swedish
+ภาษาไทย    Thai
+Türkçe     Turkish
+Tiếng Việt Vietnamese
+日本語     Japanese
+中文       Chinese
+한글       Korean
+
+// Checking for correct handling of shift characters ('&', '-') after base64 sequences
+한글&
+한글-
+
+// Checking for correct handling of litteral '&' and '-'
+---&&-
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A≢Α
\ No newline at end of file
diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
index b639d8ff3e..5c2e17e50c 100644
--- a/iconvdata/utf-7.c
+++ b/iconvdata/utf-7.c
@@ -32,11 +32,13 @@
 enum variant
 {
   UTF7,
+  UTF_7_IMAP
 };
 
 /* Must be in the same order as enum variant above.  */
 static const char names[] =
   "UTF-7//\0"
+  "UTF-7-IMAP//\0"
   "\0";
 
 static uint32_t
@@ -44,6 +46,8 @@ shift_character (enum variant const var)
 {
   if (var == UTF7)
     return '+';
+  else if (var == UTF_7_IMAP)
+    return '&';
   else
     abort ();
 }
@@ -58,6 +62,9 @@ between (uint32_t const ch,
 /* The set of "direct characters":
    FOR UTF-7
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
+   FOR UTF-7-IMAP
+   A-Z a-z 0-9 ' ( ) , - . / : ? space
+   ! " # $ % + * ; < = > @ [ \ ] ^ _ ` { | } ~
 */
 
 static bool
@@ -71,6 +78,8 @@ isdirect (uint32_t ch, enum variant var)
 	    || between (ch, ',', '/')
 	    || ch == ':' || ch == '?'
 	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
+  else if (var == UTF_7_IMAP)
+    return (ch != '&' && between (ch, ' ', '~'));
   abort ();
 }
 
@@ -124,6 +133,8 @@ base64 (unsigned int i, enum variant var)
     return '+';
   else if (i == 63 && var == UTF7)
     return '/';
+  else if (i == 63 && var == UTF_7_IMAP)
+    return ',';
   else
     abort ();
 }
@@ -308,7 +319,8 @@ gconv_end (struct __gconv_step *data)
 	  i = ch - '0' + 52;						      \
 	else if (ch == '+')						      \
 	  i = 62;							      \
-	else if (ch == '/')						      \
+	else if ((var == UTF7 && ch == '/')                                   \
+		  || (var == UTF_7_IMAP && ch == ','))			      \
 	  i = 63;							      \
 	else								      \
 	  {								      \
@@ -316,8 +328,10 @@ gconv_end (struct __gconv_step *data)
 									      \
 	    /* If accumulated data is nonzero, the input is invalid.  */      \
 	    /* Also, partial UTF-16 characters are invalid.  */		      \
-	    if (__builtin_expect (statep->__value.__wch != 0, 0)	      \
-		|| __builtin_expect ((statep->__count >> 3) <= 26, 0))	      \
+	    /* In IMAP variant, must be terminated by '-'.  */		      \
+	    if (__glibc_unlikely (statep->__value.__wch != 0)		      \
+		|| __glibc_unlikely ((statep->__count >> 3) <= 26)	      \
+		|| __glibc_unlikely (var == UTF_7_IMAP && ch != '-'))	      \
 	      {								      \
 		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
 	      }								      \
@@ -474,13 +488,15 @@ gconv_end (struct __gconv_step *data)
     else								      \
       {									      \
 	/* base64 encoding active */					      \
-	if (isdirect (ch, var))						      \
+	if ((var == UTF_7_IMAP && ch == '&') || isdirect (ch, var))	      \
 	  {								      \
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
 									      \
 	    count = ((statep->__count & 0x18) >= 0x10)			      \
-	      + needs_explicit_shift (ch) + 1;				      \
+	      + (var == UTF_7_IMAP || needs_explicit_shift (ch))	      \
+	      + (var == UTF_7_IMAP && ch == '&')			      \
+	      + 1;							      \
 	    if (__glibc_unlikely (outptr + count > outend))		      \
 	      {								      \
 		result = __GCONV_FULL_OUTPUT;				      \
@@ -489,9 +505,11 @@ gconv_end (struct __gconv_step *data)
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
 	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
-	    if (needs_explicit_shift (ch))				      \
+	    if (var == UTF_7_IMAP || needs_explicit_shift (ch))		      \
 	      *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
+	    if (var == UTF_7_IMAP && ch == '&')				      \
+	      *outptr++ = '-';						      \
 	    statep->__count = 0;					      \
 	  }								      \
 	else								      \
-- 
2.35.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c
  2022-03-20 16:43                 ` [PATCH v5 " Max Gautier
@ 2022-03-21 12:24                   ` Adhemerval Zanella
  0 siblings, 0 replies; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-21 12:24 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 20/03/2022 13:43, Max Gautier via Libc-alpha wrote:
> UTF-7-IMAP differs from UTF-7 in the followings ways (see RFC 3501[1]
> for reference) :
> 
> - The shift character is '&' instead of '+'
> - There is no "optional direct characters" and the "direct characters"
>   set is different
> - There is no implicit shift back to US-ASCII from BASE64, all BASE64
>   sequences MUST be terminated with '-'
> 
> [1]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3
> 
> Signed-off-by: Max Gautier <mg@max.gautier.name>

LGTM, thanks.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  iconvdata/TESTS                     |  1 +
>  iconvdata/gconv-modules             |  4 ++++
>  iconvdata/testdata/UTF-7-IMAP       |  1 +
>  iconvdata/testdata/UTF-7-IMAP..UTF8 | 32 +++++++++++++++++++++++++++++
>  iconvdata/utf-7.c                   | 30 +++++++++++++++++++++------
>  5 files changed, 62 insertions(+), 6 deletions(-)
>  create mode 100644 iconvdata/testdata/UTF-7-IMAP
>  create mode 100644 iconvdata/testdata/UTF-7-IMAP..UTF8
> 
> diff --git a/iconvdata/TESTS b/iconvdata/TESTS
> index a0157c3350..3cc043c21b 100644
> --- a/iconvdata/TESTS
> +++ b/iconvdata/TESTS
> @@ -94,6 +94,7 @@ EUC-TW			EUC-TW			Y	UTF8
>  GBK			GBK			Y	UTF8
>  BIG5HKSCS		BIG5HKSCS		Y	UTF8
>  UTF-7			UTF-7			N	UTF8
> +UTF-7-IMAP		UTF-7-IMAP		N	UTF8
>  IBM856			IBM856			N	UTF8
>  IBM922			IBM922			Y	UTF8
>  IBM930			IBM930			N	UTF8
> diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules
> index 4acbba062f..d120699394 100644
> --- a/iconvdata/gconv-modules
> +++ b/iconvdata/gconv-modules
> @@ -113,3 +113,7 @@ module	INTERNAL		UTF-32BE//		UTF-32		1
>  alias	UTF7//			UTF-7//
>  module	UTF-7//			INTERNAL		UTF-7		1
>  module	INTERNAL		UTF-7//			UTF-7		1
> +
> +#	from			to			module		cost
> +module	UTF-7-IMAP//		INTERNAL		UTF-7		1
> +module	INTERNAL		UTF-7-IMAP//		UTF-7		1
> diff --git a/iconvdata/testdata/UTF-7-IMAP b/iconvdata/testdata/UTF-7-IMAP
> new file mode 100644
> index 0000000000..6b5dada63c
> --- /dev/null
> +++ b/iconvdata/testdata/UTF-7-IMAP
> @@ -0,0 +1 @@
> +&EqASGxItEps-       Amharic&AAoBDQ-esky      Czech&AAo-Dansk      Danish&AAo-English    English&AAo-Suomi      Finnish&AAo-Fran&AOc-ais   French&AAo-Deutsch    German&AAoDlQO7A7sDtwO9A7kDugOs-   Greek&AAoF4gXRBegF2QXq-      Hebrew&AAo-Italiano   Italian&AAo-Norsk      Norwegian&AAoEIARDBEEEQQQ6BDgEOQ-    Russian&AAo-Espa&APE-ol    Spanish&AAo-Svenska    Swedish&AAoOIA4yDikOMg5EDhcOIg-    Thai&AAo-T&APw-rk&AOc-e     Turkish&AAo-Ti&Hr8-ng Vi&Hsc-t Vietnamese&AApl5Wcsip4-     Japanese&AApOLWWH-       Chinese&AArVXK4A-       Korean&AAoACg-// Checking for correct handling of shift characters ('&-', '-') after base64 sequences&AArVXK4A-&-&AArVXK4A--&AAoACg-// Checking for correct handling of litteral '&-' and '-'&AAo----&-&--&AAoACg-// The last line of this file is missing the end-of-line terminator&AAo-// on purpose, in order to test that the conversion empties the bit buffer&AAo-// and shifts back to the initial state at the end of the conversion.&AAo-A&ImIDkQ-
> \ No newline at end of file
> diff --git a/iconvdata/testdata/UTF-7-IMAP..UTF8 b/iconvdata/testdata/UTF-7-IMAP..UTF8
> new file mode 100644
> index 0000000000..8b9add3670
> --- /dev/null
> +++ b/iconvdata/testdata/UTF-7-IMAP..UTF8
> @@ -0,0 +1,32 @@
> +አማርኛ       Amharic
> +česky      Czech
> +Dansk      Danish
> +English    English
> +Suomi      Finnish
> +Français   French
> +Deutsch    German
> +Ελληνικά   Greek
> +עברית      Hebrew
> +Italiano   Italian
> +Norsk      Norwegian
> +Русский    Russian
> +Español    Spanish
> +Svenska    Swedish
> +ภาษาไทย    Thai
> +Türkçe     Turkish
> +Tiếng Việt Vietnamese
> +日本語     Japanese
> +中文       Chinese
> +한글       Korean
> +
> +// Checking for correct handling of shift characters ('&', '-') after base64 sequences
> +한글&
> +한글-
> +
> +// Checking for correct handling of litteral '&' and '-'
> +---&&-
> +
> +// The last line of this file is missing the end-of-line terminator
> +// on purpose, in order to test that the conversion empties the bit buffer
> +// and shifts back to the initial state at the end of the conversion.
> +A≢Α
> \ No newline at end of file
> diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c
> index b639d8ff3e..5c2e17e50c 100644
> --- a/iconvdata/utf-7.c
> +++ b/iconvdata/utf-7.c
> @@ -32,11 +32,13 @@
>  enum variant
>  {
>    UTF7,
> +  UTF_7_IMAP
>  };
>  
>  /* Must be in the same order as enum variant above.  */
>  static const char names[] =
>    "UTF-7//\0"
> +  "UTF-7-IMAP//\0"
>    "\0";
>  
>  static uint32_t
> @@ -44,6 +46,8 @@ shift_character (enum variant const var)
>  {
>    if (var == UTF7)
>      return '+';
> +  else if (var == UTF_7_IMAP)
> +    return '&';
>    else
>      abort ();
>  }
> @@ -58,6 +62,9 @@ between (uint32_t const ch,
>  /* The set of "direct characters":
>     FOR UTF-7
>     A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
> +   FOR UTF-7-IMAP
> +   A-Z a-z 0-9 ' ( ) , - . / : ? space
> +   ! " # $ % + * ; < = > @ [ \ ] ^ _ ` { | } ~
>  */
>  
>  static bool
> @@ -71,6 +78,8 @@ isdirect (uint32_t ch, enum variant var)
>  	    || between (ch, ',', '/')
>  	    || ch == ':' || ch == '?'
>  	    || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
> +  else if (var == UTF_7_IMAP)
> +    return (ch != '&' && between (ch, ' ', '~'));
>    abort ();
>  }
>  
> @@ -124,6 +133,8 @@ base64 (unsigned int i, enum variant var)
>      return '+';
>    else if (i == 63 && var == UTF7)
>      return '/';
> +  else if (i == 63 && var == UTF_7_IMAP)
> +    return ',';
>    else
>      abort ();
>  }
> @@ -308,7 +319,8 @@ gconv_end (struct __gconv_step *data)
>  	  i = ch - '0' + 52;						      \
>  	else if (ch == '+')						      \
>  	  i = 62;							      \
> -	else if (ch == '/')						      \
> +	else if ((var == UTF7 && ch == '/')                                   \
> +		  || (var == UTF_7_IMAP && ch == ','))			      \
>  	  i = 63;							      \
>  	else								      \
>  	  {								      \
> @@ -316,8 +328,10 @@ gconv_end (struct __gconv_step *data)
>  									      \
>  	    /* If accumulated data is nonzero, the input is invalid.  */      \
>  	    /* Also, partial UTF-16 characters are invalid.  */		      \
> -	    if (__builtin_expect (statep->__value.__wch != 0, 0)	      \
> -		|| __builtin_expect ((statep->__count >> 3) <= 26, 0))	      \
> +	    /* In IMAP variant, must be terminated by '-'.  */		      \
> +	    if (__glibc_unlikely (statep->__value.__wch != 0)		      \
> +		|| __glibc_unlikely ((statep->__count >> 3) <= 26)	      \
> +		|| __glibc_unlikely (var == UTF_7_IMAP && ch != '-'))	      \
>  	      {								      \
>  		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
>  	      }								      \
> @@ -474,13 +488,15 @@ gconv_end (struct __gconv_step *data)
>      else								      \
>        {									      \
>  	/* base64 encoding active */					      \
> -	if (isdirect (ch, var))						      \
> +	if ((var == UTF_7_IMAP && ch == '&') || isdirect (ch, var))	      \
>  	  {								      \
>  	    /* deactivate base64 encoding */				      \
>  	    size_t count;						      \
>  									      \
>  	    count = ((statep->__count & 0x18) >= 0x10)			      \
> -	      + needs_explicit_shift (ch) + 1;				      \
> +	      + (var == UTF_7_IMAP || needs_explicit_shift (ch))	      \
> +	      + (var == UTF_7_IMAP && ch == '&')			      \
> +	      + 1;							      \
>  	    if (__glibc_unlikely (outptr + count > outend))		      \
>  	      {								      \
>  		result = __GCONV_FULL_OUTPUT;				      \
> @@ -489,9 +505,11 @@ gconv_end (struct __gconv_step *data)
>  									      \
>  	    if ((statep->__count & 0x18) >= 0x10)			      \
>  	      *outptr++ = base64 ((statep->__count >> 3) & ~3, var);	      \
> -	    if (needs_explicit_shift (ch))				      \
> +	    if (var == UTF_7_IMAP || needs_explicit_shift (ch))		      \
>  	      *outptr++ = '-';						      \
>  	    *outptr++ = (unsigned char) ch;				      \
> +	    if (var == UTF_7_IMAP && ch == '&')				      \
> +	      *outptr++ = '-';						      \
>  	    statep->__count = 0;					      \
>  	  }								      \
>  	else								      \

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
                                 ` (3 preceding siblings ...)
  2021-12-09  9:31               ` [PATCH v4 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c Max Gautier
@ 2021-12-17 13:15               ` Max Gautier
  2022-01-24 14:19                 ` Adhemerval Zanella
  2022-01-17 14:07               ` Max Gautier
  2022-01-24  9:17               ` Max Gautier
  6 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2021-12-17 13:15 UTC (permalink / raw)
  To: libc-alpha; +Cc: mg

Hi,

The contribution checklist on the wiki says to keep pinging weekly, so,
doing that.

Cheers

-- 
Max Gautier

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP
  2021-12-17 13:15               ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
@ 2022-01-24 14:19                 ` Adhemerval Zanella
  2022-02-10 13:16                   ` Max Gautier
  0 siblings, 1 reply; 60+ messages in thread
From: Adhemerval Zanella @ 2022-01-24 14:19 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 17/12/2021 10:15, Max Gautier via Libc-alpha wrote:
> Hi,
> 
> The contribution checklist on the wiki says to keep pinging weekly, so,
> doing that.
> 
> Cheers
> 

Thanks for your patience.  I think it is late for 2.35, but I want to get
back on this for 2.36.  

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP
  2022-01-24 14:19                 ` Adhemerval Zanella
@ 2022-02-10 13:16                   ` Max Gautier
  2022-02-10 13:17                     ` Adhemerval Zanella
  0 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2022-02-10 13:16 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, Jan 24, 2022 at 11:19:46AM -0300, Adhemerval Zanella wrote:
> ... 
> I think it is late for 2.35, but I want to get back on this for 2.36.  

Since 2.35 has sailed, any chances to tackle this some time soon ?

Cheers

-- 
Max Gautier

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP
  2022-02-10 13:16                   ` Max Gautier
@ 2022-02-10 13:17                     ` Adhemerval Zanella
  2022-03-04  8:53                       ` Max Gautier
  0 siblings, 1 reply; 60+ messages in thread
From: Adhemerval Zanella @ 2022-02-10 13:17 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 10/02/2022 10:16, Max Gautier wrote:
> On Mon, Jan 24, 2022 at 11:19:46AM -0300, Adhemerval Zanella wrote:
>> ... 
>> I think it is late for 2.35, but I want to get back on this for 2.36.  
> 
> Since 2.35 has sailed, any chances to tackle this some time soon ?
> 
> Cheers
> 

Thanks for remind me, I will try to spare some time to check on this.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP
  2022-02-10 13:17                     ` Adhemerval Zanella
@ 2022-03-04  8:53                       ` Max Gautier
  0 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2022-03-04  8:53 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

Pinging for UTF-7-IMAP


Cheers

-- 
Max Gautier

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
                                 ` (4 preceding siblings ...)
  2021-12-17 13:15               ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
@ 2022-01-17 14:07               ` Max Gautier
  2022-01-24  9:17               ` Max Gautier
  6 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2022-01-17 14:07 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

Keeping pinging.

-- 
Max Gautier

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP
  2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
                                 ` (5 preceding siblings ...)
  2022-01-17 14:07               ` Max Gautier
@ 2022-01-24  9:17               ` Max Gautier
  6 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2022-01-24  9:17 UTC (permalink / raw)
  To: libc-alpha

Pinging the patch.

-- 
Max Gautier

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 2/5] Update gconv-modules file
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
  2021-01-25  9:02     ` [PATCH v3 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
@ 2021-01-25  9:02     ` Max Gautier
  2021-02-07  9:49       ` Florian Weimer
  2021-01-25  9:02     ` [PATCH v3 3/5] Transform UTF-7 to IMAP-UTF-7 Max Gautier
                       ` (4 subsequent siblings)
  6 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2021-01-25  9:02 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

---
 iconvdata/gconv-modules | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules
index 8540225b1c..da107a1a56 100644
--- a/iconvdata/gconv-modules
+++ b/iconvdata/gconv-modules
@@ -1534,6 +1534,10 @@ alias	UTF7//			UTF-7//
 module	UTF-7//			INTERNAL		UTF-7		1
 module	INTERNAL		UTF-7//			UTF-7		1
 
+#	from			to			module		cost
+module	IMAP-UTF-7//		INTERNAL		IMAP-UTF-7	1
+module	INTERNAL		IMAP-UTF-7//		IMAP-UTF-7	1
+
 #	from			to			module		cost
 module	GB18030//		INTERNAL		GB18030		1
 module	INTERNAL		GB18030//		GB18030		1
-- 
2.30.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 2/5] Update gconv-modules file
  2021-01-25  9:02     ` [PATCH v3 2/5] Update gconv-modules file Max Gautier
@ 2021-02-07  9:49       ` Florian Weimer
  0 siblings, 0 replies; 60+ messages in thread
From: Florian Weimer @ 2021-02-07  9:49 UTC (permalink / raw)
  To: Max Gautier via Libc-alpha; +Cc: Max Gautier

Please fold this into the main patch, and use UTF-7-IMAP as the charset
name.  Thanks.

Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 3/5] Transform UTF-7 to IMAP-UTF-7
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
  2021-01-25  9:02     ` [PATCH v3 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
  2021-01-25  9:02     ` [PATCH v3 2/5] Update gconv-modules file Max Gautier
@ 2021-01-25  9:02     ` Max Gautier
  2021-01-25  9:02     ` [PATCH v3 4/5] Make terminating base64 sequences mandatory Max Gautier
                       ` (3 subsequent siblings)
  6 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2021-01-25  9:02 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

* shift character is '&' instead of '+'
* No "optionnal direct characters" set
* modified base64 character set
* use direct comparison instead of arrays and bitwise op (if there is a
  good reason to use those, please let me know)
---
 iconvdata/Makefile                           |  2 +-
 iconvdata/{modified-utf-7.c => imap-utf-7.c} | 97 +++++++-------------
 2 files changed, 32 insertions(+), 67 deletions(-)
 rename iconvdata/{modified-utf-7.c => imap-utf-7.c} (87%)

diff --git a/iconvdata/Makefile b/iconvdata/Makefile
index 7f932e10ed..c9c1a3a006 100644
--- a/iconvdata/Makefile
+++ b/iconvdata/Makefile
@@ -61,7 +61,7 @@ modules	:= ISO8859-1 ISO8859-2 ISO8859-3 ISO8859-4 ISO8859-5		 \
 	   IBM5347 IBM9030 IBM9066 IBM9448 IBM12712 IBM16804             \
 	   IBM1364 IBM1371 IBM1388 IBM1390 IBM1399 ISO_11548-1 MIK BRF	 \
 	   MAC-CENTRALEUROPE KOI8-RU ISO8859-9E				 \
-	   CP770 CP771 CP772 CP773 CP774 MODIFIED-UTF-7
+	   CP770 CP771 CP772 CP773 CP774 IMAP-UTF-7
 
 # If lazy binding is disabled, use BIND_NOW for the gconv modules.
 ifeq ($(bind-now),yes)
diff --git a/iconvdata/modified-utf-7.c b/iconvdata/imap-utf-7.c
similarity index 87%
rename from iconvdata/modified-utf-7.c
rename to iconvdata/imap-utf-7.c
index fc6a8dfcfd..ebd66d3388 100644
--- a/iconvdata/modified-utf-7.c
+++ b/iconvdata/imap-utf-7.c
@@ -1,4 +1,4 @@
-/* Conversion module for UTF-7.
+/* Conversion module for IMAP-UTF-7.
    Copyright (C) 2000-2020 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
@@ -16,12 +16,12 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-/* UTF-7 is a legacy encoding used for transmitting Unicode within the
-   ASCII character set, used primarily by mail agents.  New programs
-   are encouraged to use UTF-8 instead.
+/* IMAP-UTF-7 is a legacy encoding used for transmitting Unicode within the
+   ASCII character set, used primarily by IMAP server and clients agents.
+   New programs are encouraged to use UTF-8 instead.
 
-   UTF-7 is specified in RFC 2152 (and old RFC 1641, RFC 1642).  The
-   original Base64 encoding is defined in RFC 2045.  */
+   IMAP-UTF-7 is specified in RFC 3501 as part of the IMAPv4 specification.
+   The original Base64 encoding is defined in RFC 2045.  */
 
 #include <dlfcn.h>
 #include <gconv.h>
@@ -29,64 +29,29 @@
 #include <stdlib.h>
 
 
-/* Define this to 1 if you want the so-called "optional direct" characters
-      ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
-   to be encoded. Define to 0 if you want them to be passed straight
-   through, like the so-called "direct" characters.
-   We set this to 1 because it's safer.
- */
-#define UTF7_ENCODE_OPTIONAL_CHARS 1
-
-
 /* The set of "direct characters":
    A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
+   ! " # $ % + * ; < = > @ [ ] ^ _ ` { | }
 */
 
-static const unsigned char direct_tab[128 / 8] =
-  {
-    0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87,
-    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
-  };
-
 static int
 isdirect (uint32_t ch)
 {
-  return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1));
-}
-
-
-/* The set of "direct and optional direct characters":
-   A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr
-   ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
-*/
-
-static const unsigned char xdirect_tab[128 / 8] =
-  {
-    0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff,
-    0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f
-  };
-
-static int
-isxdirect (uint32_t ch)
-{
-  return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1));
+  return ((ch == '\n' || ch == '\t' || ch == '\r')
+		  || (ch >= 0x20 && ch <= 0x7e && ch != '&'));
 }
 
-
-/* The set of "extended base64 characters":
-   A-Z a-z 0-9 + / -
+/* The set of "modified base64 characters":
+   A-Z a-z 0-9 + , -
 */
 
-static const unsigned char xbase64_tab[128 / 8] =
-  {
-    0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03,
-    0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07
-  };
-
 static int
-isxbase64 (uint32_t ch)
+ismbase64 (uint32_t ch)
 {
-  return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1));
+  return ((ch >= 'a' && ch <= 'z')
+			  || (ch >= 'A' && ch <= 'Z')
+			  || (ch >= '0' && ch <= '9')
+			  || (ch == '+' || ch == ','));
 }
 
 
@@ -103,18 +68,18 @@ base64 (unsigned int i)
   else if (i == 62)
     return '+';
   else if (i == 63)
-    return '/';
+    return ',';
   else
     abort ();
 }
 
 
 /* Definitions used in the body of the `gconv' function.  */
-#define CHARSET_NAME		"UTF-7//"
+#define CHARSET_NAME		"IMAP-UTF-7//"
 #define DEFINE_INIT		1
 #define DEFINE_FINI		1
-#define FROM_LOOP		from_utf7_loop
-#define TO_LOOP			to_utf7_loop
+#define FROM_LOOP		from_imap_utf7_loop
+#define TO_LOOP			to_imap_utf7_loop
 #define MIN_NEEDED_FROM		1
 #define MAX_NEEDED_FROM		6
 #define MIN_NEEDED_TO		4
@@ -161,13 +126,13 @@ base64 (unsigned int i)
     if ((statep->__count >> 3) == 0)					      \
       {									      \
 	/* base64 encoding inactive.  */				      \
-	if (isxdirect (ch))						      \
+	if (isdirect (ch))						      \
 	  {								      \
 	    inptr++;							      \
 	    put32 (outptr, ch);						      \
 	    outptr += 4;						      \
 	  }								      \
-	else if (__glibc_likely (ch == '+'))				      \
+	else if (__glibc_likely (ch == '&'))				      \
 	  {								      \
 	    if (__glibc_unlikely (inptr + 2 > inend))			      \
 	      {								      \
@@ -209,7 +174,7 @@ base64 (unsigned int i)
 	  i = ch - '0' + 52;						      \
 	else if (ch == '+')						      \
 	  i = 62;							      \
-	else if (ch == '/')						      \
+	else if (ch == ',')						      \
 	  i = 63;							      \
 	else								      \
 	  {								      \
@@ -323,7 +288,7 @@ base64 (unsigned int i)
     if ((statep->__count & 0x18) == 0)					      \
       {									      \
 	/* base64 encoding inactive */					      \
-	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	if (isdirect (ch))      \
 	  {								      \
 	    *outptr++ = (unsigned char) ch;				      \
 	  }								      \
@@ -331,7 +296,7 @@ base64 (unsigned int i)
 	  {								      \
 	    size_t count;						      \
 									      \
-	    if (ch == '+')						      \
+	    if (ch == '&')						      \
 	      count = 2;						      \
 	    else if (ch < 0x10000)					      \
 	      count = 3;						      \
@@ -346,8 +311,8 @@ base64 (unsigned int i)
 		break;							      \
 	      }								      \
 									      \
-	    *outptr++ = '+';						      \
-	    if (ch == '+')						      \
+	    *outptr++ = '&';						      \
+	    if (ch == '&')						      \
 	      *outptr++ = '-';						      \
 	    else if (ch < 0x10000)					      \
 	      {								      \
@@ -375,12 +340,12 @@ base64 (unsigned int i)
     else								      \
       {									      \
 	/* base64 encoding active */					      \
-	if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch))      \
+	if (isdirect (ch))      \
 	  {								      \
 	    /* deactivate base64 encoding */				      \
 	    size_t count;						      \
 									      \
-	    count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1;  \
+	    count = ((statep->__count & 0x18) >= 0x10) + ismbase64 (ch) + 1;  \
 	    if (__glibc_unlikely (outptr + count > outend))		      \
 	      {								      \
 		result = __GCONV_FULL_OUTPUT;				      \
@@ -389,7 +354,7 @@ base64 (unsigned int i)
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
 	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
-	    if (isxbase64 (ch))						      \
+	    if (ismbase64 (ch))						      \
 	      *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
 	    statep->__count = 0;					      \
@@ -499,7 +464,7 @@ base64 (unsigned int i)
     memset (data->__statep, '\0', sizeof (mbstate_t));			      \
   else									      \
     {									      \
-      /* The "to UTF-7" direction.  Flush the remaining bits and terminate    \
+      /* The "to M-UTF-7" direction.  Flush the remaining bits and terminate    \
 	 with a '-' byte.  This will guarantee correct decoding if more	      \
 	 UTF-7 encoded text is added afterwards.  */			      \
       int state = data->__statep->__count;				      \
-- 
2.30.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 4/5] Make terminating base64 sequences mandatory
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
                       ` (2 preceding siblings ...)
  2021-01-25  9:02     ` [PATCH v3 3/5] Transform UTF-7 to IMAP-UTF-7 Max Gautier
@ 2021-01-25  9:02     ` Max Gautier
  2021-02-07  9:45       ` Florian Weimer
  2021-01-25  9:02     ` [PATCH v3 5/5] Add test case for IMAP-UTF-7 Max Gautier
                       ` (2 subsequent siblings)
  6 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2021-01-25  9:02 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

In the modified UTF-7 encoding, unlike in UTF-7, one MUST terminate all
base64 sequence with the '-' character.
MODIFIED-UTF-7 -> INTERNAL : make unterminated sequences illegal
INTERNAL -> MODIFIED-UTF-7 : always terminate the sequences
---
 iconvdata/imap-utf-7.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/iconvdata/imap-utf-7.c b/iconvdata/imap-utf-7.c
index ebd66d3388..45629a48f8 100644
--- a/iconvdata/imap-utf-7.c
+++ b/iconvdata/imap-utf-7.c
@@ -176,7 +176,7 @@ base64 (unsigned int i)
 	  i = 62;							      \
 	else if (ch == ',')						      \
 	  i = 63;							      \
-	else								      \
+	else if (ch == '-')								      \
 	  {								      \
 	    /* Terminate base64 encoding.  */				      \
 									      \
@@ -188,12 +188,14 @@ base64 (unsigned int i)
 		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
 	      }								      \
 									      \
-	    if (ch == '-')						      \
-	      inptr++;							      \
+		inptr++;							      \
 									      \
 	    statep->__count = 0;					      \
 	    continue;							      \
 	  }								      \
+	else  \
+		STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1));    \
+		/* Terminating '-' is required */  \
 									      \
 	/* Concatenate the base64 integer i to the accumulator.  */	      \
 	shift = (statep->__count >> 3);					      \
@@ -354,8 +356,7 @@ base64 (unsigned int i)
 									      \
 	    if ((statep->__count & 0x18) >= 0x10)			      \
 	      *outptr++ = base64 ((statep->__count >> 3) & ~3);		      \
-	    if (ismbase64 (ch))						      \
-	      *outptr++ = '-';						      \
+	    *outptr++ = '-';						      \
 	    *outptr++ = (unsigned char) ch;				      \
 	    statep->__count = 0;					      \
 	  }								      \
-- 
2.30.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 4/5] Make terminating base64 sequences mandatory
  2021-01-25  9:02     ` [PATCH v3 4/5] Make terminating base64 sequences mandatory Max Gautier
@ 2021-02-07  9:45       ` Florian Weimer
  0 siblings, 0 replies; 60+ messages in thread
From: Florian Weimer @ 2021-02-07  9:45 UTC (permalink / raw)
  To: Max Gautier via Libc-alpha; +Cc: Max Gautier

* Max Gautier via Libc-alpha:

> In the modified UTF-7 encoding, unlike in UTF-7, one MUST terminate all
> base64 sequence with the '-' character.
> MODIFIED-UTF-7 -> INTERNAL : make unterminated sequences illegal
> INTERNAL -> MODIFIED-UTF-7 : always terminate the sequences

Still mentions MODIFIED-UTF-7.  This should be folded into the main
patch (conditional for UTF-7-IMAP).  The explanation in the commit
message should be put into a source code comment.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 5/5] Add test case for IMAP-UTF-7
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
                       ` (3 preceding siblings ...)
  2021-01-25  9:02     ` [PATCH v3 4/5] Make terminating base64 sequences mandatory Max Gautier
@ 2021-01-25  9:02     ` Max Gautier
  2021-02-07  9:49       ` Florian Weimer
  2021-03-16 14:39     ` [PATCH v3 5/5][pw utf test] " Siddhesh Poyarekar
  2022-03-21 12:28     ` [PATCH v3 0/5] iconv: module " Adhemerval Zanella
  6 siblings, 1 reply; 60+ messages in thread
From: Max Gautier @ 2021-01-25  9:02 UTC (permalink / raw)
  To: libc-alpha; +Cc: Max Gautier

---
 iconvdata/TESTS                     |  1 +
 iconvdata/testdata/IMAP-UTF-7       | 25 +++++++++++++++++++++++++
 iconvdata/testdata/IMAP-UTF-7..UTF8 | 25 +++++++++++++++++++++++++
 3 files changed, 51 insertions(+)
 create mode 100644 iconvdata/testdata/IMAP-UTF-7
 create mode 100644 iconvdata/testdata/IMAP-UTF-7..UTF8

diff --git a/iconvdata/TESTS b/iconvdata/TESTS
index 74b82f1409..96d425219f 100644
--- a/iconvdata/TESTS
+++ b/iconvdata/TESTS
@@ -95,6 +95,7 @@ EUC-TW			EUC-TW			Y	UTF8
 GBK			GBK			Y	UTF8
 BIG5HKSCS		BIG5HKSCS		Y	UTF8
 UTF-7			UTF-7			N	UTF8
+IMAP-UTF-7		IMAP-UTF-7		N	UTF8
 IBM856			IBM856			N	UTF8
 IBM922			IBM922			Y	UTF8
 IBM930			IBM930			N	UTF8
diff --git a/iconvdata/testdata/IMAP-UTF-7 b/iconvdata/testdata/IMAP-UTF-7
new file mode 100644
index 0000000000..4b03e4ae57
--- /dev/null
+++ b/iconvdata/testdata/IMAP-UTF-7
@@ -0,0 +1,25 @@
+&EqASGxItEps-       Amharic
+&AQ0-esky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Fran&AOc-ais   French
+Deutsch    German
+&A5UDuwO7A7cDvQO5A7oDrA-   Greek
+&BeIF0QXoBdkF6g-      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+&BCAEQwRBBEEEOgQ4BDk-    Russian
+Espa&APE-ol    Spanish
+Svenska    Swedish
+&DiAOMg4pDjIORA4XDiI-    Thai
+T&APw-rk&AOc-e     Turkish
+Ti&Hr8-ng Vi&Hsc-t Vietnamese
+&ZeVnLIqe-     Japanese
+&Ti1lhw-       Chinese
+&1VyuAA-       Korean
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A&ImIDkQ-
\ No newline at end of file
diff --git a/iconvdata/testdata/IMAP-UTF-7..UTF8 b/iconvdata/testdata/IMAP-UTF-7..UTF8
new file mode 100644
index 0000000000..3b362e578c
--- /dev/null
+++ b/iconvdata/testdata/IMAP-UTF-7..UTF8
@@ -0,0 +1,25 @@
+አማርኛ       Amharic
+česky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Français   French
+Deutsch    German
+Ελληνικά   Greek
+עברית      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+Русский    Russian
+Español    Spanish
+Svenska    Swedish
+ภาษาไทย    Thai
+Türkçe     Turkish
+Tiếng Việt Vietnamese
+日本語     Japanese
+中文       Chinese
+한글       Korean
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A≢Α
\ No newline at end of file
-- 
2.30.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 5/5] Add test case for IMAP-UTF-7
  2021-01-25  9:02     ` [PATCH v3 5/5] Add test case for IMAP-UTF-7 Max Gautier
@ 2021-02-07  9:49       ` Florian Weimer
  0 siblings, 0 replies; 60+ messages in thread
From: Florian Weimer @ 2021-02-07  9:49 UTC (permalink / raw)
  To: Max Gautier via Libc-alpha; +Cc: Max Gautier

* Max Gautier via Libc-alpha:

> ---
>  iconvdata/TESTS                     |  1 +
>  iconvdata/testdata/IMAP-UTF-7       | 25 +++++++++++++++++++++++++
>  iconvdata/testdata/IMAP-UTF-7..UTF8 | 25 +++++++++++++++++++++++++
>  3 files changed, 51 insertions(+)
>  create mode 100644 iconvdata/testdata/IMAP-UTF-7
>  create mode 100644 iconvdata/testdata/IMAP-UTF-7..UTF8

The charset name should be UTF-7-IMAP.

I'd recommend to add coverage for the mandatory-to-encode ASCII
characters (“/” etc.).

You could also add a test that checks for the presence of the -
terminator.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v3 5/5][pw utf test] Add test case for IMAP-UTF-7
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
                       ` (4 preceding siblings ...)
  2021-01-25  9:02     ` [PATCH v3 5/5] Add test case for IMAP-UTF-7 Max Gautier
@ 2021-03-16 14:39     ` Siddhesh Poyarekar
  2022-03-21 12:28     ` [PATCH v3 0/5] iconv: module " Adhemerval Zanella
  6 siblings, 0 replies; 60+ messages in thread
From: Siddhesh Poyarekar @ 2021-03-16 14:39 UTC (permalink / raw)
  To: libc-alpha

---
  iconvdata/TESTS                     |  1 +
  iconvdata/testdata/IMAP-UTF-7       | 25 +++++++++++++++++++++++++
  iconvdata/testdata/IMAP-UTF-7..UTF8 | 25 +++++++++++++++++++++++++
  3 files changed, 51 insertions(+)
  create mode 100644 iconvdata/testdata/IMAP-UTF-7
  create mode 100644 iconvdata/testdata/IMAP-UTF-7..UTF8

diff --git a/iconvdata/TESTS b/iconvdata/TESTS
index 74b82f1409..96d425219f 100644
--- a/iconvdata/TESTS
+++ b/iconvdata/TESTS
@@ -95,6 +95,7 @@ EUC-TW			EUC-TW			Y	UTF8
  GBK			GBK			Y	UTF8
  BIG5HKSCS		BIG5HKSCS		Y	UTF8
  UTF-7			UTF-7			N	UTF8
+IMAP-UTF-7		IMAP-UTF-7		N	UTF8
  IBM856			IBM856			N	UTF8
  IBM922			IBM922			Y	UTF8
  IBM930			IBM930			N	UTF8
diff --git a/iconvdata/testdata/IMAP-UTF-7 b/iconvdata/testdata/IMAP-UTF-7
new file mode 100644
index 0000000000..4b03e4ae57
--- /dev/null
+++ b/iconvdata/testdata/IMAP-UTF-7
@@ -0,0 +1,25 @@
+&EqASGxItEps-       Amharic
+&AQ0-esky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Fran&AOc-ais   French
+Deutsch    German
+&A5UDuwO7A7cDvQO5A7oDrA-   Greek
+&BeIF0QXoBdkF6g-      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+&BCAEQwRBBEEEOgQ4BDk-    Russian
+Espa&APE-ol    Spanish
+Svenska    Swedish
+&DiAOMg4pDjIORA4XDiI-    Thai
+T&APw-rk&AOc-e     Turkish
+Ti&Hr8-ng Vi&Hsc-t Vietnamese
+&ZeVnLIqe-     Japanese
+&Ti1lhw-       Chinese
+&1VyuAA-       Korean
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A&ImIDkQ-
\ No newline at end of file
diff --git a/iconvdata/testdata/IMAP-UTF-7..UTF8 
b/iconvdata/testdata/IMAP-UTF-7..UTF8
new file mode 100644
index 0000000000..3b362e578c
--- /dev/null
+++ b/iconvdata/testdata/IMAP-UTF-7..UTF8
@@ -0,0 +1,25 @@
+አማርኛ       Amharic
+česky      Czech
+Dansk      Danish
+English    English
+Suomi      Finnish
+Français   French
+Deutsch    German
+Ελληνικά   Greek
+עברית      Hebrew
+Italiano   Italian
+Norsk      Norwegian
+Русский    Russian
+Español    Spanish
+Svenska    Swedish
+ภาษาไทย    Thai
+Türkçe     Turkish
+Tiếng Việt Vietnamese
+日本語     Japanese
+中文       Chinese
+한글       Korean
+
+// The last line of this file is missing the end-of-line terminator
+// on purpose, in order to test that the conversion empties the bit buffer
+// and shifts back to the initial state at the end of the conversion.
+A≢Α
\ No newline at end of file
-- 
2.30.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 0/5] iconv: module for IMAP-UTF-7
  2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
                       ` (5 preceding siblings ...)
  2021-03-16 14:39     ` [PATCH v3 5/5][pw utf test] " Siddhesh Poyarekar
@ 2022-03-21 12:28     ` Adhemerval Zanella
  2022-03-21 14:09       ` Max Gautier
  6 siblings, 1 reply; 60+ messages in thread
From: Adhemerval Zanella @ 2022-03-21 12:28 UTC (permalink / raw)
  To: Max Gautier, libc-alpha



On 25/01/2021 06:02, Max Gautier via Libc-alpha wrote:
> Here are the updated patchs , using the name IMAP-UTF-7 
> 

Hi Max,

I have fixed the 'enum variant' issue on patch 2 [1] and if you
are ok I can install this patchset.  Or you can resend it so I 
can apply it directly from the maillist.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/imap-utf7

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 0/5] iconv: module for IMAP-UTF-7
  2022-03-21 12:28     ` [PATCH v3 0/5] iconv: module " Adhemerval Zanella
@ 2022-03-21 14:09       ` Max Gautier
  0 siblings, 0 replies; 60+ messages in thread
From: Max Gautier @ 2022-03-21 14:09 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, Mar 21, 2022 at 09:28:26AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 25/01/2021 06:02, Max Gautier via Libc-alpha wrote:
> > Here are the updated patchs , using the name IMAP-UTF-7 
> > 
> 
> Hi Max,
> 
> I have fixed the 'enum variant' issue on patch 2 [1] and if you
> are ok I can install this patchset.  Or you can resend it so I 
> can apply it directly from the maillist.
> 
> [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/imap-utf7

I'm totally ok with it, thanks; and sorry for the mistake on my side.


-- 
Max Gautier

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] iconv: module for MODIFIED-UTF-7
  2020-08-19 23:06 [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Max Gautier
                   ` (5 preceding siblings ...)
  2020-08-20  8:03 ` [PATCH 0/5] iconv: module " Florian Weimer
@ 2021-01-12  9:12 ` Florian Weimer
  6 siblings, 0 replies; 60+ messages in thread
From: Florian Weimer @ 2021-01-12  9:12 UTC (permalink / raw)
  To: Max Gautier via Libc-alpha; +Cc: Max Gautier

* Max Gautier via Libc-alpha:

> These patches implement a conversion module for "modified UTF-7"
> described by RFC 3501 as part of the IMAP4rev1 specification (in
> section 5.1.3[1]).  This is the encoding used by convention by IMAP
> server to describe internationalized mailbox names.

UTF-7-IMAP is now an official IMAP charset.  I suggest to use this name.

Thanks,
FLorian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2022-03-21 14:09 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19 23:06 [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Max Gautier
2020-08-19 23:06 ` [PATCH 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
2020-08-19 23:06 ` [PATCH 2/5] Update gconv-modules file Max Gautier
2020-08-19 23:07 ` [PATCH 3/5] Transform UTF-7 to MODIFIED-UTF-7 Max Gautier
2020-08-19 23:07 ` [PATCH 4/5] Make terminating base64 sequences mandatory Max Gautier
2020-08-19 23:07 ` [PATCH 5/5] Add test case for MODIFIED-UTF-7 Max Gautier
2020-08-20  7:18   ` Andreas Schwab
2020-08-20 15:40     ` [PATCH v2 " Max Gautier
2020-08-20  8:03 ` [PATCH 0/5] iconv: module " Florian Weimer
2020-08-20 15:19   ` Max Gautier
2020-08-20 15:58     ` Florian Weimer
2020-09-02 15:24   ` Max Gautier
2020-09-02 20:01     ` Adhemerval Zanella
2020-09-03  9:47       ` Max Gautier
2020-09-03 10:56         ` Andreas Schwab
2021-01-25  9:02   ` [PATCH v3 0/5] iconv: module for IMAP-UTF-7 Max Gautier
2021-01-25  9:02     ` [PATCH v3 1/5] Copy utf-7 module to modified-utf-7 Max Gautier
2021-01-25  9:31       ` Andreas Schwab
2021-01-25 13:51         ` Max Gautier
2021-02-07  9:42           ` Florian Weimer
2021-02-07 12:29             ` Max Gautier
2021-02-07 12:34               ` Florian Weimer
2021-12-09  9:31             ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
2021-12-09  9:31               ` [PATCH v4 1/4] iconv: Always encode "optional direct" UTF-7 characters Max Gautier
2022-03-07 12:10                 ` Adhemerval Zanella
2021-12-09  9:31               ` [PATCH v4 2/4] iconv: Better mapping to RFC for UTF-7 Max Gautier
2022-03-07 12:14                 ` Adhemerval Zanella
2022-03-20 16:41                 ` [PATCH v5 " Max Gautier
2022-03-21 11:53                   ` Adhemerval Zanella
2022-03-21 11:59                     ` Adhemerval Zanella
2022-03-21 12:06                       ` Adhemerval Zanella
2022-03-21 14:07                       ` Max Gautier
2021-12-09  9:31               ` [PATCH v4 3/4] iconv: make utf-7.c able to use variants Max Gautier
2022-03-07 12:34                 ` Adhemerval Zanella
2022-03-12 11:07                   ` Max Gautier
2022-03-14 12:17                     ` Adhemerval Zanella
2022-03-20 16:42                 ` [PATCH v5 " Max Gautier
2022-03-21 12:24                   ` Adhemerval Zanella
2021-12-09  9:31               ` [PATCH v4 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c Max Gautier
2022-03-07 12:46                 ` Adhemerval Zanella
2022-03-20 16:43                 ` [PATCH v5 " Max Gautier
2022-03-21 12:24                   ` Adhemerval Zanella
2021-12-17 13:15               ` [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Max Gautier
2022-01-24 14:19                 ` Adhemerval Zanella
2022-02-10 13:16                   ` Max Gautier
2022-02-10 13:17                     ` Adhemerval Zanella
2022-03-04  8:53                       ` Max Gautier
2022-01-17 14:07               ` Max Gautier
2022-01-24  9:17               ` Max Gautier
2021-01-25  9:02     ` [PATCH v3 2/5] Update gconv-modules file Max Gautier
2021-02-07  9:49       ` Florian Weimer
2021-01-25  9:02     ` [PATCH v3 3/5] Transform UTF-7 to IMAP-UTF-7 Max Gautier
2021-01-25  9:02     ` [PATCH v3 4/5] Make terminating base64 sequences mandatory Max Gautier
2021-02-07  9:45       ` Florian Weimer
2021-01-25  9:02     ` [PATCH v3 5/5] Add test case for IMAP-UTF-7 Max Gautier
2021-02-07  9:49       ` Florian Weimer
2021-03-16 14:39     ` [PATCH v3 5/5][pw utf test] " Siddhesh Poyarekar
2022-03-21 12:28     ` [PATCH v3 0/5] iconv: module " Adhemerval Zanella
2022-03-21 14:09       ` Max Gautier
2021-01-12  9:12 ` [PATCH 0/5] iconv: module for MODIFIED-UTF-7 Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).