Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
       [not found] <565EDF7C.9020808@linux.vnet.ibm.com>
@ 2015-12-02 12:29 ` Florian Weimer
  2015-12-03 21:44   ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Florian Weimer @ 2015-12-02 12:29 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: Joseph S. Myers, Carlos O'Donell, GNU C Library

On 12/02/2015 01:09 PM, Stefan Liebler wrote:

> What is the reason for reporting an error in the direction from UTF-8 to
> UTF-32, but not in the direction from UTF-32 to UTF-8?
> Or is it a bug?

It's a bug.  When processing UTF-* encodings, iconv needs to detect
invalid source sequences and avoid creating invalid destination sequences.

There are various legacy encodings which treat surrogate code points as
if they were regular characters: CESU-8 corresponding to UTF-8, and
UCS-2 corresponding to UTF-16.  But if the user-visibile identifier
contains the string "UTF", it really should conform to the  current (?)
specification.

UTF-8, UTF-32 (and perhaps UCS-4, I do not have access to the ISO
standard) were changed fairly recently to restrict valid code points to
the first 17 planes (those that can be encoded in UTF-16).  This is
another source of decoding and encoding failures (which are also
required by the Unicode specification).

I wrote â€œcurrent (?)â€ above because it's a bit annoying that the
definition of UTF-32 was change retroactively without changing its
identifier.  But I don't think there is anything glibc can do except to
adopt the new behavior for the old identifier.

glibc iconv seems to treat UCS-2 as UTF-16 (checking for surrogate
characters, which looks like a bug), but UCS-4 as a superset of UTF-32
(which could be correct, depending on what the last version of ISO 10646
says).

> There is a further issue in utf-16.c when converting from UTF-16 to
> internal. If an uint16_t value is in the range of 0xd800 .. 0xdfff,
> the next uint16_t value is checked, if it is in the range of a low
> surrogate (0xdc00 .. 0xdfff). Afterwards these two uint16_t values are
> interpreted as a high- and low-surrogates pair.
> But there is no test if the first uint16_t value is really in the range
> of a high-surrogate (0xd800 .. 0xdbff).
> If there would be two uint16_t values in the range of a low surrogate,
> then they will be treated as a valid high- and low-surrogates pair.
> Should iconv() report the error "invalid multibyte sequence" in such a
> case?

Yes, this is a bug, and it should report an error.

Florian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
  2015-12-02 12:29 ` Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates Florian Weimer
@ 2015-12-03 21:44   ` Rich Felker
  2015-12-03 22:33     ` Florian Weimer
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2015-12-03 21:44 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Stefan Liebler, Joseph S. Myers, Carlos O'Donell, GNU C Library

On Wed, Dec 02, 2015 at 01:29:40PM +0100, Florian Weimer wrote:
> On 12/02/2015 01:09 PM, Stefan Liebler wrote:
> 
> > What is the reason for reporting an error in the direction from UTF-8 to
> > UTF-32, but not in the direction from UTF-32 to UTF-8?
> > Or is it a bug?
> 
> It's a bug.  When processing UTF-* encodings, iconv needs to detect
> invalid source sequences and avoid creating invalid destination sequences.
> 
> There are various legacy encodings which treat surrogate code points as
> if they were regular characters: CESU-8 corresponding to UTF-8, and
> UCS-2 corresponding to UTF-16.  But if the user-visibile identifier
> contains the string "UTF", it really should conform to the  current (?)
> specification.
> 
> UTF-8, UTF-32 (and perhaps UCS-4, I do not have access to the ISO
> standard) were changed fairly recently to restrict valid code points to
> the first 17 planes (those that can be encoded in UTF-16).  This is
> another source of decoding and encoding failures (which are also
> required by the Unicode specification).
> 
> I wrote â€œcurrent (?)â€ above because it's a bit annoying that the
> definition of UTF-32 was change retroactively without changing its
> identifier.  But I don't think there is anything glibc can do except to
> adopt the new behavior for the old identifier.
> 
> glibc iconv seems to treat UCS-2 as UTF-16 (checking for surrogate
> characters, which looks like a bug), but UCS-4 as a superset of UTF-32
> (which could be correct, depending on what the last version of ISO 10646
> says).

The relevant term is "Unicode Scalar Values", and these are exactly
the integers 0-0xd7ff and 0xe000-0x10ffff. UTF's assign a unique
encoding (in terms of code units) to each the Unicode Scalar Value,
and are not defined for any other integers. Likewise, UCS (16 or 32)
does not include values which are not Unicode Scalar Values.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
  2015-12-03 21:44   ` Rich Felker
@ 2015-12-03 22:33     ` Florian Weimer
  2015-12-03 22:50       ` Joseph Myers
  0 siblings, 1 reply; 5+ messages in thread
From: Florian Weimer @ 2015-12-03 22:33 UTC (permalink / raw)
  To: Rich Felker
  Cc: Stefan Liebler, Joseph S. Myers, Carlos O'Donell, GNU C Library

On 12/03/2015 10:44 PM, Rich Felker wrote:

> The relevant term is "Unicode Scalar Values", and these are exactly
> the integers 0-0xd7ff and 0xe000-0x10ffff. UTF's assign a unique
> encoding (in terms of code units) to each the Unicode Scalar Value,
> and are not defined for any other integers. Likewise, UCS (16 or 32)
> does not include values which are not Unicode Scalar Values.

The term Unicode Scalar Value did not exist when Unicode support was
added to glibc.  For example, all the reference I have readily at hand
(I can't find the 10646 CD right now) imply that UCS-4 in ISO/IEC
10646:2000 still had 31 bits and not the range restriction you gave.

The question is what glibc should doâ€”implement historic definitions,
preserving the meaning of charset names for backwards compatibility, or
tweak the implementations as the definitions evolve.

Florian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
  2015-12-03 22:33     ` Florian Weimer
@ 2015-12-03 22:50       ` Joseph Myers
  2016-05-08 14:22         ` Florian Weimer
  0 siblings, 1 reply; 5+ messages in thread
From: Joseph Myers @ 2015-12-03 22:50 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Rich Felker, Stefan Liebler, Carlos O'Donell, GNU C Library

[-- Attachment #1: Type: text/plain, Size: 1302 bytes --]

On Thu, 3 Dec 2015, Florian Weimer wrote:

> On 12/03/2015 10:44 PM, Rich Felker wrote:
> 
> > The relevant term is "Unicode Scalar Values", and these are exactly
> > the integers 0-0xd7ff and 0xe000-0x10ffff. UTF's assign a unique
> > encoding (in terms of code units) to each the Unicode Scalar Value,
> > and are not defined for any other integers. Likewise, UCS (16 or 32)
> > does not include values which are not Unicode Scalar Values.
> 
> The term Unicode Scalar Value did not exist when Unicode support was
> added to glibc.  For example, all the reference I have readily at hand
> (I can't find the 10646 CD right now) imply that UCS-4 in ISO/IEC
> 10646:2000 still had 31 bits and not the range restriction you gave.

My previous look at that question in 
<https://sourceware.org/ml/libc-alpha/2012-09/msg00112.html> indicates 
that the restriction was some time between 2008 and 2011.  I haven't gone 
further into SC2 documents to identify the time of the change further.

> The question is what glibc should doâ€”implement historic definitions,
> preserving the meaning of charset names for backwards compatibility, or
> tweak the implementations as the definitions evolve.

I think we need to implement the current meanings of those names.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates.
  2015-12-03 22:50       ` Joseph Myers
@ 2016-05-08 14:22         ` Florian Weimer
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Weimer @ 2016-05-08 14:22 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Rich Felker, Stefan Liebler, Carlos O'Donell, GNU C Library

On 12/03/2015 11:49 PM, Joseph Myers wrote:
> On Thu, 3 Dec 2015, Florian Weimer wrote:
>> The question is what glibc should doâ€”implement historic definitions,
>> preserving the meaning of charset names for backwards compatibility, or
>> tweak the implementations as the definitions evolve.
>
> I think we need to implement the current meanings of those names.

Makes sense.  I have reopened bug 2373.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-08 14:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <565EDF7C.9020808@linux.vnet.ibm.com>
2015-12-02 12:29 ` Question about iconv, UTF 8/16/32 and error reporting due to UTF-16 surrogates Florian Weimer
2015-12-03 21:44   ` Rich Felker
2015-12-03 22:33     ` Florian Weimer
2015-12-03 22:50       ` Joseph Myers
2016-05-08 14:22         ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).