public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libc/11418] New: iconv/gconv: "illegal input sequence at position"/incomplete implementation
@ 2010-03-22 21:14 svenboden at hotmail dot com
  2010-03-22 22:37 ` [Bug libc/11418] " svenboden at hotmail dot com
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: svenboden at hotmail dot com @ 2010-03-22 21:14 UTC (permalink / raw)
  To: glibc-bugs

Doing a conversion from HP to Linux (Red Hat/Ubuntu) shows some encodings on
Linux are incomplete. Or at at least they show unexpected behaviour.

Using "\xc3\xbc\x73", which is a UTF-8 string of "u umlaut" followed by an "s".

Doing the following works
printf "\xc3\xbc\x73" | iconv -f utf8 -t ISO-8859-15

Doing the following doesn't work 
printf "\xc3\xbc\x73" | iconv -f utf8 -t EUC-KR
and output "iconv: illegal input sequence at position 0"

While following works:
printf "\xc3\xbc\x73" | iconv -f utf8 -t EUC-CN

On HP-UX all of the above generate proper output. Since UTF-8 is used as input
in all cases it seems strange iconv/gconv thinks the input is wrong (errno 84)
in the EUC-KR case. Converting to US-ASCII has the same problem as converting to
EUC-KR.

-- 
           Summary: iconv/gconv: "illegal input sequence at
                    position"/incomplete implementation
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: libc
        AssignedTo: drepper at redhat dot com
        ReportedBy: svenboden at hotmail dot com
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=11418

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug libc/11418] iconv/gconv: "illegal input sequence at position"/incomplete implementation
  2010-03-22 21:14 [Bug libc/11418] New: iconv/gconv: "illegal input sequence at position"/incomplete implementation svenboden at hotmail dot com
@ 2010-03-22 22:37 ` svenboden at hotmail dot com
  2010-03-23  6:57 ` drepper at redhat dot com
  2010-04-04  2:28 ` drepper at redhat dot com
  2 siblings, 0 replies; 5+ messages in thread
From: svenboden at hotmail dot com @ 2010-03-22 22:37 UTC (permalink / raw)
  To: glibc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2303 bytes --]


------- Additional Comments From svenboden at hotmail dot com  2010-03-22 22:37 -------
>From Colin Watson on ubuntu launchpad:

Well, it's entirely unsurprising that converting to US-ASCII would fail,
as U+00FC "ü" has no representation in US-ASCII. The error code is just
a slightly awkward way to say that conversion is impossible; the iconv()
function doesn't intrinsically distinguish between "this text isn't
valid in your input encoding" and "this text can't be converted to your
output encoding". Thus this just means that iconv thinks that there's
no mapping for U+00FC in EUC-KR.

So, the question is, what byte sequence does iconv on HP-UX output for
this string? And does it actually match the Korean character set
standards? That is, there are two possibilities here: either this is a
bug in glibc for failing to perform a correct conversion, or it's
actually a bug in HP-UX for performing an incorrect conversion rather
than returning an error.

-----------------

And my reply:

On HP-UX using utf8 to US-ASCII conversion a galley character would replace a
non-mappable character (if the conversion has a galley character defined). So
for my original example the output would be "?s" on HP-UX from utf8 to US-ASCII
(where ? is the galley character). In Korean the galley character is an y umlaut
(or something that looks very similar)

So my problem is that:
- the input in my case is always utf-8 (my example is a simplification)
- a user can select the "to conversion"
- my input is dynamic, so I don't have a clue on possible input combinations, I
only know all input is valid utf-8 (from an Oracle database using utf-8).

On HP-UX iconv (once the iconv_open() succeeds) always returns something on
valid utf-8 input. On Linux it seems to return at the first unmappable
character, and I don't have a clue e.g. how many bytes to "skip" in general upon
error. This is what I consider incomplete on the Linux side, but maybe the
question is whether there are any standards on iconv conversion.

Short term solution would be to make an own conversion "plugin" to gconv that
does handle all utf8 inputs?


-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=11418

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug libc/11418] iconv/gconv: "illegal input sequence at position"/incomplete implementation
  2010-03-22 21:14 [Bug libc/11418] New: iconv/gconv: "illegal input sequence at position"/incomplete implementation svenboden at hotmail dot com
  2010-03-22 22:37 ` [Bug libc/11418] " svenboden at hotmail dot com
@ 2010-03-23  6:57 ` drepper at redhat dot com
  2010-04-04  2:28 ` drepper at redhat dot com
  2 siblings, 0 replies; 5+ messages in thread
From: drepper at redhat dot com @ 2010-03-23  6:57 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From joseph at codesourcery dot com  2010-03-22 23:14 -------
Subject: Re:  iconv/gconv: "illegal input sequence at
 position"/incomplete implementation

You can use US-ASCII//TRANSLIT to transliterate.  That produces "?s" here.


------- Additional Comments From drepper at redhat dot com  2010-03-23 06:56 -------
There is no bug as you've already admitted and this is no place to ask for
programming advise.

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=11418

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug libc/11418] iconv/gconv: "illegal input sequence at position"/incomplete implementation
  2010-03-22 21:14 [Bug libc/11418] New: iconv/gconv: "illegal input sequence at position"/incomplete implementation svenboden at hotmail dot com
  2010-03-22 22:37 ` [Bug libc/11418] " svenboden at hotmail dot com
  2010-03-23  6:57 ` drepper at redhat dot com
@ 2010-04-04  2:28 ` drepper at redhat dot com
  2 siblings, 0 replies; 5+ messages in thread
From: drepper at redhat dot com @ 2010-04-04  2:28 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From drepper at redhat dot com  2010-04-04 02:28 -------
As mentioned before, there is no bug.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID


http://sourceware.org/bugzilla/show_bug.cgi?id=11418

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug libc/11418] iconv/gconv: "illegal input sequence at position"/incomplete implementation
       [not found] <bug-11418-131@http.sourceware.org/bugzilla/>
@ 2014-06-30 18:25 ` fweimer at redhat dot com
  0 siblings, 0 replies; 5+ messages in thread
From: fweimer at redhat dot com @ 2014-06-30 18:25 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=11418

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-06-30 18:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-22 21:14 [Bug libc/11418] New: iconv/gconv: "illegal input sequence at position"/incomplete implementation svenboden at hotmail dot com
2010-03-22 22:37 ` [Bug libc/11418] " svenboden at hotmail dot com
2010-03-23  6:57 ` drepper at redhat dot com
2010-04-04  2:28 ` drepper at redhat dot com
     [not found] <bug-11418-131@http.sourceware.org/bugzilla/>
2014-06-30 18:25 ` fweimer at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).