From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27925 invoked by alias); 22 Mar 2010 22:37:21 -0000 Received: (qmail 27852 invoked by uid 48); 22 Mar 2010 22:37:08 -0000 Date: Mon, 22 Mar 2010 22:37:00 -0000 Message-ID: <20100322223708.27851.qmail@sourceware.org> From: "svenboden at hotmail dot com" To: glibc-bugs@sources.redhat.com In-Reply-To: <20100322211440.11418.svenboden@hotmail.com> References: <20100322211440.11418.svenboden@hotmail.com> Reply-To: sourceware-bugzilla@sourceware.org Subject: [Bug libc/11418] iconv/gconv: "illegal input sequence at position"/incomplete implementation X-Bugzilla-Reason: CC Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org X-SW-Source: 2010-03/txt/msg00041.txt.bz2 ------- Additional Comments From svenboden at hotmail dot com 2010-03-22 22:37 ------- >>From Colin Watson on ubuntu launchpad: Well, it's entirely unsurprising that converting to US-ASCII would fail, as U+00FC "ΓΌ" has no representation in US-ASCII. The error code is just a slightly awkward way to say that conversion is impossible; the iconv() function doesn't intrinsically distinguish between "this text isn't valid in your input encoding" and "this text can't be converted to your output encoding". Thus this just means that iconv thinks that there's no mapping for U+00FC in EUC-KR. So, the question is, what byte sequence does iconv on HP-UX output for this string? And does it actually match the Korean character set standards? That is, there are two possibilities here: either this is a bug in glibc for failing to perform a correct conversion, or it's actually a bug in HP-UX for performing an incorrect conversion rather than returning an error. ----------------- And my reply: On HP-UX using utf8 to US-ASCII conversion a galley character would replace a non-mappable character (if the conversion has a galley character defined). So for my original example the output would be "?s" on HP-UX from utf8 to US-ASCII (where ? is the galley character). In Korean the galley character is an y umlaut (or something that looks very similar) So my problem is that: - the input in my case is always utf-8 (my example is a simplification) - a user can select the "to conversion" - my input is dynamic, so I don't have a clue on possible input combinations, I only know all input is valid utf-8 (from an Oracle database using utf-8). On HP-UX iconv (once the iconv_open() succeeds) always returns something on valid utf-8 input. On Linux it seems to return at the first unmappable character, and I don't have a clue e.g. how many bytes to "skip" in general upon error. This is what I consider incomplete on the Linux side, but maybe the question is whether there are any standards on iconv conversion. Short term solution would be to make an own conversion "plugin" to gconv that does handle all utf8 inputs? -- http://sourceware.org/bugzilla/show_bug.cgi?id=11418 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.