public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com
Subject: Bug in libiconv?
Date: Tue, 25 Jan 2011 06:36:00 -0000	[thread overview]
Message-ID: <20110124154158.GA15279@calimero.vinschen.de> (raw)

Hi Chuck,
hi everyone else,


In a twisted turn of events, I'm trying to get the orphaned catgets
package to work correctly on Cygwin 1.7.  As you might know, the package
is derived from the glibc package.  Apart from other portability issues
of this *very* glibc-centric piece of code, I found some problem which
appears to point to two bugs in Cygwin's libiconv2.

For some reason, the iconv conversion seems to be overly dependent on
the usage of setlocale, and the returned value in the fourth parameter
appears to be incorrect, if the output codeset is "WCHAR_T".

Here's a simple testcase:

==== SNIP ====
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <iconv.h>
#include <locale.h>
#include <wchar.h>

iconv_t
open_iconv ()
{
  iconv_t cd_towcp = iconv_open ("WCHAR_T", "UTF-8");
  if (cd_towcp == (iconv_t) -1)
    {
      fprintf (stderr, "iconv_open: %d <%s>\n", errno, strerror (errno));
      exit (1);
    }
  return cd_towcp;
}

void
run_iconv (iconv_t cd_towcp, char *input)
{
  wchar_t out[256];

  char *inbuf = input;
  size_t inbytesleft = strlen (inbuf);
  char *outbuf = (char *) out;
  size_t outbytesleft = sizeof (out);
  size_t ret = iconv (cd_towcp, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
  if (ret == (size_t) -1)
    fprintf (stderr, "iconv: %d <%s>\n", errno, strerror (errno));
  printf ("in = <%s>, inbuf = <%s>, inbytesleft = %zd, outbytesleft = %zd\n",
	  input, inbuf, inbytesleft, outbytesleft);
}

int
main ()
{
  iconv_t cd_towcp;
  char *finnish = "Liian pitk\303\244 sana";  // Umlaut-a
  
  setlocale (LC_ALL, "C");
  cd_towcp = open_iconv ();
  setlocale (LC_ALL, "C");
  run_iconv (cd_towcp, finnish);
  setlocale (LC_ALL, "C.UTF-8");
  run_iconv (cd_towcp, finnish);
  iconv_close (cd_towcp);
  
  setlocale (LC_ALL, "C.UTF-8");
  cd_towcp = open_iconv ();
  setlocale (LC_ALL, "C");
  run_iconv (cd_towcp, finnish);
  setlocale (LC_ALL, "C.UTF-8");
  run_iconv (cd_towcp, finnish);
  iconv_close (cd_towcp);

  return 0;
}
==== SNAP ====

Here are the important details:

- The input string is a fixed finnish UTF-8 sentence containing a
  single non-ASCII char.

- The testcase always calls setlocale before calling iconv_open(),
  then subsequently it sets setlocale before calling iconv().

- So the application tests to convert a UTF-8 to WCHAR_T string in four
  combinations of the current locale, in this order:

  - iconv_open "C",       iconv "C"
  - iconv_open "C",       iconv "C.UTF-8"
  - iconv_open "C.UTF-8", iconv "C"
  - iconv_open "C.UTF-8", iconv "C.UTF-8"

Here's what happens in Linux:

  $ gcc -g -o ic ic.c
  $ ./ic
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960

Here's what happens on Cygwin:

  $ gcc -g -o ic ic.c -liconv
  $ ./ic
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkä sana>, inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkä sana>, inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkä sana>, inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 480

So, AFAICS, there are two problems:

  - Even though iconv_open has been opened explicitely with "UTF-8" as
    input string, the conversion still depends on the current application
    codeset.  That dsoesn't make sense.

  - Even though the last parameter to iconv is defined in bytes, the
    value of outbytesleft after the conversion is the number of remaining
    wchar"t's, not the number of remaining bytes.  That's contrary to what
    POSIX defines, see
    http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html

Is this analyzes correct?  Is there by any chance a newer version of
libiconv2 which does not have these problems?


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

             reply	other threads:[~2011-01-24 15:42 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-25  6:36 Corinna Vinschen [this message]
2011-01-25 11:15 ` Charles Wilson
2011-01-25 15:04   ` Corinna Vinschen
2011-01-25 18:58     ` Charles Wilson
2011-01-25 20:11       ` Corinna Vinschen
2011-01-28 22:13         ` Charles Wilson
2011-01-27  5:46     ` Charles Wilson
2011-01-27 16:05       ` Corinna Vinschen
2011-01-27 17:18         ` Charles Wilson
2011-01-27  3:53   ` Charles Wilson
2011-01-27 16:21     ` Corinna Vinschen
2011-01-27 17:39       ` Charles Wilson
2011-01-27 18:05         ` Corinna Vinschen
2011-01-27 20:12           ` cygwin patches for gnulib relocation code [Was: Re: Bug in libiconv?] Charles Wilson
2011-01-28  0:37             ` Eric Blake
2011-01-28  4:45               ` Charles Wilson
2011-01-26 13:39 Bug in libiconv? simrw
2011-01-26 13:50 ` Corinna Vinschen
2011-01-26 17:01   ` Charles Wilson
2011-01-26 22:39     ` Corinna Vinschen
2011-01-27 16:06 simrw
2011-01-29  2:15 Bruno Haible
2011-01-29 12:34 ` Charles Wilson
2011-01-29 13:20 ` Charles Wilson
2011-01-29 17:15   ` Corinna Vinschen
2011-01-29 16:02 ` Corinna Vinschen
2011-01-29 17:51   ` Eric Blake
2011-01-29 18:12     ` Corinna Vinschen
2011-01-29 18:28       ` Eric Blake
2011-01-30 11:34         ` Corinna Vinschen
2011-01-30 11:43           ` Corinna Vinschen
2011-01-30  2:40     ` Corinna Vinschen
2011-02-02 18:58 Bruno Haible
2011-02-02 21:20 ` Corinna Vinschen
2011-02-02 22:57   ` Charles Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110124154158.GA15279@calimero.vinschen.de \
    --to=corinna-cygwin@cygwin.com \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).