[Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale
@ 2007-04-08 12:18 d+bugzilla at vdr dot jp
  2007-06-02 23:43 ` [Bug libc/4335] " bruno at clisp dot org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2007-04-08 12:18 UTC (permalink / raw)
  To: glibc-bugs

According to /usr/share/i18n/charmaps/UTF-8.gz,
Character width is 1 by default.  W(Wide) and F(Full Width) are 2.

% Character width according to Unicode 3.2.
% - Default width is 1.
% - Double-width characters have width 2; generated from
%        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
% - Non-spacing characters have width 0; generated from PropList.txt or
%   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
% - Format control characters have width 0; generated from
%   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
% - Zero width characters have width 0; generated from
%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"

A(Ambiguous) is expected that it is context-sensitive,
but its width is always 1 irrelevant to context.

According to http://www.unicode.org/reports/tr11/#Recommendations

> When mapping Unicode to East Asian legacy character encodings
> 
>     * Wide Unicode characters always map to fullwidth characters.
>     * Narrow (and neutral) Unicode characters always map to halfwidth characters.
>     * Halfwidth Unicode characters always map to halfwidth characters.
>     * Ambiguous Unicode characters always map to fullwidth characters.

I think EastAsianAmbiguous character width should be 2 in CJK UTF-8 locale.

-- 
           Summary: EastAsianAmbiguous character width is always 1 in UTF-8
                    locale
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: libc
        AssignedTo: drepper at redhat dot com
        ReportedBy: d+bugzilla at vdr dot jp
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=4335

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
  2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
@ 2007-06-02 23:43 ` bruno at clisp dot org
  2007-06-10 13:05 ` d+bugzilla at vdr dot jp
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: bruno at clisp dot org @ 2007-06-02 23:43 UTC (permalink / raw)
  To: glibc-bugs

------- Additional Comments From bruno at clisp dot org  2007-06-02 23:43 -------
The "character width" is mostly useful when dealing with cell-based
terminal emulators.

IMO it makes no sense to make such a change in glibc (i.e. to create an
alternative charmap UTF-8-CJK and to build locales like ja_JP.UTF-8 against
it) in isolation. What needs to be considered is the majority of the terminal
emulators; see for example the list at
  http://packages.debian.org/stable/virtual/x-terminal-emulator
If you change the most important among these terminal emulators to choose
their font configuration according to the locale, in such a way that in CJK
locales the Ambiguous Width characters have width 2, and in other locales they
have width 1, _then_ IMO the change makes also sense in glibc.

-- 

http://sourceware.org/bugzilla/show_bug.cgi?id=4335

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
  2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
  2007-06-02 23:43 ` [Bug libc/4335] " bruno at clisp dot org
@ 2007-06-10 13:05 ` d+bugzilla at vdr dot jp
  2007-11-27 16:04 ` d+bugzilla at vdr dot jp
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2007-06-10 13:05 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From d+bugzilla at vdr dot jp  2007-06-10 13:05 -------
I created UTF-8-CJK (EastAsianAmbiguous character width 2) and built ja_JP.UTF-8
against it.
Then, I test terminal emulators; debian's x-terminal-emulator list.
Terminal Emulators that be able to handle UTF-8 works well and chooses font
correctly.
(I leave terminal emulators that be unable to handle UTF-8 out of consideration)

works well:

gnome-terminal
konsole
mlterm (mlterm-tiny)
rxvt (rxvt-ml)
rxvt-beta
rxvt-unicode (rxvt-unicode-ml, rxvt-unicode-lite)
tilda
xfce4-terminal
xterm

does not handle UTF-8:

aterm (aterm-ml)
eterm
kterm
mrxvt (mrxvt-cjk, mrxvt-mini)
multi-gnome-terminal
wterm (wterm-ml)

does not handle ja_JP.eucJP:

hanterm-xf
powershell
pterm
terminal.app
xvt

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=4335

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
  2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
  2007-06-02 23:43 ` [Bug libc/4335] " bruno at clisp dot org
  2007-06-10 13:05 ` d+bugzilla at vdr dot jp
@ 2007-11-27 16:04 ` d+bugzilla at vdr dot jp
  2008-11-25 17:28 ` d+bugzilla at vdr dot jp
  2009-02-28  7:38 ` d+bugzilla at vdr dot jp
  4 siblings, 0 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2007-11-27 16:04 UTC (permalink / raw)
  To: glibc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6272 bytes --]


------- Additional Comments From d+bugzilla at vdr dot jp  2007-11-27 16:04 -------
Any progress?
It is still present in glibc 2.7 (Debian).

% /lib/libc.so.6
GNU C Library stable release version 2.7, by Roland McGrath et al.
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.2.3 20071123 (prerelease) (Debian 4.2.2-4).
Compiled on a Linux >>2.6.22.12<< system on 2007-11-26.
Available extensions:
	crypt add-on version 2.1 by Michael Glad and others
	GNU Libidn by Simon Josefsson
	Native POSIX Threads Library by Ulrich Drepper et al
	BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

% cat test.c
#include <stdio.h>
#include <locale.h>
#define __USE_XOPEN
#include <wchar.h>

int main( void ) {
  wchar_t i;
  wchar_t euc, utf8;

  for( i = 0x00; i < 0x100; i++ ) {
    setlocale( LC_CTYPE, "ja_JP.eucJP" );
    euc = wcwidth( i );
    setlocale( LC_CTYPE, "ja_JP.UTF-8" );
    utf8 = wcwidth( i );

    if( euc > 0 && euc != utf8 ) {
      fprintf( stdout, "%02x : %d : %d : [%c]\n", i, euc, utf8, i );
    }
  }

  return 0;
}

Using default UTF-8 locale:

% ./a.out
a1 : 2 : 1 : [¢Â]
a2 : 2 : 1 : [¡ñ]
a3 : 2 : 1 : [¡ò]
a4 : 2 : 1 : [¢ð]
a6 : 2 : 1 : [üü]
a7 : 2 : 1 : [¡ø]
a8 : 2 : 1 : [¡¯]
a9 : 2 : 1 : [¢í]
aa : 2 : 1 : [¢ì]
ac : 2 : 1 : [¢Ì]
ae : 2 : 1 : [¢î]
af : 2 : 1 : [¢´]
b0 : 2 : 1 : [¡ë]
b1 : 2 : 1 : [¡Þ]
b4 : 2 : 1 : [¡]
b6 : 2 : 1 : [¢ù]
b8 : 2 : 1 : [¢±]
ba : 2 : 1 : [¢ë]
bf : 2 : 1 : [¢Ä]
c0 : 2 : 1 : [ª¢]
c1 : 2 : 1 : [ª¡]
c2 : 2 : 1 : [ª¤]
c3 : 2 : 1 : [ªª]
c4 : 2 : 1 : [ª£]
c5 : 2 : 1 : [ª©]
c6 : 2 : 1 : [©¡]
c7 : 2 : 1 : [ª®]
c8 : 2 : 1 : [ª²]
c9 : 2 : 1 : [ª±]
ca : 2 : 1 : [ª´]
cb : 2 : 1 : [ª³]
cc : 2 : 1 : [ªÀ]
cd : 2 : 1 : [ª¿]
ce : 2 : 1 : [ªÂ]
cf : 2 : 1 : [ªÁ]
d1 : 2 : 1 : [ªÐ]
d2 : 2 : 1 : [ªÒ]
d3 : 2 : 1 : [ªÑ]
d4 : 2 : 1 : [ªÔ]
d5 : 2 : 1 : [ªØ]
d6 : 2 : 1 : [ªÓ]
d7 : 2 : 1 : [¡ß]
d8 : 2 : 1 : [©¬]
d9 : 2 : 1 : [ªã]
da : 2 : 1 : [ªâ]
db : 2 : 1 : [ªå]
dc : 2 : 1 : [ªä]
dd : 2 : 1 : [ªò]
de : 2 : 1 : [©°]
df : 2 : 1 : [©Î]
e0 : 2 : 1 : [«¢]
e1 : 2 : 1 : [«¡]
e2 : 2 : 1 : [«¤]
e3 : 2 : 1 : [«ª]
e4 : 2 : 1 : [«£]
e5 : 2 : 1 : [«©]
e6 : 2 : 1 : [©Á]
e7 : 2 : 1 : [«®]
e8 : 2 : 1 : [«²]
e9 : 2 : 1 : [«±]
ea : 2 : 1 : [«´]
eb : 2 : 1 : [«³]
ec : 2 : 1 : [«À]
ed : 2 : 1 : [«¿]
ee : 2 : 1 : [«Â]
ef : 2 : 1 : [«Á]
f0 : 2 : 1 : [©Ã]
f1 : 2 : 1 : [«Ð]
f2 : 2 : 1 : [«Ò]
f3 : 2 : 1 : [«Ñ]
f4 : 2 : 1 : [«Ô]
f5 : 2 : 1 : [«Ø]
f6 : 2 : 1 : [«Ó]
f7 : 2 : 1 : [¡à]
f8 : 2 : 1 : [©Ì]
f9 : 2 : 1 : [«ã]
fa : 2 : 1 : [«â]
fb : 2 : 1 : [«å]
fc : 2 : 1 : [«ä]
fd : 2 : 1 : [«ò]
fe : 2 : 1 : [©Ð]
ff : 2 : 1 : [«ó]

Using modified (EastAsianAmbiguous character width == 2,
according to EastAsianWidth-5.0.0.txt) UTF-8 locale:

% ./a.out
a2 : 2 : 1 : [¡ñ]
a3 : 2 : 1 : [¡ò]
a6 : 2 : 1 : [üü]
a9 : 2 : 1 : [¢í]
ac : 2 : 1 : [¢Ì]
af : 2 : 1 : [¢´]
c0 : 2 : 1 : [ª¢]
c1 : 2 : 1 : [ª¡]
c2 : 2 : 1 : [ª¤]
c3 : 2 : 1 : [ªª]
c4 : 2 : 1 : [ª£]
c5 : 2 : 1 : [ª©]
c7 : 2 : 1 : [ª®]
c8 : 2 : 1 : [ª²]
c9 : 2 : 1 : [ª±]
ca : 2 : 1 : [ª´]
cb : 2 : 1 : [ª³]
cc : 2 : 1 : [ªÀ]
cd : 2 : 1 : [ª¿]
ce : 2 : 1 : [ªÂ]
cf : 2 : 1 : [ªÁ]
d1 : 2 : 1 : [ªÐ]
d2 : 2 : 1 : [ªÒ]
d3 : 2 : 1 : [ªÑ]
d4 : 2 : 1 : [ªÔ]
d5 : 2 : 1 : [ªØ]
d6 : 2 : 1 : [ªÓ]
d9 : 2 : 1 : [ªã]
da : 2 : 1 : [ªâ]
db : 2 : 1 : [ªå]
dc : 2 : 1 : [ªä]
dd : 2 : 1 : [ªò]
e2 : 2 : 1 : [«¤]
e3 : 2 : 1 : [«ª]
e4 : 2 : 1 : [«£]
e5 : 2 : 1 : [«©]
e7 : 2 : 1 : [«®]
eb : 2 : 1 : [«³]
ee : 2 : 1 : [«Â]
ef : 2 : 1 : [«Á]
f1 : 2 : 1 : [«Ð]
f4 : 2 : 1 : [«Ô]
f5 : 2 : 1 : [«Ø]
f6 : 2 : 1 : [«Ó]
fb : 2 : 1 : [«å]
fd : 2 : 1 : [«ò]
ff : 2 : 1 : [«ó]

% diff -u utf8-cjk-default utf8-cjk-modified
--- utf8-cjk-default	2007-11-28 01:03:07.000000000 +0900
+++ utf8-cjk-modified	2007-11-28 01:02:55.000000000 +0900
@@ -1,29 +1,15 @@
-a1 : 2 : 1 : [¢Â]
 a2 : 2 : 1 : [¡ñ]
 a3 : 2 : 1 : [¡ò]
-a4 : 2 : 1 : [¢ð]
 a6 : 2 : 1 : [üü]
-a7 : 2 : 1 : [¡ø]
-a8 : 2 : 1 : [¡¯]
 a9 : 2 : 1 : [¢í]
-aa : 2 : 1 : [¢ì]
 ac : 2 : 1 : [¢Ì]
-ae : 2 : 1 : [¢î]
 af : 2 : 1 : [¢´]
-b0 : 2 : 1 : [¡ë]
-b1 : 2 : 1 : [¡Þ]
-b4 : 2 : 1 : [¡]
-b6 : 2 : 1 : [¢ù]
-b8 : 2 : 1 : [¢±]
-ba : 2 : 1 : [¢ë]
-bf : 2 : 1 : [¢Ä]
 c0 : 2 : 1 : [ª¢]
 c1 : 2 : 1 : [ª¡]
 c2 : 2 : 1 : [ª¤]
 c3 : 2 : 1 : [ªª]
 c4 : 2 : 1 : [ª£]
 c5 : 2 : 1 : [ª©]
-c6 : 2 : 1 : [©¡]
 c7 : 2 : 1 : [ª®]
 c8 : 2 : 1 : [ª²]
 c9 : 2 : 1 : [ª±]
@@ -39,44 +25,23 @@
 d4 : 2 : 1 : [ªÔ]
 d5 : 2 : 1 : [ªØ]
 d6 : 2 : 1 : [ªÓ]
-d7 : 2 : 1 : [¡ß]
-d8 : 2 : 1 : [©¬]
 d9 : 2 : 1 : [ªã]
 da : 2 : 1 : [ªâ]
 db : 2 : 1 : [ªå]
 dc : 2 : 1 : [ªä]
 dd : 2 : 1 : [ªò]
-de : 2 : 1 : [©°]
-df : 2 : 1 : [©Î]
-e0 : 2 : 1 : [«¢]
-e1 : 2 : 1 : [«¡]
 e2 : 2 : 1 : [«¤]
 e3 : 2 : 1 : [«ª]
 e4 : 2 : 1 : [«£]
 e5 : 2 : 1 : [«©]
-e6 : 2 : 1 : [©Á]
 e7 : 2 : 1 : [«®]
-e8 : 2 : 1 : [«²]
-e9 : 2 : 1 : [«±]
-ea : 2 : 1 : [«´]
 eb : 2 : 1 : [«³]
-ec : 2 : 1 : [«À]
-ed : 2 : 1 : [«¿]
 ee : 2 : 1 : [«Â]
 ef : 2 : 1 : [«Á]
-f0 : 2 : 1 : [©Ã]
 f1 : 2 : 1 : [«Ð]
-f2 : 2 : 1 : [«Ò]
-f3 : 2 : 1 : [«Ñ]
 f4 : 2 : 1 : [«Ô]
 f5 : 2 : 1 : [«Ø]
 f6 : 2 : 1 : [«Ó]
-f7 : 2 : 1 : [¡à]
-f8 : 2 : 1 : [©Ì]
-f9 : 2 : 1 : [«ã]
-fa : 2 : 1 : [«â]
 fb : 2 : 1 : [«å]
-fc : 2 : 1 : [«ä]
 fd : 2 : 1 : [«ò]
-fe : 2 : 1 : [©Ð]
 ff : 2 : 1 : [«ó]

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=4335

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
  2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
                   ` (2 preceding siblings ...)
  2007-11-27 16:04 ` d+bugzilla at vdr dot jp
@ 2008-11-25 17:28 ` d+bugzilla at vdr dot jp
  2009-02-28  7:38 ` d+bugzilla at vdr dot jp
  4 siblings, 0 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2008-11-25 17:28 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From d+bugzilla at vdr dot jp  2008-11-25 17:27 -------
Here is rxvt-unicode author's opinion.

http://lists.schmorp.de/pipermail/rxvt-unicode/2007q1/000402.html

> > > > ja_JP.eucJP locale is fixed by src/rxvt.h r1.265.
> > > > But ja_JP.UTF-8 locale is still weird.
> > >
> > > No, its correct, thats what the locale specified.
> >
> > Do you mean that ja_JP.UTF-8 locale specifies
> > "0xd7" (EastAsianAmbiguous) is HALFWIDTH and
> > rxvt-unicode simply respects it?
> 
> Basically, yes. At least that is how it *should* be: urxvt always respects
> your locale, as should all other programs do too. If your locale says
> something and urxvt doesn't follow that, that is considered a bug in
> urxvt.
> 
> > > > Do you plan to merge doc/solaris9.patch?
> > >
> > > No, thats an ugly hack around solaris being broken.
> >
> > Uh, I mean mk_wcwidth() that is a part of doc/solaris9.patch.
> > mk_wcwidth() variant with configurable option is imported into vim,
> > xterm and so on.
> 
> Yes, they are all buggy as long as they use that.
> 
> > Yes, rxvt-unicode respects that locale tells.
> > But vim, xterm, etc have option that gives EastAsianAmbiguous
> > special treatment that EastAsiwnAmbiguous char width is 2.
> > vim has ambiwidth=double option, xterm has -cjk_width option.
> 
> Yes, I know. But its stupid to add such hacks to each and every program
> and force the user to enable them. The right way is to use or modify the
> locale, then suddenly all well-written programs with or without such hacks
> just magically work.
> 
> Ignoring the locale is just wrong. It leads to interoperability
> problems between programs that simply wouldn't exist if everybody just
> respected the locale instead of relying on their own hacks.
> 
> The only justification for adding hacks is for systems that do not support
> required locales (such as one providing utf-8), but those systems either
> die or get upgraded, so the time is much better spent improving the locale
> system on those rare sytems rather than adding hacks to each and every
> program.
> 
> > Do you mean locale is wrong/broken then programs do not need to
> 
> If the locale specifies a character width that you do not want, then the
> locale is pretty much broken from your perspective, isn't it? At least its
> not the locale you want.
> 
> > Do I need to ask not rxvt-unicode but glibc?
> 
> I think glibc (or any software distribution either using it or something
> else) should provide the means to configure it regarding such details such
> as character width, at least for commonly wanted cases such as east asian
> widths.
> 
> I am open to reasoning against my arguments, but to change my mind one
> would have to overcome the arguments above. It just plain makes no
> sense to hack eahc and every program on the world to workaround locale
> limitations: there are far more editors and terminals around than libcs.

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=4335

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
  2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
                   ` (3 preceding siblings ...)
  2008-11-25 17:28 ` d+bugzilla at vdr dot jp
@ 2009-02-28  7:38 ` d+bugzilla at vdr dot jp
  4 siblings, 0 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2009-02-28  7:38 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From d+bugzilla at vdr dot jp  2009-02-28 07:38 -------
Each application should implements each approach
for EastAsianAmbiguous character width issue now.
For example, own one, Markus Kuhn's wcwidth()
(http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c).

Unable to expand glibc wcwidth()'s current implementation
and locale definition, then, could glibc offer common method
for this issue?

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=4335

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-02-28  7:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
2007-06-02 23:43 ` [Bug libc/4335] " bruno at clisp dot org
2007-06-10 13:05 ` d+bugzilla at vdr dot jp
2007-11-27 16:04 ` d+bugzilla at vdr dot jp
2008-11-25 17:28 ` d+bugzilla at vdr dot jp
2009-02-28  7:38 ` d+bugzilla at vdr dot jp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).