public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale
@ 2007-04-08 12:18 d+bugzilla at vdr dot jp
2007-06-02 23:43 ` [Bug libc/4335] " bruno at clisp dot org
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2007-04-08 12:18 UTC (permalink / raw)
To: glibc-bugs
According to /usr/share/i18n/charmaps/UTF-8.gz,
Character width is 1 by default. W(Wide) and F(Full Width) are 2.
% Character width according to Unicode 3.2.
% - Default width is 1.
% - Double-width characters have width 2; generated from
% "grep '^[^;]*;[WF]' EastAsianWidth.txt"
% and "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
% - Non-spacing characters have width 0; generated from PropList.txt or
% "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
% - Format control characters have width 0; generated from
% "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
% - Zero width characters have width 0; generated from
% "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"
A(Ambiguous) is expected that it is context-sensitive,
but its width is always 1 irrelevant to context.
According to http://www.unicode.org/reports/tr11/#Recommendations
> When mapping Unicode to East Asian legacy character encodings
>
> * Wide Unicode characters always map to fullwidth characters.
> * Narrow (and neutral) Unicode characters always map to halfwidth characters.
> * Halfwidth Unicode characters always map to halfwidth characters.
> * Ambiguous Unicode characters always map to fullwidth characters.
I think EastAsianAmbiguous character width should be 2 in CJK UTF-8 locale.
--
Summary: EastAsianAmbiguous character width is always 1 in UTF-8
locale
Product: glibc
Version: unspecified
Status: NEW
Severity: normal
Priority: P2
Component: libc
AssignedTo: drepper at redhat dot com
ReportedBy: d+bugzilla at vdr dot jp
CC: glibc-bugs at sources dot redhat dot com
http://sourceware.org/bugzilla/show_bug.cgi?id=4335
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
@ 2007-06-02 23:43 ` bruno at clisp dot org
2007-06-10 13:05 ` d+bugzilla at vdr dot jp
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: bruno at clisp dot org @ 2007-06-02 23:43 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From bruno at clisp dot org 2007-06-02 23:43 -------
The "character width" is mostly useful when dealing with cell-based
terminal emulators.
IMO it makes no sense to make such a change in glibc (i.e. to create an
alternative charmap UTF-8-CJK and to build locales like ja_JP.UTF-8 against
it) in isolation. What needs to be considered is the majority of the terminal
emulators; see for example the list at
http://packages.debian.org/stable/virtual/x-terminal-emulator
If you change the most important among these terminal emulators to choose
their font configuration according to the locale, in such a way that in CJK
locales the Ambiguous Width characters have width 2, and in other locales they
have width 1, _then_ IMO the change makes also sense in glibc.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=4335
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
2007-06-02 23:43 ` [Bug libc/4335] " bruno at clisp dot org
@ 2007-06-10 13:05 ` d+bugzilla at vdr dot jp
2007-11-27 16:04 ` d+bugzilla at vdr dot jp
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2007-06-10 13:05 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From d+bugzilla at vdr dot jp 2007-06-10 13:05 -------
I created UTF-8-CJK (EastAsianAmbiguous character width 2) and built ja_JP.UTF-8
against it.
Then, I test terminal emulators; debian's x-terminal-emulator list.
Terminal Emulators that be able to handle UTF-8 works well and chooses font
correctly.
(I leave terminal emulators that be unable to handle UTF-8 out of consideration)
works well:
gnome-terminal
konsole
mlterm (mlterm-tiny)
rxvt (rxvt-ml)
rxvt-beta
rxvt-unicode (rxvt-unicode-ml, rxvt-unicode-lite)
tilda
xfce4-terminal
xterm
does not handle UTF-8:
aterm (aterm-ml)
eterm
kterm
mrxvt (mrxvt-cjk, mrxvt-mini)
multi-gnome-terminal
wterm (wterm-ml)
does not handle ja_JP.eucJP:
hanterm-xf
powershell
pterm
terminal.app
xvt
--
http://sourceware.org/bugzilla/show_bug.cgi?id=4335
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
2007-06-02 23:43 ` [Bug libc/4335] " bruno at clisp dot org
2007-06-10 13:05 ` d+bugzilla at vdr dot jp
@ 2007-11-27 16:04 ` d+bugzilla at vdr dot jp
2008-11-25 17:28 ` d+bugzilla at vdr dot jp
2009-02-28 7:38 ` d+bugzilla at vdr dot jp
4 siblings, 0 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2007-11-27 16:04 UTC (permalink / raw)
To: glibc-bugs
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6272 bytes --]
------- Additional Comments From d+bugzilla at vdr dot jp 2007-11-27 16:04 -------
Any progress?
It is still present in glibc 2.7 (Debian).
% /lib/libc.so.6
GNU C Library stable release version 2.7, by Roland McGrath et al.
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.2.3 20071123 (prerelease) (Debian 4.2.2-4).
Compiled on a Linux >>2.6.22.12<< system on 2007-11-26.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
% cat test.c
#include <stdio.h>
#include <locale.h>
#define __USE_XOPEN
#include <wchar.h>
int main( void ) {
wchar_t i;
wchar_t euc, utf8;
for( i = 0x00; i < 0x100; i++ ) {
setlocale( LC_CTYPE, "ja_JP.eucJP" );
euc = wcwidth( i );
setlocale( LC_CTYPE, "ja_JP.UTF-8" );
utf8 = wcwidth( i );
if( euc > 0 && euc != utf8 ) {
fprintf( stdout, "%02x : %d : %d : [%c]\n", i, euc, utf8, i );
}
}
return 0;
}
Using default UTF-8 locale:
% ./a.out
a1 : 2 : 1 : [¢Â]
a2 : 2 : 1 : [¡ñ]
a3 : 2 : 1 : [¡ò]
a4 : 2 : 1 : [¢ð]
a6 : 2 : 1 : [üü]
a7 : 2 : 1 : [¡ø]
a8 : 2 : 1 : [¡¯]
a9 : 2 : 1 : [¢í]
aa : 2 : 1 : [¢ì]
ac : 2 : 1 : [¢Ì]
ae : 2 : 1 : [¢î]
af : 2 : 1 : [¢´]
b0 : 2 : 1 : [¡ë]
b1 : 2 : 1 : [¡Þ]
b4 : 2 : 1 : [¡]
b6 : 2 : 1 : [¢ù]
b8 : 2 : 1 : [¢±]
ba : 2 : 1 : [¢ë]
bf : 2 : 1 : [¢Ä]
c0 : 2 : 1 : [ª¢]
c1 : 2 : 1 : [ª¡]
c2 : 2 : 1 : [ª¤]
c3 : 2 : 1 : [ªª]
c4 : 2 : 1 : [ª£]
c5 : 2 : 1 : [ª©]
c6 : 2 : 1 : [©¡]
c7 : 2 : 1 : [ª®]
c8 : 2 : 1 : [ª²]
c9 : 2 : 1 : [ª±]
ca : 2 : 1 : [ª´]
cb : 2 : 1 : [ª³]
cc : 2 : 1 : [ªÀ]
cd : 2 : 1 : [ª¿]
ce : 2 : 1 : [ªÂ]
cf : 2 : 1 : [ªÁ]
d1 : 2 : 1 : [ªÐ]
d2 : 2 : 1 : [ªÒ]
d3 : 2 : 1 : [ªÑ]
d4 : 2 : 1 : [ªÔ]
d5 : 2 : 1 : [ªØ]
d6 : 2 : 1 : [ªÓ]
d7 : 2 : 1 : [¡ß]
d8 : 2 : 1 : [©¬]
d9 : 2 : 1 : [ªã]
da : 2 : 1 : [ªâ]
db : 2 : 1 : [ªå]
dc : 2 : 1 : [ªä]
dd : 2 : 1 : [ªò]
de : 2 : 1 : [©°]
df : 2 : 1 : [©Î]
e0 : 2 : 1 : [«¢]
e1 : 2 : 1 : [«¡]
e2 : 2 : 1 : [«¤]
e3 : 2 : 1 : [ǻ]
e4 : 2 : 1 : [«£]
e5 : 2 : 1 : [«©]
e6 : 2 : 1 : [©Á]
e7 : 2 : 1 : [«®]
e8 : 2 : 1 : [«²]
e9 : 2 : 1 : [«±]
ea : 2 : 1 : [«´]
eb : 2 : 1 : [«³]
ec : 2 : 1 : [«À]
ed : 2 : 1 : [«¿]
ee : 2 : 1 : [«Â]
ef : 2 : 1 : [«Á]
f0 : 2 : 1 : [©Ã]
f1 : 2 : 1 : [«Ð]
f2 : 2 : 1 : [«Ò]
f3 : 2 : 1 : [«Ñ]
f4 : 2 : 1 : [«Ô]
f5 : 2 : 1 : [«Ø]
f6 : 2 : 1 : [«Ó]
f7 : 2 : 1 : [¡à]
f8 : 2 : 1 : [©Ì]
f9 : 2 : 1 : [«ã]
fa : 2 : 1 : [«â]
fb : 2 : 1 : [«å]
fc : 2 : 1 : [«ä]
fd : 2 : 1 : [«ò]
fe : 2 : 1 : [©Ð]
ff : 2 : 1 : [«ó]
Using modified (EastAsianAmbiguous character width == 2,
according to EastAsianWidth-5.0.0.txt) UTF-8 locale:
% ./a.out
a2 : 2 : 1 : [¡ñ]
a3 : 2 : 1 : [¡ò]
a6 : 2 : 1 : [üü]
a9 : 2 : 1 : [¢í]
ac : 2 : 1 : [¢Ì]
af : 2 : 1 : [¢´]
c0 : 2 : 1 : [ª¢]
c1 : 2 : 1 : [ª¡]
c2 : 2 : 1 : [ª¤]
c3 : 2 : 1 : [ªª]
c4 : 2 : 1 : [ª£]
c5 : 2 : 1 : [ª©]
c7 : 2 : 1 : [ª®]
c8 : 2 : 1 : [ª²]
c9 : 2 : 1 : [ª±]
ca : 2 : 1 : [ª´]
cb : 2 : 1 : [ª³]
cc : 2 : 1 : [ªÀ]
cd : 2 : 1 : [ª¿]
ce : 2 : 1 : [ªÂ]
cf : 2 : 1 : [ªÁ]
d1 : 2 : 1 : [ªÐ]
d2 : 2 : 1 : [ªÒ]
d3 : 2 : 1 : [ªÑ]
d4 : 2 : 1 : [ªÔ]
d5 : 2 : 1 : [ªØ]
d6 : 2 : 1 : [ªÓ]
d9 : 2 : 1 : [ªã]
da : 2 : 1 : [ªâ]
db : 2 : 1 : [ªå]
dc : 2 : 1 : [ªä]
dd : 2 : 1 : [ªò]
e2 : 2 : 1 : [«¤]
e3 : 2 : 1 : [ǻ]
e4 : 2 : 1 : [«£]
e5 : 2 : 1 : [«©]
e7 : 2 : 1 : [«®]
eb : 2 : 1 : [«³]
ee : 2 : 1 : [«Â]
ef : 2 : 1 : [«Á]
f1 : 2 : 1 : [«Ð]
f4 : 2 : 1 : [«Ô]
f5 : 2 : 1 : [«Ø]
f6 : 2 : 1 : [«Ó]
fb : 2 : 1 : [«å]
fd : 2 : 1 : [«ò]
ff : 2 : 1 : [«ó]
% diff -u utf8-cjk-default utf8-cjk-modified
--- utf8-cjk-default 2007-11-28 01:03:07.000000000 +0900
+++ utf8-cjk-modified 2007-11-28 01:02:55.000000000 +0900
@@ -1,29 +1,15 @@
-a1 : 2 : 1 : [¢Â]
a2 : 2 : 1 : [¡ñ]
a3 : 2 : 1 : [¡ò]
-a4 : 2 : 1 : [¢ð]
a6 : 2 : 1 : [üü]
-a7 : 2 : 1 : [¡ø]
-a8 : 2 : 1 : [¡¯]
a9 : 2 : 1 : [¢í]
-aa : 2 : 1 : [¢ì]
ac : 2 : 1 : [¢Ì]
-ae : 2 : 1 : [¢î]
af : 2 : 1 : [¢´]
-b0 : 2 : 1 : [¡ë]
-b1 : 2 : 1 : [¡Þ]
-b4 : 2 : 1 : [¡]
-b6 : 2 : 1 : [¢ù]
-b8 : 2 : 1 : [¢±]
-ba : 2 : 1 : [¢ë]
-bf : 2 : 1 : [¢Ä]
c0 : 2 : 1 : [ª¢]
c1 : 2 : 1 : [ª¡]
c2 : 2 : 1 : [ª¤]
c3 : 2 : 1 : [ªª]
c4 : 2 : 1 : [ª£]
c5 : 2 : 1 : [ª©]
-c6 : 2 : 1 : [©¡]
c7 : 2 : 1 : [ª®]
c8 : 2 : 1 : [ª²]
c9 : 2 : 1 : [ª±]
@@ -39,44 +25,23 @@
d4 : 2 : 1 : [ªÔ]
d5 : 2 : 1 : [ªØ]
d6 : 2 : 1 : [ªÓ]
-d7 : 2 : 1 : [¡ß]
-d8 : 2 : 1 : [©¬]
d9 : 2 : 1 : [ªã]
da : 2 : 1 : [ªâ]
db : 2 : 1 : [ªå]
dc : 2 : 1 : [ªä]
dd : 2 : 1 : [ªò]
-de : 2 : 1 : [©°]
-df : 2 : 1 : [©Î]
-e0 : 2 : 1 : [«¢]
-e1 : 2 : 1 : [«¡]
e2 : 2 : 1 : [«¤]
e3 : 2 : 1 : [ǻ]
e4 : 2 : 1 : [«£]
e5 : 2 : 1 : [«©]
-e6 : 2 : 1 : [©Á]
e7 : 2 : 1 : [«®]
-e8 : 2 : 1 : [«²]
-e9 : 2 : 1 : [«±]
-ea : 2 : 1 : [«´]
eb : 2 : 1 : [«³]
-ec : 2 : 1 : [«À]
-ed : 2 : 1 : [«¿]
ee : 2 : 1 : [«Â]
ef : 2 : 1 : [«Á]
-f0 : 2 : 1 : [©Ã]
f1 : 2 : 1 : [«Ð]
-f2 : 2 : 1 : [«Ò]
-f3 : 2 : 1 : [«Ñ]
f4 : 2 : 1 : [«Ô]
f5 : 2 : 1 : [«Ø]
f6 : 2 : 1 : [«Ó]
-f7 : 2 : 1 : [¡à]
-f8 : 2 : 1 : [©Ì]
-f9 : 2 : 1 : [«ã]
-fa : 2 : 1 : [«â]
fb : 2 : 1 : [«å]
-fc : 2 : 1 : [«ä]
fd : 2 : 1 : [«ò]
-fe : 2 : 1 : [©Ð]
ff : 2 : 1 : [«ó]
--
http://sourceware.org/bugzilla/show_bug.cgi?id=4335
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
` (2 preceding siblings ...)
2007-11-27 16:04 ` d+bugzilla at vdr dot jp
@ 2008-11-25 17:28 ` d+bugzilla at vdr dot jp
2009-02-28 7:38 ` d+bugzilla at vdr dot jp
4 siblings, 0 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2008-11-25 17:28 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From d+bugzilla at vdr dot jp 2008-11-25 17:27 -------
Here is rxvt-unicode author's opinion.
http://lists.schmorp.de/pipermail/rxvt-unicode/2007q1/000402.html
> > > > ja_JP.eucJP locale is fixed by src/rxvt.h r1.265.
> > > > But ja_JP.UTF-8 locale is still weird.
> > >
> > > No, its correct, thats what the locale specified.
> >
> > Do you mean that ja_JP.UTF-8 locale specifies
> > "0xd7" (EastAsianAmbiguous) is HALFWIDTH and
> > rxvt-unicode simply respects it?
>
> Basically, yes. At least that is how it *should* be: urxvt always respects
> your locale, as should all other programs do too. If your locale says
> something and urxvt doesn't follow that, that is considered a bug in
> urxvt.
>
> > > > Do you plan to merge doc/solaris9.patch?
> > >
> > > No, thats an ugly hack around solaris being broken.
> >
> > Uh, I mean mk_wcwidth() that is a part of doc/solaris9.patch.
> > mk_wcwidth() variant with configurable option is imported into vim,
> > xterm and so on.
>
> Yes, they are all buggy as long as they use that.
>
> > Yes, rxvt-unicode respects that locale tells.
> > But vim, xterm, etc have option that gives EastAsianAmbiguous
> > special treatment that EastAsiwnAmbiguous char width is 2.
> > vim has ambiwidth=double option, xterm has -cjk_width option.
>
> Yes, I know. But its stupid to add such hacks to each and every program
> and force the user to enable them. The right way is to use or modify the
> locale, then suddenly all well-written programs with or without such hacks
> just magically work.
>
> Ignoring the locale is just wrong. It leads to interoperability
> problems between programs that simply wouldn't exist if everybody just
> respected the locale instead of relying on their own hacks.
>
> The only justification for adding hacks is for systems that do not support
> required locales (such as one providing utf-8), but those systems either
> die or get upgraded, so the time is much better spent improving the locale
> system on those rare sytems rather than adding hacks to each and every
> program.
>
> > Do you mean locale is wrong/broken then programs do not need to
>
> If the locale specifies a character width that you do not want, then the
> locale is pretty much broken from your perspective, isn't it? At least its
> not the locale you want.
>
> > Do I need to ask not rxvt-unicode but glibc?
>
> I think glibc (or any software distribution either using it or something
> else) should provide the means to configure it regarding such details such
> as character width, at least for commonly wanted cases such as east asian
> widths.
>
> I am open to reasoning against my arguments, but to change my mind one
> would have to overcome the arguments above. It just plain makes no
> sense to hack eahc and every program on the world to workaround locale
> limitations: there are far more editors and terminals around than libcs.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=4335
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug libc/4335] EastAsianAmbiguous character width is always 1 in UTF-8 locale
2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
` (3 preceding siblings ...)
2008-11-25 17:28 ` d+bugzilla at vdr dot jp
@ 2009-02-28 7:38 ` d+bugzilla at vdr dot jp
4 siblings, 0 replies; 6+ messages in thread
From: d+bugzilla at vdr dot jp @ 2009-02-28 7:38 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From d+bugzilla at vdr dot jp 2009-02-28 07:38 -------
Each application should implements each approach
for EastAsianAmbiguous character width issue now.
For example, own one, Markus Kuhn's wcwidth()
(http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c).
Unable to expand glibc wcwidth()'s current implementation
and locale definition, then, could glibc offer common method
for this issue?
--
http://sourceware.org/bugzilla/show_bug.cgi?id=4335
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-02-28 7:38 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-08 12:18 [Bug libc/4335] New: EastAsianAmbiguous character width is always 1 in UTF-8 locale d+bugzilla at vdr dot jp
2007-06-02 23:43 ` [Bug libc/4335] " bruno at clisp dot org
2007-06-10 13:05 ` d+bugzilla at vdr dot jp
2007-11-27 16:04 ` d+bugzilla at vdr dot jp
2008-11-25 17:28 ` d+bugzilla at vdr dot jp
2009-02-28 7:38 ` d+bugzilla at vdr dot jp
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).