* [Fwd: [1.7] wcwidth failing configure tests]
@ 2009-05-12 16:54 Corinna Vinschen
2009-05-12 16:56 ` Andy Koppe
0 siblings, 1 reply; 36+ messages in thread
From: Corinna Vinschen @ 2009-05-12 16:54 UTC (permalink / raw)
To: newlib; +Cc: cygwin
Forwarded to newlib.
----- Forwarded message from Eric Blake -----
> Date: Tue, 12 May 2009 16:02:04 +0000 (UTC)
> From: Eric Blake
> Subject: [1.7] wcwidth failing configure tests
> To: cygwin AT cygwin DOT com
>
> I noticed this failure in various configure scripts (findutils, coreutils, ...):
>
> checking whether wcwidth works reasonably in UTF-8 locales... no
>
> I've reduced it to a STC:
>
> #include <locale.h>
> #include <wchar.h>
> int main ()
> {
> int i = 0;
> if (setlocale (LC_ALL, "fr_FR.UTF-8") != NULL)
> {
> if (wcwidth (0x0301) > 0)
> i |= 1;
> if (wcwidth (0x200B) > 0)
> i |= 2;
> }
> return i;
> }
>
> The return value should be 0 but is coming back as 3; 0x0301 is a combining
> mark which should occupy no space on its own, and 0x200b is a 0-width space,
> according to Unicode 5.1 (and earlier, to some extent). And that probably
> means that other places within wcwidth() are broken.
----- End forwarded message -----
wcwidth returns 1 if iswprint returns true. I had a quick debug attempt
and it turns out that the entire range 0x0300..0x034f is marked as
printable in the u3 array in libc/ctype/utf8print.h. The entire range
0x0300..0x034f are combining characters which are printable, but have
zero width.
200b..200d are all three zero-width characters but all three are also
printable.
Scanning the Unicode 5.1 standard, I see a couple of these characters,
which are printable but have zero width:
0300..036f
0483..0489
200b..200f
20d0..20ea
3099..309a
fe20..fe23 (not sure about them. Each of them is the half of a full combined
char which doesn't make sense alone, afaics)
feff
and a couple of musical symbols in the 0x1d1xx range
How can we fix this problem? Should we hardcode a check for the above
character values in wcwidth?
And here's another question. The utf8*.h files claim they have been
generated from the unicode.txt file of the Unicode 3.2 standard. Do we
have the script which generated the utf8*.h files? Can we regenerate
the files to match the current Unicode 5.1 standard?
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-12 16:54 [Fwd: [1.7] wcwidth failing configure tests] Corinna Vinschen @ 2009-05-12 16:56 ` Andy Koppe 2009-05-12 17:32 ` Corinna Vinschen 0 siblings, 1 reply; 36+ messages in thread From: Andy Koppe @ 2009-05-12 16:56 UTC (permalink / raw) To: newlib, cygwin > And here's another question. The utf8*.h files claim they have been > generated from the unicode.txt file of the Unicode 3.2 standard. Do we > have the script which generated the utf8*.h files? Can we regenerate > the files to match the current Unicode 5.1 standard? There's Markus Kuhn's wcwidth implementation, which says it's based on Unicode 5.0: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c Trouble is, there's the thorny issue of the "CJK Ambiguous Width" category of characters, which consists of things like Greek and Cyrillic letters as well as line drawing symbols. Those have a width of 1 in Western use, yet with CJK fonts they have a width of 2. That's why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. Andy -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-12 16:56 ` Andy Koppe @ 2009-05-12 17:32 ` Corinna Vinschen 2009-05-13 19:04 ` Andy Koppe 2009-05-14 15:58 ` IWAMURO Motonori 0 siblings, 2 replies; 36+ messages in thread From: Corinna Vinschen @ 2009-05-12 17:32 UTC (permalink / raw) To: newlib, cygwin On May 12 17:56, Andy Koppe wrote: > > And here's another question. Â The utf8*.h files claim they have been > > generated from the unicode.txt file of the Unicode 3.2 standard. Â Do we > > have the script which generated the utf8*.h files? Â Can we regenerate > > the files to match the current Unicode 5.1 standard? > > There's Markus Kuhn's wcwidth implementation, which says it's based on > Unicode 5.0: > > http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c This looks nice. > Trouble is, there's the thorny issue of the "CJK Ambiguous Width" > category of characters, which consists of things like Greek and > Cyrillic letters as well as line drawing symbols. Those have a width > of 1 in Western use, yet with CJK fonts they have a width of 2. That's > why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. We should use the standard variation alone, imho. And we need some workaround for UTF-16 systems like Cygwin. Unfortunately, surrogate pairs only work well as part of a string, not as standalone chars. So wcwidth would return -1 for each single char, but wcswidth could be tweaked to handle them gracefully. Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-12 17:32 ` Corinna Vinschen @ 2009-05-13 19:04 ` Andy Koppe 2009-05-13 19:40 ` Corinna Vinschen 2009-05-14 15:58 ` IWAMURO Motonori 1 sibling, 1 reply; 36+ messages in thread From: Andy Koppe @ 2009-05-13 19:04 UTC (permalink / raw) To: newlib, cygwin 2009/5/12 Corinna Vinschen: >> Trouble is, there's the thorny issue of the "CJK Ambiguous Width" >> category of characters, which consists of things like Greek and >> Cyrillic letters as well as line drawing symbols. Those have a width >> of 1 in Western use, yet with CJK fonts they have a width of 2. That's >> why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. > > We should use the standard variation alone, imho. I'm not sure that CJK users would be happy with that. See MinTTY issue 88 for my misguided attempts to dismiss this as a legacy issue: http://code.google.com/p/mintty/issues/detail?id=88 In comment 8 on that, "deenheart" mentioned that he was working on a fix for wcwidth(). I don't know what he had in mind, but I'd suspect something based on an environment variable setting. > And we need some workaround for UTF-16 systems like Cygwin. > Unfortunately, surrogate pairs only work well as part of a string, not > as standalone chars. So wcwidth would return -1 for each single char, > but wcswidth could be tweaked to handle them gracefully. Looking at the ranges in wcwidth.c, it might be possible to decide the width of a surrogate pair based on the high surrogate only, and then treat the low surrogate as a combining character with length 0. Andy -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-13 19:04 ` Andy Koppe @ 2009-05-13 19:40 ` Corinna Vinschen 2009-05-13 19:55 ` Andy Koppe 0 siblings, 1 reply; 36+ messages in thread From: Corinna Vinschen @ 2009-05-13 19:40 UTC (permalink / raw) To: newlib, cygwin On May 13 20:04, Andy Koppe wrote: > 2009/5/12 Corinna Vinschen: > >> Trouble is, there's the thorny issue of the "CJK Ambiguous Width" > >> category of characters, which consists of things like Greek and > >> Cyrillic letters as well as line drawing symbols. Those have a width > >> of 1 in Western use, yet with CJK fonts they have a width of 2. That's > >> why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. > > > > We should use the standard variation alone, imho. > > I'm not sure that CJK users would be happy with that. See MinTTY issue > 88 for my misguided attempts to dismiss this as a legacy issue: > http://code.google.com/p/mintty/issues/detail?id=88 > > In comment 8 on that, "deenheart" mentioned that he was working on a > fix for wcwidth(). I don't know what he had in mind, but I'd suspect > something based on an environment variable setting. > > > And we need some workaround for UTF-16 systems like Cygwin. > > Unfortunately, surrogate pairs only work well as part of a string, not > > as standalone chars. Â So wcwidth would return -1 for each single char, > > but wcswidth could be tweaked to handle them gracefully. > > Looking at the ranges in wcwidth.c, it might be possible to decide the > width of a surrogate pair based on the high surrogate only, and then > treat the low surrogate as a combining character with length 0. How should that work? The first half of the surrogate pair has not enough information to decide that. For instance, take the ranges 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }. The information about the low 10 bits of the Unicode value is in the second half of the pair. From the first half you don't know if the char is perhaps the 0x10A04 value or one of the other. So you need both halves to make a decision. A surrogate pair half alone is also always invalid. That's something you can't handle in wcwidth. Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-13 19:40 ` Corinna Vinschen @ 2009-05-13 19:55 ` Andy Koppe 0 siblings, 0 replies; 36+ messages in thread From: Andy Koppe @ 2009-05-13 19:55 UTC (permalink / raw) To: newlib, cygwin > How should that work? The first half of the surrogate pair has not > enough information to decide that. For instance, take the ranges > 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }. The information about the low > 10 bits of the Unicode value is in the second half of the pair. From > the first half you don't know if the char is perhaps the 0x10A04 value > or one of the other. So you need both halves to make a decision. You're right. I'd somehow overlooked the end of the combining[] array. Andy -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-12 17:32 ` Corinna Vinschen 2009-05-13 19:04 ` Andy Koppe @ 2009-05-14 15:58 ` IWAMURO Motonori 2009-05-14 17:26 ` Corinna Vinschen ` (2 more replies) 1 sibling, 3 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-05-14 15:58 UTC (permalink / raw) To: newlib, cygwin 2009/5/13 Corinna Vinschen <vinschen@redhat.com>: >> http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c > > This looks nice. Do you import Markus Kuhn's wcwidth implementation? >> Trouble is, there's the thorny issue of the "CJK Ambiguous Width" >> category of characters, which consists of things like Greek and >> Cyrillic letters as well as line drawing symbols. Those have a width >> of 1 in Western use, yet with CJK fonts they have a width of 2. That's >> why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. > > We should use the standard variation alone, imho. I don't think so. 1) It is very very inconvenient for me :-) (Now, I apply the local patch of CJK width support to cygwin1.dll in my environment.) 2) Unicode Standard Annex #11 http://www.unicode.org/unicode/reports/tr11/ recommends: > 5 Recommendations (snip) > When processing or displaying data (snip) > Ambiguous characters behave like wide or narrow characters depending > on the context (language tag, script identification, associated > font, source of data, or explicit markup; all can provide the > context). If the context cannot be established reliably, they should > be treated as narrow characters by default. The recommendation is independent of legacy encoding. I think that a new locale category that specifies the "context" is necessary. Because the "context" influences only the display or text layout. However, there is no such standard now. Therefore, I propose to use *_cjk() when the language part of LC_CTYPE is 'ja', 'ko', 'vi' or 'zh'. -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-14 15:58 ` IWAMURO Motonori @ 2009-05-14 17:26 ` Corinna Vinschen 2009-05-14 21:51 ` Jeff Johnston 2009-05-20 16:52 ` Thomas Wolff 2009-05-26 16:46 ` IWAMURO Motonori 2 siblings, 1 reply; 36+ messages in thread From: Corinna Vinschen @ 2009-05-14 17:26 UTC (permalink / raw) To: newlib, cygwin On May 15 00:58, IWAMURO Motonori wrote: > 2009/5/13 Corinna Vinschen <vinschen@redhat.com>: > >> http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c > > > > This looks nice. > > Do you import Markus Kuhn's wcwidth implementation? > > >> Trouble is, there's the thorny issue of the "CJK Ambiguous Width" > >> category of characters, which consists of things like Greek and > >> Cyrillic letters as well as line drawing symbols. Those have a width > >> of 1 in Western use, yet with CJK fonts they have a width of 2. That's > >> why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. > > > > We should use the standard variation alone, imho. > > I don't think so. > > 1) It is very very inconvenient for me :-) > (Now, I apply the local patch of CJK width support to cygwin1.dll in > my environment.) > > 2) Unicode Standard Annex #11 > http://www.unicode.org/unicode/reports/tr11/ recommends: > > 5 Recommendations > (snip) > > When processing or displaying data > (snip) > > Ambiguous characters behave like wide or narrow characters depending > > on the context (language tag, script identification, associated > > font, source of data, or explicit markup; all can provide the > > context). If the context cannot be established reliably, they should > > be treated as narrow characters by default. > > The recommendation is independent of legacy encoding. > > I think that a new locale category that specifies the "context" is necessary. > Because the "context" influences only the display or text layout. > > However, there is no such standard now. > > Therefore, I propose to use *_cjk() when the language part of LC_CTYPE > is 'ja', 'ko', 'vi' or 'zh'. That would be fine with me, but tests for the actual language are not used anywhere in newlib, so that's something very new. Can we check in my patch for the time being and extend it with the CJK variation later? I will not be available for the next two weeks, but I'd be glad if at least the default variation can go in so I can create another Cygwin test release before I'm offline. Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-14 17:26 ` Corinna Vinschen @ 2009-05-14 21:51 ` Jeff Johnston 2009-05-15 11:43 ` Corinna Vinschen 0 siblings, 1 reply; 36+ messages in thread From: Jeff Johnston @ 2009-05-14 21:51 UTC (permalink / raw) To: newlib, cygwin Corinna Vinschen wrote: > On May 15 00:58, IWAMURO Motonori wrote: > >> 2009/5/13 Corinna Vinschen <vinschen@redhat.com>: >> >>>> http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c >>>> >>> This looks nice. >>> >> Do you import Markus Kuhn's wcwidth implementation? >> >> >>>> Trouble is, there's the thorny issue of the "CJK Ambiguous Width" >>>> category of characters, which consists of things like Greek and >>>> Cyrillic letters as well as line drawing symbols. Those have a width >>>> of 1 in Western use, yet with CJK fonts they have a width of 2. That's >>>> why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. >>>> >>> We should use the standard variation alone, imho. >>> >> I don't think so. >> >> 1) It is very very inconvenient for me :-) >> (Now, I apply the local patch of CJK width support to cygwin1.dll in >> my environment.) >> >> 2) Unicode Standard Annex #11 >> http://www.unicode.org/unicode/reports/tr11/ recommends: >> >>> 5 Recommendations >>> >> (snip) >> >>> When processing or displaying data >>> >> (snip) >> >>> Ambiguous characters behave like wide or narrow characters depending >>> on the context (language tag, script identification, associated >>> font, source of data, or explicit markup; all can provide the >>> context). If the context cannot be established reliably, they should >>> be treated as narrow characters by default. >>> >> The recommendation is independent of legacy encoding. >> >> I think that a new locale category that specifies the "context" is necessary. >> Because the "context" influences only the display or text layout. >> >> However, there is no such standard now. >> >> Therefore, I propose to use *_cjk() when the language part of LC_CTYPE >> is 'ja', 'ko', 'vi' or 'zh'. >> > > That would be fine with me, but tests for the actual language are not > used anywhere in newlib, so that's something very new. Can we check in my patch for the time being and > extend it with the CJK variation later? I will not be available for the > next two weeks, but I'd be glad if at least the default variation can go > in so I can create another Cygwin test release before I'm offline. > > > Corinna, I have no problem with checking the new patch in and extending this later, assuming you have thoroughly tested this implementation. -- Jeff J. > Corinna > > -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-14 21:51 ` Jeff Johnston @ 2009-05-15 11:43 ` Corinna Vinschen 0 siblings, 0 replies; 36+ messages in thread From: Corinna Vinschen @ 2009-05-15 11:43 UTC (permalink / raw) To: newlib, cygwin On May 14 17:51, Jeff Johnston wrote: > Corinna, I have no problem with checking the new patch in and extending > this later, assuming you have thoroughly tested this implementation. I tested it with _MB_CAPABLE defined and with _MB_CAPABLE undefined. Both variations worked as expected, the latter using the old newlib implementation using iswprint/iswcntrl. Patch applied. I have adding the CJK variation on my todo list for when I'm back from vacation. Thanks, Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-14 15:58 ` IWAMURO Motonori 2009-05-14 17:26 ` Corinna Vinschen @ 2009-05-20 16:52 ` Thomas Wolff 2009-05-20 19:41 ` IWAMURO Motonori 2009-06-05 16:25 ` Thomas Wolff 2009-05-26 16:46 ` IWAMURO Motonori 2 siblings, 2 replies; 36+ messages in thread From: Thomas Wolff @ 2009-05-20 16:52 UTC (permalink / raw) To: newlib, cygwin Corinna Vinschen wrote: > On May 12 17:56, Andy Koppe wrote: > > > And here's another question. ?The utf8*.h files claim they have been > > > generated from the unicode.txt file of the Unicode 3.2 standard. ?Do we > > > have the script which generated the utf8*.h files? ?Can we regenerate > > > the files to match the current Unicode 5.1 standard? I've updated my editor mined to Unicode 5.1 data already. I can provide an according wcwidth function if that's desired. I also have scripts for semi-automatic generation of this information, however "semi" as I said, to be improved. > > There's Markus Kuhn's wcwidth implementation, which says it's based on > > Unicode 5.0: > > > > http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c > > This looks nice. I'm sure Markus will update to 5.1 one day too... > > Trouble is, there's the thorny issue of the "CJK Ambiguous Width" > > category of characters, which consists of things like Greek and > > Cyrillic letters as well as line drawing symbols. Those have a width > > of 1 in Western use, yet with CJK fonts they have a width of 2. That's > > why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. > > We should use the standard variation alone, imho. > > And we need some workaround for UTF-16 systems like Cygwin. > Unfortunately, surrogate pairs only work well as part of a string, not > as standalone chars. So wcwidth would return -1 for each single char, > but wcswidth could be tweaked to handle them gracefully. This gets me to the related question how to output non-BMP characters; currently, the cygwin console display them all as two square boxes, using two screen columns. This indicates that probably just the single surrogate characters are being output. Could proper non-BMP character display be achieved by simply combining the surrogates and outputting them to Windows as a true Unicode character? (The Windows function would need to be 32 bit which I don't know, the string elements could stay as they are.) Just an idea which might lead to a simple solution. > On May 15 00:58, IWAMURO Motonori wrote: > > 2009/5/13 Corinna Vinschen <vinschen@redhat.com>: > > >> Trouble is, there's the thorny issue of the "CJK Ambiguous Width" > > >> ... (see above) > > > We should use the standard variation alone, imho. > > I don't think so. > > > > 1) It is very very inconvenient for me :-) > > > > 2) Unicode Standard Annex #11 > > http://www.unicode.org/unicode/reports/tr11/ recommends: > > > 5 Recommendations > > (snip) > > > When processing or displaying data > > (snip) > > > Ambiguous characters behave like wide or narrow characters depending > > > on the context (language tag, script identification, associated > > > font, source of data, or explicit markup; all can provide the > > > context). If the context cannot be established reliably, they should > > > be treated as narrow characters by default. > > > > The recommendation is independent of legacy encoding. > > > > I think that a new locale category that specifies the "context" is necessary. > > Because the "context" influences only the display or text layout. > > > > However, there is no such standard now. > > > > Therefore, I propose to use *_cjk() when the language part of LC_CTYPE > > is 'ja', 'ko', 'vi' or 'zh'. The problem with this is 1. As you say, there is no standard. 2. If you wish to handle character widths compliant with the terminal your application is running in, there is no guarantee that your assumption of CJK width (or the actual locale setting if that model would be implemented) does indeed reflect the terminal's width properties. 3. In mintty, you can dynamically change width properties by selecting different fonts; mintty changes CJK width behaviour according to certain font properties. "static" configuration in your shell using a locale variable would not reflect this change I see two ways to handle this: a) Ask Andy (author of mintty) to not do this switching; however, I don't know what display consequences that might have. On the other hand, other terminals don't switch either. Or maybe mintty could at leasts issue a warning on CJK width switching, or maintain two separate font lists, or... b) Determine the actual CJK width behaviour dynamically. That's what mined does (in addition to other width property detection in general). That's why it can handle the alternative quite seamlessly. > That would be fine with me, but tests for the actual language are not > used anywhere in newlib, so that's something very new. So I would suggest not to introduce it before the concept is sufficiently discussed. And I'm not happy with the idea of a cygwin-specific solution (or workaround). Kind regards, Thomas -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-20 16:52 ` Thomas Wolff @ 2009-05-20 19:41 ` IWAMURO Motonori 2009-06-05 16:25 ` Thomas Wolff 1 sibling, 0 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-05-20 19:41 UTC (permalink / raw) To: newlib, cygwin 2009/5/21 Thomas Wolff <towo@towo.net>: >> > Therefore, I propose to use *_cjk() when the language part of LC_CTYPE >> > is 'ja', 'ko', 'vi' or 'zh'. > The problem with this is > 1. As you say, there is no standard. But, - I think that my proposal doesn't violate any specification. - I heard that there is an existing implementation that behave like my proposal. (Sorry, I didn't hear the system name.) > 2. If you wish to handle character widths compliant with the terminal > your application is running in, there is no guarantee that your > assumption of CJK width (or the actual locale setting if that model > would be implemented) does indeed reflect the terminal's width properties. Yes, I understand it, too. My proposal is completely workaround. But it is the best solution because we have no specification/standard for my wish. > 3. In mintty, you can dynamically change width properties by selecting > different fonts; mintty changes CJK width behaviour according to certain > font properties. "static" configuration in your shell using a locale > variable would not reflect this change It is no problem because we -- most Japanese language users -- need not change the settings of mintty and locale after first setup. We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty. > I see two ways to handle this: > a) Ask Andy (author of mintty) to not do this switching; It is not necessary bacause the mechanism is based on my another poroposal. ("deenheart" is my handle on google code.) > other terminals don't switch either. If we use other terminals, we need switch CJK width option manually. (xterm, mlterm, putty, ...) > b) Determine the actual CJK width behaviour dynamically. That's what > mined does (in addition to other width property detection in general). It is the best solution. I think that we need specify the following: - the escape sequence about language context for terminal emulater. -- setting language context -- getting language context -- getting capability of language context (context is fixed, static or dynamic / acceptable languages) - new multilingualized string/terminal API for terminal based applications. And, we need rewrite too many applications by new API. > I'm not happy with the idea of a cygwin-specific solution (or workaround). I think that it is not cygwin-specific solution. -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-20 16:52 ` Thomas Wolff 2009-05-20 19:41 ` IWAMURO Motonori @ 2009-06-05 16:25 ` Thomas Wolff 2009-06-06 7:24 ` Andy Koppe ` (3 more replies) 1 sibling, 4 replies; 36+ messages in thread From: Thomas Wolff @ 2009-06-05 16:25 UTC (permalink / raw) To: newlib, cygwin IWAMURO Motonori wrote: > 2009/5/21 Thomas Wolff <towo@towo.net>: > >> > Therefore, I propose to use *_cjk() when the language part of LC_CTYPE > >> > is 'ja', 'ko', 'vi' or 'zh'. > > The problem with this is > > 1. As you say, there is no standard. > But, > - I think that my proposal doesn't violate any specification. I think it does. Part of the locale information is the "charmap" (called "codepage" on DOS/Windows). It may be implicit like with LC_CTYPE=zh_CN which defines "GB2312" as its charmap, but it is typically explicit like in en_US.UTF-8 - the intention is that the "codepage" information should be the same for all locales having thbe "UTF-8" (or any other) charmap. So you cannot freely change width information among locales with the same charmap. Also, if ja_JP.UTF-8 would mean "CJK width", how would you specify a working locale setting for a terminal that does not run a CJK width font but should yet use other Japanese settings? E.g. with rxvt which does not support CJK width. However, there is one resort within the locale mechanism that can be used; the locale syntax allows for an optional "modifier" which can be used to specify deviations, e.g. de_DE has charmap ISO-8859-1 de_DE@euro has charmap ISO-8859-15 uz_UZ has charmap ISO-8859-1 uz_UZ@cyrillic has charmap UTF-8 aa_ER and aa_ER@saaho both have charmap UTF-8 (with some other difference). Thus you could define e.g. ja_JP.UTF-8@cjk or ja_JP.UTF-8@cjkwidth to indicate CJK width properties. I guess this is the most compliant way to go. > - I heard that there is an existing implementation that behave like my > proposal. (Sorry, I didn't hear the system name.) Even if so, I think the way I described is more compatible with the locale mechanism as used elsewhere. > > 2. If you wish to handle character widths compliant with the terminal > > ? your application is running in, there is no guarantee that your > > ? assumption of CJK width (or the actual locale setting if that model > > ? would be implemented) does indeed reflect the terminal's width properties. > Yes, I understand it, too. My proposal is completely workaround. > But it is the best solution because we have no specification/standard > for my wish. A well-chosen option like above, that stays within the described standard options, would be best accepted by other communities, I think, and could be established for this purpose. > > 3. In mintty, you can dynamically change width properties by selecting > > ? different fonts; mintty changes CJK width behaviour according to certain > > ? font properties. "static" configuration in your shell using a locale > > ? variable would not reflect this change > It is no problem because we -- most Japanese language users -- need > not change the settings of mintty and locale after first setup. > We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty. In any case, mined running in mintty will detect CJK width itself, regardless of locale setting, with coming versions of both programs even when it gets changed on-the-fly :) > > ? b) Determine the actual CJK width behaviour dynamically. That's what > > ? ? ?mined does (in addition to other width property detection in general). > It is the best solution. I think that we need specify the following: > - the escape sequence about language context for terminal emulater. > -- setting language context > -- getting language context > -- getting capability of language context > (context is fixed, static or dynamic / acceptable languages) > - new multilingualized string/terminal API for terminal based applications. This sounds complicated. With my proposal, an application that wishes to auto-adjust on width properties (maybe even when changing) and which (unlike mined) uses the system wcwidth functions could proceed as follows: * Detect CJK width by using a simple test string width detection. * (Optional) When receiving a SIGWINCH signal (future version of MinTTY), repeat this detection. * If e.g. LC_CTYPE starts with "ja_JP.UTF-8", call setlocale with either "ja_JP.UTF-8@cjkwidth" or "ja_JP.UTF-8". The application would need to stay with the same locale prefix "ja_JP..." because there is no reasonable way to choose a completely different locale, which is another reason to just use the modifier suffix, rather than reserving the complete "ja_JP..." setting for CJK width. Advantage of this approach: The system does not have to care about this issue and can just follow the locale setting. > And, we need rewrite too many applications by new API. Well, alternatively, the system could follow the approach outlined above, but maybe that's not the proper level to do it (?) > > I'm not happy with the idea of a cygwin-specific solution (or workaround). > I think that it is not cygwin-specific solution. As I tried to suggest above, using "UTF-8" for different width data on one system would be quite specific, using the "@" modifier syntax would not. Kind regards, Thomas -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-05 16:25 ` Thomas Wolff @ 2009-06-06 7:24 ` Andy Koppe 2009-06-06 12:53 ` IWAMURO Motonori 2009-06-06 9:31 ` Corinna Vinschen ` (2 subsequent siblings) 3 siblings, 1 reply; 36+ messages in thread From: Andy Koppe @ 2009-06-06 7:24 UTC (permalink / raw) To: cygwin; +Cc: newlib 2009/6/5 Thomas Wolff: > the locale syntax allows for an optional "modifier" which can be used to > specify deviations, e.g. > de_DE has charmap ISO-8859-1 > de_DE@euro has charmap ISO-8859-15 > uz_UZ has charmap ISO-8859-1 > uz_UZ@cyrillic has charmap UTF-8 > aa_ER and aa_ER@saaho both have charmap UTF-8 (with some other difference). > Thus you could define e.g. > ja_JP.UTF-8@cjk > or > ja_JP.UTF-8@cjkwidth > to indicate CJK width properties. I guess this is the most compliant way to go. This looks the right approach to me. However, to make the locale setting more convenient for CJK users, there could be modifiers for both widths. Without modifier, the CJK locales would default to "Ambiguous Wide", while everything else would default to "Ambiguous Narrow". In the time-honoured tradition of keeping Unix identifiers brief and obscure, I propose the modifiers should be "@aw" and "@an". Otherwise, how about "@ambigwide" and "@ambignarrow"? Calling it something like "cjkwide" has the problem that it gives the impression that the actual CJK ideographs are affected by this, whereas this really concerns things like line drawing characters and non-latin non-CJK letters. That confused me to start with anyway. Puzzled that this hasn't been solved in glibc years ago ... Andy -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-06 7:24 ` Andy Koppe @ 2009-06-06 12:53 ` IWAMURO Motonori 0 siblings, 0 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-06 12:53 UTC (permalink / raw) To: cygwin; +Cc: newlib 2009/6/6 Andy Koppe <andy.koppe@gmail.com>: > However, to make the locale setting more convenient for CJK users, > there could be modifiers for both widths. Without modifier, the CJK > locales would default to "Ambiguous Wide", while everything else would > default to "Ambiguous Narrow". It is acceptable for me. > Puzzled that this hasn't been solved in glibc years ago ... I also examined it. But, I was not able to discover the reason. One Debian user is trying to fix it, but it doesn't progress... http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=471021 http://sourceware.org/bugzilla/show_bug.cgi?id=4335 -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-05 16:25 ` Thomas Wolff 2009-06-06 7:24 ` Andy Koppe @ 2009-06-06 9:31 ` Corinna Vinschen 2009-06-06 9:56 ` Andy Koppe 2009-06-06 13:06 ` IWAMURO Motonori [not found] ` <3f0ad08d0906060242t275a78e7tb9913bf78d1c5e83@mail.gmail.com> 2009-06-06 12:22 ` [Fwd: [1.7] wcwidth failing configure tests] IWAMURO Motonori 3 siblings, 2 replies; 36+ messages in thread From: Corinna Vinschen @ 2009-06-06 9:31 UTC (permalink / raw) To: cygwin, newlib On Jun 5 18:25, Thomas Wolff wrote: > IWAMURO Motonori wrote: > > 2009/5/21 Thomas Wolff <towo@towo.net>: > > >> > Therefore, I propose to use *_cjk() when the language part of LC_CTYPE > > >> > is 'ja', 'ko', 'vi' or 'zh'. > > > The problem with this is > > > 1. As you say, there is no standard. > > > But, > > - I think that my proposal doesn't violate any specification. > I think it does. Part of the locale information is the "charmap" > (called "codepage" on DOS/Windows). It may be implicit like > with LC_CTYPE=zh_CN which defines "GB2312" as its charmap, but it > is typically explicit like in en_US.UTF-8 - the intention is > that the "codepage" information should be the same for all locales > having thbe "UTF-8" (or any other) charmap. So you cannot freely > change width information among locales with the same charmap. > Also, if ja_JP.UTF-8 would mean "CJK width", how would you specify > a working locale setting for a terminal that does not run a CJK width > font but should yet use other Japanese settings? E.g. with rxvt which > does not support CJK width. > > However, there is one resort within the locale mechanism that can be used; > the locale syntax allows for an optional "modifier" which can be used to > specify deviations, e.g. > de_DE has charmap ISO-8859-1 > de_DE@euro has charmap ISO-8859-15 > uz_UZ has charmap ISO-8859-1 > uz_UZ@cyrillic has charmap UTF-8 > aa_ER and aa_ER@saaho both have charmap UTF-8 (with some other difference). > Thus you could define e.g. > ja_JP.UTF-8@cjk > or > ja_JP.UTF-8@cjkwidth > to indicate CJK width properties. I guess this is the most compliant way to go. I like this approach. It's also more flexible than using the language specifier. <nit-picking> Thomas, couldn't you have discussed this in the two weeks I was on vacation? Why did you wait until I implemented the language-based approach? </nit-picking> Now, we just have to agree on the modifier and somebody has to implement this in newlib/libc/locale/locale.c. So far the modifier is ignored entirely (de_DE@euro will still use ISO-8859-1). I vote for @cjkwide, regardless of Andy's objection. People using CJK will know the meaning and it has the additional advantage to be a rather simple to memorize identifier. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-06 9:31 ` Corinna Vinschen @ 2009-06-06 9:56 ` Andy Koppe 2009-06-06 13:06 ` IWAMURO Motonori 1 sibling, 0 replies; 36+ messages in thread From: Andy Koppe @ 2009-06-06 9:56 UTC (permalink / raw) To: cygwin, newlib > <nit-picking> > Thomas, couldn't you have discussed this in the two weeks I was on > vacation? Why did you wait until I implemented the language-based > approach? > </nit-picking> Sorry, that's largely my fault. Among a bunch of other MinTTY issues we were privately discussing various more or less mad schemes to communicate the ambiguous width between terminal and application and so it took a while for us to realise that a locale-based scheme really is the best approach. Andy -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-06 9:31 ` Corinna Vinschen 2009-06-06 9:56 ` Andy Koppe @ 2009-06-06 13:06 ` IWAMURO Motonori 1 sibling, 0 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-06 13:06 UTC (permalink / raw) To: cygwin, newlib 2009/6/6 Corinna Vinschen <corinna-cygwin@cygwin.com>: > I vote for @cjkwide, regardless of Andy's objection. People using CJK > will know the meaning and it has the additional advantage to be a rather > simple to memorize identifier. I oppose @cjkwide approach because I don't think that I need make special cases give priority more than general cases. I think that Andy's approach is better. -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <3f0ad08d0906060242t275a78e7tb9913bf78d1c5e83@mail.gmail.com>]
* Re: [Fwd: [1.7] wcwidth failing configure tests] [not found] ` <3f0ad08d0906060242t275a78e7tb9913bf78d1c5e83@mail.gmail.com> @ 2009-06-06 9:46 ` IWAMURO Motonori 2009-06-12 18:56 ` Thomas Wolff 1 sibling, 0 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-06 9:46 UTC (permalink / raw) To: cygwin, newlib I oppose your proposal because I think that it is useless for us. 2009/6/6 Thomas Wolff <towo@towo.net>: > the intention is that the "codepage" information should be the same > for all locales having thbe "UTF-8" (or any other) charmap. So you > cannot freely change width information among locales with the same > charmap. I don't think that there is such a restriction. The standard of the character doesn't provide for the width of the character as a standard. > Also, if ja_JP.UTF-8 would mean "CJK width", how would you specify a > working locale setting for a terminal that does not run a CJK width > font but should yet use other Japanese settings? E.g. with rxvt > which does not support CJK width. Oh, we ALWAYS have a hard time in this problem VERY VERY VERY much. case1: We use only the application that treats the width of the character without locale. case2: We make the patch that solves the character width problem, and throw it out up-stream. case3: We make the patch, and apply it locally. case4: We tearfully give up the correct display of the screen. case5: We tearfully give up using the application. I selected case5 for rxvt. > Thus you could define e.g. > ja_JP.UTF-8@cjk > or > ja_JP.UTF-8@cjkwidth > to indicate CJK width properties. I guess this is the most compliant way to go. I don't think that it is the good idea because: - It is "a cygwin-specific solution (or workaround)". - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters returns 2 by CJK locale is planned. # to be continued. -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] [not found] ` <3f0ad08d0906060242t275a78e7tb9913bf78d1c5e83@mail.gmail.com> 2009-06-06 9:46 ` IWAMURO Motonori @ 2009-06-12 18:56 ` Thomas Wolff 2009-06-12 19:12 ` Corinna Vinschen 2009-06-15 0:30 ` IWAMURO Motonori 1 sibling, 2 replies; 36+ messages in thread From: Thomas Wolff @ 2009-06-12 18:56 UTC (permalink / raw) To: newlib, cygwin, IWAMURO Motonori IWAMURO Motonori wrote to me by private mail: > I oppose your proposal because I think that it is useless for us. > > 2009/6/6 Thomas Wolff <towo@towo.net>: >> the intention is that the "codepage" information should be the same >> for all locales having thbe "UTF-8" (or any other) charmap. So you >> cannot freely change width information among locales with the same >> charmap. > > I don't think that there is such a restriction. > The standard of the character doesn't provide for the width of the > character as a standard. I'm not sure which "standard" you are referring to. I have checked source data files in /usr/share/i18n/charmaps on my Linux system, e.g. "UTF-8.gz". These files are used when creating a new locale with the "localedef" command. They contain not only the mapping but also (by the end of the file) a list of combining and double-width characters. So obviously, even stronger than I had argued, this would imply a scheme of predefined character widths defined by each such "charmap", thus assuming that character widths are the same for all locales with the same "charmap". >> Also, if ja_JP.UTF-8 would mean "CJK width", how would you specify a >> working locale setting for a terminal that does not run a CJK width >> font but should yet use other Japanese settings? E.g. with rxvt >> which does not support CJK width. > > Oh, we ALWAYS have a hard time in this problem VERY VERY VERY much. > > case1: We use only the application that treats the width of the > character without locale. No problem. > case2: We make the patch that solves the character width problem, and > throw it out up-stream. Yes, you should go ahead "up-stream", whatever that means in the case of locales. > case3: We make the patch, and apply it locally. No, bad idea. All locale-dogmatic people (I'm not one, just warning) will bash you for this. What is the situation after remote login? The remote system will assume its own locale setting (e.g. "ja_JP.UTF-8") to indicate the actual behaviour of its environment properly, which is not the case after local implementation of a solution. > case4: We tearfully give up the correct display of the screen. > case5: We tearfully give up using the application. > I selected case5 for rxvt. > No reason to give up. The approach I've taken in mined is quite successful. The other approach, via locale names, will also have limited success provided it is taken "up-stream". >> Thus you could define e.g. >> ja_JP.UTF-8@cjk >> or >> ja_JP.UTF-8@cjkwidth >> to indicate CJK width properties. I guess this is the most compliant way to go. > > I don't think that it is the good idea because: > > - It is "a cygwin-specific solution (or workaround)". Apparently we agree that a solution should be found that is not cygwin-specific, but should be established "up-stream". The question is thus which of the discussed mechanisms has a better chance to get accepted up-stream: - ja_JP.UTF-8 meaning different width data than en_US.UTF-8 or - ja_JP.UTF-8@cjkwidth meaning different width data than ja_JP.UTF-8 My assumption is that the second proposal (that I made) has a better chance, given the existing paradigms of the locale community. But that's speculative. If you think you can get your proposal passed "up-stream", go ahead and try it, please! If you succeed, everything is fine. > - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters returns 2 by CJK locale is planned. So the same issue (of compliance and portability, especially in the remote case) should be discussed in the NetBSD community. (Is there a suitable forum or mailing list to check?) > - and, I don't think that I need make special cases give priority more > than general cases. > >> - I heard that there is an existing implementation that behave like my > >> proposal. (Sorry, I didn't hear the system name.) > > Even if so, I think the way I described is more compatible with the locale > > mechanism as used elsewhere. > I think that ALL locale implementations should treat East Asian > Ambiguous Character Width as 2 for CJK locale. Again, I agree that IF you manage to get ALL implementations to follow this approach, the solution is fine. Please go ahead. > >> It is no problem because we -- most Japanese language users -- need > >> not change the settings of mintty and locale after first setup. > >> We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty. > > In any case, mined running in mintty will detect CJK width itself, > > regardless of locale setting, with coming versions of both programs > > even when it gets changed on-the-fly :) > Sorry, I can't understand above because I am not good at English. Well, even if your proposal would finally be implemented, MinTTY will still be able to choose different fonts and depending on which font is selected, run in locale-width-compliant or width-breaking mode. * My solution could be tweaked to handle this. * Auto-detection (of mined) can handle it already. * Your solution could probably not handle it. > I don't think so. I think that we should consider the following issues > if a new mechanism is introduced. > The existing locale / terminal API don't support: > - Unicode BiDi. > - Unicode control characters. > - Unicode combining characters. > - Multilingualization. (*) > - Detect font/fontset information selected with terminal emulator. > (including, need to consider the case of no-tty) Not sure what you intend to say with these remarks. Locale and terminal APIs are actually two different things. And locale API can e.g. handle combining characters (by wcwidth returning 0). > * Now, we can't use Japanese, Chinese, and Korean at the same time > even if we use Unicode. > Because many font glyphs are quite different even if the code point > is the same in each language. This is a completely different issue and it should be easy to solve it by simply choosing an appropriate font. > > With my proposal, an application that wishes to auto-adjust on width > > properties (maybe even when changing) and which (unlike mined) uses > > the system wcwidth functions could proceed as follows: > > * Detect CJK width by using a simple test string width detection. > > * (Optional) When receiving a SIGWINCH signal (future version of MinTTY), > > repeat this detection. > > * If e.g. LC_CTYPE starts with "ja_JP.UTF-8", call setlocale with > > either "ja_JP.UTF-8@cjkwidth" or "ja_JP.UTF-8". > How to detect it? The application using wcwidth is not necessarily > executed with terminal emulator. (e.g. text formatter) OK, my arguments refer to an interactive application that wants to control the precise representation of text on the screen. If for example a text formatter formats for paper printing, it would need to apply completely different assumptions anyway. The dreadful single/double width issue of cell-based terminals isn't relevant at all in that case. > >> > I'm not happy with the idea of a cygwin-specific solution (or workaround). > >> I think that it is not cygwin-specific solution. > > As I tried to suggest above, using "UTF-8" for different width data on one > > system would be quite specific, using the "@" modifier syntax would not. > "UTF-8" is only an encoding scheme. It does not specify the character width. OK, we had this argument above, and we were both not quite right before. The essence is that whatever you get established up-stream may turn out to be a working solution, so I would appreciate if you go ahead and persuade some "up-stream" people... Best regards, Thomas -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-12 18:56 ` Thomas Wolff @ 2009-06-12 19:12 ` Corinna Vinschen 2009-06-15 0:30 ` IWAMURO Motonori 1 sibling, 0 replies; 36+ messages in thread From: Corinna Vinschen @ 2009-06-12 19:12 UTC (permalink / raw) To: newlib, cygwin On Jun 12 17:38, Thomas Wolff wrote: > IWAMURO Motonori wrote to me by private mail: > > I oppose your proposal because I think that it is useless for us. > > > > 2009/6/6 Thomas Wolff <towo@towo.net>: > >> the intention is that the "codepage" information should be the same > >> for all locales having thbe "UTF-8" (or any other) charmap. So you > >> cannot freely change width information among locales with the same > >> charmap. > > > > I don't think that there is such a restriction. > > The standard of the character doesn't provide for the width of the > > character as a standard. > I'm not sure which "standard" you are referring to. The problem appears to be that there is no standard for the handling of ambiguous characters. > I have checked source data files in /usr/share/i18n/charmaps on my Linux system, e.g. "UTF-8.gz". > These files are used when creating a new locale with the "localedef" command. > They contain not only the mapping but also (by the end of the file) a > list of combining and double-width characters. So obviously, even > stronger than I had argued, this would imply a scheme of predefined > character widths defined by each such "charmap", thus assuming that > character widths are the same for all locales with the same "charmap". I'm not sure the Linux solution is overly flexible. AFAICS, when using the UTF-8 charset, the ambiguous characters always have width 1. Only when switching to GB18030, the width of these chars is two. That seems to be a bit unsatisfying for CJK users. > >> Also, if ja_JP.UTF-8 would mean "CJK width", how would you specify a > >> working locale setting for a terminal that does not run a CJK width > >> font but should yet use other Japanese settings? E.g. with rxvt > >> which does not support CJK width. Wouldn't that be covered by using your own proposal just backwards? Define the default for ja, ko, and zh to use width = 2, with a @cjknarrow (or whatever) modifier to use width = 1. > The approach I've taken in mined is quite successful. The other > approach, via locale names, will also have limited success provided it > is taken "up-stream". Whatever "upstream" means. Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-12 18:56 ` Thomas Wolff 2009-06-12 19:12 ` Corinna Vinschen @ 2009-06-15 0:30 ` IWAMURO Motonori 2009-06-15 4:34 ` IWAMURO Motonori 1 sibling, 1 reply; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-15 0:30 UTC (permalink / raw) To: newlib, cygwin 2009/6/13 Thomas Wolff <towo@towo.net>: > I have checked source data files in /usr/share/i18n/charmaps on my Linux system, e.g. "UTF-8.gz". <snip> > character widths are the same for all locales with the same "charmap". It was reported as a bug, but it isn't fixed now...X-( http://sourceware.org/bugzilla/show_bug.cgi?id=4335 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=471021 > If you think you can get your proposal passed "up-stream", > go ahead and try it, please! If you succeed, everything is fine. Hmmm, I think that you have misunderstood something because my explanation is bad. I called "up-stream" as the maintainance team of each OS, library, or application. I don't think that there is something single "up-stream". Japanese language users have tried to fix of the problem for many years, but it doesn't progress so much now. >> - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters returns 2 by CJK locale is planned. > So the same issue (of compliance and portability, especially in the > remote case) should be discussed in the NetBSD community. > (Is there a suitable forum or mailing list to check?) Sorry, I don't know it because I was personally advised by one of the NetBSD maintainer ( http://www.hi-matic.org/ (written in Japanese) ). >> I think that ALL locale implementations should treat East Asian >> Ambiguous Character Width as 2 for CJK locale. > Again, I agree that IF you manage to get ALL implementations to follow > this approach, the solution is fine. Please go ahead. I will do so, but I want to solve the problem on Cygwin first of all. >> How to detect it? The application using wcwidth is not necessarily >> executed with terminal emulator. (e.g. text formatter) > OK, my arguments refer to an interactive application that wants to > control the precise representation of text on the screen. > If for example a text formatter formats for paper printing, it would > need to apply completely different assumptions anyway. The dreadful > single/double width issue of cell-based terminals isn't relevant at > all in that case. I am assuming the application that depends on the fixed-pitch font as text-formatter. (like 'indent' command) I hope the following two results become the same. - the auto-format filter program using 'wcwidth'. - run auto-format command on editor. (e.g. "fill-paragraph", "indent-region", etc on Emacs) -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-15 0:30 ` IWAMURO Motonori @ 2009-06-15 4:34 ` IWAMURO Motonori 2009-06-15 11:43 ` [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) Corinna Vinschen 0 siblings, 1 reply; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-15 4:34 UTC (permalink / raw) To: newlib, cygwin 2009/6/13 Corinna Vinschen <vinschen@redhat.com>: >> I'm not sure which "standard" you are referring to. > > The problem appears to be that there is no standard for the handling > of ambiguous characters. Yes, but the guideline exists. http://cygwin.com/ml/cygwin/2009-05/msg00444.html > 2) Unicode Standard Annex #11 > http://www.unicode.org/unicode/reports/tr11/ recommends: > > 5 Recommendations > (snip) > > When processing or displaying data > (snip) > > Ambiguous characters behave like wide or narrow characters depending > > on the context (language tag, script identification, associated > > font, source of data, or explicit markup; all can provide the > > context). If the context cannot be established reliably, they should > > be treated as narrow characters by default. > Define the default for ja, ko, and zh to use width = 2, with a > @cjknarrow (or whatever) modifier to use width = 1. I think it is good idea. -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-15 4:34 ` IWAMURO Motonori @ 2009-06-15 11:43 ` Corinna Vinschen 2009-06-15 15:58 ` IWAMURO Motonori 2009-06-27 22:03 ` Andy Koppe 0 siblings, 2 replies; 36+ messages in thread From: Corinna Vinschen @ 2009-06-15 11:43 UTC (permalink / raw) To: cygwin, newlib On Jun 14 22:18, IWAMURO Motonori wrote: > 2009/6/13 Corinna Vinschen > > The problem appears to be that there is no standard for the handling > > of ambiguous characters. > > Yes, but the guideline exists. > http://cygwin.com/ml/cygwin/2009-05/msg00444.html A single mail in a single mailing list of a single project. That's rather a suggestion than a guideline... > > > Ambiguous characters behave like wide or narrow characters depending > > > on the context (language tag, script identification, associated > > > font, source of data, or explicit markup; all can provide the > > > context). If the context cannot be established reliably, they should > > > be treated as narrow characters by default. > > > Define the default for ja, ko, and zh to use width = 2, with a > > @cjknarrow (or whatever) modifier to use width = 1. > > I think it is good idea. If everybody agrees to this suggestion, here's the patch. Tested with various combinations like LANG=ja_JP.UTF-8@cjknarrow LANG=ja_JP@cjknarrow LANG=ja.UTF-8@cjknarrow LANG=ja@cjknarrow Corinna * libc/locale/locale.c (loadlocale): Add handling of "@cjknarrow" modifier on _MB_CAPABLE targets. Add comment to explain. Index: libc/locale/locale.c =================================================================== RCS file: /cvs/src/src/newlib/libc/locale/locale.c,v retrieving revision 1.20 diff -u -p -r1.20 locale.c --- libc/locale/locale.c 3 Jun 2009 19:28:22 -0000 1.20 +++ libc/locale/locale.c 15 Jun 2009 08:40:46 -0000 @@ -397,6 +397,9 @@ loadlocale(struct _reent *p, int categor int (*l_wctomb) (struct _reent *, char *, wchar_t, const char *, mbstate_t *); int (*l_mbtowc) (struct _reent *, wchar_t *, const char *, size_t, const char *, mbstate_t *); +#ifdef _MB_CAPABLE + int cjknarrow = 0; +#endif /* "POSIX" is translated to "C", as on Linux. */ if (!strcmp (locale, "POSIX")) @@ -427,10 +430,14 @@ loadlocale(struct _reent *p, int categor if (c[0] == '.') { /* Charset */ - strcpy (charset, c + 1); - if ((c = strchr (charset, '@'))) + char *chp; + + ++c; + strcpy (charset, c); + if ((chp = strchr (charset, '@'))) /* Strip off modifier */ - *c = '\0'; + *chp = '\0'; + c += strlen (charset); } else if (c[0] == '\0' || c[0] == '@') /* End of string or just a modifier */ @@ -442,6 +449,17 @@ loadlocale(struct _reent *p, int categor else /* Invalid string */ return NULL; +#ifdef _MB_CAPABLE + if (c[0] == '@') + { + /* Modifier */ + /* Only one modifier is recognized right now. "cjknarrow" is used + to modify the behaviour of wcwidth() for East Asian languages. + For details see the comment at the end of this function. */ + if (!strcmp (c + 1, "cjknarrow")) + cjknarrow = 1; + } +#endif } /* We only support this subset of charsets. */ switch (charset[0]) @@ -604,13 +622,15 @@ loadlocale(struct _reent *p, int categor __mbtowc = l_mbtowc; __set_ctype (charset); /* Check for the language part of the locale specifier. In case - of "ja", "ko", or "zh", assume the use of CJK fonts. This is - stored in lc_ctype_cjk_lang and tested in wcwidth() to figure - out the width to return (1 or 2) for the "CJK Ambiguous Width" - category of characters. */ - lc_ctype_cjk_lang = (strncmp (locale, "ja", 2) == 0 - || strncmp (locale, "ko", 2) == 0 - || strncmp (locale, "zh", 2) == 0); + of "ja", "ko", or "zh", assume the use of CJK fonts, unless the + "@cjknarrow" modifier has been specifed. + The result is stored in lc_ctype_cjk_lang and tested in wcwidth() + to figure out the width to return (1 or 2) for the "CJK Ambiguous + Width" category of characters. */ + lc_ctype_cjk_lang = !cjknarrow + && ((strncmp (locale, "ja", 2) == 0 + || strncmp (locale, "ko", 2) == 0 + || strncmp (locale, "zh", 2) == 0)); #endif } else if (category == LC_MESSAGES) -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-15 11:43 ` [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) Corinna Vinschen @ 2009-06-15 15:58 ` IWAMURO Motonori 2009-06-15 17:08 ` Corinna Vinschen 2009-06-27 22:03 ` Andy Koppe 1 sibling, 1 reply; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-15 15:58 UTC (permalink / raw) To: cygwin, newlib 2009/6/15 Corinna Vinschen <corinna-cygwin@cygwin.com>: >> Yes, but the guideline exists. >> http://cygwin.com/ml/cygwin/2009-05/msg00444.html > > A single mail in a single mailing list of a single project. That's rather > a suggestion than a guideline... Sorry, my writing was bad. My quotation is a part of Unicode Standard Annex #11 EAST ASIAN WIDTH. Please see "When processing or displaying data" of "5 Recommendations" at http://www.unicode.org/unicode/reports/tr11/ . > If everybody agrees to this suggestion, here's the patch. Is the name of modifier prefix "cjk-" good? It influences not CJK characters but a part of symbols and European characters. Please refer to Andy's opinion: http://cygwin.com/ml/cygwin/2009-06/msg00240.html It personally proposes "ambinarrow" because the switch of Vim is "ambiwidth". And, I don't think that it is symmetrical. How about the following patch? (I have not changed the name of modifier prefix) --- libc/locale/locale.c.ORIG 2009-06-15 23:05:40.812500000 +0900 +++ libc/locale/locale.c 2009-06-15 22:56:35.546875000 +0900 @@ -398,7 +398,8 @@ int (*l_mbtowc) (struct _reent *, wchar_t *, const char *, size_t, const char *, mbstate_t *); #ifdef _MB_CAPABLE - int cjknarrow = 0; +#define CJK_DEFAULT -1 + int cjk_lang = CJK_DEFAULT; #endif /* "POSIX" is translated to "C", as on Linux. */ @@ -453,11 +454,14 @@ if (c[0] == '@') { /* Modifier */ - /* Only one modifier is recognized right now. "cjknarrow" is used - to modify the behaviour of wcwidth() for East Asian languages. - For details see the comment at the end of this function. */ + /* Only one modifier is recognized right now. "cjknarrow" and + "cjkwide" are used to modify the behaviour of wcwidth() for + East Asian languages. For details see the comment at the + end of this function. */ if (!strcmp (c + 1, "cjknarrow")) - cjknarrow = 1; + cjk_lang = 0; + else if (!strcmp (c + 1, "cjkwide")) + cjk_lang = 1; } #endif } @@ -627,10 +631,11 @@ The result is stored in lc_ctype_cjk_lang and tested in wcwidth() to figure out the width to return (1 or 2) for the "CJK Ambiguous Width" category of characters. */ - lc_ctype_cjk_lang = !cjknarrow - && ((strncmp (locale, "ja", 2) == 0 - || strncmp (locale, "ko", 2) == 0 - || strncmp (locale, "zh", 2) == 0)); + lc_ctype_cjk_lang = cjk_lang != CJK_DEFAULT + ? cjk_lang + : ((strncmp (locale, "ja", 2) == 0 + || strncmp (locale, "ko", 2) == 0 + || strncmp (locale, "zh", 2) == 0)); #endif } else if (category == LC_MESSAGES) -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-15 15:58 ` IWAMURO Motonori @ 2009-06-15 17:08 ` Corinna Vinschen 2009-06-15 17:14 ` IWAMURO Motonori 2009-06-18 15:57 ` Thomas.Wolff 0 siblings, 2 replies; 36+ messages in thread From: Corinna Vinschen @ 2009-06-15 17:08 UTC (permalink / raw) To: cygwin, newlib On Jun 15 23:35, IWAMURO Motonori wrote: > 2009/6/15 Corinna Vinschen: > > If everybody agrees to this suggestion, here's the patch. > > Is the name of modifier prefix "cjk-" good? It influences not CJK > characters but a part of symbols and European characters. > Please refer to Andy's opinion: > http://cygwin.com/ml/cygwin/2009-06/msg00240.html > > It personally proposes "ambinarrow" because the switch of Vim is "ambiwidth". I think "cjk" in the name is the right choice. There are no ambiguous characters in western languages (well, probably there are, but the ambiguity is not on the level of character widths). This is a problem which only has a meaning in these so called CJK languages. It makes sense to me to use this in the modifier name. > And, I don't think that it is symmetrical. How about the following > patch? (I have not changed the name of modifier prefix) I'm not convinced that we need symmetry. It looks like a nice idea for Cygwin or newlib, given that the setlocale language string is checked and picked to pieces hardcoded in the loadlocale function. However, besides of being unnecessary, other systems like Linux or BSD use the language string as directory name relative to the /usr/share/locale directory. If this gets ever used on non-Cygwin systems, the symmetry (which has no precedent in the locale arena) would require these systems to create yet another subdirectory or symlink for the same purpose. Even worse, if you propose that @cjkwide is a valid modifier for *any* language, you would make the whole mechanism on non-newlib based systems more complicated for no apparent reason. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-15 17:08 ` Corinna Vinschen @ 2009-06-15 17:14 ` IWAMURO Motonori 2009-06-18 15:57 ` Thomas.Wolff 1 sibling, 0 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-15 17:14 UTC (permalink / raw) To: cygwin, newlib OK. I withdraw my proposal. 2009/6/16 Corinna Vinschen <corinna-cygwin@cygwin.com>: > On Jun 15 23:35, IWAMURO Motonori wrote: >> 2009/6/15 Corinna Vinschen: >> > If everybody agrees to this suggestion, here's the patch. >> >> Is the name of modifier prefix "cjk-" good? It influences not CJK >> characters but a part of symbols and European characters. >> Please refer to Andy's opinion: >> http://cygwin.com/ml/cygwin/2009-06/msg00240.html >> >> It personally proposes "ambinarrow" because the switch of Vim is "ambiwidth". > > I think "cjk" in the name is the right choice. There are no ambiguous > characters in western languages (well, probably there are, but the > ambiguity is not on the level of character widths). This is a problem > which only has a meaning in these so called CJK languages. It makes > sense to me to use this in the modifier name. > >> And, I don't think that it is symmetrical. How about the following >> patch? (I have not changed the name of modifier prefix) > > I'm not convinced that we need symmetry. It looks like a nice idea for > Cygwin or newlib, given that the setlocale language string is checked > and picked to pieces hardcoded in the loadlocale function. > > However, besides of being unnecessary, other systems like Linux or BSD > use the language string as directory name relative to the > /usr/share/locale directory. If this gets ever used on non-Cygwin > systems, the symmetry (which has no precedent in the locale arena) would > require these systems to create yet another subdirectory or symlink for > the same purpose. Even worse, if you propose that @cjkwide is a valid > modifier for *any* language, you would make the whole mechanism on > non-newlib based systems more complicated for no apparent reason. > > > Corinna > > -- > Corinna Vinschen Please, send mails regarding Cygwin to > Cygwin Project Co-Leader cygwin AT cygwin DOT com > Red Hat > -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-15 17:08 ` Corinna Vinschen 2009-06-15 17:14 ` IWAMURO Motonori @ 2009-06-18 15:57 ` Thomas.Wolff 2009-06-18 16:49 ` Corinna Vinschen ` (2 more replies) 1 sibling, 3 replies; 36+ messages in thread From: Thomas.Wolff @ 2009-06-18 15:57 UTC (permalink / raw) To: cygwin, newlib; +Cc: IWAMURO Motonori, Andy Koppe 2009/6/16 Corinna Vinschen <corinna-cygwin@cygwin.com>: > On Jun 15 23:35, IWAMURO Motonori wrote: >> 2009/6/15 Corinna Vinschen: >> > If everybody agrees to this suggestion, here's the patch. >> >> Is the name of modifier prefix "cjk-" good? It influences not CJK >> characters but a part of symbols and European characters. >> Please refer to Andy's opinion: >> http://cygwin.com/ml/cygwin/2009-06/msg00240.html >> >> It personally proposes "ambinarrow" because the switch of Vim is "ambiwidth". > > I think "cjk" in the name is the right choice. ?There are no ambiguous > characters in western languages (well, probably there are, but the > ambiguity is not on the level of character widths). ?This is a problem > which only has a meaning in these so called CJK languages. ?It makes > sense to me to use this in the modifier name. I agree with keeping "cjk" in the modifier name (also because the xterm option is called -cjk_width) but for the historic understanding, it's actually quite the other way round: In traditional CJK character encodings, fonts, and terminal applications, basically ALL characters were wide, including a subset of Latin characters as it happened to be included in those character sets, and sometimes even including the ASCII range. These are the ones considered "ambiguous" since they used to be wide, while in all non-CJK environments they are not (excluding ASCII which is thus mirrored in the range "Halfwidth and Fullwidth Forms", U+FF00 ... U+FF5E). This also explains the chaotic mix of wide and narrow characters in ranges like Latin-1 Supplement, Latin Extended, Greek and Cyrillic which is in no way useful for any user; it's just a legacy compatibility issue. I think the major usage for CJK users nowadays is about ranges like Arrows, Enclosed Alphanumerics (with circled digits), Box Drawing etc. >> And, I don't think that it is symmetrical. How about the following >> patch? (I have not changed the name of modifier prefix) > > I'm not convinced that we need symmetry. ?It looks like a nice idea for > Cygwin or newlib, given that the setlocale language string is checked > and picked to pieces hardcoded in the loadlocale function. Despite IWAMURO Motonori's withdrawal, I think symmetry would be the right approach to take. The major aspect is how to reflect the actual behaviour of existing terminal environments. And as a matter of fact, you can run both xterm and MinTTY with a non-CJK locale and ambiguous characters being wide. This is achieved by invoking xterm -cjk_width or by selecting an according font in MinTTY, e.g. Ming, SimSun, MS Mincho, or even just the popular Lucida Typewriter. (Although it occurs to me that in the case of Lucida Typewriter this might be a bug since the wideness of ambiguous characters is just simulated in this configuration rather than using wide font characters - Andy, can you please check this?) > However, besides of being unnecessary, other systems like Linux or BSD > use the language string as directory name relative to the > /usr/share/locale directory. ?If this gets ever used on non-Cygwin > systems, the symmetry (which has no precedent in the locale arena) would > require these systems to create yet another subdirectory or symlink for > the same purpose. ?Even worse, if you propose that @cjkwide is a valid > modifier for *any* language, you would make the whole mechanism on > non-newlib based systems more complicated for no apparent reason. The silly unmodular way that some systems implement the locale mechanism (the worst of them being SunOS) should not be an argument to not propagate a reasonable solution. [Who was in favour of these double negations?] The "locale interface" (syntax and semantics of LC_* strings) is defined in a modular way and so the implementations should be - let them fix it. Thomas -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-18 15:57 ` Thomas.Wolff @ 2009-06-18 16:49 ` Corinna Vinschen 2009-06-19 0:08 ` Andy Koppe 2009-06-19 14:45 ` Thomas Wolff 2 siblings, 0 replies; 36+ messages in thread From: Corinna Vinschen @ 2009-06-18 16:49 UTC (permalink / raw) To: cygwin, newlib On Jun 18 14:09, Thomas.Wolff@nsn.com wrote: > 2009/6/16 Corinna Vinschen > > However, besides of being unnecessary, other systems like Linux or BSD > > use the language string as directory name relative to the > > /usr/share/locale directory. ?If this gets ever used on non-Cygwin > > systems, the symmetry (which has no precedent in the locale arena) would > > require these systems to create yet another subdirectory or symlink for > > the same purpose. ?Even worse, if you propose that @cjkwide is a valid > > modifier for *any* language, you would make the whole mechanism on > > non-newlib based systems more complicated for no apparent reason. > The silly unmodular way that some systems implement the locale mechanism > (the worst of them being SunOS) > should not be an argument to not propagate a reasonable solution. > [Who was in favour of these double negations?] > > The "locale interface" (syntax and semantics of LC_* strings) is defined > in a modular way and so the implementations should be - let them fix it. What do you think, how big will be the acceptance of this approach outside of newlib/Cygwin? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-18 15:57 ` Thomas.Wolff 2009-06-18 16:49 ` Corinna Vinschen @ 2009-06-19 0:08 ` Andy Koppe 2009-06-19 14:45 ` Thomas Wolff 2 siblings, 0 replies; 36+ messages in thread From: Andy Koppe @ 2009-06-19 0:08 UTC (permalink / raw) To: cygwin 2009/6/18 Thomas.Wolff: > And as a matter of fact, > you can run both xterm and MinTTY with a non-CJK locale and ambiguous > characters being wide. This is achieved by invoking xterm -cjk_width or > by selecting an according font in MinTTY, e.g. Ming, SimSun, MS Mincho, > or even just the popular Lucida Typewriter. > (Although it occurs to me that in the case of Lucida Typewriter this > might be a bug since the wideness of ambiguous characters is just > simulated in this configuration rather than using wide font characters - > Andy, can you please check this?) Yep, there's a problem here, thanks. I haven't got Lucida Typewriter, but found my Vista install has Lucida Sans Typewriter. That font doesn't actually have Greek or box drawing characters, so all I'm getting is the square replacement character, but it does indeed take up two cells for those. Turns out that's because Latin characters are reported as having a width of 0.5 (of whatever unit) whereas the replacement character is reported as being 0.625 wide. I'll adjust the ambiguous-width detection. Andy -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-18 15:57 ` Thomas.Wolff 2009-06-18 16:49 ` Corinna Vinschen 2009-06-19 0:08 ` Andy Koppe @ 2009-06-19 14:45 ` Thomas Wolff 2009-06-19 14:49 ` Corinna Vinschen 2 siblings, 1 reply; 36+ messages in thread From: Thomas Wolff @ 2009-06-19 14:45 UTC (permalink / raw) To: cygwin, newlib I wrote: > Despite IWAMURO Motonori's withdrawal, I think symmetry would be the > right approach to take. The major aspect is how to reflect the actual > behaviour of existing terminal environments. ... > ... > The "locale interface" (syntax and semantics of LC_* strings) is defined > in a modular way and so the implementations should be - let them fix it. Corinna Vinschen wrote: > What do you think, how big will be the acceptance of this approach > outside of newlib/Cygwin? I have no idea about the acceptance of the whole concept, especially (as I had warned) about changing the width of the CJK locales WITHOUT modifier as IWAMURO Motonori insisted. But I guess a general solution of the width issue will be more appreciated than one that handles only the CJK locales. Thomas -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-19 14:45 ` Thomas Wolff @ 2009-06-19 14:49 ` Corinna Vinschen 0 siblings, 0 replies; 36+ messages in thread From: Corinna Vinschen @ 2009-06-19 14:49 UTC (permalink / raw) To: cygwin, newlib On Jun 19 13:02, Thomas Wolff wrote: > I wrote: > > Despite IWAMURO Motonori's withdrawal, I think symmetry would be the > > right approach to take. The major aspect is how to reflect the actual > > behaviour of existing terminal environments. ... > > > ... > > The "locale interface" (syntax and semantics of LC_* strings) is defined > > in a modular way and so the implementations should be - let them fix it. > > Corinna Vinschen wrote: > > What do you think, how big will be the acceptance of this approach > > outside of newlib/Cygwin? > > I have no idea about the acceptance of the whole concept, especially > (as I had warned) about changing the width of the CJK locales > WITHOUT modifier as IWAMURO Motonori insisted. > But I guess a general solution of the width issue will be more > appreciated than one that handles only the CJK locales. Well, if your proposal will be accepted by other projects, we can easily extend our own implementation without changing existing functionality. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-15 11:43 ` [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) Corinna Vinschen 2009-06-15 15:58 ` IWAMURO Motonori @ 2009-06-27 22:03 ` Andy Koppe 2009-06-28 8:18 ` IWAMURO Motonori 1 sibling, 1 reply; 36+ messages in thread From: Andy Koppe @ 2009-06-27 22:03 UTC (permalink / raw) To: cygwin, newlib 2009/6/15 Corinna Vinschen: >> > Define the default for ja, ko, and zh to use width = 2, with a >> > @cjknarrow (or whatever) modifier to use width = 1. >> >> I think it is good idea. > > If everybody agrees to this suggestion, here's the patch. Tested > with various combinations like > > LANG=ja_JP.UTF-8@cjknarrow > LANG=ja_JP@cjknarrow > LANG=ja.UTF-8@cjknarrow > LANG=ja@cjknarrow Apologies for harping on about this, especially as it was me who suggested the @narrow scheme in the first place, but I do think this is the wrong way to go. MinTTY currenly ignores POSIX locales completely, so I've been pondering how to deal with locales and codepages more properly. One thing I'd like to do is to automatically set LANG depending on the Windows locale and the codepage and font settings in MinTTY (if LANG isn't set already, that is). Trouble is, what do I do if a cjkwide font is selected, yet the Windows locale is not East Asian? I can't just randomly stick the user into one of the three CJK countries, because people don't always take kindly to being put into the wrong country. That could be addressed by adding the @cjkwide modifier for non-CJK languages, as discussed previously, but then MinTTY would still need to parse the language setting to decide which modifier (if any) needs to be used. Having the @cjkwide modifier only, independent of the selected language, would keep things much easier to use and explain. And then there's the Linux compatibility angle, where ja_JP.UTF-8 means ambiguous width 1 not 2. To try to help with changing this, here's some text for the user guide. Replace this: "Right now the language and territory, as well as the modifier, are not important to Cygwin, except to fix a single problem. There's a class of characters in the Unicode character set, called the "CJK Ambiguous Width Character set". For these characters the width returned by the wcwidth/wcswidth function is usually 1. This is often a problem in East-Asian languages, which historically use character sets in which these characters have a width of 2. Kind of explains why they are called "ambiguous"... The problem has been fixed for now like this. wcwidth/wcswidth usually return 1 as the width of these characters. However, if the language is specifed as "ja" (Japanese), "ko" (Korean), or "zh" (Chinese), wcwidth returns 2 for these characters. Unfortunately this isn't correct in all circumstances, so the user can specify the modifier "@cjknarrow", which modifies the behaviour of wcwidth/wcswidth to return 1 for the ambiguous width characters to return 1 even in those languages." With this: "Right now the language and territory are not important to Cygwin, but the modifier is used to deal with the issue of "CJK Ambiguous Width" characters. For these characters the width returned by the wcwidth function is usually 1. This is often a problem in East Asian languages, which historically use character sets in which these characters have a width of 2. Kind of explains why they are called "ambiguous"... . (See http://unicode.org/reports/tr11/ for a full explanation.) Therefore, if the modifier "@cjkwide" is specified, wcwidth returns 2 for these characters. For example, with jp_JP.UTF-8 their width is 1, whereas with jp_JP.UTF-8@cjkwide it is 2." Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) 2009-06-27 22:03 ` Andy Koppe @ 2009-06-28 8:18 ` IWAMURO Motonori 0 siblings, 0 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-28 8:18 UTC (permalink / raw) To: cygwin Hi. 2009/6/27 Andy Koppe <andy.koppe@gmail.com>: > And then there's the Linux compatibility angle, where ja_JP.UTF-8 > means ambiguous width 1 not 2. I want you not to judge it based on the behavior of current Linux. Because: - I don't think the behavior is correct. - Now, I am creating the patch for the problem. -- IWAMURO Motnori <http://vmi.jp/> -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-06-05 16:25 ` Thomas Wolff ` (2 preceding siblings ...) [not found] ` <3f0ad08d0906060242t275a78e7tb9913bf78d1c5e83@mail.gmail.com> @ 2009-06-06 12:22 ` IWAMURO Motonori 3 siblings, 0 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-06-06 12:22 UTC (permalink / raw) To: newlib, cygwin # Continuation of discussion. # # I hope that all the applications work correctly only by setting "LANG=ja_JP.UTF-8". # I don't hope that I give up the use of the binary packages and that I keep applying many local patches. > I don't think that it is the good idea because: > > - It is "a cygwin-specific solution (or workaround)". > - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters returns 2 by CJK locale is planned. - and, I don't think that I need make special cases give priority more than general cases. >> - I heard that there is an existing implementation that behave like my >> proposal. (Sorry, I didn't hear the system name.) > Even if so, I think the way I described is more compatible with the locale > mechanism as used elsewhere. I think that ALL locale implementations should treat East Asian Ambiguous Character Width as 2 for CJK locale. >> It is no problem because we -- most Japanese language users -- need >> not change the settings of mintty and locale after first setup. >> We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty. > In any case, mined running in mintty will detect CJK width itself, > regardless of locale setting, with coming versions of both programs > even when it gets changed on-the-fly :) Sorry, I can't understand above because I am not good at English. > This sounds complicated. I don't think so. I think that we should consider the following issues if a new mechanism is introduced. The existing locale / terminal API don't support: - Unicode BiDi. - Unicode control characters. - Unicode combining characters. - Multilingualization. (*) - Detect font/fontset information selected with terminal emulator. (including, need to consider the case of no-tty) * Now, we can't use Japanese, Chinese, and Korean at the same time even if we use Unicode. Because many font glyphs are quite different even if the code point is the same in each language. > With my proposal, an application that wishes to auto-adjust on width > properties (maybe even when changing) and which (unlike mined) uses > the system wcwidth functions could proceed as follows: > * Detect CJK width by using a simple test string width detection. > * (Optional) When receiving a SIGWINCH signal (future version of MinTTY), > repeat this detection. > * If e.g. LC_CTYPE starts with "ja_JP.UTF-8", call setlocale with > either "ja_JP.UTF-8@cjkwidth" or "ja_JP.UTF-8". How to detect it? The application using wcwidth is not necessarily executed with terminal emulator. (e.g. text formatter) >> > I'm not happy with the idea of a cygwin-specific solution (or workaround). >> I think that it is not cygwin-specific solution. > As I tried to suggest above, using "UTF-8" for different width data on one > system would be quite specific, using the "@" modifier syntax would not. "UTF-8" is only an encoding scheme. It does not specify the character width. -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Fwd: [1.7] wcwidth failing configure tests] 2009-05-14 15:58 ` IWAMURO Motonori 2009-05-14 17:26 ` Corinna Vinschen 2009-05-20 16:52 ` Thomas Wolff @ 2009-05-26 16:46 ` IWAMURO Motonori 2 siblings, 0 replies; 36+ messages in thread From: IWAMURO Motonori @ 2009-05-26 16:46 UTC (permalink / raw) To: newlib, cygwin I correct my proposal. 2009/5/15 IWAMURO Motonori <deenheart@gmail.com>: > I propose to use *_cjk() when the language part of LC_CTYPE > is 'ja', 'ko', 'vi' or 'zh'. LC_CTYPE is 'ja', 'ko', or 'zh'. I remove 'vi'. (advice from a NetBSD locale part maintainer) -- IWAMURO Motnori <http://vmi.jp/> -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/ ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2009-06-28 5:40 UTC | newest] Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-05-12 16:54 [Fwd: [1.7] wcwidth failing configure tests] Corinna Vinschen 2009-05-12 16:56 ` Andy Koppe 2009-05-12 17:32 ` Corinna Vinschen 2009-05-13 19:04 ` Andy Koppe 2009-05-13 19:40 ` Corinna Vinschen 2009-05-13 19:55 ` Andy Koppe 2009-05-14 15:58 ` IWAMURO Motonori 2009-05-14 17:26 ` Corinna Vinschen 2009-05-14 21:51 ` Jeff Johnston 2009-05-15 11:43 ` Corinna Vinschen 2009-05-20 16:52 ` Thomas Wolff 2009-05-20 19:41 ` IWAMURO Motonori 2009-06-05 16:25 ` Thomas Wolff 2009-06-06 7:24 ` Andy Koppe 2009-06-06 12:53 ` IWAMURO Motonori 2009-06-06 9:31 ` Corinna Vinschen 2009-06-06 9:56 ` Andy Koppe 2009-06-06 13:06 ` IWAMURO Motonori [not found] ` <3f0ad08d0906060242t275a78e7tb9913bf78d1c5e83@mail.gmail.com> 2009-06-06 9:46 ` IWAMURO Motonori 2009-06-12 18:56 ` Thomas Wolff 2009-06-12 19:12 ` Corinna Vinschen 2009-06-15 0:30 ` IWAMURO Motonori 2009-06-15 4:34 ` IWAMURO Motonori 2009-06-15 11:43 ` [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) Corinna Vinschen 2009-06-15 15:58 ` IWAMURO Motonori 2009-06-15 17:08 ` Corinna Vinschen 2009-06-15 17:14 ` IWAMURO Motonori 2009-06-18 15:57 ` Thomas.Wolff 2009-06-18 16:49 ` Corinna Vinschen 2009-06-19 0:08 ` Andy Koppe 2009-06-19 14:45 ` Thomas Wolff 2009-06-19 14:49 ` Corinna Vinschen 2009-06-27 22:03 ` Andy Koppe 2009-06-28 8:18 ` IWAMURO Motonori 2009-06-06 12:22 ` [Fwd: [1.7] wcwidth failing configure tests] IWAMURO Motonori 2009-05-26 16:46 ` IWAMURO Motonori
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).