public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* non-BMP character width
@ 2009-09-16 11:48 Thomas Wolff
  2009-09-21 16:34 ` Corinna Vinschen
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Wolff @ 2009-09-16 11:48 UTC (permalink / raw)
  To: cygwin

Hi,
I see one small remaining glitch with Unicode display; non-BMP characters 
(those with Unicode value > 0xFFFF) are displayed as two boxes.
The reason is probably related to their representation as two 
surrogates at some point.
I do not expect to have visible display of non-BMP in the cygwin 
console, esp. as the two available console fonts, Raster and Lucida, 
don't support them anyway. But they should at least have the proper 
width, i.e. one box instead of two.

Kink regards,
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: non-BMP character width
  2009-09-16 11:48 non-BMP character width Thomas Wolff
@ 2009-09-21 16:34 ` Corinna Vinschen
  2009-09-21 16:53   ` Lapo Luchini
  0 siblings, 1 reply; 8+ messages in thread
From: Corinna Vinschen @ 2009-09-21 16:34 UTC (permalink / raw)
  To: cygwin

On Sep 16 13:48, Thomas Wolff wrote:
> Hi,
> I see one small remaining glitch with Unicode display; non-BMP characters 
> (those with Unicode value > 0xFFFF) are displayed as two boxes.
> The reason is probably related to their representation as two 
> surrogates at some point.
> I do not expect to have visible display of non-BMP in the cygwin 
> console, esp. as the two available console fonts, Raster and Lucida, 
> don't support them anyway. But they should at least have the proper 
> width, i.e. one box instead of two.

Can you please create a simple self-contained testcase?  I'm not exactly
sure how this is supposed to work and if a solution exists.  Is that a
problem for the non-UTF-8 case, too, or for UTF-8 only?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: non-BMP character width
  2009-09-21 16:34 ` Corinna Vinschen
@ 2009-09-21 16:53   ` Lapo Luchini
  2009-09-21 17:58     ` Corinna Vinschen
  0 siblings, 1 reply; 8+ messages in thread
From: Lapo Luchini @ 2009-09-21 16:53 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> On Sep 16 13:48, Thomas Wolff wrote:
>> Hi,
>> I see one small remaining glitch with Unicode display; non-BMP characters 
>> (those with Unicode value > 0xFFFF) are displayed as two boxes.
> 
> Can you please create a simple self-contained testcase?  I'm not exactly
> sure how this is supposed to work and if a solution exists.  Is that a
> problem for the non-UTF-8 case, too, or for UTF-8 only?

I guess he meant anything like U+10001, which seems to be assigned to
linear-B charset on the DecodeUnicode database:

𐀁 = http://www.decodeunicode.org/U+10001
UTF-8 as F0 90 80 81

Or this (Iguess that's traditional Chinese?) example taken from en.wiki:
𤭢 = http://www.decodeunicode.org/U+24B62
UTF-8 as F0 A4 AD A2

-- 
Lapo Luchini - http://lapo.it/


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: non-BMP character width
  2009-09-21 16:53   ` Lapo Luchini
@ 2009-09-21 17:58     ` Corinna Vinschen
  2009-09-22  4:57       ` Lapo Luchini
  0 siblings, 1 reply; 8+ messages in thread
From: Corinna Vinschen @ 2009-09-21 17:58 UTC (permalink / raw)
  To: cygwin

On Sep 21 18:52, Lapo Luchini wrote:
> Corinna Vinschen wrote:
> > On Sep 16 13:48, Thomas Wolff wrote:
> >> Hi,
> >> I see one small remaining glitch with Unicode display; non-BMP characters 
> >> (those with Unicode value > 0xFFFF) are displayed as two boxes.
> > 
> > Can you please create a simple self-contained testcase?  I'm not exactly
> > sure how this is supposed to work and if a solution exists.  Is that a
> > problem for the non-UTF-8 case, too, or for UTF-8 only?
> 
> I guess he meant anything like U+10001, which seems to be assigned to
> linear-B charset on the DecodeUnicode database:

Sure.  I was specificially asking for a testcase, preferrably in
plain C, which allows to reproduce this under a debugger.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: non-BMP character width
  2009-09-21 17:58     ` Corinna Vinschen
@ 2009-09-22  4:57       ` Lapo Luchini
  2009-09-22  9:57         ` Corinna Vinschen
  2009-09-24 15:14         ` Thomas Wolff
  0 siblings, 2 replies; 8+ messages in thread
From: Lapo Luchini @ 2009-09-22  4:57 UTC (permalink / raw)
  To: [ML] CygWin , Thomas Wolff

Corinna Vinschen wrote:
> Sure.  I was specificially asking for a testcase, preferrably in
> plain C, which allows to reproduce this under a debugger.

Actually, I can't reproduce that, but I guess it's a problem of the
specific console he's using (Thomas, which one is that?): on mintty it
works ok (I'm not really sure it outputs U+10001, but it surely shows a
single box) and on rxvt it just shows as four ISO-8859-1 chars:
(es expected, as native rxvt doesn't support Unicode)

mintty% echo "-\xF0\x90\x80\x81-"
-�-
rxvt% echo "-\xF0\x90\x80\x81-"
-𐀁-

Also ok on `ls`:

% cat s.c
int main() {
    fopen("a-\xF0\x90\x80\x81", "w");
    return 0;
}
% ./s
% ls -l|fgrep a-
-rw-r--r-- 1 lapo None     0 22 Sep 06:50 a-�

-- 
Lapo Luchini - http://lapo.it/

“The future is not google-able.” (William Gibson, 2004-02-05)

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: non-BMP character width
  2009-09-22  4:57       ` Lapo Luchini
@ 2009-09-22  9:57         ` Corinna Vinschen
  2009-09-24 15:14         ` Thomas Wolff
  1 sibling, 0 replies; 8+ messages in thread
From: Corinna Vinschen @ 2009-09-22  9:57 UTC (permalink / raw)
  To: cygwin

On Sep 22 06:57, Lapo Luchini wrote:
> Corinna Vinschen wrote:
> > Sure.  I was specificially asking for a testcase, preferrably in
> > plain C, which allows to reproduce this under a debugger.
> 
> Actually, I can't reproduce that, but I guess it's a problem of the
> specific console he's using (Thomas, which one is that?): on mintty it
> works ok (I'm not really sure it outputs U+10001, but it surely shows a
> single box) and on rxvt it just shows as four ISO-8859-1 chars:
> (es expected, as native rxvt doesn't support Unicode)
> 
> mintty% echo "-\xF0\x90\x80\x81-"
> -???-
> rxvt% echo "-\xF0\x90\x80\x81-"
> -ð???-
> 
> Also ok on `ls`:
> 
> % cat s.c
> int main() {
>     fopen("a-\xF0\x90\x80\x81", "w");
>     return 0;
> }
> % ./s
> % ls -l|fgrep a-
> -rw-r--r-- 1 lapo None     0 22 Sep 06:50 a-???

Uh, I see.  That occurs in the normal Windows console.  This is not
Cygwin's fault.  Cygwin's console code converts the multibyte string to
the WCHAR representation and prints it to the console using the
WriteConsoleW function.  That function prints two blocks/question marks
for a surrogate pair.  Look at the file in a cmd shell, it will also
print two blocks/question marks for the surrogate pair.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: non-BMP character width
  2009-09-22  4:57       ` Lapo Luchini
  2009-09-22  9:57         ` Corinna Vinschen
@ 2009-09-24 15:14         ` Thomas Wolff
  2009-09-24 15:33           ` andy.koppe
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Wolff @ 2009-09-24 15:14 UTC (permalink / raw)
  To: cygwin

Corinna Vinschen wrote:
> Can you please create a simple self-contained testcase?  I'm not exactly
> sure how this is supposed to work and if a solution exists.  Is that a
> problem for the non-UTF-8 case, too, or for UTF-8 only?

Sorry for the late response; I see you reproduced the case meanwhile -
anyway, here is a test case, to be used with gcc or just with cat:

/* print U+20000 ð €€ */
int main () {
  printf ("<U+20000> is <ð €€>\n");
}

where you could enter the character in mined with Control-V #20000 Enter :)

About non-UTF-8, I tried to test in Big5, using character 0x8750 which is U+242BF,
and the test suggests it's OK (in cygwin console, mintty, and rxvt-unicode); 
however, that may not be significant since although its Unicode code 
point is non-BMP, the Big5 character is only 16 bits and Windows, 
having supported CJK before Unicode, probably doesn't handle this via Unicode.
I also tried to test eucJP, but that doesn't seem to work at all and mintty crashes...

See my other comment below, please.


On Sep 22 06:57, Lapo Luchini wrote:
> ...
> Actually, I can't reproduce that, but I guess it's a problem of the
> specific console he's using (Thomas, which one is that?): on mintty it
> works ok (I'm not really sure it outputs U+10001, but it surely shows a
> single box)...
The problem used to be in mintty as well until I pointed it out and 
Andy was so ambitious to find a workaround - maybe he could supply a 
code snipplet which would fix this in the cygwin console too, despite 
the bug origin being in the Windows API...

> and on rxvt it just shows as four ISO-8859-1 chars:
> (es expected, as native rxvt doesn't support Unicode)
You would have to test this with rxvt-unicode (urxvt in cygwin) 
where the test case passes (one box). (Not very relevant maybe, 
if reports are true that rxvt is not maintained anymore.)

Corinna wrote:
> > ...
> Uh, I see.  That occurs in the normal Windows console.  This is not
> Cygwin's fault.  Cygwin's console code converts the multibyte string to
> the WCHAR representation and prints it to the console using the
> WriteConsoleW function.  That function prints two blocks/question marks
> for a surrogate pair.  Look at the file in a cmd shell, it will also
> print two blocks/question marks for the surrogate pair.
I was assuming that, like for mintty, the fault was not in the cygwin domain, 
however, as there is a workaround, I thought it would be nice for the cygwin 
console as well.

Kind regards,
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: non-BMP character width
  2009-09-24 15:14         ` Thomas Wolff
@ 2009-09-24 15:33           ` andy.koppe
  0 siblings, 0 replies; 8+ messages in thread
From: andy.koppe @ 2009-09-24 15:33 UTC (permalink / raw)
  To: cygwin

2009/9/24 Thomas Wolff:
> I also tried to test eucJP, but that doesn't seem to work at all and mintty crashes...

Ouch. Details?

> The problem used to be in mintty as well until I pointed it out and
> Andy was so ambitious to find a workaround

Yep, given a font that actually supports them, e.g. SimSunExtB,
non-BMP chars should display correctly in mintty 0.5.

> - maybe he could supply a
> code snipplet which would fix this in the cygwin console too, despite
> the bug origin being in the Windows API...

'fraid not. Mintty uses the Win32 GUI function ExtTextOut to paint
characters in its window, and that function does support surrogates.
The Cygwin DLL uses WriteConsole, which apparently doesn't support
them, and only MS can change that.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-09-24 15:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-16 11:48 non-BMP character width Thomas Wolff
2009-09-21 16:34 ` Corinna Vinschen
2009-09-21 16:53   ` Lapo Luchini
2009-09-21 17:58     ` Corinna Vinschen
2009-09-22  4:57       ` Lapo Luchini
2009-09-22  9:57         ` Corinna Vinschen
2009-09-24 15:14         ` Thomas Wolff
2009-09-24 15:33           ` andy.koppe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).