public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Encoding of German 'umlauts' - please explain
@ 2009-09-24  8:20 Ronald Fischer
  2009-09-24 10:26 ` Matthias Andree
  2009-09-24 15:31 ` Thomas Wolff
  0 siblings, 2 replies; 3+ messages in thread
From: Ronald Fischer @ 2009-09-24  8:20 UTC (permalink / raw)
  To: cygwin

Maybe someone could enlighten me about the following:

On Cygwin bash I see

$ echo ü | od -cx
0000000 374  \n
        0afc
0000002

That means, the German letter ü has encoding 0xFC. If I do the same on CMD shell
(the 'od' used here comes from the Gnu Utilities for Windows), I see:

  echo ü | od -cx
0000000 201      \r  \n
        2081 0a0d
0000004

That is, ü is encoded as 0x81. Why is this different?

I am aware that, for historic reason, different encodings exist (the old
DOS encoding, Windows ANSI encoding etc.). I wouldn't have expected those
differences, however, when comparing bash.exe vs. cmd.exe.



--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Encoding of German 'umlauts' - please explain
  2009-09-24  8:20 Encoding of German 'umlauts' - please explain Ronald Fischer
@ 2009-09-24 10:26 ` Matthias Andree
  2009-09-24 15:31 ` Thomas Wolff
  1 sibling, 0 replies; 3+ messages in thread
From: Matthias Andree @ 2009-09-24 10:26 UTC (permalink / raw)
  To: cygwin

Ronald Fischer schrieb:
> Maybe someone could enlighten me about the following:
> 
> On Cygwin bash I see
> 
> $ echo ü | od -cx
> 0000000 374  \n
>         0afc
> 0000002
> 
> That means, the German letter ü has encoding 0xFC. If I do the same on CMD shell
> (the 'od' used here comes from the Gnu Utilities for Windows), I see:
> 
>   echo ü | od -cx
> 0000000 201      \r  \n
>         2081 0a0d
> 0000004
> 
> That is, ü is encoded as 0x81. Why is this different?

Because the code pages differ. 0xFC is ISO-8859-1 ("Latin 1") or -15 ("Latin 9")
or CP1252/Windows-1252 (Latin 1 Extended; the latter allocates 0x80...0x9f
differently than ISO-8859-1) and CMD uses CP437 or CP850.


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Encoding of German 'umlauts' - please explain
  2009-09-24  8:20 Encoding of German 'umlauts' - please explain Ronald Fischer
  2009-09-24 10:26 ` Matthias Andree
@ 2009-09-24 15:31 ` Thomas Wolff
  1 sibling, 0 replies; 3+ messages in thread
From: Thomas Wolff @ 2009-09-24 15:31 UTC (permalink / raw)
  To: cygwin

Ronald Fischer wrote:
> Maybe someone could enlighten me about the following:
> ...
> That means, the German letter ü has encoding 0xFC. If I do the same on CMD shell
> (the 'od' used here comes from the Gnu Utilities for Windows), I see:
> ...
> That is, ü is encoded as 0x81. Why is this different?

> I am aware that, for historic reason, different encodings exist (the old
> DOS encoding, Windows ANSI encoding etc.).
So you answered your question yourself :)
> I wouldn't have expected those
> differences, however, when comparing bash.exe vs. cmd.exe.

The encoding is applied by the terminal, not the application. For bash, 
the letter ü is only a sequence of one or two bytes, while the terminal 
decides which bytes your keyboard sends to the application when you enter 
ü, and what to display when your program outputs those bytes (i.e., 
traditionally, while in the age of locales things may sometimes get more 
complicated :( ).

Having said this, I also need to adjust the following response:

Matthias Andree wrote:
> Because the code pages differ. 0xFC is ISO-8859-1 ("Latin 1") or -15 ("Latin 9")
> or CP1252/Windows-1252 (Latin 1 Extended; the latter allocates 0x80...0x9f
> differently than ISO-8859-1) and CMD uses CP437 or CP850.

This is not really correct; like bash, CMD does not use a codepage itself.
If you start CMD from Windows, it will implicitly be embedded in a Windows 
console which uses CP437 (American), CP850 (Western European) or some other 
default of your system configuration.

However, you could also run CMD from a cygwin bash. In this case, maximising 
the confusion, there are two different situations:
* Run mintty, start CMD from bash there: CMD will see the same codepage as 
  bash since it is the one configured for mintty. So echo ü would produce 
  0xFC even in CMD (assuming mintty runs one of the codepages which map 
  ü to 0xFC).
* Run cygwin console, observe this: Since the cygwin console is a hybrid as 
  the encoding is emulated by the cygwin dll within a Windows console, unlike 
  all other terminals, the effective "codepage" varies with the application:
  A cygwin application will use the encoding configured for the cygwin session, 
  while any non-cygwin application will use the native Windows console codepage.
  So you may echo ü from bash, then start CMD from there, echo ü again, and will 
  get different codes for the same key!

Kind regards,
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-09-24 15:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-24  8:20 Encoding of German 'umlauts' - please explain Ronald Fischer
2009-09-24 10:26 ` Matthias Andree
2009-09-24 15:31 ` Thomas Wolff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).