Re: UTF-8 character encoding

public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed

From: Thomas Wolff <towo@towo.net>
To: cygwin@cygwin.com
Subject: Re: UTF-8 character encoding
Date: Tue, 26 Jun 2018 21:39:00 -0000	[thread overview]
Message-ID: <981ba1fe-7961-5ed0-e3c7-a5717af8c141@towo.net> (raw)
In-Reply-To: <CAD8GWsuevQX6fBUzkEvUs5rBPehhG7-ht+FPZU=eOaACF5uCPg@mail.gmail.com>

Am 25.06.2018 um 20:33 schrieb Lee:
> On 6/24/18, L A Walsh <cygwin@tlinx.org> wrote:
>> Lee wrote:
>>> So... keep it simple, set
>>>    LANG=en_US.UTF-8
>>> and use vi or something else that comes with cygwin to create the file
>>> and I'll have a file with UTF-8 character encoding - correct?
>> ---
>> 	The first 127 characters of UTF-8 are identical to the
>> first 127 characters of ASCII, and latin1 and iso-8859-1.
>>
>> If you don't use any characters that need accents or special symbols,
>> then nothing will be encoded in UTF-8, because its only
>> the characters OVER the first 127
>> (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).
> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
> 0xff is part of the utf-8 encoding.  This chart makes things clearer
> ... at least for me :)
>      http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
>   The proposed UCS transformation format encodes UCS values in the range
>   [0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, and 5
>   bytes.  For all encodings of more than one byte, the initial byte
>   determines the number of bytes used and the high-order bit in each byte
>   is set.
>
>   An easy way to remember this transformation format is to note that the
>   number of high-order 1's in the first byte is the same as the number of
>   subsequent bytes in the multibyte character:
>
>      Bits  Hex Min  Hex Max         Byte Sequence in Binary
>   1    7  00000000 0000007f 0zzzzzzz
>   2   13  00000080 0000207f 10zzzzzz 1yyyyyyy
>   3   19  00002080 0008207f 110zzzzz 1yyyyyyy 1xxxxxxx
>   4   25  00082080 0208207f 1110zzzz 1yyyyyyy 1xxxxxxx 1wwwwwww
>   5   31  02082080 7fffffff 11110zzz 1yyyyyyy 1xxxxxxx 1wwwwwww 1vvvvvvv
This encoding scheme is wrong; where did you get it from? Maybe it's the 
obsolete UTF-8...

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

next prev parent reply	other threads:[~2018-06-26 19:23 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-21  7:20 Lee
2018-06-21 10:12 ` Stefan Weil
2018-06-21 10:39 ` Andrey Repin
2018-06-22  7:31   ` Lee
2018-06-22 17:30     ` Andrey Repin
2018-06-25  9:56     ` L A Walsh
2018-06-25 20:52       ` Lee
2018-06-26 21:39         ` Thomas Wolff [this message]
2018-06-27  9:31           ` Lee
2018-06-27  7:50         ` Michael Enright
2018-06-27  9:34           ` Lee
2018-06-21 18:49 ` Houder
2018-06-21 20:46   ` Houder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=981ba1fe-7961-5ed0-e3c7-a5717af8c141@towo.net \
    --to=towo@towo.net \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).