UTF-8 character encoding

public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed

* UTF-8 character encoding
@ 2018-06-21  7:20 Lee
  2018-06-21 10:12 ` Stefan Weil
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Lee @ 2018-06-21  7:20 UTC (permalink / raw)
  To: cygwin

I'm looking at
  https://cygwin.com/packaging-hint-files.html#pvr.hint
and it starts off with
  Use UTF-8 character encoding.

How do I do that and how do I check that I actually did use UTF-8
character encoding _without_ using file?

for whatever it's worth:
$ file unicode.html
unicode.html: HTML document, UTF-8 Unicode text

$ file test.c
test.c: C source, ASCII text

I used vi to create both files & I'd like to understand why file says
one is ascii & the other is utf-8

Thanks,
Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-21  7:20 UTF-8 character encoding Lee
@ 2018-06-21 10:12 ` Stefan Weil
  2018-06-21 10:39 ` Andrey Repin
  2018-06-21 18:49 ` Houder
  2 siblings, 0 replies; 13+ messages in thread
From: Stefan Weil @ 2018-06-21 10:12 UTC (permalink / raw)
  To: cygwin

Am 20.06.2018 um 20:09 schrieb Lee:
> I'm looking at
>   https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
>   Use UTF-8 character encoding.
> 
> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?
> 
> for whatever it's worth:
> $ file unicode.html
> unicode.html: HTML document, UTF-8 Unicode text
> 
> $ file test.c
> test.c: C source, ASCII text
> 
> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8
> 
> Thanks,
> Lee

ASCII is a subset of UTF-8, so that's fine.

The file command will report ASCII as long as your text does not contain
any non-ASCII characters. If you add some (for example Ã„Ã–Ãœ), it should
report UTF-8.

Regards,
Stefan


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-21  7:20 UTF-8 character encoding Lee
  2018-06-21 10:12 ` Stefan Weil
@ 2018-06-21 10:39 ` Andrey Repin
  2018-06-22  7:31   ` Lee
  2018-06-21 18:49 ` Houder
  2 siblings, 1 reply; 13+ messages in thread
From: Andrey Repin @ 2018-06-21 10:39 UTC (permalink / raw)
  To: Lee, cygwin

Greetings, Lee!

> I'm looking at
>   https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
>   Use UTF-8 character encoding.

> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

> for whatever it's worth:
> $ file unicode.html
> unicode.html: HTML document, UTF-8 Unicode text

> $ file test.c
> test.c: C source, ASCII text

> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8


-- 
With best regards,
Andrey Repin
Thursday, June 21, 2018 4:25:27

Sorry for my terrible english...


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-21  7:20 UTF-8 character encoding Lee
  2018-06-21 10:12 ` Stefan Weil
  2018-06-21 10:39 ` Andrey Repin
@ 2018-06-21 18:49 ` Houder
  2018-06-21 20:46   ` Houder
  2 siblings, 1 reply; 13+ messages in thread
From: Houder @ 2018-06-21 18:49 UTC (permalink / raw)
  To: cygwin

On Wed, 20 Jun 2018 14:09:59, Lee wrote:
> I'm looking at
>   https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
>   Use UTF-8 character encoding.
> 
> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?
[snip]

> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8

vim can tell you that in the statusline ...

:help statusline
:help encoding

Ask Google to help you with the details: GS: "vim show encoding in status".

E.g.

 - http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line
   (Show fileencoding and bomb in the status line)

As an example:

set laststatus=2
"set statusline=...
set statusline+=\ en:\ %{strlen(&enc)\ ?\ &enc\ :\ 'x'}
"set statusline+...

Henri


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-21 18:49 ` Houder
@ 2018-06-21 20:46   ` Houder
  0 siblings, 0 replies; 13+ messages in thread
From: Houder @ 2018-06-21 20:46 UTC (permalink / raw)
  To: cygwin

On Thu, 21 Jun 2018 12:12:39, Houder wrote:
> On Wed, 20 Jun 2018 14:09:59, Lee wrote:
> > I'm looking at
> >   https://cygwin.com/packaging-hint-files.html#pvr.hint
> > and it starts off with
> >   Use UTF-8 character encoding.
> > 
> > How do I do that and how do I check that I actually did use UTF-8
> > character encoding _without_ using file?
> [snip]
> 
> > I used vi to create both files & I'd like to understand why file says
> > one is ascii & the other is utf-8
> 
> vim can tell you that in the statusline ...
> 
> :help statusline
> :help encoding
> 
> Ask Google to help you with the details: GS: "vim show encoding in status".
> 
> E.g.
> 
>  - http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line
>    (Show fileencoding and bomb in the status line)
> 
> As an example:
> 
> set laststatus=2
> "set statusline=...
> set statusline+=\ en:\ %{strlen(&enc)\ ?\ &enc\ :\ 'x'}
> "set statusline+...

Also read:

 - https://unix.stackexchange.com/questions/23389/how-can-i-set-vims-default-encoding-to-utf-8
   (How can I set VIM's default encoding to UTF-8?)

for a "quickstart" on the subject of character encoding/vim.

Henri


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-21 10:39 ` Andrey Repin
@ 2018-06-22  7:31   ` Lee
  2018-06-22 17:30     ` Andrey Repin
  2018-06-25  9:56     ` L A Walsh
  0 siblings, 2 replies; 13+ messages in thread
From: Lee @ 2018-06-22  7:31 UTC (permalink / raw)
  To: cygwin

On 6/20/18, Andrey Repin wrote:
> Greetings, Lee!
>
>> I'm looking at
>>   https://cygwin.com/packaging-hint-files.html#pvr.hint
>> and it starts off with
>>   Use UTF-8 character encoding.
>
>> How do I do that and how do I check that I actually did use UTF-8
>> character encoding _without_ using file?
>
> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

I think I don't know enough to ask the right question.  A quick search
yesterday on byte order markers turned up
  https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
with this bit
  Note   Microsoft uses UTF-16, little endian byte order.

So... keep it simple, set
  LANG=en_US.UTF-8
and use vi or something else that comes with cygwin to create the file
and I'll have a file with UTF-8 character encoding - correct?

Thanks,
Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-22  7:31   ` Lee
@ 2018-06-22 17:30     ` Andrey Repin
  2018-06-25  9:56     ` L A Walsh
  1 sibling, 0 replies; 13+ messages in thread
From: Andrey Repin @ 2018-06-22 17:30 UTC (permalink / raw)
  To: Lee, cygwin

Greetings, Lee!

> On 6/20/18, Andrey Repin wrote:
>> Greetings, Lee!
>>
>>> I'm looking at
>>>   https://cygwin.com/packaging-hint-files.html#pvr.hint
>>> and it starts off with
>>>   Use UTF-8 character encoding.
>>
>>> How do I do that and how do I check that I actually did use UTF-8
>>> character encoding _without_ using file?
>>
>> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

> I think I don't know enough to ask the right question.  A quick search
> yesterday on byte order markers turned up
>  
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
> with this bit
>   Note   Microsoft uses UTF-16, little endian byte order.

Yes, default multibyte Windows encoding is UTF-16LE.
But in general, this is application specific.

> So... keep it simple, set
>   LANG=en_US.UTF-8
> and use vi or something else that comes with cygwin to create the file
> and I'll have a file with UTF-8 character encoding - correct?

I'm not familiar with vi, but this is true for other *NIX editors I know, they
use current locale settings by default, unless something else is specified in
their configuration or prompted by other cases (like byte order mark).

IMO, best chance is to use an editor that explicitly supports saving texts in
the desired encoding.
And please no BOM for UTF-8 files.


-- 
With best regards,
Andrey Repin
Friday, June 22, 2018 14:13:14

Sorry for my terrible english...


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-22  7:31   ` Lee
  2018-06-22 17:30     ` Andrey Repin
@ 2018-06-25  9:56     ` L A Walsh
  2018-06-25 20:52       ` Lee
  1 sibling, 1 reply; 13+ messages in thread
From: L A Walsh @ 2018-06-25  9:56 UTC (permalink / raw)
  To: cygwin

Lee wrote:
> So... keep it simple, set
>   LANG=en_US.UTF-8
> and use vi or something else that comes with cygwin to create the file
> and I'll have a file with UTF-8 character encoding - correct?
---
	The first 127 characters of UTF-8 are identical to the
first 127 characters of ASCII, and latin1 and iso-8859-1.

If you don't use any characters that need accents or special symbols,
then nothing will be encoded in UTF-8, because its only 
the characters OVER the first 127
(see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).

The site also has a sw util (http://www.babelstone.co.uk/Software/BabelMap.html), 
that displays and helps config fonts
to display all the characters in unicode, though it hasn't 
been updated to the changes that came out last month or so
(Unicode 11).

It's a cool little, *free*, utility...though if you find it useful
you can always send in your registration.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-25  9:56     ` L A Walsh
@ 2018-06-25 20:52       ` Lee
  2018-06-26 21:39         ` Thomas Wolff
  2018-06-27  7:50         ` Michael Enright
  0 siblings, 2 replies; 13+ messages in thread
From: Lee @ 2018-06-25 20:52 UTC (permalink / raw)
  To: cygwin

On 6/24/18, L A Walsh <cygwin@tlinx.org> wrote:
> Lee wrote:
>> So... keep it simple, set
>>   LANG=en_US.UTF-8
>> and use vi or something else that comes with cygwin to create the file
>> and I'll have a file with UTF-8 character encoding - correct?
> ---
> 	The first 127 characters of UTF-8 are identical to the
> first 127 characters of ASCII, and latin1 and iso-8859-1.
>
> If you don't use any characters that need accents or special symbols,
> then nothing will be encoded in UTF-8, because its only
> the characters OVER the first 127
> (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).

I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
0xff is part of the utf-8 encoding.  This chart makes things clearer
... at least for me :)
    http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
 The proposed UCS transformation format encodes UCS values in the range
 [0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, and 5
 bytes.  For all encodings of more than one byte, the initial byte
 determines the number of bytes used and the high-order bit in each byte
 is set.

 An easy way to remember this transformation format is to note that the
 number of high-order 1's in the first byte is the same as the number of
 subsequent bytes in the multibyte character:

    Bits  Hex Min  Hex Max         Byte Sequence in Binary
 1    7  00000000 0000007f 0zzzzzzz
 2   13  00000080 0000207f 10zzzzzz 1yyyyyyy
 3   19  00002080 0008207f 110zzzzz 1yyyyyyy 1xxxxxxx
 4   25  00082080 0208207f 1110zzzz 1yyyyyyy 1xxxxxxx 1wwwwwww
 5   31  02082080 7fffffff 11110zzz 1yyyyyyy 1xxxxxxx 1wwwwwww 1vvvvvvv

Thanks
Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-25 20:52       ` Lee
@ 2018-06-26 21:39         ` Thomas Wolff
  2018-06-27  9:31           ` Lee
  2018-06-27  7:50         ` Michael Enright
  1 sibling, 1 reply; 13+ messages in thread
From: Thomas Wolff @ 2018-06-26 21:39 UTC (permalink / raw)
  To: cygwin

Am 25.06.2018 um 20:33 schrieb Lee:
> On 6/24/18, L A Walsh <cygwin@tlinx.org> wrote:
>> Lee wrote:
>>> So... keep it simple, set
>>>    LANG=en_US.UTF-8
>>> and use vi or something else that comes with cygwin to create the file
>>> and I'll have a file with UTF-8 character encoding - correct?
>> ---
>> 	The first 127 characters of UTF-8 are identical to the
>> first 127 characters of ASCII, and latin1 and iso-8859-1.
>>
>> If you don't use any characters that need accents or special symbols,
>> then nothing will be encoded in UTF-8, because its only
>> the characters OVER the first 127
>> (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).
> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
> 0xff is part of the utf-8 encoding.  This chart makes things clearer
> ... at least for me :)
>      http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
>   The proposed UCS transformation format encodes UCS values in the range
>   [0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, and 5
>   bytes.  For all encodings of more than one byte, the initial byte
>   determines the number of bytes used and the high-order bit in each byte
>   is set.
>
>   An easy way to remember this transformation format is to note that the
>   number of high-order 1's in the first byte is the same as the number of
>   subsequent bytes in the multibyte character:
>
>      Bits  Hex Min  Hex Max         Byte Sequence in Binary
>   1    7  00000000 0000007f 0zzzzzzz
>   2   13  00000080 0000207f 10zzzzzz 1yyyyyyy
>   3   19  00002080 0008207f 110zzzzz 1yyyyyyy 1xxxxxxx
>   4   25  00082080 0208207f 1110zzzz 1yyyyyyy 1xxxxxxx 1wwwwwww
>   5   31  02082080 7fffffff 11110zzz 1yyyyyyy 1xxxxxxx 1wwwwwww 1vvvvvvv
This encoding scheme is wrong; where did you get it from? Maybe it's the 
obsolete UTF-8...

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-25 20:52       ` Lee
  2018-06-26 21:39         ` Thomas Wolff
@ 2018-06-27  7:50         ` Michael Enright
  2018-06-27  9:34           ` Lee
  1 sibling, 1 reply; 13+ messages in thread
From: Michael Enright @ 2018-06-27  7:50 UTC (permalink / raw)
  To: cygwin

On Mon, Jun 25, 2018 at 11:33 AM, Lee <ler762@gmail.com> wrote:
> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
> 0xff is part of the utf-8 encoding.

I don't see how you arrived at this. An initial byte of 0xFF is not
the initial byte of any valid UTF-8 byte sequence. And it doesn't
conform with the statement you have later:

>  An easy way to remember this transformation format is to note that the
>  number of high-order 1's in the first byte is the same as the number of
>  subsequent bytes in the multibyte character:

This is true, but there is also a zero bit that ends the
high-order-1's bit string, which means that 0xFF is not a valid lead
byte. 0x7F is the highest byte value that you can have as a
single-byte UTF8 string.

Perhaps your statement about 0-0xFF was meant to be read differently.

Thomas Wolff's note seems to be objecting to the inclusion of
characters above U+10FFFF which isn't legal UTF-8, but was in the
original proposal. Otherwise your table rows 1-4 is correct.

The standards such as IETF RFC-3629 are easy enough to read, so I
recommend using them and citing them to others instead of trying to
summarize.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-26 21:39         ` Thomas Wolff
@ 2018-06-27  9:31           ` Lee
  0 siblings, 0 replies; 13+ messages in thread
From: Lee @ 2018-06-27  9:31 UTC (permalink / raw)
  To: cygwin

On 6/26/18, Thomas Wolff  wrote:

> This encoding scheme is wrong; where did you get it from? Maybe it's the
> obsolete UTF-8...

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I thought I saw something about utf-8 being able to handle a 31 bit
value..  is that also obsolete/wrong?

how about this for the current encoding scheme:
http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf

Table 3-6.  UTF-8 Bit Distribution
Bits    Scalar Value               First Byte  Second Byte  Third Byte
 Fourth Byte
  7   00000000 0xxxxxxx            0xxxxxxx
 11   00000yyy yyxxxxxx            110yyyyy    10xxxxxx
 16   zzzzyyyy yyxxxxxx            1110zzzz    10yyyyyy     10xxxxxx
 21   000uuuuu zzzzyyyy yyxxxxxx   11110uuu    10uuzzzz     10yyyyyy    10xxxxxx

Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: UTF-8 character encoding
  2018-06-27  7:50         ` Michael Enright
@ 2018-06-27  9:34           ` Lee
  0 siblings, 0 replies; 13+ messages in thread
From: Lee @ 2018-06-27  9:34 UTC (permalink / raw)
  To: cygwin

On 6/26/18, Michael Enright  wrote:
> On Mon, Jun 25, 2018 at 11:33 AM, Lee  wrote:
>> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
>> 0xff is part of the utf-8 encoding.
>
> I don't see how you arrived at this.

I screwed up trying to do hex in my head.  For whatever reason I
didn't want to write 0 - 127

> An initial byte of 0xFF is not
> the initial byte of any valid UTF-8 byte sequence. And it doesn't
> conform with the statement you have later:

right, I screwed up :)

> The standards such as IETF RFC-3629 are easy enough to read, so I
> recommend using them and citing them to others instead of trying to
> summarize.

Thanks for the RFC reference - I hadn't come across that one yet.

Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-06-27  6:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-21  7:20 UTF-8 character encoding Lee
2018-06-21 10:12 ` Stefan Weil
2018-06-21 10:39 ` Andrey Repin
2018-06-22  7:31   ` Lee
2018-06-22 17:30     ` Andrey Repin
2018-06-25  9:56     ` L A Walsh
2018-06-25 20:52       ` Lee
2018-06-26 21:39         ` Thomas Wolff
2018-06-27  9:31           ` Lee
2018-06-27  7:50         ` Michael Enright
2018-06-27  9:34           ` Lee
2018-06-21 18:49 ` Houder
2018-06-21 20:46   ` Houder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).