* UTF-8 character encoding
@ 2018-06-21 7:20 Lee
2018-06-21 10:12 ` Stefan Weil
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Lee @ 2018-06-21 7:20 UTC (permalink / raw)
To: cygwin
I'm looking at
https://cygwin.com/packaging-hint-files.html#pvr.hint
and it starts off with
Use UTF-8 character encoding.
How do I do that and how do I check that I actually did use UTF-8
character encoding _without_ using file?
for whatever it's worth:
$ file unicode.html
unicode.html: HTML document, UTF-8 Unicode text
$ file test.c
test.c: C source, ASCII text
I used vi to create both files & I'd like to understand why file says
one is ascii & the other is utf-8
Thanks,
Lee
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-21 7:20 UTF-8 character encoding Lee
@ 2018-06-21 10:12 ` Stefan Weil
2018-06-21 10:39 ` Andrey Repin
2018-06-21 18:49 ` Houder
2 siblings, 0 replies; 13+ messages in thread
From: Stefan Weil @ 2018-06-21 10:12 UTC (permalink / raw)
To: cygwin
Am 20.06.2018 um 20:09 schrieb Lee:
> I'm looking at
> https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
> Use UTF-8 character encoding.
>
> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?
>
> for whatever it's worth:
> $ file unicode.html
> unicode.html: HTML document, UTF-8 Unicode text
>
> $ file test.c
> test.c: C source, ASCII text
>
> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8
>
> Thanks,
> Lee
ASCII is a subset of UTF-8, so that's fine.
The file command will report ASCII as long as your text does not contain
any non-ASCII characters. If you add some (for example ÃÃÃ), it should
report UTF-8.
Regards,
Stefan
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-21 7:20 UTF-8 character encoding Lee
2018-06-21 10:12 ` Stefan Weil
@ 2018-06-21 10:39 ` Andrey Repin
2018-06-22 7:31 ` Lee
2018-06-21 18:49 ` Houder
2 siblings, 1 reply; 13+ messages in thread
From: Andrey Repin @ 2018-06-21 10:39 UTC (permalink / raw)
To: Lee, cygwin
Greetings, Lee!
> I'm looking at
> https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
> Use UTF-8 character encoding.
> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
> for whatever it's worth:
> $ file unicode.html
> unicode.html: HTML document, UTF-8 Unicode text
> $ file test.c
> test.c: C source, ASCII text
> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8
--
With best regards,
Andrey Repin
Thursday, June 21, 2018 4:25:27
Sorry for my terrible english...
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-21 7:20 UTF-8 character encoding Lee
2018-06-21 10:12 ` Stefan Weil
2018-06-21 10:39 ` Andrey Repin
@ 2018-06-21 18:49 ` Houder
2018-06-21 20:46 ` Houder
2 siblings, 1 reply; 13+ messages in thread
From: Houder @ 2018-06-21 18:49 UTC (permalink / raw)
To: cygwin
On Wed, 20 Jun 2018 14:09:59, Lee wrote:
> I'm looking at
> https://cygwin.com/packaging-hint-files.html#pvr.hint
> and it starts off with
> Use UTF-8 character encoding.
>
> How do I do that and how do I check that I actually did use UTF-8
> character encoding _without_ using file?
[snip]
> I used vi to create both files & I'd like to understand why file says
> one is ascii & the other is utf-8
vim can tell you that in the statusline ...
:help statusline
:help encoding
Ask Google to help you with the details: GS: "vim show encoding in status".
E.g.
- http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line
(Show fileencoding and bomb in the status line)
As an example:
set laststatus=2
"set statusline=...
set statusline+=\ en:\ %{strlen(&enc)\ ?\ &enc\ :\ 'x'}
"set statusline+...
Henri
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-21 18:49 ` Houder
@ 2018-06-21 20:46 ` Houder
0 siblings, 0 replies; 13+ messages in thread
From: Houder @ 2018-06-21 20:46 UTC (permalink / raw)
To: cygwin
On Thu, 21 Jun 2018 12:12:39, Houder wrote:
> On Wed, 20 Jun 2018 14:09:59, Lee wrote:
> > I'm looking at
> > https://cygwin.com/packaging-hint-files.html#pvr.hint
> > and it starts off with
> > Use UTF-8 character encoding.
> >
> > How do I do that and how do I check that I actually did use UTF-8
> > character encoding _without_ using file?
> [snip]
>
> > I used vi to create both files & I'd like to understand why file says
> > one is ascii & the other is utf-8
>
> vim can tell you that in the statusline ...
>
> :help statusline
> :help encoding
>
> Ask Google to help you with the details: GS: "vim show encoding in status".
>
> E.g.
>
> - http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line
> (Show fileencoding and bomb in the status line)
>
> As an example:
>
> set laststatus=2
> "set statusline=...
> set statusline+=\ en:\ %{strlen(&enc)\ ?\ &enc\ :\ 'x'}
> "set statusline+...
Also read:
- https://unix.stackexchange.com/questions/23389/how-can-i-set-vims-default-encoding-to-utf-8
(How can I set VIM's default encoding to UTF-8?)
for a "quickstart" on the subject of character encoding/vim.
Henri
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-21 10:39 ` Andrey Repin
@ 2018-06-22 7:31 ` Lee
2018-06-22 17:30 ` Andrey Repin
2018-06-25 9:56 ` L A Walsh
0 siblings, 2 replies; 13+ messages in thread
From: Lee @ 2018-06-22 7:31 UTC (permalink / raw)
To: cygwin
On 6/20/18, Andrey Repin wrote:
> Greetings, Lee!
>
>> I'm looking at
>> https://cygwin.com/packaging-hint-files.html#pvr.hint
>> and it starts off with
>> Use UTF-8 character encoding.
>
>> How do I do that and how do I check that I actually did use UTF-8
>> character encoding _without_ using file?
>
> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
I think I don't know enough to ask the right question. A quick search
yesterday on byte order markers turned up
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
with this bit
Note Microsoft uses UTF-16, little endian byte order.
So... keep it simple, set
LANG=en_US.UTF-8
and use vi or something else that comes with cygwin to create the file
and I'll have a file with UTF-8 character encoding - correct?
Thanks,
Lee
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-22 7:31 ` Lee
@ 2018-06-22 17:30 ` Andrey Repin
2018-06-25 9:56 ` L A Walsh
1 sibling, 0 replies; 13+ messages in thread
From: Andrey Repin @ 2018-06-22 17:30 UTC (permalink / raw)
To: Lee, cygwin
Greetings, Lee!
> On 6/20/18, Andrey Repin wrote:
>> Greetings, Lee!
>>
>>> I'm looking at
>>> https://cygwin.com/packaging-hint-files.html#pvr.hint
>>> and it starts off with
>>> Use UTF-8 character encoding.
>>
>>> How do I do that and how do I check that I actually did use UTF-8
>>> character encoding _without_ using file?
>>
>> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
> I think I don't know enough to ask the right question. A quick search
> yesterday on byte order markers turned up
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
> with this bit
> Note Microsoft uses UTF-16, little endian byte order.
Yes, default multibyte Windows encoding is UTF-16LE.
But in general, this is application specific.
> So... keep it simple, set
> LANG=en_US.UTF-8
> and use vi or something else that comes with cygwin to create the file
> and I'll have a file with UTF-8 character encoding - correct?
I'm not familiar with vi, but this is true for other *NIX editors I know, they
use current locale settings by default, unless something else is specified in
their configuration or prompted by other cases (like byte order mark).
IMO, best chance is to use an editor that explicitly supports saving texts in
the desired encoding.
And please no BOM for UTF-8 files.
--
With best regards,
Andrey Repin
Friday, June 22, 2018 14:13:14
Sorry for my terrible english...
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-22 7:31 ` Lee
2018-06-22 17:30 ` Andrey Repin
@ 2018-06-25 9:56 ` L A Walsh
2018-06-25 20:52 ` Lee
1 sibling, 1 reply; 13+ messages in thread
From: L A Walsh @ 2018-06-25 9:56 UTC (permalink / raw)
To: cygwin
Lee wrote:
> So... keep it simple, set
> LANG=en_US.UTF-8
> and use vi or something else that comes with cygwin to create the file
> and I'll have a file with UTF-8 character encoding - correct?
---
The first 127 characters of UTF-8 are identical to the
first 127 characters of ASCII, and latin1 and iso-8859-1.
If you don't use any characters that need accents or special symbols,
then nothing will be encoded in UTF-8, because its only
the characters OVER the first 127
(see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).
The site also has a sw util (http://www.babelstone.co.uk/Software/BabelMap.html),
that displays and helps config fonts
to display all the characters in unicode, though it hasn't
been updated to the changes that came out last month or so
(Unicode 11).
It's a cool little, *free*, utility...though if you find it useful
you can always send in your registration.
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-25 9:56 ` L A Walsh
@ 2018-06-25 20:52 ` Lee
2018-06-26 21:39 ` Thomas Wolff
2018-06-27 7:50 ` Michael Enright
0 siblings, 2 replies; 13+ messages in thread
From: Lee @ 2018-06-25 20:52 UTC (permalink / raw)
To: cygwin
On 6/24/18, L A Walsh <cygwin@tlinx.org> wrote:
> Lee wrote:
>> So... keep it simple, set
>> LANG=en_US.UTF-8
>> and use vi or something else that comes with cygwin to create the file
>> and I'll have a file with UTF-8 character encoding - correct?
> ---
> The first 127 characters of UTF-8 are identical to the
> first 127 characters of ASCII, and latin1 and iso-8859-1.
>
> If you don't use any characters that need accents or special symbols,
> then nothing will be encoded in UTF-8, because its only
> the characters OVER the first 127
> (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).
I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
0xff is part of the utf-8 encoding. This chart makes things clearer
... at least for me :)
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
The proposed UCS transformation format encodes UCS values in the range
[0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, and 5
bytes. For all encodings of more than one byte, the initial byte
determines the number of bytes used and the high-order bit in each byte
is set.
An easy way to remember this transformation format is to note that the
number of high-order 1's in the first byte is the same as the number of
subsequent bytes in the multibyte character:
Bits Hex Min Hex Max Byte Sequence in Binary
1 7 00000000 0000007f 0zzzzzzz
2 13 00000080 0000207f 10zzzzzz 1yyyyyyy
3 19 00002080 0008207f 110zzzzz 1yyyyyyy 1xxxxxxx
4 25 00082080 0208207f 1110zzzz 1yyyyyyy 1xxxxxxx 1wwwwwww
5 31 02082080 7fffffff 11110zzz 1yyyyyyy 1xxxxxxx 1wwwwwww 1vvvvvvv
Thanks
Lee
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-25 20:52 ` Lee
@ 2018-06-26 21:39 ` Thomas Wolff
2018-06-27 9:31 ` Lee
2018-06-27 7:50 ` Michael Enright
1 sibling, 1 reply; 13+ messages in thread
From: Thomas Wolff @ 2018-06-26 21:39 UTC (permalink / raw)
To: cygwin
Am 25.06.2018 um 20:33 schrieb Lee:
> On 6/24/18, L A Walsh <cygwin@tlinx.org> wrote:
>> Lee wrote:
>>> So... keep it simple, set
>>> LANG=en_US.UTF-8
>>> and use vi or something else that comes with cygwin to create the file
>>> and I'll have a file with UTF-8 character encoding - correct?
>> ---
>> The first 127 characters of UTF-8 are identical to the
>> first 127 characters of ASCII, and latin1 and iso-8859-1.
>>
>> If you don't use any characters that need accents or special symbols,
>> then nothing will be encoded in UTF-8, because its only
>> the characters OVER the first 127
>> (see chart @ http://www.babelstone.co.uk/Unicode/babelmap.html).
> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
> 0xff is part of the utf-8 encoding. This chart makes things clearer
> ... at least for me :)
> http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
> The proposed UCS transformation format encodes UCS values in the range
> [0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, and 5
> bytes. For all encodings of more than one byte, the initial byte
> determines the number of bytes used and the high-order bit in each byte
> is set.
>
> An easy way to remember this transformation format is to note that the
> number of high-order 1's in the first byte is the same as the number of
> subsequent bytes in the multibyte character:
>
> Bits Hex Min Hex Max Byte Sequence in Binary
> 1 7 00000000 0000007f 0zzzzzzz
> 2 13 00000080 0000207f 10zzzzzz 1yyyyyyy
> 3 19 00002080 0008207f 110zzzzz 1yyyyyyy 1xxxxxxx
> 4 25 00082080 0208207f 1110zzzz 1yyyyyyy 1xxxxxxx 1wwwwwww
> 5 31 02082080 7fffffff 11110zzz 1yyyyyyy 1xxxxxxx 1wwwwwww 1vvvvvvv
This encoding scheme is wrong; where did you get it from? Maybe it's the
obsolete UTF-8...
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-25 20:52 ` Lee
2018-06-26 21:39 ` Thomas Wolff
@ 2018-06-27 7:50 ` Michael Enright
2018-06-27 9:34 ` Lee
1 sibling, 1 reply; 13+ messages in thread
From: Michael Enright @ 2018-06-27 7:50 UTC (permalink / raw)
To: cygwin
On Mon, Jun 25, 2018 at 11:33 AM, Lee <ler762@gmail.com> wrote:
> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
> 0xff is part of the utf-8 encoding.
I don't see how you arrived at this. An initial byte of 0xFF is not
the initial byte of any valid UTF-8 byte sequence. And it doesn't
conform with the statement you have later:
> An easy way to remember this transformation format is to note that the
> number of high-order 1's in the first byte is the same as the number of
> subsequent bytes in the multibyte character:
This is true, but there is also a zero bit that ends the
high-order-1's bit string, which means that 0xFF is not a valid lead
byte. 0x7F is the highest byte value that you can have as a
single-byte UTF8 string.
Perhaps your statement about 0-0xFF was meant to be read differently.
Thomas Wolff's note seems to be objecting to the inclusion of
characters above U+10FFFF which isn't legal UTF-8, but was in the
original proposal. Otherwise your table rows 1-4 is correct.
The standards such as IETF RFC-3629 are easy enough to read, so I
recommend using them and citing them to others instead of trying to
summarize.
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-26 21:39 ` Thomas Wolff
@ 2018-06-27 9:31 ` Lee
0 siblings, 0 replies; 13+ messages in thread
From: Lee @ 2018-06-27 9:31 UTC (permalink / raw)
To: cygwin
On 6/26/18, Thomas Wolff wrote:
> This encoding scheme is wrong; where did you get it from? Maybe it's the
> obsolete UTF-8...
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
I thought I saw something about utf-8 being able to handle a 31 bit
value.. is that also obsolete/wrong?
how about this for the current encoding scheme:
http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf
Table 3-6. UTF-8 Bit Distribution
Bits Scalar Value First Byte Second Byte Third Byte
Fourth Byte
7 00000000 0xxxxxxx 0xxxxxxx
11 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
16 zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
21 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
Lee
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: UTF-8 character encoding
2018-06-27 7:50 ` Michael Enright
@ 2018-06-27 9:34 ` Lee
0 siblings, 0 replies; 13+ messages in thread
From: Lee @ 2018-06-27 9:34 UTC (permalink / raw)
To: cygwin
On 6/26/18, Michael Enright wrote:
> On Mon, Jun 25, 2018 at 11:33 AM, Lee wrote:
>> I'm still trying to figure utf-8 out, but it seems to me that 0x0 -
>> 0xff is part of the utf-8 encoding.
>
> I don't see how you arrived at this.
I screwed up trying to do hex in my head. For whatever reason I
didn't want to write 0 - 127
> An initial byte of 0xFF is not
> the initial byte of any valid UTF-8 byte sequence. And it doesn't
> conform with the statement you have later:
right, I screwed up :)
> The standards such as IETF RFC-3629 are easy enough to read, so I
> recommend using them and citing them to others instead of trying to
> summarize.
Thanks for the RFC reference - I hadn't come across that one yet.
Lee
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-06-27 6:53 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-21 7:20 UTF-8 character encoding Lee
2018-06-21 10:12 ` Stefan Weil
2018-06-21 10:39 ` Andrey Repin
2018-06-22 7:31 ` Lee
2018-06-22 17:30 ` Andrey Repin
2018-06-25 9:56 ` L A Walsh
2018-06-25 20:52 ` Lee
2018-06-26 21:39 ` Thomas Wolff
2018-06-27 9:31 ` Lee
2018-06-27 7:50 ` Michael Enright
2018-06-27 9:34 ` Lee
2018-06-21 18:49 ` Houder
2018-06-21 20:46 ` Houder
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).