public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Need help with multibyte UTF-8 characters
@ 2017-12-05  1:24 Thomas Taylor
  2017-12-05  3:48 ` Brian Inglis
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Thomas Taylor @ 2017-12-05  1:24 UTC (permalink / raw)
  To: cygwin

I want to use multibyte UTF-8 characters in 64-bit Cygwin under Windows 
7.  The "vim" editor running in mintty displays the two-byte characters 
correctly, but not the three- (and I assume four-) byte characters, 
which instead display as rectangular filled-in blocks.  The "less" 
program doesn't even display two-byte characters correctly, but instead 
displays them as <A1> to <FF>, depending on the character in question, 
in reverse color in the terminal window.  The "cat" program is even 
worse, replacing every two-byte character with a character that looks 
like three horizontal bars stacked one above the other.  I've read the 
"Internationalization" page in the Cygwin online manual, but am still 
baffled.  My LANG environment variable is set to "en_US.UTF-8".  Can 
anyone help?


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-05  1:24 Need help with multibyte UTF-8 characters Thomas Taylor
@ 2017-12-05  3:48 ` Brian Inglis
  2017-12-12  3:43 ` Thomas Taylor
  2017-12-13 13:07 ` Brian Inglis
  2 siblings, 0 replies; 18+ messages in thread
From: Brian Inglis @ 2017-12-05  3:48 UTC (permalink / raw)
  To: cygwin

On 2017-12-04 18:23, Thomas Taylor wrote:
> I want to use multibyte UTF-8 characters in 64-bit Cygwin under Windows 7.  The
> "vim" editor running in mintty displays the two-byte characters correctly, but
> not the three- (and I assume four-) byte characters, which instead display as
> rectangular filled-in blocks.  The "less" program doesn't even display two-byte
> characters correctly, but instead displays them as <A1> to <FF>, depending on
> the character in question, in reverse color in the terminal window.  The "cat"
> program is even worse, replacing every two-byte character with a character that
> looks like three horizontal bars stacked one above the other.  I've read the
> "Internationalization" page in the Cygwin online manual, but am still baffled. 
> My LANG environment variable is set to "en_US.UTF-8".  Can anyone help?

Check mintty/Options/Text/Locale[en_US]/Character set[UTF-8]/Apply/Save.
Then exit and restart mintty and your shell.

To see what locale Cygwin thinks you are set to, run:
	$ locale

To check all Windows locale settings, you can run:
	$ for o in -s -u -n -i -f ''; do locale $o; done

The first two should show your Windows install locale, the rest should show
anything you have set up, or the same locale.
If any settings don't match LANG, you may have to set LC_ALL=$LANG to force the
setting.
I use the following profile stanza across all systems for consistency:

# Set user-defined locale - use regional settings if available
locale -fU > /dev/null 2>&1     \
        && LC_ALL=`locale -fU`  \
        || LC_ALL=`locale |	\
		/bin/sed '/^LANG=\|^LC_CTYPE=\|^LC_ALL=/{s///;h};$!d;x;s/"//g'`
export LC_ALL

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-05  1:24 Need help with multibyte UTF-8 characters Thomas Taylor
  2017-12-05  3:48 ` Brian Inglis
@ 2017-12-12  3:43 ` Thomas Taylor
  2017-12-12 20:00   ` Doug Henderson
                     ` (3 more replies)
  2017-12-13 13:07 ` Brian Inglis
  2 siblings, 4 replies; 18+ messages in thread
From: Thomas Taylor @ 2017-12-12  3:43 UTC (permalink / raw)
  To: cygwin

Thank you for your advice on setting my locale to en_US.UTF-8.  
Unfortunately, Cygwin still seems to have trouble displaying some 
three-byte UTF-8 encoded characters correctly.  For example, see the 
following snippet from a "sed" file.  This file attempts to convert 
XML-encoded filenames to UTF-8.  As you can see, it converts one- and 
two-byte encodings correctly, but fails on some three-byte encodings 
(the en dash, the em dash, and the ellipsis, all of which are displayed 
as a filled-in rectangle):

# Match longest strings first

# Three-byte encodings:

# En dash
s/%[Ee]2%80%93/–/g

# Em dash
s/%[Ee]2%80%94/—/g

# Horizontal ellipsis
s/%[Ee]2%80%[Aa]6/…/g

# Less-than-or-equal sign
s/%[Ee]2%89%[Aa]4/≤/g

# Euro symbol
s/%[Ee]2%82%[Aa][Cc]/€/g

# Two-byte encodings:

# Non-break space
#s/%[Cc]2%[Aa]0/⎵/g

# Lowercase a with acute accent
s/%[Cc]3%[Aa]1/á/g

# Lowercase a with umlaut (a.k.a. diaeresis)
s/%[Cc]3%[Aa]4/ä/g

# Lowercase e with acute accent
s/%[Cc]3%[Aa]9/é/g

# Lowercase i with acute accent
s/%[Cc]3%[Aa]D/í/g

# Lowercase o with acute accent
s/%[Cc]3%[Bb]3/ó/g



--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-12  3:43 ` Thomas Taylor
@ 2017-12-12 20:00   ` Doug Henderson
  2017-12-12 20:17   ` Thomas Taylor
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Doug Henderson @ 2017-12-12 20:00 UTC (permalink / raw)
  To: cygwin

On 11 December 2017 at 16:36, Thomas Taylor wrote:
> Thank you for your advice on setting my locale to en_US.UTF-8.
> Unfortunately, Cygwin still seems to have trouble displaying some three-byte
> UTF-8 encoded characters correctly.  For example, see the following snippet
> from a "sed" file.  This file attempts to convert XML-encoded filenames to
> UTF-8.  As you can see, it converts one- and two-byte encodings correctly,
> but fails on some three-byte encodings (the en dash, the em dash, and the
> ellipsis, all of which are displayed as a filled-in rectangle):
>

Your sed script works for me. I copy/pasted your sample script into
"cvt_script.sed" and also into "cvt_input.txt". My sed command looks
like: "sed --file=cvt_script.sed < cvt_input.txt > cvt_output.txt". It
correctly translates all the encoded utf-8 strings.

Your display may appear different if you are using different fonts in
mintty or the windows console. I am using Lucinda Console, 10pt and
Consolas 16, respectively. They display different glyphs for the
non-breaking space, but are otherwise identical. In mintty, I have
LANG and all the LC_* variables set to en_CA.UTF-8, and in the windows
console, to en_US.UTF-8.

I am running Win 10 and cygwin setup was last updated a couple or
three days ago.

Check the output of the "locale" command. All variables should have
the same value.

Is your cygwin installation up to date, or fairly close to current?
What wiindows version are you using?

HTH,
Doug


-- 
Doug Henderson, Calgary, Alberta, Canada - from gmail.com

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-12  3:43 ` Thomas Taylor
  2017-12-12 20:00   ` Doug Henderson
@ 2017-12-12 20:17   ` Thomas Taylor
  2017-12-14 19:50     ` Andrey Repin
  2017-12-15  2:51     ` Brian Inglis
  2017-12-13  3:06   ` Thomas Wolff
  2017-12-14 19:32   ` Brian Inglis
  3 siblings, 2 replies; 18+ messages in thread
From: Thomas Taylor @ 2017-12-12 20:17 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]

I believe that Cygwin displays certain UTF-8 characters incorrectly.  To 
see the problem, first save the attached "utf-8_test.sed" text file to 
your desktop.  Then run "mintty," and set its options by right clicking 
in its title bar, selecting "Options" and then "Text."  On the Text page 
set "Locale" to "en_US" and "Character set" to "UTF-8," and then 
"Save."  Now exit and restart mintty.  Change directory to your desktop 
and run the editor "vim" on the utf-8_test.sed file.  Once inside vim do 
a ":set fileencoding=utf-8".  You should now see that vim displays 
correctly a sample of one-, two-, and three-byte UTF-8 character 
encodings in the test file.  Vim fails, however, on the three-byte 
encodings for the "en" dash, the "em" dash, and the ellipsis, each of 
which displays incorrectly as a filled-in rectangle.  Now exit vim and 
do a "less" or "cat" on the utf-8_test.sed file.  You should see most of 
the sample UTF-8 encoded characters displayed correctly, except once 
again for the en dash, em dash, and ellipsis.  So it looks like a 
problem in the underlying Cygwin run-time libraries rather than in vim, 
less, or cat.  I haven't tested this on four-byte UTF-8 character 
encodings, but assume Cygwin will have similar problems.


[-- Attachment #2: utf-8_test.sed --]
[-- Type: text/plain, Size: 1140 bytes --]

# This is file "utf-8_test.sed"
#
# It's used by the "sed" utility program
# to convert XML-encoded filenames to UTF-8

# Match longest strings first

# Three-byte encodings:

# En dash
s/%[Ee]2%80%93/–/g

# Em dash
s/%[Ee]2%80%94/—/g

# Horizontal ellipsis
s/%[Ee]2%80%[Aa]6/…/g

# Less-than-or-equal sign
s/%[Ee]2%89%[Aa]4/≤/g

# Euro symbol
s/%[Ee]2%82%[Aa][Cc]/€/g

# Two-byte encodings:

# Non-break space
s/%[Cc]2%[Aa]0/⎵/g

# Lowercase a with acute accent
s/%[Cc]3%[Aa]1/á/g

# Lowercase a with umlaut (a.k.a. diaeresis)
s/%[Cc]3%[Aa]4/ä/g

# Lowercase e with acute accent
s/%[Cc]3%[Aa]9/é/g

# Lowercase i with acute accent
s/%[Cc]3%[Aa]D/í/g

# Lowercase o with acute accent
s/%[Cc]3%[Bb]3/ó/g

# Lowercase n with tilde
s/%[Cc]3%[Bb]1/ñ/g

# Lowercase c with acute accent 
s/%[Cc]4%87/ć/g

# Lowercase o with long accent (a.k.a. macron)
s/%[Cc]5%8[Dd]/ō/g

# One-byte encodings:

# "And" sign (a.k.a. ampersand)
s/&#38;/\&/g

# Space
s/%20/ /g

# Sharp (or pound) sign
s/%23/#/g

# Percent sign
s/%25/%/g

# Left square bracket
s/%5[Bb]/[/g

# Right square bracket
s/%5[Dd]/]/g

# End of file "utf-8_test.sed"


[-- Attachment #3: Type: text/plain, Size: 219 bytes --]


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-12  3:43 ` Thomas Taylor
  2017-12-12 20:00   ` Doug Henderson
  2017-12-12 20:17   ` Thomas Taylor
@ 2017-12-13  3:06   ` Thomas Wolff
  2017-12-14 19:32   ` Brian Inglis
  3 siblings, 0 replies; 18+ messages in thread
From: Thomas Wolff @ 2017-12-13  3:06 UTC (permalink / raw)
  To: cygwin

Am 12.12.2017 um 00:36 schrieb Thomas Taylor:
> ... This file attempts to convert XML-encoded filenames to UTF-8.  ...
How about a generic script, like:
sed -e 's,%,\\x,g' -e "s,^,echo $'," -e "s,$,'," | sh

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-05  1:24 Need help with multibyte UTF-8 characters Thomas Taylor
  2017-12-05  3:48 ` Brian Inglis
  2017-12-12  3:43 ` Thomas Taylor
@ 2017-12-13 13:07 ` Brian Inglis
  2017-12-13 13:28   ` Thomas Wolff
  2 siblings, 1 reply; 18+ messages in thread
From: Brian Inglis @ 2017-12-13 13:07 UTC (permalink / raw)
  To: cygwin

On 2017-12-04 18:23, Thomas Taylor wrote:
> I want to use multibyte UTF-8 characters in 64-bit Cygwin under Windows 7.  The
> "vim" editor running in mintty displays the two-byte characters correctly, but
> not the three- (and I assume four-) byte characters, which instead display as
> rectangular filled-in blocks.  The "less" program doesn't even display two-byte
> characters correctly, but instead displays them as <A1> to <FF>, depending on
> the character in question, in reverse color in the terminal window.  The "cat"
> program is even worse, replacing every two-byte character with a character that
> looks like three horizontal bars stacked one above the other.  I've read the
> "Internationalization" page in the Cygwin online manual, but am still baffled. 
> My LANG environment variable is set to "en_US.UTF-8".  Can anyone help?

Your Windows Regional settings and your mintty/Options/Text/Language and
Character Set should be set to match.
The profile commands below set Cygwin locale to your Windows Regional settings
and charset to UTF-8, or Unix locale to your system locale.
Otherwise your system or mintty is going to be doing conversions on each character.

# Set user-defined locale
locale -fU > /dev/null 2>&1     \
        && LC_ALL=$(locale -fU) \
        || LC_ALL=$(locale |    \
                sed '/^LANG=\|^LC_CTYPE=\|^LC_ALL=/{s///;h};$!d;x;s/"//g')

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-13 13:07 ` Brian Inglis
@ 2017-12-13 13:28   ` Thomas Wolff
  2017-12-14  1:15     ` cyg Simple
  2017-12-14  7:36     ` Brian Inglis
  0 siblings, 2 replies; 18+ messages in thread
From: Thomas Wolff @ 2017-12-13 13:28 UTC (permalink / raw)
  To: cygwin

Hi Brian,

Am 13.12.2017 um 06:21 schrieb Brian Inglis:
> On 2017-12-04 18:23, Thomas Taylor wrote:
>> I want to use multibyte UTF-8 characters in 64-bit Cygwin under Windows 7.  The
>> "vim" editor running in mintty displays the two-byte characters correctly, but
>> not the three- (and I assume four-) byte characters, which instead display as
>> rectangular filled-in blocks.  The "less" program doesn't even display two-byte
>> characters correctly, but instead displays them as <A1> to <FF>, depending on
>> the character in question, in reverse color in the terminal window.  The "cat"
>> program is even worse, replacing every two-byte character with a character that
>> looks like three horizontal bars stacked one above the other.  I've read the
>> "Internationalization" page in the Cygwin online manual, but am still baffled.
>> My LANG environment variable is set to "en_US.UTF-8".  Can anyone help?
> Your Windows Regional settings and your mintty/Options/Text/Language and
> Character Set should be set to match.
> The profile commands below set Cygwin locale to your Windows Regional settings
> and charset to UTF-8, or Unix locale to your system locale.
> Otherwise your system or mintty is going to be doing conversions on each character.
I am not aware that mintty character display and Windows regional 
settings would interfere in any way you indicated.
Can you elaborate on this please?
Thomas

> # Set user-defined locale
> locale -fU > /dev/null 2>&1     \
>          && LC_ALL=$(locale -fU) \
>          || LC_ALL=$(locale |    \
>                  sed '/^LANG=\|^LC_CTYPE=\|^LC_ALL=/{s///;h};$!d;x;s/"//g')
>


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-13 13:28   ` Thomas Wolff
@ 2017-12-14  1:15     ` cyg Simple
  2017-12-14  7:36     ` Brian Inglis
  1 sibling, 0 replies; 18+ messages in thread
From: cyg Simple @ 2017-12-14  1:15 UTC (permalink / raw)
  To: cygwin

On 12/13/2017 2:50 AM, Thomas Wolff wrote:
> Hi Brian,
> 
> Am 13.12.2017 um 06:21 schrieb Brian Inglis:
>> On 2017-12-04 18:23, Thomas Taylor wrote:
>>> I want to use multibyte UTF-8 characters in 64-bit Cygwin under
>>> Windows 7.  The
>>> "vim" editor running in mintty displays the two-byte characters
>>> correctly, but
>>> not the three- (and I assume four-) byte characters, which instead
>>> display as
>>> rectangular filled-in blocks.  The "less" program doesn't even
>>> display two-byte
>>> characters correctly, but instead displays them as <A1> to <FF>,
>>> depending on
>>> the character in question, in reverse color in the terminal window. 
>>> The "cat"
>>> program is even worse, replacing every two-byte character with a
>>> character that
>>> looks like three horizontal bars stacked one above the other.  I've
>>> read the
>>> "Internationalization" page in the Cygwin online manual, but am still
>>> baffled.
>>> My LANG environment variable is set to "en_US.UTF-8".  Can anyone help?
>> Your Windows Regional settings and your mintty/Options/Text/Language and
>> Character Set should be set to match.
>> The profile commands below set Cygwin locale to your Windows Regional
>> settings
>> and charset to UTF-8, or Unix locale to your system locale.
>> Otherwise your system or mintty is going to be doing conversions on
>> each character.
> I am not aware that mintty character display and Windows regional
> settings would interfere in any way you indicated.
> Can you elaborate on this please?
> Thomas
> 
>> # Set user-defined locale
>> locale -fU > /dev/null 2>&1     \
>>          && LC_ALL=$(locale -fU) \
>>          || LC_ALL=$(locale |    \
>>                  sed
>> '/^LANG=\|^LC_CTYPE=\|^LC_ALL=/{s///;h};$!d;x;s/"//g')
>>

I was having an issue with git changing the locale of the files from
ISO-8859-1 to UTF-8 because of this.  I modified my $HOME/.profile and
changed:

# Set user-defined locale
export LANG=$(locale -uU)

to:

# Set user-defined locale
export LANG=$(locale -u).ISO-8859-1

which sets all of the locale within Cygwin except for LC_ALL.

$ locale
LANG=en_US.ISO-8859-1
LC_CTYPE="en_US.ISO-8859-1"
LC_NUMERIC="en_US.ISO-8859-1"
LC_TIME="en_US.ISO-8859-1"
LC_COLLATE="en_US.ISO-8859-1"
LC_MONETARY="en_US.ISO-8859-1"
LC_MESSAGES="en_US.ISO-8859-1"
LC_ALL=
$

-- 
cyg Simple

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-13 13:28   ` Thomas Wolff
  2017-12-14  1:15     ` cyg Simple
@ 2017-12-14  7:36     ` Brian Inglis
  2017-12-14 16:21       ` Thomas Wolff
  2017-12-14 16:55       ` cyg Simple
  1 sibling, 2 replies; 18+ messages in thread
From: Brian Inglis @ 2017-12-14  7:36 UTC (permalink / raw)
  To: cygwin

On 2017-12-13 00:50, Thomas Wolff wrote:
> Am 13.12.2017 um 06:21 schrieb Brian Inglis:
>> On 2017-12-04 18:23, Thomas Taylor wrote:
>> Your Windows Regional settings and your mintty/Options/Text/Language and
>> Character Set should be set to match.
>> The profile commands below set Cygwin locale to your Windows Regional settings
>> and charset to UTF-8, or Unix locale to your system locale.
>> Otherwise your system or mintty is going to be doing conversions on each
>> character.
> I am not aware that mintty character display and Windows regional settings would
> interfere in any way you indicated.
> Can you elaborate on this please?

Maybe I'm just too optimistic that software will DTRT to ensure that output is
faithfully passed thru, or converted for the next layer of software, if it has
different settings.
I set all of my locales the same so characters should pass thru transparently
and I can see output faithfully rendered, given adequate font configurations.

What happens when your system, terminal, and shell locales and charsets differ?
Either some component/-s has/have to do conversion to provide readable output,
which is my expectation given the requirement to specify locales and charsets,
or you could end up with garbled output if nothing is doing any conversion.
Does one override others to pass thru readable output, does conversion occur, or
do you just see junk in some or all cases when locales and charsets differ?

I am ignoring here the effect on text content, input and output formatting of
selecting languages, territories, and scripts.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-14  7:36     ` Brian Inglis
@ 2017-12-14 16:21       ` Thomas Wolff
  2017-12-14 18:09         ` cyg Simple
  2017-12-14 16:55       ` cyg Simple
  1 sibling, 1 reply; 18+ messages in thread
From: Thomas Wolff @ 2017-12-14 16:21 UTC (permalink / raw)
  To: cygwin

Am 14.12.2017 um 05:40 schrieb Brian Inglis:
> On 2017-12-13 00:50, Thomas Wolff wrote:
>> Am 13.12.2017 um 06:21 schrieb Brian Inglis:
>>> On 2017-12-04 18:23, Thomas Taylor wrote:
>>> Your Windows Regional settings and your mintty/Options/Text/Language and
>>> Character Set should be set to match.
>>> The profile commands below set Cygwin locale to your Windows Regional settings
>>> and charset to UTF-8, or Unix locale to your system locale.
>>> Otherwise your system or mintty is going to be doing conversions on each
>>> character.
>> I am not aware that mintty character display and Windows regional settings would
>> interfere in any way you indicated.
>> Can you elaborate on this please?
> Maybe I'm just too optimistic that software will DTRT to ensure that output is
> faithfully passed thru, or converted for the next layer of software, if it has
> different settings.
> I set all of my locales the same so characters should pass thru transparently
> and I can see output faithfully rendered, given adequate font configurations.
Mintty interfaces to Windows using the Unicode/UTF-16 API, so there is 
no dependency on the Windows system locale.
I assume the original poster's problem is a font issue, unless a test 
case would demonstrate anything else.
Thomas

> What happens when your system, terminal, and shell locales and charsets differ?
> Either some component/-s has/have to do conversion to provide readable output,
> which is my expectation given the requirement to specify locales and charsets,
> or you could end up with garbled output if nothing is doing any conversion.
> Does one override others to pass thru readable output, does conversion occur, or
> do you just see junk in some or all cases when locales and charsets differ?
>
> I am ignoring here the effect on text content, input and output formatting of
> selecting languages, territories, and scripts.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-14  7:36     ` Brian Inglis
  2017-12-14 16:21       ` Thomas Wolff
@ 2017-12-14 16:55       ` cyg Simple
  1 sibling, 0 replies; 18+ messages in thread
From: cyg Simple @ 2017-12-14 16:55 UTC (permalink / raw)
  To: cygwin

On 12/13/2017 11:40 PM, Brian Inglis wrote:
> On 2017-12-13 00:50, Thomas Wolff wrote:
>> Am 13.12.2017 um 06:21 schrieb Brian Inglis:
>>> On 2017-12-04 18:23, Thomas Taylor wrote:
>>> Your Windows Regional settings and your mintty/Options/Text/Language and
>>> Character Set should be set to match.
>>> The profile commands below set Cygwin locale to your Windows Regional settings
>>> and charset to UTF-8, or Unix locale to your system locale.
>>> Otherwise your system or mintty is going to be doing conversions on each
>>> character.
>> I am not aware that mintty character display and Windows regional settings would
>> interfere in any way you indicated.
>> Can you elaborate on this please?
> 
> Maybe I'm just too optimistic that software will DTRT to ensure that output is
> faithfully passed thru, or converted for the next layer of software, if it has
> different settings.
> I set all of my locales the same so characters should pass thru transparently
> and I can see output faithfully rendered, given adequate font configurations.
> 
> What happens when your system, terminal, and shell locales and charsets differ?
> Either some component/-s has/have to do conversion to provide readable output,
> which is my expectation given the requirement to specify locales and charsets,
> or you could end up with garbled output if nothing is doing any conversion.
> Does one override others to pass thru readable output, does conversion occur, or
> do you just see junk in some or all cases when locales and charsets differ?
> 
> I am ignoring here the effect on text content, input and output formatting of
> selecting languages, territories, and scripts.
> 

For my working environment I need Cygwin and Windows to be different.  I
have other requirements for en_US.UTF-8 within the Windows environment.
I use Netbeans IDE and it allows me to set the locale per project.  It
doesn't matter what the OS level states; that is just the default.

-- 
cyg Simple

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-14 16:21       ` Thomas Wolff
@ 2017-12-14 18:09         ` cyg Simple
  2017-12-14 19:20           ` Thomas Wolff
  0 siblings, 1 reply; 18+ messages in thread
From: cyg Simple @ 2017-12-14 18:09 UTC (permalink / raw)
  To: cygwin

On 12/14/2017 3:55 AM, Thomas Wolff wrote:> Mintty interfaces to Windows
using the Unicode/UTF-16 API, so there is
> no dependency on the Windows system locale.
> I assume the original poster's problem is a font issue, unless a test
> case would demonstrate anything else.
> Thomas
> 

I seem to remember the OP giving a test case already. Mail from 12/12
has an attachment.

-- 
cyg Simple

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-14 18:09         ` cyg Simple
@ 2017-12-14 19:20           ` Thomas Wolff
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Wolff @ 2017-12-14 19:20 UTC (permalink / raw)
  To: cygwin

Am 14.12.2017 um 17:21 schrieb cyg Simple:
> On 12/14/2017 3:55 AM, Thomas Wolff wrote:> Mintty interfaces to Windows
> using the Unicode/UTF-16 API, so there is
>> no dependency on the Windows system locale.
>> I assume the original poster's problem is a font issue, unless a test
>> case would demonstrate anything else.
>> Thomas
>>
> I seem to remember the OP giving a test case already. Mail from 12/12 has an attachment.
My idea of a test case is somewhat more focussed, but anyway: there are 
no problems with that file here.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-12  3:43 ` Thomas Taylor
                     ` (2 preceding siblings ...)
  2017-12-13  3:06   ` Thomas Wolff
@ 2017-12-14 19:32   ` Brian Inglis
  3 siblings, 0 replies; 18+ messages in thread
From: Brian Inglis @ 2017-12-14 19:32 UTC (permalink / raw)
  To: cygwin

On 2017-12-11 16:36, Thomas Taylor wrote:
> Thank you for your advice on setting my locale to en_US.UTF-8.  Unfortunately,
> Cygwin still seems to have trouble displaying some three-byte UTF-8 encoded
> characters correctly.  For example, see the following snippet from a "sed"
> file.  This file attempts to convert XML-encoded filenames to UTF-8.  As you can
> see, it converts one- and two-byte encodings correctly, but fails on some
> three-byte encodings (the en dash, the em dash, and the ellipsis, all of which
> are displayed as a filled-in rectangle):

Going back to first principles - what is your script encoded as and run as?
What characters are in your script?
	$ wc -lwmc ...
What does vim say for that script:
	:set enc? tenc? fenc? fencs? eol? bomb?
What does locale say sed runs as:
	$ locale

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-12 20:17   ` Thomas Taylor
@ 2017-12-14 19:50     ` Andrey Repin
  2017-12-15  2:51     ` Brian Inglis
  1 sibling, 0 replies; 18+ messages in thread
From: Andrey Repin @ 2017-12-14 19:50 UTC (permalink / raw)
  To: Thomas Taylor, cygwin

Greetings, Thomas Taylor!

> I believe that Cygwin displays certain UTF-8 characters incorrectly.  To 
> see the problem, first save the attached "utf-8_test.sed" text file to 
> your desktop. 

First, your "NBSP" is actually http://www.fileformat.info/info/unicode/char/23b5/index.htm

> Then run "mintty," and set its options by right clicking
> in its title bar, selecting "Options" and then "Text." 

I just keep them clear.

> On the Text page
> set "Locale" to "en_US" and "Character set" to "UTF-8," and then 
> "Save."  Now exit and restart mintty.  Change directory to your desktop 
> and run the editor "vim" on the utf-8_test.sed file.  Once inside vim do 
> a ":set fileencoding=utf-8".  You should now see that vim displays 
> correctly a sample of one-, two-, and three-byte UTF-8 character 
> encodings in the test file.  Vim fails, however, on the three-byte 
> encodings for the "en" dash, the "em" dash, and the ellipsis, each of 
> which displays incorrectly as a filled-in rectangle.  Now exit vim and 
> do a "less" or "cat" on the utf-8_test.sed file.  You should see most of 
> the sample UTF-8 encoded characters displayed correctly, except once 
> again for the en dash, em dash, and ellipsis. 

All displayed correctly. Lucida Console 11pt.

> So it looks like a problem in the underlying Cygwin run-time libraries
> rather than in vim, less, or cat.  I haven't tested this on four-byte UTF-8
> character encodings, but assume Cygwin will have similar problems.

I don't have a good console font for mb4, but I presume it will be displaed
just fine.


-- 
With best regards,
Andrey Repin
Thursday, December 14, 2017 21:59:07

Sorry for my terrible english...
--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-12 20:17   ` Thomas Taylor
  2017-12-14 19:50     ` Andrey Repin
@ 2017-12-15  2:51     ` Brian Inglis
  2017-12-16  1:50       ` Thomas Wolff
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Inglis @ 2017-12-15  2:51 UTC (permalink / raw)
  To: cygwin

On 2017-12-12 12:42, Thomas Taylor wrote:
> I believe that Cygwin displays certain UTF-8 characters incorrectly.  To see the
> problem, first save the attached "utf-8_test.sed" text file to your desktop. 
> Then run "mintty," and set its options by right clicking in its title bar,
> selecting "Options" and then "Text."  On the Text page set "Locale" to "en_US"
> and "Character set" to "UTF-8," and then "Save."  Now exit and restart mintty. 
> Change directory to your desktop and run the editor "vim" on the utf-8_test.sed
> file.  Once inside vim do a ":set fileencoding=utf-8".  You should now see that
> vim displays correctly a sample of one-, two-, and three-byte UTF-8 character
> encodings in the test file.  Vim fails, however, on the three-byte encodings for
> the "en" dash, the "em" dash, and the ellipsis, each of which displays
> incorrectly as a filled-in rectangle.  Now exit vim and do a "less" or "cat" on
> the utf-8_test.sed file.  You should see most of the sample UTF-8 encoded
> characters displayed correctly, except once again for the en dash, em dash, and
> ellipsis.  So it looks like a problem in the underlying Cygwin run-time
> libraries rather than in vim, less, or cat.  I haven't tested this on four-byte
> UTF-8 character encodings, but assume Cygwin will have similar problems.

Like many others -- no problems visible -- all UTF-8 characters displayed
correctly in gvim/X, vim, less, cat from mintty.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Need help with multibyte UTF-8 characters
  2017-12-15  2:51     ` Brian Inglis
@ 2017-12-16  1:50       ` Thomas Wolff
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Wolff @ 2017-12-16  1:50 UTC (permalink / raw)
  To: cygwin

Am 15.12.2017 um 01:32 schrieb Brian Inglis:
> On 2017-12-12 12:42, Thomas Taylor wrote:
>> I believe that Cygwin displays certain UTF-8 characters incorrectly.  To see the
>> problem, first save the attached "utf-8_test.sed" text file to your desktop.
>> Then run "mintty," and set its options by right clicking in its title bar,
>> selecting "Options" and then "Text."  On the Text page set "Locale" to "en_US"
>> and "Character set" to "UTF-8," and then "Save."  Now exit and restart mintty.
>> Change directory to your desktop and run the editor "vim" on the utf-8_test.sed
>> file.  Once inside vim do a ":set fileencoding=utf-8".  You should now see that
>> vim displays correctly a sample of one-, two-, and three-byte UTF-8 character
>> encodings in the test file.  Vim fails, however, on the three-byte encodings for
>> the "en" dash, the "em" dash, and the ellipsis, each of which displays
>> incorrectly as a filled-in rectangle.  Now exit vim and do a "less" or "cat" on
>> the utf-8_test.sed file.  You should see most of the sample UTF-8 encoded
>> characters displayed correctly, except once again for the en dash, em dash, and
>> ellipsis.  So it looks like a problem in the underlying Cygwin run-time
>> libraries rather than in vim, less, or cat.  I haven't tested this on four-byte
>> UTF-8 character encodings, but assume Cygwin will have similar problems.
> Like many others -- no problems visible -- all UTF-8 characters displayed
> correctly in gvim/X, vim, less, cat from mintty.
It seems nobody asked you so far which font you use. So please report that.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-12-15 15:20 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-05  1:24 Need help with multibyte UTF-8 characters Thomas Taylor
2017-12-05  3:48 ` Brian Inglis
2017-12-12  3:43 ` Thomas Taylor
2017-12-12 20:00   ` Doug Henderson
2017-12-12 20:17   ` Thomas Taylor
2017-12-14 19:50     ` Andrey Repin
2017-12-15  2:51     ` Brian Inglis
2017-12-16  1:50       ` Thomas Wolff
2017-12-13  3:06   ` Thomas Wolff
2017-12-14 19:32   ` Brian Inglis
2017-12-13 13:07 ` Brian Inglis
2017-12-13 13:28   ` Thomas Wolff
2017-12-14  1:15     ` cyg Simple
2017-12-14  7:36     ` Brian Inglis
2017-12-14 16:21       ` Thomas Wolff
2017-12-14 18:09         ` cyg Simple
2017-12-14 19:20           ` Thomas Wolff
2017-12-14 16:55       ` cyg Simple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).