public inbox for cygwin-patches@cygwin.com
 help / color / mirror / Atom feed
* [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
@ 2020-09-01 16:19 Johannes Schindelin
  2020-09-02  6:06 ` Johannes Schindelin
                   ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Johannes Schindelin @ 2020-09-01 16:19 UTC (permalink / raw)
  To: cygwin-patches

When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
correct, but after that (at least if Pseudo Console support is enabled),
we try to find the default code page for that `LCID`, which is ASCII
(437). Subsequently, we set the Console output code page to that value,
completely ignoring that we wanted to use UTF-8.

Let's not ignore the specifically asked-for UTF-8 character set.

While at it, let's also set the Console output code page even if Pseudo
Console support is disabled; contrary to the behavior of v3.0.7, the
Console output code page is not ignored in that case.

The most common symptom would be that console applications which do not
specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.

This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
https://github.com/msys2/MSYS2-packages/issues/2012,
https://github.com/rust-lang/cargo/issues/8369,
https://github.com/git-for-windows/git/issues/2734,
https://github.com/git-for-windows/git/issues/2793,
https://github.com/git-for-windows/git/issues/2792, and possibly quite a
few others.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 winsup/cygwin/fhandler_tty.cc | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index 06789a500..414c26992 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
   char charset[ENCODING_LEN + 1] = "ASCII";
   LCID lcid = get_langinfo (locale, charset);

+  /* Special-case the UTF-8 character set */
+  if (strcasecmp (charset, "UTF-8") == 0)
+    {
+      get_ttyp ()->term_code_page = CP_UTF8;
+      SetConsoleCP (CP_UTF8);
+      SetConsoleOutputCP (CP_UTF8);
+      return;
+    }
+
   /* Set console code page from locale */
   if (get_pseudo_console ())
     {
--
2.27.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-01 16:19 [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8" Johannes Schindelin
@ 2020-09-02  6:06 ` Johannes Schindelin
  2020-09-02  8:30 ` Corinna Vinschen
  2020-09-04 10:03 ` Takashi Yano
  2 siblings, 0 replies; 60+ messages in thread
From: Johannes Schindelin @ 2020-09-02  6:06 UTC (permalink / raw)
  To: cygwin-patches

Hi,

On Tue, 1 Sep 2020, Johannes Schindelin wrote:

> When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> correct, but after that (at least if Pseudo Console support is enabled),
> we try to find the default code page for that `LCID`, which is ASCII
> (437). Subsequently, we set the Console output code page to that value,
> completely ignoring that we wanted to use UTF-8.
>
> Let's not ignore the specifically asked-for UTF-8 character set.
>
> While at it, let's also set the Console output code page even if Pseudo
> Console support is disabled; contrary to the behavior of v3.0.7, the
> Console output code page is not ignored in that case.
>
> The most common symptom would be that console applications which do not
> specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
>
> This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> https://github.com/msys2/MSYS2-packages/issues/2012,
> https://github.com/rust-lang/cargo/issues/8369,
> https://github.com/git-for-windows/git/issues/2734,
> https://github.com/git-for-windows/git/issues/2793,
> https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> few others.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> index 06789a500..414c26992 100644
> --- a/winsup/cygwin/fhandler_tty.cc
> +++ b/winsup/cygwin/fhandler_tty.cc
> @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
>    char charset[ENCODING_LEN + 1] = "ASCII";
>    LCID lcid = get_langinfo (locale, charset);
>
> +  /* Special-case the UTF-8 character set */
> +  if (strcasecmp (charset, "UTF-8") == 0)
> +    {
> +      get_ttyp ()->term_code_page = CP_UTF8;
> +      SetConsoleCP (CP_UTF8);
> +      SetConsoleOutputCP (CP_UTF8);
> +      return;
> +    }
> +

Just a word of warning: while this patch can be ported to a634adda5
(libm/machine/arm: Rename s*_fma.c -> s*_fma_arm.c, 2020-09-01) relatively
easily (and the first two patches of this patch series cannot, as they are
no longer applicable after the complete redesign of the Pseudo Console
support), it only works as intended in the `disable_pcon` mode.

The new design calls for Pseudo Consoles to be created per spawned console
application.

And I have not found any way to convince my local version of the runtime
to change the code page of these Pseudo Consoles away from the rather
unfortunate default 437.

This is a problem.

Take for example https://github.com/git-for-windows/git/issues/2793.
Telling the users that they should patch node.js and recompile is probably
not going to fly.

Hopefully there is a way to fix this, otherwise Pseudo Console support
will continue to be quite the support burden.

Ciao,
Johannes

>    /* Set console code page from locale */
>    if (get_pseudo_console ())
>      {
> --
> 2.27.0
>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02  9:41   ` Takashi Yano
@ 2020-09-02  6:26     ` Johannes Schindelin
  2020-09-02 13:06       ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Johannes Schindelin @ 2020-09-02  6:26 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin-patches

Hi Takashi,

On Wed, 2 Sep 2020, Takashi Yano via Cygwin-patches wrote:

> On Wed, 2 Sep 2020 10:30:14 +0200
> Corinna Vinschen wrote:
> > On Sep  1 18:19, Johannes Schindelin wrote:
> > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > correct, but after that (at least if Pseudo Console support is enabled),
> > > we try to find the default code page for that `LCID`, which is ASCII
> > > (437). Subsequently, we set the Console output code page to that value,
> > > completely ignoring that we wanted to use UTF-8.
> > >
> > > Let's not ignore the specifically asked-for UTF-8 character set.
> > >
> > > While at it, let's also set the Console output code page even if Pseudo
> > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > Console output code page is not ignored in that case.
> > >
> > > The most common symptom would be that console applications which do not
> > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > >
> > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > https://github.com/rust-lang/cargo/issues/8369,
> > > https://github.com/git-for-windows/git/issues/2734,
> > > https://github.com/git-for-windows/git/issues/2793,
> > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > few others.
> > >
> > > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > ---
> > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> >
> > Ok guys, I'm not opposed to this change in terms of its result,
>
> I am sorry, but I cannot agree with Johannes's patch.
>
> For example, code page in Japan is CP932 by default.
> In this case, cmd.exe, netsh.exe and so on are generate
> messages in Japanese.
>
> If the code page is set to CP_UTF8, messages from such
> commands changes to english. I guess similar things happen
> for other locales.
>
> I do not prefer this result.
>
> Furthermore, I looked into the issue:
> https://github.com/git-for-windows/git/issues/2734
> and I found that git-for-windows always use utf-8
> encoding even if the locale is ja_JP.CP932.
> It does not change coding based on locale or code
> page.
>
> Even with Johannes's patch, if mintty is started with
> locale ja_JP.CP932, the file name will be garbled
> bacause SetConsoleOutputCP(CP_UTF8) will not be called.
>
> IMHO, it is the problem of git-for-windows rather
> than cygwin and msys2.
>
> To make current version of git-for-windows work, it is
> necessary to set code page to CP_UTF8 regardless locale.
> This does not make sense at all.

You are misrepresenting the problem. It has nothing to do with Git for
Windows. For example, if you run tests in an Angular project inside
Cygwin's MinTTY, the output will be garbled because node.js (or the
Angular libraries, I don't know) will interpret `LANG` or something.

This is a much bigger problem than you make it sound. The console
applications that do _not_ call `SetConsoleOutputCP()` are sooooo
ubiquituous. I think you are underestimating that problem rather
dramatically.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-01 16:19 [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8" Johannes Schindelin
  2020-09-02  6:06 ` Johannes Schindelin
@ 2020-09-02  8:30 ` Corinna Vinschen
  2020-09-02  8:38   ` Corinna Vinschen
  2020-09-02  9:41   ` Takashi Yano
  2020-09-04 10:03 ` Takashi Yano
  2 siblings, 2 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-02  8:30 UTC (permalink / raw)
  To: cygwin-patches

On Sep  1 18:19, Johannes Schindelin wrote:
> When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> correct, but after that (at least if Pseudo Console support is enabled),
> we try to find the default code page for that `LCID`, which is ASCII
> (437). Subsequently, we set the Console output code page to that value,
> completely ignoring that we wanted to use UTF-8.
> 
> Let's not ignore the specifically asked-for UTF-8 character set.
> 
> While at it, let's also set the Console output code page even if Pseudo
> Console support is disabled; contrary to the behavior of v3.0.7, the
> Console output code page is not ignored in that case.
> 
> The most common symptom would be that console applications which do not
> specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> 
> This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> https://github.com/msys2/MSYS2-packages/issues/2012,
> https://github.com/rust-lang/cargo/issues/8369,
> https://github.com/git-for-windows/git/issues/2734,
> https://github.com/git-for-windows/git/issues/2793,
> https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> few others.
> 
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
>  1 file changed, 9 insertions(+)

Ok guys, I'm not opposed to this change in terms of its result,
but I'm starting to wonder why all this locale code in fhandler_tty
is necessary at all.

I see that get_langinfo() calls __loadlocale and performs a lot of stuff
on the charsets which looks like duplicates of the initial_setlocale()
call performed at DLL startup.

If there's anything missing in the initial_setlocale() call which would
be required by the pseudo tty code?  What exactly is it?  The codepage?
And why can't we just add the info to cygheap->locale at initial_setlocale()
time so it's available at exec time without going through all this hassle
every time?

Apart from that, all this locale/charset/lcid stuff should be concentrated
in nlsfunc.cc ideally.


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02  8:30 ` Corinna Vinschen
@ 2020-09-02  8:38   ` Corinna Vinschen
  2020-09-02 10:54     ` Takashi Yano
  2020-09-02  9:41   ` Takashi Yano
  1 sibling, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-02  8:38 UTC (permalink / raw)
  To: cygwin-patches

On Sep  2 10:30, Corinna Vinschen wrote:
> On Sep  1 18:19, Johannes Schindelin wrote:
> > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > correct, but after that (at least if Pseudo Console support is enabled),
> > we try to find the default code page for that `LCID`, which is ASCII
> > (437). Subsequently, we set the Console output code page to that value,
> > completely ignoring that we wanted to use UTF-8.
> > 
> > Let's not ignore the specifically asked-for UTF-8 character set.
> > 
> > While at it, let's also set the Console output code page even if Pseudo
> > Console support is disabled; contrary to the behavior of v3.0.7, the
> > Console output code page is not ignored in that case.
> > 
> > The most common symptom would be that console applications which do not
> > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > 
> > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > https://github.com/msys2/MSYS2-packages/issues/2012,
> > https://github.com/rust-lang/cargo/issues/8369,
> > https://github.com/git-for-windows/git/issues/2734,
> > https://github.com/git-for-windows/git/issues/2793,
> > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > few others.
> > 
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> 
> Ok guys, I'm not opposed to this change in terms of its result,
> but I'm starting to wonder why all this locale code in fhandler_tty
> is necessary at all.
> 
> I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> on the charsets which looks like duplicates of the initial_setlocale()
> call performed at DLL startup.
> 
> If there's anything missing in the initial_setlocale() call which would
> be required by the pseudo tty code?  What exactly is it?  The codepage?
> And why can't we just add the info to cygheap->locale at initial_setlocale()
> time so it's available at exec time without going through all this hassle
> every time?
> 
> Apart from that, all this locale/charset/lcid stuff should be concentrated
> in nlsfunc.cc ideally.

get_locale_from_env() and get_langinfo() should go away.  If we just
need a codepage for get_ttyp ()->term_code_page, we should really find a
way to do this from within internal_setlocale().


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02 13:06       ` Takashi Yano
@ 2020-09-02  9:12         ` Johannes Schindelin
  2020-09-02 14:52           ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Johannes Schindelin @ 2020-09-02  9:12 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin-patches

Hi Takashi,


On Wed, 2 Sep 2020, Takashi Yano wrote:

> On Wed, 2 Sep 2020 08:26:04 +0200 (CEST)
> Johannes Schindelin wrote:
> > Hi Takashi,
> >
> > On Wed, 2 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> >
> > > On Wed, 2 Sep 2020 10:30:14 +0200
> > > Corinna Vinschen wrote:
> > > > On Sep  1 18:19, Johannes Schindelin wrote:
> > > > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > > > correct, but after that (at least if Pseudo Console support is enabled),
> > > > > we try to find the default code page for that `LCID`, which is ASCII
> > > > > (437). Subsequently, we set the Console output code page to that value,
> > > > > completely ignoring that we wanted to use UTF-8.
> > > > >
> > > > > Let's not ignore the specifically asked-for UTF-8 character set.
> > > > >
> > > > > While at it, let's also set the Console output code page even if Pseudo
> > > > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > > > Console output code page is not ignored in that case.
> > > > >
> > > > > The most common symptom would be that console applications which do not
> > > > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > > > >
> > > > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > > > https://github.com/rust-lang/cargo/issues/8369,
> > > > > https://github.com/git-for-windows/git/issues/2734,
> > > > > https://github.com/git-for-windows/git/issues/2793,
> > > > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > > > few others.
> > > > >
> > > > > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > > > ---
> > > > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > > > >  1 file changed, 9 insertions(+)
> > > >
> > > > Ok guys, I'm not opposed to this change in terms of its result,
> > >
> > > I am sorry, but I cannot agree with Johannes's patch.
> > >
> > > For example, code page in Japan is CP932 by default.
> > > In this case, cmd.exe, netsh.exe and so on are generate
> > > messages in Japanese.
> > >
> > > If the code page is set to CP_UTF8, messages from such
> > > commands changes to english. I guess similar things happen
> > > for other locales.
> > >
> > > I do not prefer this result.
> > >
> > > Furthermore, I looked into the issue:
> > > https://github.com/git-for-windows/git/issues/2734
> > > and I found that git-for-windows always use utf-8
> > > encoding even if the locale is ja_JP.CP932.
> > > It does not change coding based on locale or code
> > > page.
> > >
> > > Even with Johannes's patch, if mintty is started with
> > > locale ja_JP.CP932, the file name will be garbled
> > > bacause SetConsoleOutputCP(CP_UTF8) will not be called.
> > >
> > > IMHO, it is the problem of git-for-windows rather
> > > than cygwin and msys2.
> > >
> > > To make current version of git-for-windows work, it is
> > > necessary to set code page to CP_UTF8 regardless locale.
> > > This does not make sense at all.
> >
> > You are misrepresenting the problem. It has nothing to do with Git for
> > Windows. For example, if you run tests in an Angular project inside
> > Cygwin's MinTTY, the output will be garbled because node.js (or the
> > Angular libraries, I don't know) will interpret `LANG` or something.
>
> You listed these issues in git-for-windows:
> > > > > https://github.com/git-for-windows/git/issues/2734,
> > > > > https://github.com/git-for-windows/git/issues/2792,
> didn't you? Therefore, I looked into them.
>
> OK, I will check Angular/CLI next. But I am not familier with
> Agnular/CLI. Could you please provide simple steps to reproduce
> the problem?

Here is a report: https://github.com/git-for-windows/git/issues/2793

But why do you want to look into Angular/CLI? Do you really think that it
makes sense to try to fix every console app out there? Really? Wouldn't it
make more sense to just accept that there are many console applications
that are broken by the recent change, and accommodate them in the Cygwin
runtime? That would take substantially less time.

Ciao,
Johannes

>
> > This is a much bigger problem than you make it sound. The console
> > applications that do _not_ call `SetConsoleOutputCP()` are sooooo
> > ubiquituous. I think you are underestimating that problem rather
> > dramatically.
> >
> > Ciao,
> > Johannes
>
>
> --
> Takashi Yano <takashi.yano@nifty.ne.jp>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02  8:30 ` Corinna Vinschen
  2020-09-02  8:38   ` Corinna Vinschen
@ 2020-09-02  9:41   ` Takashi Yano
  2020-09-02  6:26     ` Johannes Schindelin
  1 sibling, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-02  9:41 UTC (permalink / raw)
  To: cygwin-patches

On Wed, 2 Sep 2020 10:30:14 +0200
Corinna Vinschen wrote:
> On Sep  1 18:19, Johannes Schindelin wrote:
> > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > correct, but after that (at least if Pseudo Console support is enabled),
> > we try to find the default code page for that `LCID`, which is ASCII
> > (437). Subsequently, we set the Console output code page to that value,
> > completely ignoring that we wanted to use UTF-8.
> > 
> > Let's not ignore the specifically asked-for UTF-8 character set.
> > 
> > While at it, let's also set the Console output code page even if Pseudo
> > Console support is disabled; contrary to the behavior of v3.0.7, the
> > Console output code page is not ignored in that case.
> > 
> > The most common symptom would be that console applications which do not
> > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > 
> > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > https://github.com/msys2/MSYS2-packages/issues/2012,
> > https://github.com/rust-lang/cargo/issues/8369,
> > https://github.com/git-for-windows/git/issues/2734,
> > https://github.com/git-for-windows/git/issues/2793,
> > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > few others.
> > 
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> 
> Ok guys, I'm not opposed to this change in terms of its result,

I am sorry, but I cannot agree with Johannes's patch.

For example, code page in Japan is CP932 by default.
In this case, cmd.exe, netsh.exe and so on are generate
messages in Japanese.

If the code page is set to CP_UTF8, messages from such
commands changes to english. I guess similar things happen
for other locales.

I do not prefer this result.

Furthermore, I looked into the issue:
https://github.com/git-for-windows/git/issues/2734
and I found that git-for-windows always use utf-8
encoding even if the locale is ja_JP.CP932.
It does not change coding based on locale or code
page.

Even with Johannes's patch, if mintty is started with
locale ja_JP.CP932, the file name will be garbled
bacause SetConsoleOutputCP(CP_UTF8) will not be called.

IMHO, it is the problem of git-for-windows rather
than cygwin and msys2.

To make current version of git-for-windows work, it is
necessary to set code page to CP_UTF8 regardless locale.
This does not make sense at all.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02  8:38   ` Corinna Vinschen
@ 2020-09-02 10:54     ` Takashi Yano
  2020-09-02 15:24       ` Corinna Vinschen
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-02 10:54 UTC (permalink / raw)
  To: cygwin-patches

Hi Corinna,

On Wed, 2 Sep 2020 10:38:18 +0200
Corinna Vinschen wrote:
> On Sep  2 10:30, Corinna Vinschen wrote:
> > Ok guys, I'm not opposed to this change in terms of its result,
> > but I'm starting to wonder why all this locale code in fhandler_tty
> > is necessary at all.
> > 
> > I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> > on the charsets which looks like duplicates of the initial_setlocale()
> > call performed at DLL startup.
> > 
> > If there's anything missing in the initial_setlocale() call which would
> > be required by the pseudo tty code?  What exactly is it?  The codepage?
> > And why can't we just add the info to cygheap->locale at initial_setlocale()
> > time so it's available at exec time without going through all this hassle
> > every time?
> > 
> > Apart from that, all this locale/charset/lcid stuff should be concentrated
> > in nlsfunc.cc ideally.
> 
> get_locale_from_env() and get_langinfo() should go away.  If we just
> need a codepage for get_ttyp ()->term_code_page, we should really find a
> way to do this from within internal_setlocale().

I looked into internal_setlocale() code, but I could not found
the code which handles thecode page. I found the code handling
the code page in __set_charset_from_locale() function in nlsfuncs.cc,
but it does not return code page itself. Could you please explain
more detail of your idea?

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02  6:26     ` Johannes Schindelin
@ 2020-09-02 13:06       ` Takashi Yano
  2020-09-02  9:12         ` Johannes Schindelin
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-02 13:06 UTC (permalink / raw)
  To: cygwin-patches

On Wed, 2 Sep 2020 08:26:04 +0200 (CEST)
Johannes Schindelin wrote:
> Hi Takashi,
> 
> On Wed, 2 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> 
> > On Wed, 2 Sep 2020 10:30:14 +0200
> > Corinna Vinschen wrote:
> > > On Sep  1 18:19, Johannes Schindelin wrote:
> > > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > > correct, but after that (at least if Pseudo Console support is enabled),
> > > > we try to find the default code page for that `LCID`, which is ASCII
> > > > (437). Subsequently, we set the Console output code page to that value,
> > > > completely ignoring that we wanted to use UTF-8.
> > > >
> > > > Let's not ignore the specifically asked-for UTF-8 character set.
> > > >
> > > > While at it, let's also set the Console output code page even if Pseudo
> > > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > > Console output code page is not ignored in that case.
> > > >
> > > > The most common symptom would be that console applications which do not
> > > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > > >
> > > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > > https://github.com/rust-lang/cargo/issues/8369,
> > > > https://github.com/git-for-windows/git/issues/2734,
> > > > https://github.com/git-for-windows/git/issues/2793,
> > > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > > few others.
> > > >
> > > > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > > ---
> > > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > > >  1 file changed, 9 insertions(+)
> > >
> > > Ok guys, I'm not opposed to this change in terms of its result,
> >
> > I am sorry, but I cannot agree with Johannes's patch.
> >
> > For example, code page in Japan is CP932 by default.
> > In this case, cmd.exe, netsh.exe and so on are generate
> > messages in Japanese.
> >
> > If the code page is set to CP_UTF8, messages from such
> > commands changes to english. I guess similar things happen
> > for other locales.
> >
> > I do not prefer this result.
> >
> > Furthermore, I looked into the issue:
> > https://github.com/git-for-windows/git/issues/2734
> > and I found that git-for-windows always use utf-8
> > encoding even if the locale is ja_JP.CP932.
> > It does not change coding based on locale or code
> > page.
> >
> > Even with Johannes's patch, if mintty is started with
> > locale ja_JP.CP932, the file name will be garbled
> > bacause SetConsoleOutputCP(CP_UTF8) will not be called.
> >
> > IMHO, it is the problem of git-for-windows rather
> > than cygwin and msys2.
> >
> > To make current version of git-for-windows work, it is
> > necessary to set code page to CP_UTF8 regardless locale.
> > This does not make sense at all.
> 
> You are misrepresenting the problem. It has nothing to do with Git for
> Windows. For example, if you run tests in an Angular project inside
> Cygwin's MinTTY, the output will be garbled because node.js (or the
> Angular libraries, I don't know) will interpret `LANG` or something.

You listed these issues in git-for-windows:
> > > > https://github.com/git-for-windows/git/issues/2734,
> > > > https://github.com/git-for-windows/git/issues/2792,
didn't you? Therefore, I looked into them.

OK, I will check Angular/CLI next. But I am not familier with
Agnular/CLI. Could you please provide simple steps to reproduce
the problem?

> This is a much bigger problem than you make it sound. The console
> applications that do _not_ call `SetConsoleOutputCP()` are sooooo
> ubiquituous. I think you are underestimating that problem rather
> dramatically.
> 
> Ciao,
> Johannes


-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02  9:12         ` Johannes Schindelin
@ 2020-09-02 14:52           ` Takashi Yano
  0 siblings, 0 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-02 14:52 UTC (permalink / raw)
  To: cygwin-patches

On Wed, 2 Sep 2020 11:12:53 +0200 (CEST)
Johannes Schindelin wrote:
> On Wed, 2 Sep 2020, Takashi Yano wrote:
> > OK, I will check Angular/CLI next. But I am not familier with
> > Agnular/CLI. Could you please provide simple steps to reproduce
> > the problem?
> 
> Here is a report: https://github.com/git-for-windows/git/issues/2793

I already read that thread, and I have try following step.

1) Install node.js
2) npm install --global @angular/cli
3) ng new test-app
4) cd test-app
5) ng test --code-coverage

However, the output is very differnt from the bug report,
and there seems to be no garbled output.

The output is something like:

Compiling @angular/core : es2015 as esm2015
Compiling @angular/animations : es2015 as esm2015
Compiling @angular/compiler/testing : es2015 as esm2015
Compiling @angular/common : es2015 as esm2015
Compiling @angular/animations/browser : es2015 as esm2015
Compiling @angular/core/testing : es2015 as esm2015
Compiling @angular/animations/browser/testing : es2015 as esm2015
Compiling @angular/platform-browser : es2015 as esm2015
Compiling @angular/common/http : es2015 as esm2015
Compiling @angular/common/testing : es2015 as esm2015
Compiling @angular/router : es2015 as esm2015
Compiling @angular/forms : es2015 as esm2015
Compiling @angular/platform-browser-dynamic : es2015 as esm2015
Compiling @angular/platform-browser/testing : es2015 as esm2015
Compiling @angular/platform-browser/animations : es2015 as esm2015
Compiling @angular/common/http/testing : es2015 as esm2015
Compiling @angular/platform-browser-dynamic/testing : es2015 as esm2015
Compiling @angular/router/testing : es2015 as esm2015
10% building 2/2 modules 0 active02 09 2020 23:49:25.110:WARN [karma]: No captur
ed browser, open http://localhost:9876/
02 09 2020 23:49:25.118:INFO [karma-server]: Karma v5.0.9 server started at http
://0.0.0.0:9876/
02 09 2020 23:49:25.119:INFO [launcher]: Launching browsers Chrome with concurre
ncy unlimited
02 09 2020 23:49:25.125:INFO [launcher]: Starting browser Chrome
92% additional asset processing copy-webpack-plugin02 09 2020 23:49:34.003:WARN
[karma]: No captured browser, open http://localhost:9876/
02 09 2020 23:49:34.907:INFO [Chrome 85.0.4183.83 (Windows 10)]: Connected on so
cket z7khH23JfW4v-_TtAAAA with id 59608191
Chrome 85.0.4183.83 (Windows 10): Executed 3 of 3 SUCCESS (0.62 secs / 0.273 sec
s)
TOTAL: 3 SUCCESS
TOTAL: 3 SUCCESS
TOTAL: 3 SUCCESS

=============================== Coverage summary ===============================
Statements   : 100% ( 6/6 )
Branches     : 100% ( 0/0 )
Functions    : 100% ( 0/0 )
Lines        : 100% ( 5/5 )
================================================================================

What's wrong?

> But why do you want to look into Angular/CLI? Do you really think that it
> makes sense to try to fix every console app out there? Really? Wouldn't it
> make more sense to just accept that there are many console applications
> that are broken by the recent change, and accommodate them in the Cygwin
> runtime? That would take substantially less time.

It is important to know what's happning actually to fix the issue.
Superficial patch makes the problem more complicated.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02 10:54     ` Takashi Yano
@ 2020-09-02 15:24       ` Corinna Vinschen
  2020-09-02 16:09         ` Corinna Vinschen
  2020-09-02 16:25         ` Takashi Yano
  0 siblings, 2 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-02 15:24 UTC (permalink / raw)
  To: cygwin-patches

On Sep  2 19:54, Takashi Yano via Cygwin-patches wrote:
> Hi Corinna,
> 
> On Wed, 2 Sep 2020 10:38:18 +0200
> Corinna Vinschen wrote:
> > On Sep  2 10:30, Corinna Vinschen wrote:
> > > Ok guys, I'm not opposed to this change in terms of its result,
> > > but I'm starting to wonder why all this locale code in fhandler_tty
> > > is necessary at all.
> > > 
> > > I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> > > on the charsets which looks like duplicates of the initial_setlocale()
> > > call performed at DLL startup.
> > > 
> > > If there's anything missing in the initial_setlocale() call which would
> > > be required by the pseudo tty code?  What exactly is it?  The codepage?
> > > And why can't we just add the info to cygheap->locale at initial_setlocale()
> > > time so it's available at exec time without going through all this hassle
> > > every time?
> > > 
> > > Apart from that, all this locale/charset/lcid stuff should be concentrated
> > > in nlsfunc.cc ideally.
> > 
> > get_locale_from_env() and get_langinfo() should go away.  If we just
> > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > way to do this from within internal_setlocale().
> 
> I looked into internal_setlocale() code, but I could not found
> the code which handles thecode page. I found the code handling
> the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> but it does not return code page itself. Could you please explain
> more detail of your idea?

I had none yet :)  I was just musing, without actually thinking about a
solution.  But I think this isn't very complicated.  Given this is
inside Cygwin, nothing keeps the function to have a well-defined
side-effect, as in setting a (not yet existing) member "term_code_page"
of cygheap->locale.

Kind of like this:

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 8877cc358c39..2b84f4252071 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,6 +341,7 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
+  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 668d7eb9e778..752f4239d911 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1298,6 +1298,9 @@ __set_charset_from_locale (const char *locale, char *charset)
 			    LOCALE_IDEFAULTANSICODEPAGE | LOCALE_RETURN_NUMBER,
 			    (PWCHAR) &cp, sizeof cp))
     cp = 0;
+  /* Store codepage in cygheap->locale so fhandler_tty can switch the
+     pseudo console to the correct codepage. */
+  cygheap->locale.term_code_page = cp ?: CP_UTF8;
   /* Translate codepage and lcid to a charset closely aligned with the default
      charsets defined in Glibc. */
   const char *cs;

Make sense?


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02 15:24       ` Corinna Vinschen
@ 2020-09-02 16:09         ` Corinna Vinschen
  2020-09-02 16:25         ` Takashi Yano
  1 sibling, 0 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-02 16:09 UTC (permalink / raw)
  To: cygwin-patches

On Sep  2 17:24, Corinna Vinschen wrote:
> On Sep  2 19:54, Takashi Yano via Cygwin-patches wrote:
> > Hi Corinna,
> > 
> > On Wed, 2 Sep 2020 10:38:18 +0200
> > Corinna Vinschen wrote:
> > > On Sep  2 10:30, Corinna Vinschen wrote:
> > > > Ok guys, I'm not opposed to this change in terms of its result,
> > > > but I'm starting to wonder why all this locale code in fhandler_tty
> > > > is necessary at all.
> > > > 
> > > > I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> > > > on the charsets which looks like duplicates of the initial_setlocale()
> > > > call performed at DLL startup.
> > > > 
> > > > If there's anything missing in the initial_setlocale() call which would
> > > > be required by the pseudo tty code?  What exactly is it?  The codepage?
> > > > And why can't we just add the info to cygheap->locale at initial_setlocale()
> > > > time so it's available at exec time without going through all this hassle
> > > > every time?
> > > > 
> > > > Apart from that, all this locale/charset/lcid stuff should be concentrated
> > > > in nlsfunc.cc ideally.
> > > 
> > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > way to do this from within internal_setlocale().
> > 
> > I looked into internal_setlocale() code, but I could not found
> > the code which handles thecode page. I found the code handling
> > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > but it does not return code page itself. Could you please explain
> > more detail of your idea?
> 
> I had none yet :)  I was just musing, without actually thinking about a
> solution.  But I think this isn't very complicated.  Given this is
> inside Cygwin, nothing keeps the function to have a well-defined
> side-effect, as in setting a (not yet existing) member "term_code_page"
> of cygheap->locale.
> 
> Kind of like this:

Actually, this is a bit too simple, but you get the idea.  We need to
align the terminal codepage with the actual Cygwin charset, along the
lines of what your setup_locale is doing standalone yet.  Except in case
of ASCII, where we default to UTF-8 internally.  The important part here
is that we do this once, and that we don't have unnecessary code
duplication.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02 15:24       ` Corinna Vinschen
  2020-09-02 16:09         ` Corinna Vinschen
@ 2020-09-02 16:25         ` Takashi Yano
  2020-09-02 16:38           ` Corinna Vinschen
  1 sibling, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-02 16:25 UTC (permalink / raw)
  To: cygwin-patches

Hi Corinna,

On Wed, 2 Sep 2020 17:24:50 +0200
Corinna Vinschen  wrote:
> On Sep  2 19:54, Takashi Yano via Cygwin-patches wrote:
> > Hi Corinna,
> > 
> > On Wed, 2 Sep 2020 10:38:18 +0200
> > Corinna Vinschen wrote:
> > > On Sep  2 10:30, Corinna Vinschen wrote:
> > > > Ok guys, I'm not opposed to this change in terms of its result,
> > > > but I'm starting to wonder why all this locale code in fhandler_tty
> > > > is necessary at all.
> > > > 
> > > > I see that get_langinfo() calls __loadlocale and performs a lot of stuff
> > > > on the charsets which looks like duplicates of the initial_setlocale()
> > > > call performed at DLL startup.
> > > > 
> > > > If there's anything missing in the initial_setlocale() call which would
> > > > be required by the pseudo tty code?  What exactly is it?  The codepage?
> > > > And why can't we just add the info to cygheap->locale at initial_setlocale()
> > > > time so it's available at exec time without going through all this hassle
> > > > every time?
> > > > 
> > > > Apart from that, all this locale/charset/lcid stuff should be concentrated
> > > > in nlsfunc.cc ideally.
> > > 
> > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > way to do this from within internal_setlocale().
> > 
> > I looked into internal_setlocale() code, but I could not found
> > the code which handles thecode page. I found the code handling
> > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > but it does not return code page itself. Could you please explain
> > more detail of your idea?
> 
> I had none yet :)  I was just musing, without actually thinking about a
> solution.  But I think this isn't very complicated.  Given this is
> inside Cygwin, nothing keeps the function to have a well-defined
> side-effect, as in setting a (not yet existing) member "term_code_page"
> of cygheap->locale.
> 
> Kind of like this:
> 
> diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
> index 8877cc358c39..2b84f4252071 100644
> --- a/winsup/cygwin/cygheap.h
> +++ b/winsup/cygwin/cygheap.h
> @@ -341,6 +341,7 @@ struct cygheap_debug
>  struct cygheap_locale
>  {
>    mbtowc_p mbtowc;
> +  UINT term_code_page;
>  };
>  
>  struct user_heap_info
> diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
> index 668d7eb9e778..752f4239d911 100644
> --- a/winsup/cygwin/nlsfuncs.cc
> +++ b/winsup/cygwin/nlsfuncs.cc
> @@ -1298,6 +1298,9 @@ __set_charset_from_locale (const char *locale, char *charset)
>  			    LOCALE_IDEFAULTANSICODEPAGE | LOCALE_RETURN_NUMBER,
>  			    (PWCHAR) &cp, sizeof cp))
>      cp = 0;
> +  /* Store codepage in cygheap->locale so fhandler_tty can switch the
> +     pseudo console to the correct codepage. */
> +  cygheap->locale.term_code_page = cp ?: CP_UTF8;
>    /* Translate codepage and lcid to a charset closely aligned with the default
>       charsets defined in Glibc. */
>    const char *cs;
> 
> Make sense?

I have tried your code, however, it does not work as expected.
It seems that __set_charset_from_locale() is not called.
cygheap->locale.term_code_page is always 0.

I have added following lines into setup_locale() to make sure
to call __set_charset_from_locale() for a test,

  setlocale (LC_ALL, "");
  __set_charset_from_locale (__get_global_locale()->categories[LC_CTYPE], charset);
  get_ttyp ()->term_code_page = cygheap->locale.term_code_page;

however, term_code_page is set to 932 if locale is ja_JP.UTF-8.
In this case term_code_page should be CP_UTF8 (65001).

The code page retrieved in __set_charset_from_locale() is not
based on "UTF-8" but "ja_JP".

Let me consider a while.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02 16:25         ` Takashi Yano
@ 2020-09-02 16:38           ` Corinna Vinschen
  2020-09-03 17:59             ` Corinna Vinschen
  0 siblings, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-02 16:38 UTC (permalink / raw)
  To: cygwin-patches

Hi Takashi,

On Sep  3 01:25, Takashi Yano via Cygwin-patches wrote:
> Hi Corinna,
> 
> On Wed, 2 Sep 2020 17:24:50 +0200
> Corinna Vinschen  wrote:
> > > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > > way to do this from within internal_setlocale().
> > > 
> > > I looked into internal_setlocale() code, but I could not found
> > > the code which handles thecode page. I found the code handling
> > > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > > but it does not return code page itself. Could you please explain
> > > more detail of your idea?
> > 
> > I had none yet :)  I was just musing, without actually thinking about a
> > solution.  But I think this isn't very complicated.  Given this is
> > inside Cygwin, nothing keeps the function to have a well-defined
> > side-effect, as in setting a (not yet existing) member "term_code_page"
> > of cygheap->locale.
> > 
> > Kind of like this:
> > [...]
> I have tried your code, however, it does not work as expected.
> It seems that __set_charset_from_locale() is not called.
> cygheap->locale.term_code_page is always 0.

Ah, right!  Take a look into newlib/libc/locale/locale.c, function
__loadlocale().  This function is called from _setlocale_r().  However,
it calls __set_charset_from_locale() *only* if the charset isn't already
given explicitely in the LC_* or LANG string, because otherwise we
already know the charset, after all.

Darn!  That foils my plans for world domination...

> Let me consider a while.

Thanks, I'll do the same.  I'd really like to simplify this stuff
and doing the locale shuffle in two entirely different locations
at different times is prone to getting out of sync.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-02 16:38           ` Corinna Vinschen
@ 2020-09-03 17:59             ` Corinna Vinschen
  2020-09-04  9:21               ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-03 17:59 UTC (permalink / raw)
  To: cygwin-patches

On Sep  2 18:38, Corinna Vinschen wrote:
> Hi Takashi,
> 
> On Sep  3 01:25, Takashi Yano via Cygwin-patches wrote:
> > Hi Corinna,
> > 
> > On Wed, 2 Sep 2020 17:24:50 +0200
> > Corinna Vinschen  wrote:
> > > > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > > > way to do this from within internal_setlocale().
> > > > 
> > > > I looked into internal_setlocale() code, but I could not found
> > > > the code which handles thecode page. I found the code handling
> > > > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > > > but it does not return code page itself. Could you please explain
> > > > more detail of your idea?
> > > 
> > > I had none yet :)  I was just musing, without actually thinking about a
> > > solution.  But I think this isn't very complicated.  Given this is
> > > inside Cygwin, nothing keeps the function to have a well-defined
> > > side-effect, as in setting a (not yet existing) member "term_code_page"
> > > of cygheap->locale.
> > > 
> > > Kind of like this:
> > > [...]
> > I have tried your code, however, it does not work as expected.
> > It seems that __set_charset_from_locale() is not called.
> > cygheap->locale.term_code_page is always 0.
> 
> Ah, right!  Take a look into newlib/libc/locale/locale.c, function
> __loadlocale().  This function is called from _setlocale_r().  However,
> it calls __set_charset_from_locale() *only* if the charset isn't already
> given explicitely in the LC_* or LANG string, because otherwise we
> already know the charset, after all.
> 
> Darn!  That foils my plans for world domination...
> 
> > Let me consider a while.
> 
> Thanks, I'll do the same.  I'd really like to simplify this stuff
> and doing the locale shuffle in two entirely different locations
> at different times is prone to getting out of sync.

The only idea I had so far was, changing the way __set_charset_from_locale
works from within _setlocale_r:

We could add a Cygwin-specific function only fetching the codepage and
call it unconditionally from _setlocale_r.  __set_charset_from_locale is
called with a new parameter "codepage", so it doesn't have to fetch the
CP by itself, but it's still only called from _setlocale_r if necessary.

Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
could be done at the point the codepage is actually required.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04 10:03 ` Takashi Yano
@ 2020-09-04  6:23   ` Johannes Schindelin
  2020-09-04 15:03     ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Johannes Schindelin @ 2020-09-04  6:23 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin-patches

Hi Takashi,

On Fri, 4 Sep 2020, Takashi Yano via Cygwin-patches wrote:

> Hi Johannes and Corinna,
>
> On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
> Johannes Schindelin wrote:
>
> > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > correct, but after that (at least if Pseudo Console support is enabled),
> > we try to find the default code page for that `LCID`, which is ASCII
> > (437). Subsequently, we set the Console output code page to that value,
> > completely ignoring that we wanted to use UTF-8.
> >
> > Let's not ignore the specifically asked-for UTF-8 character set.
> >
> > While at it, let's also set the Console output code page even if Pseudo
> > Console support is disabled; contrary to the behavior of v3.0.7, the
> > Console output code page is not ignored in that case.
> >
> > The most common symptom would be that console applications which do not
> > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> >
> > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > https://github.com/msys2/MSYS2-packages/issues/2012,
> > https://github.com/rust-lang/cargo/issues/8369,
> > https://github.com/git-for-windows/git/issues/2734,
> > https://github.com/git-for-windows/git/issues/2793,
> > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > few others.
> >
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > index 06789a500..414c26992 100644
> > --- a/winsup/cygwin/fhandler_tty.cc
> > +++ b/winsup/cygwin/fhandler_tty.cc
> > @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
> >    char charset[ENCODING_LEN + 1] = "ASCII";
> >    LCID lcid = get_langinfo (locale, charset);
> >
> > +  /* Special-case the UTF-8 character set */
> > +  if (strcasecmp (charset, "UTF-8") == 0)
> > +    {
> > +      get_ttyp ()->term_code_page = CP_UTF8;
> > +      SetConsoleCP (CP_UTF8);
> > +      SetConsoleOutputCP (CP_UTF8);
> > +      return;
> > +    }
> > +
> >    /* Set console code page from locale */
> >    if (get_pseudo_console ())
> >      {
> > --
> > 2.27.0
>
> I would like to propose a counter patch attached.
> What do you think of this patch?
>
> This patch does not treat UTF-8 as special.

Sure, but it only fixes the issue in `disable_pcon` mode in the current
tip commit. That's because a new Pseudo Console is created for every
spawned non-Cygwin console application, and that new Pseudo Console does
_not_ have the code page set by your patch.

I verified that this patch works around the issue in `disable_pcon` mode,
which is good.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-03 17:59             ` Corinna Vinschen
@ 2020-09-04  9:21               ` Takashi Yano
  2020-09-04 12:44                 ` Corinna Vinschen
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-04  9:21 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 2929 bytes --]

Hi Corinna,

On Thu, 3 Sep 2020 19:59:12 +0200
Corinna Vinschen wrote:
> On Sep  2 18:38, Corinna Vinschen wrote:
> > Hi Takashi,
> > 
> > On Sep  3 01:25, Takashi Yano via Cygwin-patches wrote:
> > > Hi Corinna,
> > > 
> > > On Wed, 2 Sep 2020 17:24:50 +0200
> > > Corinna Vinschen  wrote:
> > > > > > get_locale_from_env() and get_langinfo() should go away.  If we just
> > > > > > need a codepage for get_ttyp ()->term_code_page, we should really find a
> > > > > > way to do this from within internal_setlocale().
> > > > > 
> > > > > I looked into internal_setlocale() code, but I could not found
> > > > > the code which handles thecode page. I found the code handling
> > > > > the code page in __set_charset_from_locale() function in nlsfuncs.cc,
> > > > > but it does not return code page itself. Could you please explain
> > > > > more detail of your idea?
> > > > 
> > > > I had none yet :)  I was just musing, without actually thinking about a
> > > > solution.  But I think this isn't very complicated.  Given this is
> > > > inside Cygwin, nothing keeps the function to have a well-defined
> > > > side-effect, as in setting a (not yet existing) member "term_code_page"
> > > > of cygheap->locale.
> > > > 
> > > > Kind of like this:
> > > > [...]
> > > I have tried your code, however, it does not work as expected.
> > > It seems that __set_charset_from_locale() is not called.
> > > cygheap->locale.term_code_page is always 0.
> > 
> > Ah, right!  Take a look into newlib/libc/locale/locale.c, function
> > __loadlocale().  This function is called from _setlocale_r().  However,
> > it calls __set_charset_from_locale() *only* if the charset isn't already
> > given explicitely in the LC_* or LANG string, because otherwise we
> > already know the charset, after all.
> > 
> > Darn!  That foils my plans for world domination...
> > 
> > > Let me consider a while.
> > 
> > Thanks, I'll do the same.  I'd really like to simplify this stuff
> > and doing the locale shuffle in two entirely different locations
> > at different times is prone to getting out of sync.
> 
> The only idea I had so far was, changing the way __set_charset_from_locale
> works from within _setlocale_r:
> 
> We could add a Cygwin-specific function only fetching the codepage and
> call it unconditionally from _setlocale_r.  __set_charset_from_locale is
> called with a new parameter "codepage", so it doesn't have to fetch the
> CP by itself, but it's still only called from _setlocale_r if necessary.
> 
> Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
> could be done at the point the codepage is actually required.

I think I have found the answer to your request.
Patch attached. What do you think of this patch?

Calling initial_setlocale() is necessary because
nl_langinfo() always returns "ANSI_X3.4-1968"
regardless locale setting if this is not called.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

[-- Attachment #2: 0001-Cygwin-pty-Replace-pty-specific-locale-functions-wit.patch --]
[-- Type: application/octet-stream, Size: 5000 bytes --]

From de56230876d4e06fa78a41a3af6a1c439dfe8f41 Mon Sep 17 00:00:00 2001
From: Takashi Yano <takashi.yano@nifty.ne.jp>
Date: Fri, 4 Sep 2020 18:11:01 +0900
Subject: [PATCH] Cygwin: pty: Replace pty specific locale functions with
 nl_langinfo().

- Remove pty specific functions get_locale_from_env() and get_langinfo(),
  which are called from setup_locale(), and use nl_langinfo() instead.
---
 winsup/cygwin/fhandler_tty.cc | 119 ++--------------------------------
 1 file changed, 4 insertions(+), 115 deletions(-)

diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index 8bf39c3e6..fd74a9f3d 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -26,6 +26,7 @@ details. */
 #include <asm/socket.h>
 #include "cygwait.h"
 #include "registry.h"
+#include <langinfo.h>
 
 #ifndef PROC_THREAD_ATTRIBUTE_PSEUDOCONSOLE
 #define PROC_THREAD_ATTRIBUTE_PSEUDOCONSOLE 0x00020016
@@ -1813,127 +1814,15 @@ cs_names[] = {
   {       0, "NULL"}
 };
 
-static void
-get_locale_from_env (char *locale)
-{
-  const char *env = NULL;
-  char lang[ENCODING_LEN + 1] = {0, }, country[ENCODING_LEN + 1] = {0, };
-  env = getenv ("LC_ALL");
-  if (env == NULL || !*env)
-    env = getenv ("LC_CTYPE");
-  if (env == NULL || !*env)
-    env = getenv ("LANG");
-  if (env == NULL || !*env)
-    {
-      if (GetLocaleInfo (LOCALE_CUSTOM_UI_DEFAULT,
-			  LOCALE_SISO639LANGNAME,
-			  lang, sizeof (lang)))
-	GetLocaleInfo (LOCALE_CUSTOM_UI_DEFAULT,
-		       LOCALE_SISO3166CTRYNAME,
-		       country, sizeof (country));
-      else if (GetLocaleInfo (LOCALE_CUSTOM_DEFAULT,
-			      LOCALE_SISO639LANGNAME,
-			      lang, sizeof (lang)))
-	  GetLocaleInfo (LOCALE_CUSTOM_DEFAULT,
-			 LOCALE_SISO3166CTRYNAME,
-			 country, sizeof (country));
-      else if (GetLocaleInfo (LOCALE_USER_DEFAULT,
-			      LOCALE_SISO639LANGNAME,
-			      lang, sizeof (lang)))
-	  GetLocaleInfo (LOCALE_USER_DEFAULT,
-			 LOCALE_SISO3166CTRYNAME,
-			 country, sizeof (country));
-      else if (GetLocaleInfo (LOCALE_SYSTEM_DEFAULT,
-			      LOCALE_SISO639LANGNAME,
-			      lang, sizeof (lang)))
-	  GetLocaleInfo (LOCALE_SYSTEM_DEFAULT,
-			 LOCALE_SISO3166CTRYNAME,
-			 country, sizeof (country));
-      if (strlen (lang) && strlen (country))
-	__small_sprintf (lang + strlen (lang), "_%s.UTF-8", country);
-      else
-	strcpy (lang , "C.UTF-8");
-      env = lang;
-    }
-  strcpy (locale, env);
-}
-
-static void
-get_langinfo (char *locale_out, char *charset_out)
-{
-  /* Get locale from environment */
-  char new_locale[ENCODING_LEN + 1];
-  get_locale_from_env (new_locale);
-
-  __locale_t loc;
-  memset (&loc, 0, sizeof (loc));
-  const char *locale = __loadlocale (&loc, LC_CTYPE, new_locale);
-  if (!locale)
-    locale = "C";
-
-  const char *charset;
-  struct lc_ctype_T *lc_ctype = (struct lc_ctype_T *) loc.lc_cat[LC_CTYPE].ptr;
-  if (!lc_ctype)
-    charset = "ASCII";
-  else
-    charset = lc_ctype->codeset;
-
-  /* The following code is borrowed from nl_langinfo()
-     in newlib/libc/locale/nl_langinfo.c */
-  /* Convert charset to Linux compatible codeset string. */
-  if (charset[0] == 'A'/*SCII*/)
-    charset = "ANSI_X3.4-1968";
-  else if (charset[0] == 'E')
-    {
-      if (strcmp (charset, "EUCJP") == 0)
-	charset = "EUC-JP";
-      else if (strcmp (charset, "EUCKR") == 0)
-	charset = "EUC-KR";
-      else if (strcmp (charset, "EUCCN") == 0)
-	charset = "GB2312";
-    }
-  else if (charset[0] == 'C'/*Pxxxx*/)
-    {
-      if (strcmp (charset + 2, "874") == 0)
-	charset = "TIS-620";
-      else if (strcmp (charset + 2, "20866") == 0)
-	charset = "KOI8-R";
-      else if (strcmp (charset + 2, "21866") == 0)
-	charset = "KOI8-U";
-      else if (strcmp (charset + 2, "101") == 0)
-	charset = "GEORGIAN-PS";
-      else if (strcmp (charset + 2, "102") == 0)
-	charset = "PT154";
-    }
-  else if (charset[0] == 'S'/*JIS*/)
-    {
-      /* Cygwin uses MSFT's implementation of SJIS, which differs
-	 in some codepoints from the real thing, especially
-	 0x5c: yen sign instead of backslash,
-	 0x7e: overline instead of tilde.
-	 We can't use the real SJIS since otherwise Win32
-	 pathnames would become invalid.  OTOH, if we return
-	 "SJIS" here, then libiconv will do mb<->wc conversion
-	 differently to our internal functions.  Therefore we
-	 return what we really implement, CP932.  This is handled
-	 fine by libiconv. */
-      charset = "CP932";
-    }
-
-  /* Set results */
-  strcpy (locale_out, new_locale);
-  strcpy (charset_out, charset);
-}
-
 void
 fhandler_pty_slave::setup_locale (void)
 {
   if (get_ttyp ()->term_code_page != 0)
     return;
 
-  char locale[ENCODING_LEN + 1] = "C";
-  char charset[ENCODING_LEN + 1] = "ASCII";
-  get_langinfo (locale, charset);
+  extern void initial_setlocale ();
+  initial_setlocale ();
+  const char *charset = nl_langinfo (_NL_CTYPE_CODESET_NAME);
 
   /* Set terminal code page from locale */
   /* This code is borrowed from mintty: charset.c */
-- 
2.28.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-01 16:19 [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8" Johannes Schindelin
  2020-09-02  6:06 ` Johannes Schindelin
  2020-09-02  8:30 ` Corinna Vinschen
@ 2020-09-04 10:03 ` Takashi Yano
  2020-09-04  6:23   ` Johannes Schindelin
  2 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-04 10:03 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 2341 bytes --]

Hi Johannes and Corinna,

On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
Johannes Schindelin wrote:

> When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> correct, but after that (at least if Pseudo Console support is enabled),
> we try to find the default code page for that `LCID`, which is ASCII
> (437). Subsequently, we set the Console output code page to that value,
> completely ignoring that we wanted to use UTF-8.
> 
> Let's not ignore the specifically asked-for UTF-8 character set.
> 
> While at it, let's also set the Console output code page even if Pseudo
> Console support is disabled; contrary to the behavior of v3.0.7, the
> Console output code page is not ignored in that case.
> 
> The most common symptom would be that console applications which do not
> specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> 
> This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> https://github.com/msys2/MSYS2-packages/issues/2012,
> https://github.com/rust-lang/cargo/issues/8369,
> https://github.com/git-for-windows/git/issues/2734,
> https://github.com/git-for-windows/git/issues/2793,
> https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> few others.
> 
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> index 06789a500..414c26992 100644
> --- a/winsup/cygwin/fhandler_tty.cc
> +++ b/winsup/cygwin/fhandler_tty.cc
> @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
>    char charset[ENCODING_LEN + 1] = "ASCII";
>    LCID lcid = get_langinfo (locale, charset);
> 
> +  /* Special-case the UTF-8 character set */
> +  if (strcasecmp (charset, "UTF-8") == 0)
> +    {
> +      get_ttyp ()->term_code_page = CP_UTF8;
> +      SetConsoleCP (CP_UTF8);
> +      SetConsoleOutputCP (CP_UTF8);
> +      return;
> +    }
> +
>    /* Set console code page from locale */
>    if (get_pseudo_console ())
>      {
> --
> 2.27.0

I would like to propose a counter patch attached.
What do you think of this patch?

This patch does not treat UTF-8 as special.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

[-- Attachment #2: 0001-Cygwin-pty-Prevent-garbled-output-when-pseudo-consol.patch --]
[-- Type: application/octet-stream, Size: 1145 bytes --]

From 05013fe7cc5bbc3f20a840bc44a49c1e07702bf4 Mon Sep 17 00:00:00 2001
From: Takashi Yano <takashi.yano@nifty.ne.jp>
Date: Fri, 4 Sep 2020 18:31:35 +0900
Subject: [PATCH] Cygwin: pty: Prevent garbled output when pseudo console is
 disabled.

- If pseudo console is disabled, non-cygwin apps do not detect
  console device. In this case, the some apps may output messages
  based on the locale. In addition, some apps output UTF-8 string
  regardless of the locale setting. At least git-for-windows and
  node.js do that. Even in this case, garbled output is prevented
  with this patch in most cases because mintty uses UTF-8 by default.
---
 winsup/cygwin/fhandler_tty.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index fd74a9f3d..3b33e82f7 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -1848,6 +1848,8 @@ fhandler_pty_slave::setup_locale (void)
 	  get_ttyp ()->term_code_page = cs_names[i].cp;
 	  break;
 	}
+  SetConsoleCP (get_ttyp ()->term_code_page);
+  SetConsoleOutputCP (get_ttyp ()->term_code_page);
 }
 
 void
-- 
2.28.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04  9:21               ` Takashi Yano
@ 2020-09-04 12:44                 ` Corinna Vinschen
  2020-09-04 14:05                   ` Brian Inglis
                                     ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-04 12:44 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 2620 bytes --]

Hi Takashi,

On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> Hi Corinna,
> 
> On Thu, 3 Sep 2020 19:59:12 +0200
> Corinna Vinschen wrote:
> > The only idea I had so far was, changing the way __set_charset_from_locale
> > works from within _setlocale_r:
> > 
> > We could add a Cygwin-specific function only fetching the codepage and
> > call it unconditionally from _setlocale_r.  __set_charset_from_locale is
> > called with a new parameter "codepage", so it doesn't have to fetch the
> > CP by itself, but it's still only called from _setlocale_r if necessary.
> > 
> > Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
> > could be done at the point the codepage is actually required.
> 
> I think I have found the answer to your request.
> Patch attached. What do you think of this patch?
> 
> Calling initial_setlocale() is necessary because
> nl_langinfo() always returns "ANSI_X3.4-1968"
> regardless locale setting if this is not called.

Well, this is correct.  Per POSIX, a standard-conformant application is
running in the "C" locale unless it calls setlocale() explicitely.
That's one reason Cygwin defaults to UTF-8 internally.  Everything,
including the terminal, is supposed to default to UTF-8.  That's the
most sane default, even if an application is not locale-aware.

So, to follow POSIX, initial_setlocale() is used to set up the
environment and command line stuff and then, before calling the
application's main, Cygwin calls _setlocale_r (_REENT, LC_CTYPE, "C");
to reset the apps default locale to "C".  That's why nl_langinfo()
returns "ANSI_X3.4-1968".

However, the initial_setlocale() call in dll_crt0_1 calls
internal_setlocale(), and *that* function sets the conversion functions
for the internal conversions.  What it *doesn't* do yet at the moment is
to store the charset name itself or, better, the equivalent codepage.

If we change that, setup_locale can simply go away.  Below is a patch
doing just that.  Can you please check if that works in your test
scenarios?

However, there's something which worries me.  Why do we need or set the
pseudo terminal codepage at all?  I see that you convert from MB charset
to MB charset and then use the result in WriteFile to the connecting
pipes.  Question is this: Why not just converting the strings via
sys_mbstowcs to a UTF-16 string and then send that over the line, using
WriteConsoleW for the final output to the console?  That would simplify
this stuff quite a bit, wouldn't it?  After all, for writing UTF-16 to
the console, we simply don't need to know or care for the console CP.


Thanks,
Corinna

[-- Attachment #2: 0001-Cygwin-fetch-codepage-for-pseudo-console-at-initial_.patch --]
[-- Type: text/plain, Size: 11743 bytes --]

From d5dff1690d1a228579eef472441d67fb6ef20b5e Mon Sep 17 00:00:00 2001
From: Corinna Vinschen <corinna@vinschen.de>
Date: Fri, 4 Sep 2020 14:39:58 +0200
Subject: [PATCH] Cygwin: fetch codepage for pseudo console at
 initial_setlocale time

drop locale checking in fhandler_tty code. Create function
__eval_codepage_from_internal_charset called from
internal_setlocale to store term_code_page in cygheap,
rather than in tty.  Drop now unneeded setup_locale calls
during fork and execve.

Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
---
 winsup/cygwin/cygheap.h       |   1 +
 winsup/cygwin/fhandler.h      |   1 -
 winsup/cygwin/fhandler_tty.cc | 195 ++--------------------------------
 winsup/cygwin/nlsfuncs.cc     |  54 +++++++++-
 winsup/cygwin/spawn.cc        |  12 ---
 5 files changed, 60 insertions(+), 203 deletions(-)

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 8877cc358c39..2b84f4252071 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,6 +341,7 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
+  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/fhandler.h b/winsup/cygwin/fhandler.h
index b4ba9428aa90..af619df5f9b1 100644
--- a/winsup/cygwin/fhandler.h
+++ b/winsup/cygwin/fhandler.h
@@ -2336,7 +2336,6 @@ class fhandler_pty_slave: public fhandler_pty_common
   void set_switch_to_pcon (void);
   void reset_switch_to_pcon (void);
   void mask_switch_to_pcon_in (bool mask);
-  void setup_locale (void);
 };
 
 #define __ptsname(buf, unit) __small_sprintf ((buf), "/dev/pty%d", (unit))
diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index 8bf39c3e6cf2..f207a0b27892 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -1614,8 +1614,9 @@ fhandler_pty_master::write (const void *ptr, size_t len)
   if (to_be_read_from_pcon () && get_ttyp ()->h_pseudo_console)
     {
       size_t nlen;
-      char *buf = convert_mb_str
-	(CP_UTF8, &nlen, get_ttyp ()->term_code_page, (const char *) ptr, len);
+      char *buf = convert_mb_str (CP_UTF8, &nlen,
+				  cygheap->locale.term_code_page,
+				  (const char *) ptr, len);
 
       WaitForSingleObject (input_mutex, INFINITE);
 
@@ -1782,185 +1783,6 @@ fhandler_pty_common::set_close_on_exec (bool val)
   close_on_exec (val);
 }
 
-/* This table is borrowed from mintty: charset.c */
-static const struct {
-  UINT cp;
-  const char *name;
-}
-cs_names[] = {
-  { CP_UTF8, "UTF-8"},
-  { CP_UTF8, "UTF8"},
-  {   20127, "ASCII"},
-  {   20127, "US-ASCII"},
-  {   20127, "ANSI_X3.4-1968"},
-  {   20866, "KOI8-R"},
-  {   20866, "KOI8R"},
-  {   20866, "KOI8"},
-  {   21866, "KOI8-U"},
-  {   21866, "KOI8U"},
-  {   20932, "EUCJP"},
-  {   20932, "EUC-JP"},
-  {     874, "TIS620"},
-  {     874, "TIS-620"},
-  {     932, "SJIS"},
-  {     936, "GBK"},
-  {     936, "GB2312"},
-  {     936, "EUCCN"},
-  {     936, "EUC-CN"},
-  {     949, "EUCKR"},
-  {     949, "EUC-KR"},
-  {     950, "BIG5"},
-  {       0, "NULL"}
-};
-
-static void
-get_locale_from_env (char *locale)
-{
-  const char *env = NULL;
-  char lang[ENCODING_LEN + 1] = {0, }, country[ENCODING_LEN + 1] = {0, };
-  env = getenv ("LC_ALL");
-  if (env == NULL || !*env)
-    env = getenv ("LC_CTYPE");
-  if (env == NULL || !*env)
-    env = getenv ("LANG");
-  if (env == NULL || !*env)
-    {
-      if (GetLocaleInfo (LOCALE_CUSTOM_UI_DEFAULT,
-			  LOCALE_SISO639LANGNAME,
-			  lang, sizeof (lang)))
-	GetLocaleInfo (LOCALE_CUSTOM_UI_DEFAULT,
-		       LOCALE_SISO3166CTRYNAME,
-		       country, sizeof (country));
-      else if (GetLocaleInfo (LOCALE_CUSTOM_DEFAULT,
-			      LOCALE_SISO639LANGNAME,
-			      lang, sizeof (lang)))
-	  GetLocaleInfo (LOCALE_CUSTOM_DEFAULT,
-			 LOCALE_SISO3166CTRYNAME,
-			 country, sizeof (country));
-      else if (GetLocaleInfo (LOCALE_USER_DEFAULT,
-			      LOCALE_SISO639LANGNAME,
-			      lang, sizeof (lang)))
-	  GetLocaleInfo (LOCALE_USER_DEFAULT,
-			 LOCALE_SISO3166CTRYNAME,
-			 country, sizeof (country));
-      else if (GetLocaleInfo (LOCALE_SYSTEM_DEFAULT,
-			      LOCALE_SISO639LANGNAME,
-			      lang, sizeof (lang)))
-	  GetLocaleInfo (LOCALE_SYSTEM_DEFAULT,
-			 LOCALE_SISO3166CTRYNAME,
-			 country, sizeof (country));
-      if (strlen (lang) && strlen (country))
-	__small_sprintf (lang + strlen (lang), "_%s.UTF-8", country);
-      else
-	strcpy (lang , "C.UTF-8");
-      env = lang;
-    }
-  strcpy (locale, env);
-}
-
-static void
-get_langinfo (char *locale_out, char *charset_out)
-{
-  /* Get locale from environment */
-  char new_locale[ENCODING_LEN + 1];
-  get_locale_from_env (new_locale);
-
-  __locale_t loc;
-  memset (&loc, 0, sizeof (loc));
-  const char *locale = __loadlocale (&loc, LC_CTYPE, new_locale);
-  if (!locale)
-    locale = "C";
-
-  const char *charset;
-  struct lc_ctype_T *lc_ctype = (struct lc_ctype_T *) loc.lc_cat[LC_CTYPE].ptr;
-  if (!lc_ctype)
-    charset = "ASCII";
-  else
-    charset = lc_ctype->codeset;
-
-  /* The following code is borrowed from nl_langinfo()
-     in newlib/libc/locale/nl_langinfo.c */
-  /* Convert charset to Linux compatible codeset string. */
-  if (charset[0] == 'A'/*SCII*/)
-    charset = "ANSI_X3.4-1968";
-  else if (charset[0] == 'E')
-    {
-      if (strcmp (charset, "EUCJP") == 0)
-	charset = "EUC-JP";
-      else if (strcmp (charset, "EUCKR") == 0)
-	charset = "EUC-KR";
-      else if (strcmp (charset, "EUCCN") == 0)
-	charset = "GB2312";
-    }
-  else if (charset[0] == 'C'/*Pxxxx*/)
-    {
-      if (strcmp (charset + 2, "874") == 0)
-	charset = "TIS-620";
-      else if (strcmp (charset + 2, "20866") == 0)
-	charset = "KOI8-R";
-      else if (strcmp (charset + 2, "21866") == 0)
-	charset = "KOI8-U";
-      else if (strcmp (charset + 2, "101") == 0)
-	charset = "GEORGIAN-PS";
-      else if (strcmp (charset + 2, "102") == 0)
-	charset = "PT154";
-    }
-  else if (charset[0] == 'S'/*JIS*/)
-    {
-      /* Cygwin uses MSFT's implementation of SJIS, which differs
-	 in some codepoints from the real thing, especially
-	 0x5c: yen sign instead of backslash,
-	 0x7e: overline instead of tilde.
-	 We can't use the real SJIS since otherwise Win32
-	 pathnames would become invalid.  OTOH, if we return
-	 "SJIS" here, then libiconv will do mb<->wc conversion
-	 differently to our internal functions.  Therefore we
-	 return what we really implement, CP932.  This is handled
-	 fine by libiconv. */
-      charset = "CP932";
-    }
-
-  /* Set results */
-  strcpy (locale_out, new_locale);
-  strcpy (charset_out, charset);
-}
-
-void
-fhandler_pty_slave::setup_locale (void)
-{
-  if (get_ttyp ()->term_code_page != 0)
-    return;
-
-  char locale[ENCODING_LEN + 1] = "C";
-  char charset[ENCODING_LEN + 1] = "ASCII";
-  get_langinfo (locale, charset);
-
-  /* Set terminal code page from locale */
-  /* This code is borrowed from mintty: charset.c */
-  get_ttyp ()->term_code_page = 20127; /* Default ASCII */
-  char charset_u[ENCODING_LEN + 1] = {0, };
-  for (int i=0; charset[i] && i<ENCODING_LEN; i++)
-    charset_u[i] = toupper (charset[i]);
-  unsigned int iso;
-  UINT cp = 20127; /* Default for fallback */
-  if (sscanf (charset_u, "ISO-8859-%u", &iso) == 1
-      || sscanf (charset_u, "ISO8859-%u", &iso) == 1
-      || sscanf (charset_u, "ISO8859%u", &iso) == 1)
-    {
-      if (iso && iso <= 16 && iso !=12)
-	get_ttyp ()->term_code_page = 28590 + iso;
-    }
-  else if (sscanf (charset_u, "CP%u", &cp) == 1)
-    get_ttyp ()->term_code_page = cp;
-  else
-    for (int i=0; cs_names[i].cp; i++)
-      if (strcasecmp (charset_u, cs_names[i].name) == 0)
-	{
-	  get_ttyp ()->term_code_page = cs_names[i].cp;
-	  break;
-	}
-}
-
 void
 fhandler_pty_slave::fixup_after_fork (HANDLE parent)
 {
@@ -1977,9 +1799,6 @@ fhandler_pty_slave::fixup_after_exec ()
   if (!close_on_exec ())
     fixup_after_fork (NULL);	/* No parent handle required. */
 
-  /* Set locale */
-  setup_locale ();
-
   /* Hook Console API */
 #define DO_HOOK(module, name) \
   if (!name##_Orig) \
@@ -2205,8 +2024,8 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	      state = 0;
 
 	  size_t nlen;
-	  char *buf = convert_mb_str
-	    (get_ttyp ()->term_code_page, &nlen, CP_UTF8, ptr, wlen);
+	  char *buf = convert_mb_str (cygheap->locale.term_code_page,
+				      &nlen, CP_UTF8, ptr, wlen);
 
 	  ptr = buf;
 	  wlen = rlen = nlen;
@@ -2228,8 +2047,8 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	  continue;
 	}
       size_t nlen;
-      char *buf = convert_mb_str
-	(get_ttyp ()->term_code_page, &nlen, GetConsoleOutputCP (), ptr, wlen);
+      char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
+				  GetConsoleOutputCP (), ptr, wlen);
 
       ptr = buf;
       wlen = rlen = nlen;
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 668d7eb9e778..10a3a0705142 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1448,6 +1448,53 @@ __set_charset_from_locale (const char *locale, char *charset)
   stpcpy (charset, cs);
 }
 
+/* Called from internal_setlocale.  Set a codepage which reflects the
+   internal charset setting.  This is *not* necessarily the Windows
+   codepage connected to a locale by default, so we have to set this
+   up explicitely. */
+static UINT
+__eval_codepage_from_internal_charset (const char *charset)
+{
+  UINT codepage = CP_UTF8; /* Default UTF8 */
+
+  /* The internal charset names are well defined, so we can use shortcuts. */
+  switch (charset[0])
+    {
+    case 'B': /* BIG5 */
+      codepage = 950;
+      break;
+    case 'C': /* CPxxx */
+      codepage = strtoul (charset + 2, NULL, 10);
+      break;
+    case 'E': /* EUCxx */
+      switch (charset[3])
+	{
+	case 'J': /* EUCJP */
+	  codepage = 20932;
+	  break;
+	case 'K': /* EUCKR */
+	  codepage = 949;
+	  break;
+	case 'C': /* EUCCN */
+	  codepage = 936;
+	  break;
+	}
+      break;
+    case 'G': /* GBK/GB2312 */
+      codepage = 936;
+      break;
+    case 'I': /* ISO-8859-x */
+      codepage = strtoul (charset + 9, NULL, 10);
+      break;
+    case 'S': /* SJIS */
+      codepage = 932;
+      break;
+    default: /* All set to UTF8 already */
+      break;
+    }
+  return codepage;
+}
+
 /* This function is called from newlib's loadlocale if the locale identifier
    was invalid, one way or the other.  It looks for the file
 
@@ -1535,8 +1582,11 @@ internal_setlocale ()
   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
     return;
 
-  debug_printf ("Global charset set to %s",
-		__locale_charset (__get_global_locale ()));
+  const char *charset = __locale_charset (__get_global_locale ());
+  debug_printf ("Global charset set to %s", charset);
+  /* Store codepage to be utilized by pseudo console code. */
+  cygheap->locale.term_code_page =
+			__eval_codepage_from_internal_charset (charset);
   /* Fetch PATH and CWD and convert to wchar_t in previous charset. */
   path = getenv ("PATH");
   if (path && *path)	/* $PATH can be potentially unset. */
diff --git a/winsup/cygwin/spawn.cc b/winsup/cygwin/spawn.cc
index 92d190d1a764..02c4207abae9 100644
--- a/winsup/cygwin/spawn.cc
+++ b/winsup/cygwin/spawn.cc
@@ -619,18 +619,6 @@ child_info_spawn::worker (const char *prog_arg, const char *const *argv,
 	    }
 	}
 
-      if (!iscygwin ())
-	{
-	  cfd.rewind ();
-	  while (cfd.next () >= 0)
-	    if (cfd->get_major () == DEV_PTYS_MAJOR)
-	      {
-		fhandler_pty_slave *ptys =
-		  (fhandler_pty_slave *)(fhandler_base *) cfd;
-		ptys->setup_locale ();
-	      }
-	}
-
       /* Set up needed handles for stdio */
       si.dwFlags = STARTF_USESTDHANDLES;
       si.hStdInput = handle ((in__stdin < 0 ? 0 : in__stdin), false);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04 12:44                 ` Corinna Vinschen
@ 2020-09-04 14:05                   ` Brian Inglis
  2020-09-04 14:50                   ` Takashi Yano
  2020-09-06 10:28                   ` Takashi Yano
  2 siblings, 0 replies; 60+ messages in thread
From: Brian Inglis @ 2020-09-04 14:05 UTC (permalink / raw)
  To: cygwin-patches

On 2020-09-04 06:44, Corinna Vinschen wrote:
> Hi Takashi,
> 
> On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
>> Hi Corinna,
>>
>> On Thu, 3 Sep 2020 19:59:12 +0200
>> Corinna Vinschen wrote:
>>> The only idea I had so far was, changing the way __set_charset_from_locale
>>> works from within _setlocale_r:
>>>
>>> We could add a Cygwin-specific function only fetching the codepage and
>>> call it unconditionally from _setlocale_r.  __set_charset_from_locale is
>>> called with a new parameter "codepage", so it doesn't have to fetch the
>>> CP by itself, but it's still only called from _setlocale_r if necessary.
>>>
>>> Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
>>> could be done at the point the codepage is actually required.
>>
>> I think I have found the answer to your request.
>> Patch attached. What do you think of this patch?
>>
>> Calling initial_setlocale() is necessary because
>> nl_langinfo() always returns "ANSI_X3.4-1968"
>> regardless locale setting if this is not called.
> 
> Well, this is correct.  Per POSIX, a standard-conformant application is
> running in the "C" locale unless it calls setlocale() explicitely.
> That's one reason Cygwin defaults to UTF-8 internally.  Everything,
> including the terminal, is supposed to default to UTF-8.  That's the
> most sane default, even if an application is not locale-aware.
> 
> So, to follow POSIX, initial_setlocale() is used to set up the
> environment and command line stuff and then, before calling the
> application's main, Cygwin calls _setlocale_r (_REENT, LC_CTYPE, "C");
> to reset the apps default locale to "C".  That's why nl_langinfo()
> returns "ANSI_X3.4-1968".
> 
> However, the initial_setlocale() call in dll_crt0_1 calls
> internal_setlocale(), and *that* function sets the conversion functions
> for the internal conversions.  What it *doesn't* do yet at the moment is
> to store the charset name itself or, better, the equivalent codepage.
> 
> If we change that, setup_locale can simply go away.  Below is a patch
> doing just that.  Can you please check if that works in your test
> scenarios?
> 
> However, there's something which worries me.  Why do we need or set the
> pseudo terminal codepage at all?  I see that you convert from MB charset
> to MB charset and then use the result in WriteFile to the connecting
> pipes.  Question is this: Why not just converting the strings via
> sys_mbstowcs to a UTF-16 string and then send that over the line, using
> WriteConsoleW for the final output to the console?  That would simplify
> this stuff quite a bit, wouldn't it?  After all, for writing UTF-16 to
> the console, we simply don't need to know or care for the console CP.

IIRC his locale was ja_JP.UTF-8 but he got English messages on CP 932!

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04 12:44                 ` Corinna Vinschen
  2020-09-04 14:05                   ` Brian Inglis
@ 2020-09-04 14:50                   ` Takashi Yano
  2020-09-04 19:22                     ` Corinna Vinschen
  2020-09-06 10:28                   ` Takashi Yano
  2 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-04 14:50 UTC (permalink / raw)
  To: cygwin-patches

Hi Corinna,

On Fri, 4 Sep 2020 14:44:00 +0200
Corinna Vinschen wrote:
> Hi Takashi,
> 
> On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > Hi Corinna,
> > 
> > On Thu, 3 Sep 2020 19:59:12 +0200
> > Corinna Vinschen wrote:
> > > The only idea I had so far was, changing the way __set_charset_from_locale
> > > works from within _setlocale_r:
> > > 
> > > We could add a Cygwin-specific function only fetching the codepage and
> > > call it unconditionally from _setlocale_r.  __set_charset_from_locale is
> > > called with a new parameter "codepage", so it doesn't have to fetch the
> > > CP by itself, but it's still only called from _setlocale_r if necessary.
> > > 
> > > Would that be sufficient?  The CP conversion from 20127/ASCII to 65001/UTF8
> > > could be done at the point the codepage is actually required.
> > 
> > I think I have found the answer to your request.
> > Patch attached. What do you think of this patch?
> > 
> > Calling initial_setlocale() is necessary because
> > nl_langinfo() always returns "ANSI_X3.4-1968"
> > regardless locale setting if this is not called.
> 
> Well, this is correct.  Per POSIX, a standard-conformant application is
> running in the "C" locale unless it calls setlocale() explicitely.
> That's one reason Cygwin defaults to UTF-8 internally.  Everything,
> including the terminal, is supposed to default to UTF-8.  That's the
> most sane default, even if an application is not locale-aware.
> 
> So, to follow POSIX, initial_setlocale() is used to set up the
> environment and command line stuff and then, before calling the
> application's main, Cygwin calls _setlocale_r (_REENT, LC_CTYPE, "C");
> to reset the apps default locale to "C".  That's why nl_langinfo()
> returns "ANSI_X3.4-1968".
> 
> However, the initial_setlocale() call in dll_crt0_1 calls
> internal_setlocale(), and *that* function sets the conversion functions
> for the internal conversions.  What it *doesn't* do yet at the moment is
> to store the charset name itself or, better, the equivalent codepage.
> 
> If we change that, setup_locale can simply go away.  Below is a patch
> doing just that.  Can you please check if that works in your test
> scenarios?

I tried your patch, but unfortunately it does not work.
cygheap->locale.term_code_page is 0 in pty master.

If the following lines are moved in internal_setlocale(),

  const char *charset = __locale_charset (__get_global_locale ());
  debug_printf ("Global charset set to %s", charset);
  /* Store codepage to be utilized by pseudo console code. */
  cygheap->locale.term_code_page =
            __eval_codepage_from_internal_charset (charset);

in internal_setlocale() before

  /* Don't do anything if the charset hasn't actually changed. */
  if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
    return;

cygheap->locale.term_code_page is always 65001 even if mintty is
startted by
mintty -o locale=ja_JP -o charset=CP932
or
mintty -o locale=ja_JP -o charset=EUCJP

Perhaps, this is because LANG is not set properly yet when mintty
is started.

> However, there's something which worries me.  Why do we need or set the
> pseudo terminal codepage at all?  I see that you convert from MB charset
> to MB charset and then use the result in WriteFile to the connecting
> pipes.  Question is this: Why not just converting the strings via
> sys_mbstowcs to a UTF-16 string and then send that over the line, using
> WriteConsoleW for the final output to the console?  That would simplify
> this stuff quite a bit, wouldn't it?  After all, for writing UTF-16 to
> the console, we simply don't need to know or care for the console CP.

WriteConsoleW() cannot be used because the handle to_master_cyg is
not a console handle.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04  6:23   ` Johannes Schindelin
@ 2020-09-04 15:03     ` Takashi Yano
  2020-09-07 21:17       ` Johannes Schindelin
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-04 15:03 UTC (permalink / raw)
  To: cygwin-patches

Hi Johannes,

On Fri, 4 Sep 2020 08:23:42 +0200 (CEST)
Johannes Schindelin wrote:
> Hi Takashi,
> 
> On Fri, 4 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> 
> > Hi Johannes and Corinna,
> >
> > On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
> > Johannes Schindelin wrote:
> >
> > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > correct, but after that (at least if Pseudo Console support is enabled),
> > > we try to find the default code page for that `LCID`, which is ASCII
> > > (437). Subsequently, we set the Console output code page to that value,
> > > completely ignoring that we wanted to use UTF-8.
> > >
> > > Let's not ignore the specifically asked-for UTF-8 character set.
> > >
> > > While at it, let's also set the Console output code page even if Pseudo
> > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > Console output code page is not ignored in that case.
> > >
> > > The most common symptom would be that console applications which do not
> > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > >
> > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > https://github.com/rust-lang/cargo/issues/8369,
> > > https://github.com/git-for-windows/git/issues/2734,
> > > https://github.com/git-for-windows/git/issues/2793,
> > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > few others.
> > >
> > > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > ---
> > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > > index 06789a500..414c26992 100644
> > > --- a/winsup/cygwin/fhandler_tty.cc
> > > +++ b/winsup/cygwin/fhandler_tty.cc
> > > @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
> > >    char charset[ENCODING_LEN + 1] = "ASCII";
> > >    LCID lcid = get_langinfo (locale, charset);
> > >
> > > +  /* Special-case the UTF-8 character set */
> > > +  if (strcasecmp (charset, "UTF-8") == 0)
> > > +    {
> > > +      get_ttyp ()->term_code_page = CP_UTF8;
> > > +      SetConsoleCP (CP_UTF8);
> > > +      SetConsoleOutputCP (CP_UTF8);
> > > +      return;
> > > +    }
> > > +
> > >    /* Set console code page from locale */
> > >    if (get_pseudo_console ())
> > >      {
> > > --
> > > 2.27.0
> >
> > I would like to propose a counter patch attached.
> > What do you think of this patch?
> >
> > This patch does not treat UTF-8 as special.
> 
> Sure, but it only fixes the issue in `disable_pcon` mode in the current
> tip commit. That's because a new Pseudo Console is created for every
> spawned non-Cygwin console application, and that new Pseudo Console does
> _not_ have the code page set by your patch.

You are right. However, if pseudo console is enabled, the app
which works correclty in command prompt should work as well in
pseudo console. Therefore, there is nothing to be fixed.

> I verified that this patch works around the issue in `disable_pcon` mode,
> which is good.

Thanks for testing.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04 14:50                   ` Takashi Yano
@ 2020-09-04 19:22                     ` Corinna Vinschen
  2020-09-05  8:43                       ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-04 19:22 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 4386 bytes --]

Hi Takashi,

On Sep  4 23:50, Takashi Yano via Cygwin-patches wrote:
> Hi Corinna,
> 
> On Fri, 4 Sep 2020 14:44:00 +0200
> Corinna Vinschen wrote:
> > On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > > I think I have found the answer to your request.
> > > Patch attached. What do you think of this patch?
> > > 
> > > Calling initial_setlocale() is necessary because
> > > nl_langinfo() always returns "ANSI_X3.4-1968"
> > > regardless locale setting if this is not called.
> > [...]
> > However, the initial_setlocale() call in dll_crt0_1 calls
> > internal_setlocale(), and *that* function sets the conversion functions
> > for the internal conversions.  What it *doesn't* do yet at the moment is
> > to store the charset name itself or, better, the equivalent codepage.
> > 
> > If we change that, setup_locale can simply go away.  Below is a patch
> > doing just that.  Can you please check if that works in your test
> > scenarios?
> 
> I tried your patch, but unfortunately it does not work.
> cygheap->locale.term_code_page is 0 in pty master.
> 
> If the following lines are moved in internal_setlocale(),
> 
>   const char *charset = __locale_charset (__get_global_locale ());
>   debug_printf ("Global charset set to %s", charset);
>   /* Store codepage to be utilized by pseudo console code. */
>   cygheap->locale.term_code_page =
>             __eval_codepage_from_internal_charset (charset);
> 
> in internal_setlocale() before
> 
>   /* Don't do anything if the charset hasn't actually changed. */
>   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
>     return;

Uh, that makes sense.

> cygheap->locale.term_code_page is always 65001 even if mintty is
> startted by
> mintty -o locale=ja_JP -o charset=CP932
> or
> mintty -o locale=ja_JP -o charset=EUCJP
> 
> Perhaps, this is because LANG is not set properly yet when mintty
> is started.

Yeah, that's the reason.  The above settings of locale and charset on
the CLI should only take over when mintty calls setlocale() with a
matching string.  The fact that it sets the matching value in the
environment, too, should only affect child processes, not mintty itself.

But it's incorrect to call initial_setlocale() from setup_locale()
without resetting it to its original value.

Unfortunately that doesn't solve any problem with the pseudo console
codepage.  Drat.  It sounds like you need the terminal's charset,
rather than the one set in the environment.

So this boils down to the fact that term_code_page must be set
after the application is already running and as soo as it creates
the pty, me thinks.  What if __eval_codepage_from_internal_charset()
is called at pty creation?  Or even on reading from /writing to
the pty the first time?  That should always be late enough to fetch
the correct codepage.

Patch attached.  Does that work as expected?

> > However, there's something which worries me.  Why do we need or set the
> > pseudo terminal codepage at all?  I see that you convert from MB charset
> > to MB charset and then use the result in WriteFile to the connecting
> > pipes.  Question is this: Why not just converting the strings via
> > sys_mbstowcs to a UTF-16 string and then send that over the line, using
> > WriteConsoleW for the final output to the console?  That would simplify
> > this stuff quite a bit, wouldn't it?  After all, for writing UTF-16 to
> > the console, we simply don't need to know or care for the console CP.
> 
> WriteConsoleW() cannot be used because the handle to_master_cyg is
> not a console handle.

Sigh, of course.  It's all just pipes.  I forgot that, sorry.

Btw., the main loop in fhandler_pty_master::pty_master_fwd_thread()
calls 

  char *buf = convert_mb_str (cygheap->locale.term_code_page,
                              &nlen, CP_UTF8, ptr, wlen);
                                     ^^^^^^^
  [...]
  WriteFile (to_master_cyg, ...

But then, after the code breaks from that loop, it calls

  char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
                              GetConsoleOutputCP (), ptr, wlen);
                              ^^^^^^^^^^^^^^^^^^^^^
  [...]
  process_opost_output (to_master_cyg, ...

process_opost_output then calls WriteFile on that to_master_cyg handle,
just like the WriteFile call above.

Is that really correct?  Shouldn't the second invocation use CP_UTF8 as
well?


Corinna

[-- Attachment #2: 0001-Cygwin-try-calling-__eval_codepage_from_internal_cha.patch --]
[-- Type: text/plain, Size: 3462 bytes --]

From 89acd62e88871a89b9bcb2964924f8960c7673ec Mon Sep 17 00:00:00 2001
From: Corinna Vinschen <corinna@vinschen.de>
Date: Fri, 4 Sep 2020 21:20:35 +0200
Subject: [PATCH] Cygwin: try calling __eval_codepage_from_internal_charset in
 pty code

the idea being, that we need the tty's locale and charset, not the
environment locale settings when starting the tty.
---
 winsup/cygwin/fhandler_tty.cc | 17 +++++++++++++++++
 winsup/cygwin/nlsfuncs.cc     | 12 +++++-------
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index f207a0b27892..feb014175123 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -39,6 +39,8 @@ extern "C" {
   VOID WINAPI ClosePseudoConsole (HPCON);
 }
 
+extern UINT __eval_codepage_from_internal_charset ();
+
 #define close_maybe(h) \
   do { \
     if (h && h != INVALID_HANDLE_VALUE) \
@@ -633,6 +635,11 @@ fhandler_pty_slave::open (int flags, mode_t)
 
   fhandler_console::need_invisible ();
 
+#if 1 /* Let's try this first */
+  if (!cygheap->locale.term_code_page)
+    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+#endif
+
   set_open_status ();
   return 1;
 
@@ -1607,6 +1614,11 @@ fhandler_pty_master::write (const void *ptr, size_t len)
   if (bg <= bg_eof)
     return (ssize_t) bg;
 
+#if 0 /* Let's try this if setting codepage at pty open time is not enough */
+  if (!cygheap->locale.term_code_page)
+    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+#endif
+
   push_process_state process_state (PID_TTYOU);
 
   /* Write terminal input to to_slave pipe instead of output_handle
@@ -1969,6 +1981,11 @@ fhandler_pty_master::pty_master_fwd_thread ()
   DWORD rlen;
   char outbuf[65536];
 
+#if 0 /* Let's try this if setting codepage at pty open time is not enough */
+  if (!cygheap->locale.term_code_page)
+    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+#endif
+
   termios_printf ("Started.");
   for (;;)
     {
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 10a3a0705142..6bffc77c91eb 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1452,9 +1452,10 @@ __set_charset_from_locale (const char *locale, char *charset)
    internal charset setting.  This is *not* necessarily the Windows
    codepage connected to a locale by default, so we have to set this
    up explicitely. */
-static UINT
-__eval_codepage_from_internal_charset (const char *charset)
+UINT
+__eval_codepage_from_internal_charset ()
 {
+  const char *charset = __locale_charset (__get_global_locale ());
   UINT codepage = CP_UTF8; /* Default UTF8 */
 
   /* The internal charset names are well defined, so we can use shortcuts. */
@@ -1582,11 +1583,8 @@ internal_setlocale ()
   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
     return;
 
-  const char *charset = __locale_charset (__get_global_locale ());
-  debug_printf ("Global charset set to %s", charset);
-  /* Store codepage to be utilized by pseudo console code. */
-  cygheap->locale.term_code_page =
-			__eval_codepage_from_internal_charset (charset);
+  debug_printf ("Global charset set to %s",
+		__locale_charset (__get_global_locale ()));
   /* Fetch PATH and CWD and convert to wchar_t in previous charset. */
   path = getenv ("PATH");
   if (path && *path)	/* $PATH can be potentially unset. */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04 19:22                     ` Corinna Vinschen
@ 2020-09-05  8:43                       ` Takashi Yano
  2020-09-05 11:15                         ` Takashi Yano
  2020-09-07  8:26                         ` Corinna Vinschen
  0 siblings, 2 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-05  8:43 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 5235 bytes --]

Hi Corinna,

On Fri, 4 Sep 2020 21:22:35 +0200
Corinna Vinschen wrote:
> Hi Takashi,
> 
> On Sep  4 23:50, Takashi Yano via Cygwin-patches wrote:
> > Hi Corinna,
> > 
> > On Fri, 4 Sep 2020 14:44:00 +0200
> > Corinna Vinschen wrote:
> > > On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > > > I think I have found the answer to your request.
> > > > Patch attached. What do you think of this patch?
> > > > 
> > > > Calling initial_setlocale() is necessary because
> > > > nl_langinfo() always returns "ANSI_X3.4-1968"
> > > > regardless locale setting if this is not called.
> > > [...]
> > > However, the initial_setlocale() call in dll_crt0_1 calls
> > > internal_setlocale(), and *that* function sets the conversion functions
> > > for the internal conversions.  What it *doesn't* do yet at the moment is
> > > to store the charset name itself or, better, the equivalent codepage.
> > > 
> > > If we change that, setup_locale can simply go away.  Below is a patch
> > > doing just that.  Can you please check if that works in your test
> > > scenarios?
> > 
> > I tried your patch, but unfortunately it does not work.
> > cygheap->locale.term_code_page is 0 in pty master.
> > 
> > If the following lines are moved in internal_setlocale(),
> > 
> >   const char *charset = __locale_charset (__get_global_locale ());
> >   debug_printf ("Global charset set to %s", charset);
> >   /* Store codepage to be utilized by pseudo console code. */
> >   cygheap->locale.term_code_page =
> >             __eval_codepage_from_internal_charset (charset);
> > 
> > in internal_setlocale() before
> > 
> >   /* Don't do anything if the charset hasn't actually changed. */
> >   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
> >     return;
> 
> Uh, that makes sense.
> 
> > cygheap->locale.term_code_page is always 65001 even if mintty is
> > startted by
> > mintty -o locale=ja_JP -o charset=CP932
> > or
> > mintty -o locale=ja_JP -o charset=EUCJP
> > 
> > Perhaps, this is because LANG is not set properly yet when mintty
> > is started.
> 
> Yeah, that's the reason.  The above settings of locale and charset on
> the CLI should only take over when mintty calls setlocale() with a
> matching string.  The fact that it sets the matching value in the
> environment, too, should only affect child processes, not mintty itself.
> 
> But it's incorrect to call initial_setlocale() from setup_locale()
> without resetting it to its original value.
> 
> Unfortunately that doesn't solve any problem with the pseudo console
> codepage.  Drat.  It sounds like you need the terminal's charset,
> rather than the one set in the environment.
> 
> So this boils down to the fact that term_code_page must be set
> after the application is already running and as soo as it creates
> the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> is called at pty creation?  Or even on reading from /writing to
> the pty the first time?  That should always be late enough to fetch
> the correct codepage.
> 
> Patch attached.  Does that work as expected?

Thank you very much for the patch.

Your new additional patch works well except the test case such as:

  int pm = getpt();
  if (fork()) {
    [do the master operations]
  } else {
    int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
    close(pm);
    setsid();
    ioctl(ps, TIOCSCTTY, 1);
    dup2(ps, 0);
    dup2(ps, 1);
    dup2(ps, 2);
    close(ps);
    [exec non-cygwin process]
  }

If this test case is run in cygwin console (command prompt),
it causes garbled output due to term_code_page == 0.

The second additional patch attached fixes the isseu.

> Btw., the main loop in fhandler_pty_master::pty_master_fwd_thread()
> calls 
> 
>   char *buf = convert_mb_str (cygheap->locale.term_code_page,
>                               &nlen, CP_UTF8, ptr, wlen);
>                                      ^^^^^^^
>   [...]
>   WriteFile (to_master_cyg, ...
> 
> But then, after the code breaks from that loop, it calls
> 
>   char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
>                               GetConsoleOutputCP (), ptr, wlen);
>                               ^^^^^^^^^^^^^^^^^^^^^
>   [...]
>   process_opost_output (to_master_cyg, ...
> 
> process_opost_output then calls WriteFile on that to_master_cyg handle,
> just like the WriteFile call above.
> 
> Is that really correct?  Shouldn't the second invocation use CP_UTF8 as
> well?

That is correct. The first conversion is for the case that pseudo
console is enabled, and the second one is for the case that pseudo
console is disabled.

Pseudo console converts charset from console code page to UTF-8.
Therefore, data read from from_slave is always UTF-8 when pseudo
console is enabled. Moreover, OPOST processing is done in pseudo
console, so write data simply by WriteFile() is enough.

If pseudo console is disabled, cmd.exe and so on uses console
code page, so the code page of data read from from_slave is
GetConsoleOutputCP(). In this case, OPOST processing is necessary.

If the second conversion and process_opost_output() are in the
"else {}" block for "if (get_ttyp ()->h_pseudo_console) {}",
the intent whould be clearer.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

[-- Attachment #2: set-term-code-page-in-_ti.diff --]
[-- Type: text/plain, Size: 2731 bytes --]

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 2b84f4252..8877cc358 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,7 +341,6 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
-  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index feb014175..8dc9d3df9 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -636,8 +636,8 @@ fhandler_pty_slave::open (int flags, mode_t)
   fhandler_console::need_invisible ();
 
 #if 1 /* Let's try this first */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   set_open_status ();
@@ -1615,8 +1615,8 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     return (ssize_t) bg;
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   push_process_state process_state (PID_TTYOU);
@@ -1627,7 +1627,7 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     {
       size_t nlen;
       char *buf = convert_mb_str (CP_UTF8, &nlen,
-				  cygheap->locale.term_code_page,
+				  get_ttyp ()->term_code_page,
 				  (const char *) ptr, len);
 
       WaitForSingleObject (input_mutex, INFINITE);
@@ -1982,8 +1982,8 @@ fhandler_pty_master::pty_master_fwd_thread ()
   char outbuf[65536];
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_tty ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   termios_printf ("Started.");
@@ -2041,7 +2041,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	      state = 0;
 
 	  size_t nlen;
-	  char *buf = convert_mb_str (cygheap->locale.term_code_page,
+	  char *buf = convert_mb_str (get_ttyp ()->term_code_page,
 				      &nlen, CP_UTF8, ptr, wlen);
 
 	  ptr = buf;
@@ -2064,7 +2064,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	  continue;
 	}
       size_t nlen;
-      char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
+      char *buf = convert_mb_str (get_ttyp ()->term_code_page, &nlen,
 				  GetConsoleOutputCP (), ptr, wlen);
 
       ptr = buf;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-05  8:43                       ` Takashi Yano
@ 2020-09-05 11:15                         ` Takashi Yano
  2020-09-05 14:15                           ` Takashi Yano
  2020-09-07  8:27                           ` Corinna Vinschen
  2020-09-07  8:26                         ` Corinna Vinschen
  1 sibling, 2 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-05 11:15 UTC (permalink / raw)
  To: cygwin-patches

On Sat, 5 Sep 2020 17:43:01 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> Hi Corinna,
> 
> On Fri, 4 Sep 2020 21:22:35 +0200
> Corinna Vinschen wrote:
> > Hi Takashi,
> > 
> > On Sep  4 23:50, Takashi Yano via Cygwin-patches wrote:
> > > Hi Corinna,
> > > 
> > > On Fri, 4 Sep 2020 14:44:00 +0200
> > > Corinna Vinschen wrote:
> > > > On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > > > > I think I have found the answer to your request.
> > > > > Patch attached. What do you think of this patch?
> > > > > 
> > > > > Calling initial_setlocale() is necessary because
> > > > > nl_langinfo() always returns "ANSI_X3.4-1968"
> > > > > regardless locale setting if this is not called.
> > > > [...]
> > > > However, the initial_setlocale() call in dll_crt0_1 calls
> > > > internal_setlocale(), and *that* function sets the conversion functions
> > > > for the internal conversions.  What it *doesn't* do yet at the moment is
> > > > to store the charset name itself or, better, the equivalent codepage.
> > > > 
> > > > If we change that, setup_locale can simply go away.  Below is a patch
> > > > doing just that.  Can you please check if that works in your test
> > > > scenarios?
> > > 
> > > I tried your patch, but unfortunately it does not work.
> > > cygheap->locale.term_code_page is 0 in pty master.
> > > 
> > > If the following lines are moved in internal_setlocale(),
> > > 
> > >   const char *charset = __locale_charset (__get_global_locale ());
> > >   debug_printf ("Global charset set to %s", charset);
> > >   /* Store codepage to be utilized by pseudo console code. */
> > >   cygheap->locale.term_code_page =
> > >             __eval_codepage_from_internal_charset (charset);
> > > 
> > > in internal_setlocale() before
> > > 
> > >   /* Don't do anything if the charset hasn't actually changed. */
> > >   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
> > >     return;
> > 
> > Uh, that makes sense.
> > 
> > > cygheap->locale.term_code_page is always 65001 even if mintty is
> > > startted by
> > > mintty -o locale=ja_JP -o charset=CP932
> > > or
> > > mintty -o locale=ja_JP -o charset=EUCJP
> > > 
> > > Perhaps, this is because LANG is not set properly yet when mintty
> > > is started.
> > 
> > Yeah, that's the reason.  The above settings of locale and charset on
> > the CLI should only take over when mintty calls setlocale() with a
> > matching string.  The fact that it sets the matching value in the
> > environment, too, should only affect child processes, not mintty itself.
> > 
> > But it's incorrect to call initial_setlocale() from setup_locale()
> > without resetting it to its original value.
> > 
> > Unfortunately that doesn't solve any problem with the pseudo console
> > codepage.  Drat.  It sounds like you need the terminal's charset,
> > rather than the one set in the environment.
> > 
> > So this boils down to the fact that term_code_page must be set
> > after the application is already running and as soo as it creates
> > the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> > is called at pty creation?  Or even on reading from /writing to
> > the pty the first time?  That should always be late enough to fetch
> > the correct codepage.
> > 
> > Patch attached.  Does that work as expected?
> 
> Thank you very much for the patch.
> 
> Your new additional patch works well except the test case such as:
> 
>   int pm = getpt();
>   if (fork()) {
>     [do the master operations]
>   } else {
>     int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
>     close(pm);
>     setsid();
>     ioctl(ps, TIOCSCTTY, 1);
>     dup2(ps, 0);
>     dup2(ps, 1);
>     dup2(ps, 2);
>     close(ps);
>     [exec non-cygwin process]
>   }
> 
> If this test case is run in cygwin console (command prompt),
> it causes garbled output due to term_code_page == 0.
> 
> The second additional patch attached fixes the isseu.

No. This does not fix enough.

In the test case above, if it does not call setlocale(),
__eval_codepage_from_internal_charset() always returns "ASCII"
regardless of locale setting. Therefore, output is garbled if
the terminal charset is not UTF-8.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-05 11:15                         ` Takashi Yano
@ 2020-09-05 14:15                           ` Takashi Yano
  2020-09-06  8:57                             ` Takashi Yano
  2020-09-07  8:27                           ` Corinna Vinschen
  1 sibling, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-05 14:15 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 4695 bytes --]

Hi Corinna,

On Sat, 5 Sep 2020 20:15:06 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> On Sat, 5 Sep 2020 17:43:01 +0900
> Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > Hi Corinna,
> > 
> > On Fri, 4 Sep 2020 21:22:35 +0200
> > Corinna Vinschen wrote:
> > > Hi Takashi,
> > > 
> > > On Sep  4 23:50, Takashi Yano via Cygwin-patches wrote:
> > > > Hi Corinna,
> > > > 
> > > > On Fri, 4 Sep 2020 14:44:00 +0200
> > > > Corinna Vinschen wrote:
> > > > > On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > > > > > I think I have found the answer to your request.
> > > > > > Patch attached. What do you think of this patch?
> > > > > > 
> > > > > > Calling initial_setlocale() is necessary because
> > > > > > nl_langinfo() always returns "ANSI_X3.4-1968"
> > > > > > regardless locale setting if this is not called.
> > > > > [...]
> > > > > However, the initial_setlocale() call in dll_crt0_1 calls
> > > > > internal_setlocale(), and *that* function sets the conversion functions
> > > > > for the internal conversions.  What it *doesn't* do yet at the moment is
> > > > > to store the charset name itself or, better, the equivalent codepage.
> > > > > 
> > > > > If we change that, setup_locale can simply go away.  Below is a patch
> > > > > doing just that.  Can you please check if that works in your test
> > > > > scenarios?
> > > > 
> > > > I tried your patch, but unfortunately it does not work.
> > > > cygheap->locale.term_code_page is 0 in pty master.
> > > > 
> > > > If the following lines are moved in internal_setlocale(),
> > > > 
> > > >   const char *charset = __locale_charset (__get_global_locale ());
> > > >   debug_printf ("Global charset set to %s", charset);
> > > >   /* Store codepage to be utilized by pseudo console code. */
> > > >   cygheap->locale.term_code_page =
> > > >             __eval_codepage_from_internal_charset (charset);
> > > > 
> > > > in internal_setlocale() before
> > > > 
> > > >   /* Don't do anything if the charset hasn't actually changed. */
> > > >   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
> > > >     return;
> > > 
> > > Uh, that makes sense.
> > > 
> > > > cygheap->locale.term_code_page is always 65001 even if mintty is
> > > > startted by
> > > > mintty -o locale=ja_JP -o charset=CP932
> > > > or
> > > > mintty -o locale=ja_JP -o charset=EUCJP
> > > > 
> > > > Perhaps, this is because LANG is not set properly yet when mintty
> > > > is started.
> > > 
> > > Yeah, that's the reason.  The above settings of locale and charset on
> > > the CLI should only take over when mintty calls setlocale() with a
> > > matching string.  The fact that it sets the matching value in the
> > > environment, too, should only affect child processes, not mintty itself.
> > > 
> > > But it's incorrect to call initial_setlocale() from setup_locale()
> > > without resetting it to its original value.
> > > 
> > > Unfortunately that doesn't solve any problem with the pseudo console
> > > codepage.  Drat.  It sounds like you need the terminal's charset,
> > > rather than the one set in the environment.
> > > 
> > > So this boils down to the fact that term_code_page must be set
> > > after the application is already running and as soo as it creates
> > > the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> > > is called at pty creation?  Or even on reading from /writing to
> > > the pty the first time?  That should always be late enough to fetch
> > > the correct codepage.
> > > 
> > > Patch attached.  Does that work as expected?
> > 
> > Thank you very much for the patch.
> > 
> > Your new additional patch works well except the test case such as:
> > 
> >   int pm = getpt();
> >   if (fork()) {
> >     [do the master operations]
> >   } else {
> >     int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
> >     close(pm);
> >     setsid();
> >     ioctl(ps, TIOCSCTTY, 1);
> >     dup2(ps, 0);
> >     dup2(ps, 1);
> >     dup2(ps, 2);
> >     close(ps);
> >     [exec non-cygwin process]
> >   }
> > 
> > If this test case is run in cygwin console (command prompt),
> > it causes garbled output due to term_code_page == 0.
> > 
> > The second additional patch attached fixes the isseu.
> 
> No. This does not fix enough.
> 
> In the test case above, if it does not call setlocale(),
> __eval_codepage_from_internal_charset() always returns "ASCII"
> regardless of locale setting. Therefore, output is garbled if
> the terminal charset is not UTF-8.

New additional patch against your patches is attached.
In the end, essentially what was being done in the original
code has been restored.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

[-- Attachment #2: set-term-code-page-in-_ti.diff --]
[-- Type: text/plain, Size: 5408 bytes --]

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 2b84f4252..8877cc358 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,7 +341,6 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
-  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/fhandler.h b/winsup/cygwin/fhandler.h
index af619df5f..b4ba9428a 100644
--- a/winsup/cygwin/fhandler.h
+++ b/winsup/cygwin/fhandler.h
@@ -2336,6 +2336,7 @@ class fhandler_pty_slave: public fhandler_pty_common
   void set_switch_to_pcon (void);
   void reset_switch_to_pcon (void);
   void mask_switch_to_pcon_in (bool mask);
+  void setup_locale (void);
 };
 
 #define __ptsname(buf, unit) __small_sprintf ((buf), "/dev/pty%d", (unit))
diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index feb014175..ed1c34a5f 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -635,9 +635,9 @@ fhandler_pty_slave::open (int flags, mode_t)
 
   fhandler_console::need_invisible ();
 
-#if 1 /* Let's try this first */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+#if 0 /* Let's try this first */
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   set_open_status ();
@@ -1615,8 +1615,8 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     return (ssize_t) bg;
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   push_process_state process_state (PID_TTYOU);
@@ -1627,7 +1627,7 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     {
       size_t nlen;
       char *buf = convert_mb_str (CP_UTF8, &nlen,
-				  cygheap->locale.term_code_page,
+				  get_ttyp ()->term_code_page,
 				  (const char *) ptr, len);
 
       WaitForSingleObject (input_mutex, INFINITE);
@@ -1795,6 +1795,13 @@ fhandler_pty_common::set_close_on_exec (bool val)
   close_on_exec (val);
 }
 
+void
+fhandler_pty_slave::setup_locale (void)
+{
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
+}
+
 void
 fhandler_pty_slave::fixup_after_fork (HANDLE parent)
 {
@@ -1811,6 +1818,9 @@ fhandler_pty_slave::fixup_after_exec ()
   if (!close_on_exec ())
     fixup_after_fork (NULL);	/* No parent handle required. */
 
+  /* Set locale */
+  setup_locale ();
+
   /* Hook Console API */
 #define DO_HOOK(module, name) \
   if (!name##_Orig) \
@@ -1982,8 +1992,8 @@ fhandler_pty_master::pty_master_fwd_thread ()
   char outbuf[65536];
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   termios_printf ("Started.");
@@ -2041,7 +2051,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	      state = 0;
 
 	  size_t nlen;
-	  char *buf = convert_mb_str (cygheap->locale.term_code_page,
+	  char *buf = convert_mb_str (get_ttyp ()->term_code_page,
 				      &nlen, CP_UTF8, ptr, wlen);
 
 	  ptr = buf;
@@ -2064,7 +2074,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	  continue;
 	}
       size_t nlen;
-      char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
+      char *buf = convert_mb_str (get_ttyp ()->term_code_page, &nlen,
 				  GetConsoleOutputCP (), ptr, wlen);
 
       ptr = buf;
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 6bffc77c9..dd82b7f49 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1455,7 +1455,17 @@ __set_charset_from_locale (const char *locale, char *charset)
 UINT
 __eval_codepage_from_internal_charset ()
 {
-  const char *charset = __locale_charset (__get_global_locale ());
+  const char *env = __get_locale_env (_REENT, LC_CTYPE);
+  char locale[ENCODING_LEN + 1];
+  strncpy (locale, env, ENCODING_LEN);
+  locale[ENCODING_LEN] = '\0';
+  __locale_t loc;
+  memset (&loc, 0, sizeof (loc));
+  const char *charset;
+  if (__loadlocale (&loc, LC_CTYPE, locale))
+    charset = __locale_charset (&loc);
+  else
+    charset = __locale_charset (__get_global_locale ());
   UINT codepage = CP_UTF8; /* Default UTF8 */
 
   /* The internal charset names are well defined, so we can use shortcuts. */
diff --git a/winsup/cygwin/spawn.cc b/winsup/cygwin/spawn.cc
index 02c4207ab..92d190d1a 100644
--- a/winsup/cygwin/spawn.cc
+++ b/winsup/cygwin/spawn.cc
@@ -619,6 +619,18 @@ child_info_spawn::worker (const char *prog_arg, const char *const *argv,
 	    }
 	}
 
+      if (!iscygwin ())
+	{
+	  cfd.rewind ();
+	  while (cfd.next () >= 0)
+	    if (cfd->get_major () == DEV_PTYS_MAJOR)
+	      {
+		fhandler_pty_slave *ptys =
+		  (fhandler_pty_slave *)(fhandler_base *) cfd;
+		ptys->setup_locale ();
+	      }
+	}
+
       /* Set up needed handles for stdio */
       si.dwFlags = STARTF_USESTDHANDLES;
       si.hStdInput = handle ((in__stdin < 0 ? 0 : in__stdin), false);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-05 14:15                           ` Takashi Yano
@ 2020-09-06  8:57                             ` Takashi Yano
  2020-09-06 10:15                               ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-06  8:57 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 5059 bytes --]

On Sat, 5 Sep 2020 23:15:16 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> Hi Corinna,
> 
> On Sat, 5 Sep 2020 20:15:06 +0900
> Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > On Sat, 5 Sep 2020 17:43:01 +0900
> > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > Hi Corinna,
> > > 
> > > On Fri, 4 Sep 2020 21:22:35 +0200
> > > Corinna Vinschen wrote:
> > > > Hi Takashi,
> > > > 
> > > > On Sep  4 23:50, Takashi Yano via Cygwin-patches wrote:
> > > > > Hi Corinna,
> > > > > 
> > > > > On Fri, 4 Sep 2020 14:44:00 +0200
> > > > > Corinna Vinschen wrote:
> > > > > > On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > > > > > > I think I have found the answer to your request.
> > > > > > > Patch attached. What do you think of this patch?
> > > > > > > 
> > > > > > > Calling initial_setlocale() is necessary because
> > > > > > > nl_langinfo() always returns "ANSI_X3.4-1968"
> > > > > > > regardless locale setting if this is not called.
> > > > > > [...]
> > > > > > However, the initial_setlocale() call in dll_crt0_1 calls
> > > > > > internal_setlocale(), and *that* function sets the conversion functions
> > > > > > for the internal conversions.  What it *doesn't* do yet at the moment is
> > > > > > to store the charset name itself or, better, the equivalent codepage.
> > > > > > 
> > > > > > If we change that, setup_locale can simply go away.  Below is a patch
> > > > > > doing just that.  Can you please check if that works in your test
> > > > > > scenarios?
> > > > > 
> > > > > I tried your patch, but unfortunately it does not work.
> > > > > cygheap->locale.term_code_page is 0 in pty master.
> > > > > 
> > > > > If the following lines are moved in internal_setlocale(),
> > > > > 
> > > > >   const char *charset = __locale_charset (__get_global_locale ());
> > > > >   debug_printf ("Global charset set to %s", charset);
> > > > >   /* Store codepage to be utilized by pseudo console code. */
> > > > >   cygheap->locale.term_code_page =
> > > > >             __eval_codepage_from_internal_charset (charset);
> > > > > 
> > > > > in internal_setlocale() before
> > > > > 
> > > > >   /* Don't do anything if the charset hasn't actually changed. */
> > > > >   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
> > > > >     return;
> > > > 
> > > > Uh, that makes sense.
> > > > 
> > > > > cygheap->locale.term_code_page is always 65001 even if mintty is
> > > > > startted by
> > > > > mintty -o locale=ja_JP -o charset=CP932
> > > > > or
> > > > > mintty -o locale=ja_JP -o charset=EUCJP
> > > > > 
> > > > > Perhaps, this is because LANG is not set properly yet when mintty
> > > > > is started.
> > > > 
> > > > Yeah, that's the reason.  The above settings of locale and charset on
> > > > the CLI should only take over when mintty calls setlocale() with a
> > > > matching string.  The fact that it sets the matching value in the
> > > > environment, too, should only affect child processes, not mintty itself.
> > > > 
> > > > But it's incorrect to call initial_setlocale() from setup_locale()
> > > > without resetting it to its original value.
> > > > 
> > > > Unfortunately that doesn't solve any problem with the pseudo console
> > > > codepage.  Drat.  It sounds like you need the terminal's charset,
> > > > rather than the one set in the environment.
> > > > 
> > > > So this boils down to the fact that term_code_page must be set
> > > > after the application is already running and as soo as it creates
> > > > the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> > > > is called at pty creation?  Or even on reading from /writing to
> > > > the pty the first time?  That should always be late enough to fetch
> > > > the correct codepage.
> > > > 
> > > > Patch attached.  Does that work as expected?
> > > 
> > > Thank you very much for the patch.
> > > 
> > > Your new additional patch works well except the test case such as:
> > > 
> > >   int pm = getpt();
> > >   if (fork()) {
> > >     [do the master operations]
> > >   } else {
> > >     int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
> > >     close(pm);
> > >     setsid();
> > >     ioctl(ps, TIOCSCTTY, 1);
> > >     dup2(ps, 0);
> > >     dup2(ps, 1);
> > >     dup2(ps, 2);
> > >     close(ps);
> > >     [exec non-cygwin process]
> > >   }
> > > 
> > > If this test case is run in cygwin console (command prompt),
> > > it causes garbled output due to term_code_page == 0.
> > > 
> > > The second additional patch attached fixes the isseu.
> > 
> > No. This does not fix enough.
> > 
> > In the test case above, if it does not call setlocale(),
> > __eval_codepage_from_internal_charset() always returns "ASCII"
> > regardless of locale setting. Therefore, output is garbled if
> > the terminal charset is not UTF-8.
> 
> New additional patch against your patches is attached.
> In the end, essentially what was being done in the original
> code has been restored.

The patch revised a little.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

[-- Attachment #2: set-term-code-page-in-_ti.diff --]
[-- Type: text/plain, Size: 5499 bytes --]

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 2b84f4252..8877cc358 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,7 +341,6 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
-  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/fhandler.h b/winsup/cygwin/fhandler.h
index af619df5f..b4ba9428a 100644
--- a/winsup/cygwin/fhandler.h
+++ b/winsup/cygwin/fhandler.h
@@ -2336,6 +2336,7 @@ class fhandler_pty_slave: public fhandler_pty_common
   void set_switch_to_pcon (void);
   void reset_switch_to_pcon (void);
   void mask_switch_to_pcon_in (bool mask);
+  void setup_locale (void);
 };
 
 #define __ptsname(buf, unit) __small_sprintf ((buf), "/dev/pty%d", (unit))
diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index feb014175..4d8c10830 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -635,9 +635,9 @@ fhandler_pty_slave::open (int flags, mode_t)
 
   fhandler_console::need_invisible ();
 
-#if 1 /* Let's try this first */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+#if 0 /* Let's try this first */
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   set_open_status ();
@@ -1615,8 +1615,8 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     return (ssize_t) bg;
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   push_process_state process_state (PID_TTYOU);
@@ -1627,7 +1627,7 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     {
       size_t nlen;
       char *buf = convert_mb_str (CP_UTF8, &nlen,
-				  cygheap->locale.term_code_page,
+				  get_ttyp ()->term_code_page,
 				  (const char *) ptr, len);
 
       WaitForSingleObject (input_mutex, INFINITE);
@@ -1795,6 +1795,13 @@ fhandler_pty_common::set_close_on_exec (bool val)
   close_on_exec (val);
 }
 
+void
+fhandler_pty_slave::setup_locale (void)
+{
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
+}
+
 void
 fhandler_pty_slave::fixup_after_fork (HANDLE parent)
 {
@@ -1811,6 +1818,9 @@ fhandler_pty_slave::fixup_after_exec ()
   if (!close_on_exec ())
     fixup_after_fork (NULL);	/* No parent handle required. */
 
+  /* Set locale */
+  setup_locale ();
+
   /* Hook Console API */
 #define DO_HOOK(module, name) \
   if (!name##_Orig) \
@@ -1982,8 +1992,8 @@ fhandler_pty_master::pty_master_fwd_thread ()
   char outbuf[65536];
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   termios_printf ("Started.");
@@ -2041,7 +2051,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	      state = 0;
 
 	  size_t nlen;
-	  char *buf = convert_mb_str (cygheap->locale.term_code_page,
+	  char *buf = convert_mb_str (get_ttyp ()->term_code_page,
 				      &nlen, CP_UTF8, ptr, wlen);
 
 	  ptr = buf;
@@ -2064,7 +2074,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	  continue;
 	}
       size_t nlen;
-      char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
+      char *buf = convert_mb_str (get_ttyp ()->term_code_page, &nlen,
 				  GetConsoleOutputCP (), ptr, wlen);
 
       ptr = buf;
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 6bffc77c9..f5af26982 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1455,7 +1455,13 @@ __set_charset_from_locale (const char *locale, char *charset)
 UINT
 __eval_codepage_from_internal_charset ()
 {
-  const char *charset = __locale_charset (__get_global_locale ());
+  const char *env = __get_locale_env (_REENT, LC_CTYPE);
+  char locale[ENCODING_LEN + 1];
+  strncpy (locale, env, ENCODING_LEN);
+  locale[ENCODING_LEN] = '\0';
+  __locale_t *loc = duplocale (__get_global_locale ());
+  __loadlocale (loc, LC_CTYPE, locale);
+  const char *charset = __locale_charset (loc);
   UINT codepage = CP_UTF8; /* Default UTF8 */
 
   /* The internal charset names are well defined, so we can use shortcuts. */
@@ -1493,6 +1499,7 @@ __eval_codepage_from_internal_charset ()
     default: /* All set to UTF8 already */
       break;
     }
+  freelocale (loc);
   return codepage;
 }
 
diff --git a/winsup/cygwin/spawn.cc b/winsup/cygwin/spawn.cc
index 02c4207ab..92d190d1a 100644
--- a/winsup/cygwin/spawn.cc
+++ b/winsup/cygwin/spawn.cc
@@ -619,6 +619,18 @@ child_info_spawn::worker (const char *prog_arg, const char *const *argv,
 	    }
 	}
 
+      if (!iscygwin ())
+	{
+	  cfd.rewind ();
+	  while (cfd.next () >= 0)
+	    if (cfd->get_major () == DEV_PTYS_MAJOR)
+	      {
+		fhandler_pty_slave *ptys =
+		  (fhandler_pty_slave *)(fhandler_base *) cfd;
+		ptys->setup_locale ();
+	      }
+	}
+
       /* Set up needed handles for stdio */
       si.dwFlags = STARTF_USESTDHANDLES;
       si.hStdInput = handle ((in__stdin < 0 ? 0 : in__stdin), false);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-06  8:57                             ` Takashi Yano
@ 2020-09-06 10:15                               ` Takashi Yano
  2020-09-06 16:04                                 ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-06 10:15 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 5459 bytes --]

On Sun, 6 Sep 2020 17:57:03 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> On Sat, 5 Sep 2020 23:15:16 +0900
> Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > Hi Corinna,
> > 
> > On Sat, 5 Sep 2020 20:15:06 +0900
> > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > On Sat, 5 Sep 2020 17:43:01 +0900
> > > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > > Hi Corinna,
> > > > 
> > > > On Fri, 4 Sep 2020 21:22:35 +0200
> > > > Corinna Vinschen wrote:
> > > > > Hi Takashi,
> > > > > 
> > > > > On Sep  4 23:50, Takashi Yano via Cygwin-patches wrote:
> > > > > > Hi Corinna,
> > > > > > 
> > > > > > On Fri, 4 Sep 2020 14:44:00 +0200
> > > > > > Corinna Vinschen wrote:
> > > > > > > On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > > > > > > > I think I have found the answer to your request.
> > > > > > > > Patch attached. What do you think of this patch?
> > > > > > > > 
> > > > > > > > Calling initial_setlocale() is necessary because
> > > > > > > > nl_langinfo() always returns "ANSI_X3.4-1968"
> > > > > > > > regardless locale setting if this is not called.
> > > > > > > [...]
> > > > > > > However, the initial_setlocale() call in dll_crt0_1 calls
> > > > > > > internal_setlocale(), and *that* function sets the conversion functions
> > > > > > > for the internal conversions.  What it *doesn't* do yet at the moment is
> > > > > > > to store the charset name itself or, better, the equivalent codepage.
> > > > > > > 
> > > > > > > If we change that, setup_locale can simply go away.  Below is a patch
> > > > > > > doing just that.  Can you please check if that works in your test
> > > > > > > scenarios?
> > > > > > 
> > > > > > I tried your patch, but unfortunately it does not work.
> > > > > > cygheap->locale.term_code_page is 0 in pty master.
> > > > > > 
> > > > > > If the following lines are moved in internal_setlocale(),
> > > > > > 
> > > > > >   const char *charset = __locale_charset (__get_global_locale ());
> > > > > >   debug_printf ("Global charset set to %s", charset);
> > > > > >   /* Store codepage to be utilized by pseudo console code. */
> > > > > >   cygheap->locale.term_code_page =
> > > > > >             __eval_codepage_from_internal_charset (charset);
> > > > > > 
> > > > > > in internal_setlocale() before
> > > > > > 
> > > > > >   /* Don't do anything if the charset hasn't actually changed. */
> > > > > >   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
> > > > > >     return;
> > > > > 
> > > > > Uh, that makes sense.
> > > > > 
> > > > > > cygheap->locale.term_code_page is always 65001 even if mintty is
> > > > > > startted by
> > > > > > mintty -o locale=ja_JP -o charset=CP932
> > > > > > or
> > > > > > mintty -o locale=ja_JP -o charset=EUCJP
> > > > > > 
> > > > > > Perhaps, this is because LANG is not set properly yet when mintty
> > > > > > is started.
> > > > > 
> > > > > Yeah, that's the reason.  The above settings of locale and charset on
> > > > > the CLI should only take over when mintty calls setlocale() with a
> > > > > matching string.  The fact that it sets the matching value in the
> > > > > environment, too, should only affect child processes, not mintty itself.
> > > > > 
> > > > > But it's incorrect to call initial_setlocale() from setup_locale()
> > > > > without resetting it to its original value.
> > > > > 
> > > > > Unfortunately that doesn't solve any problem with the pseudo console
> > > > > codepage.  Drat.  It sounds like you need the terminal's charset,
> > > > > rather than the one set in the environment.
> > > > > 
> > > > > So this boils down to the fact that term_code_page must be set
> > > > > after the application is already running and as soo as it creates
> > > > > the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> > > > > is called at pty creation?  Or even on reading from /writing to
> > > > > the pty the first time?  That should always be late enough to fetch
> > > > > the correct codepage.
> > > > > 
> > > > > Patch attached.  Does that work as expected?
> > > > 
> > > > Thank you very much for the patch.
> > > > 
> > > > Your new additional patch works well except the test case such as:
> > > > 
> > > >   int pm = getpt();
> > > >   if (fork()) {
> > > >     [do the master operations]
> > > >   } else {
> > > >     int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
> > > >     close(pm);
> > > >     setsid();
> > > >     ioctl(ps, TIOCSCTTY, 1);
> > > >     dup2(ps, 0);
> > > >     dup2(ps, 1);
> > > >     dup2(ps, 2);
> > > >     close(ps);
> > > >     [exec non-cygwin process]
> > > >   }
> > > > 
> > > > If this test case is run in cygwin console (command prompt),
> > > > it causes garbled output due to term_code_page == 0.
> > > > 
> > > > The second additional patch attached fixes the isseu.
> > > 
> > > No. This does not fix enough.
> > > 
> > > In the test case above, if it does not call setlocale(),
> > > __eval_codepage_from_internal_charset() always returns "ASCII"
> > > regardless of locale setting. Therefore, output is garbled if
> > > the terminal charset is not UTF-8.
> > 
> > New additional patch against your patches is attached.
> > In the end, essentially what was being done in the original
> > code has been restored.
> 
> The patch revised a little.

Chages:
- If global locale is set, it takes precedence.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

[-- Attachment #2: set-term-code-page-in-_ti.diff --]
[-- Type: text/plain, Size: 5701 bytes --]

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 2b84f4252..8877cc358 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,7 +341,6 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
-  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/fhandler.h b/winsup/cygwin/fhandler.h
index af619df5f..b4ba9428a 100644
--- a/winsup/cygwin/fhandler.h
+++ b/winsup/cygwin/fhandler.h
@@ -2336,6 +2336,7 @@ class fhandler_pty_slave: public fhandler_pty_common
   void set_switch_to_pcon (void);
   void reset_switch_to_pcon (void);
   void mask_switch_to_pcon_in (bool mask);
+  void setup_locale (void);
 };
 
 #define __ptsname(buf, unit) __small_sprintf ((buf), "/dev/pty%d", (unit))
diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index feb014175..4d8c10830 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -635,9 +635,9 @@ fhandler_pty_slave::open (int flags, mode_t)
 
   fhandler_console::need_invisible ();
 
-#if 1 /* Let's try this first */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+#if 0 /* Let's try this first */
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   set_open_status ();
@@ -1615,8 +1615,8 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     return (ssize_t) bg;
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   push_process_state process_state (PID_TTYOU);
@@ -1627,7 +1627,7 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     {
       size_t nlen;
       char *buf = convert_mb_str (CP_UTF8, &nlen,
-				  cygheap->locale.term_code_page,
+				  get_ttyp ()->term_code_page,
 				  (const char *) ptr, len);
 
       WaitForSingleObject (input_mutex, INFINITE);
@@ -1795,6 +1795,13 @@ fhandler_pty_common::set_close_on_exec (bool val)
   close_on_exec (val);
 }
 
+void
+fhandler_pty_slave::setup_locale (void)
+{
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
+}
+
 void
 fhandler_pty_slave::fixup_after_fork (HANDLE parent)
 {
@@ -1811,6 +1818,9 @@ fhandler_pty_slave::fixup_after_exec ()
   if (!close_on_exec ())
     fixup_after_fork (NULL);	/* No parent handle required. */
 
+  /* Set locale */
+  setup_locale ();
+
   /* Hook Console API */
 #define DO_HOOK(module, name) \
   if (!name##_Orig) \
@@ -1982,8 +1992,8 @@ fhandler_pty_master::pty_master_fwd_thread ()
   char outbuf[65536];
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   termios_printf ("Started.");
@@ -2041,7 +2051,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	      state = 0;
 
 	  size_t nlen;
-	  char *buf = convert_mb_str (cygheap->locale.term_code_page,
+	  char *buf = convert_mb_str (get_ttyp ()->term_code_page,
 				      &nlen, CP_UTF8, ptr, wlen);
 
 	  ptr = buf;
@@ -2064,7 +2074,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	  continue;
 	}
       size_t nlen;
-      char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
+      char *buf = convert_mb_str (get_ttyp ()->term_code_page, &nlen,
 				  GetConsoleOutputCP (), ptr, wlen);
 
       ptr = buf;
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 6bffc77c9..203a22e87 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1455,7 +1455,20 @@ __set_charset_from_locale (const char *locale, char *charset)
 UINT
 __eval_codepage_from_internal_charset ()
 {
-  const char *charset = __locale_charset (__get_global_locale ());
+  const char *charset;
+  __locale_t *loc = NULL;
+  if (__get_global_locale ()->lc_cat[LC_CTYPE].buf)
+    charset = __locale_charset (__get_global_locale ());
+  else
+    {
+      const char *env = __get_locale_env (_REENT, LC_CTYPE);
+      char locale[ENCODING_LEN + 1];
+      strncpy (locale, env, ENCODING_LEN);
+      locale[ENCODING_LEN] = '\0';
+      loc = duplocale (__get_global_locale ());
+      __loadlocale (loc, LC_CTYPE, locale);
+      charset = __locale_charset (loc);
+    }
   UINT codepage = CP_UTF8; /* Default UTF8 */
 
   /* The internal charset names are well defined, so we can use shortcuts. */
@@ -1493,6 +1506,8 @@ __eval_codepage_from_internal_charset ()
     default: /* All set to UTF8 already */
       break;
     }
+  if (loc)
+    freelocale (loc);
   return codepage;
 }
 
diff --git a/winsup/cygwin/spawn.cc b/winsup/cygwin/spawn.cc
index 02c4207ab..92d190d1a 100644
--- a/winsup/cygwin/spawn.cc
+++ b/winsup/cygwin/spawn.cc
@@ -619,6 +619,18 @@ child_info_spawn::worker (const char *prog_arg, const char *const *argv,
 	    }
 	}
 
+      if (!iscygwin ())
+	{
+	  cfd.rewind ();
+	  while (cfd.next () >= 0)
+	    if (cfd->get_major () == DEV_PTYS_MAJOR)
+	      {
+		fhandler_pty_slave *ptys =
+		  (fhandler_pty_slave *)(fhandler_base *) cfd;
+		ptys->setup_locale ();
+	      }
+	}
+
       /* Set up needed handles for stdio */
       si.dwFlags = STARTF_USESTDHANDLES;
       si.hStdInput = handle ((in__stdin < 0 ? 0 : in__stdin), false);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04 12:44                 ` Corinna Vinschen
  2020-09-04 14:05                   ` Brian Inglis
  2020-09-04 14:50                   ` Takashi Yano
@ 2020-09-06 10:28                   ` Takashi Yano
  2020-09-07  8:33                     ` Corinna Vinschen
  2 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-06 10:28 UTC (permalink / raw)
  To: cygwin-patches

Hi Corinna,

On Fri, 4 Sep 2020 14:44:00 +0200
Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
> +    case 'I': /* ISO-8859-x */
> +      codepage = strtoul (charset + 9, NULL, 10);
> +      break;

This should be:
codepage = strtoul (charset + 9, NULL, 10) + 28590;
shouldn't it?

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-06 10:15                               ` Takashi Yano
@ 2020-09-06 16:04                                 ` Takashi Yano
  2020-09-07  4:45                                   ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-06 16:04 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 5921 bytes --]

On Sun, 6 Sep 2020 19:15:30 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> On Sun, 6 Sep 2020 17:57:03 +0900
> Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > On Sat, 5 Sep 2020 23:15:16 +0900
> > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > Hi Corinna,
> > > 
> > > On Sat, 5 Sep 2020 20:15:06 +0900
> > > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > > On Sat, 5 Sep 2020 17:43:01 +0900
> > > > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > > > Hi Corinna,
> > > > > 
> > > > > On Fri, 4 Sep 2020 21:22:35 +0200
> > > > > Corinna Vinschen wrote:
> > > > > > Hi Takashi,
> > > > > > 
> > > > > > On Sep  4 23:50, Takashi Yano via Cygwin-patches wrote:
> > > > > > > Hi Corinna,
> > > > > > > 
> > > > > > > On Fri, 4 Sep 2020 14:44:00 +0200
> > > > > > > Corinna Vinschen wrote:
> > > > > > > > On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > > > > > > > > I think I have found the answer to your request.
> > > > > > > > > Patch attached. What do you think of this patch?
> > > > > > > > > 
> > > > > > > > > Calling initial_setlocale() is necessary because
> > > > > > > > > nl_langinfo() always returns "ANSI_X3.4-1968"
> > > > > > > > > regardless locale setting if this is not called.
> > > > > > > > [...]
> > > > > > > > However, the initial_setlocale() call in dll_crt0_1 calls
> > > > > > > > internal_setlocale(), and *that* function sets the conversion functions
> > > > > > > > for the internal conversions.  What it *doesn't* do yet at the moment is
> > > > > > > > to store the charset name itself or, better, the equivalent codepage.
> > > > > > > > 
> > > > > > > > If we change that, setup_locale can simply go away.  Below is a patch
> > > > > > > > doing just that.  Can you please check if that works in your test
> > > > > > > > scenarios?
> > > > > > > 
> > > > > > > I tried your patch, but unfortunately it does not work.
> > > > > > > cygheap->locale.term_code_page is 0 in pty master.
> > > > > > > 
> > > > > > > If the following lines are moved in internal_setlocale(),
> > > > > > > 
> > > > > > >   const char *charset = __locale_charset (__get_global_locale ());
> > > > > > >   debug_printf ("Global charset set to %s", charset);
> > > > > > >   /* Store codepage to be utilized by pseudo console code. */
> > > > > > >   cygheap->locale.term_code_page =
> > > > > > >             __eval_codepage_from_internal_charset (charset);
> > > > > > > 
> > > > > > > in internal_setlocale() before
> > > > > > > 
> > > > > > >   /* Don't do anything if the charset hasn't actually changed. */
> > > > > > >   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
> > > > > > >     return;
> > > > > > 
> > > > > > Uh, that makes sense.
> > > > > > 
> > > > > > > cygheap->locale.term_code_page is always 65001 even if mintty is
> > > > > > > startted by
> > > > > > > mintty -o locale=ja_JP -o charset=CP932
> > > > > > > or
> > > > > > > mintty -o locale=ja_JP -o charset=EUCJP
> > > > > > > 
> > > > > > > Perhaps, this is because LANG is not set properly yet when mintty
> > > > > > > is started.
> > > > > > 
> > > > > > Yeah, that's the reason.  The above settings of locale and charset on
> > > > > > the CLI should only take over when mintty calls setlocale() with a
> > > > > > matching string.  The fact that it sets the matching value in the
> > > > > > environment, too, should only affect child processes, not mintty itself.
> > > > > > 
> > > > > > But it's incorrect to call initial_setlocale() from setup_locale()
> > > > > > without resetting it to its original value.
> > > > > > 
> > > > > > Unfortunately that doesn't solve any problem with the pseudo console
> > > > > > codepage.  Drat.  It sounds like you need the terminal's charset,
> > > > > > rather than the one set in the environment.
> > > > > > 
> > > > > > So this boils down to the fact that term_code_page must be set
> > > > > > after the application is already running and as soo as it creates
> > > > > > the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> > > > > > is called at pty creation?  Or even on reading from /writing to
> > > > > > the pty the first time?  That should always be late enough to fetch
> > > > > > the correct codepage.
> > > > > > 
> > > > > > Patch attached.  Does that work as expected?
> > > > > 
> > > > > Thank you very much for the patch.
> > > > > 
> > > > > Your new additional patch works well except the test case such as:
> > > > > 
> > > > >   int pm = getpt();
> > > > >   if (fork()) {
> > > > >     [do the master operations]
> > > > >   } else {
> > > > >     int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
> > > > >     close(pm);
> > > > >     setsid();
> > > > >     ioctl(ps, TIOCSCTTY, 1);
> > > > >     dup2(ps, 0);
> > > > >     dup2(ps, 1);
> > > > >     dup2(ps, 2);
> > > > >     close(ps);
> > > > >     [exec non-cygwin process]
> > > > >   }
> > > > > 
> > > > > If this test case is run in cygwin console (command prompt),
> > > > > it causes garbled output due to term_code_page == 0.
> > > > > 
> > > > > The second additional patch attached fixes the isseu.
> > > > 
> > > > No. This does not fix enough.
> > > > 
> > > > In the test case above, if it does not call setlocale(),
> > > > __eval_codepage_from_internal_charset() always returns "ASCII"
> > > > regardless of locale setting. Therefore, output is garbled if
> > > > the terminal charset is not UTF-8.
> > > 
> > > New additional patch against your patches is attached.
> > > In the end, essentially what was being done in the original
> > > code has been restored.
> > 
> > The patch revised a little.
> 
> Chages:
> - If global locale is set, it takes precedence.

Changes:
- Use __get_current_locale() instead of __get_global_locale().
- Fix a bug for ISO-8859-* charset.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

[-- Attachment #2: set-term-code-page.diff --]
[-- Type: text/plain, Size: 6009 bytes --]

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 2b84f4252..8877cc358 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,7 +341,6 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
-  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/fhandler.h b/winsup/cygwin/fhandler.h
index af619df5f..b4ba9428a 100644
--- a/winsup/cygwin/fhandler.h
+++ b/winsup/cygwin/fhandler.h
@@ -2336,6 +2336,7 @@ class fhandler_pty_slave: public fhandler_pty_common
   void set_switch_to_pcon (void);
   void reset_switch_to_pcon (void);
   void mask_switch_to_pcon_in (bool mask);
+  void setup_locale (void);
 };
 
 #define __ptsname(buf, unit) __small_sprintf ((buf), "/dev/pty%d", (unit))
diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index feb014175..4d8c10830 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -635,9 +635,9 @@ fhandler_pty_slave::open (int flags, mode_t)
 
   fhandler_console::need_invisible ();
 
-#if 1 /* Let's try this first */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+#if 0 /* Let's try this first */
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   set_open_status ();
@@ -1615,8 +1615,8 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     return (ssize_t) bg;
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   push_process_state process_state (PID_TTYOU);
@@ -1627,7 +1627,7 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     {
       size_t nlen;
       char *buf = convert_mb_str (CP_UTF8, &nlen,
-				  cygheap->locale.term_code_page,
+				  get_ttyp ()->term_code_page,
 				  (const char *) ptr, len);
 
       WaitForSingleObject (input_mutex, INFINITE);
@@ -1795,6 +1795,13 @@ fhandler_pty_common::set_close_on_exec (bool val)
   close_on_exec (val);
 }
 
+void
+fhandler_pty_slave::setup_locale (void)
+{
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
+}
+
 void
 fhandler_pty_slave::fixup_after_fork (HANDLE parent)
 {
@@ -1811,6 +1818,9 @@ fhandler_pty_slave::fixup_after_exec ()
   if (!close_on_exec ())
     fixup_after_fork (NULL);	/* No parent handle required. */
 
+  /* Set locale */
+  setup_locale ();
+
   /* Hook Console API */
 #define DO_HOOK(module, name) \
   if (!name##_Orig) \
@@ -1982,8 +1992,8 @@ fhandler_pty_master::pty_master_fwd_thread ()
   char outbuf[65536];
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
 #endif
 
   termios_printf ("Started.");
@@ -2041,7 +2051,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	      state = 0;
 
 	  size_t nlen;
-	  char *buf = convert_mb_str (cygheap->locale.term_code_page,
+	  char *buf = convert_mb_str (get_ttyp ()->term_code_page,
 				      &nlen, CP_UTF8, ptr, wlen);
 
 	  ptr = buf;
@@ -2064,7 +2074,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	  continue;
 	}
       size_t nlen;
-      char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
+      char *buf = convert_mb_str (get_ttyp ()->term_code_page, &nlen,
 				  GetConsoleOutputCP (), ptr, wlen);
 
       ptr = buf;
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 6bffc77c9..eef4239e7 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1455,7 +1455,20 @@ __set_charset_from_locale (const char *locale, char *charset)
 UINT
 __eval_codepage_from_internal_charset ()
 {
-  const char *charset = __locale_charset (__get_global_locale ());
+  const char *charset;
+  __locale_t *loc = NULL;
+  if (__get_current_locale ()->lc_cat[LC_CTYPE].buf)
+    charset = __locale_charset (__get_current_locale ());
+  else
+    {
+      const char *env = __get_locale_env (_REENT, LC_CTYPE);
+      char locale[ENCODING_LEN + 1];
+      strncpy (locale, env, ENCODING_LEN);
+      locale[ENCODING_LEN] = '\0';
+      loc = duplocale (__get_current_locale ());
+      __loadlocale (loc, LC_CTYPE, locale);
+      charset = __locale_charset (loc);
+    }
   UINT codepage = CP_UTF8; /* Default UTF8 */
 
   /* The internal charset names are well defined, so we can use shortcuts. */
@@ -1485,7 +1498,7 @@ __eval_codepage_from_internal_charset ()
       codepage = 936;
       break;
     case 'I': /* ISO-8859-x */
-      codepage = strtoul (charset + 9, NULL, 10);
+      codepage = strtoul (charset + 9, NULL, 10) + 28590;
       break;
     case 'S': /* SJIS */
       codepage = 932;
@@ -1493,6 +1506,8 @@ __eval_codepage_from_internal_charset ()
     default: /* All set to UTF8 already */
       break;
     }
+  if (loc)
+    freelocale (loc);
   return codepage;
 }
 
diff --git a/winsup/cygwin/spawn.cc b/winsup/cygwin/spawn.cc
index 02c4207ab..92d190d1a 100644
--- a/winsup/cygwin/spawn.cc
+++ b/winsup/cygwin/spawn.cc
@@ -619,6 +619,18 @@ child_info_spawn::worker (const char *prog_arg, const char *const *argv,
 	    }
 	}
 
+      if (!iscygwin ())
+	{
+	  cfd.rewind ();
+	  while (cfd.next () >= 0)
+	    if (cfd->get_major () == DEV_PTYS_MAJOR)
+	      {
+		fhandler_pty_slave *ptys =
+		  (fhandler_pty_slave *)(fhandler_base *) cfd;
+		ptys->setup_locale ();
+	      }
+	}
+
       /* Set up needed handles for stdio */
       si.dwFlags = STARTF_USESTDHANDLES;
       si.hStdInput = handle ((in__stdin < 0 ? 0 : in__stdin), false);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-06 16:04                                 ` Takashi Yano
@ 2020-09-07  4:45                                   ` Takashi Yano
  2020-09-07  9:08                                     ` Corinna Vinschen
  2020-09-08  8:40                                     ` Corinna Vinschen
  0 siblings, 2 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-07  4:45 UTC (permalink / raw)
  To: cygwin-patches

[-- Attachment #1: Type: text/plain, Size: 6357 bytes --]

On Mon, 7 Sep 2020 01:04:13 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> On Sun, 6 Sep 2020 19:15:30 +0900
> Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > On Sun, 6 Sep 2020 17:57:03 +0900
> > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > On Sat, 5 Sep 2020 23:15:16 +0900
> > > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > > Hi Corinna,
> > > > 
> > > > On Sat, 5 Sep 2020 20:15:06 +0900
> > > > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > > > On Sat, 5 Sep 2020 17:43:01 +0900
> > > > > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > > > > Hi Corinna,
> > > > > > 
> > > > > > On Fri, 4 Sep 2020 21:22:35 +0200
> > > > > > Corinna Vinschen wrote:
> > > > > > > Hi Takashi,
> > > > > > > 
> > > > > > > On Sep  4 23:50, Takashi Yano via Cygwin-patches wrote:
> > > > > > > > Hi Corinna,
> > > > > > > > 
> > > > > > > > On Fri, 4 Sep 2020 14:44:00 +0200
> > > > > > > > Corinna Vinschen wrote:
> > > > > > > > > On Sep  4 18:21, Takashi Yano via Cygwin-patches wrote:
> > > > > > > > > > I think I have found the answer to your request.
> > > > > > > > > > Patch attached. What do you think of this patch?
> > > > > > > > > > 
> > > > > > > > > > Calling initial_setlocale() is necessary because
> > > > > > > > > > nl_langinfo() always returns "ANSI_X3.4-1968"
> > > > > > > > > > regardless locale setting if this is not called.
> > > > > > > > > [...]
> > > > > > > > > However, the initial_setlocale() call in dll_crt0_1 calls
> > > > > > > > > internal_setlocale(), and *that* function sets the conversion functions
> > > > > > > > > for the internal conversions.  What it *doesn't* do yet at the moment is
> > > > > > > > > to store the charset name itself or, better, the equivalent codepage.
> > > > > > > > > 
> > > > > > > > > If we change that, setup_locale can simply go away.  Below is a patch
> > > > > > > > > doing just that.  Can you please check if that works in your test
> > > > > > > > > scenarios?
> > > > > > > > 
> > > > > > > > I tried your patch, but unfortunately it does not work.
> > > > > > > > cygheap->locale.term_code_page is 0 in pty master.
> > > > > > > > 
> > > > > > > > If the following lines are moved in internal_setlocale(),
> > > > > > > > 
> > > > > > > >   const char *charset = __locale_charset (__get_global_locale ());
> > > > > > > >   debug_printf ("Global charset set to %s", charset);
> > > > > > > >   /* Store codepage to be utilized by pseudo console code. */
> > > > > > > >   cygheap->locale.term_code_page =
> > > > > > > >             __eval_codepage_from_internal_charset (charset);
> > > > > > > > 
> > > > > > > > in internal_setlocale() before
> > > > > > > > 
> > > > > > > >   /* Don't do anything if the charset hasn't actually changed. */
> > > > > > > >   if (cygheap->locale.mbtowc == __get_global_locale ()->mbtowc)
> > > > > > > >     return;
> > > > > > > 
> > > > > > > Uh, that makes sense.
> > > > > > > 
> > > > > > > > cygheap->locale.term_code_page is always 65001 even if mintty is
> > > > > > > > startted by
> > > > > > > > mintty -o locale=ja_JP -o charset=CP932
> > > > > > > > or
> > > > > > > > mintty -o locale=ja_JP -o charset=EUCJP
> > > > > > > > 
> > > > > > > > Perhaps, this is because LANG is not set properly yet when mintty
> > > > > > > > is started.
> > > > > > > 
> > > > > > > Yeah, that's the reason.  The above settings of locale and charset on
> > > > > > > the CLI should only take over when mintty calls setlocale() with a
> > > > > > > matching string.  The fact that it sets the matching value in the
> > > > > > > environment, too, should only affect child processes, not mintty itself.
> > > > > > > 
> > > > > > > But it's incorrect to call initial_setlocale() from setup_locale()
> > > > > > > without resetting it to its original value.
> > > > > > > 
> > > > > > > Unfortunately that doesn't solve any problem with the pseudo console
> > > > > > > codepage.  Drat.  It sounds like you need the terminal's charset,
> > > > > > > rather than the one set in the environment.
> > > > > > > 
> > > > > > > So this boils down to the fact that term_code_page must be set
> > > > > > > after the application is already running and as soo as it creates
> > > > > > > the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> > > > > > > is called at pty creation?  Or even on reading from /writing to
> > > > > > > the pty the first time?  That should always be late enough to fetch
> > > > > > > the correct codepage.
> > > > > > > 
> > > > > > > Patch attached.  Does that work as expected?
> > > > > > 
> > > > > > Thank you very much for the patch.
> > > > > > 
> > > > > > Your new additional patch works well except the test case such as:
> > > > > > 
> > > > > >   int pm = getpt();
> > > > > >   if (fork()) {
> > > > > >     [do the master operations]
> > > > > >   } else {
> > > > > >     int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
> > > > > >     close(pm);
> > > > > >     setsid();
> > > > > >     ioctl(ps, TIOCSCTTY, 1);
> > > > > >     dup2(ps, 0);
> > > > > >     dup2(ps, 1);
> > > > > >     dup2(ps, 2);
> > > > > >     close(ps);
> > > > > >     [exec non-cygwin process]
> > > > > >   }
> > > > > > 
> > > > > > If this test case is run in cygwin console (command prompt),
> > > > > > it causes garbled output due to term_code_page == 0.
> > > > > > 
> > > > > > The second additional patch attached fixes the isseu.
> > > > > 
> > > > > No. This does not fix enough.
> > > > > 
> > > > > In the test case above, if it does not call setlocale(),
> > > > > __eval_codepage_from_internal_charset() always returns "ASCII"
> > > > > regardless of locale setting. Therefore, output is garbled if
> > > > > the terminal charset is not UTF-8.
> > > > 
> > > > New additional patch against your patches is attached.
> > > > In the end, essentially what was being done in the original
> > > > code has been restored.
> > > 
> > > The patch revised a little.
> > 
> > Chages:
> > - If global locale is set, it takes precedence.
> 
> Changes:
> - Use __get_current_locale() instead of __get_global_locale().
> - Fix a bug for ISO-8859-* charset.

Changes:
- Use envblock if it is passed to CreateProcess in spawn.cc.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

[-- Attachment #2: set-term-code-page.diff --]
[-- Type: text/plain, Size: 7126 bytes --]

diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
index 2b84f4252..8877cc358 100644
--- a/winsup/cygwin/cygheap.h
+++ b/winsup/cygwin/cygheap.h
@@ -341,7 +341,6 @@ struct cygheap_debug
 struct cygheap_locale
 {
   mbtowc_p mbtowc;
-  UINT term_code_page;
 };
 
 struct user_heap_info
diff --git a/winsup/cygwin/fhandler.h b/winsup/cygwin/fhandler.h
index af619df5f..17e03785f 100644
--- a/winsup/cygwin/fhandler.h
+++ b/winsup/cygwin/fhandler.h
@@ -2336,6 +2336,7 @@ class fhandler_pty_slave: public fhandler_pty_common
   void set_switch_to_pcon (void);
   void reset_switch_to_pcon (void);
   void mask_switch_to_pcon_in (bool mask);
+  void setup_locale (const WCHAR *env);
 };
 
 #define __ptsname(buf, unit) __small_sprintf ((buf), "/dev/pty%d", (unit))
diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index feb014175..fc9f4c44d 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -39,7 +39,7 @@ extern "C" {
   VOID WINAPI ClosePseudoConsole (HPCON);
 }
 
-extern UINT __eval_codepage_from_internal_charset ();
+extern UINT __eval_codepage_from_internal_charset (const WCHAR *env);
 
 #define close_maybe(h) \
   do { \
@@ -635,9 +635,9 @@ fhandler_pty_slave::open (int flags, mode_t)
 
   fhandler_console::need_invisible ();
 
-#if 1 /* Let's try this first */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+#if 0 /* Let's try this first */
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset (NULL);
 #endif
 
   set_open_status ();
@@ -1615,8 +1615,8 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     return (ssize_t) bg;
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset (NULL);
 #endif
 
   push_process_state process_state (PID_TTYOU);
@@ -1627,7 +1627,7 @@ fhandler_pty_master::write (const void *ptr, size_t len)
     {
       size_t nlen;
       char *buf = convert_mb_str (CP_UTF8, &nlen,
-				  cygheap->locale.term_code_page,
+				  get_ttyp ()->term_code_page,
 				  (const char *) ptr, len);
 
       WaitForSingleObject (input_mutex, INFINITE);
@@ -1795,6 +1795,13 @@ fhandler_pty_common::set_close_on_exec (bool val)
   close_on_exec (val);
 }
 
+void
+fhandler_pty_slave::setup_locale (const WCHAR *env)
+{
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset (env);
+}
+
 void
 fhandler_pty_slave::fixup_after_fork (HANDLE parent)
 {
@@ -1811,6 +1818,9 @@ fhandler_pty_slave::fixup_after_exec ()
   if (!close_on_exec ())
     fixup_after_fork (NULL);	/* No parent handle required. */
 
+  /* Set locale */
+  setup_locale (NULL);
+
   /* Hook Console API */
 #define DO_HOOK(module, name) \
   if (!name##_Orig) \
@@ -1982,8 +1992,8 @@ fhandler_pty_master::pty_master_fwd_thread ()
   char outbuf[65536];
 
 #if 0 /* Let's try this if setting codepage at pty open time is not enough */
-  if (!cygheap->locale.term_code_page)
-    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
+  if (!get_ttyp ()->term_code_page)
+    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset (NULL);
 #endif
 
   termios_printf ("Started.");
@@ -2041,7 +2051,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	      state = 0;
 
 	  size_t nlen;
-	  char *buf = convert_mb_str (cygheap->locale.term_code_page,
+	  char *buf = convert_mb_str (get_ttyp ()->term_code_page,
 				      &nlen, CP_UTF8, ptr, wlen);
 
 	  ptr = buf;
@@ -2064,7 +2074,7 @@ fhandler_pty_master::pty_master_fwd_thread ()
 	  continue;
 	}
       size_t nlen;
-      char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
+      char *buf = convert_mb_str (get_ttyp ()->term_code_page, &nlen,
 				  GetConsoleOutputCP (), ptr, wlen);
 
       ptr = buf;
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index 6bffc77c9..120b55c24 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1453,9 +1453,42 @@ __set_charset_from_locale (const char *locale, char *charset)
    codepage connected to a locale by default, so we have to set this
    up explicitely. */
 UINT
-__eval_codepage_from_internal_charset ()
+__eval_codepage_from_internal_charset (const WCHAR *envblock)
 {
-  const char *charset = __locale_charset (__get_global_locale ());
+  const char *charset;
+  __locale_t *loc = NULL;
+  if (__get_current_locale ()->lc_cat[LC_CTYPE].buf)
+    charset = __locale_charset (__get_current_locale ());
+  else
+    {
+      char locale[ENCODING_LEN + 1] = {0, };
+      if (envblock)
+	{
+	  const WCHAR *lc_all = NULL, *lc_ctype = NULL, *lang = NULL;
+	  for (const WCHAR *p = envblock; *p != L'\0'; p += wcslen (p) + 1)
+	    if (wcsncmp (p, L"LC_ALL=", 7) == 0)
+	      lc_all = p + 7;
+	    else if (wcsncmp (p, L"LC_CTYPE=", 9) == 0)
+	      lc_ctype = p + 9;
+	    else if (wcsncmp (p, L"LANG=", 5) == 0)
+	      lang = p + 5;
+	  if (lc_all && *lc_all)
+	    snprintf (locale, ENCODING_LEN + 1, "%ls", lc_all);
+	  else if (lc_ctype && *lc_ctype)
+	    snprintf (locale, ENCODING_LEN + 1, "%ls", lc_ctype);
+	  else if (lang && *lang)
+	    snprintf (locale, ENCODING_LEN + 1, "%ls", lang);
+	}
+      if (!*locale)
+	{
+	  const char *env = __get_locale_env (_REENT, LC_CTYPE);
+	  strncpy (locale, env, ENCODING_LEN);
+	  locale[ENCODING_LEN] = '\0';
+	}
+      loc = duplocale (__get_current_locale ());
+      __loadlocale (loc, LC_CTYPE, locale);
+      charset = __locale_charset (loc);
+    }
   UINT codepage = CP_UTF8; /* Default UTF8 */
 
   /* The internal charset names are well defined, so we can use shortcuts. */
@@ -1485,7 +1518,7 @@ __eval_codepage_from_internal_charset ()
       codepage = 936;
       break;
     case 'I': /* ISO-8859-x */
-      codepage = strtoul (charset + 9, NULL, 10);
+      codepage = strtoul (charset + 9, NULL, 10) + 28590;
       break;
     case 'S': /* SJIS */
       codepage = 932;
@@ -1493,6 +1526,8 @@ __eval_codepage_from_internal_charset ()
     default: /* All set to UTF8 already */
       break;
     }
+  if (loc)
+    freelocale (loc);
   return codepage;
 }
 
diff --git a/winsup/cygwin/spawn.cc b/winsup/cygwin/spawn.cc
index 02c4207ab..e9b808688 100644
--- a/winsup/cygwin/spawn.cc
+++ b/winsup/cygwin/spawn.cc
@@ -619,6 +619,18 @@ child_info_spawn::worker (const char *prog_arg, const char *const *argv,
 	    }
 	}
 
+      if (!iscygwin ())
+	{
+	  cfd.rewind ();
+	  while (cfd.next () >= 0)
+	    if (cfd->get_major () == DEV_PTYS_MAJOR)
+	      {
+		fhandler_pty_slave *ptys =
+		  (fhandler_pty_slave *)(fhandler_base *) cfd;
+		ptys->setup_locale (envblock);
+	      }
+	}
+
       /* Set up needed handles for stdio */
       si.dwFlags = STARTF_USESTDHANDLES;
       si.hStdInput = handle ((in__stdin < 0 ? 0 : in__stdin), false);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-05  8:43                       ` Takashi Yano
  2020-09-05 11:15                         ` Takashi Yano
@ 2020-09-07  8:26                         ` Corinna Vinschen
  2020-09-07  9:36                           ` Takashi Yano
  2020-09-07 10:27                           ` Takashi Yano
  1 sibling, 2 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-07  8:26 UTC (permalink / raw)
  To: cygwin-patches

Hi Takashi,

On Sep  5 17:43, Takashi Yano via Cygwin-patches wrote:
> On Fri, 4 Sep 2020 21:22:35 +0200
> Corinna Vinschen wrote:
> > So this boils down to the fact that term_code_page must be set
> > after the application is already running and as soo as it creates
> > the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> > is called at pty creation?  Or even on reading from /writing to
> > the pty the first time?  That should always be late enough to fetch
> > the correct codepage.
> > 
> > Patch attached.  Does that work as expected?
> 
> Thank you very much for the patch.
> 
> Your new additional patch works well except the test case such as:
> 
>   int pm = getpt();
>   if (fork()) {
>     [do the master operations]
>   } else {
>     int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
>     close(pm);
>     setsid();
>     ioctl(ps, TIOCSCTTY, 1);
>     dup2(ps, 0);
>     dup2(ps, 1);
>     dup2(ps, 2);
>     close(ps);
>     [exec non-cygwin process]
>   }
> 
> If this test case is run in cygwin console (command prompt),
> it causes garbled output due to term_code_page == 0.

term_code_page is set on fhandler_pty_slave::open, which is what
you call above.  How can term_code_page be 0 after that?  Are you
talking about the forking parent process being master?  If so, either
switching `#if 1/#if 0' blocks in the patch might fix this (by setting
term_code_page on first read/write), or alternatively adding an
__eval_codepage_from_internal_charset() call to the master creation
as well.  Did you try enabling the #if 0'd blocks?

Either way, the call to __eval_codepage_from_internal_charset() should
take place as soon as the first pty gets created or on first pty
read/write.

__eval_codepage_from_internal_charset() could also simply be called
on *every* invocation of pty read/write, given how low profile it is.

Apart from that, I don't see anything wrong with the above scenario.
If the application creating the pty does *not* switch locale, we're
in "C" locale territory, and we should default to UTF-8.  That's
what a call to __eval_codepage_from_internal_charset() would do,
because it defaults to UTF-8 and never returns the ASCII codepage.

> > Btw., the main loop in fhandler_pty_master::pty_master_fwd_thread()
> > calls 
> > 
> >   char *buf = convert_mb_str (cygheap->locale.term_code_page,
> >                               &nlen, CP_UTF8, ptr, wlen);
> >                                      ^^^^^^^
> >   [...]
> >   WriteFile (to_master_cyg, ...
> > 
> > But then, after the code breaks from that loop, it calls
> > 
> >   char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
> >                               GetConsoleOutputCP (), ptr, wlen);
> >                               ^^^^^^^^^^^^^^^^^^^^^
> >   [...]
> >   process_opost_output (to_master_cyg, ...
> > 
> > process_opost_output then calls WriteFile on that to_master_cyg handle,
> > just like the WriteFile call above.
> > 
> > Is that really correct?  Shouldn't the second invocation use CP_UTF8 as
> > well?
> 
> That is correct. The first conversion is for the case that pseudo
> console is enabled, and the second one is for the case that pseudo
> console is disabled.
> 
> Pseudo console converts charset from console code page to UTF-8.
> Therefore, data read from from_slave is always UTF-8 when pseudo
> console is enabled. Moreover, OPOST processing is done in pseudo
> console, so write data simply by WriteFile() is enough.
> 
> If pseudo console is disabled, cmd.exe and so on uses console
> code page, so the code page of data read from from_slave is
> GetConsoleOutputCP(). In this case, OPOST processing is necessary.

This is really confusing me.  We never set the console codepage in the
old pty code before, it was just pipes transmitting bytes.  Why do we
suddenly have to handle native apps running in a console in this case?!?

> diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
> index 2b84f4252..8877cc358 100644
> --- a/winsup/cygwin/cygheap.h
> +++ b/winsup/cygwin/cygheap.h
> @@ -341,7 +341,6 @@ struct cygheap_debug
>  struct cygheap_locale
>  {
>    mbtowc_p mbtowc;
> -  UINT term_code_page;

No, wait.  Just reverting this change without checking the alternatives
doesn't make sense.

Why would we want to store the codepage in every single tty, given
term_code_page is set to the same value in every one of them?  AFAICS
it's sufficient to have a single term_code_page shared with the child
processes via cygheap.  The idea is to get rid of the complex
setup_locale code in every execve call and just set it once in a process
tree starting at the process creating the ptys.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-05 11:15                         ` Takashi Yano
  2020-09-05 14:15                           ` Takashi Yano
@ 2020-09-07  8:27                           ` Corinna Vinschen
  2020-09-07  8:38                             ` Takashi Yano
  1 sibling, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-07  8:27 UTC (permalink / raw)
  To: cygwin-patches

On Sep  5 20:15, Takashi Yano via Cygwin-patches wrote:
> On Sat, 5 Sep 2020 17:43:01 +0900
> Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> No. This does not fix enough.
> 
> In the test case above, if it does not call setlocale(),
> __eval_codepage_from_internal_charset() always returns "ASCII"

??? __eval_codepage_from_internal_charset() never returns ASCII.
It returns UTF-8 by default.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-06 10:28                   ` Takashi Yano
@ 2020-09-07  8:33                     ` Corinna Vinschen
  0 siblings, 0 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-07  8:33 UTC (permalink / raw)
  To: cygwin-patches

On Sep  6 19:28, Takashi Yano via Cygwin-patches wrote:
> Hi Corinna,
> 
> On Fri, 4 Sep 2020 14:44:00 +0200
> Corinna Vinschen <corinna-cygwin@cygwin.com> wrote:
> > +    case 'I': /* ISO-8859-x */
> > +      codepage = strtoul (charset + 9, NULL, 10);
> > +      break;
> 
> This should be:
> codepage = strtoul (charset + 9, NULL, 10) + 28590;
> shouldn't it?

Thanks for catching!


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  8:27                           ` Corinna Vinschen
@ 2020-09-07  8:38                             ` Takashi Yano
  2020-09-07  9:09                               ` Corinna Vinschen
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-07  8:38 UTC (permalink / raw)
  To: cygwin-patches

On Mon, 7 Sep 2020 10:27:38 +0200
Corinna Vinschen wrote:
> On Sep  5 20:15, Takashi Yano via Cygwin-patches wrote:
> > On Sat, 5 Sep 2020 17:43:01 +0900
> > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > No. This does not fix enough.
> > 
> > In the test case above, if it does not call setlocale(),
> > __eval_codepage_from_internal_charset() always returns "ASCII"
> 
> ??? __eval_codepage_from_internal_charset() never returns ASCII.
> It returns UTF-8 by default.

Sorry, I meant __locale_charset (__get_global_locale ()) returns
"ASCII".

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  4:45                                   ` Takashi Yano
@ 2020-09-07  9:08                                     ` Corinna Vinschen
  2020-09-07  9:54                                       ` Takashi Yano
  2020-09-08  8:40                                     ` Corinna Vinschen
  1 sibling, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-07  9:08 UTC (permalink / raw)
  To: cygwin-patches

Hi Takashi,

On Sep  7 13:45, Takashi Yano via Cygwin-patches wrote:
>  #if 0 /* Let's try this if setting codepage at pty open time is not enough */
> -  if (!cygheap->locale.term_code_page)
> -    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
> +  if (!get_ttyp ()->term_code_page)
> +    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset (NULL);
>  #endif

*If* we revert back to using setup_locale, these #if blocks would
go away.

> -__eval_codepage_from_internal_charset ()
> +__eval_codepage_from_internal_charset (const WCHAR *envblock)
>  {
> -  const char *charset = __locale_charset (__get_global_locale ());
> +  const char *charset;
> +  __locale_t *loc = NULL;
> +  if (__get_current_locale ()->lc_cat[LC_CTYPE].buf)
> +    charset = __locale_charset (__get_current_locale ());
> +  else
> +    {
> +      char locale[ENCODING_LEN + 1] = {0, };
> +      if (envblock)
> +	{
> +	  const WCHAR *lc_all = NULL, *lc_ctype = NULL, *lang = NULL;
> +	  for (const WCHAR *p = envblock; *p != L'\0'; p += wcslen (p) + 1)
> +	    if (wcsncmp (p, L"LC_ALL=", 7) == 0)
> +	      lc_all = p + 7;
> +	    else if (wcsncmp (p, L"LC_CTYPE=", 9) == 0)
> +	      lc_ctype = p + 9;
> +	    else if (wcsncmp (p, L"LANG=", 5) == 0)
> +	      lang = p + 5;
> +	  if (lc_all && *lc_all)
> +	    snprintf (locale, ENCODING_LEN + 1, "%ls", lc_all);
	    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	    sys_wcstombs (locale, ENCODING_LEN + 1, lc_all);

OTOH, if you read these environment vars right from our current POSIX
env, you don't have to convert from mbs to wcs at all.  Just call
getenv("LC_ALL"), etc.  After all, envblock is just the wide char
copy of our current POSIX env.

> +	  else if (lc_ctype && *lc_ctype)
> +	    snprintf (locale, ENCODING_LEN + 1, "%ls", lc_ctype);
> +	  else if (lang && *lang)
> +	    snprintf (locale, ENCODING_LEN + 1, "%ls", lang);
> +	}
> +      if (!*locale)
> +	{
> +	  const char *env = __get_locale_env (_REENT, LC_CTYPE);
> +	  strncpy (locale, env, ENCODING_LEN);
> +	  locale[ENCODING_LEN] = '\0';
> +	}
> +      loc = duplocale (__get_current_locale ());
> +      __loadlocale (loc, LC_CTYPE, locale);
> +      charset = __locale_charset (loc);
> +    }

Oh, boy, this is really a lot.  I have some doubts this complexity is
really necessary.  It's a bit weird to go to such great lengths for
native applications.  Still, why not just do this once in the process
creating the pty rather than trying on every execve?

>      case 'I': /* ISO-8859-x */
> -      codepage = strtoul (charset + 9, NULL, 10);
> +      codepage = strtoul (charset + 9, NULL, 10) + 28590;

Oops, I just fixed that in my original patch already.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  8:38                             ` Takashi Yano
@ 2020-09-07  9:09                               ` Corinna Vinschen
  0 siblings, 0 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-07  9:09 UTC (permalink / raw)
  To: cygwin-patches

On Sep  7 17:38, Takashi Yano via Cygwin-patches wrote:
> On Mon, 7 Sep 2020 10:27:38 +0200
> Corinna Vinschen wrote:
> > On Sep  5 20:15, Takashi Yano via Cygwin-patches wrote:
> > > On Sat, 5 Sep 2020 17:43:01 +0900
> > > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > No. This does not fix enough.
> > > 
> > > In the test case above, if it does not call setlocale(),
> > > __eval_codepage_from_internal_charset() always returns "ASCII"
> > 
> > ??? __eval_codepage_from_internal_charset() never returns ASCII.
> > It returns UTF-8 by default.
> 
> Sorry, I meant __locale_charset (__get_global_locale ()) returns
> "ASCII".

Which is correct and translates to UTF8 as default charset.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  8:26                         ` Corinna Vinschen
@ 2020-09-07  9:36                           ` Takashi Yano
  2020-09-07 18:24                             ` Takashi Yano
  2020-09-07 21:08                             ` Johannes Schindelin
  2020-09-07 10:27                           ` Takashi Yano
  1 sibling, 2 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-07  9:36 UTC (permalink / raw)
  To: cygwin-patches

On Mon, 7 Sep 2020 10:26:33 +0200
Corinna Vinschen wrote:
> Hi Takashi,
> 
> On Sep  5 17:43, Takashi Yano via Cygwin-patches wrote:
> > On Fri, 4 Sep 2020 21:22:35 +0200
> > Corinna Vinschen wrote:
> > > So this boils down to the fact that term_code_page must be set
> > > after the application is already running and as soo as it creates
> > > the pty, me thinks.  What if __eval_codepage_from_internal_charset()
> > > is called at pty creation?  Or even on reading from /writing to
> > > the pty the first time?  That should always be late enough to fetch
> > > the correct codepage.
> > > 
> > > Patch attached.  Does that work as expected?
> > 
> > Thank you very much for the patch.
> > 
> > Your new additional patch works well except the test case such as:
> > 
> >   int pm = getpt();
> >   if (fork()) {
> >     [do the master operations]
> >   } else {
> >     int ps = open(ptsname(pm), O_RDWR|O_NOCTTY);
> >     close(pm);
> >     setsid();
> >     ioctl(ps, TIOCSCTTY, 1);
> >     dup2(ps, 0);
> >     dup2(ps, 1);
> >     dup2(ps, 2);
> >     close(ps);
> >     [exec non-cygwin process]
> >   }
> > 
> > If this test case is run in cygwin console (command prompt),
> > it causes garbled output due to term_code_page == 0.
> 
> term_code_page is set on fhandler_pty_slave::open, which is what
> you call above.  How can term_code_page be 0 after that?  Are you
> talking about the forking parent process being master?  If so, either
> switching `#if 1/#if 0' blocks in the patch might fix this (by setting
> term_code_page on first read/write), or alternatively adding an
> __eval_codepage_from_internal_charset() call to the master creation
> as well.  Did you try enabling the #if 0'd blocks?

IIUC, it is not shared with its parent process (master) if the pty
slave is opned in the child slave process.

When #if 0 block in master::write is enabled the problem is avoidable
only if master calls write. If the slave process is a output-only
process, master::write is never called.

> Either way, the call to __eval_codepage_from_internal_charset() should
> take place as soon as the first pty gets created or on first pty
> read/write.

Hmm, this sounds reasonable...

> __eval_codepage_from_internal_charset() could also simply be called
> on *every* invocation of pty read/write, given how low profile it is.

If the slave process execed is non-cygwin process in above test case,
slave::read and slave::write are not called, so this also cannot solve
the problem.

> Apart from that, I don't see anything wrong with the above scenario.
> If the application creating the pty does *not* switch locale, we're
> in "C" locale territory, and we should default to UTF-8.  That's
> what a call to __eval_codepage_from_internal_charset() would do,
> because it defaults to UTF-8 and never returns the ASCII codepage.

The above test case is simplifed version of the code of cygterm.
It is a part of teraterm https://ttssh2.osdn.jp/index.html.en.
Cygterm does not call setlocale() even with LANG=ja_JP.CP932.
So I think environment should be checked if setlocale() is not
called. In cygterm, environment LANG is set only in slave side.
Therefore, term_code_page should be determine in slave side.

> > > Btw., the main loop in fhandler_pty_master::pty_master_fwd_thread()
> > > calls 
> > > 
> > >   char *buf = convert_mb_str (cygheap->locale.term_code_page,
> > >                               &nlen, CP_UTF8, ptr, wlen);
> > >                                      ^^^^^^^
> > >   [...]
> > >   WriteFile (to_master_cyg, ...
> > > 
> > > But then, after the code breaks from that loop, it calls
> > > 
> > >   char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
> > >                               GetConsoleOutputCP (), ptr, wlen);
> > >                               ^^^^^^^^^^^^^^^^^^^^^
> > >   [...]
> > >   process_opost_output (to_master_cyg, ...
> > > 
> > > process_opost_output then calls WriteFile on that to_master_cyg handle,
> > > just like the WriteFile call above.
> > > 
> > > Is that really correct?  Shouldn't the second invocation use CP_UTF8 as
> > > well?
> > 
> > That is correct. The first conversion is for the case that pseudo
> > console is enabled, and the second one is for the case that pseudo
> > console is disabled.
> > 
> > Pseudo console converts charset from console code page to UTF-8.
> > Therefore, data read from from_slave is always UTF-8 when pseudo
> > console is enabled. Moreover, OPOST processing is done in pseudo
> > console, so write data simply by WriteFile() is enough.
> > 
> > If pseudo console is disabled, cmd.exe and so on uses console
> > code page, so the code page of data read from from_slave is
> > GetConsoleOutputCP(). In this case, OPOST processing is necessary.
> 
> This is really confusing me.  We never set the console codepage in the
> old pty code before, it was just pipes transmitting bytes.  Why do we
> suddenly have to handle native apps running in a console in this case?!?

This is actually not related to pseudo console. In Japanese environment,
cmd.exe output CP932 string by default. This caused gabled output in old
cygwin such as 3.0.7. The code for the case that pseudo console is
disabled is to fix this.

> > diff --git a/winsup/cygwin/cygheap.h b/winsup/cygwin/cygheap.h
> > index 2b84f4252..8877cc358 100644
> > --- a/winsup/cygwin/cygheap.h
> > +++ b/winsup/cygwin/cygheap.h
> > @@ -341,7 +341,6 @@ struct cygheap_debug
> >  struct cygheap_locale
> >  {
> >    mbtowc_p mbtowc;
> > -  UINT term_code_page;
> 
> No, wait.  Just reverting this change without checking the alternatives
> doesn't make sense.
> 
> Why would we want to store the codepage in every single tty, given
> term_code_page is set to the same value in every one of them?  AFAICS
> it's sufficient to have a single term_code_page shared with the child
> processes via cygheap.  The idea is to get rid of the complex
> setup_locale code in every execve call and just set it once in a process
> tree starting at the process creating the ptys.

You are right if term_code_page can be determined in master process.
However, as for the above test case, I cannot imagine the solution.
What cygterm essentially does is as follows. It does not call setlocale()
and not set LANG in master process.

  int pm = getpt();
  if (fork()) {
    [do the master operations]
  } else {
    setsid();
    ps = open(ptsname(pm), O_RDWR);
    close(pm);
    dup2(ps, 0);
    dup2(ps, 1);
    dup2(ps, 2);
    close(ps);
    setenv("LANG", "ja_JP.SJIS", 1);
    [exec shell]
  }

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  9:08                                     ` Corinna Vinschen
@ 2020-09-07  9:54                                       ` Takashi Yano
  2020-09-07  9:59                                         ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-07  9:54 UTC (permalink / raw)
  To: cygwin-patches

On Mon, 7 Sep 2020 11:08:23 +0200
Corinna Vinschen wrote:
> Hi Takashi,
> 
> On Sep  7 13:45, Takashi Yano via Cygwin-patches wrote:
> >  #if 0 /* Let's try this if setting codepage at pty open time is not enough */
> > -  if (!cygheap->locale.term_code_page)
> > -    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
> > +  if (!get_ttyp ()->term_code_page)
> > +    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset (NULL);
> >  #endif
> 
> *If* we revert back to using setup_locale, these #if blocks would
> go away.
> 
> > -__eval_codepage_from_internal_charset ()
> > +__eval_codepage_from_internal_charset (const WCHAR *envblock)
> >  {
> > -  const char *charset = __locale_charset (__get_global_locale ());
> > +  const char *charset;
> > +  __locale_t *loc = NULL;
> > +  if (__get_current_locale ()->lc_cat[LC_CTYPE].buf)
> > +    charset = __locale_charset (__get_current_locale ());
> > +  else
> > +    {
> > +      char locale[ENCODING_LEN + 1] = {0, };
> > +      if (envblock)
> > +	{
> > +	  const WCHAR *lc_all = NULL, *lc_ctype = NULL, *lang = NULL;
> > +	  for (const WCHAR *p = envblock; *p != L'\0'; p += wcslen (p) + 1)
> > +	    if (wcsncmp (p, L"LC_ALL=", 7) == 0)
> > +	      lc_all = p + 7;
> > +	    else if (wcsncmp (p, L"LC_CTYPE=", 9) == 0)
> > +	      lc_ctype = p + 9;
> > +	    else if (wcsncmp (p, L"LANG=", 5) == 0)
> > +	      lang = p + 5;
> > +	  if (lc_all && *lc_all)
> > +	    snprintf (locale, ENCODING_LEN + 1, "%ls", lc_all);
> 	    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	    sys_wcstombs (locale, ENCODING_LEN + 1, lc_all);
> 
> OTOH, if you read these environment vars right from our current POSIX
> env, you don't have to convert from mbs to wcs at all.  Just call
> getenv("LC_ALL"), etc.  After all, envblock is just the wide char
> copy of our current POSIX env.

IIUC, envblock is not a copy of environment if exec*e() is used.
In this case, getenv() cannot retrieve the new environment values
passed to exec*e(). This is needed by the test case bellow.

  int pm = getpt();
  if (fork()) {
    [do the master operations]
  } else {
    char *env[] = {"LANG=ja_JP.SJIS", ...., NULL};
    setsid();
    ps = open(ptsname(pm), O_RDWR);
    close(pm);
    dup2(ps, 0);
    dup2(ps, 1);
    dup2(ps, 2);
    close(ps);
    execle("/bin/tcsh", "/bin/tcsh", "-l", NULL, env);
  }

> > +	  else if (lc_ctype && *lc_ctype)
> > +	    snprintf (locale, ENCODING_LEN + 1, "%ls", lc_ctype);
> > +	  else if (lang && *lang)
> > +	    snprintf (locale, ENCODING_LEN + 1, "%ls", lang);
> > +	}
> > +      if (!*locale)
> > +	{
> > +	  const char *env = __get_locale_env (_REENT, LC_CTYPE);
> > +	  strncpy (locale, env, ENCODING_LEN);
> > +	  locale[ENCODING_LEN] = '\0';
> > +	}
> > +      loc = duplocale (__get_current_locale ());
> > +      __loadlocale (loc, LC_CTYPE, locale);
> > +      charset = __locale_charset (loc);
> > +    }
> 
> Oh, boy, this is really a lot.  I have some doubts this complexity is
> really necessary.  It's a bit weird to go to such great lengths for
> native applications.  Still, why not just do this once in the process
> creating the pty rather than trying on every execve?

This is executed just once for a pty. Because
__eval_codepage_from_internal_charset() is called only when
get_ttyp ()->term_code_page is not set yet.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  9:54                                       ` Takashi Yano
@ 2020-09-07  9:59                                         ` Takashi Yano
  0 siblings, 0 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-07  9:59 UTC (permalink / raw)
  To: cygwin-patches

On Mon, 7 Sep 2020 18:54:45 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> On Mon, 7 Sep 2020 11:08:23 +0200
> Corinna Vinschen wrote:
> > Hi Takashi,
> > 
> > On Sep  7 13:45, Takashi Yano via Cygwin-patches wrote:
> > >  #if 0 /* Let's try this if setting codepage at pty open time is not enough */
> > > -  if (!cygheap->locale.term_code_page)
> > > -    cygheap->locale.term_code_page = __eval_codepage_from_internal_charset ();
> > > +  if (!get_ttyp ()->term_code_page)
> > > +    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset (NULL);
> > >  #endif
> > 
> > *If* we revert back to using setup_locale, these #if blocks would
> > go away.
> > 
> > > -__eval_codepage_from_internal_charset ()
> > > +__eval_codepage_from_internal_charset (const WCHAR *envblock)
> > >  {
> > > -  const char *charset = __locale_charset (__get_global_locale ());
> > > +  const char *charset;
> > > +  __locale_t *loc = NULL;
> > > +  if (__get_current_locale ()->lc_cat[LC_CTYPE].buf)
> > > +    charset = __locale_charset (__get_current_locale ());
> > > +  else
> > > +    {
> > > +      char locale[ENCODING_LEN + 1] = {0, };
> > > +      if (envblock)
> > > +	{
> > > +	  const WCHAR *lc_all = NULL, *lc_ctype = NULL, *lang = NULL;
> > > +	  for (const WCHAR *p = envblock; *p != L'\0'; p += wcslen (p) + 1)
> > > +	    if (wcsncmp (p, L"LC_ALL=", 7) == 0)
> > > +	      lc_all = p + 7;
> > > +	    else if (wcsncmp (p, L"LC_CTYPE=", 9) == 0)
> > > +	      lc_ctype = p + 9;
> > > +	    else if (wcsncmp (p, L"LANG=", 5) == 0)
> > > +	      lang = p + 5;
> > > +	  if (lc_all && *lc_all)
> > > +	    snprintf (locale, ENCODING_LEN + 1, "%ls", lc_all);
> > 	    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 	    sys_wcstombs (locale, ENCODING_LEN + 1, lc_all);
> > 
> > OTOH, if you read these environment vars right from our current POSIX
> > env, you don't have to convert from mbs to wcs at all.  Just call
> > getenv("LC_ALL"), etc.  After all, envblock is just the wide char
> > copy of our current POSIX env.
> 
> IIUC, envblock is not a copy of environment if exec*e() is used.
> In this case, getenv() cannot retrieve the new environment values
> passed to exec*e(). This is needed by the test case bellow.
> 
>   int pm = getpt();
>   if (fork()) {
>     [do the master operations]
>   } else {
>     char *env[] = {"LANG=ja_JP.SJIS", ...., NULL};
>     setsid();
>     ps = open(ptsname(pm), O_RDWR);
>     close(pm);
>     dup2(ps, 0);
>     dup2(ps, 1);
>     dup2(ps, 2);
>     close(ps);
>     execle("/bin/tcsh", "/bin/tcsh", "-l", NULL, env);
>   }

Sorry, if the process execed is cygwin process, envblock is
not used becase setup_locale() is called from fixup_after_exec()
with NULL.

-    execle("/bin/tcsh", "/bin/tcsh", "-l", NULL, env);
+    execle("/cygdrive/c/windows/system32/cmd.exe", "cmd", NULL, env);


-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  8:26                         ` Corinna Vinschen
  2020-09-07  9:36                           ` Takashi Yano
@ 2020-09-07 10:27                           ` Takashi Yano
  2020-09-07 13:40                             ` Takashi Yano
  1 sibling, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-07 10:27 UTC (permalink / raw)
  To: cygwin-patches

On Mon, 7 Sep 2020 10:26:33 +0200
Corinna Vinschen wrote:
> processes via cygheap.  The idea is to get rid of the complex
> setup_locale code in every execve call and just set it once in a process
> tree starting at the process creating the ptys.

This cause a problem if mintty is started from another
mintty using diffrent charset. If the child mintty
inherits cygheap, term_code_page is not changed.

1) Start mintty (UTF-8).
2) Start another mintty by 
     mintty -o charset=SJIS
   from the first mintty.

In this case, the second mintty inherits cygheap having
locale.term_code_page == UTF-8, while the correct
term_code_page is SJIS (CP932).

Therefore, term_code_page should be determined per pty.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07 10:27                           ` Takashi Yano
@ 2020-09-07 13:40                             ` Takashi Yano
  2020-09-08  7:55                               ` Corinna Vinschen
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-07 13:40 UTC (permalink / raw)
  To: cygwin-patches

Here is a summary of my points:

[Senario 1]
1) Start mintty (UTF-8).
2) Start another mintty by 
     mintty -o charset=SJIS
   from the first mintty.

[Senario 2]
  int pm = getpt();
  if (fork()) {
    [do the master operations]
  } else {
    setsid();
    ps = open(ptsname(pm), O_RDWR);
    close(pm);
    dup2(ps, 0);
    dup2(ps, 1);
    dup2(ps, 2);
    close(ps);
    setenv("LANG", "ja_JP.SJIS", 1);
    [exec shell]
  }


Q1) cygheap or tty shared memory?

Consider senario 1. If the term_code_page is in cygheap,
it is inherited to child process. Therefore, the second
mintty cannot update term_code_page while locale is changed.

Consider senario 2. Since only the child process knows the
locale, master (parent process) cannot get term_code_page
if it is in cygheap.

Q2) Is checking environment necessary?

As for senario 2, setlocale() is not called. So it is
necessary to check environment to know locale.

Q3) Where and when should term_code_page be set?

In senario 2, LANG is set just before exec() in the CHILD
process. Therefore, term_code_page should be determined
in the child_info_spawn:worker or in the execed process.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  9:36                           ` Takashi Yano
@ 2020-09-07 18:24                             ` Takashi Yano
  2020-09-07 21:08                             ` Johannes Schindelin
  1 sibling, 0 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-07 18:24 UTC (permalink / raw)
  To: cygwin-patches

On Mon, 7 Sep 2020 18:36:59 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > This is really confusing me.  We never set the console codepage in the
> > old pty code before, it was just pipes transmitting bytes.  Why do we
> > suddenly have to handle native apps running in a console in this case?!?
> 
> This is actually not related to pseudo console. In Japanese environment,
> cmd.exe output CP932 string by default. This caused gabled output in old
> cygwin such as 3.0.7. The code for the case that pseudo console is
> disabled is to fix this.

If your question is "Why is GetConsoleOutputCP() used for
non-console handle?", the answer is "I am not sure why, but
it affected by codepage". Even with cygwin 3.0.7 which does
not support pseudo console, chcp.com works.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  9:36                           ` Takashi Yano
  2020-09-07 18:24                             ` Takashi Yano
@ 2020-09-07 21:08                             ` Johannes Schindelin
  2020-09-08  4:52                               ` Brian Inglis
  1 sibling, 1 reply; 60+ messages in thread
From: Johannes Schindelin @ 2020-09-07 21:08 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin-patches

Hi,

On Mon, 7 Sep 2020, Takashi Yano via Cygwin-patches wrote:

> On Mon, 7 Sep 2020 10:26:33 +0200
> Corinna Vinschen wrote:
> > Hi Takashi,
> > On Sep  5 17:43, Takashi Yano via Cygwin-patches wrote:
> > > On Fri, 4 Sep 2020 21:22:35 +0200
> > > Corinna Vinschen wrote:
> > >
> > > > Btw., the main loop in
> > > > fhandler_pty_master::pty_master_fwd_thread() calls
> > > >
> > > >   char *buf = convert_mb_str (cygheap->locale.term_code_page,
> > > >                               &nlen, CP_UTF8, ptr, wlen);
> > > >                                      ^^^^^^^
> > > >   [...]
> > > >   WriteFile (to_master_cyg, ...
> > > >
> > > > But then, after the code breaks from that loop, it calls
> > > >
> > > >   char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
> > > >                               GetConsoleOutputCP (), ptr, wlen);
> > > >                               ^^^^^^^^^^^^^^^^^^^^^
> > > >   [...]
> > > >   process_opost_output (to_master_cyg, ...
> > > >
> > > > process_opost_output then calls WriteFile on that to_master_cyg handle,
> > > > just like the WriteFile call above.
> > > >
> > > > Is that really correct?  Shouldn't the second invocation use CP_UTF8 as
> > > > well?
> > >
> > > That is correct. The first conversion is for the case that pseudo
> > > console is enabled, and the second one is for the case that pseudo
> > > console is disabled.
> > >
> > > Pseudo console converts charset from console code page to UTF-8.
> > > Therefore, data read from from_slave is always UTF-8 when pseudo
> > > console is enabled. Moreover, OPOST processing is done in pseudo
> > > console, so write data simply by WriteFile() is enough.
> > >
> > > If pseudo console is disabled, cmd.exe and so on uses console
> > > code page, so the code page of data read from from_slave is
> > > GetConsoleOutputCP(). In this case, OPOST processing is necessary.
> >
> > This is really confusing me.  We never set the console codepage in the
> > old pty code before, it was just pipes transmitting bytes.  Why do we
> > suddenly have to handle native apps running in a console in this case?!?
>
> This is actually not related to pseudo console. In Japanese environment,
> cmd.exe output CP932 string by default. This caused gabled output in old
> cygwin such as 3.0.7. The code for the case that pseudo console is
> disabled is to fix this.

It is related to Pseudo Console insofar as it was slipped in as part of
the Pseudo Console patches.

And what Takashi reports as a bug fix is the underlying reason for the
tickets in MSYS2 (and elsewhere) that I mentioned.

In fact, I even suggested in
https://github.com/msys2/MSYS2-packages/issues/1974#issuecomment-685475967
to revert that change.

What Takashi describes as "correct behavior" unfortunately seems not to be
very common in practice, which is why I contend that from the users' point
of view, it could not matter less whether the console applications are
"correct" or not. From the point of view of users who have their `LANG`
set to something like `en_US.UTF-8`, the encoding was correct before, and
now it is no longer correct. And _that_ is the correctness users actually
care about.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-04 15:03     ` Takashi Yano
@ 2020-09-07 21:17       ` Johannes Schindelin
  2020-09-08  8:16         ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Johannes Schindelin @ 2020-09-07 21:17 UTC (permalink / raw)
  To: Takashi Yano; +Cc: cygwin-patches

Hi Takashi,

On Sat, 5 Sep 2020, Takashi Yano wrote:

> On Fri, 4 Sep 2020 08:23:42 +0200 (CEST)
> Johannes Schindelin wrote:
> >
> > On Fri, 4 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> >
> > > On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
> > > Johannes Schindelin wrote:
> > >
> > > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > > correct, but after that (at least if Pseudo Console support is enabled),
> > > > we try to find the default code page for that `LCID`, which is ASCII
> > > > (437). Subsequently, we set the Console output code page to that value,
> > > > completely ignoring that we wanted to use UTF-8.
> > > >
> > > > Let's not ignore the specifically asked-for UTF-8 character set.
> > > >
> > > > While at it, let's also set the Console output code page even if Pseudo
> > > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > > Console output code page is not ignored in that case.
> > > >
> > > > The most common symptom would be that console applications which do not
> > > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > > >
> > > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > > https://github.com/rust-lang/cargo/issues/8369,
> > > > https://github.com/git-for-windows/git/issues/2734,
> > > > https://github.com/git-for-windows/git/issues/2793,
> > > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > > few others.
> > > >
> > > > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > > ---
> > > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > > >  1 file changed, 9 insertions(+)
> > > >
> > > > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > > > index 06789a500..414c26992 100644
> > > > --- a/winsup/cygwin/fhandler_tty.cc
> > > > +++ b/winsup/cygwin/fhandler_tty.cc
> > > > @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
> > > >    char charset[ENCODING_LEN + 1] = "ASCII";
> > > >    LCID lcid = get_langinfo (locale, charset);
> > > >
> > > > +  /* Special-case the UTF-8 character set */
> > > > +  if (strcasecmp (charset, "UTF-8") == 0)
> > > > +    {
> > > > +      get_ttyp ()->term_code_page = CP_UTF8;
> > > > +      SetConsoleCP (CP_UTF8);
> > > > +      SetConsoleOutputCP (CP_UTF8);
> > > > +      return;
> > > > +    }
> > > > +
> > > >    /* Set console code page from locale */
> > > >    if (get_pseudo_console ())
> > > >      {
> > > > --
> > > > 2.27.0
> > >
> > > I would like to propose a counter patch attached.
> > > What do you think of this patch?
> > >
> > > This patch does not treat UTF-8 as special.
> >
> > Sure, but it only fixes the issue in `disable_pcon` mode in the current
> > tip commit. That's because a new Pseudo Console is created for every
> > spawned non-Cygwin console application, and that new Pseudo Console does
> > _not_ have the code page set by your patch.
>
> You are right. However, if pseudo console is enabled, the app
> which works correclty in command prompt should work as well in
> pseudo console. Therefore, there is nothing to be fixed.

I am coming to the conclusion that your definition what is correct differs
from my definition of what is correct.

For me, it matters what users see. And what users actually see is the
output of UTF-8 encoded text that is now interpreted via the default code
page of their LCID, i.e. it is incorrect.

Sure, you can argue all you want that those console applications are _all
wrong_. _All of them_.

In practice, that matters very little, as many users have
`LANG=en_US.UTF-8` (meaning your patches force their console applications'
output to be interpreted with code page 437) and therefore for those
users, things looked fine before, and now they don't.

Note that I am not talking about developers who develop said console
applications. I am talking about users who use those console applications.
In other words, I am talking about a vastly larger group of affected
people.

All of those people (or at least a substantial majority) will now have to
be told to please disable Pseudo Console support in v3.2.0 because they
would have to patch and rebuild those console applications that don't call
`SetConsoleOutputCP()`, and that is certainly unreasonable to expect of
the majority of users. Not even the `cmd /c chcp 65001` work-around (that
helps with v3.1.7) will work with v3.2.0 when Pseudo Console support is
enabled.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07 21:08                             ` Johannes Schindelin
@ 2020-09-08  4:52                               ` Brian Inglis
  0 siblings, 0 replies; 60+ messages in thread
From: Brian Inglis @ 2020-09-08  4:52 UTC (permalink / raw)
  To: cygwin-patches

On 2020-09-07 15:08, Johannes Schindelin wrote:
> On Mon, 7 Sep 2020, Takashi Yano via Cygwin-patches wrote:
>> On Mon, 7 Sep 2020 10:26:33 +0200
>> Corinna Vinschen wrote:
>>> Hi Takashi,
>>> On Sep  5 17:43, Takashi Yano via Cygwin-patches wrote:
>>>> On Fri, 4 Sep 2020 21:22:35 +0200
>>>> Corinna Vinschen wrote:
>>>>
>>>>> Btw., the main loop in
>>>>> fhandler_pty_master::pty_master_fwd_thread() calls
>>>>>
>>>>>   char *buf = convert_mb_str (cygheap->locale.term_code_page,
>>>>>                               &nlen, CP_UTF8, ptr, wlen);
>>>>>                                      ^^^^^^^
>>>>>   [...]
>>>>>   WriteFile (to_master_cyg, ...
>>>>>
>>>>> But then, after the code breaks from that loop, it calls
>>>>>
>>>>>   char *buf = convert_mb_str (cygheap->locale.term_code_page, &nlen,
>>>>>                               GetConsoleOutputCP (), ptr, wlen);
>>>>>                               ^^^^^^^^^^^^^^^^^^^^^
>>>>>   [...]
>>>>>   process_opost_output (to_master_cyg, ...
>>>>>
>>>>> process_opost_output then calls WriteFile on that to_master_cyg handle,
>>>>> just like the WriteFile call above.
>>>>>
>>>>> Is that really correct?  Shouldn't the second invocation use CP_UTF8 as
>>>>> well?
>>>>
>>>> That is correct. The first conversion is for the case that pseudo
>>>> console is enabled, and the second one is for the case that pseudo
>>>> console is disabled.
>>>>
>>>> Pseudo console converts charset from console code page to UTF-8.
>>>> Therefore, data read from from_slave is always UTF-8 when pseudo
>>>> console is enabled. Moreover, OPOST processing is done in pseudo
>>>> console, so write data simply by WriteFile() is enough.
>>>>
>>>> If pseudo console is disabled, cmd.exe and so on uses console
>>>> code page, so the code page of data read from from_slave is
>>>> GetConsoleOutputCP(). In this case, OPOST processing is necessary.
>>>
>>> This is really confusing me.  We never set the console codepage in the
>>> old pty code before, it was just pipes transmitting bytes.  Why do we
>>> suddenly have to handle native apps running in a console in this case?!?
>>
>> This is actually not related to pseudo console. In Japanese environment,
>> cmd.exe output CP932 string by default. This caused gabled output in old
>> cygwin such as 3.0.7. The code for the case that pseudo console is
>> disabled is to fix this.
> 
> It is related to Pseudo Console insofar as it was slipped in as part of
> the Pseudo Console patches.
> 
> And what Takashi reports as a bug fix is the underlying reason for the
> tickets in MSYS2 (and elsewhere) that I mentioned.
> 
> In fact, I even suggested in
> https://github.com/msys2/MSYS2-packages/issues/1974#issuecomment-685475967
> to revert that change.
> 
> What Takashi describes as "correct behavior" unfortunately seems not to be
> very common in practice, which is why I contend that from the users' point
> of view, it could not matter less whether the console applications are
> "correct" or not. From the point of view of users who have their `LANG`
> set to something like `en_US.UTF-8`, the encoding was correct before, and
> now it is no longer correct. And _that_ is the correctness users actually
> care about.

But also for users running locales and localization using non-Latin scripts, it
is important that messages be generated in languages they understand and output
in characters they can read.
It has been for some years (at least since the EU was formed in 1993) inadequate
and erroneous to support only en_US.ASCII.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07 13:40                             ` Takashi Yano
@ 2020-09-08  7:55                               ` Corinna Vinschen
  0 siblings, 0 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-08  7:55 UTC (permalink / raw)
  To: cygwin-patches

On Sep  7 22:40, Takashi Yano via Cygwin-patches wrote:
> Here is a summary of my points:
> 
> [Senario 1]
> 1) Start mintty (UTF-8).
> 2) Start another mintty by 
>      mintty -o charset=SJIS
>    from the first mintty.
> 
> [Senario 2]
>   int pm = getpt();
>   if (fork()) {
>     [do the master operations]
>   } else {
>     setsid();
>     ps = open(ptsname(pm), O_RDWR);
>     close(pm);
>     dup2(ps, 0);
>     dup2(ps, 1);
>     dup2(ps, 2);
>     close(ps);
>     setenv("LANG", "ja_JP.SJIS", 1);
>     [exec shell]
>   }
> 
> 
> Q1) cygheap or tty shared memory?
> 
> Consider senario 1. If the term_code_page is in cygheap,
> it is inherited to child process. Therefore, the second
> mintty cannot update term_code_page while locale is changed.
> 
> Consider senario 2. Since only the child process knows the
> locale, master (parent process) cannot get term_code_page
> if it is in cygheap.
> 
> Q2) Is checking environment necessary?
> 
> As for senario 2, setlocale() is not called. So it is
> necessary to check environment to know locale.
> 
> Q3) Where and when should term_code_page be set?
> 
> In senario 2, LANG is set just before exec() in the CHILD
> process. Therefore, term_code_page should be determined
> in the child_info_spawn:worker or in the execed process.

What bugs me here is that in scenario 1, the codeset of the master side
is the defining factor, while in case 2 the slave side is the defining
factor.

Actually, the only defining factor is the codeset of the master side of
the pty.  If the master side interprets all input as utf-8, then the
slave side should either send utf-8, or live with the consequences.
It's the task of the *user* on the slave side, to set the env setting
matching to the master side.

I tend to agree with Johannes.  We should not enforce a codepage setting
inside Cygwin.  The bytes should go to the master side and the master
side interprets them.  If native apps produce garbage, well, I'm not
overly sympathetic.  Especially given the fact that even Microsoft is
now doing a lot to switch to UTF-8 as much as possible.  It's the only
sane option anyway.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07 21:17       ` Johannes Schindelin
@ 2020-09-08  8:16         ` Takashi Yano
  2020-09-09  7:21           ` Corinna Vinschen
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-08  8:16 UTC (permalink / raw)
  To: cygwin-patches

On Mon, 7 Sep 2020 23:17:36 +0200 (CEST)
Johannes Schindelin wrote:
> Hi Takashi,
> 
> On Sat, 5 Sep 2020, Takashi Yano wrote:
> 
> > On Fri, 4 Sep 2020 08:23:42 +0200 (CEST)
> > Johannes Schindelin wrote:
> > >
> > > On Fri, 4 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> > >
> > > > On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
> > > > Johannes Schindelin wrote:
> > > >
> > > > > When `LANG=en_US.UTF-8`, the detected `LCID` is 0x0409, which is
> > > > > correct, but after that (at least if Pseudo Console support is enabled),
> > > > > we try to find the default code page for that `LCID`, which is ASCII
> > > > > (437). Subsequently, we set the Console output code page to that value,
> > > > > completely ignoring that we wanted to use UTF-8.
> > > > >
> > > > > Let's not ignore the specifically asked-for UTF-8 character set.
> > > > >
> > > > > While at it, let's also set the Console output code page even if Pseudo
> > > > > Console support is disabled; contrary to the behavior of v3.0.7, the
> > > > > Console output code page is not ignored in that case.
> > > > >
> > > > > The most common symptom would be that console applications which do not
> > > > > specifically call `SetConsoleOutputCP()` but output UTF-8-encoded text
> > > > > seem to be broken with v3.1.x when they worked plenty fine with v3.0.x.
> > > > >
> > > > > This fixes https://github.com/msys2/MSYS2-packages/issues/1974,
> > > > > https://github.com/msys2/MSYS2-packages/issues/2012,
> > > > > https://github.com/rust-lang/cargo/issues/8369,
> > > > > https://github.com/git-for-windows/git/issues/2734,
> > > > > https://github.com/git-for-windows/git/issues/2793,
> > > > > https://github.com/git-for-windows/git/issues/2792, and possibly quite a
> > > > > few others.
> > > > >
> > > > > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > > > ---
> > > > >  winsup/cygwin/fhandler_tty.cc | 9 +++++++++
> > > > >  1 file changed, 9 insertions(+)
> > > > >
> > > > > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > > > > index 06789a500..414c26992 100644
> > > > > --- a/winsup/cygwin/fhandler_tty.cc
> > > > > +++ b/winsup/cygwin/fhandler_tty.cc
> > > > > @@ -2859,6 +2859,15 @@ fhandler_pty_slave::setup_locale (void)
> > > > >    char charset[ENCODING_LEN + 1] = "ASCII";
> > > > >    LCID lcid = get_langinfo (locale, charset);
> > > > >
> > > > > +  /* Special-case the UTF-8 character set */
> > > > > +  if (strcasecmp (charset, "UTF-8") == 0)
> > > > > +    {
> > > > > +      get_ttyp ()->term_code_page = CP_UTF8;
> > > > > +      SetConsoleCP (CP_UTF8);
> > > > > +      SetConsoleOutputCP (CP_UTF8);
> > > > > +      return;
> > > > > +    }
> > > > > +
> > > > >    /* Set console code page from locale */
> > > > >    if (get_pseudo_console ())
> > > > >      {
> > > > > --
> > > > > 2.27.0
> > > >
> > > > I would like to propose a counter patch attached.
> > > > What do you think of this patch?
> > > >
> > > > This patch does not treat UTF-8 as special.
> > >
> > > Sure, but it only fixes the issue in `disable_pcon` mode in the current
> > > tip commit. That's because a new Pseudo Console is created for every
> > > spawned non-Cygwin console application, and that new Pseudo Console does
> > > _not_ have the code page set by your patch.
> >
> > You are right. However, if pseudo console is enabled, the app
> > which works correclty in command prompt should work as well in
> > pseudo console. Therefore, there is nothing to be fixed.
> 
> I am coming to the conclusion that your definition what is correct differs
> from my definition of what is correct.
> 
> For me, it matters what users see. And what users actually see is the
> output of UTF-8 encoded text that is now interpreted via the default code
> page of their LCID, i.e. it is incorrect.
> 
> Sure, you can argue all you want that those console applications are _all
> wrong_. _All of them_.
> 
> In practice, that matters very little, as many users have
> `LANG=en_US.UTF-8` (meaning your patches force their console applications'
> output to be interpreted with code page 437) and therefore for those
> users, things looked fine before, and now they don't.
> 
> Note that I am not talking about developers who develop said console
> applications. I am talking about users who use those console applications.
> In other words, I am talking about a vastly larger group of affected
> people.
> 
> All of those people (or at least a substantial majority) will now have to
> be told to please disable Pseudo Console support in v3.2.0 because they
> would have to patch and rebuild those console applications that don't call
> `SetConsoleOutputCP()`, and that is certainly unreasonable to expect of
> the majority of users. Not even the `cmd /c chcp 65001` work-around (that
> helps with v3.1.7) will work with v3.2.0 when Pseudo Console support is
> enabled.

In the case where pseudo console is disabled, the patch I proposed in
https://cygwin.com/pipermail/cygwin-patches/2020q3/010548.html
will solve the issue. I mean apps which work correctly in cygwin 3.0.7
work in cygwin 3.2.0 as well with that patch.

OTOH, in the case where pseudo console is enabled, non-cygwin apps which
work correctly in command prompt, work in cygwin 3.2.0 as well.

It is enough for all, isn't it?

You may think that all non-cygwin apps which work in cygwin 3.0.7 must
work in cygwin 3.2.0 even when pseudo console is enabled, but it is
against the purpose of the pseudo console support. The goal of pseudo
console support is to make non-cygwin apps work as if it is executed in
command prompt.

By the way, you complained regarding garbled output of the program which
outputs UTF-8 string regardless of locale.
https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
However, many Japanese programmers, for example, make programs which
output SJIS (CP932) string regardless of locale.

Why do you think the former must work while the latter deos not have to?
Is there any reasonable reason other than backward compatibility?

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-07  4:45                                   ` Takashi Yano
  2020-09-07  9:08                                     ` Corinna Vinschen
@ 2020-09-08  8:40                                     ` Corinna Vinschen
  2020-09-08  9:45                                       ` Takashi Yano
  1 sibling, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-08  8:40 UTC (permalink / raw)
  To: cygwin-patches

On Sep  7 13:45, Takashi Yano via Cygwin-patches wrote:
> On Mon, 7 Sep 2020 01:04:13 +0900
> > > Chages:
> > > - If global locale is set, it takes precedence.
> > 
> > Changes:
> > - Use __get_current_locale() instead of __get_global_locale().
> > - Fix a bug for ISO-8859-* charset.
> 
> Changes:
> - Use envblock if it is passed to CreateProcess in spawn.cc.

For the time being and to make at least *some* progress and with my
upcoming "away from keyboard"-time , I pushed the gist of my patch,
replacing the locale evaluating code in fhandler_tty with the function
__eval_codepage_from_internal_charset in its most simple form.
I didn't touch anything else, given that this discussion is still
ongoing.


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-08  8:40                                     ` Corinna Vinschen
@ 2020-09-08  9:45                                       ` Takashi Yano
  2020-09-08 19:16                                         ` Corinna Vinschen
  2020-09-10 13:08                                         ` Takashi Yano
  0 siblings, 2 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-08  9:45 UTC (permalink / raw)
  To: cygwin-patches

Hi Corinna,

On Tue, 8 Sep 2020 10:40:34 +0200
Corinna Vinschen wrote:
> On Sep  7 13:45, Takashi Yano via Cygwin-patches wrote:
> > On Mon, 7 Sep 2020 01:04:13 +0900
> > > > Chages:
> > > > - If global locale is set, it takes precedence.
> > > 
> > > Changes:
> > > - Use __get_current_locale() instead of __get_global_locale().
> > > - Fix a bug for ISO-8859-* charset.
> > 
> > Changes:
> > - Use envblock if it is passed to CreateProcess in spawn.cc.
> 
> For the time being and to make at least *some* progress and with my
> upcoming "away from keyboard"-time , I pushed the gist of my patch,
> replacing the locale evaluating code in fhandler_tty with the function
> __eval_codepage_from_internal_charset in its most simple form.
> I didn't touch anything else, given that this discussion is still
> ongoing.

Your patch pushed does the magic!

Even cygterm works even though the code does not check environment.

The point is here.

@@ -1977,9 +1807,6 @@ fhandler_pty_slave::fixup_after_exec ()
   if (!close_on_exec ())
     fixup_after_fork (NULL);   /* No parent handle required. */

-  /* Set locale */
-  setup_locale ();
-
   /* Hook Console API */
 #define DO_HOOK(module, name) \
   if (!name##_Orig) \

Without this deletion, term_code_page is determined when
cygwin shell is executed. Since it is in fixup_after_exec(),
setlocale() does not called yet in the shell. As a result,
term_code_page cannot be determined correctly.

In your new patch, term_code_page is determined when the first
non-cygwin program is execed in the shell. The shell process
already calls setlocale(), so term_code_page can be determined
using global locale.

Thanks for the excellent idea!

Only the problem I noticed is that cygterm does not work if the
shell does not call setlocale(). This happens if the shell is
non-cygwin program, for example, cmd.exe, however, it is unusual
case.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-08  9:45                                       ` Takashi Yano
@ 2020-09-08 19:16                                         ` Corinna Vinschen
  2020-09-10 13:08                                         ` Takashi Yano
  1 sibling, 0 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-08 19:16 UTC (permalink / raw)
  To: cygwin-patches

On Sep  8 18:45, Takashi Yano via Cygwin-patches wrote:
> Hi Corinna,
> 
> On Tue, 8 Sep 2020 10:40:34 +0200
> Corinna Vinschen wrote:
> > On Sep  7 13:45, Takashi Yano via Cygwin-patches wrote:
> > > On Mon, 7 Sep 2020 01:04:13 +0900
> > > > > Chages:
> > > > > - If global locale is set, it takes precedence.
> > > > 
> > > > Changes:
> > > > - Use __get_current_locale() instead of __get_global_locale().
> > > > - Fix a bug for ISO-8859-* charset.
> > > 
> > > Changes:
> > > - Use envblock if it is passed to CreateProcess in spawn.cc.
> > 
> > For the time being and to make at least *some* progress and with my
> > upcoming "away from keyboard"-time , I pushed the gist of my patch,
> > replacing the locale evaluating code in fhandler_tty with the function
> > __eval_codepage_from_internal_charset in its most simple form.
> > I didn't touch anything else, given that this discussion is still
> > ongoing.
> 
> Your patch pushed does the magic!
> 
> Even cygterm works even though the code does not check environment.
> 
> The point is here.
> 
> @@ -1977,9 +1807,6 @@ fhandler_pty_slave::fixup_after_exec ()
>    if (!close_on_exec ())
>      fixup_after_fork (NULL);   /* No parent handle required. */
> 
> -  /* Set locale */
> -  setup_locale ();
> -
>    /* Hook Console API */
>  #define DO_HOOK(module, name) \
>    if (!name##_Orig) \
> 
> Without this deletion, term_code_page is determined when
> cygwin shell is executed. Since it is in fixup_after_exec(),
> setlocale() does not called yet in the shell. As a result,
> term_code_page cannot be determined correctly.
> 
> In your new patch, term_code_page is determined when the first
> non-cygwin program is execed in the shell. The shell process
> already calls setlocale(), so term_code_page can be determined
> using global locale.
> 
> Thanks for the excellent idea!
> 
> Only the problem I noticed is that cygterm does not work if the
> shell does not call setlocale(). This happens if the shell is
> non-cygwin program, for example, cmd.exe, however, it is unusual
> case.

This is unexpected, but I'm glad this could be much simplfied.


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-08  8:16         ` Takashi Yano
@ 2020-09-09  7:21           ` Corinna Vinschen
  2020-09-10  0:15             ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-09  7:21 UTC (permalink / raw)
  To: cygwin-patches

On Sep  8 17:16, Takashi Yano via Cygwin-patches wrote:
> On Mon, 7 Sep 2020 23:17:36 +0200 (CEST)
> Johannes Schindelin wrote:
> > Hi Takashi,
> > 
> > On Sat, 5 Sep 2020, Takashi Yano wrote:
> > 
> > > On Fri, 4 Sep 2020 08:23:42 +0200 (CEST)
> > > Johannes Schindelin wrote:
> > > >
> > > > On Fri, 4 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> > > >
> > > > > On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
> > > > > Johannes Schindelin wrote:
> > > > > [...]
> > `LANG=en_US.UTF-8` (meaning your patches force their console applications'
> > output to be interpreted with code page 437) and therefore for those
> > users, things looked fine before, and now they don't.
> > 
> > Note that I am not talking about developers who develop said console
> > applications. I am talking about users who use those console applications.
> > In other words, I am talking about a vastly larger group of affected
> > people.
> > 
> > All of those people (or at least a substantial majority) will now have to
> > be told to please disable Pseudo Console support in v3.2.0 because they
> > would have to patch and rebuild those console applications that don't call
> > `SetConsoleOutputCP()`, and that is certainly unreasonable to expect of
> > the majority of users. Not even the `cmd /c chcp 65001` work-around (that
> > helps with v3.1.7) will work with v3.2.0 when Pseudo Console support is
> > enabled.
> 
> In the case where pseudo console is disabled, the patch I proposed in
> https://cygwin.com/pipermail/cygwin-patches/2020q3/010548.html
> will solve the issue. I mean apps which work correctly in cygwin 3.0.7
> work in cygwin 3.2.0 as well with that patch.
> 
> OTOH, in the case where pseudo console is enabled, non-cygwin apps which
> work correctly in command prompt, work in cygwin 3.2.0 as well.
> 
> It is enough for all, isn't it?
> 
> You may think that all non-cygwin apps which work in cygwin 3.0.7 must
> work in cygwin 3.2.0 even when pseudo console is enabled, but it is
> against the purpose of the pseudo console support. The goal of pseudo
> console support is to make non-cygwin apps work as if it is executed in
> command prompt.
> 
> By the way, you complained regarding garbled output of the program which
> outputs UTF-8 string regardless of locale.
> https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
> However, many Japanese programmers, for example, make programs which
> output SJIS (CP932) string regardless of locale.
> 
> Why do you think the former must work while the latter deos not have to?
> Is there any reasonable reason other than backward compatibility?

Is that still a concern with the latest from master?  There's
a snapshot for testing, Johannes.

Takashi, does the patch from
https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
still apply to the latest from master?  Question is, shouldn't the
Windows calls setting the codepage be only called if started from
child_info_spawn::worker for non-Cygwin executables?


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-09  7:21           ` Corinna Vinschen
@ 2020-09-10  0:15             ` Takashi Yano
  2020-09-10 12:34               ` Takashi Yano
  2020-09-10 14:04               ` Corinna Vinschen
  0 siblings, 2 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-10  0:15 UTC (permalink / raw)
  To: cygwin-patches

Hi Corinna,

On Wed, 9 Sep 2020 09:21:23 +0200
Corinna Vinschen wrote:
> On Sep  8 17:16, Takashi Yano via Cygwin-patches wrote:
> > On Mon, 7 Sep 2020 23:17:36 +0200 (CEST)
> > Johannes Schindelin wrote:
> > > Hi Takashi,
> > > 
> > > On Sat, 5 Sep 2020, Takashi Yano wrote:
> > > 
> > > > On Fri, 4 Sep 2020 08:23:42 +0200 (CEST)
> > > > Johannes Schindelin wrote:
> > > > >
> > > > > On Fri, 4 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> > > > >
> > > > > > On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
> > > > > > Johannes Schindelin wrote:
> > > > > > [...]
> > > `LANG=en_US.UTF-8` (meaning your patches force their console applications'
> > > output to be interpreted with code page 437) and therefore for those
> > > users, things looked fine before, and now they don't.
> > > 
> > > Note that I am not talking about developers who develop said console
> > > applications. I am talking about users who use those console applications.
> > > In other words, I am talking about a vastly larger group of affected
> > > people.
> > > 
> > > All of those people (or at least a substantial majority) will now have to
> > > be told to please disable Pseudo Console support in v3.2.0 because they
> > > would have to patch and rebuild those console applications that don't call
> > > `SetConsoleOutputCP()`, and that is certainly unreasonable to expect of
> > > the majority of users. Not even the `cmd /c chcp 65001` work-around (that
> > > helps with v3.1.7) will work with v3.2.0 when Pseudo Console support is
> > > enabled.
> > 
> > In the case where pseudo console is disabled, the patch I proposed in
> > https://cygwin.com/pipermail/cygwin-patches/2020q3/010548.html
> > will solve the issue. I mean apps which work correctly in cygwin 3.0.7
> > work in cygwin 3.2.0 as well with that patch.
> > 
> > OTOH, in the case where pseudo console is enabled, non-cygwin apps which
> > work correctly in command prompt, work in cygwin 3.2.0 as well.
> > 
> > It is enough for all, isn't it?
> > 
> > You may think that all non-cygwin apps which work in cygwin 3.0.7 must
> > work in cygwin 3.2.0 even when pseudo console is enabled, but it is
> > against the purpose of the pseudo console support. The goal of pseudo
> > console support is to make non-cygwin apps work as if it is executed in
> > command prompt.
> > 
> > By the way, you complained regarding garbled output of the program which
> > outputs UTF-8 string regardless of locale.
> > https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
> > However, many Japanese programmers, for example, make programs which
> > output SJIS (CP932) string regardless of locale.
> > 
> > Why do you think the former must work while the latter deos not have to?
> > Is there any reasonable reason other than backward compatibility?
> 
> Is that still a concern with the latest from master?  There's
> a snapshot for testing, Johannes.

We are still discussing about that.
https://github.com/msys2/MSYS2-packages/issues/1974

> Takashi, does the patch from
> https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
> still apply to the latest from master?  Question is, shouldn't the
> Windows calls setting the codepage be only called if started from
> child_info_spawn::worker for non-Cygwin executables?

I'd propose the patch:

diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
index 37d033bbe..95b28c3da 100644
--- a/winsup/cygwin/fhandler_tty.cc
+++ b/winsup/cygwin/fhandler_tty.cc
@@ -1830,7 +1830,11 @@ fhandler_pty_slave::setup_locale (void)
   extern UINT __eval_codepage_from_internal_charset ();

   if (!get_ttyp ()->term_code_page)
-    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
+    {
+      get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
+      SetConsoleCP (get_ttyp ()->term_code_page);
+      SetConsoleOutputCP (get_ttyp ()->term_code_page);
+    }
 }

 void

However, Johannes insists setting codepage for non-cygwin apps
even when pseudo console is enabled, which I cannot agree.

Actually, I hesitate even the patch above, however, it seems to
be necessary for msys apps in terms of backward compatibility.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-10  0:15             ` Takashi Yano
@ 2020-09-10 12:34               ` Takashi Yano
  2020-09-11  9:05                 ` Corinna Vinschen
  2020-09-10 14:04               ` Corinna Vinschen
  1 sibling, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-10 12:34 UTC (permalink / raw)
  To: cygwin-patches

On Thu, 10 Sep 2020 09:15:00 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> Hi Corinna,
> 
> On Wed, 9 Sep 2020 09:21:23 +0200
> Corinna Vinschen wrote:
> > On Sep  8 17:16, Takashi Yano via Cygwin-patches wrote:
> > > On Mon, 7 Sep 2020 23:17:36 +0200 (CEST)
> > > Johannes Schindelin wrote:
> > > > Hi Takashi,
> > > > 
> > > > On Sat, 5 Sep 2020, Takashi Yano wrote:
> > > > 
> > > > > On Fri, 4 Sep 2020 08:23:42 +0200 (CEST)
> > > > > Johannes Schindelin wrote:
> > > > > >
> > > > > > On Fri, 4 Sep 2020, Takashi Yano via Cygwin-patches wrote:
> > > > > >
> > > > > > > On Tue, 1 Sep 2020 18:19:16 +0200 (CEST)
> > > > > > > Johannes Schindelin wrote:
> > > > > > > [...]
> > > > `LANG=en_US.UTF-8` (meaning your patches force their console applications'
> > > > output to be interpreted with code page 437) and therefore for those
> > > > users, things looked fine before, and now they don't.
> > > > 
> > > > Note that I am not talking about developers who develop said console
> > > > applications. I am talking about users who use those console applications.
> > > > In other words, I am talking about a vastly larger group of affected
> > > > people.
> > > > 
> > > > All of those people (or at least a substantial majority) will now have to
> > > > be told to please disable Pseudo Console support in v3.2.0 because they
> > > > would have to patch and rebuild those console applications that don't call
> > > > `SetConsoleOutputCP()`, and that is certainly unreasonable to expect of
> > > > the majority of users. Not even the `cmd /c chcp 65001` work-around (that
> > > > helps with v3.1.7) will work with v3.2.0 when Pseudo Console support is
> > > > enabled.
> > > 
> > > In the case where pseudo console is disabled, the patch I proposed in
> > > https://cygwin.com/pipermail/cygwin-patches/2020q3/010548.html
> > > will solve the issue. I mean apps which work correctly in cygwin 3.0.7
> > > work in cygwin 3.2.0 as well with that patch.
> > > 
> > > OTOH, in the case where pseudo console is enabled, non-cygwin apps which
> > > work correctly in command prompt, work in cygwin 3.2.0 as well.
> > > 
> > > It is enough for all, isn't it?
> > > 
> > > You may think that all non-cygwin apps which work in cygwin 3.0.7 must
> > > work in cygwin 3.2.0 even when pseudo console is enabled, but it is
> > > against the purpose of the pseudo console support. The goal of pseudo
> > > console support is to make non-cygwin apps work as if it is executed in
> > > command prompt.
> > > 
> > > By the way, you complained regarding garbled output of the program which
> > > outputs UTF-8 string regardless of locale.
> > > https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
> > > However, many Japanese programmers, for example, make programs which
> > > output SJIS (CP932) string regardless of locale.
> > > 
> > > Why do you think the former must work while the latter deos not have to?
> > > Is there any reasonable reason other than backward compatibility?
> > 
> > Is that still a concern with the latest from master?  There's
> > a snapshot for testing, Johannes.
> 
> We are still discussing about that.
> https://github.com/msys2/MSYS2-packages/issues/1974
> 
> > Takashi, does the patch from
> > https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
> > still apply to the latest from master?  Question is, shouldn't the
> > Windows calls setting the codepage be only called if started from
> > child_info_spawn::worker for non-Cygwin executables?
> 
> I'd propose the patch:
> 
> diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> index 37d033bbe..95b28c3da 100644
> --- a/winsup/cygwin/fhandler_tty.cc
> +++ b/winsup/cygwin/fhandler_tty.cc
> @@ -1830,7 +1830,11 @@ fhandler_pty_slave::setup_locale (void)
>    extern UINT __eval_codepage_from_internal_charset ();
> 
>    if (!get_ttyp ()->term_code_page)
> -    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> +    {
> +      get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> +      SetConsoleCP (get_ttyp ()->term_code_page);
> +      SetConsoleOutputCP (get_ttyp ()->term_code_page);
> +    }
>  }
> 
>  void
> 
> However, Johannes insists setting codepage for non-cygwin apps
> even when pseudo console is enabled, which I cannot agree.
> 
> Actually, I hesitate even the patch above, however, it seems to
> be necessary for msys apps in terms of backward compatibility.

I found that output of Oracle java.exe and javac.exe are garbled
if the patch above is applied. This is because java.exe and javac.exe
output SJIS code unconditionally in my environment.

OTOH, rust-based program such as cargo and ripgrep output UTF-8
unconditionally. node.js also seems to output UTF-8 string by
default.

I think there is no way for both apps to work properly if pseudo
console is disabled. As far as I tested, both of them works when
pseudo console is enabled. IMHO, the best way to achieve maximum
compatibility, is enabling pseudo console, which is disabled in
MSYS2 by default.

As for the case with pseudo console disabled:

If backward compatibility is important, we should apply the patch
above. If compatibility with the behaviour in command prompt is
important, we should leave the codepage to the system default.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-08  9:45                                       ` Takashi Yano
  2020-09-08 19:16                                         ` Corinna Vinschen
@ 2020-09-10 13:08                                         ` Takashi Yano
  1 sibling, 0 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-10 13:08 UTC (permalink / raw)
  To: cygwin-patches

On Tue, 8 Sep 2020 18:45:36 +0900
Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:

> Hi Corinna,
> 
> On Tue, 8 Sep 2020 10:40:34 +0200
> Corinna Vinschen wrote:
> > On Sep  7 13:45, Takashi Yano via Cygwin-patches wrote:
> > > On Mon, 7 Sep 2020 01:04:13 +0900
> > > > > Chages:
> > > > > - If global locale is set, it takes precedence.
> > > > 
> > > > Changes:
> > > > - Use __get_current_locale() instead of __get_global_locale().
> > > > - Fix a bug for ISO-8859-* charset.
> > > 
> > > Changes:
> > > - Use envblock if it is passed to CreateProcess in spawn.cc.
> > 
> > For the time being and to make at least *some* progress and with my
> > upcoming "away from keyboard"-time , I pushed the gist of my patch,
> > replacing the locale evaluating code in fhandler_tty with the function
> > __eval_codepage_from_internal_charset in its most simple form.
> > I didn't touch anything else, given that this discussion is still
> > ongoing.
> 
> Your patch pushed does the magic!
> 
> Even cygterm works even though the code does not check environment.
> 
> The point is here.
> 
> @@ -1977,9 +1807,6 @@ fhandler_pty_slave::fixup_after_exec ()
>    if (!close_on_exec ())
>      fixup_after_fork (NULL);   /* No parent handle required. */
> 
> -  /* Set locale */
> -  setup_locale ();
> -
>    /* Hook Console API */
>  #define DO_HOOK(module, name) \
>    if (!name##_Orig) \
> 
> Without this deletion, term_code_page is determined when
> cygwin shell is executed. Since it is in fixup_after_exec(),
> setlocale() does not called yet in the shell. As a result,
> term_code_page cannot be determined correctly.
> 
> In your new patch, term_code_page is determined when the first
> non-cygwin program is execed in the shell. The shell process
> already calls setlocale(), so term_code_page can be determined
> using global locale.
> 
> Thanks for the excellent idea!
> 
> Only the problem I noticed is that cygterm does not work if the
> shell does not call setlocale(). This happens if the shell is
> non-cygwin program, for example, cmd.exe, however, it is unusual
> case.

I noticed this also happens when shell is /bin/ash (/bin/dash).
However, it's still acceptable for me.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-10  0:15             ` Takashi Yano
  2020-09-10 12:34               ` Takashi Yano
@ 2020-09-10 14:04               ` Corinna Vinschen
  2020-09-10 14:16                 ` Takashi Yano
  1 sibling, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-10 14:04 UTC (permalink / raw)
  To: cygwin-patches

Hi Takashi,

On Sep 10 09:15, Takashi Yano via Cygwin-patches wrote:
> On Wed, 9 Sep 2020 09:21:23 +0200
> Corinna Vinschen wrote:
> > Takashi, does the patch from
> > https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
> > still apply to the latest from master?  Question is, shouldn't the
> > Windows calls setting the codepage be only called if started from
> > child_info_spawn::worker for non-Cygwin executables?
> 
> I'd propose the patch:
> 
> diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> index 37d033bbe..95b28c3da 100644
> --- a/winsup/cygwin/fhandler_tty.cc
> +++ b/winsup/cygwin/fhandler_tty.cc
> @@ -1830,7 +1830,11 @@ fhandler_pty_slave::setup_locale (void)
>    extern UINT __eval_codepage_from_internal_charset ();
> 
>    if (!get_ttyp ()->term_code_page)
> -    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> +    {
> +      get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> +      SetConsoleCP (get_ttyp ()->term_code_page);
> +      SetConsoleOutputCP (get_ttyp ()->term_code_page);
> +    }
>  }
> 
>  void
> 
> However, Johannes insists setting codepage for non-cygwin apps
> even when pseudo console is enabled, which I cannot agree.
> 
> Actually, I hesitate even the patch above, however, it seems to
> be necessary for msys apps in terms of backward compatibility.

If we do as above, doesn't that mean the invocation of convert_mb_str in
fhandler_pty_master::accept_input, as well as the second invocation of
convert_mb_str in fhandler_pty_master::pty_master_fwd_thread are
redundant?  Both are only called if get_ttyp ()->term_code_page differs
from the input or output console codepage.  Given the above setting of
the console CP to term_code_page, this would never be the case, right?


Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-10 14:04               ` Corinna Vinschen
@ 2020-09-10 14:16                 ` Takashi Yano
  2020-09-10 14:18                   ` Takashi Yano
  0 siblings, 1 reply; 60+ messages in thread
From: Takashi Yano @ 2020-09-10 14:16 UTC (permalink / raw)
  To: cygwin-patches

On Thu, 10 Sep 2020 16:04:07 +0200
Corinna Vinschen wrote:

> Hi Takashi,
> 
> On Sep 10 09:15, Takashi Yano via Cygwin-patches wrote:
> > On Wed, 9 Sep 2020 09:21:23 +0200
> > Corinna Vinschen wrote:
> > > Takashi, does the patch from
> > > https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
> > > still apply to the latest from master?  Question is, shouldn't the
> > > Windows calls setting the codepage be only called if started from
> > > child_info_spawn::worker for non-Cygwin executables?
> > 
> > I'd propose the patch:
> > 
> > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > index 37d033bbe..95b28c3da 100644
> > --- a/winsup/cygwin/fhandler_tty.cc
> > +++ b/winsup/cygwin/fhandler_tty.cc
> > @@ -1830,7 +1830,11 @@ fhandler_pty_slave::setup_locale (void)
> >    extern UINT __eval_codepage_from_internal_charset ();
> > 
> >    if (!get_ttyp ()->term_code_page)
> > -    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> > +    {
> > +      get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> > +      SetConsoleCP (get_ttyp ()->term_code_page);
> > +      SetConsoleOutputCP (get_ttyp ()->term_code_page);
> > +    }
> >  }
> > 
> >  void
> > 
> > However, Johannes insists setting codepage for non-cygwin apps
> > even when pseudo console is enabled, which I cannot agree.
> > 
> > Actually, I hesitate even the patch above, however, it seems to
> > be necessary for msys apps in terms of backward compatibility.
> 
> If we do as above, doesn't that mean the invocation of convert_mb_str in
> fhandler_pty_master::accept_input, as well as the second invocation of
> convert_mb_str in fhandler_pty_master::pty_master_fwd_thread are
> redundant?  Both are only called if get_ttyp ()->term_code_page differs
> from the input or output console codepage.  Given the above setting of
> the console CP to term_code_page, this would never be the case, right?

You are right by default. However, if user change the code page using
chcp.com, conversion does work.

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-10 14:16                 ` Takashi Yano
@ 2020-09-10 14:18                   ` Takashi Yano
  0 siblings, 0 replies; 60+ messages in thread
From: Takashi Yano @ 2020-09-10 14:18 UTC (permalink / raw)
  To: cygwin-patches

On Thu, 10 Sep 2020 23:16:10 +0900
Takashi Yano  wrote:

> On Thu, 10 Sep 2020 16:04:07 +0200
> Corinna Vinschen wrote:
> 
> > Hi Takashi,
> > 
> > On Sep 10 09:15, Takashi Yano via Cygwin-patches wrote:
> > > On Wed, 9 Sep 2020 09:21:23 +0200
> > > Corinna Vinschen wrote:
> > > > Takashi, does the patch from
> > > > https://cygwin.com/pipermail/cygwin-developers/2020-August/011951.html
> > > > still apply to the latest from master?  Question is, shouldn't the
> > > > Windows calls setting the codepage be only called if started from
> > > > child_info_spawn::worker for non-Cygwin executables?
> > > 
> > > I'd propose the patch:
> > > 
> > > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > > index 37d033bbe..95b28c3da 100644
> > > --- a/winsup/cygwin/fhandler_tty.cc
> > > +++ b/winsup/cygwin/fhandler_tty.cc
> > > @@ -1830,7 +1830,11 @@ fhandler_pty_slave::setup_locale (void)
> > >    extern UINT __eval_codepage_from_internal_charset ();
> > > 
> > >    if (!get_ttyp ()->term_code_page)
> > > -    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> > > +    {
> > > +      get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> > > +      SetConsoleCP (get_ttyp ()->term_code_page);
> > > +      SetConsoleOutputCP (get_ttyp ()->term_code_page);
> > > +    }
> > >  }
> > > 
> > >  void
> > > 
> > > However, Johannes insists setting codepage for non-cygwin apps
> > > even when pseudo console is enabled, which I cannot agree.
> > > 
> > > Actually, I hesitate even the patch above, however, it seems to
> > > be necessary for msys apps in terms of backward compatibility.
> > 
> > If we do as above, doesn't that mean the invocation of convert_mb_str in
> > fhandler_pty_master::accept_input, as well as the second invocation of
> > convert_mb_str in fhandler_pty_master::pty_master_fwd_thread are
> > redundant?  Both are only called if get_ttyp ()->term_code_page differs
> > from the input or output console codepage.  Given the above setting of
> > the console CP to term_code_page, this would never be the case, right?
> 
> You are right by default. However, if user change the code page using
> chcp.com, conversion does work.

Not only user, but also app may change the codepage by SetConsoleCP()
and SetConsoleOutputCP().

-- 
Takashi Yano <takashi.yano@nifty.ne.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-10 12:34               ` Takashi Yano
@ 2020-09-11  9:05                 ` Corinna Vinschen
  2020-09-11  9:23                   ` Corinna Vinschen
  0 siblings, 1 reply; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-11  9:05 UTC (permalink / raw)
  To: cygwin-patches

On Sep 10 21:34, Takashi Yano via Cygwin-patches wrote:
> On Thu, 10 Sep 2020 09:15:00 +0900
> Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > I'd propose the patch:
> > 
> > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > index 37d033bbe..95b28c3da 100644
> > --- a/winsup/cygwin/fhandler_tty.cc
> > +++ b/winsup/cygwin/fhandler_tty.cc
> > @@ -1830,7 +1830,11 @@ fhandler_pty_slave::setup_locale (void)
> >    extern UINT __eval_codepage_from_internal_charset ();
> > 
> >    if (!get_ttyp ()->term_code_page)
> > -    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> > +    {
> > +      get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> > +      SetConsoleCP (get_ttyp ()->term_code_page);
> > +      SetConsoleOutputCP (get_ttyp ()->term_code_page);
> > +    }
> >  }
> > 
> >  void
> > 
> > However, Johannes insists setting codepage for non-cygwin apps
> > even when pseudo console is enabled, which I cannot agree.
> > 
> > Actually, I hesitate even the patch above, however, it seems to
> > be necessary for msys apps in terms of backward compatibility.
> 
> I found that output of Oracle java.exe and javac.exe are garbled
> if the patch above is applied. This is because java.exe and javac.exe
> output SJIS code unconditionally in my environment.
> 
> OTOH, rust-based program such as cargo and ripgrep output UTF-8
> unconditionally. node.js also seems to output UTF-8 string by
> default.
> 
> I think there is no way for both apps to work properly if pseudo
> console is disabled. As far as I tested, both of them works when
> pseudo console is enabled. IMHO, the best way to achieve maximum
> compatibility, is enabling pseudo console, which is disabled in
> MSYS2 by default.
> 
> As for the case with pseudo console disabled:
> 
> If backward compatibility is important, we should apply the patch
> above. If compatibility with the behaviour in command prompt is
> important, we should leave the codepage to the system default.

Pseudo console is probbaly the way to go in future anyway.  For
older OSes and older apps, we might better opt for backward compat.
I'll apply your patch for the time being.


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8"
  2020-09-11  9:05                 ` Corinna Vinschen
@ 2020-09-11  9:23                   ` Corinna Vinschen
  0 siblings, 0 replies; 60+ messages in thread
From: Corinna Vinschen @ 2020-09-11  9:23 UTC (permalink / raw)
  To: cygwin-patches

On Sep 11 11:05, Corinna Vinschen wrote:
> On Sep 10 21:34, Takashi Yano via Cygwin-patches wrote:
> > On Thu, 10 Sep 2020 09:15:00 +0900
> > Takashi Yano via Cygwin-patches <cygwin-patches@cygwin.com> wrote:
> > > I'd propose the patch:
> > > 
> > > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> > > index 37d033bbe..95b28c3da 100644
> > > --- a/winsup/cygwin/fhandler_tty.cc
> > > +++ b/winsup/cygwin/fhandler_tty.cc
> > > @@ -1830,7 +1830,11 @@ fhandler_pty_slave::setup_locale (void)
> > >    extern UINT __eval_codepage_from_internal_charset ();
> > > 
> > >    if (!get_ttyp ()->term_code_page)
> > > -    get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> > > +    {
> > > +      get_ttyp ()->term_code_page = __eval_codepage_from_internal_charset ();
> > > +      SetConsoleCP (get_ttyp ()->term_code_page);
> > > +      SetConsoleOutputCP (get_ttyp ()->term_code_page);
> > > +    }
> > >  }
> > > 
> > >  void
> > > 
> > > However, Johannes insists setting codepage for non-cygwin apps
> > > even when pseudo console is enabled, which I cannot agree.
> > > 
> > > Actually, I hesitate even the patch above, however, it seems to
> > > be necessary for msys apps in terms of backward compatibility.
> > 
> > I found that output of Oracle java.exe and javac.exe are garbled
> > if the patch above is applied. This is because java.exe and javac.exe
> > output SJIS code unconditionally in my environment.
> > 
> > OTOH, rust-based program such as cargo and ripgrep output UTF-8
> > unconditionally. node.js also seems to output UTF-8 string by
> > default.
> > 
> > I think there is no way for both apps to work properly if pseudo
> > console is disabled. As far as I tested, both of them works when
> > pseudo console is enabled. IMHO, the best way to achieve maximum
> > compatibility, is enabling pseudo console, which is disabled in
> > MSYS2 by default.
> > 
> > As for the case with pseudo console disabled:
> > 
> > If backward compatibility is important, we should apply the patch
> > above. If compatibility with the behaviour in command prompt is
> > important, we should leave the codepage to the system default.
> 
> Pseudo console is probbaly the way to go in future anyway.  For
> older OSes and older apps, we might better opt for backward compat.
> I'll apply your patch for the time being.

Oh, right, can you please send it again in git style?


Thanks,
Corinna

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2020-09-11  9:23 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-01 16:19 [PATCH 3/3] fhandler_pty_slave::setup_locale: respect charset == "UTF-8" Johannes Schindelin
2020-09-02  6:06 ` Johannes Schindelin
2020-09-02  8:30 ` Corinna Vinschen
2020-09-02  8:38   ` Corinna Vinschen
2020-09-02 10:54     ` Takashi Yano
2020-09-02 15:24       ` Corinna Vinschen
2020-09-02 16:09         ` Corinna Vinschen
2020-09-02 16:25         ` Takashi Yano
2020-09-02 16:38           ` Corinna Vinschen
2020-09-03 17:59             ` Corinna Vinschen
2020-09-04  9:21               ` Takashi Yano
2020-09-04 12:44                 ` Corinna Vinschen
2020-09-04 14:05                   ` Brian Inglis
2020-09-04 14:50                   ` Takashi Yano
2020-09-04 19:22                     ` Corinna Vinschen
2020-09-05  8:43                       ` Takashi Yano
2020-09-05 11:15                         ` Takashi Yano
2020-09-05 14:15                           ` Takashi Yano
2020-09-06  8:57                             ` Takashi Yano
2020-09-06 10:15                               ` Takashi Yano
2020-09-06 16:04                                 ` Takashi Yano
2020-09-07  4:45                                   ` Takashi Yano
2020-09-07  9:08                                     ` Corinna Vinschen
2020-09-07  9:54                                       ` Takashi Yano
2020-09-07  9:59                                         ` Takashi Yano
2020-09-08  8:40                                     ` Corinna Vinschen
2020-09-08  9:45                                       ` Takashi Yano
2020-09-08 19:16                                         ` Corinna Vinschen
2020-09-10 13:08                                         ` Takashi Yano
2020-09-07  8:27                           ` Corinna Vinschen
2020-09-07  8:38                             ` Takashi Yano
2020-09-07  9:09                               ` Corinna Vinschen
2020-09-07  8:26                         ` Corinna Vinschen
2020-09-07  9:36                           ` Takashi Yano
2020-09-07 18:24                             ` Takashi Yano
2020-09-07 21:08                             ` Johannes Schindelin
2020-09-08  4:52                               ` Brian Inglis
2020-09-07 10:27                           ` Takashi Yano
2020-09-07 13:40                             ` Takashi Yano
2020-09-08  7:55                               ` Corinna Vinschen
2020-09-06 10:28                   ` Takashi Yano
2020-09-07  8:33                     ` Corinna Vinschen
2020-09-02  9:41   ` Takashi Yano
2020-09-02  6:26     ` Johannes Schindelin
2020-09-02 13:06       ` Takashi Yano
2020-09-02  9:12         ` Johannes Schindelin
2020-09-02 14:52           ` Takashi Yano
2020-09-04 10:03 ` Takashi Yano
2020-09-04  6:23   ` Johannes Schindelin
2020-09-04 15:03     ` Takashi Yano
2020-09-07 21:17       ` Johannes Schindelin
2020-09-08  8:16         ` Takashi Yano
2020-09-09  7:21           ` Corinna Vinschen
2020-09-10  0:15             ` Takashi Yano
2020-09-10 12:34               ` Takashi Yano
2020-09-11  9:05                 ` Corinna Vinschen
2020-09-11  9:23                   ` Corinna Vinschen
2020-09-10 14:04               ` Corinna Vinschen
2020-09-10 14:16                 ` Takashi Yano
2020-09-10 14:18                   ` Takashi Yano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).