From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.126.135]) by sourceware.org (Postfix) with ESMTPS id F284A3857C47 for ; Mon, 31 Aug 2020 21:07:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org F284A3857C47 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=towo.net Authentication-Results: sourceware.org; spf=none smtp.mailfrom=towo@towo.net Received: from [192.168.178.45] ([95.90.245.244]) by mrelayeu.kundenserver.de (mreue012 [212.227.15.167]) with ESMTPSA (Nemesis) id 1MUl4z-1k3g3C27Dl-00QgAu; Mon, 31 Aug 2020 23:07:03 +0200 Subject: Re: New implementation of pseudo console support (experimental) To: Johannes Schindelin , cygwin-developers@cygwin.com References: <20200722174541.c8113635236fd217cb9ebb77@nifty.ne.jp> <20200724202219.16ad238f515da19db21d3a6c@nifty.ne.jp> <20200803111103.27ef6554df7f40d1142bceee@nifty.ne.jp> <20200803212342.8b14a3164ed66bd521774fe4@nifty.ne.jp> <20200811201258.4bffb987ecdb96583c516bc2@nifty.ne.jp> <20200813185813.2d851113b6e134db371d35b4@nifty.ne.jp> <20200817205718.a0fcc08bf21be4ba5f10ba3d@nifty.ne.jp> <20200819203959.9d220306c58736f94381d1e6@nifty.ne.jp> <20200819134156.GP3272@calimero.vinschen.de> <20200820170210.e066c8ad933ca31061130ba9@nifty.ne.jp> <20200831231253.332c66fdddb33ceed5f61db6@nifty.ne.jp> <20200831235325.c26c1a75e4cec737e793c91c@nifty.ne.jp> <9f0e8248-cc3b-b5a8-0af5-43dbdf079478@towo.net> <1104c24d-49ea-96b9-30cb-acd4460108ab@towo.net> From: Thomas Wolff X-Tagtoolbar-Keys: D20200831230703225 Message-ID: <131b9bdf-4b89-051b-f9a3-5332f837d56b@towo.net> Date: Mon, 31 Aug 2020 23:07:03 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K1:Q/3UoFB0+e+hRgCwSn6cwZJNMGudpBFKr1fnzH5ETALmI0chUXh AwkfaKP6t4oW5KCeNfasmM3FZF0eteakS+26dfwptu1ATVFHKj5FnIK3Rs1rbSW6uBMWxib lZ1nb5T6HuXT3tAw+6tn06ezllxA+9BDU7PTlxn9aVJ2M3QLpMHclWxP8sUpOrdq7S70BGz Kn4zpbMCR/xiprMOCWKdA== X-UI-Out-Filterresults: notjunk:1;V03:K0:7/DxjH2Fx+8=:6MglVZ1L9OqWToViU4aBIM 0AgX6ptcCop05kpZ3gl+b5ooQP1ypu15uNDN60OiIysvLbTZXDS1Q3Nbvyc4oEd5bDDYVNS0s HxMhJzqsmEC/3tS/sOxdCvF/Halmtk+PJEA2+X6wtuRao1axo4SgxkxtpsjMlGjF1A13npE32 6xJMdkQx1FhpHa6xzyMxpgrB8pnEeYkN2OoPJ5NEB2ItzR9nHCklzKWoQ0sv3+Kp6HGfFOk3R epIU/aj+Wjp614S1SmUCnXx65mWdczafJdEJ1ckeymv9SFPll7hlAkzKg/WatpNeStmTfbrEI NIkg0vBCLZIFamGoNpVA6G5zK3EzmJlbt7EQ6mg+ZyHTUze2uigdv6xyHzzo7tV2nd0rQhAZ4 i44GptsrTn6b7B76XuVwg6NKjU4oW9ruBSMwzFDUhSHVugkbw9jMGWi+HS1USM6B+MpoauRSl jUYgKEsApspl+TycSyY7zSmUkOoIYjFfKo5zHJZYPTKHYtEiY4ppQKeT6IICJ4D5x8Ku2Po02 UEKXBlITDL3OUn6lzk6YWPTrTcqFb/SzsNnSQfSbIVxYwaAmxFhJ90nlUljQ8zZKyosqxv6yt /yBzAmE5v5iqDxsCfgtErB65KjyGY7AVvL7I/NfWGm08QqzCA9Q/KKN5qkmjIyLmRS/3m6MpS Q/c3TB08D3qQih3pv/3IvxrD2HiN//Kd1iu9J70xE6JZbmRWqp7k6+hvc4zBc1udVuKhjV5rY mShCHGIljKI9DgjSRSFoXkUG0UCwYWorbfofdGn7mLlI7QNSrWYFpaHFCiqPXD0kmeEjRmG8o fHrz0y/4NSGjwAEUH6MJvW0xW7Vr0b8lpK9tPDLB8D8tEPb3+dctOH01+1G4RiOl2f7v5f7 X-Spam-Status: No, score=-10.2 required=5.0 tests=BAYES_00, BODY_8BITS, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: cygwin-developers@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Cygwin core component developers mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Aug 2020 21:07:07 -0000 Am 31.08.2020 um 21:17 schrieb Johannes Schindelin: > Hi Thomas, > > On Mon, 31 Aug 2020, Thomas Wolff wrote: > >> Am 31.08.2020 um 18:12 schrieb Thomas Wolff: >>> Am 31.08.2020 um 17:56 schrieb Johannes Schindelin: >>> >>>> On Mon, 31 Aug 2020, Takashi Yano wrote: >>>> >>>>> On Mon, 31 Aug 2020 16:22:20 +0200 (CEST) >>>>> Johannes Schindelin wrote: >>>>>> On Mon, 31 Aug 2020, Takashi Yano wrote: >>>>>> >>>>>>> On Mon, 31 Aug 2020 14:49:04 +0200 (CEST) >>>>>>> Johannes Schindelin wrote: >>>>>>> >>>>>>>> Sorry to latch onto this thread with something slightly >>>>>>>> different, but we do see pretty serious encoding problems >>>>>>>> (both with and without `CYGWIN=disable_pcon`) in the Git for >>>>>>>> Windows and the MSYS2 projects. For example, in >>>>>>>> https://github.com/msys2/MSYS2-packages/issues/1974 the >>>>>>>> following issue was reported. If you compile a _MINGW_ >>>>>>>> program from this source code: >>>>>>>> >>>>>>>> -- snip -- >>>>>>>> #include >>>>>>>> >>>>>>>> int main(){ >>>>>>>>    puts("Привет мир! Hello world!"); >>>>>>>>    return 0; >>>>>>>> } >>>>>>>> -- snap -- >>>>>>>> >>>>>>>> and then execute it, you will see this output: >>>>>>>> >>>>>>>> -- snip -- >>>>>>>> ╨ƒ╤Ç╨╕╨▓╨╡╤é ╨╝╨╕╤Ç! Hello world! >>>>>>>> -- snap -- >>>>>>> I guess your program (binary exe) does not work as you expect >>>>>>> in command prompt as well. If you want to use UTF-8 coding in >>>>>>> output, you should add SetConsoleOutputCP(CP_UTF8) call befere >>>>>>> puts(). >>>>>> That may be, but I would like to point out that the very same >>>>>> executable worked quite well in a MinTTY using v3.0.7... >>> Assuming the test program source file is encoded in UTF-8 when >>> compiling with x86_64-w64-mingw32-gcc, the string would be output byte >>> by byte, which happend to be interpreted in UTF-8 when run in a >>> terminal on cygwin 3.0.7, although the program was not set up to use >>> UTF-8. The "correct" output was actually buggy behaviour, so current >>> cygwin has "fixed" it, to your disadvantage in this case. With ConPTY >>> support, matching encoding on Windows and terminal side need to be >>> taken care of. >> My wording was misleading. Maybe it's proper to say it this way: >> Matching encoding on each side between application and respective system >> is needed, as ConPTY transforms encoding properly on system level. > Well, I just wonder how your wording (misleading or not) relates to the > issue at hand: there are programs out there that simply do not take care > of calling `SetConsoleOutputCP()`. Those would use the pre-set system codepage. Unless POSIX functions, which need an initial dummy call to setlocale to work, in the Windows API, always a codepage is set, typically 850 in European Windows installations. > > What you are telling me is that those programs are wrong, which I can kind > of get behind. No, but ConsoleOutput functions would involve the current codepage, which is usually *not* 65001 (the UTF-8 codepage). So if those programs output UTF-8 strings, they would actually be byte strings in the respective codepage (e.g. 850) by definition of the Windows API. Ignoring that in previous cygwin versions and just sending the bytes to a UTF-8 terminal would have given you the expected result, but it's unfortunately not really correct. > However, what I do not understand is what you argue should happen with the > output of such programs (if you address that concern at all, which I am > not really sure of). I'm afraid I think the proper way is to show the respective CP850 (or whichever) interpretation that you saw; I'm puzzled though that the output is changed by piping through cat. Note that you can set previous/expected behaviour consistently with chcp as follows: > chcp.com Aktive Codepage: 850. > ./conming ðƒÐÇð©ð▓ðÁÐé ð╝ð©ÐÇ! Hello world! > ./conming | cat ?????? ???! Hello world! > chcp.com 65001 Aktive Codepage: 65001. > ./conming Привет мир! Hello world! > ./conming | cat Привет мир! Hello world! > > > Previously, we assumed the output to be in UTF-8 (although I frankly have > no idea how that worked). Just by chance, as I described above. > Starting with v3.1.0 (or at least v3.1.4, I have > not _really_ verified with earlier versions), the output is assumed to use > code page 437. Or whatever the system / you have set. > With seemingly everybody and their sister switching to UTF-8, I wonder > whether that even makes sense. When using the Windows API, the modern way would be to use UTF-16, i.e. all functions ending with "W", like WriteConsoleW. If you prefer 8-bit functions and want to support Unicode, set the codepage to 65001. You may still use 8-bit codepages if desired, like CP1252 for Windows European ANSI. A "DOS mode" program using 8 bit output and not setting a codepage is really doing something undefined and cannot expect specific output beyond ASCII. > > So I had a look at the code, and it seems that > `fhandler_pty_slave::setup_locale()` forces the output encoding to > C.ASCII if Pseudo Console support is enabled: > > char locale[ENCODING_LEN + 1] = "C"; > char charset[ENCODING_LEN + 1] = "ASCII"; > LCID lcid = get_langinfo (locale, charset); > > /* Set console code page from locale */ > if (get_pseudo_console ()) > { > UINT code_page; > if (lcid == 0 || lcid == (LCID) -1) > code_page = 20127; /* ASCII */ > else if (!GetLocaleInfo (lcid, > LOCALE_IDEFAULTCODEPAGE | LOCALE_RETURN_NUMBER, > (char *) &code_page, sizeof (code_page))) > code_page = 20127; /* ASCII */ > SetConsoleCP (code_page); > SetConsoleOutputCP (code_page); > } > > Please note that this essentially forces the console output code page to > ASCII (in my case, the fall-back to 20127 seems not to kick in, but 437 is > used instead, as LCID x0409 is used). As seen, the output "ðƒÐÇð©ð▓ðÁÐé ð╝ð©ÐÇ" is not confined to ASCII. I doubt the branch with 20127 is taken in the test case, as lcid is likely to be something other than 0 or -1. > However, there is no overriding call to `SetConsoleOutputCP()` later in > that method, not even when the `charset` is correctly identified as > `UTF-8` (because my `LANG=en_US.UTF-8`). I don't know how the ConPTY support code works, but I'd say SetConsoleOutputCP is rather to be called on the client side of the pty, in the Windows program, if it wants. It might have been an alternative way to support Windows codepages from cygwin, before the age of ConPTY, as I had once considered. > Now, what I _really_ do not understand is why Cygwin insists on using the > console output code page when running in `CYGWIN=disable_pcon` mode... Because it is proper to interpret output in the way it would be intended by the original program if that was correct. Writing Windows programs so that they could nicely output to UTF-8 terminals was a neat trick but unfortunately not correct. > > Otherwise, this patch would be enough to fix it for me: > > -- snip -- > diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc > index 43eebc174..2ce8dae9a 100644 > --- a/winsup/cygwin/fhandler_tty.cc > +++ b/winsup/cygwin/fhandler_tty.cc > @@ -2867,11 +2867,13 @@ fhandler_pty_slave::setup_locale (void) > char charset[ENCODING_LEN + 1] = "ASCII"; > LCID lcid = get_langinfo (locale, charset); > > - /* Set console code page form locale */ > + /* Set console code page from locale */ > if (get_pseudo_console ()) > { > UINT code_page; > - if (lcid == 0 || lcid == (LCID) -1) > + if (!strcasecmp (charset, "utf-8")) > + code_page = CP_UTF8; > + else if (lcid == 0 || lcid == (LCID) -1) > code_page = 20127; /* ASCII */ > else if (!GetLocaleInfo (lcid, > LOCALE_IDEFAULTCODEPAGE | LOCALE_RETURN_NUMBER, > -- snap -- > > But that does _not_ reinstate the previous behavior when Pseudo Console > support is disabled. > > Now, I would call that a regression (the entire idea of `disable_pcon` was > to fall back to the previous behavior, no?). And I do not really > understand where it comes from, that regression. I wouldn't quite call it a regression as it disables buggy behaviour which was used as a workaround for a buggy system environment. But arguably you could expect such an option to fall back to previous buggy behaviour. Thomas > Where does the code path > differ from the previous one when Pseudo Console support is disabled, and > how does that relate to the current console output code page? > > Ciao, > Johannes > >>> Thomas >>> >>>>> at the expense of garbled output for apps which use native >>>>> code page of the system in the correct maner. >>>> Are you referring to apps that call the SetConsoleOutputCP() function? If >>>> so, I am asking myself what would be broken. Because apps that do _not_ >>>> call that function (expecting UTF-8 to be active) would be fixed, while >>>> apps that _do_ call that function would not care if the Cygwin runtime >>>> changed it. >>>> >>>> Ciao, >>>> Johannes >>