From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.126.134]) by sourceware.org (Postfix) with ESMTPS id 4973A3858D37 for ; Wed, 5 Aug 2020 02:10:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 4973A3858D37 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=towo.net Authentication-Results: sourceware.org; spf=none smtp.mailfrom=towo@towo.net Received: from [192.168.178.45] ([95.90.245.244]) by mrelayeu.kundenserver.de (mreue009 [212.227.15.167]) with ESMTPSA (Nemesis) id 1Mt6wz-1ksEEr16mP-00tTPr for ; Wed, 05 Aug 2020 04:10:48 +0200 Subject: Re: Trouble with output character sets from Win32 applications running under mksh To: cygwin@cygwin.com References: <1314865780.20200803204249@yandex.ru> <6263e211-8751-8d61-7ceb-e9af59f0e5ce@SystematicSw.ab.ca> From: Thomas Wolff X-Tagtoolbar-Keys: D20200805041045857 Message-ID: <085a1a1f-944e-5b31-2203-a48caa496db3@towo.net> Date: Wed, 5 Aug 2020 04:10:45 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K1:RGN6S3zv9RQJzCs65CMArSpBLrkUcenUx0hbOMn12wtMWJZ5Dt8 B1BLeGG7ZhOds3rUNDhUGxuWhk3R7SrUddOmlXBDRyf/2guOBSUKo4LBMjZuQ1hvTILAwlA D0DjQotBA9m9/v+QluDe+wkcEcwtNPr5oerdYJioV431LP0HayMaL3gCo2WjWyuxhmJB5Mp qxYcSwSyL6OFuTq49zYEw== X-UI-Out-Filterresults: notjunk:1;V03:K0:qnKh+OHmWkY=:/nkAh5WGfwyINCCJd+7TX4 Xax5RcT915qdC6As3qylHOgogt1HaNGhsBmlKNROJemS2GR2Tq+b7ArQCRcY1BlPCTnHI7Oqg a7WSPOPcZZCntDpiqLhSss/h4C5zNHPEFJEZpVFOr1aiSiHYYF9jnlASoyYCN7vJPB23QP4Xs pzMnPHEozMFiF4roVb6aoxxlldscaCTWM1gxfVXdvqPa0bJcY+FzOWv0Pl/DCBOB1zm/ZSRsH W2efVUc4QJBYtHvc/XFZQWj8AOICbs8NBFAjOyksL08TzSmOWpGRA6jJQ2q1Qok1ODx3zzvX1 Y5/WmTIioRExAk/ktn4f50F7KHuHN8KVzIlxR0LovnoSF9gBUReBOrkftu7c6L99h4qeWdViM ciJMRFqsYUYNxc+/hPv6aez1OXjYGrTeyIORMygC1YcezZ/BP/eW8nsmhJlxrg1KZOoFWRKDJ vRKBL8y5rANegnmvv0hD4sHzq6bAdInq4A/6updylYu//asdrwUizD8uXwNsZJGD/8gx/rSHj exsF0+6cXjCJIBd1WWg29doUnFneDFKbBuMqQKtl5NyCbMCzdOeEplBF87IsShm9Sm1wi0cNB TBglmnujrv4q8D4JPUhZ5iC6f3ORooj0kR3dUhrsM6ADSQypoEowG6tW0TPpPcK6ZyHWX8Gmb UApAhuotGaz8VAnsD6LmgjMuMSH40DruQL4yEivwx/Y9RvQN3FvSSjPjYPilytAD2Cfs/Iom3 hpQGJj5JCez/zoNlmosOnNrXxJdCKF3j9RDBiTjfArJyLphDLqLDBBfRzmKrvkteeytD9Fl9W bgZOpQwMkK0ZFd3sSx5iH9ikf2+BAP1sDwdQdnR7JY/n5ivgluwXSVCGJFZPItAwvvRcGDI X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Aug 2020 02:10:54 -0000 Am 04.08.2020 um 23:19 schrieb Michael Shay via Cygwin: > Michael The contents of your mail responses is not recognizable due to utterly broken formatting. [This is not top-posting as there's nothing to respond to] > > > > From: "Brian Inglis" > To: cygwin@cygwin.com > Date: 08/04/2020 08:32 AM > Subject: Re: Trouble with output character sets from Win32 > applications running under mksh > Sent by: "Cygwin" > > > > On 2020-08-03 16:05, Michael Shay via Cygwin wrote: >> On 2020-08-03 11:42, Andrey Repin wrote: >>>>> Doesn't help. I tried 65001 (UTF-8): >>>> Because you're confusing things. >>>> chcp has nothing to do with LANG or LC_*. >>>> Et vice versa. >>>> >>> chcp sets console code page for native console applications. >>>> Only for those supporting it. Many do not. >>>> LANG sets output parameters for Cygwin applications (and other > programs >>>> that look for it, but these are few). >>> You cut the significant statement at the top of the OP: >>>>> I'm having a problem with Cygwin 3.1.4, changing the character set on >>>>> the fly. It seems to work with Cygwin applications, but not with > Win32 >>>>> applications. >>> He has problems with invalid characters only running win32 console >>> applications: I changed the subject to hopefully better reflect the > issue. >>> I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have > to >>> use the Windows codepage conversion routines. >>> >>> You can only change input character sets on the fly; output character > sets >>> will depend on mintty support of xterm-compatible character set support >>> and switching escape sequences; if you set up UTF16LE console output, >>> Windows and mintty should handle it. >>> >>> Perhaps a better description of your environment, build tools, what you >>> are trying to do, what you expect as output, and what you are getting > as >>> output, could help us better understand and help with the issue you > see. > >> The script I sent changes the locale information i.e. LANG and LC_ALL > are >> set to en_US.CP1252. i.e. >> >> export LANG="en_US.CP1252" >> export LC_ALL=en_US.CP1252 > FYI the normal sequence and order to check is LANG, LC_CTYPE, LC_ALL, > where the > last var set wins, or the reverse where the first var set wins; the > default > locale may be POSIX C.ASCII or the effective Windows locale, depending on > your > startup. >>> Thanks, that's good to know. >> Then, it runs a simple Win32 program that takes a single input argument, > ZÇ, >> the second character being C-cedilla, an 8-bit character, hex value > 0xc7. >> The Win32 program transcodes the input Unicode argument using the Cygwin >> character set to determine the codepage, 1252. > Do you mean using the environment variables to determine the codepage? >>> Yes. Our code does try to fetch the character set information from the >>> environment. > > FYI the default character set if none is specified is the Unix equivalent > of the > default Windows "ANSI"/OEM code page, in English or many European locales > that > will be ISO-8859-1. > > You may have to use cygpath -C OEM chars... or cygpath -C ANSI chars... to > convert a string to the required character set for console or GUI > programs. >>> Our production code uses the console to display error information in > the >>> appropriate character set, but our command-line utilities expect to be >>> able to take input strings encoded in the character set in use, which >>> may be an 8-bit SBCS like ISO-8849-1, Windows 1252, or a MBCS, like > UTF-8 >>> or e.g. Windows 932. Using 'cygpath' isn't an option. > Please specify what you mean by "Unicode" in each context; that term means > a > standard for representing scripts in many writing systems with a large > character > glyph repertoire and a number of encodings, representations, and handling > rules: > in each use case, do you mean a char/wchar representation, and/or an > encoding > UTF16LE or UTF-8? > Similarly when MS uses "ANSI" they may mean an SBCS OEM code page. > >>> Unicode == UTF-16 in all cases. This is the wide-character set used by > Microsoft >>> as far as I can tell in the wide-char version of their Win32 API > functions e.g. >>> CreateProcessW() vs. CreateProcessA(). > To check what is available and what is in effect in Cygwin, try e.g.: > > $ for o in system user no-unicode input format; do echo `locale --$o` $o; > done > en_US system > en_GB user > en_CA no-unicode > en_CA input > en_CA format > $ locale > > on both Cygwin versions. > >>> 1.7.28 output >>> $for o in system user no-unicode input format; do echo `locale --$o` $o; > done >>> en_US system >>> en_US user >>> en_US no-unicode >>> locale: unknown option -- input >>> Try `locale --help' for more information. >>> input >>> en_US format >>> 3.1.4 output >>> $for o in system user no-unicode input format; do echo `locale --$o` $o; > done >>> en_US system >>> en_US user >>> en_US no-unicode >>> en_US input >>> en_US format > FYI see: > > https://cygwin.com/cygwin-ug-net/setup-locale.html > >> It then prints the transcoded characters to stdout, and the result > should be >> ZÇ, identical to the input argument. >> This works fine using Cygwin 1.7.28. > Which Windows version are you running Cygwin 1.7.28 on? > Please show output from cmd /c ver. >>> $cmd /c ver >>> Microsoft Windows [Version 10.0.18363.959] > That Cygwin version 1.7.28 is from 2014-02 and has been unsupported for > years. > That version may not have completely supported international character > sets and > may just assume that everything is in ISO-8859-1/Latin-1, which is similar > to > CP1252, so that may work, or your system default OEM codepage e.g. 437 or > 850, > and pass it along. >>> Our code supports dozens of character sets, for international sales, > and that >>> includes many SBCS, and MBCS, as well as UTF-8. I can use any of the > codepages >>> supported by Windows and Cygwin and 1.7.28 handles them just fine. >> Cygwin 3.1.4 is launching the Win32 application, and is responsible for >> transcoding the arguments passed to it by mksh, in this case CP1252 >> characters ZÇ, into Unicode. > Do you mean you believe Cygwin should recode argument strings, and what do > you > mean by Unicode in this context? >>> When I launch a Win32 application that is using a character set other > than 7-bit ASCI >>> in a Cygwin shell, the shell passes the command and arguments in the > input character set. >>> So, for example, using CP 1252 as the character set, and passing 8-bit > single-byte characters >>> like e.g. ZÇ, the shell doesn't change the characters, it passes them > through to Cygwin >>> to launch the process. In my test, using gdb ($gdb --version GNU gdb > (GDB) (Cygwin 8.2.1-1) 8.2.1) >>> i.e. "gdb ksh.exe", then "(gdb) start -c 'cygtest.exe ZÇ', I can step > into spawnve() in spawn.cc. >>> At this point, examining the input arguments confirms that the input > argument 'ZÇ' is still >>> in the correct encoding i.e. 0x5a 0xc7. The real work of launching the > process is done in >>> child_info_spawn::worker(). Eventually, the code invokes > CreateProcessW(). The executable >>> path is already in UTF-16 format, so the only transcoding left to be > done is the >>> argument string. This is done in linebuf::wcs() function (winf.h) This > small method >>> invokes sys_mbstowcs(), in strfuncs.cc. So yes, I do believe Cygwin > should transcode >>> the argument strings from whatever their current character set is to > UTF-16. This is >>> what the ancient 1.7.28 did. >> That means Cygwin has to use the mb-to-uc function for transcoding > codepage >> 1252 to Unicode. > I am unsure if Cygwin does any recoding internally except for input typed > on the > terminal console interface. > CP1252 is an SBCS not an MBCS so MB functions are not required. > What do you expect when you use Unicode here? >>> If Cygwin no longer does this internal transcoding, that's a > significant change >>> from previous versions. I only know 1.7.28 did the transcoding > correctly, and it's >>> certainly possible that at some point between that version and 3.1.4, > the behavior >>> changed. Yes, CP1252 is a SBCS, but it supports 8-bit characters, > unlike 7-bit ASCII >>> so requires a different mapping from UTF-16. Using either CP 1252 or > 7-bit ASCII >>> though would require a different transcoding routine than the UTF-8 -> > UTF-16 that >>> gets used. >> It does not. It uses the UTF-8 to Unicode function (I've seen this using >> gdb). That function flags the Ç as an invalid UTF-8 sequence, not >> surprisingly since it's not a UTF-8 character. > What Windows, Cygwin, gdb versions are you seeing this on and what is the > name > of the function you are seeing? >>> Windows - Microsoft Windows [Version 10.0.18363.959] >>> Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19 > 08:49 x86_64 Cygwin >>> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1) >>> As described above, spawnve() calls child_info_spawn::worker() to do > the real work of >>> launching a process, a Win32 or a Cygwin process. The conversion of the > process arguments >>> into UTF-16 is done through linebuf::wcs(), into sys_mbstowcs(). In the > latter function >>> the only work done is to check if the pointer to the MBCS to WCS is ' > __ascii_mbtowc' and >>> if so, to instead set it to '__utf8_mbtowc'. It then invokes > sys_cp_mbstowcs() to do the >>> work. >>> However, the problem if there is one, must be occurring very early on. > dll_crt0_1() >>> which according to the comments "Take over from libc's crt0.o and start > the application." >>> fetches the locale from the environment: >>> /* Set internal locale to the environment settings. */ >>> initial_setlocale (); >>> I suspect that it's here where either there's a problem, or Cygwin > behavior has changed from >>> 1.7.28. I haven't tried to use gdb to step into that initialization > code. > >> No matter what character set I use in 'export LANG...' and 'export >> LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding function > in >> sys > ... what should be there and what is the name of the function used? > >> 1.7.28 Uses the correct function. > What is the name of that function? >>> The function is sys_cp_mbstowcs(), which is invoked by sys_mbstowcs() > as it is in 3.1.4. >>> But the older version doesn't get the pointer to the mb-to-wc > transcoding function passed >>> it, it fetches the pointer and the character set from cygheap->locale > and passes those >>> to sys_cp_mbstowcs(). >> I'm not using mintty, I'm using mksh, a requirement since our software > uses >> lots of shell scripts, and for legacy support, that means using a Korn > shell. > > So that means that the mksh is running on the Windows console, and you are > not > running mintty. >>> Correct. >> I could understand it if 1.7.28 didn't do the proper transcoding, but it >> does. > You may just be seeing Cygwin 1.7.28 passing the character codes along > verbatim. >>> I don't think so. child_info_spawn::worker() has to translate the > CP1252 characters >>> into UTF-16. And it does, as I've seen using Windbg on the Windows side > of this. > >> I used: >> >> gdb mksh >> >> to load mksh into the debugger, then started it with >> >> start -c 'cygtest.exe ZÇ' > Windows, Cygwin, and gdb versions? >>> Windows - Microsoft Windows [Version 10.0.18363.959] >>> Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19 > 08:49 x86_64 Cygwin >>> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1) >> That allowed me to step into child_info_spawn::worker() and stop at the >> call to CreateProcess(), where the command line (cygtest.exe) and > argument >> (ZÇ) are translated into Unicode. > In this case you mean into a UTF16LE string? >>> Yes. >> This is the code to which I'm referring, in strfuncs.cc, which is > supposed >> to translate the command line and arguments from CP 1252 into Unicode. >> >> size_t __reg3 >> sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms) >> { >> mbtowc_p f_mbtowc = __MBTOWC; >> if (f_mbtowc == __ascii_mbtowc) >> { >> f_mbtowc = __utf8_mbtowc; <<<< THE CODE CHANGES THE >> '__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE >> CODEPAGE. >> } >> return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms); >> } >> >> So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said: > UTF-8 contains ASCII as the first 128 code points, so that is valid, > unless the > "ASCII" used isn't really, and has character codes > 127! >>> CP1252 supports 8-bit single-byte characters such as C-cedilla. The > UTF-8 >>> representation is a 3-byte sequence that is not correct if the > character >>> set in use is CP1252. >> You can only change input character sets on the fly; >> >> The input character set to Cygwin should have been changed to CP 1252, > as >> it was in 1.7.28. At least, that's what I would expect to happen. If it >> does not, or if miintty is required, then that's a regression from > 1.7.28. > > As Cygwin packages are rolling releases, old releases are unsupported, and > you > must upgrade to the latest release, reproduce the problem with a simple > test > case, and other examples if you wish, and post that with a copy of the > output from: > > $ cygcheck -hrsv > cygcheck.out > > as a plain text attachment to your post. >>> I understand. We do not ship a stock Cygwin installation. I happen to > have an >>> unmodified 3.1.4 on a development machine and was able to reproduce the > problem >>> with it. But we cannot take frequent Cygwin updates, as it takes far > too long >>> to find and fix problems between Cygwin and our code. The version has > to be >>> stable for months before we can use it. >>> Thanks for the helpful suggestions and information. I'll send updates, > in case >>> anyone else sees a similar problem. >>> Michael Shay