From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from Longs.ABINITIO.com (fw-lex.abinitio.com [65.170.40.234]) by sourceware.org (Postfix) with ESMTP id DAE5E3861872 for ; Tue, 4 Aug 2020 21:19:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org DAE5E3861872 In-Reply-To: <6263e211-8751-8d61-7ceb-e9af59f0e5ce@SystematicSw.ab.ca> To: cygwin@cygwin.com Subject: Re: Trouble with output character sets from Win32 applications running under mksh Message-ID: From: "Michael Shay" Date: Tue, 4 Aug 2020 17:19:00 -0400 References: <1314865780.20200803204249@yandex.ru> <6263e211-8751-8d61-7ceb-e9af59f0e5ce@SystematicSw.ab.ca> MIME-Version: 1.0 X-KeepSent: 21528DD3:618CFDF1-852585BA:0059D77D; name=$KeepSent; type=4 X-Disclaimed: 40399 X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00, HTML_MESSAGE, KAM_DMARC_STATUS, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Aug 2020 21:19:04 -0000 Michael From: "Brian Inglis" To: cygwin@cygwin.com Date: 08/04/2020 08:32 AM Subject: Re: Trouble with output character sets from Win32=20 applications running under mksh Sent by: "Cygwin" On 2020-08-03 16:05, Michael Shay via Cygwin wrote: > On 2020-08-03 11:42, Andrey Repin wrote: >>>> Doesn't help. I tried 65001 (UTF-8): >>> Because you're confusing things. >>> chcp has nothing to do with LANG or LC=5F*. >>> Et vice versa. >>> >> chcp sets console code page for native console applications.=20 >>> Only for those supporting it. Many do not. >>> LANG sets output parameters for Cygwin applications (and other=20 programs=20 >>> that look for it, but these are few). >> You cut the significant statement at the top of the OP: >>>> I'm having a problem with Cygwin 3.1.4, changing the character set on = >>>> the fly. It seems to work with Cygwin applications, but not with=20 Win32=20 >>>> applications. >> He has problems with invalid characters only running win32 console=20 >> applications: I changed the subject to hopefully better reflect the=20 issue. >>=20 >> I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have = to=20 >> use the Windows codepage conversion routines. >>=20 >> You can only change input character sets on the fly; output character=20 sets=20 >> will depend on mintty support of xterm-compatible character set support >> and switching escape sequences; if you set up UTF16LE console output, >> Windows and mintty should handle it. >>=20 >> Perhaps a better description of your environment, build tools, what you = >> are trying to do, what you expect as output, and what you are getting=20 as=20 >> output, could help us better understand and help with the issue you=20 see. > The script I sent changes the locale information i.e. LANG and LC=5FALL=20 are=20 > set to en=5FUS.CP1252. i.e. >=20 > export LANG=3D"en=5FUS.CP1252" > export LC=5FALL=3Den=5FUS.CP1252 FYI the normal sequence and order to check is LANG, LC=5FCTYPE, LC=5FALL,=20 where the last var set wins, or the reverse where the first var set wins; the=20 default locale may be POSIX C.ASCII or the effective Windows locale, depending on=20 your startup. >> Thanks, that's good to know. > Then, it runs a simple Win32 program that takes a single input argument, = Z=C7, > the second character being C-cedilla, an 8-bit character, hex value=20 0xc7. > The Win32 program transcodes the input Unicode argument using the Cygwin > character set to determine the codepage, 1252. Do you mean using the environment variables to determine the codepage? >> Yes. Our code does try to fetch the character set information from the >> environment. FYI the default character set if none is specified is the Unix equivalent=20 of the default Windows "ANSI"/OEM code page, in English or many European locales=20 that will be ISO-8859-1. You may have to use cygpath -C OEM chars... or cygpath -C ANSI chars... to convert a string to the required character set for console or GUI=20 programs. >> Our production code uses the console to display error information in=20 the >> appropriate character set, but our command-line utilities expect to be >> able to take input strings encoded in the character set in use, which >> may be an 8-bit SBCS like ISO-8849-1, Windows 1252, or a MBCS, like=20 UTF-8 >> or e.g. Windows 932. Using 'cygpath' isn't an option. Please specify what you mean by "Unicode" in each context; that term means = a standard for representing scripts in many writing systems with a large=20 character glyph repertoire and a number of encodings, representations, and handling=20 rules: in each use case, do you mean a char/wchar representation, and/or an=20 encoding UTF16LE or UTF-8? Similarly when MS uses "ANSI" they may mean an SBCS OEM code page. >> Unicode =3D=3D UTF-16 in all cases. This is the wide-character set used = by=20 Microsoft >> as far as I can tell in the wide-char version of their Win32 API=20 functions e.g. >> CreateProcessW() vs. CreateProcessA(). To check what is available and what is in effect in Cygwin, try e.g.: $ for o in system user no-unicode input format; do echo `locale --$o` $o;=20 done en=5FUS system en=5FGB user en=5FCA no-unicode en=5FCA input en=5FCA format $ locale on both Cygwin versions. >>1.7.28 output >>$for o in system user no-unicode input format; do echo `locale --$o` $o; = done >>en=5FUS system >>en=5FUS user >>en=5FUS no-unicode >>locale: unknown option -- input >>Try `locale --help' for more information. >>input >>en=5FUS format >>3.1.4 output >>$for o in system user no-unicode input format; do echo `locale --$o` $o; = done >>en=5FUS system >>en=5FUS user >>en=5FUS no-unicode >>en=5FUS input >>en=5FUS format FYI see: https://cygwin.com/cygwin-ug-net/setup-locale.html > It then prints the transcoded characters to stdout, and the result=20 should be > Z=C7, identical to the input argument. > This works fine using Cygwin 1.7.28. Which Windows version are you running Cygwin 1.7.28 on? Please show output from cmd /c ver. >>$cmd /c ver >>Microsoft Windows [Version 10.0.18363.959] That Cygwin version 1.7.28 is from 2014-02 and has been unsupported for=20 years. That version may not have completely supported international character=20 sets and may just assume that everything is in ISO-8859-1/Latin-1, which is similar = to CP1252, so that may work, or your system default OEM codepage e.g. 437 or=20 850, and pass it along. >> Our code supports dozens of character sets, for international sales,=20 and that >> includes many SBCS, and MBCS, as well as UTF-8. I can use any of the=20 codepages >> supported by Windows and Cygwin and 1.7.28 handles them just fine. > Cygwin 3.1.4 is launching the Win32 application, and is responsible for > transcoding the arguments passed to it by mksh, in this case CP1252 > characters Z=C7, into Unicode. Do you mean you believe Cygwin should recode argument strings, and what do = you mean by Unicode in this context? >> When I launch a Win32 application that is using a character set other=20 than 7-bit ASCI >> in a Cygwin shell, the shell passes the command and arguments in the=20 input character set. >> So, for example, using CP 1252 as the character set, and passing 8-bit=20 single-byte characters >> like e.g. Z=C7, the shell doesn't change the characters, it passes them = through to Cygwin >> to launch the process. In my test, using gdb ($gdb --version GNU gdb=20 (GDB) (Cygwin 8.2.1-1) 8.2.1) >> i.e. "gdb ksh.exe", then "(gdb) start -c 'cygtest.exe Z=C7', I can step = into spawnve() in spawn.cc. >> At this point, examining the input arguments confirms that the input=20 argument 'Z=C7' is still >> in the correct encoding i.e. 0x5a 0xc7. The real work of launching the=20 process is done in >> child=5Finfo=5Fspawn::worker(). Eventually, the code invokes=20 CreateProcessW(). The executable >> path is already in UTF-16 format, so the only transcoding left to be=20 done is the >> argument string. This is done in linebuf::wcs() function (winf.h) This=20 small method >> invokes sys=5Fmbstowcs(), in strfuncs.cc. So yes, I do believe Cygwin=20 should transcode >> the argument strings from whatever their current character set is to=20 UTF-16. This is >> what the ancient 1.7.28 did. > That means Cygwin has to use the mb-to-uc function for transcoding=20 codepage > 1252 to Unicode. I am unsure if Cygwin does any recoding internally except for input typed=20 on the terminal console interface. CP1252 is an SBCS not an MBCS so MB functions are not required. What do you expect when you use Unicode here? >> If Cygwin no longer does this internal transcoding, that's a=20 significant change >> from previous versions. I only know 1.7.28 did the transcoding=20 correctly, and it's >> certainly possible that at some point between that version and 3.1.4,=20 the behavior >> changed. Yes, CP1252 is a SBCS, but it supports 8-bit characters,=20 unlike 7-bit ASCII >> so requires a different mapping from UTF-16. Using either CP 1252 or=20 7-bit ASCII >> though would require a different transcoding routine than the UTF-8 ->=20 UTF-16 that >> gets used. > It does not. It uses the UTF-8 to Unicode function (I've seen this using > gdb). That function flags the =C7 as an invalid UTF-8 sequence, not > surprisingly since it's not a UTF-8 character. What Windows, Cygwin, gdb versions are you seeing this on and what is the=20 name of the function you are seeing? >> Windows - Microsoft Windows [Version 10.0.18363.959] >> Cygwin - CYGWIN=5FNT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19=20 08:49 x86=5F64 Cygwin >> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1) >> As described above, spawnve() calls child=5Finfo=5Fspawn::worker() to do= =20 the real work of >> launching a process, a Win32 or a Cygwin process. The conversion of the = process arguments >> into UTF-16 is done through linebuf::wcs(), into sys=5Fmbstowcs(). In th= e=20 latter function >> the only work done is to check if the pointer to the MBCS to WCS is ' =5F=5Fascii=5Fmbtowc' and >> if so, to instead set it to '=5F=5Futf8=5Fmbtowc'. It then invokes=20 sys=5Fcp=5Fmbstowcs() to do the >> work. >> However, the problem if there is one, must be occurring very early on.=20 dll=5Fcrt0=5F1() >> which according to the comments "Take over from libc's crt0.o and start = the application." >> fetches the locale from the environment: >> /* Set internal locale to the environment settings. */ >> initial=5Fsetlocale (); >> I suspect that it's here where either there's a problem, or Cygwin=20 behavior has changed from >> 1.7.28. I haven't tried to use gdb to step into that initialization=20 code. > No matter what character set I use in 'export LANG...' and 'export > LC=5FALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding functio= n=20 in > sys ... what should be there and what is the name of the function used? > 1.7.28 Uses the correct function. What is the name of that function? >> The function is sys=5Fcp=5Fmbstowcs(), which is invoked by sys=5Fmbstowc= s()=20 as it is in 3.1.4. >> But the older version doesn't get the pointer to the mb-to-wc=20 transcoding function passed >> it, it fetches the pointer and the character set from cygheap->locale=20 and passes those >> to sys=5Fcp=5Fmbstowcs(). > I'm not using mintty, I'm using mksh, a requirement since our software=20 uses > lots of shell scripts, and for legacy support, that means using a Korn=20 shell. So that means that the mksh is running on the Windows console, and you are = not running mintty. >> Correct. > I could understand it if 1.7.28 didn't do the proper transcoding, but it > does. You may just be seeing Cygwin 1.7.28 passing the character codes along=20 verbatim. >> I don't think so. child=5Finfo=5Fspawn::worker() has to translate the=20 CP1252 characters >> into UTF-16. And it does, as I've seen using Windbg on the Windows side = of this. > I used: >=20 > gdb mksh >=20 > to load mksh into the debugger, then started it with >=20 > start -c 'cygtest.exe Z=C7' Windows, Cygwin, and gdb versions? >> Windows - Microsoft Windows [Version 10.0.18363.959] >> Cygwin - CYGWIN=5FNT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19=20 08:49 x86=5F64 Cygwin >> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1) > That allowed me to step into child=5Finfo=5Fspawn::worker() and stop at t= he=20 > call to CreateProcess(), where the command line (cygtest.exe) and=20 argument=20 > (Z=C7) are translated into Unicode. In this case you mean into a UTF16LE string? >> Yes. > This is the code to which I'm referring, in strfuncs.cc, which is=20 supposed=20 > to translate the command line and arguments from CP 1252 into Unicode. >=20 > size=5Ft =5F=5Freg3 > sys=5Fmbstowcs (wchar=5Ft * dst, size=5Ft dlen, const char *src, size= =5Ft nms) > { > mbtowc=5Fp f=5Fmbtowc =3D =5F=5FMBTOWC; > if (f=5Fmbtowc =3D=3D =5F=5Fascii=5Fmbtowc) > { > f=5Fmbtowc =3D =5F=5Futf8=5Fmbtowc; <<<< THE CODE CHANGES T= HE=20 > '=5F=5Fascii=5Fmbtowc' TO '=5F=5Futf8=5Fmbtowc' EVERY TIME, REGARDLESS OF= THE=20 > CODEPAGE. > } > return sys=5Fcp=5Fmbstowcs (f=5Fmbtowc, dst, dlen, src, nms); > } >=20 > So 'f=5Fmbtowc' is set to =5Fascii=5Fmbtowc, the default.You said: UTF-8 contains ASCII as the first 128 code points, so that is valid,=20 unless the "ASCII" used isn't really, and has character codes > 127! >> CP1252 supports 8-bit single-byte characters such as C-cedilla. The=20 UTF-8 >> representation is a 3-byte sequence that is not correct if the=20 character >> set in use is CP1252. > You can only change input character sets on the fly; >=20 > The input character set to Cygwin should have been changed to CP 1252,=20 as=20 > it was in 1.7.28. At least, that's what I would expect to happen. If it=20 > does not, or if miintty is required, then that's a regression from=20 1.7.28. As Cygwin packages are rolling releases, old releases are unsupported, and = you must upgrade to the latest release, reproduce the problem with a simple=20 test case, and other examples if you wish, and post that with a copy of the=20 output from: $ cygcheck -hrsv > cygcheck.out as a plain text attachment to your post. >> I understand. We do not ship a stock Cygwin installation. I happen to=20 have an=20 >> unmodified 3.1.4 on a development machine and was able to reproduce the = problem >> with it. But we cannot take frequent Cygwin updates, as it takes far=20 too long >> to find and fix problems between Cygwin and our code. The version has=20 to be >> stable for months before we can use it. >> Thanks for the helpful suggestions and information. I'll send updates,=20 in case >> anyone else sees a similar problem. >> Michael Shay --=20 Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada This email may be disturbing to some readers as it contains too much technical detail. Reader discretion is advised. [Data in IEC units and prefixes, physical quantities in SI.] -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple =20 NOTICE from Ab Initio: This email (including any attachments) may contain = information that is subject to confidentiality obligations or is legally pr= ivileged, and sender does not waive confidentiality or privilege. If receiv= ed in error, please notify the sender, delete this email, and make no furth= er use, disclosure, or distribution.