From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from Longs.ABINITIO.com (fw-lex.abinitio.com [65.170.40.234]) by sourceware.org (Postfix) with ESMTP id 946E03857C42 for ; Mon, 3 Aug 2020 22:05:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 946E03857C42 In-Reply-To: To: cygwin@cygwin.com Subject: Re: Trouble with output character sets from Win32 applications running under mintty Message-ID: From: "Michael Shay" Date: Mon, 3 Aug 2020 18:05:18 -0400 References: <1314865780.20200803204249@yandex.ru> MIME-Version: 1.0 X-KeepSent: E0AAB507:AC9FD3B4-852585B9:0076DEA7; name=$KeepSent; type=4 X-Disclaimed: 25291 X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00, HTML_MESSAGE, KAM_DMARC_STATUS, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Aug 2020 22:05:21 -0000 Michael From: "Brian Inglis" To: cygwin@cygwin.com Date: 08/03/2020 05:23 PM Subject: Re: Trouble with output character sets from Win32=20 applications running under mintty Sent by: "Cygwin" On 2020-08-03 11:42, Andrey Repin wrote: >> Doesn't help. I tried 65001 (UTF-8): >=20 > Because you're confusing things. > chcp has nothing to do with LANG or LC=5F*. > Et vice versa. >=20 > chcp sets console code page for native console applications. Only for=20 those > supporting it. Many do not. > LANG sets output parameters for Cygwin applications (and other programs=20 that > look for it, but these are few). You cut the significant statement at the top of the OP: >> I'm having a problem with Cygwin 3.1.4, changing the character set on=20 the=20 >> fly. It seems to work with Cygwin applications, but not with Win32=20 >> applications. He has problems with invalid characters only running win32 console=20 applications: I changed the subject to hopefully better reflect the issue. I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have to = use the Windows codepage conversion routines. You can only change input character sets on the fly; output character sets = will depend on mintty support of xterm-compatible character set support and=20 switching escape sequences; if you set up UCS16LE console output, Windows and mintty should handle it. Perhaps a better description of your environment, build tools, what you=20 are trying to do, what you expect as output, and what you are getting as=20 output, could help us better understand and help with the issue you see. --=20 Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada This email may be disturbing to some readers as it contains too much technical detail. Reader discretion is advised. [Data in IEC units and prefixes, physical quantities in SI.] -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple The script I sent changes the locale information i.e. LANG and LC=5FALL are= =20 set to en=5FUS.CP1252. i.e. export LANG=3D"en=5FUS.CP1252" export LC=5FALL=3Den=5FUS.CP1252 Then, it runs a simple Win32 program that takes a single input argument,=20 Z=C7, the second character being C-cedilla, an 8-bit character, hex value=20 0xc7. The Win32 program transcodes the input Unicode argument using the=20 Cygwin character set to determine the codepage, 1252. It then prints the=20 transcoded characters to stdout, and the result should be Z=C7, identical t= o=20 the input argument. This works fine using Cygwin 1.7.28. Cygwin 3.1.4 is=20 launching the Win32 application, and is responsible for transcoding the=20 arguments passed to it by mksh, in this case CP1252 characters Z=C7, into=20 Unicode. That means Cygwin has to use the mb-to-uc function for=20 transcoding codepage 1252 to Unicode. It does not. It uses the UTF-8 to=20 Unicode function (I've seen this using gdb). That function flags the =C7 as= =20 an invalid UTF-8 sequence, not surprisingly since it's not a UTF-8=20 character. No matter what character set I use in 'export LANG...' and=20 'export LC=5FALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding=20 function in sys1.7.28 Uses the correct function. I'm not using mintty, I'm = using mksh, a requirement since our software uses lots of shell scripts,=20 and for legacy support, that means using a Korn shell. I could understand=20 it if 1.7.28 didn't do the proper transcoding, but it does.=20 I used: gdb mksh to load mksh into the debugger, then started it with start -c 'cygtest.exe Z=C7' That allowed me to step into child=5Finfo=5Fspawn::worker() and stop at the= =20 call to CreateProcess(), where the command line (cygtest.exe) and argument = (Z=C7) are translated into Unicode. This is the code to which I'm referring, in strfuncs.cc, which is supposed = to translate the command line and arguments from CP 1252 into Unicode. size=5Ft =5F=5Freg3 sys=5Fmbstowcs (wchar=5Ft * dst, size=5Ft dlen, const char *src, size=5Ft= nms) { mbtowc=5Fp f=5Fmbtowc =3D =5F=5FMBTOWC; if (f=5Fmbtowc =3D=3D =5F=5Fascii=5Fmbtowc) { f=5Fmbtowc =3D =5F=5Futf8=5Fmbtowc; <<<< THE CODE CHANGES THE= =20 '=5F=5Fascii=5Fmbtowc' TO '=5F=5Futf8=5Fmbtowc' EVERY TIME, REGARDLESS OF T= HE=20 CODEPAGE. } return sys=5Fcp=5Fmbstowcs (f=5Fmbtowc, dst, dlen, src, nms); } So 'f=5Fmbtowc' is set to =5Fascii=5Fmbtowc, the default.You said: You can only change input character sets on the fly; The input character set to Cygwin should have been changed to CP 1252, as=20 it was in 1.7.28. At least, that's what I would expect to happen. If it=20 does not, or if miintty is required, then that's a regression from 1.7.28. Mike Shay =20 NOTICE from Ab Initio: This email (including any attachments) may contain = information that is subject to confidentiality obligations or is legally pr= ivileged, and sender does not waive confidentiality or privilege. If receiv= ed in error, please notify the sender, delete this email, and make no furth= er use, disclosure, or distribution.