* Trouble with character sets
@ 2020-08-03 15:36 Michael Shay
2020-08-03 16:31 ` Brian Inglis
0 siblings, 1 reply; 11+ messages in thread
From: Michael Shay @ 2020-08-03 15:36 UTC (permalink / raw)
To: cygwin
[-- Attachment #1: Type: text/plain, Size: 3160 bytes --]
I'm having a problem with Cygwin 3.1.4, changing the character set on the
fly. It seems to work with Cygwin applications, but not with Win32
applications.
I have a Korn shell script:
#!/bin/ksh
OLD_LANG="$LANG"
OLD_LC_ALL="$LC_ALL"
echo "locale on entry"
locale
echo ""
export LANG="en_US.CP1252"
export LC_ALL=en_US.CP1252
echo "locale changed to"
locale
echo ""
# Default is to run the Win32 program. Input any argument other than
'WIN32'
# to run '/bin/echo'.
case $# in
0 ) echo "Running WIN32 pgm"
ksh -c 'cygtest.exe ZÇ'
;;
1 ) echo "Running Cygwin 'echo'"
ksh -c '/bin/echo ZÇ'
;;
2 ) echo "Running WIN32 pgm"
ksh -c 'cygtest.exe ZÇ'
echo ""
echo "Running Cygwin 'echo'"
ksh -c '/bin/echo ZÇ'
;;
* ) ;;
esac
LC_ALL="$OLD_LC_ALL"
LANG="$OLD_LANG"
and a Win32 application (attached file cygtest.cpp)
I used gdb to see what was happening in child_info_spawn::worker(), when a
Win32 program is started using:
rc = CreateProcessW (runpath, /* image name w/ full path */
cmd.wcs (wcmd), /* what was passed to exec */
sa, /* process security attrs */
sa, /* thread security attrs */
TRUE, /* inherit handles */
c_flags,
envblock, /* environment */
NULL,
&si,
&pi);
Specifically, 'cmd.wcs(wcmd)' invokes:
wchar_t *wcs (wchar_t *wbuf, size_t n)
{
if (n == 1)
wbuf[0] = L'\0';
else
sys_mbstowcs (wbuf, n, buf);
return wbuf;
}
and sys_mbstowcs():
size_t __reg3
sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
{
mbtowc_p f_mbtowc = __MBTOWC;
if (f_mbtowc == __ascii_mbtowc)
{
f_mbtowc = __utf8_mbtowc; <<<<< this
is ALWAYS done, no matter what charset is in use.
}
return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
}
Since the CP1252 is an 8-bit single-byte character set with characters >=
0x80, the '0xc7' character is always translated as '0xc7 0xf0', with the
'0xf0' byte indicating an invalid character in the string.
This doesn't seem to happen when e.g. '/bin/echo' is run, although I
haven't stepped into the code to see what's happening.
I do not think this is a Cygwin bug, but since the User's Guide says the
locale and charset can be changed on the fly, I don't know what's going
awry.
Any suggestions? If you need more information, I'm happy to provide it.
Mike Shay
Here's the source for the Win32 program. I built it with Visual Studio
2015, to get something running quickly.
NOTICE from Ab Initio: This email (including any attachments) may contain information that is subject to confidentiality obligations or is legally privileged, and sender does not waive confidentiality or privilege. If received in error, please notify the sender, delete this email, and make no further use, disclosure, or distribution.
[-- Attachment #2: cygtest.cpp --]
[-- Type: application/octet-stream, Size: 4428 bytes --]
// cygtest.cpp : Defines the entry point for the console application.
//
#include <SDKDDKVer.h>
#include <stdio.h>
#include <windows.h>
#include <string>
using namespace std;
LPSTR __stdcall UnicodeToMByteHelper(LPSTR lpa, int nBytes, LPCWSTR lpw, int nChars, int codepage);
static UINT cyg_codepage_string_to_CP(const string &cp)
{
const string UTF8 = "UTF-8";
const string utf8 = "utf-8";
const string ANSI = "ANSI";
const string ansi = "ansi";
const string ISO88591 = "ISO-8859-1";
const string iso88591 = "iso-8859-1";
const string OEM = "OEM";
const string oem = "oem";
const string WINDOWS = "WINDOWS";
const string windows = "windows";
const string CODEPAGE = "CP";
const string codepage = "cp";
UINT shell_cp{ 0 };
if (NULL == cp.c_str() || cp.length() == 0)
return 0;
if ((cp.compare(utf8) == 0) || (cp.compare(UTF8) == 0))
shell_cp = 65001;
else if ((cp.compare(ansi) == 0) || (cp.compare(ANSI) == 0)
|| (cp.compare(ISO88591) == 0) || (cp.compare(iso88591) == 0))
shell_cp = 1252;
// oem is also standard cygwin nomenclature
else if ((cp.compare(oem) == 0) || (cp.compare(OEM) == 0))
shell_cp = 437;
// cpXXX, windows-XXX and windows_XXX are all recognized by
// the Ab Initio extensions to cygwin. Not sure if they are
// known to standard cygwin, but I don't think they are.
else if ((cp.compare(0, 2, codepage) == 0) ||
(cp.compare(0, 2, CODEPAGE) == 0) ||
(cp.compare(0, 7, windows) == 0) ||
(cp.compare(0, 7, WINDOWS) == 0)) {
// If the prefix is "CP" or "cp" then get the number after that
// else it's "WINDOWS{-,_}" or "WINDOWS{-,_}"
int offset = ((cp.compare(0, 2, codepage) == 0) || (cp.compare(0, 2, CODEPAGE) == 0)) ? 2 : 8;
shell_cp = atoi(cp.substr(offset).c_str());
}
return shell_cp;
}
static UINT get_cygwin_codepage()
{
string default_cyg_charset = "C.UTF-8"; // Cygwin default character set
string cyg_locale;
UINT shell_cp{ 0 };
UINT default_cp{ 65001 };
char *envptr = ::getenv("LANG");
if (NULL == envptr)
envptr = ::getenv("LC_ALL");
cyg_locale = (NULL == envptr ? default_cyg_charset : envptr);
// The 'value' field of the environment string "var_name=value"
// will be of the form: <language ID>.<codepage ID>
// We want the substring after the '.'
int dotPos = cyg_locale.find_first_of('.');
if (dotPos >= 0) {
// The character set string, if specified, starts AFTER the '.'.
// If NOT specified, return the input default.
string page = cyg_locale.substr(++dotPos);
if (0 <= (shell_cp = cyg_codepage_string_to_CP(page))) {
return shell_cp;
} // end SHELL_CP
} // end EQPOS
return default_cp;
}
LPSTR __stdcall UnicodeToMByteHelper(LPSTR lpa, int nBytes, LPCWSTR lpw, int nChars, int codepage)
{
static int printInfo = 0;
int nOut = 0;
if (NULL == lpa) {
printf("NULL input string\n");
return NULL;
}
if (printInfo) {
printf("Transcoding using Cygwin codepage: %d\nInput widechar string:\n", codepage);
for (int i = 0; i < nChars; i++)
printf("\tlpw[%d] = %C - %02X\n", i, lpw[i], lpw[i]);
}
++printInfo;
if (nChars > 0) {
if (0 == (nOut = WideCharToMultiByte(codepage, 0, lpw, nChars, lpa, nBytes, NULL, NULL))) {
DWORD dwErr = GetLastError();
printf("WideCharToMultiByte(%d, %S) failed, error %d\n", codepage, lpw, dwErr);
return NULL;
}
}
lpa[nOut] = '\0';
return lpa;
}
int wmain(int argc, wchar_t** wargv)
{
try {
char *pNull = "NULL";
char** argv = new char*[(argc)+1];
int _argi;
int codepage = get_cygwin_codepage();
for (_argi = 0; _argi < (argc); _argi++) {
if (wargv[_argi]) {
LPWSTR utf_lpw = wargv[_argi];
int utf_len = lstrlenW(utf_lpw);
int utf_convert = utf_len * 3 + 1;
LPSTR utf_lpa = (LPSTR)_alloca(utf_convert);
argv[_argi] = UnicodeToMByteHelper(utf_lpa, utf_convert, utf_lpw, utf_len, codepage);
}
else {
argv[_argi] = pNull;
}
}
argv[(argc)] = NULL;
// Now print the transcoded string.
for (int i = 1; i < argc; i++)
printf("%s: %s\n", __FUNCTION__, argv[i]);
return 0;
}
catch (...) {
printf("Caught unhandled exception\n");
}
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with character sets
2020-08-03 15:36 Trouble with character sets Michael Shay
@ 2020-08-03 16:31 ` Brian Inglis
2020-08-03 17:10 ` Michael Shay
0 siblings, 1 reply; 11+ messages in thread
From: Brian Inglis @ 2020-08-03 16:31 UTC (permalink / raw)
To: cygwin
On 2020-08-03 09:36, Michael Shay via Cygwin wrote:
> I'm having a problem with Cygwin 3.1.4, changing the character set on the
> fly. It seems to work with Cygwin applications, but not with Win32
> applications.
> I have a Korn shell script:
> #!/bin/ksh
> OLD_LANG="$LANG"
> OLD_LC_ALL="$LC_ALL"
> echo "locale on entry"
> locale
> echo ""
> export LANG="en_US.CP1252"
> export LC_ALL=en_US.CP1252
> echo "locale changed to"
> locale
> echo ""
> # Default is to run the Win32 program. Input any argument other than
> 'WIN32'
> # to run '/bin/echo'.
> case $# in
> 0 ) echo "Running WIN32 pgm"
> ksh -c 'cygtest.exe ZÇ'
> ;;
> 1 ) echo "Running Cygwin 'echo'"
> ksh -c '/bin/echo ZÇ'
> ;;
> 2 ) echo "Running WIN32 pgm"
> ksh -c 'cygtest.exe ZÇ'
> echo ""
> echo "Running Cygwin 'echo'"
> ksh -c '/bin/echo ZÇ'
> ;;
> * ) ;;
> esac
> LC_ALL="$OLD_LC_ALL"
> LANG="$OLD_LANG"
> and a Win32 application (attached file cygtest.cpp)
> I used gdb to see what was happening in child_info_spawn::worker(), when a
> Win32 program is started using:
> rc = CreateProcessW (runpath, /* image name w/ full path */
> cmd.wcs (wcmd), /* what was passed to exec */
> sa, /* process security attrs */
> sa, /* thread security attrs */
> TRUE, /* inherit handles */
> c_flags,
> envblock, /* environment */
> NULL,
> &si,
> &pi);
> Specifically, 'cmd.wcs(wcmd)' invokes:
> wchar_t *wcs (wchar_t *wbuf, size_t n)
> {
> if (n == 1)
> wbuf[0] = L'\0';
> else
> sys_mbstowcs (wbuf, n, buf);
> return wbuf;
> }
> and sys_mbstowcs():
> size_t __reg3
> sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
> {
> mbtowc_p f_mbtowc = __MBTOWC;
> if (f_mbtowc == __ascii_mbtowc)
> {
> f_mbtowc = __utf8_mbtowc; <<<<< this
> is ALWAYS done, no matter what charset is in use.
> }
> return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
> }
> Since the CP1252 is an 8-bit single-byte character set with characters >=
> 0x80, the '0xc7' character is always translated as '0xc7 0xf0', with the
> '0xf0' byte indicating an invalid character in the string.
> This doesn't seem to happen when e.g. '/bin/echo' is run, although I
> haven't stepped into the code to see what's happening.
> I do not think this is a Cygwin bug, but since the User's Guide says the
> locale and charset can be changed on the fly, I don't know what's going
> awry.
> Any suggestions? If you need more information, I'm happy to provide it.
Try:
$ chcp.com
Active code page: 850
$ chcp.com 65001
Active code page: 65001
$ chcp.com
Active code page: 65001
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with character sets
2020-08-03 16:31 ` Brian Inglis
@ 2020-08-03 17:10 ` Michael Shay
2020-08-03 17:42 ` Andrey Repin
0 siblings, 1 reply; 11+ messages in thread
From: Michael Shay @ 2020-08-03 17:10 UTC (permalink / raw)
To: cygwin
Doesn't help. I tried 65001 (UTF-8):
### SET CP TO UTF-8, 65001
$cygwin_charset_test.ksh
Old CP 65001
locale on entry
LANG=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=
### CP SET TO 65001
Active code page: 65001
locale changed to
LANG=en_US.CP1252
LC_CTYPE="en_US.CP1252"
LC_NUMERIC="en_US.CP1252"
LC_TIME="en_US.CP1252"
LC_COLLATE="en_US.CP1252"
LC_MONETARY="en_US.CP1252"
LC_MESSAGES="en_US.CP1252"
LC_ALL=en_US.CP1252
Running WIN32 pgm
Transcoding using Cygwin codepage: 1252
Input widechar string:
lpw[0] = Z - 5A
lpw[1] = - F0C7
wmain: Z?
Active code page: 65001
and 1252
### SET CP TO 1252
$cygwin_charset_test.ksh
Old CP 65001
locale on entry
LANG=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=
### CP SET TO 1252
Active code page: 1252
locale changed to
LANG=en_US.CP1252
LC_CTYPE="en_US.CP1252"
LC_NUMERIC="en_US.CP1252"
LC_TIME="en_US.CP1252"
LC_COLLATE="en_US.CP1252"
LC_MONETARY="en_US.CP1252"
LC_MESSAGES="en_US.CP1252"
LC_ALL=en_US.CP1252
Running WIN32 pgm
Transcoding using Cygwin codepage: 1252
Input widechar string:
lpw[0] = Z - 5A
lpw[1] = - F0C7
wmain: Z?
Active code page: 65001
Michael
From: "Brian Inglis" <Brian.Inglis@SystematicSw.ab.ca>
To: cygwin@cygwin.com
Date: 08/03/2020 12:31 PM
Subject: Re: Trouble with character sets
Sent by: "Cygwin" <cygwin-bounces@cygwin.com>
On 2020-08-03 09:36, Michael Shay via Cygwin wrote:
> I'm having a problem with Cygwin 3.1.4, changing the character set on
the
> fly. It seems to work with Cygwin applications, but not with Win32
> applications.
> I have a Korn shell script:
> #!/bin/ksh
> OLD_LANG="$LANG"
> OLD_LC_ALL="$LC_ALL"
> echo "locale on entry"
> locale
> echo ""
> export LANG="en_US.CP1252"
> export LC_ALL=en_US.CP1252
> echo "locale changed to"
> locale
> echo ""
> # Default is to run the Win32 program. Input any argument other than
> 'WIN32'
> # to run '/bin/echo'.
> case $# in
> 0 ) echo "Running WIN32 pgm"
> ksh -c 'cygtest.exe ZÇ'
> ;;
> 1 ) echo "Running Cygwin 'echo'"
> ksh -c '/bin/echo ZÇ'
> ;;
> 2 ) echo "Running WIN32 pgm"
> ksh -c 'cygtest.exe ZÇ'
> echo ""
> echo "Running Cygwin 'echo'"
> ksh -c '/bin/echo ZÇ'
> ;;
> * ) ;;
> esac
> LC_ALL="$OLD_LC_ALL"
> LANG="$OLD_LANG"
> and a Win32 application (attached file cygtest.cpp)
> I used gdb to see what was happening in child_info_spawn::worker(), when
a
> Win32 program is started using:
> rc = CreateProcessW (runpath, /* image name w/ full path */
> cmd.wcs (wcmd), /* what was passed to exec */
> sa, /* process security attrs */
> sa, /* thread security attrs */
> TRUE, /* inherit handles */
> c_flags,
> envblock, /* environment */
> NULL,
> &si,
> &pi);
> Specifically, 'cmd.wcs(wcmd)' invokes:
> wchar_t *wcs (wchar_t *wbuf, size_t n)
> {
> if (n == 1)
> wbuf[0] = L'\0';
> else
> sys_mbstowcs (wbuf, n, buf);
> return wbuf;
> }
> and sys_mbstowcs():
> size_t __reg3
> sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
> {
> mbtowc_p f_mbtowc = __MBTOWC;
> if (f_mbtowc == __ascii_mbtowc)
> {
> f_mbtowc = __utf8_mbtowc; <<<<<
this
> is ALWAYS done, no matter what charset is in use.
> }
> return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
> }
> Since the CP1252 is an 8-bit single-byte character set with characters
>=
> 0x80, the '0xc7' character is always translated as '0xc7 0xf0', with the
> '0xf0' byte indicating an invalid character in the string.
> This doesn't seem to happen when e.g. '/bin/echo' is run, although I
> haven't stepped into the code to see what's happening.
> I do not think this is a Cygwin bug, but since the User's Guide says the
> locale and charset can be changed on the fly, I don't know what's going
> awry.
> Any suggestions? If you need more information, I'm happy to provide it.
Try:
$ chcp.com
Active code page: 850
$ chcp.com 65001
Active code page: 65001
$ chcp.com
Active code page: 65001
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
NOTICE from Ab Initio: This email (including any attachments) may contain information that is subject to confidentiality obligations or is legally privileged, and sender does not waive confidentiality or privilege. If received in error, please notify the sender, delete this email, and make no further use, disclosure, or distribution.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with character sets
2020-08-03 17:10 ` Michael Shay
@ 2020-08-03 17:42 ` Andrey Repin
2020-08-03 18:15 ` Michael Shay
2020-08-03 21:23 ` Trouble with output character sets from Win32 applications running under mintty Brian Inglis
0 siblings, 2 replies; 11+ messages in thread
From: Andrey Repin @ 2020-08-03 17:42 UTC (permalink / raw)
To: Michael Shay, cygwin
Greetings, Michael Shay!
Please bottom post in this mailing list.
> Doesn't help. I tried 65001 (UTF-8):
Because you're confusing things.
chcp has nothing to do with LANG or LC_*.
Et vice versa.
chcp sets console code page for native console applications. Only for those
supporting it. Many do not.
LANG sets output parameters for Cygwin applications (and other programs that
look for it, but these are few).
> ### SET CP TO UTF-8, 65001
> $cygwin_charset_test.ksh
> Old CP 65001
> locale on entry
> LANG=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_ALL=
> ### CP SET TO 65001
> Active code page: 65001
> locale changed to
> LANG=en_US.CP1252
> LC_CTYPE="en_US.CP1252"
> LC_NUMERIC="en_US.CP1252"
> LC_TIME="en_US.CP1252"
> LC_COLLATE="en_US.CP1252"
> LC_MONETARY="en_US.CP1252"
> LC_MESSAGES="en_US.CP1252"
> LC_ALL=en_US.CP1252
> Running WIN32 pgm
> Transcoding using Cygwin codepage: 1252
> Input widechar string:
> lpw[0] = Z - 5A
> lpw[1] = - F0C7
> wmain: Z?
> Active code page: 65001
> and 1252
> ### SET CP TO 1252
> $cygwin_charset_test.ksh
> Old CP 65001
> locale on entry
> LANG=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_ALL=
> ### CP SET TO 1252
> Active code page: 1252
> locale changed to
> LANG=en_US.CP1252
> LC_CTYPE="en_US.CP1252"
> LC_NUMERIC="en_US.CP1252"
> LC_TIME="en_US.CP1252"
> LC_COLLATE="en_US.CP1252"
> LC_MONETARY="en_US.CP1252"
> LC_MESSAGES="en_US.CP1252"
> LC_ALL=en_US.CP1252
> Running WIN32 pgm
> Transcoding using Cygwin codepage: 1252
> Input widechar string:
> lpw[0] = Z - 5A
> lpw[1] = - F0C7
> wmain: Z?
> Active code page: 65001
--
With best regards,
Andrey Repin
Monday, August 3, 2020 20:36:16
Sorry for my terrible english...
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with character sets
2020-08-03 17:42 ` Andrey Repin
@ 2020-08-03 18:15 ` Michael Shay
2020-08-03 21:23 ` Trouble with output character sets from Win32 applications running under mintty Brian Inglis
1 sibling, 0 replies; 11+ messages in thread
From: Michael Shay @ 2020-08-03 18:15 UTC (permalink / raw)
To: cygwin
Michael
From: "Andrey Repin" <anrdaemon@yandex.ru>
To: "Michael Shay" <MShay@ABINITIO.COM>, cygwin@cygwin.com
Date: 08/03/2020 02:06 PM
Subject: Re: Trouble with character sets
Greetings, Michael Shay!
Please bottom post in this mailing list.
> Doesn't help. I tried 65001 (UTF-8):
Because you're confusing things.
chcp has nothing to do with LANG or LC_*.
Et vice versa.
chcp sets console code page for native console applications. Only for
those
supporting it. Many do not.
LANG sets output parameters for Cygwin applications (and other programs
that
look for it, but these are few).
> ### SET CP TO UTF-8, 65001
> $cygwin_charset_test.ksh
> Old CP 65001
> locale on entry
> LANG=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_ALL=
> ### CP SET TO 65001
> Active code page: 65001
> locale changed to
> LANG=en_US.CP1252
> LC_CTYPE="en_US.CP1252"
> LC_NUMERIC="en_US.CP1252"
> LC_TIME="en_US.CP1252"
> LC_COLLATE="en_US.CP1252"
> LC_MONETARY="en_US.CP1252"
> LC_MESSAGES="en_US.CP1252"
> LC_ALL=en_US.CP1252
> Running WIN32 pgm
> Transcoding using Cygwin codepage: 1252
> Input widechar string:
> lpw[0] = Z - 5A
> lpw[1] = - F0C7
> wmain: Z?
> Active code page: 65001
> and 1252
> ### SET CP TO 1252
> $cygwin_charset_test.ksh
> Old CP 65001
> locale on entry
> LANG=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_ALL=
> ### CP SET TO 1252
> Active code page: 1252
> locale changed to
> LANG=en_US.CP1252
> LC_CTYPE="en_US.CP1252"
> LC_NUMERIC="en_US.CP1252"
> LC_TIME="en_US.CP1252"
> LC_COLLATE="en_US.CP1252"
> LC_MONETARY="en_US.CP1252"
> LC_MESSAGES="en_US.CP1252"
> LC_ALL=en_US.CP1252
> Running WIN32 pgm
> Transcoding using Cygwin codepage: 1252
> Input widechar string:
> lpw[0] = Z - 5A
> lpw[1] = - F0C7
> wmain: Z?
> Active code page: 65001
--
With best regards,
Andrey Repin
Monday, August 3, 2020 20:36:16
Sorry for my terrible english...
Thanks for the feedback. I wasn't aware of the protocol.
Mike Shay
NOTICE from Ab Initio: This email (including any attachments) may contain information that is subject to confidentiality obligations or is legally privileged, and sender does not waive confidentiality or privilege. If received in error, please notify the sender, delete this email, and make no further use, disclosure, or distribution.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with output character sets from Win32 applications running under mintty
2020-08-03 17:42 ` Andrey Repin
2020-08-03 18:15 ` Michael Shay
@ 2020-08-03 21:23 ` Brian Inglis
2020-08-03 22:05 ` Michael Shay
1 sibling, 1 reply; 11+ messages in thread
From: Brian Inglis @ 2020-08-03 21:23 UTC (permalink / raw)
To: cygwin
On 2020-08-03 11:42, Andrey Repin wrote:
>> Doesn't help. I tried 65001 (UTF-8):
>
> Because you're confusing things.
> chcp has nothing to do with LANG or LC_*.
> Et vice versa.
>
> chcp sets console code page for native console applications. Only for those
> supporting it. Many do not.
> LANG sets output parameters for Cygwin applications (and other programs that
> look for it, but these are few).
You cut the significant statement at the top of the OP:
>> I'm having a problem with Cygwin 3.1.4, changing the character set on the
>> fly. It seems to work with Cygwin applications, but not with Win32
>> applications.
He has problems with invalid characters only running win32 console applications:
I changed the subject to hopefully better reflect the issue.
I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have to use
the Windows codepage conversion routines.
You can only change input character sets on the fly; output character sets will
depend on mintty support of xterm-compatible character set support and switching
escape sequences; if you set up UCS16LE console output, Windows and mintty
should handle it.
Perhaps a better description of your environment, build tools, what you are
trying to do, what you expect as output, and what you are getting as output,
could help us better understand and help with the issue you see.
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with output character sets from Win32 applications running under mintty
2020-08-03 21:23 ` Trouble with output character sets from Win32 applications running under mintty Brian Inglis
@ 2020-08-03 22:05 ` Michael Shay
2020-08-04 12:32 ` Trouble with output character sets from Win32 applications running under mksh Brian Inglis
0 siblings, 1 reply; 11+ messages in thread
From: Michael Shay @ 2020-08-03 22:05 UTC (permalink / raw)
To: cygwin
Michael
From: "Brian Inglis" <Brian.Inglis@SystematicSw.ab.ca>
To: cygwin@cygwin.com
Date: 08/03/2020 05:23 PM
Subject: Re: Trouble with output character sets from Win32
applications running under mintty
Sent by: "Cygwin" <cygwin-bounces@cygwin.com>
On 2020-08-03 11:42, Andrey Repin wrote:
>> Doesn't help. I tried 65001 (UTF-8):
>
> Because you're confusing things.
> chcp has nothing to do with LANG or LC_*.
> Et vice versa.
>
> chcp sets console code page for native console applications. Only for
those
> supporting it. Many do not.
> LANG sets output parameters for Cygwin applications (and other programs
that
> look for it, but these are few).
You cut the significant statement at the top of the OP:
>> I'm having a problem with Cygwin 3.1.4, changing the character set on
the
>> fly. It seems to work with Cygwin applications, but not with Win32
>> applications.
He has problems with invalid characters only running win32 console
applications:
I changed the subject to hopefully better reflect the issue.
I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have to
use
the Windows codepage conversion routines.
You can only change input character sets on the fly; output character sets
will
depend on mintty support of xterm-compatible character set support and
switching
escape sequences; if you set up UCS16LE console output, Windows and mintty
should handle it.
Perhaps a better description of your environment, build tools, what you
are
trying to do, what you expect as output, and what you are getting as
output,
could help us better understand and help with the issue you see.
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
The script I sent changes the locale information i.e. LANG and LC_ALL are
set to en_US.CP1252. i.e.
export LANG="en_US.CP1252"
export LC_ALL=en_US.CP1252
Then, it runs a simple Win32 program that takes a single input argument,
ZÇ, the second character being C-cedilla, an 8-bit character, hex value
0xc7. The Win32 program transcodes the input Unicode argument using the
Cygwin character set to determine the codepage, 1252. It then prints the
transcoded characters to stdout, and the result should be ZÇ, identical to
the input argument. This works fine using Cygwin 1.7.28. Cygwin 3.1.4 is
launching the Win32 application, and is responsible for transcoding the
arguments passed to it by mksh, in this case CP1252 characters ZÇ, into
Unicode. That means Cygwin has to use the mb-to-uc function for
transcoding codepage 1252 to Unicode. It does not. It uses the UTF-8 to
Unicode function (I've seen this using gdb). That function flags the Ç as
an invalid UTF-8 sequence, not surprisingly since it's not a UTF-8
character. No matter what character set I use in 'export LANG...' and
'export LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding
function in sys1.7.28 Uses the correct function. I'm not using mintty, I'm
using mksh, a requirement since our software uses lots of shell scripts,
and for legacy support, that means using a Korn shell. I could understand
it if 1.7.28 didn't do the proper transcoding, but it does.
I used:
gdb mksh
to load mksh into the debugger, then started it with
start -c 'cygtest.exe ZÇ'
That allowed me to step into child_info_spawn::worker() and stop at the
call to CreateProcess(), where the command line (cygtest.exe) and argument
(ZÇ) are translated into Unicode.
This is the code to which I'm referring, in strfuncs.cc, which is supposed
to translate the command line and arguments from CP 1252 into Unicode.
size_t __reg3
sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
{
mbtowc_p f_mbtowc = __MBTOWC;
if (f_mbtowc == __ascii_mbtowc)
{
f_mbtowc = __utf8_mbtowc; <<<< THE CODE CHANGES THE
'__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE
CODEPAGE.
}
return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
}
So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said:
You can only change input character sets on the fly;
The input character set to Cygwin should have been changed to CP 1252, as
it was in 1.7.28. At least, that's what I would expect to happen. If it
does not, or if miintty is required, then that's a regression from 1.7.28.
Mike Shay
NOTICE from Ab Initio: This email (including any attachments) may contain information that is subject to confidentiality obligations or is legally privileged, and sender does not waive confidentiality or privilege. If received in error, please notify the sender, delete this email, and make no further use, disclosure, or distribution.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with output character sets from Win32 applications running under mksh
2020-08-03 22:05 ` Michael Shay
@ 2020-08-04 12:32 ` Brian Inglis
2020-08-04 21:19 ` Michael Shay
0 siblings, 1 reply; 11+ messages in thread
From: Brian Inglis @ 2020-08-04 12:32 UTC (permalink / raw)
To: cygwin
On 2020-08-03 16:05, Michael Shay via Cygwin wrote:
> On 2020-08-03 11:42, Andrey Repin wrote:
>>>> Doesn't help. I tried 65001 (UTF-8):
>>> Because you're confusing things.
>>> chcp has nothing to do with LANG or LC_*.
>>> Et vice versa.
>>>
>> chcp sets console code page for native console applications.
>>> Only for those supporting it. Many do not.
>>> LANG sets output parameters for Cygwin applications (and other programs
>>> that look for it, but these are few).
>> You cut the significant statement at the top of the OP:
>>>> I'm having a problem with Cygwin 3.1.4, changing the character set on
>>>> the fly. It seems to work with Cygwin applications, but not with Win32
>>>> applications.
>> He has problems with invalid characters only running win32 console
>> applications: I changed the subject to hopefully better reflect the issue.
>>
>> I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have to
>> use the Windows codepage conversion routines.
>>
>> You can only change input character sets on the fly; output character sets
>> will depend on mintty support of xterm-compatible character set support
>> and switching escape sequences; if you set up UTF16LE console output,
>> Windows and mintty should handle it.
>>
>> Perhaps a better description of your environment, build tools, what you
>> are trying to do, what you expect as output, and what you are getting as
>> output, could help us better understand and help with the issue you see.
> The script I sent changes the locale information i.e. LANG and LC_ALL are
> set to en_US.CP1252. i.e.
>
> export LANG="en_US.CP1252"
> export LC_ALL=en_US.CP1252
FYI the normal sequence and order to check is LANG, LC_CTYPE, LC_ALL, where the
last var set wins, or the reverse where the first var set wins; the default
locale may be POSIX C.ASCII or the effective Windows locale, depending on your
startup.
> Then, it runs a simple Win32 program that takes a single input argument, ZÇ,
> the second character being C-cedilla, an 8-bit character, hex value 0xc7.
> The Win32 program transcodes the input Unicode argument using the Cygwin
> character set to determine the codepage, 1252.
Do you mean using the environment variables to determine the codepage?
FYI the default character set if none is specified is the Unix equivalent of the
default Windows "ANSI"/OEM code page, in English or many European locales that
will be ISO-8859-1.
You may have to use cygpath -C OEM chars... or cygpath -C ANSI chars... to
convert a string to the required character set for console or GUI programs.
Please specify what you mean by "Unicode" in each context; that term means a
standard for representing scripts in many writing systems with a large character
glyph repertoire and a number of encodings, representations, and handling rules:
in each use case, do you mean a char/wchar representation, and/or an encoding
UTF16LE or UTF-8?
Similarly when MS uses "ANSI" they may mean an SBCS OEM code page.
To check what is available and what is in effect in Cygwin, try e.g.:
$ for o in system user no-unicode input format; do echo `locale --$o` $o; done
en_US system
en_GB user
en_CA no-unicode
en_CA input
en_CA format
$ locale
on both Cygwin versions.
FYI see:
https://cygwin.com/cygwin-ug-net/setup-locale.html
> It then prints the transcoded characters to stdout, and the result should be
> ZÇ, identical to the input argument.
> This works fine using Cygwin 1.7.28.
Which Windows version are you running Cygwin 1.7.28 on?
Please show output from cmd /c ver.
That Cygwin version 1.7.28 is from 2014-02 and has been unsupported for years.
That version may not have completely supported international character sets and
may just assume that everything is in ISO-8859-1/Latin-1, which is similar to
CP1252, so that may work, or your system default OEM codepage e.g. 437 or 850,
and pass it along.
> Cygwin 3.1.4 is launching the Win32 application, and is responsible for
> transcoding the arguments passed to it by mksh, in this case CP1252
> characters ZÇ, into Unicode.
Do you mean you believe Cygwin should recode argument strings, and what do you
mean by Unicode in this context?
> That means Cygwin has to use the mb-to-uc function for transcoding codepage
> 1252 to Unicode.
I am unsure if Cygwin does any recoding internally except for input typed on the
terminal console interface.
CP1252 is an SBCS not an MBCS so MB functions are not required.
What do you expect when you use Unicode here?
> It does not. It uses the UTF-8 to Unicode function (I've seen this using
> gdb). That function flags the Ç as an invalid UTF-8 sequence, not
> surprisingly since it's not a UTF-8 character.
What Windows, Cygwin, gdb versions are you seeing this on and what is the name
of the function you are seeing?
> No matter what character set I use in 'export LANG...' and 'export
> LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding function in
> sys
... what should be there and what is the name of the function used?
> 1.7.28 Uses the correct function.
What is the name of that function?
> I'm not using mintty, I'm using mksh, a requirement since our software uses
> lots of shell scripts, and for legacy support, that means using a Korn shell.
So that means that the mksh is running on the Windows console, and you are not
running mintty.
> I could understand it if 1.7.28 didn't do the proper transcoding, but it
> does.
You may just be seeing Cygwin 1.7.28 passing the character codes along verbatim.
> I used:
>
> gdb mksh
>
> to load mksh into the debugger, then started it with
>
> start -c 'cygtest.exe ZÇ'
Windows, Cygwin, and gdb versions?
> That allowed me to step into child_info_spawn::worker() and stop at the
> call to CreateProcess(), where the command line (cygtest.exe) and argument
> (ZÇ) are translated into Unicode.
In this case you mean into a UTF16LE string?
> This is the code to which I'm referring, in strfuncs.cc, which is supposed
> to translate the command line and arguments from CP 1252 into Unicode.
>
> size_t __reg3
> sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
> {
> mbtowc_p f_mbtowc = __MBTOWC;
> if (f_mbtowc == __ascii_mbtowc)
> {
> f_mbtowc = __utf8_mbtowc; <<<< THE CODE CHANGES THE
> '__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE
> CODEPAGE.
> }
> return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
> }
>
> So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said:
UTF-8 contains ASCII as the first 128 code points, so that is valid, unless the
"ASCII" used isn't really, and has character codes > 127!
> You can only change input character sets on the fly;
>
> The input character set to Cygwin should have been changed to CP 1252, as
> it was in 1.7.28. At least, that's what I would expect to happen. If it
> does not, or if miintty is required, then that's a regression from 1.7.28.
As Cygwin packages are rolling releases, old releases are unsupported, and you
must upgrade to the latest release, reproduce the problem with a simple test
case, and other examples if you wish, and post that with a copy of the output from:
$ cygcheck -hrsv > cygcheck.out
as a plain text attachment to your post.
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with output character sets from Win32 applications running under mksh
2020-08-04 12:32 ` Trouble with output character sets from Win32 applications running under mksh Brian Inglis
@ 2020-08-04 21:19 ` Michael Shay
2020-08-05 2:10 ` Thomas Wolff
2020-08-05 5:22 ` Brian Inglis
0 siblings, 2 replies; 11+ messages in thread
From: Michael Shay @ 2020-08-04 21:19 UTC (permalink / raw)
To: cygwin
Michael
From: "Brian Inglis" <Brian.Inglis@SystematicSw.ab.ca>
To: cygwin@cygwin.com
Date: 08/04/2020 08:32 AM
Subject: Re: Trouble with output character sets from Win32
applications running under mksh
Sent by: "Cygwin" <cygwin-bounces@cygwin.com>
On 2020-08-03 16:05, Michael Shay via Cygwin wrote:
> On 2020-08-03 11:42, Andrey Repin wrote:
>>>> Doesn't help. I tried 65001 (UTF-8):
>>> Because you're confusing things.
>>> chcp has nothing to do with LANG or LC_*.
>>> Et vice versa.
>>>
>> chcp sets console code page for native console applications.
>>> Only for those supporting it. Many do not.
>>> LANG sets output parameters for Cygwin applications (and other
programs
>>> that look for it, but these are few).
>> You cut the significant statement at the top of the OP:
>>>> I'm having a problem with Cygwin 3.1.4, changing the character set on
>>>> the fly. It seems to work with Cygwin applications, but not with
Win32
>>>> applications.
>> He has problems with invalid characters only running win32 console
>> applications: I changed the subject to hopefully better reflect the
issue.
>>
>> I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have
to
>> use the Windows codepage conversion routines.
>>
>> You can only change input character sets on the fly; output character
sets
>> will depend on mintty support of xterm-compatible character set support
>> and switching escape sequences; if you set up UTF16LE console output,
>> Windows and mintty should handle it.
>>
>> Perhaps a better description of your environment, build tools, what you
>> are trying to do, what you expect as output, and what you are getting
as
>> output, could help us better understand and help with the issue you
see.
> The script I sent changes the locale information i.e. LANG and LC_ALL
are
> set to en_US.CP1252. i.e.
>
> export LANG="en_US.CP1252"
> export LC_ALL=en_US.CP1252
FYI the normal sequence and order to check is LANG, LC_CTYPE, LC_ALL,
where the
last var set wins, or the reverse where the first var set wins; the
default
locale may be POSIX C.ASCII or the effective Windows locale, depending on
your
startup.
>> Thanks, that's good to know.
> Then, it runs a simple Win32 program that takes a single input argument,
ZÇ,
> the second character being C-cedilla, an 8-bit character, hex value
0xc7.
> The Win32 program transcodes the input Unicode argument using the Cygwin
> character set to determine the codepage, 1252.
Do you mean using the environment variables to determine the codepage?
>> Yes. Our code does try to fetch the character set information from the
>> environment.
FYI the default character set if none is specified is the Unix equivalent
of the
default Windows "ANSI"/OEM code page, in English or many European locales
that
will be ISO-8859-1.
You may have to use cygpath -C OEM chars... or cygpath -C ANSI chars... to
convert a string to the required character set for console or GUI
programs.
>> Our production code uses the console to display error information in
the
>> appropriate character set, but our command-line utilities expect to be
>> able to take input strings encoded in the character set in use, which
>> may be an 8-bit SBCS like ISO-8849-1, Windows 1252, or a MBCS, like
UTF-8
>> or e.g. Windows 932. Using 'cygpath' isn't an option.
Please specify what you mean by "Unicode" in each context; that term means
a
standard for representing scripts in many writing systems with a large
character
glyph repertoire and a number of encodings, representations, and handling
rules:
in each use case, do you mean a char/wchar representation, and/or an
encoding
UTF16LE or UTF-8?
Similarly when MS uses "ANSI" they may mean an SBCS OEM code page.
>> Unicode == UTF-16 in all cases. This is the wide-character set used by
Microsoft
>> as far as I can tell in the wide-char version of their Win32 API
functions e.g.
>> CreateProcessW() vs. CreateProcessA().
To check what is available and what is in effect in Cygwin, try e.g.:
$ for o in system user no-unicode input format; do echo `locale --$o` $o;
done
en_US system
en_GB user
en_CA no-unicode
en_CA input
en_CA format
$ locale
on both Cygwin versions.
>>1.7.28 output
>>$for o in system user no-unicode input format; do echo `locale --$o` $o;
done
>>en_US system
>>en_US user
>>en_US no-unicode
>>locale: unknown option -- input
>>Try `locale --help' for more information.
>>input
>>en_US format
>>3.1.4 output
>>$for o in system user no-unicode input format; do echo `locale --$o` $o;
done
>>en_US system
>>en_US user
>>en_US no-unicode
>>en_US input
>>en_US format
FYI see:
https://cygwin.com/cygwin-ug-net/setup-locale.html
> It then prints the transcoded characters to stdout, and the result
should be
> ZÇ, identical to the input argument.
> This works fine using Cygwin 1.7.28.
Which Windows version are you running Cygwin 1.7.28 on?
Please show output from cmd /c ver.
>>$cmd /c ver
>>Microsoft Windows [Version 10.0.18363.959]
That Cygwin version 1.7.28 is from 2014-02 and has been unsupported for
years.
That version may not have completely supported international character
sets and
may just assume that everything is in ISO-8859-1/Latin-1, which is similar
to
CP1252, so that may work, or your system default OEM codepage e.g. 437 or
850,
and pass it along.
>> Our code supports dozens of character sets, for international sales,
and that
>> includes many SBCS, and MBCS, as well as UTF-8. I can use any of the
codepages
>> supported by Windows and Cygwin and 1.7.28 handles them just fine.
> Cygwin 3.1.4 is launching the Win32 application, and is responsible for
> transcoding the arguments passed to it by mksh, in this case CP1252
> characters ZÇ, into Unicode.
Do you mean you believe Cygwin should recode argument strings, and what do
you
mean by Unicode in this context?
>> When I launch a Win32 application that is using a character set other
than 7-bit ASCI
>> in a Cygwin shell, the shell passes the command and arguments in the
input character set.
>> So, for example, using CP 1252 as the character set, and passing 8-bit
single-byte characters
>> like e.g. ZÇ, the shell doesn't change the characters, it passes them
through to Cygwin
>> to launch the process. In my test, using gdb ($gdb --version GNU gdb
(GDB) (Cygwin 8.2.1-1) 8.2.1)
>> i.e. "gdb ksh.exe", then "(gdb) start -c 'cygtest.exe ZÇ', I can step
into spawnve() in spawn.cc.
>> At this point, examining the input arguments confirms that the input
argument 'ZÇ' is still
>> in the correct encoding i.e. 0x5a 0xc7. The real work of launching the
process is done in
>> child_info_spawn::worker(). Eventually, the code invokes
CreateProcessW(). The executable
>> path is already in UTF-16 format, so the only transcoding left to be
done is the
>> argument string. This is done in linebuf::wcs() function (winf.h) This
small method
>> invokes sys_mbstowcs(), in strfuncs.cc. So yes, I do believe Cygwin
should transcode
>> the argument strings from whatever their current character set is to
UTF-16. This is
>> what the ancient 1.7.28 did.
> That means Cygwin has to use the mb-to-uc function for transcoding
codepage
> 1252 to Unicode.
I am unsure if Cygwin does any recoding internally except for input typed
on the
terminal console interface.
CP1252 is an SBCS not an MBCS so MB functions are not required.
What do you expect when you use Unicode here?
>> If Cygwin no longer does this internal transcoding, that's a
significant change
>> from previous versions. I only know 1.7.28 did the transcoding
correctly, and it's
>> certainly possible that at some point between that version and 3.1.4,
the behavior
>> changed. Yes, CP1252 is a SBCS, but it supports 8-bit characters,
unlike 7-bit ASCII
>> so requires a different mapping from UTF-16. Using either CP 1252 or
7-bit ASCII
>> though would require a different transcoding routine than the UTF-8 ->
UTF-16 that
>> gets used.
> It does not. It uses the UTF-8 to Unicode function (I've seen this using
> gdb). That function flags the Ç as an invalid UTF-8 sequence, not
> surprisingly since it's not a UTF-8 character.
What Windows, Cygwin, gdb versions are you seeing this on and what is the
name
of the function you are seeing?
>> Windows - Microsoft Windows [Version 10.0.18363.959]
>> Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19
08:49 x86_64 Cygwin
>> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1)
>> As described above, spawnve() calls child_info_spawn::worker() to do
the real work of
>> launching a process, a Win32 or a Cygwin process. The conversion of the
process arguments
>> into UTF-16 is done through linebuf::wcs(), into sys_mbstowcs(). In the
latter function
>> the only work done is to check if the pointer to the MBCS to WCS is '
__ascii_mbtowc' and
>> if so, to instead set it to '__utf8_mbtowc'. It then invokes
sys_cp_mbstowcs() to do the
>> work.
>> However, the problem if there is one, must be occurring very early on.
dll_crt0_1()
>> which according to the comments "Take over from libc's crt0.o and start
the application."
>> fetches the locale from the environment:
>> /* Set internal locale to the environment settings. */
>> initial_setlocale ();
>> I suspect that it's here where either there's a problem, or Cygwin
behavior has changed from
>> 1.7.28. I haven't tried to use gdb to step into that initialization
code.
> No matter what character set I use in 'export LANG...' and 'export
> LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding function
in
> sys
... what should be there and what is the name of the function used?
> 1.7.28 Uses the correct function.
What is the name of that function?
>> The function is sys_cp_mbstowcs(), which is invoked by sys_mbstowcs()
as it is in 3.1.4.
>> But the older version doesn't get the pointer to the mb-to-wc
transcoding function passed
>> it, it fetches the pointer and the character set from cygheap->locale
and passes those
>> to sys_cp_mbstowcs().
> I'm not using mintty, I'm using mksh, a requirement since our software
uses
> lots of shell scripts, and for legacy support, that means using a Korn
shell.
So that means that the mksh is running on the Windows console, and you are
not
running mintty.
>> Correct.
> I could understand it if 1.7.28 didn't do the proper transcoding, but it
> does.
You may just be seeing Cygwin 1.7.28 passing the character codes along
verbatim.
>> I don't think so. child_info_spawn::worker() has to translate the
CP1252 characters
>> into UTF-16. And it does, as I've seen using Windbg on the Windows side
of this.
> I used:
>
> gdb mksh
>
> to load mksh into the debugger, then started it with
>
> start -c 'cygtest.exe ZÇ'
Windows, Cygwin, and gdb versions?
>> Windows - Microsoft Windows [Version 10.0.18363.959]
>> Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19
08:49 x86_64 Cygwin
>> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1)
> That allowed me to step into child_info_spawn::worker() and stop at the
> call to CreateProcess(), where the command line (cygtest.exe) and
argument
> (ZÇ) are translated into Unicode.
In this case you mean into a UTF16LE string?
>> Yes.
> This is the code to which I'm referring, in strfuncs.cc, which is
supposed
> to translate the command line and arguments from CP 1252 into Unicode.
>
> size_t __reg3
> sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
> {
> mbtowc_p f_mbtowc = __MBTOWC;
> if (f_mbtowc == __ascii_mbtowc)
> {
> f_mbtowc = __utf8_mbtowc; <<<< THE CODE CHANGES THE
> '__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE
> CODEPAGE.
> }
> return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
> }
>
> So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said:
UTF-8 contains ASCII as the first 128 code points, so that is valid,
unless the
"ASCII" used isn't really, and has character codes > 127!
>> CP1252 supports 8-bit single-byte characters such as C-cedilla. The
UTF-8
>> representation is a 3-byte sequence that is not correct if the
character
>> set in use is CP1252.
> You can only change input character sets on the fly;
>
> The input character set to Cygwin should have been changed to CP 1252,
as
> it was in 1.7.28. At least, that's what I would expect to happen. If it
> does not, or if miintty is required, then that's a regression from
1.7.28.
As Cygwin packages are rolling releases, old releases are unsupported, and
you
must upgrade to the latest release, reproduce the problem with a simple
test
case, and other examples if you wish, and post that with a copy of the
output from:
$ cygcheck -hrsv > cygcheck.out
as a plain text attachment to your post.
>> I understand. We do not ship a stock Cygwin installation. I happen to
have an
>> unmodified 3.1.4 on a development machine and was able to reproduce the
problem
>> with it. But we cannot take frequent Cygwin updates, as it takes far
too long
>> to find and fix problems between Cygwin and our code. The version has
to be
>> stable for months before we can use it.
>> Thanks for the helpful suggestions and information. I'll send updates,
in case
>> anyone else sees a similar problem.
>> Michael Shay
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
NOTICE from Ab Initio: This email (including any attachments) may contain information that is subject to confidentiality obligations or is legally privileged, and sender does not waive confidentiality or privilege. If received in error, please notify the sender, delete this email, and make no further use, disclosure, or distribution.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with output character sets from Win32 applications running under mksh
2020-08-04 21:19 ` Michael Shay
@ 2020-08-05 2:10 ` Thomas Wolff
2020-08-05 5:22 ` Brian Inglis
1 sibling, 0 replies; 11+ messages in thread
From: Thomas Wolff @ 2020-08-05 2:10 UTC (permalink / raw)
To: cygwin
Am 04.08.2020 um 23:19 schrieb Michael Shay via Cygwin:
> Michael
The contents of your mail responses is not recognizable due to utterly
broken formatting.
[This is not top-posting as there's nothing to respond to]
>
>
>
> From: "Brian Inglis" <Brian.Inglis@SystematicSw.ab.ca>
> To: cygwin@cygwin.com
> Date: 08/04/2020 08:32 AM
> Subject: Re: Trouble with output character sets from Win32
> applications running under mksh
> Sent by: "Cygwin" <cygwin-bounces@cygwin.com>
>
>
>
> On 2020-08-03 16:05, Michael Shay via Cygwin wrote:
>> On 2020-08-03 11:42, Andrey Repin wrote:
>>>>> Doesn't help. I tried 65001 (UTF-8):
>>>> Because you're confusing things.
>>>> chcp has nothing to do with LANG or LC_*.
>>>> Et vice versa.
>>>>
>>> chcp sets console code page for native console applications.
>>>> Only for those supporting it. Many do not.
>>>> LANG sets output parameters for Cygwin applications (and other
> programs
>>>> that look for it, but these are few).
>>> You cut the significant statement at the top of the OP:
>>>>> I'm having a problem with Cygwin 3.1.4, changing the character set on
>>>>> the fly. It seems to work with Cygwin applications, but not with
> Win32
>>>>> applications.
>>> He has problems with invalid characters only running win32 console
>>> applications: I changed the subject to hopefully better reflect the
> issue.
>>> I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have
> to
>>> use the Windows codepage conversion routines.
>>>
>>> You can only change input character sets on the fly; output character
> sets
>>> will depend on mintty support of xterm-compatible character set support
>>> and switching escape sequences; if you set up UTF16LE console output,
>>> Windows and mintty should handle it.
>>>
>>> Perhaps a better description of your environment, build tools, what you
>>> are trying to do, what you expect as output, and what you are getting
> as
>>> output, could help us better understand and help with the issue you
> see.
>
>> The script I sent changes the locale information i.e. LANG and LC_ALL
> are
>> set to en_US.CP1252. i.e.
>>
>> export LANG="en_US.CP1252"
>> export LC_ALL=en_US.CP1252
> FYI the normal sequence and order to check is LANG, LC_CTYPE, LC_ALL,
> where the
> last var set wins, or the reverse where the first var set wins; the
> default
> locale may be POSIX C.ASCII or the effective Windows locale, depending on
> your
> startup.
>>> Thanks, that's good to know.
>> Then, it runs a simple Win32 program that takes a single input argument,
> ZÇ,
>> the second character being C-cedilla, an 8-bit character, hex value
> 0xc7.
>> The Win32 program transcodes the input Unicode argument using the Cygwin
>> character set to determine the codepage, 1252.
> Do you mean using the environment variables to determine the codepage?
>>> Yes. Our code does try to fetch the character set information from the
>>> environment.
>
> FYI the default character set if none is specified is the Unix equivalent
> of the
> default Windows "ANSI"/OEM code page, in English or many European locales
> that
> will be ISO-8859-1.
>
> You may have to use cygpath -C OEM chars... or cygpath -C ANSI chars... to
> convert a string to the required character set for console or GUI
> programs.
>>> Our production code uses the console to display error information in
> the
>>> appropriate character set, but our command-line utilities expect to be
>>> able to take input strings encoded in the character set in use, which
>>> may be an 8-bit SBCS like ISO-8849-1, Windows 1252, or a MBCS, like
> UTF-8
>>> or e.g. Windows 932. Using 'cygpath' isn't an option.
> Please specify what you mean by "Unicode" in each context; that term means
> a
> standard for representing scripts in many writing systems with a large
> character
> glyph repertoire and a number of encodings, representations, and handling
> rules:
> in each use case, do you mean a char/wchar representation, and/or an
> encoding
> UTF16LE or UTF-8?
> Similarly when MS uses "ANSI" they may mean an SBCS OEM code page.
>
>>> Unicode == UTF-16 in all cases. This is the wide-character set used by
> Microsoft
>>> as far as I can tell in the wide-char version of their Win32 API
> functions e.g.
>>> CreateProcessW() vs. CreateProcessA().
> To check what is available and what is in effect in Cygwin, try e.g.:
>
> $ for o in system user no-unicode input format; do echo `locale --$o` $o;
> done
> en_US system
> en_GB user
> en_CA no-unicode
> en_CA input
> en_CA format
> $ locale
>
> on both Cygwin versions.
>
>>> 1.7.28 output
>>> $for o in system user no-unicode input format; do echo `locale --$o` $o;
> done
>>> en_US system
>>> en_US user
>>> en_US no-unicode
>>> locale: unknown option -- input
>>> Try `locale --help' for more information.
>>> input
>>> en_US format
>>> 3.1.4 output
>>> $for o in system user no-unicode input format; do echo `locale --$o` $o;
> done
>>> en_US system
>>> en_US user
>>> en_US no-unicode
>>> en_US input
>>> en_US format
> FYI see:
>
> https://cygwin.com/cygwin-ug-net/setup-locale.html
>
>> It then prints the transcoded characters to stdout, and the result
> should be
>> ZÇ, identical to the input argument.
>> This works fine using Cygwin 1.7.28.
> Which Windows version are you running Cygwin 1.7.28 on?
> Please show output from cmd /c ver.
>>> $cmd /c ver
>>> Microsoft Windows [Version 10.0.18363.959]
> That Cygwin version 1.7.28 is from 2014-02 and has been unsupported for
> years.
> That version may not have completely supported international character
> sets and
> may just assume that everything is in ISO-8859-1/Latin-1, which is similar
> to
> CP1252, so that may work, or your system default OEM codepage e.g. 437 or
> 850,
> and pass it along.
>>> Our code supports dozens of character sets, for international sales,
> and that
>>> includes many SBCS, and MBCS, as well as UTF-8. I can use any of the
> codepages
>>> supported by Windows and Cygwin and 1.7.28 handles them just fine.
>> Cygwin 3.1.4 is launching the Win32 application, and is responsible for
>> transcoding the arguments passed to it by mksh, in this case CP1252
>> characters ZÇ, into Unicode.
> Do you mean you believe Cygwin should recode argument strings, and what do
> you
> mean by Unicode in this context?
>>> When I launch a Win32 application that is using a character set other
> than 7-bit ASCI
>>> in a Cygwin shell, the shell passes the command and arguments in the
> input character set.
>>> So, for example, using CP 1252 as the character set, and passing 8-bit
> single-byte characters
>>> like e.g. ZÇ, the shell doesn't change the characters, it passes them
> through to Cygwin
>>> to launch the process. In my test, using gdb ($gdb --version GNU gdb
> (GDB) (Cygwin 8.2.1-1) 8.2.1)
>>> i.e. "gdb ksh.exe", then "(gdb) start -c 'cygtest.exe ZÇ', I can step
> into spawnve() in spawn.cc.
>>> At this point, examining the input arguments confirms that the input
> argument 'ZÇ' is still
>>> in the correct encoding i.e. 0x5a 0xc7. The real work of launching the
> process is done in
>>> child_info_spawn::worker(). Eventually, the code invokes
> CreateProcessW(). The executable
>>> path is already in UTF-16 format, so the only transcoding left to be
> done is the
>>> argument string. This is done in linebuf::wcs() function (winf.h) This
> small method
>>> invokes sys_mbstowcs(), in strfuncs.cc. So yes, I do believe Cygwin
> should transcode
>>> the argument strings from whatever their current character set is to
> UTF-16. This is
>>> what the ancient 1.7.28 did.
>> That means Cygwin has to use the mb-to-uc function for transcoding
> codepage
>> 1252 to Unicode.
> I am unsure if Cygwin does any recoding internally except for input typed
> on the
> terminal console interface.
> CP1252 is an SBCS not an MBCS so MB functions are not required.
> What do you expect when you use Unicode here?
>>> If Cygwin no longer does this internal transcoding, that's a
> significant change
>>> from previous versions. I only know 1.7.28 did the transcoding
> correctly, and it's
>>> certainly possible that at some point between that version and 3.1.4,
> the behavior
>>> changed. Yes, CP1252 is a SBCS, but it supports 8-bit characters,
> unlike 7-bit ASCII
>>> so requires a different mapping from UTF-16. Using either CP 1252 or
> 7-bit ASCII
>>> though would require a different transcoding routine than the UTF-8 ->
> UTF-16 that
>>> gets used.
>> It does not. It uses the UTF-8 to Unicode function (I've seen this using
>> gdb). That function flags the Ç as an invalid UTF-8 sequence, not
>> surprisingly since it's not a UTF-8 character.
> What Windows, Cygwin, gdb versions are you seeing this on and what is the
> name
> of the function you are seeing?
>>> Windows - Microsoft Windows [Version 10.0.18363.959]
>>> Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19
> 08:49 x86_64 Cygwin
>>> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1)
>>> As described above, spawnve() calls child_info_spawn::worker() to do
> the real work of
>>> launching a process, a Win32 or a Cygwin process. The conversion of the
> process arguments
>>> into UTF-16 is done through linebuf::wcs(), into sys_mbstowcs(). In the
> latter function
>>> the only work done is to check if the pointer to the MBCS to WCS is '
> __ascii_mbtowc' and
>>> if so, to instead set it to '__utf8_mbtowc'. It then invokes
> sys_cp_mbstowcs() to do the
>>> work.
>>> However, the problem if there is one, must be occurring very early on.
> dll_crt0_1()
>>> which according to the comments "Take over from libc's crt0.o and start
> the application."
>>> fetches the locale from the environment:
>>> /* Set internal locale to the environment settings. */
>>> initial_setlocale ();
>>> I suspect that it's here where either there's a problem, or Cygwin
> behavior has changed from
>>> 1.7.28. I haven't tried to use gdb to step into that initialization
> code.
>
>> No matter what character set I use in 'export LANG...' and 'export
>> LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding function
> in
>> sys
> ... what should be there and what is the name of the function used?
>
>> 1.7.28 Uses the correct function.
> What is the name of that function?
>>> The function is sys_cp_mbstowcs(), which is invoked by sys_mbstowcs()
> as it is in 3.1.4.
>>> But the older version doesn't get the pointer to the mb-to-wc
> transcoding function passed
>>> it, it fetches the pointer and the character set from cygheap->locale
> and passes those
>>> to sys_cp_mbstowcs().
>> I'm not using mintty, I'm using mksh, a requirement since our software
> uses
>> lots of shell scripts, and for legacy support, that means using a Korn
> shell.
>
> So that means that the mksh is running on the Windows console, and you are
> not
> running mintty.
>>> Correct.
>> I could understand it if 1.7.28 didn't do the proper transcoding, but it
>> does.
> You may just be seeing Cygwin 1.7.28 passing the character codes along
> verbatim.
>>> I don't think so. child_info_spawn::worker() has to translate the
> CP1252 characters
>>> into UTF-16. And it does, as I've seen using Windbg on the Windows side
> of this.
>
>> I used:
>>
>> gdb mksh
>>
>> to load mksh into the debugger, then started it with
>>
>> start -c 'cygtest.exe ZÇ'
> Windows, Cygwin, and gdb versions?
>>> Windows - Microsoft Windows [Version 10.0.18363.959]
>>> Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19
> 08:49 x86_64 Cygwin
>>> gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1)
>> That allowed me to step into child_info_spawn::worker() and stop at the
>> call to CreateProcess(), where the command line (cygtest.exe) and
> argument
>> (ZÇ) are translated into Unicode.
> In this case you mean into a UTF16LE string?
>>> Yes.
>> This is the code to which I'm referring, in strfuncs.cc, which is
> supposed
>> to translate the command line and arguments from CP 1252 into Unicode.
>>
>> size_t __reg3
>> sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
>> {
>> mbtowc_p f_mbtowc = __MBTOWC;
>> if (f_mbtowc == __ascii_mbtowc)
>> {
>> f_mbtowc = __utf8_mbtowc; <<<< THE CODE CHANGES THE
>> '__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE
>> CODEPAGE.
>> }
>> return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
>> }
>>
>> So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said:
> UTF-8 contains ASCII as the first 128 code points, so that is valid,
> unless the
> "ASCII" used isn't really, and has character codes > 127!
>>> CP1252 supports 8-bit single-byte characters such as C-cedilla. The
> UTF-8
>>> representation is a 3-byte sequence that is not correct if the
> character
>>> set in use is CP1252.
>> You can only change input character sets on the fly;
>>
>> The input character set to Cygwin should have been changed to CP 1252,
> as
>> it was in 1.7.28. At least, that's what I would expect to happen. If it
>> does not, or if miintty is required, then that's a regression from
> 1.7.28.
>
> As Cygwin packages are rolling releases, old releases are unsupported, and
> you
> must upgrade to the latest release, reproduce the problem with a simple
> test
> case, and other examples if you wish, and post that with a copy of the
> output from:
>
> $ cygcheck -hrsv > cygcheck.out
>
> as a plain text attachment to your post.
>>> I understand. We do not ship a stock Cygwin installation. I happen to
> have an
>>> unmodified 3.1.4 on a development machine and was able to reproduce the
> problem
>>> with it. But we cannot take frequent Cygwin updates, as it takes far
> too long
>>> to find and fix problems between Cygwin and our code. The version has
> to be
>>> stable for months before we can use it.
>>> Thanks for the helpful suggestions and information. I'll send updates,
> in case
>>> anyone else sees a similar problem.
>>> Michael Shay
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Trouble with output character sets from Win32 applications running under mksh
2020-08-04 21:19 ` Michael Shay
2020-08-05 2:10 ` Thomas Wolff
@ 2020-08-05 5:22 ` Brian Inglis
1 sibling, 0 replies; 11+ messages in thread
From: Brian Inglis @ 2020-08-05 5:22 UTC (permalink / raw)
To: cygwin
What mail client are you using?
Could you please fix whatever settings you are using to Reply, to provide
conventional nested quoting of previous content, either keep the existing
wrapping or rewrap properly, and not quote your own new reply content?
It makes the resulting email content unreadable: please see the mailing list
archives for this thread, e.g.
https://cygwin.com/pipermail/cygwin/2020-August/245762.html
Failing that, please do some heavy trimming and editing on your replies, so that
it comes across readably in the archives and in others' email clients.
--
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada
This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-08-05 5:22 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-03 15:36 Trouble with character sets Michael Shay
2020-08-03 16:31 ` Brian Inglis
2020-08-03 17:10 ` Michael Shay
2020-08-03 17:42 ` Andrey Repin
2020-08-03 18:15 ` Michael Shay
2020-08-03 21:23 ` Trouble with output character sets from Win32 applications running under mintty Brian Inglis
2020-08-03 22:05 ` Michael Shay
2020-08-04 12:32 ` Trouble with output character sets from Win32 applications running under mksh Brian Inglis
2020-08-04 21:19 ` Michael Shay
2020-08-05 2:10 ` Thomas Wolff
2020-08-05 5:22 ` Brian Inglis
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).