public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: Brian Inglis <Brian.Inglis@SystematicSw.ab.ca>
To: cygwin@cygwin.com
Subject: Re: UTF-8 quoted args passed to program include quotes when run from cmd
Date: Wed, 14 Oct 2020 23:14:45 -0600	[thread overview]
Message-ID: <05455675-7d66-fb00-9973-0a53c33ee796@SystematicSw.ab.ca> (raw)
In-Reply-To: <CAFC9CLCx3nAQu6aMYTTL1syr9zyXgHYY0vKCKSCXAf=HpYXDiQ@mail.gmail.com>

[changed subject]

On 2020-10-14 15:47, Jérôme Froissart wrote:
>> (As evidence of this: the Cygwin command line parser was able to break the
>> command line into arguments correctly, but chose to retain the double
>> quotes.)
>>
>>     #include <stdio.h>
>>
>>     int main(int argc, char *argv[])
>>     {
>>         for (int i = 0; argc > i; i++)
>>             printf("%d=%s\n", i, argv[i]);
>>
>>         return 0;
>>     }
>>
>> I compiled this program under Cygwin to produce cyg.exe and ran it under
>> Cygwin and CMD.EXE.

Please post compile and link command lines, as Cygwin can create native Windows
as well as its own Unix like executables, and the command line parsing may vary.

>> Cygwin run:
>>> billziss@xps:~/Projects/t$ locale
>> LANG=en_US.UTF-8
>> LC_CTYPE="en_US.UTF-8"
>> LC_NUMERIC="en_US.UTF-8"
>> LC_TIME="en_US.UTF-8"
>> LC_COLLATE="en_US.UTF-8"
>> LC_MONETARY="en_US.UTF-8"
>> LC_MESSAGES="en_US.UTF-8"
>> LC_ALL=
>> billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\Jérôme"
>> 0=./cyg
>> 1=foo bar
>> 2=Domain\Jérôme

>> CMD.EXE run:
>> C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\Jérôme"
>> 0=cyg
>> 1=foo bar
>> 2="Domain\Jérôme"

>>> Now, let's start a Windows shell (cmd.exe)
>>> Note that I had to copy cygwin1.dll from my Cygwin installation
>>> directory, otherwise binary.exe would not start.
>>> I do not know whether there is a `locale` equivalent in Windows
>>> command prompt, so I merely ran my program.
>>>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>>>     0=binary
>>>     1=foo bar
>>>     2="Jérôme"

Your Windows CommandLineA/W outputs were confusing.

The point is that Cygwin programs run from cmd shell appear to receive UTF-8
arguments with the surrounding double quotes included intact, whereas the double
quotes are stripped when run from a Cygwin shell.

I think the charset needs verified by dumping each arg as hex bytes e.g.

//!/usr/bin/gcc -g -Og -Wall -Wextra -o quoted-arg-dump quoted-arg-dump.c
// quoted-arg-dump.c - dump quoted args under Cygwin and Windows shells
// outputs:
// $ ./quoted-arg-dump "foo bar" "Jérôme"
// 0 './quoted-arg-dump' 2e 2f 71 75 6f 74 65 64 2d 61 72 67 2d 64 75 6d 70
// 1 'foo bar' 66 6f 6f 20 62 61 72
// 2 'Jérôme' 4a c3 a9 72 c3 b4 6d 65
// >quoted-arg-dump "foo bar" "Jérôme"
// 0 'quoted-arg-dump' 71 75 6f 74 65 64 2d 61 72 67 2d 64 75 6d 70
// 1 'foo bar' 66 6f 6f 20 62 61 72
// 2 '"Jérôme"' 22 4a c3 a9 72 c3 b4 6d 65 22
// checks:
// $ grep -a '[éô]' unicode-symbols.txt
// é  U+00E9  LATIN SMALL LETTER E WITH ACUTE
// ô  U+00F4  LATIN SMALL LETTER O WITH CIRCUMFLEX
// $ grep -a '[éô]' unicode-symbols.txt | od -An -tx1z -w11
// c3 a9 20 20 55 2b 30 30 45 39 20  >..  U+00E9 <
// 20 4c 41 54 49 4e 20 53 4d 41 4c  > LATIN SMAL<
// 4c 20 4c 45 54 54 45 52 20 45 20  >L LETTER E <
// 57 49 54 48 20 41 43 55 54 45 0a  >WITH ACUTE.<
// c3 b4 20 20 55 2b 30 30 46 34 20  >..  U+00F4 <
// 20 4c 41 54 49 4e 20 53 4d 41 4c  > LATIN SMAL<
// 4c 20 4c 45 54 54 45 52 20 4f 20  >L LETTER O <
// 57 49 54 48 20 43 49 52 43 55 4d  >WITH CIRCUM<
// 46 4c 45 58 0a                    >FLEX.<
#include <stdio.h>
int
main(int argc, char *argv[]) {
	for (int a = 0; a < argc; ++a) {
		printf("%d '%s'", a, argv[a]);

		for (char *p = argv[a]; *p; ++p) {
			printf(" %.2hhx", *p);
		} // for chars

		printf("\n");
	} // for args
} // main()

This verifies that Cygwin does not strip double quotes from UTF-8 args when run
from Windows cmd, and the args are received and output as UTF-8 characters.

It might be interesting if you could also run from PowerShell and/or Terminal
for comparison to see if the Windows cmd behaviour is reproduced there.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

  parent reply	other threads:[~2020-10-15  5:14 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-02 21:40 Unconsistent command-line parsing in case of UTF-8 quoted arguments Jérôme Froissart
2020-10-03  2:22 ` Doug Henderson
2020-10-04 11:18 ` Andrey Repin
2020-10-06 21:36   ` Jérôme Froissart
2020-10-07  1:10     ` Andrey Repin
2020-10-07 22:21       ` Jérôme Froissart
2020-10-11 18:55         ` Andrey Repin
2020-10-07  2:20     ` Brian Inglis
2020-10-07  5:17     ` Thomas Wolff
2020-10-07 23:32       ` Brian Inglis
2020-10-08  0:59         ` Eliot Moss
2020-10-08  6:22           ` Brian Inglis
2020-10-13 16:30     ` Kaz Kylheku (Cygwin)
2020-10-14 21:47       ` Jérôme Froissart
2020-10-14 22:14         ` Jérôme Froissart
2020-10-15  5:14         ` Brian Inglis [this message]
2020-10-19  2:32         ` Kaz Kylheku (Cygwin)
2020-10-13 17:34     ` Brian Inglis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05455675-7d66-fb00-9973-0a53c33ee796@SystematicSw.ab.ca \
    --to=brian.inglis@systematicsw.ab.ca \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).