public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* Unconsistent command-line parsing in case of UTF-8 quoted arguments
@ 2020-10-02 21:40 Jérôme Froissart
  2020-10-03  2:22 ` Doug Henderson
  2020-10-04 11:18 ` Andrey Repin
  0 siblings, 2 replies; 18+ messages in thread
From: Jérôme Froissart @ 2020-10-02 21:40 UTC (permalink / raw)
  To: cygwin

Hello,

By discussing a merge request on another project [1], I think
billziss-gh found a weirdness in the way Cygwin parses the command
line arguments when non-ASCII characters come into play.

EXPECTED BEHAVIOUR:
cygwin should parse the following command line
    binary.exe --non-ascii "charaçtérs" --ascii "nothing-fancy-here"
as
    argv = ["binary.exe",
            "--non-ascii",
            "chara\xXX\xXXt\xXX\xXXrs",
            "--ascii",
            "nothing-fancy-here"]
    // \xXX\xXX being the UTF-8 encoding of the special characters,
but this does not really matter here
before calling main()

ACTUAL BEHAVIOUR:
it parses it as
    argv = ["binary.exe",
            "--non-ascii",
            "\"chara\xXX\xXXt\xXX\xXXrs\"", // mind the unstripped
quotes here...
            "--ascii",
            "nothing-fancy-here" // ...but not here
    ]

It looks that words containing UTF-8 characters are not properly
stripped when they are surrounded by quotes, unlinke ASCII words.

More examples and a better description is available at [1] (thanks to
billziss-gh for his analysis, much more thorough than mine)
For the record, we wrote a work-around in our specific program, but
handling this issue in Cygwin might be a better way to solve it.

[1]: https://github.com/billziss-gh/sshfs-win/pull/208 (Checking for
quotes around non-ascii usernames passed by Windows)

Thanks for your help! In case you didn't have time, please tell me
where to look at, and I might try to fix it myself and send a patch
proposal if that is easy enough (I have never read Cygwin's code yet).
Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-02 21:40 Unconsistent command-line parsing in case of UTF-8 quoted arguments Jérôme Froissart
@ 2020-10-03  2:22 ` Doug Henderson
  2020-10-04 11:18 ` Andrey Repin
  1 sibling, 0 replies; 18+ messages in thread
From: Doug Henderson @ 2020-10-03  2:22 UTC (permalink / raw)
  To: cygwin

On Fri, 2 Oct 2020 at 15:41, Jérôme Froissart <> wrote:
>
> By discussing a merge request on another project [1], I think
> billziss-gh found a weirdness in the way Cygwin parses the command
> line arguments when non-ASCII characters come into play.
>
> EXPECTED BEHAVIOUR:
> cygwin should parse the following command line
>     binary.exe --non-ascii "charaçtérs" --ascii "nothing-fancy-here"

Please show us the output from "uname -a" and "locale" run from the bash prompt.

Tell is more about "binary.exe". Is it compiled for cygwon with gcc,
for windows with mingw64 or windows with a native tool chain. Also are
you running it from a bash prompt or a cmd.exe prompt?

-- 
Doug Henderson, Calgary, Alberta, Canada - from gmail.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-02 21:40 Unconsistent command-line parsing in case of UTF-8 quoted arguments Jérôme Froissart
  2020-10-03  2:22 ` Doug Henderson
@ 2020-10-04 11:18 ` Andrey Repin
  2020-10-06 21:36   ` Jérôme Froissart
  1 sibling, 1 reply; 18+ messages in thread
From: Andrey Repin @ 2020-10-04 11:18 UTC (permalink / raw)
  To: Jérôme Froissart, cygwin

Greetings, Jérôme Froissart!

> By discussing a merge request on another project [1], I think
> billziss-gh found a weirdness in the way Cygwin parses the command
> line arguments when non-ASCII characters come into play.

> EXPECTED BEHAVIOUR:
> cygwin should parse the following command line
>     binary.exe --non-ascii "charaçtérs" --ascii "nothing-fancy-here"
> as
>     argv = ["binary.exe",
>             "--non-ascii",
>             "chara\xXX\xXXt\xXX\xXXrs",
>             "--ascii",
>             "nothing-fancy-here"]
>     // \xXX\xXX being the UTF-8 encoding of the special characters,
> but this does not really matter here
> before calling main()

> ACTUAL BEHAVIOUR:
> it parses it as
>     argv = ["binary.exe",
>             "--non-ascii",
>             "\"chara\xXX\xXXt\xXX\xXXrs\"", // mind the unstripped
> quotes here...
>             "--ascii",
>             "nothing-fancy-here" // ...but not here
>     ]

> It looks that words containing UTF-8 characters are not properly
> stripped when they are surrounded by quotes, unlinke ASCII words.

> More examples and a better description is available at [1] (thanks to
> billziss-gh for his analysis, much more thorough than mine)
> For the record, we wrote a work-around in our specific program, but
> handling this issue in Cygwin might be a better way to solve it.

> [1]: https://github.com/billziss-gh/sshfs-win/pull/208 (Checking for
> quotes around non-ascii usernames passed by Windows)

> Thanks for your help! In case you didn't have time, please tell me
> where to look at, and I might try to fix it myself and send a patch
> proposal if that is easy enough (I have never read Cygwin's code yet).

This seems like the Cygwin command was launched from a non-Cygwin terminal or
from a terminal where locale was not set to UNICODE.

Please provide the results of "locale" command right before running your test
binary.


-- 
With best regards,
Andrey Repin
Sunday, October 4, 2020 14:16:17

Sorry for my terrible english...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-04 11:18 ` Andrey Repin
@ 2020-10-06 21:36   ` Jérôme Froissart
  2020-10-07  1:10     ` Andrey Repin
                       ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Jérôme Froissart @ 2020-10-06 21:36 UTC (permalink / raw)
  To: cygwin; +Cc: Jérôme Froissart

Thanks for your replies.
This issue only happens when a program is run from cmd.exe, not from a
Cygwin bash shell.
This is important for me, since I discovered this bug in a project
that must be run from Windows graphical shell (i.e. there is no
sensible way to run it through Cygwin and Bash).

> Please show us the output from "uname -a" and "locale" run from the bash prompt.

> Please provide the results of "locale" command right before running your test
> binary.
Here are the more detailed steps to reproduce the issue (along with
answers to your requests about `uname`, `locale`, etc.).
(I mostly reproduced what billziss-gh had done before, I do not take
all the credits :D)

Here is an example C file
    $ cat example.c
    #include <stdio.h>

    const char *GetCommandLineA(void);

    int main(int argc, char *argv[])
    {
        const char *s = GetCommandLineA();
        printf("C=%s\n", s);

        for (int i = 0; argc > i; i++)
            printf("%d=%s\n", i, argv[i]);

        return 0;
    }

I have built it with gcc from Cygwin
    $ gcc -o binary example.c

Running it from the same Cygwin bash prompt works as expected
    $ uname -a
    CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
    # (XPS is my Windows machine name)

    $ locale
    LANG=fr_FR.UTF-8
    LC_CTYPE="fr_FR.UTF-8"
    LC_NUMERIC="fr_FR.UTF-8"
    LC_TIME="fr_FR.UTF-8"
    LC_COLLATE="fr_FR.UTF-8"
    LC_MONETARY="fr_FR.UTF-8"
    LC_MESSAGES="fr_FR.UTF-8"
    LC_ALL=

    $ which gcc
    /usr/bin/gcc

    # The following runs as expected
    $ ./binary.exe "foo bar" "Jérôme"
    C="C:\Users\Public\binary.exe"
    0=./binary
    1=foo bar
    2=Jérôme

Now, let's start a Windows shell (cmd.exe)
Note that I had to copy cygwin1.dll from my Cygwin installation
directory, otherwise binary.exe would not start.
I do not know whether there is a `locale` equivalent in Windows
command prompt, so I merely ran my program.
    C:\Users\Public>binary.exe "foo bar" "Jérôme"
    C=binary.exe  "foo bar" "J□r□me"
    0=binary
    1=foo bar
    2="Jérôme"

This behaviour is not expected and is quite inconsistent with what
happened through Bash.
Besides the "strange squares" that appear on the first line, and the
extra space after binary.exe, I especially did not expect "Jérôme" to
remain quoted as a second argument.

Sorry for the delay in my answer. I hope this is now clear, please ask
me for more examples or investigation if you need.
Thanks for your help.

Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-06 21:36   ` Jérôme Froissart
@ 2020-10-07  1:10     ` Andrey Repin
  2020-10-07 22:21       ` Jérôme Froissart
  2020-10-07  2:20     ` Brian Inglis
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Andrey Repin @ 2020-10-07  1:10 UTC (permalink / raw)
  To: Jérôme Froissart, cygwin

Greetings, Jérôme Froissart!

> Now, let's start a Windows shell (cmd.exe)

That explains it.

> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows

We've specifically asked to run Cygwin's /bin/locale.exe tool.

> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"

> This behaviour is not expected and is quite inconsistent with what
> happened through Bash.
> Besides the "strange squares" that appear on the first line, and the

1. Run CMD in a more capable terminal. Either M$ Terminal 1.0, or select true
type font for your console.

> extra space after binary.exe, I especially did not expect "Jérôme" to
> remain quoted as a second argument.

2. Then you are parsing the command line wrong. In Windows, it is up to called
program to parse the command line.

> Sorry for the delay in my answer. I hope this is now clear, please ask
> me for more examples or investigation if you need.
> Thanks for your help.


-- 
With best regards,
Andrey Repin
Wednesday, October 7, 2020 1:02:59

Sorry for my terrible english...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-06 21:36   ` Jérôme Froissart
  2020-10-07  1:10     ` Andrey Repin
@ 2020-10-07  2:20     ` Brian Inglis
  2020-10-07  5:17     ` Thomas Wolff
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Brian Inglis @ 2020-10-07  2:20 UTC (permalink / raw)
  To: cygwin

On 2020-10-06 15:36, Jérôme Froissart wrote:
> Thanks for your replies.
> This issue only happens when a program is run from cmd.exe, not from a
> Cygwin bash shell.
> This is important for me, since I discovered this bug in a project
> that must be run from Windows graphical shell (i.e. there is no
> sensible way to run it through Cygwin and Bash).
> 
>> Please show us the output from "uname -a" and "locale" run from the bash prompt.
> 
>> Please provide the results of "locale" command right before running your test
>> binary.
> Here are the more detailed steps to reproduce the issue (along with
> answers to your requests about `uname`, `locale`, etc.).
> (I mostly reproduced what billziss-gh had done before, I do not take
> all the credits :D)
> 
> Here is an example C file

> I have built it with gcc from Cygwin
>     $ gcc -o binary example.c
> 
> Running it from the same Cygwin bash prompt works as expected
>     $ uname -a
>     CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
>     # (XPS is my Windows machine name)
> 
>     $ locale
>     LANG=fr_FR.UTF-8
>     LC_CTYPE="fr_FR.UTF-8"
>     LC_NUMERIC="fr_FR.UTF-8"
>     LC_TIME="fr_FR.UTF-8"
>     LC_COLLATE="fr_FR.UTF-8"
>     LC_MONETARY="fr_FR.UTF-8"
>     LC_MESSAGES="fr_FR.UTF-8"
>     LC_ALL=
> 
>     $ which gcc
>     /usr/bin/gcc
> 
>     # The following runs as expected
>     $ ./binary.exe "foo bar" "Jérôme"
>     C="C:\Users\Public\binary.exe"
>     0=./binary
>     1=foo bar
>     2=Jérôme
> 
> Now, let's start a Windows shell (cmd.exe)
> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows
> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"
> 
> This behaviour is not expected and is quite inconsistent with what
> happened through Bash.
> Besides the "strange squares" that appear on the first line, and the
> extra space after binary.exe, I especially did not expect "Jérôme" to
> remain quoted as a second argument.
> 
> Sorry for the delay in my answer. I hope this is now clear, please ask
> me for more examples or investigation if you need.
> Thanks for your help.

Create a new or change your current Command Prompt shortcut to run:

	"%windir%\system32\cmd /u"

"/U Causes the output of internal commands to a pipe or file to be Unicode"

and add "chcp 65001":

	"%windir%\system32\cmd /u /k chcp 65001"

or set

	HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun

or

	HKEY_CURRENT_USER\Software\Microsoft\Command Processor\AutoRun

to command

	"@chcp 65001 > nul"

e.g.

	> reg add HKEY_CURRENT_USER\Software\Microsoft\Command Processor ^
		/v AutoRun /d "@chcp 65001 > nul" /f

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-06 21:36   ` Jérôme Froissart
  2020-10-07  1:10     ` Andrey Repin
  2020-10-07  2:20     ` Brian Inglis
@ 2020-10-07  5:17     ` Thomas Wolff
  2020-10-07 23:32       ` Brian Inglis
  2020-10-13 16:30     ` Kaz Kylheku (Cygwin)
  2020-10-13 17:34     ` Brian Inglis
  4 siblings, 1 reply; 18+ messages in thread
From: Thomas Wolff @ 2020-10-07  5:17 UTC (permalink / raw)
  To: cygwin



Am 06.10.2020 um 23:36 schrieb Jérôme Froissart:
> Thanks for your replies.
> This issue only happens when a program is run from cmd.exe, not from a
> Cygwin bash shell.
> This is important for me, since I discovered this bug in a project
> that must be run from Windows graphical shell (i.e. there is no
> sensible way to run it through Cygwin and Bash).
>
>> Please show us the output from "uname -a" and "locale" run from the bash prompt.
> Running it from the same Cygwin bash prompt works as expected
>      $ uname -a
>      CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
Please update to cygwin 3.1.7; there were issues about command line 
quoting before, I'm not sure whether there was a tweak since 3.1.5 already.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-07  1:10     ` Andrey Repin
@ 2020-10-07 22:21       ` Jérôme Froissart
  2020-10-11 18:55         ` Andrey Repin
  0 siblings, 1 reply; 18+ messages in thread
From: Jérôme Froissart @ 2020-10-07 22:21 UTC (permalink / raw)
  To: cygwin; +Cc: Jérôme Froissart

Thanks for your reply.

Andrey Repin wrote:
> 1. Run CMD in a more capable terminal. Either M$ Terminal 1.0, or select true
> type font for your console.
I tried Windows Terminal 1.3, but this did not change anything :-(
Besides, I think my cmd.exe was already using True Type fonts (if I
understand the icons from the settings window correctly)

Anyway, I now understand that the terminal I use matters. In my case
however, I do not intend to run the binary (built with Cygwin) in a
terminal at all.
I am using win-sshfs [2]. It is built from Cygwin, but it is then used
as a standalone executable, without any GUI. It is called by a Windows
component/driver (with a command line that contains quoted UTF-8
arguments), invoked by some clicks and actions from the 'My computer'
window. What could I do so that this program correctly handles the
command line?
> 2. Then you are parsing the command line wrong. In Windows, it is up to called
> program to parse the command line.
Right, but my program starts at `int main(int argc, char *argv[])`,
where the parsing is already handled (by some Cygwin runtime
component?). How could I parse it differently?
And would that even make sense that I parse it in a custom way? Since
-I suppose- every C program built by Cygwin faces the same issues,
wouldn't we rather want a "universal" change on how the Cygwin runtime
parses command lines?
For the record, this is what I have done in this program [1], but that
feels more like a work around some UTF-8-related bug than a proper,
custom command line parsing :-S

...or maybe I'm completely mistaken in how Cygwin works, in case I'd
be happy to be told :-)

[1] https://github.com/billziss-gh/sshfs-win/pull/208
[2] https://github.com/billziss-gh/sshfs-win

Thanks for your help
Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-07  5:17     ` Thomas Wolff
@ 2020-10-07 23:32       ` Brian Inglis
  2020-10-08  0:59         ` Eliot Moss
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Inglis @ 2020-10-07 23:32 UTC (permalink / raw)
  To: cygwin

On 2020-10-06 23:17, Thomas Wolff wrote:
> 
> 
> Am 06.10.2020 um 23:36 schrieb Jérôme Froissart:
>> Thanks for your replies.
>> This issue only happens when a program is run from cmd.exe, not from a
>> Cygwin bash shell.
>> This is important for me, since I discovered this bug in a project
>> that must be run from Windows graphical shell (i.e. there is no
>> sensible way to run it through Cygwin and Bash).
>>
>>> Please show us the output from "uname -a" and "locale" run from the bash prompt.
>> Running it from the same Cygwin bash prompt works as expected
>>      $ uname -a
>>      CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
> Please update to cygwin 3.1.7; there were issues about command line quoting
> before, I'm not sure whether there was a tweak since 3.1.5 already.

[PATCH] Cygwin: console: Replace WriteConsoleA() with WriteConsoleW():
	https://cygwin.com/pipermail/cygwin-patches/2020q3/010495.html

[PATCH v4 1/3] Cygwin: rewrite and make public cmdline parser:
	https://cygwin.com/pipermail/cygwin-patches/2020q3/010577.html
Issues raised and no v5 response so far

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-07 23:32       ` Brian Inglis
@ 2020-10-08  0:59         ` Eliot Moss
  2020-10-08  6:22           ` Brian Inglis
  0 siblings, 1 reply; 18+ messages in thread
From: Eliot Moss @ 2020-10-08  0:59 UTC (permalink / raw)
  To: cygwin, Brian Inglis

I think what we mean is that, under Windows cmd, some things the shell does for you under Linux and 
Cygwin will not have been done.  For example, there is "glob" expansion of filenames.  If I write 
*.txt under bash, it gets expanded to a space-separated list of names of files that match that 
pattern.  This happens _before_ calling my program.  If the program is run from Windows cmd.exe, the 
program will receive an argument *.txt, and it will have to do the "globbing" itself.  Etc.

Regards - Eliot Moss

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-08  0:59         ` Eliot Moss
@ 2020-10-08  6:22           ` Brian Inglis
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Inglis @ 2020-10-08  6:22 UTC (permalink / raw)
  To: cygwin

On 2020-10-07 18:59, Eliot Moss wrote:
> I think what we mean is that, under Windows cmd, some things the shell does for
> you under Linux and Cygwin will not have been done.  For example, there is
> "glob" expansion of filenames.  If I write *.txt under bash, it gets expanded to
> a space-separated list of names of files that match that pattern.  This happens
> _before_ calling my program.  If the program is run from Windows cmd.exe, the
> program will receive an argument *.txt, and it will have to do the "globbing"
> itself.  Etc.

That's handled automatically by the Cygwin program startup command line parser
if it is not passed a "Cygwin" command line: that avoids the startup expanding
quoted args that contain wildcards passed from another Cygwin program or shell.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-07 22:21       ` Jérôme Froissart
@ 2020-10-11 18:55         ` Andrey Repin
  0 siblings, 0 replies; 18+ messages in thread
From: Andrey Repin @ 2020-10-11 18:55 UTC (permalink / raw)
  To: Jérôme Froissart, cygwin

Greetings, Jérôme Froissart!

> Andrey Repin wrote:
>> 1. Run CMD in a more capable terminal. Either M$ Terminal 1.0, or select true
>> type font for your console.
> I tried Windows Terminal 1.3, but this did not change anything :-(
> Besides, I think my cmd.exe was already using True Type fonts (if I
> understand the icons from the settings window correctly)

> Anyway, I now understand that the terminal I use matters. In my case
> however, I do not intend to run the binary (built with Cygwin) in a
> terminal at all.
> I am using win-sshfs [2]. It is built from Cygwin, but it is then used
> as a standalone executable, without any GUI. It is called by a Windows
> component/driver (with a command line that contains quoted UTF-8
> arguments), invoked by some clicks and actions from the 'My computer'
> window. What could I do so that this program correctly handles the
> command line?

I would at least run it with LANG env. variable set.
F.e. LANG=ru_RU.UTF-8 in my case.
I further tweak it with
LC_MESSAGES=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8

to get more consistent program output parsing experience.

>> 2. Then you are parsing the command line wrong. In Windows, it is up to called
>> program to parse the command line.
> Right, but my program starts at `int main(int argc, char *argv[])`,
> where the parsing is already handled (by some Cygwin runtime
> component?). How could I parse it differently?
> And would that even make sense that I parse it in a custom way? Since
> -I suppose- every C program built by Cygwin faces the same issues,
> wouldn't we rather want a "universal" change on how the Cygwin runtime
> parses command lines?
> For the record, this is what I have done in this program [1], but that
> feels more like a work around some UTF-8-related bug than a proper,
> custom command line parsing :-S

> ...or maybe I'm completely mistaken in how Cygwin works, in case I'd
> be happy to be told :-)

> [1] https://github.com/billziss-gh/sshfs-win/pull/208
> [2] https://github.com/billziss-gh/sshfs-win

P.S.

I suggest
ln -fs /proc/cygdrive/c/Windows/System32/chcp.com /usr/local/bin/chcp


-- 
With best regards,
Andrey Repin
Sunday, October 11, 2020 21:51:57

Sorry for my terrible english...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-06 21:36   ` Jérôme Froissart
                       ` (2 preceding siblings ...)
  2020-10-07  5:17     ` Thomas Wolff
@ 2020-10-13 16:30     ` Kaz Kylheku (Cygwin)
  2020-10-14 21:47       ` Jérôme Froissart
  2020-10-13 17:34     ` Brian Inglis
  4 siblings, 1 reply; 18+ messages in thread
From: Kaz Kylheku (Cygwin) @ 2020-10-13 16:30 UTC (permalink / raw)
  To: Jérôme Froissart; +Cc: cygwin

On 2020-10-06 14:36, Jérôme Froissart wrote:
> Here is an example C file
>     $ cat example.c
>     #include <stdio.h>
> 
>     const char *GetCommandLineA(void);
> 
>     int main(int argc, char *argv[])
>     {
>         const char *s = GetCommandLineA();
>         printf("C=%s\n", s);
> 
>         for (int i = 0; argc > i; i++)
>             printf("%d=%s\n", i, argv[i]);
> 
>         return 0;
>     }

Your program's comparison seems to be based on the
hypothesis that Cygwin parses the GetCommandLineA() command line.

But this hypothesis is almost certainly wrong.

> Now, let's start a Windows shell (cmd.exe)
> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows
> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"

The "A" command line from GetCommandLineA has "tofu"
characters: é and ô were not decoded properly.

The é and ô characters we see in the Cygwin-parsed
arguments coming into main could not have been recovered
from these "tofu" replacement characters.

What is actually being parsed must be the WCHAR command line
corresponding to what comes from GetCommandLineW().

It's necessary to show that one to get a more complete understanding.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-06 21:36   ` Jérôme Froissart
                       ` (3 preceding siblings ...)
  2020-10-13 16:30     ` Kaz Kylheku (Cygwin)
@ 2020-10-13 17:34     ` Brian Inglis
  4 siblings, 0 replies; 18+ messages in thread
From: Brian Inglis @ 2020-10-13 17:34 UTC (permalink / raw)
  To: cygwin

On 2020-10-06 15:36, Jérôme Froissart wrote:
> Here are the more detailed steps to reproduce the issue (along with
> answers to your requests about `uname`, `locale`, etc.).
> (I mostly reproduced what billziss-gh had done before, I do not take
> all the credits :D)
> 
> Here is an example C file
>     $ cat example.c
>     #include <stdio.h>
> 
>     const char *GetCommandLineA(void);
> 
>     int main(int argc, char *argv[])
>     {
>         const char *s = GetCommandLineA();
>         printf("C=%s\n", s);
> 
>         for (int i = 0; argc > i; i++)
>             printf("%d=%s\n", i, argv[i]);
> 
>         return 0;
>     }
> 
> I have built it with gcc from Cygwin
>     $ gcc -o binary example.c
> 
> Running it from the same Cygwin bash prompt works as expected
>     $ uname -a
>     CYGWIN_NT-10.0 XPS 3.1.5(0.340/5/3) 2020-06-01 08:59 x86_64 Cygwin
>     # (XPS is my Windows machine name)
> 
>     $ locale
>     LANG=fr_FR.UTF-8
>     LC_CTYPE="fr_FR.UTF-8"
>     LC_NUMERIC="fr_FR.UTF-8"
>     LC_TIME="fr_FR.UTF-8"
>     LC_COLLATE="fr_FR.UTF-8"
>     LC_MONETARY="fr_FR.UTF-8"
>     LC_MESSAGES="fr_FR.UTF-8"
>     LC_ALL=
> 
>     $ which gcc
>     /usr/bin/gcc
> 
>     # The following runs as expected
>     $ ./binary.exe "foo bar" "Jérôme"
>     C="C:\Users\Public\binary.exe"
>     0=./binary
>     1=foo bar
>     2=Jérôme
> 
> Now, let's start a Windows shell (cmd.exe)
> Note that I had to copy cygwin1.dll from my Cygwin installation
> directory, otherwise binary.exe would not start.
> I do not know whether there is a `locale` equivalent in Windows
> command prompt, so I merely ran my program.
>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>     C=binary.exe  "foo bar" "J□r□me"
>     0=binary
>     1=foo bar
>     2="Jérôme"
> 
> This behaviour is not expected and is quite inconsistent with what
> happened through Bash.
> Besides the "strange squares" that appear on the first line, and the
> extra space after binary.exe, I especially did not expect "Jérôme" to
> remain quoted as a second argument.

Don't call inappropriate Windows functions without understanding the limitations
of Windows and its APIs.
Cygwin args are consistent with what you ran and what we would all expect.
I don't see any Cygwin problems here.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-13 16:30     ` Kaz Kylheku (Cygwin)
@ 2020-10-14 21:47       ` Jérôme Froissart
  2020-10-14 22:14         ` Jérôme Froissart
                           ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jérôme Froissart @ 2020-10-14 21:47 UTC (permalink / raw)
  To: Kaz Kylheku (Cygwin); +Cc: cygwin

Thank you everyone, I now have a better understanding of how Windows
and Cygwin work (being rather a Linux guy, I was not really aware of
all of this).

However, there is still a question that is puzzling me. I now
understand _why_ things happen that way, but I am still wondering
whether this is really what we _want_. I mean, keeping the double
quotes around an UTF-8 argument just because it is not run from
Cygwin's bash sounds like a bug for me, doesn't it? (yet I definitely
understand the reasons that explain this behaviour). Since I cannot
run my program from bash, I have to resort to manually trimming the
quotes, which I would have liked to avoid.

I'd like to share a message that the maintainer of sshfs-win has
posted on Github [1], which is a follow-up to our discussions (he did
not know whether he was able to post in the mailing list without
subscribing first).
(besides, I unfortunately don't have much time currently to
investigate on this issue (for instance, I have not yet succeeded in
doing the same experiments with the very latest version of Cygwin), so
having his feedback is very valuable).

Here is what he says:
> It seems to me that the list is missing the important point
> about the double quote characters that should NOT be there
> regardless of how the é and ô characters are being interpreted.
> (As evidence of this: the Cygwin command line parser was able
> to break the command line into arguments correctly, but chose
> to retain the double quotes.)
>
> The choice of GetCommandLineA was for illustration purposes;
> had I used GetCommandLineW I would not be able to printf
> using %ls under CMD.EXE, because of code page issues. However
> here is a modified version of the test program that uses
> GetCommandLineW.
>
>     #include <stdio.h>
>
>     wchar_t *GetCommandLineW(void);
>
>     int main(int argc, char *argv[])
>     {
>         wchar_t *s = GetCommandLineW();
>
>         for (wchar_t *p = s; *p; p++)
>             printf("%04x %c%s",
>                 *p,
>                 32 <= *p && *p < 127 ? *p : '.',
>                 (p - s) % 8 + 1 != 8 ? "   " : "\n");
>         printf("\n");
>
>         for (int i = 0; argc > i; i++)
>             printf("%d=%s\n", i, argv[i]);
>
>         return 0;
>     }
>
> I compiled this program under Cygwin to produce cyg.exe and ran
> it under Cygwin and CMD.EXE.
>
> Cygwin run:
> > billziss@xps:~/Projects/t$ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_ALL=
> billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\Jérôme"
> 0022 "   0043 C   003a :   005c \   0055 U   0073 s   0065 e   0072 r
> 0073 s   005c \   0062 b   0069 i   006c l   006c l   007a z   0069 i
> 0073 s   0073 s   005c \   0050 P   0072 r   006f o   006a j   0065 e
> 0063 c   0074 t   0073 s   005c \   0074 t   005c \   0063 c   0079 y
> 0067 g   002e .   0065 e   0078 x   0065 e   0022 "
> 0=./cyg
> 1=foo bar
> 2=Domain\Jérôme
>
>
>
>
>
> CMD.EXE run:
>
> C:\Users\billziss\Projects\t>\Windows\System32\chcp.com
> Active code page: 437
>
> C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\Jérôme"
> 0063 c   0079 y   0067 g   002e .   0065 e   0078 x   0065 e   0020
> 0020     0022 "   0066 f   006f o   006f o   0020     0062 b   0061 a
> 0072 r   0022 "   0020     0022 "   0044 D   006f o   006d m   0061 a
> 0069 i   006e n   005c \   004a J   00e9 .   0072 r   00f4 .   006d m
> 0065 e   0022 "
> 0=cyg
> 1=foo bar
> 2="Domain\Jérôme"


[1] https://github.com/billziss-gh/sshfs-win/pull/208

Thank you very much
Jérôme

Le mar. 13 oct. 2020 à 18:30, Kaz Kylheku (Cygwin)
<743-406-3965@kylheku.com> a écrit :
>
> On 2020-10-06 14:36, Jérôme Froissart wrote:
> > Here is an example C file
> >     $ cat example.c
> >     #include <stdio.h>
> >
> >     const char *GetCommandLineA(void);
> >
> >     int main(int argc, char *argv[])
> >     {
> >         const char *s = GetCommandLineA();
> >         printf("C=%s\n", s);
> >
> >         for (int i = 0; argc > i; i++)
> >             printf("%d=%s\n", i, argv[i]);
> >
> >         return 0;
> >     }
>
> Your program's comparison seems to be based on the
> hypothesis that Cygwin parses the GetCommandLineA() command line.
>
> But this hypothesis is almost certainly wrong.
>
> > Now, let's start a Windows shell (cmd.exe)
> > Note that I had to copy cygwin1.dll from my Cygwin installation
> > directory, otherwise binary.exe would not start.
> > I do not know whether there is a `locale` equivalent in Windows
> > command prompt, so I merely ran my program.
> >     C:\Users\Public>binary.exe "foo bar" "Jérôme"
> >     C=binary.exe  "foo bar" "J□r□me"
> >     0=binary
> >     1=foo bar
> >     2="Jérôme"
>
> The "A" command line from GetCommandLineA has "tofu"
> characters: é and ô were not decoded properly.
>
> The é and ô characters we see in the Cygwin-parsed
> arguments coming into main could not have been recovered
> from these "tofu" replacement characters.
>
> What is actually being parsed must be the WCHAR command line
> corresponding to what comes from GetCommandLineW().
>
> It's necessary to show that one to get a more complete understanding.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-14 21:47       ` Jérôme Froissart
@ 2020-10-14 22:14         ` Jérôme Froissart
  2020-10-15  5:14         ` UTF-8 quoted args passed to program include quotes when run from cmd Brian Inglis
  2020-10-19  2:32         ` Unconsistent command-line parsing in case of UTF-8 quoted arguments Kaz Kylheku (Cygwin)
  2 siblings, 0 replies; 18+ messages in thread
From: Jérôme Froissart @ 2020-10-14 22:14 UTC (permalink / raw)
  To: Jérôme Froissart; +Cc: Kaz Kylheku (Cygwin), cygwin

Le mer. 14 oct. 2020 à 23:47, Jérôme Froissart <software@froissart.eu> a écrit :
> However, there is still a question that is puzzling me. I now
> understand _why_ things happen that way, but I am still wondering
> whether this is really what we _want_. I mean, keeping the double
> quotes around an UTF-8 argument just because it is not run from
> Cygwin's bash sounds like a bug for me, doesn't it? (yet I definitely
> understand the reasons that explain this behaviour). Since I cannot
> run my program from bash, I have to resort to manually trimming the
> quotes, which I would have liked to avoid.

Just to rephrase what is puzzling me:
When I understood that sshfs-win had a bug when an argument contained
diacritics, I expected many possible issues : mismatching codepages,
poorly-handled encodings, implicit conversions between UTF-8 and
Latin-1, etc., which would make some sense.
But I definitely did not expect that "double quotes were not properly
removed by the runtime", which (imho) does not make any sense.

I hope I have managed to rephrase my problem clearly :D
Thanks to all of you for your help!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: UTF-8 quoted args passed to program include quotes when run from cmd
  2020-10-14 21:47       ` Jérôme Froissart
  2020-10-14 22:14         ` Jérôme Froissart
@ 2020-10-15  5:14         ` Brian Inglis
  2020-10-19  2:32         ` Unconsistent command-line parsing in case of UTF-8 quoted arguments Kaz Kylheku (Cygwin)
  2 siblings, 0 replies; 18+ messages in thread
From: Brian Inglis @ 2020-10-15  5:14 UTC (permalink / raw)
  To: cygwin

[changed subject]

On 2020-10-14 15:47, Jérôme Froissart wrote:
>> (As evidence of this: the Cygwin command line parser was able to break the
>> command line into arguments correctly, but chose to retain the double
>> quotes.)
>>
>>     #include <stdio.h>
>>
>>     int main(int argc, char *argv[])
>>     {
>>         for (int i = 0; argc > i; i++)
>>             printf("%d=%s\n", i, argv[i]);
>>
>>         return 0;
>>     }
>>
>> I compiled this program under Cygwin to produce cyg.exe and ran it under
>> Cygwin and CMD.EXE.

Please post compile and link command lines, as Cygwin can create native Windows
as well as its own Unix like executables, and the command line parsing may vary.

>> Cygwin run:
>>> billziss@xps:~/Projects/t$ locale
>> LANG=en_US.UTF-8
>> LC_CTYPE="en_US.UTF-8"
>> LC_NUMERIC="en_US.UTF-8"
>> LC_TIME="en_US.UTF-8"
>> LC_COLLATE="en_US.UTF-8"
>> LC_MONETARY="en_US.UTF-8"
>> LC_MESSAGES="en_US.UTF-8"
>> LC_ALL=
>> billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\Jérôme"
>> 0=./cyg
>> 1=foo bar
>> 2=Domain\Jérôme

>> CMD.EXE run:
>> C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\Jérôme"
>> 0=cyg
>> 1=foo bar
>> 2="Domain\Jérôme"

>>> Now, let's start a Windows shell (cmd.exe)
>>> Note that I had to copy cygwin1.dll from my Cygwin installation
>>> directory, otherwise binary.exe would not start.
>>> I do not know whether there is a `locale` equivalent in Windows
>>> command prompt, so I merely ran my program.
>>>     C:\Users\Public>binary.exe "foo bar" "Jérôme"
>>>     0=binary
>>>     1=foo bar
>>>     2="Jérôme"

Your Windows CommandLineA/W outputs were confusing.

The point is that Cygwin programs run from cmd shell appear to receive UTF-8
arguments with the surrounding double quotes included intact, whereas the double
quotes are stripped when run from a Cygwin shell.

I think the charset needs verified by dumping each arg as hex bytes e.g.

//!/usr/bin/gcc -g -Og -Wall -Wextra -o quoted-arg-dump quoted-arg-dump.c
// quoted-arg-dump.c - dump quoted args under Cygwin and Windows shells
// outputs:
// $ ./quoted-arg-dump "foo bar" "Jérôme"
// 0 './quoted-arg-dump' 2e 2f 71 75 6f 74 65 64 2d 61 72 67 2d 64 75 6d 70
// 1 'foo bar' 66 6f 6f 20 62 61 72
// 2 'Jérôme' 4a c3 a9 72 c3 b4 6d 65
// >quoted-arg-dump "foo bar" "Jérôme"
// 0 'quoted-arg-dump' 71 75 6f 74 65 64 2d 61 72 67 2d 64 75 6d 70
// 1 'foo bar' 66 6f 6f 20 62 61 72
// 2 '"Jérôme"' 22 4a c3 a9 72 c3 b4 6d 65 22
// checks:
// $ grep -a '[éô]' unicode-symbols.txt
// é  U+00E9  LATIN SMALL LETTER E WITH ACUTE
// ô  U+00F4  LATIN SMALL LETTER O WITH CIRCUMFLEX
// $ grep -a '[éô]' unicode-symbols.txt | od -An -tx1z -w11
// c3 a9 20 20 55 2b 30 30 45 39 20  >..  U+00E9 <
// 20 4c 41 54 49 4e 20 53 4d 41 4c  > LATIN SMAL<
// 4c 20 4c 45 54 54 45 52 20 45 20  >L LETTER E <
// 57 49 54 48 20 41 43 55 54 45 0a  >WITH ACUTE.<
// c3 b4 20 20 55 2b 30 30 46 34 20  >..  U+00F4 <
// 20 4c 41 54 49 4e 20 53 4d 41 4c  > LATIN SMAL<
// 4c 20 4c 45 54 54 45 52 20 4f 20  >L LETTER O <
// 57 49 54 48 20 43 49 52 43 55 4d  >WITH CIRCUM<
// 46 4c 45 58 0a                    >FLEX.<
#include <stdio.h>
int
main(int argc, char *argv[]) {
	for (int a = 0; a < argc; ++a) {
		printf("%d '%s'", a, argv[a]);

		for (char *p = argv[a]; *p; ++p) {
			printf(" %.2hhx", *p);
		} // for chars

		printf("\n");
	} // for args
} // main()

This verifies that Cygwin does not strip double quotes from UTF-8 args when run
from Windows cmd, and the args are received and output as UTF-8 characters.

It might be interesting if you could also run from PowerShell and/or Terminal
for comparison to see if the Windows cmd behaviour is reproduced there.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
  2020-10-14 21:47       ` Jérôme Froissart
  2020-10-14 22:14         ` Jérôme Froissart
  2020-10-15  5:14         ` UTF-8 quoted args passed to program include quotes when run from cmd Brian Inglis
@ 2020-10-19  2:32         ` Kaz Kylheku (Cygwin)
  2 siblings, 0 replies; 18+ messages in thread
From: Kaz Kylheku (Cygwin) @ 2020-10-19  2:32 UTC (permalink / raw)
  To: Jérôme Froissart; +Cc: cygwin

On 2020-10-14 14:47, Jérôme Froissart wrote:
>> The choice of GetCommandLineA was for illustration purposes;
>> had I used GetCommandLineW I would not be able to printf
>> using %ls under CMD.EXE, because of code page issues. However
>> here is a modified version of the test program that uses
>> GetCommandLineW.

[ ... ]

>> billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\Jérôme"
>> 0022 "   0043 C   003a :   005c \   0055 U   0073 s   0065 e   0072 r
>> 0073 s   005c \   0062 b   0069 i   006c l   006c l   007a z   0069 i
>> 0073 s   0073 s   005c \   0050 P   0072 r   006f o   006a j   0065 e
>> 0063 c   0074 t   0073 s   005c \   0074 t   005c \   0063 c   0079 y
>> 0067 g   002e .   0065 e   0078 x   0065 e   0022 "

[ ... ]

>> C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\Jérôme"
>> 0063 c   0079 y   0067 g   002e .   0065 e   0078 x   0065 e   0020
>> 0020     0022 "   0066 f   006f o   006f o   0020     0062 b   0061 a
>> 0072 r   0022 "   0020     0022 "   0044 D   006f o   006d m   0061 a
>> 0069 i   006e n   005c \   004a J   00e9 .   0072 r   00f4 .   006d m
>> 0065 e   0022 "

Aha! There is a hint of a problem here. Firstly, the command lines
are obviously different.

The Cygwin one starts with a quote that we did not see, wrapping
the full path to the executable:

   "C:\Users\billziss\Projects\t\cyg.exe"

It ends there. Why is that? I'm guessing that the command line was
tokenized destructively; a null character was written.

But under cmd.exe, we see the whole command line, without any null
character having been written in it. Moreover, the program name just
appears as the original relative path cyg.exe with no quotes.

What a mess. :)




^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-10-19  2:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-02 21:40 Unconsistent command-line parsing in case of UTF-8 quoted arguments Jérôme Froissart
2020-10-03  2:22 ` Doug Henderson
2020-10-04 11:18 ` Andrey Repin
2020-10-06 21:36   ` Jérôme Froissart
2020-10-07  1:10     ` Andrey Repin
2020-10-07 22:21       ` Jérôme Froissart
2020-10-11 18:55         ` Andrey Repin
2020-10-07  2:20     ` Brian Inglis
2020-10-07  5:17     ` Thomas Wolff
2020-10-07 23:32       ` Brian Inglis
2020-10-08  0:59         ` Eliot Moss
2020-10-08  6:22           ` Brian Inglis
2020-10-13 16:30     ` Kaz Kylheku (Cygwin)
2020-10-14 21:47       ` Jérôme Froissart
2020-10-14 22:14         ` Jérôme Froissart
2020-10-15  5:14         ` UTF-8 quoted args passed to program include quotes when run from cmd Brian Inglis
2020-10-19  2:32         ` Unconsistent command-line parsing in case of UTF-8 quoted arguments Kaz Kylheku (Cygwin)
2020-10-13 17:34     ` Brian Inglis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).