public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: "Jérôme Froissart" <software@froissart.eu>
To: "Kaz Kylheku (Cygwin)" <743-406-3965@kylheku.com>
Cc: cygwin@cygwin.com
Subject: Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments
Date: Wed, 14 Oct 2020 23:47:37 +0200	[thread overview]
Message-ID: <CAFC9CLCx3nAQu6aMYTTL1syr9zyXgHYY0vKCKSCXAf=HpYXDiQ@mail.gmail.com> (raw)
In-Reply-To: <d4f283fe85c31be76dcfc01b20bb375e@mail.kylheku.com>

Thank you everyone, I now have a better understanding of how Windows
and Cygwin work (being rather a Linux guy, I was not really aware of
all of this).

However, there is still a question that is puzzling me. I now
understand _why_ things happen that way, but I am still wondering
whether this is really what we _want_. I mean, keeping the double
quotes around an UTF-8 argument just because it is not run from
Cygwin's bash sounds like a bug for me, doesn't it? (yet I definitely
understand the reasons that explain this behaviour). Since I cannot
run my program from bash, I have to resort to manually trimming the
quotes, which I would have liked to avoid.

I'd like to share a message that the maintainer of sshfs-win has
posted on Github [1], which is a follow-up to our discussions (he did
not know whether he was able to post in the mailing list without
subscribing first).
(besides, I unfortunately don't have much time currently to
investigate on this issue (for instance, I have not yet succeeded in
doing the same experiments with the very latest version of Cygwin), so
having his feedback is very valuable).

Here is what he says:
> It seems to me that the list is missing the important point
> about the double quote characters that should NOT be there
> regardless of how the é and ô characters are being interpreted.
> (As evidence of this: the Cygwin command line parser was able
> to break the command line into arguments correctly, but chose
> to retain the double quotes.)
>
> The choice of GetCommandLineA was for illustration purposes;
> had I used GetCommandLineW I would not be able to printf
> using %ls under CMD.EXE, because of code page issues. However
> here is a modified version of the test program that uses
> GetCommandLineW.
>
>     #include <stdio.h>
>
>     wchar_t *GetCommandLineW(void);
>
>     int main(int argc, char *argv[])
>     {
>         wchar_t *s = GetCommandLineW();
>
>         for (wchar_t *p = s; *p; p++)
>             printf("%04x %c%s",
>                 *p,
>                 32 <= *p && *p < 127 ? *p : '.',
>                 (p - s) % 8 + 1 != 8 ? "   " : "\n");
>         printf("\n");
>
>         for (int i = 0; argc > i; i++)
>             printf("%d=%s\n", i, argv[i]);
>
>         return 0;
>     }
>
> I compiled this program under Cygwin to produce cyg.exe and ran
> it under Cygwin and CMD.EXE.
>
> Cygwin run:
> > billziss@xps:~/Projects/t$ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_ALL=
> billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\Jérôme"
> 0022 "   0043 C   003a :   005c \   0055 U   0073 s   0065 e   0072 r
> 0073 s   005c \   0062 b   0069 i   006c l   006c l   007a z   0069 i
> 0073 s   0073 s   005c \   0050 P   0072 r   006f o   006a j   0065 e
> 0063 c   0074 t   0073 s   005c \   0074 t   005c \   0063 c   0079 y
> 0067 g   002e .   0065 e   0078 x   0065 e   0022 "
> 0=./cyg
> 1=foo bar
> 2=Domain\Jérôme
>
>
>
>
>
> CMD.EXE run:
>
> C:\Users\billziss\Projects\t>\Windows\System32\chcp.com
> Active code page: 437
>
> C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\Jérôme"
> 0063 c   0079 y   0067 g   002e .   0065 e   0078 x   0065 e   0020
> 0020     0022 "   0066 f   006f o   006f o   0020     0062 b   0061 a
> 0072 r   0022 "   0020     0022 "   0044 D   006f o   006d m   0061 a
> 0069 i   006e n   005c \   004a J   00e9 .   0072 r   00f4 .   006d m
> 0065 e   0022 "
> 0=cyg
> 1=foo bar
> 2="Domain\Jérôme"


[1] https://github.com/billziss-gh/sshfs-win/pull/208

Thank you very much
Jérôme

Le mar. 13 oct. 2020 à 18:30, Kaz Kylheku (Cygwin)
<743-406-3965@kylheku.com> a écrit :
>
> On 2020-10-06 14:36, Jérôme Froissart wrote:
> > Here is an example C file
> >     $ cat example.c
> >     #include <stdio.h>
> >
> >     const char *GetCommandLineA(void);
> >
> >     int main(int argc, char *argv[])
> >     {
> >         const char *s = GetCommandLineA();
> >         printf("C=%s\n", s);
> >
> >         for (int i = 0; argc > i; i++)
> >             printf("%d=%s\n", i, argv[i]);
> >
> >         return 0;
> >     }
>
> Your program's comparison seems to be based on the
> hypothesis that Cygwin parses the GetCommandLineA() command line.
>
> But this hypothesis is almost certainly wrong.
>
> > Now, let's start a Windows shell (cmd.exe)
> > Note that I had to copy cygwin1.dll from my Cygwin installation
> > directory, otherwise binary.exe would not start.
> > I do not know whether there is a `locale` equivalent in Windows
> > command prompt, so I merely ran my program.
> >     C:\Users\Public>binary.exe "foo bar" "Jérôme"
> >     C=binary.exe  "foo bar" "J□r□me"
> >     0=binary
> >     1=foo bar
> >     2="Jérôme"
>
> The "A" command line from GetCommandLineA has "tofu"
> characters: é and ô were not decoded properly.
>
> The é and ô characters we see in the Cygwin-parsed
> arguments coming into main could not have been recovered
> from these "tofu" replacement characters.
>
> What is actually being parsed must be the WCHAR command line
> corresponding to what comes from GetCommandLineW().
>
> It's necessary to show that one to get a more complete understanding.
>

  reply	other threads:[~2020-10-14 21:47 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-02 21:40 Jérôme Froissart
2020-10-03  2:22 ` Doug Henderson
2020-10-04 11:18 ` Andrey Repin
2020-10-06 21:36   ` Jérôme Froissart
2020-10-07  1:10     ` Andrey Repin
2020-10-07 22:21       ` Jérôme Froissart
2020-10-11 18:55         ` Andrey Repin
2020-10-07  2:20     ` Brian Inglis
2020-10-07  5:17     ` Thomas Wolff
2020-10-07 23:32       ` Brian Inglis
2020-10-08  0:59         ` Eliot Moss
2020-10-08  6:22           ` Brian Inglis
2020-10-13 16:30     ` Kaz Kylheku (Cygwin)
2020-10-14 21:47       ` Jérôme Froissart [this message]
2020-10-14 22:14         ` Jérôme Froissart
2020-10-15  5:14         ` UTF-8 quoted args passed to program include quotes when run from cmd Brian Inglis
2020-10-19  2:32         ` Unconsistent command-line parsing in case of UTF-8 quoted arguments Kaz Kylheku (Cygwin)
2020-10-13 17:34     ` Brian Inglis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFC9CLCx3nAQu6aMYTTL1syr9zyXgHYY0vKCKSCXAf=HpYXDiQ@mail.gmail.com' \
    --to=software@froissart.eu \
    --cc=743-406-3965@kylheku.com \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).