From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47]) by sourceware.org (Postfix) with ESMTPS id 3DE02385DC17 for ; Wed, 14 Oct 2020 21:47:44 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 3DE02385DC17 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=froissart.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=jerome.froissart@gmail.com Received: by mail-ej1-f47.google.com with SMTP id p5so672909ejj.2 for ; Wed, 14 Oct 2020 14:47:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=GAiEXA/5vhQVNWtEkY8PRBwveLixkZAEdNmNoFW2g+0=; b=C6FKhIyp0VhYOdilGe4K6HX2jJh1qT6kKTu0zMuaHOmnrg6x7U1GUny9HFD4nz641u gfLXJINFgvKuvxa/v6F/rmmYu46GgFYHiCyfc6dIb9xZWjL8G8Rlc74gZyU/atceqjBb ASeUVwcRFptanGDMBGFJ+VQto4AYOor0aIdItWGTlffOuKta+0lw9dk7gyn6gUnOXFsu Yaz4vPWzcyITc+267eZT1Y4JMXRrkDIGp+29eTZnEfPf+z5NVs0L/dYA6q92SMjC63/z S3GYZvtoGhD9wQnW6vN3XDjb3MLC36Q/RAz2I1WXwshwIPUdxQw9xcOkWdY/qIwnzil0 OizA== X-Gm-Message-State: AOAM530lqKGdVRZT/7JXdiuCcjXnpmoBMm8bAn+hU6j0g0WnFypOhW/n 6lfJY5VX2N700X+JH5ZvFs9Kt1Laj4cJrwSUMIE= X-Google-Smtp-Source: ABdhPJwnfaurzE5vUnsuuLpCbEs7IAAPU/7aL7JIXwn8mGbiXw32+bhHAvAkimcf5IsELID7DetjlSebyVUIgZyatCQ= X-Received: by 2002:a17:906:4e16:: with SMTP id z22mr1114300eju.527.1602712063213; Wed, 14 Oct 2020 14:47:43 -0700 (PDT) MIME-Version: 1.0 References: <634821436.20201004141809@yandex.ru> In-Reply-To: From: =?UTF-8?B?SsOpcsO0bWUgRnJvaXNzYXJ0?= Date: Wed, 14 Oct 2020 23:47:37 +0200 Message-ID: Subject: Re: Unconsistent command-line parsing in case of UTF-8 quoted arguments To: "Kaz Kylheku (Cygwin)" <743-406-3965@kylheku.com> Cc: cygwin@cygwin.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: cygwin@cygwin.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Oct 2020 21:47:46 -0000 Thank you everyone, I now have a better understanding of how Windows and Cygwin work (being rather a Linux guy, I was not really aware of all of this). However, there is still a question that is puzzling me. I now understand _why_ things happen that way, but I am still wondering whether this is really what we _want_. I mean, keeping the double quotes around an UTF-8 argument just because it is not run from Cygwin's bash sounds like a bug for me, doesn't it? (yet I definitely understand the reasons that explain this behaviour). Since I cannot run my program from bash, I have to resort to manually trimming the quotes, which I would have liked to avoid. I'd like to share a message that the maintainer of sshfs-win has posted on Github [1], which is a follow-up to our discussions (he did not know whether he was able to post in the mailing list without subscribing first). (besides, I unfortunately don't have much time currently to investigate on this issue (for instance, I have not yet succeeded in doing the same experiments with the very latest version of Cygwin), so having his feedback is very valuable). Here is what he says: > It seems to me that the list is missing the important point > about the double quote characters that should NOT be there > regardless of how the =C3=A9 and =C3=B4 characters are being interpreted. > (As evidence of this: the Cygwin command line parser was able > to break the command line into arguments correctly, but chose > to retain the double quotes.) > > The choice of GetCommandLineA was for illustration purposes; > had I used GetCommandLineW I would not be able to printf > using %ls under CMD.EXE, because of code page issues. However > here is a modified version of the test program that uses > GetCommandLineW. > > #include > > wchar_t *GetCommandLineW(void); > > int main(int argc, char *argv[]) > { > wchar_t *s =3D GetCommandLineW(); > > for (wchar_t *p =3D s; *p; p++) > printf("%04x %c%s", > *p, > 32 <=3D *p && *p < 127 ? *p : '.', > (p - s) % 8 + 1 !=3D 8 ? " " : "\n"); > printf("\n"); > > for (int i =3D 0; argc > i; i++) > printf("%d=3D%s\n", i, argv[i]); > > return 0; > } > > I compiled this program under Cygwin to produce cyg.exe and ran > it under Cygwin and CMD.EXE. > > Cygwin run: > > billziss@xps:~/Projects/t$ locale > LANG=3Den_US.UTF-8 > LC_CTYPE=3D"en_US.UTF-8" > LC_NUMERIC=3D"en_US.UTF-8" > LC_TIME=3D"en_US.UTF-8" > LC_COLLATE=3D"en_US.UTF-8" > LC_MONETARY=3D"en_US.UTF-8" > LC_MESSAGES=3D"en_US.UTF-8" > LC_ALL=3D > billziss@xps:~/Projects/t$ ./cyg.exe "foo bar" "Domain\J=C3=A9r=C3=B4me" > 0022 " 0043 C 003a : 005c \ 0055 U 0073 s 0065 e 0072 r > 0073 s 005c \ 0062 b 0069 i 006c l 006c l 007a z 0069 i > 0073 s 0073 s 005c \ 0050 P 0072 r 006f o 006a j 0065 e > 0063 c 0074 t 0073 s 005c \ 0074 t 005c \ 0063 c 0079 y > 0067 g 002e . 0065 e 0078 x 0065 e 0022 " > 0=3D./cyg > 1=3Dfoo bar > 2=3DDomain\J=C3=A9r=C3=B4me > > > > > > CMD.EXE run: > > C:\Users\billziss\Projects\t>\Windows\System32\chcp.com > Active code page: 437 > > C:\Users\billziss\Projects\t>cyg.exe "foo bar" "Domain\J=C3=A9r=C3=B4me" > 0063 c 0079 y 0067 g 002e . 0065 e 0078 x 0065 e 0020 > 0020 0022 " 0066 f 006f o 006f o 0020 0062 b 0061 a > 0072 r 0022 " 0020 0022 " 0044 D 006f o 006d m 0061 a > 0069 i 006e n 005c \ 004a J 00e9 . 0072 r 00f4 . 006d m > 0065 e 0022 " > 0=3Dcyg > 1=3Dfoo bar > 2=3D"Domain\J=C3=A9r=C3=B4me" [1] https://github.com/billziss-gh/sshfs-win/pull/208 Thank you very much J=C3=A9r=C3=B4me Le mar. 13 oct. 2020 =C3=A0 18:30, Kaz Kylheku (Cygwin) <743-406-3965@kylheku.com> a =C3=A9crit : > > On 2020-10-06 14:36, J=C3=A9r=C3=B4me Froissart wrote: > > Here is an example C file > > $ cat example.c > > #include > > > > const char *GetCommandLineA(void); > > > > int main(int argc, char *argv[]) > > { > > const char *s =3D GetCommandLineA(); > > printf("C=3D%s\n", s); > > > > for (int i =3D 0; argc > i; i++) > > printf("%d=3D%s\n", i, argv[i]); > > > > return 0; > > } > > Your program's comparison seems to be based on the > hypothesis that Cygwin parses the GetCommandLineA() command line. > > But this hypothesis is almost certainly wrong. > > > Now, let's start a Windows shell (cmd.exe) > > Note that I had to copy cygwin1.dll from my Cygwin installation > > directory, otherwise binary.exe would not start. > > I do not know whether there is a `locale` equivalent in Windows > > command prompt, so I merely ran my program. > > C:\Users\Public>binary.exe "foo bar" "J=C3=A9r=C3=B4me" > > C=3Dbinary.exe "foo bar" "J=E2=96=A1r=E2=96=A1me" > > 0=3Dbinary > > 1=3Dfoo bar > > 2=3D"J=C3=A9r=C3=B4me" > > The "A" command line from GetCommandLineA has "tofu" > characters: =C3=A9 and =C3=B4 were not decoded properly. > > The =C3=A9 and =C3=B4 characters we see in the Cygwin-parsed > arguments coming into main could not have been recovered > from these "tofu" replacement characters. > > What is actually being parsed must be the WCHAR command line > corresponding to what comes from GetCommandLineW(). > > It's necessary to show that one to get a more complete understanding. >