public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows
@ 2023-08-30 19:10 thiago at kde dot org
  2023-08-30 19:25 ` [Bug libstdc++/111244] " pinskia at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: thiago at kde dot org @ 2023-08-30 19:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

            Bug ID: 111244
           Summary: std::filesystem::path encoding mismatches locale on
                    Windows
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: thiago at kde dot org
  Target Milestone: ---

Test:
$ cat fstest.cpp 
#include <filesystem>
#include <stdio.h>

int main(int argc, char **argv)
{
    for (int i = 1; i < argc; ++i) {
        std::filesystem::path p(argv[i]);
        if (std::filesystem::exists(p)) {
            printf("%s %llu\n", argv[1], (unsigned long
long)std::filesystem::file_size(p));
        } else {
            printf("%s does not exist\n", argv[1]);
        }
    }
}
$ touch filæ
$ g++ fstest.cpp
$ ./a.out fstest.cpp filæ

On Linux (and any other Unix):
fstest.cpp 377
fstest.cpp 0

On Windows with libc++ or MS STL:
fstest.cpp 377
fstest.cpp 0

On Windows with libstdc++:
fstest.cpp 377
terminate called after throwing an instance of
'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: Cannot convert character sequence: Illegal byte
sequence

This is caused by std::filesystem::path interpreting the input as UTF-8. On
Windows, it's not; it must be decoded using the locale codec. 

Strictly speaking, the same should apply to the conversion to Unicode on Unix
systems too, but a) they're almost all UTF-8 these days, so the corner cases
may be ignored by a policy decision and b) the mismatch of input does not lead
to inability to refer to files by fs::path alone.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows
  2023-08-30 19:10 [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows thiago at kde dot org
@ 2023-08-30 19:25 ` pinskia at gcc dot gnu.org
  2023-08-30 19:31 ` thiago at kde dot org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-08-30 19:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=108865

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For windows it is a bit more complex, than that even.

> This is caused by std::filesystem::path interpreting the input as UTF-8.
> On Windows, it's not; it must be decoded using the locale codec. 

Except the code page could be tuned via a manifest file even.
For an example GCC embeds a manifest into its own compiler to work around this
issue and just use UTF8 always.

So ...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows
  2023-08-30 19:10 [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows thiago at kde dot org
  2023-08-30 19:25 ` [Bug libstdc++/111244] " pinskia at gcc dot gnu.org
@ 2023-08-30 19:31 ` thiago at kde dot org
  2023-08-30 19:59 ` redi at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: thiago at kde dot org @ 2023-08-30 19:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

--- Comment #2 from Thiago Macieira <thiago at kde dot org> ---
(In reply to Andrew Pinski from comment #1)
> Except the code page could be tuned via a manifest file even.
> For an example GCC embeds a manifest into its own compiler to work around
> this issue and just use UTF8 always.
> 
> So ...

Indeed, but won't MultiByteToWideChar() adapt to that and correctly convert
from UTF-8 to UTF-16?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows
  2023-08-30 19:10 [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows thiago at kde dot org
  2023-08-30 19:25 ` [Bug libstdc++/111244] " pinskia at gcc dot gnu.org
  2023-08-30 19:31 ` thiago at kde dot org
@ 2023-08-30 19:59 ` redi at gcc dot gnu.org
  2023-08-30 20:03 ` costas.argyris at gmail dot com
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: redi at gcc dot gnu.org @ 2023-08-30 19:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Somebody else will have to fix this, I've already wasted too much of my life
making std:: filesystem (mostly) work on Windows.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows
  2023-08-30 19:10 [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows thiago at kde dot org
                   ` (2 preceding siblings ...)
  2023-08-30 19:59 ` redi at gcc dot gnu.org
@ 2023-08-30 20:03 ` costas.argyris at gmail dot com
  2023-08-30 20:15 ` thiago at kde dot org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: costas.argyris at gmail dot com @ 2023-08-30 20:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

Costas Argyris <costas.argyris at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |costas.argyris at gmail dot com

--- Comment #4 from Costas Argyris <costas.argyris at gmail dot com> ---
I'm wondering if it will work after embedding a UTF-8 manifest into your a.out
executable, as described here:

https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows
  2023-08-30 19:10 [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows thiago at kde dot org
                   ` (3 preceding siblings ...)
  2023-08-30 20:03 ` costas.argyris at gmail dot com
@ 2023-08-30 20:15 ` thiago at kde dot org
  2023-08-30 20:50 ` costas.argyris at gmail dot com
  2023-08-30 20:56 ` thiago at kde dot org
  6 siblings, 0 replies; 8+ messages in thread
From: thiago at kde dot org @ 2023-08-30 20:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

--- Comment #5 from Thiago Macieira <thiago at kde dot org> ---
(In reply to Jonathan Wakely from comment #3)
> Somebody else will have to fix this, I've already wasted too much of my life
> making std:: filesystem (mostly) work on Windows.

Same here.

(In reply to Costas Argyris from comment #4)
> I'm wondering if it will work after embedding a UTF-8 manifest into your
> a.out executable, as described here:
> 
> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-
> code-page

I can't embed a UTF-8 manifest in my DLL and much less in my .a. As a library
writer (I'm the QtCore maintainer), that's out of my hands - it is an
application decision.

If GCC+Binutils team wants to enforce that for the future, be my guest. I'd
support your decision; I think it's high time this happened. But I'm sure there
would be a lot of push-back from people who can't do that because their
existing Windows applications rely on the legacy encodings or those who deploy
to Windows versions that didn't have such support. I have a vague memory of
discussing this in the Qt development mailing list, but can't find it.

A softer approach is for std::filesystem to declare that it only supports
UTF-8-manifested applications (closing this bug as WONTFIX /
working-as-designed). I'd again support your decision and will simply pass the
requirement along to my users.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows
  2023-08-30 19:10 [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows thiago at kde dot org
                   ` (4 preceding siblings ...)
  2023-08-30 20:15 ` thiago at kde dot org
@ 2023-08-30 20:50 ` costas.argyris at gmail dot com
  2023-08-30 20:56 ` thiago at kde dot org
  6 siblings, 0 replies; 8+ messages in thread
From: costas.argyris at gmail dot com @ 2023-08-30 20:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

--- Comment #6 from Costas Argyris <costas.argyris at gmail dot com> ---
> I can't embed a UTF-8 manifest in my DLL and much less in my .a. As a
> library writer (I'm the QtCore maintainer), that's out of my hands - it is
> an application decision.

At this point I just meant embedding it in your example a.out executable file,
just to check if it will work correctly.

But FYI, you don't embed the UTF-8 manifest into every static/dynamic library -
just to the executable.    It is essentially just a new object file that you
are linking your executable against, whose purpose is to make the resulting
executable use UTF-8 as its active code page.

But yes, assuming this even works, embedding the UTF-8 manifest is part of the
build process of the application, so it would have to be accounted for in the
Makefiles etc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows
  2023-08-30 19:10 [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows thiago at kde dot org
                   ` (5 preceding siblings ...)
  2023-08-30 20:50 ` costas.argyris at gmail dot com
@ 2023-08-30 20:56 ` thiago at kde dot org
  6 siblings, 0 replies; 8+ messages in thread
From: thiago at kde dot org @ 2023-08-30 20:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

--- Comment #7 from Thiago Macieira <thiago at kde dot org> ---
(In reply to Costas Argyris from comment #6)
> At this point I just meant embedding it in your example a.out executable
> file, just to check if it will work correctly.

Ah, got it. But that is not the conditions of the issue at hand, so proving it
works doesn't help me in the conditions that do apply.

> But yes, assuming this even works, embedding the UTF-8 manifest is part of
> the build process of the application, so it would have to be accounted for
> in the Makefiles etc.

And I can't force my users to do that.

If libstdc++ wants to enforce that or require it for use of std::filesystem,
it's your choice.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-08-30 20:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-30 19:10 [Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows thiago at kde dot org
2023-08-30 19:25 ` [Bug libstdc++/111244] " pinskia at gcc dot gnu.org
2023-08-30 19:31 ` thiago at kde dot org
2023-08-30 19:59 ` redi at gcc dot gnu.org
2023-08-30 20:03 ` costas.argyris at gmail dot com
2023-08-30 20:15 ` thiago at kde dot org
2023-08-30 20:50 ` costas.argyris at gmail dot com
2023-08-30 20:56 ` thiago at kde dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).