public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug fortran/48972] New: OPEN with Unicode file name
@ 2011-05-11 21:53 burnus at gcc dot gnu.org
  2011-05-12  6:59 ` [Bug fortran/48972] " burnus at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-11 21:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

           Summary: OPEN with Unicode file name
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Keywords: accepts-invalid, diagnostic
          Severity: normal
          Priority: P3
         Component: fortran
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: burnus@gcc.gnu.org
                CC: jvdelisle@gcc.gnu.org


This PR is motivated by the thread which started at
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=COMP-FORTRAN-90;59308f3c.1105


GNU Fortran happily accepts kind=4 character strings to the FILE= argument of
the OPEN statement - and probably also to the other string arguments.

However, the Fortran 2008 standard has:

  R905 connect-spec is
            ...
                   or   FILE = file-name-expr
with
  R906 file-name-expr is scalar-default-char-expr

Thus, such strings should be rejected -- at least with -std=f2008.

 * * *

Independent of that, it would be convenient if as vendor extension passing a
UCS-4 string would be allowed. The only problem is how it should be handled in
the library.

For Unix systems, I think converting the UCS-4 to UTF-8 and using it in the
normal file open should work.

However, for Windows, I think one needs a special solution as Windows seems to
use UTF-16 everywhere [1]. Thus, one should be able to directly pass the UCS-16
file name to CreateFileW [2].

[1] http://msdn.microsoft.com/en-us/library/dd374081%28v=vs.85%29.aspx
[2] http://msdn.microsoft.com/en-us/library/aa363858%28v=vs.85%29.aspx



Example program. Sample usage:
  $ gfortran test.f90
  $ ./a.out 
  Enter filename: ファイル
  $

Should create "ファイル.dat" with the content "Hello World and Ni Hao -- 你好" - the
latter works but the file name is as written above "?" (= \343). If one passes
"44", the created file is just "4".


use iso_fortran_env
implicit none
integer, parameter :: ucs4  = selected_char_kind ('ISO_10646')
character(len=30, kind=ucs4) :: str
integer :: unit

open(unit=INPUT_UNIT, encoding='utf-8')
write(*, '(a)', advance='no') 'Enter filename: '
read(*,*) str
open(newunit=unit, file=trim(str)//ucs4_'.dat', encoding='utf-8')
write(unit, '(a)') ucs4_'Hello World and Ni Hao -- ' &
                   // char (int (z'4F60'), ucs4)     &
                   // char (int (z'597D'), ucs4)
close(unit)
end


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
@ 2011-05-12  6:59 ` burnus at gcc dot gnu.org
  2011-05-12 13:42 ` burnus at gcc dot gnu.org
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-12  6:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #1 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-12 06:15:57 UTC ---
For the diagnostic, the following untested patch should do. For Unicode
file-name support more work needs to be done ...

--- a/gcc/fortran/io.c
+++ b/gcc/fortran/io.c
@@ -1478,6 +1478,13 @@ resolve_tag (const io_tag *tag, gfc_expr *e)
       return FAILURE;
     }

+  if (e->ts.type == BT_CHARACTER && e->ts.kind != gfc_default_character_kind)
+    {
+      gfc_error ("%s tag at %L must be a character string of default kind",
+                tag->name, &e->where);
+      return FAILURE;
+    }
+
   if (e->rank != 0)
     {
       gfc_error ("%s tag at %L must be scalar", tag->name, &e->where);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
  2011-05-12  6:59 ` [Bug fortran/48972] " burnus at gcc dot gnu.org
@ 2011-05-12 13:42 ` burnus at gcc dot gnu.org
  2011-05-12 14:23 ` jb at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-12 13:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #2 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-12 12:39:32 UTC ---
Created attachment 24238
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24238
Test case

(In reply to comment #1)
> For the diagnostic, the following untested patch should do.

Well, almost. It fails for FORMAT/fmt=; I have to admit that I do not quite
understand why only for e->expr_type == EXPR_CONSTANT a default-kind character
is tested for in   io.c's  resolve_tag_format.

Jerry, could you have a look? I am a bit lost.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
  2011-05-12  6:59 ` [Bug fortran/48972] " burnus at gcc dot gnu.org
  2011-05-12 13:42 ` burnus at gcc dot gnu.org
@ 2011-05-12 14:23 ` jb at gcc dot gnu.org
  2011-05-12 14:26 ` burnus at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2011-05-12 14:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

Janne Blomqvist <jb at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jb at gcc dot gnu.org

--- Comment #3 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-05-12 12:56:07 UTC ---
Wouldn't a standard-conforming way to support Unicode file names be for
gfortran to 

- Specify that the default character set is UTF-8. 

- Then an internal read or write could be used to do a UTF-8 <->  UTF-32
conversion, if the user program uses kind=4 characters. Or if the user program
stuffs utf-8 data into default character variables, nothing needs to be done.

- When passing a filename in the open statement, on posix this can be passed
as-is to open(), on mingw the library would need to do a utf-8 -> utf-16
conversion, then call wopen(). And similarly for other syscalls where we pass
path names (e.g. stat(), access() and so on).

In any case, initially something like your patch in #c1 looks good; regardless
of how/if we decide to support Unicode filenames, currently we don't do
anything sensible for kind=4 file names.
And as you say, it's a standard violation.

Similarly to specifying the default character set as UTF-8, we could specify
the default encoding as UTF-8 (see ENCODING= in OPEN (9.5.6.9) and INQUIRE
(9.10.2.10)). That way we wouldn't need to handle the non-Unicode cases in
10.7.1 at all. I think we're mostly there already, really, what's lacking is
perhaps a "GFortran and Unicode" chapter in the manual.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2011-05-12 14:23 ` jb at gcc dot gnu.org
@ 2011-05-12 14:26 ` burnus at gcc dot gnu.org
  2011-05-12 17:51 ` burnus at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-12 14:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-12 13:37:34 UTC ---
(In reply to comment #3)
> Wouldn't a standard-conforming way to support Unicode file names be for
> gfortran to

I am admittedly a bit lost.

> - Specify that the default character set is UTF-8.

What do you mean by that? I know 1 byte and 4 byte character variables, but I
do not see where UTF-8 fits in there. (One can place UTF-8 into
character(kind=1) - and it also kind of works OK. But if one wants to use
len(), string manipulation ("change 3 character to ..."), or tabulated I/O that
will fail. But as quirky workaround, one can use UTF-8 file names with kind=1
character variables - at least under Unix/Linux.)

Regarding the ENCODING= specifier: That's already used for the encoding of the
file content - one shan't use it to also modify the interpretation of the FILE
string.

I still think that the default character encoding should remain 1 byte
(kind=1), which is simply passed as is to "open()". And UCS-4 as FILE= argument
should simply be supported as vendor extension. One just needs to tell the
library that the string is in UCS-4. This wide string could then directly used
for Windows' _wopen or converted to UTF-8 for Unix/Linux. (The conversion
routine exists for UCS-4 <-> UTF-8 I/O.)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2011-05-12 14:26 ` burnus at gcc dot gnu.org
@ 2011-05-12 17:51 ` burnus at gcc dot gnu.org
  2011-05-12 18:34 ` burnus at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-12 17:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #5 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-12 17:40:32 UTC ---
Author: burnus
Date: Thu May 12 17:40:29 2011
New Revision: 173708

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173708
Log:
2011-05-12  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        * resolve.c (resolve_intrinsic): Don't resolve module
        intrinsics multiple times.

2011-05-12  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        * gfortran.dg/iso_c_binding_compiler_3.f90: New.


Added:
    trunk/gcc/testsuite/gfortran.dg/iso_c_binding_compiler_3.f90
Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/resolve.c
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2011-05-12 17:51 ` burnus at gcc dot gnu.org
@ 2011-05-12 18:34 ` burnus at gcc dot gnu.org
  2011-05-12 19:00 ` jvdelisle at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-12 18:34 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #6 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-12 17:44:21 UTC ---
(In reply to comment #5)
> New Revision: 173708

Wrong PR number - supposed to go to PR 45823


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2011-05-12 18:34 ` burnus at gcc dot gnu.org
@ 2011-05-12 19:00 ` jvdelisle at gcc dot gnu.org
  2011-05-12 21:09 ` jb at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jvdelisle at gcc dot gnu.org @ 2011-05-12 19:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #7 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> 2011-05-12 18:39:19 UTC ---
Reply to comment#2, There are tags that are constants and some that are
variable expressions, so you have to resolve the correct one.  I have not
looked for a while , but I think there is a resolve_tag_e or such.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2011-05-12 19:00 ` jvdelisle at gcc dot gnu.org
@ 2011-05-12 21:09 ` jb at gcc dot gnu.org
  2011-05-13 18:34 ` burnus at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jb at gcc dot gnu.org @ 2011-05-12 21:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #8 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-05-12 21:02:40 UTC ---
(In reply to comment #4)
> (In reply to comment #3)
> > - Specify that the default character set is UTF-8.
> 
> What do you mean by that? I know 1 byte and 4 byte character variables, but I
> do not see where UTF-8 fits in there. (One can place UTF-8 into
> character(kind=1) - and it also kind of works OK. But if one wants to use
> len(), string manipulation ("change 3 character to ..."), or tabulated I/O that
> will fail. But as quirky workaround, one can use UTF-8 file names with kind=1
> character variables - at least under Unix/Linux.)

Well, for backwards compatibility I strongly think we should keep kind=1 the
default. What I meant was that for bytes whose values are not part of the 7-bit
ASCII character set, we can interpret it as UTF-8, as UTF-8 is backwards
compatible with ASCII. In most cases this won't matter, but it matters e.g. as
discussed in this PR on mingw as we need to convert the default character
filename to utf-16.

The other option, I suppose, would be to regard the default character set as
some locale-dependent charset, and then use some char->wchar_t conversion
routines from the MS libc, assuming such things exist.

FWIW, the issue that the length of a string does not equal the width when
printed is not unique to utf-8. The same issue is seen with kind=4 (utf-32) as
well e.g. if one uses diacritic characters. So regardless of whether one uses
UTF-8, UTF-16 or UTF-32, with unicode one needs to be prepared for the fact
that the number of code points in a string might not be the same as the width.
Fortran is not really prepared for this, so I suppose that making the string
intrinsics etc. consider bytes==characters (for kind=1) is the best we can do
in any case.

> Regarding the ENCODING= specifier: That's already used for the encoding of the
> file content - one shan't use it to also modify the interpretation of the FILE
> string.

Yes, the point was not related to the FILE= issue. Rather, that if we make
utf-8 the default charset then it makes sense to also make the default file
encoding utf-8.

> I still think that the default character encoding should remain 1 byte
> (kind=1), which is simply passed as is to "open()". 

Yes, I agree, at least for Unix. What about mingw, then, if the string contains
characters not part of the 7-bit ASCII charset? Will MS libc convert it into
UTF-16 assuming the encoding is according to the current locale, or what?

> And UCS-4 as FILE= argument
> should simply be supported as vendor extension. One just needs to tell the
> library that the string is in UCS-4. 

I'm not convinced of the value of such an extension. Fortran already suffers
from too many vendor extensions.

> This wide string could then directly used
> for Windows' _wopen

Not really, since wchar_t on windows is a 16-bit type (utf-16), not a 32-bit
one.

> or converted to UTF-8 for Unix/Linux.

Well, that is also a choice that needs to be made, analogous on how to convert
default char file names to utf-16 on mingw. That is, do we convert the name to
UTF-8 or to whatever the charset of the current locale is (LC_CTYPE)?

So in one way it would be nice to make gfortran respect the current locale
charset, but OTOH Unicode was invented because the locale charset system is a
failure, and just using unicode everywhere would in some respects be simpler.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2011-05-12 21:09 ` jb at gcc dot gnu.org
@ 2011-05-13 18:34 ` burnus at gcc dot gnu.org
  2011-05-13 21:14 ` burnus at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-13 18:34 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-13 18:16:40 UTC ---
Author: burnus
Date: Fri May 13 18:16:37 2011
New Revision: 173736

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173736
Log:
2011-05-12  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        * io.c (resolve_tag_format, resolve_tag): Make sure
        that the string is of default kind.
        (gfc_resolve_inquire): Also resolve decimal tag.

2011-05-12  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        * gfortran.dg/io_constraints_8.f90: New.
        * gfortran.dg/io_constraints_9.f90: New.


Added:
    trunk/gcc/testsuite/gfortran.dg/io_constraints_8.f90
    trunk/gcc/testsuite/gfortran.dg/io_constraints_9.f90
Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/io.c
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2011-05-13 18:34 ` burnus at gcc dot gnu.org
@ 2011-05-13 21:14 ` burnus at gcc dot gnu.org
  2011-05-14 12:45 ` burnus at gcc dot gnu.org
  2011-11-07 22:37 ` fxcoudert at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-13 21:14 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #10 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-13 20:59:09 UTC ---
Author: burnus
Date: Fri May 13 20:59:07 2011
New Revision: 173738

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173738
Log:
2011-05-13  Tobias Burnus  <burnus@net-b.de>

        PR fortran/48972
        PR fortran/48991
        * gfortran.dg/assign_8.f90: Update dg-error.


Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gfortran.dg/assign_8.f90


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2011-05-13 21:14 ` burnus at gcc dot gnu.org
@ 2011-05-14 12:45 ` burnus at gcc dot gnu.org
  2011-11-07 22:37 ` fxcoudert at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-14 12:45 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

--- Comment #11 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-14 11:55:14 UTC ---
Done: Constraint diagnostic of the Fortran standard.

To be done: Adding vendor extension to support UCS-4 arguments to OPEN's and
INQUIRE's file argument.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug fortran/48972] OPEN with Unicode file name
  2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2011-05-14 12:45 ` burnus at gcc dot gnu.org
@ 2011-11-07 22:37 ` fxcoudert at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: fxcoudert at gcc dot gnu.org @ 2011-11-07 22:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48972

Francois-Xavier Coudert <fxcoudert at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011-11-07
                 CC|                            |fxcoudert at gcc dot
                   |                            |gnu.org
     Ever Confirmed|0                           |1
           Severity|normal                      |enhancement


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-11-07 22:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-11 21:53 [Bug fortran/48972] New: OPEN with Unicode file name burnus at gcc dot gnu.org
2011-05-12  6:59 ` [Bug fortran/48972] " burnus at gcc dot gnu.org
2011-05-12 13:42 ` burnus at gcc dot gnu.org
2011-05-12 14:23 ` jb at gcc dot gnu.org
2011-05-12 14:26 ` burnus at gcc dot gnu.org
2011-05-12 17:51 ` burnus at gcc dot gnu.org
2011-05-12 18:34 ` burnus at gcc dot gnu.org
2011-05-12 19:00 ` jvdelisle at gcc dot gnu.org
2011-05-12 21:09 ` jb at gcc dot gnu.org
2011-05-13 18:34 ` burnus at gcc dot gnu.org
2011-05-13 21:14 ` burnus at gcc dot gnu.org
2011-05-14 12:45 ` burnus at gcc dot gnu.org
2011-11-07 22:37 ` fxcoudert at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).