From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 5AB323857C70; Sun, 28 Feb 2021 03:25:35 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5AB323857C70
From: "jvdelisle at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libfortran/99210] X editing for reading file with
 encoding='utf-8'
Date: Sun, 28 Feb 2021 03:25:35 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: libfortran
X-Bugzilla-Version: 10.2.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: jvdelisle at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: jvdelisle at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-99210-4-npFOKuf5Uy@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-99210-4@http.gcc.gnu.org/bugzilla/>
References: <bug-99210-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Feb 2021 03:25:35 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D99210

--- Comment #3 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> ---
Here is the real issue. The X format specifier is a position modifier. UTF-=
8 is
a variable character length encoding so moving one character could mean mov=
e 1,
2, 3, or 4 bytes depending on the content of the file.

Up to now we have chosen to move "position" by 1 byte.

13.8.1.1 Position editing

1 The position edit descriptors T, TL, TR, and X, specify the position at w=
hich
the next character will be transmitted to or from the record. If any charac=
ter
skipped by a position edit descriptor is of type nondefault character,
and the unit is a default character internal file or an external non-Unicode
file, the result of that position editing is processor dependent.

Our interpretation of this has been that the example provided in this PR is
processor dependent. However, the file is opened as encoding=3D'UTF-8'.

So, we have to use UTF-8 based skips for READs.  The following patch does t=
his:
diff --git a/libgfortran/io/read.c b/libgfortran/io/read.c
index 7515d912c51..30ff0e0deb7 100644
--- a/libgfortran/io/read.c
+++ b/libgfortran/io/read.c
@@ -1255,6 +1255,23 @@ read_x (st_parameter_dt *dtp, size_t n)

   if (n =3D=3D 0)
     return;
+=20=20=20=20
+  if (dtp->u.p.current_unit->flags.encoding =3D=3D ENCODING_UTF8)
+    {
+      gfc_char4_t c;
+      size_t nbytes, j;
+=20=20=20=20
+      /* Proceed with decoding one character at a time.  */
+      for (j =3D 0; j < n; j++)
+       {
+         c =3D read_utf8 (dtp, &nbytes);
+=20=20=20=20
+         /* Check for a short read and if so, break out.  */
+         if (nbytes =3D=3D 0 || c =3D=3D (gfc_char4_t)0)
+           break;
+       }
+      return;
+    }

   length =3D n;

The remaining part of this is what to do for end of file conditions.  So, I=
 am
doing a little mor testing.=