From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 5AB323857C70; Sun, 28 Feb 2021 03:25:35 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5AB323857C70 From: "jvdelisle at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug libfortran/99210] X editing for reading file with encoding='utf-8' Date: Sun, 28 Feb 2021 03:25:35 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libfortran X-Bugzilla-Version: 10.2.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: jvdelisle at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: jvdelisle at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Feb 2021 03:25:35 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D99210 --- Comment #3 from Jerry DeLisle --- Here is the real issue. The X format specifier is a position modifier. UTF-= 8 is a variable character length encoding so moving one character could mean mov= e 1, 2, 3, or 4 bytes depending on the content of the file. Up to now we have chosen to move "position" by 1 byte. 13.8.1.1 Position editing 1 The position edit descriptors T, TL, TR, and X, specify the position at w= hich the next character will be transmitted to or from the record. If any charac= ter skipped by a position edit descriptor is of type nondefault character, and the unit is a default character internal file or an external non-Unicode file, the result of that position editing is processor dependent. Our interpretation of this has been that the example provided in this PR is processor dependent. However, the file is opened as encoding=3D'UTF-8'. So, we have to use UTF-8 based skips for READs. The following patch does t= his: diff --git a/libgfortran/io/read.c b/libgfortran/io/read.c index 7515d912c51..30ff0e0deb7 100644 --- a/libgfortran/io/read.c +++ b/libgfortran/io/read.c @@ -1255,6 +1255,23 @@ read_x (st_parameter_dt *dtp, size_t n) if (n =3D=3D 0) return; +=20=20=20=20 + if (dtp->u.p.current_unit->flags.encoding =3D=3D ENCODING_UTF8) + { + gfc_char4_t c; + size_t nbytes, j; +=20=20=20=20 + /* Proceed with decoding one character at a time. */ + for (j =3D 0; j < n; j++) + { + c =3D read_utf8 (dtp, &nbytes); +=20=20=20=20 + /* Check for a short read and if so, break out. */ + if (nbytes =3D=3D 0 || c =3D=3D (gfc_char4_t)0) + break; + } + return; + } length =3D n; The remaining part of this is what to do for end of file conditions. So, I= am doing a little mor testing.=