From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gnu.wildebeest.org (wildebeest.demon.nl [212.238.236.112]) by sourceware.org (Postfix) with ESMTPS id 819293851C10 for ; Mon, 5 Jul 2021 19:38:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 819293851C10 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=klomp.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=klomp.org Received: from reform (deer0x01.wildebeest.org [172.31.17.131]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gnu.wildebeest.org (Postfix) with ESMTPSA id B223A302FBA6; Mon, 5 Jul 2021 21:38:06 +0200 (CEST) Received: by reform (Postfix, from userid 1000) id 9AC332E80FEC; Mon, 5 Jul 2021 21:38:06 +0200 (CEST) From: Mark Wielaard To: gcc-rust@gcc.gnu.org Cc: Mark Wielaard Subject: [PATCH 1/2] Handle UTF-8 BOM in lexer Date: Mon, 5 Jul 2021 21:37:47 +0200 Message-Id: <20210705193748.124938-2-mark@klomp.org> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210705193748.124938-1-mark@klomp.org> References: <20210705193748.124938-1-mark@klomp.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-rust@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: gcc-rust mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Jul 2021 19:38:09 -0000 The very first thing in a rust source file might be the optional UTF-8 BOM. This is the 3 bytes 0xEF 0xBB 0xBF. They can simply be skipped, they just mark the file as UTF-8. Add some testcases to show we now handle such files. --- gcc/rust/lex/rust-lex.cc | 13 +++++++++++++ gcc/testsuite/rust/compile/torture/bom.rs | 1 + gcc/testsuite/rust/compile/torture/bom_comment.rs | 2 ++ gcc/testsuite/rust/compile/torture/bom_shebang.rs | 2 ++ .../rust/compile/torture/bom_whitespace.rs | 2 ++ 5 files changed, 20 insertions(+) create mode 100644 gcc/testsuite/rust/compile/torture/bom.rs create mode 100644 gcc/testsuite/rust/compile/torture/bom_comment.rs create mode 100644 gcc/testsuite/rust/compile/torture/bom_shebang.rs create mode 100644 gcc/testsuite/rust/compile/torture/bom_whitespace.rs diff --git a/gcc/rust/lex/rust-lex.cc b/gcc/rust/lex/rust-lex.cc index ebd69de0fd1..617dd69a080 100644 --- a/gcc/rust/lex/rust-lex.cc +++ b/gcc/rust/lex/rust-lex.cc @@ -237,6 +237,19 @@ Lexer::build_token () current_char = peek_input (); skip_input (); + // detect UTF8 bom + // + // Must be the first thing on the first line. + // There might be an optional BOM (Byte Order Mark), which for UTF-8 is + // the three bytes 0xEF, 0xBB and 0xBF. These can simply be skipped. + if (current_line == 1 && current_column == 1 && current_char == 0xef + && peek_input () == 0xbb && peek_input (1) == 0xbf) + { + skip_input (1); + current_char = peek_input (); + skip_input (); + } + // detect shebang // Must be the first thing on the first line, starting with #! // But since an attribute can also start with an #! we don't count it as a diff --git a/gcc/testsuite/rust/compile/torture/bom.rs b/gcc/testsuite/rust/compile/torture/bom.rs new file mode 100644 index 00000000000..5edcab227ee --- /dev/null +++ b/gcc/testsuite/rust/compile/torture/bom.rs @@ -0,0 +1 @@ +pub fn main () { } diff --git a/gcc/testsuite/rust/compile/torture/bom_comment.rs b/gcc/testsuite/rust/compile/torture/bom_comment.rs new file mode 100644 index 00000000000..020e1707b55 --- /dev/null +++ b/gcc/testsuite/rust/compile/torture/bom_comment.rs @@ -0,0 +1,2 @@ +// UTF8 BOM +pub fn main () { } diff --git a/gcc/testsuite/rust/compile/torture/bom_shebang.rs b/gcc/testsuite/rust/compile/torture/bom_shebang.rs new file mode 100644 index 00000000000..4c552e8d71d --- /dev/null +++ b/gcc/testsuite/rust/compile/torture/bom_shebang.rs @@ -0,0 +1,2 @@ +#!/usr/bin/cat +pub fn main () { } diff --git a/gcc/testsuite/rust/compile/torture/bom_whitespace.rs b/gcc/testsuite/rust/compile/torture/bom_whitespace.rs new file mode 100644 index 00000000000..b10d5654473 --- /dev/null +++ b/gcc/testsuite/rust/compile/torture/bom_whitespace.rs @@ -0,0 +1,2 @@ + +pub fn main () { } -- 2.32.0