* [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] @ 2021-11-01 16:36 Marek Polacek 2021-11-01 22:10 ` Joseph Myers 2021-11-02 20:57 ` [PATCH 0/2] Re: [PATCH] libcpp: Implement -Wbidirectional " David Malcolm 0 siblings, 2 replies; 27+ messages in thread From: Marek Polacek @ 2021-11-01 16:36 UTC (permalink / raw) To: GCC Patches, Joseph Myers; +Cc: Jason Merrill, Jakub Jelinek From a link below: "An issue was discovered in the Bidirectional Algorithm in the Unicode Specification through 14.0. It permits the visual reordering of characters via control sequences, which can be used to craft source code that renders different logic than the logical ordering of tokens ingested by compilers and interpreters. Adversaries can leverage this to encode source code for compilers accepting Unicode such that targeted vulnerabilities are introduced invisibly to human reviewers." More info: https://nvd.nist.gov/vuln/detail/CVE-2021-42574 https://trojansource.codes/ This is not a compiler bug. However, to mitigate the problem, this patch implements -Wbidirectional=[none|unpaired|any] to warn about possibly misleading Unicode bidirectional characters the preprocessor may encounter. The default is =unpaired, which warns about improperly terminated bidirectional characters; e.g. a LRE without its appertaining PDF. The level =any warns about any use of bidirectional characters. This patch handles both UCNs and UTF-8 characters. UCNs designating bidi characters in identifiers are accepted since r204886. Then r217144 enabled -fextended-identifiers by default. Extended characters in C/C++ identifiers have been accepted since r275979. However, this patch still warns about mixing UTF-8 and UCN bidi characters; there seems to be no good reason to allow mixing them. We warn in different contexts: comments (both C and C++-style), string literals, character constants, and identifiers. Expectedly, UCNs are ignored in comments and raw string literals. The bidirectional characters can nest so this patch handles that as well. I have not included nor tested this at all with Fortran (which also has string literals and line comments). Dave M. posted patches improving diagnostic involving Unicode characters. This patch does not make use of this new infrastructure yet. Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? PR preprocessor/103026 gcc/c-family/ChangeLog: * c.opt (Wbidirectional, Wbidirectional=): New option. gcc/ChangeLog: * doc/invoke.texi: Document -Wbidirectional. libcpp/ChangeLog: * include/cpplib.h (enum cpp_bidirectional_level): New. (struct cpp_options): Add cpp_warn_bidirectional. (enum cpp_warning_reason): Add CPP_W_BIDIRECTIONAL. * init.c (cpp_create_reader): Set cpp_warn_bidirectional. * lex.c (bidi): New namespace. (get_bidi_utf8): New function. (get_bidi_ucn): Likewise. (maybe_warn_bidi_on_close): Likewise. (maybe_warn_bidi_on_char): Likewise. (_cpp_skip_block_comment): Implement warning about bidirectional characters. (skip_line_comment): Likewise. (forms_identifier_p): Likewise. (lex_identifier): Likewise. (lex_string): Likewise. (lex_raw_string): Likewise. gcc/testsuite/ChangeLog: * c-c++-common/Wbidirectional-1.c: New test. * c-c++-common/Wbidirectional-2.c: New test. * c-c++-common/Wbidirectional-3.c: New test. * c-c++-common/Wbidirectional-4.c: New test. * c-c++-common/Wbidirectional-5.c: New test. * c-c++-common/Wbidirectional-6.c: New test. * c-c++-common/Wbidirectional-7.c: New test. * c-c++-common/Wbidirectional-8.c: New test. * c-c++-common/Wbidirectional-9.c: New test. * c-c++-common/Wbidirectional-10.c: New test. * c-c++-common/Wbidirectional-11.c: New test. * c-c++-common/Wbidirectional-12.c: New test. * c-c++-common/Wbidirectional-13.c: New test. --- gcc/c-family/c.opt | 24 ++ gcc/doc/invoke.texi | 19 +- gcc/testsuite/c-c++-common/Wbidirectional-1.c | 12 + .../c-c++-common/Wbidirectional-10.c | 28 ++ .../c-c++-common/Wbidirectional-11.c | 13 + .../c-c++-common/Wbidirectional-12.c | 19 + .../c-c++-common/Wbidirectional-13.c | 17 + gcc/testsuite/c-c++-common/Wbidirectional-2.c | 9 + gcc/testsuite/c-c++-common/Wbidirectional-3.c | 11 + gcc/testsuite/c-c++-common/Wbidirectional-4.c | 166 ++++++++ gcc/testsuite/c-c++-common/Wbidirectional-5.c | 166 ++++++++ gcc/testsuite/c-c++-common/Wbidirectional-6.c | 155 +++++++ gcc/testsuite/c-c++-common/Wbidirectional-7.c | 9 + gcc/testsuite/c-c++-common/Wbidirectional-8.c | 13 + gcc/testsuite/c-c++-common/Wbidirectional-9.c | 29 ++ libcpp/include/cpplib.h | 18 +- libcpp/init.c | 1 + libcpp/lex.c | 391 +++++++++++++++++- 18 files changed, 1085 insertions(+), 15 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-1.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-10.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-11.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-12.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-13.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-2.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-3.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-4.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-5.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-6.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-7.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-8.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-9.c diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt index 06457ac739e..09391824676 100644 --- a/gcc/c-family/c.opt +++ b/gcc/c-family/c.opt @@ -374,6 +374,30 @@ Wbad-function-cast C ObjC Var(warn_bad_function_cast) Warning Warn about casting functions to incompatible types. +Wbidirectional +C ObjC C++ ObjC++ Warning Alias(Wbidirectional=,any,none) +; + +Wbidirectional= +C ObjC C++ ObjC++ RejectNegative Joined Warning CPP(cpp_warn_bidirectional) CppReason(CPP_W_BIDIRECTIONAL) Var(warn_bidirectional) Init(bidirectional_unpaired) Enum(cpp_bidirectional_level) +-Wbidirectional=[none|unpaired|any] Warn about UTF-8 bidirectional characters. + +; Required for these enum values. +SourceInclude +cpplib.h + +Enum +Name(cpp_bidirectional_level) Type(int) UnknownError(argument %qs to %<-Wbidirectional%> not recognized) + +EnumValue +Enum(cpp_bidirectional_level) String(none) Value(bidirectional_none) + +EnumValue +Enum(cpp_bidirectional_level) String(unpaired) Value(bidirectional_unpaired) + +EnumValue +Enum(cpp_bidirectional_level) String(any) Value(bidirectional_any) + Wbool-compare C ObjC C++ ObjC++ Var(warn_bool_compare) Warning LangEnabledBy(C ObjC C++ ObjC++,Wall) Warn about boolean expression compared with an integer value different from true/false. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index c5730228821..9dfb95dc24c 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. -Warith-conversion @gol -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol +-Wno-attribute-warning @gol +-Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol +-Wbool-compare -Wbool-operation @gol -Wno-builtin-declaration-mismatch @gol -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol -Wc11-c2x-compat @gol @@ -7674,6 +7676,21 @@ Attributes considered include @code{alloc_align}, @code{alloc_size}, This is the default. You can disable these warnings with either @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. +@item -Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} +@opindex Wbidirectional= +@opindex Wbidirectional +@opindex Wno-bidirectional +Warn about UTF-8 bidirectional characters. Such characters can change +left-to-right writing direction into right-to-left (and vice versa), +which can cause confusion between the logical order and visual order. +This may be dangerous; for instance, it may seem that a piece of code +is not commented out, whereas it in fact is. + +There are three levels of warning supported by GCC@. The default is +@option{-Wbidirectional=unpaired}, which warns about improperly terminated +bidi contexts. @option{-Wbidirectional=none} turns the warning off. +@option{-Wbidirectional=any} warns about any use of bidirectional characters. + @item -Wbool-compare @opindex Wno-bool-compare @opindex Wbool-compare diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-1.c b/gcc/testsuite/c-c++-common/Wbidirectional-1.c new file mode 100644 index 00000000000..34f5ac19271 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-1.c @@ -0,0 +1,12 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + int isAdmin = 0; + /* } if (isAdmin) begin admins only */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + /* end admins only { */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-10.c b/gcc/testsuite/c-c++-common/Wbidirectional-10.c new file mode 100644 index 00000000000..fea1e4d034b --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-10.c @@ -0,0 +1,28 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* More nesting testing. */ + +/* RLE LRI PDF PDI*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF_\u202c; +int LRE_\u202a_PDF_\u202c_LRE_\u202a_PDF_\u202c; +int LRE_\u202a_LRI_\u2066_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDI_\u2069_PDF_\u202c; +int FSI_\u2068_LRO_\u202d_PDI_\u2069_PDF_\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDF_\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_FSI_\u2068_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-11.c b/gcc/testsuite/c-c++-common/Wbidirectional-11.c new file mode 100644 index 00000000000..7d91527148e --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-11.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test that we warn when mixing UCN and UTF-8. */ + +int LRE__PDF_\u202c; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF__; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s1 = "LRE__PDF_\u202c"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s2 = "LRE_\u202a_PDF_"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-12.c b/gcc/testsuite/c-c++-common/Wbidirectional-12.c new file mode 100644 index 00000000000..2030bb73fe9 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-12.c @@ -0,0 +1,19 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidirectional=any" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c RLO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +const char *s7 = R"(a b c FSI 1 2 3 PDI x y) z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +const char *s8 = R"(a b c PDI x y )z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +const char *s9 = R"(a b c PDF x y z)"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-13.c b/gcc/testsuite/c-c++-common/Wbidirectional-13.c new file mode 100644 index 00000000000..8e13bf53595 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-13.c @@ -0,0 +1,17 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c FSI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s5 = R"(a b c LRI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s6 = R"(a b c RLI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-2.c b/gcc/testsuite/c-c++-common/Wbidirectional-2.c new file mode 100644 index 00000000000..2340374f276 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-2.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + /* Say hello; newline/*/ return 0 ; +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("Hello world.\n"); + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-3.c b/gcc/testsuite/c-c++-common/Wbidirectional-3.c new file mode 100644 index 00000000000..9dc7edb6e64 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-3.c @@ -0,0 +1,11 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + const char* access_level = "user"; + if (__builtin_strcmp(access_level, "user // Check if admin ")) { +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + } + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-4.c b/gcc/testsuite/c-c++-common/Wbidirectional-4.c new file mode 100644 index 00000000000..0fbc0dad6ab --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-4.c @@ -0,0 +1,166 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=any -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-5.c b/gcc/testsuite/c-c++-common/Wbidirectional-5.c new file mode 100644 index 00000000000..800d273dbe5 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-5.c @@ -0,0 +1,166 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-6.c b/gcc/testsuite/c-c++-common/Wbidirectional-6.c new file mode 100644 index 00000000000..bf8b6104d43 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-6.c @@ -0,0 +1,155 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test nesting of bidi chars in various contexts. */ + +/* Terminated by the wrong char: */ +/* a b c LRE 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDF x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +/* LRE PDF */ +/* LRE LRE PDF PDF */ +/* PDF LRE PDF */ +/* LRE PDF LRE PDF */ +/* LRE LRE PDF */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* PDF LRE */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// a b c LRE 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// LRE PDF +// LRE LRE PDF PDF +// PDF LRE PDF +// LRE PDF LRE PDF +// LRE LRE PDF +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// PDF LRE +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c LRE\u202a 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c RLE 1 2 3 PDI x y "; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLE\u202b 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c LRO\u202d 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c RLO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c RLO\u202e 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c LRI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s10 = "a b c LRI\u2066 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c RLI 1 2 3 PDF x y z\ + "; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ + const char *s12 = "a b c RLI\u2067 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c FSI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c FSI\u2068 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "LRE PDF"; + const char *s18 = "LRE\u202a PDF\u202c"; + const char *s19 = "LRE LRE PDF PDF"; + const char *s20 = "LRE\u202a LRE\u202a PDF\u202c PDF\u202c"; + const char *s21 = "PDF LRE PDF"; + const char *s22 = "PDF\u202c LRE\u202a PDF\u202c"; + const char *s23 = "LRE LRE PDF"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s24 = "LRE\u202a LRE\u202a PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s25 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s26 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s27 = "PDF LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s28 = "PDF\u202c LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int aLREbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int A\u202aB\u2069C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLEbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202bB\u2069c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202db\u2069c2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202eb\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2066b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLIbPDFc +; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +int a\u2067b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2068b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPD\u202C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSI\u2068bPDF_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLREbPDFb; +int A\u202aB\u202c; +int a_LRE_LRE_b_PDF_PDF; +int A\u202aA\u202aB\u202cB\u202c; +int aPDFbLREadPDF; +int a_\u202C_\u202a_\u202c; +int a_LRE_b_PDF_c_LRE_PDF; +int a_\u202a_\u202c_\u202a_\u202c_; +int a_LRE_b_PDF_c_LRE; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a_\u202a_\u202c_\u202a_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-7.c b/gcc/testsuite/c-c++-common/Wbidirectional-7.c new file mode 100644 index 00000000000..bb973ab7c27 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-7.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=any" } */ +/* Test we ignore UCNs in comments. */ + +// a b c \u202a 1 2 3 +// a b c \u202A 1 2 3 +/* a b c \u202a 1 2 3 */ +/* a b c \u202A 1 2 3 */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-8.c b/gcc/testsuite/c-c++-common/Wbidirectional-8.c new file mode 100644 index 00000000000..23f5c92418b --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-8.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=any" } */ +/* Test \u vs \U. */ + +int a_\u202A; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\u202a_2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202A_3; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202a_4; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-9.c b/gcc/testsuite/c-c++-common/Wbidirectional-9.c new file mode 100644 index 00000000000..dfcd6814fcb --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-9.c @@ -0,0 +1,29 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test that we properly separate bidi contexts (comment/identifier/character + constant/string literal). */ + +/* LRE -><- */ int pdf_\u202c_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLE -><- */ int pdf_\u202c_2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRO -><- */ int pdf_\u202c_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLO -><- */ int pdf_\u202c_4; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRI -><-*/ int pdi_\u2069_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLI -><- */ int pdi_\u2069_12; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* FSI -><- */ int pdi_\u2069_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +const char *s1 = "LRE\u202a"; /* PDF -><- */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRE -><- */ const char *s2 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = "LRE\u202a"; int pdf_\u202c_5; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int lre_\u202a; const char *s4 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h index 176f8c5bbce..60cf08ddd35 100644 --- a/libcpp/include/cpplib.h +++ b/libcpp/include/cpplib.h @@ -319,6 +319,17 @@ enum cpp_main_search CMS_system, /* Search the system INCLUDE path. */ }; +/* The possible bidirectional characters checking levels, from least + restrictive to most. */ +enum cpp_bidirectional_level { + /* No checking. */ + bidirectional_none, + /* Only detect unpaired uses of bidirectional characters. */ + bidirectional_unpaired, + /* Detect any use of bidirectional characters. */ + bidirectional_any +}; + /* This structure is nested inside struct cpp_reader, and carries all the options visible to the command line. */ struct cpp_options @@ -539,6 +550,10 @@ struct cpp_options /* True if warn about differences between C++98 and C++11. */ bool cpp_warn_cxx11_compat; + /* Nonzero of bidirectional characters checking is on. See enum + cpp_bidirectional_level. */ + unsigned char cpp_warn_bidirectional; + /* Dependency generation. */ struct { @@ -643,7 +658,8 @@ enum cpp_warning_reason { CPP_W_C90_C99_COMPAT, CPP_W_C11_C2X_COMPAT, CPP_W_CXX11_COMPAT, - CPP_W_EXPANSION_TO_DEFINED + CPP_W_EXPANSION_TO_DEFINED, + CPP_W_BIDIRECTIONAL }; /* Callback for header lookup for HEADER, which is the name of a diff --git a/libcpp/init.c b/libcpp/init.c index 5a424e23553..f9a8f5f088f 100644 --- a/libcpp/init.c +++ b/libcpp/init.c @@ -223,6 +223,7 @@ cpp_create_reader (enum c_lang lang, cpp_hash_table *table, = ENABLE_CANONICAL_SYSTEM_HEADERS; CPP_OPTION (pfile, ext_numeric_literals) = 1; CPP_OPTION (pfile, warn_date_time) = 0; + CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired; /* Default CPP arithmetic to something sensible for the host for the benefit of dumb users like fix-header. */ diff --git a/libcpp/lex.c b/libcpp/lex.c index fa2253d41c3..f7a86fbe4b5 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1164,6 +1164,284 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) } } +namespace bidi { + enum class kind { + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI + }; + + /* All the UTF-8 encodings of bidi characters start with E2. */ + constexpr uchar utf8_start = 0xe2; + + /* A vector holding currently open bidi contexts. We use a char for + each context, its LSB is 1 if it represents a PDF context, 0 if it + represents a PDI context. The next bit is 1 if this context was open + by a bidi character written as a UCN, and 0 when it was UTF-8. */ + semi_embedded_vec <unsigned char, 16> vec; + + /* Close the whole comment/identifier/string literal/character constant + context. */ + void on_close () + { + vec.truncate (0); + } + + /* Pop the last element in the vector. */ + void pop () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + vec.truncate (len - 1); + } + + /* Return which context is currently opened. */ + kind current_ctx () + { + unsigned int len = vec.count (); + if (len == 0) + return kind::NONE; + return (vec[len - 1] & 1) ? kind::PDF : kind::PDI; + } + + /* Return true if the current context comes from a UCN origin, that is, + the bidi char which started this bidi context was written as a UCN. */ + bool current_ctx_ucn_p () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + return (vec[len - 1] >> 1) & 1; + } + + /* We've read a bidi char, update the current vector as necessary. */ + void on_char (kind k, bool ucn_p) + { + switch (k) + { + case kind::LRE: + case kind::RLE: + case kind::LRO: + case kind::RLO: + vec.push (ucn_p ? 3u : 1u); + break; + case kind::LRI: + case kind::RLI: + case kind::FSI: + vec.push (ucn_p ? 2u : 0u); + break; + case kind::PDF: + if (current_ctx () == kind::PDF) + pop (); + break; + case kind::PDI: + if (current_ctx () == kind::PDI) + pop (); + break; + [[likely]] case kind::NONE: + break; + default: + abort (); + } + } + + /* Return a descriptive string for K. */ + const char *to_str (kind k) + { + switch (k) + { + case kind::LRE: + return "U+202A (LEFT-TO-RIGHT EMBEDDING)"; + case kind::RLE: + return "U+202B (RIGHT-TO-LEFT EMBEDDING)"; + case kind::LRO: + return "U+202D (LEFT-TO-RIGHT OVERRIDE)"; + case kind::RLO: + return "U+202E (RIGHT-TO-LEFT OVERRIDE)"; + case kind::LRI: + return "U+2066 (LEFT-TO-RIGHT ISOLATE)"; + case kind::RLI: + return "U+2067 (RIGHT-TO-LEFT ISOLATE)"; + case kind::FSI: + return "U+2068 (FIRST STRONG ISOLATE)"; + case kind::PDF: + return "U+202C (POP DIRECTIONAL FORMATTING)"; + case kind::PDI: + return "U+2069 (POP DIRECTIONAL ISOLATE)"; + default: + abort (); + } + } +} + +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ + +static bidi::kind +get_bidi_utf8 (const unsigned char *const p) +{ + gcc_checking_assert (p[0] == bidi::utf8_start); + + if (p[1] == 0x80) + switch (p[2]) + { + case 0xaa: + return bidi::kind::LRE; + case 0xab: + return bidi::kind::RLE; + case 0xac: + return bidi::kind::PDF; + case 0xad: + return bidi::kind::LRO; + case 0xae: + return bidi::kind::RLO; + default: + break; + } + else if (p[1] == 0x81) + switch (p[2]) + { + case 0xa6: + return bidi::kind::LRI; + case 0xa7: + return bidi::kind::RLI; + case 0xa8: + return bidi::kind::FSI; + case 0xa9: + return bidi::kind::PDI; + default: + break; + } + + return bidi::kind::NONE; +} + +/* Parse a UCN where P points just past \u or \U and return its bidi code. */ + +static bidi::kind +get_bidi_ucn (const unsigned char *p, bool is_U) +{ + /* 6.4.3 Universal Character Names + \u hex-quad + \U hex-quad hex-quad + where \unnnn means \U0000nnnn. */ + + if (is_U) + { + if (p[0] != '0' || p[1] != '0' || p[2] != '0' || p[3] != '0') + return bidi::kind::NONE; + /* Skip 4B so we can treat \u and \U the same below. */ + p += 4; + } + + /* All code points we are looking for start with 20xx. */ + if (p[0] != '2' || p[1] != '0') + return bidi::kind::NONE; + else if (p[2] == '2') + switch (p[3]) + { + case 'a': + case 'A': + return bidi::kind::LRE; + case 'b': + case 'B': + return bidi::kind::RLE; + case 'c': + case 'C': + return bidi::kind::PDF; + case 'd': + case 'D': + return bidi::kind::LRO; + case 'e': + case 'E': + return bidi::kind::RLO; + default: + break; + } + else if (p[2] == '6') + switch (p[3]) + { + case '6': + return bidi::kind::LRI; + case '7': + return bidi::kind::RLI; + case '8': + return bidi::kind::FSI; + case '9': + return bidi::kind::PDI; + default: + break; + } + + return bidi::kind::NONE; +} + +/* We're closing a bidi context, that is, we've encountered a newline, + are closing a C-style comment, or are at the end of a string literal, + character constant, or identifier. Warn if this context was not + properly terminated by a PDI or PDF. P points to the last character + in this context. */ + +static void +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) +{ + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired + && bidi::vec.count () > 0) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "unpaired UTF-8 bidirectional character " + "detected"); + } + /* We're done with this context. */ + bidi::on_close (); +} + +/* We're at the beginning or in the middle of an identifier/comment/string + literal/character constant. Warn if we've encountered a bidi character. + KIND says which bidi character it was; P points to it in the character + stream. UCN_P is true iff this bidi character was written as a UCN. */ + +static void +maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, + bool ucn_p) +{ + if (__builtin_expect (kind == bidi::kind::NONE, 1)) + return; + + const auto warn_bidi = CPP_OPTION (pfile, cpp_warn_bidirectional); + + if (warn_bidi != bidirectional_none) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + /* It seems excessive to warn about a PDI/PDF that is closing + an opened context because we've already warned about the + opening character. Except warn when we have a UCN x UTF-8 + mismatch. */ + if (kind == bidi::current_ctx ()) + { + if (warn_bidi == bidirectional_unpaired + && bidi::current_ctx_ucn_p () != ucn_p) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); + } + else if (warn_bidi == bidirectional_any) + { + if (kind == bidi::kind::PDF || kind == bidi::kind::PDI) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "\"%s\" is closing an unopened context", + bidi::to_str (kind)); + else + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "found problematic Unicode character \"%s\"", + bidi::to_str (kind)); + } + } + /* We're done with this context. */ + bidi::on_char (kind, ucn_p); +} + /* Skip a C-style block comment. We find the end of the comment by seeing if an asterisk is before every '/' we encounter. Returns nonzero if comment terminated by EOF, zero otherwise. @@ -1175,7 +1453,8 @@ _cpp_skip_block_comment (cpp_reader *pfile) cpp_buffer *buffer = pfile->buffer; const uchar *cur = buffer->cur; uchar c; - + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); cur++; if (*cur == '/') cur++; @@ -1189,7 +1468,11 @@ _cpp_skip_block_comment (cpp_reader *pfile) if (c == '/') { if (cur[-2] == '*') - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); + break; + } /* Warn about potential nested comments, but not if the '/' comes immediately before the true comment delimiter. @@ -1208,6 +1491,8 @@ _cpp_skip_block_comment (cpp_reader *pfile) { unsigned int cols; buffer->cur = cur - 1; + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); _cpp_process_line_notes (pfile, true); if (buffer->next_line >= buffer->rlimit) return true; @@ -1218,6 +1503,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) cur = buffer->cur; } + /* If this is a beginning of a UTF-8 encoding, it might be + a bidirectional character. */ + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); + } } buffer->cur = cur; @@ -1233,9 +1525,32 @@ skip_line_comment (cpp_reader *pfile) { cpp_buffer *buffer = pfile->buffer; location_t orig_line = pfile->line_table->highest_line; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); - while (*buffer->cur != '\n') - buffer->cur++; + if (!warn_bidi_p) + while (*buffer->cur != '\n') + buffer->cur++; + else + { + while (*buffer->cur != '\n' + && *buffer->cur != bidi::utf8_start) + buffer->cur++; + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + while (*buffer->cur != '\n') + { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } + buffer->cur++; + } + maybe_warn_bidi_on_close (pfile, buffer->cur); + } + } _cpp_process_line_notes (pfile, true); return orig_line != pfile->line_table->highest_line; @@ -1346,11 +1661,14 @@ static const cppchar_t utf8_signifier = 0xC0; /* Returns TRUE if the sequence starting at buffer->cur is valid in an identifier. FIRST is TRUE if this starts an identifier. */ + static bool forms_identifier_p (cpp_reader *pfile, int first, struct normalize_state *state) { cpp_buffer *buffer = pfile->buffer; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); if (*buffer->cur == '$') { @@ -1373,6 +1691,13 @@ forms_identifier_p (cpp_reader *pfile, int first, cppchar_t s; if (*buffer->cur >= utf8_signifier) { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) + && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s)) return true; @@ -1381,6 +1706,13 @@ forms_identifier_p (cpp_reader *pfile, int first, && (buffer->cur[1] == 'u' || buffer->cur[1] == 'U')) { buffer->cur += 2; + if (warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (buffer->cur, + buffer->cur[-1] == 'U'); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/true); + } if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s, NULL, NULL)) return true; @@ -1489,6 +1821,8 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, const uchar *cur; unsigned int len; unsigned int hash = HT_HASHSTEP (0, *base); + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); cur = pfile->buffer->cur; if (! starts_ucn) @@ -1505,13 +1839,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, { /* Slower version for identifiers containing UCNs or extended chars (including $). */ - do { - while (ISIDNUM (*pfile->buffer->cur)) - { - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); - pfile->buffer->cur++; - } - } while (forms_identifier_p (pfile, false, nst)); + do + { + while (ISIDNUM (*pfile->buffer->cur)) + { + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); + pfile->buffer->cur++; + } + } + while (forms_identifier_p (pfile, false, nst)); + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pfile->buffer->cur); result = _cpp_interpret_identifier (pfile, base, pfile->buffer->cur - base); *spelling = cpp_lookup (pfile, base, pfile->buffer->cur - base); @@ -1758,6 +2096,8 @@ static void lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) { const uchar *pos = base; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); /* 'tis a pity this information isn't passed down from the lexer's initial categorization of the token. */ @@ -1994,8 +2334,15 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) pos = base = pfile->buffer->cur; note = &pfile->buffer->notes[pfile->buffer->cur_note]; } + else if (__builtin_expect ((unsigned char) c == bidi::utf8_start, 0) + && warn_bidi_p) + maybe_warn_bidi_on_char (pfile, pos - 1, get_bidi_utf8 (pos - 1), + /*ucn_p=*/false); } + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pos); + if (CPP_OPTION (pfile, user_literals)) { /* If a string format macro, say from inttypes.h, is placed touching @@ -2090,15 +2437,28 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) else terminator = '>', type = CPP_HEADER_NAME; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); for (;;) { cppchar_t c = *cur++; /* In #include-style directives, terminators are not escapable. */ if (c == '\\' && !pfile->state.angled_headers && *cur != '\n') - cur++; + { + if ((cur[0] == 'u' || cur[0] == 'U') && warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (cur + 1, cur[0] == 'U'); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/true); + } + cur++; + } else if (c == terminator) - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur - 1); + break; + } else if (c == '\n') { cur--; @@ -2115,6 +2475,11 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) } else if (c == '\0') saw_NUL = true; + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur - 1, kind, /*ucn_p=*/false); + } } if (saw_NUL && !pfile->state.skipping) base-commit: a11c53985a7080f9bf6143788ccb455dc9b0da21 -- 2.31.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] 2021-11-01 16:36 [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] Marek Polacek @ 2021-11-01 22:10 ` Joseph Myers 2021-11-02 17:18 ` [PATCH v2] " Marek Polacek 2021-11-02 20:57 ` [PATCH 0/2] Re: [PATCH] libcpp: Implement -Wbidirectional " David Malcolm 1 sibling, 1 reply; 27+ messages in thread From: Joseph Myers @ 2021-11-01 22:10 UTC (permalink / raw) To: Marek Polacek; +Cc: GCC Patches, Jakub Jelinek On Mon, 1 Nov 2021, Marek Polacek via Gcc-patches wrote: > + /* We've read a bidi char, update the current vector as necessary. */ > + void on_char (kind k, bool ucn_p) > + { > + switch (k) > + { > + case kind::LRE: > + case kind::RLE: > + case kind::LRO: > + case kind::RLO: > + vec.push (ucn_p ? 3u : 1u); > + break; > + case kind::LRI: > + case kind::RLI: > + case kind::FSI: > + vec.push (ucn_p ? 2u : 0u); > + break; > + case kind::PDF: > + if (current_ctx () == kind::PDF) > + pop (); > + break; > + case kind::PDI: > + if (current_ctx () == kind::PDI) > + pop (); My understanding is that PDI should pop all intermediate PDF contexts outward to a PDI context, which it also pops. (But if it's embedded only in PDF contexts, with no PDI context containing it, it doesn't pop anything.) I think failing to handle that only means libcpp sometimes models there as being more bidirectional contexts open than there should be, so it might give spurious warnings when in fact all such contexts had been closed by end of string or comment. -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] 2021-11-01 22:10 ` Joseph Myers @ 2021-11-02 17:18 ` Marek Polacek 2021-11-02 19:20 ` Martin Sebor 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2021-11-02 17:18 UTC (permalink / raw) To: Joseph Myers; +Cc: GCC Patches, Jakub Jelinek, Jason Merrill On Mon, Nov 01, 2021 at 10:10:40PM +0000, Joseph Myers wrote: > On Mon, 1 Nov 2021, Marek Polacek via Gcc-patches wrote: > > > + /* We've read a bidi char, update the current vector as necessary. */ > > + void on_char (kind k, bool ucn_p) > > + { > > + switch (k) > > + { > > + case kind::LRE: > > + case kind::RLE: > > + case kind::LRO: > > + case kind::RLO: > > + vec.push (ucn_p ? 3u : 1u); > > + break; > > + case kind::LRI: > > + case kind::RLI: > > + case kind::FSI: > > + vec.push (ucn_p ? 2u : 0u); > > + break; > > + case kind::PDF: > > + if (current_ctx () == kind::PDF) > > + pop (); > > + break; > > + case kind::PDI: > > + if (current_ctx () == kind::PDI) > > + pop (); > > My understanding is that PDI should pop all intermediate PDF contexts > outward to a PDI context, which it also pops. (But if it's embedded only > in PDF contexts, with no PDI context containing it, it doesn't pop > anything.) > > I think failing to handle that only means libcpp sometimes models there > as being more bidirectional contexts open than there should be, so it > might give spurious warnings when in fact all such contexts had been > closed by end of string or comment. Ah, you're right. https://www.unicode.org/reports/tr9/#Terminating_Explicit_Directional_Isolates says that "[PDI] terminates the scope of the last LRI, RLI, or FSI whose scope has not yet been terminated, as well as the scopes of any subsequent LREs, RLEs, LROs, or RLOs whose scopes have not yet been terminated." but PDF doesn't have the latter quirk. Fixed in the below: I added a suitable truncate into on_char. The new test Wbidirectional-14.c exercises the handling of PDI. Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? -- >8 -- From a link below: "An issue was discovered in the Bidirectional Algorithm in the Unicode Specification through 14.0. It permits the visual reordering of characters via control sequences, which can be used to craft source code that renders different logic than the logical ordering of tokens ingested by compilers and interpreters. Adversaries can leverage this to encode source code for compilers accepting Unicode such that targeted vulnerabilities are introduced invisibly to human reviewers." More info: https://nvd.nist.gov/vuln/detail/CVE-2021-42574 https://trojansource.codes/ This is not a compiler bug. However, to mitigate the problem, this patch implements -Wbidirectional=[none|unpaired|any] to warn about possibly misleading Unicode bidirectional characters the preprocessor may encounter. The default is =unpaired, which warns about improperly terminated bidirectional characters; e.g. a LRE without its appertaining PDF. The level =any warns about any use of bidirectional characters. This patch handles both UCNs and UTF-8 characters. UCNs designating bidi characters in identifiers are accepted since r204886. Then r217144 enabled -fextended-identifiers by default. Extended characters in C/C++ identifiers have been accepted since r275979. However, this patch still warns about mixing UTF-8 and UCN bidi characters; there seems to be no good reason to allow mixing them. We warn in different contexts: comments (both C and C++-style), string literals, character constants, and identifiers. Expectedly, UCNs are ignored in comments and raw string literals. The bidirectional characters can nest so this patch handles that as well. I have not included nor tested this at all with Fortran (which also has string literals and line comments). Dave M. posted patches improving diagnostic involving Unicode characters. This patch does not make use of this new infrastructure yet. PR preprocessor/103026 gcc/c-family/ChangeLog: * c.opt (Wbidirectional, Wbidirectional=): New option. gcc/ChangeLog: * doc/invoke.texi: Document -Wbidirectional. libcpp/ChangeLog: * include/cpplib.h (enum cpp_bidirectional_level): New. (struct cpp_options): Add cpp_warn_bidirectional. (enum cpp_warning_reason): Add CPP_W_BIDIRECTIONAL. * init.c (cpp_create_reader): Set cpp_warn_bidirectional. * lex.c (bidi): New namespace. (get_bidi_utf8): New function. (get_bidi_ucn): Likewise. (maybe_warn_bidi_on_close): Likewise. (maybe_warn_bidi_on_char): Likewise. (_cpp_skip_block_comment): Implement warning about bidirectional characters. (skip_line_comment): Likewise. (forms_identifier_p): Likewise. (lex_identifier): Likewise. (lex_string): Likewise. (lex_raw_string): Likewise. gcc/testsuite/ChangeLog: * c-c++-common/Wbidirectional-1.c: New test. * c-c++-common/Wbidirectional-2.c: New test. * c-c++-common/Wbidirectional-3.c: New test. * c-c++-common/Wbidirectional-4.c: New test. * c-c++-common/Wbidirectional-5.c: New test. * c-c++-common/Wbidirectional-6.c: New test. * c-c++-common/Wbidirectional-7.c: New test. * c-c++-common/Wbidirectional-8.c: New test. * c-c++-common/Wbidirectional-9.c: New test. * c-c++-common/Wbidirectional-10.c: New test. * c-c++-common/Wbidirectional-11.c: New test. * c-c++-common/Wbidirectional-12.c: New test. * c-c++-common/Wbidirectional-13.c: New test. * c-c++-common/Wbidirectional-14.c: New test. --- gcc/c-family/c.opt | 24 ++ gcc/doc/invoke.texi | 19 +- gcc/testsuite/c-c++-common/Wbidirectional-1.c | 12 + .../c-c++-common/Wbidirectional-10.c | 27 ++ .../c-c++-common/Wbidirectional-11.c | 13 + .../c-c++-common/Wbidirectional-12.c | 19 + .../c-c++-common/Wbidirectional-13.c | 17 + .../c-c++-common/Wbidirectional-14.c | 38 ++ gcc/testsuite/c-c++-common/Wbidirectional-2.c | 9 + gcc/testsuite/c-c++-common/Wbidirectional-3.c | 11 + gcc/testsuite/c-c++-common/Wbidirectional-4.c | 166 +++++++ gcc/testsuite/c-c++-common/Wbidirectional-5.c | 166 +++++++ gcc/testsuite/c-c++-common/Wbidirectional-6.c | 155 +++++++ gcc/testsuite/c-c++-common/Wbidirectional-7.c | 9 + gcc/testsuite/c-c++-common/Wbidirectional-8.c | 13 + gcc/testsuite/c-c++-common/Wbidirectional-9.c | 29 ++ libcpp/include/cpplib.h | 18 +- libcpp/init.c | 1 + libcpp/lex.c | 407 +++++++++++++++++- 19 files changed, 1138 insertions(+), 15 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-1.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-10.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-11.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-12.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-13.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-14.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-2.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-3.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-4.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-5.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-6.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-7.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-8.c create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-9.c diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt index 06457ac739e..09391824676 100644 --- a/gcc/c-family/c.opt +++ b/gcc/c-family/c.opt @@ -374,6 +374,30 @@ Wbad-function-cast C ObjC Var(warn_bad_function_cast) Warning Warn about casting functions to incompatible types. +Wbidirectional +C ObjC C++ ObjC++ Warning Alias(Wbidirectional=,any,none) +; + +Wbidirectional= +C ObjC C++ ObjC++ RejectNegative Joined Warning CPP(cpp_warn_bidirectional) CppReason(CPP_W_BIDIRECTIONAL) Var(warn_bidirectional) Init(bidirectional_unpaired) Enum(cpp_bidirectional_level) +-Wbidirectional=[none|unpaired|any] Warn about UTF-8 bidirectional characters. + +; Required for these enum values. +SourceInclude +cpplib.h + +Enum +Name(cpp_bidirectional_level) Type(int) UnknownError(argument %qs to %<-Wbidirectional%> not recognized) + +EnumValue +Enum(cpp_bidirectional_level) String(none) Value(bidirectional_none) + +EnumValue +Enum(cpp_bidirectional_level) String(unpaired) Value(bidirectional_unpaired) + +EnumValue +Enum(cpp_bidirectional_level) String(any) Value(bidirectional_any) + Wbool-compare C ObjC C++ ObjC++ Var(warn_bool_compare) Warning LangEnabledBy(C ObjC C++ ObjC++,Wall) Warn about boolean expression compared with an integer value different from true/false. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index c5730228821..9dfb95dc24c 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. -Warith-conversion @gol -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol +-Wno-attribute-warning @gol +-Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol +-Wbool-compare -Wbool-operation @gol -Wno-builtin-declaration-mismatch @gol -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol -Wc11-c2x-compat @gol @@ -7674,6 +7676,21 @@ Attributes considered include @code{alloc_align}, @code{alloc_size}, This is the default. You can disable these warnings with either @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. +@item -Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} +@opindex Wbidirectional= +@opindex Wbidirectional +@opindex Wno-bidirectional +Warn about UTF-8 bidirectional characters. Such characters can change +left-to-right writing direction into right-to-left (and vice versa), +which can cause confusion between the logical order and visual order. +This may be dangerous; for instance, it may seem that a piece of code +is not commented out, whereas it in fact is. + +There are three levels of warning supported by GCC@. The default is +@option{-Wbidirectional=unpaired}, which warns about improperly terminated +bidi contexts. @option{-Wbidirectional=none} turns the warning off. +@option{-Wbidirectional=any} warns about any use of bidirectional characters. + @item -Wbool-compare @opindex Wno-bool-compare @opindex Wbool-compare diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-1.c b/gcc/testsuite/c-c++-common/Wbidirectional-1.c new file mode 100644 index 00000000000..34f5ac19271 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-1.c @@ -0,0 +1,12 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + int isAdmin = 0; + /* } if (isAdmin) begin admins only */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + /* end admins only { */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-10.c b/gcc/testsuite/c-c++-common/Wbidirectional-10.c new file mode 100644 index 00000000000..2647e44db03 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-10.c @@ -0,0 +1,27 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* More nesting testing. */ + +/* RLE LRI PDF PDI*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF_\u202c; +int LRE_\u202a_PDF_\u202c_LRE_\u202a_PDF_\u202c; +int LRE_\u202a_LRI_\u2066_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDI_\u2069_PDF_\u202c; +int FSI_\u2068_LRO_\u202d_PDI_\u2069_PDF_\u202c; +int FSI_\u2068; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDF_\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_FSI_\u2068_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-11.c b/gcc/testsuite/c-c++-common/Wbidirectional-11.c new file mode 100644 index 00000000000..7d91527148e --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-11.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test that we warn when mixing UCN and UTF-8. */ + +int LRE__PDF_\u202c; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF__; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s1 = "LRE__PDF_\u202c"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s2 = "LRE_\u202a_PDF_"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-12.c b/gcc/testsuite/c-c++-common/Wbidirectional-12.c new file mode 100644 index 00000000000..2030bb73fe9 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-12.c @@ -0,0 +1,19 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidirectional=any" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c RLO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +const char *s7 = R"(a b c FSI 1 2 3 PDI x y) z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +const char *s8 = R"(a b c PDI x y )z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +const char *s9 = R"(a b c PDF x y z)"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-13.c b/gcc/testsuite/c-c++-common/Wbidirectional-13.c new file mode 100644 index 00000000000..8e13bf53595 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-13.c @@ -0,0 +1,17 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c FSI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s5 = R"(a b c LRI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s6 = R"(a b c RLI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-14.c b/gcc/testsuite/c-c++-common/Wbidirectional-14.c new file mode 100644 index 00000000000..eb4a0a30d58 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-14.c @@ -0,0 +1,38 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test PDI handling, which also pops any subsequent LREs, RLEs, LROs, + or RLOs. */ + +/* LRI__LRI__RLE__RLE__RLE__PDI_*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// LRI__RLE__RLE__RLE__PDI_ +// LRI__RLO__RLE__RLE__PDI_ +// LRI__RLO__RLE__PDI_ +// FSI__RLO__PDI_ +// FSI__FSI__RLO__PDI_ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +int LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069_PDI_\u2069; +int LRI_\u2066_LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int PDI_\u2069; +int LRI_\u2066_PDI_\u2069; +int RLI_\u2067_PDI_\u2069; +int LRE_\u202a_LRI_\u2066_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRI_\u2066_LRE_\u202a_PDF_\u202c_PDI_\u2069; +int LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +int RLI_\u2067_LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLO_\u202e_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_PDI_\u2069_RLI_\u2067; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDF_\u202c_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-2.c b/gcc/testsuite/c-c++-common/Wbidirectional-2.c new file mode 100644 index 00000000000..2340374f276 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-2.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + /* Say hello; newline/*/ return 0 ; +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("Hello world.\n"); + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-3.c b/gcc/testsuite/c-c++-common/Wbidirectional-3.c new file mode 100644 index 00000000000..9dc7edb6e64 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-3.c @@ -0,0 +1,11 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + const char* access_level = "user"; + if (__builtin_strcmp(access_level, "user // Check if admin ")) { +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + } + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-4.c b/gcc/testsuite/c-c++-common/Wbidirectional-4.c new file mode 100644 index 00000000000..0fbc0dad6ab --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-4.c @@ -0,0 +1,166 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=any -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-5.c b/gcc/testsuite/c-c++-common/Wbidirectional-5.c new file mode 100644 index 00000000000..800d273dbe5 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-5.c @@ -0,0 +1,166 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-6.c b/gcc/testsuite/c-c++-common/Wbidirectional-6.c new file mode 100644 index 00000000000..bf8b6104d43 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-6.c @@ -0,0 +1,155 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test nesting of bidi chars in various contexts. */ + +/* Terminated by the wrong char: */ +/* a b c LRE 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDF x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +/* LRE PDF */ +/* LRE LRE PDF PDF */ +/* PDF LRE PDF */ +/* LRE PDF LRE PDF */ +/* LRE LRE PDF */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* PDF LRE */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// a b c LRE 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// LRE PDF +// LRE LRE PDF PDF +// PDF LRE PDF +// LRE PDF LRE PDF +// LRE LRE PDF +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// PDF LRE +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c LRE\u202a 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c RLE 1 2 3 PDI x y "; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLE\u202b 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c LRO\u202d 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c RLO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c RLO\u202e 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c LRI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s10 = "a b c LRI\u2066 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c RLI 1 2 3 PDF x y z\ + "; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ + const char *s12 = "a b c RLI\u2067 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c FSI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c FSI\u2068 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "LRE PDF"; + const char *s18 = "LRE\u202a PDF\u202c"; + const char *s19 = "LRE LRE PDF PDF"; + const char *s20 = "LRE\u202a LRE\u202a PDF\u202c PDF\u202c"; + const char *s21 = "PDF LRE PDF"; + const char *s22 = "PDF\u202c LRE\u202a PDF\u202c"; + const char *s23 = "LRE LRE PDF"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s24 = "LRE\u202a LRE\u202a PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s25 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s26 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s27 = "PDF LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s28 = "PDF\u202c LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int aLREbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int A\u202aB\u2069C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLEbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202bB\u2069c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202db\u2069c2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202eb\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2066b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLIbPDFc +; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +int a\u2067b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2068b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPD\u202C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSI\u2068bPDF_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLREbPDFb; +int A\u202aB\u202c; +int a_LRE_LRE_b_PDF_PDF; +int A\u202aA\u202aB\u202cB\u202c; +int aPDFbLREadPDF; +int a_\u202C_\u202a_\u202c; +int a_LRE_b_PDF_c_LRE_PDF; +int a_\u202a_\u202c_\u202a_\u202c_; +int a_LRE_b_PDF_c_LRE; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a_\u202a_\u202c_\u202a_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-7.c b/gcc/testsuite/c-c++-common/Wbidirectional-7.c new file mode 100644 index 00000000000..bb973ab7c27 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-7.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=any" } */ +/* Test we ignore UCNs in comments. */ + +// a b c \u202a 1 2 3 +// a b c \u202A 1 2 3 +/* a b c \u202a 1 2 3 */ +/* a b c \u202A 1 2 3 */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-8.c b/gcc/testsuite/c-c++-common/Wbidirectional-8.c new file mode 100644 index 00000000000..23f5c92418b --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-8.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=any" } */ +/* Test \u vs \U. */ + +int a_\u202A; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\u202a_2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202A_3; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202a_4; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-9.c b/gcc/testsuite/c-c++-common/Wbidirectional-9.c new file mode 100644 index 00000000000..dfcd6814fcb --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-9.c @@ -0,0 +1,29 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired" } */ +/* Test that we properly separate bidi contexts (comment/identifier/character + constant/string literal). */ + +/* LRE -><- */ int pdf_\u202c_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLE -><- */ int pdf_\u202c_2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRO -><- */ int pdf_\u202c_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLO -><- */ int pdf_\u202c_4; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRI -><-*/ int pdi_\u2069_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLI -><- */ int pdi_\u2069_12; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* FSI -><- */ int pdi_\u2069_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +const char *s1 = "LRE\u202a"; /* PDF -><- */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRE -><- */ const char *s2 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = "LRE\u202a"; int pdf_\u202c_5; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int lre_\u202a; const char *s4 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h index 176f8c5bbce..60cf08ddd35 100644 --- a/libcpp/include/cpplib.h +++ b/libcpp/include/cpplib.h @@ -319,6 +319,17 @@ enum cpp_main_search CMS_system, /* Search the system INCLUDE path. */ }; +/* The possible bidirectional characters checking levels, from least + restrictive to most. */ +enum cpp_bidirectional_level { + /* No checking. */ + bidirectional_none, + /* Only detect unpaired uses of bidirectional characters. */ + bidirectional_unpaired, + /* Detect any use of bidirectional characters. */ + bidirectional_any +}; + /* This structure is nested inside struct cpp_reader, and carries all the options visible to the command line. */ struct cpp_options @@ -539,6 +550,10 @@ struct cpp_options /* True if warn about differences between C++98 and C++11. */ bool cpp_warn_cxx11_compat; + /* Nonzero of bidirectional characters checking is on. See enum + cpp_bidirectional_level. */ + unsigned char cpp_warn_bidirectional; + /* Dependency generation. */ struct { @@ -643,7 +658,8 @@ enum cpp_warning_reason { CPP_W_C90_C99_COMPAT, CPP_W_C11_C2X_COMPAT, CPP_W_CXX11_COMPAT, - CPP_W_EXPANSION_TO_DEFINED + CPP_W_EXPANSION_TO_DEFINED, + CPP_W_BIDIRECTIONAL }; /* Callback for header lookup for HEADER, which is the name of a diff --git a/libcpp/init.c b/libcpp/init.c index 5a424e23553..f9a8f5f088f 100644 --- a/libcpp/init.c +++ b/libcpp/init.c @@ -223,6 +223,7 @@ cpp_create_reader (enum c_lang lang, cpp_hash_table *table, = ENABLE_CANONICAL_SYSTEM_HEADERS; CPP_OPTION (pfile, ext_numeric_literals) = 1; CPP_OPTION (pfile, warn_date_time) = 0; + CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired; /* Default CPP arithmetic to something sensible for the host for the benefit of dumb users like fix-header. */ diff --git a/libcpp/lex.c b/libcpp/lex.c index fa2253d41c3..3fb518e202b 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) } } +namespace bidi { + enum class kind { + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI + }; + + /* All the UTF-8 encodings of bidi characters start with E2. */ + constexpr uchar utf8_start = 0xe2; + + /* A vector holding currently open bidi contexts. We use a char for + each context, its LSB is 1 if it represents a PDF context, 0 if it + represents a PDI context. The next bit is 1 if this context was open + by a bidi character written as a UCN, and 0 when it was UTF-8. */ + semi_embedded_vec <unsigned char, 16> vec; + + /* Close the whole comment/identifier/string literal/character constant + context. */ + void on_close () + { + vec.truncate (0); + } + + /* Pop the last element in the vector. */ + void pop () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + vec.truncate (len - 1); + } + + /* Return the context of the Ith element. */ + kind ctx_at (unsigned int i) + { + return (vec[i] & 1) ? kind::PDF : kind::PDI; + } + + /* Return which context is currently opened. */ + kind current_ctx () + { + unsigned int len = vec.count (); + if (len == 0) + return kind::NONE; + return ctx_at (len - 1); + } + + /* Return true if the current context comes from a UCN origin, that is, + the bidi char which started this bidi context was written as a UCN. */ + bool current_ctx_ucn_p () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + return (vec[len - 1] >> 1) & 1; + } + + /* We've read a bidi char, update the current vector as necessary. */ + void on_char (kind k, bool ucn_p) + { + switch (k) + { + case kind::LRE: + case kind::RLE: + case kind::LRO: + case kind::RLO: + vec.push (ucn_p ? 3u : 1u); + break; + case kind::LRI: + case kind::RLI: + case kind::FSI: + vec.push (ucn_p ? 2u : 0u); + break; + /* PDF terminates the scope of the last LRE, RLE, LRO, or RLO + whose scope has not yet been terminated. */ + case kind::PDF: + if (current_ctx () == kind::PDF) + pop (); + break; + /* PDI terminates the scope of the last LRI, RLI, or FSI whose + scope has not yet been terminated, as well as the scopes of + any subsequent LREs, RLEs, LROs, or RLOs whose scopes have not + yet been terminated. */ + case kind::PDI: + for (int i = vec.count () - 1; i >= 0; --i) + if (ctx_at (i) == kind::PDI) + { + vec.truncate (i); + break; + } + break; + [[likely]] case kind::NONE: + break; + default: + abort (); + } + } + + /* Return a descriptive string for K. */ + const char *to_str (kind k) + { + switch (k) + { + case kind::LRE: + return "U+202A (LEFT-TO-RIGHT EMBEDDING)"; + case kind::RLE: + return "U+202B (RIGHT-TO-LEFT EMBEDDING)"; + case kind::LRO: + return "U+202D (LEFT-TO-RIGHT OVERRIDE)"; + case kind::RLO: + return "U+202E (RIGHT-TO-LEFT OVERRIDE)"; + case kind::LRI: + return "U+2066 (LEFT-TO-RIGHT ISOLATE)"; + case kind::RLI: + return "U+2067 (RIGHT-TO-LEFT ISOLATE)"; + case kind::FSI: + return "U+2068 (FIRST STRONG ISOLATE)"; + case kind::PDF: + return "U+202C (POP DIRECTIONAL FORMATTING)"; + case kind::PDI: + return "U+2069 (POP DIRECTIONAL ISOLATE)"; + default: + abort (); + } + } +} + +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ + +static bidi::kind +get_bidi_utf8 (const unsigned char *const p) +{ + gcc_checking_assert (p[0] == bidi::utf8_start); + + if (p[1] == 0x80) + switch (p[2]) + { + case 0xaa: + return bidi::kind::LRE; + case 0xab: + return bidi::kind::RLE; + case 0xac: + return bidi::kind::PDF; + case 0xad: + return bidi::kind::LRO; + case 0xae: + return bidi::kind::RLO; + default: + break; + } + else if (p[1] == 0x81) + switch (p[2]) + { + case 0xa6: + return bidi::kind::LRI; + case 0xa7: + return bidi::kind::RLI; + case 0xa8: + return bidi::kind::FSI; + case 0xa9: + return bidi::kind::PDI; + default: + break; + } + + return bidi::kind::NONE; +} + +/* Parse a UCN where P points just past \u or \U and return its bidi code. */ + +static bidi::kind +get_bidi_ucn (const unsigned char *p, bool is_U) +{ + /* 6.4.3 Universal Character Names + \u hex-quad + \U hex-quad hex-quad + where \unnnn means \U0000nnnn. */ + + if (is_U) + { + if (p[0] != '0' || p[1] != '0' || p[2] != '0' || p[3] != '0') + return bidi::kind::NONE; + /* Skip 4B so we can treat \u and \U the same below. */ + p += 4; + } + + /* All code points we are looking for start with 20xx. */ + if (p[0] != '2' || p[1] != '0') + return bidi::kind::NONE; + else if (p[2] == '2') + switch (p[3]) + { + case 'a': + case 'A': + return bidi::kind::LRE; + case 'b': + case 'B': + return bidi::kind::RLE; + case 'c': + case 'C': + return bidi::kind::PDF; + case 'd': + case 'D': + return bidi::kind::LRO; + case 'e': + case 'E': + return bidi::kind::RLO; + default: + break; + } + else if (p[2] == '6') + switch (p[3]) + { + case '6': + return bidi::kind::LRI; + case '7': + return bidi::kind::RLI; + case '8': + return bidi::kind::FSI; + case '9': + return bidi::kind::PDI; + default: + break; + } + + return bidi::kind::NONE; +} + +/* We're closing a bidi context, that is, we've encountered a newline, + are closing a C-style comment, or are at the end of a string literal, + character constant, or identifier. Warn if this context was not + properly terminated by a PDI or PDF. P points to the last character + in this context. */ + +static void +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) +{ + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired + && bidi::vec.count () > 0) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "unpaired UTF-8 bidirectional character " + "detected"); + } + /* We're done with this context. */ + bidi::on_close (); +} + +/* We're at the beginning or in the middle of an identifier/comment/string + literal/character constant. Warn if we've encountered a bidi character. + KIND says which bidi character it was; P points to it in the character + stream. UCN_P is true iff this bidi character was written as a UCN. */ + +static void +maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, + bool ucn_p) +{ + if (__builtin_expect (kind == bidi::kind::NONE, 1)) + return; + + const auto warn_bidi = CPP_OPTION (pfile, cpp_warn_bidirectional); + + if (warn_bidi != bidirectional_none) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + /* It seems excessive to warn about a PDI/PDF that is closing + an opened context because we've already warned about the + opening character. Except warn when we have a UCN x UTF-8 + mismatch. */ + if (kind == bidi::current_ctx ()) + { + if (warn_bidi == bidirectional_unpaired + && bidi::current_ctx_ucn_p () != ucn_p) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); + } + else if (warn_bidi == bidirectional_any) + { + if (kind == bidi::kind::PDF || kind == bidi::kind::PDI) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "\"%s\" is closing an unopened context", + bidi::to_str (kind)); + else + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "found problematic Unicode character \"%s\"", + bidi::to_str (kind)); + } + } + /* We're done with this context. */ + bidi::on_char (kind, ucn_p); +} + /* Skip a C-style block comment. We find the end of the comment by seeing if an asterisk is before every '/' we encounter. Returns nonzero if comment terminated by EOF, zero otherwise. @@ -1175,7 +1469,8 @@ _cpp_skip_block_comment (cpp_reader *pfile) cpp_buffer *buffer = pfile->buffer; const uchar *cur = buffer->cur; uchar c; - + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); cur++; if (*cur == '/') cur++; @@ -1189,7 +1484,11 @@ _cpp_skip_block_comment (cpp_reader *pfile) if (c == '/') { if (cur[-2] == '*') - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); + break; + } /* Warn about potential nested comments, but not if the '/' comes immediately before the true comment delimiter. @@ -1208,6 +1507,8 @@ _cpp_skip_block_comment (cpp_reader *pfile) { unsigned int cols; buffer->cur = cur - 1; + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); _cpp_process_line_notes (pfile, true); if (buffer->next_line >= buffer->rlimit) return true; @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) cur = buffer->cur; } + /* If this is a beginning of a UTF-8 encoding, it might be + a bidirectional character. */ + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); + } } buffer->cur = cur; @@ -1233,9 +1541,32 @@ skip_line_comment (cpp_reader *pfile) { cpp_buffer *buffer = pfile->buffer; location_t orig_line = pfile->line_table->highest_line; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); - while (*buffer->cur != '\n') - buffer->cur++; + if (!warn_bidi_p) + while (*buffer->cur != '\n') + buffer->cur++; + else + { + while (*buffer->cur != '\n' + && *buffer->cur != bidi::utf8_start) + buffer->cur++; + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + while (*buffer->cur != '\n') + { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } + buffer->cur++; + } + maybe_warn_bidi_on_close (pfile, buffer->cur); + } + } _cpp_process_line_notes (pfile, true); return orig_line != pfile->line_table->highest_line; @@ -1346,11 +1677,14 @@ static const cppchar_t utf8_signifier = 0xC0; /* Returns TRUE if the sequence starting at buffer->cur is valid in an identifier. FIRST is TRUE if this starts an identifier. */ + static bool forms_identifier_p (cpp_reader *pfile, int first, struct normalize_state *state) { cpp_buffer *buffer = pfile->buffer; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); if (*buffer->cur == '$') { @@ -1373,6 +1707,13 @@ forms_identifier_p (cpp_reader *pfile, int first, cppchar_t s; if (*buffer->cur >= utf8_signifier) { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) + && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s)) return true; @@ -1381,6 +1722,13 @@ forms_identifier_p (cpp_reader *pfile, int first, && (buffer->cur[1] == 'u' || buffer->cur[1] == 'U')) { buffer->cur += 2; + if (warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (buffer->cur, + buffer->cur[-1] == 'U'); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/true); + } if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s, NULL, NULL)) return true; @@ -1489,6 +1837,8 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, const uchar *cur; unsigned int len; unsigned int hash = HT_HASHSTEP (0, *base); + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); cur = pfile->buffer->cur; if (! starts_ucn) @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, { /* Slower version for identifiers containing UCNs or extended chars (including $). */ - do { - while (ISIDNUM (*pfile->buffer->cur)) - { - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); - pfile->buffer->cur++; - } - } while (forms_identifier_p (pfile, false, nst)); + do + { + while (ISIDNUM (*pfile->buffer->cur)) + { + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); + pfile->buffer->cur++; + } + } + while (forms_identifier_p (pfile, false, nst)); + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pfile->buffer->cur); result = _cpp_interpret_identifier (pfile, base, pfile->buffer->cur - base); *spelling = cpp_lookup (pfile, base, pfile->buffer->cur - base); @@ -1758,6 +2112,8 @@ static void lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) { const uchar *pos = base; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); /* 'tis a pity this information isn't passed down from the lexer's initial categorization of the token. */ @@ -1994,8 +2350,15 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) pos = base = pfile->buffer->cur; note = &pfile->buffer->notes[pfile->buffer->cur_note]; } + else if (__builtin_expect ((unsigned char) c == bidi::utf8_start, 0) + && warn_bidi_p) + maybe_warn_bidi_on_char (pfile, pos - 1, get_bidi_utf8 (pos - 1), + /*ucn_p=*/false); } + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pos); + if (CPP_OPTION (pfile, user_literals)) { /* If a string format macro, say from inttypes.h, is placed touching @@ -2090,15 +2453,28 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) else terminator = '>', type = CPP_HEADER_NAME; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); for (;;) { cppchar_t c = *cur++; /* In #include-style directives, terminators are not escapable. */ if (c == '\\' && !pfile->state.angled_headers && *cur != '\n') - cur++; + { + if ((cur[0] == 'u' || cur[0] == 'U') && warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (cur + 1, cur[0] == 'U'); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/true); + } + cur++; + } else if (c == terminator) - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur - 1); + break; + } else if (c == '\n') { cur--; @@ -2115,6 +2491,11 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) } else if (c == '\0') saw_NUL = true; + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur - 1, kind, /*ucn_p=*/false); + } } if (saw_NUL && !pfile->state.skipping) base-commit: 6cc8aa65fdeaefe9774d5e0d4e72c91f52313be1 -- 2.31.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] 2021-11-02 17:18 ` [PATCH v2] " Marek Polacek @ 2021-11-02 19:20 ` Martin Sebor 2021-11-02 19:52 ` Marek Polacek 0 siblings, 1 reply; 27+ messages in thread From: Martin Sebor @ 2021-11-02 19:20 UTC (permalink / raw) To: Marek Polacek, Joseph Myers; +Cc: Jakub Jelinek, GCC Patches On 11/2/21 11:18 AM, Marek Polacek via Gcc-patches wrote: > On Mon, Nov 01, 2021 at 10:10:40PM +0000, Joseph Myers wrote: >> On Mon, 1 Nov 2021, Marek Polacek via Gcc-patches wrote: >> >>> + /* We've read a bidi char, update the current vector as necessary. */ >>> + void on_char (kind k, bool ucn_p) >>> + { >>> + switch (k) >>> + { >>> + case kind::LRE: >>> + case kind::RLE: >>> + case kind::LRO: >>> + case kind::RLO: >>> + vec.push (ucn_p ? 3u : 1u); >>> + break; >>> + case kind::LRI: >>> + case kind::RLI: >>> + case kind::FSI: >>> + vec.push (ucn_p ? 2u : 0u); >>> + break; >>> + case kind::PDF: >>> + if (current_ctx () == kind::PDF) >>> + pop (); >>> + break; >>> + case kind::PDI: >>> + if (current_ctx () == kind::PDI) >>> + pop (); >> >> My understanding is that PDI should pop all intermediate PDF contexts >> outward to a PDI context, which it also pops. (But if it's embedded only >> in PDF contexts, with no PDI context containing it, it doesn't pop >> anything.) >> >> I think failing to handle that only means libcpp sometimes models there >> as being more bidirectional contexts open than there should be, so it >> might give spurious warnings when in fact all such contexts had been >> closed by end of string or comment. > > Ah, you're right. > https://www.unicode.org/reports/tr9/#Terminating_Explicit_Directional_Isolates > says that "[PDI] terminates the scope of the last LRI, RLI, or FSI whose > scope has not yet been terminated, as well as the scopes of any subsequent > LREs, RLEs, LROs, or RLOs whose scopes have not yet been terminated." > but PDF doesn't have the latter quirk. > > Fixed in the below: I added a suitable truncate into on_char. The new test > Wbidirectional-14.c exercises the handling of PDI. > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > -- >8 -- > From a link below: > "An issue was discovered in the Bidirectional Algorithm in the Unicode > Specification through 14.0. It permits the visual reordering of > characters via control sequences, which can be used to craft source code > that renders different logic than the logical ordering of tokens > ingested by compilers and interpreters. Adversaries can leverage this to > encode source code for compilers accepting Unicode such that targeted > vulnerabilities are introduced invisibly to human reviewers." > > More info: > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > https://trojansource.codes/ > > This is not a compiler bug. However, to mitigate the problem, this patch > implements -Wbidirectional=[none|unpaired|any] to warn about possibly > misleading Unicode bidirectional characters the preprocessor may encounter. Birectional sounds very general. Can we come up with a name that's a bit more descriptive of the problem the warning reports? From skimming the docs and the tests it looks like the warning points out uses of bidirectonal characters in the program source code as well as comments. Would -Wbidirectional-text be better? Or -Wbidirectional-chars? (If Clang is also adding a warning for this, syncing up with them one way or the other might be helpful.) ... > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index c5730228821..9dfb95dc24c 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. > -Warith-conversion @gol > -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol > -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol > --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol > +-Wno-attribute-warning @gol > +-Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol > +-Wbool-compare -Wbool-operation @gol > -Wno-builtin-declaration-mismatch @gol > -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol > -Wc11-c2x-compat @gol > @@ -7674,6 +7676,21 @@ Attributes considered include @code{alloc_align}, @code{alloc_size}, > This is the default. You can disable these warnings with either > @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. > > +@item -Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} > +@opindex Wbidirectional= > +@opindex Wbidirectional > +@opindex Wno-bidirectional > +Warn about UTF-8 bidirectional characters. I suggest to mention where. If everywhere, enumerate the most common contexts to make it clear it means everywhere: Warn about UTF-8 bidirectional characters in source code, including string literals, identifiers, and comments. Martin ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] 2021-11-02 19:20 ` Martin Sebor @ 2021-11-02 19:52 ` Marek Polacek 2021-11-08 21:33 ` Marek Polacek 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2021-11-02 19:52 UTC (permalink / raw) To: Martin Sebor; +Cc: Joseph Myers, Jakub Jelinek, GCC Patches On Tue, Nov 02, 2021 at 01:20:03PM -0600, Martin Sebor wrote: > On 11/2/21 11:18 AM, Marek Polacek via Gcc-patches wrote: > > On Mon, Nov 01, 2021 at 10:10:40PM +0000, Joseph Myers wrote: > > > On Mon, 1 Nov 2021, Marek Polacek via Gcc-patches wrote: > > > > > > > + /* We've read a bidi char, update the current vector as necessary. */ > > > > + void on_char (kind k, bool ucn_p) > > > > + { > > > > + switch (k) > > > > + { > > > > + case kind::LRE: > > > > + case kind::RLE: > > > > + case kind::LRO: > > > > + case kind::RLO: > > > > + vec.push (ucn_p ? 3u : 1u); > > > > + break; > > > > + case kind::LRI: > > > > + case kind::RLI: > > > > + case kind::FSI: > > > > + vec.push (ucn_p ? 2u : 0u); > > > > + break; > > > > + case kind::PDF: > > > > + if (current_ctx () == kind::PDF) > > > > + pop (); > > > > + break; > > > > + case kind::PDI: > > > > + if (current_ctx () == kind::PDI) > > > > + pop (); > > > > > > My understanding is that PDI should pop all intermediate PDF contexts > > > outward to a PDI context, which it also pops. (But if it's embedded only > > > in PDF contexts, with no PDI context containing it, it doesn't pop > > > anything.) > > > > > > I think failing to handle that only means libcpp sometimes models there > > > as being more bidirectional contexts open than there should be, so it > > > might give spurious warnings when in fact all such contexts had been > > > closed by end of string or comment. > > > > Ah, you're right. > > https://www.unicode.org/reports/tr9/#Terminating_Explicit_Directional_Isolates > > says that "[PDI] terminates the scope of the last LRI, RLI, or FSI whose > > scope has not yet been terminated, as well as the scopes of any subsequent > > LREs, RLEs, LROs, or RLOs whose scopes have not yet been terminated." > > but PDF doesn't have the latter quirk. > > > > Fixed in the below: I added a suitable truncate into on_char. The new test > > Wbidirectional-14.c exercises the handling of PDI. > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > -- >8 -- > > From a link below: > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > Specification through 14.0. It permits the visual reordering of > > characters via control sequences, which can be used to craft source code > > that renders different logic than the logical ordering of tokens > > ingested by compilers and interpreters. Adversaries can leverage this to > > encode source code for compilers accepting Unicode such that targeted > > vulnerabilities are introduced invisibly to human reviewers." > > > > More info: > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > https://trojansource.codes/ > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > implements -Wbidirectional=[none|unpaired|any] to warn about possibly > > misleading Unicode bidirectional characters the preprocessor may encounter. > > Birectional sounds very general. Can we come up with a name > that's a bit more descriptive of the problem the warning reports? > From skimming the docs and the tests it looks like the warning > points out uses of bidirectonal characters in the program source > code as well as comments. Would -Wbidirectional-text be better? > Or -Wbidirectional-chars? (If Clang is also adding a warning > for this, syncing up with them one way or the other might be > helpful.) I dunno, I could go with -Wbidirectional-chars. Does anyone else think I should rename the current name to -Wbidirectional-chars? Other ideas: -Wunicode-bidi / -Wmultibyte-chars / -Wmisleading-bidirectional. The patch for clang-tidy I saw called this misleading-bidirectional. > ... > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index c5730228821..9dfb95dc24c 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. > > -Warith-conversion @gol > > -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol > > -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol > > --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol > > +-Wno-attribute-warning @gol > > +-Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol > > +-Wbool-compare -Wbool-operation @gol > > -Wno-builtin-declaration-mismatch @gol > > -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol > > -Wc11-c2x-compat @gol > > @@ -7674,6 +7676,21 @@ Attributes considered include @code{alloc_align}, @code{alloc_size}, > > This is the default. You can disable these warnings with either > > @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. > > +@item -Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} > > +@opindex Wbidirectional= > > +@opindex Wbidirectional > > +@opindex Wno-bidirectional > > +Warn about UTF-8 bidirectional characters. > > I suggest to mention where. If everywhere, enumerate the most > common contexts to make it clear it means everywhere: > > Warn about UTF-8 bidirectional characters in source code, > including string literals, identifiers, and comments. OK, I've updated the text to: Warn about possibly misleading UTF-8 bidirectional characters in comments, string literals, character constants, and identifiers. Marek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] 2021-11-02 19:52 ` Marek Polacek @ 2021-11-08 21:33 ` Marek Polacek 2021-11-15 17:28 ` [PATCH] libcpp: Implement -Wbidi-chars " Marek Polacek 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2021-11-08 21:33 UTC (permalink / raw) To: Martin Sebor; +Cc: Joseph Myers, Jakub Jelinek, GCC Patches Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, but changing the name is a trivial operation. On Tue, Nov 02, 2021 at 03:52:05PM -0400, Marek Polacek wrote: > On Tue, Nov 02, 2021 at 01:20:03PM -0600, Martin Sebor wrote: > > On 11/2/21 11:18 AM, Marek Polacek via Gcc-patches wrote: > > > On Mon, Nov 01, 2021 at 10:10:40PM +0000, Joseph Myers wrote: > > > > On Mon, 1 Nov 2021, Marek Polacek via Gcc-patches wrote: > > > > > > > > > + /* We've read a bidi char, update the current vector as necessary. */ > > > > > + void on_char (kind k, bool ucn_p) > > > > > + { > > > > > + switch (k) > > > > > + { > > > > > + case kind::LRE: > > > > > + case kind::RLE: > > > > > + case kind::LRO: > > > > > + case kind::RLO: > > > > > + vec.push (ucn_p ? 3u : 1u); > > > > > + break; > > > > > + case kind::LRI: > > > > > + case kind::RLI: > > > > > + case kind::FSI: > > > > > + vec.push (ucn_p ? 2u : 0u); > > > > > + break; > > > > > + case kind::PDF: > > > > > + if (current_ctx () == kind::PDF) > > > > > + pop (); > > > > > + break; > > > > > + case kind::PDI: > > > > > + if (current_ctx () == kind::PDI) > > > > > + pop (); > > > > > > > > My understanding is that PDI should pop all intermediate PDF contexts > > > > outward to a PDI context, which it also pops. (But if it's embedded only > > > > in PDF contexts, with no PDI context containing it, it doesn't pop > > > > anything.) > > > > > > > > I think failing to handle that only means libcpp sometimes models there > > > > as being more bidirectional contexts open than there should be, so it > > > > might give spurious warnings when in fact all such contexts had been > > > > closed by end of string or comment. > > > > > > Ah, you're right. > > > https://www.unicode.org/reports/tr9/#Terminating_Explicit_Directional_Isolates > > > says that "[PDI] terminates the scope of the last LRI, RLI, or FSI whose > > > scope has not yet been terminated, as well as the scopes of any subsequent > > > LREs, RLEs, LROs, or RLOs whose scopes have not yet been terminated." > > > but PDF doesn't have the latter quirk. > > > > > > Fixed in the below: I added a suitable truncate into on_char. The new test > > > Wbidirectional-14.c exercises the handling of PDI. > > > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > > > -- >8 -- > > > From a link below: > > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > > Specification through 14.0. It permits the visual reordering of > > > characters via control sequences, which can be used to craft source code > > > that renders different logic than the logical ordering of tokens > > > ingested by compilers and interpreters. Adversaries can leverage this to > > > encode source code for compilers accepting Unicode such that targeted > > > vulnerabilities are introduced invisibly to human reviewers." > > > > > > More info: > > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > > https://trojansource.codes/ > > > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > > implements -Wbidirectional=[none|unpaired|any] to warn about possibly > > > misleading Unicode bidirectional characters the preprocessor may encounter. > > > > Birectional sounds very general. Can we come up with a name > > that's a bit more descriptive of the problem the warning reports? > > From skimming the docs and the tests it looks like the warning > > points out uses of bidirectonal characters in the program source > > code as well as comments. Would -Wbidirectional-text be better? > > Or -Wbidirectional-chars? (If Clang is also adding a warning > > for this, syncing up with them one way or the other might be > > helpful.) > > I dunno, I could go with -Wbidirectional-chars. Does anyone else > think I should rename the current name to -Wbidirectional-chars? > > Other ideas: -Wunicode-bidi / -Wmultibyte-chars / -Wmisleading-bidirectional. > > The patch for clang-tidy I saw called this misleading-bidirectional. > > > ... > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > > index c5730228821..9dfb95dc24c 100644 > > > --- a/gcc/doc/invoke.texi > > > +++ b/gcc/doc/invoke.texi > > > @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. > > > -Warith-conversion @gol > > > -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol > > > -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol > > > --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol > > > +-Wno-attribute-warning @gol > > > +-Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol > > > +-Wbool-compare -Wbool-operation @gol > > > -Wno-builtin-declaration-mismatch @gol > > > -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol > > > -Wc11-c2x-compat @gol > > > @@ -7674,6 +7676,21 @@ Attributes considered include @code{alloc_align}, @code{alloc_size}, > > > This is the default. You can disable these warnings with either > > > @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. > > > +@item -Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} > > > +@opindex Wbidirectional= > > > +@opindex Wbidirectional > > > +@opindex Wno-bidirectional > > > +Warn about UTF-8 bidirectional characters. > > > > I suggest to mention where. If everywhere, enumerate the most > > common contexts to make it clear it means everywhere: > > > > Warn about UTF-8 bidirectional characters in source code, > > including string literals, identifiers, and comments. > > OK, I've updated the text to: > > Warn about possibly misleading UTF-8 bidirectional characters in comments, > string literals, character constants, and identifiers. > > Marek Marek ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-08 21:33 ` Marek Polacek @ 2021-11-15 17:28 ` Marek Polacek 2021-11-15 23:15 ` David Malcolm 2021-11-30 8:38 ` [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] Stephan Bergmann 0 siblings, 2 replies; 27+ messages in thread From: Marek Polacek @ 2021-11-15 17:28 UTC (permalink / raw) To: Joseph Myers; +Cc: Jakub Jelinek, GCC Patches, Martin Sebor On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > but changing the name is a trivial operation. Here's a patch with a better name (suggested by Jonathan W.). Otherwise no changes. Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? -- >8 -- From a link below: "An issue was discovered in the Bidirectional Algorithm in the Unicode Specification through 14.0. It permits the visual reordering of characters via control sequences, which can be used to craft source code that renders different logic than the logical ordering of tokens ingested by compilers and interpreters. Adversaries can leverage this to encode source code for compilers accepting Unicode such that targeted vulnerabilities are introduced invisibly to human reviewers." More info: https://nvd.nist.gov/vuln/detail/CVE-2021-42574 https://trojansource.codes/ This is not a compiler bug. However, to mitigate the problem, this patch implements -Wbidi-chars=[none|unpaired|any] to warn about possibly misleading Unicode bidirectional characters the preprocessor may encounter. The default is =unpaired, which warns about improperly terminated bidirectional characters; e.g. a LRE without its appertaining PDF. The level =any warns about any use of bidirectional characters. This patch handles both UCNs and UTF-8 characters. UCNs designating bidi characters in identifiers are accepted since r204886. Then r217144 enabled -fextended-identifiers by default. Extended characters in C/C++ identifiers have been accepted since r275979. However, this patch still warns about mixing UTF-8 and UCN bidi characters; there seems to be no good reason to allow mixing them. We warn in different contexts: comments (both C and C++-style), string literals, character constants, and identifiers. Expectedly, UCNs are ignored in comments and raw string literals. The bidirectional characters can nest so this patch handles that as well. I have not included nor tested this at all with Fortran (which also has string literals and line comments). Dave M. posted patches improving diagnostic involving Unicode characters. This patch does not make use of this new infrastructure yet. PR preprocessor/103026 gcc/c-family/ChangeLog: * c.opt (Wbidi-chars, Wbidi-chars=): New option. gcc/ChangeLog: * doc/invoke.texi: Document -Wbidi-chars. libcpp/ChangeLog: * include/cpplib.h (enum cpp_bidirectional_level): New. (struct cpp_options): Add cpp_warn_bidirectional. (enum cpp_warning_reason): Add CPP_W_BIDIRECTIONAL. * init.c (cpp_create_reader): Set cpp_warn_bidirectional. * lex.c (bidi): New namespace. (get_bidi_utf8): New function. (get_bidi_ucn): Likewise. (maybe_warn_bidi_on_close): Likewise. (maybe_warn_bidi_on_char): Likewise. (_cpp_skip_block_comment): Implement warning about bidirectional characters. (skip_line_comment): Likewise. (forms_identifier_p): Likewise. (lex_identifier): Likewise. (lex_string): Likewise. (lex_raw_string): Likewise. gcc/testsuite/ChangeLog: * c-c++-common/Wbidi-chars-1.c: New test. * c-c++-common/Wbidi-chars-2.c: New test. * c-c++-common/Wbidi-chars-3.c: New test. * c-c++-common/Wbidi-chars-4.c: New test. * c-c++-common/Wbidi-chars-5.c: New test. * c-c++-common/Wbidi-chars-6.c: New test. * c-c++-common/Wbidi-chars-7.c: New test. * c-c++-common/Wbidi-chars-8.c: New test. * c-c++-common/Wbidi-chars-9.c: New test. * c-c++-common/Wbidi-chars-10.c: New test. * c-c++-common/Wbidi-chars-11.c: New test. * c-c++-common/Wbidi-chars-12.c: New test. * c-c++-common/Wbidi-chars-13.c: New test. * c-c++-common/Wbidi-chars-14.c: New test. --- gcc/c-family/c.opt | 24 ++ gcc/doc/invoke.texi | 20 +- gcc/testsuite/c-c++-common/Wbidi-chars-1.c | 12 + gcc/testsuite/c-c++-common/Wbidi-chars-10.c | 27 ++ gcc/testsuite/c-c++-common/Wbidi-chars-11.c | 13 + gcc/testsuite/c-c++-common/Wbidi-chars-12.c | 19 + gcc/testsuite/c-c++-common/Wbidi-chars-13.c | 17 + gcc/testsuite/c-c++-common/Wbidi-chars-14.c | 38 ++ gcc/testsuite/c-c++-common/Wbidi-chars-2.c | 9 + gcc/testsuite/c-c++-common/Wbidi-chars-3.c | 11 + gcc/testsuite/c-c++-common/Wbidi-chars-4.c | 166 ++++++++ gcc/testsuite/c-c++-common/Wbidi-chars-5.c | 166 ++++++++ gcc/testsuite/c-c++-common/Wbidi-chars-6.c | 155 ++++++++ gcc/testsuite/c-c++-common/Wbidi-chars-7.c | 9 + gcc/testsuite/c-c++-common/Wbidi-chars-8.c | 13 + gcc/testsuite/c-c++-common/Wbidi-chars-9.c | 29 ++ libcpp/include/cpplib.h | 18 +- libcpp/init.c | 1 + libcpp/lex.c | 407 +++++++++++++++++++- 19 files changed, 1139 insertions(+), 15 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-1.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-10.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-11.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-12.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-13.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-14.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-2.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-3.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-4.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-5.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-6.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-7.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-8.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-9.c diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt index 06457ac739e..b047df0f125 100644 --- a/gcc/c-family/c.opt +++ b/gcc/c-family/c.opt @@ -374,6 +374,30 @@ Wbad-function-cast C ObjC Var(warn_bad_function_cast) Warning Warn about casting functions to incompatible types. +Wbidi-chars +C ObjC C++ ObjC++ Warning Alias(Wbidi-chars=,any,none) +; + +Wbidi-chars= +C ObjC C++ ObjC++ RejectNegative Joined Warning CPP(cpp_warn_bidirectional) CppReason(CPP_W_BIDIRECTIONAL) Var(warn_bidirectional) Init(bidirectional_unpaired) Enum(cpp_bidirectional_level) +-Wbidi-chars=[none|unpaired|any] Warn about UTF-8 bidirectional characters. + +; Required for these enum values. +SourceInclude +cpplib.h + +Enum +Name(cpp_bidirectional_level) Type(int) UnknownError(argument %qs to %<-Wbidi-chars%> not recognized) + +EnumValue +Enum(cpp_bidirectional_level) String(none) Value(bidirectional_none) + +EnumValue +Enum(cpp_bidirectional_level) String(unpaired) Value(bidirectional_unpaired) + +EnumValue +Enum(cpp_bidirectional_level) String(any) Value(bidirectional_any) + Wbool-compare C ObjC C++ ObjC++ Var(warn_bool_compare) Warning LangEnabledBy(C ObjC C++ ObjC++,Wall) Warn about boolean expression compared with an integer value different from true/false. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 6070288856c..c4473bc8971 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. -Warith-conversion @gol -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol +-Wno-attribute-warning @gol +-Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol +-Wbool-compare -Wbool-operation @gol -Wno-builtin-declaration-mismatch @gol -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol -Wc11-c2x-compat @gol @@ -7678,6 +7680,22 @@ Attributes considered include @code{alloc_align}, @code{alloc_size}, This is the default. You can disable these warnings with either @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. +@item -Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]} +@opindex Wbidi-chars= +@opindex Wbidi-chars +@opindex Wno-bidi-chars +Warn about possibly misleading UTF-8 bidirectional characters in comments, +string literals, character constants, and identifiers. Such characters can +change left-to-right writing direction into right-to-left (and vice versa), +which can cause confusion between the logical order and visual order. This +may be dangerous; for instance, it may seem that a piece of code is not +commented out, whereas it in fact is. + +There are three levels of warning supported by GCC@. The default is +@option{-Wbidi-chars=unpaired}, which warns about improperly terminated +bidi contexts. @option{-Wbidi-chars=none} turns the warning off. +@option{-Wbidi-chars=any} warns about any use of bidirectional characters. + @item -Wbool-compare @opindex Wno-bool-compare @opindex Wbool-compare diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-1.c b/gcc/testsuite/c-c++-common/Wbidi-chars-1.c new file mode 100644 index 00000000000..34f5ac19271 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-1.c @@ -0,0 +1,12 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + int isAdmin = 0; + /* } if (isAdmin) begin admins only */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + /* end admins only { */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-10.c b/gcc/testsuite/c-c++-common/Wbidi-chars-10.c new file mode 100644 index 00000000000..3f851b69e65 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-10.c @@ -0,0 +1,27 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* More nesting testing. */ + +/* RLE LRI PDF PDI*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF_\u202c; +int LRE_\u202a_PDF_\u202c_LRE_\u202a_PDF_\u202c; +int LRE_\u202a_LRI_\u2066_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDI_\u2069_PDF_\u202c; +int FSI_\u2068_LRO_\u202d_PDI_\u2069_PDF_\u202c; +int FSI_\u2068; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDF_\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_FSI_\u2068_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-11.c b/gcc/testsuite/c-c++-common/Wbidi-chars-11.c new file mode 100644 index 00000000000..270ce2368a9 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-11.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test that we warn when mixing UCN and UTF-8. */ + +int LRE__PDF_\u202c; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF__; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s1 = "LRE__PDF_\u202c"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s2 = "LRE_\u202a_PDF_"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-12.c b/gcc/testsuite/c-c++-common/Wbidi-chars-12.c new file mode 100644 index 00000000000..b07eec1da91 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-12.c @@ -0,0 +1,19 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c RLO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +const char *s7 = R"(a b c FSI 1 2 3 PDI x y) z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +const char *s8 = R"(a b c PDI x y )z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +const char *s9 = R"(a b c PDF x y z)"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-13.c b/gcc/testsuite/c-c++-common/Wbidi-chars-13.c new file mode 100644 index 00000000000..b2dd9fde752 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-13.c @@ -0,0 +1,17 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c FSI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s5 = R"(a b c LRI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s6 = R"(a b c RLI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-14.c b/gcc/testsuite/c-c++-common/Wbidi-chars-14.c new file mode 100644 index 00000000000..ba5f75d9553 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-14.c @@ -0,0 +1,38 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test PDI handling, which also pops any subsequent LREs, RLEs, LROs, + or RLOs. */ + +/* LRI__LRI__RLE__RLE__RLE__PDI_*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// LRI__RLE__RLE__RLE__PDI_ +// LRI__RLO__RLE__RLE__PDI_ +// LRI__RLO__RLE__PDI_ +// FSI__RLO__PDI_ +// FSI__FSI__RLO__PDI_ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +int LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069_PDI_\u2069; +int LRI_\u2066_LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int PDI_\u2069; +int LRI_\u2066_PDI_\u2069; +int RLI_\u2067_PDI_\u2069; +int LRE_\u202a_LRI_\u2066_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRI_\u2066_LRE_\u202a_PDF_\u202c_PDI_\u2069; +int LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +int RLI_\u2067_LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLO_\u202e_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_PDI_\u2069_RLI_\u2067; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDF_\u202c_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-2.c b/gcc/testsuite/c-c++-common/Wbidi-chars-2.c new file mode 100644 index 00000000000..2340374f276 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-2.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + /* Say hello; newline/*/ return 0 ; +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("Hello world.\n"); + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-3.c b/gcc/testsuite/c-c++-common/Wbidi-chars-3.c new file mode 100644 index 00000000000..9dc7edb6e64 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-3.c @@ -0,0 +1,11 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + const char* access_level = "user"; + if (__builtin_strcmp(access_level, "user // Check if admin ")) { +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + } + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c new file mode 100644 index 00000000000..9fd4bc535ca --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c @@ -0,0 +1,166 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-5.c b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c new file mode 100644 index 00000000000..efb26309b68 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c @@ -0,0 +1,166 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-6.c b/gcc/testsuite/c-c++-common/Wbidi-chars-6.c new file mode 100644 index 00000000000..0ce6fff2dee --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-6.c @@ -0,0 +1,155 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test nesting of bidi chars in various contexts. */ + +/* Terminated by the wrong char: */ +/* a b c LRE 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDF x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +/* LRE PDF */ +/* LRE LRE PDF PDF */ +/* PDF LRE PDF */ +/* LRE PDF LRE PDF */ +/* LRE LRE PDF */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* PDF LRE */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// a b c LRE 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// LRE PDF +// LRE LRE PDF PDF +// PDF LRE PDF +// LRE PDF LRE PDF +// LRE LRE PDF +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// PDF LRE +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c LRE\u202a 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c RLE 1 2 3 PDI x y "; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLE\u202b 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c LRO\u202d 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c RLO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c RLO\u202e 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c LRI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s10 = "a b c LRI\u2066 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c RLI 1 2 3 PDF x y z\ + "; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ + const char *s12 = "a b c RLI\u2067 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c FSI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c FSI\u2068 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "LRE PDF"; + const char *s18 = "LRE\u202a PDF\u202c"; + const char *s19 = "LRE LRE PDF PDF"; + const char *s20 = "LRE\u202a LRE\u202a PDF\u202c PDF\u202c"; + const char *s21 = "PDF LRE PDF"; + const char *s22 = "PDF\u202c LRE\u202a PDF\u202c"; + const char *s23 = "LRE LRE PDF"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s24 = "LRE\u202a LRE\u202a PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s25 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s26 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s27 = "PDF LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s28 = "PDF\u202c LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int aLREbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int A\u202aB\u2069C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLEbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202bB\u2069c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202db\u2069c2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202eb\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2066b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLIbPDFc +; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +int a\u2067b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2068b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPD\u202C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSI\u2068bPDF_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLREbPDFb; +int A\u202aB\u202c; +int a_LRE_LRE_b_PDF_PDF; +int A\u202aA\u202aB\u202cB\u202c; +int aPDFbLREadPDF; +int a_\u202C_\u202a_\u202c; +int a_LRE_b_PDF_c_LRE_PDF; +int a_\u202a_\u202c_\u202a_\u202c_; +int a_LRE_b_PDF_c_LRE; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a_\u202a_\u202c_\u202a_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-7.c b/gcc/testsuite/c-c++-common/Wbidi-chars-7.c new file mode 100644 index 00000000000..d012d420ec0 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-7.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test we ignore UCNs in comments. */ + +// a b c \u202a 1 2 3 +// a b c \u202A 1 2 3 +/* a b c \u202a 1 2 3 */ +/* a b c \u202A 1 2 3 */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-8.c b/gcc/testsuite/c-c++-common/Wbidi-chars-8.c new file mode 100644 index 00000000000..4f54c5092ec --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-8.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test \u vs \U. */ + +int a_\u202A; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\u202a_2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202A_3; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202a_4; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-9.c b/gcc/testsuite/c-c++-common/Wbidi-chars-9.c new file mode 100644 index 00000000000..e2af1b1ca97 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-9.c @@ -0,0 +1,29 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test that we properly separate bidi contexts (comment/identifier/character + constant/string literal). */ + +/* LRE -><- */ int pdf_\u202c_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLE -><- */ int pdf_\u202c_2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRO -><- */ int pdf_\u202c_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLO -><- */ int pdf_\u202c_4; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRI -><-*/ int pdi_\u2069_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLI -><- */ int pdi_\u2069_12; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* FSI -><- */ int pdi_\u2069_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +const char *s1 = "LRE\u202a"; /* PDF -><- */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRE -><- */ const char *s2 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = "LRE\u202a"; int pdf_\u202c_5; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int lre_\u202a; const char *s4 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h index 176f8c5bbce..60cf08ddd35 100644 --- a/libcpp/include/cpplib.h +++ b/libcpp/include/cpplib.h @@ -319,6 +319,17 @@ enum cpp_main_search CMS_system, /* Search the system INCLUDE path. */ }; +/* The possible bidirectional characters checking levels, from least + restrictive to most. */ +enum cpp_bidirectional_level { + /* No checking. */ + bidirectional_none, + /* Only detect unpaired uses of bidirectional characters. */ + bidirectional_unpaired, + /* Detect any use of bidirectional characters. */ + bidirectional_any +}; + /* This structure is nested inside struct cpp_reader, and carries all the options visible to the command line. */ struct cpp_options @@ -539,6 +550,10 @@ struct cpp_options /* True if warn about differences between C++98 and C++11. */ bool cpp_warn_cxx11_compat; + /* Nonzero of bidirectional characters checking is on. See enum + cpp_bidirectional_level. */ + unsigned char cpp_warn_bidirectional; + /* Dependency generation. */ struct { @@ -643,7 +658,8 @@ enum cpp_warning_reason { CPP_W_C90_C99_COMPAT, CPP_W_C11_C2X_COMPAT, CPP_W_CXX11_COMPAT, - CPP_W_EXPANSION_TO_DEFINED + CPP_W_EXPANSION_TO_DEFINED, + CPP_W_BIDIRECTIONAL }; /* Callback for header lookup for HEADER, which is the name of a diff --git a/libcpp/init.c b/libcpp/init.c index 5a424e23553..f9a8f5f088f 100644 --- a/libcpp/init.c +++ b/libcpp/init.c @@ -223,6 +223,7 @@ cpp_create_reader (enum c_lang lang, cpp_hash_table *table, = ENABLE_CANONICAL_SYSTEM_HEADERS; CPP_OPTION (pfile, ext_numeric_literals) = 1; CPP_OPTION (pfile, warn_date_time) = 0; + CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired; /* Default CPP arithmetic to something sensible for the host for the benefit of dumb users like fix-header. */ diff --git a/libcpp/lex.c b/libcpp/lex.c index fa2253d41c3..3fb518e202b 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) } } +namespace bidi { + enum class kind { + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI + }; + + /* All the UTF-8 encodings of bidi characters start with E2. */ + constexpr uchar utf8_start = 0xe2; + + /* A vector holding currently open bidi contexts. We use a char for + each context, its LSB is 1 if it represents a PDF context, 0 if it + represents a PDI context. The next bit is 1 if this context was open + by a bidi character written as a UCN, and 0 when it was UTF-8. */ + semi_embedded_vec <unsigned char, 16> vec; + + /* Close the whole comment/identifier/string literal/character constant + context. */ + void on_close () + { + vec.truncate (0); + } + + /* Pop the last element in the vector. */ + void pop () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + vec.truncate (len - 1); + } + + /* Return the context of the Ith element. */ + kind ctx_at (unsigned int i) + { + return (vec[i] & 1) ? kind::PDF : kind::PDI; + } + + /* Return which context is currently opened. */ + kind current_ctx () + { + unsigned int len = vec.count (); + if (len == 0) + return kind::NONE; + return ctx_at (len - 1); + } + + /* Return true if the current context comes from a UCN origin, that is, + the bidi char which started this bidi context was written as a UCN. */ + bool current_ctx_ucn_p () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + return (vec[len - 1] >> 1) & 1; + } + + /* We've read a bidi char, update the current vector as necessary. */ + void on_char (kind k, bool ucn_p) + { + switch (k) + { + case kind::LRE: + case kind::RLE: + case kind::LRO: + case kind::RLO: + vec.push (ucn_p ? 3u : 1u); + break; + case kind::LRI: + case kind::RLI: + case kind::FSI: + vec.push (ucn_p ? 2u : 0u); + break; + /* PDF terminates the scope of the last LRE, RLE, LRO, or RLO + whose scope has not yet been terminated. */ + case kind::PDF: + if (current_ctx () == kind::PDF) + pop (); + break; + /* PDI terminates the scope of the last LRI, RLI, or FSI whose + scope has not yet been terminated, as well as the scopes of + any subsequent LREs, RLEs, LROs, or RLOs whose scopes have not + yet been terminated. */ + case kind::PDI: + for (int i = vec.count () - 1; i >= 0; --i) + if (ctx_at (i) == kind::PDI) + { + vec.truncate (i); + break; + } + break; + [[likely]] case kind::NONE: + break; + default: + abort (); + } + } + + /* Return a descriptive string for K. */ + const char *to_str (kind k) + { + switch (k) + { + case kind::LRE: + return "U+202A (LEFT-TO-RIGHT EMBEDDING)"; + case kind::RLE: + return "U+202B (RIGHT-TO-LEFT EMBEDDING)"; + case kind::LRO: + return "U+202D (LEFT-TO-RIGHT OVERRIDE)"; + case kind::RLO: + return "U+202E (RIGHT-TO-LEFT OVERRIDE)"; + case kind::LRI: + return "U+2066 (LEFT-TO-RIGHT ISOLATE)"; + case kind::RLI: + return "U+2067 (RIGHT-TO-LEFT ISOLATE)"; + case kind::FSI: + return "U+2068 (FIRST STRONG ISOLATE)"; + case kind::PDF: + return "U+202C (POP DIRECTIONAL FORMATTING)"; + case kind::PDI: + return "U+2069 (POP DIRECTIONAL ISOLATE)"; + default: + abort (); + } + } +} + +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ + +static bidi::kind +get_bidi_utf8 (const unsigned char *const p) +{ + gcc_checking_assert (p[0] == bidi::utf8_start); + + if (p[1] == 0x80) + switch (p[2]) + { + case 0xaa: + return bidi::kind::LRE; + case 0xab: + return bidi::kind::RLE; + case 0xac: + return bidi::kind::PDF; + case 0xad: + return bidi::kind::LRO; + case 0xae: + return bidi::kind::RLO; + default: + break; + } + else if (p[1] == 0x81) + switch (p[2]) + { + case 0xa6: + return bidi::kind::LRI; + case 0xa7: + return bidi::kind::RLI; + case 0xa8: + return bidi::kind::FSI; + case 0xa9: + return bidi::kind::PDI; + default: + break; + } + + return bidi::kind::NONE; +} + +/* Parse a UCN where P points just past \u or \U and return its bidi code. */ + +static bidi::kind +get_bidi_ucn (const unsigned char *p, bool is_U) +{ + /* 6.4.3 Universal Character Names + \u hex-quad + \U hex-quad hex-quad + where \unnnn means \U0000nnnn. */ + + if (is_U) + { + if (p[0] != '0' || p[1] != '0' || p[2] != '0' || p[3] != '0') + return bidi::kind::NONE; + /* Skip 4B so we can treat \u and \U the same below. */ + p += 4; + } + + /* All code points we are looking for start with 20xx. */ + if (p[0] != '2' || p[1] != '0') + return bidi::kind::NONE; + else if (p[2] == '2') + switch (p[3]) + { + case 'a': + case 'A': + return bidi::kind::LRE; + case 'b': + case 'B': + return bidi::kind::RLE; + case 'c': + case 'C': + return bidi::kind::PDF; + case 'd': + case 'D': + return bidi::kind::LRO; + case 'e': + case 'E': + return bidi::kind::RLO; + default: + break; + } + else if (p[2] == '6') + switch (p[3]) + { + case '6': + return bidi::kind::LRI; + case '7': + return bidi::kind::RLI; + case '8': + return bidi::kind::FSI; + case '9': + return bidi::kind::PDI; + default: + break; + } + + return bidi::kind::NONE; +} + +/* We're closing a bidi context, that is, we've encountered a newline, + are closing a C-style comment, or are at the end of a string literal, + character constant, or identifier. Warn if this context was not + properly terminated by a PDI or PDF. P points to the last character + in this context. */ + +static void +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) +{ + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired + && bidi::vec.count () > 0) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "unpaired UTF-8 bidirectional character " + "detected"); + } + /* We're done with this context. */ + bidi::on_close (); +} + +/* We're at the beginning or in the middle of an identifier/comment/string + literal/character constant. Warn if we've encountered a bidi character. + KIND says which bidi character it was; P points to it in the character + stream. UCN_P is true iff this bidi character was written as a UCN. */ + +static void +maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, + bool ucn_p) +{ + if (__builtin_expect (kind == bidi::kind::NONE, 1)) + return; + + const auto warn_bidi = CPP_OPTION (pfile, cpp_warn_bidirectional); + + if (warn_bidi != bidirectional_none) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + /* It seems excessive to warn about a PDI/PDF that is closing + an opened context because we've already warned about the + opening character. Except warn when we have a UCN x UTF-8 + mismatch. */ + if (kind == bidi::current_ctx ()) + { + if (warn_bidi == bidirectional_unpaired + && bidi::current_ctx_ucn_p () != ucn_p) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); + } + else if (warn_bidi == bidirectional_any) + { + if (kind == bidi::kind::PDF || kind == bidi::kind::PDI) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "\"%s\" is closing an unopened context", + bidi::to_str (kind)); + else + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "found problematic Unicode character \"%s\"", + bidi::to_str (kind)); + } + } + /* We're done with this context. */ + bidi::on_char (kind, ucn_p); +} + /* Skip a C-style block comment. We find the end of the comment by seeing if an asterisk is before every '/' we encounter. Returns nonzero if comment terminated by EOF, zero otherwise. @@ -1175,7 +1469,8 @@ _cpp_skip_block_comment (cpp_reader *pfile) cpp_buffer *buffer = pfile->buffer; const uchar *cur = buffer->cur; uchar c; - + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); cur++; if (*cur == '/') cur++; @@ -1189,7 +1484,11 @@ _cpp_skip_block_comment (cpp_reader *pfile) if (c == '/') { if (cur[-2] == '*') - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); + break; + } /* Warn about potential nested comments, but not if the '/' comes immediately before the true comment delimiter. @@ -1208,6 +1507,8 @@ _cpp_skip_block_comment (cpp_reader *pfile) { unsigned int cols; buffer->cur = cur - 1; + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); _cpp_process_line_notes (pfile, true); if (buffer->next_line >= buffer->rlimit) return true; @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) cur = buffer->cur; } + /* If this is a beginning of a UTF-8 encoding, it might be + a bidirectional character. */ + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); + } } buffer->cur = cur; @@ -1233,9 +1541,32 @@ skip_line_comment (cpp_reader *pfile) { cpp_buffer *buffer = pfile->buffer; location_t orig_line = pfile->line_table->highest_line; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); - while (*buffer->cur != '\n') - buffer->cur++; + if (!warn_bidi_p) + while (*buffer->cur != '\n') + buffer->cur++; + else + { + while (*buffer->cur != '\n' + && *buffer->cur != bidi::utf8_start) + buffer->cur++; + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + while (*buffer->cur != '\n') + { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } + buffer->cur++; + } + maybe_warn_bidi_on_close (pfile, buffer->cur); + } + } _cpp_process_line_notes (pfile, true); return orig_line != pfile->line_table->highest_line; @@ -1346,11 +1677,14 @@ static const cppchar_t utf8_signifier = 0xC0; /* Returns TRUE if the sequence starting at buffer->cur is valid in an identifier. FIRST is TRUE if this starts an identifier. */ + static bool forms_identifier_p (cpp_reader *pfile, int first, struct normalize_state *state) { cpp_buffer *buffer = pfile->buffer; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); if (*buffer->cur == '$') { @@ -1373,6 +1707,13 @@ forms_identifier_p (cpp_reader *pfile, int first, cppchar_t s; if (*buffer->cur >= utf8_signifier) { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) + && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s)) return true; @@ -1381,6 +1722,13 @@ forms_identifier_p (cpp_reader *pfile, int first, && (buffer->cur[1] == 'u' || buffer->cur[1] == 'U')) { buffer->cur += 2; + if (warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (buffer->cur, + buffer->cur[-1] == 'U'); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/true); + } if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s, NULL, NULL)) return true; @@ -1489,6 +1837,8 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, const uchar *cur; unsigned int len; unsigned int hash = HT_HASHSTEP (0, *base); + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); cur = pfile->buffer->cur; if (! starts_ucn) @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, { /* Slower version for identifiers containing UCNs or extended chars (including $). */ - do { - while (ISIDNUM (*pfile->buffer->cur)) - { - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); - pfile->buffer->cur++; - } - } while (forms_identifier_p (pfile, false, nst)); + do + { + while (ISIDNUM (*pfile->buffer->cur)) + { + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); + pfile->buffer->cur++; + } + } + while (forms_identifier_p (pfile, false, nst)); + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pfile->buffer->cur); result = _cpp_interpret_identifier (pfile, base, pfile->buffer->cur - base); *spelling = cpp_lookup (pfile, base, pfile->buffer->cur - base); @@ -1758,6 +2112,8 @@ static void lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) { const uchar *pos = base; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); /* 'tis a pity this information isn't passed down from the lexer's initial categorization of the token. */ @@ -1994,8 +2350,15 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) pos = base = pfile->buffer->cur; note = &pfile->buffer->notes[pfile->buffer->cur_note]; } + else if (__builtin_expect ((unsigned char) c == bidi::utf8_start, 0) + && warn_bidi_p) + maybe_warn_bidi_on_char (pfile, pos - 1, get_bidi_utf8 (pos - 1), + /*ucn_p=*/false); } + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pos); + if (CPP_OPTION (pfile, user_literals)) { /* If a string format macro, say from inttypes.h, is placed touching @@ -2090,15 +2453,28 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) else terminator = '>', type = CPP_HEADER_NAME; + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) + != bidirectional_none); for (;;) { cppchar_t c = *cur++; /* In #include-style directives, terminators are not escapable. */ if (c == '\\' && !pfile->state.angled_headers && *cur != '\n') - cur++; + { + if ((cur[0] == 'u' || cur[0] == 'U') && warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (cur + 1, cur[0] == 'U'); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/true); + } + cur++; + } else if (c == terminator) - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur - 1); + break; + } else if (c == '\n') { cur--; @@ -2115,6 +2491,11 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) } else if (c == '\0') saw_NUL = true; + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur - 1, kind, /*ucn_p=*/false); + } } if (saw_NUL && !pfile->state.skipping) base-commit: 82ec4cb3c43c7429be6b902d96770a6435fa068b -- 2.33.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-15 17:28 ` [PATCH] libcpp: Implement -Wbidi-chars " Marek Polacek @ 2021-11-15 23:15 ` David Malcolm 2021-11-16 19:50 ` [PATCH v2] " Marek Polacek 2021-11-30 8:38 ` [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] Stephan Bergmann 1 sibling, 1 reply; 27+ messages in thread From: David Malcolm @ 2021-11-15 23:15 UTC (permalink / raw) To: Marek Polacek, Joseph Myers; +Cc: Jakub Jelinek, Martin Sebor, GCC Patches > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > > but changing the name is a trivial operation. > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no > changes. Thanks for implementing this. > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > -- >8 -- > From a link below: > "An issue was discovered in the Bidirectional Algorithm in the Unicode > Specification through 14.0. It permits the visual reordering of > characters via control sequences, which can be used to craft source code > that renders different logic than the logical ordering of tokens > ingested by compilers and interpreters. Adversaries can leverage this to > encode source code for compilers accepting Unicode such that targeted > vulnerabilities are introduced invisibly to human reviewers." > > More info: > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > https://trojansource.codes/ > > This is not a compiler bug. However, to mitigate the problem, this patch > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > misleading Unicode bidirectional characters the preprocessor may encounter. > > The default is =unpaired, which warns about improperly terminated > bidirectional characters; e.g. a LRE without its appertaining PDF. The I like the default. Wording nit: maybe use "corresponding" rather than "appertaining"; I believe the latter has a sense that one is part of the other, when they are more like peers. > level =any warns about any use of bidirectional characters. Terminology nit: The patch is referring to "bidirectional characters", but I think the term "bidirectional control characters" would be better. For example, a passage of text containing both numbers and characters in a right-to-left script could be considered "bidirectional", since the numbers are written from left-to-right. Specifically, the patch looks for these specific characters: * U+202A LEFT-TO-RIGHT EMBEDDING * U+202B RIGHT-TO-LEFT EMBEDDING * U+202C POP DIRECTIONAL FORMATTING * U+202D LEFT-TO-RIGHT OVERRIDE * U+202E RIGHT-TO-LEFT OVERRIDE * U+2066 LEFT-TO-RIGHT ISOLATE * U+2067 RIGHT-TO-LEFT ISOLATE * U+2068 FIRST STRONG ISOLATE * U+2069 POP DIRECTIONAL ISOLATE However, the following characters could also be considered as "bidirectional control characters": * U+200E LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E) * U+200F RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F) but aren't checked for in the patch. Should they be? I can imagine ways in which they could be abused, so I think so. [...snip...] > diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt > index 06457ac739e..b047df0f125 100644 > --- a/gcc/c-family/c.opt > +++ b/gcc/c-family/c.opt > @@ -374,6 +374,30 @@ Wbad-function-cast > C ObjC Var(warn_bad_function_cast) Warning > Warn about casting functions to incompatible types. > > +Wbidi-chars > +C ObjC C++ ObjC++ Warning Alias(Wbidi-chars=,any,none) > +; > + > +Wbidi-chars= > +C ObjC C++ ObjC++ RejectNegative Joined Warning CPP(cpp_warn_bidirectional) CppReason(CPP_W_BIDIRECTIONAL) Var(warn_bidirectional) Init(bidirectional_unpaired) Enum(cpp_bidirectional_level) > +-Wbidi-chars=[none|unpaired|any] Warn about UTF-8 bidirectional characters. "control characters" [...snip...] > > +@item -Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]} > +@opindex Wbidi-chars= > +@opindex Wbidi-chars > +@opindex Wno-bidi-chars > +Warn about possibly misleading UTF-8 bidirectional characters in comments, (and here again) > +string literals, character constants, and identifiers. Such characters can > +change left-to-right writing direction into right-to-left (and vice versa), > +which can cause confusion between the logical order and visual order. This > +may be dangerous; for instance, it may seem that a piece of code is not > +commented out, whereas it in fact is. > + > +There are three levels of warning supported by GCC@. The default is > +@option{-Wbidi-chars=unpaired}, which warns about improperly terminated > +bidi contexts. @option{-Wbidi-chars=none} turns the warning off. > +@option{-Wbidi-chars=any} warns about any use of bidirectional characters. (and again) [...snip...] > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > new file mode 100644 > index 00000000000..9fd4bc535ca > --- /dev/null > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > @@ -0,0 +1,166 @@ > +/* PR preprocessor/103026 */ > +/* { dg-do compile } */ > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ > +/* Test all bidi chars in various contexts (identifiers, comments, > + string literals, character constants), both UCN and UTF-8. The bidi > + chars here are properly terminated, except for the character constants. */ > + > +/* a b c LRE 1 2 3 PDF x y z */ > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ > +/* a b c RLE 1 2 3 PDF x y z */ > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ > +/* a b c LRO 1 2 3 PDF x y z */ > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ > +/* a b c RLO 1 2 3 PDF x y z */ > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ > +/* a b c LRI 1 2 3 PDI x y z */ > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ > +/* a b c RLI 1 2 3 PDI x y */ > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ > +/* a b c FSI 1 2 3 PDI x y z */ > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ AIUI the Unicode bidirectionality algorithm works at the line level, and so each line in a block comment should be checked individually for unclossed bidi control chars, rather than a block comment as a whole. Hence I think the test case needs to have block comment test coverage for: - single line blocks - first line of a multiline block comment - middle line of a multiline block comment - final line of a multiline block comment but I think the patch as it stands is only checking for the first of these four cases. [...snip...] > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-5.c b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c > new file mode 100644 > index 00000000000..efb26309b68 > --- /dev/null > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c > @@ -0,0 +1,166 @@ > +/* PR preprocessor/103026 */ > +/* { dg-do compile } */ > +/* { dg-options "-Wbidi-chars=unpaired -Wno-multichar -Wno-overflow" } */ > +/* Test all bidi chars in various contexts (identifiers, comments, > + string literals, character constants), both UCN and UTF-8. The bidi > + chars here are properly terminated, except for the character constants. */ Similar comment as above re block comments [...snip...] > diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h > index 176f8c5bbce..60cf08ddd35 100644 > --- a/libcpp/include/cpplib.h > +++ b/libcpp/include/cpplib.h > @@ -319,6 +319,17 @@ enum cpp_main_search > CMS_system, /* Search the system INCLUDE path. */ > }; > > +/* The possible bidirectional characters checking levels, from least > + restrictive to most. */ > +enum cpp_bidirectional_level { > + /* No checking. */ > + bidirectional_none, > + /* Only detect unpaired uses of bidirectional characters. */ > + bidirectional_unpaired, > + /* Detect any use of bidirectional characters. */ > + bidirectional_any > +}; As before, "control characters". [...snip...] > @@ -539,6 +550,10 @@ struct cpp_options > /* True if warn about differences between C++98 and C++11. */ > bool cpp_warn_cxx11_compat; > > + /* Nonzero of bidirectional characters checking is on. See enum s/of/if/ and usual nit about "control characters". > + cpp_bidirectional_level. */ > + unsigned char cpp_warn_bidirectional; > + > /* Dependency generation. */ > struct > { [...snip...] > diff --git a/libcpp/lex.c b/libcpp/lex.c > index fa2253d41c3..3fb518e202b 100644 > --- a/libcpp/lex.c > +++ b/libcpp/lex.c > @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) > } > } > > +namespace bidi { > + enum class kind { > + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI > + }; > + > + /* All the UTF-8 encodings of bidi characters start with E2. */ > + constexpr uchar utf8_start = 0xe2; Is there a difference between "constexpr" vs "const" here? (sorry for my ignorance) > + > + /* A vector holding currently open bidi contexts. We use a char for > + each context, its LSB is 1 if it represents a PDF context, 0 if it > + represents a PDI context. The next bit is 1 if this context was open > + by a bidi character written as a UCN, and 0 when it was UTF-8. */ > + semi_embedded_vec <unsigned char, 16> vec; > + > + /* Close the whole comment/identifier/string literal/character constant > + context. */ > + void on_close () > + { > + vec.truncate (0); > + } > + > + /* Pop the last element in the vector. */ > + void pop () > + { > + unsigned int len = vec.count (); > + gcc_checking_assert (len > 0); > + vec.truncate (len - 1); > + } > + > + /* Return the context of the Ith element. */ > + kind ctx_at (unsigned int i) > + { > + return (vec[i] & 1) ? kind::PDF : kind::PDI; > + } > + > + /* Return which context is currently opened. */ > + kind current_ctx () > + { > + unsigned int len = vec.count (); > + if (len == 0) > + return kind::NONE; > + return ctx_at (len - 1); > + } > + > + /* Return true if the current context comes from a UCN origin, that is, > + the bidi char which started this bidi context was written as a UCN. */ > + bool current_ctx_ucn_p () > + { > + unsigned int len = vec.count (); > + gcc_checking_assert (len > 0); > + return (vec[len - 1] >> 1) & 1; > + } > + > + /* We've read a bidi char, update the current vector as necessary. */ > + void on_char (kind k, bool ucn_p) > + { > + switch (k) > + { > + case kind::LRE: > + case kind::RLE: > + case kind::LRO: > + case kind::RLO: > + vec.push (ucn_p ? 3u : 1u); > + break; > + case kind::LRI: > + case kind::RLI: > + case kind::FSI: > + vec.push (ucn_p ? 2u : 0u); > + break; I don't like the hand-coded bit fields here, where bit 1 and bit 2 in the above have special meaning, but aren't clearly labelled as such. Please can you at least use some kind of constant/define to make clear the meaning of the bits. Even clearer would be bitfields; is there a performance reason for not using them? (though this code is only called on bidi control chars, which presumably is a rare occurrence). My patch here: "[PATCH 2/2] Capture locations of bidi chars and underline ranges" https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html did some refactoring of this patch, replacing hand-coded bit manipulation with bitfields in a struct (as well as then using that as a good place to stach location_t values, and then using these locations). Would it be helpful if I split that part of my patch out? [...snip...] > +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ > + > +static bidi::kind > +get_bidi_utf8 (const unsigned char *const p) > +{ > + gcc_checking_assert (p[0] == bidi::utf8_start); > + > + if (p[1] == 0x80) > + switch (p[2]) get_bidi_utf8 accesss up to 2 bytes beyond "p"... [...snip...] ...and is called in various places such as... > @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) > > cur = buffer->cur; > } > + /* If this is a beginning of a UTF-8 encoding, it might be > + a bidirectional character. */ > + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) > + { > + bidi::kind kind = get_bidi_utf8 (cur - 1); > + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); > + } Are we guaranteed to have a '\n' at the end of the buffer? (even for the final line of the file) That would ensure that we don't read past the end of the buffer. Can we have testcases involving malformed UTF-8, in which, say: - the final byte of the input file is 0xe2 - the final two bytes of the input file are 0xe2 0x80 for each of block comment, C++-style comment, string-literal, identifier, etc? (or is that overkill?) [...snip...] > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, > { > /* Slower version for identifiers containing UCNs > or extended chars (including $). */ > - do { > - while (ISIDNUM (*pfile->buffer->cur)) > - { > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > - pfile->buffer->cur++; > - } > - } while (forms_identifier_p (pfile, false, nst)); > + do > + { > + while (ISIDNUM (*pfile->buffer->cur)) > + { > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > + pfile->buffer->cur++; > + } > + } > + while (forms_identifier_p (pfile, false, nst)); Is the above purely a whitespace change? > + if (warn_bidi_p) > + maybe_warn_bidi_on_close (pfile, pfile->buffer->cur); > result = _cpp_interpret_identifier (pfile, base, > pfile->buffer->cur - base); > *spelling = cpp_lookup (pfile, base, pfile->buffer->cur - base); > @@ -1758,6 +2112,8 @@ static void > lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) > { > const uchar *pos = base; > + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) > + != bidirectional_none); There are lots of places where the patch uses: const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) > + != bidirectional_none); Maybe make it an inline member function: bool warn_bidi_p () const { return CPP_OPTION (this, cpp_warn_bidirectional) != bidirectional_none; } so that these can all be: const bool warn_bidi_p = pfile->warn_bidi_p (); ? [...snip...] Hope this is constructive; thanks again for the patch Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-15 23:15 ` David Malcolm @ 2021-11-16 19:50 ` Marek Polacek 2021-11-16 23:00 ` David Malcolm 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2021-11-16 19:50 UTC (permalink / raw) To: David Malcolm; +Cc: Joseph Myers, Jakub Jelinek, Martin Sebor, GCC Patches On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote: > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > > > but changing the name is a trivial operation. > > > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no > > changes. > > Thanks for implementing this. > > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > -- >8 -- > > From a link below: > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > Specification through 14.0. It permits the visual reordering of > > characters via control sequences, which can be used to craft source code > > that renders different logic than the logical ordering of tokens > > ingested by compilers and interpreters. Adversaries can leverage this to > > encode source code for compilers accepting Unicode such that targeted > > vulnerabilities are introduced invisibly to human reviewers." > > > > More info: > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > https://trojansource.codes/ > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > > misleading Unicode bidirectional characters the preprocessor may encounter. > > > > The default is =unpaired, which warns about improperly terminated > > bidirectional characters; e.g. a LRE without its appertaining PDF. The > > I like the default. Great. > Wording nit: maybe use "corresponding" rather than "appertaining"; I > believe the latter has a sense that one is part of the other, when they > are more like peers. OK, fixed. > > level =any warns about any use of bidirectional characters. > > Terminology nit: > The patch is referring to "bidirectional characters", but I think the > term "bidirectional control characters" would be better. Adjusted. > For example, a passage of text containing both numbers and characters > in a right-to-left script could be considered "bidirectional", since > the numbers are written from left-to-right. > > Specifically, the patch looks for these specific characters: > * U+202A LEFT-TO-RIGHT EMBEDDING > * U+202B RIGHT-TO-LEFT EMBEDDING > * U+202C POP DIRECTIONAL FORMATTING > * U+202D LEFT-TO-RIGHT OVERRIDE > * U+202E RIGHT-TO-LEFT OVERRIDE > * U+2066 LEFT-TO-RIGHT ISOLATE > * U+2067 RIGHT-TO-LEFT ISOLATE > * U+2068 FIRST STRONG ISOLATE > * U+2069 POP DIRECTIONAL ISOLATE > > However, the following characters could also be considered as > "bidirectional control characters": > * U+200E LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E) > * U+200F RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F) > but aren't checked for in the patch. Should they be? I can imagine > ways in which they could be abused, so I think so. I'd only intended to check the bidi chars described in the original trojan source pdf, but I added checking for U+200E/U+200F too, since it was easy enough. AFAIK they aren't popped by a PDF/PDI like the rest, so don't need to go on the vec, and so we only warn with =any. Tests: Wbidi-chars-16.c + Wbidi-chars-17.c > [...snip...] > > > diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt > > index 06457ac739e..b047df0f125 100644 > > --- a/gcc/c-family/c.opt > > +++ b/gcc/c-family/c.opt > > @@ -374,6 +374,30 @@ Wbad-function-cast > > C ObjC Var(warn_bad_function_cast) Warning > > Warn about casting functions to incompatible types. > > > > +Wbidi-chars > > +C ObjC C++ ObjC++ Warning Alias(Wbidi-chars=,any,none) > > +; > > + > > +Wbidi-chars= > > +C ObjC C++ ObjC++ RejectNegative Joined Warning CPP(cpp_warn_bidirectional) CppReason(CPP_W_BIDIRECTIONAL) Var(warn_bidirectional) Init(bidirectional_unpaired) Enum(cpp_bidirectional_level) > > +-Wbidi-chars=[none|unpaired|any] Warn about UTF-8 bidirectional characters. > > "control characters" Fixed. > [...snip...] > > > > > +@item -Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]} > > +@opindex Wbidi-chars= > > +@opindex Wbidi-chars > > +@opindex Wno-bidi-chars > > +Warn about possibly misleading UTF-8 bidirectional characters in comments, > > (and here again) Fixed. > > +string literals, character constants, and identifiers. Such characters can > > +change left-to-right writing direction into right-to-left (and vice versa), > > +which can cause confusion between the logical order and visual order. This > > +may be dangerous; for instance, it may seem that a piece of code is not > > +commented out, whereas it in fact is. > > + > > +There are three levels of warning supported by GCC@. The default is > > +@option{-Wbidi-chars=unpaired}, which warns about improperly terminated > > +bidi contexts. @option{-Wbidi-chars=none} turns the warning off. > > +@option{-Wbidi-chars=any} warns about any use of bidirectional characters. > > (and again) Fixed. > [...snip...] > > > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > new file mode 100644 > > index 00000000000..9fd4bc535ca > > --- /dev/null > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > @@ -0,0 +1,166 @@ > > +/* PR preprocessor/103026 */ > > +/* { dg-do compile } */ > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ > > +/* Test all bidi chars in various contexts (identifiers, comments, > > + string literals, character constants), both UCN and UTF-8. The bidi > > + chars here are properly terminated, except for the character constants. */ > > + > > +/* a b c LRE 1 2 3 PDF x y z */ > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ > > +/* a b c RLE 1 2 3 PDF x y z */ > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ > > +/* a b c LRO 1 2 3 PDF x y z */ > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ > > +/* a b c RLO 1 2 3 PDF x y z */ > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ > > +/* a b c LRI 1 2 3 PDI x y z */ > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ > > +/* a b c RLI 1 2 3 PDI x y */ > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ > > +/* a b c FSI 1 2 3 PDI x y z */ > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ > > AIUI the Unicode bidirectionality algorithm works at the line level, > and so each line in a block comment should be checked individually for > unclossed bidi control chars, rather than a block comment as a whole. > Hence I think the test case needs to have block comment test coverage > for: > - single line blocks > - first line of a multiline block comment > - middle line of a multiline block comment > - final line of a multiline block comment > but I think the patch as it stands is only checking for the first of > these four cases. The patch handles all of them, because of: 1534 if (warn_bidi_p) 1535 maybe_warn_bidi_on_close (pfile, cur); in _cpp_skip_block_comment, but I was lacking some more testing, so I've added some testing, and included a new test: Wbidi-chars-15.c. > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-5.c b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c > > new file mode 100644 > > index 00000000000..efb26309b68 > > --- /dev/null > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c > > @@ -0,0 +1,166 @@ > > +/* PR preprocessor/103026 */ > > +/* { dg-do compile } */ > > +/* { dg-options "-Wbidi-chars=unpaired -Wno-multichar -Wno-overflow" } */ > > +/* Test all bidi chars in various contexts (identifiers, comments, > > + string literals, character constants), both UCN and UTF-8. The bidi > > + chars here are properly terminated, except for the character constants. */ > > Similar comment as above re block comments I've extended the test too. > [...snip...] > > > diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h > > index 176f8c5bbce..60cf08ddd35 100644 > > --- a/libcpp/include/cpplib.h > > +++ b/libcpp/include/cpplib.h > > @@ -319,6 +319,17 @@ enum cpp_main_search > > CMS_system, /* Search the system INCLUDE path. */ > > }; > > > > +/* The possible bidirectional characters checking levels, from least > > + restrictive to most. */ > > +enum cpp_bidirectional_level { > > + /* No checking. */ > > + bidirectional_none, > > + /* Only detect unpaired uses of bidirectional characters. */ > > + bidirectional_unpaired, > > + /* Detect any use of bidirectional characters. */ > > + bidirectional_any > > +}; > > As before, "control characters". Fixed. > [...snip...] > > > @@ -539,6 +550,10 @@ struct cpp_options > > /* True if warn about differences between C++98 and C++11. */ > > bool cpp_warn_cxx11_compat; > > > > + /* Nonzero of bidirectional characters checking is on. See enum > > s/of/if/ > > and usual nit about "control characters". Fixed. > > + cpp_bidirectional_level. */ > > + unsigned char cpp_warn_bidirectional; > > + > > /* Dependency generation. */ > > struct > > { > > [...snip...] > > > diff --git a/libcpp/lex.c b/libcpp/lex.c > > index fa2253d41c3..3fb518e202b 100644 > > --- a/libcpp/lex.c > > +++ b/libcpp/lex.c > > @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) > > } > > } > > > > +namespace bidi { > > + enum class kind { > > + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI > > + }; > > + > > + /* All the UTF-8 encodings of bidi characters start with E2. */ > > + constexpr uchar utf8_start = 0xe2; > > Is there a difference between "constexpr" vs "const" here? (sorry for > my ignorance) I just wanted to make sure that utf8_start will be usable in contexts where an integral constant expression is required. 'const' objects need not be initialized with compile-time constants, but 'constexpr' objects do. > > + > > + /* A vector holding currently open bidi contexts. We use a char for > > + each context, its LSB is 1 if it represents a PDF context, 0 if it > > + represents a PDI context. The next bit is 1 if this context was open > > + by a bidi character written as a UCN, and 0 when it was UTF-8. */ > > + semi_embedded_vec <unsigned char, 16> vec; > > + > > + /* Close the whole comment/identifier/string literal/character constant > > + context. */ > > + void on_close () > > + { > > + vec.truncate (0); > > + } > > + > > + /* Pop the last element in the vector. */ > > + void pop () > > + { > > + unsigned int len = vec.count (); > > + gcc_checking_assert (len > 0); > > + vec.truncate (len - 1); > > + } > > + > > + /* Return the context of the Ith element. */ > > + kind ctx_at (unsigned int i) > > + { > > + return (vec[i] & 1) ? kind::PDF : kind::PDI; > > + } > > + > > + /* Return which context is currently opened. */ > > + kind current_ctx () > > + { > > + unsigned int len = vec.count (); > > + if (len == 0) > > + return kind::NONE; > > + return ctx_at (len - 1); > > + } > > + > > + /* Return true if the current context comes from a UCN origin, that is, > > + the bidi char which started this bidi context was written as a UCN. */ > > + bool current_ctx_ucn_p () > > + { > > + unsigned int len = vec.count (); > > + gcc_checking_assert (len > 0); > > + return (vec[len - 1] >> 1) & 1; > > + } > > + > > + /* We've read a bidi char, update the current vector as necessary. */ > > + void on_char (kind k, bool ucn_p) > > + { > > + switch (k) > > + { > > + case kind::LRE: > > + case kind::RLE: > > + case kind::LRO: > > + case kind::RLO: > > + vec.push (ucn_p ? 3u : 1u); > > + break; > > + case kind::LRI: > > + case kind::RLI: > > + case kind::FSI: > > + vec.push (ucn_p ? 2u : 0u); > > + break; > > I don't like the hand-coded bit fields here, where bit 1 and bit 2 in > the above have special meaning, but aren't clearly labelled as such. > > Please can you at least use some kind of constant/define to make clear > the meaning of the bits. Even clearer would be bitfields; is there a > performance reason for not using them? (though this code is only > called on bidi control chars, which presumably is a rare occurrence). > My patch here: > "[PATCH 2/2] Capture locations of bidi chars and underline ranges" > https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html > did some refactoring of this patch, replacing hand-coded bit > manipulation with bitfields in a struct (as well as then using that as > a good place to stach location_t values, and then using these > locations). > Would it be helpful if I split that part of my patch out? I think they are just fine here, given they are used only in bidi:: and not outside of it. And I could just use a simple unsigned char in semi_embedded_vec instead of inventing a new struct. Your diagnostic patch changes it because you need to remember a location too, so we're changing it anyway, so I left it be so that you have fewer conflicts. > [...snip...] > > > +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ > > + > > +static bidi::kind > > +get_bidi_utf8 (const unsigned char *const p) > > +{ > > + gcc_checking_assert (p[0] == bidi::utf8_start); > > + > > + if (p[1] == 0x80) > > + switch (p[2]) > > get_bidi_utf8 accesss up to 2 bytes beyond "p"... > > [...snip...] > > ...and is called in various places such as... > > > @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) > > > > cur = buffer->cur; > > } > > + /* If this is a beginning of a UTF-8 encoding, it might be > > + a bidirectional character. */ > > + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) > > + { > > + bidi::kind kind = get_bidi_utf8 (cur - 1); > > + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); > > + } > > Are we guaranteed to have a '\n' at the end of the buffer? (even for > the final line of the file) That would ensure that we don't read past > the end of the buffer. We've discussed this in our internal thread; I think I said that you will always get a '\n' so this is not going to read past the end of the buffer. To be sure... > Can we have testcases involving malformed UTF-8, in which, say: > - the final byte of the input file is 0xe2 > - the final two bytes of the input file are 0xe2 0x80 > for each of block comment, C++-style comment, string-literal, > identifier, etc? > (or is that overkill?) ...I'd crafted a malformed text file using hexedit but couldn't get it to crash. I'd rather not include it in the testsuite though. > [...snip...] > > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, > > { > > /* Slower version for identifiers containing UCNs > > or extended chars (including $). */ > > - do { > > - while (ISIDNUM (*pfile->buffer->cur)) > > - { > > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > - pfile->buffer->cur++; > > - } > > - } while (forms_identifier_p (pfile, false, nst)); > > + do > > + { > > + while (ISIDNUM (*pfile->buffer->cur)) > > + { > > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > + pfile->buffer->cur++; > > + } > > + } > > + while (forms_identifier_p (pfile, false, nst)); > > Is the above purely a whitespace change? Yes. > > + if (warn_bidi_p) > > + maybe_warn_bidi_on_close (pfile, pfile->buffer->cur); > > result = _cpp_interpret_identifier (pfile, base, > > pfile->buffer->cur - base); > > *spelling = cpp_lookup (pfile, base, pfile->buffer->cur - base); > > @@ -1758,6 +2112,8 @@ static void > > lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) > > { > > const uchar *pos = base; > > + const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) > > + != bidirectional_none); > > There are lots of places where the patch uses: > > const bool warn_bidi_p = (CPP_OPTION (pfile, cpp_warn_bidirectional) > > + != bidirectional_none); > > Maybe make it an inline member function: > > bool warn_bidi_p () const > { > return CPP_OPTION (this, cpp_warn_bidirectional) != bidirectional_none; > } > > so that these can all be: > > const bool warn_bidi_p = pfile->warn_bidi_p (); > > ? OK, I guess pfile->warn_bidi_p () is visually more appealing. Since adding a member function won't change the PODness of cpp_reader, I went with your suggestion and added warn_bidi_p. > Hope this is constructive; thanks again for the patch Thanks a lot for taking a look! Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? -- >8 -- From a link below: "An issue was discovered in the Bidirectional Algorithm in the Unicode Specification through 14.0. It permits the visual reordering of characters via control sequences, which can be used to craft source code that renders different logic than the logical ordering of tokens ingested by compilers and interpreters. Adversaries can leverage this to encode source code for compilers accepting Unicode such that targeted vulnerabilities are introduced invisibly to human reviewers." More info: https://nvd.nist.gov/vuln/detail/CVE-2021-42574 https://trojansource.codes/ This is not a compiler bug. However, to mitigate the problem, this patch implements -Wbidi-chars=[none|unpaired|any] to warn about possibly misleading Unicode bidirectional control characters the preprocessor may encounter. The default is =unpaired, which warns about improperly terminated bidirectional control characters; e.g. a LRE without its corresponding PDF. The level =any warns about any use of bidirectional control characters. This patch handles both UCNs and UTF-8 characters. UCNs designating bidi characters in identifiers are accepted since r204886. Then r217144 enabled -fextended-identifiers by default. Extended characters in C/C++ identifiers have been accepted since r275979. However, this patch still warns about mixing UTF-8 and UCN bidi characters; there seems to be no good reason to allow mixing them. We warn in different contexts: comments (both C and C++-style), string literals, character constants, and identifiers. Expectedly, UCNs are ignored in comments and raw string literals. The bidirectional control characters can nest so this patch handles that as well. I have not included nor tested this at all with Fortran (which also has string literals and line comments). Dave M. posted patches improving diagnostic involving Unicode characters. This patch does not make use of this new infrastructure yet. PR preprocessor/103026 gcc/c-family/ChangeLog: * c.opt (Wbidi-chars, Wbidi-chars=): New option. gcc/ChangeLog: * doc/invoke.texi: Document -Wbidi-chars. libcpp/ChangeLog: * include/cpplib.h (enum cpp_bidirectional_level): New. (struct cpp_options): Add cpp_warn_bidirectional. (enum cpp_warning_reason): Add CPP_W_BIDIRECTIONAL. * internal.h (struct cpp_reader): Add warn_bidi_p member function. * init.c (cpp_create_reader): Set cpp_warn_bidirectional. * lex.c (bidi): New namespace. (get_bidi_utf8): New function. (get_bidi_ucn): Likewise. (maybe_warn_bidi_on_close): Likewise. (maybe_warn_bidi_on_char): Likewise. (_cpp_skip_block_comment): Implement warning about bidirectional control characters. (skip_line_comment): Likewise. (forms_identifier_p): Likewise. (lex_identifier): Likewise. (lex_string): Likewise. (lex_raw_string): Likewise. gcc/testsuite/ChangeLog: * c-c++-common/Wbidi-chars-1.c: New test. * c-c++-common/Wbidi-chars-2.c: New test. * c-c++-common/Wbidi-chars-3.c: New test. * c-c++-common/Wbidi-chars-4.c: New test. * c-c++-common/Wbidi-chars-5.c: New test. * c-c++-common/Wbidi-chars-6.c: New test. * c-c++-common/Wbidi-chars-7.c: New test. * c-c++-common/Wbidi-chars-8.c: New test. * c-c++-common/Wbidi-chars-9.c: New test. * c-c++-common/Wbidi-chars-10.c: New test. * c-c++-common/Wbidi-chars-11.c: New test. * c-c++-common/Wbidi-chars-12.c: New test. * c-c++-common/Wbidi-chars-13.c: New test. * c-c++-common/Wbidi-chars-14.c: New test. * c-c++-common/Wbidi-chars-15.c: New test. * c-c++-common/Wbidi-chars-16.c: New test. * c-c++-common/Wbidi-chars-17.c: New test. --- gcc/c-family/c.opt | 24 ++ gcc/doc/invoke.texi | 21 +- gcc/testsuite/c-c++-common/Wbidi-chars-1.c | 12 + gcc/testsuite/c-c++-common/Wbidi-chars-10.c | 27 ++ gcc/testsuite/c-c++-common/Wbidi-chars-11.c | 13 + gcc/testsuite/c-c++-common/Wbidi-chars-12.c | 19 + gcc/testsuite/c-c++-common/Wbidi-chars-13.c | 17 + gcc/testsuite/c-c++-common/Wbidi-chars-14.c | 38 ++ gcc/testsuite/c-c++-common/Wbidi-chars-15.c | 33 ++ gcc/testsuite/c-c++-common/Wbidi-chars-16.c | 26 ++ gcc/testsuite/c-c++-common/Wbidi-chars-17.c | 30 ++ gcc/testsuite/c-c++-common/Wbidi-chars-2.c | 9 + gcc/testsuite/c-c++-common/Wbidi-chars-3.c | 11 + gcc/testsuite/c-c++-common/Wbidi-chars-4.c | 188 +++++++++ gcc/testsuite/c-c++-common/Wbidi-chars-5.c | 188 +++++++++ gcc/testsuite/c-c++-common/Wbidi-chars-6.c | 155 +++++++ gcc/testsuite/c-c++-common/Wbidi-chars-7.c | 9 + gcc/testsuite/c-c++-common/Wbidi-chars-8.c | 13 + gcc/testsuite/c-c++-common/Wbidi-chars-9.c | 29 ++ libcpp/include/cpplib.h | 18 +- libcpp/init.c | 1 + libcpp/internal.h | 7 + libcpp/lex.c | 424 +++++++++++++++++++- 23 files changed, 1298 insertions(+), 14 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-1.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-10.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-11.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-12.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-13.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-14.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-15.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-16.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-17.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-2.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-3.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-4.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-5.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-6.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-7.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-8.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-9.c diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt index 8a4cd634f77..3976fc368db 100644 --- a/gcc/c-family/c.opt +++ b/gcc/c-family/c.opt @@ -374,6 +374,30 @@ Wbad-function-cast C ObjC Var(warn_bad_function_cast) Warning Warn about casting functions to incompatible types. +Wbidi-chars +C ObjC C++ ObjC++ Warning Alias(Wbidi-chars=,any,none) +; + +Wbidi-chars= +C ObjC C++ ObjC++ RejectNegative Joined Warning CPP(cpp_warn_bidirectional) CppReason(CPP_W_BIDIRECTIONAL) Var(warn_bidirectional) Init(bidirectional_unpaired) Enum(cpp_bidirectional_level) +-Wbidi-chars=[none|unpaired|any] Warn about UTF-8 bidirectional control characters. + +; Required for these enum values. +SourceInclude +cpplib.h + +Enum +Name(cpp_bidirectional_level) Type(int) UnknownError(argument %qs to %<-Wbidi-chars%> not recognized) + +EnumValue +Enum(cpp_bidirectional_level) String(none) Value(bidirectional_none) + +EnumValue +Enum(cpp_bidirectional_level) String(unpaired) Value(bidirectional_unpaired) + +EnumValue +Enum(cpp_bidirectional_level) String(any) Value(bidirectional_any) + Wbool-compare C ObjC C++ ObjC++ Var(warn_bool_compare) Warning LangEnabledBy(C ObjC C++ ObjC++,Wall) Warn about boolean expression compared with an integer value different from true/false. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 6070288856c..a22758d18ee 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. -Warith-conversion @gol -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol +-Wno-attribute-warning @gol +-Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol +-Wbool-compare -Wbool-operation @gol -Wno-builtin-declaration-mismatch @gol -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol -Wc11-c2x-compat @gol @@ -7678,6 +7680,23 @@ Attributes considered include @code{alloc_align}, @code{alloc_size}, This is the default. You can disable these warnings with either @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. +@item -Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]} +@opindex Wbidi-chars= +@opindex Wbidi-chars +@opindex Wno-bidi-chars +Warn about possibly misleading UTF-8 bidirectional control characters in +comments, string literals, character constants, and identifiers. Such +characters can change left-to-right writing direction into right-to-left +(and vice versa), which can cause confusion between the logical order and +visual order. This may be dangerous; for instance, it may seem that a piece +of code is not commented out, whereas it in fact is. + +There are three levels of warning supported by GCC@. The default is +@option{-Wbidi-chars=unpaired}, which warns about improperly terminated +bidi contexts. @option{-Wbidi-chars=none} turns the warning off. +@option{-Wbidi-chars=any} warns about any use of bidirectional control +characters. + @item -Wbool-compare @opindex Wno-bool-compare @opindex Wbool-compare diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-1.c b/gcc/testsuite/c-c++-common/Wbidi-chars-1.c new file mode 100644 index 00000000000..34f5ac19271 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-1.c @@ -0,0 +1,12 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + int isAdmin = 0; + /* } if (isAdmin) begin admins only */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + /* end admins only { */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-10.c b/gcc/testsuite/c-c++-common/Wbidi-chars-10.c new file mode 100644 index 00000000000..3f851b69e65 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-10.c @@ -0,0 +1,27 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* More nesting testing. */ + +/* RLE LRI PDF PDI*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF_\u202c; +int LRE_\u202a_PDF_\u202c_LRE_\u202a_PDF_\u202c; +int LRE_\u202a_LRI_\u2066_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDI_\u2069_PDF_\u202c; +int FSI_\u2068_LRO_\u202d_PDI_\u2069_PDF_\u202c; +int FSI_\u2068; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDF_\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_FSI_\u2068_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-11.c b/gcc/testsuite/c-c++-common/Wbidi-chars-11.c new file mode 100644 index 00000000000..270ce2368a9 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-11.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test that we warn when mixing UCN and UTF-8. */ + +int LRE__PDF_\u202c; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF__; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s1 = "LRE__PDF_\u202c"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s2 = "LRE_\u202a_PDF_"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-12.c b/gcc/testsuite/c-c++-common/Wbidi-chars-12.c new file mode 100644 index 00000000000..b07eec1da91 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-12.c @@ -0,0 +1,19 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c RLO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +const char *s7 = R"(a b c FSI 1 2 3 PDI x y) z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +const char *s8 = R"(a b c PDI x y )z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +const char *s9 = R"(a b c PDF x y z)"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-13.c b/gcc/testsuite/c-c++-common/Wbidi-chars-13.c new file mode 100644 index 00000000000..b2dd9fde752 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-13.c @@ -0,0 +1,17 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c FSI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s5 = R"(a b c LRI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s6 = R"(a b c RLI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-14.c b/gcc/testsuite/c-c++-common/Wbidi-chars-14.c new file mode 100644 index 00000000000..ba5f75d9553 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-14.c @@ -0,0 +1,38 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test PDI handling, which also pops any subsequent LREs, RLEs, LROs, + or RLOs. */ + +/* LRI__LRI__RLE__RLE__RLE__PDI_*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// LRI__RLE__RLE__RLE__PDI_ +// LRI__RLO__RLE__RLE__PDI_ +// LRI__RLO__RLE__PDI_ +// FSI__RLO__PDI_ +// FSI__FSI__RLO__PDI_ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +int LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069_PDI_\u2069; +int LRI_\u2066_LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int PDI_\u2069; +int LRI_\u2066_PDI_\u2069; +int RLI_\u2067_PDI_\u2069; +int LRE_\u202a_LRI_\u2066_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRI_\u2066_LRE_\u202a_PDF_\u202c_PDI_\u2069; +int LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +int RLI_\u2067_LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLO_\u202e_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_PDI_\u2069_RLI_\u2067; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDF_\u202c_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-15.c b/gcc/testsuite/c-c++-common/Wbidi-chars-15.c new file mode 100644 index 00000000000..68e65001a01 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-15.c @@ -0,0 +1,33 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test unpaired bidi control chars in multiline comments. */ + +/* + * LRE end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * RLE end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * LRO end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * RLO end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * LRI end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * RLI end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * FSI end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-16.c b/gcc/testsuite/c-c++-common/Wbidi-chars-16.c new file mode 100644 index 00000000000..baa0159861c --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-16.c @@ -0,0 +1,26 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test LTR/RTL chars. */ + +/* LTR<> */ +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +// LTR<> +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +/* RTL<> */ +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ +// RTL<> +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ + +const char *s1 = "LTR<>"; +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +const char *s2 = "LTR\u200e"; +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +const char *s3 = "LTR\u200E"; +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +const char *s4 = "RTL<>"; +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ +const char *s5 = "RTL\u200f"; +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ +const char *s6 = "RTL\u200F"; +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-17.c b/gcc/testsuite/c-c++-common/Wbidi-chars-17.c new file mode 100644 index 00000000000..07cb4321f96 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-17.c @@ -0,0 +1,30 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test LTR/RTL chars. */ + +/* LTR<> */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// LTR<> +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* RTL<> */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// RTL<> +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int ltr_\u200e; +/* { dg-error "universal character " "" { target *-*-* } .-1 } */ +int rtl_\u200f; +/* { dg-error "universal character " "" { target *-*-* } .-1 } */ + +const char *s1 = "LTR<>"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s2 = "LTR\u200e"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = "LTR\u200E"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s4 = "RTL<>"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s5 = "RTL\u200f"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s6 = "RTL\u200F"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-2.c b/gcc/testsuite/c-c++-common/Wbidi-chars-2.c new file mode 100644 index 00000000000..2340374f276 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-2.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + /* Say hello; newline/*/ return 0 ; +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("Hello world.\n"); + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-3.c b/gcc/testsuite/c-c++-common/Wbidi-chars-3.c new file mode 100644 index 00000000000..9dc7edb6e64 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-3.c @@ -0,0 +1,11 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + const char* access_level = "user"; + if (__builtin_strcmp(access_level, "user // Check if admin ")) { +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + } + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c new file mode 100644 index 00000000000..639e5c62e88 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c @@ -0,0 +1,188 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +/* Multiline comments. */ +/* a b c PDI x y z + */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-2 } */ +/* a b c PDF x y z + */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-2 } */ +/* first + a b c PDI x y z + */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-2 } */ +/* first + a b c PDF x y z + */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-2 } */ +/* first + a b c PDI x y z */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +/* first + a b c PDF x y z */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-5.c b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c new file mode 100644 index 00000000000..68cb053144b --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c @@ -0,0 +1,188 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Multiline comments. */ +/* a b c PDI x y z + */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-2 } */ +/* a b c PDF x y z + */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-2 } */ +/* first + a b c PDI x y z + */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-2 } */ +/* first + a b c PDF x y z + */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-2 } */ +/* first + a b c PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* first + a b c PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-6.c b/gcc/testsuite/c-c++-common/Wbidi-chars-6.c new file mode 100644 index 00000000000..0ce6fff2dee --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-6.c @@ -0,0 +1,155 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test nesting of bidi chars in various contexts. */ + +/* Terminated by the wrong char: */ +/* a b c LRE 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDF x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +/* LRE PDF */ +/* LRE LRE PDF PDF */ +/* PDF LRE PDF */ +/* LRE PDF LRE PDF */ +/* LRE LRE PDF */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* PDF LRE */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// a b c LRE 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// LRE PDF +// LRE LRE PDF PDF +// PDF LRE PDF +// LRE PDF LRE PDF +// LRE LRE PDF +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// PDF LRE +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c LRE\u202a 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c RLE 1 2 3 PDI x y "; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLE\u202b 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c LRO\u202d 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c RLO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c RLO\u202e 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c LRI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s10 = "a b c LRI\u2066 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c RLI 1 2 3 PDF x y z\ + "; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ + const char *s12 = "a b c RLI\u2067 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c FSI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c FSI\u2068 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "LRE PDF"; + const char *s18 = "LRE\u202a PDF\u202c"; + const char *s19 = "LRE LRE PDF PDF"; + const char *s20 = "LRE\u202a LRE\u202a PDF\u202c PDF\u202c"; + const char *s21 = "PDF LRE PDF"; + const char *s22 = "PDF\u202c LRE\u202a PDF\u202c"; + const char *s23 = "LRE LRE PDF"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s24 = "LRE\u202a LRE\u202a PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s25 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s26 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s27 = "PDF LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s28 = "PDF\u202c LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int aLREbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int A\u202aB\u2069C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLEbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202bB\u2069c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202db\u2069c2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202eb\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2066b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLIbPDFc +; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +int a\u2067b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2068b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPD\u202C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSI\u2068bPDF_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLREbPDFb; +int A\u202aB\u202c; +int a_LRE_LRE_b_PDF_PDF; +int A\u202aA\u202aB\u202cB\u202c; +int aPDFbLREadPDF; +int a_\u202C_\u202a_\u202c; +int a_LRE_b_PDF_c_LRE_PDF; +int a_\u202a_\u202c_\u202a_\u202c_; +int a_LRE_b_PDF_c_LRE; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a_\u202a_\u202c_\u202a_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-7.c b/gcc/testsuite/c-c++-common/Wbidi-chars-7.c new file mode 100644 index 00000000000..d012d420ec0 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-7.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test we ignore UCNs in comments. */ + +// a b c \u202a 1 2 3 +// a b c \u202A 1 2 3 +/* a b c \u202a 1 2 3 */ +/* a b c \u202A 1 2 3 */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-8.c b/gcc/testsuite/c-c++-common/Wbidi-chars-8.c new file mode 100644 index 00000000000..4f54c5092ec --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-8.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test \u vs \U. */ + +int a_\u202A; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\u202a_2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202A_3; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202a_4; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-9.c b/gcc/testsuite/c-c++-common/Wbidi-chars-9.c new file mode 100644 index 00000000000..e2af1b1ca97 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-9.c @@ -0,0 +1,29 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test that we properly separate bidi contexts (comment/identifier/character + constant/string literal). */ + +/* LRE -><- */ int pdf_\u202c_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLE -><- */ int pdf_\u202c_2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRO -><- */ int pdf_\u202c_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLO -><- */ int pdf_\u202c_4; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRI -><-*/ int pdi_\u2069_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLI -><- */ int pdi_\u2069_12; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* FSI -><- */ int pdi_\u2069_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +const char *s1 = "LRE\u202a"; /* PDF -><- */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRE -><- */ const char *s2 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = "LRE\u202a"; int pdf_\u202c_5; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int lre_\u202a; const char *s4 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h index 176f8c5bbce..112b9c24751 100644 --- a/libcpp/include/cpplib.h +++ b/libcpp/include/cpplib.h @@ -319,6 +319,17 @@ enum cpp_main_search CMS_system, /* Search the system INCLUDE path. */ }; +/* The possible bidirectional control characters checking levels, from least + restrictive to most. */ +enum cpp_bidirectional_level { + /* No checking. */ + bidirectional_none, + /* Only detect unpaired uses of bidirectional control characters. */ + bidirectional_unpaired, + /* Detect any use of bidirectional control characters. */ + bidirectional_any +}; + /* This structure is nested inside struct cpp_reader, and carries all the options visible to the command line. */ struct cpp_options @@ -539,6 +550,10 @@ struct cpp_options /* True if warn about differences between C++98 and C++11. */ bool cpp_warn_cxx11_compat; + /* Nonzero if bidirectional control characters checking is on. See enum + cpp_bidirectional_level. */ + unsigned char cpp_warn_bidirectional; + /* Dependency generation. */ struct { @@ -643,7 +658,8 @@ enum cpp_warning_reason { CPP_W_C90_C99_COMPAT, CPP_W_C11_C2X_COMPAT, CPP_W_CXX11_COMPAT, - CPP_W_EXPANSION_TO_DEFINED + CPP_W_EXPANSION_TO_DEFINED, + CPP_W_BIDIRECTIONAL }; /* Callback for header lookup for HEADER, which is the name of a diff --git a/libcpp/init.c b/libcpp/init.c index 5a424e23553..f9a8f5f088f 100644 --- a/libcpp/init.c +++ b/libcpp/init.c @@ -223,6 +223,7 @@ cpp_create_reader (enum c_lang lang, cpp_hash_table *table, = ENABLE_CANONICAL_SYSTEM_HEADERS; CPP_OPTION (pfile, ext_numeric_literals) = 1; CPP_OPTION (pfile, warn_date_time) = 0; + CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired; /* Default CPP arithmetic to something sensible for the host for the benefit of dumb users like fix-header. */ diff --git a/libcpp/internal.h b/libcpp/internal.h index 8577cab6c83..0ce0246c5a2 100644 --- a/libcpp/internal.h +++ b/libcpp/internal.h @@ -597,6 +597,13 @@ struct cpp_reader /* Location identifying the main source file -- intended to be line zero of said file. */ location_t main_loc; + + /* Returns true iff we should warn about UTF-8 bidirectional control + characters. */ + bool warn_bidi_p () const + { + return CPP_OPTION (this, cpp_warn_bidirectional) != bidirectional_none; + } }; /* Character classes. Based on the more primitive macros in safe-ctype.h. diff --git a/libcpp/lex.c b/libcpp/lex.c index fa2253d41c3..f5907ce8f08 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1164,6 +1164,324 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) } } +namespace bidi { + enum class kind { + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI, LTR, RTL + }; + + /* All the UTF-8 encodings of bidi characters start with E2. */ + constexpr uchar utf8_start = 0xe2; + + /* A vector holding currently open bidi contexts. We use a char for + each context, its LSB is 1 if it represents a PDF context, 0 if it + represents a PDI context. The next bit is 1 if this context was open + by a bidi character written as a UCN, and 0 when it was UTF-8. */ + semi_embedded_vec <unsigned char, 16> vec; + + /* Close the whole comment/identifier/string literal/character constant + context. */ + void on_close () + { + vec.truncate (0); + } + + /* Pop the last element in the vector. */ + void pop () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + vec.truncate (len - 1); + } + + /* Return the context of the Ith element. */ + kind ctx_at (unsigned int i) + { + return (vec[i] & 1) ? kind::PDF : kind::PDI; + } + + /* Return which context is currently opened. */ + kind current_ctx () + { + unsigned int len = vec.count (); + if (len == 0) + return kind::NONE; + return ctx_at (len - 1); + } + + /* Return true if the current context comes from a UCN origin, that is, + the bidi char which started this bidi context was written as a UCN. */ + bool current_ctx_ucn_p () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + return (vec[len - 1] >> 1) & 1; + } + + /* We've read a bidi char, update the current vector as necessary. */ + void on_char (kind k, bool ucn_p) + { + switch (k) + { + case kind::LRE: + case kind::RLE: + case kind::LRO: + case kind::RLO: + vec.push (ucn_p ? 3u : 1u); + break; + case kind::LRI: + case kind::RLI: + case kind::FSI: + vec.push (ucn_p ? 2u : 0u); + break; + /* PDF terminates the scope of the last LRE, RLE, LRO, or RLO + whose scope has not yet been terminated. */ + case kind::PDF: + if (current_ctx () == kind::PDF) + pop (); + break; + /* PDI terminates the scope of the last LRI, RLI, or FSI whose + scope has not yet been terminated, as well as the scopes of + any subsequent LREs, RLEs, LROs, or RLOs whose scopes have not + yet been terminated. */ + case kind::PDI: + for (int i = vec.count () - 1; i >= 0; --i) + if (ctx_at (i) == kind::PDI) + { + vec.truncate (i); + break; + } + break; + case kind::LTR: + case kind::RTL: + /* These aren't popped by a PDF/PDI. */ + break; + [[likely]] case kind::NONE: + break; + default: + abort (); + } + } + + /* Return a descriptive string for K. */ + const char *to_str (kind k) + { + switch (k) + { + case kind::LRE: + return "U+202A (LEFT-TO-RIGHT EMBEDDING)"; + case kind::RLE: + return "U+202B (RIGHT-TO-LEFT EMBEDDING)"; + case kind::LRO: + return "U+202D (LEFT-TO-RIGHT OVERRIDE)"; + case kind::RLO: + return "U+202E (RIGHT-TO-LEFT OVERRIDE)"; + case kind::LRI: + return "U+2066 (LEFT-TO-RIGHT ISOLATE)"; + case kind::RLI: + return "U+2067 (RIGHT-TO-LEFT ISOLATE)"; + case kind::FSI: + return "U+2068 (FIRST STRONG ISOLATE)"; + case kind::PDF: + return "U+202C (POP DIRECTIONAL FORMATTING)"; + case kind::PDI: + return "U+2069 (POP DIRECTIONAL ISOLATE)"; + case kind::LTR: + return "U+200E (LEFT-TO-RIGHT MARK)"; + case kind::RTL: + return "U+200F (RIGHT-TO-LEFT MARK)"; + default: + abort (); + } + } +} + +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ + +static bidi::kind +get_bidi_utf8 (const unsigned char *const p) +{ + gcc_checking_assert (p[0] == bidi::utf8_start); + + if (p[1] == 0x80) + switch (p[2]) + { + case 0xaa: + return bidi::kind::LRE; + case 0xab: + return bidi::kind::RLE; + case 0xac: + return bidi::kind::PDF; + case 0xad: + return bidi::kind::LRO; + case 0xae: + return bidi::kind::RLO; + case 0x8e: + return bidi::kind::LTR; + case 0x8f: + return bidi::kind::RTL; + default: + break; + } + else if (p[1] == 0x81) + switch (p[2]) + { + case 0xa6: + return bidi::kind::LRI; + case 0xa7: + return bidi::kind::RLI; + case 0xa8: + return bidi::kind::FSI; + case 0xa9: + return bidi::kind::PDI; + default: + break; + } + + return bidi::kind::NONE; +} + +/* Parse a UCN where P points just past \u or \U and return its bidi code. */ + +static bidi::kind +get_bidi_ucn (const unsigned char *p, bool is_U) +{ + /* 6.4.3 Universal Character Names + \u hex-quad + \U hex-quad hex-quad + where \unnnn means \U0000nnnn. */ + + if (is_U) + { + if (p[0] != '0' || p[1] != '0' || p[2] != '0' || p[3] != '0') + return bidi::kind::NONE; + /* Skip 4B so we can treat \u and \U the same below. */ + p += 4; + } + + /* All code points we are looking for start with 20xx. */ + if (p[0] != '2' || p[1] != '0') + return bidi::kind::NONE; + else if (p[2] == '2') + switch (p[3]) + { + case 'a': + case 'A': + return bidi::kind::LRE; + case 'b': + case 'B': + return bidi::kind::RLE; + case 'c': + case 'C': + return bidi::kind::PDF; + case 'd': + case 'D': + return bidi::kind::LRO; + case 'e': + case 'E': + return bidi::kind::RLO; + default: + break; + } + else if (p[2] == '6') + switch (p[3]) + { + case '6': + return bidi::kind::LRI; + case '7': + return bidi::kind::RLI; + case '8': + return bidi::kind::FSI; + case '9': + return bidi::kind::PDI; + default: + break; + } + else if (p[2] == '0') + switch (p[3]) + { + case 'e': + case 'E': + return bidi::kind::LTR; + case 'f': + case 'F': + return bidi::kind::RTL; + default: + break; + } + + return bidi::kind::NONE; +} + +/* We're closing a bidi context, that is, we've encountered a newline, + are closing a C-style comment, or are at the end of a string literal, + character constant, or identifier. Warn if this context was not + properly terminated by a PDI or PDF. P points to the last character + in this context. */ + +static void +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) +{ + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired + && bidi::vec.count () > 0) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "unpaired UTF-8 bidirectional character " + "detected"); + } + /* We're done with this context. */ + bidi::on_close (); +} + +/* We're at the beginning or in the middle of an identifier/comment/string + literal/character constant. Warn if we've encountered a bidi character. + KIND says which bidi character it was; P points to it in the character + stream. UCN_P is true iff this bidi character was written as a UCN. */ + +static void +maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, + bool ucn_p) +{ + if (__builtin_expect (kind == bidi::kind::NONE, 1)) + return; + + const auto warn_bidi = CPP_OPTION (pfile, cpp_warn_bidirectional); + + if (warn_bidi != bidirectional_none) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + /* It seems excessive to warn about a PDI/PDF that is closing + an opened context because we've already warned about the + opening character. Except warn when we have a UCN x UTF-8 + mismatch. */ + if (kind == bidi::current_ctx ()) + { + if (warn_bidi == bidirectional_unpaired + && bidi::current_ctx_ucn_p () != ucn_p) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); + } + else if (warn_bidi == bidirectional_any) + { + if (kind == bidi::kind::PDF || kind == bidi::kind::PDI) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "\"%s\" is closing an unopened context", + bidi::to_str (kind)); + else + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "found problematic Unicode character \"%s\"", + bidi::to_str (kind)); + } + } + /* We're done with this context. */ + bidi::on_char (kind, ucn_p); +} + /* Skip a C-style block comment. We find the end of the comment by seeing if an asterisk is before every '/' we encounter. Returns nonzero if comment terminated by EOF, zero otherwise. @@ -1175,6 +1493,7 @@ _cpp_skip_block_comment (cpp_reader *pfile) cpp_buffer *buffer = pfile->buffer; const uchar *cur = buffer->cur; uchar c; + const bool warn_bidi_p = pfile->warn_bidi_p (); cur++; if (*cur == '/') @@ -1189,7 +1508,11 @@ _cpp_skip_block_comment (cpp_reader *pfile) if (c == '/') { if (cur[-2] == '*') - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); + break; + } /* Warn about potential nested comments, but not if the '/' comes immediately before the true comment delimiter. @@ -1208,6 +1531,8 @@ _cpp_skip_block_comment (cpp_reader *pfile) { unsigned int cols; buffer->cur = cur - 1; + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); _cpp_process_line_notes (pfile, true); if (buffer->next_line >= buffer->rlimit) return true; @@ -1218,6 +1543,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) cur = buffer->cur; } + /* If this is a beginning of a UTF-8 encoding, it might be + a bidirectional character. */ + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); + } } buffer->cur = cur; @@ -1233,9 +1565,31 @@ skip_line_comment (cpp_reader *pfile) { cpp_buffer *buffer = pfile->buffer; location_t orig_line = pfile->line_table->highest_line; + const bool warn_bidi_p = pfile->warn_bidi_p (); - while (*buffer->cur != '\n') - buffer->cur++; + if (!warn_bidi_p) + while (*buffer->cur != '\n') + buffer->cur++; + else + { + while (*buffer->cur != '\n' + && *buffer->cur != bidi::utf8_start) + buffer->cur++; + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + while (*buffer->cur != '\n') + { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } + buffer->cur++; + } + maybe_warn_bidi_on_close (pfile, buffer->cur); + } + } _cpp_process_line_notes (pfile, true); return orig_line != pfile->line_table->highest_line; @@ -1346,11 +1700,13 @@ static const cppchar_t utf8_signifier = 0xC0; /* Returns TRUE if the sequence starting at buffer->cur is valid in an identifier. FIRST is TRUE if this starts an identifier. */ + static bool forms_identifier_p (cpp_reader *pfile, int first, struct normalize_state *state) { cpp_buffer *buffer = pfile->buffer; + const bool warn_bidi_p = pfile->warn_bidi_p (); if (*buffer->cur == '$') { @@ -1373,6 +1729,13 @@ forms_identifier_p (cpp_reader *pfile, int first, cppchar_t s; if (*buffer->cur >= utf8_signifier) { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) + && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s)) return true; @@ -1381,6 +1744,13 @@ forms_identifier_p (cpp_reader *pfile, int first, && (buffer->cur[1] == 'u' || buffer->cur[1] == 'U')) { buffer->cur += 2; + if (warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (buffer->cur, + buffer->cur[-1] == 'U'); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/true); + } if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s, NULL, NULL)) return true; @@ -1489,6 +1859,7 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, const uchar *cur; unsigned int len; unsigned int hash = HT_HASHSTEP (0, *base); + const bool warn_bidi_p = pfile->warn_bidi_p (); cur = pfile->buffer->cur; if (! starts_ucn) @@ -1505,13 +1876,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, { /* Slower version for identifiers containing UCNs or extended chars (including $). */ - do { - while (ISIDNUM (*pfile->buffer->cur)) - { - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); - pfile->buffer->cur++; - } - } while (forms_identifier_p (pfile, false, nst)); + do + { + while (ISIDNUM (*pfile->buffer->cur)) + { + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); + pfile->buffer->cur++; + } + } + while (forms_identifier_p (pfile, false, nst)); + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pfile->buffer->cur); result = _cpp_interpret_identifier (pfile, base, pfile->buffer->cur - base); *spelling = cpp_lookup (pfile, base, pfile->buffer->cur - base); @@ -1758,6 +2133,7 @@ static void lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) { const uchar *pos = base; + const bool warn_bidi_p = pfile->warn_bidi_p (); /* 'tis a pity this information isn't passed down from the lexer's initial categorization of the token. */ @@ -1994,8 +2370,15 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) pos = base = pfile->buffer->cur; note = &pfile->buffer->notes[pfile->buffer->cur_note]; } + else if (__builtin_expect ((unsigned char) c == bidi::utf8_start, 0) + && warn_bidi_p) + maybe_warn_bidi_on_char (pfile, pos - 1, get_bidi_utf8 (pos - 1), + /*ucn_p=*/false); } + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pos); + if (CPP_OPTION (pfile, user_literals)) { /* If a string format macro, say from inttypes.h, is placed touching @@ -2090,15 +2473,27 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) else terminator = '>', type = CPP_HEADER_NAME; + const bool warn_bidi_p = pfile->warn_bidi_p (); for (;;) { cppchar_t c = *cur++; /* In #include-style directives, terminators are not escapable. */ if (c == '\\' && !pfile->state.angled_headers && *cur != '\n') - cur++; + { + if ((cur[0] == 'u' || cur[0] == 'U') && warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (cur + 1, cur[0] == 'U'); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/true); + } + cur++; + } else if (c == terminator) - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur - 1); + break; + } else if (c == '\n') { cur--; @@ -2115,6 +2510,11 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) } else if (c == '\0') saw_NUL = true; + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur - 1, kind, /*ucn_p=*/false); + } } if (saw_NUL && !pfile->state.skipping) base-commit: 4cdf7db9a39d18bd536d816a5751d4d3cf23808b -- 2.33.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-16 19:50 ` [PATCH v2] " Marek Polacek @ 2021-11-16 23:00 ` David Malcolm 2021-11-17 0:37 ` [PATCH v3] " Marek Polacek 0 siblings, 1 reply; 27+ messages in thread From: David Malcolm @ 2021-11-16 23:00 UTC (permalink / raw) To: Marek Polacek; +Cc: Joseph Myers, Jakub Jelinek, Martin Sebor, GCC Patches > On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote: > > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > > > > but changing the name is a trivial operation. > > > > > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no > > > changes. > > > > Thanks for implementing this. > > > > > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > > > -- >8 -- > > > From a link below: > > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > > Specification through 14.0. It permits the visual reordering of > > > characters via control sequences, which can be used to craft source code > > > that renders different logic than the logical ordering of tokens > > > ingested by compilers and interpreters. Adversaries can leverage this to > > > encode source code for compilers accepting Unicode such that targeted > > > vulnerabilities are introduced invisibly to human reviewers." > > > > > > More info: > > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > > https://trojansource.codes/ > > > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > > > misleading Unicode bidirectional characters the preprocessor may encounter. [...snip...] > > > > Terminology nit: > > The patch is referring to "bidirectional characters", but I think the > > term "bidirectional control characters" would be better. > > Adjusted. Thanks. I wonder if the warning should be -Wbidi-control-chars, but I don't care enough to insist on it being changed. > > > For example, a passage of text containing both numbers and characters > > in a right-to-left script could be considered "bidirectional", since > > the numbers are written from left-to-right. > > > > Specifically, the patch looks for these specific characters: > > * U+202A LEFT-TO-RIGHT EMBEDDING > > * U+202B RIGHT-TO-LEFT EMBEDDING > > * U+202C POP DIRECTIONAL FORMATTING > > * U+202D LEFT-TO-RIGHT OVERRIDE > > * U+202E RIGHT-TO-LEFT OVERRIDE > > * U+2066 LEFT-TO-RIGHT ISOLATE > > * U+2067 RIGHT-TO-LEFT ISOLATE > > * U+2068 FIRST STRONG ISOLATE > > * U+2069 POP DIRECTIONAL ISOLATE > > > > However, the following characters could also be considered as > > "bidirectional control characters": > > * U+200E LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E) > > * U+200F RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F) > > but aren't checked for in the patch. Should they be? I can imagine > > ways in which they could be abused, so I think so. > > I'd only intended to check the bidi chars described in the original > trojan source pdf, but I added checking for U+200E/U+200F too, since > it was easy enough. AFAIK they aren't popped by a PDF/PDI like the > rest, so don't need to go on the vec, and so we only warn with =any. > Tests: Wbidi-chars-16.c + Wbidi-chars-17.c Thanks. I took a look through the revised patch and I think you updated things correctly. [...snip...] > > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > new file mode 100644 > > > index 00000000000..9fd4bc535ca > > > --- /dev/null > > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > @@ -0,0 +1,166 @@ > > > +/* PR preprocessor/103026 */ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ > > > +/* Test all bidi chars in various contexts (identifiers, comments, > > > + string literals, character constants), both UCN and UTF-8. The bidi > > > + chars here are properly terminated, except for the character constants. */ > > > + > > > +/* a b c LRE 1 2 3 PDF x y z */ > > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ > > > +/* a b c RLE 1 2 3 PDF x y z */ > > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ > > > +/* a b c LRO 1 2 3 PDF x y z */ > > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ > > > +/* a b c RLO 1 2 3 PDF x y z */ > > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ > > > +/* a b c LRI 1 2 3 PDI x y z */ > > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ > > > +/* a b c RLI 1 2 3 PDI x y */ > > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ > > > +/* a b c FSI 1 2 3 PDI x y z */ > > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ > > > > AIUI the Unicode bidirectionality algorithm works at the line level, > > and so each line in a block comment should be checked individually for > > unclossed bidi control chars, rather than a block comment as a whole. > > Hence I think the test case needs to have block comment test coverage > > for: > > - single line blocks > > - first line of a multiline block comment > > - middle line of a multiline block comment > > - final line of a multiline block comment > > but I think the patch as it stands is only checking for the first of > > these four cases. > > The patch handles all of them, because of: > 1534 if (warn_bidi_p) > 1535 maybe_warn_bidi_on_close (pfile, cur); > in _cpp_skip_block_comment, but I was lacking some more testing, so I've > added some testing, and included a new test: Wbidi-chars-15.c. All of the cases in Wbidi-chars-15.c only test for unparired chars in a middle line of a multiline block comment; I don't think the patch has any explicit coverage for unpaired control chars happening in the first line and last lines of *multiline* block comments. So it would be good if Wbidi-chars-15.c could gain some coverage for that (don't have to handle all the different chars). [...snip...] > > > diff --git a/libcpp/lex.c b/libcpp/lex.c > > > index fa2253d41c3..3fb518e202b 100644 > > > --- a/libcpp/lex.c > > > +++ b/libcpp/lex.c > > > @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) > > > } > > > } > > > > > > +namespace bidi { > > > + enum class kind { > > > + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI > > > + }; > > > + > > > + /* All the UTF-8 encodings of bidi characters start with E2. */ > > > + constexpr uchar utf8_start = 0xe2; > > > > Is there a difference between "constexpr" vs "const" here? (sorry for > > my ignorance) > > I just wanted to make sure that utf8_start will be usable in contexts where > an integral constant expression is required. 'const' objects need not be > initialized with compile-time constants, but 'constexpr' objects do. Thanks. > > > + > > > + /* A vector holding currently open bidi contexts. We use a char for > > > + each context, its LSB is 1 if it represents a PDF context, 0 if it > > > + represents a PDI context. The next bit is 1 if this context was open > > > + by a bidi character written as a UCN, and 0 when it was UTF-8. */ > > > + semi_embedded_vec <unsigned char, 16> vec; > > > + > > > + /* Close the whole comment/identifier/string literal/character constant > > > + context. */ > > > + void on_close () > > > + { > > > + vec.truncate (0); > > > + } > > > + > > > + /* Pop the last element in the vector. */ > > > + void pop () > > > + { > > > + unsigned int len = vec.count (); > > > + gcc_checking_assert (len > 0); > > > + vec.truncate (len - 1); > > > + } > > > + > > > + /* Return the context of the Ith element. */ > > > + kind ctx_at (unsigned int i) > > > + { > > > + return (vec[i] & 1) ? kind::PDF : kind::PDI; > > > + } > > > + > > > + /* Return which context is currently opened. */ > > > + kind current_ctx () > > > + { > > > + unsigned int len = vec.count (); > > > + if (len == 0) > > > + return kind::NONE; > > > + return ctx_at (len - 1); > > > + } > > > + > > > + /* Return true if the current context comes from a UCN origin, that is, > > > + the bidi char which started this bidi context was written as a UCN. */ > > > + bool current_ctx_ucn_p () > > > + { > > > + unsigned int len = vec.count (); > > > + gcc_checking_assert (len > 0); > > > + return (vec[len - 1] >> 1) & 1; > > > + } > > > + > > > + /* We've read a bidi char, update the current vector as necessary. */ > > > + void on_char (kind k, bool ucn_p) > > > + { > > > + switch (k) > > > + { > > > + case kind::LRE: > > > + case kind::RLE: > > > + case kind::LRO: > > > + case kind::RLO: > > > + vec.push (ucn_p ? 3u : 1u); > > > + break; > > > + case kind::LRI: > > > + case kind::RLI: > > > + case kind::FSI: > > > + vec.push (ucn_p ? 2u : 0u); > > > + break; > > > > I don't like the hand-coded bit fields here, where bit 1 and bit 2 in > > the above have special meaning, but aren't clearly labelled as such. > > > > Please can you at least use some kind of constant/define to make clear > > the meaning of the bits. Even clearer would be bitfields; is there a > > performance reason for not using them? (though this code is only > > called on bidi control chars, which presumably is a rare occurrence). > > My patch here: > > "[PATCH 2/2] Capture locations of bidi chars and underline ranges" > > https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html > > did some refactoring of this patch, replacing hand-coded bit > > manipulation with bitfields in a struct (as well as then using that as > > a good place to stach location_t values, and then using these > > locations). > > Would it be helpful if I split that part of my patch out? > > I think they are just fine here, given they are used only in bidi:: and > not outside of it. And I could just use a simple unsigned char in > semi_embedded_vec instead of inventing a new struct. > > Your diagnostic patch changes it because you need to remember a location > too, so we're changing it anyway, so I left it be so that you have fewer > conflicts. Fair enough; I'll post an updated version of my followup once yours goes in. > > [...snip...] > > > > > +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ > > > + > > > +static bidi::kind > > > +get_bidi_utf8 (const unsigned char *const p) > > > +{ > > > + gcc_checking_assert (p[0] == bidi::utf8_start); > > > + > > > + if (p[1] == 0x80) > > > + switch (p[2]) > > > > get_bidi_utf8 accesss up to 2 bytes beyond "p"... > > > > [...snip...] > > > > ...and is called in various places such as... > > > > > @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) > > > > > > cur = buffer->cur; > > > } > > > + /* If this is a beginning of a UTF-8 encoding, it might be > > > + a bidirectional character. */ > > > + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) > > > + { > > > + bidi::kind kind = get_bidi_utf8 (cur - 1); > > > + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); > > > + } > > > > Are we guaranteed to have a '\n' at the end of the buffer? (even for > > the final line of the file) That would ensure that we don't read past > > the end of the buffer. > > We've discussed this in our internal thread; I think I said that you > will always get a '\n' so this is not going to read past the end of > the buffer. To be sure... > > > Can we have testcases involving malformed UTF-8, in which, say: > > - the final byte of the input file is 0xe2 > > - the final two bytes of the input file are 0xe2 0x80 > > for each of block comment, C++-style comment, string-literal, > > identifier, etc? > > (or is that overkill?) > > ...I'd crafted a malformed text file using hexedit but couldn't get > it to crash. I'd rather not include it in the testsuite though. Fair enough. > > [...snip...] > > > > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, > > > { > > > /* Slower version for identifiers containing UCNs > > > or extended chars (including $). */ > > > - do { > > > - while (ISIDNUM (*pfile->buffer->cur)) > > > - { > > > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > > - pfile->buffer->cur++; > > > - } > > > - } while (forms_identifier_p (pfile, false, nst)); > > > + do > > > + { > > > + while (ISIDNUM (*pfile->buffer->cur)) > > > + { > > > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > > + pfile->buffer->cur++; > > > + } > > > + } > > > + while (forms_identifier_p (pfile, false, nst)); > > > > Is the above purely a whitespace change? > > Yes. If I'm reading things correctly, these lines in the existing code were correctly indented, so is there a purpose to this change? If not, please can you remove this change from the patch (to minimize the change to the history). [...snip...] > +/* We're closing a bidi context, that is, we've encountered a newline, > + are closing a C-style comment, or are at the end of a string literal, > + character constant, or identifier. Warn if this context was not > + properly terminated by a PDI or PDF. P points to the last character > + in this context. */ > + > +static void > +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) > +{ > + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired > + && bidi::vec.count () > 0) > + { > + const location_t loc > + = linemap_position_for_column (pfile->line_table, > + CPP_BUF_COLUMN (pfile->buffer, p)); > + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, > + "unpaired UTF-8 bidirectional character " > + "detected"); > + } Sorry, I missed this one in my initial review, should be "control character" here. [...snip...] OK for trunk with the above nits fixed. Thanks Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-16 23:00 ` David Malcolm @ 2021-11-17 0:37 ` Marek Polacek 2021-11-17 2:28 ` David Malcolm 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2021-11-17 0:37 UTC (permalink / raw) To: David Malcolm; +Cc: Joseph Myers, Jakub Jelinek, Martin Sebor, GCC Patches On Tue, Nov 16, 2021 at 06:00:58PM -0500, David Malcolm wrote: > > On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote: > > > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > > > > > but changing the name is a trivial operation. > > > > > > > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no > > > > changes. > > > > > > Thanks for implementing this. > > > > > > > > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > > > > > -- >8 -- > > > > From a link below: > > > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > > > Specification through 14.0. It permits the visual reordering of > > > > characters via control sequences, which can be used to craft source code > > > > that renders different logic than the logical ordering of tokens > > > > ingested by compilers and interpreters. Adversaries can leverage this to > > > > encode source code for compilers accepting Unicode such that targeted > > > > vulnerabilities are introduced invisibly to human reviewers." > > > > > > > > More info: > > > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > > > https://trojansource.codes/ > > > > > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > > > > misleading Unicode bidirectional characters the preprocessor may encounter. > > [...snip...] > > > > > > > Terminology nit: > > > The patch is referring to "bidirectional characters", but I think the > > > term "bidirectional control characters" would be better. > > > > Adjusted. > > Thanks. > > I wonder if the warning should be -Wbidi-control-chars, but I don't > care enough to insist on it being changed. > > > > > > For example, a passage of text containing both numbers and characters > > > in a right-to-left script could be considered "bidirectional", since > > > the numbers are written from left-to-right. > > > > > > Specifically, the patch looks for these specific characters: > > > * U+202A LEFT-TO-RIGHT EMBEDDING > > > * U+202B RIGHT-TO-LEFT EMBEDDING > > > * U+202C POP DIRECTIONAL FORMATTING > > > * U+202D LEFT-TO-RIGHT OVERRIDE > > > * U+202E RIGHT-TO-LEFT OVERRIDE > > > * U+2066 LEFT-TO-RIGHT ISOLATE > > > * U+2067 RIGHT-TO-LEFT ISOLATE > > > * U+2068 FIRST STRONG ISOLATE > > > * U+2069 POP DIRECTIONAL ISOLATE > > > > > > However, the following characters could also be considered as > > > "bidirectional control characters": > > > * U+200E LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E) > > > * U+200F RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F) > > > but aren't checked for in the patch. Should they be? I can imagine > > > ways in which they could be abused, so I think so. > > > > I'd only intended to check the bidi chars described in the original > > trojan source pdf, but I added checking for U+200E/U+200F too, since > > it was easy enough. AFAIK they aren't popped by a PDF/PDI like the > > rest, so don't need to go on the vec, and so we only warn with =any. > > Tests: Wbidi-chars-16.c + Wbidi-chars-17.c > > Thanks. I took a look through the revised patch and I think you > updated things correctly. > > [...snip...] > > > > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > > new file mode 100644 > > > > index 00000000000..9fd4bc535ca > > > > --- /dev/null > > > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > > @@ -0,0 +1,166 @@ > > > > +/* PR preprocessor/103026 */ > > > > +/* { dg-do compile } */ > > > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ > > > > +/* Test all bidi chars in various contexts (identifiers, comments, > > > > + string literals, character constants), both UCN and UTF-8. The bidi > > > > + chars here are properly terminated, except for the character constants. */ > > > > + > > > > +/* a b c LRE 1 2 3 PDF x y z */ > > > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ > > > > +/* a b c RLE 1 2 3 PDF x y z */ > > > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ > > > > +/* a b c LRO 1 2 3 PDF x y z */ > > > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ > > > > +/* a b c RLO 1 2 3 PDF x y z */ > > > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ > > > > +/* a b c LRI 1 2 3 PDI x y z */ > > > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ > > > > +/* a b c RLI 1 2 3 PDI x y */ > > > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ > > > > +/* a b c FSI 1 2 3 PDI x y z */ > > > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ > > > > > > AIUI the Unicode bidirectionality algorithm works at the line level, > > > and so each line in a block comment should be checked individually for > > > unclossed bidi control chars, rather than a block comment as a whole. > > > Hence I think the test case needs to have block comment test coverage > > > for: > > > - single line blocks > > > - first line of a multiline block comment > > > - middle line of a multiline block comment > > > - final line of a multiline block comment > > > but I think the patch as it stands is only checking for the first of > > > these four cases. > > > > The patch handles all of them, because of: > > 1534 if (warn_bidi_p) > > 1535 maybe_warn_bidi_on_close (pfile, cur); > > in _cpp_skip_block_comment, but I was lacking some more testing, so I've > > added some testing, and included a new test: Wbidi-chars-15.c. > > All of the cases in Wbidi-chars-15.c only test for unparired chars in a > middle line of a multiline block comment; I don't think the patch has > any explicit coverage for unpaired control chars happening in the first > line and last lines of *multiline* block comments. So it would be good > if Wbidi-chars-15.c could gain some coverage for that (don't have to > handle all the different chars). Sorry for a dumb question, but is this what you have in mind? /* LRE PDF */ /* FSI PDI */ and check that we warn for these? > > > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, > > > > { > > > > /* Slower version for identifiers containing UCNs > > > > or extended chars (including $). */ > > > > - do { > > > > - while (ISIDNUM (*pfile->buffer->cur)) > > > > - { > > > > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > > > - pfile->buffer->cur++; > > > > - } > > > > - } while (forms_identifier_p (pfile, false, nst)); > > > > + do > > > > + { > > > > + while (ISIDNUM (*pfile->buffer->cur)) > > > > + { > > > > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > > > + pfile->buffer->cur++; > > > > + } > > > > + } > > > > + while (forms_identifier_p (pfile, false, nst)); > > > > > > Is the above purely a whitespace change? > > > > Yes. > > If I'm reading things correctly, these lines in the existing code were > correctly indented, so is there a purpose to this change? If not, > please can you remove this change from the patch (to minimize the > change to the history). I dropped that change then. Sometimes it's hard to resist fixing formatting. ;) > [...snip...] > > > +/* We're closing a bidi context, that is, we've encountered a newline, > > + are closing a C-style comment, or are at the end of a string literal, > > + character constant, or identifier. Warn if this context was not > > + properly terminated by a PDI or PDF. P points to the last character > > + in this context. */ > > + > > +static void > > +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) > > +{ > > + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired > > + && bidi::vec.count () > 0) > > + { > > + const location_t loc > > + = linemap_position_for_column (pfile->line_table, > > + CPP_BUF_COLUMN (pfile->buffer, p)); > > + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, > > + "unpaired UTF-8 bidirectional character " > > + "detected"); > > + } > > Sorry, I missed this one in my initial review, should be "control > character" here. Fixed. > [...snip...] > > OK for trunk with the above nits fixed. Thanks again for the review. I'll push this once the test question above is resolved. -- >8 -- From a link below: "An issue was discovered in the Bidirectional Algorithm in the Unicode Specification through 14.0. It permits the visual reordering of characters via control sequences, which can be used to craft source code that renders different logic than the logical ordering of tokens ingested by compilers and interpreters. Adversaries can leverage this to encode source code for compilers accepting Unicode such that targeted vulnerabilities are introduced invisibly to human reviewers." More info: https://nvd.nist.gov/vuln/detail/CVE-2021-42574 https://trojansource.codes/ This is not a compiler bug. However, to mitigate the problem, this patch implements -Wbidi-chars=[none|unpaired|any] to warn about possibly misleading Unicode bidirectional control characters the preprocessor may encounter. The default is =unpaired, which warns about improperly terminated bidirectional control characters; e.g. a LRE without its corresponding PDF. The level =any warns about any use of bidirectional control characters. This patch handles both UCNs and UTF-8 characters. UCNs designating bidi characters in identifiers are accepted since r204886. Then r217144 enabled -fextended-identifiers by default. Extended characters in C/C++ identifiers have been accepted since r275979. However, this patch still warns about mixing UTF-8 and UCN bidi characters; there seems to be no good reason to allow mixing them. We warn in different contexts: comments (both C and C++-style), string literals, character constants, and identifiers. Expectedly, UCNs are ignored in comments and raw string literals. The bidirectional control characters can nest so this patch handles that as well. I have not included nor tested this at all with Fortran (which also has string literals and line comments). Dave M. posted patches improving diagnostic involving Unicode characters. This patch does not make use of this new infrastructure yet. PR preprocessor/103026 gcc/c-family/ChangeLog: * c.opt (Wbidi-chars, Wbidi-chars=): New option. gcc/ChangeLog: * doc/invoke.texi: Document -Wbidi-chars. libcpp/ChangeLog: * include/cpplib.h (enum cpp_bidirectional_level): New. (struct cpp_options): Add cpp_warn_bidirectional. (enum cpp_warning_reason): Add CPP_W_BIDIRECTIONAL. * internal.h (struct cpp_reader): Add warn_bidi_p member function. * init.c (cpp_create_reader): Set cpp_warn_bidirectional. * lex.c (bidi): New namespace. (get_bidi_utf8): New function. (get_bidi_ucn): Likewise. (maybe_warn_bidi_on_close): Likewise. (maybe_warn_bidi_on_char): Likewise. (_cpp_skip_block_comment): Implement warning about bidirectional control characters. (skip_line_comment): Likewise. (forms_identifier_p): Likewise. (lex_identifier): Likewise. (lex_string): Likewise. (lex_raw_string): Likewise. gcc/testsuite/ChangeLog: * c-c++-common/Wbidi-chars-1.c: New test. * c-c++-common/Wbidi-chars-2.c: New test. * c-c++-common/Wbidi-chars-3.c: New test. * c-c++-common/Wbidi-chars-4.c: New test. * c-c++-common/Wbidi-chars-5.c: New test. * c-c++-common/Wbidi-chars-6.c: New test. * c-c++-common/Wbidi-chars-7.c: New test. * c-c++-common/Wbidi-chars-8.c: New test. * c-c++-common/Wbidi-chars-9.c: New test. * c-c++-common/Wbidi-chars-10.c: New test. * c-c++-common/Wbidi-chars-11.c: New test. * c-c++-common/Wbidi-chars-12.c: New test. * c-c++-common/Wbidi-chars-13.c: New test. * c-c++-common/Wbidi-chars-14.c: New test. * c-c++-common/Wbidi-chars-15.c: New test. * c-c++-common/Wbidi-chars-16.c: New test. * c-c++-common/Wbidi-chars-17.c: New test. --- gcc/c-family/c.opt | 24 ++ gcc/doc/invoke.texi | 21 +- gcc/testsuite/c-c++-common/Wbidi-chars-1.c | 12 + gcc/testsuite/c-c++-common/Wbidi-chars-10.c | 27 ++ gcc/testsuite/c-c++-common/Wbidi-chars-11.c | 13 + gcc/testsuite/c-c++-common/Wbidi-chars-12.c | 19 + gcc/testsuite/c-c++-common/Wbidi-chars-13.c | 17 + gcc/testsuite/c-c++-common/Wbidi-chars-14.c | 38 ++ gcc/testsuite/c-c++-common/Wbidi-chars-15.c | 39 ++ gcc/testsuite/c-c++-common/Wbidi-chars-16.c | 26 ++ gcc/testsuite/c-c++-common/Wbidi-chars-17.c | 30 ++ gcc/testsuite/c-c++-common/Wbidi-chars-2.c | 9 + gcc/testsuite/c-c++-common/Wbidi-chars-3.c | 11 + gcc/testsuite/c-c++-common/Wbidi-chars-4.c | 188 +++++++++ gcc/testsuite/c-c++-common/Wbidi-chars-5.c | 188 +++++++++ gcc/testsuite/c-c++-common/Wbidi-chars-6.c | 155 ++++++++ gcc/testsuite/c-c++-common/Wbidi-chars-7.c | 9 + gcc/testsuite/c-c++-common/Wbidi-chars-8.c | 13 + gcc/testsuite/c-c++-common/Wbidi-chars-9.c | 29 ++ libcpp/include/cpplib.h | 18 +- libcpp/init.c | 1 + libcpp/internal.h | 7 + libcpp/lex.c | 408 +++++++++++++++++++- 23 files changed, 1295 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-1.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-10.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-11.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-12.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-13.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-14.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-15.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-16.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-17.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-2.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-3.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-4.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-5.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-6.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-7.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-8.c create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-9.c diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt index 8a4cd634f77..3976fc368db 100644 --- a/gcc/c-family/c.opt +++ b/gcc/c-family/c.opt @@ -374,6 +374,30 @@ Wbad-function-cast C ObjC Var(warn_bad_function_cast) Warning Warn about casting functions to incompatible types. +Wbidi-chars +C ObjC C++ ObjC++ Warning Alias(Wbidi-chars=,any,none) +; + +Wbidi-chars= +C ObjC C++ ObjC++ RejectNegative Joined Warning CPP(cpp_warn_bidirectional) CppReason(CPP_W_BIDIRECTIONAL) Var(warn_bidirectional) Init(bidirectional_unpaired) Enum(cpp_bidirectional_level) +-Wbidi-chars=[none|unpaired|any] Warn about UTF-8 bidirectional control characters. + +; Required for these enum values. +SourceInclude +cpplib.h + +Enum +Name(cpp_bidirectional_level) Type(int) UnknownError(argument %qs to %<-Wbidi-chars%> not recognized) + +EnumValue +Enum(cpp_bidirectional_level) String(none) Value(bidirectional_none) + +EnumValue +Enum(cpp_bidirectional_level) String(unpaired) Value(bidirectional_unpaired) + +EnumValue +Enum(cpp_bidirectional_level) String(any) Value(bidirectional_any) + Wbool-compare C ObjC C++ ObjC++ Var(warn_bool_compare) Warning LangEnabledBy(C ObjC C++ ObjC++,Wall) Warn about boolean expression compared with an integer value different from true/false. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 6070288856c..a22758d18ee 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. -Warith-conversion @gol -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol +-Wno-attribute-warning @gol +-Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol +-Wbool-compare -Wbool-operation @gol -Wno-builtin-declaration-mismatch @gol -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol -Wc11-c2x-compat @gol @@ -7678,6 +7680,23 @@ Attributes considered include @code{alloc_align}, @code{alloc_size}, This is the default. You can disable these warnings with either @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. +@item -Wbidi-chars=@r{[}none@r{|}unpaired@r{|}any@r{]} +@opindex Wbidi-chars= +@opindex Wbidi-chars +@opindex Wno-bidi-chars +Warn about possibly misleading UTF-8 bidirectional control characters in +comments, string literals, character constants, and identifiers. Such +characters can change left-to-right writing direction into right-to-left +(and vice versa), which can cause confusion between the logical order and +visual order. This may be dangerous; for instance, it may seem that a piece +of code is not commented out, whereas it in fact is. + +There are three levels of warning supported by GCC@. The default is +@option{-Wbidi-chars=unpaired}, which warns about improperly terminated +bidi contexts. @option{-Wbidi-chars=none} turns the warning off. +@option{-Wbidi-chars=any} warns about any use of bidirectional control +characters. + @item -Wbool-compare @opindex Wno-bool-compare @opindex Wbool-compare diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-1.c b/gcc/testsuite/c-c++-common/Wbidi-chars-1.c new file mode 100644 index 00000000000..34f5ac19271 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-1.c @@ -0,0 +1,12 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + int isAdmin = 0; + /* } if (isAdmin) begin admins only */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + /* end admins only { */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-10.c b/gcc/testsuite/c-c++-common/Wbidi-chars-10.c new file mode 100644 index 00000000000..3f851b69e65 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-10.c @@ -0,0 +1,27 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* More nesting testing. */ + +/* RLE LRI PDF PDI*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF_\u202c; +int LRE_\u202a_PDF_\u202c_LRE_\u202a_PDF_\u202c; +int LRE_\u202a_LRI_\u2066_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLE_\u202b_RLI_\u2067_PDI_\u2069_PDF_\u202c; +int FSI_\u2068_LRO_\u202d_PDI_\u2069_PDF_\u202c; +int FSI_\u2068; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDF_\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_RLI_\u2067_RLI_\u2067_RLI_\u2067_FSI_\u2068_PDI_\u2069_PDI_\u2069_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-11.c b/gcc/testsuite/c-c++-common/Wbidi-chars-11.c new file mode 100644 index 00000000000..270ce2368a9 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-11.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test that we warn when mixing UCN and UTF-8. */ + +int LRE__PDF_\u202c; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +int LRE_\u202a_PDF__; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s1 = "LRE__PDF_\u202c"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +const char *s2 = "LRE_\u202a_PDF_"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-12.c b/gcc/testsuite/c-c++-common/Wbidi-chars-12.c new file mode 100644 index 00000000000..b07eec1da91 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-12.c @@ -0,0 +1,19 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c RLO 1 2 3 PDF x y z)"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +const char *s7 = R"(a b c FSI 1 2 3 PDI x y) z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +const char *s8 = R"(a b c PDI x y )z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +const char *s9 = R"(a b c PDF x y z)"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-13.c b/gcc/testsuite/c-c++-common/Wbidi-chars-13.c new file mode 100644 index 00000000000..b2dd9fde752 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-13.c @@ -0,0 +1,17 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile { target { c || c++11 } } } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test raw strings. */ + +const char *s1 = R"(a b c LRE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s2 = R"(a b c RLE 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = R"(a b c LRO 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s4 = R"(a b c FSI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s5 = R"(a b c LRI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s6 = R"(a b c RLI 1 2 3)"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-14.c b/gcc/testsuite/c-c++-common/Wbidi-chars-14.c new file mode 100644 index 00000000000..ba5f75d9553 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-14.c @@ -0,0 +1,38 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test PDI handling, which also pops any subsequent LREs, RLEs, LROs, + or RLOs. */ + +/* LRI__LRI__RLE__RLE__RLE__PDI_*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// LRI__RLE__RLE__RLE__PDI_ +// LRI__RLO__RLE__RLE__PDI_ +// LRI__RLO__RLE__PDI_ +// FSI__RLO__PDI_ +// FSI__FSI__RLO__PDI_ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +int LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069_PDI_\u2069; +int LRI_\u2066_LRI_\u2066_LRI_\u2066_LRE_\u202a_LRE_\u202a_LRE_\u202a_PDI_\u2069_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int PDI_\u2069; +int LRI_\u2066_PDI_\u2069; +int RLI_\u2067_PDI_\u2069; +int LRE_\u202a_LRI_\u2066_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int LRI_\u2066_LRE_\u202a_PDF_\u202c_PDI_\u2069; +int LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +int RLI_\u2067_LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_LRI_\u2066_LRE_\u202a_LRE_\u202a_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLO_\u202e_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int RLI_\u2067_PDI_\u2069_RLI_\u2067; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int FSI_\u2068_PDF_\u202c_PDI_\u2069; +int FSI_\u2068_FSI_\u2068_PDF_\u202c_PDI_\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-15.c b/gcc/testsuite/c-c++-common/Wbidi-chars-15.c new file mode 100644 index 00000000000..a16eec85493 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-15.c @@ -0,0 +1,39 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test unpaired bidi control chars in multiline comments. */ + +/* + * LRE end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * RLE end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * LRO end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * RLO end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * LRI end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * RLI end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* + * FSI end + */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* LRE + PDF */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +/* FSI + PDI */ +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-16.c b/gcc/testsuite/c-c++-common/Wbidi-chars-16.c new file mode 100644 index 00000000000..baa0159861c --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-16.c @@ -0,0 +1,26 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test LTR/RTL chars. */ + +/* LTR<> */ +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +// LTR<> +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +/* RTL<> */ +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ +// RTL<> +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ + +const char *s1 = "LTR<>"; +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +const char *s2 = "LTR\u200e"; +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +const char *s3 = "LTR\u200E"; +/* { dg-warning "U\\+200E" "" { target *-*-* } .-1 } */ +const char *s4 = "RTL<>"; +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ +const char *s5 = "RTL\u200f"; +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ +const char *s6 = "RTL\u200F"; +/* { dg-warning "U\\+200F" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-17.c b/gcc/testsuite/c-c++-common/Wbidi-chars-17.c new file mode 100644 index 00000000000..07cb4321f96 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-17.c @@ -0,0 +1,30 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test LTR/RTL chars. */ + +/* LTR<> */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// LTR<> +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* RTL<> */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// RTL<> +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int ltr_\u200e; +/* { dg-error "universal character " "" { target *-*-* } .-1 } */ +int rtl_\u200f; +/* { dg-error "universal character " "" { target *-*-* } .-1 } */ + +const char *s1 = "LTR<>"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s2 = "LTR\u200e"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = "LTR\u200E"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s4 = "RTL<>"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s5 = "RTL\u200f"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +const char *s6 = "RTL\u200F"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-2.c b/gcc/testsuite/c-c++-common/Wbidi-chars-2.c new file mode 100644 index 00000000000..2340374f276 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-2.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + /* Say hello; newline/*/ return 0 ; +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("Hello world.\n"); + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-3.c b/gcc/testsuite/c-c++-common/Wbidi-chars-3.c new file mode 100644 index 00000000000..9dc7edb6e64 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-3.c @@ -0,0 +1,11 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ + +int main() { + const char* access_level = "user"; + if (__builtin_strcmp(access_level, "user // Check if admin ")) { +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ + __builtin_printf("You are an admin.\n"); + } + return 0; +} diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c new file mode 100644 index 00000000000..639e5c62e88 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c @@ -0,0 +1,188 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +/* Multiline comments. */ +/* a b c PDI x y z + */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-2 } */ +/* a b c PDF x y z + */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-2 } */ +/* first + a b c PDI x y z + */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-2 } */ +/* first + a b c PDF x y z + */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-2 } */ +/* first + a b c PDI x y z */ +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ +/* first + a b c PDF x y z */ +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-warning "U\\+202C" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-warning "U\\+2069" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-5.c b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c new file mode 100644 index 00000000000..68cb053144b --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-5.c @@ -0,0 +1,188 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired -Wno-multichar -Wno-overflow" } */ +/* Test all bidi chars in various contexts (identifiers, comments, + string literals, character constants), both UCN and UTF-8. The bidi + chars here are properly terminated, except for the character constants. */ + +/* a b c LRE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDI x y */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Same but C++ comments instead. */ +// a b c LRE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDI x y +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Here we're closing an unopened context, warn when =any. */ +/* a b c PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* a b c PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDI x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +// a b c PDF x y z +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +/* Multiline comments. */ +/* a b c PDI x y z + */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-2 } */ +/* a b c PDF x y z + */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-2 } */ +/* first + a b c PDI x y z + */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-2 } */ +/* first + a b c PDF x y z + */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-2 } */ +/* first + a b c PDI x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +/* first + a b c PDF x y z */ +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c RLE 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c LRO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLO 1 2 3 PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c RLI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c FSI 1 2 3 PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c PDI x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c PDF x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + + const char *s10 = "a b c LRE\u202a 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c LRE\u202A 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s12 = "a b c RLE\u202b 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c RLE\u202B 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c LRO\u202d 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "a b c LRO\u202D 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "a b c RLO\u202e 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "a b c RLO\u202E 1 2 3 PDF\u202c x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s18 = "a b c LRI\u2066 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s19 = "a b c RLI\u2067 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + const char *s20 = "a b c FSI\u2068 1 2 3 PDI\u2069 x y z"; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +} + +void +g2 () +{ + const char c1 = '\u202a'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c2 = '\u202A'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c3 = '\u202b'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c4 = '\u202B'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c5 = '\u202d'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c6 = '\u202D'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c7 = '\u202e'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c8 = '\u202E'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c9 = '\u2066'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c10 = '\u2067'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char c11 = '\u2068'; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int abc; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int AX; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202cY; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int A\u202CY2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ + +int d\u202ae\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ae\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202be\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Be\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202de\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202De\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202ee\u202cf; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u202Ee\u202cf2; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2066e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2067e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int d\u2068e\u2069f; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ +int X\u2069; +/* { dg-bogus "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-6.c b/gcc/testsuite/c-c++-common/Wbidi-chars-6.c new file mode 100644 index 00000000000..0ce6fff2dee --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-6.c @@ -0,0 +1,155 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test nesting of bidi chars in various contexts. */ + +/* Terminated by the wrong char: */ +/* a b c LRE 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLO 1 2 3 PDI x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c LRI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c RLI 1 2 3 PDF x y z */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* a b c FSI 1 2 3 PDF x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +/* LRE PDF */ +/* LRE LRE PDF PDF */ +/* PDF LRE PDF */ +/* LRE PDF LRE PDF */ +/* LRE LRE PDF */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* PDF LRE */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// a b c LRE 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLE 1 2 3 PDI x y z*/ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLO 1 2 3 PDI x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c LRI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c RLI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// a b c FSI 1 2 3 PDF x y z +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +// LRE PDF +// LRE LRE PDF PDF +// PDF LRE PDF +// LRE PDF LRE PDF +// LRE LRE PDF +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +// PDF LRE +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +void +g1 () +{ + const char *s1 = "a b c LRE 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s2 = "a b c LRE\u202a 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s3 = "a b c RLE 1 2 3 PDI x y "; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s4 = "a b c RLE\u202b 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s5 = "a b c LRO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s6 = "a b c LRO\u202d 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s7 = "a b c RLO 1 2 3 PDI x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s8 = "a b c RLO\u202e 1 2 3 PDI\u2069 x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s9 = "a b c LRI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s10 = "a b c LRI\u2066 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s11 = "a b c RLI 1 2 3 PDF x y z\ + "; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ + const char *s12 = "a b c RLI\u2067 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s13 = "a b c FSI 1 2 3 PDF x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s14 = "a b c FSI\u2068 1 2 3 PDF\u202c x y z"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s15 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s16 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s17 = "LRE PDF"; + const char *s18 = "LRE\u202a PDF\u202c"; + const char *s19 = "LRE LRE PDF PDF"; + const char *s20 = "LRE\u202a LRE\u202a PDF\u202c PDF\u202c"; + const char *s21 = "PDF LRE PDF"; + const char *s22 = "PDF\u202c LRE\u202a PDF\u202c"; + const char *s23 = "LRE LRE PDF"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s24 = "LRE\u202a LRE\u202a PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s25 = "PDF LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s26 = "PDF\u202c LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s27 = "PDF LRE\u202a"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + const char *s28 = "PDF\u202c LRE"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +} + +int aLREbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int A\u202aB\u2069C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLEbPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202bB\u2069c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202db\u2069c2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLObPDI; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u202eb\u2069; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLRIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2066b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aRLIbPDFc +; +/* { dg-warning "unpaired" "" { target *-*-* } .-2 } */ +int a\u2067b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPDF; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a\u2068b\u202c; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSIbPD\u202C; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aFSI\u2068bPDF_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int aLREbPDFb; +int A\u202aB\u202c; +int a_LRE_LRE_b_PDF_PDF; +int A\u202aA\u202aB\u202cB\u202c; +int aPDFbLREadPDF; +int a_\u202C_\u202a_\u202c; +int a_LRE_b_PDF_c_LRE_PDF; +int a_\u202a_\u202c_\u202a_\u202c_; +int a_LRE_b_PDF_c_LRE; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int a_\u202a_\u202c_\u202a_; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-7.c b/gcc/testsuite/c-c++-common/Wbidi-chars-7.c new file mode 100644 index 00000000000..d012d420ec0 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-7.c @@ -0,0 +1,9 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test we ignore UCNs in comments. */ + +// a b c \u202a 1 2 3 +// a b c \u202A 1 2 3 +/* a b c \u202a 1 2 3 */ +/* a b c \u202A 1 2 3 */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-8.c b/gcc/testsuite/c-c++-common/Wbidi-chars-8.c new file mode 100644 index 00000000000..4f54c5092ec --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-8.c @@ -0,0 +1,13 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=any" } */ +/* Test \u vs \U. */ + +int a_\u202A; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\u202a_2; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202A_3; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ +int a_\U0000202a_4; +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-9.c b/gcc/testsuite/c-c++-common/Wbidi-chars-9.c new file mode 100644 index 00000000000..e2af1b1ca97 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-9.c @@ -0,0 +1,29 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired" } */ +/* Test that we properly separate bidi contexts (comment/identifier/character + constant/string literal). */ + +/* LRE -><- */ int pdf_\u202c_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLE -><- */ int pdf_\u202c_2; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRO -><- */ int pdf_\u202c_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLO -><- */ int pdf_\u202c_4; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRI -><-*/ int pdi_\u2069_1; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* RLI -><- */ int pdi_\u2069_12; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* FSI -><- */ int pdi_\u2069_3; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ + +const char *s1 = "LRE\u202a"; /* PDF -><- */ +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +/* LRE -><- */ const char *s2 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +const char *s3 = "LRE\u202a"; int pdf_\u202c_5; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ +int lre_\u202a; const char *s4 = "PDF\u202c"; +/* { dg-warning "unpaired" "" { target *-*-* } .-1 } */ diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h index 176f8c5bbce..112b9c24751 100644 --- a/libcpp/include/cpplib.h +++ b/libcpp/include/cpplib.h @@ -319,6 +319,17 @@ enum cpp_main_search CMS_system, /* Search the system INCLUDE path. */ }; +/* The possible bidirectional control characters checking levels, from least + restrictive to most. */ +enum cpp_bidirectional_level { + /* No checking. */ + bidirectional_none, + /* Only detect unpaired uses of bidirectional control characters. */ + bidirectional_unpaired, + /* Detect any use of bidirectional control characters. */ + bidirectional_any +}; + /* This structure is nested inside struct cpp_reader, and carries all the options visible to the command line. */ struct cpp_options @@ -539,6 +550,10 @@ struct cpp_options /* True if warn about differences between C++98 and C++11. */ bool cpp_warn_cxx11_compat; + /* Nonzero if bidirectional control characters checking is on. See enum + cpp_bidirectional_level. */ + unsigned char cpp_warn_bidirectional; + /* Dependency generation. */ struct { @@ -643,7 +658,8 @@ enum cpp_warning_reason { CPP_W_C90_C99_COMPAT, CPP_W_C11_C2X_COMPAT, CPP_W_CXX11_COMPAT, - CPP_W_EXPANSION_TO_DEFINED + CPP_W_EXPANSION_TO_DEFINED, + CPP_W_BIDIRECTIONAL }; /* Callback for header lookup for HEADER, which is the name of a diff --git a/libcpp/init.c b/libcpp/init.c index 5a424e23553..f9a8f5f088f 100644 --- a/libcpp/init.c +++ b/libcpp/init.c @@ -223,6 +223,7 @@ cpp_create_reader (enum c_lang lang, cpp_hash_table *table, = ENABLE_CANONICAL_SYSTEM_HEADERS; CPP_OPTION (pfile, ext_numeric_literals) = 1; CPP_OPTION (pfile, warn_date_time) = 0; + CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired; /* Default CPP arithmetic to something sensible for the host for the benefit of dumb users like fix-header. */ diff --git a/libcpp/internal.h b/libcpp/internal.h index 8577cab6c83..0ce0246c5a2 100644 --- a/libcpp/internal.h +++ b/libcpp/internal.h @@ -597,6 +597,13 @@ struct cpp_reader /* Location identifying the main source file -- intended to be line zero of said file. */ location_t main_loc; + + /* Returns true iff we should warn about UTF-8 bidirectional control + characters. */ + bool warn_bidi_p () const + { + return CPP_OPTION (this, cpp_warn_bidirectional) != bidirectional_none; + } }; /* Character classes. Based on the more primitive macros in safe-ctype.h. diff --git a/libcpp/lex.c b/libcpp/lex.c index fa2253d41c3..6a4fbce6030 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1164,6 +1164,324 @@ _cpp_process_line_notes (cpp_reader *pfile, int in_comment) } } +namespace bidi { + enum class kind { + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI, LTR, RTL + }; + + /* All the UTF-8 encodings of bidi characters start with E2. */ + constexpr uchar utf8_start = 0xe2; + + /* A vector holding currently open bidi contexts. We use a char for + each context, its LSB is 1 if it represents a PDF context, 0 if it + represents a PDI context. The next bit is 1 if this context was open + by a bidi character written as a UCN, and 0 when it was UTF-8. */ + semi_embedded_vec <unsigned char, 16> vec; + + /* Close the whole comment/identifier/string literal/character constant + context. */ + void on_close () + { + vec.truncate (0); + } + + /* Pop the last element in the vector. */ + void pop () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + vec.truncate (len - 1); + } + + /* Return the context of the Ith element. */ + kind ctx_at (unsigned int i) + { + return (vec[i] & 1) ? kind::PDF : kind::PDI; + } + + /* Return which context is currently opened. */ + kind current_ctx () + { + unsigned int len = vec.count (); + if (len == 0) + return kind::NONE; + return ctx_at (len - 1); + } + + /* Return true if the current context comes from a UCN origin, that is, + the bidi char which started this bidi context was written as a UCN. */ + bool current_ctx_ucn_p () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + return (vec[len - 1] >> 1) & 1; + } + + /* We've read a bidi char, update the current vector as necessary. */ + void on_char (kind k, bool ucn_p) + { + switch (k) + { + case kind::LRE: + case kind::RLE: + case kind::LRO: + case kind::RLO: + vec.push (ucn_p ? 3u : 1u); + break; + case kind::LRI: + case kind::RLI: + case kind::FSI: + vec.push (ucn_p ? 2u : 0u); + break; + /* PDF terminates the scope of the last LRE, RLE, LRO, or RLO + whose scope has not yet been terminated. */ + case kind::PDF: + if (current_ctx () == kind::PDF) + pop (); + break; + /* PDI terminates the scope of the last LRI, RLI, or FSI whose + scope has not yet been terminated, as well as the scopes of + any subsequent LREs, RLEs, LROs, or RLOs whose scopes have not + yet been terminated. */ + case kind::PDI: + for (int i = vec.count () - 1; i >= 0; --i) + if (ctx_at (i) == kind::PDI) + { + vec.truncate (i); + break; + } + break; + case kind::LTR: + case kind::RTL: + /* These aren't popped by a PDF/PDI. */ + break; + [[likely]] case kind::NONE: + break; + default: + abort (); + } + } + + /* Return a descriptive string for K. */ + const char *to_str (kind k) + { + switch (k) + { + case kind::LRE: + return "U+202A (LEFT-TO-RIGHT EMBEDDING)"; + case kind::RLE: + return "U+202B (RIGHT-TO-LEFT EMBEDDING)"; + case kind::LRO: + return "U+202D (LEFT-TO-RIGHT OVERRIDE)"; + case kind::RLO: + return "U+202E (RIGHT-TO-LEFT OVERRIDE)"; + case kind::LRI: + return "U+2066 (LEFT-TO-RIGHT ISOLATE)"; + case kind::RLI: + return "U+2067 (RIGHT-TO-LEFT ISOLATE)"; + case kind::FSI: + return "U+2068 (FIRST STRONG ISOLATE)"; + case kind::PDF: + return "U+202C (POP DIRECTIONAL FORMATTING)"; + case kind::PDI: + return "U+2069 (POP DIRECTIONAL ISOLATE)"; + case kind::LTR: + return "U+200E (LEFT-TO-RIGHT MARK)"; + case kind::RTL: + return "U+200F (RIGHT-TO-LEFT MARK)"; + default: + abort (); + } + } +} + +/* Parse a sequence of 3 bytes starting with P and return its bidi code. */ + +static bidi::kind +get_bidi_utf8 (const unsigned char *const p) +{ + gcc_checking_assert (p[0] == bidi::utf8_start); + + if (p[1] == 0x80) + switch (p[2]) + { + case 0xaa: + return bidi::kind::LRE; + case 0xab: + return bidi::kind::RLE; + case 0xac: + return bidi::kind::PDF; + case 0xad: + return bidi::kind::LRO; + case 0xae: + return bidi::kind::RLO; + case 0x8e: + return bidi::kind::LTR; + case 0x8f: + return bidi::kind::RTL; + default: + break; + } + else if (p[1] == 0x81) + switch (p[2]) + { + case 0xa6: + return bidi::kind::LRI; + case 0xa7: + return bidi::kind::RLI; + case 0xa8: + return bidi::kind::FSI; + case 0xa9: + return bidi::kind::PDI; + default: + break; + } + + return bidi::kind::NONE; +} + +/* Parse a UCN where P points just past \u or \U and return its bidi code. */ + +static bidi::kind +get_bidi_ucn (const unsigned char *p, bool is_U) +{ + /* 6.4.3 Universal Character Names + \u hex-quad + \U hex-quad hex-quad + where \unnnn means \U0000nnnn. */ + + if (is_U) + { + if (p[0] != '0' || p[1] != '0' || p[2] != '0' || p[3] != '0') + return bidi::kind::NONE; + /* Skip 4B so we can treat \u and \U the same below. */ + p += 4; + } + + /* All code points we are looking for start with 20xx. */ + if (p[0] != '2' || p[1] != '0') + return bidi::kind::NONE; + else if (p[2] == '2') + switch (p[3]) + { + case 'a': + case 'A': + return bidi::kind::LRE; + case 'b': + case 'B': + return bidi::kind::RLE; + case 'c': + case 'C': + return bidi::kind::PDF; + case 'd': + case 'D': + return bidi::kind::LRO; + case 'e': + case 'E': + return bidi::kind::RLO; + default: + break; + } + else if (p[2] == '6') + switch (p[3]) + { + case '6': + return bidi::kind::LRI; + case '7': + return bidi::kind::RLI; + case '8': + return bidi::kind::FSI; + case '9': + return bidi::kind::PDI; + default: + break; + } + else if (p[2] == '0') + switch (p[3]) + { + case 'e': + case 'E': + return bidi::kind::LTR; + case 'f': + case 'F': + return bidi::kind::RTL; + default: + break; + } + + return bidi::kind::NONE; +} + +/* We're closing a bidi context, that is, we've encountered a newline, + are closing a C-style comment, or are at the end of a string literal, + character constant, or identifier. Warn if this context was not + properly terminated by a PDI or PDF. P points to the last character + in this context. */ + +static void +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) +{ + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired + && bidi::vec.count () > 0) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "unpaired UTF-8 bidirectional control character " + "detected"); + } + /* We're done with this context. */ + bidi::on_close (); +} + +/* We're at the beginning or in the middle of an identifier/comment/string + literal/character constant. Warn if we've encountered a bidi character. + KIND says which bidi character it was; P points to it in the character + stream. UCN_P is true iff this bidi character was written as a UCN. */ + +static void +maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, + bool ucn_p) +{ + if (__builtin_expect (kind == bidi::kind::NONE, 1)) + return; + + const auto warn_bidi = CPP_OPTION (pfile, cpp_warn_bidirectional); + + if (warn_bidi != bidirectional_none) + { + const location_t loc + = linemap_position_for_column (pfile->line_table, + CPP_BUF_COLUMN (pfile->buffer, p)); + /* It seems excessive to warn about a PDI/PDF that is closing + an opened context because we've already warned about the + opening character. Except warn when we have a UCN x UTF-8 + mismatch. */ + if (kind == bidi::current_ctx ()) + { + if (warn_bidi == bidirectional_unpaired + && bidi::current_ctx_ucn_p () != ucn_p) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); + } + else if (warn_bidi == bidirectional_any) + { + if (kind == bidi::kind::PDF || kind == bidi::kind::PDI) + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "\"%s\" is closing an unopened context", + bidi::to_str (kind)); + else + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, + "found problematic Unicode character \"%s\"", + bidi::to_str (kind)); + } + } + /* We're done with this context. */ + bidi::on_char (kind, ucn_p); +} + /* Skip a C-style block comment. We find the end of the comment by seeing if an asterisk is before every '/' we encounter. Returns nonzero if comment terminated by EOF, zero otherwise. @@ -1175,6 +1493,7 @@ _cpp_skip_block_comment (cpp_reader *pfile) cpp_buffer *buffer = pfile->buffer; const uchar *cur = buffer->cur; uchar c; + const bool warn_bidi_p = pfile->warn_bidi_p (); cur++; if (*cur == '/') @@ -1189,7 +1508,11 @@ _cpp_skip_block_comment (cpp_reader *pfile) if (c == '/') { if (cur[-2] == '*') - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); + break; + } /* Warn about potential nested comments, but not if the '/' comes immediately before the true comment delimiter. @@ -1208,6 +1531,8 @@ _cpp_skip_block_comment (cpp_reader *pfile) { unsigned int cols; buffer->cur = cur - 1; + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur); _cpp_process_line_notes (pfile, true); if (buffer->next_line >= buffer->rlimit) return true; @@ -1218,6 +1543,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) cur = buffer->cur; } + /* If this is a beginning of a UTF-8 encoding, it might be + a bidirectional control character. */ + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); + } } buffer->cur = cur; @@ -1233,9 +1565,31 @@ skip_line_comment (cpp_reader *pfile) { cpp_buffer *buffer = pfile->buffer; location_t orig_line = pfile->line_table->highest_line; + const bool warn_bidi_p = pfile->warn_bidi_p (); - while (*buffer->cur != '\n') - buffer->cur++; + if (!warn_bidi_p) + while (*buffer->cur != '\n') + buffer->cur++; + else + { + while (*buffer->cur != '\n' + && *buffer->cur != bidi::utf8_start) + buffer->cur++; + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + while (*buffer->cur != '\n') + { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } + buffer->cur++; + } + maybe_warn_bidi_on_close (pfile, buffer->cur); + } + } _cpp_process_line_notes (pfile, true); return orig_line != pfile->line_table->highest_line; @@ -1346,11 +1700,13 @@ static const cppchar_t utf8_signifier = 0xC0; /* Returns TRUE if the sequence starting at buffer->cur is valid in an identifier. FIRST is TRUE if this starts an identifier. */ + static bool forms_identifier_p (cpp_reader *pfile, int first, struct normalize_state *state) { cpp_buffer *buffer = pfile->buffer; + const bool warn_bidi_p = pfile->warn_bidi_p (); if (*buffer->cur == '$') { @@ -1373,6 +1729,13 @@ forms_identifier_p (cpp_reader *pfile, int first, cppchar_t s; if (*buffer->cur >= utf8_signifier) { + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) + && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (buffer->cur); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/false); + } if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s)) return true; @@ -1381,6 +1744,13 @@ forms_identifier_p (cpp_reader *pfile, int first, && (buffer->cur[1] == 'u' || buffer->cur[1] == 'U')) { buffer->cur += 2; + if (warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (buffer->cur, + buffer->cur[-1] == 'U'); + maybe_warn_bidi_on_char (pfile, buffer->cur, kind, + /*ucn_p=*/true); + } if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s, NULL, NULL)) return true; @@ -1489,6 +1859,7 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, const uchar *cur; unsigned int len; unsigned int hash = HT_HASHSTEP (0, *base); + const bool warn_bidi_p = pfile->warn_bidi_p (); cur = pfile->buffer->cur; if (! starts_ucn) @@ -1512,6 +1883,8 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, pfile->buffer->cur++; } } while (forms_identifier_p (pfile, false, nst)); + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pfile->buffer->cur); result = _cpp_interpret_identifier (pfile, base, pfile->buffer->cur - base); *spelling = cpp_lookup (pfile, base, pfile->buffer->cur - base); @@ -1758,6 +2131,7 @@ static void lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) { const uchar *pos = base; + const bool warn_bidi_p = pfile->warn_bidi_p (); /* 'tis a pity this information isn't passed down from the lexer's initial categorization of the token. */ @@ -1994,8 +2368,15 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) pos = base = pfile->buffer->cur; note = &pfile->buffer->notes[pfile->buffer->cur_note]; } + else if (__builtin_expect ((unsigned char) c == bidi::utf8_start, 0) + && warn_bidi_p) + maybe_warn_bidi_on_char (pfile, pos - 1, get_bidi_utf8 (pos - 1), + /*ucn_p=*/false); } + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, pos); + if (CPP_OPTION (pfile, user_literals)) { /* If a string format macro, say from inttypes.h, is placed touching @@ -2090,15 +2471,27 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) else terminator = '>', type = CPP_HEADER_NAME; + const bool warn_bidi_p = pfile->warn_bidi_p (); for (;;) { cppchar_t c = *cur++; /* In #include-style directives, terminators are not escapable. */ if (c == '\\' && !pfile->state.angled_headers && *cur != '\n') - cur++; + { + if ((cur[0] == 'u' || cur[0] == 'U') && warn_bidi_p) + { + bidi::kind kind = get_bidi_ucn (cur + 1, cur[0] == 'U'); + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/true); + } + cur++; + } else if (c == terminator) - break; + { + if (warn_bidi_p) + maybe_warn_bidi_on_close (pfile, cur - 1); + break; + } else if (c == '\n') { cur--; @@ -2115,6 +2508,11 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) } else if (c == '\0') saw_NUL = true; + else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) + { + bidi::kind kind = get_bidi_utf8 (cur - 1); + maybe_warn_bidi_on_char (pfile, cur - 1, kind, /*ucn_p=*/false); + } } if (saw_NUL && !pfile->state.skipping) base-commit: 6b1695f4a094f99575c9d067da6277bb4302fb89 -- 2.33.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-17 0:37 ` [PATCH v3] " Marek Polacek @ 2021-11-17 2:28 ` David Malcolm 2021-11-17 3:05 ` Marek Polacek 0 siblings, 1 reply; 27+ messages in thread From: David Malcolm @ 2021-11-17 2:28 UTC (permalink / raw) To: Marek Polacek; +Cc: Joseph Myers, Jakub Jelinek, Martin Sebor, GCC Patches On Tue, 2021-11-16 at 19:37 -0500, Marek Polacek wrote: > On Tue, Nov 16, 2021 at 06:00:58PM -0500, David Malcolm wrote: > > > On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote: > > > > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > > > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is > > > > > > just fine, > > > > > > but changing the name is a trivial operation. > > > > > > > > > > Here's a patch with a better name (suggested by Jonathan W.). > > > > > Otherwise no > > > > > changes. > > > > > > > > Thanks for implementing this. > > > > > > > > > > > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > > > > > > > -- >8 -- > > > > > From a link below: > > > > > "An issue was discovered in the Bidirectional Algorithm in the > > > > > Unicode > > > > > Specification through 14.0. It permits the visual reordering of > > > > > characters via control sequences, which can be used to craft > > > > > source code > > > > > that renders different logic than the logical ordering of > > > > > tokens > > > > > ingested by compilers and interpreters. Adversaries can > > > > > leverage this to > > > > > encode source code for compilers accepting Unicode such that > > > > > targeted > > > > > vulnerabilities are introduced invisibly to human reviewers." > > > > > > > > > > More info: > > > > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > > > > https://trojansource.codes/ > > > > > > > > > > This is not a compiler bug. However, to mitigate the problem, > > > > > this patch > > > > > implements -Wbidi-chars=[none|unpaired|any] to warn about > > > > > possibly > > > > > misleading Unicode bidirectional characters the preprocessor > > > > > may encounter. > > > > [...snip...] > > > > > > > > > > Terminology nit: > > > > The patch is referring to "bidirectional characters", but I think > > > > the > > > > term "bidirectional control characters" would be better. > > > > > > Adjusted. > > > > Thanks. > > > > I wonder if the warning should be -Wbidi-control-chars, but I don't > > care enough to insist on it being changed. > > > > > > > > > For example, a passage of text containing both numbers and > > > > characters > > > > in a right-to-left script could be considered "bidirectional", > > > > since > > > > the numbers are written from left-to-right. > > > > > > > > Specifically, the patch looks for these specific characters: > > > > * U+202A LEFT-TO-RIGHT EMBEDDING > > > > * U+202B RIGHT-TO-LEFT EMBEDDING > > > > * U+202C POP DIRECTIONAL FORMATTING > > > > * U+202D LEFT-TO-RIGHT OVERRIDE > > > > * U+202E RIGHT-TO-LEFT OVERRIDE > > > > * U+2066 LEFT-TO-RIGHT ISOLATE > > > > * U+2067 RIGHT-TO-LEFT ISOLATE > > > > * U+2068 FIRST STRONG ISOLATE > > > > * U+2069 POP DIRECTIONAL ISOLATE > > > > > > > > However, the following characters could also be considered as > > > > "bidirectional control characters": > > > > * U+200E LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E) > > > > * U+200F RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F) > > > > but aren't checked for in the patch. Should they be? I can > > > > imagine > > > > ways in which they could be abused, so I think so. > > > > > > I'd only intended to check the bidi chars described in the original > > > trojan source pdf, but I added checking for U+200E/U+200F too, > > > since > > > it was easy enough. AFAIK they aren't popped by a PDF/PDI like the > > > rest, so don't need to go on the vec, and so we only warn with > > > =any. > > > Tests: Wbidi-chars-16.c + Wbidi-chars-17.c > > > > Thanks. I took a look through the revised patch and I think you > > updated things correctly. > > > > [...snip...] > > > > > > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > > > b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > > > new file mode 100644 > > > > > index 00000000000..9fd4bc535ca > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > > > @@ -0,0 +1,166 @@ > > > > > +/* PR preprocessor/103026 */ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno- > > > > > overflow" } */ > > > > > +/* Test all bidi chars in various contexts (identifiers, > > > > > comments, > > > > > + string literals, character constants), both UCN and UTF-8. > > > > > The bidi > > > > > + chars here are properly terminated, except for the > > > > > character constants. */ > > > > > + > > > > > +/* a b c LRE 1 2 3 PDF x y z */ > > > > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ > > > > > +/* a b c RLE 1 2 3 PDF x y z */ > > > > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ > > > > > +/* a b c LRO 1 2 3 PDF x y z */ > > > > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ > > > > > +/* a b c RLO 1 2 3 PDF x y z */ > > > > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ > > > > > +/* a b c LRI 1 2 3 PDI x y z */ > > > > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ > > > > > +/* a b c RLI 1 2 3 PDI x y */ > > > > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ > > > > > +/* a b c FSI 1 2 3 PDI x y z */ > > > > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ > > > > > > > > AIUI the Unicode bidirectionality algorithm works at the line > > > > level, > > > > and so each line in a block comment should be checked > > > > individually for > > > > unclossed bidi control chars, rather than a block comment as a > > > > whole. > > > > Hence I think the test case needs to have block comment test > > > > coverage > > > > for: > > > > - single line blocks > > > > - first line of a multiline block comment > > > > - middle line of a multiline block comment > > > > - final line of a multiline block comment > > > > but I think the patch as it stands is only checking for the first > > > > of > > > > these four cases. > > > > > > The patch handles all of them, because of: > > > 1534 if (warn_bidi_p) > > > 1535 maybe_warn_bidi_on_close (pfile, cur); > > > in _cpp_skip_block_comment, but I was lacking some more testing, so > > > I've > > > added some testing, and included a new test: Wbidi-chars-15.c. > > > > All of the cases in Wbidi-chars-15.c only test for unparired chars in > > a > > middle line of a multiline block comment; I don't think the patch has > > any explicit coverage for unpaired control chars happening in the > > first > > line and last lines of *multiline* block comments. So it would be > > good > > if Wbidi-chars-15.c could gain some coverage for that (don't have to > > handle all the different chars). > > Sorry for a dumb question, but is this what you have in mind? > > /* LRE > PDF */ > /* FSI > PDI */ > and check that we warn for these? I mean something like the following multiline comments in which lines within them at the start, middle and end have unpaired constructs within a given line: /* RLI * */ /* * RLI */ /* * * RLI */ and that we should warn for each case at the line containing the unpaired control character. (the above lines don't have the actual chars, just "RLI") Mostly this is just me trying to think about it from a black-box testing perspective, or in case we ever touch this code in the future (perhaps it's obviously correct by inspection of the implementation now, but let's have regression tests for these cases). Sorry to add more work, but here's an idea for another test case: multiple comments on one line: /* RLI */ /* PDF */ where the closure of a comment should trigger closing a "context", so we should complain about the above. > > > > > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, > > > > > const uchar *base, bool starts_ucn, > > > > > { > > > > > /* Slower version for identifiers containing UCNs > > > > > or extended chars (including $). */ > > > > > - do { > > > > > - while (ISIDNUM (*pfile->buffer->cur)) > > > > > - { > > > > > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer- > > > > > >cur); > > > > > - pfile->buffer->cur++; > > > > > - } > > > > > - } while (forms_identifier_p (pfile, false, nst)); > > > > > + do > > > > > + { > > > > > + while (ISIDNUM (*pfile->buffer->cur)) > > > > > + { > > > > > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile- > > > > > >buffer->cur); > > > > > + pfile->buffer->cur++; > > > > > + } > > > > > + } > > > > > + while (forms_identifier_p (pfile, false, nst)); > > > > > > > > Is the above purely a whitespace change? > > > > > > Yes. > > > > If I'm reading things correctly, these lines in the existing code > > were > > correctly indented, so is there a purpose to this change? If not, > > please can you remove this change from the patch (to minimize the > > change to the history). > > I dropped that change then. Sometimes it's hard to resist fixing > formatting. ;) Thanks. But I don't think the existing formatting in the code *was* broken; I thought the patch was taking correct formatting and breaking it (hence my objection to a whitespace change). If I misread this, sorry. > > > [...snip...] > > > > > +/* We're closing a bidi context, that is, we've encountered a > > > newline, > > > + are closing a C-style comment, or are at the end of a string > > > literal, > > > + character constant, or identifier. Warn if this context was > > > not > > > + properly terminated by a PDI or PDF. P points to the last > > > character > > > + in this context. */ > > > + > > > +static void > > > +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) > > > +{ > > > + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == > > > bidirectional_unpaired > > > + && bidi::vec.count () > 0) > > > + { > > > + const location_t loc > > > + = linemap_position_for_column (pfile->line_table, > > > + CPP_BUF_COLUMN (pfile- > > > >buffer, p)); > > > + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, > > > + "unpaired UTF-8 bidirectional > > > character " > > > + "detected"); > > > + } > > > > Sorry, I missed this one in my initial review, should be "control > > character" here. > > Fixed. > > > [...snip...] > > > > OK for trunk with the above nits fixed. > > Thanks again for the review. > > I'll push this once the test question above is resolved. Hopefully the above makes sense and is constructive; let me know when you push your patch so that I can work on my followup. Thanks Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-17 2:28 ` David Malcolm @ 2021-11-17 3:05 ` Marek Polacek 2021-11-17 22:45 ` [committed] libcpp: escape non-ASCII source bytes in -Wbidi-chars= [PR103026] David Malcolm 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2021-11-17 3:05 UTC (permalink / raw) To: David Malcolm; +Cc: Joseph Myers, Jakub Jelinek, Martin Sebor, GCC Patches On Tue, Nov 16, 2021 at 09:28:21PM -0500, David Malcolm wrote: > On Tue, 2021-11-16 at 19:37 -0500, Marek Polacek wrote: > > Sorry for a dumb question, but is this what you have in mind? > > > > /* LRE > > PDF */ > > /* FSI > > PDI */ > > and check that we warn for these? > > I mean something like the following multiline comments in which lines > within them at the start, middle and end have unpaired constructs > within a given line: > > > /* RLI > * > */ > > /* > * RLI > */ > > /* > * > * RLI */ > > and that we should warn for each case at the line containing the > unpaired control character. > > (the above lines don't have the actual chars, just "RLI") > > Mostly this is just me trying to think about it from a black-box > testing perspective, or in case we ever touch this code in the future > (perhaps it's obviously correct by inspection of the implementation > now, but let's have regression tests for these cases). > > Sorry to add more work, but here's an idea for another test case: > multiple comments on one line: > > /* RLI */ /* PDF */ > > where the closure of a comment should trigger closing a "context", so > we should complain about the above. No problem, I've added these. > > > > > > > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, > > > > > > const uchar *base, bool starts_ucn, > > > > > > { > > > > > > /* Slower version for identifiers containing UCNs > > > > > > or extended chars (including $). */ > > > > > > - do { > > > > > > - while (ISIDNUM (*pfile->buffer->cur)) > > > > > > - { > > > > > > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer- > > > > > > >cur); > > > > > > - pfile->buffer->cur++; > > > > > > - } > > > > > > - } while (forms_identifier_p (pfile, false, nst)); > > > > > > + do > > > > > > + { > > > > > > + while (ISIDNUM (*pfile->buffer->cur)) > > > > > > + { > > > > > > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile- > > > > > > >buffer->cur); > > > > > > + pfile->buffer->cur++; > > > > > > + } > > > > > > + } > > > > > > + while (forms_identifier_p (pfile, false, nst)); > > > > > > > > > > Is the above purely a whitespace change? > > > > > > > > Yes. > > > > > > If I'm reading things correctly, these lines in the existing code > > > were > > > correctly indented, so is there a purpose to this change? If not, > > > please can you remove this change from the patch (to minimize the > > > change to the history). > > > > I dropped that change then. Sometimes it's hard to resist fixing > > formatting. ;) > > Thanks. But I don't think the existing formatting in the code *was* > broken; I thought the patch was taking correct formatting and breaking > it (hence my objection to a whitespace change). If I misread this, > sorry. I think it was, we're supposed to format do-while as do { } while (...); but it's obviously not a big deal. > Hopefully the above makes sense and is constructive; let me know when > you push your patch so that I can work on my followup. Pushed now. Thanks! Marek ^ permalink raw reply [flat|nested] 27+ messages in thread
* [committed] libcpp: escape non-ASCII source bytes in -Wbidi-chars= [PR103026] 2021-11-17 3:05 ` Marek Polacek @ 2021-11-17 22:45 ` David Malcolm 2021-11-17 22:45 ` [PATCH 2/2] libcpp: capture and underline ranges " David Malcolm 0 siblings, 1 reply; 27+ messages in thread From: David Malcolm @ 2021-11-17 22:45 UTC (permalink / raw) To: Marek Polacek Cc: Joseph Myers, Jakub Jelinek, Martin Sebor, GCC Patches, David Malcolm This flags rich_locations associated with -Wbidi-chars= so that non-ASCII bytes will be escaped when printing the source lines (using the diagnostics support I added in r12-4825-gbd5e882cf6e0def3dd1bc106075d59a303fe0d1e). In particular, this ensures that the printed source lines will be pure ASCII, and thus the visual ordering of the characters will be the same as the logical ordering. Before: Wbidi-chars-1.c: In function ‘main’: Wbidi-chars-1.c:6:43: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=] 6 | /* } if (isAdmin) begin admins only */ | ^ Wbidi-chars-1.c:9:28: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=] 9 | /* end admins only { */ | ^ Wbidi-chars-11.c:6:15: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=] 6 | int LRE__PDF_\u202c; | ^ Wbidi-chars-11.c:8:19: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=] 8 | int LRE_\u202a_PDF__; | ^ Wbidi-chars-11.c:10:28: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=] 10 | const char *s1 = "LRE__PDF_\u202c"; | ^ Wbidi-chars-11.c:12:33: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=] 12 | const char *s2 = "LRE_\u202a_PDF_"; | ^ After: Wbidi-chars-1.c: In function ‘main’: Wbidi-chars-1.c:6:43: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=] 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ | ^ Wbidi-chars-1.c:9:28: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=] 9 | /* end admins only <U+202E> { <U+2066>*/ | ^ Wbidi-chars-11.c:6:15: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=] 6 | int LRE_<U+202A>_PDF_\u202c; | ^ Wbidi-chars-11.c:8:19: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=] 8 | int LRE_\u202a_PDF_<U+202C>_; | ^ Wbidi-chars-11.c:10:28: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=] 10 | const char *s1 = "LRE_<U+202A>_PDF_\u202c"; | ^ Wbidi-chars-11.c:12:33: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidi-chars=] 12 | const char *s2 = "LRE_\u202a_PDF_<U+202C>"; | ^ Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu. Pushed to trunk as r12-5355-g1a7f2c0774129750fdf73e9f1b78f0ce983c9ab3. libcpp/ChangeLog: PR preprocessor/103026 * lex.c (maybe_warn_bidi_on_close): Use a rich_location and call set_escape_on_output (true) on it. (maybe_warn_bidi_on_char): Likewise. Signed-off-by: David Malcolm <dmalcolm@redhat.com> --- libcpp/lex.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/libcpp/lex.c b/libcpp/lex.c index 6a4fbce6030..8290bc637cd 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1427,9 +1427,11 @@ maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) const location_t loc = linemap_position_for_column (pfile->line_table, CPP_BUF_COLUMN (pfile->buffer, p)); - cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, - "unpaired UTF-8 bidirectional control character " - "detected"); + rich_location rich_loc (pfile->line_table, loc); + rich_loc.set_escape_on_output (true); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "unpaired UTF-8 bidirectional control character " + "detected"); } /* We're done with this context. */ bidi::on_close (); @@ -1454,6 +1456,9 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, const location_t loc = linemap_position_for_column (pfile->line_table, CPP_BUF_COLUMN (pfile->buffer, p)); + rich_location rich_loc (pfile->line_table, loc); + rich_loc.set_escape_on_output (true); + /* It seems excessive to warn about a PDI/PDF that is closing an opened context because we've already warned about the opening character. Except warn when we have a UCN x UTF-8 @@ -1462,20 +1467,20 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, { if (warn_bidi == bidirectional_unpaired && bidi::current_ctx_ucn_p () != ucn_p) - cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, - "UTF-8 vs UCN mismatch when closing " - "a context by \"%s\"", bidi::to_str (kind)); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); } else if (warn_bidi == bidirectional_any) { if (kind == bidi::kind::PDF || kind == bidi::kind::PDI) - cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, - "\"%s\" is closing an unopened context", - bidi::to_str (kind)); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "\"%s\" is closing an unopened context", + bidi::to_str (kind)); else - cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, - "found problematic Unicode character \"%s\"", - bidi::to_str (kind)); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "found problematic Unicode character \"%s\"", + bidi::to_str (kind)); } } /* We're done with this context. */ -- 2.26.3 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/2] libcpp: capture and underline ranges in -Wbidi-chars= [PR103026] 2021-11-17 22:45 ` [committed] libcpp: escape non-ASCII source bytes in -Wbidi-chars= [PR103026] David Malcolm @ 2021-11-17 22:45 ` David Malcolm 2021-11-17 23:01 ` Marek Polacek 0 siblings, 1 reply; 27+ messages in thread From: David Malcolm @ 2021-11-17 22:45 UTC (permalink / raw) To: Marek Polacek Cc: Joseph Myers, Jakub Jelinek, Martin Sebor, GCC Patches, David Malcolm This patch converts the bidi::vec to use a struct so that we can capture location_t values for the bidirectional control characters. Before: Wbidi-chars-1.c: In function ‘main’: Wbidi-chars-1.c:6:43: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=] 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ | ^ Wbidi-chars-1.c:9:28: warning: unpaired UTF-8 bidirectional control character detected [-Wbidi-chars=] 9 | /* end admins only <U+202E> { <U+2066>*/ | ^ After: Wbidi-chars-1.c: In function ‘main’: Wbidi-chars-1.c:6:43: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=] 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ | ~~~~~~~~ ~~~~~~~~ ^ | | | | | | | end of bidirectional context | U+202E (RIGHT-TO-LEFT OVERRIDE) U+2066 (LEFT-TO-RIGHT ISOLATE) Wbidi-chars-1.c:9:28: warning: unpaired UTF-8 bidirectional control characters detected [-Wbidi-chars=] 9 | /* end admins only <U+202E> { <U+2066>*/ | ~~~~~~~~ ~~~~~~~~ ^ | | | | | | | end of bidirectional context | | U+2066 (LEFT-TO-RIGHT ISOLATE) | U+202E (RIGHT-TO-LEFT OVERRIDE) Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu. Pushed to trunk as r12-5356-gbef32d4a28595e933f24fef378cf052a30b674a7. Signed-off-by: David Malcolm <dmalcolm@redhat.com> gcc/testsuite/ChangeLog: PR preprocessor/103026 * c-c++-common/Wbidi-chars-ranges.c: New test. libcpp/ChangeLog: PR preprocessor/103026 * lex.c (struct bidi::context): New. (bidi::vec): Convert to a vec of context rather than unsigned char. (bidi::ctx_at): Rename to... (bidi::pop_kind_at): ...this and reimplement for above change. (bidi::current_ctx): Update for change to vec. (bidi::current_ctx_ucn_p): Likewise. (bidi::current_ctx_loc): New. (bidi::on_char): Update for usage of context struct. Add "loc" param and pass it when pushing contexts. (get_location_for_byte_range_in_cur_line): New. (get_bidi_utf8): Rename to... (get_bidi_utf8_1): ...this, reintroducing... (get_bidi_utf8): ...as a wrapper, setting *OUT when the result is not NONE. (get_bidi_ucn): Rename to... (get_bidi_ucn_1): ...this, reintroducing... (get_bidi_ucn): ...as a wrapper, setting *OUT when the result is not NONE. (class unpaired_bidi_rich_location): New. (maybe_warn_bidi_on_close): Use unpaired_bidi_rich_location when reporting on unpaired bidi chars. Split into singular vs plural spellings. (maybe_warn_bidi_on_char): Pass in a location_t rather than a const uchar * and use it when emitting warnings, and when calling bidi::on_char. (_cpp_skip_block_comment): Capture location when kind is not NONE and pass it to maybe_warn_bidi_on_char. (skip_line_comment): Likewise. (forms_identifier_p): Likewise. (lex_raw_string): Likewise. (lex_string): Likewise. Signed-off-by: David Malcolm <dmalcolm@redhat.com> --- .../c-c++-common/Wbidi-chars-ranges.c | 54 ++++ libcpp/lex.c | 251 ++++++++++++++---- 2 files changed, 257 insertions(+), 48 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidi-chars-ranges.c diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-ranges.c b/gcc/testsuite/c-c++-common/Wbidi-chars-ranges.c new file mode 100644 index 00000000000..298750a2a64 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-ranges.c @@ -0,0 +1,54 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidi-chars=unpaired -fdiagnostics-show-caret" } */ +/* Verify that we escape and underline pertinent bidirectional + control characters when quoting the source. */ + +int test_unpaired_bidi () { + int isAdmin = 0; + /* } if (isAdmin) begin admins only */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ +#if 0 + { dg-begin-multiline-output "" } + /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ + ~~~~~~~~ ~~~~~~~~ ^ + | | | + | | end of bidirectional context + U+202E (RIGHT-TO-LEFT OVERRIDE) U+2066 (LEFT-TO-RIGHT ISOLATE) + { dg-end-multiline-output "" } +#endif + + __builtin_printf("You are an admin.\n"); + /* end admins only { */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ +#if 0 + { dg-begin-multiline-output "" } + /* end admins only <U+202E> { <U+2066>*/ + ~~~~~~~~ ~~~~~~~~ ^ + | | | + | | end of bidirectional context + | U+2066 (LEFT-TO-RIGHT ISOLATE) + U+202E (RIGHT-TO-LEFT OVERRIDE) + { dg-end-multiline-output "" } +#endif + + return 0; +} + +int LRE__PDF_\u202c; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +#if 0 + { dg-begin-multiline-output "" } + int LRE_<U+202A>_PDF_\u202c; + ~~~~~~~~ ^~~~~~ + { dg-end-multiline-output "" } +#endif + +const char *s1 = "LRE__PDF_\u202c"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +#if 0 + { dg-begin-multiline-output "" } + const char *s1 = "LRE_<U+202A>_PDF_\u202c"; + ~~~~~~~~ ^~~~~~ + { dg-end-multiline-output "" } +#endif diff --git a/libcpp/lex.c b/libcpp/lex.c index 8290bc637cd..a83392ab11c 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1172,11 +1172,34 @@ namespace bidi { /* All the UTF-8 encodings of bidi characters start with E2. */ constexpr uchar utf8_start = 0xe2; + struct context + { + context () {} + context (location_t loc, kind k, bool pdf, bool ucn) + : m_loc (loc), m_kind (k), m_pdf (pdf), m_ucn (ucn) + { + } + + kind get_pop_kind () const + { + return m_pdf ? kind::PDF : kind::PDI; + } + bool ucn_p () const + { + return m_ucn; + } + + location_t m_loc; + kind m_kind; + unsigned m_pdf : 1; + unsigned m_ucn : 1; + }; + /* A vector holding currently open bidi contexts. We use a char for each context, its LSB is 1 if it represents a PDF context, 0 if it represents a PDI context. The next bit is 1 if this context was open by a bidi character written as a UCN, and 0 when it was UTF-8. */ - semi_embedded_vec <unsigned char, 16> vec; + semi_embedded_vec <context, 16> vec; /* Close the whole comment/identifier/string literal/character constant context. */ @@ -1193,19 +1216,19 @@ namespace bidi { vec.truncate (len - 1); } - /* Return the context of the Ith element. */ - kind ctx_at (unsigned int i) + /* Return the pop kind of the context of the Ith element. */ + kind pop_kind_at (unsigned int i) { - return (vec[i] & 1) ? kind::PDF : kind::PDI; + return vec[i].get_pop_kind (); } - /* Return which context is currently opened. */ + /* Return the pop kind of the context that is currently opened. */ kind current_ctx () { unsigned int len = vec.count (); if (len == 0) return kind::NONE; - return ctx_at (len - 1); + return vec[len - 1].get_pop_kind (); } /* Return true if the current context comes from a UCN origin, that is, @@ -1214,11 +1237,19 @@ namespace bidi { { unsigned int len = vec.count (); gcc_checking_assert (len > 0); - return (vec[len - 1] >> 1) & 1; + return vec[len - 1].m_ucn; } - /* We've read a bidi char, update the current vector as necessary. */ - void on_char (kind k, bool ucn_p) + location_t current_ctx_loc () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + return vec[len - 1].m_loc; + } + + /* We've read a bidi char, update the current vector as necessary. + LOC is only valid when K is not kind::NONE. */ + void on_char (kind k, bool ucn_p, location_t loc) { switch (k) { @@ -1226,12 +1257,12 @@ namespace bidi { case kind::RLE: case kind::LRO: case kind::RLO: - vec.push (ucn_p ? 3u : 1u); + vec.push (context (loc, k, true, ucn_p)); break; case kind::LRI: case kind::RLI: case kind::FSI: - vec.push (ucn_p ? 2u : 0u); + vec.push (context (loc, k, false, ucn_p)); break; /* PDF terminates the scope of the last LRE, RLE, LRO, or RLO whose scope has not yet been terminated. */ @@ -1245,7 +1276,7 @@ namespace bidi { yet been terminated. */ case kind::PDI: for (int i = vec.count () - 1; i >= 0; --i) - if (ctx_at (i) == kind::PDI) + if (pop_kind_at (i) == kind::PDI) { vec.truncate (i); break; @@ -1295,10 +1326,47 @@ namespace bidi { } } +/* Get location_t for the range of bytes [START, START + NUM_BYTES) + within the current line in FILE, with the caret at START. */ + +static location_t +get_location_for_byte_range_in_cur_line (cpp_reader *pfile, + const unsigned char *const start, + size_t num_bytes) +{ + gcc_checking_assert (num_bytes > 0); + + /* CPP_BUF_COLUMN and linemap_position_for_column both refer + to offsets in bytes, but CPP_BUF_COLUMN is 0-based, + whereas linemap_position_for_column is 1-based. */ + + /* Get 0-based offsets within the line. */ + size_t start_offset = CPP_BUF_COLUMN (pfile->buffer, start); + size_t end_offset = start_offset + num_bytes - 1; + + /* Now convert to location_t, where "columns" are 1-based byte offsets. */ + location_t start_loc = linemap_position_for_column (pfile->line_table, + start_offset + 1); + location_t end_loc = linemap_position_for_column (pfile->line_table, + end_offset + 1); + + if (start_loc == end_loc) + return start_loc; + + source_range src_range; + src_range.m_start = start_loc; + src_range.m_finish = end_loc; + location_t combined_loc = COMBINE_LOCATION_DATA (pfile->line_table, + start_loc, + src_range, + NULL); + return combined_loc; +} + /* Parse a sequence of 3 bytes starting with P and return its bidi code. */ static bidi::kind -get_bidi_utf8 (const unsigned char *const p) +get_bidi_utf8_1 (const unsigned char *const p) { gcc_checking_assert (p[0] == bidi::utf8_start); @@ -1340,10 +1408,25 @@ get_bidi_utf8 (const unsigned char *const p) return bidi::kind::NONE; } +/* Parse a sequence of 3 bytes starting with P and return its bidi code. + If the kind is not NONE, write the location to *OUT.*/ + +static bidi::kind +get_bidi_utf8 (cpp_reader *pfile, const unsigned char *const p, location_t *out) +{ + bidi::kind result = get_bidi_utf8_1 (p); + if (result != bidi::kind::NONE) + { + /* We have a sequence of 3 bytes starting at P. */ + *out = get_location_for_byte_range_in_cur_line (pfile, p, 3); + } + return result; +} + /* Parse a UCN where P points just past \u or \U and return its bidi code. */ static bidi::kind -get_bidi_ucn (const unsigned char *p, bool is_U) +get_bidi_ucn_1 (const unsigned char *p, bool is_U) { /* 6.4.3 Universal Character Names \u hex-quad @@ -1412,6 +1495,62 @@ get_bidi_ucn (const unsigned char *p, bool is_U) return bidi::kind::NONE; } +/* Parse a UCN where P points just past \u or \U and return its bidi code. + If the kind is not NONE, write the location to *OUT.*/ + +static bidi::kind +get_bidi_ucn (cpp_reader *pfile, const unsigned char *p, bool is_U, + location_t *out) +{ + bidi::kind result = get_bidi_ucn_1 (p, is_U); + if (result != bidi::kind::NONE) + { + const unsigned char *start = p - 2; + size_t num_bytes = 2 + (is_U ? 8 : 4); + *out = get_location_for_byte_range_in_cur_line (pfile, start, num_bytes); + } + return result; +} + +/* Subclass of rich_location for reporting on unpaired UTF-8 + bidirectional control character(s). + Escape the source lines on output, and show all unclosed + bidi context, labelling everything. */ + +class unpaired_bidi_rich_location : public rich_location +{ + public: + class custom_range_label : public range_label + { + public: + label_text get_text (unsigned range_idx) const FINAL OVERRIDE + { + /* range 0 is the primary location; each subsequent range i + 1 + is for bidi::vec[i]. */ + if (range_idx > 0) + { + const bidi::context &ctxt (bidi::vec[range_idx - 1]); + return label_text::borrow (bidi::to_str (ctxt.m_kind)); + } + else + return label_text::borrow (_("end of bidirectional context")); + } + }; + + unpaired_bidi_rich_location (cpp_reader *pfile, location_t loc) + : rich_location (pfile->line_table, loc, &m_custom_label) + { + set_escape_on_output (true); + for (unsigned i = 0; i < bidi::vec.count (); i++) + add_range (bidi::vec[i].m_loc, + SHOW_RANGE_WITHOUT_CARET, + &m_custom_label); + } + + private: + custom_range_label m_custom_label; +}; + /* We're closing a bidi context, that is, we've encountered a newline, are closing a C-style comment, or are at the end of a string literal, character constant, or identifier. Warn if this context was not @@ -1427,11 +1566,17 @@ maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) const location_t loc = linemap_position_for_column (pfile->line_table, CPP_BUF_COLUMN (pfile->buffer, p)); - rich_location rich_loc (pfile->line_table, loc); - rich_loc.set_escape_on_output (true); - cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, - "unpaired UTF-8 bidirectional control character " - "detected"); + unpaired_bidi_rich_location rich_loc (pfile, loc); + /* cpp_callbacks doesn't yet have a way to handle singular vs plural + forms of a diagnostic, so fake it for now. */ + if (bidi::vec.count () > 1) + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "unpaired UTF-8 bidirectional control characters " + "detected"); + else + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "unpaired UTF-8 bidirectional control character " + "detected"); } /* We're done with this context. */ bidi::on_close (); @@ -1439,12 +1584,13 @@ maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) /* We're at the beginning or in the middle of an identifier/comment/string literal/character constant. Warn if we've encountered a bidi character. - KIND says which bidi character it was; P points to it in the character - stream. UCN_P is true iff this bidi character was written as a UCN. */ + KIND says which bidi control character it was; UCN_P is true iff this bidi + control character was written as a UCN. LOC is the location of the + character, but is only valid if KIND != bidi::kind::NONE. */ static void -maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, - bool ucn_p) +maybe_warn_bidi_on_char (cpp_reader *pfile, bidi::kind kind, + bool ucn_p, location_t loc) { if (__builtin_expect (kind == bidi::kind::NONE, 1)) return; @@ -1453,9 +1599,6 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, if (warn_bidi != bidirectional_none) { - const location_t loc - = linemap_position_for_column (pfile->line_table, - CPP_BUF_COLUMN (pfile->buffer, p)); rich_location rich_loc (pfile->line_table, loc); rich_loc.set_escape_on_output (true); @@ -1467,9 +1610,12 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, { if (warn_bidi == bidirectional_unpaired && bidi::current_ctx_ucn_p () != ucn_p) - cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, - "UTF-8 vs UCN mismatch when closing " - "a context by \"%s\"", bidi::to_str (kind)); + { + rich_loc.add_range (bidi::current_ctx_loc ()); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); + } } else if (warn_bidi == bidirectional_any) { @@ -1484,7 +1630,7 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, } } /* We're done with this context. */ - bidi::on_char (kind, ucn_p); + bidi::on_char (kind, ucn_p, loc); } /* Skip a C-style block comment. We find the end of the comment by @@ -1552,8 +1698,9 @@ _cpp_skip_block_comment (cpp_reader *pfile) a bidirectional control character. */ else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) { - bidi::kind kind = get_bidi_utf8 (cur - 1); - maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); } } @@ -1586,9 +1733,9 @@ skip_line_comment (cpp_reader *pfile) { if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) { - bidi::kind kind = get_bidi_utf8 (buffer->cur); - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, - /*ucn_p=*/false); + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); } buffer->cur++; } @@ -1737,9 +1884,9 @@ forms_identifier_p (cpp_reader *pfile, int first, if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) && warn_bidi_p) { - bidi::kind kind = get_bidi_utf8 (buffer->cur); - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, - /*ucn_p=*/false); + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); } if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s)) @@ -1751,10 +1898,12 @@ forms_identifier_p (cpp_reader *pfile, int first, buffer->cur += 2; if (warn_bidi_p) { - bidi::kind kind = get_bidi_ucn (buffer->cur, - buffer->cur[-1] == 'U'); - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, - /*ucn_p=*/true); + location_t loc; + bidi::kind kind = get_bidi_ucn (pfile, + buffer->cur, + buffer->cur[-1] == 'U', + &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/true, loc); } if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s, NULL, NULL)) @@ -2375,8 +2524,11 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) } else if (__builtin_expect ((unsigned char) c == bidi::utf8_start, 0) && warn_bidi_p) - maybe_warn_bidi_on_char (pfile, pos - 1, get_bidi_utf8 (pos - 1), - /*ucn_p=*/false); + { + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, pos - 1, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); + } } if (warn_bidi_p) @@ -2486,8 +2638,10 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) { if ((cur[0] == 'u' || cur[0] == 'U') && warn_bidi_p) { - bidi::kind kind = get_bidi_ucn (cur + 1, cur[0] == 'U'); - maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/true); + location_t loc; + bidi::kind kind = get_bidi_ucn (pfile, cur + 1, cur[0] == 'U', + &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/true, loc); } cur++; } @@ -2515,8 +2669,9 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) saw_NUL = true; else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) { - bidi::kind kind = get_bidi_utf8 (cur - 1); - maybe_warn_bidi_on_char (pfile, cur - 1, kind, /*ucn_p=*/false); + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); } } -- 2.26.3 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] libcpp: capture and underline ranges in -Wbidi-chars= [PR103026] 2021-11-17 22:45 ` [PATCH 2/2] libcpp: capture and underline ranges " David Malcolm @ 2021-11-17 23:01 ` Marek Polacek 0 siblings, 0 replies; 27+ messages in thread From: Marek Polacek @ 2021-11-17 23:01 UTC (permalink / raw) To: David Malcolm; +Cc: Joseph Myers, Jakub Jelinek, Martin Sebor, GCC Patches On Wed, Nov 17, 2021 at 05:45:15PM -0500, David Malcolm wrote: > This patch converts the bidi::vec to use a struct so that we can > capture location_t values for the bidirectional control characters. Thanks for these improvements. I noticed a few nits, but nothing that needs to be fixed immediately. > --- a/libcpp/lex.c > +++ b/libcpp/lex.c > @@ -1172,11 +1172,34 @@ namespace bidi { > /* All the UTF-8 encodings of bidi characters start with E2. */ > constexpr uchar utf8_start = 0xe2; > > + struct context > + { > + context () {} > + context (location_t loc, kind k, bool pdf, bool ucn) > + : m_loc (loc), m_kind (k), m_pdf (pdf), m_ucn (ucn) > + { > + } > + > + kind get_pop_kind () const > + { > + return m_pdf ? kind::PDF : kind::PDI; > + } > + bool ucn_p () const > + { > + return m_ucn; > + } > + > + location_t m_loc; > + kind m_kind; > + unsigned m_pdf : 1; > + unsigned m_ucn : 1; Should these members be private:, since we have getters for them? > + }; > + > /* A vector holding currently open bidi contexts. We use a char for > each context, its LSB is 1 if it represents a PDF context, 0 if it > represents a PDI context. The next bit is 1 if this context was open > by a bidi character written as a UCN, and 0 when it was UTF-8. */ Looks like this comments needs to be updated now. > - semi_embedded_vec <unsigned char, 16> vec; > + semi_embedded_vec <context, 16> vec; > > /* Close the whole comment/identifier/string literal/character constant > context. */ > @@ -1193,19 +1216,19 @@ namespace bidi { > vec.truncate (len - 1); > } > > - /* Return the context of the Ith element. */ > - kind ctx_at (unsigned int i) > + /* Return the pop kind of the context of the Ith element. */ > + kind pop_kind_at (unsigned int i) > { > - return (vec[i] & 1) ? kind::PDF : kind::PDI; > + return vec[i].get_pop_kind (); > } > > - /* Return which context is currently opened. */ > + /* Return the pop kind of the context that is currently opened. */ > kind current_ctx () > { > unsigned int len = vec.count (); > if (len == 0) > return kind::NONE; > - return ctx_at (len - 1); > + return vec[len - 1].get_pop_kind (); > } > > /* Return true if the current context comes from a UCN origin, that is, > @@ -1214,11 +1237,19 @@ namespace bidi { > { > unsigned int len = vec.count (); > gcc_checking_assert (len > 0); > - return (vec[len - 1] >> 1) & 1; > + return vec[len - 1].m_ucn; > } > > - /* We've read a bidi char, update the current vector as necessary. */ > - void on_char (kind k, bool ucn_p) > + location_t current_ctx_loc () > + { > + unsigned int len = vec.count (); > + gcc_checking_assert (len > 0); > + return vec[len - 1].m_loc; > + } > + > + /* We've read a bidi char, update the current vector as necessary. > + LOC is only valid when K is not kind::NONE. */ > + void on_char (kind k, bool ucn_p, location_t loc) > { > switch (k) > { > @@ -1226,12 +1257,12 @@ namespace bidi { > case kind::RLE: > case kind::LRO: > case kind::RLO: > - vec.push (ucn_p ? 3u : 1u); > + vec.push (context (loc, k, true, ucn_p)); > break; > case kind::LRI: > case kind::RLI: > case kind::FSI: > - vec.push (ucn_p ? 2u : 0u); > + vec.push (context (loc, k, false, ucn_p)); > break; > /* PDF terminates the scope of the last LRE, RLE, LRO, or RLO > whose scope has not yet been terminated. */ > @@ -1245,7 +1276,7 @@ namespace bidi { > yet been terminated. */ > case kind::PDI: > for (int i = vec.count () - 1; i >= 0; --i) > - if (ctx_at (i) == kind::PDI) > + if (pop_kind_at (i) == kind::PDI) > { > vec.truncate (i); > break; > @@ -1295,10 +1326,47 @@ namespace bidi { > } > } > > +/* Get location_t for the range of bytes [START, START + NUM_BYTES) > + within the current line in FILE, with the caret at START. */ > + > +static location_t > +get_location_for_byte_range_in_cur_line (cpp_reader *pfile, > + const unsigned char *const start, > + size_t num_bytes) > +{ > + gcc_checking_assert (num_bytes > 0); > + > + /* CPP_BUF_COLUMN and linemap_position_for_column both refer > + to offsets in bytes, but CPP_BUF_COLUMN is 0-based, > + whereas linemap_position_for_column is 1-based. */ > + > + /* Get 0-based offsets within the line. */ > + size_t start_offset = CPP_BUF_COLUMN (pfile->buffer, start); > + size_t end_offset = start_offset + num_bytes - 1; > + > + /* Now convert to location_t, where "columns" are 1-based byte offsets. */ > + location_t start_loc = linemap_position_for_column (pfile->line_table, > + start_offset + 1); > + location_t end_loc = linemap_position_for_column (pfile->line_table, > + end_offset + 1); > + > + if (start_loc == end_loc) > + return start_loc; > + > + source_range src_range; > + src_range.m_start = start_loc; > + src_range.m_finish = end_loc; > + location_t combined_loc = COMBINE_LOCATION_DATA (pfile->line_table, > + start_loc, > + src_range, > + NULL); > + return combined_loc; > +} > + > /* Parse a sequence of 3 bytes starting with P and return its bidi code. */ > > static bidi::kind > -get_bidi_utf8 (const unsigned char *const p) > +get_bidi_utf8_1 (const unsigned char *const p) > { > gcc_checking_assert (p[0] == bidi::utf8_start); > > @@ -1340,10 +1408,25 @@ get_bidi_utf8 (const unsigned char *const p) > return bidi::kind::NONE; > } > > +/* Parse a sequence of 3 bytes starting with P and return its bidi code. > + If the kind is not NONE, write the location to *OUT.*/ '. */' at the end > + > +static bidi::kind > +get_bidi_utf8 (cpp_reader *pfile, const unsigned char *const p, location_t *out) > +{ > + bidi::kind result = get_bidi_utf8_1 (p); > + if (result != bidi::kind::NONE) > + { > + /* We have a sequence of 3 bytes starting at P. */ > + *out = get_location_for_byte_range_in_cur_line (pfile, p, 3); > + } > + return result; > +} > + > /* Parse a UCN where P points just past \u or \U and return its bidi code. */ > > static bidi::kind > -get_bidi_ucn (const unsigned char *p, bool is_U) > +get_bidi_ucn_1 (const unsigned char *p, bool is_U) > { > /* 6.4.3 Universal Character Names > \u hex-quad > @@ -1412,6 +1495,62 @@ get_bidi_ucn (const unsigned char *p, bool is_U) > return bidi::kind::NONE; > } > > +/* Parse a UCN where P points just past \u or \U and return its bidi code. > + If the kind is not NONE, write the location to *OUT.*/ > + > +static bidi::kind > +get_bidi_ucn (cpp_reader *pfile, const unsigned char *p, bool is_U, Two spaces before 'const unsigned char'. > + location_t *out) > +{ > + bidi::kind result = get_bidi_ucn_1 (p, is_U); > + if (result != bidi::kind::NONE) > + { > + const unsigned char *start = p - 2; > + size_t num_bytes = 2 + (is_U ? 8 : 4); > + *out = get_location_for_byte_range_in_cur_line (pfile, start, num_bytes); > + } > + return result; > +} > + > +/* Subclass of rich_location for reporting on unpaired UTF-8 > + bidirectional control character(s). > + Escape the source lines on output, and show all unclosed > + bidi context, labelling everything. */ > + > +class unpaired_bidi_rich_location : public rich_location > +{ > + public: > + class custom_range_label : public range_label > + { > + public: > + label_text get_text (unsigned range_idx) const FINAL OVERRIDE > + { > + /* range 0 is the primary location; each subsequent range i + 1 > + is for bidi::vec[i]. */ > + if (range_idx > 0) > + { > + const bidi::context &ctxt (bidi::vec[range_idx - 1]); > + return label_text::borrow (bidi::to_str (ctxt.m_kind)); > + } > + else > + return label_text::borrow (_("end of bidirectional context")); > + } > + }; > + > + unpaired_bidi_rich_location (cpp_reader *pfile, location_t loc) > + : rich_location (pfile->line_table, loc, &m_custom_label) > + { > + set_escape_on_output (true); > + for (unsigned i = 0; i < bidi::vec.count (); i++) > + add_range (bidi::vec[i].m_loc, > + SHOW_RANGE_WITHOUT_CARET, > + &m_custom_label); > + } > + > + private: > + custom_range_label m_custom_label; > +}; > + > /* We're closing a bidi context, that is, we've encountered a newline, > are closing a C-style comment, or are at the end of a string literal, > character constant, or identifier. Warn if this context was not > @@ -1427,11 +1566,17 @@ maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) > const location_t loc > = linemap_position_for_column (pfile->line_table, > CPP_BUF_COLUMN (pfile->buffer, p)); > - rich_location rich_loc (pfile->line_table, loc); > - rich_loc.set_escape_on_output (true); > - cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, > - "unpaired UTF-8 bidirectional control character " > - "detected"); > + unpaired_bidi_rich_location rich_loc (pfile, loc); > + /* cpp_callbacks doesn't yet have a way to handle singular vs plural > + forms of a diagnostic, so fake it for now. */ > + if (bidi::vec.count () > 1) > + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, > + "unpaired UTF-8 bidirectional control characters " > + "detected"); > + else > + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, > + "unpaired UTF-8 bidirectional control character " > + "detected"); > } > /* We're done with this context. */ > bidi::on_close (); > @@ -1439,12 +1584,13 @@ maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) > > /* We're at the beginning or in the middle of an identifier/comment/string > literal/character constant. Warn if we've encountered a bidi character. > - KIND says which bidi character it was; P points to it in the character > - stream. UCN_P is true iff this bidi character was written as a UCN. */ > + KIND says which bidi control character it was; UCN_P is true iff this bidi > + control character was written as a UCN. LOC is the location of the > + character, but is only valid if KIND != bidi::kind::NONE. */ > > static void > -maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, > - bool ucn_p) > +maybe_warn_bidi_on_char (cpp_reader *pfile, bidi::kind kind, > + bool ucn_p, location_t loc) > { > if (__builtin_expect (kind == bidi::kind::NONE, 1)) > return; > @@ -1453,9 +1599,6 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, > > if (warn_bidi != bidirectional_none) > { > - const location_t loc > - = linemap_position_for_column (pfile->line_table, > - CPP_BUF_COLUMN (pfile->buffer, p)); > rich_location rich_loc (pfile->line_table, loc); > rich_loc.set_escape_on_output (true); > > @@ -1467,9 +1610,12 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, > { > if (warn_bidi == bidirectional_unpaired > && bidi::current_ctx_ucn_p () != ucn_p) > - cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, > - "UTF-8 vs UCN mismatch when closing " > - "a context by \"%s\"", bidi::to_str (kind)); > + { > + rich_loc.add_range (bidi::current_ctx_loc ()); > + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, > + "UTF-8 vs UCN mismatch when closing " > + "a context by \"%s\"", bidi::to_str (kind)); > + } > } > else if (warn_bidi == bidirectional_any) > { > @@ -1484,7 +1630,7 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, > } > } > /* We're done with this context. */ > - bidi::on_char (kind, ucn_p); > + bidi::on_char (kind, ucn_p, loc); > } > > /* Skip a C-style block comment. We find the end of the comment by > @@ -1552,8 +1698,9 @@ _cpp_skip_block_comment (cpp_reader *pfile) > a bidirectional control character. */ > else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) > { > - bidi::kind kind = get_bidi_utf8 (cur - 1); > - maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); > + location_t loc; > + bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); > } > } > > @@ -1586,9 +1733,9 @@ skip_line_comment (cpp_reader *pfile) > { > if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) > { > - bidi::kind kind = get_bidi_utf8 (buffer->cur); > - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, > - /*ucn_p=*/false); > + location_t loc; > + bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); > } > buffer->cur++; > } > @@ -1737,9 +1884,9 @@ forms_identifier_p (cpp_reader *pfile, int first, > if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) > && warn_bidi_p) > { > - bidi::kind kind = get_bidi_utf8 (buffer->cur); > - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, > - /*ucn_p=*/false); > + location_t loc; > + bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); > } > if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, > state, &s)) > @@ -1751,10 +1898,12 @@ forms_identifier_p (cpp_reader *pfile, int first, > buffer->cur += 2; > if (warn_bidi_p) > { > - bidi::kind kind = get_bidi_ucn (buffer->cur, > - buffer->cur[-1] == 'U'); > - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, > - /*ucn_p=*/true); > + location_t loc; > + bidi::kind kind = get_bidi_ucn (pfile, > + buffer->cur, > + buffer->cur[-1] == 'U', > + &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/true, loc); > } > if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first, > state, &s, NULL, NULL)) > @@ -2375,8 +2524,11 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) > } > else if (__builtin_expect ((unsigned char) c == bidi::utf8_start, 0) > && warn_bidi_p) > - maybe_warn_bidi_on_char (pfile, pos - 1, get_bidi_utf8 (pos - 1), > - /*ucn_p=*/false); > + { > + location_t loc; > + bidi::kind kind = get_bidi_utf8 (pfile, pos - 1, &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); > + } > } > > if (warn_bidi_p) > @@ -2486,8 +2638,10 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) > { > if ((cur[0] == 'u' || cur[0] == 'U') && warn_bidi_p) > { > - bidi::kind kind = get_bidi_ucn (cur + 1, cur[0] == 'U'); > - maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/true); > + location_t loc; > + bidi::kind kind = get_bidi_ucn (pfile, cur + 1, cur[0] == 'U', > + &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/true, loc); > } > cur++; > } > @@ -2515,8 +2669,9 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) > saw_NUL = true; > else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) > { > - bidi::kind kind = get_bidi_utf8 (cur - 1); > - maybe_warn_bidi_on_char (pfile, cur - 1, kind, /*ucn_p=*/false); > + location_t loc; > + bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); > } > } > > -- > 2.26.3 > Marek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-15 17:28 ` [PATCH] libcpp: Implement -Wbidi-chars " Marek Polacek 2021-11-15 23:15 ` David Malcolm @ 2021-11-30 8:38 ` Stephan Bergmann 2021-11-30 13:26 ` Marek Polacek 1 sibling, 1 reply; 27+ messages in thread From: Stephan Bergmann @ 2021-11-30 8:38 UTC (permalink / raw) To: Marek Polacek; +Cc: Jakub Jelinek, Martin Sebor, GCC Patches, Joseph Myers On 15/11/2021 18:28, Marek Polacek via Gcc-patches wrote: > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: >> Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, >> but changing the name is a trivial operation. > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no > changes. > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > -- >8 -- > From a link below: > "An issue was discovered in the Bidirectional Algorithm in the Unicode > Specification through 14.0. It permits the visual reordering of > characters via control sequences, which can be used to craft source code > that renders different logic than the logical ordering of tokens > ingested by compilers and interpreters. Adversaries can leverage this to > encode source code for compilers accepting Unicode such that targeted > vulnerabilities are introduced invisibly to human reviewers." > > More info: > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > https://trojansource.codes/ > > This is not a compiler bug. However, to mitigate the problem, this patch > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > misleading Unicode bidirectional characters the preprocessor may encounter. > > The default is =unpaired, which warns about improperly terminated > bidirectional characters; e.g. a LRE without its appertaining PDF. The > level =any warns about any use of bidirectional characters. > > This patch handles both UCNs and UTF-8 characters. UCNs designating > bidi characters in identifiers are accepted since r204886. Then r217144 > enabled -fextended-identifiers by default. Extended characters in C/C++ > identifiers have been accepted since r275979. However, this patch still > warns about mixing UTF-8 and UCN bidi characters; there seems to be no > good reason to allow mixing them. I wonder what the rationale is to warn about UCNs, like in > aText = u"\u202D" + aText; (as found in the LibreOffice source code). > We warn in different contexts: comments (both C and C++-style), string > literals, character constants, and identifiers. Expectedly, UCNs are ignored > in comments and raw string literals. The bidirectional characters can nest > so this patch handles that as well. > > I have not included nor tested this at all with Fortran (which also has > string literals and line comments). > > Dave M. posted patches improving diagnostic involving Unicode characters. > This patch does not make use of this new infrastructure yet. > > PR preprocessor/103026 [...] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-30 8:38 ` [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] Stephan Bergmann @ 2021-11-30 13:26 ` Marek Polacek 2021-11-30 15:00 ` Stephan Bergmann 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2021-11-30 13:26 UTC (permalink / raw) To: Stephan Bergmann; +Cc: Jakub Jelinek, Martin Sebor, GCC Patches, Joseph Myers On Tue, Nov 30, 2021 at 09:38:57AM +0100, Stephan Bergmann wrote: > On 15/11/2021 18:28, Marek Polacek via Gcc-patches wrote: > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > > > but changing the name is a trivial operation. > > > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no > > changes. > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > -- >8 -- > > From a link below: > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > Specification through 14.0. It permits the visual reordering of > > characters via control sequences, which can be used to craft source code > > that renders different logic than the logical ordering of tokens > > ingested by compilers and interpreters. Adversaries can leverage this to > > encode source code for compilers accepting Unicode such that targeted > > vulnerabilities are introduced invisibly to human reviewers." > > > > More info: > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > https://trojansource.codes/ > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > > misleading Unicode bidirectional characters the preprocessor may encounter. > > > > The default is =unpaired, which warns about improperly terminated > > bidirectional characters; e.g. a LRE without its appertaining PDF. The > > level =any warns about any use of bidirectional characters. > > > > This patch handles both UCNs and UTF-8 characters. UCNs designating > > bidi characters in identifiers are accepted since r204886. Then r217144 > > enabled -fextended-identifiers by default. Extended characters in C/C++ > > identifiers have been accepted since r275979. However, this patch still > > warns about mixing UTF-8 and UCN bidi characters; there seems to be no > > good reason to allow mixing them. > > I wonder what the rationale is to warn about UCNs, like in > > > aText = u"\u202D" + aText; > > (as found in the LibreOffice source code). Is this line mixing a UCN and a UTF-8? Or is it just that you're prepending a LRO to aText? We warn because the LRO is not "closed" in the context of its string literal, which was part of the Trojan source attack. So "\u202D ... \u202C" would not warn. I'm not sure what workaround I could offer. Maybe provide an option not to warn about UCNs at all, though even that is potentially dangerous -- while you can see UCNs in the source code, if you print strings containing them, they won't be visible anymore. Marek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-30 13:26 ` Marek Polacek @ 2021-11-30 15:00 ` Stephan Bergmann 2021-11-30 15:27 ` Marek Polacek 0 siblings, 1 reply; 27+ messages in thread From: Stephan Bergmann @ 2021-11-30 15:00 UTC (permalink / raw) To: Marek Polacek; +Cc: Jakub Jelinek, Martin Sebor, GCC Patches, Joseph Myers On 30/11/2021 14:26, Marek Polacek wrote: > On Tue, Nov 30, 2021 at 09:38:57AM +0100, Stephan Bergmann wrote: >> On 15/11/2021 18:28, Marek Polacek via Gcc-patches wrote: >>> On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: >>>> Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, >>>> but changing the name is a trivial operation. >>> >>> Here's a patch with a better name (suggested by Jonathan W.). Otherwise no >>> changes. >>> >>> Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? >>> >>> -- >8 -- >>> From a link below: >>> "An issue was discovered in the Bidirectional Algorithm in the Unicode >>> Specification through 14.0. It permits the visual reordering of >>> characters via control sequences, which can be used to craft source code >>> that renders different logic than the logical ordering of tokens >>> ingested by compilers and interpreters. Adversaries can leverage this to >>> encode source code for compilers accepting Unicode such that targeted >>> vulnerabilities are introduced invisibly to human reviewers." >>> >>> More info: >>> https://nvd.nist.gov/vuln/detail/CVE-2021-42574 >>> https://trojansource.codes/ >>> >>> This is not a compiler bug. However, to mitigate the problem, this patch >>> implements -Wbidi-chars=[none|unpaired|any] to warn about possibly >>> misleading Unicode bidirectional characters the preprocessor may encounter. >>> >>> The default is =unpaired, which warns about improperly terminated >>> bidirectional characters; e.g. a LRE without its appertaining PDF. The >>> level =any warns about any use of bidirectional characters. >>> >>> This patch handles both UCNs and UTF-8 characters. UCNs designating >>> bidi characters in identifiers are accepted since r204886. Then r217144 >>> enabled -fextended-identifiers by default. Extended characters in C/C++ >>> identifiers have been accepted since r275979. However, this patch still >>> warns about mixing UTF-8 and UCN bidi characters; there seems to be no >>> good reason to allow mixing them. >> >> I wonder what the rationale is to warn about UCNs, like in >> >>> aText = u"\u202D" + aText; >> >> (as found in the LibreOffice source code). > > Is this line mixing a UCN and a UTF-8? Or is it just that you're > prepending a LRO to aText? We warn because the LRO is not "closed" > in the context of its string literal, which was part of the Trojan > source attack. So "\u202D ... \u202C" would not warn. > > I'm not sure what workaround I could offer. Maybe provide an option not to > warn about UCNs at all, though even that is potentially dangerous -- while > you can see UCNs in the source code, if you print strings containing them, > they won't be visible anymore. I'm not sure what you mean with "mixing a UCN and a UTF-8", but what the code apparently does is programmatically constructing a larger piece of text by prepending LRO to an existing piece of text. My understanding is that Trojan Source is concerned with presentation of program source code and not with properties of Unicode text constructed during the execution of such a program, and from the documentation quoted above I understand that -Wbidi-chars is meant to address Trojan Source, so I don't understand why you're concerned here with what happens "if you print strings containing [UCNs in the source code]". Short of a source code viewer that interprets UCNs in C/C++ source code and renders them in the same way as their corresponding Unicode characters, I don't think that UCNs are relevant for Trojan Source, and don't understand why -Wbidi-chars would warn about them. (Also, I noticed that it doesn't work to silence -Werror=bidi-chars= with a > #pragma GCC diagnostic ignored "-Wbidi-chars" ?) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-30 15:00 ` Stephan Bergmann @ 2021-11-30 15:27 ` Marek Polacek 2022-01-14 9:23 ` Stephan Bergmann 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2021-11-30 15:27 UTC (permalink / raw) To: Stephan Bergmann; +Cc: Jakub Jelinek, Martin Sebor, GCC Patches, Joseph Myers On Tue, Nov 30, 2021 at 04:00:01PM +0100, Stephan Bergmann wrote: > On 30/11/2021 14:26, Marek Polacek wrote: > > On Tue, Nov 30, 2021 at 09:38:57AM +0100, Stephan Bergmann wrote: > > > On 15/11/2021 18:28, Marek Polacek via Gcc-patches wrote: > > > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > > > > > but changing the name is a trivial operation. > > > > > > > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise no > > > > changes. > > > > > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > > > > > -- >8 -- > > > > From a link below: > > > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > > > Specification through 14.0. It permits the visual reordering of > > > > characters via control sequences, which can be used to craft source code > > > > that renders different logic than the logical ordering of tokens > > > > ingested by compilers and interpreters. Adversaries can leverage this to > > > > encode source code for compilers accepting Unicode such that targeted > > > > vulnerabilities are introduced invisibly to human reviewers." > > > > > > > > More info: > > > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > > > https://trojansource.codes/ > > > > > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > > > > misleading Unicode bidirectional characters the preprocessor may encounter. > > > > > > > > The default is =unpaired, which warns about improperly terminated > > > > bidirectional characters; e.g. a LRE without its appertaining PDF. The > > > > level =any warns about any use of bidirectional characters. > > > > > > > > This patch handles both UCNs and UTF-8 characters. UCNs designating > > > > bidi characters in identifiers are accepted since r204886. Then r217144 > > > > enabled -fextended-identifiers by default. Extended characters in C/C++ > > > > identifiers have been accepted since r275979. However, this patch still > > > > warns about mixing UTF-8 and UCN bidi characters; there seems to be no > > > > good reason to allow mixing them. > > > > > > I wonder what the rationale is to warn about UCNs, like in > > > > > > > aText = u"\u202D" + aText; > > > > > > (as found in the LibreOffice source code). > > > > Is this line mixing a UCN and a UTF-8? Or is it just that you're > > prepending a LRO to aText? We warn because the LRO is not "closed" > > in the context of its string literal, which was part of the Trojan > > source attack. So "\u202D ... \u202C" would not warn. > > > > I'm not sure what workaround I could offer. Maybe provide an option not to > > warn about UCNs at all, though even that is potentially dangerous -- while > > you can see UCNs in the source code, if you print strings containing them, > > they won't be visible anymore. > > I'm not sure what you mean with "mixing a UCN and a UTF-8", but what the > code apparently does is programmatically constructing a larger piece of text > by prepending LRO to an existing piece of text. > > My understanding is that Trojan Source is concerned with presentation of > program source code and not with properties of Unicode text constructed > during the execution of such a program, and from the documentation quoted > above I understand that -Wbidi-chars is meant to address Trojan Source, so I > don't understand why you're concerned here with what happens "if you print > strings containing [UCNs in the source code]". > > Short of a source code viewer that interprets UCNs in C/C++ source code and > renders them in the same way as their corresponding Unicode characters, I > don't think that UCNs are relevant for Trojan Source, and don't understand > why -Wbidi-chars would warn about them. I guess we were concerned with programs that generate other programs. Maybe UCNs should be ignored by default. There's still time to adjust the behavior. > (Also, I noticed that it doesn't work to silence -Werror=bidi-chars= with a > > > #pragma GCC diagnostic ignored "-Wbidi-chars" Yeah, it doesn't work with C++, it's https://gcc.gnu.org/PR53431 :( Marek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2021-11-30 15:27 ` Marek Polacek @ 2022-01-14 9:23 ` Stephan Bergmann 2022-01-14 13:28 ` Marek Polacek 0 siblings, 1 reply; 27+ messages in thread From: Stephan Bergmann @ 2022-01-14 9:23 UTC (permalink / raw) To: Marek Polacek; +Cc: Jakub Jelinek, Martin Sebor, GCC Patches, Joseph Myers On 30/11/2021 16:27, Marek Polacek wrote: > I guess we were concerned with programs that generate other programs. > Maybe UCNs should be ignored by default. There's still time to adjust > the behavior. Is there any update on this? Shall I file a bug? As-is, -Wbidi-chars is unusable for building LibreOffice and (esp. in combination with <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53431> "C++ preprocessor ignores #pragma GCC diagnostic") has to be explicitly disabled globally. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2022-01-14 9:23 ` Stephan Bergmann @ 2022-01-14 13:28 ` Marek Polacek 2022-01-14 14:52 ` Stephan Bergmann 0 siblings, 1 reply; 27+ messages in thread From: Marek Polacek @ 2022-01-14 13:28 UTC (permalink / raw) To: Stephan Bergmann; +Cc: Jakub Jelinek, Martin Sebor, GCC Patches, Joseph Myers On Fri, Jan 14, 2022 at 10:23:16AM +0100, Stephan Bergmann wrote: > On 30/11/2021 16:27, Marek Polacek wrote: > > I guess we were concerned with programs that generate other programs. > > Maybe UCNs should be ignored by default. There's still time to adjust > > the behavior. > > Is there any update on this? Shall I file a bug? As-is, -Wbidi-chars is > unusable for building LibreOffice and (esp. in combination with > <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53431> "C++ preprocessor > ignores #pragma GCC diagnostic") has to be explicitly disabled globally. No update, it wasn't clear to me what the action should be here. Please do file a bug. I think I'll just have to adjust the warning to ignore UCNs. Marek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] 2022-01-14 13:28 ` Marek Polacek @ 2022-01-14 14:52 ` Stephan Bergmann 0 siblings, 0 replies; 27+ messages in thread From: Stephan Bergmann @ 2022-01-14 14:52 UTC (permalink / raw) To: Marek Polacek; +Cc: Jakub Jelinek, Martin Sebor, GCC Patches, Joseph Myers On 14/01/2022 14:28, Marek Polacek wrote: > On Fri, Jan 14, 2022 at 10:23:16AM +0100, Stephan Bergmann wrote: >> On 30/11/2021 16:27, Marek Polacek wrote: >>> I guess we were concerned with programs that generate other programs. >>> Maybe UCNs should be ignored by default. There's still time to adjust >>> the behavior. >> >> Is there any update on this? Shall I file a bug? As-is, -Wbidi-chars is >> unusable for building LibreOffice and (esp. in combination with >> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53431> "C++ preprocessor >> ignores #pragma GCC diagnostic") has to be explicitly disabled globally. > > No update, it wasn't clear to me what the action should be here. > > Please do file a bug. I think I'll just have to adjust the warning to ignore > UCNs. <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104030> "-Wbidi-chars should not warn about UCNs" ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 0/2] Re: [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] 2021-11-01 16:36 [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] Marek Polacek 2021-11-01 22:10 ` Joseph Myers @ 2021-11-02 20:57 ` David Malcolm 2021-11-02 20:58 ` [PATCH 1/2] Flag CPP_W_BIDIRECTIONAL so that source lines are escaped David Malcolm 2021-11-02 20:58 ` [PATCH 2/2] Capture locations of bidi chars and underline ranges David Malcolm 1 sibling, 2 replies; 27+ messages in thread From: David Malcolm @ 2021-11-02 20:57 UTC (permalink / raw) To: Marek Polacek, GCC Patches; +Cc: Joseph Myers, Jakub Jelinek, David Malcolm On Mon, 2021-11-01 at 12:36 -0400, Marek Polacek via Gcc-patches wrote: > From a link below: > "An issue was discovered in the Bidirectional Algorithm in the > Unicode > Specification through 14.0. It permits the visual reordering of > characters via control sequences, which can be used to craft source > code > that renders different logic than the logical ordering of tokens > ingested by compilers and interpreters. Adversaries can leverage this > to > encode source code for compilers accepting Unicode such that targeted > vulnerabilities are introduced invisibly to human reviewers." > > More info: > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > https://trojansource.codes/ > > This is not a compiler bug. However, to mitigate the problem, this > patch > implements -Wbidirectional=[none|unpaired|any] to warn about possibly > misleading Unicode bidirectional characters the preprocessor may > encounter. > > The default is =unpaired, which warns about improperly terminated > bidirectional characters; e.g. a LRE without its appertaining PDF. > The > level =any warns about any use of bidirectional characters. > > This patch handles both UCNs and UTF-8 characters. UCNs designating > bidi characters in identifiers are accepted since r204886. Then > r217144 > enabled -fextended-identifiers by default. Extended characters in > C/C++ > identifiers have been accepted since r275979. However, this patch > still > warns about mixing UTF-8 and UCN bidi characters; there seems to be > no > good reason to allow mixing them. > > We warn in different contexts: comments (both C and C++-style), > string > literals, character constants, and identifiers. Expectedly, UCNs are > ignored > in comments and raw string literals. The bidirectional characters > can nest > so this patch handles that as well. > > I have not included nor tested this at all with Fortran (which also > has > string literals and line comments). > > Dave M. posted patches improving diagnostic involving Unicode > characters. > This patch does not make use of this new infrastructure yet. Challenge accepted :) Here are a couple of patches on top of the v1 version of your patch to make use of that new infrastructure. The first patch is relatively non-invasive; the second patch reworks things quite a bit to capture location_t values for the bidirectional control characters, and use them in the diagnostics, with labelled ranges, giving e.g.: $ ./xgcc -B. -S ../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c -fdiagnostics-escape-format=bytes ../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c: In function ‘main’: ../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c:5:28: warning: unpaired UTF-8 bidirectional character detected [-Wbidirectional=] 5 | /* Say hello; newline<e2><81><a7>/*/ return 0 ; | ~~~~~~~~~~~~ ^ | | | | | end of bidirectional context | U+2067 (RIGHT-TO-LEFT ISOLATE) There's a more complicated example in the test case. Not yet bootstrapped, but hopefully gives you some ideas on future versions of the patch. Note that the precise location_t values aren't going to make much sense without the escaping feature [1], and I don't think that's backportable to GCC 11, so these UX tweaks might be for GCC 12+ only. Hope this is constructive Dave [1] what is a "column number" in a line of bidirectional text? Right now it's a 1-based offset w.r.t. the logical ordering of the characters, but respecting tabs and counting certain characters as occupying two columns, but it's not at all clear to me that there's such a thing as a "column number" in bidirectional text. David Malcolm (2): Flag CPP_W_BIDIRECTIONAL so that source lines are escaped Capture locations of bidi chars and underline ranges .../c-c++-common/Wbidirectional-ranges.c | 54 ++++ libcpp/lex.c | 254 ++++++++++++++---- 2 files changed, 261 insertions(+), 47 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-ranges.c -- 2.26.3 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/2] Flag CPP_W_BIDIRECTIONAL so that source lines are escaped 2021-11-02 20:57 ` [PATCH 0/2] Re: [PATCH] libcpp: Implement -Wbidirectional " David Malcolm @ 2021-11-02 20:58 ` David Malcolm 2021-11-02 21:07 ` David Malcolm 2021-11-02 20:58 ` [PATCH 2/2] Capture locations of bidi chars and underline ranges David Malcolm 1 sibling, 1 reply; 27+ messages in thread From: David Malcolm @ 2021-11-02 20:58 UTC (permalink / raw) To: Marek Polacek, GCC Patches; +Cc: Joseph Myers, Jakub Jelinek, David Malcolm Before: Wbidirectional-1.c: In function ‘main’: Wbidirectional-1.c:6:43: warning: unpaired UTF-8 bidirectional character detected [-Wbidirectional=] 6 | /* } if (isAdmin) begin admins only */ | ^ Wbidirectional-1.c:9:28: warning: unpaired UTF-8 bidirectional character detected [-Wbidirectional=] 9 | /* end admins only { */ | ^ Wbidirectional-11.c:6:15: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidirectional=] 6 | int LRE__PDF_\u202c; | ^ After setting rich_loc.set_escape_on_output (true): Wbidirectional-1.c:6:43: warning: unpaired UTF-8 bidirectional character detected [-Wbidirectional=] 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ | ^ Wbidirectional-1.c:9:28: warning: unpaired UTF-8 bidirectional character detected [-Wbidirectional=] 9 | /* end admins only <U+202E> { <U+2066>*/ | ^ Wbidirectional-11.c:6:15: warning: UTF-8 vs UCN mismatch when closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [-Wbidirectional=] 6 | int LRE_<U+202A>_PDF_\u202c; | ^ libcpp/ChangeLog: * lex.c (maybe_warn_bidi_on_close): Use a rich_location and call set_escape_on_output (true) on it. (maybe_warn_bidi_on_char): Likewise. Signed-off-by: David Malcolm <dmalcolm@redhat.com> --- libcpp/lex.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/libcpp/lex.c b/libcpp/lex.c index f7a86fbe4b5..88aba307991 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1387,9 +1387,11 @@ maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) const location_t loc = linemap_position_for_column (pfile->line_table, CPP_BUF_COLUMN (pfile->buffer, p)); - cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, - "unpaired UTF-8 bidirectional character " - "detected"); + rich_location rich_loc (pfile->line_table, loc); + rich_loc.set_escape_on_output (true); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "unpaired UTF-8 bidirectional character " + "detected"); } /* We're done with this context. */ bidi::on_close (); @@ -1414,6 +1416,9 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, const location_t loc = linemap_position_for_column (pfile->line_table, CPP_BUF_COLUMN (pfile->buffer, p)); + rich_location rich_loc (pfile->line_table, loc); + rich_loc.set_escape_on_output (true); + /* It seems excessive to warn about a PDI/PDF that is closing an opened context because we've already warned about the opening character. Except warn when we have a UCN x UTF-8 @@ -1422,20 +1427,20 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, { if (warn_bidi == bidirectional_unpaired && bidi::current_ctx_ucn_p () != ucn_p) - cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, - "UTF-8 vs UCN mismatch when closing " - "a context by \"%s\"", bidi::to_str (kind)); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); } else if (warn_bidi == bidirectional_any) { if (kind == bidi::kind::PDF || kind == bidi::kind::PDI) - cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, - "\"%s\" is closing an unopened context", - bidi::to_str (kind)); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "\"%s\" is closing an unopened context", + bidi::to_str (kind)); else - cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, - "found problematic Unicode character \"%s\"", - bidi::to_str (kind)); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "found problematic Unicode character \"%s\"", + bidi::to_str (kind)); } } /* We're done with this context. */ -- 2.26.3 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] Flag CPP_W_BIDIRECTIONAL so that source lines are escaped 2021-11-02 20:58 ` [PATCH 1/2] Flag CPP_W_BIDIRECTIONAL so that source lines are escaped David Malcolm @ 2021-11-02 21:07 ` David Malcolm 0 siblings, 0 replies; 27+ messages in thread From: David Malcolm @ 2021-11-02 21:07 UTC (permalink / raw) To: Marek Polacek, GCC Patches; +Cc: Joseph Myers, Jakub Jelinek On Tue, 2021-11-02 at 16:58 -0400, David Malcolm wrote: > Before: > > Wbidirectional-1.c: In function ‘main’: > Wbidirectional-1.c:6:43: warning: unpaired UTF-8 bidirectional > character detected [-Wbidirectional=] > 6 | /* } if (isAdmin) begin admins only */ > | ^ > Wbidirectional-1.c:9:28: warning: unpaired UTF-8 bidirectional > character detected [-Wbidirectional=] > 9 | /* end admins only { */ > | ^ > > Wbidirectional-11.c:6:15: warning: UTF-8 vs UCN mismatch when > closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [- > Wbidirectional=] > 6 | int LRE__PDF_\u202c; > | ^ > > After setting rich_loc.set_escape_on_output (true): > > Wbidirectional-1.c:6:43: warning: unpaired UTF-8 bidirectional > character detected [-Wbidirectional=] > 6 | /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> > begin admins only */ > > | > ^ > Wbidirectional-1.c:9:28: warning: unpaired UTF-8 bidirectional > character detected [-Wbidirectional=] > 9 | /* end admins only <U+202E> { <U+2066>*/ > | ^ > > Wbidirectional-11.c:6:15: warning: UTF-8 vs UCN mismatch when > closing a context by "U+202C (POP DIRECTIONAL FORMATTING)" [- > Wbidirectional=] > 6 | int LRE_<U+202A>_PDF_\u202c; > | ^ > > libcpp/ChangeLog: > * lex.c (maybe_warn_bidi_on_close): Use a rich_location > and call set_escape_on_output (true) on it. > (maybe_warn_bidi_on_char): Likewise. > > Signed-off-by: David Malcolm <dmalcolm@redhat.com> [...snip...] To be more explicit: part of the benefit of escaping non-ASCII bytes in the source line is that it further mitigates against CVE-2021-42574, since it "defangs" the bidi control characters - turning everything into ASCII, so that the user can see the logical ordering of the characters directly. A similar consideration applies to homoglyph attacks. Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/2] Capture locations of bidi chars and underline ranges 2021-11-02 20:57 ` [PATCH 0/2] Re: [PATCH] libcpp: Implement -Wbidirectional " David Malcolm 2021-11-02 20:58 ` [PATCH 1/2] Flag CPP_W_BIDIRECTIONAL so that source lines are escaped David Malcolm @ 2021-11-02 20:58 ` David Malcolm 1 sibling, 0 replies; 27+ messages in thread From: David Malcolm @ 2021-11-02 20:58 UTC (permalink / raw) To: Marek Polacek, GCC Patches; +Cc: Joseph Myers, Jakub Jelinek, David Malcolm This patch converts the bidi::vec to use a struct so that we can capture location_t values for the bidirectional control characters, and uses these to label sources ranges in the diagnostics. The effect on the output can be seen in the new testcase. gcc/testsuite/ChangeLog: * c-c++-common/Wbidirectional-ranges.c: New test. libcpp/ChangeLog: * lex.c (struct bidi::context): New. (bidi::vec): Convert to a vec of context rather than unsigned char. (bidi::current_ctx): Update for above change. (bidi::current_ctx_ucn_p): Likewise. (bidi::current_ctx_loc): New. (bidi::on_char): Update for usage of context struct. Add "loc" param and pass it when pushing contexts. (get_location_for_byte_range_in_cur_line): New. (get_bidi_utf8): Rename to... (get_bidi_utf8_1): ...this, reintroducing... (get_bidi_utf8): ...as a wrapper, setting *OUT when the result is not NONE. (get_bidi_ucn): Rename to... (get_bidi_ucn_1): ...this, reintroducing... (get_bidi_ucn): ...as a wrapper, setting *OUT when the result is not NONE. (class unpaired_bidi_rich_location): New. (maybe_warn_bidi_on_close): Use unpaired_bidi_rich_location when reporting on unpaired bidi chars. Split into singular vs plural spellings. (maybe_warn_bidi_on_char): Pass in a location_t rather than a const uchar * and use it when emitting warnings, and when calling bidi::on_char. (_cpp_skip_block_comment): Capture location when kind is not NONE and pass it to maybe_warn_bidi_on_char. (skip_line_comment): Likewise. (forms_identifier_p): Likewise. (lex_raw_string): Likewise. (lex_string): Likewise. Signed-off-by: David Malcolm <dmalcolm@redhat.com> --- .../c-c++-common/Wbidirectional-ranges.c | 54 ++++ libcpp/lex.c | 241 ++++++++++++++---- 2 files changed, 252 insertions(+), 43 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-ranges.c diff --git a/gcc/testsuite/c-c++-common/Wbidirectional-ranges.c b/gcc/testsuite/c-c++-common/Wbidirectional-ranges.c new file mode 100644 index 00000000000..a41ae47dc30 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Wbidirectional-ranges.c @@ -0,0 +1,54 @@ +/* PR preprocessor/103026 */ +/* { dg-do compile } */ +/* { dg-options "-Wbidirectional=unpaired -fdiagnostics-show-caret" } */ +/* Verify that we escape and underline pertinent bidirectional characters + when quoting the source. */ + +int test_unpaired_bidi () { + int isAdmin = 0; + /* } if (isAdmin) begin admins only */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ +#if 0 + { dg-begin-multiline-output "" } + /*<U+202E> } <U+2066>if (isAdmin)<U+2069> <U+2066> begin admins only */ + ~~~~~~~~ ~~~~~~~~ ^ + | | | + | | end of bidirectional context + U+202E (RIGHT-TO-LEFT OVERRIDE) U+2066 (LEFT-TO-RIGHT ISOLATE) + { dg-end-multiline-output "" } +#endif + + __builtin_printf("You are an admin.\n"); + /* end admins only { */ +/* { dg-warning "bidirectional" "" { target *-*-* } .-1 } */ +#if 0 + { dg-begin-multiline-output "" } + /* end admins only <U+202E> { <U+2066>*/ + ~~~~~~~~ ~~~~~~~~ ^ + | | | + | | end of bidirectional context + | U+2066 (LEFT-TO-RIGHT ISOLATE) + U+202E (RIGHT-TO-LEFT OVERRIDE) + { dg-end-multiline-output "" } +#endif + + return 0; +} + +int LRE__PDF_\u202c; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +#if 0 + { dg-begin-multiline-output "" } + int LRE_<U+202A>_PDF_\u202c; + ~~~~~~~~ ^~~~~~ + { dg-end-multiline-output "" } +#endif + +const char *s1 = "LRE__PDF_\u202c"; +/* { dg-warning "mismatch" "" { target *-*-* } .-1 } */ +#if 0 + { dg-begin-multiline-output "" } + const char *s1 = "LRE_<U+202A>_PDF_\u202c"; + ~~~~~~~~ ^~~~~~ + { dg-end-multiline-output "" } +#endif diff --git a/libcpp/lex.c b/libcpp/lex.c index 88aba307991..9e5531fb125 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -1172,11 +1172,34 @@ namespace bidi { /* All the UTF-8 encodings of bidi characters start with E2. */ constexpr uchar utf8_start = 0xe2; + struct context + { + context () {} + context (location_t loc, kind k, bool pdf, bool ucn) + : m_loc (loc), m_kind (k), m_pdf (pdf), m_ucn (ucn) + { + } + + kind get_pop_kind () const + { + return m_pdf ? kind::PDF : kind::PDI; + } + bool ucn_p () const + { + return m_ucn; + } + + location_t m_loc; + kind m_kind; + unsigned m_pdf : 1; + unsigned m_ucn : 1; + }; + /* A vector holding currently open bidi contexts. We use a char for each context, its LSB is 1 if it represents a PDF context, 0 if it represents a PDI context. The next bit is 1 if this context was open by a bidi character written as a UCN, and 0 when it was UTF-8. */ - semi_embedded_vec <unsigned char, 16> vec; + semi_embedded_vec <context, 16> vec; /* Close the whole comment/identifier/string literal/character constant context. */ @@ -1199,7 +1222,7 @@ namespace bidi { unsigned int len = vec.count (); if (len == 0) return kind::NONE; - return (vec[len - 1] & 1) ? kind::PDF : kind::PDI; + return vec[len - 1].get_pop_kind (); } /* Return true if the current context comes from a UCN origin, that is, @@ -1208,11 +1231,19 @@ namespace bidi { { unsigned int len = vec.count (); gcc_checking_assert (len > 0); - return (vec[len - 1] >> 1) & 1; + return vec[len - 1].m_ucn; } - /* We've read a bidi char, update the current vector as necessary. */ - void on_char (kind k, bool ucn_p) + location_t current_ctx_loc () + { + unsigned int len = vec.count (); + gcc_checking_assert (len > 0); + return vec[len - 1].m_loc; + } + + /* We've read a bidi char, update the current vector as necessary. + LOC is only valid when K is not kind::NONE. */ + void on_char (kind k, bool ucn_p, location_t loc) { switch (k) { @@ -1220,12 +1251,12 @@ namespace bidi { case kind::RLE: case kind::LRO: case kind::RLO: - vec.push (ucn_p ? 3u : 1u); + vec.push (context (loc, k, true, ucn_p)); break; case kind::LRI: case kind::RLI: case kind::FSI: - vec.push (ucn_p ? 2u : 0u); + vec.push (context (loc, k, false, ucn_p)); break; case kind::PDF: if (current_ctx () == kind::PDF) @@ -1271,10 +1302,47 @@ namespace bidi { } } +/* Get location_t for the range of bytes [START, START + NUM_BYTES) + within the current line in FILE, with the caret at START. */ + +static location_t +get_location_for_byte_range_in_cur_line (cpp_reader *pfile, + const unsigned char *const start, + size_t num_bytes) +{ + gcc_checking_assert (num_bytes > 0); + + /* CPP_BUF_COLUMN and linemap_position_for_column both refer + to offsets in bytes, but CPP_BUF_COLUMN is 0-based, + whereas linemap_position_for_column is 1-based. */ + + /* Get 0-based offsets within the line. */ + size_t start_offset = CPP_BUF_COLUMN (pfile->buffer, start); + size_t end_offset = start_offset + num_bytes - 1; + + /* Now convert to location_t, where "columns" are 1-based byte offsets. */ + location_t start_loc = linemap_position_for_column (pfile->line_table, + start_offset + 1); + location_t end_loc = linemap_position_for_column (pfile->line_table, + end_offset + 1); + + if (start_loc == end_loc) + return start_loc; + + source_range src_range; + src_range.m_start = start_loc; + src_range.m_finish = end_loc; + location_t combined_loc = COMBINE_LOCATION_DATA (pfile->line_table, + start_loc, + src_range, + NULL); + return combined_loc; +} + /* Parse a sequence of 3 bytes starting with P and return its bidi code. */ static bidi::kind -get_bidi_utf8 (const unsigned char *const p) +get_bidi_utf8_1 (const unsigned char *const p) { gcc_checking_assert (p[0] == bidi::utf8_start); @@ -1312,10 +1380,25 @@ get_bidi_utf8 (const unsigned char *const p) return bidi::kind::NONE; } +/* Parse a sequence of 3 bytes starting with P and return its bidi code. + If the kind is not NONE, write the location to *OUT.*/ + +static bidi::kind +get_bidi_utf8 (cpp_reader *pfile, const unsigned char *const p, location_t *out) +{ + bidi::kind result = get_bidi_utf8_1 (p); + if (result != bidi::kind::NONE) + { + /* We have a sequence of 3 bytes starting at P. */ + *out = get_location_for_byte_range_in_cur_line (pfile, p, 3); + } + return result; +} + /* Parse a UCN where P points just past \u or \U and return its bidi code. */ static bidi::kind -get_bidi_ucn (const unsigned char *p, bool is_U) +get_bidi_ucn_1 (const unsigned char *p, bool is_U) { /* 6.4.3 Universal Character Names \u hex-quad @@ -1372,6 +1455,62 @@ get_bidi_ucn (const unsigned char *p, bool is_U) return bidi::kind::NONE; } +/* Parse a UCN where P points just past \u or \U and return its bidi code. + If the kind is not NONE, write the location to *OUT.*/ + +static bidi::kind +get_bidi_ucn (cpp_reader *pfile, const unsigned char *p, bool is_U, + location_t *out) +{ + bidi::kind result = get_bidi_ucn_1 (p, is_U); + if (result != bidi::kind::NONE) + { + const unsigned char *start = p - 2; + size_t num_bytes = 2 + (is_U ? 8 : 4); + *out = get_location_for_byte_range_in_cur_line (pfile, start, num_bytes); + } + return result; +} + +/* Subclass of rich_location for reporting on unpaired UTF-8 + bidirectional character(s). + Escape the source lines on output, and show all unclosed + bidi context, labelling everything. */ + +class unpaired_bidi_rich_location : public rich_location +{ + public: + class custom_range_label : public range_label + { + public: + label_text get_text (unsigned range_idx) const FINAL OVERRIDE + { + /* range 0 is the primary location; each subsequent range i + 1 + is for bidi::vec[i]. */ + if (range_idx > 0) + { + const bidi::context &ctxt (bidi::vec[range_idx - 1]); + return label_text::borrow (bidi::to_str (ctxt.m_kind)); + } + else + return label_text::borrow (_("end of bidirectional context")); + } + }; + + unpaired_bidi_rich_location (cpp_reader *pfile, location_t loc) + : rich_location (pfile->line_table, loc, &m_custom_label) + { + set_escape_on_output (true); + for (unsigned i = 0; i < bidi::vec.count (); i++) + add_range (bidi::vec[i].m_loc, + SHOW_RANGE_WITHOUT_CARET, + &m_custom_label); + } + + private: + custom_range_label m_custom_label; +}; + /* We're closing a bidi context, that is, we've encountered a newline, are closing a C-style comment, or are at the end of a string literal, character constant, or identifier. Warn if this context was not @@ -1387,11 +1526,17 @@ maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) const location_t loc = linemap_position_for_column (pfile->line_table, CPP_BUF_COLUMN (pfile->buffer, p)); - rich_location rich_loc (pfile->line_table, loc); - rich_loc.set_escape_on_output (true); - cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, - "unpaired UTF-8 bidirectional character " - "detected"); + unpaired_bidi_rich_location rich_loc (pfile, loc); + /* cpp_callbacks doesn't yet have a way to handle singular vs plural + forms of a diagnostic, so fake it for now. */ + if (bidi::vec.count () > 1) + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "unpaired UTF-8 bidirectional characters " + "detected"); + else + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "unpaired UTF-8 bidirectional character " + "detected"); } /* We're done with this context. */ bidi::on_close (); @@ -1399,12 +1544,13 @@ maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) /* We're at the beginning or in the middle of an identifier/comment/string literal/character constant. Warn if we've encountered a bidi character. - KIND says which bidi character it was; P points to it in the character - stream. UCN_P is true iff this bidi character was written as a UCN. */ + KIND says which bidi character it was; UCN_P is true iff this bidi + character was written as a UCN. LOC is the location of the character, + but is only valid if KIND != bidi::kind::NONE. */ static void -maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, - bool ucn_p) +maybe_warn_bidi_on_char (cpp_reader *pfile, bidi::kind kind, + bool ucn_p, location_t loc) { if (__builtin_expect (kind == bidi::kind::NONE, 1)) return; @@ -1413,9 +1559,6 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, if (warn_bidi != bidirectional_none) { - const location_t loc - = linemap_position_for_column (pfile->line_table, - CPP_BUF_COLUMN (pfile->buffer, p)); rich_location rich_loc (pfile->line_table, loc); rich_loc.set_escape_on_output (true); @@ -1427,9 +1570,12 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, { if (warn_bidi == bidirectional_unpaired && bidi::current_ctx_ucn_p () != ucn_p) - cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, - "UTF-8 vs UCN mismatch when closing " - "a context by \"%s\"", bidi::to_str (kind)); + { + rich_loc.add_range (bidi::current_ctx_loc ()); + cpp_warning_at (pfile, CPP_W_BIDIRECTIONAL, &rich_loc, + "UTF-8 vs UCN mismatch when closing " + "a context by \"%s\"", bidi::to_str (kind)); + } } else if (warn_bidi == bidirectional_any) { @@ -1444,7 +1590,7 @@ maybe_warn_bidi_on_char (cpp_reader *pfile, const uchar *p, bidi::kind kind, } } /* We're done with this context. */ - bidi::on_char (kind, ucn_p); + bidi::on_char (kind, ucn_p, loc); } /* Skip a C-style block comment. We find the end of the comment by @@ -1512,8 +1658,9 @@ _cpp_skip_block_comment (cpp_reader *pfile) a bidirectional character. */ else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) { - bidi::kind kind = get_bidi_utf8 (cur - 1); - maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); } } @@ -1547,9 +1694,9 @@ skip_line_comment (cpp_reader *pfile) { if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)) { - bidi::kind kind = get_bidi_utf8 (buffer->cur); - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, - /*ucn_p=*/false); + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); } buffer->cur++; } @@ -1699,9 +1846,9 @@ forms_identifier_p (cpp_reader *pfile, int first, if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) && warn_bidi_p) { - bidi::kind kind = get_bidi_utf8 (buffer->cur); - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, - /*ucn_p=*/false); + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); } if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s)) @@ -1713,10 +1860,12 @@ forms_identifier_p (cpp_reader *pfile, int first, buffer->cur += 2; if (warn_bidi_p) { - bidi::kind kind = get_bidi_ucn (buffer->cur, - buffer->cur[-1] == 'U'); - maybe_warn_bidi_on_char (pfile, buffer->cur, kind, - /*ucn_p=*/true); + location_t loc; + bidi::kind kind = get_bidi_ucn (pfile, + buffer->cur, + buffer->cur[-1] == 'U', + &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/true, loc); } if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s, NULL, NULL)) @@ -2341,8 +2490,11 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base) } else if (__builtin_expect ((unsigned char) c == bidi::utf8_start, 0) && warn_bidi_p) - maybe_warn_bidi_on_char (pfile, pos - 1, get_bidi_utf8 (pos - 1), - /*ucn_p=*/false); + { + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, pos - 1, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); + } } if (warn_bidi_p) @@ -2453,8 +2605,10 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) { if ((cur[0] == 'u' || cur[0] == 'U') && warn_bidi_p) { - bidi::kind kind = get_bidi_ucn (cur + 1, cur[0] == 'U'); - maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/true); + location_t loc; + bidi::kind kind = get_bidi_ucn (pfile, cur + 1, cur[0] == 'U', + &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/true, loc); } cur++; } @@ -2482,8 +2636,9 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base) saw_NUL = true; else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) { - bidi::kind kind = get_bidi_utf8 (cur - 1); - maybe_warn_bidi_on_char (pfile, cur - 1, kind, /*ucn_p=*/false); + location_t loc; + bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc); + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); } } -- 2.26.3 ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2022-01-14 14:52 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-01 16:36 [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] Marek Polacek 2021-11-01 22:10 ` Joseph Myers 2021-11-02 17:18 ` [PATCH v2] " Marek Polacek 2021-11-02 19:20 ` Martin Sebor 2021-11-02 19:52 ` Marek Polacek 2021-11-08 21:33 ` Marek Polacek 2021-11-15 17:28 ` [PATCH] libcpp: Implement -Wbidi-chars " Marek Polacek 2021-11-15 23:15 ` David Malcolm 2021-11-16 19:50 ` [PATCH v2] " Marek Polacek 2021-11-16 23:00 ` David Malcolm 2021-11-17 0:37 ` [PATCH v3] " Marek Polacek 2021-11-17 2:28 ` David Malcolm 2021-11-17 3:05 ` Marek Polacek 2021-11-17 22:45 ` [committed] libcpp: escape non-ASCII source bytes in -Wbidi-chars= [PR103026] David Malcolm 2021-11-17 22:45 ` [PATCH 2/2] libcpp: capture and underline ranges " David Malcolm 2021-11-17 23:01 ` Marek Polacek 2021-11-30 8:38 ` [PATCH] libcpp: Implement -Wbidi-chars for CVE-2021-42574 [PR103026] Stephan Bergmann 2021-11-30 13:26 ` Marek Polacek 2021-11-30 15:00 ` Stephan Bergmann 2021-11-30 15:27 ` Marek Polacek 2022-01-14 9:23 ` Stephan Bergmann 2022-01-14 13:28 ` Marek Polacek 2022-01-14 14:52 ` Stephan Bergmann 2021-11-02 20:57 ` [PATCH 0/2] Re: [PATCH] libcpp: Implement -Wbidirectional " David Malcolm 2021-11-02 20:58 ` [PATCH 1/2] Flag CPP_W_BIDIRECTIONAL so that source lines are escaped David Malcolm 2021-11-02 21:07 ` David Malcolm 2021-11-02 20:58 ` [PATCH 2/2] Capture locations of bidi chars and underline ranges David Malcolm
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).