public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v2] bpf: avoid creating wrong symbols while parsing
@ 2023-11-17 18:54 David Faust
  2023-11-17 19:29 ` Jose E. Marchesi
  0 siblings, 1 reply; 5+ messages in thread
From: David Faust @ 2023-11-17 18:54 UTC (permalink / raw)
  To: binutils; +Cc: jose.marchesi, jbeulich

[ WAS: gas,bpf: cleanup bad symbols created while parsing
  v1: https://sourceware.org/pipermail/binutils/2023-November/130556.html

  Changes from v1:
  - Rewrite patch to avoid inserting wrong symbols in the first place,
    rather than insert, remove, re-insert dance.
    Suggested by Jan Beulich. ]

To support the "pseudo-C" asm dialect in BPF, the BPF parser must often
attempt multiple different templates for a single instruction. In some
cases this can cause the parser to incorrectly parse part of the
instruction opcode as an expression, which leads to the creation of a
new undefined symbol.

Once the parser recognizes the error, the expression is discarded and it
tries again with a new instruction template. However, symbols created
during the process are added to the symbol table and are not removed
even if the expression is discarded.

This is a problem for BPF: generally the assembled object will be loaded
directly to the Linux kernel, without being linked. These erroneous
parser-created symbols are rejected by the kernel BPF loader, and the
entire object is refused.

This patch remedies the issue by tentatively creating symbols while
parsing instruction operands, and storing them in a temporary list
rather than immediately inserting them into the symbol table. Later,
after the parser is sure that it has correctly parsed the instruction,
those symbols are committed to the real symbol table.

This approach is modeled directly after Jan Beulich's patch for RISC-V:

  commit 7a29ee290307087e1749ce610207e93a15d0b78d
  RISC-V: adjust logic to avoid register name symbols

Many thanks to Jan for recognizing the problem as similar, and pointing
me to that patch.

Tested on x86_64-linux-gnu host for bpf-unknown-none target.

gas/

	* config/tc-bpf.c (parsing_insn_operands): New.
	(parse_expression): Set it here.
	(deferred_sym_rootP, deferred_sym_lastP): New.
	(orphan_sym_rootP, orphan_sym_lastP): New.
	(bpf_parse_name): New.
	(parse_error): Clear deferred symbol list on error.
	(md_assemble): Clear parsing_insn_operands. Commit deferred
	symbols to symbol table on successful parse.
	* config/tc-bpf.h (md_parse_name): Define to...
	(bpf_parse_name): ...this. New prototype.
	* testsuite/gas/bpf/asm-extra-sym-1.s: New test source.
	* testsuite/gas/bpf/asm-extra-sym-1.d: New test.
	* testsuite/gas/bpf/bpf.exp: Run new test.
---
 gas/config/tc-bpf.c                     | 92 +++++++++++++++++++++++++
 gas/config/tc-bpf.h                     |  4 ++
 gas/testsuite/gas/bpf/asm-extra-sym-1.d |  7 ++
 gas/testsuite/gas/bpf/asm-extra-sym-1.s |  1 +
 gas/testsuite/gas/bpf/bpf.exp           |  3 +
 5 files changed, 107 insertions(+)
 create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.d
 create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.s

diff --git a/gas/config/tc-bpf.c b/gas/config/tc-bpf.c
index fd4144a354b..3122f80804a 100644
--- a/gas/config/tc-bpf.c
+++ b/gas/config/tc-bpf.c
@@ -1223,6 +1223,7 @@ add_relaxed_insn (struct bpf_insn *insn, expressionS *exp)
    See md_operand below to see how exp_parse_failed is used.  */
 
 static int exp_parse_failed = 0;
+static bool parsing_insn_operands = false;
 
 static char *
 parse_expression (char *s, expressionS *exp)
@@ -1230,6 +1231,9 @@ parse_expression (char *s, expressionS *exp)
   char *saved_input_line_pointer = input_line_pointer;
   char *saved_s = s;
 
+  /* Wake up bpf_parse_name before the call to expression ().  */
+  parsing_insn_operands = true;
+
   exp_parse_failed = 0;
   input_line_pointer = s;
   expression (exp);
@@ -1251,6 +1255,71 @@ parse_expression (char *s, expressionS *exp)
   return s;
 }
 
+/* Symbols created by this parse, but not yet committed to the real
+   symbol table.  */
+static symbolS *deferred_sym_rootP;
+static symbolS *deferred_sym_lastP;
+
+/* Symbols discarded by a previous parse.  Symbols cannot easily be freed
+   after creation, so try to recycle.  */
+static symbolS *orphan_sym_rootP;
+static symbolS *orphan_sym_lastP;
+
+/* Implement md_parse_name hook.  Handles any symbol found in an expression.
+   This allows us to tentatively create symbols, before we know for sure
+   whether the parser is using the correct template for an instruction.
+   If we end up keeping the instruction, the deferred symbols are committed
+   to the real symbol table. This approach is modeled after the riscv port.  */
+
+bool
+bpf_parse_name (const char *name, expressionS *exp, enum expr_mode mode)
+{
+  symbolS *sym;
+
+  /* If we aren't currently parsing an instruction, don't do anything.
+     This prevents tampering with operands to directives.  */
+  if (!parsing_insn_operands)
+    return false;
+
+  gas_assert (mode == expr_normal);
+
+  if (symbol_find (name) != NULL)
+    return false;
+
+  for (sym = deferred_sym_rootP; sym; sym = symbol_next (sym))
+    if (strcmp (name, S_GET_NAME (sym)) == 0)
+      break;
+
+  /* Tentatively create a symbol.  */
+  if (!sym)
+    {
+      /* See if we can reuse a symbol discarded by a previous parse.
+	 This may be quite common, for example when trying multiple templates
+	 for an instruction with the first reference to a valid symbol.  */
+      for (sym = orphan_sym_rootP; sym; sym = symbol_next (sym))
+	if (strcmp (name, S_GET_NAME (sym)) == 0)
+	  {
+	    symbol_remove (sym, &orphan_sym_rootP, &orphan_sym_lastP);
+	    break;
+	  }
+
+      if (!sym)
+	  sym = symbol_create (name, undefined_section, &zero_address_frag, 0);
+
+      /* Add symbol to the deferred list.  If we commit to the isntruction,
+	 then the symbol will be inserted into to the real symbol table at
+	 that point (in md_assemble).  */
+      symbol_append (sym, deferred_sym_lastP, &deferred_sym_rootP,
+		     &deferred_sym_lastP);
+    }
+
+  exp->X_op = O_symbol;
+  exp->X_add_symbol = sym;
+  exp->X_add_number = 0;
+
+  return true;
+}
+
 /* Parse a BPF register name and return the corresponding register
    number.  Return NULL in case of parse error, or a pointer to the
    first character in S that is not part of the register name.  */
@@ -1317,6 +1386,16 @@ parse_error (int length, const char *fmt, ...)
       va_end (args);
       partial_match_length = length;
     }
+
+  /* Discard deferred symbols from the failed parse.  They may potentially
+     be reused in the future from the orphan list.  */
+  while (deferred_sym_rootP)
+    {
+      symbolS *sym = deferred_sym_rootP;
+      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
+      symbol_append (sym, orphan_sym_lastP, &orphan_sym_rootP,
+		     &orphan_sym_lastP);
+    }
 }
 
 /* Assemble a machine instruction in STR and emit the frags/bytes it
@@ -1606,6 +1685,10 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
         }
     }
 
+  /* Mark that we are no longer parsing an instruction, bpf_parse_name does
+     not interfere with symbols in e.g. assembler directives.  */
+  parsing_insn_operands = false;
+
   if (opcode == NULL)
     {
       as_bad (_("unrecognized instruction `%s'"), str);
@@ -1622,6 +1705,15 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
 
 #undef PARSE_ERROR
 
+  /* Commit any symbols created while parsing the instruction.  */
+  while (deferred_sym_rootP)
+    {
+      symbolS *sym = deferred_sym_rootP;
+      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
+      symbol_append (sym, symbol_lastP, &symbol_rootP, &symbol_lastP);
+      symbol_table_insert (sym);
+    }
+
   /* Generate the frags and fixups for the parsed instruction.  */
   if (do_relax && isa_spec >= BPF_V4 && insn.is_relaxable)
     {
diff --git a/gas/config/tc-bpf.h b/gas/config/tc-bpf.h
index 9fb71eddd14..06096ef5926 100644
--- a/gas/config/tc-bpf.h
+++ b/gas/config/tc-bpf.h
@@ -51,6 +51,10 @@
    a jump to offset 0 means jump to the next instruction.  */
 #define md_single_noop_insn "ja 0"
 
+#define md_parse_name(name, exp, mode, c) \
+  bpf_parse_name (name, exp, mode)
+bool bpf_parse_name (const char *, struct expressionS *, enum expr_mode);
+
 #define TC_EQUAL_IN_INSN(c, s) bpf_tc_equal_in_insn ((c), (s))
 extern bool bpf_tc_equal_in_insn (int, char *);
 
diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.d b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
new file mode 100644
index 00000000000..113750dd3fd
--- /dev/null
+++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
@@ -0,0 +1,7 @@
+#as: -EL -mdialect=pseudoc
+#nm: --numeric-sort
+#source: asm-extra-sym-1.s
+#name: BPF pseudoc no extra symbols 1
+
+# Note: there should be no output from nm.
+# Previously a bug in the BPF parser created an UND '*' symbol.
diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.s b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
new file mode 100644
index 00000000000..2cfa605a259
--- /dev/null
+++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
@@ -0,0 +1 @@
+    r2 = *(u32*)(r1 + 8)
diff --git a/gas/testsuite/gas/bpf/bpf.exp b/gas/testsuite/gas/bpf/bpf.exp
index 80f5a1dbc2d..fcbeccd8ecd 100644
--- a/gas/testsuite/gas/bpf/bpf.exp
+++ b/gas/testsuite/gas/bpf/bpf.exp
@@ -72,4 +72,7 @@ if {[istarget bpf*-*-*]} {
     run_dump_test disp16-overflow-relax
     run_dump_test disp32-overflow
     run_dump_test imm32-overflow
+
+    # Test that parser does not create undefined symbols
+    run_dump_test asm-extra-sym-1
 }
-- 
2.42.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] bpf: avoid creating wrong symbols while parsing
  2023-11-17 18:54 [PATCH v2] bpf: avoid creating wrong symbols while parsing David Faust
@ 2023-11-17 19:29 ` Jose E. Marchesi
  2023-11-17 20:09   ` David Faust
  0 siblings, 1 reply; 5+ messages in thread
From: Jose E. Marchesi @ 2023-11-17 19:29 UTC (permalink / raw)
  To: David Faust; +Cc: binutils, jbeulich


Hi David.

> [ WAS: gas,bpf: cleanup bad symbols created while parsing
>   v1: https://sourceware.org/pipermail/binutils/2023-November/130556.html
>
>   Changes from v1:
>   - Rewrite patch to avoid inserting wrong symbols in the first place,
>     rather than insert, remove, re-insert dance.
>     Suggested by Jan Beulich. ]
>
> To support the "pseudo-C" asm dialect in BPF, the BPF parser must often
> attempt multiple different templates for a single instruction. In some
> cases this can cause the parser to incorrectly parse part of the
> instruction opcode as an expression, which leads to the creation of a
> new undefined symbol.
>
> Once the parser recognizes the error, the expression is discarded and it
> tries again with a new instruction template. However, symbols created
> during the process are added to the symbol table and are not removed
> even if the expression is discarded.
>
> This is a problem for BPF: generally the assembled object will be loaded
> directly to the Linux kernel, without being linked. These erroneous
> parser-created symbols are rejected by the kernel BPF loader, and the
> entire object is refused.
>
> This patch remedies the issue by tentatively creating symbols while
> parsing instruction operands, and storing them in a temporary list
> rather than immediately inserting them into the symbol table. Later,
> after the parser is sure that it has correctly parsed the instruction,
> those symbols are committed to the real symbol table.
>
> This approach is modeled directly after Jan Beulich's patch for RISC-V:
>
>   commit 7a29ee290307087e1749ce610207e93a15d0b78d
>   RISC-V: adjust logic to avoid register name symbols
>
> Many thanks to Jan for recognizing the problem as similar, and pointing
> me to that patch.
>
> Tested on x86_64-linux-gnu host for bpf-unknown-none target.
>
> gas/
>
> 	* config/tc-bpf.c (parsing_insn_operands): New.
> 	(parse_expression): Set it here.
> 	(deferred_sym_rootP, deferred_sym_lastP): New.
> 	(orphan_sym_rootP, orphan_sym_lastP): New.
> 	(bpf_parse_name): New.
> 	(parse_error): Clear deferred symbol list on error.
> 	(md_assemble): Clear parsing_insn_operands. Commit deferred
> 	symbols to symbol table on successful parse.
> 	* config/tc-bpf.h (md_parse_name): Define to...
> 	(bpf_parse_name): ...this. New prototype.
> 	* testsuite/gas/bpf/asm-extra-sym-1.s: New test source.
> 	* testsuite/gas/bpf/asm-extra-sym-1.d: New test.
> 	* testsuite/gas/bpf/bpf.exp: Run new test.
> ---
>  gas/config/tc-bpf.c                     | 92 +++++++++++++++++++++++++
>  gas/config/tc-bpf.h                     |  4 ++
>  gas/testsuite/gas/bpf/asm-extra-sym-1.d |  7 ++
>  gas/testsuite/gas/bpf/asm-extra-sym-1.s |  1 +
>  gas/testsuite/gas/bpf/bpf.exp           |  3 +
>  5 files changed, 107 insertions(+)
>  create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.d
>  create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.s
>
> diff --git a/gas/config/tc-bpf.c b/gas/config/tc-bpf.c
> index fd4144a354b..3122f80804a 100644
> --- a/gas/config/tc-bpf.c
> +++ b/gas/config/tc-bpf.c
> @@ -1223,6 +1223,7 @@ add_relaxed_insn (struct bpf_insn *insn, expressionS *exp)
>     See md_operand below to see how exp_parse_failed is used.  */
>  
>  static int exp_parse_failed = 0;
> +static bool parsing_insn_operands = false;
>  
>  static char *
>  parse_expression (char *s, expressionS *exp)
> @@ -1230,6 +1231,9 @@ parse_expression (char *s, expressionS *exp)
>    char *saved_input_line_pointer = input_line_pointer;
>    char *saved_s = s;
>  
> +  /* Wake up bpf_parse_name before the call to expression ().  */
> +  parsing_insn_operands = true;
> +
>    exp_parse_failed = 0;
>    input_line_pointer = s;
>    expression (exp);
> @@ -1251,6 +1255,71 @@ parse_expression (char *s, expressionS *exp)
>    return s;
>  }
>  
> +/* Symbols created by this parse, but not yet committed to the real
> +   symbol table.  */
> +static symbolS *deferred_sym_rootP;
> +static symbolS *deferred_sym_lastP;
> +
> +/* Symbols discarded by a previous parse.  Symbols cannot easily be freed
> +   after creation, so try to recycle.  */
> +static symbolS *orphan_sym_rootP;
> +static symbolS *orphan_sym_lastP;
> +
> +/* Implement md_parse_name hook.  Handles any symbol found in an expression.
> +   This allows us to tentatively create symbols, before we know for sure
> +   whether the parser is using the correct template for an instruction.
> +   If we end up keeping the instruction, the deferred symbols are committed
> +   to the real symbol table. This approach is modeled after the riscv port.  */
> +
> +bool
> +bpf_parse_name (const char *name, expressionS *exp, enum expr_mode mode)
> +{
> +  symbolS *sym;
> +
> +  /* If we aren't currently parsing an instruction, don't do anything.
> +     This prevents tampering with operands to directives.  */
> +  if (!parsing_insn_operands)
> +    return false;
> +
> +  gas_assert (mode == expr_normal);
> +
> +  if (symbol_find (name) != NULL)
> +    return false;

Is this check correct?

I see that expr.c:operand calls symbol_find_or_make right after calling
md_parse_name (if the later is defined) and actually installs the
resulting symbol in expressionP even if it was already defined.  That is
not done if md_parse_name returns `false'.

> +
> +  for (sym = deferred_sym_rootP; sym; sym = symbol_next (sym))
> +    if (strcmp (name, S_GET_NAME (sym)) == 0)
> +      break;
> +
> +  /* Tentatively create a symbol.  */
> +  if (!sym)
> +    {
> +      /* See if we can reuse a symbol discarded by a previous parse.
> +	 This may be quite common, for example when trying multiple templates
> +	 for an instruction with the first reference to a valid symbol.  */
> +      for (sym = orphan_sym_rootP; sym; sym = symbol_next (sym))
> +	if (strcmp (name, S_GET_NAME (sym)) == 0)
> +	  {
> +	    symbol_remove (sym, &orphan_sym_rootP, &orphan_sym_lastP);
> +	    break;
> +	  }
> +
> +      if (!sym)
> +	  sym = symbol_create (name, undefined_section, &zero_address_frag, 0);
> +
> +      /* Add symbol to the deferred list.  If we commit to the isntruction,
> +	 then the symbol will be inserted into to the real symbol table at
> +	 that point (in md_assemble).  */
> +      symbol_append (sym, deferred_sym_lastP, &deferred_sym_rootP,
> +		     &deferred_sym_lastP);
> +    }
> +
> +  exp->X_op = O_symbol;
> +  exp->X_add_symbol = sym;
> +  exp->X_add_number = 0;
> +
> +  return true;
> +}
> +
>  /* Parse a BPF register name and return the corresponding register
>     number.  Return NULL in case of parse error, or a pointer to the
>     first character in S that is not part of the register name.  */
> @@ -1317,6 +1386,16 @@ parse_error (int length, const char *fmt, ...)
>        va_end (args);
>        partial_match_length = length;
>      }
> +
> +  /* Discard deferred symbols from the failed parse.  They may potentially
> +     be reused in the future from the orphan list.  */
> +  while (deferred_sym_rootP)
> +    {
> +      symbolS *sym = deferred_sym_rootP;
> +      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
> +      symbol_append (sym, orphan_sym_lastP, &orphan_sym_rootP,
> +		     &orphan_sym_lastP);
> +    }
>  }
>  
>  /* Assemble a machine instruction in STR and emit the frags/bytes it
> @@ -1606,6 +1685,10 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
>          }
>      }
>  
> +  /* Mark that we are no longer parsing an instruction, bpf_parse_name does
> +     not interfere with symbols in e.g. assembler directives.  */
> +  parsing_insn_operands = false;
> +
>    if (opcode == NULL)
>      {
>        as_bad (_("unrecognized instruction `%s'"), str);
> @@ -1622,6 +1705,15 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
>  
>  #undef PARSE_ERROR
>  
> +  /* Commit any symbols created while parsing the instruction.  */
> +  while (deferred_sym_rootP)
> +    {
> +      symbolS *sym = deferred_sym_rootP;
> +      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
> +      symbol_append (sym, symbol_lastP, &symbol_rootP, &symbol_lastP);
> +      symbol_table_insert (sym);
> +    }
> +
>    /* Generate the frags and fixups for the parsed instruction.  */
>    if (do_relax && isa_spec >= BPF_V4 && insn.is_relaxable)
>      {
> diff --git a/gas/config/tc-bpf.h b/gas/config/tc-bpf.h
> index 9fb71eddd14..06096ef5926 100644
> --- a/gas/config/tc-bpf.h
> +++ b/gas/config/tc-bpf.h
> @@ -51,6 +51,10 @@
>     a jump to offset 0 means jump to the next instruction.  */
>  #define md_single_noop_insn "ja 0"
>  
> +#define md_parse_name(name, exp, mode, c) \
> +  bpf_parse_name (name, exp, mode)
> +bool bpf_parse_name (const char *, struct expressionS *, enum expr_mode);
> +
>  #define TC_EQUAL_IN_INSN(c, s) bpf_tc_equal_in_insn ((c), (s))
>  extern bool bpf_tc_equal_in_insn (int, char *);
>  
> diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.d b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
> new file mode 100644
> index 00000000000..113750dd3fd
> --- /dev/null
> +++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
> @@ -0,0 +1,7 @@
> +#as: -EL -mdialect=pseudoc
> +#nm: --numeric-sort
> +#source: asm-extra-sym-1.s
> +#name: BPF pseudoc no extra symbols 1
> +
> +# Note: there should be no output from nm.
> +# Previously a bug in the BPF parser created an UND '*' symbol.
> diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.s b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
> new file mode 100644
> index 00000000000..2cfa605a259
> --- /dev/null
> +++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
> @@ -0,0 +1 @@
> +    r2 = *(u32*)(r1 + 8)
> diff --git a/gas/testsuite/gas/bpf/bpf.exp b/gas/testsuite/gas/bpf/bpf.exp
> index 80f5a1dbc2d..fcbeccd8ecd 100644
> --- a/gas/testsuite/gas/bpf/bpf.exp
> +++ b/gas/testsuite/gas/bpf/bpf.exp
> @@ -72,4 +72,7 @@ if {[istarget bpf*-*-*]} {
>      run_dump_test disp16-overflow-relax
>      run_dump_test disp32-overflow
>      run_dump_test imm32-overflow
> +
> +    # Test that parser does not create undefined symbols
> +    run_dump_test asm-extra-sym-1
>  }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] bpf: avoid creating wrong symbols while parsing
  2023-11-17 19:29 ` Jose E. Marchesi
@ 2023-11-17 20:09   ` David Faust
  2023-11-17 20:27     ` Jose E. Marchesi
  0 siblings, 1 reply; 5+ messages in thread
From: David Faust @ 2023-11-17 20:09 UTC (permalink / raw)
  To: Jose E. Marchesi; +Cc: binutils, jbeulich



On 11/17/23 11:29, Jose E. Marchesi wrote:
> 
> Hi David.
> 
>> [ WAS: gas,bpf: cleanup bad symbols created while parsing
>>   v1: https://sourceware.org/pipermail/binutils/2023-November/130556.html
>>
>>   Changes from v1:
>>   - Rewrite patch to avoid inserting wrong symbols in the first place,
>>     rather than insert, remove, re-insert dance.
>>     Suggested by Jan Beulich. ]
>>
>> To support the "pseudo-C" asm dialect in BPF, the BPF parser must often
>> attempt multiple different templates for a single instruction. In some
>> cases this can cause the parser to incorrectly parse part of the
>> instruction opcode as an expression, which leads to the creation of a
>> new undefined symbol.
>>
>> Once the parser recognizes the error, the expression is discarded and it
>> tries again with a new instruction template. However, symbols created
>> during the process are added to the symbol table and are not removed
>> even if the expression is discarded.
>>
>> This is a problem for BPF: generally the assembled object will be loaded
>> directly to the Linux kernel, without being linked. These erroneous
>> parser-created symbols are rejected by the kernel BPF loader, and the
>> entire object is refused.
>>
>> This patch remedies the issue by tentatively creating symbols while
>> parsing instruction operands, and storing them in a temporary list
>> rather than immediately inserting them into the symbol table. Later,
>> after the parser is sure that it has correctly parsed the instruction,
>> those symbols are committed to the real symbol table.
>>
>> This approach is modeled directly after Jan Beulich's patch for RISC-V:
>>
>>   commit 7a29ee290307087e1749ce610207e93a15d0b78d
>>   RISC-V: adjust logic to avoid register name symbols
>>
>> Many thanks to Jan for recognizing the problem as similar, and pointing
>> me to that patch.
>>
>> Tested on x86_64-linux-gnu host for bpf-unknown-none target.
>>
>> gas/
>>
>> 	* config/tc-bpf.c (parsing_insn_operands): New.
>> 	(parse_expression): Set it here.
>> 	(deferred_sym_rootP, deferred_sym_lastP): New.
>> 	(orphan_sym_rootP, orphan_sym_lastP): New.
>> 	(bpf_parse_name): New.
>> 	(parse_error): Clear deferred symbol list on error.
>> 	(md_assemble): Clear parsing_insn_operands. Commit deferred
>> 	symbols to symbol table on successful parse.
>> 	* config/tc-bpf.h (md_parse_name): Define to...
>> 	(bpf_parse_name): ...this. New prototype.
>> 	* testsuite/gas/bpf/asm-extra-sym-1.s: New test source.
>> 	* testsuite/gas/bpf/asm-extra-sym-1.d: New test.
>> 	* testsuite/gas/bpf/bpf.exp: Run new test.
>> ---
>>  gas/config/tc-bpf.c                     | 92 +++++++++++++++++++++++++
>>  gas/config/tc-bpf.h                     |  4 ++
>>  gas/testsuite/gas/bpf/asm-extra-sym-1.d |  7 ++
>>  gas/testsuite/gas/bpf/asm-extra-sym-1.s |  1 +
>>  gas/testsuite/gas/bpf/bpf.exp           |  3 +
>>  5 files changed, 107 insertions(+)
>>  create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.d
>>  create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.s
>>
>> diff --git a/gas/config/tc-bpf.c b/gas/config/tc-bpf.c
>> index fd4144a354b..3122f80804a 100644
>> --- a/gas/config/tc-bpf.c
>> +++ b/gas/config/tc-bpf.c
>> @@ -1223,6 +1223,7 @@ add_relaxed_insn (struct bpf_insn *insn, expressionS *exp)
>>     See md_operand below to see how exp_parse_failed is used.  */
>>  
>>  static int exp_parse_failed = 0;
>> +static bool parsing_insn_operands = false;
>>  
>>  static char *
>>  parse_expression (char *s, expressionS *exp)
>> @@ -1230,6 +1231,9 @@ parse_expression (char *s, expressionS *exp)
>>    char *saved_input_line_pointer = input_line_pointer;
>>    char *saved_s = s;
>>  
>> +  /* Wake up bpf_parse_name before the call to expression ().  */
>> +  parsing_insn_operands = true;
>> +
>>    exp_parse_failed = 0;
>>    input_line_pointer = s;
>>    expression (exp);
>> @@ -1251,6 +1255,71 @@ parse_expression (char *s, expressionS *exp)
>>    return s;
>>  }
>>  
>> +/* Symbols created by this parse, but not yet committed to the real
>> +   symbol table.  */
>> +static symbolS *deferred_sym_rootP;
>> +static symbolS *deferred_sym_lastP;
>> +
>> +/* Symbols discarded by a previous parse.  Symbols cannot easily be freed
>> +   after creation, so try to recycle.  */
>> +static symbolS *orphan_sym_rootP;
>> +static symbolS *orphan_sym_lastP;
>> +
>> +/* Implement md_parse_name hook.  Handles any symbol found in an expression.
>> +   This allows us to tentatively create symbols, before we know for sure
>> +   whether the parser is using the correct template for an instruction.
>> +   If we end up keeping the instruction, the deferred symbols are committed
>> +   to the real symbol table. This approach is modeled after the riscv port.  */
>> +
>> +bool
>> +bpf_parse_name (const char *name, expressionS *exp, enum expr_mode mode)
>> +{
>> +  symbolS *sym;
>> +
>> +  /* If we aren't currently parsing an instruction, don't do anything.
>> +     This prevents tampering with operands to directives.  */
>> +  if (!parsing_insn_operands)
>> +    return false;
>> +
>> +  gas_assert (mode == expr_normal);
>> +
>> +  if (symbol_find (name) != NULL)
>> +    return false;
> 
> Is this check correct?
> 
> I see that expr.c:operand calls symbol_find_or_make right after calling
> md_parse_name (if the later is defined) and actually installs the
> resulting symbol in expressionP even if it was already defined.  That is
> not done if md_parse_name returns `false'.

Hmm, I think you may have the logic in operand () flipped.
If md_parse_name returns 'false' then symbol_find_or_make is called;
if it returns 'true' then the 'break' is hit:

#ifdef md_parse_name
	...
	  if (md_parse_name (name, expressionP, mode, &c))
	    {
	      restore_line_pointer (c);
	      break;
	    }
#endif
	  symbolP = symbol_find_or_make (name);
	...

That is what we want to happen - if the symbol_find in parse_name
returns non-NULL, then we know a real symbol already exists for the
name we are looking at, and we should use it. In that case,
md_parse_name returns false, so operand () calls symbol_find_or_make,
which finds the extant symbol and installs it in the expression.

> 
>> +
>> +  for (sym = deferred_sym_rootP; sym; sym = symbol_next (sym))
>> +    if (strcmp (name, S_GET_NAME (sym)) == 0)
>> +      break;
>> +
>> +  /* Tentatively create a symbol.  */
>> +  if (!sym)
>> +    {
>> +      /* See if we can reuse a symbol discarded by a previous parse.
>> +	 This may be quite common, for example when trying multiple templates
>> +	 for an instruction with the first reference to a valid symbol.  */
>> +      for (sym = orphan_sym_rootP; sym; sym = symbol_next (sym))
>> +	if (strcmp (name, S_GET_NAME (sym)) == 0)
>> +	  {
>> +	    symbol_remove (sym, &orphan_sym_rootP, &orphan_sym_lastP);
>> +	    break;
>> +	  }
>> +
>> +      if (!sym)
>> +	  sym = symbol_create (name, undefined_section, &zero_address_frag, 0);
>> +
>> +      /* Add symbol to the deferred list.  If we commit to the isntruction,
>> +	 then the symbol will be inserted into to the real symbol table at
>> +	 that point (in md_assemble).  */
>> +      symbol_append (sym, deferred_sym_lastP, &deferred_sym_rootP,
>> +		     &deferred_sym_lastP);
>> +    }
>> +
>> +  exp->X_op = O_symbol;
>> +  exp->X_add_symbol = sym;
>> +  exp->X_add_number = 0;
>> +
>> +  return true;
>> +}
>> +
>>  /* Parse a BPF register name and return the corresponding register
>>     number.  Return NULL in case of parse error, or a pointer to the
>>     first character in S that is not part of the register name.  */
>> @@ -1317,6 +1386,16 @@ parse_error (int length, const char *fmt, ...)
>>        va_end (args);
>>        partial_match_length = length;
>>      }
>> +
>> +  /* Discard deferred symbols from the failed parse.  They may potentially
>> +     be reused in the future from the orphan list.  */
>> +  while (deferred_sym_rootP)
>> +    {
>> +      symbolS *sym = deferred_sym_rootP;
>> +      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
>> +      symbol_append (sym, orphan_sym_lastP, &orphan_sym_rootP,
>> +		     &orphan_sym_lastP);
>> +    }
>>  }
>>  
>>  /* Assemble a machine instruction in STR and emit the frags/bytes it
>> @@ -1606,6 +1685,10 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
>>          }
>>      }
>>  
>> +  /* Mark that we are no longer parsing an instruction, bpf_parse_name does
>> +     not interfere with symbols in e.g. assembler directives.  */
>> +  parsing_insn_operands = false;
>> +
>>    if (opcode == NULL)
>>      {
>>        as_bad (_("unrecognized instruction `%s'"), str);
>> @@ -1622,6 +1705,15 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
>>  
>>  #undef PARSE_ERROR
>>  
>> +  /* Commit any symbols created while parsing the instruction.  */
>> +  while (deferred_sym_rootP)
>> +    {
>> +      symbolS *sym = deferred_sym_rootP;
>> +      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
>> +      symbol_append (sym, symbol_lastP, &symbol_rootP, &symbol_lastP);
>> +      symbol_table_insert (sym);
>> +    }
>> +
>>    /* Generate the frags and fixups for the parsed instruction.  */
>>    if (do_relax && isa_spec >= BPF_V4 && insn.is_relaxable)
>>      {
>> diff --git a/gas/config/tc-bpf.h b/gas/config/tc-bpf.h
>> index 9fb71eddd14..06096ef5926 100644
>> --- a/gas/config/tc-bpf.h
>> +++ b/gas/config/tc-bpf.h
>> @@ -51,6 +51,10 @@
>>     a jump to offset 0 means jump to the next instruction.  */
>>  #define md_single_noop_insn "ja 0"
>>  
>> +#define md_parse_name(name, exp, mode, c) \
>> +  bpf_parse_name (name, exp, mode)
>> +bool bpf_parse_name (const char *, struct expressionS *, enum expr_mode);
>> +
>>  #define TC_EQUAL_IN_INSN(c, s) bpf_tc_equal_in_insn ((c), (s))
>>  extern bool bpf_tc_equal_in_insn (int, char *);
>>  
>> diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.d b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
>> new file mode 100644
>> index 00000000000..113750dd3fd
>> --- /dev/null
>> +++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
>> @@ -0,0 +1,7 @@
>> +#as: -EL -mdialect=pseudoc
>> +#nm: --numeric-sort
>> +#source: asm-extra-sym-1.s
>> +#name: BPF pseudoc no extra symbols 1
>> +
>> +# Note: there should be no output from nm.
>> +# Previously a bug in the BPF parser created an UND '*' symbol.
>> diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.s b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
>> new file mode 100644
>> index 00000000000..2cfa605a259
>> --- /dev/null
>> +++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
>> @@ -0,0 +1 @@
>> +    r2 = *(u32*)(r1 + 8)
>> diff --git a/gas/testsuite/gas/bpf/bpf.exp b/gas/testsuite/gas/bpf/bpf.exp
>> index 80f5a1dbc2d..fcbeccd8ecd 100644
>> --- a/gas/testsuite/gas/bpf/bpf.exp
>> +++ b/gas/testsuite/gas/bpf/bpf.exp
>> @@ -72,4 +72,7 @@ if {[istarget bpf*-*-*]} {
>>      run_dump_test disp16-overflow-relax
>>      run_dump_test disp32-overflow
>>      run_dump_test imm32-overflow
>> +
>> +    # Test that parser does not create undefined symbols
>> +    run_dump_test asm-extra-sym-1
>>  }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] bpf: avoid creating wrong symbols while parsing
  2023-11-17 20:09   ` David Faust
@ 2023-11-17 20:27     ` Jose E. Marchesi
  2023-11-17 21:41       ` David Faust
  0 siblings, 1 reply; 5+ messages in thread
From: Jose E. Marchesi @ 2023-11-17 20:27 UTC (permalink / raw)
  To: David Faust; +Cc: binutils, jbeulich


> On 11/17/23 11:29, Jose E. Marchesi wrote:
>> 
>> Hi David.
>> 
>>> [ WAS: gas,bpf: cleanup bad symbols created while parsing
>>>   v1: https://sourceware.org/pipermail/binutils/2023-November/130556.html
>>>
>>>   Changes from v1:
>>>   - Rewrite patch to avoid inserting wrong symbols in the first place,
>>>     rather than insert, remove, re-insert dance.
>>>     Suggested by Jan Beulich. ]
>>>
>>> To support the "pseudo-C" asm dialect in BPF, the BPF parser must often
>>> attempt multiple different templates for a single instruction. In some
>>> cases this can cause the parser to incorrectly parse part of the
>>> instruction opcode as an expression, which leads to the creation of a
>>> new undefined symbol.
>>>
>>> Once the parser recognizes the error, the expression is discarded and it
>>> tries again with a new instruction template. However, symbols created
>>> during the process are added to the symbol table and are not removed
>>> even if the expression is discarded.
>>>
>>> This is a problem for BPF: generally the assembled object will be loaded
>>> directly to the Linux kernel, without being linked. These erroneous
>>> parser-created symbols are rejected by the kernel BPF loader, and the
>>> entire object is refused.
>>>
>>> This patch remedies the issue by tentatively creating symbols while
>>> parsing instruction operands, and storing them in a temporary list
>>> rather than immediately inserting them into the symbol table. Later,
>>> after the parser is sure that it has correctly parsed the instruction,
>>> those symbols are committed to the real symbol table.
>>>
>>> This approach is modeled directly after Jan Beulich's patch for RISC-V:
>>>
>>>   commit 7a29ee290307087e1749ce610207e93a15d0b78d
>>>   RISC-V: adjust logic to avoid register name symbols
>>>
>>> Many thanks to Jan for recognizing the problem as similar, and pointing
>>> me to that patch.
>>>
>>> Tested on x86_64-linux-gnu host for bpf-unknown-none target.
>>>
>>> gas/
>>>
>>> 	* config/tc-bpf.c (parsing_insn_operands): New.
>>> 	(parse_expression): Set it here.
>>> 	(deferred_sym_rootP, deferred_sym_lastP): New.
>>> 	(orphan_sym_rootP, orphan_sym_lastP): New.
>>> 	(bpf_parse_name): New.
>>> 	(parse_error): Clear deferred symbol list on error.
>>> 	(md_assemble): Clear parsing_insn_operands. Commit deferred
>>> 	symbols to symbol table on successful parse.
>>> 	* config/tc-bpf.h (md_parse_name): Define to...
>>> 	(bpf_parse_name): ...this. New prototype.
>>> 	* testsuite/gas/bpf/asm-extra-sym-1.s: New test source.
>>> 	* testsuite/gas/bpf/asm-extra-sym-1.d: New test.
>>> 	* testsuite/gas/bpf/bpf.exp: Run new test.
>>> ---
>>>  gas/config/tc-bpf.c                     | 92 +++++++++++++++++++++++++
>>>  gas/config/tc-bpf.h                     |  4 ++
>>>  gas/testsuite/gas/bpf/asm-extra-sym-1.d |  7 ++
>>>  gas/testsuite/gas/bpf/asm-extra-sym-1.s |  1 +
>>>  gas/testsuite/gas/bpf/bpf.exp           |  3 +
>>>  5 files changed, 107 insertions(+)
>>>  create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.d
>>>  create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.s
>>>
>>> diff --git a/gas/config/tc-bpf.c b/gas/config/tc-bpf.c
>>> index fd4144a354b..3122f80804a 100644
>>> --- a/gas/config/tc-bpf.c
>>> +++ b/gas/config/tc-bpf.c
>>> @@ -1223,6 +1223,7 @@ add_relaxed_insn (struct bpf_insn *insn, expressionS *exp)
>>>     See md_operand below to see how exp_parse_failed is used.  */
>>>  
>>>  static int exp_parse_failed = 0;
>>> +static bool parsing_insn_operands = false;
>>>  
>>>  static char *
>>>  parse_expression (char *s, expressionS *exp)
>>> @@ -1230,6 +1231,9 @@ parse_expression (char *s, expressionS *exp)
>>>    char *saved_input_line_pointer = input_line_pointer;
>>>    char *saved_s = s;
>>>  
>>> +  /* Wake up bpf_parse_name before the call to expression ().  */
>>> +  parsing_insn_operands = true;
>>> +
>>>    exp_parse_failed = 0;
>>>    input_line_pointer = s;
>>>    expression (exp);
>>> @@ -1251,6 +1255,71 @@ parse_expression (char *s, expressionS *exp)
>>>    return s;
>>>  }
>>>  
>>> +/* Symbols created by this parse, but not yet committed to the real
>>> +   symbol table.  */
>>> +static symbolS *deferred_sym_rootP;
>>> +static symbolS *deferred_sym_lastP;
>>> +
>>> +/* Symbols discarded by a previous parse.  Symbols cannot easily be freed
>>> +   after creation, so try to recycle.  */
>>> +static symbolS *orphan_sym_rootP;
>>> +static symbolS *orphan_sym_lastP;
>>> +
>>> +/* Implement md_parse_name hook.  Handles any symbol found in an expression.
>>> +   This allows us to tentatively create symbols, before we know for sure
>>> +   whether the parser is using the correct template for an instruction.
>>> +   If we end up keeping the instruction, the deferred symbols are committed
>>> +   to the real symbol table. This approach is modeled after the riscv port.  */
>>> +
>>> +bool
>>> +bpf_parse_name (const char *name, expressionS *exp, enum expr_mode mode)
>>> +{
>>> +  symbolS *sym;
>>> +
>>> +  /* If we aren't currently parsing an instruction, don't do anything.
>>> +     This prevents tampering with operands to directives.  */
>>> +  if (!parsing_insn_operands)
>>> +    return false;
>>> +
>>> +  gas_assert (mode == expr_normal);
>>> +
>>> +  if (symbol_find (name) != NULL)
>>> +    return false;
>> 
>> Is this check correct?
>> 
>> I see that expr.c:operand calls symbol_find_or_make right after calling
>> md_parse_name (if the later is defined) and actually installs the
>> resulting symbol in expressionP even if it was already defined.  That is
>> not done if md_parse_name returns `false'.
>
> Hmm, I think you may have the logic in operand () flipped.
> If md_parse_name returns 'false' then symbol_find_or_make is called;
> if it returns 'true' then the 'break' is hit:
>
> #ifdef md_parse_name
> 	...
> 	  if (md_parse_name (name, expressionP, mode, &c))
> 	    {
> 	      restore_line_pointer (c);
> 	      break;
> 	    }
> #endif
> 	  symbolP = symbol_find_or_make (name);
> 	...
>
> That is what we want to happen - if the symbol_find in parse_name
> returns non-NULL, then we know a real symbol already exists for the
> name we are looking at, and we should use it. In that case,
> md_parse_name returns false, so operand () calls symbol_find_or_make,
> which finds the extant symbol and installs it in the expression.

Perfect then.
Please apply.
Thanks!

>
>> 
>>> +
>>> +  for (sym = deferred_sym_rootP; sym; sym = symbol_next (sym))
>>> +    if (strcmp (name, S_GET_NAME (sym)) == 0)
>>> +      break;
>>> +
>>> +  /* Tentatively create a symbol.  */
>>> +  if (!sym)
>>> +    {
>>> +      /* See if we can reuse a symbol discarded by a previous parse.
>>> +	 This may be quite common, for example when trying multiple templates
>>> +	 for an instruction with the first reference to a valid symbol.  */
>>> +      for (sym = orphan_sym_rootP; sym; sym = symbol_next (sym))
>>> +	if (strcmp (name, S_GET_NAME (sym)) == 0)
>>> +	  {
>>> +	    symbol_remove (sym, &orphan_sym_rootP, &orphan_sym_lastP);
>>> +	    break;
>>> +	  }
>>> +
>>> +      if (!sym)
>>> +	  sym = symbol_create (name, undefined_section, &zero_address_frag, 0);
>>> +
>>> +      /* Add symbol to the deferred list.  If we commit to the isntruction,
>>> +	 then the symbol will be inserted into to the real symbol table at
>>> +	 that point (in md_assemble).  */
>>> +      symbol_append (sym, deferred_sym_lastP, &deferred_sym_rootP,
>>> +		     &deferred_sym_lastP);
>>> +    }
>>> +
>>> +  exp->X_op = O_symbol;
>>> +  exp->X_add_symbol = sym;
>>> +  exp->X_add_number = 0;
>>> +
>>> +  return true;
>>> +}
>>> +
>>>  /* Parse a BPF register name and return the corresponding register
>>>     number.  Return NULL in case of parse error, or a pointer to the
>>>     first character in S that is not part of the register name.  */
>>> @@ -1317,6 +1386,16 @@ parse_error (int length, const char *fmt, ...)
>>>        va_end (args);
>>>        partial_match_length = length;
>>>      }
>>> +
>>> +  /* Discard deferred symbols from the failed parse.  They may potentially
>>> +     be reused in the future from the orphan list.  */
>>> +  while (deferred_sym_rootP)
>>> +    {
>>> +      symbolS *sym = deferred_sym_rootP;
>>> +      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
>>> +      symbol_append (sym, orphan_sym_lastP, &orphan_sym_rootP,
>>> +		     &orphan_sym_lastP);
>>> +    }
>>>  }
>>>  
>>>  /* Assemble a machine instruction in STR and emit the frags/bytes it
>>> @@ -1606,6 +1685,10 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
>>>          }
>>>      }
>>>  
>>> +  /* Mark that we are no longer parsing an instruction, bpf_parse_name does
>>> +     not interfere with symbols in e.g. assembler directives.  */
>>> +  parsing_insn_operands = false;
>>> +
>>>    if (opcode == NULL)
>>>      {
>>>        as_bad (_("unrecognized instruction `%s'"), str);
>>> @@ -1622,6 +1705,15 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
>>>  
>>>  #undef PARSE_ERROR
>>>  
>>> +  /* Commit any symbols created while parsing the instruction.  */
>>> +  while (deferred_sym_rootP)
>>> +    {
>>> +      symbolS *sym = deferred_sym_rootP;
>>> +      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
>>> +      symbol_append (sym, symbol_lastP, &symbol_rootP, &symbol_lastP);
>>> +      symbol_table_insert (sym);
>>> +    }
>>> +
>>>    /* Generate the frags and fixups for the parsed instruction.  */
>>>    if (do_relax && isa_spec >= BPF_V4 && insn.is_relaxable)
>>>      {
>>> diff --git a/gas/config/tc-bpf.h b/gas/config/tc-bpf.h
>>> index 9fb71eddd14..06096ef5926 100644
>>> --- a/gas/config/tc-bpf.h
>>> +++ b/gas/config/tc-bpf.h
>>> @@ -51,6 +51,10 @@
>>>     a jump to offset 0 means jump to the next instruction.  */
>>>  #define md_single_noop_insn "ja 0"
>>>  
>>> +#define md_parse_name(name, exp, mode, c) \
>>> +  bpf_parse_name (name, exp, mode)
>>> +bool bpf_parse_name (const char *, struct expressionS *, enum expr_mode);
>>> +
>>>  #define TC_EQUAL_IN_INSN(c, s) bpf_tc_equal_in_insn ((c), (s))
>>>  extern bool bpf_tc_equal_in_insn (int, char *);
>>>  
>>> diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.d b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
>>> new file mode 100644
>>> index 00000000000..113750dd3fd
>>> --- /dev/null
>>> +++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
>>> @@ -0,0 +1,7 @@
>>> +#as: -EL -mdialect=pseudoc
>>> +#nm: --numeric-sort
>>> +#source: asm-extra-sym-1.s
>>> +#name: BPF pseudoc no extra symbols 1
>>> +
>>> +# Note: there should be no output from nm.
>>> +# Previously a bug in the BPF parser created an UND '*' symbol.
>>> diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.s b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
>>> new file mode 100644
>>> index 00000000000..2cfa605a259
>>> --- /dev/null
>>> +++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
>>> @@ -0,0 +1 @@
>>> +    r2 = *(u32*)(r1 + 8)
>>> diff --git a/gas/testsuite/gas/bpf/bpf.exp b/gas/testsuite/gas/bpf/bpf.exp
>>> index 80f5a1dbc2d..fcbeccd8ecd 100644
>>> --- a/gas/testsuite/gas/bpf/bpf.exp
>>> +++ b/gas/testsuite/gas/bpf/bpf.exp
>>> @@ -72,4 +72,7 @@ if {[istarget bpf*-*-*]} {
>>>      run_dump_test disp16-overflow-relax
>>>      run_dump_test disp32-overflow
>>>      run_dump_test imm32-overflow
>>> +
>>> +    # Test that parser does not create undefined symbols
>>> +    run_dump_test asm-extra-sym-1
>>>  }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] bpf: avoid creating wrong symbols while parsing
  2023-11-17 20:27     ` Jose E. Marchesi
@ 2023-11-17 21:41       ` David Faust
  0 siblings, 0 replies; 5+ messages in thread
From: David Faust @ 2023-11-17 21:41 UTC (permalink / raw)
  To: Jose E. Marchesi; +Cc: binutils, jbeulich



On 11/17/23 12:27, Jose E. Marchesi wrote:
> 
>> On 11/17/23 11:29, Jose E. Marchesi wrote:
>>>
>>> Hi David.
>>>
>>>> [ WAS: gas,bpf: cleanup bad symbols created while parsing
>>>>   v1: https://sourceware.org/pipermail/binutils/2023-November/130556.html
>>>>
>>>>   Changes from v1:
>>>>   - Rewrite patch to avoid inserting wrong symbols in the first place,
>>>>     rather than insert, remove, re-insert dance.
>>>>     Suggested by Jan Beulich. ]
>>>>
>>>> To support the "pseudo-C" asm dialect in BPF, the BPF parser must often
>>>> attempt multiple different templates for a single instruction. In some
>>>> cases this can cause the parser to incorrectly parse part of the
>>>> instruction opcode as an expression, which leads to the creation of a
>>>> new undefined symbol.
>>>>
>>>> Once the parser recognizes the error, the expression is discarded and it
>>>> tries again with a new instruction template. However, symbols created
>>>> during the process are added to the symbol table and are not removed
>>>> even if the expression is discarded.
>>>>
>>>> This is a problem for BPF: generally the assembled object will be loaded
>>>> directly to the Linux kernel, without being linked. These erroneous
>>>> parser-created symbols are rejected by the kernel BPF loader, and the
>>>> entire object is refused.
>>>>
>>>> This patch remedies the issue by tentatively creating symbols while
>>>> parsing instruction operands, and storing them in a temporary list
>>>> rather than immediately inserting them into the symbol table. Later,
>>>> after the parser is sure that it has correctly parsed the instruction,
>>>> those symbols are committed to the real symbol table.
>>>>
>>>> This approach is modeled directly after Jan Beulich's patch for RISC-V:
>>>>
>>>>   commit 7a29ee290307087e1749ce610207e93a15d0b78d
>>>>   RISC-V: adjust logic to avoid register name symbols
>>>>
>>>> Many thanks to Jan for recognizing the problem as similar, and pointing
>>>> me to that patch.
>>>>
>>>> Tested on x86_64-linux-gnu host for bpf-unknown-none target.
>>>>
>>>> gas/
>>>>
>>>> 	* config/tc-bpf.c (parsing_insn_operands): New.
>>>> 	(parse_expression): Set it here.
>>>> 	(deferred_sym_rootP, deferred_sym_lastP): New.
>>>> 	(orphan_sym_rootP, orphan_sym_lastP): New.
>>>> 	(bpf_parse_name): New.
>>>> 	(parse_error): Clear deferred symbol list on error.
>>>> 	(md_assemble): Clear parsing_insn_operands. Commit deferred
>>>> 	symbols to symbol table on successful parse.
>>>> 	* config/tc-bpf.h (md_parse_name): Define to...
>>>> 	(bpf_parse_name): ...this. New prototype.
>>>> 	* testsuite/gas/bpf/asm-extra-sym-1.s: New test source.
>>>> 	* testsuite/gas/bpf/asm-extra-sym-1.d: New test.
>>>> 	* testsuite/gas/bpf/bpf.exp: Run new test.
>>>> ---
>>>>  gas/config/tc-bpf.c                     | 92 +++++++++++++++++++++++++
>>>>  gas/config/tc-bpf.h                     |  4 ++
>>>>  gas/testsuite/gas/bpf/asm-extra-sym-1.d |  7 ++
>>>>  gas/testsuite/gas/bpf/asm-extra-sym-1.s |  1 +
>>>>  gas/testsuite/gas/bpf/bpf.exp           |  3 +
>>>>  5 files changed, 107 insertions(+)
>>>>  create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.d
>>>>  create mode 100644 gas/testsuite/gas/bpf/asm-extra-sym-1.s
>>>>
>>>> diff --git a/gas/config/tc-bpf.c b/gas/config/tc-bpf.c
>>>> index fd4144a354b..3122f80804a 100644
>>>> --- a/gas/config/tc-bpf.c
>>>> +++ b/gas/config/tc-bpf.c
>>>> @@ -1223,6 +1223,7 @@ add_relaxed_insn (struct bpf_insn *insn, expressionS *exp)
>>>>     See md_operand below to see how exp_parse_failed is used.  */
>>>>  
>>>>  static int exp_parse_failed = 0;
>>>> +static bool parsing_insn_operands = false;
>>>>  
>>>>  static char *
>>>>  parse_expression (char *s, expressionS *exp)
>>>> @@ -1230,6 +1231,9 @@ parse_expression (char *s, expressionS *exp)
>>>>    char *saved_input_line_pointer = input_line_pointer;
>>>>    char *saved_s = s;
>>>>  
>>>> +  /* Wake up bpf_parse_name before the call to expression ().  */
>>>> +  parsing_insn_operands = true;
>>>> +
>>>>    exp_parse_failed = 0;
>>>>    input_line_pointer = s;
>>>>    expression (exp);
>>>> @@ -1251,6 +1255,71 @@ parse_expression (char *s, expressionS *exp)
>>>>    return s;
>>>>  }
>>>>  
>>>> +/* Symbols created by this parse, but not yet committed to the real
>>>> +   symbol table.  */
>>>> +static symbolS *deferred_sym_rootP;
>>>> +static symbolS *deferred_sym_lastP;
>>>> +
>>>> +/* Symbols discarded by a previous parse.  Symbols cannot easily be freed
>>>> +   after creation, so try to recycle.  */
>>>> +static symbolS *orphan_sym_rootP;
>>>> +static symbolS *orphan_sym_lastP;
>>>> +
>>>> +/* Implement md_parse_name hook.  Handles any symbol found in an expression.
>>>> +   This allows us to tentatively create symbols, before we know for sure
>>>> +   whether the parser is using the correct template for an instruction.
>>>> +   If we end up keeping the instruction, the deferred symbols are committed
>>>> +   to the real symbol table. This approach is modeled after the riscv port.  */
>>>> +
>>>> +bool
>>>> +bpf_parse_name (const char *name, expressionS *exp, enum expr_mode mode)
>>>> +{
>>>> +  symbolS *sym;
>>>> +
>>>> +  /* If we aren't currently parsing an instruction, don't do anything.
>>>> +     This prevents tampering with operands to directives.  */
>>>> +  if (!parsing_insn_operands)
>>>> +    return false;
>>>> +
>>>> +  gas_assert (mode == expr_normal);
>>>> +
>>>> +  if (symbol_find (name) != NULL)
>>>> +    return false;
>>>
>>> Is this check correct?
>>>
>>> I see that expr.c:operand calls symbol_find_or_make right after calling
>>> md_parse_name (if the later is defined) and actually installs the
>>> resulting symbol in expressionP even if it was already defined.  That is
>>> not done if md_parse_name returns `false'.
>>
>> Hmm, I think you may have the logic in operand () flipped.
>> If md_parse_name returns 'false' then symbol_find_or_make is called;
>> if it returns 'true' then the 'break' is hit:
>>
>> #ifdef md_parse_name
>> 	...
>> 	  if (md_parse_name (name, expressionP, mode, &c))
>> 	    {
>> 	      restore_line_pointer (c);
>> 	      break;
>> 	    }
>> #endif
>> 	  symbolP = symbol_find_or_make (name);
>> 	...
>>
>> That is what we want to happen - if the symbol_find in parse_name
>> returns non-NULL, then we know a real symbol already exists for the
>> name we are looking at, and we should use it. In that case,
>> md_parse_name returns false, so operand () calls symbol_find_or_make,
>> which finds the extant symbol and installs it in the expression.
> 
> Perfect then.
> Please apply.
> Thanks!

Pushed, thanks.

> 
>>
>>>
>>>> +
>>>> +  for (sym = deferred_sym_rootP; sym; sym = symbol_next (sym))
>>>> +    if (strcmp (name, S_GET_NAME (sym)) == 0)
>>>> +      break;
>>>> +
>>>> +  /* Tentatively create a symbol.  */
>>>> +  if (!sym)
>>>> +    {
>>>> +      /* See if we can reuse a symbol discarded by a previous parse.
>>>> +	 This may be quite common, for example when trying multiple templates
>>>> +	 for an instruction with the first reference to a valid symbol.  */
>>>> +      for (sym = orphan_sym_rootP; sym; sym = symbol_next (sym))
>>>> +	if (strcmp (name, S_GET_NAME (sym)) == 0)
>>>> +	  {
>>>> +	    symbol_remove (sym, &orphan_sym_rootP, &orphan_sym_lastP);
>>>> +	    break;
>>>> +	  }
>>>> +
>>>> +      if (!sym)
>>>> +	  sym = symbol_create (name, undefined_section, &zero_address_frag, 0);
>>>> +
>>>> +      /* Add symbol to the deferred list.  If we commit to the isntruction,
>>>> +	 then the symbol will be inserted into to the real symbol table at
>>>> +	 that point (in md_assemble).  */
>>>> +      symbol_append (sym, deferred_sym_lastP, &deferred_sym_rootP,
>>>> +		     &deferred_sym_lastP);
>>>> +    }
>>>> +
>>>> +  exp->X_op = O_symbol;
>>>> +  exp->X_add_symbol = sym;
>>>> +  exp->X_add_number = 0;
>>>> +
>>>> +  return true;
>>>> +}
>>>> +
>>>>  /* Parse a BPF register name and return the corresponding register
>>>>     number.  Return NULL in case of parse error, or a pointer to the
>>>>     first character in S that is not part of the register name.  */
>>>> @@ -1317,6 +1386,16 @@ parse_error (int length, const char *fmt, ...)
>>>>        va_end (args);
>>>>        partial_match_length = length;
>>>>      }
>>>> +
>>>> +  /* Discard deferred symbols from the failed parse.  They may potentially
>>>> +     be reused in the future from the orphan list.  */
>>>> +  while (deferred_sym_rootP)
>>>> +    {
>>>> +      symbolS *sym = deferred_sym_rootP;
>>>> +      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
>>>> +      symbol_append (sym, orphan_sym_lastP, &orphan_sym_rootP,
>>>> +		     &orphan_sym_lastP);
>>>> +    }
>>>>  }
>>>>  
>>>>  /* Assemble a machine instruction in STR and emit the frags/bytes it
>>>> @@ -1606,6 +1685,10 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
>>>>          }
>>>>      }
>>>>  
>>>> +  /* Mark that we are no longer parsing an instruction, bpf_parse_name does
>>>> +     not interfere with symbols in e.g. assembler directives.  */
>>>> +  parsing_insn_operands = false;
>>>> +
>>>>    if (opcode == NULL)
>>>>      {
>>>>        as_bad (_("unrecognized instruction `%s'"), str);
>>>> @@ -1622,6 +1705,15 @@ md_assemble (char *str ATTRIBUTE_UNUSED)
>>>>  
>>>>  #undef PARSE_ERROR
>>>>  
>>>> +  /* Commit any symbols created while parsing the instruction.  */
>>>> +  while (deferred_sym_rootP)
>>>> +    {
>>>> +      symbolS *sym = deferred_sym_rootP;
>>>> +      symbol_remove (sym, &deferred_sym_rootP, &deferred_sym_lastP);
>>>> +      symbol_append (sym, symbol_lastP, &symbol_rootP, &symbol_lastP);
>>>> +      symbol_table_insert (sym);
>>>> +    }
>>>> +
>>>>    /* Generate the frags and fixups for the parsed instruction.  */
>>>>    if (do_relax && isa_spec >= BPF_V4 && insn.is_relaxable)
>>>>      {
>>>> diff --git a/gas/config/tc-bpf.h b/gas/config/tc-bpf.h
>>>> index 9fb71eddd14..06096ef5926 100644
>>>> --- a/gas/config/tc-bpf.h
>>>> +++ b/gas/config/tc-bpf.h
>>>> @@ -51,6 +51,10 @@
>>>>     a jump to offset 0 means jump to the next instruction.  */
>>>>  #define md_single_noop_insn "ja 0"
>>>>  
>>>> +#define md_parse_name(name, exp, mode, c) \
>>>> +  bpf_parse_name (name, exp, mode)
>>>> +bool bpf_parse_name (const char *, struct expressionS *, enum expr_mode);
>>>> +
>>>>  #define TC_EQUAL_IN_INSN(c, s) bpf_tc_equal_in_insn ((c), (s))
>>>>  extern bool bpf_tc_equal_in_insn (int, char *);
>>>>  
>>>> diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.d b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
>>>> new file mode 100644
>>>> index 00000000000..113750dd3fd
>>>> --- /dev/null
>>>> +++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.d
>>>> @@ -0,0 +1,7 @@
>>>> +#as: -EL -mdialect=pseudoc
>>>> +#nm: --numeric-sort
>>>> +#source: asm-extra-sym-1.s
>>>> +#name: BPF pseudoc no extra symbols 1
>>>> +
>>>> +# Note: there should be no output from nm.
>>>> +# Previously a bug in the BPF parser created an UND '*' symbol.
>>>> diff --git a/gas/testsuite/gas/bpf/asm-extra-sym-1.s b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
>>>> new file mode 100644
>>>> index 00000000000..2cfa605a259
>>>> --- /dev/null
>>>> +++ b/gas/testsuite/gas/bpf/asm-extra-sym-1.s
>>>> @@ -0,0 +1 @@
>>>> +    r2 = *(u32*)(r1 + 8)
>>>> diff --git a/gas/testsuite/gas/bpf/bpf.exp b/gas/testsuite/gas/bpf/bpf.exp
>>>> index 80f5a1dbc2d..fcbeccd8ecd 100644
>>>> --- a/gas/testsuite/gas/bpf/bpf.exp
>>>> +++ b/gas/testsuite/gas/bpf/bpf.exp
>>>> @@ -72,4 +72,7 @@ if {[istarget bpf*-*-*]} {
>>>>      run_dump_test disp16-overflow-relax
>>>>      run_dump_test disp32-overflow
>>>>      run_dump_test imm32-overflow
>>>> +
>>>> +    # Test that parser does not create undefined symbols
>>>> +    run_dump_test asm-extra-sym-1
>>>>  }

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-11-17 21:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-17 18:54 [PATCH v2] bpf: avoid creating wrong symbols while parsing David Faust
2023-11-17 19:29 ` Jose E. Marchesi
2023-11-17 20:09   ` David Faust
2023-11-17 20:27     ` Jose E. Marchesi
2023-11-17 21:41       ` David Faust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).