[PATCH] [AArch64] support -mfentry feature for arm64

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] [AArch64] support -mfentry feature for arm64
  2016-03-14  8:13 [PATCH] [AArch64] support -mfentry feature for arm64 Li Bin
@ 2016-03-14  8:12 ` Li Bin
  2016-04-14 13:08 ` Maxim Kuvyrkov
  1 sibling, 0 replies; 38+ messages in thread
From: Li Bin @ 2016-03-14  8:12 UTC (permalink / raw)
  To: gcc-patches
  Cc: marcus.shawcroft, richard.earnshaw, andrew.wafaa, szabolcs.nagy,
	masami.hiramatsu.pt, geoff, takahiro.akashi, guohanjun,
	felix.yang, jiangjiji, huawei.libin

From: Jiangjiji <jiangjiji@huawei.com>

* gcc/config/aarch64/aarch64.opt: Add a new option.
* gcc/config/aarch64/aarch64.c: Add some new functions and Macros.
* gcc/config/aarch64/aarch64.h: Modify PROFILE_HOOK and FUNCTION_PROFILER.

Signed-off-by: Jiangjiji <jiangjiji@huawei.com>
Signed-off-by: Li Bin <huawei.libin@huawei.com>
---
 gcc/config/aarch64/aarch64.c   |   23 +++++++++++++++++++++++
 gcc/config/aarch64/aarch64.h   |   13 ++++++++-----
 gcc/config/aarch64/aarch64.opt |    4 ++++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 752df4e..c70b161 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -440,6 +440,17 @@ aarch64_is_long_call_p (rtx sym)
   return aarch64_decl_is_long_call_p (SYMBOL_REF_DECL (sym));
 }
 
+void
+aarch64_function_profiler (FILE *file, int labelno ATTRIBUTE_UNUSED)
+{
+	if (flag_fentry)
+	{
+		fprintf (file, "\tmov\tx9, x30\n");
+		fprintf (file, "\tbl\t__fentry__\n");
+		fprintf (file, "\tmov\tx30, x9\n");
+	}
+}
+
 /* Return true if the offsets to a zero/sign-extract operation
    represent an expression that matches an extend operation.  The
    operands represent the paramters from
@@ -7414,6 +7425,15 @@ aarch64_emit_unlikely_jump (rtx insn)
   add_int_reg_note (insn, REG_BR_PROB, very_unlikely);
 }
 
+/* Return true, if profiling code should be emitted before
+ * prologue. Otherwise it returns false.
+ * Note: For x86 with "hotfix" it is sorried.  */
+static bool
+aarch64_profile_before_prologue (void)
+{
+	return flag_fentry != 0;
+}
+
 /* Expand a compare and swap pattern.  */
 
 void
@@ -8454,6 +8474,9 @@ aarch64_cannot_change_mode_class (enum machine_mode from,
 #undef TARGET_ASM_ALIGNED_SI_OP
 #define TARGET_ASM_ALIGNED_SI_OP "\t.word\t"
 
+#undef TARGET_PROFILE_BEFORE_PROLOGUE
+#define TARGET_PROFILE_BEFORE_PROLOGUE aarch64_profile_before_prologue
+
 #undef TARGET_ASM_CAN_OUTPUT_MI_THUNK
 #define TARGET_ASM_CAN_OUTPUT_MI_THUNK \
   hook_bool_const_tree_hwi_hwi_const_tree_true
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 77b2bb9..65e34fc 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -804,13 +804,16 @@ do {									     \
 #define PROFILE_HOOK(LABEL)						\
   {									\
     rtx fun, lr;							\
-    lr = get_hard_reg_initial_val (Pmode, LR_REGNUM);			\
-    fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME);			\
-    emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode);	\
+	if (!flag_fentry)
+	  {
+		lr = get_hard_reg_initial_val (Pmode, LR_REGNUM);			\
+		fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME);			\
+		emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode);	\
+	  }
   }
 
-/* All the work done in PROFILE_HOOK, but still required.  */
-#define FUNCTION_PROFILER(STREAM, LABELNO) do { } while (0)
+#define FUNCTION_PROFILER(STREAM, LABELNO)
+	aarch64_function_profiler(STREAM, LABELNO)
 
 /* For some reason, the Linux headers think they know how to define
    these macros.  They don't!!!  */
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 266d873..9e4b408 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -124,3 +124,7 @@ Enum(aarch64_abi) String(ilp32) Value(AARCH64_ABI_ILP32)
 
 EnumValue
 Enum(aarch64_abi) String(lp64) Value(AARCH64_ABI_LP64)
+
+mfentry
+Target Report Var(flag_fentry) Init(0)
+Emit profiling counter call at function entry immediately after prologue.
-- 
1.7.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] [AArch64] support -mfentry feature for arm64
@ 2016-03-14  8:13 Li Bin
  2016-03-14  8:12 ` Li Bin
  2016-04-14 13:08 ` Maxim Kuvyrkov
  0 siblings, 2 replies; 38+ messages in thread
From: Li Bin @ 2016-03-14  8:13 UTC (permalink / raw)
  To: gcc-patches
  Cc: marcus.shawcroft, richard.earnshaw, andrew.wafaa, szabolcs.nagy,
	masami.hiramatsu.pt, geoff, takahiro.akashi, guohanjun,
	felix.yang, jiangjiji, huawei.libin

As ARM64 is entering enterprise world, machines can not be stopped for
some critical enterprise production environment, that is, live patch as
one of the RAS features is increasing more important for ARM64 arch now.

Now, the mainstream live patch implementation which has been merged in
Linux kernel (x86/s390) is based on the 'ftrace with regs' feature, and
this feature needs the help of gcc. 

This patch proposes a generic solution for arm64 gcc which called mfentry,
following the example of x86, mips, s390, etc. and on these archs, this
feature has been used to implement the ftrace feature 'ftrace with regs'
to support live patch.

By now, there is an another solution from linaro [1], which proposes to
implement a new option -fprolog-pad=N that generate a pad of N nops at the
beginning of each function. This solution is a arch-independent way for gcc,
but there may be some limitations which have not been recognized for Linux
kernel to adapt to this solution besides the discussion on [2], typically
for powerpc archs. Furthermore I think there are no good reasons to promote
the other archs (such as x86) which have implemented the feature 'ftrace with regs'
to replace the current method with the new option, which may bring heavily
target-dependent code adaption, as a result it becomes a arm64 dedicated
solution, leaving kernel with two different forms of implementation. 

[1] https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/401854.html

Jiangjiji (1):
  [AArch64] support -mfentry feature for arm64

 gcc/config/aarch64/aarch64.c   |   23 +++++++++++++++++++++++
 gcc/config/aarch64/aarch64.h   |   13 ++++++++-----
 gcc/config/aarch64/aarch64.opt |    4 ++++
 3 files changed, 35 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-03-14  8:13 [PATCH] [AArch64] support -mfentry feature for arm64 Li Bin
  2016-03-14  8:12 ` Li Bin
@ 2016-04-14 13:08 ` Maxim Kuvyrkov
  2016-04-14 13:15   ` Andrew Pinski
                     ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Maxim Kuvyrkov @ 2016-04-14 13:08 UTC (permalink / raw)
  To: Li Bin
  Cc: GCC Patches, Marcus Shawcroft, richard.earnshaw, andrew.wafaa,
	szabolcs.nagy, masami.hiramatsu.pt, geoff, takahiro.akashi,
	guohanjun, felix.yang, jiangjiji

On Mar 14, 2016, at 11:14 AM, Li Bin <huawei.libin@huawei.com> wrote:
> 
> As ARM64 is entering enterprise world, machines can not be stopped for
> some critical enterprise production environment, that is, live patch as
> one of the RAS features is increasing more important for ARM64 arch now.
> 
> Now, the mainstream live patch implementation which has been merged in
> Linux kernel (x86/s390) is based on the 'ftrace with regs' feature, and
> this feature needs the help of gcc. 
> 
> This patch proposes a generic solution for arm64 gcc which called mfentry,
> following the example of x86, mips, s390, etc. and on these archs, this
> feature has been used to implement the ftrace feature 'ftrace with regs'
> to support live patch.
> 
> By now, there is an another solution from linaro [1], which proposes to
> implement a new option -fprolog-pad=N that generate a pad of N nops at the
> beginning of each function. This solution is a arch-independent way for gcc,
> but there may be some limitations which have not been recognized for Linux
> kernel to adapt to this solution besides the discussion on [2]

It appears that implementing -fprolog-pad=N option in GCC will not enable kernel live-patching support for AArch64.  The proposal for the option was to make GCC output a given number of NOPs at the beginning of each function, and then the kernel could use that NOP pad to insert whatever instructions it needs.  The modification of kernel instruction stream needs to be done atomically, and, unfortunately, it seems the kernel can use only architecture-provided atomicity primitives -- i.e., changing at most 8 bytes at a time.

From the kernel discussion thread it appears that the pad needs to be more than 8 bytes, and that the kernel can't update that atomically.  However if -mfentry approach is used, then we need to update only 4 (or 8) bytes of the pad, and we avoid the atomicity problem.

Therefore, [unless there is a clever multi-stage update process to atomically change NOPs to whatever we need,] I think we have to go with Li's -mfentry approach.

Comments?

--
Maxim Kuvyrkov
www.linaro.org


> , typically
> for powerpc archs. Furthermore I think there are no good reasons to promote
> the other archs (such as x86) which have implemented the feature 'ftrace with regs'
> to replace the current method with the new option, which may bring heavily
> target-dependent code adaption, as a result it becomes a arm64 dedicated
> solution, leaving kernel with two different forms of implementation. 
> 
> [1] https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
> [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/401854.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-14 13:08 ` Maxim Kuvyrkov
@ 2016-04-14 13:15   ` Andrew Pinski
  2016-04-14 15:58     ` Szabolcs Nagy
  2016-04-15 15:40   ` Michael Matz
  2016-04-19  6:08   ` AKASHI Takahiro
  2 siblings, 1 reply; 38+ messages in thread
From: Andrew Pinski @ 2016-04-14 13:15 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Li Bin, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	andrew.wafaa, Szabolcs Nagy, masami.hiramatsu.pt, geoff,
	Takahiro Akashi, guohanjun, Yangfei (Felix),
	jiangjiji

On Thu, Apr 14, 2016 at 9:08 PM, Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
> On Mar 14, 2016, at 11:14 AM, Li Bin <huawei.libin@huawei.com> wrote:
>>
>> As ARM64 is entering enterprise world, machines can not be stopped for
>> some critical enterprise production environment, that is, live patch as
>> one of the RAS features is increasing more important for ARM64 arch now.
>>
>> Now, the mainstream live patch implementation which has been merged in
>> Linux kernel (x86/s390) is based on the 'ftrace with regs' feature, and
>> this feature needs the help of gcc.
>>
>> This patch proposes a generic solution for arm64 gcc which called mfentry,
>> following the example of x86, mips, s390, etc. and on these archs, this
>> feature has been used to implement the ftrace feature 'ftrace with regs'
>> to support live patch.
>>
>> By now, there is an another solution from linaro [1], which proposes to
>> implement a new option -fprolog-pad=N that generate a pad of N nops at the
>> beginning of each function. This solution is a arch-independent way for gcc,
>> but there may be some limitations which have not been recognized for Linux
>> kernel to adapt to this solution besides the discussion on [2]
>
> It appears that implementing -fprolog-pad=N option in GCC will not enable kernel live-patching support for AArch64.  The proposal for the option was to make GCC output a given number of NOPs at the beginning of each function, and then the kernel could use that NOP pad to insert whatever instructions it needs.  The modification of kernel instruction stream needs to be done atomically, and, unfortunately, it seems the kernel can use only architecture-provided atomicity primitives -- i.e., changing at most 8 bytes at a time.
>

Can't we add a 16byte atomic primitive for ARM64 to the kernel?
Though you need to align all functions to a 16 byte boundary if the
-fprolog-pag=N needs to happen.  Do you know what the size that needs
to be modified?  It does seem to be either 12 or 16 bytes.

> From the kernel discussion thread it appears that the pad needs to be more than 8 bytes, and that the kernel can't update that atomically.  However if -mfentry approach is used, then we need to update only 4 (or 8) bytes of the pad, and we avoid the atomicity problem.

I think you are incorrect, you could add a 16 byte atomic primitive if needed.

>
> Therefore, [unless there is a clever multi-stage update process to atomically change NOPs to whatever we need,] I think we have to go with Li's -mfentry approach.

Please consider the above of having a 16 byte (128bit) atomic
instructions be available would that be enough?

Thanks,
Andrew

>
> Comments?
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>
>> , typically
>> for powerpc archs. Furthermore I think there are no good reasons to promote
>> the other archs (such as x86) which have implemented the feature 'ftrace with regs'
>> to replace the current method with the new option, which may bring heavily
>> target-dependent code adaption, as a result it becomes a arm64 dedicated
>> solution, leaving kernel with two different forms of implementation.
>>
>> [1] https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>> [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/401854.html
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-14 13:15   ` Andrew Pinski
@ 2016-04-14 15:58     ` Szabolcs Nagy
  2016-04-18 13:26       ` Alexander Monakov
  2016-04-19  6:13       ` AKASHI Takahiro
  0 siblings, 2 replies; 38+ messages in thread
From: Szabolcs Nagy @ 2016-04-14 15:58 UTC (permalink / raw)
  To: Andrew Pinski, Maxim Kuvyrkov
  Cc: Li Bin, GCC Patches, Marcus Shawcroft, Richard Earnshaw,
	andrew.wafaa, masami.hiramatsu.pt, geoff, Takahiro Akashi,
	guohanjun, Yangfei (Felix),
	jiangjiji, nd

On 14/04/16 14:15, Andrew Pinski wrote:
> On Thu, Apr 14, 2016 at 9:08 PM, Maxim Kuvyrkov
> <maxim.kuvyrkov@linaro.org> wrote:
>> On Mar 14, 2016, at 11:14 AM, Li Bin <huawei.libin@huawei.com> wrote:
>>>
>>> As ARM64 is entering enterprise world, machines can not be stopped for
>>> some critical enterprise production environment, that is, live patch as
>>> one of the RAS features is increasing more important for ARM64 arch now.
>>>
>>> Now, the mainstream live patch implementation which has been merged in
>>> Linux kernel (x86/s390) is based on the 'ftrace with regs' feature, and
>>> this feature needs the help of gcc.
>>>
>>> This patch proposes a generic solution for arm64 gcc which called mfentry,
>>> following the example of x86, mips, s390, etc. and on these archs, this
>>> feature has been used to implement the ftrace feature 'ftrace with regs'
>>> to support live patch.
>>>
>>> By now, there is an another solution from linaro [1], which proposes to
>>> implement a new option -fprolog-pad=N that generate a pad of N nops at the
>>> beginning of each function. This solution is a arch-independent way for gcc,
>>> but there may be some limitations which have not been recognized for Linux
>>> kernel to adapt to this solution besides the discussion on [2]
>>
>> It appears that implementing -fprolog-pad=N option in GCC will not enable kernel live-patching support for AArch64.  The proposal for the option was to make GCC output a given number of NOPs at the beginning of each function, and then the kernel could use that NOP pad to insert whatever instructions it needs.  The modification of kernel instruction stream needs to be done atomically, and, unfortunately, it seems the kernel can use only architecture-provided atomicity primitives -- i.e., changing at most 8 bytes at a time.
>>
> 
> Can't we add a 16byte atomic primitive for ARM64 to the kernel?
> Though you need to align all functions to a 16 byte boundary if the
> -fprolog-pag=N needs to happen.  Do you know what the size that needs
> to be modified?  It does seem to be either 12 or 16 bytes.
> 

looking at [2] i don't see why

func:
  mov x9, x30
  bl _tracefunc
  <function body>

is not good for the kernel.

mov x9, x30 is a nop at function entry, so in
theory 4 byte atomic write should be enough
to enable/disable tracing.

>> From the kernel discussion thread it appears that the pad needs to be more than 8 bytes, and that the kernel can't update that atomically.  However if -mfentry approach is used, then we need to update only 4 (or 8) bytes of the pad, and we avoid the atomicity problem.
> 
> I think you are incorrect, you could add a 16 byte atomic primitive if needed.
> 
>>
>> Therefore, [unless there is a clever multi-stage update process to atomically change NOPs to whatever we need,] I think we have to go with Li's -mfentry approach.
> 
> Please consider the above of having a 16 byte (128bit) atomic
> instructions be available would that be enough?
> 
> Thanks,
> Andrew
> 
>>
>> Comments?
>>
>> --
>> Maxim Kuvyrkov
>> www.linaro.org
>>
>>
>>> , typically
>>> for powerpc archs. Furthermore I think there are no good reasons to promote
>>> the other archs (such as x86) which have implemented the feature 'ftrace with regs'
>>> to replace the current method with the new option, which may bring heavily
>>> target-dependent code adaption, as a result it becomes a arm64 dedicated
>>> solution, leaving kernel with two different forms of implementation.
>>>
>>> [1] https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
>>> [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/401854.html
>>
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-14 13:08 ` Maxim Kuvyrkov
  2016-04-14 13:15   ` Andrew Pinski
@ 2016-04-15 15:40   ` Michael Matz
  2016-04-15 17:29     ` Alexander Monakov
  2016-04-19  6:08   ` AKASHI Takahiro
  2 siblings, 1 reply; 38+ messages in thread
From: Michael Matz @ 2016-04-15 15:40 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Li Bin, GCC Patches, Marcus Shawcroft, richard.earnshaw,
	andrew.wafaa, szabolcs.nagy, masami.hiramatsu.pt, geoff,
	takahiro.akashi, guohanjun, felix.yang, jiangjiji

Hi,

On Thu, 14 Apr 2016, Maxim Kuvyrkov wrote:

> It appears that implementing -fprolog-pad=N option in GCC will not 
> enable kernel live-patching support for AArch64.  The proposal for the 
> option was to make GCC output a given number of NOPs at the beginning of 
> each function, and then the kernel could use that NOP pad to insert 
> whatever instructions it needs.  The modification of kernel instruction 
> stream needs to be done atomically, and, unfortunately, it seems the 
> kernel can use only architecture-provided atomicity primitives -- i.e., 
> changing at most 8 bytes at a time.

Replace first nop with a breakpoint, handle rest of patching in breakpoint 
handler, patch breakpoint insn last, no need to atomically patch multiple 
instructions.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-15 15:40   ` Michael Matz
@ 2016-04-15 17:29     ` Alexander Monakov
  2016-04-17 15:06       ` Alexander Monakov
  2016-04-18 14:32       ` Andrew Haley
  0 siblings, 2 replies; 38+ messages in thread
From: Alexander Monakov @ 2016-04-15 17:29 UTC (permalink / raw)
  To: Michael Matz
  Cc: Maxim Kuvyrkov, Li Bin, GCC Patches, Marcus Shawcroft,
	richard.earnshaw, andrew.wafaa, szabolcs.nagy,
	masami.hiramatsu.pt, geoff, takahiro.akashi, guohanjun,
	felix.yang, jiangjiji

On Fri, 15 Apr 2016, Michael Matz wrote:
> On Thu, 14 Apr 2016, Maxim Kuvyrkov wrote:
> 
> > It appears that implementing -fprolog-pad=N option in GCC will not 
> > enable kernel live-patching support for AArch64.  The proposal for the 
> > option was to make GCC output a given number of NOPs at the beginning of 
> > each function, and then the kernel could use that NOP pad to insert 
> > whatever instructions it needs.  The modification of kernel instruction 
> > stream needs to be done atomically, and, unfortunately, it seems the 
> > kernel can use only architecture-provided atomicity primitives -- i.e., 
> > changing at most 8 bytes at a time.
> 
> Replace first nop with a breakpoint, handle rest of patching in breakpoint 
> handler, patch breakpoint insn last, no need to atomically patch multiple 
> instructions.

Alternatively: replace first nop with a short forward branch that jumps over
the rest of the pad, patch rest of the pad, patch the initial forward branch.

Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-15 17:29     ` Alexander Monakov
@ 2016-04-17 15:06       ` Alexander Monakov
  2016-04-18 12:12         ` Michael Matz
  2016-04-18 14:32       ` Andrew Haley
  1 sibling, 1 reply; 38+ messages in thread
From: Alexander Monakov @ 2016-04-17 15:06 UTC (permalink / raw)
  To: Michael Matz
  Cc: Maxim Kuvyrkov, Li Bin, GCC Patches, Marcus Shawcroft,
	richard.earnshaw, andrew.wafaa, szabolcs.nagy,
	masami.hiramatsu.pt, geoff, takahiro.akashi, guohanjun,
	felix.yang, jiangjiji

On Fri, 15 Apr 2016, Alexander Monakov wrote:
> On Fri, 15 Apr 2016, Michael Matz wrote:
> > Replace first nop with a breakpoint, handle rest of patching in breakpoint 
> > handler, patch breakpoint insn last, no need to atomically patch multiple 
> > instructions.
> 
> Alternatively: replace first nop with a short forward branch that jumps over
> the rest of the pad, patch rest of the pad, patch the initial forward branch.

I've noticed an issue in my (and probably Michael's) solution: if there's a
thread that made it past the first nop, but is still executing the nop pad,
it's unsafe to replace the nops.  To solve that, it suffices to have a forward
branch in place of the first nop to begin with (i.e. have the compiler emit
it).  But if Szabolcs' two-instruction sequence in the adjacent subthread is
sufficient, this is moot.

Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-17 15:06       ` Alexander Monakov
@ 2016-04-18 12:12         ` Michael Matz
  2016-04-19  6:26           ` AKASHI Takahiro
  2016-04-19 16:03           ` Torsten Duwe
  0 siblings, 2 replies; 38+ messages in thread
From: Michael Matz @ 2016-04-18 12:12 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Maxim Kuvyrkov, Li Bin, GCC Patches, Marcus Shawcroft,
	richard.earnshaw, andrew.wafaa, szabolcs.nagy,
	masami.hiramatsu.pt, geoff, takahiro.akashi, guohanjun,
	felix.yang, jiangjiji

Hi,

On Sun, 17 Apr 2016, Alexander Monakov wrote:

> I've noticed an issue in my (and probably Michael's) solution: if 
> there's a thread that made it past the first nop, but is still executing 
> the nop pad, it's unsafe to replace the nops.  To solve that, it 
> suffices to have a forward branch in place of the first nop to begin 
> with (i.e. have the compiler emit it).

True.  I wonder if the generic solution in GCC should do that always or if 
the patch infrastructure should do that to enable more freedom like doing 
this:

> But if Szabolcs' two-instruction 
> sequence in the adjacent subthread is sufficient, this is moot.

.  It can also be solved by having just one NOP after the function label, 
and a number of them before, then no thread can be in the nop pad.  That 
seems to indicate that GCC should not try to be too clever and simply 
leave the specified number of nops before and after the function label, 
leaving safety measures to the patching infrastructure.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-14 15:58     ` Szabolcs Nagy
@ 2016-04-18 13:26       ` Alexander Monakov
  2016-04-18 13:34         ` Ramana Radhakrishnan
  2016-04-18 14:31         ` Szabolcs Nagy
  2016-04-19  6:13       ` AKASHI Takahiro
  1 sibling, 2 replies; 38+ messages in thread
From: Alexander Monakov @ 2016-04-18 13:26 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Andrew Pinski, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, Takahiro Akashi, guohanjun,
	Yangfei (Felix),
	jiangjiji, nd

On Thu, 14 Apr 2016, Szabolcs Nagy wrote:
> looking at [2] i don't see why
> 
> func:
>   mov x9, x30
>   bl _tracefunc
>   <function body>
> 
> is not good for the kernel.
> 
> mov x9, x30 is a nop at function entry, so in
> theory 4 byte atomic write should be enough
> to enable/disable tracing.

Overwriting x9 can be problematic because GCC has gained the ability to track
register usage interprocedurally: if foo() calls bar(), and GCC has already
emitted code for bar() and knows that it cannot change x9, it can use that
knowledge to avoid saving/restoring x9 in foo() around calls to bar(). See
option '-fipa-ra'.

If there's no register that can be safely used in place of x9 here, then
the backend should emit the entry/pad appropriately (e.g. with an unspec that
clobbers the possibly-overwritten register).

Or, with Michael Matz' suggestion, there can be one nop at function start and
a big enough pad just before the function, and that pad can just push/pop x30
on its own.

Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 13:26       ` Alexander Monakov
@ 2016-04-18 13:34         ` Ramana Radhakrishnan
  2016-04-18 13:44           ` Alexander Monakov
  2016-04-18 14:31         ` Szabolcs Nagy
  1 sibling, 1 reply; 38+ messages in thread
From: Ramana Radhakrishnan @ 2016-04-18 13:34 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Szabolcs Nagy, Andrew Pinski, Maxim Kuvyrkov, Li Bin,
	GCC Patches, Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, Takahiro Akashi, guohanjun,
	Yangfei (Felix),
	jiangjiji, nd

On Mon, Apr 18, 2016 at 2:26 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
> On Thu, 14 Apr 2016, Szabolcs Nagy wrote:
>> looking at [2] i don't see why
>>
>> func:
>>   mov x9, x30
>>   bl _tracefunc
>>   <function body>
>>
>> is not good for the kernel.
>>
>> mov x9, x30 is a nop at function entry, so in
>> theory 4 byte atomic write should be enough
>> to enable/disable tracing.
>
> Overwriting x9 can be problematic because GCC has gained the ability to track
> register usage interprocedurally: if foo() calls bar(), and GCC has already
> emitted code for bar() and knows that it cannot change x9, it can use that
> knowledge to avoid saving/restoring x9 in foo() around calls to bar(). See
> option '-fipa-ra'.
>
> If there's no register that can be safely used in place of x9 here, then
> the backend should emit the entry/pad appropriately (e.g. with an unspec that
> clobbers the possibly-overwritten register).

Can you not use one of x16 / x17 the intra-procedure-call scratch
registers as per the PCS ?

regards
Ramana

>
> Or, with Michael Matz' suggestion, there can be one nop at function start and
> a big enough pad just before the function, and that pad can just push/pop x30
> on its own.
>
> Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 13:34         ` Ramana Radhakrishnan
@ 2016-04-18 13:44           ` Alexander Monakov
  2016-04-18 13:57             ` Ramana Radhakrishnan
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Monakov @ 2016-04-18 13:44 UTC (permalink / raw)
  To: Ramana Radhakrishnan
  Cc: Szabolcs Nagy, Andrew Pinski, Maxim Kuvyrkov, Li Bin,
	GCC Patches, Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, Takahiro Akashi, guohanjun,
	Yangfei (Felix),
	jiangjiji, nd

On Mon, 18 Apr 2016, Ramana Radhakrishnan wrote:
> On Mon, Apr 18, 2016 at 2:26 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
> > On Thu, 14 Apr 2016, Szabolcs Nagy wrote:
> >> looking at [2] i don't see why
> >>
> >> func:
> >>   mov x9, x30
> >>   bl _tracefunc
> >>   <function body>
> >>
> >> is not good for the kernel.
> >>
> >> mov x9, x30 is a nop at function entry, so in
> >> theory 4 byte atomic write should be enough
> >> to enable/disable tracing.
> >
> > Overwriting x9 can be problematic because GCC has gained the ability to track
> > register usage interprocedurally: if foo() calls bar(), and GCC has already
> > emitted code for bar() and knows that it cannot change x9, it can use that
> > knowledge to avoid saving/restoring x9 in foo() around calls to bar(). See
> > option '-fipa-ra'.
> >
> > If there's no register that can be safely used in place of x9 here, then
> > the backend should emit the entry/pad appropriately (e.g. with an unspec that
> > clobbers the possibly-overwritten register).
> 
> Can you not use one of x16 / x17 the intra-procedure-call scratch
> registers as per the PCS ?

As long as:

- there's a guarantee that the call to _tracefunc wouldn't need some kind of
  veneer that would clobber those; I don't know enough about AArch64 to say;

- and GCC is not smart enough to be aware that intra-TU calls to 'func' (the
  function we're instrumenting) don't touch x16/x17.  And GCC should be that
  smart, if it's not, it's a bug, right? :)

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 13:44           ` Alexander Monakov
@ 2016-04-18 13:57             ` Ramana Radhakrishnan
  2016-04-18 14:03               ` Alexander Monakov
  0 siblings, 1 reply; 38+ messages in thread
From: Ramana Radhakrishnan @ 2016-04-18 13:57 UTC (permalink / raw)
  To: Alexander Monakov, Ramana Radhakrishnan
  Cc: Szabolcs Nagy, Andrew Pinski, Maxim Kuvyrkov, Li Bin,
	GCC Patches, Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, Takahiro Akashi, guohanjun,
	Yangfei (Felix),
	jiangjiji, nd



On 18/04/16 14:44, Alexander Monakov wrote:
> On Mon, 18 Apr 2016, Ramana Radhakrishnan wrote:
>> On Mon, Apr 18, 2016 at 2:26 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
>>> On Thu, 14 Apr 2016, Szabolcs Nagy wrote:
>>>> looking at [2] i don't see why
>>>>
>>>> func:
>>>>   mov x9, x30
>>>>   bl _tracefunc
>>>>   <function body>
>>>>
>>>> is not good for the kernel.
>>>>
>>>> mov x9, x30 is a nop at function entry, so in
>>>> theory 4 byte atomic write should be enough
>>>> to enable/disable tracing.
>>>
>>> Overwriting x9 can be problematic because GCC has gained the ability to track
>>> register usage interprocedurally: if foo() calls bar(), and GCC has already
>>> emitted code for bar() and knows that it cannot change x9, it can use that
>>> knowledge to avoid saving/restoring x9 in foo() around calls to bar(). See
>>> option '-fipa-ra'.
>>>
>>> If there's no register that can be safely used in place of x9 here, then
>>> the backend should emit the entry/pad appropriately (e.g. with an unspec that
>>> clobbers the possibly-overwritten register).
>>
>> Can you not use one of x16 / x17 the intra-procedure-call scratch
>> registers as per the PCS ?
> 
> As long as:
> 
> - there's a guarantee that the call to _tracefunc wouldn't need some kind of
>   veneer that would clobber those; I don't know enough about AArch64 to say;
> 

A veneer could clobber them, yeah .

> - and GCC is not smart enough to be aware that intra-TU calls to 'func' (the
>   function we're instrumenting) don't touch x16/x17.  And GCC should be that
>   smart, if it's not, it's a bug, right? :)
> 

That it already is - IIRC. Otherwise -fipa-ra wouldn't work .

Alternatively just add x9 to the list for fipa-ra the same way as x16 / x17 are handled already , no ?

Ramana

> Thanks.
> Alexander
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 13:57             ` Ramana Radhakrishnan
@ 2016-04-18 14:03               ` Alexander Monakov
  0 siblings, 0 replies; 38+ messages in thread
From: Alexander Monakov @ 2016-04-18 14:03 UTC (permalink / raw)
  To: Ramana Radhakrishnan
  Cc: Ramana Radhakrishnan, Szabolcs Nagy, Andrew Pinski,
	Maxim Kuvyrkov, Li Bin, GCC Patches, Marcus Shawcroft,
	Richard Earnshaw, andrew.wafaa, masami.hiramatsu.pt, geoff,
	Takahiro Akashi, guohanjun, Yangfei (Felix),
	jiangjiji, nd

On Mon, 18 Apr 2016, Ramana Radhakrishnan wrote:
> > - and GCC is not smart enough to be aware that intra-TU calls to 'func' (the
> >   function we're instrumenting) don't touch x16/x17.  And GCC should be that
> >   smart, if it's not, it's a bug, right? :)
> > 
> 
> That it already is - IIRC. Otherwise -fipa-ra wouldn't work .

Only if the TU is huge, or -ffunction-sections is in effect, right? Otherwise
if the TU is sufficiently small, GCC can be sure that no veneer is needed. Or
am I missing something?

> Alternatively just add x9 to the list for fipa-ra the same way as x16 / x17
> are handled already , no ?

I wouldn't recommend that; I think making it apparent as a register clobber as
part of unspec signifying the pad is better.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 13:26       ` Alexander Monakov
  2016-04-18 13:34         ` Ramana Radhakrishnan
@ 2016-04-18 14:31         ` Szabolcs Nagy
  2016-04-18 15:54           ` Alexander Monakov
  1 sibling, 1 reply; 38+ messages in thread
From: Szabolcs Nagy @ 2016-04-18 14:31 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Andrew Pinski, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, Takahiro Akashi, guohanjun,
	Yangfei (Felix),
	jiangjiji, nd

On 18/04/16 14:26, Alexander Monakov wrote:
> On Thu, 14 Apr 2016, Szabolcs Nagy wrote:
>> looking at [2] i don't see why
>>
>> func:
>>   mov x9, x30
>>   bl _tracefunc
>>   <function body>
>>
>> is not good for the kernel.
>>
>> mov x9, x30 is a nop at function entry, so in
>> theory 4 byte atomic write should be enough
>> to enable/disable tracing.
> 
> Overwriting x9 can be problematic because GCC has gained the ability to track
> register usage interprocedurally: if foo() calls bar(), and GCC has already
> emitted code for bar() and knows that it cannot change x9, it can use that
> knowledge to avoid saving/restoring x9 in foo() around calls to bar(). See
> option '-fipa-ra'.
> 
> If there's no register that can be safely used in place of x9 here, then
> the backend should emit the entry/pad appropriately (e.g. with an unspec that
> clobbers the possibly-overwritten register).
> 

(1) nop padded function can be assumed to clobber all temp regs
(2) or _tracefunc must save/restore all temp regs, not just arg regs.

on x86_64, glibc and linux _mcount and __fentry__ don't
save %r11 (temp reg), only the arg regs, so i think nop
padding should behave the same way (1).

> Or, with Michael Matz' suggestion, there can be one nop at function start and
> a big enough pad just before the function, and that pad can just push/pop x30
> on its own.
> 
> Alexander
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-15 17:29     ` Alexander Monakov
  2016-04-17 15:06       ` Alexander Monakov
@ 2016-04-18 14:32       ` Andrew Haley
  2016-04-18 17:13         ` Michael Matz
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Haley @ 2016-04-18 14:32 UTC (permalink / raw)
  To: gcc-patches

On 04/15/2016 06:29 PM, Alexander Monakov wrote:

> Alternatively: replace first nop with a short forward branch that
> jumps over the rest of the pad, patch rest of the pad, patch the
> initial forward branch.

That may not be safe.  Consider an implementation which looks ahead in
the instruction stream and decodes the instructions speculatively.  I
suppse you could begin the block of instructions after the branch with
an ISB.  On balance, the trap sounds like the best plan.  We do this
in Java all the time: every method begins with a NOP, and we patch it
either to a trap or to a call to the replacement code.

Andrew.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 14:31         ` Szabolcs Nagy
@ 2016-04-18 15:54           ` Alexander Monakov
  2016-04-19  6:46             ` AKASHI Takahiro
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Monakov @ 2016-04-18 15:54 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Andrew Pinski, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, Takahiro Akashi, guohanjun,
	Yangfei (Felix),
	jiangjiji, nd

On Mon, 18 Apr 2016, Szabolcs Nagy wrote:
> On 18/04/16 14:26, Alexander Monakov wrote:
> > On Thu, 14 Apr 2016, Szabolcs Nagy wrote:
> >> looking at [2] i don't see why
> >>
> >> func:
> >>   mov x9, x30
> >>   bl _tracefunc
> >>   <function body>
> >>
> >> is not good for the kernel.
> >>
> >> mov x9, x30 is a nop at function entry, so in
> >> theory 4 byte atomic write should be enough
> >> to enable/disable tracing.
> > 
> > Overwriting x9 can be problematic because GCC has gained the ability to track
> > register usage interprocedurally: if foo() calls bar(), and GCC has already
> > emitted code for bar() and knows that it cannot change x9, it can use that
> > knowledge to avoid saving/restoring x9 in foo() around calls to bar(). See
> > option '-fipa-ra'.
> > 
> > If there's no register that can be safely used in place of x9 here, then
> > the backend should emit the entry/pad appropriately (e.g. with an unspec that
> > clobbers the possibly-overwritten register).
> > 
> 
> (1) nop padded function can be assumed to clobber all temp regs

This may be undesirable if the nop pad is expected to be left untouched
most of the time, because it would penalize the common case.  If only
sufficiently complex functions (e.g. making other calls anyway) are expected
to be padded, it's moot.

> (2) or _tracefunc must save/restore all temp regs, not just arg regs.

This doesn't work: when _tracefunc starts executing, old value of x9 is
already unrecoverable.

> on x86_64, glibc and linux _mcount and __fentry__ don't
> save %r11 (temp reg), only the arg regs, so i think nop
> padding should behave the same way (1).

That makes sense (modulo what I said above about penalizing tiny functions).

Heh, I started wondering if on x86 this is handled correctly when the calls
are nopped out, and it turns out -pg disables -fipa-ra (in toplev.c)! :)

Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 14:32       ` Andrew Haley
@ 2016-04-18 17:13         ` Michael Matz
  2016-04-18 17:17           ` Andrew Haley
  0 siblings, 1 reply; 38+ messages in thread
From: Michael Matz @ 2016-04-18 17:13 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-patches

Hi,

On Mon, 18 Apr 2016, Andrew Haley wrote:

> On 04/15/2016 06:29 PM, Alexander Monakov wrote:
> 
> > Alternatively: replace first nop with a short forward branch that
> > jumps over the rest of the pad, patch rest of the pad, patch the
> > initial forward branch.
> 
> That may not be safe.  Consider an implementation which looks ahead in
> the instruction stream and decodes the instructions speculatively.

It should go without saying that patching instructions is followed by 
whatever means necessary to flush any such caches on a particular 
implementation (here after patching the jump, after patching the rest, and 
after patching the first insn again, i.e. three times).  Nothing that GCC 
needs to worry about, but the patching infrastructure.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 17:13         ` Michael Matz
@ 2016-04-18 17:17           ` Andrew Haley
  2016-04-18 17:34             ` Michael Matz
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Haley @ 2016-04-18 17:17 UTC (permalink / raw)
  To: Michael Matz; +Cc: gcc-patches

On 04/18/2016 06:13 PM, Michael Matz wrote:

> On Mon, 18 Apr 2016, Andrew Haley wrote:
> 
>> On 04/15/2016 06:29 PM, Alexander Monakov wrote:
>>
>>> Alternatively: replace first nop with a short forward branch that
>>> jumps over the rest of the pad, patch rest of the pad, patch the
>>> initial forward branch.
>>
>> That may not be safe.  Consider an implementation which looks ahead in
>> the instruction stream and decodes the instructions speculatively.
> 
> It should go without saying that patching instructions is followed by 
> whatever means necessary to flush any such caches on a particular 
> implementation (here after patching the jump, after patching the rest, and 
> after patching the first insn again, i.e. three times).

That doesn't necessarily help you, though, without an ISB in the reading
thread.

Andrew.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 17:17           ` Andrew Haley
@ 2016-04-18 17:34             ` Michael Matz
  2016-04-19  8:00               ` Andrew Haley
  0 siblings, 1 reply; 38+ messages in thread
From: Michael Matz @ 2016-04-18 17:34 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-patches

Hi,

On Mon, 18 Apr 2016, Andrew Haley wrote:

> >> That may not be safe.  Consider an implementation which looks ahead 
> >> in the instruction stream and decodes the instructions speculatively.
> > 
> > It should go without saying that patching instructions is followed by 
> > whatever means necessary to flush any such caches on a particular 
> > implementation (here after patching the jump, after patching the rest, 
> > and after patching the first insn again, i.e. three times).
> 
> That doesn't necessarily help you, though, without an ISB in the reading 
> thread.

I don't understand, which reading thread?  We're writing, not reading 
instructions.  You mean other executing threads?  I will happily declare 
any implementation where it's impossible to safely patch the 
instruction stream by flushing the respective buffers or other means 
completely under control of the patching machinery, to be broken by 
design.  What failure mode do you envision, exactly?

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-14 13:08 ` Maxim Kuvyrkov
  2016-04-14 13:15   ` Andrew Pinski
  2016-04-15 15:40   ` Michael Matz
@ 2016-04-19  6:08   ` AKASHI Takahiro
  2 siblings, 0 replies; 38+ messages in thread
From: AKASHI Takahiro @ 2016-04-19  6:08 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Li Bin, GCC Patches, Marcus Shawcroft, richard.earnshaw,
	andrew.wafaa, szabolcs.nagy, masami.hiramatsu.pt, geoff,
	guohanjun, felix.yang, jiangjiji

On Thu, Apr 14, 2016 at 04:08:23PM +0300, Maxim Kuvyrkov wrote:
> On Mar 14, 2016, at 11:14 AM, Li Bin <huawei.libin@huawei.com> wrote:
> > 
> > As ARM64 is entering enterprise world, machines can not be stopped for
> > some critical enterprise production environment, that is, live patch as
> > one of the RAS features is increasing more important for ARM64 arch now.
> > 
> > Now, the mainstream live patch implementation which has been merged in
> > Linux kernel (x86/s390) is based on the 'ftrace with regs' feature, and
> > this feature needs the help of gcc. 
> > 
> > This patch proposes a generic solution for arm64 gcc which called mfentry,
> > following the example of x86, mips, s390, etc. and on these archs, this
> > feature has been used to implement the ftrace feature 'ftrace with regs'
> > to support live patch.
> > 
> > By now, there is an another solution from linaro [1], which proposes to
> > implement a new option -fprolog-pad=N that generate a pad of N nops at the
> > beginning of each function. This solution is a arch-independent way for gcc,
> > but there may be some limitations which have not been recognized for Linux
> > kernel to adapt to this solution besides the discussion on [2]
> 
> It appears that implementing -fprolog-pad=N option in GCC will not enable kernel live-patching support for AArch64.  The proposal for the option was to make GCC output a given number of NOPs at the beginning of each function, and then the kernel could use that NOP pad to insert whatever instructions it needs.  The modification of kernel instruction stream needs to be done atomically, and, unfortunately, it seems the kernel can use only architecture-provided atomicity primitives -- i.e., changing at most 8 bytes at a time.

Let me clarify the issue with -fprolog-pad=N.
The kernel/ftrace has two chances of replacing prologue instructions:
 1) at boot time for all the "C" functions
 2) at run time for given functions

1) will be done in part of kernel/ftrace initialization and executed while
no other threads(cpus) are running. So we don't need atomicity here.
See [1].

For 2), we only have to replace one instruction (nop <-> bl) as [1] stated.
So we can guarantee atomicity.

Therefore, I still believe that -fproglog-pad=N approach will work for
Aarch64.
 
> From the kernel discussion thread it appears that the pad needs to be more than 8 bytes, and that the kernel can't update that atomically.  However if -mfentry approach is used, then we need to update only 4 (or 8) bytes of the pad, and we avoid the atomicity problem.
> 
> Therefore, [unless there is a clever multi-stage update process to atomically change NOPs to whatever we need,] I think we have to go with Li's -mfentry approach.

The reason that I gave up this approach is that it is not as generic
as we have expected. At least, power pc needs a specific instruction
(i.e. saving TOC) before NOPs.
See discussions in [2].

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/401854.html
[2] http://lkml.iu.edu//hypermail/linux/kernel/1602.0/02257.html


Thanks,
-Takahiro AKASHI

> Comments?
> 
> --
> Maxim Kuvyrkov
> www.linaro.org
> 
> 
> > , typically
> > for powerpc archs. Furthermore I think there are no good reasons to promote
> > the other archs (such as x86) which have implemented the feature 'ftrace with regs'
> > to replace the current method with the new option, which may bring heavily
> > target-dependent code adaption, as a result it becomes a arm64 dedicated
> > solution, leaving kernel with two different forms of implementation. 
> > 
> > [1] https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
> > [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/401854.html
> 

-- 
Thanks,
-Takahiro AKASHI

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-14 15:58     ` Szabolcs Nagy
  2016-04-18 13:26       ` Alexander Monakov
@ 2016-04-19  6:13       ` AKASHI Takahiro
  2016-04-19  6:44         ` Alexander Monakov
  1 sibling, 1 reply; 38+ messages in thread
From: AKASHI Takahiro @ 2016-04-19  6:13 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Andrew Pinski, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, guohanjun, Yangfei (Felix),
	jiangjiji, nd

On Thu, Apr 14, 2016 at 04:58:12PM +0100, Szabolcs Nagy wrote:
> On 14/04/16 14:15, Andrew Pinski wrote:
> > On Thu, Apr 14, 2016 at 9:08 PM, Maxim Kuvyrkov
> > <maxim.kuvyrkov@linaro.org> wrote:
> >> On Mar 14, 2016, at 11:14 AM, Li Bin <huawei.libin@huawei.com> wrote:
> >>>
> >>> As ARM64 is entering enterprise world, machines can not be stopped for
> >>> some critical enterprise production environment, that is, live patch as
> >>> one of the RAS features is increasing more important for ARM64 arch now.
> >>>
> >>> Now, the mainstream live patch implementation which has been merged in
> >>> Linux kernel (x86/s390) is based on the 'ftrace with regs' feature, and
> >>> this feature needs the help of gcc.
> >>>
> >>> This patch proposes a generic solution for arm64 gcc which called mfentry,
> >>> following the example of x86, mips, s390, etc. and on these archs, this
> >>> feature has been used to implement the ftrace feature 'ftrace with regs'
> >>> to support live patch.
> >>>
> >>> By now, there is an another solution from linaro [1], which proposes to
> >>> implement a new option -fprolog-pad=N that generate a pad of N nops at the
> >>> beginning of each function. This solution is a arch-independent way for gcc,
> >>> but there may be some limitations which have not been recognized for Linux
> >>> kernel to adapt to this solution besides the discussion on [2]
> >>
> >> It appears that implementing -fprolog-pad=N option in GCC will not enable kernel live-patching support for AArch64.  The proposal for the option was to make GCC output a given number of NOPs at the beginning of each function, and then the kernel could use that NOP pad to insert whatever instructions it needs.  The modification of kernel instruction stream needs to be done atomically, and, unfortunately, it seems the kernel can use only architecture-provided atomicity primitives -- i.e., changing at most 8 bytes at a time.
> >>
> > 
> > Can't we add a 16byte atomic primitive for ARM64 to the kernel?
> > Though you need to align all functions to a 16 byte boundary if the
> > -fprolog-pag=N needs to happen.  Do you know what the size that needs
> > to be modified?  It does seem to be either 12 or 16 bytes.
> > 
> 
> looking at [2] i don't see why
> 
> func:
>   mov x9, x30
>   bl _tracefunc
>   <function body>

Actually,
    mov x9, x30
    bl _tracefunc
    mov x30, x9
    <function body>
 
> is not good for the kernel.
> 
> mov x9, x30 is a nop at function entry, so in
> theory 4 byte atomic write should be enough
> to enable/disable tracing.

Please see my previous reply to Maxim.

Thanks,
-Takahiro AKASHI

> >> From the kernel discussion thread it appears that the pad needs to be more than 8 bytes, and that the kernel can't update that atomically.  However if -mfentry approach is used, then we need to update only 4 (or 8) bytes of the pad, and we avoid the atomicity problem.
> > 
> > I think you are incorrect, you could add a 16 byte atomic primitive if needed.
> > 
> >>
> >> Therefore, [unless there is a clever multi-stage update process to atomically change NOPs to whatever we need,] I think we have to go with Li's -mfentry approach.
> > 
> > Please consider the above of having a 16 byte (128bit) atomic
> > instructions be available would that be enough?
> > 
> > Thanks,
> > Andrew
> > 
> >>
> >> Comments?
> >>
> >> --
> >> Maxim Kuvyrkov
> >> www.linaro.org
> >>
> >>
> >>> , typically
> >>> for powerpc archs. Furthermore I think there are no good reasons to promote
> >>> the other archs (such as x86) which have implemented the feature 'ftrace with regs'
> >>> to replace the current method with the new option, which may bring heavily
> >>> target-dependent code adaption, as a result it becomes a arm64 dedicated
> >>> solution, leaving kernel with two different forms of implementation.
> >>>
> >>> [1] https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html
> >>> [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/401854.html
> >>
> > 
> 

-- 
Thanks,
-Takahiro AKASHI

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 12:12         ` Michael Matz
@ 2016-04-19  6:26           ` AKASHI Takahiro
  2016-04-19  6:39             ` Alexander Monakov
  2016-04-19 16:03           ` Torsten Duwe
  1 sibling, 1 reply; 38+ messages in thread
From: AKASHI Takahiro @ 2016-04-19  6:26 UTC (permalink / raw)
  To: Michael Matz
  Cc: Alexander Monakov, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, richard.earnshaw, andrew.wafaa, szabolcs.nagy,
	masami.hiramatsu.pt, geoff, guohanjun, felix.yang, jiangjiji

On Mon, Apr 18, 2016 at 02:12:09PM +0200, Michael Matz wrote:
> Hi,
> 
> On Sun, 17 Apr 2016, Alexander Monakov wrote:
> 
> > I've noticed an issue in my (and probably Michael's) solution: if 
> > there's a thread that made it past the first nop, but is still executing 
> > the nop pad, it's unsafe to replace the nops.

Yeah, this issue also trapped me before :)
 
> To solve that, it 
> > suffices to have a forward branch in place of the first nop to begin 
> > with (i.e. have the compiler emit it).
>
> True.  I wonder if the generic solution in GCC should do that always or if 
> the patch infrastructure should do that to enable more freedom like doing 
> this:
> 
> > But if Szabolcs' two-instruction 
> > sequence in the adjacent subthread is sufficient, this is moot.
> 
> .  It can also be solved by having just one NOP after the function label, 
> and a number of them before, then no thread can be in the nop pad.  That 
> seems to indicate that GCC should not try to be too clever and simply 
> leave the specified number of nops before and after the function label, 
> leaving safety measures to the patching infrastructure.

I don't get this idea very well.
How can the instructions *before* a function label be executed
after branching into this function?

Thanks,
-Takahiro AKASHI

> 
> Ciao,
> Michael.

-- 
Thanks,
-Takahiro AKASHI

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-19  6:26           ` AKASHI Takahiro
@ 2016-04-19  6:39             ` Alexander Monakov
  2016-04-20  1:23               ` AKASHI Takahiro
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Monakov @ 2016-04-19  6:39 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Michael Matz, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, richard.earnshaw, andrew.wafaa, szabolcs.nagy,
	masami.hiramatsu.pt, geoff, guohanjun, felix.yang, jiangjiji

On Tue, 19 Apr 2016, AKASHI Takahiro wrote:
> > > But if Szabolcs' two-instruction 
> > > sequence in the adjacent subthread is sufficient, this is moot.
> > 
> > .  It can also be solved by having just one NOP after the function label, 
> > and a number of them before, then no thread can be in the nop pad.  That 
> > seems to indicate that GCC should not try to be too clever and simply 
> > leave the specified number of nops before and after the function label, 
> > leaving safety measures to the patching infrastructure.
> 
> I don't get this idea very well.
> How can the instructions *before* a function label be executed
> after branching into this function?

The single nop after the function label is changed to a short backwards branch
to the instructions just before the function label.

As a result, the last instruction in the pad would have to become a short
forward branch jumping over the backwards branch described above, to the first
real instruction of the function.

Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-19  6:13       ` AKASHI Takahiro
@ 2016-04-19  6:44         ` Alexander Monakov
  2016-04-20  0:33           ` AKASHI Takahiro
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Monakov @ 2016-04-19  6:44 UTC (permalink / raw)
  To: AKASHI Takahiro
  Cc: Szabolcs Nagy, Andrew Pinski, Maxim Kuvyrkov, Li Bin,
	GCC Patches, Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, guohanjun, Yangfei (Felix),
	jiangjiji, nd

On Tue, 19 Apr 2016, AKASHI Takahiro wrote:
> > looking at [2] i don't see why
> > 
> > func:
> >   mov x9, x30
> >   bl _tracefunc
> >   <function body>
> 
> Actually,
>     mov x9, x30
>     bl _tracefunc
>     mov x30, x9
>     <function body>

I think here Szabolcs' point was that the last instruction can be eliminated:
_tracefunc can be responsible for restoring x30, and can use x9 to return to
its caller. It has a non-standard calling convention and needs to be
implemented in assembly anyway.

Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 15:54           ` Alexander Monakov
@ 2016-04-19  6:46             ` AKASHI Takahiro
  0 siblings, 0 replies; 38+ messages in thread
From: AKASHI Takahiro @ 2016-04-19  6:46 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Szabolcs Nagy, Andrew Pinski, Maxim Kuvyrkov, Li Bin,
	GCC Patches, Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, guohanjun, Yangfei (Felix),
	jiangjiji, nd

On Mon, Apr 18, 2016 at 06:54:33PM +0300, Alexander Monakov wrote:
> On Mon, 18 Apr 2016, Szabolcs Nagy wrote:
> > On 18/04/16 14:26, Alexander Monakov wrote:
> > > On Thu, 14 Apr 2016, Szabolcs Nagy wrote:
> > >> looking at [2] i don't see why
> > >>
> > >> func:
> > >>   mov x9, x30
> > >>   bl _tracefunc
> > >>   <function body>
> > >>
> > >> is not good for the kernel.
> > >>
> > >> mov x9, x30 is a nop at function entry, so in
> > >> theory 4 byte atomic write should be enough
> > >> to enable/disable tracing.
> > > 
> > > Overwriting x9 can be problematic because GCC has gained the ability to track
> > > register usage interprocedurally: if foo() calls bar(), and GCC has already
> > > emitted code for bar() and knows that it cannot change x9, it can use that
> > > knowledge to avoid saving/restoring x9 in foo() around calls to bar(). See
> > > option '-fipa-ra'.
> > > 
> > > If there's no register that can be safely used in place of x9 here, then
> > > the backend should emit the entry/pad appropriately (e.g. with an unspec that
> > > clobbers the possibly-overwritten register).
> > > 
> > 
> > (1) nop padded function can be assumed to clobber all temp regs
> 
> This may be undesirable if the nop pad is expected to be left untouched
> most of the time, because it would penalize the common case.  If only
> sufficiently complex functions (e.g. making other calls anyway) are expected
> to be padded, it's moot.

Almost of all the "C" functions in the kernel will be compiled
with -mfentry, and later on, we can dynamically turn on and off
tracing per-function.

> > (2) or _tracefunc must save/restore all temp regs, not just arg regs.
> 
> This doesn't work: when _tracefunc starts executing, old value of x9 is
> already unrecoverable.

Yeah. We may, instead, be able to preserve LR value on a stack,
but obviously with performance penalty.
I wondered whether we could stop "instruction scheduling" partially,
and always generate a fixed sequence of instructions like
    save x29, x30, [sp, #-XX]!
    mov x29, x30
    bl _mcount
    <function body>
but Maxim said no :)

Thanks,
-Takahiro AKASHI

> > on x86_64, glibc and linux _mcount and __fentry__ don't
> > save %r11 (temp reg), only the arg regs, so i think nop
> > padding should behave the same way (1).
> 
> That makes sense (modulo what I said above about penalizing tiny functions).
> 
> Heh, I started wondering if on x86 this is handled correctly when the calls
> are nopped out, and it turns out -pg disables -fipa-ra (in toplev.c)! :)
> 
> Alexander

-- 
Thanks,
-Takahiro AKASHI

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 17:34             ` Michael Matz
@ 2016-04-19  8:00               ` Andrew Haley
  2016-04-19 13:19                 ` Michael Matz
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Haley @ 2016-04-19  8:00 UTC (permalink / raw)
  To: Michael Matz; +Cc: gcc-patches

On 18/04/16 18:34, Michael Matz wrote:
> Hi,
> 
> On Mon, 18 Apr 2016, Andrew Haley wrote:
> 
>>>> That may not be safe.  Consider an implementation which looks
>>>> ahead in the instruction stream and decodes the instructions
>>>> speculatively.
>>>
>>> It should go without saying that patching instructions is followed
>>> by whatever means necessary to flush any such caches on a
>>> particular implementation (here after patching the jump, after
>>> patching the rest, and after patching the first insn again,
>>> i.e. three times).
>>
>> That doesn't necessarily help you, though, without an ISB in the reading 
>> thread.
> 
> I don't understand, which reading thread?  We're writing, not reading 
> instructions.  You mean other executing threads? 

Yes.

> I will happily declare any implementation where it's impossible to
> safely patch the instruction stream by flushing the respective
> buffers or other means completely under control of the patching
> machinery, to be broken by design. 

You can declare anything you want, but we have to program for the
architectural specification.

> What failure mode do you envision, exactly?

It's easiest just to quote from the spec:

    How far ahead of the current point of execution instructions are
    fetched from is IMPLEMENTATION DEFINED. Such prefetching can be
    either a fixed or a dynamically varying number of instructions,
    and can follow any or all possible future execution paths. For all
    types of memory:

       The PE might have fetched the instructions from memory at any
       time since the last Context synchronization operation on that
       PE.

       Any instructions fetched in this way might be executed multiple
       times, if this is required by the execution of the program,
       without being re-fetched from memory. In the absence of an ISB,
       there is no limit on the number of times such an instruction
       might be executed without being re-fetched from memory.

    The ARM architecture does not require the hardware to ensure
    coherency between instruction caches and memory, even for
    locations of shared memory.

So, if you write a bunch of instructions (which might have been
pre-fetched) and then rewrite a NOP to jump to those instructions you
need to make sure that the thread which might be running concurrently
does an ISB.

Note also:

    Memory accesses caused by instruction fetches are not required to
    be observed in program order, unless they are separated by an ISB
    or other context synchronization event.

So, if you modify instruction memory in one thread, other threads may
see those changes in a different order from the writing thread.  Sure,
the writing thread executes the cache maintenance instructions on its
side, but you also need to do something on the side which is executing
the instructions.

I have wondered if it might be a good idea to use an inter-processor
interrupt to force a context synchronization event across all PEs.

Andrew.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-19  8:00               ` Andrew Haley
@ 2016-04-19 13:19                 ` Michael Matz
  2016-04-19 13:25                   ` Andrew Haley
  0 siblings, 1 reply; 38+ messages in thread
From: Michael Matz @ 2016-04-19 13:19 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-patches

Hi,

On Tue, 19 Apr 2016, Andrew Haley wrote:

> > I will happily declare any implementation where it's impossible to 
> > safely patch the instruction stream by flushing the respective buffers 
> > or other means completely under control of the patching machinery, to 
> > be broken by design.
> 
> You can declare anything you want, but we have to program for the 
> architectural specification.
> 
> > What failure mode do you envision, exactly?
> 
> It's easiest just to quote from the spec:
> 
>     How far ahead of the current point of execution instructions are
>     fetched from is IMPLEMENTATION DEFINED. Such prefetching can be
>     either a fixed or a dynamically varying number of instructions,
>     and can follow any or all possible future execution paths. For all
>     types of memory:
> 
>        The PE might have fetched the instructions from memory at any
>        time since the last Context synchronization operation on that
>        PE.
> 
>        Any instructions fetched in this way might be executed multiple
>        times, if this is required by the execution of the program,
>        without being re-fetched from memory. In the absence of an ISB,
>        there is no limit on the number of times such an instruction
>        might be executed without being re-fetched from memory.
> 
>     The ARM architecture does not require the hardware to ensure
>     coherency between instruction caches and memory, even for
>     locations of shared memory.

Well, yeah, that's traditional insn caches on multiple cores.  From user 
space you need kernel help for this, doing interprocess interrupts to 
flush all such buffers on all cores (or at least those potentially 
fetching stuff in the patched region, if such granularity is possible).  
An implementation providing such is non-broken :)  Alternatively the 
various invalidate cache instructions need to have a form that invalidates 
the i$ on all cores.

That's just normal code patching on multi-core systems.  Nothing specific 
for aarch64 and nothing the GCC side needs to cater for.

> I have wondered if it might be a good idea to use an inter-processor 
> interrupt to force a context synchronization event across all PEs.

So, this, exactly.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-19 13:19                 ` Michael Matz
@ 2016-04-19 13:25                   ` Andrew Haley
  2016-04-19 14:38                     ` Pedro Alves
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Haley @ 2016-04-19 13:25 UTC (permalink / raw)
  To: Michael Matz; +Cc: gcc-patches

On 04/19/2016 02:19 PM, Michael Matz wrote:

> Well, yeah, that's traditional insn caches on multiple cores.  From
> user space you need kernel help for this, doing interprocess
> interrupts to flush all such buffers on all cores (or at least those
> potentially fetching stuff in the patched region, if such
> granularity is possible).  An implementation providing such is
> non-broken :)

Sure.  If you know of any such facility in Linux userspace, please let
me know.  :-)

But there are ways of doing patching sequences which don't require
IPIs across all the cores; which was my point.

> Alternatively the various invalidate cache instructions need to have
> a form that invalidates the i$ on all cores.

I'm fairly sure we haven't got that in the AArch64 architecture.

Andrew.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-19 13:25                   ` Andrew Haley
@ 2016-04-19 14:38                     ` Pedro Alves
  2016-04-19 15:02                       ` Andrew Haley
  0 siblings, 1 reply; 38+ messages in thread
From: Pedro Alves @ 2016-04-19 14:38 UTC (permalink / raw)
  To: Andrew Haley, Michael Matz; +Cc: gcc-patches

On 04/19/2016 02:25 PM, Andrew Haley wrote:
> On 04/19/2016 02:19 PM, Michael Matz wrote:
> 
>> Well, yeah, that's traditional insn caches on multiple cores.  From
>> user space you need kernel help for this, doing interprocess
>> interrupts to flush all such buffers on all cores (or at least those
>> potentially fetching stuff in the patched region, if such
>> granularity is possible).  An implementation providing such is
>> non-broken :)
> 
> Sure.  If you know of any such facility in Linux userspace, please let
> me know.  :-)

Sounds like a job for the sys_membarrier system call:

 https://lkml.org/lkml/2015/3/18/531
 https://lwn.net/Articles/369567/

I think it's available in Linux 4.3+.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-19 14:38                     ` Pedro Alves
@ 2016-04-19 15:02                       ` Andrew Haley
  0 siblings, 0 replies; 38+ messages in thread
From: Andrew Haley @ 2016-04-19 15:02 UTC (permalink / raw)
  To: Pedro Alves, Michael Matz; +Cc: gcc-patches

On 04/19/2016 03:37 PM, Pedro Alves wrote:
> On 04/19/2016 02:25 PM, Andrew Haley wrote:
>> On 04/19/2016 02:19 PM, Michael Matz wrote:
>>
>>> Well, yeah, that's traditional insn caches on multiple cores.  From
>>> user space you need kernel help for this, doing interprocess
>>> interrupts to flush all such buffers on all cores (or at least those
>>> potentially fetching stuff in the patched region, if such
>>> granularity is possible).  An implementation providing such is
>>> non-broken :)
>>
>> Sure.  If you know of any such facility in Linux userspace, please let
>> me know.  :-)
> 
> Sounds like a job for the sys_membarrier system call:
> 
>  https://lkml.org/lkml/2015/3/18/531
>  https://lwn.net/Articles/369567/
> 
> I think it's available in Linux 4.3+.

So it is, thanks.  I'm guessing that might be good enough for full
instruction synchronization barriers, but from looking at the kernel
source I can't really tell.

Andrew.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-18 12:12         ` Michael Matz
  2016-04-19  6:26           ` AKASHI Takahiro
@ 2016-04-19 16:03           ` Torsten Duwe
  1 sibling, 0 replies; 38+ messages in thread
From: Torsten Duwe @ 2016-04-19 16:03 UTC (permalink / raw)
  To: gcc-patches

On Mon, Apr 18, 2016 at 02:12:09PM +0200, Michael Matz wrote:
> 
> .  It can also be solved by having just one NOP after the function label, 
> and a number of them before, then no thread can be in the nop pad.  That 
> seems to indicate that GCC should not try to be too clever and simply 
> leave the specified number of nops before and after the function label, 
> leaving safety measures to the patching infrastructure.

Yes, please. Consistency is required to maintain a sane stream of instructions
for all CPUs involved; this gets particularily nasty on x86 byte code that
can in theory cross a cache line boundary with every byte. ARM does not have
this problem. Even if it had, it would be a kernel problem, not the compiler's.

All that kernel live patching needs is some space to place the patch calls.
I currently see the need for 2 NOPs at the beginning of each function; making
that configurable would be a plus; leaving additional configurable space
before the entry point would be even more flexible.

Fentry, like ppc64 profile-kernel, does more than necessary by generating
a call already. On ppc64 currently, that branch target isn't even actively
used, and the first thing the kernel does is patch these calls with NOPs.

So why not start with the NOPs right away? An architecture-independent,
variable number might stimulate use cases on other OSes and architectures
as well.

	Torsten

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-19  6:44         ` Alexander Monakov
@ 2016-04-20  0:33           ` AKASHI Takahiro
  2016-04-20 10:02             ` Szabolcs Nagy
  0 siblings, 1 reply; 38+ messages in thread
From: AKASHI Takahiro @ 2016-04-20  0:33 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Szabolcs Nagy, Andrew Pinski, Maxim Kuvyrkov, Li Bin,
	GCC Patches, Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, guohanjun, Yangfei (Felix),
	jiangjiji, nd

On Tue, Apr 19, 2016 at 09:44:37AM +0300, Alexander Monakov wrote:
> On Tue, 19 Apr 2016, AKASHI Takahiro wrote:
> > > looking at [2] i don't see why
> > > 
> > > func:
> > >   mov x9, x30
> > >   bl _tracefunc
> > >   <function body>
> > 
> > Actually,
> >     mov x9, x30
> >     bl _tracefunc
> >     mov x30, x9
> >     <function body>
> 
> I think here Szabolcs' point was that the last instruction can be eliminated:
> _tracefunc can be responsible for restoring x30, and can use x9 to return to
> its caller. It has a non-standard calling convention and needs to be
> implemented in assembly anyway.

OK, but in _tracefunc, x30 has been updated, and so we should
return as follows:
    mov xTMP, x30
    mov x30, x9
    ret xTMP

We need one more temp register here...

Thanks,
-Takahiro AKASHI

> Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-19  6:39             ` Alexander Monakov
@ 2016-04-20  1:23               ` AKASHI Takahiro
  2016-04-20 16:45                 ` Szabolcs Nagy
  0 siblings, 1 reply; 38+ messages in thread
From: AKASHI Takahiro @ 2016-04-20  1:23 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Michael Matz, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, richard.earnshaw, andrew.wafaa, szabolcs.nagy,
	geoff, guohanjun, felix.yang, jiangjiji

On Tue, Apr 19, 2016 at 09:39:39AM +0300, Alexander Monakov wrote:
> On Tue, 19 Apr 2016, AKASHI Takahiro wrote:
> > > > But if Szabolcs' two-instruction 
> > > > sequence in the adjacent subthread is sufficient, this is moot.
> > > 
> > > .  It can also be solved by having just one NOP after the function label, 
> > > and a number of them before, then no thread can be in the nop pad.  That 
> > > seems to indicate that GCC should not try to be too clever and simply 
> > > leave the specified number of nops before and after the function label, 
> > > leaving safety measures to the patching infrastructure.
> > 
> > I don't get this idea very well.
> > How can the instructions *before* a function label be executed
> > after branching into this function?
> 
> The single nop after the function label is changed to a short backwards branch
> to the instructions just before the function label.
> 
> As a result, the last instruction in the pad would have to become a short
> forward branch jumping over the backwards branch described above, to the first
> real instruction of the function.

So you mean something like:
1:
  str x30, [sp, #-8]!
  bl _tracefunc
  ldr x30, [sp], #8
  b 2f
.global <function label>
  b 1b
2:
  <function prologue/body>
  ...
(We will not have to use x9 or else to preserve x30 here.)

Interesting.
Livepatch code in the kernel has an assumption that the address of
"bl _tracefunc" be equal to <function label>, but a recent patch for
power pc to support livepatch tries to ease this restriction [1],
and so hopefully it won't be an issue.
(I will have to dig into the kernel code to be sure that there is
no other issues though.)

Thanks,
-Takahiro AKASHI

[1] http://lkml.iu.edu//hypermail/linux/kernel/1604.1/04111.html and
    http://lkml.iu.edu//hypermail/linux/kernel/1604.1/04112.html

> Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-20  0:33           ` AKASHI Takahiro
@ 2016-04-20 10:02             ` Szabolcs Nagy
  0 siblings, 0 replies; 38+ messages in thread
From: Szabolcs Nagy @ 2016-04-20 10:02 UTC (permalink / raw)
  To: AKASHI Takahiro, Alexander Monakov
  Cc: Andrew Pinski, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, Richard Earnshaw, andrew.wafaa,
	masami.hiramatsu.pt, geoff, guohanjun, Yangfei (Felix),
	jiangjiji, nd

On 20/04/16 01:36, AKASHI Takahiro wrote:
> On Tue, Apr 19, 2016 at 09:44:37AM +0300, Alexander Monakov wrote:
>> On Tue, 19 Apr 2016, AKASHI Takahiro wrote:
>>>> looking at [2] i don't see why
>>>>
>>>> func:
>>>>   mov x9, x30
>>>>   bl _tracefunc
>>>>   <function body>
>>>
>>> Actually,
>>>     mov x9, x30
>>>     bl _tracefunc
>>>     mov x30, x9
>>>     <function body>
>>
>> I think here Szabolcs' point was that the last instruction can be eliminated:
>> _tracefunc can be responsible for restoring x30, and can use x9 to return to
>> its caller. It has a non-standard calling convention and needs to be
>> implemented in assembly anyway.
> 
> OK, but in _tracefunc, x30 has been updated, and so we should
> return as follows:
>     mov xTMP, x30
>     mov x30, x9
>     ret xTMP
> 
> We need one more temp register here...
> 

you have to save/restore x9 and x30 around
the ftrace callback that is written in c anyway,
so i think you don't need more registers, just
restore from the stack differently.

and the instrumentation code sequence should
be optimized, not the trace function.

> Thanks,
> -Takahiro AKASHI
> 
>> Alexander
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2016-04-20  1:23               ` AKASHI Takahiro
@ 2016-04-20 16:45                 ` Szabolcs Nagy
  0 siblings, 0 replies; 38+ messages in thread
From: Szabolcs Nagy @ 2016-04-20 16:45 UTC (permalink / raw)
  To: AKASHI Takahiro, Alexander Monakov
  Cc: Michael Matz, Maxim Kuvyrkov, Li Bin, GCC Patches,
	Marcus Shawcroft, richard.earnshaw, andrew.wafaa, geoff,
	guohanjun, felix.yang, jiangjiji, nd

On 20/04/16 02:25, AKASHI Takahiro wrote:
> On Tue, Apr 19, 2016 at 09:39:39AM +0300, Alexander Monakov wrote:
>> On Tue, 19 Apr 2016, AKASHI Takahiro wrote:
>>>>> But if Szabolcs' two-instruction 
>>>>> sequence in the adjacent subthread is sufficient, this is moot.
>>>>
>>>> .  It can also be solved by having just one NOP after the function label, 
>>>> and a number of them before, then no thread can be in the nop pad.  That 
>>>> seems to indicate that GCC should not try to be too clever and simply 
>>>> leave the specified number of nops before and after the function label, 
>>>> leaving safety measures to the patching infrastructure.
>>>
>>> I don't get this idea very well.
>>> How can the instructions *before* a function label be executed
>>> after branching into this function?
>>
>> The single nop after the function label is changed to a short backwards branch
>> to the instructions just before the function label.
>>
>> As a result, the last instruction in the pad would have to become a short
>> forward branch jumping over the backwards branch described above, to the first
>> real instruction of the function.
> 
> So you mean something like:
> 1:
>   str x30, [sp, #-8]!
>   bl _tracefunc
>   ldr x30, [sp], #8
>   b 2f
> .global <function label>
>   b 1b
> 2:
>   <function prologue/body>
>   ...
> (We will not have to use x9 or else to preserve x30 here.)
> 
> Interesting.
> Livepatch code in the kernel has an assumption that the address of
> "bl _tracefunc" be equal to <function label>, but a recent patch for
> power pc to support livepatch tries to ease this restriction [1],
> and so hopefully it won't be an issue.
> (I will have to dig into the kernel code to be sure that there is
> no other issues though.)
> 

i think ldr x30,[sp],#8 after the _tracefunc is not ok for
livepatching, since _tracefunc will change the return
address to the new function to hijack the call, which will
not restore the stack (this can be solved if the new
function can be instrumented, but fiddly).
and sp has to be 16 byte aligned, so the options are

  str x30,[sp,#-16]!
  bl _tracefunc

or

  mov x9,x30
  bl _tracefunc

where _tracefunc is responsible for restoring x30 and
sp, and this sequence can come before or after the
function symbol.

if it's before then

1:
  <save x30>
  bl _tracefunc
  b 2f
func:
  b 1b
2:
  <prologue>

the trace disabled case is better (only one nop), but i
think it would mean more kernel work (the current
code assumes bl _tracefunc is nopped, so whenever
tracing is enabled a different tracefunc target may be
used in the atomic update, i don't know if this is
necessary though).

it is probably only worth inventing something new for
aarch64 in gcc if the kernel can use that consistently
across targets or if that can cover other significant
use cases, but it's not clear if the various flexible nop
padding solutions can be more useful than the simple
two instruction sequence which kernel tools can already
deal with.

so it seems to me that

func:
  mov x9, x30
  bl __fentry__
  <prologue>

is still the best option with a new -mfentry option for
aarch64 (then we can keep the default -pg behaviour
for backward compatibility and work similarly to x86
with -mfentry) it does not solve the more general
instrumentation problem, but that would require more
analysis.

(on x86, gcc also provides -mrecord-mcount and
-mnop-mcount to record the noped out mcount call
sites,  but the kernel seems to use its own tool
to do that by looking for the mcount/fentry call
relocs so they are probably not needed).

> Thanks,
> -Takahiro AKASHI
> 
> [1] http://lkml.iu.edu//hypermail/linux/kernel/1604.1/04111.html and
>     http://lkml.iu.edu//hypermail/linux/kernel/1604.1/04112.html
> 
>> Alexander
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] [AArch64] support -mfentry feature for arm64
  2015-10-22 13:24 Li Bin
@ 2015-10-22 13:53 ` Marcus Shawcroft
  0 siblings, 0 replies; 38+ messages in thread
From: Marcus Shawcroft @ 2015-10-22 13:53 UTC (permalink / raw)
  To: Li Bin
  Cc: gcc-patches, szabolcs.nagy, Geoff.Levand, Andrew.Wafaa,
	guohanjun, felix.yang, jiangjiji

On 22 October 2015 at 14:21, Li Bin <huawei.libin@huawei.com> wrote:
> From: Jiangjiji <jiangjiji@huawei.com>
>
> * gcc/config/aarch64/aarch64.opt: Add a new option.
> * gcc/config/aarch64/aarch64.c: Add some new functions and Macros.
> * gcc/config/aarch64/aarch64.h: Modify PROFILE_HOOK and FUNCTION_PROFILER.
>
> Signed-off-by: Jiangjiji <jiangjiji@huawei.com>
> Signed-off-by: Li Bin <huawei.libin@huawei.com>

Hi,  Please work with Maxim
https://gcc.gnu.org/ml/gcc/2015-10/msg00090.html towards a generic
solution to this issue.  If, in the future it becomes apparent that
his proposal is not viable then we can reconsider a backend centric
solution.

Cheers
/Marcus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] [AArch64] support -mfentry feature for arm64
@ 2015-10-22 13:24 Li Bin
  2015-10-22 13:53 ` Marcus Shawcroft
  0 siblings, 1 reply; 38+ messages in thread
From: Li Bin @ 2015-10-22 13:24 UTC (permalink / raw)
  To: gcc-patches
  Cc: szabolcs.nagy, Geoff.Levand, Andrew.Wafaa, guohanjun, felix.yang,
	jiangjiji, huawei.libin

From: Jiangjiji <jiangjiji@huawei.com>

* gcc/config/aarch64/aarch64.opt: Add a new option.
* gcc/config/aarch64/aarch64.c: Add some new functions and Macros.
* gcc/config/aarch64/aarch64.h: Modify PROFILE_HOOK and FUNCTION_PROFILER.

Signed-off-by: Jiangjiji <jiangjiji@huawei.com>
Signed-off-by: Li Bin <huawei.libin@huawei.com>
---
 gcc/config/aarch64/aarch64.c   |   23 +++++++++++++++++++++++
 gcc/config/aarch64/aarch64.h   |   13 ++++++++-----
 gcc/config/aarch64/aarch64.opt |    4 ++++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 752df4e..c70b161 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -440,6 +440,17 @@ aarch64_is_long_call_p (rtx sym)
   return aarch64_decl_is_long_call_p (SYMBOL_REF_DECL (sym));
 }
 
+void
+aarch64_function_profiler (FILE *file, int labelno ATTRIBUTE_UNUSED)
+{
+	if (flag_fentry)
+	{
+		fprintf (file, "\tmov\tx9, x30\n");
+		fprintf (file, "\tbl\t__fentry__\n");
+		fprintf (file, "\tmov\tx30, x9\n");
+	}
+}
+
 /* Return true if the offsets to a zero/sign-extract operation
    represent an expression that matches an extend operation.  The
    operands represent the paramters from
@@ -7414,6 +7425,15 @@ aarch64_emit_unlikely_jump (rtx insn)
   add_int_reg_note (insn, REG_BR_PROB, very_unlikely);
 }
 
+/* Return true, if profiling code should be emitted before
+ * prologue. Otherwise it returns false.
+ * Note: For x86 with "hotfix" it is sorried.  */
+static bool
+aarch64_profile_before_prologue (void)
+{
+	return flag_fentry != 0;
+}
+
 /* Expand a compare and swap pattern.  */
 
 void
@@ -8454,6 +8474,9 @@ aarch64_cannot_change_mode_class (enum machine_mode from,
 #undef TARGET_ASM_ALIGNED_SI_OP
 #define TARGET_ASM_ALIGNED_SI_OP "\t.word\t"
 
+#undef TARGET_PROFILE_BEFORE_PROLOGUE
+#define TARGET_PROFILE_BEFORE_PROLOGUE aarch64_profile_before_prologue
+
 #undef TARGET_ASM_CAN_OUTPUT_MI_THUNK
 #define TARGET_ASM_CAN_OUTPUT_MI_THUNK \
   hook_bool_const_tree_hwi_hwi_const_tree_true
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 77b2bb9..65e34fc 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -804,13 +804,16 @@ do {									     \
 #define PROFILE_HOOK(LABEL)						\
   {									\
     rtx fun, lr;							\
-    lr = get_hard_reg_initial_val (Pmode, LR_REGNUM);			\
-    fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME);			\
-    emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode);	\
+	if (!flag_fentry)
+	  {
+		lr = get_hard_reg_initial_val (Pmode, LR_REGNUM);			\
+		fun = gen_rtx_SYMBOL_REF (Pmode, MCOUNT_NAME);			\
+		emit_library_call (fun, LCT_NORMAL, VOIDmode, 1, lr, Pmode);	\
+	  }
   }
 
-/* All the work done in PROFILE_HOOK, but still required.  */
-#define FUNCTION_PROFILER(STREAM, LABELNO) do { } while (0)
+#define FUNCTION_PROFILER(STREAM, LABELNO)
+	aarch64_function_profiler(STREAM, LABELNO)
 
 /* For some reason, the Linux headers think they know how to define
    these macros.  They don't!!!  */
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 266d873..9e4b408 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -124,3 +124,7 @@ Enum(aarch64_abi) String(ilp32) Value(AARCH64_ABI_ILP32)
 
 EnumValue
 Enum(aarch64_abi) String(lp64) Value(AARCH64_ABI_LP64)
+
+mfentry
+Target Report Var(flag_fentry) Init(0)
+Emit profiling counter call at function entry immediately after prologue.
-- 
1.7.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-04-20 16:45 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-14  8:13 [PATCH] [AArch64] support -mfentry feature for arm64 Li Bin
2016-03-14  8:12 ` Li Bin
2016-04-14 13:08 ` Maxim Kuvyrkov
2016-04-14 13:15   ` Andrew Pinski
2016-04-14 15:58     ` Szabolcs Nagy
2016-04-18 13:26       ` Alexander Monakov
2016-04-18 13:34         ` Ramana Radhakrishnan
2016-04-18 13:44           ` Alexander Monakov
2016-04-18 13:57             ` Ramana Radhakrishnan
2016-04-18 14:03               ` Alexander Monakov
2016-04-18 14:31         ` Szabolcs Nagy
2016-04-18 15:54           ` Alexander Monakov
2016-04-19  6:46             ` AKASHI Takahiro
2016-04-19  6:13       ` AKASHI Takahiro
2016-04-19  6:44         ` Alexander Monakov
2016-04-20  0:33           ` AKASHI Takahiro
2016-04-20 10:02             ` Szabolcs Nagy
2016-04-15 15:40   ` Michael Matz
2016-04-15 17:29     ` Alexander Monakov
2016-04-17 15:06       ` Alexander Monakov
2016-04-18 12:12         ` Michael Matz
2016-04-19  6:26           ` AKASHI Takahiro
2016-04-19  6:39             ` Alexander Monakov
2016-04-20  1:23               ` AKASHI Takahiro
2016-04-20 16:45                 ` Szabolcs Nagy
2016-04-19 16:03           ` Torsten Duwe
2016-04-18 14:32       ` Andrew Haley
2016-04-18 17:13         ` Michael Matz
2016-04-18 17:17           ` Andrew Haley
2016-04-18 17:34             ` Michael Matz
2016-04-19  8:00               ` Andrew Haley
2016-04-19 13:19                 ` Michael Matz
2016-04-19 13:25                   ` Andrew Haley
2016-04-19 14:38                     ` Pedro Alves
2016-04-19 15:02                       ` Andrew Haley
2016-04-19  6:08   ` AKASHI Takahiro
  -- strict thread matches above, loose matches on Subject: below --
2015-10-22 13:24 Li Bin
2015-10-22 13:53 ` Marcus Shawcroft

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).