public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH] Frame pointer for arm with THUMB2 mode
@ 2018-08-28  0:48 ` Wilco Dijkstra
  2018-08-28 10:44   ` Denis Khalikov
  0 siblings, 1 reply; 11+ messages in thread
From: Wilco Dijkstra @ 2018-08-28  0:48 UTC (permalink / raw)
  To: GCC Patches, d.khalikov, v.barinov; +Cc: nd

Hi,

> But we still have an issue with performance, when we are using default
> unwinder, which uses unwind tables. It could be up to 10 times faster to 
> use frame based stack unwinder instead "default unwinder".

Switching on the frame pointer typically costs 1-2% performance, so it's a bad
idea to use it. However changing the frame pointer like in the proposed patch
will have a much larger cost - both in performance and codesize. You'd be 
lucky if it is less than 10%. This is due to placing the frame pointer at the top
rather than the bottom of the frame, and that is very inefficient in Thumb-2.

You would need to unwind ~100k times a second before you might see a 
performance gain. However you pay the performance cost all the time, even
when no unwinding is required. So I don't see this as being a good idea.

If unwind performance is an issue, it would make far more sense to solve that.
Profiling typically hits the same functions again and again. Callgraph profiling to
a fixed depth hits the same functions all the time. So caching is the obvious
solution.

Doing real unwinding is also far more accurate than frame pointer based
unwinding (the latter doesn't handle leaf functions correctly, entry/exit in
non-leaf functions and shrinkwrapped functions - and this breaks callgraph
profiling).

So my question is, is there any point in making code run significantly slower
all the time and in return get inaccurate unwinding?

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-08-28  0:48 ` [PATCH] Frame pointer for arm with THUMB2 mode Wilco Dijkstra
@ 2018-08-28 10:44   ` Denis Khalikov
  2018-09-05 12:11     ` Wilco Dijkstra
  0 siblings, 1 reply; 11+ messages in thread
From: Denis Khalikov @ 2018-08-28 10:44 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: v.barinov, nd

Hi,
thanks for the answer.

 > Switching on the frame pointer typically costs 1-2% performance, so 
it's a bad
 > idea to use it. However changing the frame pointer like in the 
proposed patch
 > will have a much larger cost - both in performance and codesize. 
You'd be
 > lucky if it is less than 10%. This is due to placing the frame 
pointer at the top
 > rather than the bottom of the frame, and that is very inefficient in 
Thumb-2.
 >
 > You would need to unwind ~100k times a second before you might see a
 > performance gain. However you pay the performance cost all the time, even
 > when no unwinding is required. So I don't see this as being a good idea.
 >
 > If unwind performance is an issue, it would make far more sense to 
solve that.
 > Profiling typically hits the same functions again and again. 
Callgraph profiling to
 > a fixed depth hits the same functions all the time. So caching is the 
obvious
 > solution.

We are working on applying Address/LeakSanitizer for the full Tizen OS
distribution. It's about ~1000 packages, ASan/LSan runtime is installed 
to ld.so.preload. As we know ASan/LSan has interceptors for 
allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
On every allocation from user space program, ASan calls
GET_STACK_TRACE_MALLOC;
which unwinds the stack frame, and by default uses frame based stack
unwinder. So, it requires to build with "-fno-omit-frame-pointer", 
switching it to default unwinder really hits the performance in our case.

 > Doing real unwinding is also far more accurate than frame pointer based
 > unwinding (the latter doesn't handle leaf functions correctly, 
entry/exit in
 > non-leaf functions and shrinkwrapped functions - and this breaks 
callgraph
 > profiling).

I agree, but in our case, all interceptors for allocators are
leaf functions, so the frame based stack unwinder works well for us.

 > So my question is, is there any point in making code run 
significantly slower
 > all the time and in return get inaccurate unwinding?

By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
because need frame based stack unwinder for every allocation, as I said
before. As we know GCC sets fp to lr on the stack with 
("-fno-omit-frame-pointer" and "-marm") and I don't really know why.
But the binary size is bigger than for thumb, so, we cannot use default 
thumb frame pointer and want to reduce binary size for the full 
sanitized image.

In other case clang works the same way, as I offered at the patch.
It has the same issue, but it was fixed at the end of 2017
https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
discussion about APCS, but it is not the main point.)

Also, unresolved issue related to this
https://github.com/google/sanitizers/issues/640

Thanks.


On 08/28/2018 03:48 AM, Wilco Dijkstra wrote:
> Hi,
>
>> But we still have an issue with performance, when we are using default
>> unwinder, which uses unwind tables. It could be up to 10 times faster to
>> use frame based stack unwinder instead "default unwinder".
>
> Switching on the frame pointer typically costs 1-2% performance, so it's a bad
> idea to use it. However changing the frame pointer like in the proposed patch
> will have a much larger cost - both in performance and codesize. You'd be
> lucky if it is less than 10%. This is due to placing the frame pointer at the top
> rather than the bottom of the frame, and that is very inefficient in Thumb-2.
>
> You would need to unwind ~100k times a second before you might see a
> performance gain. However you pay the performance cost all the time, even
> when no unwinding is required. So I don't see this as being a good idea.
>
> If unwind performance is an issue, it would make far more sense to solve that.
> Profiling typically hits the same functions again and again. Callgraph profiling to
> a fixed depth hits the same functions all the time. So caching is the obvious
> solution.
>
> Doing real unwinding is also far more accurate than frame pointer based
> unwinding (the latter doesn't handle leaf functions correctly, entry/exit in
> non-leaf functions and shrinkwrapped functions - and this breaks callgraph
> profiling).
>
> So my question is, is there any point in making code run significantly slower
> all the time and in return get inaccurate unwinding?
>
> Cheers,
> Wilco
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-08-28 10:44   ` Denis Khalikov
@ 2018-09-05 12:11     ` Wilco Dijkstra
  2018-09-05 13:53       ` Denis Khalikov
  2018-09-05 13:56       ` Denis Khalikov
  0 siblings, 2 replies; 11+ messages in thread
From: Wilco Dijkstra @ 2018-09-05 12:11 UTC (permalink / raw)
  To: Denis Khalikov, GCC Patches; +Cc: v.barinov, nd

Hi Denis,

> We are working on applying Address/LeakSanitizer for the full Tizen OS
> distribution. It's about ~1000 packages, ASan/LSan runtime is installed 
> to ld.so.preload. As we know ASan/LSan has interceptors for 
> allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
> On every allocation from user space program, ASan calls
> GET_STACK_TRACE_MALLOC;
> which unwinds the stack frame, and by default uses frame based stack
> unwinder. So, it requires to build with "-fno-omit-frame-pointer", 
> switching it to default unwinder really hits the performance in our case.

So this sounds like the first thing to do is reducing the size of the stack traces.
The default is 30 which is far larger than useful. Using 1 for example should
always be fast (since you can use __builtin_return_address(0)) and still get
the function that called malloc. Also if the unwinder happens to be too slow,
it should be optimized and caching added etc.

>> Doing real unwinding is also far more accurate than frame pointer based
> > unwinding (the latter doesn't handle leaf functions correctly, 
> entry/exit in
> > non-leaf functions and shrinkwrapped functions - and this breaks 
> callgraph
> > profiling).
>
> I agree, but in our case, all interceptors for allocators are
> leaf functions, so the frame based stack unwinder works well for us.

Yes a frame chain would work for this case. But it's not currently supported.

> By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
> because need frame based stack unwinder for every allocation, as I said
> before. As we know GCC sets fp to lr on the stack with 
> ("-fno-omit-frame-pointer" and "-marm") and I don't really know why.
> But the binary size is bigger than for thumb, so, we cannot use default 
> thumb frame pointer and want to reduce binary size for the full 
> sanitized image.

The issue is that the frame pointer and frame chain always add a large
overhead even when you do not use any sanitizers. This is especially bad
for the proposed patch - you lose much of the benefit of using Thumb-2...

Using normal unwinding means your code runs at full speed and still can be
used by the sanitizer.

> In other case clang works the same way, as I offered at the patch.
> It has the same issue, but it was fixed at the end of 2017
> https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
> discussion about APCS, but it is not the main point.)
>
> Also, unresolved issue related to this
> https://github.com/google/sanitizers/issues/640

Adding support for a frame chain would require an ABI change. It would have to 
work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
effort.

Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-09-05 12:11     ` Wilco Dijkstra
@ 2018-09-05 13:53       ` Denis Khalikov
  2018-09-05 13:56       ` Denis Khalikov
  1 sibling, 0 replies; 11+ messages in thread
From: Denis Khalikov @ 2018-09-05 13:53 UTC (permalink / raw)
  To: d.khalikov, GCC Patches; +Cc: nd

Hi Wilco,
thanks for the answer.

 > Adding support for a frame chain would require an ABI change. It 
would have to
 > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
 > effort.

Clang already works that way.
Please look at this commit:
http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459


This is an example of code which clang generates
with -mthumb -fno-omit-frame-pointer -O2.

  @ %bb.0:
          push    {r4, r5, r6, r7, lr}
          add     r7, sp, #12
          push.w  {r8, r9, r10}
          sub     sp, #56
          mov     r8, r0
          movw    r0, :lower16:_ZTVN6sensor14sensor_handlerE
          movt    r0, :upper16:_ZTVN6sensor14sensor_handlerE
          mov     r9, r8
          adds    r0, #8
          str     r0, [r9], #4

The only difference is clang sets frame pointer to the r7 on the stack 
instead gcc sets to lr, but it already handles by the Sanitizers.

 >
 > So this sounds like the first thing to do is reducing the size of the 
stack traces.
 > The default is 30 which is far larger than useful. Using 1 for 
example should
 > always be fast (since you can use __builtin_return_address(0)) and 
still get
 > the function that called malloc. Also if the unwinder happens to be 
too slow,
 > it should be optimized and caching added etc.
 >
If we change the size of the traces to 2, it could be something like this:

0xb42a50a0 is located 0 bytes inside of 88-byte region 
[0xb42a50a0,0xb42a50f8)
freed by thread T0 here:
     #0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7)
     #1 0xb5fa64e3 in ipc::message::unref() 
(/lib/libsensord-shared.so+0xe4e3)

previously allocated by thread T0 here:
     #0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157)
     #1 0xb2f8852d in accel_device::get_data(unsigned int, 
sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)

Instead this:

0xb42250a0 is located 0 bytes inside of 88-byte region 
[0xb42250a0,0xb42250f8)
freed by thread T0 here:
     #0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff)
     #1 0xb5fa64e3 in ipc::message::unref() 
(/lib/libsensord-shared.so+0xe4e3)
     #2 0xb6efbf47 in sensor::sensor_handler::notify(char const*, 
sensor_data_t*, int) (/usr/bin/sensord+0x1ef47)
     #3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned 
int) (/usr/bin/sensord+0x1dd43)
     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0xbbbb)
     #5 0xb62d9a15 in g_main_context_dispatch 
(/lib/libglib-2.0.so.0+0x91a15)
     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
     #8 0xb5fa4e1b in ipc::event_loop::run(int) 
(/lib/libsensord-shared.so+0xce1b)
     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)

previously allocated by thread T0 here:
     #0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f)
     #1 0xb2f8852d in accel_device::get_data(unsigned int, 
sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
     #2 0xb6ef848f in 
sensor::physical_sensor_handler::get_data(sensor_data_t**, int*) 
(/usr/bin/sensord+0x1b48f)
     #3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned 
int) (/usr/bin/sensord+0x1da51)
     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0xbbbb)
     #5 0xb62d9a15 in g_main_context_dispatch 
(/lib/libglib-2.0.so.0+0x91a15)
     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
     #8 0xb5fa4e1b in ipc::event_loop::run(int) 
(/lib/libsensord-shared.so+0xce1b)
     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)


At the first example we lost the full context, from where the 
control/data flow comes from.

 >
 > The issue is that the frame pointer and frame chain always add a large
 > overhead even when you do not use any sanitizers. This is especially bad
 > for the proposed patch - you lose much of the benefit of using Thumb-2...
 >

The stack layout like this enables only with compile time flag 
(-mthumb-fp and works only together with -mthumb and
-fno-omit-frame-pointer). It does not affect other codegen.


Thanks.


On 09/05/2018 03:11 PM, Wilco Dijkstra wrote:
> Hi Denis,
>
>> We are working on applying Address/LeakSanitizer for the full Tizen OS
>> distribution. It's about ~1000 packages, ASan/LSan runtime is installed
>> to ld.so.preload. As we know ASan/LSan has interceptors for
>> allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
>> On every allocation from user space program, ASan calls
>> GET_STACK_TRACE_MALLOC;
>> which unwinds the stack frame, and by default uses frame based stack
>> unwinder. So, it requires to build with "-fno-omit-frame-pointer",
>> switching it to default unwinder really hits the performance in our case.
>
> So this sounds like the first thing to do is reducing the size of the stack traces.
> The default is 30 which is far larger than useful. Using 1 for example should
> always be fast (since you can use __builtin_return_address(0)) and still get
> the function that called malloc. Also if the unwinder happens to be too slow,
> it should be optimized and caching added etc.
>
>>> Doing real unwinding is also far more accurate than frame pointer based
>>> unwinding (the latter doesn't handle leaf functions correctly,
>> entry/exit in
>>> non-leaf functions and shrinkwrapped functions - and this breaks
>> callgraph
>>> profiling).
>>
>> I agree, but in our case, all interceptors for allocators are
>> leaf functions, so the frame based stack unwinder works well for us.
>
> Yes a frame chain would work for this case. But it's not currently supported.
>
>> By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
>> because need frame based stack unwinder for every allocation, as I said
>> before. As we know GCC sets fp to lr on the stack with
>> ("-fno-omit-frame-pointer" and "-marm") and I don't really know why.
>> But the binary size is bigger than for thumb, so, we cannot use default
>> thumb frame pointer and want to reduce binary size for the full
>> sanitized image.
>
> The issue is that the frame pointer and frame chain always add a large
> overhead even when you do not use any sanitizers. This is especially bad
> for the proposed patch - you lose much of the benefit of using Thumb-2...
>
> Using normal unwinding means your code runs at full speed and still can be
> used by the sanitizer.
>
>> In other case clang works the same way, as I offered at the patch.
>> It has the same issue, but it was fixed at the end of 2017
>> https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
>> discussion about APCS, but it is not the main point.)
>>
>> Also, unresolved issue related to this
>> https://github.com/google/sanitizers/issues/640
>
> Adding support for a frame chain would require an ABI change. It would have to
> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
> effort.
>
> Wilco
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-09-05 12:11     ` Wilco Dijkstra
  2018-09-05 13:53       ` Denis Khalikov
@ 2018-09-05 13:56       ` Denis Khalikov
  2018-09-05 14:11         ` Richard Earnshaw (lists)
  2018-09-05 15:32         ` Wilco Dijkstra
  1 sibling, 2 replies; 11+ messages in thread
From: Denis Khalikov @ 2018-09-05 13:56 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: v.barinov, nd

Hi Wilco,
thanks for the answer.

 > Adding support for a frame chain would require an ABI change. It 
would have to
 > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
 > effort.

Clang already works that way.
Please look at this commit:
http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459

This is an example of code which clang generates
with -mthumb -fno-omit-frame-pointer -O2.

  @ %bb.0:
          push    {r4, r5, r6, r7, lr}
          add     r7, sp, #12
          push.w  {r8, r9, r10}
          sub     sp, #56
          mov     r8, r0
          movw    r0, :lower16:_ZTVN6sensor14sensor_handlerE
          movt    r0, :upper16:_ZTVN6sensor14sensor_handlerE
          mov     r9, r8
          adds    r0, #8
          str     r0, [r9], #4

The only difference is clang sets frame pointer to the r7 on the stack 
instead gcc sets to lr,
but it already handles by the Sanitizers.

 >
 > So this sounds like the first thing to do is reducing the size of the 
stack traces.
 > The default is 30 which is far larger than useful. Using 1 for 
example should
 > always be fast (since you can use __builtin_return_address(0)) and 
still get
 > the function that called malloc. Also if the unwinder happens to be 
too slow,
 > it should be optimized and caching added etc.
 >

If we change the size of the traces to 2, it could be something like this:

0xb42a50a0 is located 0 bytes inside of 88-byte region 
[0xb42a50a0,0xb42a50f8)
freed by thread T0 here:
     #0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7)
     #1 0xb5fa64e3 in ipc::message::unref() 
(/lib/libsensord-shared.so+0xe4e3)

previously allocated by thread T0 here:
     #0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157)
     #1 0xb2f8852d in accel_device::get_data(unsigned int, 
sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)

Instead this:

0xb42250a0 is located 0 bytes inside of 88-byte region 
[0xb42250a0,0xb42250f8)
freed by thread T0 here:
     #0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff)
     #1 0xb5fa64e3 in ipc::message::unref() 
(/lib/libsensord-shared.so+0xe4e3)
     #2 0xb6efbf47 in sensor::sensor_handler::notify(char const*, 
sensor_data_t*, int) (/usr/bin/sensord+0x1ef47)
     #3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned 
int) (/usr/bin/sensord+0x1dd43)
     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0xbbbb)
     #5 0xb62d9a15 in g_main_context_dispatch 
(/lib/libglib-2.0.so.0+0x91a15)
     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
     #8 0xb5fa4e1b in ipc::event_loop::run(int) 
(/lib/libsensord-shared.so+0xce1b)
     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)

previously allocated by thread T0 here:
     #0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f)
     #1 0xb2f8852d in accel_device::get_data(unsigned int, 
sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
     #2 0xb6ef848f in 
sensor::physical_sensor_handler::get_data(sensor_data_t**, int*) 
(/usr/bin/sensord+0x1b48f)
     #3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned 
int) (/usr/bin/sensord+0x1da51)
     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0xbbbb)
     #5 0xb62d9a15 in g_main_context_dispatch 
(/lib/libglib-2.0.so.0+0x91a15)
     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
     #8 0xb5fa4e1b in ipc::event_loop::run(int) 
(/lib/libsensord-shared.so+0xce1b)
     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)


At the first example we lost the full context, from where the 
control/data flow comes from.

 >
 > The issue is that the frame pointer and frame chain always add a large
 > overhead even when you do not use any sanitizers. This is especially bad
 > for the proposed patch - you lose much of the benefit of using Thumb-2...
 >

The stack layout like this enables only with compile time flag 
(-mthumb-fp and works only together with -mthumb and
-fno-omit-frame-pointer). It does not affect other codegen.

Thanks.

On 09/05/2018 03:11 PM, Wilco Dijkstra wrote:
> Hi Denis,
>
>> We are working on applying Address/LeakSanitizer for the full Tizen OS
>> distribution. It's about ~1000 packages, ASan/LSan runtime is installed
>> to ld.so.preload. As we know ASan/LSan has interceptors for
>> allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
>> On every allocation from user space program, ASan calls
>> GET_STACK_TRACE_MALLOC;
>> which unwinds the stack frame, and by default uses frame based stack
>> unwinder. So, it requires to build with "-fno-omit-frame-pointer",
>> switching it to default unwinder really hits the performance in our case.
>
> So this sounds like the first thing to do is reducing the size of the stack traces.
> The default is 30 which is far larger than useful. Using 1 for example should
> always be fast (since you can use __builtin_return_address(0)) and still get
> the function that called malloc. Also if the unwinder happens to be too slow,
> it should be optimized and caching added etc.
>
>>> Doing real unwinding is also far more accurate than frame pointer based
>>> unwinding (the latter doesn't handle leaf functions correctly,
>> entry/exit in
>>> non-leaf functions and shrinkwrapped functions - and this breaks
>> callgraph
>>> profiling).
>>
>> I agree, but in our case, all interceptors for allocators are
>> leaf functions, so the frame based stack unwinder works well for us.
>
> Yes a frame chain would work for this case. But it's not currently supported.
>
>> By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
>> because need frame based stack unwinder for every allocation, as I said
>> before. As we know GCC sets fp to lr on the stack with
>> ("-fno-omit-frame-pointer" and "-marm") and I don't really know why.
>> But the binary size is bigger than for thumb, so, we cannot use default
>> thumb frame pointer and want to reduce binary size for the full
>> sanitized image.
>
> The issue is that the frame pointer and frame chain always add a large
> overhead even when you do not use any sanitizers. This is especially bad
> for the proposed patch - you lose much of the benefit of using Thumb-2...
>
> Using normal unwinding means your code runs at full speed and still can be
> used by the sanitizer.
>
>> In other case clang works the same way, as I offered at the patch.
>> It has the same issue, but it was fixed at the end of 2017
>> https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
>> discussion about APCS, but it is not the main point.)
>>
>> Also, unresolved issue related to this
>> https://github.com/google/sanitizers/issues/640
>
> Adding support for a frame chain would require an ABI change. It would have to
> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
> effort.
>
> Wilco
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-09-05 13:56       ` Denis Khalikov
@ 2018-09-05 14:11         ` Richard Earnshaw (lists)
  2018-09-05 15:32         ` Wilco Dijkstra
  1 sibling, 0 replies; 11+ messages in thread
From: Richard Earnshaw (lists) @ 2018-09-05 14:11 UTC (permalink / raw)
  To: Denis Khalikov, Wilco Dijkstra, GCC Patches; +Cc: v.barinov, nd

On 05/09/18 14:55, Denis Khalikov wrote:
> Hi Wilco,
> thanks for the answer.
> 
>> Adding support for a frame chain would require an ABI change. It would
> have to
>> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
>> effort.
> 
> Clang already works that way.


> Please look at this commit:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/ARM/ARMCallingConv.td?r1=269459&r2=269458&pathrev=269459
> 
> 
> This is an example of code which clang generates
> with -mthumb -fno-omit-frame-pointer -O2.
> 
>  @ %bb.0:
>          push    {r4, r5, r6, r7, lr}
>          add     r7, sp, #12
>          push.w  {r8, r9, r10}
>          sub     sp, #56
>          mov     r8, r0
>          movw    r0, :lower16:_ZTVN6sensor14sensor_handlerE
>          movt    r0, :upper16:_ZTVN6sensor14sensor_handlerE
>          mov     r9, r8
>          adds    r0, #8
>          str     r0, [r9], #4
> 
> The only difference is clang sets frame pointer to the r7 on the stack
> instead gcc sets to lr,
> but it already handles by the Sanitizers.

Then Clang is broken.  You can't have different frame pointers in Arm
and Thumb code.

I object to another hack going in for another ill-specified frame
pointer variant until such time as the ABI is updated to sort this out
properly.  The frame handling code is just too critical to overall
performance and the prologue and epilogue code itself is also quite
fragile in this respect.  Adding more mess on top of that is going to
make sorting this out even more tricky: and once a patch like this goes
on it's not easy to remove it again.

So until the ABI sanctions a proper inter-function frame chain record,
GCC will only support local use of the frame pointer and no chaining.

R.

> 
>>
>> So this sounds like the first thing to do is reducing the size of the
> stack traces.
>> The default is 30 which is far larger than useful. Using 1 for example
> should
>> always be fast (since you can use __builtin_return_address(0)) and
> still get
>> the function that called malloc. Also if the unwinder happens to be
> too slow,
>> it should be optimized and caching added etc.
>>
> 
> If we change the size of the traces to 2, it could be something like this:
> 
> 0xb42a50a0 is located 0 bytes inside of 88-byte region
> [0xb42a50a0,0xb42a50f8)
> freed by thread T0 here:
>     #0 0xb6a35cc7 in __interceptor_free (/usr/lib/libasan.so+0x127cc7)
>     #1 0xb5fa64e3 in ipc::message::unref()
> (/lib/libsensord-shared.so+0xe4e3)
> 
> previously allocated by thread T0 here:
>     #0 0xb6a36157 in malloc (/usr/lib/libasan.so+0x128157)
>     #1 0xb2f8852d in accel_device::get_data(unsigned int,
> sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
> 
> Instead this:
> 
> 0xb42250a0 is located 0 bytes inside of 88-byte region
> [0xb42250a0,0xb42250f8)
> freed by thread T0 here:
>     #0 0xb6989cff in __interceptor_free (/usr/lib/libasan.so+0x127cff)
>     #1 0xb5fa64e3 in ipc::message::unref()
> (/lib/libsensord-shared.so+0xe4e3)
>     #2 0xb6efbf47 in sensor::sensor_handler::notify(char const*,
> sensor_data_t*, int) (/usr/bin/sensord+0x1ef47)
>     #3 0xb6efad43 in sensor::sensor_event_handler::handle(int, unsigned
> int) (/usr/bin/sensord+0x1dd43)
>     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0xbbbb)
>     #5 0xb62d9a15 in g_main_context_dispatch
> (/lib/libglib-2.0.so.0+0x91a15)
>     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
>     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
>     #8 0xb5fa4e1b in ipc::event_loop::run(int)
> (/lib/libsensord-shared.so+0xce1b)
>     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
>     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)
> 
> previously allocated by thread T0 here:
>     #0 0xb698a18f in malloc (/usr/lib/libasan.so+0x12818f)
>     #1 0xb2f8852d in accel_device::get_data(unsigned int,
> sensor_data_t**, int*) (/usr/lib/sensor/libsensor-hal-tm1.so+0x1052d)
>     #2 0xb6ef848f in
> sensor::physical_sensor_handler::get_data(sensor_data_t**, int*)
> (/usr/bin/sensord+0x1b48f)
>     #3 0xb6efaa51 in sensor::sensor_event_handler::handle(int, unsigned
> int) (/usr/bin/sensord+0x1da51)
>     #4 0xb5fa3bbb  (/lib/libsensord-shared.so+0xbbbb)
>     #5 0xb62d9a15 in g_main_context_dispatch
> (/lib/libglib-2.0.so.0+0x91a15)
>     #6 0xb62da2d9  (/lib/libglib-2.0.so.0+0x922d9)
>     #7 0xb62da9a9 in g_main_loop_run (/lib/libglib-2.0.so.0+0x929a9)
>     #8 0xb5fa4e1b in ipc::event_loop::run(int)
> (/lib/libsensord-shared.so+0xce1b)
>     #9 0xb6eec9a5 in main (/usr/bin/sensord+0xf9a5)
>     #10 0xb5cb663b in __libc_start_main (/lib/libc.so.6+0x1663b)
> 
> 
> At the first example we lost the full context, from where the
> control/data flow comes from.
> 
>>
>> The issue is that the frame pointer and frame chain always add a large
>> overhead even when you do not use any sanitizers. This is especially bad
>> for the proposed patch - you lose much of the benefit of using Thumb-2...
>>
> 
> The stack layout like this enables only with compile time flag
> (-mthumb-fp and works only together with -mthumb and
> -fno-omit-frame-pointer). It does not affect other codegen.
> 
> Thanks.
> 
> On 09/05/2018 03:11 PM, Wilco Dijkstra wrote:
>> Hi Denis,
>>
>>> We are working on applying Address/LeakSanitizer for the full Tizen OS
>>> distribution. It's about ~1000 packages, ASan/LSan runtime is installed
>>> to ld.so.preload. As we know ASan/LSan has interceptors for
>>> allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
>>> On every allocation from user space program, ASan calls
>>> GET_STACK_TRACE_MALLOC;
>>> which unwinds the stack frame, and by default uses frame based stack
>>> unwinder. So, it requires to build with "-fno-omit-frame-pointer",
>>> switching it to default unwinder really hits the performance in our
>>> case.
>>
>> So this sounds like the first thing to do is reducing the size of the
>> stack traces.
>> The default is 30 which is far larger than useful. Using 1 for example
>> should
>> always be fast (since you can use __builtin_return_address(0)) and
>> still get
>> the function that called malloc. Also if the unwinder happens to be
>> too slow,
>> it should be optimized and caching added etc.
>>
>>>> Doing real unwinding is also far more accurate than frame pointer based
>>>> unwinding (the latter doesn't handle leaf functions correctly,
>>> entry/exit in
>>>> non-leaf functions and shrinkwrapped functions - and this breaks
>>> callgraph
>>>> profiling).
>>>
>>> I agree, but in our case, all interceptors for allocators are
>>> leaf functions, so the frame based stack unwinder works well for us.
>>
>> Yes a frame chain would work for this case. But it's not currently
>> supported.
>>
>>> By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
>>> because need frame based stack unwinder for every allocation, as I said
>>> before. As we know GCC sets fp to lr on the stack with
>>> ("-fno-omit-frame-pointer" and "-marm") and I don't really know why.
>>> But the binary size is bigger than for thumb, so, we cannot use default
>>> thumb frame pointer and want to reduce binary size for the full
>>> sanitized image.
>>
>> The issue is that the frame pointer and frame chain always add a large
>> overhead even when you do not use any sanitizers. This is especially bad
>> for the proposed patch - you lose much of the benefit of using Thumb-2...
>>
>> Using normal unwinding means your code runs at full speed and still
>> can be
>> used by the sanitizer.
>>
>>> In other case clang works the same way, as I offered at the patch.
>>> It has the same issue, but it was fixed at the end of 2017
>>> https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
>>> discussion about APCS, but it is not the main point.)
>>>
>>> Also, unresolved issue related to this
>>> https://github.com/google/sanitizers/issues/640
>>
>> Adding support for a frame chain would require an ABI change. It would
>> have to
>> work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
>> effort.
>>
>> Wilco
>>
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-09-05 13:56       ` Denis Khalikov
  2018-09-05 14:11         ` Richard Earnshaw (lists)
@ 2018-09-05 15:32         ` Wilco Dijkstra
  2018-09-05 16:43           ` Denis Khalikov
  1 sibling, 1 reply; 11+ messages in thread
From: Wilco Dijkstra @ 2018-09-05 15:32 UTC (permalink / raw)
  To: Denis Khalikov, GCC Patches; +Cc: v.barinov, nd

Hi Denis,

>> Adding support for a frame chain would require an ABI change. It 
> would have to
> > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
> > effort.

> Clang already works that way.

No, that's incorrect like Richard pointed out. Only a single register can be used
as the frame chain.

> If we change the size of the traces to 2, it could be something like this:
...
> At the first example we lost the full context, from where the 
> control/data flow comes from.

If 2 is not sufficient, then try 3 or 4. It may also be feasible to only enable
deeper unwinding for particular libraries so you only pay an extra cost for 
leaks you are interested in.

> The stack layout like this enables only with compile time flag 
> (-mthumb-fp and works only together with -mthumb and
> -fno-omit-frame-pointer). It does not affect other codegen.

But any code built like that will *always* run slower even if you don't use
the sanitizer.

Wilco

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-09-05 15:32         ` Wilco Dijkstra
@ 2018-09-05 16:43           ` Denis Khalikov
  2018-09-05 16:52             ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 11+ messages in thread
From: Denis Khalikov @ 2018-09-05 16:43 UTC (permalink / raw)
  To: Wilco Dijkstra, Richard Earnshaw (lists), GCC Patches; +Cc: v.barinov, nd

Thanks for the answers.

I understood that, this hack makes more mess in codegen for arm,
but can you please clarify what did you mean by

 >Only a single register can be used
 > as the frame chain.

As far as I understood, GCC for arm with THUMB2 mode uses r7 register as
frame pointer register by default (with -fno-omit-frame-pointer flag),
GCC for arm with ARM mode uses r11, so I didn't really propose
to change the frame pointer register.

Thanks.

On 09/05/2018 06:32 PM, Wilco Dijkstra wrote:
> Hi Denis,
>
>>> Adding support for a frame chain would require an ABI change. It
>> would have to
>>  > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial amount of
>>  > effort.
>
>> Clang already works that way.
>
> No, that's incorrect like Richard pointed out. Only a single register can be used
> as the frame chain.
>
>> If we change the size of the traces to 2, it could be something like this:
> ...
>> At the first example we lost the full context, from where the
>> control/data flow comes from.
>
> If 2 is not sufficient, then try 3 or 4. It may also be feasible to only enable
> deeper unwinding for particular libraries so you only pay an extra cost for
> leaks you are interested in.
>
>> The stack layout like this enables only with compile time flag
>> (-mthumb-fp and works only together with -mthumb and
>> -fno-omit-frame-pointer). It does not affect other codegen.
>
> But any code built like that will *always* run slower even if you don't use
> the sanitizer.
>
> Wilco
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-09-05 16:43           ` Denis Khalikov
@ 2018-09-05 16:52             ` Richard Earnshaw (lists)
  2018-09-05 17:01               ` Denis Khalikov
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Earnshaw (lists) @ 2018-09-05 16:52 UTC (permalink / raw)
  To: Denis Khalikov, Wilco Dijkstra, GCC Patches; +Cc: v.barinov, nd

On 05/09/18 17:43, Denis Khalikov wrote:
> Thanks for the answers.
> 
> I understood that, this hack makes more mess in codegen for arm,
> but can you please clarify what did you mean by
> 
>>Only a single register can be used
>> as the frame chain.
> 
> As far as I understood, GCC for arm with THUMB2 mode uses r7 register as
> frame pointer register by default (with -fno-omit-frame-pointer flag),
> GCC for arm with ARM mode uses r11, so I didn't really propose
> to change the frame pointer register.
> 

On entry to a function the code has to save the existing frame register.
 It doesn't know (can't trivially know) whether the caller is code
compiled in Arm state or Thumb state.  So how can it save the caller's
frame register if they are not the same?

Furthermore, the 'other' frame register (ie r7 in Arm state, r11 in
Thumb) is available as a call-saved register, so can contain any random
value.  If you try to use that random value during a frame chain walk
your program will most like take an access violation.  It will certainly
give you a garbage frame chain.

R.

> Thanks.
> 
> On 09/05/2018 06:32 PM, Wilco Dijkstra wrote:
>> Hi Denis,
>>
>>>> Adding support for a frame chain would require an ABI change. It
>>> would have to
>>>  > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial
>>> amount of
>>>  > effort.
>>
>>> Clang already works that way.
>>
>> No, that's incorrect like Richard pointed out. Only a single register
>> can be used
>> as the frame chain.
>>
>>> If we change the size of the traces to 2, it could be something like
>>> this:
>> ...
>>> At the first example we lost the full context, from where the
>>> control/data flow comes from.
>>
>> If 2 is not sufficient, then try 3 or 4. It may also be feasible to
>> only enable
>> deeper unwinding for particular libraries so you only pay an extra
>> cost for
>> leaks you are interested in.
>>
>>> The stack layout like this enables only with compile time flag
>>> (-mthumb-fp and works only together with -mthumb and
>>> -fno-omit-frame-pointer). It does not affect other codegen.
>>
>> But any code built like that will *always* run slower even if you
>> don't use
>> the sanitizer.
>>
>> Wilco
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Frame pointer for arm with THUMB2 mode
  2018-09-05 16:52             ` Richard Earnshaw (lists)
@ 2018-09-05 17:01               ` Denis Khalikov
  0 siblings, 0 replies; 11+ messages in thread
From: Denis Khalikov @ 2018-09-05 17:01 UTC (permalink / raw)
  To: Richard Earnshaw (lists), Wilco Dijkstra, GCC Patches; +Cc: v.barinov, nd

Richard,
thanks for the answer, got it.

On 09/05/2018 07:51 PM, Richard Earnshaw (lists) wrote:
> On 05/09/18 17:43, Denis Khalikov wrote:
>> Thanks for the answers.
>>
>> I understood that, this hack makes more mess in codegen for arm,
>> but can you please clarify what did you mean by
>>
>>> Only a single register can be used
>>> as the frame chain.
>>
>> As far as I understood, GCC for arm with THUMB2 mode uses r7 register as
>> frame pointer register by default (with -fno-omit-frame-pointer flag),
>> GCC for arm with ARM mode uses r11, so I didn't really propose
>> to change the frame pointer register.
>>
>
> On entry to a function the code has to save the existing frame register.
>  It doesn't know (can't trivially know) whether the caller is code
> compiled in Arm state or Thumb state.  So how can it save the caller's
> frame register if they are not the same?
>
> Furthermore, the 'other' frame register (ie r7 in Arm state, r11 in
> Thumb) is available as a call-saved register, so can contain any random
> value.  If you try to use that random value during a frame chain walk
> your program will most like take an access violation.  It will certainly
> give you a garbage frame chain.
>
> R.
>
>> Thanks.
>>
>> On 09/05/2018 06:32 PM, Wilco Dijkstra wrote:
>>> Hi Denis,
>>>
>>>>> Adding support for a frame chain would require an ABI change. It
>>>> would have to
>>>>  > work across GCC, LLVM, Arm, Thumb-1 and Thumb-2 - not a trivial
>>>> amount of
>>>>  > effort.
>>>
>>>> Clang already works that way.
>>>
>>> No, that's incorrect like Richard pointed out. Only a single register
>>> can be used
>>> as the frame chain.
>>>
>>>> If we change the size of the traces to 2, it could be something like
>>>> this:
>>> ...
>>>> At the first example we lost the full context, from where the
>>>> control/data flow comes from.
>>>
>>> If 2 is not sufficient, then try 3 or 4. It may also be feasible to
>>> only enable
>>> deeper unwinding for particular libraries so you only pay an extra
>>> cost for
>>> leaks you are interested in.
>>>
>>>> The stack layout like this enables only with compile time flag
>>>> (-mthumb-fp and works only together with -mthumb and
>>>> -fno-omit-frame-pointer). It does not affect other codegen.
>>>
>>> But any code built like that will *always* run slower even if you
>>> don't use
>>> the sanitizer.
>>>
>>> Wilco
>>>
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] Frame pointer for arm with THUMB2 mode
       [not found] <CGME20180827093530eucas1p2c931fb7d5bc0a73d2b657116b0288e1b@eucas1p2.samsung.com>
@ 2018-08-27  9:35 ` Denis Khalikov
  0 siblings, 0 replies; 11+ messages in thread
From: Denis Khalikov @ 2018-08-27  9:35 UTC (permalink / raw)
  To: GCC Patches; +Cc: Vyacheslav Barinov

[-- Attachment #1: Type: text/plain, Size: 4224 bytes --]

Hello everyone,
I've created the patch which sets the frame pointer to the predictable 
location in the stack frame for arm with THUMB2 and non-leaf functions.

Issue:
At this moment GCC emits frame layout for arm with THUMB2 mode,
"-fno-omit-frame-pointer" and AAPCS, and does not set frame pointer to
predictable location at the function frame for non-leaf functions. And
this is right thing to do regarding to answer on my
question https://gcc.gnu.org/ml/gcc-help/2018-07/msg00093.html.
But we still have an issue with performance, when we are using default
unwinder, which uses unwind tables. It could be up to 10 times faster to 
use frame based stack unwinder instead "default unwinder". As we know, 
the frame based stack unwinder works good for arm with ARM32 mode 
("-marm" compile-time option), because fp points to the lr on the stack, 
but the binary size could be about 30%-40% larger than with THUMB2 mode 
("-mthumb" compile time option). So, I think it
could be good to have an ability to use frame based stack unwinder for
thumb mode too. In this case AddressSanitizer, which uses frame based
stack unwinder, by default for every allocation/deallocation, could
benefit too.

Unresolved issue related to AddressSanitizer/LeakSanitzer:
https://github.com/google/sanitizers/issues/640

Problems:
By default arm with THUMB2 mode uses r7 register as frame pointer
register and it could be hard to set frame pointer to the predictable
location in the frame for some functions and some level of optimization.
For example interceptor for malloc in libasan built with
("-mthumb -fno-omit-frame-pointer -O2"):

000aa118 <__interceptor_malloc>:
    stmdb   sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
    subw    sp, sp, #1076
    add     r7, sp, #16
    ...
    function body
    ...
    addw    r7, r7, #1060
    mov     sp, r7
    ldmia.w sp!, {r4, r5, r6, r7, r8, r9, r10, r11, pc}

r7 points to sp + 16 after allocating 1076 bytes on the stack and
it's impossible to find the lr at the runtime.

In other way registers should be pushed in ascending order and we cannot 
change function prologue to this:

   push {r4, r5, r6, r8, r9, sl, r11, r7, lr }

And seems to me, the right solution is to make multiple push and 
multiple pop in the same order (clang works that way):

   push {r4, r5, r6, r7, lr}
   push {r8, r9, sl, r11 }
   add r7, sp, #32
   subw sp, sp #1076
   ...
   function body
   ...
   subw r7, #32
   mov sp, r7
   pop {r8, r9, sl, r11 }
   pop {r4, r5, r6, r7, pc }

So the r7 points to lr on the stack after function prologue and we can
find previous frame pointer by

subw r7, r7, #4
ldr r7, [r7]

at the runtime.

Verification steps:
1. I've build full Tizen OS distribution which consists of about ~1000
packages with that patch "-mthumb -fno-omit-frame-pointer -mthumb-fp 
-O2" and AddressSanitizer.
2. I've built the image from those packages and verified it on real
device with armv7 ISA, and it seems works. Also stack based frame
unwinder works.
3. It's hard to write correct test case, because gcc pushes {r8... r11}
registers only on some level of optimization and the source file should
have some amount of function. So, it's kinda hard to find the small test 
case. I have many preprocessed files with size about ~1kk bytes.
4. The frame layout for malloc interceptor in libasan which was built
with that patch and ("-fno-omit-frame-poiniter -O2 -mthumb -mthumb-fp")
looks like this:

000bb1f4 <__interceptor_malloc>:
   push    {r4, r5, r6, r7, lr}
   ldr     r6, [pc, #340]  ; (bb34c <__interceptor_malloc+0x158>)
   ldr     r3, [pc, #340]  ; (bb350 <__interceptor_malloc+0x15c>)
   add     r6, pc
   stmdb   sp!, {r8, r9, sl, fp}
   add     r7, sp, #32
   subw    sp, sp, #1076   ; 0x434
   ...
   function body
   ...
   subs    r7, #32
   mov     sp, r7
   ldmia.w sp!, {r8, r9, sl, fp}
   pop     {r4, r5, r6, r7, pc}

5. I've faced only one build fail related to this patch - glibc package,
but this is related to option name, some build machinery does not
recognize the "-mthumb-fp" option, and I'm not sure that glibc could be
build with frame layout like this.

It would be nice to get review for this patch by some experts. I could
miss a lot of corner cases.
Thanks.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Frame-pointer-for-arm-with-THUMB2-mode.patch --]
[-- Type: text/x-patch; name="Frame-pointer-for-arm-with-THUMB2-mode.patch", Size: 13427 bytes --]

From: Denis Khalikov <d.khalikov@partner.samsung.com>
Date: Thu, 23 Aug 2018 16:33:58 +0300
Subject: [PATCH] Frame pointer for arm with THUMB2 mode.

Set frame pointer to the predictable location in the stack frame
for arm with THUMB2 mode.

2018-08-23  Denis Khalikov  <d.khalikov@partner.samsung.com>

  * config/arm/arm.c (arm_emit_multi_reg_pop_no_return): New function.
  (arm_compute_initial_elimination_offset): Add support for
  TARGET_THUMB_STACK_UNWIND.
  (arm_expand_prologue): Emit function prologue related to
  TARGET_THUMB_STACK_UNWIND.
  (thumb2_expand_return): Emit function epilogue related to
  TARGET_THUMB_STACK_UNWIND.
  (arm_expand_epilogue): Emit function epilogue related to
  TARGET_THUMB_STACK_UNWIND.
  * config/arm/arm.h (TARGET_THUMB_STACK_UNWIND): New define.
  (INITIAL_ELIMINATION_OFFSET): Add support for
  TARGET_THUMB_STACK_UNWIND.
  * config/arm/arm.opt: Add compile-time option THUMB_FP.
---
 gcc/config/arm/arm.c   | 206 +++++++++++++++++++++++++++++++++++++++++++------
 gcc/config/arm/arm.h   |  10 ++-
 gcc/config/arm/arm.opt |   4 +
 3 files changed, 191 insertions(+), 29 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index c081216d..344b715 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -20583,6 +20583,70 @@ arm_emit_multi_reg_pop (unsigned long saved_regs_mask)
 				 stack_pointer_rtx, stack_pointer_rtx);
 }
 
+/* This function hepls to handle situation when we need to make
+   multiple pop, but does not have lr or pc registers inside
+   saved_regs_mask and can not add cfa_notes.
+   This function is based on arm_emit_multi_reg_pop.  */
+static void
+arm_emit_multi_reg_pop_no_return (unsigned long saved_regs_mask)
+{
+  int num_regs = 0;
+  int i, j;
+  rtx par;
+  rtx dwarf = NULL_RTX;
+  rtx tmp, reg;
+  int emit_update = 1;
+
+  for (i = 0; i <= LAST_ARM_REGNUM; i++)
+    if (saved_regs_mask & (1 << i))
+      num_regs++;
+
+  gcc_assert (num_regs && num_regs <= 16);
+
+  /* If SP is in reglist, then we don't emit SP update insn.  */
+  emit_update = (saved_regs_mask & (1 << SP_REGNUM)) ? 0 : 1;
+
+  /* The parallel needs to hold num_regs SETs
+     and one SET for the stack update.  */
+  par = gen_rtx_PARALLEL (VOIDmode,
+			  rtvec_alloc (num_regs + emit_update));
+  if (emit_update)
+    {
+      /* Increment the stack pointer, based on there being
+	 num_regs 4-byte registers to restore.  */
+      tmp
+	= gen_rtx_SET (stack_pointer_rtx,
+		       plus_constant (Pmode, stack_pointer_rtx, 4 * num_regs));
+      RTX_FRAME_RELATED_P (tmp) = 1;
+      XVECEXP (par, 0, 0) = tmp;
+    }
+
+  for (j = 0, i = 0; j < num_regs; i++)
+    if (saved_regs_mask & (1 << i))
+      {
+	reg = gen_rtx_REG (SImode, i);
+	if ((num_regs == 1) && emit_update)
+	  {
+	    /* Emit single load with writeback.  */
+	    tmp = gen_frame_mem (SImode,
+				 gen_rtx_POST_INC (Pmode, stack_pointer_rtx));
+	    tmp = emit_insn (gen_rtx_SET (reg, tmp));
+	    REG_NOTES (tmp) = alloc_reg_note (REG_CFA_RESTORE, reg, dwarf);
+	    return;
+	  }
+
+	tmp = gen_rtx_SET (
+	  reg, gen_frame_mem (SImode,
+			      plus_constant (Pmode, stack_pointer_rtx, 4 * j)));
+	RTX_FRAME_RELATED_P (tmp) = 1;
+	XVECEXP (par, 0, j + emit_update) = tmp;
+	dwarf = alloc_reg_note (REG_CFA_RESTORE, reg, dwarf);
+	j++;
+      }
+
+  par = emit_insn (par);
+}
+
 /* Generate and emit an insn pattern that we will recognize as a pop_multi
    of NUM_REGS consecutive VFP regs, starting at FIRST_REG.
 
@@ -21233,6 +21297,10 @@ arm_compute_initial_elimination_offset (unsigned int from, unsigned int to)
       switch (to)
 	{
 	case THUMB_HARD_FRAME_POINTER_REGNUM:
+	  if (TARGET_THUMB_STACK_UNWIND)
+	    /* In this case the hard frame pointer points to the lr in the stack
+	       frame. So offset is frame - saved_args.  */
+	    return offsets->frame - offsets->saved_args;
 	  return 0;
 
 	case FRAME_POINTER_REGNUM:
@@ -21259,6 +21327,11 @@ arm_compute_initial_elimination_offset (unsigned int from, unsigned int to)
       switch (to)
 	{
 	case THUMB_HARD_FRAME_POINTER_REGNUM:
+	  if (TARGET_THUMB_STACK_UNWIND)
+	    /*  The hard frame_pointer points to the top entry in the
+		stack frame. The soft frame pointer points to the bottom
+		entry in the stack.  */
+	    return offsets->frame - offsets->soft_frame;
 	  return 0;
 
 	case ARM_HARD_FRAME_POINTER_REGNUM:
@@ -21832,7 +21905,7 @@ arm_expand_prologue (void)
   if ((func_type == ARM_FT_ISR || func_type == ARM_FT_FIQ)
       && (live_regs_mask & (1 << LR_REGNUM)) != 0
       && !(frame_pointer_needed && TARGET_APCS_FRAME)
-      && TARGET_ARM)
+      && (TARGET_ARM || TARGET_THUMB_STACK_UNWIND))
     {
       rtx lr = gen_rtx_REG (SImode, LR_REGNUM);
 
@@ -21879,20 +21952,57 @@ arm_expand_prologue (void)
           else
             {
 	      insn = emit_multi_reg_push (live_regs_mask, live_regs_mask);
-              RTX_FRAME_RELATED_P (insn) = 1;
-            }
-        }
+	       RTX_FRAME_RELATED_P (insn) = 1;
+	     }
+	 }
       else
-        {
-	  insn = emit_multi_reg_push (live_regs_mask, dwarf_regs_mask);
-          RTX_FRAME_RELATED_P (insn) = 1;
-        }
+	{
+	  /* In case we want to have frame pointer register to follow
+	     lr register on the stack, we need to modify stack layout.
+	     For example we have to save on the stack following registers
+	     {r4, ... r11, lr}, but with thumb mode frame pointer
+	     register is r7, so this layout is not correct. We will modify it
+	     from
+	     push {r4, ... r11, lr}
+	     to
+	     push {r4, ... r7, lr}
+	     push {r8, ..., r11}
+	     Registers should be pushed in ascending order.  */
+	  if (TARGET_THUMB_STACK_UNWIND && (live_regs_mask & (1 << 8))
+	      /* Check for non-leaf fucntion.  */
+	      && (live_regs_mask & (1 << LR_REGNUM)))
+	    {
+	      unsigned long first_regs_mask, dwarf_first_regs_mask,
+		second_regs_mask, dwarf_second_regs_mask;
+
+	      /* Save info about first 8 registers from r0 to r7 and
+		 add lr for this mask.  */
+	      first_regs_mask = dwarf_first_regs_mask
+		= (live_regs_mask & 0xff) | (1 << LR_REGNUM);
+	      /* Add all left registers from r8 and delete lr.  */
+	      second_regs_mask = dwarf_second_regs_mask
+		= (live_regs_mask & ~0xff) & ~(1 << LR_REGNUM);
+	      insn
+		= emit_multi_reg_push (first_regs_mask, dwarf_first_regs_mask);
+	      RTX_FRAME_RELATED_P (insn) = 1;
+	      insn = emit_multi_reg_push (second_regs_mask,
+					  dwarf_second_regs_mask);
+	      RTX_FRAME_RELATED_P (insn) = 1;
+	    }
+	  else
+	    {
+	      insn = emit_multi_reg_push (live_regs_mask, dwarf_regs_mask);
+	      RTX_FRAME_RELATED_P (insn) = 1;
+	    }
+	}
     }
 
   if (! IS_VOLATILE (func_type))
     saved_regs += arm_save_coproc_regs ();
 
-  if (frame_pointer_needed && TARGET_ARM)
+  /* Set frame pointer register (r7) to lr on the stack
+     for TARGET_THUMB_STACK_UNWIND.  */
+  if (frame_pointer_needed && (TARGET_ARM || TARGET_THUMB_STACK_UNWIND))
     {
       /* Create the new frame pointer.  */
       if (TARGET_APCS_FRAME)
@@ -21983,7 +22093,7 @@ arm_expand_prologue (void)
     }
 
 
-  if (frame_pointer_needed && TARGET_THUMB2)
+  if (frame_pointer_needed && (TARGET_THUMB2 && !TARGET_THUMB_FP))
     thumb_set_frame_pointer (offsets);
 
   if (flag_pic && arm_pic_register != INVALID_REGNUM)
@@ -25441,11 +25551,26 @@ thumb2_expand_return (bool simple_return)
           emit_jump_insn (par);
         }
       else
-        {
-          saved_regs_mask &= ~ (1 << LR_REGNUM);
-          saved_regs_mask |=   (1 << PC_REGNUM);
-          arm_emit_multi_reg_pop (saved_regs_mask);
-        }
+	{
+	  /* We have made multiple push for thumb, when frame pointer
+	     is needed. So, make multiple pop in the same order.  */
+	  if (TARGET_THUMB_STACK_UNWIND && (saved_regs_mask & (1 << 8)))
+	    {
+	      unsigned long first_regs_mask = saved_regs_mask & 0xff;
+	      unsigned long second_regs_mask = saved_regs_mask & ~0xff;
+
+	      first_regs_mask |= (1 << PC_REGNUM);
+	      second_regs_mask &= ~((1 << PC_REGNUM) | (1 << LR_REGNUM));
+	      arm_emit_multi_reg_pop_no_return (second_regs_mask);
+	      arm_emit_multi_reg_pop (first_regs_mask);
+	    }
+	  else
+	    {
+	      saved_regs_mask &= ~(1 << LR_REGNUM);
+	      saved_regs_mask |= (1 << PC_REGNUM);
+	      arm_emit_multi_reg_pop (saved_regs_mask);
+	    }
+	}
     }
   else
     {
@@ -25724,7 +25849,7 @@ arm_expand_epilogue (bool really_return)
       rtx_insn *insn;
       /* Restore stack pointer if necessary.  */
       if (TARGET_ARM)
-        {
+	{
           /* In ARM mode, frame pointer points to first saved register.
              Restore stack pointer to last saved register.  */
           amount = offsets->frame - offsets->saved_regs;
@@ -25745,9 +25870,15 @@ arm_expand_epilogue (bool really_return)
         }
       else
         {
-          /* In Thumb-2 mode, the frame pointer points to the last saved
-             register.  */
-	  amount = offsets->locals_base - offsets->saved_regs;
+	  /* For TARGET_THUMB_STACK_UNWIND the frame pointer points to the
+	     bottom of the stack.  */
+	  if (TARGET_THUMB_STACK_UNWIND)
+	    amount = offsets->frame - offsets->saved_regs;
+	  else
+	    /* In Thumb-2 mode, the frame pointer points to the last saved
+	        register.  */
+	    amount = offsets->locals_base - offsets->saved_regs;
+
 	  if (amount)
 	    {
 	      insn = emit_insn (gen_addsi3 (hard_frame_pointer_rtx,
@@ -25896,10 +26027,10 @@ arm_expand_epilogue (bool really_return)
         }
       else
         {
-          if (TARGET_LDRD
+	   if (TARGET_LDRD
 	      && current_tune->prefer_ldrd_strd
-              && !optimize_function_for_size_p (cfun))
-            {
+	       && !optimize_function_for_size_p (cfun))
+	     {
               if (TARGET_THUMB2)
                 thumb2_emit_ldrd_pop (saved_regs_mask);
               else if (TARGET_ARM && !IS_INTERRUPT (func_type))
@@ -25907,9 +26038,34 @@ arm_expand_epilogue (bool really_return)
               else
                 arm_emit_multi_reg_pop (saved_regs_mask);
             }
-          else
-            arm_emit_multi_reg_pop (saved_regs_mask);
-        }
+	  else
+	    {
+	      /* Make multiple pop in the same order as we done multiple push.
+	       */
+	      if (TARGET_THUMB_STACK_UNWIND && (saved_regs_mask & (1 << 8))
+		  && ((saved_regs_mask & (1 << LR_REGNUM))
+		      || (saved_regs_mask & (1 << PC_REGNUM))))
+		{
+		  unsigned long first_regs_mask = saved_regs_mask & 0xff;
+		  unsigned long second_regs_mask = saved_regs_mask & ~0xff;
+
+		  if (saved_regs_mask & (1 << PC_REGNUM))
+		    {
+		      first_regs_mask |= (1 << PC_REGNUM);
+		      second_regs_mask &= ~(1 << PC_REGNUM);
+		    }
+		  else
+		    {
+		      first_regs_mask |= (1 << LR_REGNUM);
+		      second_regs_mask &= ~(1 << LR_REGNUM);
+		    }
+		  arm_emit_multi_reg_pop_no_return (second_regs_mask);
+		  arm_emit_multi_reg_pop (first_regs_mask);
+		}
+	      else
+		arm_emit_multi_reg_pop (saved_regs_mask);
+	    }
+	}
 
       if (return_in_pc)
         return;
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 34894c0..3e87064 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -117,6 +117,8 @@ extern tree arm_fp16_type_node;
 #define TARGET_THUMB1_P(flags) (TARGET_THUMB_P (flags) && !arm_arch_thumb2)
 #define TARGET_THUMB2_P(flags) (TARGET_THUMB_P (flags) && arm_arch_thumb2)
 #define TARGET_32BIT_P(flags)  (TARGET_ARM_P (flags) || TARGET_THUMB2_P (flags))
+#define TARGET_THUMB_STACK_UNWIND                                              \
+  (TARGET_THUMB2 && TARGET_THUMB_FP && frame_pointer_needed && arm_arch7)
 
 /* Run-time Target Specification.  */
 /* Use hardware floating point instructions. */
@@ -1564,10 +1566,10 @@ typedef struct
 
 /* Define the offset between two registers, one to be eliminated, and the
    other its replacement, at the start of a routine.  */
-#define INITIAL_ELIMINATION_OFFSET(FROM, TO, OFFSET)			\
-  if (TARGET_ARM)							\
-    (OFFSET) = arm_compute_initial_elimination_offset (FROM, TO);	\
-  else									\
+#define INITIAL_ELIMINATION_OFFSET(FROM, TO, OFFSET)                           \
+  if (TARGET_ARM || TARGET_THUMB_STACK_UNWIND)                                 \
+    (OFFSET) = arm_compute_initial_elimination_offset (FROM, TO);              \
+  else                                                                         \
     (OFFSET) = thumb_compute_initial_elimination_offset (FROM, TO)
 
 /* Special case handling of the location of arguments passed on the stack.  */
diff --git a/gcc/config/arm/arm.opt b/gcc/config/arm/arm.opt
index a1286a4..49e217f 100644
--- a/gcc/config/arm/arm.opt
+++ b/gcc/config/arm/arm.opt
@@ -198,6 +198,10 @@ mthumb
 Target Report RejectNegative Negative(marm) Mask(THUMB) Save
 Generate code for Thumb state.
 
+mthumb-fp
+Target Report Mask(THUMB_FP)
+Generate frame pointer which points to the lr register on the stack.
+
 mthumb-interwork
 Target Report Mask(INTERWORK)
 Support calls between Thumb and ARM instruction sets.
-- 
1.9.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-09-05 17:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20180828004814epcas5p3fbd3af8edd93da263b6780ab3be11fe4@epcas5p3.samsung.com>
2018-08-28  0:48 ` [PATCH] Frame pointer for arm with THUMB2 mode Wilco Dijkstra
2018-08-28 10:44   ` Denis Khalikov
2018-09-05 12:11     ` Wilco Dijkstra
2018-09-05 13:53       ` Denis Khalikov
2018-09-05 13:56       ` Denis Khalikov
2018-09-05 14:11         ` Richard Earnshaw (lists)
2018-09-05 15:32         ` Wilco Dijkstra
2018-09-05 16:43           ` Denis Khalikov
2018-09-05 16:52             ` Richard Earnshaw (lists)
2018-09-05 17:01               ` Denis Khalikov
     [not found] <CGME20180827093530eucas1p2c931fb7d5bc0a73d2b657116b0288e1b@eucas1p2.samsung.com>
2018-08-27  9:35 ` Denis Khalikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).