public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug regression/44281]  New: Global Register variable pessimisation and regression
@ 2010-05-26  5:12 adam at consulting dot net dot nz
  2010-06-07  5:36 ` [Bug regression/44281] " adam at consulting dot net dot nz
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: adam at consulting dot net dot nz @ 2010-05-26  5:12 UTC (permalink / raw)
  To: gcc-bugs

I am aware developers WONTFIX GCC being a pessimising compiler with respect to
some global register variable issues:
<http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42596>

GCC is copying registers for no good reason whatsoever. Below is a very simple
example where gcc 3.3.6 does a better job of optimising the code. Unnecessary
copying of registers may also occur with local register variables.

#include <stdint.h>

register uint64_t global_flag_stack __asm__("rbx");

void push_flag_into_global_reg_var(uint64_t a, uint64_t b) {
  uint64_t flag = (a==b);
  global_flag_stack <<= 8;
  global_flag_stack  |= flag;
}

uint64_t push_flag_into_local_var(uint64_t a, uint64_t b,
                                  uint64_t local_flag_stack) {
  uint64_t flag = (a==b);
  local_flag_stack <<= 8;
  return local_flag_stack | flag;
}

int main() {
}


gcc-3.3 (GCC) 3.3.6 (Debian 1:3.3.6-15):
$ gcc-3.3 -Os flags.c && objdump -d -m i386:x86-64:intel a.out|less
...
0000000000400478 <push_flag_into_global_reg_var>:
  400478:       31 c0                   xor    eax,eax
  40047a:       48 39 f7                cmp    rdi,rsi
  40047d:       0f 94 c0                sete   al
  400480:       48 c1 e3 08             shl    rbx,0x8
  400484:       48 09 c3                or     rbx,rax
  400487:       c3                      ret    

0000000000400488 <push_flag_into_local_var>:
  400488:       31 c0                   xor    eax,eax
  40048a:       48 39 f7                cmp    rdi,rsi
  40048d:       0f 94 c0                sete   al
  400490:       48 c1 e2 08             shl    rdx,0x8
  400494:       48 09 d0                or     rax,rdx
  400497:       c3                      ret  
...

gcc-4.1 (GCC) 4.1.3 20080704 (prerelease) (Debian 4.1.2-29):
$ gcc-4.1 -Os flags.c && objdump -d -m i386:x86-64:intel a.out|less
...
0000000000400448 <push_flag_into_global_reg_var>:
  400448:       48 89 da                mov    rdx,rbx
  40044b:       31 c0                   xor    eax,eax
  40044d:       48 c1 e2 08             shl    rdx,0x8
  400451:       48 39 f7                cmp    rdi,rsi
  400454:       0f 94 c0                sete   al
  400457:       48 89 d3                mov    rbx,rdx
  40045a:       48 09 c3                or     rbx,rax
  40045d:       c3                      ret    

000000000040045e <push_flag_into_local_var>:
  40045e:       48 c1 e2 08             shl    rdx,0x8
  400462:       31 c0                   xor    eax,eax
  400464:       48 39 f7                cmp    rdi,rsi
  400467:       0f 94 c0                sete   al
  40046a:       48 09 d0                or     rax,rdx
  40046d:       c3                      ret 
...

gcc-4.5 (Debian 4.5.0-1) 4.5.0:
$ gcc-4.5 -Os flags.c && objdump -d -m i386:x86-64:intel a.out|less
...
0000000000400494 <push_flag_into_global_reg_var>:
  400494:       31 d2                   xor    edx,edx
  400496:       48 39 f7                cmp    rdi,rsi
  400499:       48 89 d8                mov    rax,rbx
  40049c:       0f 94 c2                sete   dl
  40049f:       48 c1 e0 08             shl    rax,0x8
  4004a3:       48 89 d3                mov    rbx,rdx
  4004a6:       48 09 c3                or     rbx,rax
  4004a9:       c3                      ret    

00000000004004aa <push_flag_into_local_var>:
  4004aa:       48 89 d0                mov    rax,rdx
  4004ad:       31 d2                   xor    edx,edx
  4004af:       48 c1 e0 08             shl    rax,0x8
  4004b3:       48 39 f7                cmp    rdi,rsi
  4004b6:       0f 94 c2                sete   dl
  4004b9:       48 09 d0                or     rax,rdx
  4004bc:       c3                      ret   
...

The object code that current GCC is generating is embarrassing compared with
GCC 3.3.6. Is it also necessary to increase the code footprint of
push_flag_into_local_var when optimising for size (-Os) when compared to gcc
3.3.6 and 4.1.3?


-- 
           Summary: Global Register variable pessimisation and regression
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: regression
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: adam at consulting dot net dot nz


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug regression/44281] Global Register variable pessimisation and regression
  2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
@ 2010-06-07  5:36 ` adam at consulting dot net dot nz
  2010-07-20 22:53 ` [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation steven at gcc dot gnu dot org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: adam at consulting dot net dot nz @ 2010-06-07  5:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from adam at consulting dot net dot nz  2010-06-07 05:35 -------
Example-specific workaround discovered for global register variable
pessimisation with recent versions of GCC:

void push_flag_into_global_reg_var(uint64_t a, uint64_t b) {
  uint64_t flag = (a==b);
  global_flag_stack <<= 8;
  __asm__ __volatile__("" : : : "memory"); /* ??? */
  global_flag_stack  |= flag;
}

Every version of GCC tested (including gcc (Debian 20100530-1) 4.6.0 20100530
(experimental) [trunk revision 160047]) produces similarly compact code:

0000000000400494 <push_flag_into_global_reg_var>:
  400494:       48 c1 e3 08             shl    rbx,0x8
  400498:       31 c0                   xor    eax,eax
  40049a:       48 39 f7                cmp    rdi,rsi
  40049d:       0f 94 c0                sete   al
  4004a0:       48 09 c3                or     rbx,rax
  4004a3:       c3                      ret

Telling the compiler that memory may have changed between global register
variable assignments seems to have coaxed the compiler into treating the global
register variable as volatile.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
  2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
  2010-06-07  5:36 ` [Bug regression/44281] " adam at consulting dot net dot nz
@ 2010-07-20 22:53 ` steven at gcc dot gnu dot org
  2010-07-20 22:55 ` pinskia at gcc dot gnu dot org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: steven at gcc dot gnu dot org @ 2010-07-20 22:53 UTC (permalink / raw)
  To: gcc-bugs



-- 

steven at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
          Component|regression                  |rtl-optimization
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2010-07-20 22:52:46
               date|                            |
            Summary|Global Register variable    |[4.3/4.4/4.5/4.6 Regression]
                   |pessimisation and regression|Global Register variable
                   |                            |pessimisation


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
  2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
  2010-06-07  5:36 ` [Bug regression/44281] " adam at consulting dot net dot nz
  2010-07-20 22:53 ` [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation steven at gcc dot gnu dot org
@ 2010-07-20 22:55 ` pinskia at gcc dot gnu dot org
  2010-07-22  8:48 ` rguenth at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-07-20 22:55 UTC (permalink / raw)
  To: gcc-bugs



-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
   Target Milestone|---                         |4.3.6


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
  2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
                   ` (2 preceding siblings ...)
  2010-07-20 22:55 ` pinskia at gcc dot gnu dot org
@ 2010-07-22  8:48 ` rguenth at gcc dot gnu dot org
  2010-09-11 11:16 ` adam at consulting dot net dot nz
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2010-07-22  8:48 UTC (permalink / raw)
  To: gcc-bugs



-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
  2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
                   ` (3 preceding siblings ...)
  2010-07-22  8:48 ` rguenth at gcc dot gnu dot org
@ 2010-09-11 11:16 ` adam at consulting dot net dot nz
  2010-09-11 13:50 ` hjl dot tools at gmail dot com
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: adam at consulting dot net dot nz @ 2010-09-11 11:16 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from adam at consulting dot net dot nz  2010-09-11 11:15 -------
GCC snapshot has regressed compared to gcc-4.5:

#include <assert.h>
#include <stdint.h>

#define LIKELY(x)   __builtin_expect(!!(x), 1)
#define UNLIKELY(x) __builtin_expect(!!(x), 0)

register uint32_t *Iptr __asm__("rbp");

typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b);

__attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t
b) {
  assert("FIXME"=="");
}

void dec(uint64_t types, uint64_t a, uint64_t b) {
  if (LIKELY((types & 0xFF) == 1)) {
    uint32_t next = Iptr[1];
    --a;
    ++Iptr;
    ((inst_t) (uint64_t) next)(types, a, b);
  } else dec_helper(types, a, b);
}

int main() {
  return 0;
}

$ gcc-4.5 -O3 -std=gnu99 plain-32bit-direct-dispatch.c && objdump -d -m
i386:x86-64:intel a.out|less

0000000000400520 <dec>:
  400520:       40 80 ff 01             cmp    dil,0x1
  400524:       75 0d                   jne    400533 <dec+0x13>
  400526:       8b 45 04                mov    eax,DWORD PTR [rbp+0x4]
  400529:       48 83 ee 01             sub    rsi,0x1
  40052d:       48 83 c5 04             add    rbp,0x4
  400531:       ff e0                   jmp    rax
  400533:       e9 c8 ff ff ff          jmp    400500 <dec_helper>
  400538:       eb 06                   jmp    400540 <main>
  40053a:       90                      nop
  40053b:       90                      nop
  40053c:       90                      nop
  40053d:       90                      nop
  40053e:       90                      nop
  40053f:       90                      nop

The above code generation is fine. Here is what GCC snapshot {gcc (Debian
20100828-1) 4.6.0 20100828 (experimental) [trunk revision 163616]} generates:

$ gcc-snapshot.sh -O3 -std=gnu99 plain-32bit-direct-dispatch.c && objdump -d -m
i386:x86-64:intel a.out|less

0000000000400500 <dec>:
  400500:       48 83 ec 08             sub    rsp,0x8
  400504:       40 80 ff 01             cmp    dil,0x1
  400508:       75 14                   jne    40051e <dec+0x1e>
  40050a:       48 89 e8                mov    rax,rbp
  40050d:       48 83 ee 01             sub    rsi,0x1
  400511:       48 8d 6d 04             lea    rbp,[rbp+0x4]
  400515:       8b 40 04                mov    eax,DWORD PTR [rax+0x4]
  400518:       48 83 c4 08             add    rsp,0x8
  40051c:       ff e0                   jmp    rax
  40051e:       e8 bd ff ff ff          call   4004e0 <dec_helper>
  400523:       eb 0b                   jmp    400530 <main>
  400525:       90                      nop
  400526:       90                      nop
  400527:       90                      nop
  400528:       90                      nop
  400529:       90                      nop
  40052a:       90                      nop
  40052b:       90                      nop
  40052c:       90                      nop
  40052d:       90                      nop
  40052e:       90                      nop
  40052f:       90                      nop

Function size has jumped from rounded up to 32 bytes to rounded up to 48 bytes.
Tail call has been missed, leading to insertion of stack alignment
instructions. Global register variable RBP is copied into RAX for no reason
whatsoever, subverting loading the next instruction before recomputing the
instruction pointer.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
  2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
                   ` (4 preceding siblings ...)
  2010-09-11 11:16 ` adam at consulting dot net dot nz
@ 2010-09-11 13:50 ` hjl dot tools at gmail dot com
  2010-09-12 14:12 ` pinskia at gcc dot gnu dot org
  2010-09-13  0:24 ` adam at consulting dot net dot nz
  7 siblings, 0 replies; 14+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-09-11 13:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from hjl dot tools at gmail dot com  2010-09-11 13:49 -------
(In reply to comment #2)
> GCC snapshot has regressed compared to gcc-4.5:
> 
> #include <assert.h>
> #include <stdint.h>
> 
> #define LIKELY(x)   __builtin_expect(!!(x), 1)
> #define UNLIKELY(x) __builtin_expect(!!(x), 0)
> 
> register uint32_t *Iptr __asm__("rbp");
> 
> typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b);
> 
> __attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t
> b) {
>   assert("FIXME"=="");
> }
> 
> void dec(uint64_t types, uint64_t a, uint64_t b) {
>   if (LIKELY((types & 0xFF) == 1)) {
>     uint32_t next = Iptr[1];
>     --a;
>     ++Iptr;
>     ((inst_t) (uint64_t) next)(types, a, b);
>   } else dec_helper(types, a, b);
> }

This is caused by revision 160124:

http://gcc.gnu.org/ml/gcc-cvs/2010-06/msg00036.html


-- 

hjl dot tools at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
  2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
                   ` (5 preceding siblings ...)
  2010-09-11 13:50 ` hjl dot tools at gmail dot com
@ 2010-09-12 14:12 ` pinskia at gcc dot gnu dot org
  2010-09-13  0:24 ` adam at consulting dot net dot nz
  7 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-09-12 14:12 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from pinskia at gcc dot gnu dot org  2010-09-12 14:11 -------
>This is caused by revision 160124:

Not really, it is a noreturn function so the behavior is correct for our policy
of allowing a more correct backtrace for noreturn functions.  The testcase is a
incorrect one based on size and not really that interesting anymore with
respect of global register variables.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
  2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
                   ` (6 preceding siblings ...)
  2010-09-12 14:12 ` pinskia at gcc dot gnu dot org
@ 2010-09-13  0:24 ` adam at consulting dot net dot nz
  7 siblings, 0 replies; 14+ messages in thread
From: adam at consulting dot net dot nz @ 2010-09-13  0:24 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from adam at consulting dot net dot nz  2010-09-13 00:24 -------
Andrew Pinski wrote:

   >This is caused by revision 160124:

   Not really, it is a noreturn function so the behavior is correct for our
   policy of allowing a more correct backtrace for noreturn functions.

I'm not sure what you're trying to say here Andrew. Are you trying to justify
-O3 generating slower code to simplify debugging?

   The testcase is a incorrect one based on size

If you mean zero-extension of 32-bit function pointers, this is the x86-64
small code model.

If you mean that you don't care that the testcase increased in size without
further benchmarking then empirical analysis is actually unnecessary. The
generated assembly is clearly worse.

  and not really that interesting anymore with respect of global register 
  variables.

It's another example of global register variables being copied for no good
reason whatsoever. RAX is free and the obvious translation of uint32_t next =
Iptr[1]; to x86-64 assembly is mov eax,DWORD PTR [rbp+0x4]; (Intel syntax,
where RBP is the global register variable). Generating mov rax,rbp; mov
eax,DWORD PTR [rax+0x4]; is just dumb.

I've been experimenting with optimal forms of virtual machine dispatch for a
long time and what you have is a fragment of a very fast direct threaded
interpreter. So fast in fact that a type-safe countdown will execute at 5
cycles per iteration on Intel Core 2:

#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

#define LIKELY(x)   __builtin_expect(!!(x), 1)
#define UNLIKELY(x) __builtin_expect(!!(x), 0)

register uint32_t *Iptr __asm__("rbp");

typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b);

#define FUNC(x) ((inst_t) (uint64_t) x)
#define INST(x) ((uint32_t) (uint64_t) x)

__attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t
b) {
  assert("FIXME"=="");
}

void dec(uint64_t types, uint64_t a, uint64_t b) {
  if (LIKELY((types & 0xFF) == 1)) {
    uint32_t next = Iptr[1];
    --a;
    ++Iptr;
    FUNC(next)(types, a, b);
  } else dec_helper(types, a, b);
}


__attribute__ ((noinline)) void if_not_equal_jump_back_1_helper(uint64_t types,
uint64_t a, uint64_t b) {
  assert("FIXME"=="");
}

void if_not_equal_jump_back_1(uint64_t types, uint64_t a, uint64_t b) {
  if (LIKELY((types & 0xFFFF) == 0x0101)) {
    if (LIKELY(a != b)) {
      uint32_t next = Iptr[-1];
      --Iptr;
      FUNC(next)(types, a, b);
    } else {
      uint32_t next = Iptr[1];
      ++Iptr;
      FUNC(next)(types, a, b);
    }
  } else if_not_equal_jump_back_1_helper(types, a, b);
}

void unconditional_exit(uint64_t types, uint64_t a, uint64_t b) {
  exit(0);
}

__attribute__ ((noinline, noclone)) void execute(uint32_t *code, uint64_t
types, uint64_t a, uint64_t b) {
  Iptr = code;
  FUNC(code[0])(types, a, b);
}

int main() {
  uint32_t code[]={INST(dec),
                   INST(if_not_equal_jump_back_1),
                   INST(unconditional_exit)};
  execute(code + 1, 0x0101, 3000000000, 0);
  return 0;
}

$ gcc-4.5 -O3 -std=gnu99 plain-32bit-direct-dispatch-countdown.c && time
./a.out 

real    0m5.007s
user    0m4.996s
sys     0m0.004s

CPU is 3GHz. Code execution starts at the second instruction
(if_not_equal_jump_back_1). a==3000000000 of type==1 is not equal to b==0 of
type==1 (the two type comparisons are performed in parallel in one cycle
without masking since one can compare the low 8-, 16- or 32-bits of a 64-bit
register without masking and the two types are packed into the low 16-bits of
the types register).

As a!=b the code jumps back to the dec instruction, which performs another type
check that a is of type==1 before decrementing a and jumping to
if_not_equal_jump_back_1. This continues until a==0 and program exit occurs.

While the generated assembly of GCC snapshot speaks for itself, here's some
empirical evidence of its inferiority:

$ gcc-snapshot.sh -O3 -std=gnu99 plain-32bit-direct-dispatch-countdown.c &&
time ./a.out 

real    0m10.014s
user    0m10.009s
sys     0m0.000s

GCC snapshot has doubled the execution time of this virtual machine example
(compared to gcc-4.3, gcc-4.4 and gcc-4.5).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
       [not found] <bug-44281-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2011-03-04 11:23 ` jakub at gcc dot gnu.org
@ 2011-03-05  2:01 ` adam at consulting dot net.nz
  4 siblings, 0 replies; 14+ messages in thread
From: adam at consulting dot net.nz @ 2011-03-05  2:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281

--- Comment #10 from Adam Warner <adam at consulting dot net.nz> 2011-03-05 02:01:04 UTC ---
Jakub,

Thanks for the explanation [The "weird" saving/restoring of %rdi into/from %r10
is because the RA chose to use %rdi for a temporary used in incrementing of
REG7 and loading the next pointer from it, while postreload managed to remove
all needs for such a temporary register, it is too late for the save/restore
code not to be emitted.]

I've replaced the memory lookup and REG7 increment with equivalent inline
assembly to help clarify this explanation. With one remaining source code
variable (next of type fn_t) and everything else opaque assembly the code
generation is worse.


#include <stdint.h>

/* Six caller-saved registers as input arguments */
#define CALLER_SAVED uint64_t REG0, uint64_t REG1, uint64_t REG2, \
                     uint64_t REG3, uint64_t REG4, uint64_t REG5
typedef void (*fn_t)(CALLER_SAVED);

/* Six callee-saved registers as global register variables */
register uint64_t REG6 __asm__("rbx");
register fn_t    *REG7 __asm__("rbp");
register uint64_t REG8 __asm__("r12");
register uint64_t REG9 __asm__("r13");
register uint64_t REG10 __asm__("r14");
register uint64_t REG11 __asm__("r15");

/* Free general purpose registers are RSP, RAX, R10 and R11 */

void optimal_code_generation(CALLER_SAVED) {
  fn_t next=REG7[1];
  next(REG0, REG1, REG2, REG3, REG4, REG5);
}

void unmodified_input_arg_is_copied(CALLER_SAVED) {
  fn_t next=REG7[1];
  ++REG7;
  next(REG0, REG1, REG2, REG3, REG4, REG5);
}

void unmodified_input_arg_is_copied_alt(CALLER_SAVED) {
  fn_t next=REG7[1];
  __asm__("add $8, %0" : "+r" (REG7));
  next(REG0, REG1, REG2, REG3, REG4, REG5);
}

void unmodified_input_arg_is_copied_alt2(CALLER_SAVED) {
  fn_t next;
  __asm__("mov 0x8(%[from]), %[to]" : [to] "=a" (next) : [from] "r" (REG7));
  __asm__("add $8, %0" : "+r" (REG7));
  next(REG0, REG1, REG2, REG3, REG4, REG5);
}

int main() {
  return 0;
}


$ gcc-4.6 -O3 unmodified_ordinary_register_is_copied_with_pure_asm.c && objdump
-d -m i386:x86-64 a.out|less

00000000004004a0 <optimal_code_generation>:
  4004a0:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004a4:       ff e0                   jmpq   *%rax
  4004a6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4004ad:       00 00 00 

00000000004004b0 <unmodified_input_arg_is_copied>:
  4004b0:       49 89 fa                mov    %rdi,%r10
  4004b3:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004b7:       48 8d 6d 08             lea    0x8(%rbp),%rbp
  4004bb:       4c 89 d7                mov    %r10,%rdi
  4004be:       ff e0                   jmpq   *%rax

00000000004004c0 <unmodified_input_arg_is_copied_alt>:
  4004c0:       49 89 fa                mov    %rdi,%r10
  4004c3:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004c7:       4c 89 d7                mov    %r10,%rdi
  4004ca:       48 83 c5 08             add    $0x8,%rbp
  4004ce:       ff e0                   jmpq   *%rax

00000000004004d0 <unmodified_input_arg_is_copied_alt2>:
  4004d0:       49 89 fa                mov    %rdi,%r10
  4004d3:       48 89 f7                mov    %rsi,%rdi
  4004d6:       48 89 d6                mov    %rdx,%rsi
  4004d9:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004dd:       48 89 f2                mov    %rsi,%rdx
  4004e0:       48 89 fe                mov    %rdi,%rsi
  4004e3:       4c 89 d7                mov    %r10,%rdi
  4004e6:       48 83 c5 08             add    $0x8,%rbp
  4004ea:       ff e0                   jmpq   *%rax

unmodified_input_arg_is_copied_alt2() specifies a variable next of type fn_t.
The first assembly statement __asm__("mov 0x8(%[from]), %[to]" : [to] "=a"
(next) : [from] "r" (REG7)); directly translates to mov 0x8(%rbp),%rax. Note
use of the "=a" machine constrain to force use of the free %rax register.

The second assembly statement __asm__("add $8, %0" : "+r" (REG7)); directly
translates to add $0x8,%rbp. This is in-place register mutation which does not
require a temporary for incrementing.

While I suspected I might be able to work around the spurious saving/restoring
of unmodified registers with inline assembly the results are far worse. mov
%rdi,%r10; mov %rsi,%rdi; mov %rdx,%rsi is maximally serialized. One cannot
move %rdx into %rsi until %rsi is moved into %rdi. But one cannot move %rsi
into %rdi until %rdi is moved into %r10. Restoring the unmodified registers is
also maximally serialized.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
       [not found] <bug-44281-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2011-03-04 10:51 ` adam at consulting dot net.nz
@ 2011-03-04 11:23 ` jakub at gcc dot gnu.org
  2011-03-05  2:01 ` adam at consulting dot net.nz
  4 siblings, 0 replies; 14+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-03-04 11:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-03-04 11:22:51 UTC ---
You are talking about this single testcase, I'm talking in general that if gcc
is on x86_64 tuned for a medium sized general purpose register file and you
suddenly turn it into a very limited size general purpose register file, you
can get non-optimal code.  Such bugreports are definitely much lower priority
than what you get with the common case where no global register vars are used,
or at most one or two.  The "weird" saving/restoring of %rdi into/from %r10 is
because the RA chose to use %rdi for a temporary used in incrementing of REG7
and loading the next pointer from it, while postreload managed to remove all
needs for such a temporary register, it is too late for the save/restore code
not to be emitted.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
       [not found] <bug-44281-4@http.gcc.gnu.org/bugzilla/>
  2011-03-04  7:23 ` adam at consulting dot net.nz
  2011-03-04  7:46 ` jakub at gcc dot gnu.org
@ 2011-03-04 10:51 ` adam at consulting dot net.nz
  2011-03-04 11:23 ` jakub at gcc dot gnu.org
  2011-03-05  2:01 ` adam at consulting dot net.nz
  4 siblings, 0 replies; 14+ messages in thread
From: adam at consulting dot net.nz @ 2011-03-04 10:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281

--- Comment #8 from Adam Warner <adam at consulting dot net.nz> 2011-03-04 10:51:01 UTC ---
Jakub, I fail to see how your conclusion not to do this is supported by the
facts. There are:

(a) six global register variables (though the same effect can be observed with
one global register variable and -ffixed-rbx -ffixed-r12 -ffixed-r13
-ffixed-r14 -ffixed-r15)
(b) six function arguments
(c) one stack pointer

Therefore three registers remain free: %rax, %r10 and %r11. Only one free
register is required to generate the optimal code. GCC 4.5 can do this. GCC 4.6
can't.

The fact GCC outputs the assembly sequence "mov %rdi,%r10; mov %r10,%rdi" is
evidence of a bizarre cascade of bugs. Even rudimentary pinhole optimisation
could elide that assembly sequence.

Are you able to explain why GCC outputs assembly code for a register that is
never unmodified? %rdi remains unmodified. This has nothing to do with a
"compiler has much more limited choices in generating close to optimal
code". The compiler has the choice to use %rax, %r10 or %r11 to store the
address to jump to without spilling. There is no register pressure in this
example. One register is required. Three are available.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
       [not found] <bug-44281-4@http.gcc.gnu.org/bugzilla/>
  2011-03-04  7:23 ` adam at consulting dot net.nz
@ 2011-03-04  7:46 ` jakub at gcc dot gnu.org
  2011-03-04 10:51 ` adam at consulting dot net.nz
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-03-04  7:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-03-04 07:46:11 UTC ---
Using 6 global register variables is clearly self-inflicted pain, even on
x86_64, because if you take 6 registers away and another 6 registers are used
for parameter passing, you make the target very limited on number of registers
and the compiler has much more limited choices in generating close to optimal
code.
Just don't do this.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
       [not found] <bug-44281-4@http.gcc.gnu.org/bugzilla/>
@ 2011-03-04  7:23 ` adam at consulting dot net.nz
  2011-03-04  7:46 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: adam at consulting dot net.nz @ 2011-03-04  7:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281

--- Comment #6 from Adam Warner <adam at consulting dot net.nz> 2011-03-04 07:22:47 UTC ---
Below is a very simple test case of an ordinary input argument to a function
being:

(a) copied to a spare register
(b) copied back from a spare register

When the input argument is:

(a) never modified; and
(b) an ordinary register (not a global register variable)

unmodified_ordinary_register_is_copied.c:


#include <stdint.h>

/* Six caller-saved registers as input arguments */
#define CALLER_SAVED uint64_t REG0, uint64_t REG1, uint64_t REG2, \
                     uint64_t REG3, uint64_t REG4, uint64_t REG5
typedef void (*fn_t)(CALLER_SAVED);

/* Six callee-saved registers as global register variables */
register uint64_t REG6 __asm__("rbx");
register fn_t    *REG7 __asm__("rbp");
register uint64_t REG8 __asm__("r12");
register uint64_t REG9 __asm__("r13");
register uint64_t REG10 __asm__("r14");
register uint64_t REG11 __asm__("r15");

/* Free general purpose registers are RSP, RAX, R10 and R11 */

void optimal_code_generation(CALLER_SAVED) {
  fn_t next=REG7[1];
  next(REG0, REG1, REG2, REG3, REG4, REG5);
}

void unmodified_input_arg_is_copied(CALLER_SAVED) {
  fn_t next=REG7[1];
  ++REG7;
  next(REG0, REG1, REG2, REG3, REG4, REG5);
}

int main() {
  return 0;
}


gcc-4.5 generates optimal code for both functions:
$ gcc-4.5 -O3 unmodified_ordinary_register_is_copied.c && objdump -d -m
i386:x86-64 a.out|less
...
00000000004004a0 <optimal_code_generation>:
  4004a0:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004a4:       ff e0                   jmpq   *%rax
...
00000000004004b0 <unmodified_input_arg_is_copied>:
  4004b0:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004b4:       48 83 c5 08             add    $0x8,%rbp
  4004b8:       ff e0                   jmpq   *%rax
...

Compare with GCC 4.6:
$ gcc-4.6 --version
gcc-4.6 (Debian 4.6-20110227-1) 4.6.0 20110227 (experimental) [trunk revision
170543]
...

$ gcc-4.6 -O3 unmodified_ordinary_register_is_copied.c && objdump -d -m
i386:x86-64 a.out|less
...
00000000004004a0 <optimal_code_generation>:
  4004a0:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004a4:       ff e0                   jmpq   *%rax
...
00000000004004b0 <unmodified_input_arg_is_copied>:
  4004b0:       49 89 fa                mov    %rdi,%r10
  4004b3:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004b7:       48 8d 6d 08             lea    0x8(%rbp),%rbp
  4004bb:       4c 89 d7                mov    %r10,%rdi
  4004be:       ff e0                   jmpq   *%rax
...

According to the Linux x86-64 ABI %rdi is the first argument passed to the
functions. For some reason this is being copied to %r10 before being copied
back from %r10 to %rdi. At no stage is %rdi modified.

(Minor aside:
lea 0x8(%rbp),%rbp has also replaced add $0x8,%rbp. My Intel Core 2 hardware
can execute a maximum of one LEA instruction per clock cycle compared to three
ADD instructions per clock cycle. If I add -march=core2 -mtune=core2 the code
generation becomes:
00000000004004b0 <unmodified_input_arg_is_copied>:
  4004b0:       48 8b 45 08             mov    0x8(%rbp),%rax
  4004b4:       48 8d 6d 08             lea    0x8(%rbp),%rbp
  4004b8:       49 89 fa                mov    %rdi,%r10
  4004bb:       4c 89 d7                mov    %r10,%rdi
  4004be:       ff e0                   jmpq   *%rax
)

This bizarre register copying goes away if I comment out one of the six global
register variables (i.e. five callee-saved global register variables instead of
six). For some reason GCC 4.6 cannot generate sensible code with %rsp, %rax,
%r10 and %r11 available---but can generate sensible code when an additional
register (%rbx, %r12, %r13, %r14 or %r15) is available.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-03-05  2:01 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-26  5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
2010-06-07  5:36 ` [Bug regression/44281] " adam at consulting dot net dot nz
2010-07-20 22:53 ` [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation steven at gcc dot gnu dot org
2010-07-20 22:55 ` pinskia at gcc dot gnu dot org
2010-07-22  8:48 ` rguenth at gcc dot gnu dot org
2010-09-11 11:16 ` adam at consulting dot net dot nz
2010-09-11 13:50 ` hjl dot tools at gmail dot com
2010-09-12 14:12 ` pinskia at gcc dot gnu dot org
2010-09-13  0:24 ` adam at consulting dot net dot nz
     [not found] <bug-44281-4@http.gcc.gnu.org/bugzilla/>
2011-03-04  7:23 ` adam at consulting dot net.nz
2011-03-04  7:46 ` jakub at gcc dot gnu.org
2011-03-04 10:51 ` adam at consulting dot net.nz
2011-03-04 11:23 ` jakub at gcc dot gnu.org
2011-03-05  2:01 ` adam at consulting dot net.nz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).