* [Bug regression/44281] Global Register variable pessimisation and regression
2010-05-26 5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
@ 2010-06-07 5:36 ` adam at consulting dot net dot nz
2010-07-20 22:53 ` [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation steven at gcc dot gnu dot org
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: adam at consulting dot net dot nz @ 2010-06-07 5:36 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from adam at consulting dot net dot nz 2010-06-07 05:35 -------
Example-specific workaround discovered for global register variable
pessimisation with recent versions of GCC:
void push_flag_into_global_reg_var(uint64_t a, uint64_t b) {
uint64_t flag = (a==b);
global_flag_stack <<= 8;
__asm__ __volatile__("" : : : "memory"); /* ??? */
global_flag_stack |= flag;
}
Every version of GCC tested (including gcc (Debian 20100530-1) 4.6.0 20100530
(experimental) [trunk revision 160047]) produces similarly compact code:
0000000000400494 <push_flag_into_global_reg_var>:
400494: 48 c1 e3 08 shl rbx,0x8
400498: 31 c0 xor eax,eax
40049a: 48 39 f7 cmp rdi,rsi
40049d: 0f 94 c0 sete al
4004a0: 48 09 c3 or rbx,rax
4004a3: c3 ret
Telling the compiler that memory may have changed between global register
variable assignments seems to have coaxed the compiler into treating the global
register variable as volatile.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
2010-05-26 5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
2010-06-07 5:36 ` [Bug regression/44281] " adam at consulting dot net dot nz
@ 2010-07-20 22:53 ` steven at gcc dot gnu dot org
2010-07-20 22:55 ` pinskia at gcc dot gnu dot org
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: steven at gcc dot gnu dot org @ 2010-07-20 22:53 UTC (permalink / raw)
To: gcc-bugs
--
steven at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Component|regression |rtl-optimization
Ever Confirmed|0 |1
Last reconfirmed|0000-00-00 00:00:00 |2010-07-20 22:52:46
date| |
Summary|Global Register variable |[4.3/4.4/4.5/4.6 Regression]
|pessimisation and regression|Global Register variable
| |pessimisation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
2010-05-26 5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
2010-06-07 5:36 ` [Bug regression/44281] " adam at consulting dot net dot nz
2010-07-20 22:53 ` [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation steven at gcc dot gnu dot org
@ 2010-07-20 22:55 ` pinskia at gcc dot gnu dot org
2010-07-22 8:48 ` rguenth at gcc dot gnu dot org
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-07-20 22:55 UTC (permalink / raw)
To: gcc-bugs
--
pinskia at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Target Milestone|--- |4.3.6
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
2010-05-26 5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
` (2 preceding siblings ...)
2010-07-20 22:55 ` pinskia at gcc dot gnu dot org
@ 2010-07-22 8:48 ` rguenth at gcc dot gnu dot org
2010-09-11 11:16 ` adam at consulting dot net dot nz
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2010-07-22 8:48 UTC (permalink / raw)
To: gcc-bugs
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
2010-05-26 5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
` (3 preceding siblings ...)
2010-07-22 8:48 ` rguenth at gcc dot gnu dot org
@ 2010-09-11 11:16 ` adam at consulting dot net dot nz
2010-09-11 13:50 ` hjl dot tools at gmail dot com
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: adam at consulting dot net dot nz @ 2010-09-11 11:16 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from adam at consulting dot net dot nz 2010-09-11 11:15 -------
GCC snapshot has regressed compared to gcc-4.5:
#include <assert.h>
#include <stdint.h>
#define LIKELY(x) __builtin_expect(!!(x), 1)
#define UNLIKELY(x) __builtin_expect(!!(x), 0)
register uint32_t *Iptr __asm__("rbp");
typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b);
__attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t
b) {
assert("FIXME"=="");
}
void dec(uint64_t types, uint64_t a, uint64_t b) {
if (LIKELY((types & 0xFF) == 1)) {
uint32_t next = Iptr[1];
--a;
++Iptr;
((inst_t) (uint64_t) next)(types, a, b);
} else dec_helper(types, a, b);
}
int main() {
return 0;
}
$ gcc-4.5 -O3 -std=gnu99 plain-32bit-direct-dispatch.c && objdump -d -m
i386:x86-64:intel a.out|less
0000000000400520 <dec>:
400520: 40 80 ff 01 cmp dil,0x1
400524: 75 0d jne 400533 <dec+0x13>
400526: 8b 45 04 mov eax,DWORD PTR [rbp+0x4]
400529: 48 83 ee 01 sub rsi,0x1
40052d: 48 83 c5 04 add rbp,0x4
400531: ff e0 jmp rax
400533: e9 c8 ff ff ff jmp 400500 <dec_helper>
400538: eb 06 jmp 400540 <main>
40053a: 90 nop
40053b: 90 nop
40053c: 90 nop
40053d: 90 nop
40053e: 90 nop
40053f: 90 nop
The above code generation is fine. Here is what GCC snapshot {gcc (Debian
20100828-1) 4.6.0 20100828 (experimental) [trunk revision 163616]} generates:
$ gcc-snapshot.sh -O3 -std=gnu99 plain-32bit-direct-dispatch.c && objdump -d -m
i386:x86-64:intel a.out|less
0000000000400500 <dec>:
400500: 48 83 ec 08 sub rsp,0x8
400504: 40 80 ff 01 cmp dil,0x1
400508: 75 14 jne 40051e <dec+0x1e>
40050a: 48 89 e8 mov rax,rbp
40050d: 48 83 ee 01 sub rsi,0x1
400511: 48 8d 6d 04 lea rbp,[rbp+0x4]
400515: 8b 40 04 mov eax,DWORD PTR [rax+0x4]
400518: 48 83 c4 08 add rsp,0x8
40051c: ff e0 jmp rax
40051e: e8 bd ff ff ff call 4004e0 <dec_helper>
400523: eb 0b jmp 400530 <main>
400525: 90 nop
400526: 90 nop
400527: 90 nop
400528: 90 nop
400529: 90 nop
40052a: 90 nop
40052b: 90 nop
40052c: 90 nop
40052d: 90 nop
40052e: 90 nop
40052f: 90 nop
Function size has jumped from rounded up to 32 bytes to rounded up to 48 bytes.
Tail call has been missed, leading to insertion of stack alignment
instructions. Global register variable RBP is copied into RAX for no reason
whatsoever, subverting loading the next instruction before recomputing the
instruction pointer.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
2010-05-26 5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
` (4 preceding siblings ...)
2010-09-11 11:16 ` adam at consulting dot net dot nz
@ 2010-09-11 13:50 ` hjl dot tools at gmail dot com
2010-09-12 14:12 ` pinskia at gcc dot gnu dot org
2010-09-13 0:24 ` adam at consulting dot net dot nz
7 siblings, 0 replies; 9+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-09-11 13:50 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from hjl dot tools at gmail dot com 2010-09-11 13:49 -------
(In reply to comment #2)
> GCC snapshot has regressed compared to gcc-4.5:
>
> #include <assert.h>
> #include <stdint.h>
>
> #define LIKELY(x) __builtin_expect(!!(x), 1)
> #define UNLIKELY(x) __builtin_expect(!!(x), 0)
>
> register uint32_t *Iptr __asm__("rbp");
>
> typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b);
>
> __attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t
> b) {
> assert("FIXME"=="");
> }
>
> void dec(uint64_t types, uint64_t a, uint64_t b) {
> if (LIKELY((types & 0xFF) == 1)) {
> uint32_t next = Iptr[1];
> --a;
> ++Iptr;
> ((inst_t) (uint64_t) next)(types, a, b);
> } else dec_helper(types, a, b);
> }
This is caused by revision 160124:
http://gcc.gnu.org/ml/gcc-cvs/2010-06/msg00036.html
--
hjl dot tools at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hubicka at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
2010-05-26 5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
` (5 preceding siblings ...)
2010-09-11 13:50 ` hjl dot tools at gmail dot com
@ 2010-09-12 14:12 ` pinskia at gcc dot gnu dot org
2010-09-13 0:24 ` adam at consulting dot net dot nz
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-09-12 14:12 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from pinskia at gcc dot gnu dot org 2010-09-12 14:11 -------
>This is caused by revision 160124:
Not really, it is a noreturn function so the behavior is correct for our policy
of allowing a more correct backtrace for noreturn functions. The testcase is a
incorrect one based on size and not really that interesting anymore with
respect of global register variables.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug rtl-optimization/44281] [4.3/4.4/4.5/4.6 Regression] Global Register variable pessimisation
2010-05-26 5:12 [Bug regression/44281] New: Global Register variable pessimisation and regression adam at consulting dot net dot nz
` (6 preceding siblings ...)
2010-09-12 14:12 ` pinskia at gcc dot gnu dot org
@ 2010-09-13 0:24 ` adam at consulting dot net dot nz
7 siblings, 0 replies; 9+ messages in thread
From: adam at consulting dot net dot nz @ 2010-09-13 0:24 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from adam at consulting dot net dot nz 2010-09-13 00:24 -------
Andrew Pinski wrote:
>This is caused by revision 160124:
Not really, it is a noreturn function so the behavior is correct for our
policy of allowing a more correct backtrace for noreturn functions.
I'm not sure what you're trying to say here Andrew. Are you trying to justify
-O3 generating slower code to simplify debugging?
The testcase is a incorrect one based on size
If you mean zero-extension of 32-bit function pointers, this is the x86-64
small code model.
If you mean that you don't care that the testcase increased in size without
further benchmarking then empirical analysis is actually unnecessary. The
generated assembly is clearly worse.
and not really that interesting anymore with respect of global register
variables.
It's another example of global register variables being copied for no good
reason whatsoever. RAX is free and the obvious translation of uint32_t next =
Iptr[1]; to x86-64 assembly is mov eax,DWORD PTR [rbp+0x4]; (Intel syntax,
where RBP is the global register variable). Generating mov rax,rbp; mov
eax,DWORD PTR [rax+0x4]; is just dumb.
I've been experimenting with optimal forms of virtual machine dispatch for a
long time and what you have is a fragment of a very fast direct threaded
interpreter. So fast in fact that a type-safe countdown will execute at 5
cycles per iteration on Intel Core 2:
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>
#define LIKELY(x) __builtin_expect(!!(x), 1)
#define UNLIKELY(x) __builtin_expect(!!(x), 0)
register uint32_t *Iptr __asm__("rbp");
typedef void (*inst_t)(uint64_t types, uint64_t a, uint64_t b);
#define FUNC(x) ((inst_t) (uint64_t) x)
#define INST(x) ((uint32_t) (uint64_t) x)
__attribute__ ((noinline)) void dec_helper(uint64_t types, uint64_t a, uint64_t
b) {
assert("FIXME"=="");
}
void dec(uint64_t types, uint64_t a, uint64_t b) {
if (LIKELY((types & 0xFF) == 1)) {
uint32_t next = Iptr[1];
--a;
++Iptr;
FUNC(next)(types, a, b);
} else dec_helper(types, a, b);
}
__attribute__ ((noinline)) void if_not_equal_jump_back_1_helper(uint64_t types,
uint64_t a, uint64_t b) {
assert("FIXME"=="");
}
void if_not_equal_jump_back_1(uint64_t types, uint64_t a, uint64_t b) {
if (LIKELY((types & 0xFFFF) == 0x0101)) {
if (LIKELY(a != b)) {
uint32_t next = Iptr[-1];
--Iptr;
FUNC(next)(types, a, b);
} else {
uint32_t next = Iptr[1];
++Iptr;
FUNC(next)(types, a, b);
}
} else if_not_equal_jump_back_1_helper(types, a, b);
}
void unconditional_exit(uint64_t types, uint64_t a, uint64_t b) {
exit(0);
}
__attribute__ ((noinline, noclone)) void execute(uint32_t *code, uint64_t
types, uint64_t a, uint64_t b) {
Iptr = code;
FUNC(code[0])(types, a, b);
}
int main() {
uint32_t code[]={INST(dec),
INST(if_not_equal_jump_back_1),
INST(unconditional_exit)};
execute(code + 1, 0x0101, 3000000000, 0);
return 0;
}
$ gcc-4.5 -O3 -std=gnu99 plain-32bit-direct-dispatch-countdown.c && time
./a.out
real 0m5.007s
user 0m4.996s
sys 0m0.004s
CPU is 3GHz. Code execution starts at the second instruction
(if_not_equal_jump_back_1). a==3000000000 of type==1 is not equal to b==0 of
type==1 (the two type comparisons are performed in parallel in one cycle
without masking since one can compare the low 8-, 16- or 32-bits of a 64-bit
register without masking and the two types are packed into the low 16-bits of
the types register).
As a!=b the code jumps back to the dec instruction, which performs another type
check that a is of type==1 before decrementing a and jumping to
if_not_equal_jump_back_1. This continues until a==0 and program exit occurs.
While the generated assembly of GCC snapshot speaks for itself, here's some
empirical evidence of its inferiority:
$ gcc-snapshot.sh -O3 -std=gnu99 plain-32bit-direct-dispatch-countdown.c &&
time ./a.out
real 0m10.014s
user 0m10.009s
sys 0m0.000s
GCC snapshot has doubled the execution time of this virtual machine example
(compared to gcc-4.3, gcc-4.4 and gcc-4.5).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44281
^ permalink raw reply [flat|nested] 9+ messages in thread