public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/16961] New: Poor x86-64 performance
@ 2004-08-10 13:08 tomstdenis at iahu dot ca
  2004-08-10 13:27 ` [Bug c/16961] " falk at debian dot org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: tomstdenis at iahu dot ca @ 2004-08-10 13:08 UTC (permalink / raw)
  To: gcc-bugs

On the AMD64 using "-march=k8" we see really poor performance.  My beef is two 
issues. 
 
First off, 128-bit unsigned additions are emulated when addq/adcq will do the 
job just fine.  For example, 
 
typedef unsigned long      mp_word __attribute__ ((mode(TI))); 
mp_word a, b; 
void test(void) { a += b; } 
 
Produces via [-O3 -fomit-frame-pointer -march=k8] 
 
        movq    a(%rip), %r10 
        movq    b(%rip), %r8 
        xorl    %ecx, %ecx 
        movq    a+8(%rip), %rdi 
        movq    b+8(%rip), %r9 
        leaq    (%r10,%r8), %rax 
        leaq    (%rdi,%r9), %rsi 
        cmpq    %r10, %rax 
        movq    %rax, a(%rip) 
        setb    %cl 
        leaq    (%rcx,%rsi), %rdx 
        movq    %rdx, a+8(%rip) 
        ret 
 
Which is insane. 
 
The second beef is loop unrolling.  Somehow between the 32-bit cores and 
64-bit targets it was made WAY WORSE.   
 
In the old method loops could be handled with something like 
 
while (n&3) { do(); update_for_loop(); } 
while (n) { 
    do(); do(); do(); do(); 
    update_for_loop4x(); 
} 
 
Now I'm seeing  
 
top: goto off[n&7]; 
off_7:  do(); update_for_loop(); 
off_6:  do(); update_for_loop(); 
off_5:  do(); update_for_loop(); 
off_4:  do(); update_for_loop(); 
... 
if (n) goto top; 
 
In my case it's updating pointers in the "update_for_loop()" when it really 
doesn't have to. 
 
Tom

-- 
           Summary: Poor x86-64 performance
           Product: gcc
           Version: 3.4.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tomstdenis at iahu dot ca
                CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: gcc version 3.4.1  (Gentoo Linux 3.4.1, ssp-3.4-2, pie-
                    8.7.6.3)
  GCC host triplet: Linux timmy 2.6.7-gentoo-r11 #1 Thu Aug 5 01:49:49 UTC
                    2004 x86_
GCC target triplet: gcc version 3.4.1  (Gentoo Linux 3.4.1, ssp-3.4-2, pie-
                    8.7.6.3)


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug c/16961] Poor x86-64 performance
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
@ 2004-08-10 13:27 ` falk at debian dot org
  2004-08-10 13:38 ` tomstdenis at iahu dot ca
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: falk at debian dot org @ 2004-08-10 13:27 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From falk at debian dot org  2004-08-10 13:27 -------
Please open a second bug report *with a complete example* for the second
problem, we cannot track the bug properly otherwise.


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
  GCC build triplet|gcc version 3.4.1  (Gentoo  |x86_64-linux
                   |Linux 3.4.1, ssp-3.4-2, pie-|
                   |8.7.6.3)                    |
   GCC host triplet|Linux timmy 2.6.7-gentoo-r11|x86_64-linux
                   |#1 Thu Aug 5 01:49:49 UTC   |
                   |2004 x86_                   |
 GCC target triplet|gcc version 3.4.1  (Gentoo  |x86_64-linux
                   |Linux 3.4.1, ssp-3.4-2, pie-|
                   |8.7.6.3)                    |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug c/16961] Poor x86-64 performance
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
  2004-08-10 13:27 ` [Bug c/16961] " falk at debian dot org
@ 2004-08-10 13:38 ` tomstdenis at iahu dot ca
  2004-08-10 13:39 ` tomstdenis at iahu dot ca
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: tomstdenis at iahu dot ca @ 2004-08-10 13:38 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From tomstdenis at iahu dot ca  2004-08-10 13:38 -------
Here's a larger demo 
 
typedef unsigned long long mp_digit; 
typedef unsigned long      mp_word __attribute__ ((mode(TI))); 
mp_word a, b; 
 
// demo slow 128-bit add 
void test(void) 
{ 
   a += b; 
} 
 
// this unrolls right (but is otherwise inefficient cuz of the 128-bit add) 
void test2(mp_word *out, mp_digit x, mp_digit *y, int n) 
{ 
  int z; 
  for (z = 0; z < n; z++) out[z] += ((mp_word)x) * ((mp_word)*y++); 
} 
 
// this unrolls poorly 
void test3(mp_word *out, mp_digit x, mp_digit *y, int n) 
{ 
  int z; 
  for (z = 0; z < n; z++) { 
    asm("movq %0,%%rax\n" 
        "mulq (%1)\n" 
        "addq %%rax,(%2)\n" 
        "adcq %%rdx,8(%2)\n" 
        ::"r"(x), "r"(y), "r"(out) : "%rax", "%rdx"); 
    ++out; 
    ++y; 
  } 
} 
 
Which produces  
 
	.file	"test.c" 
	.text 
	.p2align 4,,15 
.globl test 
	.type	test, @function 
test: 
.LFB2: 
	movq	a(%rip), %r10 
	movq	b(%rip), %r8 
	xorl	%ecx, %ecx 
	movq	a+8(%rip), %rdi 
	movq	b+8(%rip), %r9 
	leaq	(%r10,%r8), %rax 
	leaq	(%rdi,%r9), %rsi 
	cmpq	%r10, %rax 
	movq	%rax, a(%rip) 
	setb	%cl 
	leaq	(%rcx,%rsi), %rdx 
	movq	%rdx, a+8(%rip) 
	ret 
.LFE2: 
	.size	test, .-test 
	.p2align 4,,15 
.globl test2 
	.type	test2, @function 
test2: 
.LFB3: 
	movq	%r13, -24(%rsp) 
.LCFI0: 
	movq	%r14, -16(%rsp) 
.LCFI1: 
	movq	%rdi, %r11 
	movq	%r15, -8(%rsp) 
.LCFI2: 
	movq	%rbx, -48(%rsp) 
.LCFI3: 
	movq	%rsi, %r13 
	movq	%rbp, -40(%rsp) 
.LCFI4: 
	movq	%r12, -32(%rsp) 
.LCFI5: 
	subq	$64, %rsp 
.LCFI6: 
	testl	%ecx, %ecx 
	movq	%rdx, %r14 
	movl	%ecx, %r15d 
	jle	.L8 
	movq	%rsi, %rax 
	movq	(%rdi), %r12 
	movq	8(%rdi), %rdi 
	mulq	(%rdx) 
	leal	-1(%r15), %r10d 
	xorl	%ecx, %ecx 
	leaq	8(%r14), %rbp 
	movl	%r10d, %ebx 
	andl	$3, %ebx 
	movq	%rdx, %r9 
	leaq	(%r12,%rax), %rdx 
	leaq	(%rdi,%r9), %rsi 
	cmpq	%r12, %rdx 
	movq	%rdx, -8(%rsp) 
	movq	-8(%rsp), %rax 
	setb	%cl 
	movq	%rsi, (%rsp) 
	addq	%rcx, (%rsp) 
	movq	(%rsp), %rdx 
	movl	%r10d, %r12d 
	movl	$16, %r10d 
	testl	%r12d, %r12d 
	movq	%rax, (%r11) 
	movq	%rdx, 8(%r11) 
	je	.L8 
	testl	%ebx, %ebx 
	je	.L6 
	cmpl	$1, %ebx 
	je	.L23 
	cmpl	$2, %ebx 
	.p2align 4,,5 
	je	.L24 
	movq	%r13, %rax 
	movq	16(%r11), %rsi 
	movq	24(%r11), %rdi 
	mulq	8(%r14) 
	leaq	16(%r14), %rbp 
	movb	$32, %r10b 
	leaq	(%rsi,%rax), %r12 
	leaq	(%rdi,%rdx), %rcx 
	xorl	%eax, %eax 
	cmpq	%rsi, %r12 
	movq	%rcx, -80(%rsp) 
	movq	%r12, -88(%rsp) 
	setb	%al 
	addq	%rax, -80(%rsp) 
	movq	-88(%rsp), %r14 
	movq	-80(%rsp), %rbx 
	leal	-2(%r15), %r12d 
	movq	%r14, 16(%r11) 
	movq	%rbx, 24(%r11) 
.L24: 
	movq	%r13, %rax 
	movq	(%r10,%r11), %rcx 
	xorl	%r8d, %r8d 
	mulq	(%rbp) 
	addq	$8, %rbp 
	movq	%rax, %rdi 
	movq	8(%r10,%r11), %rax 
	leaq	(%rcx,%rdi), %r9 
	leaq	(%rax,%rdx), %rbx 
	cmpq	%rcx, %r9 
	movq	%r9, -104(%rsp) 
	setb	%r8b 
	movq	-104(%rsp), %rdx 
	decl	%r12d 
	movq	%rbx, -96(%rsp) 
	addq	%r8, -96(%rsp) 
	movq	-96(%rsp), %r15 
	movq	%rdx, (%r10,%r11) 
	movq	%r15, 8(%r10,%r11) 
	addq	$16, %r10 
.L23: 
	movq	%r13, %rax 
	movq	8(%r10,%r11), %r14 
	xorl	%r8d, %r8d 
	mulq	(%rbp) 
	addq	$8, %rbp 
	movq	%rax, %r9 
	movq	(%r10,%r11), %rax 
	leaq	(%r14,%rdx), %rdx 
	movq	%rdx, -112(%rsp) 
	leaq	(%rax,%r9), %rcx 
	cmpq	%rax, %rcx 
	movq	%rcx, -120(%rsp) 
	movq	-120(%rsp), %r15 
	setb	%r8b 
	addq	%r8, -112(%rsp) 
	movq	-112(%rsp), %rsi 
	movq	%r15, (%r10,%r11) 
	movq	%rsi, 8(%r10,%r11) 
	addq	$16, %r10 
	decl	%r12d 
	je	.L8 
	.p2align 4,,7 
.L6: 
	movq	%r13, %rax 
	movq	(%r10,%r11), %rbx 
	movq	(%r10,%r11), %r15 
	mulq	(%rbp) 
	movq	8(%r10,%r11), %rsi 
	xorl	%r9d, %r9d 
	movq	16(%r10,%r11), %r8 
	movq	32(%r10,%r11), %r14 
	addq	%rax, %rbx 
	movq	%r13, %rax 
	cmpq	%r15, %rbx 
	movq	24(%r10,%r11), %r15 
	movq	%rbx, -24(%rsp) 
	setb	%r9b 
	addq	%rdx, %rsi 
	movq	-24(%rsp), %rcx 
	movq	%rsi, -16(%rsp) 
	addq	%r9, -16(%rsp) 
	xorl	%esi, %esi 
	movq	-16(%rsp), %rdx 
	movq	%rcx, (%r10,%r11) 
	movq	%rdx, 8(%r10,%r11) 
	mulq	8(%rbp) 
	addq	%rax, %r8 
	movq	16(%r10,%r11), %rax 
	movq	%r8, -40(%rsp) 
	movq	-40(%rsp), %rdi 
	cmpq	%rax, %r8 
	movq	%r13, %rax 
	setb	%sil 
	addq	%rdx, %r15 
	movq	%rdi, 16(%r10,%r11) 
	mulq	16(%rbp) 
	movq	%r15, -32(%rsp) 
	movq	40(%r10,%r11), %r15 
	addq	%rsi, -32(%rsp) 
	movq	-32(%rsp), %r9 
	movq	%r9, 24(%r10,%r11) 
	movq	%rdx, %rbx 
	movq	32(%r10,%r11), %rdx 
	movq	%rax, %rcx 
	addq	%rcx, %rdx 
	cmpq	%r14, %rdx 
	movq	%rdx, -56(%rsp) 
	movq	-56(%rsp), %rdi 
	setb	%r8b 
	addq	%rbx, %r15 
	movl	%r8d, %eax 
	movq	%r15, -48(%rsp) 
	xorl	%r15d, %r15d 
	movzbl	%al, %esi  
	addq	%rsi, -48(%rsp) 
	movq	%r13, %rax 
	movq	-48(%rsp), %r9 
	movq	%rdi, 32(%r10,%r11) 
	mulq	24(%rbp) 
	movq	56(%r10,%r11), %r14 
	addq	$32, %rbp 
	movq	%r9, 40(%r10,%r11) 
	movq	%rax, %rcx 
	movq	48(%r10,%r11), %rax 
	movq	%rdx, %rbx 
	leaq	(%r14,%rbx), %r8 
	leaq	(%rax,%rcx), %rdx 
	movq	%r8, -64(%rsp) 
	cmpq	%rax, %rdx 
	movq	%rdx, -72(%rsp) 
	movq	-72(%rsp), %rsi 
	setb	%r15b 
	addq	%r15, -64(%rsp) 
	movq	-64(%rsp), %rdi 
	movq	%rsi, 48(%r10,%r11) 
	movq	%rdi, 56(%r10,%r11) 
	addq	$64, %r10 
	subl	$4, %r12d 
	jne	.L6 
	.p2align 4,,7 
.L8: 
	movq	16(%rsp), %rbx 
	movq	24(%rsp), %rbp 
	movq	32(%rsp), %r12 
	movq	40(%rsp), %r13 
	movq	48(%rsp), %r14 
	movq	56(%rsp), %r15 
	addq	$64, %rsp 
	ret 
.LFE3: 
	.size	test2, .-test2 
	.p2align 4,,15 
.globl test3 
	.type	test3, @function 
test3: 
.LFB4: 
	pushq	%rbp 
.LCFI7: 
	testl	%ecx, %ecx 
	movq	%rsi, %r10 
	movl	%ecx, %ebp 
	pushq	%rbx 
.LCFI8: 
	movq	%rdi, %rbx 
	movq	%rdx, %rdi 
	jle	.L33 
	leal	-1(%rbp), %ecx 
	movl	%ecx, %esi 
	andl	$7, %esi 
#APP 
	movq %r10,%rax 
mulq (%rdi) 
addq %rax,(%rbx) 
adcq %rdx,8(%rbx) 
 
#NO_APP 
	testl	%ecx, %ecx 
	leaq	16(%rbx), %r9 
	leaq	8(%rdi), %r8 
	movl	%ecx, %r11d 
	je	.L33 
	testl	%esi, %esi 
	je	.L31 
	cmpl	$1, %esi 
	je	.L61 
	cmpl	$2, %esi 
	.p2align 4,,5 
	je	.L62 
	cmpl	$3, %esi 
	.p2align 4,,5 
	je	.L63 
	cmpl	$4, %esi 
	.p2align 4,,5 
	je	.L64 
	cmpl	$5, %esi 
	.p2align 4,,5 
	je	.L65 
	cmpl	$6, %esi 
	.p2align 4,,5 
	je	.L66 
#APP 
	movq %r10,%rax 
mulq (%r8) 
addq %rax,(%r9) 
adcq %rdx,8(%r9) 
 
#NO_APP 
	leaq	32(%rbx), %r9 
	leaq	16(%rdi), %r8 
	leal	-2(%rbp), %r11d 
.L66: 
#APP 
	movq %r10,%rax 
mulq (%r8) 
addq %rax,(%r9) 
adcq %rdx,8(%r9) 
 
#NO_APP 
	addq	$16, %r9 
	addq	$8, %r8 
	decl	%r11d 
.L65: 
#APP 
	movq %r10,%rax 
mulq (%r8) 
addq %rax,(%r9) 
adcq %rdx,8(%r9) 
 
#NO_APP 
	addq	$16, %r9 
	addq	$8, %r8 
	decl	%r11d 
.L64: 
#APP 
	movq %r10,%rax 
mulq (%r8) 
addq %rax,(%r9) 
adcq %rdx,8(%r9) 
 
#NO_APP 
	addq	$16, %r9 
	addq	$8, %r8 
	decl	%r11d 
.L63: 
#APP 
	movq %r10,%rax 
mulq (%r8) 
addq %rax,(%r9) 
adcq %rdx,8(%r9) 
 
#NO_APP 
	addq	$16, %r9 
	addq	$8, %r8 
	decl	%r11d 
.L62: 
#APP 
	movq %r10,%rax 
mulq (%r8) 
addq %rax,(%r9) 
adcq %rdx,8(%r9) 
 
#NO_APP 
	addq	$16, %r9 
	addq	$8, %r8 
	decl	%r11d 
.L61: 
#APP 
	movq %r10,%rax 
mulq (%r8) 
addq %rax,(%r9) 
adcq %rdx,8(%r9) 
 
#NO_APP 
	addq	$16, %r9 
	addq	$8, %r8 
	decl	%r11d 
	je	.L33 
.L31: 
#APP 
	movq %r10,%rax 
mulq (%r8) 
addq %rax,(%r9) 
adcq %rdx,8(%r9) 
 
#NO_APP 
	leaq	16(%r9), %rsi 
	leaq	8(%r8), %rbp 
#APP 
	movq %r10,%rax 
mulq (%rbp) 
addq %rax,(%rsi) 
adcq %rdx,8(%rsi) 
 
#NO_APP 
	leaq	32(%r9), %rdi 
	leaq	16(%r8), %rbx 
#APP 
	movq %r10,%rax 
mulq (%rbx) 
addq %rax,(%rdi) 
adcq %rdx,8(%rdi) 
 
#NO_APP 
	leaq	48(%r9), %rcx 
	leaq	24(%r8), %rbp 
#APP 
	movq %r10,%rax 
mulq (%rbp) 
addq %rax,(%rcx) 
adcq %rdx,8(%rcx) 
 
#NO_APP 
	leaq	64(%r9), %rsi 
	leaq	32(%r8), %rdi 
#APP 
	movq %r10,%rax 
mulq (%rdi) 
addq %rax,(%rsi) 
adcq %rdx,8(%rsi) 
 
#NO_APP 
	leaq	80(%r9), %rbx 
	leaq	40(%r8), %rcx 
#APP 
	movq %r10,%rax 
mulq (%rcx) 
addq %rax,(%rbx) 
adcq %rdx,8(%rbx) 
 
#NO_APP 
	leaq	96(%r9), %rbp 
	leaq	48(%r8), %rdi 
#APP 
	movq %r10,%rax 
mulq (%rdi) 
addq %rax,(%rbp) 
adcq %rdx,8(%rbp) 
 
#NO_APP 
	leaq	112(%r9), %rsi 
	leaq	56(%r8), %rbx 
#APP 
	movq %r10,%rax 
mulq (%rbx) 
addq %rax,(%rsi) 
adcq %rdx,8(%rsi) 
 
#NO_APP 
	subq	$-128, %r9 
	addq	$64, %r8 
	subl	$8, %r11d 
	jne	.L31 
.L33: 
	popq	%rbx 
	popq	%rbp 
	ret 
.LFE4: 
	.size	test3, .-test3 
	.comm	a,16,16 
	.comm	b,16,16 
	.section	.eh_frame,"a",@progbits 
.Lframe1: 
	.long	.LECIE1-.LSCIE1 
.LSCIE1: 
	.long	0x0 
	.byte	0x1 
	.string	"" 
	.uleb128 0x1 
	.sleb128 -8 
	.byte	0x10 
	.byte	0xc 
	.uleb128 0x7 
	.uleb128 0x8 
	.byte	0x90 
	.uleb128 0x1 
	.align 8 
.LECIE1: 
.LSFDE1: 
	.long	.LEFDE1-.LASFDE1 
.LASFDE1: 
	.long	.LASFDE1-.Lframe1 
	.quad	.LFB2 
	.quad	.LFE2-.LFB2 
	.align 8 
.LEFDE1: 
.LSFDE3: 
	.long	.LEFDE3-.LASFDE3 
.LASFDE3: 
	.long	.LASFDE3-.Lframe1 
	.quad	.LFB3 
	.quad	.LFE3-.LFB3 
	.byte	0x4 
	.long	.LCFI3-.LFB3 
	.byte	0x83 
	.uleb128 0x7 
	.byte	0x8f 
	.uleb128 0x2 
	.byte	0x8e 
	.uleb128 0x3 
	.byte	0x8d 
	.uleb128 0x4 
	.byte	0x4 
	.long	.LCFI6-.LCFI3 
	.byte	0xe 
	.uleb128 0x48 
	.byte	0x8c 
	.uleb128 0x5 
	.byte	0x86 
	.uleb128 0x6 
	.align 8 
.LEFDE3: 
.LSFDE5: 
	.long	.LEFDE5-.LASFDE5 
.LASFDE5: 
	.long	.LASFDE5-.Lframe1 
	.quad	.LFB4 
	.quad	.LFE4-.LFB4 
	.byte	0x4 
	.long	.LCFI7-.LFB4 
	.byte	0xe 
	.uleb128 0x10 
	.byte	0x86 
	.uleb128 0x2 
	.byte	0x4 
	.long	.LCFI8-.LCFI7 
	.byte	0xe 
	.uleb128 0x18 
	.byte	0x83 
	.uleb128 0x3 
	.align 8 
.LEFDE5: 
	.section	.note.GNU-stack,"",@progbits 
	.ident	"GCC: (GNU) 3.4.1  (Gentoo Linux 3.4.1, ssp-3.4-2, 
pie-8.7.6.3)" 
 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug c/16961] Poor x86-64 performance
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
  2004-08-10 13:27 ` [Bug c/16961] " falk at debian dot org
  2004-08-10 13:38 ` tomstdenis at iahu dot ca
@ 2004-08-10 13:39 ` tomstdenis at iahu dot ca
  2004-08-10 13:58 ` [Bug target/16961] " falk at debian dot org
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: tomstdenis at iahu dot ca @ 2004-08-10 13:39 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From tomstdenis at iahu dot ca  2004-08-10 13:39 -------
I used  
 
gcc -O3 -fomit-frame-pointer -funroll-loops -march=k8 -m64 -S test.c 
 
To produce that asm code btw... 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/16961] Poor x86-64 performance
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
                   ` (2 preceding siblings ...)
  2004-08-10 13:39 ` tomstdenis at iahu dot ca
@ 2004-08-10 13:58 ` falk at debian dot org
  2004-08-10 14:09 ` tomstdenis at iahu dot ca
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: falk at debian dot org @ 2004-08-10 13:58 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From falk at debian dot org  2004-08-10 13:58 -------
Okay, as to the TImode problem, this is target specific. I'm not familiar with
i386, but I have a very hard time believing using the carry flag would lead
to a noticeable speedup here... oh well.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |minor
          Component|c                           |target
           Keywords|                            |missed-optimization


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/16961] Poor x86-64 performance
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
                   ` (3 preceding siblings ...)
  2004-08-10 13:58 ` [Bug target/16961] " falk at debian dot org
@ 2004-08-10 14:09 ` tomstdenis at iahu dot ca
  2004-12-19 15:00 ` steven at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: tomstdenis at iahu dot ca @ 2004-08-10 14:09 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From tomstdenis at iahu dot ca  2004-08-10 14:09 -------
(In reply to comment #4) 
> Okay, as to the TImode problem, this is target specific. I'm not familiar 
with 
> i386, but I have a very hard time believing using the carry flag would lead 
> to a noticeable speedup here... oh well. 
 
Um it is.  the 10 instructions GCC makes now consume decode bandwidth, require 
execute time, fill the cache, etc... 
 
Admitedly this isn't a "huge" problem because most code won't be doing 128-bit 
math but if the goal is to make GCC the best it can be someone might as well 
fix this up. 
 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/16961] Poor x86-64 performance
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
                   ` (4 preceding siblings ...)
  2004-08-10 14:09 ` tomstdenis at iahu dot ca
@ 2004-12-19 15:00 ` steven at gcc dot gnu dot org
  2005-07-18  7:52 ` [Bug target/16961] Poor x86-64 performance with 128bit ints steven at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: steven at gcc dot gnu dot org @ 2004-12-19 15:00 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From steven at gcc dot gnu dot org  2004-12-19 14:59 -------
This is similar to the "long long" problem for 32 bits x86 targets.  We 
keep the instructions in TImode all the way down until flow2, which made 
sense in the pre-GCC4 era, when this was the only way to make optimizing 
possible for arithmetic in machine modes not representable on the target 
machine.  With the new high-level optimizers we don't really need this 
anymore, we should just lower to machine instructions in expand and let 
the RTL path do its job to optimize this better. 
 
 
 

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jh at suse dot cz
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|                            |1
   Last reconfirmed|0000-00-00 00:00:00         |2004-12-19 14:59:53
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/16961] Poor x86-64 performance with 128bit ints
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
                   ` (5 preceding siblings ...)
  2004-12-19 15:00 ` steven at gcc dot gnu dot org
@ 2005-07-18  7:52 ` steven at gcc dot gnu dot org
  2005-07-18  8:47 ` steven at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: steven at gcc dot gnu dot org @ 2005-07-18  7:52 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From steven at gcc dot gnu dot org  2005-07-18 07:47 -------
The 128 bits arithmetic has improved now: 
 
typedef unsigned long      mp_word __attribute__ ((mode(TI)));  
mp_word a, b;  
void test(void) { a += b; }  
 
test: 
        movq    a(%rip), %rax 
        addq    b(%rip), %rax 
        movq    a+8(%rip), %rdx 
        adcq    b+8(%rip), %rdx 
        movq    %rax, a(%rip) 
        movq    %rdx, a+8(%rip) 
        ret 
 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/16961] Poor x86-64 performance with 128bit ints
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
                   ` (6 preceding siblings ...)
  2005-07-18  7:52 ` [Bug target/16961] Poor x86-64 performance with 128bit ints steven at gcc dot gnu dot org
@ 2005-07-18  8:47 ` steven at gcc dot gnu dot org
  2005-07-18 13:42 ` jh at suse dot cz
  2005-07-19 15:06 ` falk at debian dot org
  9 siblings, 0 replies; 11+ messages in thread
From: steven at gcc dot gnu dot org @ 2005-07-18  8:47 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From steven at gcc dot gnu dot org  2005-07-18 07:56 -------
The code for the second test case is also much better.  The code produced 
for the test3 case does not look like what you want it to produce.  Probably 
the inline asm constraints are not correct. 
 
Note that I'm only looking at mainline. 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/16961] Poor x86-64 performance with 128bit ints
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
                   ` (7 preceding siblings ...)
  2005-07-18  8:47 ` steven at gcc dot gnu dot org
@ 2005-07-18 13:42 ` jh at suse dot cz
  2005-07-19 15:06 ` falk at debian dot org
  9 siblings, 0 replies; 11+ messages in thread
From: jh at suse dot cz @ 2005-07-18 13:42 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From jh at suse dot cz  2005-07-18 12:45 -------
Subject: Re:  Poor x86-64 performance with 128bit ints

> 
> ------- Additional Comments From steven at gcc dot gnu dot org  2005-07-18 07:47 -------
> The 128 bits arithmetic has improved now: 
>  
> typedef unsigned long      mp_word __attribute__ ((mode(TI)));  
> mp_word a, b;  
> void test(void) { a += b; }  
>  
> test: 
>         movq    a(%rip), %rax 
>         addq    b(%rip), %rax 
>         movq    a+8(%rip), %rdx 
>         adcq    b+8(%rip), %rdx 
>         movq    %rax, a(%rip) 
>         movq    %rdx, a+8(%rip) 
>         ret 

I think the PR should be closed now when Jan added the 128bit arithmetic
patterns I originally skipped in the x86-64 port as it was killing my
32bit cross compiler at that time :)
At lest we should now perofrm no worse than i386 on 64bit math (that
sucks of course ;)

Honza
>  
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
> 
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/16961] Poor x86-64 performance with 128bit ints
  2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
                   ` (8 preceding siblings ...)
  2005-07-18 13:42 ` jh at suse dot cz
@ 2005-07-19 15:06 ` falk at debian dot org
  9 siblings, 0 replies; 11+ messages in thread
From: falk at debian dot org @ 2005-07-19 15:06 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From falk at debian dot org  2005-07-19 14:12 -------
The unrolling part of the report was moved to PR 16962, and the 128-bit part is
fixed, so closing.


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-07-19 14:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
2004-08-10 13:27 ` [Bug c/16961] " falk at debian dot org
2004-08-10 13:38 ` tomstdenis at iahu dot ca
2004-08-10 13:39 ` tomstdenis at iahu dot ca
2004-08-10 13:58 ` [Bug target/16961] " falk at debian dot org
2004-08-10 14:09 ` tomstdenis at iahu dot ca
2004-12-19 15:00 ` steven at gcc dot gnu dot org
2005-07-18  7:52 ` [Bug target/16961] Poor x86-64 performance with 128bit ints steven at gcc dot gnu dot org
2005-07-18  8:47 ` steven at gcc dot gnu dot org
2005-07-18 13:42 ` jh at suse dot cz
2005-07-19 15:06 ` falk at debian dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).