public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/16961] New: Poor x86-64 performance
@ 2004-08-10 13:08 tomstdenis at iahu dot ca
2004-08-10 13:27 ` [Bug c/16961] " falk at debian dot org
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: tomstdenis at iahu dot ca @ 2004-08-10 13:08 UTC (permalink / raw)
To: gcc-bugs
On the AMD64 using "-march=k8" we see really poor performance. My beef is two
issues.
First off, 128-bit unsigned additions are emulated when addq/adcq will do the
job just fine. For example,
typedef unsigned long mp_word __attribute__ ((mode(TI)));
mp_word a, b;
void test(void) { a += b; }
Produces via [-O3 -fomit-frame-pointer -march=k8]
movq a(%rip), %r10
movq b(%rip), %r8
xorl %ecx, %ecx
movq a+8(%rip), %rdi
movq b+8(%rip), %r9
leaq (%r10,%r8), %rax
leaq (%rdi,%r9), %rsi
cmpq %r10, %rax
movq %rax, a(%rip)
setb %cl
leaq (%rcx,%rsi), %rdx
movq %rdx, a+8(%rip)
ret
Which is insane.
The second beef is loop unrolling. Somehow between the 32-bit cores and
64-bit targets it was made WAY WORSE.
In the old method loops could be handled with something like
while (n&3) { do(); update_for_loop(); }
while (n) {
do(); do(); do(); do();
update_for_loop4x();
}
Now I'm seeing
top: goto off[n&7];
off_7: do(); update_for_loop();
off_6: do(); update_for_loop();
off_5: do(); update_for_loop();
off_4: do(); update_for_loop();
...
if (n) goto top;
In my case it's updating pointers in the "update_for_loop()" when it really
doesn't have to.
Tom
--
Summary: Poor x86-64 performance
Product: gcc
Version: 3.4.1
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: tomstdenis at iahu dot ca
CC: gcc-bugs at gcc dot gnu dot org
GCC build triplet: gcc version 3.4.1 (Gentoo Linux 3.4.1, ssp-3.4-2, pie-
8.7.6.3)
GCC host triplet: Linux timmy 2.6.7-gentoo-r11 #1 Thu Aug 5 01:49:49 UTC
2004 x86_
GCC target triplet: gcc version 3.4.1 (Gentoo Linux 3.4.1, ssp-3.4-2, pie-
8.7.6.3)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug c/16961] Poor x86-64 performance
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
@ 2004-08-10 13:27 ` falk at debian dot org
2004-08-10 13:38 ` tomstdenis at iahu dot ca
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: falk at debian dot org @ 2004-08-10 13:27 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From falk at debian dot org 2004-08-10 13:27 -------
Please open a second bug report *with a complete example* for the second
problem, we cannot track the bug properly otherwise.
--
What |Removed |Added
----------------------------------------------------------------------------
GCC build triplet|gcc version 3.4.1 (Gentoo |x86_64-linux
|Linux 3.4.1, ssp-3.4-2, pie-|
|8.7.6.3) |
GCC host triplet|Linux timmy 2.6.7-gentoo-r11|x86_64-linux
|#1 Thu Aug 5 01:49:49 UTC |
|2004 x86_ |
GCC target triplet|gcc version 3.4.1 (Gentoo |x86_64-linux
|Linux 3.4.1, ssp-3.4-2, pie-|
|8.7.6.3) |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug c/16961] Poor x86-64 performance
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
2004-08-10 13:27 ` [Bug c/16961] " falk at debian dot org
@ 2004-08-10 13:38 ` tomstdenis at iahu dot ca
2004-08-10 13:39 ` tomstdenis at iahu dot ca
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: tomstdenis at iahu dot ca @ 2004-08-10 13:38 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From tomstdenis at iahu dot ca 2004-08-10 13:38 -------
Here's a larger demo
typedef unsigned long long mp_digit;
typedef unsigned long mp_word __attribute__ ((mode(TI)));
mp_word a, b;
// demo slow 128-bit add
void test(void)
{
a += b;
}
// this unrolls right (but is otherwise inefficient cuz of the 128-bit add)
void test2(mp_word *out, mp_digit x, mp_digit *y, int n)
{
int z;
for (z = 0; z < n; z++) out[z] += ((mp_word)x) * ((mp_word)*y++);
}
// this unrolls poorly
void test3(mp_word *out, mp_digit x, mp_digit *y, int n)
{
int z;
for (z = 0; z < n; z++) {
asm("movq %0,%%rax\n"
"mulq (%1)\n"
"addq %%rax,(%2)\n"
"adcq %%rdx,8(%2)\n"
::"r"(x), "r"(y), "r"(out) : "%rax", "%rdx");
++out;
++y;
}
}
Which produces
.file "test.c"
.text
.p2align 4,,15
.globl test
.type test, @function
test:
.LFB2:
movq a(%rip), %r10
movq b(%rip), %r8
xorl %ecx, %ecx
movq a+8(%rip), %rdi
movq b+8(%rip), %r9
leaq (%r10,%r8), %rax
leaq (%rdi,%r9), %rsi
cmpq %r10, %rax
movq %rax, a(%rip)
setb %cl
leaq (%rcx,%rsi), %rdx
movq %rdx, a+8(%rip)
ret
.LFE2:
.size test, .-test
.p2align 4,,15
.globl test2
.type test2, @function
test2:
.LFB3:
movq %r13, -24(%rsp)
.LCFI0:
movq %r14, -16(%rsp)
.LCFI1:
movq %rdi, %r11
movq %r15, -8(%rsp)
.LCFI2:
movq %rbx, -48(%rsp)
.LCFI3:
movq %rsi, %r13
movq %rbp, -40(%rsp)
.LCFI4:
movq %r12, -32(%rsp)
.LCFI5:
subq $64, %rsp
.LCFI6:
testl %ecx, %ecx
movq %rdx, %r14
movl %ecx, %r15d
jle .L8
movq %rsi, %rax
movq (%rdi), %r12
movq 8(%rdi), %rdi
mulq (%rdx)
leal -1(%r15), %r10d
xorl %ecx, %ecx
leaq 8(%r14), %rbp
movl %r10d, %ebx
andl $3, %ebx
movq %rdx, %r9
leaq (%r12,%rax), %rdx
leaq (%rdi,%r9), %rsi
cmpq %r12, %rdx
movq %rdx, -8(%rsp)
movq -8(%rsp), %rax
setb %cl
movq %rsi, (%rsp)
addq %rcx, (%rsp)
movq (%rsp), %rdx
movl %r10d, %r12d
movl $16, %r10d
testl %r12d, %r12d
movq %rax, (%r11)
movq %rdx, 8(%r11)
je .L8
testl %ebx, %ebx
je .L6
cmpl $1, %ebx
je .L23
cmpl $2, %ebx
.p2align 4,,5
je .L24
movq %r13, %rax
movq 16(%r11), %rsi
movq 24(%r11), %rdi
mulq 8(%r14)
leaq 16(%r14), %rbp
movb $32, %r10b
leaq (%rsi,%rax), %r12
leaq (%rdi,%rdx), %rcx
xorl %eax, %eax
cmpq %rsi, %r12
movq %rcx, -80(%rsp)
movq %r12, -88(%rsp)
setb %al
addq %rax, -80(%rsp)
movq -88(%rsp), %r14
movq -80(%rsp), %rbx
leal -2(%r15), %r12d
movq %r14, 16(%r11)
movq %rbx, 24(%r11)
.L24:
movq %r13, %rax
movq (%r10,%r11), %rcx
xorl %r8d, %r8d
mulq (%rbp)
addq $8, %rbp
movq %rax, %rdi
movq 8(%r10,%r11), %rax
leaq (%rcx,%rdi), %r9
leaq (%rax,%rdx), %rbx
cmpq %rcx, %r9
movq %r9, -104(%rsp)
setb %r8b
movq -104(%rsp), %rdx
decl %r12d
movq %rbx, -96(%rsp)
addq %r8, -96(%rsp)
movq -96(%rsp), %r15
movq %rdx, (%r10,%r11)
movq %r15, 8(%r10,%r11)
addq $16, %r10
.L23:
movq %r13, %rax
movq 8(%r10,%r11), %r14
xorl %r8d, %r8d
mulq (%rbp)
addq $8, %rbp
movq %rax, %r9
movq (%r10,%r11), %rax
leaq (%r14,%rdx), %rdx
movq %rdx, -112(%rsp)
leaq (%rax,%r9), %rcx
cmpq %rax, %rcx
movq %rcx, -120(%rsp)
movq -120(%rsp), %r15
setb %r8b
addq %r8, -112(%rsp)
movq -112(%rsp), %rsi
movq %r15, (%r10,%r11)
movq %rsi, 8(%r10,%r11)
addq $16, %r10
decl %r12d
je .L8
.p2align 4,,7
.L6:
movq %r13, %rax
movq (%r10,%r11), %rbx
movq (%r10,%r11), %r15
mulq (%rbp)
movq 8(%r10,%r11), %rsi
xorl %r9d, %r9d
movq 16(%r10,%r11), %r8
movq 32(%r10,%r11), %r14
addq %rax, %rbx
movq %r13, %rax
cmpq %r15, %rbx
movq 24(%r10,%r11), %r15
movq %rbx, -24(%rsp)
setb %r9b
addq %rdx, %rsi
movq -24(%rsp), %rcx
movq %rsi, -16(%rsp)
addq %r9, -16(%rsp)
xorl %esi, %esi
movq -16(%rsp), %rdx
movq %rcx, (%r10,%r11)
movq %rdx, 8(%r10,%r11)
mulq 8(%rbp)
addq %rax, %r8
movq 16(%r10,%r11), %rax
movq %r8, -40(%rsp)
movq -40(%rsp), %rdi
cmpq %rax, %r8
movq %r13, %rax
setb %sil
addq %rdx, %r15
movq %rdi, 16(%r10,%r11)
mulq 16(%rbp)
movq %r15, -32(%rsp)
movq 40(%r10,%r11), %r15
addq %rsi, -32(%rsp)
movq -32(%rsp), %r9
movq %r9, 24(%r10,%r11)
movq %rdx, %rbx
movq 32(%r10,%r11), %rdx
movq %rax, %rcx
addq %rcx, %rdx
cmpq %r14, %rdx
movq %rdx, -56(%rsp)
movq -56(%rsp), %rdi
setb %r8b
addq %rbx, %r15
movl %r8d, %eax
movq %r15, -48(%rsp)
xorl %r15d, %r15d
movzbl %al, %esi
addq %rsi, -48(%rsp)
movq %r13, %rax
movq -48(%rsp), %r9
movq %rdi, 32(%r10,%r11)
mulq 24(%rbp)
movq 56(%r10,%r11), %r14
addq $32, %rbp
movq %r9, 40(%r10,%r11)
movq %rax, %rcx
movq 48(%r10,%r11), %rax
movq %rdx, %rbx
leaq (%r14,%rbx), %r8
leaq (%rax,%rcx), %rdx
movq %r8, -64(%rsp)
cmpq %rax, %rdx
movq %rdx, -72(%rsp)
movq -72(%rsp), %rsi
setb %r15b
addq %r15, -64(%rsp)
movq -64(%rsp), %rdi
movq %rsi, 48(%r10,%r11)
movq %rdi, 56(%r10,%r11)
addq $64, %r10
subl $4, %r12d
jne .L6
.p2align 4,,7
.L8:
movq 16(%rsp), %rbx
movq 24(%rsp), %rbp
movq 32(%rsp), %r12
movq 40(%rsp), %r13
movq 48(%rsp), %r14
movq 56(%rsp), %r15
addq $64, %rsp
ret
.LFE3:
.size test2, .-test2
.p2align 4,,15
.globl test3
.type test3, @function
test3:
.LFB4:
pushq %rbp
.LCFI7:
testl %ecx, %ecx
movq %rsi, %r10
movl %ecx, %ebp
pushq %rbx
.LCFI8:
movq %rdi, %rbx
movq %rdx, %rdi
jle .L33
leal -1(%rbp), %ecx
movl %ecx, %esi
andl $7, %esi
#APP
movq %r10,%rax
mulq (%rdi)
addq %rax,(%rbx)
adcq %rdx,8(%rbx)
#NO_APP
testl %ecx, %ecx
leaq 16(%rbx), %r9
leaq 8(%rdi), %r8
movl %ecx, %r11d
je .L33
testl %esi, %esi
je .L31
cmpl $1, %esi
je .L61
cmpl $2, %esi
.p2align 4,,5
je .L62
cmpl $3, %esi
.p2align 4,,5
je .L63
cmpl $4, %esi
.p2align 4,,5
je .L64
cmpl $5, %esi
.p2align 4,,5
je .L65
cmpl $6, %esi
.p2align 4,,5
je .L66
#APP
movq %r10,%rax
mulq (%r8)
addq %rax,(%r9)
adcq %rdx,8(%r9)
#NO_APP
leaq 32(%rbx), %r9
leaq 16(%rdi), %r8
leal -2(%rbp), %r11d
.L66:
#APP
movq %r10,%rax
mulq (%r8)
addq %rax,(%r9)
adcq %rdx,8(%r9)
#NO_APP
addq $16, %r9
addq $8, %r8
decl %r11d
.L65:
#APP
movq %r10,%rax
mulq (%r8)
addq %rax,(%r9)
adcq %rdx,8(%r9)
#NO_APP
addq $16, %r9
addq $8, %r8
decl %r11d
.L64:
#APP
movq %r10,%rax
mulq (%r8)
addq %rax,(%r9)
adcq %rdx,8(%r9)
#NO_APP
addq $16, %r9
addq $8, %r8
decl %r11d
.L63:
#APP
movq %r10,%rax
mulq (%r8)
addq %rax,(%r9)
adcq %rdx,8(%r9)
#NO_APP
addq $16, %r9
addq $8, %r8
decl %r11d
.L62:
#APP
movq %r10,%rax
mulq (%r8)
addq %rax,(%r9)
adcq %rdx,8(%r9)
#NO_APP
addq $16, %r9
addq $8, %r8
decl %r11d
.L61:
#APP
movq %r10,%rax
mulq (%r8)
addq %rax,(%r9)
adcq %rdx,8(%r9)
#NO_APP
addq $16, %r9
addq $8, %r8
decl %r11d
je .L33
.L31:
#APP
movq %r10,%rax
mulq (%r8)
addq %rax,(%r9)
adcq %rdx,8(%r9)
#NO_APP
leaq 16(%r9), %rsi
leaq 8(%r8), %rbp
#APP
movq %r10,%rax
mulq (%rbp)
addq %rax,(%rsi)
adcq %rdx,8(%rsi)
#NO_APP
leaq 32(%r9), %rdi
leaq 16(%r8), %rbx
#APP
movq %r10,%rax
mulq (%rbx)
addq %rax,(%rdi)
adcq %rdx,8(%rdi)
#NO_APP
leaq 48(%r9), %rcx
leaq 24(%r8), %rbp
#APP
movq %r10,%rax
mulq (%rbp)
addq %rax,(%rcx)
adcq %rdx,8(%rcx)
#NO_APP
leaq 64(%r9), %rsi
leaq 32(%r8), %rdi
#APP
movq %r10,%rax
mulq (%rdi)
addq %rax,(%rsi)
adcq %rdx,8(%rsi)
#NO_APP
leaq 80(%r9), %rbx
leaq 40(%r8), %rcx
#APP
movq %r10,%rax
mulq (%rcx)
addq %rax,(%rbx)
adcq %rdx,8(%rbx)
#NO_APP
leaq 96(%r9), %rbp
leaq 48(%r8), %rdi
#APP
movq %r10,%rax
mulq (%rdi)
addq %rax,(%rbp)
adcq %rdx,8(%rbp)
#NO_APP
leaq 112(%r9), %rsi
leaq 56(%r8), %rbx
#APP
movq %r10,%rax
mulq (%rbx)
addq %rax,(%rsi)
adcq %rdx,8(%rsi)
#NO_APP
subq $-128, %r9
addq $64, %r8
subl $8, %r11d
jne .L31
.L33:
popq %rbx
popq %rbp
ret
.LFE4:
.size test3, .-test3
.comm a,16,16
.comm b,16,16
.section .eh_frame,"a",@progbits
.Lframe1:
.long .LECIE1-.LSCIE1
.LSCIE1:
.long 0x0
.byte 0x1
.string ""
.uleb128 0x1
.sleb128 -8
.byte 0x10
.byte 0xc
.uleb128 0x7
.uleb128 0x8
.byte 0x90
.uleb128 0x1
.align 8
.LECIE1:
.LSFDE1:
.long .LEFDE1-.LASFDE1
.LASFDE1:
.long .LASFDE1-.Lframe1
.quad .LFB2
.quad .LFE2-.LFB2
.align 8
.LEFDE1:
.LSFDE3:
.long .LEFDE3-.LASFDE3
.LASFDE3:
.long .LASFDE3-.Lframe1
.quad .LFB3
.quad .LFE3-.LFB3
.byte 0x4
.long .LCFI3-.LFB3
.byte 0x83
.uleb128 0x7
.byte 0x8f
.uleb128 0x2
.byte 0x8e
.uleb128 0x3
.byte 0x8d
.uleb128 0x4
.byte 0x4
.long .LCFI6-.LCFI3
.byte 0xe
.uleb128 0x48
.byte 0x8c
.uleb128 0x5
.byte 0x86
.uleb128 0x6
.align 8
.LEFDE3:
.LSFDE5:
.long .LEFDE5-.LASFDE5
.LASFDE5:
.long .LASFDE5-.Lframe1
.quad .LFB4
.quad .LFE4-.LFB4
.byte 0x4
.long .LCFI7-.LFB4
.byte 0xe
.uleb128 0x10
.byte 0x86
.uleb128 0x2
.byte 0x4
.long .LCFI8-.LCFI7
.byte 0xe
.uleb128 0x18
.byte 0x83
.uleb128 0x3
.align 8
.LEFDE5:
.section .note.GNU-stack,"",@progbits
.ident "GCC: (GNU) 3.4.1 (Gentoo Linux 3.4.1, ssp-3.4-2,
pie-8.7.6.3)"
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug c/16961] Poor x86-64 performance
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
2004-08-10 13:27 ` [Bug c/16961] " falk at debian dot org
2004-08-10 13:38 ` tomstdenis at iahu dot ca
@ 2004-08-10 13:39 ` tomstdenis at iahu dot ca
2004-08-10 13:58 ` [Bug target/16961] " falk at debian dot org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: tomstdenis at iahu dot ca @ 2004-08-10 13:39 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From tomstdenis at iahu dot ca 2004-08-10 13:39 -------
I used
gcc -O3 -fomit-frame-pointer -funroll-loops -march=k8 -m64 -S test.c
To produce that asm code btw...
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/16961] Poor x86-64 performance
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
` (2 preceding siblings ...)
2004-08-10 13:39 ` tomstdenis at iahu dot ca
@ 2004-08-10 13:58 ` falk at debian dot org
2004-08-10 14:09 ` tomstdenis at iahu dot ca
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: falk at debian dot org @ 2004-08-10 13:58 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From falk at debian dot org 2004-08-10 13:58 -------
Okay, as to the TImode problem, this is target specific. I'm not familiar with
i386, but I have a very hard time believing using the carry flag would lead
to a noticeable speedup here... oh well.
--
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |minor
Component|c |target
Keywords| |missed-optimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/16961] Poor x86-64 performance
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
` (3 preceding siblings ...)
2004-08-10 13:58 ` [Bug target/16961] " falk at debian dot org
@ 2004-08-10 14:09 ` tomstdenis at iahu dot ca
2004-12-19 15:00 ` steven at gcc dot gnu dot org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: tomstdenis at iahu dot ca @ 2004-08-10 14:09 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From tomstdenis at iahu dot ca 2004-08-10 14:09 -------
(In reply to comment #4)
> Okay, as to the TImode problem, this is target specific. I'm not familiar
with
> i386, but I have a very hard time believing using the carry flag would lead
> to a noticeable speedup here... oh well.
Um it is. the 10 instructions GCC makes now consume decode bandwidth, require
execute time, fill the cache, etc...
Admitedly this isn't a "huge" problem because most code won't be doing 128-bit
math but if the goal is to make GCC the best it can be someone might as well
fix this up.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/16961] Poor x86-64 performance
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
` (4 preceding siblings ...)
2004-08-10 14:09 ` tomstdenis at iahu dot ca
@ 2004-12-19 15:00 ` steven at gcc dot gnu dot org
2005-07-18 7:52 ` [Bug target/16961] Poor x86-64 performance with 128bit ints steven at gcc dot gnu dot org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: steven at gcc dot gnu dot org @ 2004-12-19 15:00 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From steven at gcc dot gnu dot org 2004-12-19 14:59 -------
This is similar to the "long long" problem for 32 bits x86 targets. We
keep the instructions in TImode all the way down until flow2, which made
sense in the pre-GCC4 era, when this was the only way to make optimizing
possible for arithmetic in machine modes not representable on the target
machine. With the new high-level optimizers we don't really need this
anymore, we should just lower to machine instructions in expand and let
the RTL path do its job to optimize this better.
--
What |Removed |Added
----------------------------------------------------------------------------
CC| |jh at suse dot cz
Status|UNCONFIRMED |NEW
Ever Confirmed| |1
Last reconfirmed|0000-00-00 00:00:00 |2004-12-19 14:59:53
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/16961] Poor x86-64 performance with 128bit ints
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
` (5 preceding siblings ...)
2004-12-19 15:00 ` steven at gcc dot gnu dot org
@ 2005-07-18 7:52 ` steven at gcc dot gnu dot org
2005-07-18 8:47 ` steven at gcc dot gnu dot org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: steven at gcc dot gnu dot org @ 2005-07-18 7:52 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From steven at gcc dot gnu dot org 2005-07-18 07:47 -------
The 128 bits arithmetic has improved now:
typedef unsigned long mp_word __attribute__ ((mode(TI)));
mp_word a, b;
void test(void) { a += b; }
test:
movq a(%rip), %rax
addq b(%rip), %rax
movq a+8(%rip), %rdx
adcq b+8(%rip), %rdx
movq %rax, a(%rip)
movq %rdx, a+8(%rip)
ret
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/16961] Poor x86-64 performance with 128bit ints
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
` (6 preceding siblings ...)
2005-07-18 7:52 ` [Bug target/16961] Poor x86-64 performance with 128bit ints steven at gcc dot gnu dot org
@ 2005-07-18 8:47 ` steven at gcc dot gnu dot org
2005-07-18 13:42 ` jh at suse dot cz
2005-07-19 15:06 ` falk at debian dot org
9 siblings, 0 replies; 11+ messages in thread
From: steven at gcc dot gnu dot org @ 2005-07-18 8:47 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From steven at gcc dot gnu dot org 2005-07-18 07:56 -------
The code for the second test case is also much better. The code produced
for the test3 case does not look like what you want it to produce. Probably
the inline asm constraints are not correct.
Note that I'm only looking at mainline.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/16961] Poor x86-64 performance with 128bit ints
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
` (7 preceding siblings ...)
2005-07-18 8:47 ` steven at gcc dot gnu dot org
@ 2005-07-18 13:42 ` jh at suse dot cz
2005-07-19 15:06 ` falk at debian dot org
9 siblings, 0 replies; 11+ messages in thread
From: jh at suse dot cz @ 2005-07-18 13:42 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From jh at suse dot cz 2005-07-18 12:45 -------
Subject: Re: Poor x86-64 performance with 128bit ints
>
> ------- Additional Comments From steven at gcc dot gnu dot org 2005-07-18 07:47 -------
> The 128 bits arithmetic has improved now:
>
> typedef unsigned long mp_word __attribute__ ((mode(TI)));
> mp_word a, b;
> void test(void) { a += b; }
>
> test:
> movq a(%rip), %rax
> addq b(%rip), %rax
> movq a+8(%rip), %rdx
> adcq b+8(%rip), %rdx
> movq %rax, a(%rip)
> movq %rdx, a+8(%rip)
> ret
I think the PR should be closed now when Jan added the 128bit arithmetic
patterns I originally skipped in the x86-64 port as it was killing my
32bit cross compiler at that time :)
At lest we should now perofrm no worse than i386 on 64bit math (that
sucks of course ;)
Honza
>
>
> --
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
>
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/16961] Poor x86-64 performance with 128bit ints
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
` (8 preceding siblings ...)
2005-07-18 13:42 ` jh at suse dot cz
@ 2005-07-19 15:06 ` falk at debian dot org
9 siblings, 0 replies; 11+ messages in thread
From: falk at debian dot org @ 2005-07-19 15:06 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From falk at debian dot org 2005-07-19 14:12 -------
The unrolling part of the report was moved to PR 16962, and the 128-bit part is
fixed, so closing.
--
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16961
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2005-07-19 14:13 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-10 13:08 [Bug c/16961] New: Poor x86-64 performance tomstdenis at iahu dot ca
2004-08-10 13:27 ` [Bug c/16961] " falk at debian dot org
2004-08-10 13:38 ` tomstdenis at iahu dot ca
2004-08-10 13:39 ` tomstdenis at iahu dot ca
2004-08-10 13:58 ` [Bug target/16961] " falk at debian dot org
2004-08-10 14:09 ` tomstdenis at iahu dot ca
2004-12-19 15:00 ` steven at gcc dot gnu dot org
2005-07-18 7:52 ` [Bug target/16961] Poor x86-64 performance with 128bit ints steven at gcc dot gnu dot org
2005-07-18 8:47 ` steven at gcc dot gnu dot org
2005-07-18 13:42 ` jh at suse dot cz
2005-07-19 15:06 ` falk at debian dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).