* [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
@ 2007-05-16 20:25 Uros Bizjak
2007-05-16 20:34 ` Chris Lattner
0 siblings, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2007-05-16 20:25 UTC (permalink / raw)
To: GCC Patches
[-- Attachment #1: Type: text/plain, Size: 736 bytes --]
Hello!
This patch adds 128bit operations for x86_64 to longlong.h to speed up
TImode and TFmode arithmetic. The patch also redefines i386's
definitions of count_trailing/leading_zeros from asm to
__builtin_ctz/__builtin_clz builtins, as provided by i386 backend.
Patch was bootstrapped on x86_64-pc-linux-gnu, regression tested for all
default languages with and without -m32.
2007-05-16 Uros Bizjak <ubizjak@gmail.com>
* longlong.h (__x86_64__): Add definitions for add_ssaaaa,
sub_ddmmss, umul_ppmm, udiv_qrnnd, count_leading_zeros and
count_trailing_zeros.
(__i386__): Implement count_leading_zeros using __builtin_clz().
Implement count_trailing_zeros usign __builtin_ctz().
Uros.
[-- Attachment #2: x86_64-longlong.diff --]
[-- Type: text/x-patch, Size: 2243 bytes --]
Index: longlong.h
===================================================================
--- longlong.h (revision 124771)
+++ longlong.h (working copy)
@@ -341,19 +341,48 @@
: "0" ((USItype) (n0)), \
"1" ((USItype) (n1)), \
"rm" ((USItype) (dv)))
-#define count_leading_zeros(count, x) \
- do { \
- USItype __cbtmp; \
- __asm__ ("bsrl %1,%0" \
- : "=r" (__cbtmp) : "rm" ((USItype) (x))); \
- (count) = __cbtmp ^ 31; \
- } while (0)
-#define count_trailing_zeros(count, x) \
- __asm__ ("bsfl %1,%0" : "=r" (count) : "rm" ((USItype)(x)))
+#define count_leading_zeros(count, x) ((count) = __builtin_clz (x))
+#define count_trailing_zeros(count, x) ((count) = __builtin_ctz (x))
#define UMUL_TIME 40
#define UDIV_TIME 40
#endif /* 80x86 */
+#if (defined (__x86_64__) || defined (__i386__)) && W_TYPE_SIZE == 64
+#define add_ssaaaa(sh, sl, ah, al, bh, bl) \
+ __asm__ ("addq %5,%1\n\tadcq %3,%0" \
+ : "=r" ((UDItype) (sh)), \
+ "=&r" ((UDItype) (sl)) \
+ : "%0" ((UDItype) (ah)), \
+ "rem" ((UDItype) (bh)), \
+ "%1" ((UDItype) (al)), \
+ "rem" ((UDItype) (bl)))
+#define sub_ddmmss(sh, sl, ah, al, bh, bl) \
+ __asm__ ("subq %5,%1\n\tsbbq %3,%0" \
+ : "=r" ((UDItype) (sh)), \
+ "=&r" ((UDItype) (sl)) \
+ : "0" ((UDItype) (ah)), \
+ "rem" ((UDItype) (bh)), \
+ "1" ((UDItype) (al)), \
+ "rem" ((UDItype) (bl)))
+#define umul_ppmm(w1, w0, u, v) \
+ __asm__ ("mulq %3" \
+ : "=a" ((UDItype) (w0)), \
+ "=d" ((UDItype) (w1)) \
+ : "%0" ((UDItype) (u)), \
+ "rm" ((UDItype) (v)))
+#define udiv_qrnnd(q, r, n1, n0, dv) \
+ __asm__ ("divq %4" \
+ : "=a" ((UDItype) (q)), \
+ "=d" ((UDItype) (r)) \
+ : "0" ((UDItype) (n0)), \
+ "1" ((UDItype) (n1)), \
+ "rm" ((UDItype) (dv)))
+#define count_leading_zeros(count, x) ((count) = __builtin_clzl (x))
+#define count_trailing_zeros(count, x) ((count) = __builtin_ctzl (x))
+#define UMUL_TIME 40
+#define UDIV_TIME 40
+#endif /* x86_64 */
+
#if defined (__i960__) && W_TYPE_SIZE == 32
#define umul_ppmm(w1, w0, u, v) \
({union {UDItype __ll; \
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-16 20:25 [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations Uros Bizjak
@ 2007-05-16 20:34 ` Chris Lattner
2007-05-17 5:57 ` Uros Bizjak
0 siblings, 1 reply; 36+ messages in thread
From: Chris Lattner @ 2007-05-16 20:34 UTC (permalink / raw)
To: Uros Bizjak; +Cc: GCC Patches
On May 16, 2007, at 1:23 PM, Uros Bizjak wrote:
> Hello!
> This patch adds 128bit operations for x86_64 to longlong.h to speed
> up TImode and TFmode arithmetic.
Why not implement these in terms of TImode operations themselves?
Inline asm should be avoided when it can be :)
-Chris
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-16 20:34 ` Chris Lattner
@ 2007-05-17 5:57 ` Uros Bizjak
2007-05-17 6:09 ` Eric Christopher
2007-05-17 6:43 ` Chris Lattner
0 siblings, 2 replies; 36+ messages in thread
From: Uros Bizjak @ 2007-05-17 5:57 UTC (permalink / raw)
To: Chris Lattner; +Cc: GCC Patches
On 5/16/07, Chris Lattner <clattner@apple.com> wrote:
> > This patch adds 128bit operations for x86_64 to longlong.h to speed
> > up TImode and TFmode arithmetic.
>
> Why not implement these in terms of TImode operations themselves?
> Inline asm should be avoided when it can be :)
Unfortunatelly longlong.h definitions require their operands to be in
DImode (for x86_64 target). I guess it is not a coincidence, that
these definitions can be implemented with exactly one x86 insn (two
for carry-propagating addition and subtraction). Asm with one insn is
not harmful.
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 5:57 ` Uros Bizjak
@ 2007-05-17 6:09 ` Eric Christopher
2007-05-17 6:12 ` Andrew Pinski
2007-05-17 6:43 ` Chris Lattner
1 sibling, 1 reply; 36+ messages in thread
From: Eric Christopher @ 2007-05-17 6:09 UTC (permalink / raw)
To: Uros Bizjak; +Cc: Chris Lattner, GCC Patches
On May 16, 2007, at 10:57 PM, Uros Bizjak wrote:
> On 5/16/07, Chris Lattner <clattner@apple.com> wrote:
>
>> > This patch adds 128bit operations for x86_64 to longlong.h to speed
>> > up TImode and TFmode arithmetic.
>>
>> Why not implement these in terms of TImode operations themselves?
>> Inline asm should be avoided when it can be :)
>
> Unfortunatelly longlong.h definitions require their operands to be in
> DImode (for x86_64 target). I guess it is not a coincidence, that
> these definitions can be implemented with exactly one x86 insn (two
> for carry-propagating addition and subtraction). Asm with one insn is
> not harmful.
Don't forget that asms are often barriers to optimization so as few asm
as possible is always better.
That said, I don't off the top of my head have a better
implementation :)
-eric
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 6:09 ` Eric Christopher
@ 2007-05-17 6:12 ` Andrew Pinski
2007-05-17 6:22 ` Uros Bizjak
2007-05-17 11:13 ` Uros Bizjak
0 siblings, 2 replies; 36+ messages in thread
From: Andrew Pinski @ 2007-05-17 6:12 UTC (permalink / raw)
To: Eric Christopher; +Cc: Uros Bizjak, Chris Lattner, GCC Patches
On 5/16/07, Eric Christopher <echristo@apple.com> wrote:
>
> Don't forget that asms are often barriers to optimization so as few asm
> as possible is always better.
>
> That said, I don't off the top of my head have a better
> implementation :)
Oh and longlong.h belongs to GMP so you should submit your changes
upstream too. Plus the uses of what is located in longlong.h for GCC
is only in the support library and who uses TImode anyways :).
Thanks,
Andrew Pinski
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 6:12 ` Andrew Pinski
@ 2007-05-17 6:22 ` Uros Bizjak
2007-05-17 11:13 ` Uros Bizjak
1 sibling, 0 replies; 36+ messages in thread
From: Uros Bizjak @ 2007-05-17 6:22 UTC (permalink / raw)
To: Andrew Pinski; +Cc: Eric Christopher, Chris Lattner, GCC Patches
On 5/17/07, Andrew Pinski <pinskia@gmail.com> wrote:
> > Don't forget that asms are often barriers to optimization so as few asm
> > as possible is always better.
Yes, but please note that these are not __volatile__.
> > That said, I don't off the top of my head have a better
> > implementation :)
>
> Oh and longlong.h belongs to GMP so you should submit your changes
> upstream too. Plus the uses of what is located in longlong.h for GCC
Thanks, I'll submit it there.
> is only in the support library and who uses TImode anyways :).
And 640k should be enough for everybody ;]
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 5:57 ` Uros Bizjak
2007-05-17 6:09 ` Eric Christopher
@ 2007-05-17 6:43 ` Chris Lattner
2007-05-17 7:22 ` Paolo Bonzini
1 sibling, 1 reply; 36+ messages in thread
From: Chris Lattner @ 2007-05-17 6:43 UTC (permalink / raw)
To: Uros Bizjak; +Cc: GCC Patches
On May 16, 2007, at 10:57 PM, Uros Bizjak wrote:
> On 5/16/07, Chris Lattner <clattner@apple.com> wrote:
>
>> > This patch adds 128bit operations for x86_64 to longlong.h to speed
>> > up TImode and TFmode arithmetic.
>>
>> Why not implement these in terms of TImode operations themselves?
>> Inline asm should be avoided when it can be :)
>
> Unfortunatelly longlong.h definitions require their operands to be in
> DImode (for x86_64 target). I guess it is not a coincidence, that
> these definitions can be implemented with exactly one x86 insn (two
> for carry-propagating addition and subtraction). Asm with one insn is
> not harmful.
Just combine them together with logical ops. Something like this:
typedef unsigned DI __attribute__((mode(DI)));
typedef unsigned TI __attribute__((mode(TI)));
void test_add(DI AL, DI AH, DI BL, DI BH, DI *RL, DI *RH) {
TI A = AL | ((TI)AH << 64);
TI B = AL | ((TI)BH << 64);
TI C = A + B;
*RL = (DI)C;
*RH = (DI)(C >> 64);
}
void test_sub(DI AL, DI AH, DI BL, DI BH, DI *RL, DI *RH) {
TI A = AL | ((TI)AH << 64);
TI B = BL | ((TI)BH << 64);
TI C = A - B;
*RL = (DI)C;
*RH = (DI)(C >> 64);
}
Should produce some thing like this:
_test_add:
addq %rdi, %rdi
adcq %rsi, %rcx
movq %rdi, (%r8)
movq %rcx, (%r9)
ret
.align 4
.globl _test_sub
_test_sub:
subq %rdx, %rdi
sbbq %rcx, %rsi
movq %rdi, (%r8)
movq %rsi, (%r9)
ret
.subsections_via_symbols
Inline asms are bad for a number of reasons: as Andrew mentioned,
they are scheduling barriers, also the compiler can't reason about
them, constant fold, simplify, can't know the sizes of the
instructions precisely, etc.
If your compiler isn't producing decent code for simple idioms like
this, it seems profitable to fix the compiler, then write portable[1]
C code to express these things. The compiler already knows how to
efficiently implement 128-bit arithmetic, why re-implement the smarts
of the compiler in a header?
-Chris
[1] portable across 64-bit GCC targets, not across other compilers of
course. However, gcc-style inline asm isn't portable to other
compilers either.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 6:43 ` Chris Lattner
@ 2007-05-17 7:22 ` Paolo Bonzini
2007-05-17 9:25 ` Uros Bizjak
0 siblings, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2007-05-17 7:22 UTC (permalink / raw)
To: Chris Lattner; +Cc: Uros Bizjak, GCC Patches
> void test_add(DI AL, DI AH, DI BL, DI BH, DI *RL, DI *RH) {
> TI A = AL | ((TI)AH << 64);
> TI B = AL | ((TI)BH << 64);
>
> TI C = A + B;
> *RL = (DI)C;
> *RH = (DI)(C >> 64);
> }
>
> Should produce some thing like this:
>
> _test_add:
> addq %rdi, %rdi
> adcq %rsi, %rcx
> movq %rdi, (%r8)
> movq %rcx, (%r9)
> ret
Right. This probably wasn't true before the lower-subreg pass.
Paolo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 7:22 ` Paolo Bonzini
@ 2007-05-17 9:25 ` Uros Bizjak
2007-05-17 11:15 ` Rask Ingemann Lambertsen
2007-05-17 16:49 ` Chris Lattner
0 siblings, 2 replies; 36+ messages in thread
From: Uros Bizjak @ 2007-05-17 9:25 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Chris Lattner, GCC Patches
On 5/17/07, Paolo Bonzini <bonzini@gnu.org> wrote:
> > void test_add(DI AL, DI AH, DI BL, DI BH, DI *RL, DI *RH) {
> > TI A = AL | ((TI)AH << 64);
> > TI B = AL | ((TI)BH << 64);
> >
> > TI C = A + B;
> > *RL = (DI)C;
> > *RH = (DI)(C >> 64);
> > }
> >
> > Should produce some thing like this:
> >
> > _test_add:
> > addq %rdi, %rdi
> > adcq %rsi, %rcx
> > movq %rdi, (%r8)
> > movq %rcx, (%r9)
> > ret
>
> Right. This probably wasn't true before the lower-subreg pass.
Unfortunatelly, this approach confuses gcc a lot. Attached testcase
will illustrate the problem on i686 (substitute SI with DI and DI with
TI on 64bit and change shift values for the same result):
--cut here--
typedef unsigned SI __attribute__ ((mode (SI)));
typedef unsigned DI __attribute__ ((mode (DI)));
#define add_ssaaaa_c(sh, sl, ah, al, bh, bl) \
{ \
DI __a = (al) | ((DI) ah << 32); \
DI __b = (bl) | ((DI) bh << 32); \
\
DI __c = __a + __b; \
\
(sl) = (SI) (__c & 0xffffffff); \
(sh) = (SI) (__c >> 32); \
}
#define add_ssaaaa_asm(sh, sl, ah, al, bh, bl) \
__asm__ ("addl %5,%1\n\tadcl %3,%0" \
: "=r" ((SI) (sh)), \
"=&r" ((SI) (sl)) \
: "%0" ((SI) (ah)), \
"g" ((SI) (bh)), \
"%1" ((SI) (al)), \
"g" ((SI) (bl)))
void test_c (SI a, SI b, SI c, SI d)
{
volatile SI x, y;
add_ssaaaa_c (x, y, a, b, c, d);
}
void test_asm (SI a, SI b, SI c, SI d)
{
volatile SI x, y;
add_ssaaaa_asm (x, y, a, b, c, d);
}
--cut here--
Compared two approaches head-to-head on i686 target; gcc -O2
-fomit-frame-pointer generates:
test_c:
subl $28, %esp #,
xorl %edx, %edx #
movl 40(%esp), %eax # c, tmp66
movl 44(%esp), %ecx # d, d
movl %esi, 20(%esp) #,
movl %ebx, 16(%esp) #,
movl %eax, %edx # tmp66,
movl $0, %eax #, tmp66
movl %eax, %esi #, tmp74
movl 32(%esp), %eax # a, tmp70
movl %edx, %ebx # tmp75, __c
orl %ecx, %esi # d, tmp74
xorl %edx, %edx #
movl %esi, %ecx # tmp74, __c
movl 36(%esp), %esi # b, b
movl %edi, 24(%esp) #,
movl 24(%esp), %edi #,
movl %eax, %edx # tmp70,
movl $0, %eax #, tmp70
orl %esi, %eax # b, tmp72
movl 20(%esp), %esi #,
addl %eax, %ecx # tmp72, __c
adcl %edx, %ebx #, __c
movl %ecx, 8(%esp) # __c, y
movl %ebx, %ecx # __c, __c
xorl %ebx, %ebx # __c
movl 16(%esp), %ebx #,
movl %ecx, 12(%esp) # __c, x
addl $28, %esp #,
ret
and on the other hand for asm approach:
test_asm:
subl $16, %esp #,
movl 20(%esp), %eax # a, a
movl 24(%esp), %edx # b, b
#APP
addl 32(%esp),%edx # d, tmp63
adcl 28(%esp),%eax # c, tmp62
#NO_APP
movl %eax, 12(%esp) # tmp62, x
movl %edx, 8(%esp) # tmp63, y
addl $16, %esp #,
ret
Although the adition itself generates expected code, assembling SImode
values into DImode really confuses gcc. Also, restore of %ebx from
stack interferes with final DImode->SImode split in a really bad way.
Looking into this mess, I think that gcc has still a long way to go
w.r.t to DImode values on 32bit targets (and TImode values on 64bit
targets) and the only solution for the near future is using asm
here...
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 6:12 ` Andrew Pinski
2007-05-17 6:22 ` Uros Bizjak
@ 2007-05-17 11:13 ` Uros Bizjak
1 sibling, 0 replies; 36+ messages in thread
From: Uros Bizjak @ 2007-05-17 11:13 UTC (permalink / raw)
To: Andrew Pinski; +Cc: Eric Christopher, Chris Lattner, GCC Patches
On 5/17/07, Andrew Pinski <pinskia@gmail.com> wrote:
> Oh and longlong.h belongs to GMP so you should submit your changes
> upstream too. Plus the uses of what is located in longlong.h for GCC
> is only in the support library and who uses TImode anyways :).
Hm, in gmp-4.2 longlong.h, the same code to support x86_64 is already present.
Perhaps it is time to update longlong.h from GMP source.
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 9:25 ` Uros Bizjak
@ 2007-05-17 11:15 ` Rask Ingemann Lambertsen
2007-05-17 21:38 ` Rask Ingemann Lambertsen
2007-05-17 16:49 ` Chris Lattner
1 sibling, 1 reply; 36+ messages in thread
From: Rask Ingemann Lambertsen @ 2007-05-17 11:15 UTC (permalink / raw)
To: Uros Bizjak; +Cc: Paolo Bonzini, Chris Lattner, GCC Patches
On Thu, May 17, 2007 at 11:25:34AM +0200, Uros Bizjak wrote:
> Unfortunatelly, this approach confuses gcc a lot. Attached testcase
> will illustrate the problem on i686 (substitute SI with DI and DI with
> TI on 64bit and change shift values for the same result):
[snip]
> Compared two approaches head-to-head on i686 target; gcc -O2
> -fomit-frame-pointer generates:
>
> test_c:
> subl $28, %esp #,
> xorl %edx, %edx #
> movl 40(%esp), %eax # c, tmp66
> movl 44(%esp), %ecx # d, d
> movl %esi, 20(%esp) #,
> movl %ebx, 16(%esp) #,
> movl %eax, %edx # tmp66,
> movl $0, %eax #, tmp66
> movl %eax, %esi #, tmp74
> movl 32(%esp), %eax # a, tmp70
> movl %edx, %ebx # tmp75, __c
> orl %ecx, %esi # d, tmp74
> xorl %edx, %edx #
> movl %esi, %ecx # tmp74, __c
> movl 36(%esp), %esi # b, b
> movl %edi, 24(%esp) #,
> movl 24(%esp), %edi #,
> movl %eax, %edx # tmp70,
> movl $0, %eax #, tmp70
> orl %esi, %eax # b, tmp72
> movl 20(%esp), %esi #,
> addl %eax, %ecx # tmp72, __c
> adcl %edx, %ebx #, __c
> movl %ecx, 8(%esp) # __c, y
> movl %ebx, %ecx # __c, __c
> xorl %ebx, %ebx # __c
> movl 16(%esp), %ebx #,
> movl %ecx, 12(%esp) # __c, x
> addl $28, %esp #,
> ret
You must be running into some sort of target specific problem, because
x86 16-bit -O2 -fno-omit-frame-pointer gets
test_c:
pushw %cx ;# 62 *pushhi1
pushw %di ;# 63 *pushhi1
enterw $6-2, $0 ;# 64 _enter
movw 14(%bp),%ax ;# 18 *movhi/1
movw 12(%bp),%di ;# 19 *movhi/1
movw 10(%bp),%dx ;# 30 *movhi/1
movw 8(%bp), %cx ;# 31 *movhi/1
addw %dx, %ax ;# 56 _addhi3_cc_for_carry/1
movw %di, %dx ;# 61 *movhi/1
adcw %cx, %dx ;# 57 _addhi3_carry/1
movw %ax, -4(%bp) ;# 35 *movhi/2
movw %dx, -2(%bp) ;# 40 *movhi/2
leavew ;# 67 _leave
popw %di ;# 68 *pophi1
popw %cx ;# 69 *pophi1
ret ;# 70 *return
with the obvious s/SI/HI/, S/DI/SI/, etc. changes to your testcase. While
the above could be improved a little, notice the complete absence of "or"
and "xor" instructions.
(Why do you have both "movl $0, %reg" and "xorl %reg, %reg"?)
--
Rask Ingemann Lambertsen
741 unexpected failures, and counting...
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 9:25 ` Uros Bizjak
2007-05-17 11:15 ` Rask Ingemann Lambertsen
@ 2007-05-17 16:49 ` Chris Lattner
2007-05-17 17:05 ` Eric Christopher
1 sibling, 1 reply; 36+ messages in thread
From: Chris Lattner @ 2007-05-17 16:49 UTC (permalink / raw)
To: Uros Bizjak; +Cc: Paolo Bonzini, GCC Patches
> Unfortunatelly, this approach confuses gcc a lot. Attached testcase
> will illustrate the problem on i686 (substitute SI with DI and DI with
> TI on 64bit and change shift values for the same result):
> Compared two approaches head-to-head on i686 target; gcc -O2
> -fomit-frame-pointer generates:
>
> test_c:
> subl $28, %esp #,
> xorl %edx, %edx #
> movl 40(%esp), %eax # c, tmp66
> movl 44(%esp), %ecx # d, d
> movl %esi, 20(%esp) #,
> movl %ebx, 16(%esp) #,
> movl %eax, %edx # tmp66,
> movl $0, %eax #, tmp66
> movl %eax, %esi #, tmp74
> movl 32(%esp), %eax # a, tmp70
> movl %edx, %ebx # tmp75, __c
> orl %ecx, %esi # d, tmp74
> xorl %edx, %edx #
> movl %esi, %ecx # tmp74, __c
> movl 36(%esp), %esi # b, b
> movl %edi, 24(%esp) #,
> movl 24(%esp), %edi #,
> movl %eax, %edx # tmp70,
> movl $0, %eax #, tmp70
> orl %esi, %eax # b, tmp72
> movl 20(%esp), %esi #,
> addl %eax, %ecx # tmp72, __c
> adcl %edx, %ebx #, __c
> movl %ecx, 8(%esp) # __c, y
> movl %ebx, %ecx # __c, __c
> xorl %ebx, %ebx # __c
> movl 16(%esp), %ebx #,
> movl %ecx, 12(%esp) # __c, x
> addl $28, %esp #,
> ret
>
> Looking into this mess, I think that gcc has still a long way to go
> w.r.t to DImode values on 32bit targets (and TImode values on 64bit
> targets) and the only solution for the near future is using asm
> here...
At some point, hacking around these limitations becomes less useful
than just fixing them once and for all. LLVM compiles this:
volatile SI x, y;
void test_asm (SI a, SI b, SI c, SI d){
add_ssaaaa_asm (x, y, a, b, c, d);
}
into this:
_test_c:
movl 12(%esp), %eax
movl 16(%esp), %ecx
addl 8(%esp), %ecx
adcl 4(%esp), %eax
movl %ecx, _y
movl %eax, _x
ret
with llvm-gcc t.c -S -O -static -fomit-frame-pointer
Inline asm really is bad. It's much better to fix the compiler so
that users don't have to rely on it to get performance.
-Chris
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 16:49 ` Chris Lattner
@ 2007-05-17 17:05 ` Eric Christopher
2007-05-18 10:23 ` Uros Bizjak
2007-05-18 19:43 ` Uros Bizjak
0 siblings, 2 replies; 36+ messages in thread
From: Eric Christopher @ 2007-05-17 17:05 UTC (permalink / raw)
To: GCC Patches; +Cc: Uros Bizjak, Paolo Bonzini, Ian Lance Taylor, Chris Lattner
>
>
> Inline asm really is bad. It's much better to fix the compiler so
> that users don't have to rely on it to get performance.
Agreed, Ian's patches that take advantage of lower-subreg for x86 help
this quite a bit. He's waiting AFAIK on the df branch to finish them
up, but from what I recall they helped this particular case.
-eric
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 11:15 ` Rask Ingemann Lambertsen
@ 2007-05-17 21:38 ` Rask Ingemann Lambertsen
2007-05-18 6:54 ` Ian Lance Taylor
0 siblings, 1 reply; 36+ messages in thread
From: Rask Ingemann Lambertsen @ 2007-05-17 21:38 UTC (permalink / raw)
To: Uros Bizjak; +Cc: Paolo Bonzini, Chris Lattner, GCC Patches
On Thu, May 17, 2007 at 01:15:48PM +0200, Rask Ingemann Lambertsen wrote:
> You must be running into some sort of target specific problem,
The i386 back end is splitting adddi3 after reload, which is too late.
I'm not sure that splitting before reload is early enough, but that is the
easiest change to make.
--
Rask Ingemann Lambertsen
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 21:38 ` Rask Ingemann Lambertsen
@ 2007-05-18 6:54 ` Ian Lance Taylor
2007-05-18 7:10 ` Eric Botcazou
0 siblings, 1 reply; 36+ messages in thread
From: Ian Lance Taylor @ 2007-05-18 6:54 UTC (permalink / raw)
To: Rask Ingemann Lambertsen
Cc: Uros Bizjak, Paolo Bonzini, Chris Lattner, GCC Patches
Rask Ingemann Lambertsen <rask@sygehus.dk> writes:
> On Thu, May 17, 2007 at 01:15:48PM +0200, Rask Ingemann Lambertsen wrote:
>
> > You must be running into some sort of target specific problem,
>
> The i386 back end is splitting adddi3 after reload, which is too late.
> I'm not sure that splitting before reload is early enough, but that is the
> easiest change to make.
With the new lower-subreg patch, insns like adddi3 should split before
reload when possible. But you can't use the existing split in
i386.md, because it relies on FLAGS_REG, and reload may accidentally
clobber FLAGS_REG. In my uncommitted follow-on patch to lower-subreg,
I did it like this:
(define_split
[(set (match_operand:DI 0 "nonimmediate_operand" "")
(plus:DI (match_operand:DI 1 "nonimmediate_operand" "")
(match_operand:DI 2 "general_operand" "")))
(clobber (reg:CC FLAGS_REG))]
"!TARGET_64BIT"
[(parallel [(set (match_dup 0) (plus:SI (match_dup 1) (match_dup 2)))
(set (match_dup 3)
(plus:SI (plus:SI (match_dup 4) (match_dup 5))
(truncate:SI
(lshiftrt:DI
(plus:DI
(zero_extend:DI (match_dup 1))
(zero_extend:DI (match_dup 2)))
(const_int 32)))))
(clobber (reg:CC FLAGS_REG))])]
{
split_di (operands + 0, 1, operands + 0, operands + 3);
split_di (operands + 1, 1, operands + 1, operands + 4);
split_di (operands + 2, 1, operands + 2, operands + 5);
})
(define_insn_and_split "*adddi3_2"
[(set (match_operand:SI 0 "nonimmediate_operand" "=rm,r,rm,r")
(plus:SI (match_operand:SI 1 "nonimmediate_operand" "%0,0,0,0")
(match_operand:SI 2 "general_operand" "ri,rm,ri,rm")))
(set (match_operand:SI 3 "nonimmediate_operand" "=rm,r,rm,r")
(plus:SI (plus:SI (match_operand:SI 4 "general_operand" "3,3,ri,rm")
(match_operand:SI 5 "general_operand" "ri,rm,3,3"))
(truncate:SI
(lshiftrt:DI
(plus:DI
(zero_extend:DI (match_dup 1))
(zero_extend:DI (match_dup 2)))
(const_int 32)))))
(clobber (reg:CC FLAGS_REG))]
"!TARGET_64BIT"
"#"
"&& reload_completed"
[(parallel [(set (reg:CC FLAGS_REG) (unspec:CC [(match_dup 1) (match_dup 2)]
UNSPEC_ADD_CARRY))
(set (match_dup 0) (plus:SI (match_dup 1) (match_dup 2)))])
(parallel [(set (match_dup 3)
(plus:SI (plus:SI (ltu:SI (reg:CC FLAGS_REG) (const_int 0))
(match_dup 4))
(match_dup 5)))
(clobber (reg:CC FLAGS_REG))])]
{
/* We can only have one pair of commutative operands in an insn, so
we commute operands 4 and 5 manually. */
if (rtx_equal_p (operands[5], operands[3]))
{
rtx tmp = operands[4];
operands[4] = operands[5];
operands[5] = tmp;
}
})
Ian
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 6:54 ` Ian Lance Taylor
@ 2007-05-18 7:10 ` Eric Botcazou
2007-05-18 16:48 ` Ian Lance Taylor
0 siblings, 1 reply; 36+ messages in thread
From: Eric Botcazou @ 2007-05-18 7:10 UTC (permalink / raw)
To: Ian Lance Taylor
Cc: gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak,
Paolo Bonzini, Chris Lattner
> With the new lower-subreg patch, insns like adddi3 should split before
> reload when possible.
If you haven't already done it, could you sketch the modifications that should
be made to a 32-bit back-end, now that the lower-subreg patch is integrated?
--
Eric Botcazou
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 17:05 ` Eric Christopher
@ 2007-05-18 10:23 ` Uros Bizjak
2007-05-18 18:31 ` Chris Lattner
2007-05-18 19:43 ` Uros Bizjak
1 sibling, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2007-05-18 10:23 UTC (permalink / raw)
To: Eric Christopher
Cc: GCC Patches, Paolo Bonzini, Ian Lance Taylor, Chris Lattner
On 5/17/07, Eric Christopher <echristo@apple.com> wrote:
> > Inline asm really is bad. It's much better to fix the compiler so
> > that users don't have to rely on it to get performance.
>
> Agreed, Ian's patches that take advantage of lower-subreg for x86 help
> this quite a bit. He's waiting AFAIK on the df branch to finish them
> up, but from what I recall they helped this particular case.
OK, I agree with that. However, I'd like too propose the path to this
final solution:
- for now, we implement asm solution, as it optimizes library code
_now_ and follows what master copy of longlong.h does. Also, there is
no simple solution for 128bit "mul" and "div" insns, which are of most
interest here.
- an enhancement PR is opened with the comments from this thread (also
linking to Ian's follow-on patch), so we won't forget this issue.
- once the generated code is equal or better then asm, we change
longlong.h, probably for both, i386 and x86_64.
My opinion is that asm solution is currently "good enough" for current
uses (libgcc) and code inspection confirms this. It is not "the best
solution" but it is definitelly better than current state. For the
record, 3 "imul" insns are substituted by just one "mul" insns.
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 7:10 ` Eric Botcazou
@ 2007-05-18 16:48 ` Ian Lance Taylor
2007-05-18 20:37 ` Eric Botcazou
2007-05-22 17:27 ` Roman Zippel
0 siblings, 2 replies; 36+ messages in thread
From: Ian Lance Taylor @ 2007-05-18 16:48 UTC (permalink / raw)
To: Eric Botcazou
Cc: gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak,
Paolo Bonzini, Chris Lattner
Eric Botcazou <ebotcazou@libertysurf.fr> writes:
> > With the new lower-subreg patch, insns like adddi3 should split before
> > reload when possible.
>
> If you haven't already done it, could you sketch the modifications that should
> be made to a 32-bit back-end, now that the lower-subreg patch is integrated?
First I'll say that the lower-subreg patch is not yet completely
integrated. There is another part of it, which I have working, but
which I'm holding in the hopes that it will work better in the DF
framework. That is the part which tracks subreg lifetimes during
register allocation.
From the backend perspective, the best way to take advantage of
lower-subreg is conceptually simple if sometimes awkward in practice:
for multi-word operations, use define_insn_and_split and split the
insns before reload.
For CC0-style machines, in which reload may clobber the flags
register, some multi-word operations will need two
define_insn_and_splits. The first split should run before reload, and
should generate one or more insns which express the computation
entirely in terms of register-sized subregs. The second split should
run after reload and should generate the actual instructions which use
the flags register.
The idea is that the first split into subregs will permit the
lower-subreg pass and the register allocator to allocate the values
involved as separate single registers rather than as a single
multi-word register. The second split, which will be similar to
existing code, will then generate the real machine instructions for
the second scheduling pass.
For a non-CC0 style machine things are simpler. Just split multi-word
operations before reload. There was never any particular need to wait
until after reload to split them in the past. And now you really
don't want to wait.
For clarity, the relevant distinction between a CC0 machine and a
non-CC0 machine is whether reload is going to generate loads and
stores which clobber the flags register. On a non-CC0 machine the
flags register is not set by memory moves. On a CC0 machine it is.
Ian
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 10:23 ` Uros Bizjak
@ 2007-05-18 18:31 ` Chris Lattner
2007-05-18 19:58 ` Uros Bizjak
2007-05-18 20:07 ` Uros Bizjak
0 siblings, 2 replies; 36+ messages in thread
From: Chris Lattner @ 2007-05-18 18:31 UTC (permalink / raw)
To: Uros Bizjak
Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor
On May 18, 2007, at 3:22 AM, Uros Bizjak wrote:
> On 5/17/07, Eric Christopher <echristo@apple.com> wrote:
>
>> > Inline asm really is bad. It's much better to fix the compiler so
>> > that users don't have to rely on it to get performance.
>>
>> Agreed, Ian's patches that take advantage of lower-subreg for x86
>> help
>> this quite a bit. He's waiting AFAIK on the df branch to finish them
>> up, but from what I recall they helped this particular case.
>
> OK, I agree with that. However, I'd like too propose the path to this
> final solution:
>
> - for now, we implement asm solution, as it optimizes library code
> - an enhancement PR is opened with the comments from this thread (also
> linking to Ian's follow-on patch), so we won't forget this issue.
> - once the generated code is equal or better then asm, we change
> longlong.h, probably for both, i386 and x86_64.
This all sounds fine to me.
> _now_ and follows what master copy of longlong.h does. Also, there is
> no simple solution for 128bit "mul" and "div" insns, which are of most
> interest here.
What is the issue here?
-Chris
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-17 17:05 ` Eric Christopher
2007-05-18 10:23 ` Uros Bizjak
@ 2007-05-18 19:43 ` Uros Bizjak
1 sibling, 0 replies; 36+ messages in thread
From: Uros Bizjak @ 2007-05-18 19:43 UTC (permalink / raw)
To: Eric Christopher
Cc: GCC Patches, Paolo Bonzini, Ian Lance Taylor, Chris Lattner
Eric Christopher wrote:
>> Inline asm really is bad. It's much better to fix the compiler so
>> that users don't have to rely on it to get performance.
>
> Agreed, Ian's patches that take advantage of lower-subreg for x86 help
> this quite a bit. He's waiting AFAIK on the df branch to finish them
> up, but from what I recall they helped this particular case.
Here are some representative x86_64 code sizes with and without a patch
to longlong.h (these are actually TImode and TFmode functions):
(+++ are patched, --- are unpatched libgcc.a sizes)
- 238 0 0 238 ee _muldi3.o (ex libgcc.a)
+ 138 0 0 138 8a _muldi3.o (ex libgcc.a)
- 1529 0 0 1529 5f9 _divdi3.o (ex libgcc.a)
- 1650 0 0 1650 672 _moddi3.o (ex libgcc.a)
- 1322 0 0 1322 52a _udivdi3.o (ex libgcc.a)
- 1418 0 0 1418 58a _umoddi3.o (ex libgcc.a)
+ 541 0 0 541 21d _divdi3.o (ex libgcc.a)
+ 650 0 0 650 28a _moddi3.o (ex libgcc.a)
+ 358 0 0 358 166 _udivdi3.o (ex libgcc.a)
+ 445 0 0 445 1bd _umoddi3.o (ex libgcc.a)
- 1669 0 0 1669 685 _udivmoddi4.o (ex libgcc.a)
+ 610 0 0 610 262 _udivmoddi4.o (ex libgcc.a)
- 3030 0 0 3030 bd6 divtf3.o (ex libgcc.a)
+ 2604 0 0 2604 a2c divtf3.o (ex libgcc.a)
- 2772 0 0 2772 ad4 multf3.o (ex libgcc.a)
+ 2508 0 0 2508 9cc multf3.o (ex libgcc.a)
current divdi/moddi code looks quite pathological...
I think that this clears the way for the patch to be committed to
mainline, even as an interim solution.
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 18:31 ` Chris Lattner
@ 2007-05-18 19:58 ` Uros Bizjak
2007-05-18 20:15 ` Chris Lattner
2007-05-18 20:07 ` Uros Bizjak
1 sibling, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2007-05-18 19:58 UTC (permalink / raw)
To: Chris Lattner
Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor
Chris Lattner wrote:
>> _now_ and follows what master copy of longlong.h does. Also, there is
>> no simple solution for 128bit "mul" and "div" insns, which are of most
>> interest here.
>
> What is the issue here?
unsigned long long int t(unsigned int a, unsigned int b)
{
return a * b;
}
doesn't compile to "mul" as one would expect (but to "imul"):
gcc -O2 -m32 -fomit-frame-pointer:
t:
movl 8(%esp), %eax
xorl %edx, %edx
imull 4(%esp), %eax
ret
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 18:31 ` Chris Lattner
2007-05-18 19:58 ` Uros Bizjak
@ 2007-05-18 20:07 ` Uros Bizjak
1 sibling, 0 replies; 36+ messages in thread
From: Uros Bizjak @ 2007-05-18 20:07 UTC (permalink / raw)
To: Chris Lattner
Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor
Chris Lattner wrote:
>> OK, I agree with that. However, I'd like too propose the path to this
>> final solution:
>>
>> - for now, we implement asm solution, as it optimizes library code
>> - an enhancement PR is opened with the comments from this thread (also
>> linking to Ian's follow-on patch), so we won't forget this issue.
>> - once the generated code is equal or better then asm, we change
>> longlong.h, probably for both, i386 and x86_64.
>
> This all sounds fine to me.
The bugreport is at PR target/31985.
(BTW: It would be nice if someone could confirm it, so it won't be
self-confirmed ;)
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 19:58 ` Uros Bizjak
@ 2007-05-18 20:15 ` Chris Lattner
2007-05-18 21:17 ` Uros Bizjak
2007-05-18 21:49 ` Uros Bizjak
0 siblings, 2 replies; 36+ messages in thread
From: Chris Lattner @ 2007-05-18 20:15 UTC (permalink / raw)
To: Uros Bizjak
Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor
On May 18, 2007, at 12:58 PM, Uros Bizjak wrote:
> Chris Lattner wrote:
>
>>> _now_ and follows what master copy of longlong.h does. Also,
>>> there is
>>> no simple solution for 128bit "mul" and "div" insns, which are of
>>> most
>>> interest here.
>>
>> What is the issue here?
> unsigned long long int t(unsigned int a, unsigned int b)
> {
> return a * b;
> }
>
> doesn't compile to "mul" as one would expect (but to "imul"):
>
> gcc -O2 -m32 -fomit-frame-pointer:
>
> t:
> movl 8(%esp), %eax
> xorl %edx, %edx
> imull 4(%esp), %eax
> ret
Don't you want:
unsigned long long int t2(unsigned int a, unsigned int b)
{
return (unsigned long long)a * b;
}
which produces (gcc 4.0.1):
_t2:
movl 8(%esp), %eax
mull 4(%esp)
ret
?
-Chris
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 16:48 ` Ian Lance Taylor
@ 2007-05-18 20:37 ` Eric Botcazou
2007-05-18 22:32 ` Ian Lance Taylor
2007-05-22 17:27 ` Roman Zippel
1 sibling, 1 reply; 36+ messages in thread
From: Eric Botcazou @ 2007-05-18 20:37 UTC (permalink / raw)
To: Ian Lance Taylor
Cc: gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak,
Paolo Bonzini, Chris Lattner
> From the backend perspective, the best way to take advantage of
> lower-subreg is conceptually simple if sometimes awkward in practice:
> for multi-word operations, use define_insn_and_split and split the
> insns before reload.
OK, I see, thanks.
> For a non-CC0 style machine things are simpler. Just split multi-word
> operations before reload. There was never any particular need to wait
> until after reload to split them in the past. And now you really
> don't want to wait.
Of course SPARC-V8 is the perfect counter-example since you have a pairing
instruction "ldd" for consecutive integer registers. :-)
--
Eric Botcazou
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 20:15 ` Chris Lattner
@ 2007-05-18 21:17 ` Uros Bizjak
2007-05-18 21:49 ` Uros Bizjak
1 sibling, 0 replies; 36+ messages in thread
From: Uros Bizjak @ 2007-05-18 21:17 UTC (permalink / raw)
To: Chris Lattner
Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor
Chris Lattner wrote:
> Don't you want:
>
> unsigned long long int t2(unsigned int a, unsigned int b)
> {
> return (unsigned long long)a * b;
> }
Ouch... yes, indeed.
Thanks,
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 20:15 ` Chris Lattner
2007-05-18 21:17 ` Uros Bizjak
@ 2007-05-18 21:49 ` Uros Bizjak
2007-05-18 22:49 ` Chris Lattner
1 sibling, 1 reply; 36+ messages in thread
From: Uros Bizjak @ 2007-05-18 21:49 UTC (permalink / raw)
To: Chris Lattner
Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor
Chris Lattner wrote:
> Don't you want:
>
> unsigned long long int t2(unsigned int a, unsigned int b)
> {
> return (unsigned long long)a * b;
> }
Please find a testcase for muls below, implemented in asm and C. Again,
a couple of strange moves are present at the end of a function,
implemented in pure C.
--cut here--
typedef unsigned SI __attribute__ ((mode (SI)));
typedef unsigned DI __attribute__ ((mode (DI)));
#define umul_ppmm_c(w1, w0, u, v) \
{ \
DI __c = (DI) u * v; \
\
(w0) = (SI) (__c & 0xffffffff); \
(w1) = (SI) (__c >> 32); \
}
#define umul_ppmm_asm(w1, w0, u, v) \
__asm__ ("mull %3" \
: "=a" ((SI) (w0)), \
"=d" ((SI) (w1)) \
: "%0" ((SI) (u)), \
"rm" ((SI) (v)))
void test_c (SI a, SI b)
{
volatile SI x, y;
umul_ppmm_c (x, y, a, b);
}
void test_asm (SI a, SI b)
{
volatile SI x, y;
umul_ppmm_asm (x, y, a, b);
}
--cut here--
gcc -O2 -fomit-frame-pointer:
test_c:
subl $16, %esp
movl 24(%esp), %eax
mull 20(%esp)
movl %eax, 8(%esp)
movl %edx, %eax
xorl %edx, %edx
movl %eax, 12(%esp)
addl $16, %esp
ret
test_asm:
subl $16, %esp
movl 20(%esp), %eax
#APP
mull 24(%esp)
#NO_APP
movl %eax, 8(%esp)
movl %edx, 12(%esp)
addl $16, %esp
ret
So even for wide multiply, it is better to stay with asm implementation ATM.
Uros.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 20:37 ` Eric Botcazou
@ 2007-05-18 22:32 ` Ian Lance Taylor
0 siblings, 0 replies; 36+ messages in thread
From: Ian Lance Taylor @ 2007-05-18 22:32 UTC (permalink / raw)
To: Eric Botcazou
Cc: gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak,
Paolo Bonzini, Chris Lattner
Eric Botcazou <ebotcazou@libertysurf.fr> writes:
> > For a non-CC0 style machine things are simpler. Just split multi-word
> > operations before reload. There was never any particular need to wait
> > until after reload to split them in the past. And now you really
> > don't want to wait.
>
> Of course SPARC-V8 is the perfect counter-example since you have a pairing
> instruction "ldd" for consecutive integer registers. :-)
Yes, good point, you probably don't want to split a DImode load or
store there. You can still split everything else, though, and the
right thing should happen. Unfortunately this means that if one of
the halves of the load is unused, we won't convert back to an SImode
load.
Ian
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 21:49 ` Uros Bizjak
@ 2007-05-18 22:49 ` Chris Lattner
0 siblings, 0 replies; 36+ messages in thread
From: Chris Lattner @ 2007-05-18 22:49 UTC (permalink / raw)
To: Uros Bizjak
Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor
On May 18, 2007, at 2:48 PM, Uros Bizjak wrote:
> Chris Lattner wrote:
>
>> Don't you want:
>>
>> unsigned long long int t2(unsigned int a, unsigned int b)
>> {
>> return (unsigned long long)a * b;
>> }
>
> Please find a testcase for muls below, implemented in asm and C.
> Again, a couple of strange moves are present at the end of a
> function, implemented in pure C.
>
> --cut here--
> typedef unsigned SI __attribute__ ((mode (SI)));
> typedef unsigned DI __attribute__ ((mode (DI)));
>
> #define umul_ppmm_c(w1, w0, u, v) \
> { \
> DI __c = (DI) u * v; \
> \
> (w0) = (SI) (__c & 0xffffffff); \
> (w1) = (SI) (__c >> 32); \
> }
>
> test_c:
> subl $16, %esp
> movl 24(%esp), %eax
> mull 20(%esp)
> movl %eax, 8(%esp)
> movl %edx, %eax
> xorl %edx, %edx
> movl %eax, 12(%esp)
> addl $16, %esp
> ret
>
> So even for wide multiply, it is better to stay with asm
> implementation ATM.
Yuck. Please file a bugzilla.
-Chris
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-18 16:48 ` Ian Lance Taylor
2007-05-18 20:37 ` Eric Botcazou
@ 2007-05-22 17:27 ` Roman Zippel
2007-05-22 17:36 ` Ian Lance Taylor
2007-05-22 22:19 ` Rask Ingemann Lambertsen
1 sibling, 2 replies; 36+ messages in thread
From: Roman Zippel @ 2007-05-22 17:27 UTC (permalink / raw)
To: Ian Lance Taylor
Cc: Eric Botcazou, gcc-patches, Rask Ingemann Lambertsen,
Uros Bizjak, Paolo Bonzini, Chris Lattner
Hi,
On Fri, 18 May 2007, Ian Lance Taylor wrote:
> >From the backend perspective, the best way to take advantage of
> lower-subreg is conceptually simple if sometimes awkward in practice:
> for multi-word operations, use define_insn_and_split and split the
> insns before reload.
I'm playing around with it a little on m68k and there is sometimes a bit
of a conflict. E.g. I would split somthing like adddi3/subdi3 already
before combine, because the addx/subx instruction is much more limited in
its addressing mode (only accepts register), so combine can generate the
right instructions.
The problem is now that other optimizations can't do very much with these
instructions, for example something like this:
unsigned long long x, y;
return x + (y >> 63);
is better handled as:
unsigned long long x, y;
return x - ((long long)y >> 63);
It's not a very common thing, so it wouldn't be a big loss, but something
like this is currently part of the m68k description (and something very
early already converts the second form into the first form...).
The point is that as soon as we expose the two seperate adds,
transformations like above which require to change both adds to subs
pretty much become impossible. The only alternative I can think of is to
have a port specific tree pass and change this before the RTL is
generated.
In this context I'm curious about another problem, how should it work in
general (i.e. independent of cc0) to combine a subdi/cmpdi or subdi/tstdi
pattern?
> For clarity, the relevant distinction between a CC0 machine and a
> non-CC0 machine is whether reload is going to generate loads and
> stores which clobber the flags register. On a non-CC0 machine the
> flags register is not set by memory moves. On a CC0 machine it is.
BTW I was thinking about this a little (possible conversion of m68k from
cc0 in general and this reload problem specifically).
I don't really like the idea of hiding cc0 until after reload, as it
prevents many optimizations (e.g. smi is heavily used for sign extensions,
but it needs the cc flags from the previous instructions, so one would
need many patterns to combine a <op>/set/smi sequence).
I think the reload problem may not be that bad, while a move instruction
sets the cc register, it often just sets it in the same way as the
previous instructions, especially the Z and M flag, which is enough for
the common test against zero. For these tests the move wouldn't really
clobber the flags, but just recreate them.
A bit more difficult is a sub/cmp combination, here move would really
clobber flags, but luckily on m68k the sub instruction is quite flexible
and e.g. also accepts memory arguments as destination, so it would be
possible to convert a output reload into one or two input reloads.
So what basically would be needed is a port specific check in reload as to
whether an output reload is acceptable and possibly reject the
alternative and try it with another one.
That's currently the idea I would favour at least for m68k, does that
sound reasonable (or at least understandable :) ).
bye, Roman
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-22 17:27 ` Roman Zippel
@ 2007-05-22 17:36 ` Ian Lance Taylor
2007-05-22 22:27 ` Rask Ingemann Lambertsen
2007-05-23 17:27 ` Roman Zippel
2007-05-22 22:19 ` Rask Ingemann Lambertsen
1 sibling, 2 replies; 36+ messages in thread
From: Ian Lance Taylor @ 2007-05-22 17:36 UTC (permalink / raw)
To: Roman Zippel
Cc: Eric Botcazou, gcc-patches, Rask Ingemann Lambertsen,
Uros Bizjak, Paolo Bonzini, Chris Lattner
Roman Zippel <zippel@linux-m68k.org> writes:
> I'm playing around with it a little on m68k and there is sometimes a bit
> of a conflict. E.g. I would split somthing like adddi3/subdi3 already
> before combine, because the addx/subx instruction is much more limited in
> its addressing mode (only accepts register), so combine can generate the
> right instructions.
That's right. That's why I recommend a define_insn_and_split. The
insn will be complete for most of the RTL optimization passes,
including combine. It will then be split before the second lower
subreg pass and before register allocation.
Of course, that won't help your case, because you want to split before
combine for other reasons.
> I think the reload problem may not be that bad, while a move instruction
> sets the cc register, it often just sets it in the same way as the
> previous instructions, especially the Z and M flag, which is enough for
> the common test against zero. For these tests the move wouldn't really
> clobber the flags, but just recreate them.
It's unfortunately not that simple. In some cases reload will need to
load some part of an address. When it does that the flags will no
longer have any relation to what they are supposed to have.
> A bit more difficult is a sub/cmp combination, here move would really
> clobber flags, but luckily on m68k the sub instruction is quite flexible
> and e.g. also accepts memory arguments as destination, so it would be
> possible to convert a output reload into one or two input reloads.
> So what basically would be needed is a port specific check in reload as to
> whether an output reload is acceptable and possibly reject the
> alternative and try it with another one.
> That's currently the idea I would favour at least for m68k, does that
> sound reasonable (or at least understandable :) ).
That sounds unreasonable. reload is already very complicated, too
complicated, and there isn't going to be any way to drop in something
like that without making it significantly more complicated.
Ian
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-22 17:27 ` Roman Zippel
2007-05-22 17:36 ` Ian Lance Taylor
@ 2007-05-22 22:19 ` Rask Ingemann Lambertsen
1 sibling, 0 replies; 36+ messages in thread
From: Rask Ingemann Lambertsen @ 2007-05-22 22:19 UTC (permalink / raw)
To: Roman Zippel
Cc: Ian Lance Taylor, Eric Botcazou, gcc-patches, Uros Bizjak,
Paolo Bonzini, Chris Lattner
On Tue, May 22, 2007 at 07:26:50PM +0200, Roman Zippel wrote:
> Hi,
>
> On Fri, 18 May 2007, Ian Lance Taylor wrote:
>
> > >From the backend perspective, the best way to take advantage of
> > lower-subreg is conceptually simple if sometimes awkward in practice:
> > for multi-word operations, use define_insn_and_split and split the
> > insns before reload.
>
> I'm playing around with it a little on m68k and there is sometimes a bit
> of a conflict. E.g. I would split somthing like adddi3/subdi3 already
> before combine, because the addx/subx instruction is much more limited in
> its addressing mode (only accepts register), so combine can generate the
> right instructions.
It can be done:
1) Define predicates and constraints which match the available addressing
modes.
2) Define an insn(_and_split) which uses those predicates and constraints.
3) Define an insn similiar to the one in 2), but where the (mem ...)
expression is not an operand. Instead, you have something like
... (mem:SI (match_operand:SI "register_operand" "a"))
4) Split the insn in 2) to the one in 3) before reload.
5) Put the insn in 3) before the one in 2) in the .md file.
I think that since the address is so simple in this case, you can skip
part 1), 2), 4) and 5). Just be careful that you don't use the template in
3) as an output template, or you'll lose the alias set.
(The danger of losing the alias set really ought to be documented.)
> unsigned long long x, y;
> return x + (y >> 63);
>
> is better handled as:
>
> unsigned long long x, y;
> return x - ((long long)y >> 63);
Interesting optimization when signed shifts are faster than unsigned
ones.
> In this context I'm curious about another problem, how should it work in
> general (i.e. independent of cc0) to combine a subdi/cmpdi or subdi/tstdi
> pattern?
Combine will do this automatically if you define the insn to do so. In
this case, it will just be a define_insn_and_split instead of the usual
define_insn, but that makes no difference to combine. If your question is
how to write the combined patterns in general, see
<URL:http://gcc.gnu.org/ml/gcc/2004-02/msg00903.html>.
You will have to define a CCmode for the multi-word instructions since
the zero flag won't be usable. You'll therefore also have to delay the
output of the comparison instruction until you know the comparison operator.
This is documented in the GCC Internals manual.
(Also check that your overflow flag is set to something useful.)
--
Rask Ingemann Lambertsen
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-22 17:36 ` Ian Lance Taylor
@ 2007-05-22 22:27 ` Rask Ingemann Lambertsen
2007-05-22 22:32 ` Ian Lance Taylor
2007-05-23 17:27 ` Roman Zippel
1 sibling, 1 reply; 36+ messages in thread
From: Rask Ingemann Lambertsen @ 2007-05-22 22:27 UTC (permalink / raw)
To: Ian Lance Taylor
Cc: Roman Zippel, Eric Botcazou, gcc-patches, Uros Bizjak,
Paolo Bonzini, Chris Lattner
On Tue, May 22, 2007 at 10:35:25AM -0700, Ian Lance Taylor wrote:
> Roman Zippel <zippel@linux-m68k.org> writes:
>
> > I think the reload problem may not be that bad, while a move instruction
> > sets the cc register, it often just sets it in the same way as the
> > previous instructions, especially the Z and M flag, which is enough for
> > the common test against zero. For these tests the move wouldn't really
> > clobber the flags, but just recreate them.
>
> It's unfortunately not that simple. In some cases reload will need to
> load some part of an address. When it does that the flags will no
> longer have any relation to what they are supposed to have.
On the m68k, instructions with an address register destination don't set
the flags. It might actually work.
--
Rask Ingemann Lambertsen
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-22 22:27 ` Rask Ingemann Lambertsen
@ 2007-05-22 22:32 ` Ian Lance Taylor
2007-05-24 20:00 ` Rask Ingemann Lambertsen
0 siblings, 1 reply; 36+ messages in thread
From: Ian Lance Taylor @ 2007-05-22 22:32 UTC (permalink / raw)
To: Rask Ingemann Lambertsen
Cc: Roman Zippel, Eric Botcazou, gcc-patches, Uros Bizjak,
Paolo Bonzini, Chris Lattner
Rask Ingemann Lambertsen <rask@sygehus.dk> writes:
> On Tue, May 22, 2007 at 10:35:25AM -0700, Ian Lance Taylor wrote:
> > Roman Zippel <zippel@linux-m68k.org> writes:
> >
> > > I think the reload problem may not be that bad, while a move instruction
> > > sets the cc register, it often just sets it in the same way as the
> > > previous instructions, especially the Z and M flag, which is enough for
> > > the common test against zero. For these tests the move wouldn't really
> > > clobber the flags, but just recreate them.
> >
> > It's unfortunately not that simple. In some cases reload will need to
> > load some part of an address. When it does that the flags will no
> > longer have any relation to what they are supposed to have.
>
> On the m68k, instructions with an address register destination don't set
> the flags. It might actually work.
That is true, but addresses can include data registers, as in
%a0@(%d0) (aka 0(a0,d0) ).
Ian
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-22 17:36 ` Ian Lance Taylor
2007-05-22 22:27 ` Rask Ingemann Lambertsen
@ 2007-05-23 17:27 ` Roman Zippel
2007-05-23 17:48 ` Ian Lance Taylor
1 sibling, 1 reply; 36+ messages in thread
From: Roman Zippel @ 2007-05-23 17:27 UTC (permalink / raw)
To: Ian Lance Taylor
Cc: Eric Botcazou, gcc-patches, Rask Ingemann Lambertsen,
Uros Bizjak, Paolo Bonzini, Chris Lattner
Hi,
On Tue, 22 May 2007, Ian Lance Taylor wrote:
> Roman Zippel <zippel@linux-m68k.org> writes:
>
> > I'm playing around with it a little on m68k and there is sometimes a bit
> > of a conflict. E.g. I would split somthing like adddi3/subdi3 already
> > before combine, because the addx/subx instruction is much more limited in
> > its addressing mode (only accepts register), so combine can generate the
> > right instructions.
>
> That's right. That's why I recommend a define_insn_and_split. The
> insn will be complete for most of the RTL optimization passes,
> including combine. It will then be split before the second lower
> subreg pass and before register allocation.
Somehow I think this can't be everything. Let's take a different example,
which is also relevant for other ports: x + (y << 32). Here one has a
partial constant argument, but one doesn't really know it until after
splitting the shift. This could be optimized to a simple move/add (but
currently isn't).
With a post-combine-split one could now generate a combined add/shift
pattern which could be splitted into the simpler operations, but this is
too late for combine. A pre-combine-split is also problematic, as the
information that the carry flag is zero, can't get to the next
instruction, as it's usually hidden in a rather complex parallel pattern.
To handle this one could either rerun combine after split (which may
be done more than once) or we need a combined combine/split pass (which
selectively just combines the newly created instructions with other
instructions).
> Of course, that won't help your case, because you want to split before
> combine for other reasons.
There are more reasons, e.g. splitting post_inc/pre_dec operations later
is rather painful, if would be so much easier if one only had to deal with
simple memory operands and leave the bulk work to combine.
> > I think the reload problem may not be that bad, while a move instruction
> > sets the cc register, it often just sets it in the same way as the
> > previous instructions, especially the Z and M flag, which is enough for
> > the common test against zero. For these tests the move wouldn't really
> > clobber the flags, but just recreate them.
>
> It's unfortunately not that simple. In some cases reload will need to
> load some part of an address. When it does that the flags will no
> longer have any relation to what they are supposed to have.
This requires an output reload, doesn't it? If we only have single output,
it would mean the final move would recreate the right flags.
For floating point values this would be even easier, the post processing
for setting the condition codes is always the same and move to memory
doesn't even change them.
> > A bit more difficult is a sub/cmp combination, here move would really
> > clobber flags, but luckily on m68k the sub instruction is quite flexible
> > and e.g. also accepts memory arguments as destination, so it would be
> > possible to convert a output reload into one or two input reloads.
> > So what basically would be needed is a port specific check in reload as to
> > whether an output reload is acceptable and possibly reject the
> > alternative and try it with another one.
> > That's currently the idea I would favour at least for m68k, does that
> > sound reasonable (or at least understandable :) ).
>
> That sounds unreasonable. reload is already very complicated, too
> complicated, and there isn't going to be any way to drop in something
> like that without making it significantly more complicated.
Considering the alternatives I don't want dismiss this option too quickly.
What is suggested in the wiki would only trade one problem with another,
cc0 might be gone this way, but one also had none of the benefits and
needs a lot of extra work to get anywhere near where it was before.
The problem _is_ reload, so I'd prefer to look at the real problem instead
of wasting a lot of effort to work around it.
Why are you so sure that there is not "any way"? Yes, reload is
complicated, but I don't think these changes would make it significantly
more complicated than it already is and I'm pretty sure that it will be
less complicated than any of the alternatives.
BTW there is a move instruction, which doesn't modify the flags - movem,
but I would really prefer to only use it as last effort.
bye, Roman
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-23 17:27 ` Roman Zippel
@ 2007-05-23 17:48 ` Ian Lance Taylor
0 siblings, 0 replies; 36+ messages in thread
From: Ian Lance Taylor @ 2007-05-23 17:48 UTC (permalink / raw)
To: Roman Zippel
Cc: Eric Botcazou, gcc-patches, Rask Ingemann Lambertsen,
Uros Bizjak, Paolo Bonzini, Chris Lattner
Roman Zippel <zippel@linux-m68k.org> writes:
> To handle this one could either rerun combine after split (which may
> be done more than once) or we need a combined combine/split pass (which
> selectively just combines the newly created instructions with other
> instructions).
I agree that we don't handle all cases at present.
> > > I think the reload problem may not be that bad, while a move instruction
> > > sets the cc register, it often just sets it in the same way as the
> > > previous instructions, especially the Z and M flag, which is enough for
> > > the common test against zero. For these tests the move wouldn't really
> > > clobber the flags, but just recreate them.
> >
> > It's unfortunately not that simple. In some cases reload will need to
> > load some part of an address. When it does that the flags will no
> > longer have any relation to what they are supposed to have.
>
> This requires an output reload, doesn't it? If we only have single output,
> it would mean the final move would recreate the right flags.
> For floating point values this would be even easier, the post processing
> for setting the condition codes is always the same and move to memory
> doesn't even change them.
No, reloading an address can be for either an input or an output. It
will typically but not exclusively be expressed as
RELOAD_FOR_INPUT_ADDRESS or RELOAD_FOR_OUTPUT_ADDRESS.
> > > A bit more difficult is a sub/cmp combination, here move would really
> > > clobber flags, but luckily on m68k the sub instruction is quite flexible
> > > and e.g. also accepts memory arguments as destination, so it would be
> > > possible to convert a output reload into one or two input reloads.
> > > So what basically would be needed is a port specific check in reload as to
> > > whether an output reload is acceptable and possibly reject the
> > > alternative and try it with another one.
> > > That's currently the idea I would favour at least for m68k, does that
> > > sound reasonable (or at least understandable :) ).
> >
> > That sounds unreasonable. reload is already very complicated, too
> > complicated, and there isn't going to be any way to drop in something
> > like that without making it significantly more complicated.
>
> Considering the alternatives I don't want dismiss this option too quickly.
> What is suggested in the wiki would only trade one problem with another,
> cc0 might be gone this way, but one also had none of the benefits and
> needs a lot of extra work to get anywhere near where it was before.
> The problem _is_ reload, so I'd prefer to look at the real problem instead
> of wasting a lot of effort to work around it.
> Why are you so sure that there is not "any way"? Yes, reload is
> complicated, but I don't think these changes would make it significantly
> more complicated than it already is and I'm pretty sure that it will be
> less complicated than any of the alternatives.
Certainly the problem is reload and the answer is to eliminate it.
That is extremely hard but will solve the problem for good.
If you want to look at target specific routines to guide instruction
selection in reload, that is your choice. I do not think it would be
a wise way to approach this problem.
Ian
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations
2007-05-22 22:32 ` Ian Lance Taylor
@ 2007-05-24 20:00 ` Rask Ingemann Lambertsen
0 siblings, 0 replies; 36+ messages in thread
From: Rask Ingemann Lambertsen @ 2007-05-24 20:00 UTC (permalink / raw)
To: Ian Lance Taylor
Cc: Roman Zippel, Eric Botcazou, gcc-patches, Uros Bizjak,
Paolo Bonzini, Chris Lattner
On Tue, May 22, 2007 at 03:31:36PM -0700, Ian Lance Taylor wrote:
> Rask Ingemann Lambertsen <rask@sygehus.dk> writes:
>
> > On the m68k, instructions with an address register destination don't set
> > the flags. It might actually work.
>
> That is true, but addresses can include data registers, as in
> %a0@(%d0) (aka 0(a0,d0) ).
Yes, but you can tell reload not to use data registers in addresses.
IIRC, wherever disp(An,Dn) is valid, so is disp(An,Am).
--
Rask Ingemann Lambertsen
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2007-05-24 19:56 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-16 20:25 [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations Uros Bizjak
2007-05-16 20:34 ` Chris Lattner
2007-05-17 5:57 ` Uros Bizjak
2007-05-17 6:09 ` Eric Christopher
2007-05-17 6:12 ` Andrew Pinski
2007-05-17 6:22 ` Uros Bizjak
2007-05-17 11:13 ` Uros Bizjak
2007-05-17 6:43 ` Chris Lattner
2007-05-17 7:22 ` Paolo Bonzini
2007-05-17 9:25 ` Uros Bizjak
2007-05-17 11:15 ` Rask Ingemann Lambertsen
2007-05-17 21:38 ` Rask Ingemann Lambertsen
2007-05-18 6:54 ` Ian Lance Taylor
2007-05-18 7:10 ` Eric Botcazou
2007-05-18 16:48 ` Ian Lance Taylor
2007-05-18 20:37 ` Eric Botcazou
2007-05-18 22:32 ` Ian Lance Taylor
2007-05-22 17:27 ` Roman Zippel
2007-05-22 17:36 ` Ian Lance Taylor
2007-05-22 22:27 ` Rask Ingemann Lambertsen
2007-05-22 22:32 ` Ian Lance Taylor
2007-05-24 20:00 ` Rask Ingemann Lambertsen
2007-05-23 17:27 ` Roman Zippel
2007-05-23 17:48 ` Ian Lance Taylor
2007-05-22 22:19 ` Rask Ingemann Lambertsen
2007-05-17 16:49 ` Chris Lattner
2007-05-17 17:05 ` Eric Christopher
2007-05-18 10:23 ` Uros Bizjak
2007-05-18 18:31 ` Chris Lattner
2007-05-18 19:58 ` Uros Bizjak
2007-05-18 20:15 ` Chris Lattner
2007-05-18 21:17 ` Uros Bizjak
2007-05-18 21:49 ` Uros Bizjak
2007-05-18 22:49 ` Chris Lattner
2007-05-18 20:07 ` Uros Bizjak
2007-05-18 19:43 ` Uros Bizjak
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).