* [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations @ 2007-05-16 20:25 Uros Bizjak 2007-05-16 20:34 ` Chris Lattner 0 siblings, 1 reply; 36+ messages in thread From: Uros Bizjak @ 2007-05-16 20:25 UTC (permalink / raw) To: GCC Patches [-- Attachment #1: Type: text/plain, Size: 736 bytes --] Hello! This patch adds 128bit operations for x86_64 to longlong.h to speed up TImode and TFmode arithmetic. The patch also redefines i386's definitions of count_trailing/leading_zeros from asm to __builtin_ctz/__builtin_clz builtins, as provided by i386 backend. Patch was bootstrapped on x86_64-pc-linux-gnu, regression tested for all default languages with and without -m32. 2007-05-16 Uros Bizjak <ubizjak@gmail.com> * longlong.h (__x86_64__): Add definitions for add_ssaaaa, sub_ddmmss, umul_ppmm, udiv_qrnnd, count_leading_zeros and count_trailing_zeros. (__i386__): Implement count_leading_zeros using __builtin_clz(). Implement count_trailing_zeros usign __builtin_ctz(). Uros. [-- Attachment #2: x86_64-longlong.diff --] [-- Type: text/x-patch, Size: 2243 bytes --] Index: longlong.h =================================================================== --- longlong.h (revision 124771) +++ longlong.h (working copy) @@ -341,19 +341,48 @@ : "0" ((USItype) (n0)), \ "1" ((USItype) (n1)), \ "rm" ((USItype) (dv))) -#define count_leading_zeros(count, x) \ - do { \ - USItype __cbtmp; \ - __asm__ ("bsrl %1,%0" \ - : "=r" (__cbtmp) : "rm" ((USItype) (x))); \ - (count) = __cbtmp ^ 31; \ - } while (0) -#define count_trailing_zeros(count, x) \ - __asm__ ("bsfl %1,%0" : "=r" (count) : "rm" ((USItype)(x))) +#define count_leading_zeros(count, x) ((count) = __builtin_clz (x)) +#define count_trailing_zeros(count, x) ((count) = __builtin_ctz (x)) #define UMUL_TIME 40 #define UDIV_TIME 40 #endif /* 80x86 */ +#if (defined (__x86_64__) || defined (__i386__)) && W_TYPE_SIZE == 64 +#define add_ssaaaa(sh, sl, ah, al, bh, bl) \ + __asm__ ("addq %5,%1\n\tadcq %3,%0" \ + : "=r" ((UDItype) (sh)), \ + "=&r" ((UDItype) (sl)) \ + : "%0" ((UDItype) (ah)), \ + "rem" ((UDItype) (bh)), \ + "%1" ((UDItype) (al)), \ + "rem" ((UDItype) (bl))) +#define sub_ddmmss(sh, sl, ah, al, bh, bl) \ + __asm__ ("subq %5,%1\n\tsbbq %3,%0" \ + : "=r" ((UDItype) (sh)), \ + "=&r" ((UDItype) (sl)) \ + : "0" ((UDItype) (ah)), \ + "rem" ((UDItype) (bh)), \ + "1" ((UDItype) (al)), \ + "rem" ((UDItype) (bl))) +#define umul_ppmm(w1, w0, u, v) \ + __asm__ ("mulq %3" \ + : "=a" ((UDItype) (w0)), \ + "=d" ((UDItype) (w1)) \ + : "%0" ((UDItype) (u)), \ + "rm" ((UDItype) (v))) +#define udiv_qrnnd(q, r, n1, n0, dv) \ + __asm__ ("divq %4" \ + : "=a" ((UDItype) (q)), \ + "=d" ((UDItype) (r)) \ + : "0" ((UDItype) (n0)), \ + "1" ((UDItype) (n1)), \ + "rm" ((UDItype) (dv))) +#define count_leading_zeros(count, x) ((count) = __builtin_clzl (x)) +#define count_trailing_zeros(count, x) ((count) = __builtin_ctzl (x)) +#define UMUL_TIME 40 +#define UDIV_TIME 40 +#endif /* x86_64 */ + #if defined (__i960__) && W_TYPE_SIZE == 32 #define umul_ppmm(w1, w0, u, v) \ ({union {UDItype __ll; \ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-16 20:25 [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations Uros Bizjak @ 2007-05-16 20:34 ` Chris Lattner 2007-05-17 5:57 ` Uros Bizjak 0 siblings, 1 reply; 36+ messages in thread From: Chris Lattner @ 2007-05-16 20:34 UTC (permalink / raw) To: Uros Bizjak; +Cc: GCC Patches On May 16, 2007, at 1:23 PM, Uros Bizjak wrote: > Hello! > This patch adds 128bit operations for x86_64 to longlong.h to speed > up TImode and TFmode arithmetic. Why not implement these in terms of TImode operations themselves? Inline asm should be avoided when it can be :) -Chris ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-16 20:34 ` Chris Lattner @ 2007-05-17 5:57 ` Uros Bizjak 2007-05-17 6:09 ` Eric Christopher 2007-05-17 6:43 ` Chris Lattner 0 siblings, 2 replies; 36+ messages in thread From: Uros Bizjak @ 2007-05-17 5:57 UTC (permalink / raw) To: Chris Lattner; +Cc: GCC Patches On 5/16/07, Chris Lattner <clattner@apple.com> wrote: > > This patch adds 128bit operations for x86_64 to longlong.h to speed > > up TImode and TFmode arithmetic. > > Why not implement these in terms of TImode operations themselves? > Inline asm should be avoided when it can be :) Unfortunatelly longlong.h definitions require their operands to be in DImode (for x86_64 target). I guess it is not a coincidence, that these definitions can be implemented with exactly one x86 insn (two for carry-propagating addition and subtraction). Asm with one insn is not harmful. Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 5:57 ` Uros Bizjak @ 2007-05-17 6:09 ` Eric Christopher 2007-05-17 6:12 ` Andrew Pinski 2007-05-17 6:43 ` Chris Lattner 1 sibling, 1 reply; 36+ messages in thread From: Eric Christopher @ 2007-05-17 6:09 UTC (permalink / raw) To: Uros Bizjak; +Cc: Chris Lattner, GCC Patches On May 16, 2007, at 10:57 PM, Uros Bizjak wrote: > On 5/16/07, Chris Lattner <clattner@apple.com> wrote: > >> > This patch adds 128bit operations for x86_64 to longlong.h to speed >> > up TImode and TFmode arithmetic. >> >> Why not implement these in terms of TImode operations themselves? >> Inline asm should be avoided when it can be :) > > Unfortunatelly longlong.h definitions require their operands to be in > DImode (for x86_64 target). I guess it is not a coincidence, that > these definitions can be implemented with exactly one x86 insn (two > for carry-propagating addition and subtraction). Asm with one insn is > not harmful. Don't forget that asms are often barriers to optimization so as few asm as possible is always better. That said, I don't off the top of my head have a better implementation :) -eric ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 6:09 ` Eric Christopher @ 2007-05-17 6:12 ` Andrew Pinski 2007-05-17 6:22 ` Uros Bizjak 2007-05-17 11:13 ` Uros Bizjak 0 siblings, 2 replies; 36+ messages in thread From: Andrew Pinski @ 2007-05-17 6:12 UTC (permalink / raw) To: Eric Christopher; +Cc: Uros Bizjak, Chris Lattner, GCC Patches On 5/16/07, Eric Christopher <echristo@apple.com> wrote: > > Don't forget that asms are often barriers to optimization so as few asm > as possible is always better. > > That said, I don't off the top of my head have a better > implementation :) Oh and longlong.h belongs to GMP so you should submit your changes upstream too. Plus the uses of what is located in longlong.h for GCC is only in the support library and who uses TImode anyways :). Thanks, Andrew Pinski ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 6:12 ` Andrew Pinski @ 2007-05-17 6:22 ` Uros Bizjak 2007-05-17 11:13 ` Uros Bizjak 1 sibling, 0 replies; 36+ messages in thread From: Uros Bizjak @ 2007-05-17 6:22 UTC (permalink / raw) To: Andrew Pinski; +Cc: Eric Christopher, Chris Lattner, GCC Patches On 5/17/07, Andrew Pinski <pinskia@gmail.com> wrote: > > Don't forget that asms are often barriers to optimization so as few asm > > as possible is always better. Yes, but please note that these are not __volatile__. > > That said, I don't off the top of my head have a better > > implementation :) > > Oh and longlong.h belongs to GMP so you should submit your changes > upstream too. Plus the uses of what is located in longlong.h for GCC Thanks, I'll submit it there. > is only in the support library and who uses TImode anyways :). And 640k should be enough for everybody ;] Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 6:12 ` Andrew Pinski 2007-05-17 6:22 ` Uros Bizjak @ 2007-05-17 11:13 ` Uros Bizjak 1 sibling, 0 replies; 36+ messages in thread From: Uros Bizjak @ 2007-05-17 11:13 UTC (permalink / raw) To: Andrew Pinski; +Cc: Eric Christopher, Chris Lattner, GCC Patches On 5/17/07, Andrew Pinski <pinskia@gmail.com> wrote: > Oh and longlong.h belongs to GMP so you should submit your changes > upstream too. Plus the uses of what is located in longlong.h for GCC > is only in the support library and who uses TImode anyways :). Hm, in gmp-4.2 longlong.h, the same code to support x86_64 is already present. Perhaps it is time to update longlong.h from GMP source. Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 5:57 ` Uros Bizjak 2007-05-17 6:09 ` Eric Christopher @ 2007-05-17 6:43 ` Chris Lattner 2007-05-17 7:22 ` Paolo Bonzini 1 sibling, 1 reply; 36+ messages in thread From: Chris Lattner @ 2007-05-17 6:43 UTC (permalink / raw) To: Uros Bizjak; +Cc: GCC Patches On May 16, 2007, at 10:57 PM, Uros Bizjak wrote: > On 5/16/07, Chris Lattner <clattner@apple.com> wrote: > >> > This patch adds 128bit operations for x86_64 to longlong.h to speed >> > up TImode and TFmode arithmetic. >> >> Why not implement these in terms of TImode operations themselves? >> Inline asm should be avoided when it can be :) > > Unfortunatelly longlong.h definitions require their operands to be in > DImode (for x86_64 target). I guess it is not a coincidence, that > these definitions can be implemented with exactly one x86 insn (two > for carry-propagating addition and subtraction). Asm with one insn is > not harmful. Just combine them together with logical ops. Something like this: typedef unsigned DI __attribute__((mode(DI))); typedef unsigned TI __attribute__((mode(TI))); void test_add(DI AL, DI AH, DI BL, DI BH, DI *RL, DI *RH) { TI A = AL | ((TI)AH << 64); TI B = AL | ((TI)BH << 64); TI C = A + B; *RL = (DI)C; *RH = (DI)(C >> 64); } void test_sub(DI AL, DI AH, DI BL, DI BH, DI *RL, DI *RH) { TI A = AL | ((TI)AH << 64); TI B = BL | ((TI)BH << 64); TI C = A - B; *RL = (DI)C; *RH = (DI)(C >> 64); } Should produce some thing like this: _test_add: addq %rdi, %rdi adcq %rsi, %rcx movq %rdi, (%r8) movq %rcx, (%r9) ret .align 4 .globl _test_sub _test_sub: subq %rdx, %rdi sbbq %rcx, %rsi movq %rdi, (%r8) movq %rsi, (%r9) ret .subsections_via_symbols Inline asms are bad for a number of reasons: as Andrew mentioned, they are scheduling barriers, also the compiler can't reason about them, constant fold, simplify, can't know the sizes of the instructions precisely, etc. If your compiler isn't producing decent code for simple idioms like this, it seems profitable to fix the compiler, then write portable[1] C code to express these things. The compiler already knows how to efficiently implement 128-bit arithmetic, why re-implement the smarts of the compiler in a header? -Chris [1] portable across 64-bit GCC targets, not across other compilers of course. However, gcc-style inline asm isn't portable to other compilers either. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 6:43 ` Chris Lattner @ 2007-05-17 7:22 ` Paolo Bonzini 2007-05-17 9:25 ` Uros Bizjak 0 siblings, 1 reply; 36+ messages in thread From: Paolo Bonzini @ 2007-05-17 7:22 UTC (permalink / raw) To: Chris Lattner; +Cc: Uros Bizjak, GCC Patches > void test_add(DI AL, DI AH, DI BL, DI BH, DI *RL, DI *RH) { > TI A = AL | ((TI)AH << 64); > TI B = AL | ((TI)BH << 64); > > TI C = A + B; > *RL = (DI)C; > *RH = (DI)(C >> 64); > } > > Should produce some thing like this: > > _test_add: > addq %rdi, %rdi > adcq %rsi, %rcx > movq %rdi, (%r8) > movq %rcx, (%r9) > ret Right. This probably wasn't true before the lower-subreg pass. Paolo ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 7:22 ` Paolo Bonzini @ 2007-05-17 9:25 ` Uros Bizjak 2007-05-17 11:15 ` Rask Ingemann Lambertsen 2007-05-17 16:49 ` Chris Lattner 0 siblings, 2 replies; 36+ messages in thread From: Uros Bizjak @ 2007-05-17 9:25 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Chris Lattner, GCC Patches On 5/17/07, Paolo Bonzini <bonzini@gnu.org> wrote: > > void test_add(DI AL, DI AH, DI BL, DI BH, DI *RL, DI *RH) { > > TI A = AL | ((TI)AH << 64); > > TI B = AL | ((TI)BH << 64); > > > > TI C = A + B; > > *RL = (DI)C; > > *RH = (DI)(C >> 64); > > } > > > > Should produce some thing like this: > > > > _test_add: > > addq %rdi, %rdi > > adcq %rsi, %rcx > > movq %rdi, (%r8) > > movq %rcx, (%r9) > > ret > > Right. This probably wasn't true before the lower-subreg pass. Unfortunatelly, this approach confuses gcc a lot. Attached testcase will illustrate the problem on i686 (substitute SI with DI and DI with TI on 64bit and change shift values for the same result): --cut here-- typedef unsigned SI __attribute__ ((mode (SI))); typedef unsigned DI __attribute__ ((mode (DI))); #define add_ssaaaa_c(sh, sl, ah, al, bh, bl) \ { \ DI __a = (al) | ((DI) ah << 32); \ DI __b = (bl) | ((DI) bh << 32); \ \ DI __c = __a + __b; \ \ (sl) = (SI) (__c & 0xffffffff); \ (sh) = (SI) (__c >> 32); \ } #define add_ssaaaa_asm(sh, sl, ah, al, bh, bl) \ __asm__ ("addl %5,%1\n\tadcl %3,%0" \ : "=r" ((SI) (sh)), \ "=&r" ((SI) (sl)) \ : "%0" ((SI) (ah)), \ "g" ((SI) (bh)), \ "%1" ((SI) (al)), \ "g" ((SI) (bl))) void test_c (SI a, SI b, SI c, SI d) { volatile SI x, y; add_ssaaaa_c (x, y, a, b, c, d); } void test_asm (SI a, SI b, SI c, SI d) { volatile SI x, y; add_ssaaaa_asm (x, y, a, b, c, d); } --cut here-- Compared two approaches head-to-head on i686 target; gcc -O2 -fomit-frame-pointer generates: test_c: subl $28, %esp #, xorl %edx, %edx # movl 40(%esp), %eax # c, tmp66 movl 44(%esp), %ecx # d, d movl %esi, 20(%esp) #, movl %ebx, 16(%esp) #, movl %eax, %edx # tmp66, movl $0, %eax #, tmp66 movl %eax, %esi #, tmp74 movl 32(%esp), %eax # a, tmp70 movl %edx, %ebx # tmp75, __c orl %ecx, %esi # d, tmp74 xorl %edx, %edx # movl %esi, %ecx # tmp74, __c movl 36(%esp), %esi # b, b movl %edi, 24(%esp) #, movl 24(%esp), %edi #, movl %eax, %edx # tmp70, movl $0, %eax #, tmp70 orl %esi, %eax # b, tmp72 movl 20(%esp), %esi #, addl %eax, %ecx # tmp72, __c adcl %edx, %ebx #, __c movl %ecx, 8(%esp) # __c, y movl %ebx, %ecx # __c, __c xorl %ebx, %ebx # __c movl 16(%esp), %ebx #, movl %ecx, 12(%esp) # __c, x addl $28, %esp #, ret and on the other hand for asm approach: test_asm: subl $16, %esp #, movl 20(%esp), %eax # a, a movl 24(%esp), %edx # b, b #APP addl 32(%esp),%edx # d, tmp63 adcl 28(%esp),%eax # c, tmp62 #NO_APP movl %eax, 12(%esp) # tmp62, x movl %edx, 8(%esp) # tmp63, y addl $16, %esp #, ret Although the adition itself generates expected code, assembling SImode values into DImode really confuses gcc. Also, restore of %ebx from stack interferes with final DImode->SImode split in a really bad way. Looking into this mess, I think that gcc has still a long way to go w.r.t to DImode values on 32bit targets (and TImode values on 64bit targets) and the only solution for the near future is using asm here... Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 9:25 ` Uros Bizjak @ 2007-05-17 11:15 ` Rask Ingemann Lambertsen 2007-05-17 21:38 ` Rask Ingemann Lambertsen 2007-05-17 16:49 ` Chris Lattner 1 sibling, 1 reply; 36+ messages in thread From: Rask Ingemann Lambertsen @ 2007-05-17 11:15 UTC (permalink / raw) To: Uros Bizjak; +Cc: Paolo Bonzini, Chris Lattner, GCC Patches On Thu, May 17, 2007 at 11:25:34AM +0200, Uros Bizjak wrote: > Unfortunatelly, this approach confuses gcc a lot. Attached testcase > will illustrate the problem on i686 (substitute SI with DI and DI with > TI on 64bit and change shift values for the same result): [snip] > Compared two approaches head-to-head on i686 target; gcc -O2 > -fomit-frame-pointer generates: > > test_c: > subl $28, %esp #, > xorl %edx, %edx # > movl 40(%esp), %eax # c, tmp66 > movl 44(%esp), %ecx # d, d > movl %esi, 20(%esp) #, > movl %ebx, 16(%esp) #, > movl %eax, %edx # tmp66, > movl $0, %eax #, tmp66 > movl %eax, %esi #, tmp74 > movl 32(%esp), %eax # a, tmp70 > movl %edx, %ebx # tmp75, __c > orl %ecx, %esi # d, tmp74 > xorl %edx, %edx # > movl %esi, %ecx # tmp74, __c > movl 36(%esp), %esi # b, b > movl %edi, 24(%esp) #, > movl 24(%esp), %edi #, > movl %eax, %edx # tmp70, > movl $0, %eax #, tmp70 > orl %esi, %eax # b, tmp72 > movl 20(%esp), %esi #, > addl %eax, %ecx # tmp72, __c > adcl %edx, %ebx #, __c > movl %ecx, 8(%esp) # __c, y > movl %ebx, %ecx # __c, __c > xorl %ebx, %ebx # __c > movl 16(%esp), %ebx #, > movl %ecx, 12(%esp) # __c, x > addl $28, %esp #, > ret You must be running into some sort of target specific problem, because x86 16-bit -O2 -fno-omit-frame-pointer gets test_c: pushw %cx ;# 62 *pushhi1 pushw %di ;# 63 *pushhi1 enterw $6-2, $0 ;# 64 _enter movw 14(%bp),%ax ;# 18 *movhi/1 movw 12(%bp),%di ;# 19 *movhi/1 movw 10(%bp),%dx ;# 30 *movhi/1 movw 8(%bp), %cx ;# 31 *movhi/1 addw %dx, %ax ;# 56 _addhi3_cc_for_carry/1 movw %di, %dx ;# 61 *movhi/1 adcw %cx, %dx ;# 57 _addhi3_carry/1 movw %ax, -4(%bp) ;# 35 *movhi/2 movw %dx, -2(%bp) ;# 40 *movhi/2 leavew ;# 67 _leave popw %di ;# 68 *pophi1 popw %cx ;# 69 *pophi1 ret ;# 70 *return with the obvious s/SI/HI/, S/DI/SI/, etc. changes to your testcase. While the above could be improved a little, notice the complete absence of "or" and "xor" instructions. (Why do you have both "movl $0, %reg" and "xorl %reg, %reg"?) -- Rask Ingemann Lambertsen 741 unexpected failures, and counting... ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 11:15 ` Rask Ingemann Lambertsen @ 2007-05-17 21:38 ` Rask Ingemann Lambertsen 2007-05-18 6:54 ` Ian Lance Taylor 0 siblings, 1 reply; 36+ messages in thread From: Rask Ingemann Lambertsen @ 2007-05-17 21:38 UTC (permalink / raw) To: Uros Bizjak; +Cc: Paolo Bonzini, Chris Lattner, GCC Patches On Thu, May 17, 2007 at 01:15:48PM +0200, Rask Ingemann Lambertsen wrote: > You must be running into some sort of target specific problem, The i386 back end is splitting adddi3 after reload, which is too late. I'm not sure that splitting before reload is early enough, but that is the easiest change to make. -- Rask Ingemann Lambertsen ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 21:38 ` Rask Ingemann Lambertsen @ 2007-05-18 6:54 ` Ian Lance Taylor 2007-05-18 7:10 ` Eric Botcazou 0 siblings, 1 reply; 36+ messages in thread From: Ian Lance Taylor @ 2007-05-18 6:54 UTC (permalink / raw) To: Rask Ingemann Lambertsen Cc: Uros Bizjak, Paolo Bonzini, Chris Lattner, GCC Patches Rask Ingemann Lambertsen <rask@sygehus.dk> writes: > On Thu, May 17, 2007 at 01:15:48PM +0200, Rask Ingemann Lambertsen wrote: > > > You must be running into some sort of target specific problem, > > The i386 back end is splitting adddi3 after reload, which is too late. > I'm not sure that splitting before reload is early enough, but that is the > easiest change to make. With the new lower-subreg patch, insns like adddi3 should split before reload when possible. But you can't use the existing split in i386.md, because it relies on FLAGS_REG, and reload may accidentally clobber FLAGS_REG. In my uncommitted follow-on patch to lower-subreg, I did it like this: (define_split [(set (match_operand:DI 0 "nonimmediate_operand" "") (plus:DI (match_operand:DI 1 "nonimmediate_operand" "") (match_operand:DI 2 "general_operand" ""))) (clobber (reg:CC FLAGS_REG))] "!TARGET_64BIT" [(parallel [(set (match_dup 0) (plus:SI (match_dup 1) (match_dup 2))) (set (match_dup 3) (plus:SI (plus:SI (match_dup 4) (match_dup 5)) (truncate:SI (lshiftrt:DI (plus:DI (zero_extend:DI (match_dup 1)) (zero_extend:DI (match_dup 2))) (const_int 32))))) (clobber (reg:CC FLAGS_REG))])] { split_di (operands + 0, 1, operands + 0, operands + 3); split_di (operands + 1, 1, operands + 1, operands + 4); split_di (operands + 2, 1, operands + 2, operands + 5); }) (define_insn_and_split "*adddi3_2" [(set (match_operand:SI 0 "nonimmediate_operand" "=rm,r,rm,r") (plus:SI (match_operand:SI 1 "nonimmediate_operand" "%0,0,0,0") (match_operand:SI 2 "general_operand" "ri,rm,ri,rm"))) (set (match_operand:SI 3 "nonimmediate_operand" "=rm,r,rm,r") (plus:SI (plus:SI (match_operand:SI 4 "general_operand" "3,3,ri,rm") (match_operand:SI 5 "general_operand" "ri,rm,3,3")) (truncate:SI (lshiftrt:DI (plus:DI (zero_extend:DI (match_dup 1)) (zero_extend:DI (match_dup 2))) (const_int 32))))) (clobber (reg:CC FLAGS_REG))] "!TARGET_64BIT" "#" "&& reload_completed" [(parallel [(set (reg:CC FLAGS_REG) (unspec:CC [(match_dup 1) (match_dup 2)] UNSPEC_ADD_CARRY)) (set (match_dup 0) (plus:SI (match_dup 1) (match_dup 2)))]) (parallel [(set (match_dup 3) (plus:SI (plus:SI (ltu:SI (reg:CC FLAGS_REG) (const_int 0)) (match_dup 4)) (match_dup 5))) (clobber (reg:CC FLAGS_REG))])] { /* We can only have one pair of commutative operands in an insn, so we commute operands 4 and 5 manually. */ if (rtx_equal_p (operands[5], operands[3])) { rtx tmp = operands[4]; operands[4] = operands[5]; operands[5] = tmp; } }) Ian ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 6:54 ` Ian Lance Taylor @ 2007-05-18 7:10 ` Eric Botcazou 2007-05-18 16:48 ` Ian Lance Taylor 0 siblings, 1 reply; 36+ messages in thread From: Eric Botcazou @ 2007-05-18 7:10 UTC (permalink / raw) To: Ian Lance Taylor Cc: gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak, Paolo Bonzini, Chris Lattner > With the new lower-subreg patch, insns like adddi3 should split before > reload when possible. If you haven't already done it, could you sketch the modifications that should be made to a 32-bit back-end, now that the lower-subreg patch is integrated? -- Eric Botcazou ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 7:10 ` Eric Botcazou @ 2007-05-18 16:48 ` Ian Lance Taylor 2007-05-18 20:37 ` Eric Botcazou 2007-05-22 17:27 ` Roman Zippel 0 siblings, 2 replies; 36+ messages in thread From: Ian Lance Taylor @ 2007-05-18 16:48 UTC (permalink / raw) To: Eric Botcazou Cc: gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak, Paolo Bonzini, Chris Lattner Eric Botcazou <ebotcazou@libertysurf.fr> writes: > > With the new lower-subreg patch, insns like adddi3 should split before > > reload when possible. > > If you haven't already done it, could you sketch the modifications that should > be made to a 32-bit back-end, now that the lower-subreg patch is integrated? First I'll say that the lower-subreg patch is not yet completely integrated. There is another part of it, which I have working, but which I'm holding in the hopes that it will work better in the DF framework. That is the part which tracks subreg lifetimes during register allocation. From the backend perspective, the best way to take advantage of lower-subreg is conceptually simple if sometimes awkward in practice: for multi-word operations, use define_insn_and_split and split the insns before reload. For CC0-style machines, in which reload may clobber the flags register, some multi-word operations will need two define_insn_and_splits. The first split should run before reload, and should generate one or more insns which express the computation entirely in terms of register-sized subregs. The second split should run after reload and should generate the actual instructions which use the flags register. The idea is that the first split into subregs will permit the lower-subreg pass and the register allocator to allocate the values involved as separate single registers rather than as a single multi-word register. The second split, which will be similar to existing code, will then generate the real machine instructions for the second scheduling pass. For a non-CC0 style machine things are simpler. Just split multi-word operations before reload. There was never any particular need to wait until after reload to split them in the past. And now you really don't want to wait. For clarity, the relevant distinction between a CC0 machine and a non-CC0 machine is whether reload is going to generate loads and stores which clobber the flags register. On a non-CC0 machine the flags register is not set by memory moves. On a CC0 machine it is. Ian ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 16:48 ` Ian Lance Taylor @ 2007-05-18 20:37 ` Eric Botcazou 2007-05-18 22:32 ` Ian Lance Taylor 2007-05-22 17:27 ` Roman Zippel 1 sibling, 1 reply; 36+ messages in thread From: Eric Botcazou @ 2007-05-18 20:37 UTC (permalink / raw) To: Ian Lance Taylor Cc: gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak, Paolo Bonzini, Chris Lattner > From the backend perspective, the best way to take advantage of > lower-subreg is conceptually simple if sometimes awkward in practice: > for multi-word operations, use define_insn_and_split and split the > insns before reload. OK, I see, thanks. > For a non-CC0 style machine things are simpler. Just split multi-word > operations before reload. There was never any particular need to wait > until after reload to split them in the past. And now you really > don't want to wait. Of course SPARC-V8 is the perfect counter-example since you have a pairing instruction "ldd" for consecutive integer registers. :-) -- Eric Botcazou ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 20:37 ` Eric Botcazou @ 2007-05-18 22:32 ` Ian Lance Taylor 0 siblings, 0 replies; 36+ messages in thread From: Ian Lance Taylor @ 2007-05-18 22:32 UTC (permalink / raw) To: Eric Botcazou Cc: gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak, Paolo Bonzini, Chris Lattner Eric Botcazou <ebotcazou@libertysurf.fr> writes: > > For a non-CC0 style machine things are simpler. Just split multi-word > > operations before reload. There was never any particular need to wait > > until after reload to split them in the past. And now you really > > don't want to wait. > > Of course SPARC-V8 is the perfect counter-example since you have a pairing > instruction "ldd" for consecutive integer registers. :-) Yes, good point, you probably don't want to split a DImode load or store there. You can still split everything else, though, and the right thing should happen. Unfortunately this means that if one of the halves of the load is unused, we won't convert back to an SImode load. Ian ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 16:48 ` Ian Lance Taylor 2007-05-18 20:37 ` Eric Botcazou @ 2007-05-22 17:27 ` Roman Zippel 2007-05-22 17:36 ` Ian Lance Taylor 2007-05-22 22:19 ` Rask Ingemann Lambertsen 1 sibling, 2 replies; 36+ messages in thread From: Roman Zippel @ 2007-05-22 17:27 UTC (permalink / raw) To: Ian Lance Taylor Cc: Eric Botcazou, gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak, Paolo Bonzini, Chris Lattner Hi, On Fri, 18 May 2007, Ian Lance Taylor wrote: > >From the backend perspective, the best way to take advantage of > lower-subreg is conceptually simple if sometimes awkward in practice: > for multi-word operations, use define_insn_and_split and split the > insns before reload. I'm playing around with it a little on m68k and there is sometimes a bit of a conflict. E.g. I would split somthing like adddi3/subdi3 already before combine, because the addx/subx instruction is much more limited in its addressing mode (only accepts register), so combine can generate the right instructions. The problem is now that other optimizations can't do very much with these instructions, for example something like this: unsigned long long x, y; return x + (y >> 63); is better handled as: unsigned long long x, y; return x - ((long long)y >> 63); It's not a very common thing, so it wouldn't be a big loss, but something like this is currently part of the m68k description (and something very early already converts the second form into the first form...). The point is that as soon as we expose the two seperate adds, transformations like above which require to change both adds to subs pretty much become impossible. The only alternative I can think of is to have a port specific tree pass and change this before the RTL is generated. In this context I'm curious about another problem, how should it work in general (i.e. independent of cc0) to combine a subdi/cmpdi or subdi/tstdi pattern? > For clarity, the relevant distinction between a CC0 machine and a > non-CC0 machine is whether reload is going to generate loads and > stores which clobber the flags register. On a non-CC0 machine the > flags register is not set by memory moves. On a CC0 machine it is. BTW I was thinking about this a little (possible conversion of m68k from cc0 in general and this reload problem specifically). I don't really like the idea of hiding cc0 until after reload, as it prevents many optimizations (e.g. smi is heavily used for sign extensions, but it needs the cc flags from the previous instructions, so one would need many patterns to combine a <op>/set/smi sequence). I think the reload problem may not be that bad, while a move instruction sets the cc register, it often just sets it in the same way as the previous instructions, especially the Z and M flag, which is enough for the common test against zero. For these tests the move wouldn't really clobber the flags, but just recreate them. A bit more difficult is a sub/cmp combination, here move would really clobber flags, but luckily on m68k the sub instruction is quite flexible and e.g. also accepts memory arguments as destination, so it would be possible to convert a output reload into one or two input reloads. So what basically would be needed is a port specific check in reload as to whether an output reload is acceptable and possibly reject the alternative and try it with another one. That's currently the idea I would favour at least for m68k, does that sound reasonable (or at least understandable :) ). bye, Roman ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-22 17:27 ` Roman Zippel @ 2007-05-22 17:36 ` Ian Lance Taylor 2007-05-22 22:27 ` Rask Ingemann Lambertsen 2007-05-23 17:27 ` Roman Zippel 2007-05-22 22:19 ` Rask Ingemann Lambertsen 1 sibling, 2 replies; 36+ messages in thread From: Ian Lance Taylor @ 2007-05-22 17:36 UTC (permalink / raw) To: Roman Zippel Cc: Eric Botcazou, gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak, Paolo Bonzini, Chris Lattner Roman Zippel <zippel@linux-m68k.org> writes: > I'm playing around with it a little on m68k and there is sometimes a bit > of a conflict. E.g. I would split somthing like adddi3/subdi3 already > before combine, because the addx/subx instruction is much more limited in > its addressing mode (only accepts register), so combine can generate the > right instructions. That's right. That's why I recommend a define_insn_and_split. The insn will be complete for most of the RTL optimization passes, including combine. It will then be split before the second lower subreg pass and before register allocation. Of course, that won't help your case, because you want to split before combine for other reasons. > I think the reload problem may not be that bad, while a move instruction > sets the cc register, it often just sets it in the same way as the > previous instructions, especially the Z and M flag, which is enough for > the common test against zero. For these tests the move wouldn't really > clobber the flags, but just recreate them. It's unfortunately not that simple. In some cases reload will need to load some part of an address. When it does that the flags will no longer have any relation to what they are supposed to have. > A bit more difficult is a sub/cmp combination, here move would really > clobber flags, but luckily on m68k the sub instruction is quite flexible > and e.g. also accepts memory arguments as destination, so it would be > possible to convert a output reload into one or two input reloads. > So what basically would be needed is a port specific check in reload as to > whether an output reload is acceptable and possibly reject the > alternative and try it with another one. > That's currently the idea I would favour at least for m68k, does that > sound reasonable (or at least understandable :) ). That sounds unreasonable. reload is already very complicated, too complicated, and there isn't going to be any way to drop in something like that without making it significantly more complicated. Ian ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-22 17:36 ` Ian Lance Taylor @ 2007-05-22 22:27 ` Rask Ingemann Lambertsen 2007-05-22 22:32 ` Ian Lance Taylor 2007-05-23 17:27 ` Roman Zippel 1 sibling, 1 reply; 36+ messages in thread From: Rask Ingemann Lambertsen @ 2007-05-22 22:27 UTC (permalink / raw) To: Ian Lance Taylor Cc: Roman Zippel, Eric Botcazou, gcc-patches, Uros Bizjak, Paolo Bonzini, Chris Lattner On Tue, May 22, 2007 at 10:35:25AM -0700, Ian Lance Taylor wrote: > Roman Zippel <zippel@linux-m68k.org> writes: > > > I think the reload problem may not be that bad, while a move instruction > > sets the cc register, it often just sets it in the same way as the > > previous instructions, especially the Z and M flag, which is enough for > > the common test against zero. For these tests the move wouldn't really > > clobber the flags, but just recreate them. > > It's unfortunately not that simple. In some cases reload will need to > load some part of an address. When it does that the flags will no > longer have any relation to what they are supposed to have. On the m68k, instructions with an address register destination don't set the flags. It might actually work. -- Rask Ingemann Lambertsen ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-22 22:27 ` Rask Ingemann Lambertsen @ 2007-05-22 22:32 ` Ian Lance Taylor 2007-05-24 20:00 ` Rask Ingemann Lambertsen 0 siblings, 1 reply; 36+ messages in thread From: Ian Lance Taylor @ 2007-05-22 22:32 UTC (permalink / raw) To: Rask Ingemann Lambertsen Cc: Roman Zippel, Eric Botcazou, gcc-patches, Uros Bizjak, Paolo Bonzini, Chris Lattner Rask Ingemann Lambertsen <rask@sygehus.dk> writes: > On Tue, May 22, 2007 at 10:35:25AM -0700, Ian Lance Taylor wrote: > > Roman Zippel <zippel@linux-m68k.org> writes: > > > > > I think the reload problem may not be that bad, while a move instruction > > > sets the cc register, it often just sets it in the same way as the > > > previous instructions, especially the Z and M flag, which is enough for > > > the common test against zero. For these tests the move wouldn't really > > > clobber the flags, but just recreate them. > > > > It's unfortunately not that simple. In some cases reload will need to > > load some part of an address. When it does that the flags will no > > longer have any relation to what they are supposed to have. > > On the m68k, instructions with an address register destination don't set > the flags. It might actually work. That is true, but addresses can include data registers, as in %a0@(%d0) (aka 0(a0,d0) ). Ian ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-22 22:32 ` Ian Lance Taylor @ 2007-05-24 20:00 ` Rask Ingemann Lambertsen 0 siblings, 0 replies; 36+ messages in thread From: Rask Ingemann Lambertsen @ 2007-05-24 20:00 UTC (permalink / raw) To: Ian Lance Taylor Cc: Roman Zippel, Eric Botcazou, gcc-patches, Uros Bizjak, Paolo Bonzini, Chris Lattner On Tue, May 22, 2007 at 03:31:36PM -0700, Ian Lance Taylor wrote: > Rask Ingemann Lambertsen <rask@sygehus.dk> writes: > > > On the m68k, instructions with an address register destination don't set > > the flags. It might actually work. > > That is true, but addresses can include data registers, as in > %a0@(%d0) (aka 0(a0,d0) ). Yes, but you can tell reload not to use data registers in addresses. IIRC, wherever disp(An,Dn) is valid, so is disp(An,Am). -- Rask Ingemann Lambertsen ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-22 17:36 ` Ian Lance Taylor 2007-05-22 22:27 ` Rask Ingemann Lambertsen @ 2007-05-23 17:27 ` Roman Zippel 2007-05-23 17:48 ` Ian Lance Taylor 1 sibling, 1 reply; 36+ messages in thread From: Roman Zippel @ 2007-05-23 17:27 UTC (permalink / raw) To: Ian Lance Taylor Cc: Eric Botcazou, gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak, Paolo Bonzini, Chris Lattner Hi, On Tue, 22 May 2007, Ian Lance Taylor wrote: > Roman Zippel <zippel@linux-m68k.org> writes: > > > I'm playing around with it a little on m68k and there is sometimes a bit > > of a conflict. E.g. I would split somthing like adddi3/subdi3 already > > before combine, because the addx/subx instruction is much more limited in > > its addressing mode (only accepts register), so combine can generate the > > right instructions. > > That's right. That's why I recommend a define_insn_and_split. The > insn will be complete for most of the RTL optimization passes, > including combine. It will then be split before the second lower > subreg pass and before register allocation. Somehow I think this can't be everything. Let's take a different example, which is also relevant for other ports: x + (y << 32). Here one has a partial constant argument, but one doesn't really know it until after splitting the shift. This could be optimized to a simple move/add (but currently isn't). With a post-combine-split one could now generate a combined add/shift pattern which could be splitted into the simpler operations, but this is too late for combine. A pre-combine-split is also problematic, as the information that the carry flag is zero, can't get to the next instruction, as it's usually hidden in a rather complex parallel pattern. To handle this one could either rerun combine after split (which may be done more than once) or we need a combined combine/split pass (which selectively just combines the newly created instructions with other instructions). > Of course, that won't help your case, because you want to split before > combine for other reasons. There are more reasons, e.g. splitting post_inc/pre_dec operations later is rather painful, if would be so much easier if one only had to deal with simple memory operands and leave the bulk work to combine. > > I think the reload problem may not be that bad, while a move instruction > > sets the cc register, it often just sets it in the same way as the > > previous instructions, especially the Z and M flag, which is enough for > > the common test against zero. For these tests the move wouldn't really > > clobber the flags, but just recreate them. > > It's unfortunately not that simple. In some cases reload will need to > load some part of an address. When it does that the flags will no > longer have any relation to what they are supposed to have. This requires an output reload, doesn't it? If we only have single output, it would mean the final move would recreate the right flags. For floating point values this would be even easier, the post processing for setting the condition codes is always the same and move to memory doesn't even change them. > > A bit more difficult is a sub/cmp combination, here move would really > > clobber flags, but luckily on m68k the sub instruction is quite flexible > > and e.g. also accepts memory arguments as destination, so it would be > > possible to convert a output reload into one or two input reloads. > > So what basically would be needed is a port specific check in reload as to > > whether an output reload is acceptable and possibly reject the > > alternative and try it with another one. > > That's currently the idea I would favour at least for m68k, does that > > sound reasonable (or at least understandable :) ). > > That sounds unreasonable. reload is already very complicated, too > complicated, and there isn't going to be any way to drop in something > like that without making it significantly more complicated. Considering the alternatives I don't want dismiss this option too quickly. What is suggested in the wiki would only trade one problem with another, cc0 might be gone this way, but one also had none of the benefits and needs a lot of extra work to get anywhere near where it was before. The problem _is_ reload, so I'd prefer to look at the real problem instead of wasting a lot of effort to work around it. Why are you so sure that there is not "any way"? Yes, reload is complicated, but I don't think these changes would make it significantly more complicated than it already is and I'm pretty sure that it will be less complicated than any of the alternatives. BTW there is a move instruction, which doesn't modify the flags - movem, but I would really prefer to only use it as last effort. bye, Roman ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-23 17:27 ` Roman Zippel @ 2007-05-23 17:48 ` Ian Lance Taylor 0 siblings, 0 replies; 36+ messages in thread From: Ian Lance Taylor @ 2007-05-23 17:48 UTC (permalink / raw) To: Roman Zippel Cc: Eric Botcazou, gcc-patches, Rask Ingemann Lambertsen, Uros Bizjak, Paolo Bonzini, Chris Lattner Roman Zippel <zippel@linux-m68k.org> writes: > To handle this one could either rerun combine after split (which may > be done more than once) or we need a combined combine/split pass (which > selectively just combines the newly created instructions with other > instructions). I agree that we don't handle all cases at present. > > > I think the reload problem may not be that bad, while a move instruction > > > sets the cc register, it often just sets it in the same way as the > > > previous instructions, especially the Z and M flag, which is enough for > > > the common test against zero. For these tests the move wouldn't really > > > clobber the flags, but just recreate them. > > > > It's unfortunately not that simple. In some cases reload will need to > > load some part of an address. When it does that the flags will no > > longer have any relation to what they are supposed to have. > > This requires an output reload, doesn't it? If we only have single output, > it would mean the final move would recreate the right flags. > For floating point values this would be even easier, the post processing > for setting the condition codes is always the same and move to memory > doesn't even change them. No, reloading an address can be for either an input or an output. It will typically but not exclusively be expressed as RELOAD_FOR_INPUT_ADDRESS or RELOAD_FOR_OUTPUT_ADDRESS. > > > A bit more difficult is a sub/cmp combination, here move would really > > > clobber flags, but luckily on m68k the sub instruction is quite flexible > > > and e.g. also accepts memory arguments as destination, so it would be > > > possible to convert a output reload into one or two input reloads. > > > So what basically would be needed is a port specific check in reload as to > > > whether an output reload is acceptable and possibly reject the > > > alternative and try it with another one. > > > That's currently the idea I would favour at least for m68k, does that > > > sound reasonable (or at least understandable :) ). > > > > That sounds unreasonable. reload is already very complicated, too > > complicated, and there isn't going to be any way to drop in something > > like that without making it significantly more complicated. > > Considering the alternatives I don't want dismiss this option too quickly. > What is suggested in the wiki would only trade one problem with another, > cc0 might be gone this way, but one also had none of the benefits and > needs a lot of extra work to get anywhere near where it was before. > The problem _is_ reload, so I'd prefer to look at the real problem instead > of wasting a lot of effort to work around it. > Why are you so sure that there is not "any way"? Yes, reload is > complicated, but I don't think these changes would make it significantly > more complicated than it already is and I'm pretty sure that it will be > less complicated than any of the alternatives. Certainly the problem is reload and the answer is to eliminate it. That is extremely hard but will solve the problem for good. If you want to look at target specific routines to guide instruction selection in reload, that is your choice. I do not think it would be a wise way to approach this problem. Ian ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-22 17:27 ` Roman Zippel 2007-05-22 17:36 ` Ian Lance Taylor @ 2007-05-22 22:19 ` Rask Ingemann Lambertsen 1 sibling, 0 replies; 36+ messages in thread From: Rask Ingemann Lambertsen @ 2007-05-22 22:19 UTC (permalink / raw) To: Roman Zippel Cc: Ian Lance Taylor, Eric Botcazou, gcc-patches, Uros Bizjak, Paolo Bonzini, Chris Lattner On Tue, May 22, 2007 at 07:26:50PM +0200, Roman Zippel wrote: > Hi, > > On Fri, 18 May 2007, Ian Lance Taylor wrote: > > > >From the backend perspective, the best way to take advantage of > > lower-subreg is conceptually simple if sometimes awkward in practice: > > for multi-word operations, use define_insn_and_split and split the > > insns before reload. > > I'm playing around with it a little on m68k and there is sometimes a bit > of a conflict. E.g. I would split somthing like adddi3/subdi3 already > before combine, because the addx/subx instruction is much more limited in > its addressing mode (only accepts register), so combine can generate the > right instructions. It can be done: 1) Define predicates and constraints which match the available addressing modes. 2) Define an insn(_and_split) which uses those predicates and constraints. 3) Define an insn similiar to the one in 2), but where the (mem ...) expression is not an operand. Instead, you have something like ... (mem:SI (match_operand:SI "register_operand" "a")) 4) Split the insn in 2) to the one in 3) before reload. 5) Put the insn in 3) before the one in 2) in the .md file. I think that since the address is so simple in this case, you can skip part 1), 2), 4) and 5). Just be careful that you don't use the template in 3) as an output template, or you'll lose the alias set. (The danger of losing the alias set really ought to be documented.) > unsigned long long x, y; > return x + (y >> 63); > > is better handled as: > > unsigned long long x, y; > return x - ((long long)y >> 63); Interesting optimization when signed shifts are faster than unsigned ones. > In this context I'm curious about another problem, how should it work in > general (i.e. independent of cc0) to combine a subdi/cmpdi or subdi/tstdi > pattern? Combine will do this automatically if you define the insn to do so. In this case, it will just be a define_insn_and_split instead of the usual define_insn, but that makes no difference to combine. If your question is how to write the combined patterns in general, see <URL:http://gcc.gnu.org/ml/gcc/2004-02/msg00903.html>. You will have to define a CCmode for the multi-word instructions since the zero flag won't be usable. You'll therefore also have to delay the output of the comparison instruction until you know the comparison operator. This is documented in the GCC Internals manual. (Also check that your overflow flag is set to something useful.) -- Rask Ingemann Lambertsen ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 9:25 ` Uros Bizjak 2007-05-17 11:15 ` Rask Ingemann Lambertsen @ 2007-05-17 16:49 ` Chris Lattner 2007-05-17 17:05 ` Eric Christopher 1 sibling, 1 reply; 36+ messages in thread From: Chris Lattner @ 2007-05-17 16:49 UTC (permalink / raw) To: Uros Bizjak; +Cc: Paolo Bonzini, GCC Patches > Unfortunatelly, this approach confuses gcc a lot. Attached testcase > will illustrate the problem on i686 (substitute SI with DI and DI with > TI on 64bit and change shift values for the same result): > Compared two approaches head-to-head on i686 target; gcc -O2 > -fomit-frame-pointer generates: > > test_c: > subl $28, %esp #, > xorl %edx, %edx # > movl 40(%esp), %eax # c, tmp66 > movl 44(%esp), %ecx # d, d > movl %esi, 20(%esp) #, > movl %ebx, 16(%esp) #, > movl %eax, %edx # tmp66, > movl $0, %eax #, tmp66 > movl %eax, %esi #, tmp74 > movl 32(%esp), %eax # a, tmp70 > movl %edx, %ebx # tmp75, __c > orl %ecx, %esi # d, tmp74 > xorl %edx, %edx # > movl %esi, %ecx # tmp74, __c > movl 36(%esp), %esi # b, b > movl %edi, 24(%esp) #, > movl 24(%esp), %edi #, > movl %eax, %edx # tmp70, > movl $0, %eax #, tmp70 > orl %esi, %eax # b, tmp72 > movl 20(%esp), %esi #, > addl %eax, %ecx # tmp72, __c > adcl %edx, %ebx #, __c > movl %ecx, 8(%esp) # __c, y > movl %ebx, %ecx # __c, __c > xorl %ebx, %ebx # __c > movl 16(%esp), %ebx #, > movl %ecx, 12(%esp) # __c, x > addl $28, %esp #, > ret > > Looking into this mess, I think that gcc has still a long way to go > w.r.t to DImode values on 32bit targets (and TImode values on 64bit > targets) and the only solution for the near future is using asm > here... At some point, hacking around these limitations becomes less useful than just fixing them once and for all. LLVM compiles this: volatile SI x, y; void test_asm (SI a, SI b, SI c, SI d){ add_ssaaaa_asm (x, y, a, b, c, d); } into this: _test_c: movl 12(%esp), %eax movl 16(%esp), %ecx addl 8(%esp), %ecx adcl 4(%esp), %eax movl %ecx, _y movl %eax, _x ret with llvm-gcc t.c -S -O -static -fomit-frame-pointer Inline asm really is bad. It's much better to fix the compiler so that users don't have to rely on it to get performance. -Chris ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 16:49 ` Chris Lattner @ 2007-05-17 17:05 ` Eric Christopher 2007-05-18 10:23 ` Uros Bizjak 2007-05-18 19:43 ` Uros Bizjak 0 siblings, 2 replies; 36+ messages in thread From: Eric Christopher @ 2007-05-17 17:05 UTC (permalink / raw) To: GCC Patches; +Cc: Uros Bizjak, Paolo Bonzini, Ian Lance Taylor, Chris Lattner > > > Inline asm really is bad. It's much better to fix the compiler so > that users don't have to rely on it to get performance. Agreed, Ian's patches that take advantage of lower-subreg for x86 help this quite a bit. He's waiting AFAIK on the df branch to finish them up, but from what I recall they helped this particular case. -eric ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 17:05 ` Eric Christopher @ 2007-05-18 10:23 ` Uros Bizjak 2007-05-18 18:31 ` Chris Lattner 2007-05-18 19:43 ` Uros Bizjak 1 sibling, 1 reply; 36+ messages in thread From: Uros Bizjak @ 2007-05-18 10:23 UTC (permalink / raw) To: Eric Christopher Cc: GCC Patches, Paolo Bonzini, Ian Lance Taylor, Chris Lattner On 5/17/07, Eric Christopher <echristo@apple.com> wrote: > > Inline asm really is bad. It's much better to fix the compiler so > > that users don't have to rely on it to get performance. > > Agreed, Ian's patches that take advantage of lower-subreg for x86 help > this quite a bit. He's waiting AFAIK on the df branch to finish them > up, but from what I recall they helped this particular case. OK, I agree with that. However, I'd like too propose the path to this final solution: - for now, we implement asm solution, as it optimizes library code _now_ and follows what master copy of longlong.h does. Also, there is no simple solution for 128bit "mul" and "div" insns, which are of most interest here. - an enhancement PR is opened with the comments from this thread (also linking to Ian's follow-on patch), so we won't forget this issue. - once the generated code is equal or better then asm, we change longlong.h, probably for both, i386 and x86_64. My opinion is that asm solution is currently "good enough" for current uses (libgcc) and code inspection confirms this. It is not "the best solution" but it is definitelly better than current state. For the record, 3 "imul" insns are substituted by just one "mul" insns. Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 10:23 ` Uros Bizjak @ 2007-05-18 18:31 ` Chris Lattner 2007-05-18 19:58 ` Uros Bizjak 2007-05-18 20:07 ` Uros Bizjak 0 siblings, 2 replies; 36+ messages in thread From: Chris Lattner @ 2007-05-18 18:31 UTC (permalink / raw) To: Uros Bizjak Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor On May 18, 2007, at 3:22 AM, Uros Bizjak wrote: > On 5/17/07, Eric Christopher <echristo@apple.com> wrote: > >> > Inline asm really is bad. It's much better to fix the compiler so >> > that users don't have to rely on it to get performance. >> >> Agreed, Ian's patches that take advantage of lower-subreg for x86 >> help >> this quite a bit. He's waiting AFAIK on the df branch to finish them >> up, but from what I recall they helped this particular case. > > OK, I agree with that. However, I'd like too propose the path to this > final solution: > > - for now, we implement asm solution, as it optimizes library code > - an enhancement PR is opened with the comments from this thread (also > linking to Ian's follow-on patch), so we won't forget this issue. > - once the generated code is equal or better then asm, we change > longlong.h, probably for both, i386 and x86_64. This all sounds fine to me. > _now_ and follows what master copy of longlong.h does. Also, there is > no simple solution for 128bit "mul" and "div" insns, which are of most > interest here. What is the issue here? -Chris ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 18:31 ` Chris Lattner @ 2007-05-18 19:58 ` Uros Bizjak 2007-05-18 20:15 ` Chris Lattner 2007-05-18 20:07 ` Uros Bizjak 1 sibling, 1 reply; 36+ messages in thread From: Uros Bizjak @ 2007-05-18 19:58 UTC (permalink / raw) To: Chris Lattner Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor Chris Lattner wrote: >> _now_ and follows what master copy of longlong.h does. Also, there is >> no simple solution for 128bit "mul" and "div" insns, which are of most >> interest here. > > What is the issue here? unsigned long long int t(unsigned int a, unsigned int b) { return a * b; } doesn't compile to "mul" as one would expect (but to "imul"): gcc -O2 -m32 -fomit-frame-pointer: t: movl 8(%esp), %eax xorl %edx, %edx imull 4(%esp), %eax ret Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 19:58 ` Uros Bizjak @ 2007-05-18 20:15 ` Chris Lattner 2007-05-18 21:17 ` Uros Bizjak 2007-05-18 21:49 ` Uros Bizjak 0 siblings, 2 replies; 36+ messages in thread From: Chris Lattner @ 2007-05-18 20:15 UTC (permalink / raw) To: Uros Bizjak Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor On May 18, 2007, at 12:58 PM, Uros Bizjak wrote: > Chris Lattner wrote: > >>> _now_ and follows what master copy of longlong.h does. Also, >>> there is >>> no simple solution for 128bit "mul" and "div" insns, which are of >>> most >>> interest here. >> >> What is the issue here? > unsigned long long int t(unsigned int a, unsigned int b) > { > return a * b; > } > > doesn't compile to "mul" as one would expect (but to "imul"): > > gcc -O2 -m32 -fomit-frame-pointer: > > t: > movl 8(%esp), %eax > xorl %edx, %edx > imull 4(%esp), %eax > ret Don't you want: unsigned long long int t2(unsigned int a, unsigned int b) { return (unsigned long long)a * b; } which produces (gcc 4.0.1): _t2: movl 8(%esp), %eax mull 4(%esp) ret ? -Chris ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 20:15 ` Chris Lattner @ 2007-05-18 21:17 ` Uros Bizjak 2007-05-18 21:49 ` Uros Bizjak 1 sibling, 0 replies; 36+ messages in thread From: Uros Bizjak @ 2007-05-18 21:17 UTC (permalink / raw) To: Chris Lattner Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor Chris Lattner wrote: > Don't you want: > > unsigned long long int t2(unsigned int a, unsigned int b) > { > return (unsigned long long)a * b; > } Ouch... yes, indeed. Thanks, Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 20:15 ` Chris Lattner 2007-05-18 21:17 ` Uros Bizjak @ 2007-05-18 21:49 ` Uros Bizjak 2007-05-18 22:49 ` Chris Lattner 1 sibling, 1 reply; 36+ messages in thread From: Uros Bizjak @ 2007-05-18 21:49 UTC (permalink / raw) To: Chris Lattner Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor Chris Lattner wrote: > Don't you want: > > unsigned long long int t2(unsigned int a, unsigned int b) > { > return (unsigned long long)a * b; > } Please find a testcase for muls below, implemented in asm and C. Again, a couple of strange moves are present at the end of a function, implemented in pure C. --cut here-- typedef unsigned SI __attribute__ ((mode (SI))); typedef unsigned DI __attribute__ ((mode (DI))); #define umul_ppmm_c(w1, w0, u, v) \ { \ DI __c = (DI) u * v; \ \ (w0) = (SI) (__c & 0xffffffff); \ (w1) = (SI) (__c >> 32); \ } #define umul_ppmm_asm(w1, w0, u, v) \ __asm__ ("mull %3" \ : "=a" ((SI) (w0)), \ "=d" ((SI) (w1)) \ : "%0" ((SI) (u)), \ "rm" ((SI) (v))) void test_c (SI a, SI b) { volatile SI x, y; umul_ppmm_c (x, y, a, b); } void test_asm (SI a, SI b) { volatile SI x, y; umul_ppmm_asm (x, y, a, b); } --cut here-- gcc -O2 -fomit-frame-pointer: test_c: subl $16, %esp movl 24(%esp), %eax mull 20(%esp) movl %eax, 8(%esp) movl %edx, %eax xorl %edx, %edx movl %eax, 12(%esp) addl $16, %esp ret test_asm: subl $16, %esp movl 20(%esp), %eax #APP mull 24(%esp) #NO_APP movl %eax, 8(%esp) movl %edx, 12(%esp) addl $16, %esp ret So even for wide multiply, it is better to stay with asm implementation ATM. Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 21:49 ` Uros Bizjak @ 2007-05-18 22:49 ` Chris Lattner 0 siblings, 0 replies; 36+ messages in thread From: Chris Lattner @ 2007-05-18 22:49 UTC (permalink / raw) To: Uros Bizjak Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor On May 18, 2007, at 2:48 PM, Uros Bizjak wrote: > Chris Lattner wrote: > >> Don't you want: >> >> unsigned long long int t2(unsigned int a, unsigned int b) >> { >> return (unsigned long long)a * b; >> } > > Please find a testcase for muls below, implemented in asm and C. > Again, a couple of strange moves are present at the end of a > function, implemented in pure C. > > --cut here-- > typedef unsigned SI __attribute__ ((mode (SI))); > typedef unsigned DI __attribute__ ((mode (DI))); > > #define umul_ppmm_c(w1, w0, u, v) \ > { \ > DI __c = (DI) u * v; \ > \ > (w0) = (SI) (__c & 0xffffffff); \ > (w1) = (SI) (__c >> 32); \ > } > > test_c: > subl $16, %esp > movl 24(%esp), %eax > mull 20(%esp) > movl %eax, 8(%esp) > movl %edx, %eax > xorl %edx, %edx > movl %eax, 12(%esp) > addl $16, %esp > ret > > So even for wide multiply, it is better to stay with asm > implementation ATM. Yuck. Please file a bugzilla. -Chris ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-18 18:31 ` Chris Lattner 2007-05-18 19:58 ` Uros Bizjak @ 2007-05-18 20:07 ` Uros Bizjak 1 sibling, 0 replies; 36+ messages in thread From: Uros Bizjak @ 2007-05-18 20:07 UTC (permalink / raw) To: Chris Lattner Cc: Eric Christopher, GCC Patches, Paolo Bonzini, Ian Lance Taylor Chris Lattner wrote: >> OK, I agree with that. However, I'd like too propose the path to this >> final solution: >> >> - for now, we implement asm solution, as it optimizes library code >> - an enhancement PR is opened with the comments from this thread (also >> linking to Ian's follow-on patch), so we won't forget this issue. >> - once the generated code is equal or better then asm, we change >> longlong.h, probably for both, i386 and x86_64. > > This all sounds fine to me. The bugreport is at PR target/31985. (BTW: It would be nice if someone could confirm it, so it won't be self-confirmed ;) Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations 2007-05-17 17:05 ` Eric Christopher 2007-05-18 10:23 ` Uros Bizjak @ 2007-05-18 19:43 ` Uros Bizjak 1 sibling, 0 replies; 36+ messages in thread From: Uros Bizjak @ 2007-05-18 19:43 UTC (permalink / raw) To: Eric Christopher Cc: GCC Patches, Paolo Bonzini, Ian Lance Taylor, Chris Lattner Eric Christopher wrote: >> Inline asm really is bad. It's much better to fix the compiler so >> that users don't have to rely on it to get performance. > > Agreed, Ian's patches that take advantage of lower-subreg for x86 help > this quite a bit. He's waiting AFAIK on the df branch to finish them > up, but from what I recall they helped this particular case. Here are some representative x86_64 code sizes with and without a patch to longlong.h (these are actually TImode and TFmode functions): (+++ are patched, --- are unpatched libgcc.a sizes) - 238 0 0 238 ee _muldi3.o (ex libgcc.a) + 138 0 0 138 8a _muldi3.o (ex libgcc.a) - 1529 0 0 1529 5f9 _divdi3.o (ex libgcc.a) - 1650 0 0 1650 672 _moddi3.o (ex libgcc.a) - 1322 0 0 1322 52a _udivdi3.o (ex libgcc.a) - 1418 0 0 1418 58a _umoddi3.o (ex libgcc.a) + 541 0 0 541 21d _divdi3.o (ex libgcc.a) + 650 0 0 650 28a _moddi3.o (ex libgcc.a) + 358 0 0 358 166 _udivdi3.o (ex libgcc.a) + 445 0 0 445 1bd _umoddi3.o (ex libgcc.a) - 1669 0 0 1669 685 _udivmoddi4.o (ex libgcc.a) + 610 0 0 610 262 _udivmoddi4.o (ex libgcc.a) - 3030 0 0 3030 bd6 divtf3.o (ex libgcc.a) + 2604 0 0 2604 a2c divtf3.o (ex libgcc.a) - 2772 0 0 2772 ad4 multf3.o (ex libgcc.a) + 2508 0 0 2508 9cc multf3.o (ex libgcc.a) current divdi/moddi code looks quite pathological... I think that this clears the way for the patch to be committed to mainline, even as an interim solution. Uros. ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2007-05-24 19:56 UTC | newest] Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-05-16 20:25 [PATCH, x86_64]: Provide longlong.h definitions for 128bit operations Uros Bizjak 2007-05-16 20:34 ` Chris Lattner 2007-05-17 5:57 ` Uros Bizjak 2007-05-17 6:09 ` Eric Christopher 2007-05-17 6:12 ` Andrew Pinski 2007-05-17 6:22 ` Uros Bizjak 2007-05-17 11:13 ` Uros Bizjak 2007-05-17 6:43 ` Chris Lattner 2007-05-17 7:22 ` Paolo Bonzini 2007-05-17 9:25 ` Uros Bizjak 2007-05-17 11:15 ` Rask Ingemann Lambertsen 2007-05-17 21:38 ` Rask Ingemann Lambertsen 2007-05-18 6:54 ` Ian Lance Taylor 2007-05-18 7:10 ` Eric Botcazou 2007-05-18 16:48 ` Ian Lance Taylor 2007-05-18 20:37 ` Eric Botcazou 2007-05-18 22:32 ` Ian Lance Taylor 2007-05-22 17:27 ` Roman Zippel 2007-05-22 17:36 ` Ian Lance Taylor 2007-05-22 22:27 ` Rask Ingemann Lambertsen 2007-05-22 22:32 ` Ian Lance Taylor 2007-05-24 20:00 ` Rask Ingemann Lambertsen 2007-05-23 17:27 ` Roman Zippel 2007-05-23 17:48 ` Ian Lance Taylor 2007-05-22 22:19 ` Rask Ingemann Lambertsen 2007-05-17 16:49 ` Chris Lattner 2007-05-17 17:05 ` Eric Christopher 2007-05-18 10:23 ` Uros Bizjak 2007-05-18 18:31 ` Chris Lattner 2007-05-18 19:58 ` Uros Bizjak 2007-05-18 20:15 ` Chris Lattner 2007-05-18 21:17 ` Uros Bizjak 2007-05-18 21:49 ` Uros Bizjak 2007-05-18 22:49 ` Chris Lattner 2007-05-18 20:07 ` Uros Bizjak 2007-05-18 19:43 ` Uros Bizjak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).