* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
@ 2007-02-02 0:17 ` roger at eyesopen dot com
2007-11-23 20:48 ` steven at gcc dot gnu dot org
` (27 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: roger at eyesopen dot com @ 2007-02-02 0:17 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from roger at eyesopen dot com 2007-02-02 00:17 -------
It looks like Ian's recent subreg lowering pass patch has improved code
generation on this testcase. Previously, we'd spill three integer registers to
the stack for "LLM", we're now down to two. [A significant improvement from
the five we spilled when this bug was reported]
Before:
LLM: subl $12, %esp
movl %ebx, (%esp)
movl 28(%esp), %edx
movl 20(%esp), %ebx
movl 16(%esp), %ecx
movl 24(%esp), %eax
movl %esi, 4(%esp)
movl %edx, %esi
movl %edi, 8(%esp)
movl %ebx, %edi
movl (%esp), %ebx
imull %ecx, %esi
imull %eax, %edi
mull %ecx
addl %edi, %esi
movl 8(%esp), %edi
leal (%esi,%edx), %edx
movl 4(%esp), %esi
addl $12, %esp
ret
After:
LLM: subl $8, %esp
movl %ebx, (%esp)
movl 20(%esp), %eax
movl %esi, 4(%esp)
movl 24(%esp), %ecx
movl 12(%esp), %esi
movl 16(%esp), %ebx
imull %esi, %ecx
imull %eax, %ebx
mull %esi
movl 4(%esp), %esi
addl %ebx, %ecx
movl (%esp), %ebx
addl $8, %esp
leal (%ecx,%edx), %edx
ret
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
2007-02-02 0:17 ` [Bug rtl-optimization/17236] inefficient code for long long multiply on x86 roger at eyesopen dot com
@ 2007-11-23 20:48 ` steven at gcc dot gnu dot org
2007-12-18 7:43 ` bonzini at gnu dot org
` (26 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: steven at gcc dot gnu dot org @ 2007-11-23 20:48 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from steven at gcc dot gnu dot org 2007-11-23 20:48 -------
*** Bug 6585 has been marked as a duplicate of this bug. ***
--
steven at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |bruno at clisp dot org
Bug 17236 depends on bug 6585, which changed state.
Bug 6585 Summary: Redundant store/load instruction pairs on ix86
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=6585
What |Old Value |New Value
----------------------------------------------------------------------------
Status|REOPENED |RESOLVED
Resolution| |DUPLICATE
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
2007-02-02 0:17 ` [Bug rtl-optimization/17236] inefficient code for long long multiply on x86 roger at eyesopen dot com
2007-11-23 20:48 ` steven at gcc dot gnu dot org
@ 2007-12-18 7:43 ` bonzini at gnu dot org
2007-12-18 7:53 ` bonzini at gnu dot org
` (25 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-18 7:43 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from bonzini at gnu dot org 2007-12-18 07:43 -------
The generated code has changed a lot recently, though it still uses two spills:
pushl %esi
pushl %ebx
movl 12(%esp), %ebx ; load alow
movl 20(%esp), %esi ; load blow
movl 24(%esp), %ecx ; load bhigh
imull %ebx, %ecx ; bhigh = bhigh * alow
movl 16(%esp), %eax ; load ahigh
imull %esi, %eax ; ahigh = ahigh * blow
addl %eax, %ecx
movl %esi, %eax ; need to move a low part in %eax
mull %ebx
leal (%ecx,%edx), %esi ; what the heck, a simple addl could do!
movl %esi, %edx
popl %ebx
popl %esi
The important thing is that the adds are one before, and one after the
multiply. And the RTL before reload is very nice!
28 r64:SI=[argp:SI]
30 r66:SI=[argp:SI+0x8]
7 {r62:SI=[argp:SI+0xc]*r64:SI;clobber flags:CC;}
8 {r63:SI=[argp:SI+0x4]*r66:SI;clobber flags:CC;}
9 {r62:SI=r62:SI+r63:SI;clobber flags:CC;}
10 {r61:DI=zero_extend(r66:SI)*zero_extend(r64:SI);clobber flags:CC;}
12 {r61:DI#4=r62:SI+r61:DI#4;clobber flags:CC;}
That could be basically
mov (blow), %eax
mov (alow), %ecx
imul (bhigh), %eax, %ebx ;needs reload of (bh) in %ebx
imul (ahigh), %ecx, %edx ;needs reload of (ah) in %edx
addl %edx, %ebx
mul %ecx
addl %ebx, %edx
So, if we are content with a single spill, there is now mainly a register
allocation problem. For zero spills, we would need rematerialization of either
alow or blow, as in:
mov (blow), %eax
imul (bhigh), %eax, %ecx ;needs reload of (bh) in %ecx
imul (ahigh), (alow), %edx ;needs reload of (ah) in %edx
addl %edx, %ecx
mul (al)
addl %ebx, %edx
So, for now let's aim at having one spill. Then the main problem is that gcc
should try to keep one of alow or blow in %eax, in order to reuse it for the
mull. Then it could do with a single spill, like this:
pushl %ebx
movl 12(%esp), %eax ; load alow
movl 20(%esp), %ecx ; load blow
movl 24(%esp), %ebx ; load bhigh, input reload
imull %eax, %ebx
movl 16(%esp), %edx ; load ahigh, input reload
imull %ecx, %edx
addl %edx, %ebx
mul %ecx
addl %ebx, %edx
popl %ebx
--
bonzini at gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |bonzini at gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2007-12-18 7:43 ` bonzini at gnu dot org
@ 2007-12-18 7:53 ` bonzini at gnu dot org
2007-12-18 8:05 ` bonzini at gnu dot org
` (24 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-18 7:53 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from bonzini at gnu dot org 2007-12-18 07:53 -------
For -mregparm=2 the code is this:
pushl %ebx
movl 8(%esp), %ebx
movl 12(%esp), %ecx
imull %ebx, %edx
imull %eax, %ecx
addl %edx, %ecx
mull %ebx
popl %ebx
leal (%ecx,%edx), %edx
ret
It is better only for luck: the low part of "a" is already in %eax, so that
regclass decides for the "right" preference.
Register 64 pref AREG, else GENERAL_REGS
Paolo
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (3 preceding siblings ...)
2007-12-18 7:53 ` bonzini at gnu dot org
@ 2007-12-18 8:05 ` bonzini at gnu dot org
2007-12-18 8:10 ` ebotcazou at gcc dot gnu dot org
` (23 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-18 8:05 UTC (permalink / raw)
To: gcc-bugs
------- Comment #9 from bonzini at gnu dot org 2007-12-18 08:05 -------
Uros mentioned offlist that he wanted to hijack fwprop to always propagate
stack slots into instructions.
It would be a relatively useful piece of infrastructure to have a flag in MEMs
that marks on-stack MEMs, because other MEMs definitely must not be propagated
blindly.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (4 preceding siblings ...)
2007-12-18 8:05 ` bonzini at gnu dot org
@ 2007-12-18 8:10 ` ebotcazou at gcc dot gnu dot org
2007-12-18 13:48 ` ubizjak at gmail dot com
` (22 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: ebotcazou at gcc dot gnu dot org @ 2007-12-18 8:10 UTC (permalink / raw)
To: gcc-bugs
------- Comment #10 from ebotcazou at gcc dot gnu dot org 2007-12-18 08:10 -------
> It would be a relatively useful piece of infrastructure to have a flag in MEMs
> that marks on-stack MEMs, because other MEMs definitely must not be propagated
> blindly.
Depending on your needs, MEM_NOTRAP_P could be a good approximation.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (5 preceding siblings ...)
2007-12-18 8:10 ` ebotcazou at gcc dot gnu dot org
@ 2007-12-18 13:48 ` ubizjak at gmail dot com
2007-12-18 16:02 ` bonzini at gnu dot org
` (21 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2007-12-18 13:48 UTC (permalink / raw)
To: gcc-bugs
------- Comment #11 from ubizjak at gmail dot com 2007-12-18 13:47 -------
Generated code for a similar example is just plain stupid:
--cut here--
int test(long long a, long long b)
{
return a * b;
}
--cut here--
gcc -O3:
test:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl %esi, 4(%esp)
movl 16(%ebp), %esi
movl %ebx, (%esp)
movl 8(%ebp), %ebx
movl %esi, %eax
movl 4(%esp), %esi
mull %ebx
movl (%esp), %ebx
movl %ebp, %esp
popl %ebp
ret
gcc-4.0 is a little less creative and creates:
test:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl 8(%ebp), %eax
mull 16(%ebp)
popl %ebx
popl %esi
popl %edi
leave
ret
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (6 preceding siblings ...)
2007-12-18 13:48 ` ubizjak at gmail dot com
@ 2007-12-18 16:02 ` bonzini at gnu dot org
2007-12-18 16:39 ` jakub at gcc dot gnu dot org
` (20 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-18 16:02 UTC (permalink / raw)
To: gcc-bugs
------- Comment #12 from bonzini at gnu dot org 2007-12-18 16:01 -------
The problem in comment #11 is that GCC generates a widening multiply, and
cannot remove the DImode operations until after register allocation (!). While
the root cause is a deficiency in RTL-level DCE, I suggest filing a separate
bug, because it could also be fixed on the tree-level.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (7 preceding siblings ...)
2007-12-18 16:02 ` bonzini at gnu dot org
@ 2007-12-18 16:39 ` jakub at gcc dot gnu dot org
2007-12-18 16:51 ` bonzini at gnu dot org
` (19 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: jakub at gcc dot gnu dot org @ 2007-12-18 16:39 UTC (permalink / raw)
To: gcc-bugs
------- Comment #13 from jakub at gcc dot gnu dot org 2007-12-18 16:39 -------
I think tree level does the right thing, TER fixes this up and expand_expr
is called with
return (int) (b * a)
Later on expand_expr is called with
<mult_expr 0x2aaaae9032c0
type <integer_type 0x2aaaae937840 long long int DI
size <integer_cst 0x2aaaae927c60 constant invariant 64>
unit size <integer_cst 0x2aaaae927c90 constant invariant 8>
align 64 symtab 0 alias set 2 canonical type 0x2aaaae937840 precision
64 min <integer_cst 0x2aaaae927bd0 -9223372036854775808> max <integer_cst
0x2aaaae927c00 9223372036854775807>>
arg 0 <parm_decl 0x2aaaae92d2d0 b type <integer_type 0x2aaaae937840 long
long int>
used DI file pr17236.c line 2 col 30 size <integer_cst 0x2aaaae927c60
64> unit size <integer_cst 0x2aaaae927c90 8>
align 64 context <function_decl 0x2aaaae9e8a90 test> initial
<integer_type 0x2aaaae937840 long long int>
(reg/v:DI 60 [ b ]) arg-type <integer_type 0x2aaaae937840 long long
int>
incoming-rtl (mem/c/i:DI (plus:SI (reg/f:SI 53 virtual-incoming-args)
(const_int 8 [0x8])) [2 b+0 S8 A32])>
arg 1 <parm_decl 0x2aaaae92d240 a type <integer_type 0x2aaaae937840 long
long int>
used DI file pr17236.c line 2 col 17 size <integer_cst 0x2aaaae927c60
64> unit size <integer_cst 0x2aaaae927c90 8>
align 64 context <function_decl 0x2aaaae9e8a90 test> initial
<integer_type 0x2aaaae937840 long long int>
(reg/v:DI 59 [ a ]) arg-type <integer_type 0x2aaaae937840 long long
int>
incoming-rtl (mem/c/i:DI (reg/f:SI 53 virtual-incoming-args) [2 a+0 S8
A32]) chain <parm_decl 0x2aaaae92d2d0 b>>>
and tmode SImode, still enough info to choose a better multiply.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (8 preceding siblings ...)
2007-12-18 16:39 ` jakub at gcc dot gnu dot org
@ 2007-12-18 16:51 ` bonzini at gnu dot org
2007-12-18 18:20 ` ubizjak at gmail dot com
` (18 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-18 16:51 UTC (permalink / raw)
To: gcc-bugs
------- Comment #14 from bonzini at gnu dot org 2007-12-18 16:50 -------
The bug with 64*64->32 multiplication is now PR34522.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (9 preceding siblings ...)
2007-12-18 16:51 ` bonzini at gnu dot org
@ 2007-12-18 18:20 ` ubizjak at gmail dot com
2007-12-18 18:33 ` ubizjak at gmail dot com
` (17 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2007-12-18 18:20 UTC (permalink / raw)
To: gcc-bugs
------- Comment #15 from ubizjak at gmail dot com 2007-12-18 18:20 -------
(In reply to comment #7)
> mull %ebx
> leal (%ecx,%edx), %esi ; what the heck, a simple addl could do!
> movl %esi, %edx
Something disturbs RA to emit two DImode moves:
(insn:HI 10 36 37 2 m.c:2 (parallel [
(set (reg:DI 0 ax)
(mult:DI (zero_extend:DI (reg:SI 0 ax))
(zero_extend:DI (reg:SI 3 bx [orig:64 a ] [64]))))
(clobber (reg:CC 17 flags))
]) 304 {*umulsidi3_insn} (nil))
(insn 37 10 11 2 m.c:2 (set (reg:DI 3 bx [61])
(reg:DI 0 ax)) 88 {*movdi_2} (nil))
(note:HI 11 37 12 2 NOTE_INSN_DELETED)
(insn:HI 12 11 18 2 m.c:2 (parallel [
(set (reg:SI 4 si [+4 ])
(plus:SI (reg:SI 2 cx [62])
(reg:SI 4 si [+4 ])))
(clobber (reg:CC 17 flags))
]) 249 {*addsi_1} (nil))
(insn:HI 18 12 24 2 m.c:4 (set (reg/i:DI 0 ax [ <result> ])
(reg:DI 3 bx [61])) 88 {*movdi_2} (nil))
(insn 24 18 33 2 m.c:4 (use (reg/i:DI 0 ax [ <result> ])) -1 (nil))
Note two moves [(insn 36) and (insn 37)] around (insn 12).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (10 preceding siblings ...)
2007-12-18 18:20 ` ubizjak at gmail dot com
@ 2007-12-18 18:33 ` ubizjak at gmail dot com
2007-12-19 9:49 ` bonzini at gnu dot org
` (16 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2007-12-18 18:33 UTC (permalink / raw)
To: gcc-bugs
------- Comment #16 from ubizjak at gmail dot com 2007-12-18 18:33 -------
(In reply to comment #15)
> Note two moves [(insn 36) and (insn 37)] around (insn 12).
Bah. This is the correct sequence, around (insn 10) that seems to be the root
of all problems:
(insn:HI 9 8 36 2 m.c:2 (parallel [
(set (reg:SI 2 cx [62])
(plus:SI (reg:SI 2 cx [62])
(reg:SI 0 ax [63])))
(clobber (reg:CC 17 flags))
]) 249 {*addsi_1} (nil))
(insn 36 9 10 2 m.c:2 (set (reg:SI 0 ax)
(reg:SI 4 si [orig:66 b ] [66])) 47 {*movsi_1} (nil))
(insn:HI 10 36 37 2 m.c:2 (parallel [
(set (reg:DI 0 ax)
(mult:DI (zero_extend:DI (reg:SI 0 ax))
(zero_extend:DI (reg:SI 3 bx [orig:64 a ] [64]))))
(clobber (reg:CC 17 flags))
]) 304 {*umulsidi3_insn} (nil))
(insn 37 10 11 2 m.c:2 (set (reg:DI 3 bx [61])
(reg:DI 0 ax)) 88 {*movdi_2} (nil))
DImode AX as found in (insn 10) could simply be propagated up and down RTL
stream as SImode destination of (insn 8) and SImode source of (insn 12). For
some reason, RA is afraid to allocate registers in mixed modes.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (11 preceding siblings ...)
2007-12-18 18:33 ` ubizjak at gmail dot com
@ 2007-12-19 9:49 ` bonzini at gnu dot org
2007-12-19 12:11 ` ubizjak at gmail dot com
` (15 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-19 9:49 UTC (permalink / raw)
To: gcc-bugs
------- Comment #17 from bonzini at gnu dot org 2007-12-19 09:49 -------
With this patch, GCC gets the preferences right, but it does not affect code
generation.
Index: regclass.c
===================================================================
--- regclass.c (revision 130928)
+++ regclass.c (working copy)
@@ -1651,9 +1651,15 @@ record_reg_classes (int n_alts, int n_op
[(unsigned char) reg_pref[REGNO (op)].prefclass]
[(int) classes[i]]);
- if (REGNO (ops[i]) != REGNO (ops[j])
- && ! find_reg_note (insn, REG_DEAD, op))
- alt_cost += 2;
+ if (REGNO (ops[i]) != REGNO (ops[j]))
+ {
+ /* If the pseudo dies, tying it to the duplicate
+ operand can be advantageous. */
+ if (find_reg_note (insn, REG_DEAD, op))
+ pp->cost[classes[j]]--;
+ else
+ alt_cost += 2;
+ }
/* This is in place of ordinary cost computation
for this operand, so skip to the end of the
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (12 preceding siblings ...)
2007-12-19 9:49 ` bonzini at gnu dot org
@ 2007-12-19 12:11 ` ubizjak at gmail dot com
2007-12-19 12:14 ` bonzini at gnu dot org
` (14 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2007-12-19 12:11 UTC (permalink / raw)
To: gcc-bugs
------- Comment #18 from ubizjak at gmail dot com 2007-12-19 12:11 -------
Another baby step can be performed by:
Index: optabs.c
===================================================================
--- optabs.c (revision 131053)
+++ optabs.c (working copy)
@@ -1245,7 +1245,7 @@ expand_doubleword_mult (enum machine_mod
return NULL_RTX;
}
- adjust = expand_binop (word_mode, smul_optab, op0_high, op1_low,
+ adjust = expand_binop (word_mode, smul_optab, op1_low, op0_high,
NULL_RTX, 0, OPTAB_DIRECT);
if (!adjust)
return NULL_RTX;
@@ -1274,7 +1274,7 @@ expand_doubleword_mult (enum machine_mod
return NULL_RTX;
}
- temp = expand_binop (word_mode, smul_optab, op1_high, op0_low,
+ temp = expand_binop (word_mode, smul_optab, op0_low, op1_high,
NULL_RTX, 0, OPTAB_DIRECT);
if (!temp)
return NULL_RTX;
This patch generates (note mem operands to imull):
test:
subl $8, %esp
movl %ebx, (%esp)
movl 12(%esp), %ebx
movl %esi, 4(%esp)
movl 20(%esp), %esi
movl %ebx, %ecx
movl %esi, %eax
imull 24(%esp), %ecx
imull 16(%esp), %eax
addl %eax, %ecx
movl %esi, %eax
mull %ebx
movl (%esp), %ebx
leal (%ecx,%edx), %esi
movl %esi, %edx
movl 4(%esp), %esi
addl $8, %esp
ret
but since mulsi3 is defined as:
(define_insn "*mulsi3_1"
[(set (match_operand:SI 0 "register_operand" "=r,r,r")
(mult:SI (match_operand:SI 1 "nonimmediate_operand" "%rm,rm,0")
(match_operand:SI 2 "general_operand" "K,i,mr")))
(clobber (reg:CC FLAGS_REG))]
the question is, why didn't RA perform operand reversal by itself. Currently it
converts _both_ SI multiplies from:
(insn:HI 8 7 9 2 m.c:14 (parallel [
(set (reg:SI 63)
(mult:SI (mem/c/i:SI (plus:SI (reg/f:SI 16 argp)
(const_int 4 [0x4])) [2 a+4 S4 A32])
(reg:SI 66 [ b ])))
(clobber (reg:CC 17 flags))
]) 214 {*mulsi3_1} (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
into
--cut here--
(insn 35 7 8 2 m.c:14 (set (reg:SI 0 ax [63])
(mem/c/i:SI (plus:SI (reg/f:SI 7 sp)
(const_int 16 [0x10])) [2 a+4 S4 A32])) 41 {*movsi_1} (nil))
(insn:HI 8 35 9 2 m.c:14 (parallel [
(set (reg:SI 0 ax [63])
(mult:SI (reg:SI 0 ax [63])
(reg:SI 4 si [orig:66 b ] [66])))
(clobber (reg:CC 17 flags))
]) 214 {*mulsi3_1} (nil))
--cut here--
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (13 preceding siblings ...)
2007-12-19 12:11 ` ubizjak at gmail dot com
@ 2007-12-19 12:14 ` bonzini at gnu dot org
2007-12-19 12:32 ` bonzini at gnu dot org
` (13 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-19 12:14 UTC (permalink / raw)
To: gcc-bugs
------- Comment #19 from bonzini at gnu dot org 2007-12-19 12:13 -------
Created an attachment (id=14792)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14792&action=view)
patch to almost fix the bug
With this patch:
1) local-alloc first tries to allocate registers that go into small classes
2) regclass tries to use the same class for an output and an input that dies
3) i386 modes_tiable_p allows tying SImode with DImode (but not vice versa)
I get now this:
pushl %esi
pushl %ebx
movl 12(%esp), %ecx
movl 20(%esp), %eax
movl 24(%esp), %ebx
imull %ecx, %ebx
movl 16(%esp), %esi
imull %eax, %esi
addl %esi, %ebx
mull %ecx
leal (%ebx,%edx), %edx
popl %ebx
popl %esi
which has no moves. It still spills two registers because it is not able to
use %edx (I'll look into it later).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (14 preceding siblings ...)
2007-12-19 12:14 ` bonzini at gnu dot org
@ 2007-12-19 12:32 ` bonzini at gnu dot org
2007-12-19 12:43 ` bonzini at gnu dot org
` (12 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-19 12:32 UTC (permalink / raw)
To: gcc-bugs
------- Comment #20 from bonzini at gnu dot org 2007-12-19 12:32 -------
There is a big catch-22. When GCC ties one of regs 64/66 with reg 61, it
enlarges reg 61's live range to cover the live range of the tied range. When
it does this, it basically locks up %edx for the whole live range of reg 61.
:-(
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (15 preceding siblings ...)
2007-12-19 12:32 ` bonzini at gnu dot org
@ 2007-12-19 12:43 ` bonzini at gnu dot org
2007-12-19 13:12 ` ubizjak at gmail dot com
` (11 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-19 12:43 UTC (permalink / raw)
To: gcc-bugs
------- Comment #21 from bonzini at gnu dot org 2007-12-19 12:43 -------
Created an attachment (id=14793)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14793&action=view)
two hunks only from the previous patch
Indeed, if I only use the regclass.c and local-alloc.c hunks, I get only one
spill!
pushl %ebx
movl 8(%esp), %edx
movl 16(%esp), %eax
movl 20(%esp), %ecx
imull %edx, %ecx
movl 12(%esp), %ebx
imull %eax, %ebx
addl %ebx, %ecx
mull %edx
leal (%ecx,%edx), %edx
popl %ebx
and no reg-reg move.
With Uros' patch it does use memory imull's, but also adds reg-reg moves. The
next step should be to figure out why (with my patch and without his) reload
feels the need to put the memory operands from insns 7 and 8 into regs.
--
bonzini at gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #14792|0 |1
is obsolete| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (16 preceding siblings ...)
2007-12-19 12:43 ` bonzini at gnu dot org
@ 2007-12-19 13:12 ` ubizjak at gmail dot com
2007-12-19 13:36 ` bonzini at gnu dot org
` (10 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2007-12-19 13:12 UTC (permalink / raw)
To: gcc-bugs
------- Comment #22 from ubizjak at gmail dot com 2007-12-19 13:11 -------
(In reply to comment #21)
> Indeed, if I only use the regclass.c and local-alloc.c hunks, I get only one
> spill!
>
> pushl %ebx
> movl 8(%esp), %edx
> movl 16(%esp), %eax
> movl 20(%esp), %ecx
> imull %edx, %ecx
> movl 12(%esp), %ebx
> imull %eax, %ebx
> addl %ebx, %ecx
> mull %edx
> leal (%ecx,%edx), %edx
> popl %ebx
>
> and no reg-reg move.
Excellent!
I think that we should fwprop stack arguments into the insns to further
optimize the code. According to pentopt.pdf, there is no penalty at all (at
least for P4+) when one of the insn operands is a memory reference. This would
relieve register pressure considerably.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (17 preceding siblings ...)
2007-12-19 13:12 ` ubizjak at gmail dot com
@ 2007-12-19 13:36 ` bonzini at gnu dot org
2007-12-19 13:49 ` steven at gcc dot gnu dot org
` (9 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-19 13:36 UTC (permalink / raw)
To: gcc-bugs
------- Comment #23 from bonzini at gnu dot org 2007-12-19 13:36 -------
Created an attachment (id=14794)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14794&action=view)
teach combine that reg = op(reg, mem) is better
Since combine operates on the whole pattern, it can be taught the trick that
swap_commutative_operands_with_target does in optabs.c. This is the "right
way" to do the hack Uros proposed in comment #18.
Note that optabs.c does not get it because it creates reg=subreg*subreg. Then
CSE creates SImode memory loads from the subreg, and combine creates
reg=mem*reg. With this patch, combine creates reg=reg*mem instead.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (18 preceding siblings ...)
2007-12-19 13:36 ` bonzini at gnu dot org
@ 2007-12-19 13:49 ` steven at gcc dot gnu dot org
2007-12-19 13:50 ` bonzini at gnu dot org
` (8 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: steven at gcc dot gnu dot org @ 2007-12-19 13:49 UTC (permalink / raw)
To: gcc-bugs
------- Comment #24 from steven at gcc dot gnu dot org 2007-12-19 13:48 -------
The patch in comment #23 might even be suitable for GCC 4.3 ...
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (19 preceding siblings ...)
2007-12-19 13:49 ` steven at gcc dot gnu dot org
@ 2007-12-19 13:50 ` bonzini at gnu dot org
2007-12-19 13:54 ` bonzini at gnu dot org
` (7 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-19 13:50 UTC (permalink / raw)
To: gcc-bugs
------- Comment #25 from bonzini at gnu dot org 2007-12-19 13:50 -------
Note that "fwprop" is not an exact term, because there *is* a memory load in
each multiplication, and propagating a second memory operand will create an
invalid insn. You may try to add a split from reg=op(mem1, mem2) to
reg=mem1;reg=reg*mem2, but anyway, it won't help propagating into the mull,
since it is not LOG_LINKS-related to the load.
The problem is that we end up with having to reload *anyway*. If we had a way
to express the relationship between operands 0 and 1 of the multiplication
*before reload*, probably the RTL optimizations could do much more.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (20 preceding siblings ...)
2007-12-19 13:50 ` bonzini at gnu dot org
@ 2007-12-19 13:54 ` bonzini at gnu dot org
2007-12-20 13:53 ` bonzini at gnu dot org
` (6 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-19 13:54 UTC (permalink / raw)
To: gcc-bugs
------- Comment #26 from bonzini at gnu dot org 2007-12-19 13:53 -------
I'm starting a SPEC run on the overall patch
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (21 preceding siblings ...)
2007-12-19 13:54 ` bonzini at gnu dot org
@ 2007-12-20 13:53 ` bonzini at gnu dot org
2007-12-20 14:16 ` bonzini at gnu dot org
` (5 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-20 13:53 UTC (permalink / raw)
To: gcc-bugs
------- Comment #27 from bonzini at gnu dot org 2007-12-20 13:53 -------
I screwed up so I have to rerun most of SPECfp2000, but the results seem a
wash. Anybody can fire the patch I'll attach soon on a wide range of machines?
--
Bug 17236 depends on bug 6585, which changed state.
Bug 6585 Summary: Redundant store/load instruction pairs on ix86
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=6585
What |Old Value |New Value
----------------------------------------------------------------------------
Status|RESOLVED |ASSIGNED
Resolution|DUPLICATE |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (22 preceding siblings ...)
2007-12-20 13:53 ` bonzini at gnu dot org
@ 2007-12-20 14:16 ` bonzini at gnu dot org
2008-03-07 8:27 ` bonzini at gnu dot org
` (4 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2007-12-20 14:16 UTC (permalink / raw)
To: gcc-bugs
------- Comment #28 from bonzini at gnu dot org 2007-12-20 14:15 -------
Created an attachment (id=14800)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14800&action=view)
combined patch
--
bonzini at gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #14793|0 |1
is obsolete| |
Attachment #14794|0 |1
is obsolete| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (23 preceding siblings ...)
2007-12-20 14:16 ` bonzini at gnu dot org
@ 2008-03-07 8:27 ` bonzini at gnu dot org
2008-03-07 8:42 ` bonzini at gnu dot org
` (3 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2008-03-07 8:27 UTC (permalink / raw)
To: gcc-bugs
------- Comment #29 from bonzini at gnu dot org 2008-03-07 08:26 -------
ira branch produces the same code as with my patch.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (24 preceding siblings ...)
2008-03-07 8:27 ` bonzini at gnu dot org
@ 2008-03-07 8:42 ` bonzini at gnu dot org
2008-09-06 16:12 ` ubizjak at gmail dot com
` (2 subsequent siblings)
28 siblings, 0 replies; 33+ messages in thread
From: bonzini at gnu dot org @ 2008-03-07 8:42 UTC (permalink / raw)
To: gcc-bugs
--
bonzini at gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |SUSPENDED
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (25 preceding siblings ...)
2008-03-07 8:42 ` bonzini at gnu dot org
@ 2008-09-06 16:12 ` ubizjak at gmail dot com
2008-09-06 16:19 ` ubizjak at gmail dot com
2008-09-06 16:20 ` ubizjak at gmail dot com
28 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2008-09-06 16:12 UTC (permalink / raw)
To: gcc-bugs
------- Comment #30 from ubizjak at gmail dot com 2008-09-06 16:11 -------
Current mainline (4.4.0 20080906) produces:
pushl %ebx
movl 8(%ebp), %eax
movl 16(%ebp), %edx
movl 20(%ebp), %ecx
movl 12(%ebp), %ebx
imull %eax, %ecx
imull %edx, %ebx
mull %edx
addl %ebx, %ecx
popl %ebx
leal (%ecx,%edx), %edx
So, fixed.
--
ubizjak at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|SUSPENDED |RESOLVED
Resolution| |FIXED
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (26 preceding siblings ...)
2008-09-06 16:12 ` ubizjak at gmail dot com
@ 2008-09-06 16:19 ` ubizjak at gmail dot com
2008-09-06 16:20 ` ubizjak at gmail dot com
28 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2008-09-06 16:19 UTC (permalink / raw)
To: gcc-bugs
------- Comment #31 from ubizjak at gmail dot com 2008-09-06 16:18 -------
*** Bug 6585 has been marked as a duplicate of this bug. ***
--
ubizjak at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
BugsThisDependsOn|6585 |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread
* [Bug rtl-optimization/17236] inefficient code for long long multiply on x86
[not found] <bug-17236-1008@http.gcc.gnu.org/bugzilla/>
` (27 preceding siblings ...)
2008-09-06 16:19 ` ubizjak at gmail dot com
@ 2008-09-06 16:20 ` ubizjak at gmail dot com
28 siblings, 0 replies; 33+ messages in thread
From: ubizjak at gmail dot com @ 2008-09-06 16:20 UTC (permalink / raw)
To: gcc-bugs
--
ubizjak at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |4.4.0
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
^ permalink raw reply [flat|nested] 33+ messages in thread