public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse
@ 2004-09-22 19:19 bangerth at dealii dot org
2004-09-22 19:33 ` [Bug target/17619] " bangerth at dealii dot org
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 19:19 UTC (permalink / raw)
To: gcc-bugs
I know that -mfpmath=387,sse is not considered production quality.
Nevertheless, I though I might give it a try. So here's some
example code that computes the scalar product between two
vectors of length 4:
--------------------------------
struct X { float array[4]; };
X a,b;
float foobar () {
float s = 0;
for (unsigned int d=0; d<4; ++d)
s += a.array[d] * b.array[d];
return s;
}
--------------------------
In the following, I will always use compile flags
-O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4
in addition to whatever setting for -mfpmath is decribed.
With -mfpmath=387 we get this (reasonable) piece of code:
_Z6foobarv:
pushl %ebp
movl %esp, %ebp
flds b
fmuls a
fadds .LC0
flds b+4
fmuls a+4
faddp %st, %st(1)
flds b+8
fmuls a+8
faddp %st, %st(1)
flds b+12
fmuls a+12
faddp %st, %st(1)
popl %ebp
ret
Here, we load each pair of vector elements and multiply them, then
adding to the accumulator. The only thing that's nonoptimal is that
the initial addition to zero in "fadds .LC0" could be avoided (LC0
is a label to a zero floating point number).
If one tries to compile with -mfpmath=sse, one gets very similar
code, with the exception that multiplications and additions are
performed in xmm? registers.
However, here comes the catch: I though if I specify -mfpmath=387,sse
it should produce at least as good code as without. But I get this:
_Z6foobarv:
pushl %ebp
movl %esp, %ebp
subl $4, %esp
flds b
fmuls a
fadds .LC0
movss b+4, %xmm0
mulss a+4, %xmm0
movss %xmm0, -4(%ebp)
flds -4(%ebp)
faddp %st, %st(1)
movss b+8, %xmm0
mulss a+8, %xmm0
movss %xmm0, -4(%ebp)
flds -4(%ebp)
faddp %st, %st(1)
movss b+12, %xmm0
mulss a+12, %xmm0
movss %xmm0, -4(%ebp)
flds -4(%ebp)
faddp %st, %st(1)
leave
ret
That is decidedly not optimal: we compute the result of each multiplication
in xmm registers, but then we push them onto the stack, reload them into
st(?) registers and accumulate them there. Surely the whole thing
can be done without these stack operations and be more efficient. In
particular, using just -mfpmath=sse shows that this is possible.
W.
--
Summary: Non-optimal code for -mfpmath=387,sse
Product: gcc
Version: 4.0.0
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: bangerth at dealii dot org
CC: gcc-bugs at gcc dot gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
@ 2004-09-22 19:33 ` bangerth at dealii dot org
2004-09-22 21:16 ` pinskia at gcc dot gnu dot org
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 19:33 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From bangerth at dealii dot org 2004-09-22 19:33 -------
I should add that the code produced by 3.3.4 and 3.4.2 is significantly
different, though it also shows the basic problem of moves to and from
the stack.
W.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
2004-09-22 19:33 ` [Bug target/17619] " bangerth at dealii dot org
@ 2004-09-22 21:16 ` pinskia at gcc dot gnu dot org
2004-09-22 21:22 ` bangerth at dealii dot org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2004-09-22 21:16 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From pinskia at gcc dot gnu dot org 2004-09-22 21:15 -------
This is wrong:
The only thing that's nonoptimal is that the initial addition to zero in "fadds .LC0" could be avoided
(LC0 is a label to a zero floating point number).
You cannot do this transformation except with -ffast-math.
Other than that confirmed.
--
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever Confirmed| |1
Last reconfirmed|0000-00-00 00:00:00 |2004-09-22 21:15:57
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
2004-09-22 19:33 ` [Bug target/17619] " bangerth at dealii dot org
2004-09-22 21:16 ` pinskia at gcc dot gnu dot org
@ 2004-09-22 21:22 ` bangerth at dealii dot org
2004-09-22 21:25 ` bangerth at dealii dot org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 21:22 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From bangerth at dealii dot org 2004-09-22 21:22 -------
> You cannot do this transformation except with -ffast-math.
What do you mean by that? Certainly the addition of a zero floating point constant
can be avoided even without -ffast-math (or other unsafe math operations). If there
should be an overflow or similar during this operation, then it should have triggered the
relevant exceptions already in the multiplication that computed the second addend.
However, I don't want to dwell on this point -- the fact that we have unnecessary stack
moves is what bothers me.
W.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
` (2 preceding siblings ...)
2004-09-22 21:22 ` bangerth at dealii dot org
@ 2004-09-22 21:25 ` bangerth at dealii dot org
2004-09-22 21:35 ` bangerth at dealii dot org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 21:25 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From bangerth at dealii dot org 2004-09-22 21:25 -------
However, Andrew is right in that the zero addition vanishes when using
-ffast-math. I'll open another bug report for this.
W.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
` (3 preceding siblings ...)
2004-09-22 21:25 ` bangerth at dealii dot org
@ 2004-09-22 21:35 ` bangerth at dealii dot org
2004-12-01 14:07 ` uros at gcc dot gnu dot org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 21:35 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From bangerth at dealii dot org 2004-09-22 21:35 -------
That new PR is now PR 17622.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
` (4 preceding siblings ...)
2004-09-22 21:35 ` bangerth at dealii dot org
@ 2004-12-01 14:07 ` uros at gcc dot gnu dot org
2004-12-01 14:27 ` pinskia at gcc dot gnu dot org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: uros at gcc dot gnu dot org @ 2004-12-01 14:07 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From uros at gcc dot gnu dot org 2004-12-01 14:07 -------
With "GCC: (GNU) 4.0.0 20041201 (experimental)", following code is produced
(without -ffast-math):
_Z6foobarv:
.LFB2:
pushl %ebp
.LCFI0:
movl %esp, %ebp
.LCFI1:
subl $4, %esp
.LCFI2:
flds b+12
fmuls a+12
movss b, %xmm1
mulss a, %xmm1
addss .LC0, %xmm1
movss b+4, %xmm0
mulss a+4, %xmm0
addss %xmm0, %xmm1
movss b+8, %xmm0
mulss a+8, %xmm0
addss %xmm0, %xmm1
movss %xmm1, -4(%ebp)
flds -4(%ebp)
faddp %st, %st(1)
leave
ret
Please note, that we should return the result in fp reg, so final flds is needed
in any case. I think, this code is optimal.
Should we close this bug?
Uros.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
` (5 preceding siblings ...)
2004-12-01 14:07 ` uros at gcc dot gnu dot org
@ 2004-12-01 14:27 ` pinskia at gcc dot gnu dot org
2004-12-01 16:03 ` uros at gcc dot gnu dot org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2004-12-01 14:27 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From pinskia at gcc dot gnu dot org 2004-12-01 14:27 -------
Actually the most optimial code would be:
_Z6foobarv:
.LFB2:
pushl %ebp
.LCFI0:
movl %esp, %ebp
.LCFI1:
subl $24, %esp
.LCFI2:
movaps a, %xmm0
mulps b, %xmm0
movaps %xmm0, -24(%ebp)
fldz
fadds -24(%ebp)
fadds -20(%ebp)
fadds -16(%ebp)
fadds -12(%ebp)
leave
ret
But to do that we need the tree vectorizer to become better and also split the loop into two.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
` (6 preceding siblings ...)
2004-12-01 14:27 ` pinskia at gcc dot gnu dot org
@ 2004-12-01 16:03 ` uros at gcc dot gnu dot org
2004-12-01 20:49 ` bangerth at dealii dot org
2004-12-01 20:59 ` bangerth at dealii dot org
9 siblings, 0 replies; 11+ messages in thread
From: uros at gcc dot gnu dot org @ 2004-12-01 16:03 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From uros at gcc dot gnu dot org 2004-12-01 16:02 -------
If the loop is splitted manually and putting a, b and c inside the foobar()
function [otherwise vectorizer complains about unaligned load]:
--cut here--
struct X
{
float array[4];
};
float foobar()
{
X a, b, c;
float s = 0;
for (unsigned int d = 0; d < 4; ++d)
c.array[d] = a.array[d] * b.array[d];
for (unsigned int d = 0; d < 4; ++d)
s += c.array[d];
return s;
}
--cut here--
Compiling this example with rigth pack of options: -O2 -march=pentium4
-ftree-vectorize -mfpmath=sse,387 -funroll-loops -fomit-frame-pointer
-ffast-math, this wonderful piece of code is produced:
_Z6foobarv:
.LFB2:
subl $60, %esp
.LCFI0:
movaps 32(%esp), %xmm0
mulps 16(%esp), %xmm0
movaps %xmm0, (%esp)
flds 4(%esp)
fadds (%esp)
fadds 8(%esp)
fadds 12(%esp)
addl $60, %esp
ret
I don't know why vectorized doesn't like original testcase.
Uros.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
` (7 preceding siblings ...)
2004-12-01 16:03 ` uros at gcc dot gnu dot org
@ 2004-12-01 20:49 ` bangerth at dealii dot org
2004-12-01 20:59 ` bangerth at dealii dot org
9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-12-01 20:49 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From bangerth at dealii dot org 2004-12-01 20:49 -------
In reply to comment #6:
> Please note, that we should return the result in fp reg, so final flds is
> needed in any case. I think, this code is optimal.
Almost, or at least I believe so. If we assume that all the operations
with xmm registers cost the same as with the floating point stack, then
the result of -mfpmath=387,sse requires one stack push and pop more than
the result of -mfpmath=387. The compiler should recognize this and then
simply not use the sse registers at all.
I will open a new PR for this, and another one for the vectorization issue.
Thanks for now
Wolfgang
--
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
` (8 preceding siblings ...)
2004-12-01 20:49 ` bangerth at dealii dot org
@ 2004-12-01 20:59 ` bangerth at dealii dot org
9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-12-01 20:59 UTC (permalink / raw)
To: gcc-bugs
------- Additional Comments From bangerth at dealii dot org 2004-12-01 20:59 -------
The two spinoffs are PR 18766 and PR 18767.
W.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2004-12-01 20:59 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
2004-09-22 19:33 ` [Bug target/17619] " bangerth at dealii dot org
2004-09-22 21:16 ` pinskia at gcc dot gnu dot org
2004-09-22 21:22 ` bangerth at dealii dot org
2004-09-22 21:25 ` bangerth at dealii dot org
2004-09-22 21:35 ` bangerth at dealii dot org
2004-12-01 14:07 ` uros at gcc dot gnu dot org
2004-12-01 14:27 ` pinskia at gcc dot gnu dot org
2004-12-01 16:03 ` uros at gcc dot gnu dot org
2004-12-01 20:49 ` bangerth at dealii dot org
2004-12-01 20:59 ` bangerth at dealii dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).