[Bug target/17619] New: Non-optimal code for -mfpmath=387,sse

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse
@ 2004-09-22 19:19 bangerth at dealii dot org
  2004-09-22 19:33 ` [Bug target/17619] " bangerth at dealii dot org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 19:19 UTC (permalink / raw)
  To: gcc-bugs

I know that -mfpmath=387,sse is not considered production quality. 
Nevertheless, I though I might give it a try. So here's some  
example code that computes the scalar product between two 
vectors of length 4: 
-------------------------------- 
struct X { float array[4]; }; 
 
X a,b; 
 
float foobar () { 
  float s = 0; 
  for (unsigned int d=0; d<4; ++d) 
    s += a.array[d] * b.array[d]; 
  return s; 
} 
-------------------------- 
In the following, I will always use compile flags  
  -O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4 
in addition to whatever setting for -mfpmath is decribed. 
 
With -mfpmath=387 we get this (reasonable) piece of code: 
_Z6foobarv: 
	pushl	%ebp 
	movl	%esp, %ebp 
	flds	b 
	fmuls	a 
	fadds	.LC0 
	flds	b+4 
	fmuls	a+4 
	faddp	%st, %st(1) 
	flds	b+8 
	fmuls	a+8 
	faddp	%st, %st(1) 
	flds	b+12 
	fmuls	a+12 
	faddp	%st, %st(1) 
	popl	%ebp 
	ret 
Here, we load each pair of vector elements and multiply them, then 
adding to the accumulator. The only thing that's nonoptimal is that 
the initial addition to zero in "fadds	.LC0" could be avoided (LC0 
is a label to a zero floating point number). 
 
If one tries to compile with -mfpmath=sse, one gets very similar 
code, with the exception that multiplications and additions are 
performed in xmm? registers. 
 
However, here comes the catch: I though if I specify -mfpmath=387,sse 
it should produce at least as good code as without. But I get this: 
_Z6foobarv: 
	pushl	%ebp 
	movl	%esp, %ebp 
	subl	$4, %esp 
	flds	b 
	fmuls	a 
	fadds	.LC0 
	movss	b+4, %xmm0 
	mulss	a+4, %xmm0 
	movss	%xmm0, -4(%ebp) 
	flds	-4(%ebp) 
	faddp	%st, %st(1) 
	movss	b+8, %xmm0 
	mulss	a+8, %xmm0 
	movss	%xmm0, -4(%ebp) 
	flds	-4(%ebp) 
	faddp	%st, %st(1) 
	movss	b+12, %xmm0 
	mulss	a+12, %xmm0 
	movss	%xmm0, -4(%ebp) 
	flds	-4(%ebp) 
	faddp	%st, %st(1) 
	leave 
	ret 
That is decidedly not optimal: we compute the result of each multiplication 
in xmm registers, but then we push them onto the stack, reload them into 
st(?) registers and accumulate them there. Surely the whole thing 
can be done without these stack operations and be more efficient. In 
particular, using just -mfpmath=sse shows that this is possible. 
 
W.

-- 
           Summary: Non-optimal code for -mfpmath=387,sse
           Product: gcc
           Version: 4.0.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: bangerth at dealii dot org
                CC: gcc-bugs at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
@ 2004-09-22 19:33 ` bangerth at dealii dot org
  2004-09-22 21:16 ` pinskia at gcc dot gnu dot org
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 19:33 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From bangerth at dealii dot org  2004-09-22 19:33 -------
I should add that the code produced by 3.3.4 and 3.4.2 is significantly 
different, though it also shows the basic problem of moves to and from 
the stack. 
 
W. 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
  2004-09-22 19:33 ` [Bug target/17619] " bangerth at dealii dot org
@ 2004-09-22 21:16 ` pinskia at gcc dot gnu dot org
  2004-09-22 21:22 ` bangerth at dealii dot org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2004-09-22 21:16 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From pinskia at gcc dot gnu dot org  2004-09-22 21:15 -------
This is wrong:
The only thing that's nonoptimal is that the initial addition to zero in "fadds  .LC0" could be avoided 
(LC0 is a label to a zero floating point number).
You cannot do this transformation except with -ffast-math.

Other than that confirmed.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|                            |1
   Last reconfirmed|0000-00-00 00:00:00         |2004-09-22 21:15:57
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
  2004-09-22 19:33 ` [Bug target/17619] " bangerth at dealii dot org
  2004-09-22 21:16 ` pinskia at gcc dot gnu dot org
@ 2004-09-22 21:22 ` bangerth at dealii dot org
  2004-09-22 21:25 ` bangerth at dealii dot org
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 21:22 UTC (permalink / raw)
  To: gcc-bugs

------- Additional Comments From bangerth at dealii dot org  2004-09-22 21:22 -------
> You cannot do this transformation except with -ffast-math. 

What do you mean by that? Certainly the addition of a zero floating point constant 
can be avoided even without -ffast-math (or other unsafe math operations). If there 
should be an overflow or similar during this operation, then it should have triggered the 
relevant exceptions already in the multiplication that computed the second addend. 

However, I don't want to dwell on this point -- the fact that we have unnecessary stack 
moves is what bothers me. 

W. 

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
                   ` (2 preceding siblings ...)
  2004-09-22 21:22 ` bangerth at dealii dot org
@ 2004-09-22 21:25 ` bangerth at dealii dot org
  2004-09-22 21:35 ` bangerth at dealii dot org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 21:25 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From bangerth at dealii dot org  2004-09-22 21:25 -------
However, Andrew is right in that the zero addition vanishes when using 
-ffast-math. I'll open another bug report for this. 
 
W. 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
                   ` (3 preceding siblings ...)
  2004-09-22 21:25 ` bangerth at dealii dot org
@ 2004-09-22 21:35 ` bangerth at dealii dot org
  2004-12-01 14:07 ` uros at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-09-22 21:35 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From bangerth at dealii dot org  2004-09-22 21:35 -------
That new PR is now PR 17622. 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
                   ` (4 preceding siblings ...)
  2004-09-22 21:35 ` bangerth at dealii dot org
@ 2004-12-01 14:07 ` uros at gcc dot gnu dot org
  2004-12-01 14:27 ` pinskia at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: uros at gcc dot gnu dot org @ 2004-12-01 14:07 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From uros at gcc dot gnu dot org  2004-12-01 14:07 -------
With "GCC: (GNU) 4.0.0 20041201 (experimental)", following code is produced
(without -ffast-math):

_Z6foobarv:
.LFB2:
        pushl   %ebp
.LCFI0:
        movl %esp, %ebp
.LCFI1:
        subl $4, %esp
.LCFI2:
        flds b+12
        fmuls   a+12
        movss   b, %xmm1
        mulss   a, %xmm1
        addss   .LC0, %xmm1
        movss   b+4, %xmm0
        mulss   a+4, %xmm0
        addss   %xmm0, %xmm1
        movss   b+8, %xmm0
        mulss   a+8, %xmm0
        addss   %xmm0, %xmm1
        movss   %xmm1, -4(%ebp)
        flds -4(%ebp)
        faddp   %st, %st(1)
        leave
        ret

Please note, that we should return the result in fp reg, so final flds is needed
in any case. I think, this code is optimal.

Should we close this bug?

Uros.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
                   ` (5 preceding siblings ...)
  2004-12-01 14:07 ` uros at gcc dot gnu dot org
@ 2004-12-01 14:27 ` pinskia at gcc dot gnu dot org
  2004-12-01 16:03 ` uros at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2004-12-01 14:27 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From pinskia at gcc dot gnu dot org  2004-12-01 14:27 -------
Actually the most optimial code would be:
_Z6foobarv:
.LFB2:  
        pushl   %ebp
.LCFI0: 
        movl    %esp, %ebp
.LCFI1: 
        subl    $24, %esp
.LCFI2: 
        movaps  a, %xmm0
        mulps   b, %xmm0
        movaps  %xmm0, -24(%ebp)
        fldz
        fadds   -24(%ebp)
        fadds   -20(%ebp)
        fadds   -16(%ebp)
        fadds   -12(%ebp)
        leave
        ret

But to do that we need the tree vectorizer to become better and also split the loop into two.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
                   ` (6 preceding siblings ...)
  2004-12-01 14:27 ` pinskia at gcc dot gnu dot org
@ 2004-12-01 16:03 ` uros at gcc dot gnu dot org
  2004-12-01 20:49 ` bangerth at dealii dot org
  2004-12-01 20:59 ` bangerth at dealii dot org
  9 siblings, 0 replies; 11+ messages in thread
From: uros at gcc dot gnu dot org @ 2004-12-01 16:03 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From uros at gcc dot gnu dot org  2004-12-01 16:02 -------
If the loop is splitted manually and putting a, b and c inside the foobar()
function [otherwise vectorizer complains about unaligned load]:

--cut here--
struct X
{
  float array[4];
};

float foobar()
{
  X a, b, c;

  float s = 0;
  for (unsigned int d = 0; d < 4; ++d)
    c.array[d] = a.array[d] * b.array[d];

  for (unsigned int d = 0; d < 4; ++d)
    s += c.array[d];

  return s;
}
--cut here--

Compiling this example with rigth pack of options: -O2 -march=pentium4
-ftree-vectorize -mfpmath=sse,387 -funroll-loops -fomit-frame-pointer
-ffast-math, this wonderful piece of code is produced:

_Z6foobarv:
.LFB2:
        subl    $60, %esp
.LCFI0:
        movaps  32(%esp), %xmm0
        mulps   16(%esp), %xmm0
        movaps  %xmm0, (%esp)
        flds    4(%esp)
        fadds   (%esp)
        fadds   8(%esp)
        fadds   12(%esp)
        addl    $60, %esp
        ret

I don't know why vectorized doesn't like original testcase.

Uros.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
                   ` (7 preceding siblings ...)
  2004-12-01 16:03 ` uros at gcc dot gnu dot org
@ 2004-12-01 20:49 ` bangerth at dealii dot org
  2004-12-01 20:59 ` bangerth at dealii dot org
  9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-12-01 20:49 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From bangerth at dealii dot org  2004-12-01 20:49 -------
In reply to comment #6: 
 
> Please note, that we should return the result in fp reg, so final flds is 
> needed in any case. I think, this code is optimal. 
 
Almost, or at least I believe so. If we assume that all the operations  
with xmm registers cost the same as with the floating point stack, then 
the result of -mfpmath=387,sse requires one stack push and pop more than 
the result of -mfpmath=387. The compiler should recognize this and then 
simply not use the sse registers at all. 
 
I will open a new PR for this, and another one for the vectorization issue. 
 
Thanks for now 
 Wolfgang 

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/17619] Non-optimal code for -mfpmath=387,sse
  2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
                   ` (8 preceding siblings ...)
  2004-12-01 20:49 ` bangerth at dealii dot org
@ 2004-12-01 20:59 ` bangerth at dealii dot org
  9 siblings, 0 replies; 11+ messages in thread
From: bangerth at dealii dot org @ 2004-12-01 20:59 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From bangerth at dealii dot org  2004-12-01 20:59 -------
The two spinoffs are PR 18766 and PR 18767. 
W. 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17619


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-12-01 20:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-22 19:19 [Bug target/17619] New: Non-optimal code for -mfpmath=387,sse bangerth at dealii dot org
2004-09-22 19:33 ` [Bug target/17619] " bangerth at dealii dot org
2004-09-22 21:16 ` pinskia at gcc dot gnu dot org
2004-09-22 21:22 ` bangerth at dealii dot org
2004-09-22 21:25 ` bangerth at dealii dot org
2004-09-22 21:35 ` bangerth at dealii dot org
2004-12-01 14:07 ` uros at gcc dot gnu dot org
2004-12-01 14:27 ` pinskia at gcc dot gnu dot org
2004-12-01 16:03 ` uros at gcc dot gnu dot org
2004-12-01 20:49 ` bangerth at dealii dot org
2004-12-01 20:59 ` bangerth at dealii dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).