public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug inline-asm/29756]  New: SSE intrinsics hard to use without redundant temporaries appearing
@ 2006-11-07 22:23 timday at bottlenose dot demon dot co dot uk
  2006-11-07 22:27 ` [Bug inline-asm/29756] " timday at bottlenose dot demon dot co dot uk
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: timday at bottlenose dot demon dot co dot uk @ 2006-11-07 22:23 UTC (permalink / raw)
  To: gcc-bugs

I've been adapting some old codes' simple 4-float vector class to use SSE by
use of the intrinsic functions.  It seems to be quite hard to avoid the
generated assembly code being rather diluted by apparently redundant spills of
intermediate results to the stack.

On inspecting the assembly produced from the file to be attached, compare the
code generated for matrix44f::transform_good and matrix44f::transform_bad.
The former is 20 instructions and apparently optimal.  However, it was only
arrived at by prodding the latter version of the function (which does exactly
the same thing but expressed more naturally, but results in 32 instructions)
until the stack temporaries went away.  It would be nice if both versions of
the function generated optimal code and there doesn't seem to be any particular
reason they shouldn't.

Both versions' assembly contain the same expected numbers of shuffle, multiply
and add instructions, the excess seems to all involve extra stack temporaries.

[I'm not sure what the "triplet" codes on this form are.
I'm using a gcc in Debian Etch  gcc --version shows "gcc (GCC) 4.1.2 20060901
(prerelease) (Debian 4.1.1-13)"; platform is a Pentium3.  Sorry if the
"inline-asm" component is a completely inappropriate thing to assign to.]


-- 
           Summary: SSE intrinsics hard to use without redundant temporaries
                    appearing
           Product: gcc
           Version: 4.1.2
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: inline-asm
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: timday at bottlenose dot demon dot co dot uk


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug inline-asm/29756] SSE intrinsics hard to use without redundant temporaries appearing
  2006-11-07 22:23 [Bug inline-asm/29756] New: SSE intrinsics hard to use without redundant temporaries appearing timday at bottlenose dot demon dot co dot uk
@ 2006-11-07 22:27 ` timday at bottlenose dot demon dot co dot uk
  2006-11-07 22:31 ` [Bug middle-end/29756] " pinskia at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: timday at bottlenose dot demon dot co dot uk @ 2006-11-07 22:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from timday at bottlenose dot demon dot co dot uk  2006-11-07 22:26 -------
Created an attachment (id=12566)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12566&action=view)
Result of gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse
-fomit-frame-pointer intrin.cpp

This is the .ii file output from
gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse
-fomit-frame-pointer intrin.cpp
Most of it is the result of the .cpp's sole direct include : #include
<xmmintrin.h>, which was immediately before the "class vector4f" declaration.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
  2006-11-07 22:23 [Bug inline-asm/29756] New: SSE intrinsics hard to use without redundant temporaries appearing timday at bottlenose dot demon dot co dot uk
  2006-11-07 22:27 ` [Bug inline-asm/29756] " timday at bottlenose dot demon dot co dot uk
@ 2006-11-07 22:31 ` pinskia at gcc dot gnu dot org
  2006-11-08 10:01 ` timday at bottlenose dot demon dot co dot uk
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2006-11-07 22:31 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from pinskia at gcc dot gnu dot org  2006-11-07 22:31 -------
Looks like this is mostly caused by:
  union
  {
    __v4sf vecf;
    __m128 rawf;
    float val[4];
  } _rep;

I will have a look more at this issue later tonight when I get home from work.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|minor                       |enhancement
          Component|target                      |middle-end
           Keywords|                            |missed-optimization


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
  2006-11-07 22:23 [Bug inline-asm/29756] New: SSE intrinsics hard to use without redundant temporaries appearing timday at bottlenose dot demon dot co dot uk
  2006-11-07 22:27 ` [Bug inline-asm/29756] " timday at bottlenose dot demon dot co dot uk
  2006-11-07 22:31 ` [Bug middle-end/29756] " pinskia at gcc dot gnu dot org
@ 2006-11-08 10:01 ` timday at bottlenose dot demon dot co dot uk
  2006-11-08 22:18 ` timday at bottlenose dot demon dot co dot uk
  2006-11-14  1:15 ` pinskia at gcc dot gnu dot org
  4 siblings, 0 replies; 6+ messages in thread
From: timday at bottlenose dot demon dot co dot uk @ 2006-11-08 10:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from timday at bottlenose dot demon dot co dot uk  2006-11-08 10:01 -------
I've just tried an alternative version (will upload later) replacing the union
with a single
  __v4sf _rep,
and implementing the [] operators using e.g
  (reinterpret_cast<const float*>(&_rep))[i];
However the code generated by the two transform implementations remains the
same (20 and 32 instructions anyway; haven't checked the details yet).
Maybe not surprising as it's just moving the problem around.

The big difference between the two methods is perhaps primarily that the bad
one involves a __v4sf->float->__vfs4 conversion, while the good one uses __v4sf
throughout by using the mul_compN methods.  I'll try and prepare a more concise
test case based on the premise that bad handling of __v4sf <-> float is the
real issue.


-- 

timday at bottlenose dot demon dot co dot uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |timday at bottlenose dot
                   |                            |demon dot co dot uk


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
  2006-11-07 22:23 [Bug inline-asm/29756] New: SSE intrinsics hard to use without redundant temporaries appearing timday at bottlenose dot demon dot co dot uk
                   ` (2 preceding siblings ...)
  2006-11-08 10:01 ` timday at bottlenose dot demon dot co dot uk
@ 2006-11-08 22:18 ` timday at bottlenose dot demon dot co dot uk
  2006-11-14  1:15 ` pinskia at gcc dot gnu dot org
  4 siblings, 0 replies; 6+ messages in thread
From: timday at bottlenose dot demon dot co dot uk @ 2006-11-08 22:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from timday at bottlenose dot demon dot co dot uk  2006-11-08 22:18 -------
Created an attachment (id=12573)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12573&action=view)
More concise demonstration of the v4sf->float->v4sf issue.

The attached code, (no classes or unions, just a few inline functions) obtained
from
  gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse
-fomit-frame-pointer v4sf.cpp
compiles transform_good to 18 instructions and transform_bad to 33.  However
it's not really surprising a round-trip through stack temporaries is required
when pointer arithmetic is being used to extract a float from a __v4sf.  I've
no idea whether it's realistic to hope this could ever be optimised away. 
Alternatively, it would be very nice if the builtin vector types simply
provided a [] operator, or if there were some intrinsics for extracting floats
from a __v4sf.

(In the meantime, in the original vector4f class, remaining in the __v4sf
domain by having the const operator[] return a suitably type-wrapped __v4sf
"filled" with the specified component seems to be a promising direction).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
  2006-11-07 22:23 [Bug inline-asm/29756] New: SSE intrinsics hard to use without redundant temporaries appearing timday at bottlenose dot demon dot co dot uk
                   ` (3 preceding siblings ...)
  2006-11-08 22:18 ` timday at bottlenose dot demon dot co dot uk
@ 2006-11-14  1:15 ` pinskia at gcc dot gnu dot org
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2006-11-14  1:15 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from pinskia at gcc dot gnu dot org  2006-11-14 01:15 -------
This is mostly PR 28367.  There are most likely other issues like some of the
SSE intrinsics not being declared as pure/const.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |28367


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-11-14  1:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-07 22:23 [Bug inline-asm/29756] New: SSE intrinsics hard to use without redundant temporaries appearing timday at bottlenose dot demon dot co dot uk
2006-11-07 22:27 ` [Bug inline-asm/29756] " timday at bottlenose dot demon dot co dot uk
2006-11-07 22:31 ` [Bug middle-end/29756] " pinskia at gcc dot gnu dot org
2006-11-08 10:01 ` timday at bottlenose dot demon dot co dot uk
2006-11-08 22:18 ` timday at bottlenose dot demon dot co dot uk
2006-11-14  1:15 ` pinskia at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).