public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/14552] compiled trivial vector intrinsic code is ineffiencent
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
@ 2005-11-21 11:29 ` pluto at agmk dot net
  2005-11-21 11:32 ` [Bug target/14552] compiled trivial vector intrinsic code is inefficient pcarlini at suse dot de
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pluto at agmk dot net @ 2005-11-21 11:29 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #16 from pluto at agmk dot net  2005-11-21 11:29 -------
without Uros' mmx-patch the gcc-4.1.0-20051113 generates amazing code:
(gcc -O3 -march=pentium3 -S -fomit-frame-pointer pr14552.c)

test:   subl    $20, %esp
        movl    w, %eax
        movl    w+4, %edx
        movl    %ebx, 8(%esp)
        movl    %esi, 12(%esp)
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        movswl  (%esp),%esi
        movl    %edi, 16(%esp)
        movswl  4(%esp),%ecx
        movswl  2(%esp),%edi
        movswl  6(%esp),%ebx
        addl    %esi, %esi
        addl    %ecx, %ecx
        movzwl  %si, %esi
        sall    $17, %edi
        movzwl  %cx, %ecx
        sall    $17, %ebx
        movl    %edi, %eax
        movl    16(%esp), %edi
        movl    %ebx, %edx
        orl     %esi, %eax
        movl    8(%esp), %ebx
        orl     %ecx, %edx
        movl    12(%esp), %esi
        movl    %eax, w
        movl    %edx, w+4
        movl    w, %eax
        movl    w+4, %edx
        movl    %eax, dw
        movl    %edx, dw+4
        addl    $20, %esp
        ret
        .size   test, .-test
        .comm   dw,8,8
        .comm   w,8,8
        .ident  "GCC: (GNU) 4.1.0 20051113 (experimental)"
        .section        .note.GNU-stack,"",@progbits


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
  2005-11-21 11:29 ` [Bug target/14552] compiled trivial vector intrinsic code is ineffiencent pluto at agmk dot net
@ 2005-11-21 11:32 ` pcarlini at suse dot de
  2005-11-21 11:34 ` pcarlini at suse dot de
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pcarlini at suse dot de @ 2005-11-21 11:32 UTC (permalink / raw)
  To: gcc-bugs



-- 

pcarlini at suse dot de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|uros at kss-loka dot si     |unassigned at gcc dot gnu
                   |                            |dot org
             Status|ASSIGNED                    |NEW
            Summary|compiled trivial vector     |compiled trivial vector
                   |intrinsic code is           |intrinsic code is
                   |ineffiencent                |inefficient


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
  2005-11-21 11:29 ` [Bug target/14552] compiled trivial vector intrinsic code is ineffiencent pluto at agmk dot net
  2005-11-21 11:32 ` [Bug target/14552] compiled trivial vector intrinsic code is inefficient pcarlini at suse dot de
@ 2005-11-21 11:34 ` pcarlini at suse dot de
  2005-11-21 15:05 ` pluto at agmk dot net
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pcarlini at suse dot de @ 2005-11-21 11:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #17 from pcarlini at suse dot de  2005-11-21 11:34 -------
Sorry.


-- 

pcarlini at suse dot de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|unassigned at gcc dot gnu   |uros at kss-loka dot si
                   |dot org                     |
             Status|NEW                         |ASSIGNED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2005-11-21 11:34 ` pcarlini at suse dot de
@ 2005-11-21 15:05 ` pluto at agmk dot net
  2005-11-21 15:09 ` pinskia at gcc dot gnu dot org
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pluto at agmk dot net @ 2005-11-21 15:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #18 from pluto at agmk dot net  2005-11-21 15:05 -------
gcc-3.3.6 produces better code:

test:   movq    w, %mm1
        psllw   $1, %mm1
        movq    %mm1, w
        movq    w, %mm1
        movq    %mm1, dw
        ret

        .comm   dw,8,8
        .comm   w,8,8


can we classify this as a code size regression?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2005-11-21 15:05 ` pluto at agmk dot net
@ 2005-11-21 15:09 ` pinskia at gcc dot gnu dot org
  2005-11-21 18:38 ` pluto at agmk dot net
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2005-11-21 15:09 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #19 from pinskia at gcc dot gnu dot org  2005-11-21 15:09 -------
(In reply to comment #18)
> can we classify this as a code size regression?

No because 3.3.x was also wrong in the sense it did not emit an emms.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2005-11-21 15:09 ` pinskia at gcc dot gnu dot org
@ 2005-11-21 18:38 ` pluto at agmk dot net
  2005-12-01  0:52 ` pluto at agmk dot net
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pluto at agmk dot net @ 2005-11-21 18:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #20 from pluto at agmk dot net  2005-11-21 18:38 -------
(In reply to comment #19)
> (In reply to comment #18)
> > can we classify this as a code size regression?
> 
> No because 3.3.x was also wrong in the sense it did not emit an emms.

ok.

gcc 4.1.0/20051113 with x87/mmx mode switch patch produces:

test:   movq    w, %mm0
        paddw   %mm0, %mm0
        movq    %mm0, w
        movl    w, %eax
        movl    w+4, %edx
        movl    %eax, dw
        movl    %edx, dw+4
        emms
        ret

        .comm   dw,8,8
        .comm   w,8,8

it isn't optimal but correct (emms opcode) and smaller than pure 4.1 output.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2005-11-21 18:38 ` pluto at agmk dot net
@ 2005-12-01  0:52 ` pluto at agmk dot net
  2008-03-08  7:30 ` ubizjak at gmail dot com
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pluto at agmk dot net @ 2005-12-01  0:52 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #21 from pluto at agmk dot net  2005-12-01 00:52 -------
I'm wondering is it possible to implement tranformations
of vector arithmetics into vector builtins?

e.g.

#include <mmintrin.h>
__v8qi foo(const __v8qi x, const __v8qi y) { return x + y; }
__v8qi bar(const __v8qi x, const __v8qi y) { return _mm_add_pi8(x, y); }

I except from compiler the same code for both functions
but it produces insane code for foo() :/

foo (x, y)
{
  unsigned int D.2377;
  unsigned int D.2376;
  unsigned int D.2369;
  unsigned int D.2368;
<bb 0>:
  D.2368 = BIT_FIELD_REF <x, 32, 0>;
  D.2369 = BIT_FIELD_REF <y, 32, 0>;
  D.2376 = BIT_FIELD_REF <x, 32, 32>;
  D.2377 = BIT_FIELD_REF <y, 32, 32>;
  return VIEW_CONVERT_EXPR<__v8qi>(
    {(D.2368 ^ D.2369) & 080808080 ^ (D.2369 & 2139062143) +
     (D.2368 & 2139062143),
     (D.2376 ^ D.2377) & 080808080 ^ (D.2377 & 2139062143) +
     (D.2376 & 2139062143)});
}

bar (x, y)
{
  vector signed char D.2448;
<bb 0>:
  D.2448 = __builtin_ia32_paddb (
    VIEW_CONVERT_EXPR<vector signed char>(VIEW_CONVERT_EXPR<__m64>(x)),
    VIEW_CONVERT_EXPR<vector signed char>(VIEW_CONVERT_EXPR<__m64>(y)));
  return VIEW_CONVERT_EXPR<__v8qi>(VIEW_CONVERT_EXPR<vector int>(D.2448));
}

# gcc -O2 -march=pentium3 -fomit-frame-pointer -mregparm=3

foo:
        subl    $44, %esp
        movq    %mm0, 24(%esp)
        movl    %ebx, 32(%esp)
        movl    24(%esp), %ebx
        movl    %esi, 36(%esp)
        movl    28(%esp), %esi
        movq    %mm1, 24(%esp)
        movl    24(%esp), %eax
        movl    28(%esp), %edx
        movl    %edi, 40(%esp)
        movl    %ebx, %edi
        andl    $2139062143, %edi
        movl    %eax, %ecx
        xorl    %eax, %ebx
        andl    $2139062143, %ecx
        movl    %esi, %eax
        addl    %edi, %ecx
        xorl    %edx, %eax
        movl    40(%esp), %edi
        andl    $2139062143, %esi
        andl    $-2139062144, %ebx
        andl    $2139062143, %edx
        xorl    %ecx, %ebx
        addl    %esi, %edx
        andl    $-2139062144, %eax
        movl    36(%esp), %esi
        movl    %ebx, 20(%esp)
        xorl    %edx, %eax
        movl    32(%esp), %ebx
        movss   20(%esp), %xmm0
        movl    %eax, 20(%esp)
        movss   20(%esp), %xmm1
        unpcklps        %xmm1, %xmm0
        movlps  %xmm0, 8(%esp)
        movl    8(%esp), %eax
        movl    12(%esp), %edx
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        movq    (%esp), %mm1
        addl    $44, %esp
        movq    %mm1, %mm0
        ret

bar:
        paddb   %mm1, %mm0
        ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2005-12-01  0:52 ` pluto at agmk dot net
@ 2008-03-08  7:30 ` ubizjak at gmail dot com
  2008-03-19 10:46 ` ubizjak at gmail dot com
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: ubizjak at gmail dot com @ 2008-03-08  7:30 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #22 from ubizjak at gmail dot com  2008-03-08 07:29 -------
*** Bug 25277 has been marked as a duplicate of this bug. ***


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2008-03-08  7:30 ` ubizjak at gmail dot com
@ 2008-03-19 10:46 ` ubizjak at gmail dot com
  2008-03-19 19:22 ` astrange at ithinksw dot com
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: ubizjak at gmail dot com @ 2008-03-19 10:46 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #23 from ubizjak at gmail dot com  2008-03-19 10:45 -------
As said in PR 19161:

The LCM infrastructure doesn't support mode switching in the way that would be
usable for emms. Additionally, there are MANY problems expected when sharing
x87 and MMX registers (i.e. handling of uninitialized x87 registers at the
beginning of the function - this is the reason we don't implement x87 register
passing ABI).

Automatic MMX vectorization is not exactly a much usable feature nowadays (we
have SSE that works quite well here). Due to recent changes in MMX register
allocation area, excellent code is produced using MMX intrinsics, I'm closing
this bug as WONTFIX.

Also, auto-vectorization would produce either MMX or SSE code, but not both of
them:

#define UNITS_PER_SIMD_WORD (TARGET_SSE ? 16 : UNITS_PER_WORD)


-- 

ubizjak at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |WONTFIX


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2008-03-19 10:46 ` ubizjak at gmail dot com
@ 2008-03-19 19:22 ` astrange at ithinksw dot com
  2008-03-19 19:39 ` astrange at ithinksw dot com
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: astrange at ithinksw dot com @ 2008-03-19 19:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #24 from astrange at ithinksw dot com  2008-03-19 19:21 -------
For
typedef short mmxw  __attribute__ ((mode(V4HI)));
typedef int   mmxdw __attribute__ ((mode(V2SI)));

mmxdw dw;
mmxw w;

void test(){
    w+=w;
    dw= (mmxdw)w;
}

void test2(){
        w= __builtin_ia32_paddw(w,w);
        dw= (mmxdw)w;
}

gcc SVN generates the expected code for test2(), but not test(). I don't think
using += on an MMX variable should count as autovectorization - if you're doing
either you should know where to put emms yourself.

For test() we get:
        subl    $28, %esp
        movq    _w, %mm0
        movq    %mm0, 8(%esp)
        movzwl  8(%esp), %eax
        movzwl  10(%esp), %edx
        movzwl  12(%esp), %ecx
        addl    %eax, %eax
        addl    %edx, %edx
        movw    %ax, _w
        movw    %dx, _w+2
        movzwl  14(%esp), %eax
        addl    %ecx, %ecx
        addl    %eax, %eax
        movw    %cx, _w+4
        movw    %ax, _w+6
        movq    _w, %mm0
        movq    %mm0, _dw
        addl    $28, %esp
        ret

which touches mm0 (requiring emms, I think) but not using paddw (so being slow
and silly-looking).
LLVM generates expected code for both of them.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2008-03-19 19:22 ` astrange at ithinksw dot com
@ 2008-03-19 19:39 ` astrange at ithinksw dot com
  2008-03-19 23:39 ` uros at gcc dot gnu dot org
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: astrange at ithinksw dot com @ 2008-03-19 19:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #25 from astrange at ithinksw dot com  2008-03-19 19:39 -------
Actually the first generates-
        subl    $12, %esp
        movq    _w, %mm0
        paddw   %mm0, %mm0
        movq    %mm0, _w
        movq    _w, %mm0
        movq    %mm0, _dw
        addl    $12, %esp
        ret

which is better than the code in the original report but still has a useless
store/reload.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2008-03-19 19:39 ` astrange at ithinksw dot com
@ 2008-03-19 23:39 ` uros at gcc dot gnu dot org
  2008-03-19 23:47 ` ubizjak at gmail dot com
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: uros at gcc dot gnu dot org @ 2008-03-19 23:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #26 from uros at gcc dot gnu dot org  2008-03-19 23:39 -------
Subject: Bug 14552

Author: uros
Date: Wed Mar 19 23:38:35 2008
New Revision: 133354

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133354
Log:
        PR target/14552
        * config/i386/mmx.md (*mov<mode>_internal_rex64"): Adjust register
        allocator preferences for "y" and "r" class registers.
        ("*mov<mode>_internal"): Ditto.
        ("*movv2sf_internal_rex64"): Ditto.
        ("*movv2sf_internal"): Ditto.

testsuite/ChangeLog:

        PR target/14552
        * gcc.target/i386/pr14552.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr14552.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/mmx.md
    trunk/gcc/testsuite/ChangeLog


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2008-03-19 23:39 ` uros at gcc dot gnu dot org
@ 2008-03-19 23:47 ` ubizjak at gmail dot com
  2008-03-19 23:50 ` pinskia at gcc dot gnu dot org
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: ubizjak at gmail dot com @ 2008-03-19 23:47 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #27 from ubizjak at gmail dot com  2008-03-19 23:46 -------
(In reply to comment #25)
> Actually the first generates-
>         subl    $12, %esp
>         movq    _w, %mm0
>         paddw   %mm0, %mm0
>         movq    %mm0, _w
>         movq    _w, %mm0
>         movq    %mm0, _dw
>         addl    $12, %esp
>         ret
> 
> which is better than the code in the original report but still has a useless
> store/reload.

The store is not useless. Reload from "_w" is how gcc handles double stores
nowadays and is not mmx specific. It looks that some pass forgot to check where
the value came from.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2008-03-19 23:47 ` ubizjak at gmail dot com
@ 2008-03-19 23:50 ` pinskia at gcc dot gnu dot org
  2008-03-20  0:02 ` ubizjak at gmail dot com
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2008-03-19 23:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #28 from pinskia at gcc dot gnu dot org  2008-03-19 23:49 -------
(In reply to comment #27)
> The store is not useless. Reload from "_w" is how gcc handles double stores
> nowadays and is not mmx specific. It looks that some pass forgot to check where
> the value came from.

Do you happen to know if there are two different modes at work here?  If so
there are patches which fix this up in DSE and post-reload CSE.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2008-03-19 23:50 ` pinskia at gcc dot gnu dot org
@ 2008-03-20  0:02 ` ubizjak at gmail dot com
  2008-03-20  0:04 ` ubizjak at gmail dot com
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: ubizjak at gmail dot com @ 2008-03-20  0:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #29 from ubizjak at gmail dot com  2008-03-20 00:01 -------
Now we generate:

-m32 -mmmx -msse2:

test:
        subl    $20, %esp
        movl    w, %eax
        movl    w+4, %edx
        movl    %ebx, 12(%esp)
        movl    %esi, 16(%esp)
        movl    %eax, (%esp)
        movzwl  (%esp), %ecx
        movl    %edx, 4(%esp)
        movzwl  2(%esp), %ebx
        movzwl  4(%esp), %esi
        movzwl  6(%esp), %eax
        addl    %ecx, %ecx
        addl    %ebx, %ebx
        addl    %esi, %esi
        addl    %eax, %eax
        movw    %bx, w+2
        movl    12(%esp), %ebx
        movw    %si, w+4
        movl    16(%esp), %esi
        movw    %ax, w+6
        movl    w+4, %edx
        movw    %cx, w
        movl    w, %eax
        movl    %edx, dw+4
        movl    %eax, dw
        addl    $20, %esp
        ret

-m64 -mmmx -msse2:

test:
        movabsq $9223231297218904063, %rax
        andq    w(%rip), %rax
        addq    %rax, %rax
        movq    %rax, w(%rip)
        movq    w(%rip), %rax
        movq    %rax, dw(%rip)
        ret

The issue with useless reload is PR 12395, as mentioned in Comment #5.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2008-03-20  0:02 ` ubizjak at gmail dot com
@ 2008-03-20  0:04 ` ubizjak at gmail dot com
  2008-03-20  0:23   ` Andrew Pinski
  2008-03-20  0:24 ` pinskia at gmail dot com
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 25+ messages in thread
From: ubizjak at gmail dot com @ 2008-03-20  0:04 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #30 from ubizjak at gmail dot com  2008-03-20 00:04 -------
(In reply to comment #28)
> (In reply to comment #27)
> > The store is not useless. Reload from "_w" is how gcc handles double stores
> > nowadays and is not mmx specific. It looks that some pass forgot to check where
> > the value came from.
> 
> Do you happen to know if there are two different modes at work here?  If so
> there are patches which fix this up in DSE and post-reload CSE.

Yes, from comment #24 (slightly changed):

typedef short mmxw  __attribute__ ((vector_size (8)));
typedef int   mmxdw __attribute__ ((vector_size (8)));

mmxdw dw;
mmxw w;

so, we have V4HI and V2SI.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Bug target/14552] compiled trivial vector intrinsic code is inefficient
  2008-03-20  0:04 ` ubizjak at gmail dot com
@ 2008-03-20  0:23   ` Andrew Pinski
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Pinski @ 2008-03-20  0:23 UTC (permalink / raw)
  To: gcc-bugzilla; +Cc: gcc-bugs

See pr 33790.

Sent from my iPhone

On Mar 19, 2008, at 17:04, "ubizjak at gmail dot com" <gcc-bugzilla@gcc.gnu.org 
 > wrote:

>
>
> ------- Comment #30 from ubizjak at gmail dot com  2008-03-20 00:04  
> -------
> (In reply to comment #28)
>> (In reply to comment #27)
>>> The store is not useless. Reload from "_w" is how gcc handles  
>>> double stores
>>> nowadays and is not mmx specific. It looks that some pass forgot  
>>> to check where
>>> the value came from.
>>
>> Do you happen to know if there are two different modes at work  
>> here?  If so
>> there are patches which fix this up in DSE and post-reload CSE.
>
> Yes, from comment #24 (slightly changed):
>
> typedef short mmxw  __attribute__ ((vector_size (8)));
> typedef int   mmxdw __attribute__ ((vector_size (8)));
>
> mmxdw dw;
> mmxw w;
>
> so, we have V4HI and V2SI.
>
>
> -- 
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2008-03-20  0:04 ` ubizjak at gmail dot com
@ 2008-03-20  0:24 ` pinskia at gmail dot com
  2008-03-20  0:40 ` astrange at ithinksw dot com
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pinskia at gmail dot com @ 2008-03-20  0:24 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #31 from pinskia at gmail dot com  2008-03-20 00:23 -------
Subject: Re:  compiled trivial vector intrinsic code is inefficient

See pr 33790.

Sent from my iPhone

On Mar 19, 2008, at 17:04, "ubizjak at gmail dot com" <gcc-bugzilla@gcc.gnu.org 
 > wrote:

>
>
> ------- Comment #30 from ubizjak at gmail dot com  2008-03-20 00:04  
> -------
> (In reply to comment #28)
>> (In reply to comment #27)
>>> The store is not useless. Reload from "_w" is how gcc handles  
>>> double stores
>>> nowadays and is not mmx specific. It looks that some pass forgot  
>>> to check where
>>> the value came from.
>>
>> Do you happen to know if there are two different modes at work  
>> here?  If so
>> there are patches which fix this up in DSE and post-reload CSE.
>
> Yes, from comment #24 (slightly changed):
>
> typedef short mmxw  __attribute__ ((vector_size (8)));
> typedef int   mmxdw __attribute__ ((vector_size (8)));
>
> mmxdw dw;
> mmxw w;
>
> so, we have V4HI and V2SI.
>
>
> -- 
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552
>


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (16 preceding siblings ...)
  2008-03-20  0:24 ` pinskia at gmail dot com
@ 2008-03-20  0:40 ` astrange at ithinksw dot com
  2008-03-20  1:37 ` michaelni at gmx dot at
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: astrange at ithinksw dot com @ 2008-03-20  0:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #32 from astrange at ithinksw dot com  2008-03-20 00:39 -------
This is missed on trees:
mmxdw dw;
mmxw w;

void test2(){
        w= __builtin_ia32_paddw(w,w); w= (mmxdw)w;
}

void test3(){
        mmxw w2= __builtin_ia32_paddw(w,w); dw= (mmxdw)w2;
}

test2 ()
{
  vector short int w.4;
  vector short int w.3;

<bb 2>:
  w.3 = w;
  w.4 = __builtin_ia32_paddw (w.3, w.3);
  w = w.4;
  dw = VIEW_CONVERT_EXPR<vector int>(w);
  return;
}

test3 ()
{
  mmxw w2;
  vector short int w.6;

<bb 2>:
  w.6 = w;
  w2 = __builtin_ia32_paddw (w.6, w.6);
  dw = VIEW_CONVERT_EXPR<vector int>(w2);
  return;
}


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (17 preceding siblings ...)
  2008-03-20  0:40 ` astrange at ithinksw dot com
@ 2008-03-20  1:37 ` michaelni at gmx dot at
  2008-03-20  9:50 ` ubizjak at gmail dot com
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: michaelni at gmx dot at @ 2008-03-20  1:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #33 from michaelni at gmx dot at  2008-03-20 01:37 -------
Subject: Re:  compiled trivial vector intrinsic code is
        inefficient

On Wed, Mar 19, 2008 at 11:39:18PM -0000, uros at gcc dot gnu dot org wrote:
> 
> 
> ------- Comment #26 from uros at gcc dot gnu dot org  2008-03-19 23:39 -------
> Subject: Bug 14552
[...]
>         * gcc.target/i386/pr14552.c: New test.
> 
> 
> Added:
>     trunk/gcc/testsuite/gcc.target/i386/pr14552.c

Thanks, i was already scared that the inverse proportional relation between
version number and performance which was so nicely followed since 2.95
would stop.
Adding a test to the testsuit to ensure that mmx intrinsics dont use
mmx registers is well, just brilliant.
Iam already eagerly awaiting the testcase which will check that floating
point code doesnt use the FPU, i assume that will happen in gcc 5.0?

Anyway iam glad ffmpeg compiles fine under icc.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (18 preceding siblings ...)
  2008-03-20  1:37 ` michaelni at gmx dot at
@ 2008-03-20  9:50 ` ubizjak at gmail dot com
  2008-03-20 17:18 ` michaelni at gmx dot at
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: ubizjak at gmail dot com @ 2008-03-20  9:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #34 from ubizjak at gmail dot com  2008-03-20 09:49 -------
(In reply to comment #33)

> Anyway iam glad ffmpeg compiles fine under icc.

Me to. Now you will troll in their support lists.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (19 preceding siblings ...)
  2008-03-20  9:50 ` ubizjak at gmail dot com
@ 2008-03-20 17:18 ` michaelni at gmx dot at
  2008-03-21 10:34 ` ubizjak at gmail dot com
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: michaelni at gmx dot at @ 2008-03-20 17:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #35 from michaelni at gmx dot at  2008-03-20 17:18 -------
Subject: Re:  compiled trivial vector intrinsic code is
        inefficient

On Thu, Mar 20, 2008 at 09:49:22AM -0000, ubizjak at gmail dot com wrote:
> 
> 
> ------- Comment #34 from ubizjak at gmail dot com  2008-03-20 09:49 -------
> (In reply to comment #33)
> 
> > Anyway iam glad ffmpeg compiles fine under icc.
> 
> Me to. Now you will troll in their support lists.

No, truth be, i dont plan to switch to icc yet. Somehow i do prefer to use
free tools. Of course if the gap becomes too big i as well as most others
will switch to icc ...
Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
not so much a problem for ffmpeg than it is for others who followed the
recommandition of "intrinsics are better than asm".

About trolling, well i made no attempt to reply politely and diplomatic, no.
But "solving" a "problem" in some use case by droping support for that use
case is kinda extreem.

The way i see it is that
* Its non trivial to place emms optimally and automatically
* there needs to be a emms between mmx code and fpu code

The solutions to this would be any one of
A. let the programmer place emms like it has been in the past
B. dont support mmx at all
C. dont support x87 fpu at all
D. place emms after every bunch of mmx instructions
E. solve a quite non trivial problem and place emms optimally

The solution which has been selected apparently is B., why was that choosen?
Instead of lets say A.?

If i do write SIMD code then i do know that i need an emms on x86. Its
trivial for the programmer to place it optimally.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (20 preceding siblings ...)
  2008-03-20 17:18 ` michaelni at gmx dot at
@ 2008-03-21 10:34 ` ubizjak at gmail dot com
  2008-03-22  2:39 ` michaelni at gmx dot at
  2008-04-21  8:21 ` ubizjak at gmail dot com
  23 siblings, 0 replies; 25+ messages in thread
From: ubizjak at gmail dot com @ 2008-03-21 10:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #36 from ubizjak at gmail dot com  2008-03-21 10:33 -------
(In reply to comment #35)

> Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
> not so much a problem for ffmpeg than it is for others who followed the
> recommandition of "intrinsics are better than asm".
> 
> About trolling, well i made no attempt to reply politely and diplomatic, no.
> But "solving" a "problem" in some use case by droping support for that use
> case is kinda extreem.
> 
> The way i see it is that
> * Its non trivial to place emms optimally and automatically
> * there needs to be a emms between mmx code and fpu code
> 
> The solutions to this would be any one of
> A. let the programmer place emms like it has been in the past
> B. dont support mmx at all
> C. dont support x87 fpu at all
> D. place emms after every bunch of mmx instructions
> E. solve a quite non trivial problem and place emms optimally
> 
> The solution which has been selected apparently is B., why was that choosen?
> Instead of lets say A.?
> 
> If i do write SIMD code then i do know that i need an emms on x86. Its
> trivial for the programmer to place it optimally.

I don't know where you get the idea that MMX support was dropped in any way. I
won't engage in a discussion about autovectorisation, intrinsics, builtins,
generic vectorisation, etc, etc with you, but please look at PR 21395 how
performance PR should be filled. The MMX code in that PR is _far_ from trivial,
but since it is well written using intrinsic instructions, it enables
jaw-dropping performance increase that is simply not possible when ASM blocks
are used.

Now, I'm sure that you have your numbers ready to back up your claims from
Comment #33 about performance of generated code, and I challenge you to beat
performance of gcc-4.4 generated code by hand-crafted assembly using the
example of PR 21395.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (21 preceding siblings ...)
  2008-03-21 10:34 ` ubizjak at gmail dot com
@ 2008-03-22  2:39 ` michaelni at gmx dot at
  2008-04-21  8:21 ` ubizjak at gmail dot com
  23 siblings, 0 replies; 25+ messages in thread
From: michaelni at gmx dot at @ 2008-03-22  2:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #37 from michaelni at gmx dot at  2008-03-22 02:39 -------
Subject: Re:  compiled trivial vector intrinsic code is
        inefficient

On Fri, Mar 21, 2008 at 10:34:00AM -0000, ubizjak at gmail dot com wrote:
> 
> 
> ------- Comment #36 from ubizjak at gmail dot com  2008-03-21 10:33 -------
> (In reply to comment #35)
> 
> > Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
> > not so much a problem for ffmpeg than it is for others who followed the
> > recommandition of "intrinsics are better than asm".
> > 
> > About trolling, well i made no attempt to reply politely and diplomatic, no.
> > But "solving" a "problem" in some use case by droping support for that use
> > case is kinda extreem.
> > 
> > The way i see it is that
> > * Its non trivial to place emms optimally and automatically
> > * there needs to be a emms between mmx code and fpu code
> > 
> > The solutions to this would be any one of
> > A. let the programmer place emms like it has been in the past
> > B. dont support mmx at all
> > C. dont support x87 fpu at all
> > D. place emms after every bunch of mmx instructions
> > E. solve a quite non trivial problem and place emms optimally
> > 
> > The solution which has been selected apparently is B., why was that choosen?
> > Instead of lets say A.?
> > 
> > If i do write SIMD code then i do know that i need an emms on x86. Its
> > trivial for the programmer to place it optimally.
> 
> I don't know where you get the idea that MMX support was dropped in any way. I

Maybe because the SIMD code in this PR compiled with -mmmx does not use mmx
but very significantly less efficient integer instructions. And you added a
test to gcc which ensures that this case does not use mmx instructions.

This is pretty much the definion of droping mmx support (for this specific
case).


> won't engage in a discussion about autovectorisation, intrinsics, builtins,
> generic vectorisation, etc, etc with you,

And somehow iam glad about that.


> but please look at PR 21395 how
> performance PR should be filled. 

> The MMX code in that PR is _far_ from trivial,

Well that is something i would disagree about.


> but since it is well written using intrinsic instructions, it enables
> jaw-dropping performance increase that is simply not possible when ASM blocks
> are used.
> 
> Now, I'm sure that you have your numbers ready to back up your claims from
> Comment #33 about performance of generated code, and I challenge you to beat
> performance of gcc-4.4 generated code by hand-crafted assembly using the
> example of PR 21395.

done, 
jaw-dropping intrinsics need 
2.034s 

stinky hand written asm needs 
1.312s

But you can read the details in PR 21395.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/14552] compiled trivial vector intrinsic code is inefficient
       [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
                   ` (22 preceding siblings ...)
  2008-03-22  2:39 ` michaelni at gmx dot at
@ 2008-04-21  8:21 ` ubizjak at gmail dot com
  23 siblings, 0 replies; 25+ messages in thread
From: ubizjak at gmail dot com @ 2008-04-21  8:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #38 from ubizjak at gmail dot com  2008-04-21 08:21 -------
*** Bug 32301 has been marked as a duplicate of this bug. ***


-- 

ubizjak at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tomash dot brechko at gmail
                   |                            |dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-04-21  8:21 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-14552-4523@http.gcc.gnu.org/bugzilla/>
2005-11-21 11:29 ` [Bug target/14552] compiled trivial vector intrinsic code is ineffiencent pluto at agmk dot net
2005-11-21 11:32 ` [Bug target/14552] compiled trivial vector intrinsic code is inefficient pcarlini at suse dot de
2005-11-21 11:34 ` pcarlini at suse dot de
2005-11-21 15:05 ` pluto at agmk dot net
2005-11-21 15:09 ` pinskia at gcc dot gnu dot org
2005-11-21 18:38 ` pluto at agmk dot net
2005-12-01  0:52 ` pluto at agmk dot net
2008-03-08  7:30 ` ubizjak at gmail dot com
2008-03-19 10:46 ` ubizjak at gmail dot com
2008-03-19 19:22 ` astrange at ithinksw dot com
2008-03-19 19:39 ` astrange at ithinksw dot com
2008-03-19 23:39 ` uros at gcc dot gnu dot org
2008-03-19 23:47 ` ubizjak at gmail dot com
2008-03-19 23:50 ` pinskia at gcc dot gnu dot org
2008-03-20  0:02 ` ubizjak at gmail dot com
2008-03-20  0:04 ` ubizjak at gmail dot com
2008-03-20  0:23   ` Andrew Pinski
2008-03-20  0:24 ` pinskia at gmail dot com
2008-03-20  0:40 ` astrange at ithinksw dot com
2008-03-20  1:37 ` michaelni at gmx dot at
2008-03-20  9:50 ` ubizjak at gmail dot com
2008-03-20 17:18 ` michaelni at gmx dot at
2008-03-21 10:34 ` ubizjak at gmail dot com
2008-03-22  2:39 ` michaelni at gmx dot at
2008-04-21  8:21 ` ubizjak at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).