[Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization
@ 2013-08-06 16:03 siavashserver at gmail dot com
  2013-08-06 16:15 ` [Bug tree-optimization/58095] " paolo.carlini at oracle dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: siavashserver at gmail dot com @ 2013-08-06 16:03 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58095

            Bug ID: 58095
           Summary: SIMD code requiring auxiliary array for best
                    optimization
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: major
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: siavashserver at gmail dot com

Created attachment 30621
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30621&action=edit
Source code and its generated asm code.

Hello. I have noticed a strange behavior when I'm trying to write SIMD code
using provided SSE intrinsics. It looks like GCC is not able to
generate/optimize same code like function (bar) for function (foo).

I was wondering how can I achieve same generated code for the function (foo)
without going into trouble of defining and using an auxiliary array like
function (bar).

I've tried using __restrict__ keyword for input data (foo2), but GCC still
generates same code like function (foo). ICC and Clang also generate same code
and fail to optimize.

Something strange I've noticed is that GCC 4.4.7 generates desired code for
function (foo), but fails to do for function (foo2) and (bar). Newer versions
generate exactly same code for function (foo) and (foo2), and desired code for
function (bar).

Output attached is generated from GCC 4.8.1 using -O2 optimization level. I've
used online GCC compiler from: http://gcc.godbolt.org/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58095] SIMD code requiring auxiliary array for best optimization
  2013-08-06 16:03 [Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization siavashserver at gmail dot com
@ 2013-08-06 16:15 ` paolo.carlini at oracle dot com
  2013-08-06 16:54 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: paolo.carlini at oracle dot com @ 2013-08-06 16:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58095

Paolo Carlini <paolo.carlini at oracle dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|siavashserver at gmail dot com     |
          Component|c++                         |tree-optimization
           Severity|major                       |normal


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58095] SIMD code requiring auxiliary array for best optimization
  2013-08-06 16:03 [Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization siavashserver at gmail dot com
  2013-08-06 16:15 ` [Bug tree-optimization/58095] " paolo.carlini at oracle dot com
@ 2013-08-06 16:54 ` pinskia at gcc dot gnu.org
  2013-08-06 17:46 ` siavashserver at gmail dot com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2013-08-06 16:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58095

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>I've tried using __restrict__ keyword for input data (foo2),

I think you want __restrict__ inside of the [].


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58095] SIMD code requiring auxiliary array for best optimization
  2013-08-06 16:03 [Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization siavashserver at gmail dot com
  2013-08-06 16:15 ` [Bug tree-optimization/58095] " paolo.carlini at oracle dot com
  2013-08-06 16:54 ` pinskia at gcc dot gnu.org
@ 2013-08-06 17:46 ` siavashserver at gmail dot com
  2013-08-07  5:13 ` siavashserver at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: siavashserver at gmail dot com @ 2013-08-06 17:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58095

--- Comment #2 from Siavash Eliasi <siavashserver at gmail dot com> ---
(In reply to Andrew Pinski from comment #1)
> >I've tried using __restrict__ keyword for input data (foo2),
> 
> I think you want __restrict__ inside of the [].

Do you mind pasting the modified source code and generated asm code please?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58095] SIMD code requiring auxiliary array for best optimization
  2013-08-06 16:03 [Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization siavashserver at gmail dot com
                   ` (2 preceding siblings ...)
  2013-08-06 17:46 ` siavashserver at gmail dot com
@ 2013-08-07  5:13 ` siavashserver at gmail dot com
  2013-08-07  6:31 ` siavashserver at gmail dot com
  2021-08-28 18:48 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: siavashserver at gmail dot com @ 2013-08-07  5:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58095

--- Comment #3 from Siavash Eliasi <siavashserver at gmail dot com> ---
I did an experiment with using raw float data types instead of __m128 data
type. This time GCC, Clang and ICC were able to generate desired code, even
without using __restric__ keyword, but a little more dirty (Pointer
arithmetics).

Not most, but I'm sure that new video decoder/encoder, game engines and similar
applications are using __m128 data types directly instead of float data types,
because (1) it guarantees them to be 16byte aligned, (2) removes the need to
manually load/store data from memory to XMM/YMM registers, (3) makes the source
code smaller and easier to maintain and (4) much more clean and smaller
generated code.

In conclusion, I don't think issue me and other people are facing is related to
not using __restrict__ keyword. All compilers fail to generate optimal code
when facing __m128 data types. However as an exception, ICC is able to generate
optimal code when facing __m128 data types and __restrict__ keyword mixed.

Here is what I have tried:

#include <xmmintrin.h>

void fooFloat(float* a, float* b, float* d, float* c, unsigned int size)
{
    for (unsigned int i = 0; i < size; i+=32)
    {
        __m128 ax[8], bx[8], cx[8], dx[8];

        ax[0] = _mm_load_ps(&a[i*32+0]);
        ax[1] = _mm_load_ps(&a[i*32+4]);
        ax[2] = _mm_load_ps(&a[i*32+8]);
        ax[3] = _mm_load_ps(&a[i*32+12]);
        ax[4] = _mm_load_ps(&a[i*32+16]);
        ax[5] = _mm_load_ps(&a[i*32+20]);
        ax[6] = _mm_load_ps(&a[i*32+24]);
        ax[7] = _mm_load_ps(&a[i*32+28]);

        bx[0] = _mm_load_ps(&b[i*32+0]);
        bx[1] = _mm_load_ps(&b[i*32+4]);
        bx[2] = _mm_load_ps(&b[i*32+8]);
        bx[3] = _mm_load_ps(&b[i*32+12]);
        bx[4] = _mm_load_ps(&b[i*32+16]);
        bx[5] = _mm_load_ps(&b[i*32+20]);
        bx[6] = _mm_load_ps(&b[i*32+24]);
        bx[7] = _mm_load_ps(&b[i*32+28]);

        dx[0] = _mm_load_ps(&d[i*32+0]);
        dx[1] = _mm_load_ps(&d[i*32+4]);
        dx[2] = _mm_load_ps(&d[i*32+8]);
        dx[3] = _mm_load_ps(&d[i*32+12]);
        dx[4] = _mm_load_ps(&d[i*32+16]);
        dx[5] = _mm_load_ps(&d[i*32+20]);
        dx[6] = _mm_load_ps(&d[i*32+24]);
        dx[7] = _mm_load_ps(&d[i*32+28]);

        cx[0] = _mm_add_ps(ax[0], _mm_mul_ps(dx[0], bx[0]));
        cx[1] = _mm_add_ps(ax[1], _mm_mul_ps(dx[1], bx[1]));
        cx[2] = _mm_add_ps(ax[2], _mm_mul_ps(dx[2], bx[2]));
        cx[3] = _mm_add_ps(ax[3], _mm_mul_ps(dx[3], bx[3]));
        cx[4] = _mm_add_ps(ax[4], _mm_mul_ps(dx[4], bx[4]));
        cx[5] = _mm_add_ps(ax[5], _mm_mul_ps(dx[5], bx[5]));
        cx[6] = _mm_add_ps(ax[6], _mm_mul_ps(dx[6], bx[6]));
        cx[7] = _mm_add_ps(ax[7], _mm_mul_ps(dx[7], bx[7]));

        _mm_store_ps(&c[i*32+0], cx[0]);
        _mm_store_ps(&c[i*32+4], cx[1]);
        _mm_store_ps(&c[i*32+8], cx[2]);
        _mm_store_ps(&c[i*32+12], cx[3]);
        _mm_store_ps(&c[i*32+16], cx[4]);
        _mm_store_ps(&c[i*32+20], cx[5]);
        _mm_store_ps(&c[i*32+24], cx[6]);
        _mm_store_ps(&c[i*32+28], cx[7]);
    }
}

And its output using GCC 4.8.1 -O2 :

fooFloat(float*, float*, float*, float*, unsigned int):
    push    r15
    xor    r15d, r15d
    test    r8d, r8d
    mov    eax, 4
    push    r14
    push    r13
    push    r12
    push    rbp
    push    rbx
    je    .L15
.L19:
    lea    r12d, [rax+4]
    lea    ebp, [rax+8]
    lea    ebx, [rax+12]
    lea    r11d, [rax+16]
    lea    r10d, [rax+20]
    lea    r9d, [rax+24]
    mov    r14d, r15d
    mov    r13d, eax
    add    r15d, 32
    sal    r14d, 5
    movaps    xmm6, XMMWORD PTR [rdx+r13*4]
    add    eax, 1024
    cmp    r8d, r15d
    movaps    xmm7, XMMWORD PTR [rdx+r14*4]
    mulps    xmm6, XMMWORD PTR [rsi+r13*4]
    movaps    xmm5, XMMWORD PTR [rdx+r12*4]
    mulps    xmm7, XMMWORD PTR [rsi+r14*4]
    movaps    xmm4, XMMWORD PTR [rdx+rbp*4]
    mulps    xmm5, XMMWORD PTR [rsi+r12*4]
    movaps    xmm3, XMMWORD PTR [rdx+rbx*4]
    mulps    xmm4, XMMWORD PTR [rsi+rbp*4]
    movaps    xmm2, XMMWORD PTR [rdx+r11*4]
    mulps    xmm3, XMMWORD PTR [rsi+rbx*4]
    movaps    xmm1, XMMWORD PTR [rdx+r10*4]
    mulps    xmm2, XMMWORD PTR [rsi+r11*4]
    movaps    xmm0, XMMWORD PTR [rdx+r9*4]
    mulps    xmm1, XMMWORD PTR [rsi+r10*4]
    addps    xmm7, XMMWORD PTR [rdi+r14*4]
    mulps    xmm0, XMMWORD PTR [rsi+r9*4]
    addps    xmm6, XMMWORD PTR [rdi+r13*4]
    addps    xmm5, XMMWORD PTR [rdi+r12*4]
    addps    xmm4, XMMWORD PTR [rdi+rbp*4]
    addps    xmm3, XMMWORD PTR [rdi+rbx*4]
    addps    xmm2, XMMWORD PTR [rdi+r11*4]
    addps    xmm1, XMMWORD PTR [rdi+r10*4]
    addps    xmm0, XMMWORD PTR [rdi+r9*4]
    movaps    XMMWORD PTR [rcx+r14*4], xmm7
    movaps    XMMWORD PTR [rcx+r13*4], xmm6
    movaps    XMMWORD PTR [rcx+r12*4], xmm5
    movaps    XMMWORD PTR [rcx+rbp*4], xmm4
    movaps    XMMWORD PTR [rcx+rbx*4], xmm3
    movaps    XMMWORD PTR [rcx+r11*4], xmm2
    movaps    XMMWORD PTR [rcx+r10*4], xmm1
    movaps    XMMWORD PTR [rcx+r9*4], xmm0
    ja    .L19
.L15:
    pop    rbx
    pop    rbp
    pop    r12
    pop    r13
    pop    r14
    pop    r15
    ret


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58095] SIMD code requiring auxiliary array for best optimization
  2013-08-06 16:03 [Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization siavashserver at gmail dot com
                   ` (3 preceding siblings ...)
  2013-08-07  5:13 ` siavashserver at gmail dot com
@ 2013-08-07  6:31 ` siavashserver at gmail dot com
  2021-08-28 18:48 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: siavashserver at gmail dot com @ 2013-08-07  6:31 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58095

--- Comment #4 from Siavash Eliasi <siavashserver at gmail dot com> ---
In the end, here is what I really like GCC to generate for me. Same output as
function (bar) for function (foo) when using GCC with -O3 -march=core2
switches:

#include <xmmintrin.h>

#define BATCHSIZE 8

void foo(__m128 a[][BATCHSIZE], __m128 b[][BATCHSIZE], __m128 d[][BATCHSIZE],
__m128 c[][BATCHSIZE], unsigned int size)
{
    for (unsigned int i = 0; i < size; i++)
    {
        for (unsigned int j=0; j<BATCHSIZE; j++)
        {
            c[i][j] = _mm_add_ps(a[i][j], _mm_mul_ps(d[i][j], b[i][j]));
        }
    }
}

void bar(__m128 a[][BATCHSIZE], __m128 b[][BATCHSIZE], __m128 d[][BATCHSIZE],
__m128 c[][BATCHSIZE], unsigned int size)
{
    for (unsigned int i = 0; i < size; i++)
    {
        __m128 cx[BATCHSIZE];

        for (unsigned int j=0; j<BATCHSIZE; j++)
        {
            cx[j] = _mm_add_ps(a[i][j], _mm_mul_ps(d[i][j], b[i][j]));
        }

        for (unsigned int j=0; j<BATCHSIZE; j++)
        {
            c[i][j] = cx[j]; 
        }
    }
}

Generated asm code:

foo(float __vector (*) [8], float __vector (*) [8], float __vector (*) [8],
float __vector (*) [8], unsigned int):
    test    r8d, r8d
    je    .L1
    xor    eax, eax
.L4:
    movaps    xmm0, XMMWORD PTR [rdx]
    add    eax, 1
    sub    rsi, -128
    sub    rdx, -128
    sub    rdi, -128
    sub    rcx, -128
    mulps    xmm0, XMMWORD PTR [rsi-128]
    addps    xmm0, XMMWORD PTR [rdi-128]
    movaps    XMMWORD PTR [rcx-128], xmm0
    movaps    xmm0, XMMWORD PTR [rdx-112]
    mulps    xmm0, XMMWORD PTR [rsi-112]
    addps    xmm0, XMMWORD PTR [rdi-112]
    movaps    XMMWORD PTR [rcx-112], xmm0
    movaps    xmm0, XMMWORD PTR [rdx-96]
    mulps    xmm0, XMMWORD PTR [rsi-96]
    addps    xmm0, XMMWORD PTR [rdi-96]
    movaps    XMMWORD PTR [rcx-96], xmm0
    movaps    xmm0, XMMWORD PTR [rdx-80]
    mulps    xmm0, XMMWORD PTR [rsi-80]
    addps    xmm0, XMMWORD PTR [rdi-80]
    movaps    XMMWORD PTR [rcx-80], xmm0
    movaps    xmm0, XMMWORD PTR [rdx-64]
    mulps    xmm0, XMMWORD PTR [rsi-64]
    addps    xmm0, XMMWORD PTR [rdi-64]
    movaps    XMMWORD PTR [rcx-64], xmm0
    movaps    xmm0, XMMWORD PTR [rdx-48]
    mulps    xmm0, XMMWORD PTR [rsi-48]
    addps    xmm0, XMMWORD PTR [rdi-48]
    movaps    XMMWORD PTR [rcx-48], xmm0
    movaps    xmm0, XMMWORD PTR [rdx-32]
    mulps    xmm0, XMMWORD PTR [rsi-32]
    addps    xmm0, XMMWORD PTR [rdi-32]
    movaps    XMMWORD PTR [rcx-32], xmm0
    movaps    xmm0, XMMWORD PTR [rdx-16]
    mulps    xmm0, XMMWORD PTR [rsi-16]
    addps    xmm0, XMMWORD PTR [rdi-16]
    movaps    XMMWORD PTR [rcx-16], xmm0
    cmp    eax, r8d
    jne    .L4
.L1:
    rep; ret
bar(float __vector (*) [8], float __vector (*) [8], float __vector (*) [8],
float __vector (*) [8], unsigned int):
    test    r8d, r8d
    je    .L6
    xor    eax, eax
.L9:
    movaps    xmm7, XMMWORD PTR [rdx]
    add    eax, 1
    sub    rsi, -128
    movaps    xmm6, XMMWORD PTR [rdx+16]
    sub    rdi, -128
    sub    rdx, -128
    movaps    xmm5, XMMWORD PTR [rdx-96]
    sub    rcx, -128
    movaps    xmm4, XMMWORD PTR [rdx-80]
    movaps    xmm3, XMMWORD PTR [rdx-64]
    movaps    xmm2, XMMWORD PTR [rdx-48]
    movaps    xmm1, XMMWORD PTR [rdx-32]
    movaps    xmm0, XMMWORD PTR [rdx-16]
    mulps    xmm7, XMMWORD PTR [rsi-128]
    mulps    xmm6, XMMWORD PTR [rsi-112]
    mulps    xmm5, XMMWORD PTR [rsi-96]
    mulps    xmm4, XMMWORD PTR [rsi-80]
    mulps    xmm3, XMMWORD PTR [rsi-64]
    mulps    xmm2, XMMWORD PTR [rsi-48]
    mulps    xmm1, XMMWORD PTR [rsi-32]
    mulps    xmm0, XMMWORD PTR [rsi-16]
    addps    xmm7, XMMWORD PTR [rdi-128]
    addps    xmm6, XMMWORD PTR [rdi-112]
    addps    xmm5, XMMWORD PTR [rdi-96]
    addps    xmm4, XMMWORD PTR [rdi-80]
    addps    xmm3, XMMWORD PTR [rdi-64]
    addps    xmm2, XMMWORD PTR [rdi-48]
    addps    xmm1, XMMWORD PTR [rdi-32]
    addps    xmm0, XMMWORD PTR [rdi-16]
    movaps    XMMWORD PTR [rcx-128], xmm7
    movaps    XMMWORD PTR [rcx-112], xmm6
    movaps    XMMWORD PTR [rcx-96], xmm5
    movaps    XMMWORD PTR [rcx-80], xmm4
    movaps    XMMWORD PTR [rcx-64], xmm3
    movaps    XMMWORD PTR [rcx-48], xmm2
    movaps    XMMWORD PTR [rcx-32], xmm1
    movaps    XMMWORD PTR [rcx-16], xmm0
    cmp    eax, r8d
    jne    .L9
.L6:
    rep; ret


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/58095] SIMD code requiring auxiliary array for best optimization
  2013-08-06 16:03 [Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization siavashserver at gmail dot com
                   ` (4 preceding siblings ...)
  2013-08-07  6:31 ` siavashserver at gmail dot com
@ 2021-08-28 18:48 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-28 18:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58095

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           Keywords|                            |missed-optimization

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-28 18:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-06 16:03 [Bug c++/58095] New: SIMD code requiring auxiliary array for best optimization siavashserver at gmail dot com
2013-08-06 16:15 ` [Bug tree-optimization/58095] " paolo.carlini at oracle dot com
2013-08-06 16:54 ` pinskia at gcc dot gnu.org
2013-08-06 17:46 ` siavashserver at gmail dot com
2013-08-07  5:13 ` siavashserver at gmail dot com
2013-08-07  6:31 ` siavashserver at gmail dot com
2021-08-28 18:48 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).