[Bug target/65832] New: Inefficient vector construction

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/65832] New: Inefficient vector construction
@ 2015-04-21 13:56 rguenth at gcc dot gnu.org
  2015-04-21 14:03 ` [Bug target/65832] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-21 13:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832

            Bug ID: 65832
           Summary: Inefficient vector construction
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
            Target: x86_64-*-*, i?86-*-*

typedef int v4si __attribute__((vector_size(16)));

v4si foo (int i, int j, int k, int l)
{
  return (v4si) { i, j, k, l };
}

produces

        movl    %edx, -12(%rsp)
        movd    -12(%rsp), %xmm1
        movl    %ecx, -12(%rsp)
        movd    -12(%rsp), %xmm2
        movl    %edi, -12(%rsp)
        movd    -12(%rsp), %xmm0
        movl    %esi, -12(%rsp)
        movd    -12(%rsp), %xmm3
        punpckldq       %xmm2, %xmm1
        punpckldq       %xmm3, %xmm0
        punpcklqdq      %xmm1, %xmm0
        ret

as we spill everything to the stack we could as well use a vector load, thus
something like

        movl    %edx, -12(%rsp)
        movl    %ecx, -16(%rsp)
        movl    %edi, -20(%rsp)
        movl    %esi, -24(%rsp)
        movdqu  -12(%rsp), %xmm0
        ret


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/65832] Inefficient vector construction
  2015-04-21 13:56 [Bug target/65832] New: Inefficient vector construction rguenth at gcc dot gnu.org
@ 2015-04-21 14:03 ` rguenth at gcc dot gnu.org
  2015-04-21 14:11 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-21 14:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
typedef int v4si __attribute__((vector_size(16)));

v4si bar (int *i, int *j, int *k, int *l)
{
  return (v4si) { *i, *j, *k, *l };
}

looks reasonable (no spills at least, stray move for the return value).

        movd    (%rsi), %xmm0
        movd    (%rdi), %xmm3
        movd    (%rcx), %xmm1
        movd    (%rdx), %xmm2
        punpckldq       %xmm0, %xmm3
        punpckldq       %xmm1, %xmm2
        movdqa  %xmm3, %xmm0
        punpcklqdq      %xmm2, %xmm0

With -mavx2 we get

        vmovd   (%rdx), %xmm2
        vmovd   (%rdi), %xmm3
        vpinsrd $1, (%rcx), %xmm2, %xmm1
        vpinsrd $1, (%rsi), %xmm3, %xmm0
        vpunpcklqdq     %xmm1, %xmm0, %xmm0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/65832] Inefficient vector construction
  2015-04-21 13:56 [Bug target/65832] New: Inefficient vector construction rguenth at gcc dot gnu.org
  2015-04-21 14:03 ` [Bug target/65832] " rguenth at gcc dot gnu.org
@ 2015-04-21 14:11 ` rguenth at gcc dot gnu.org
  2015-04-28  8:51 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-21 14:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
typedef unsigned char v16qi __attribute__((vector_size(16)));

v16qi baz (int i0, int i1, int i2, int i3,
           int i10, int i11, int i12, int i13,
           int i20, int i21, int i22, int i23,
           int i30, int i31, int i32, int i33)
{
  return (v16qi) { i0, i1, i2, i3,
      i10, i11, i12, i13,
      i20, i21, i22, i23,
      i30, i31, i32, i33 };
}

is even more "funny".

I'm looking whether the vectorizer cost model for these vector constructors
make sense.  Currently the cost is

      case vec_construct:
        elements = TYPE_VECTOR_SUBPARTS (vectype);
        return ix86_cost->vec_stmt_cost * (elements / 2 + 1);

which for v16qi and generic would be 9 vector stmts.  The assembly for the
above contains 47 instructions.  Not sure where elements / 2 + 1 comes from.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/65832] Inefficient vector construction
  2015-04-21 13:56 [Bug target/65832] New: Inefficient vector construction rguenth at gcc dot gnu.org
  2015-04-21 14:03 ` [Bug target/65832] " rguenth at gcc dot gnu.org
  2015-04-21 14:11 ` rguenth at gcc dot gnu.org
@ 2015-04-28  8:51 ` rguenth at gcc dot gnu.org
  2015-04-28  8:55 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-28  8:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another example where the vectorizer thinks vectorization is profitable:

#define N 16

unsigned int out[N];
unsigned int in[N] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};

__attribute__ ((noinline)) int
main1 (unsigned int x, unsigned int y)
{
  int i;
  unsigned int a0, a1, a2, a3;

  a0 = in[0];
  a1 = in[1];
  a2 = in[2];
  a3 = in[3];

  out[0] = a0 * x;
  out[1] = a1 * y;
  out[2] = a2 * x;
  out[3] = a3 * y;
}

generates

main1:
.LFB0:
        .cfi_startproc
        movl    %edi, -12(%rsp)
        movd    -12(%rsp), %xmm0
        movl    %esi, -12(%rsp)
        movd    -12(%rsp), %xmm3
        movdqa  in(%rip), %xmm2
        punpckldq       %xmm3, %xmm0
        psrlq   $32, %xmm2
        punpcklqdq      %xmm0, %xmm0
        movdqa  %xmm0, %xmm1
        psrlq   $32, %xmm0
        pmuludq %xmm2, %xmm0
        pshufd  $8, %xmm0, %xmm0
        pmuludq in(%rip), %xmm1
        pshufd  $8, %xmm1, %xmm1
        punpckldq       %xmm0, %xmm1
        movaps  %xmm1, out(%rip)
        ret

slightly less obfuscated when we allow gpr xmm moves with -mtune=intel:

main1:
.LFB0:
        .cfi_startproc
        movd    %edi, %xmm0
        movd    %esi, %xmm3
        movdqa  in(%rip), %xmm2
        punpckldq       %xmm3, %xmm0
        punpcklqdq      %xmm0, %xmm0
        psrlq   $32, %xmm2
        movdqa  %xmm0, %xmm1
        psrlq   $32, %xmm0
        pmuludq in(%rip), %xmm1
        pmuludq %xmm2, %xmm0
        pshufd  $8, %xmm1, %xmm1
        pshufd  $8, %xmm0, %xmm0
        punpckldq       %xmm0, %xmm1
        movdqa  %xmm1, out(%rip)
        ret

so for { x, y, x, y } construction we generate

        movd    %edi, %xmm0
        movd    %esi, %xmm3
        punpckldq       %xmm3, %xmm0
        punpcklqdq      %xmm0, %xmm0

no f*** idea where all the shifting and shuffling comes from...

This is just

  vect_cst_.7_18 = {x_6(D), y_9(D), x_6(D), y_9(D)};
  vect_a0_2.5_17 = MEM[(unsigned int *)&in];
  vect__7.6_19 = vect_a0_2.5_17 * vect_cst_.7_18;
  MEM[(unsigned int *)&out] = vect__7.6_19;


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/65832] Inefficient vector construction
  2015-04-21 13:56 [Bug target/65832] New: Inefficient vector construction rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2015-04-28  8:51 ` rguenth at gcc dot gnu.org
@ 2015-04-28  8:55 ` rguenth at gcc dot gnu.org
  2015-05-22 12:15 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-28  8:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at gcc dot gnu.org

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Somehow we get into very weird initial RTL generated by expand...

(insn 12 11 13 (set (reg:V2DI 101)
        (mult:V2DI (zero_extend:V2DI (vec_select:V2SI (reg:V4SI 95 [
vect_cst_.7 ])
                    (parallel [
                            (const_int 0 [0])
                            (const_int 2 [0x2])
                        ])))
            (zero_extend:V2DI (vec_select:V2SI (mem/c:V4SI (reg/f:DI 94) [1
MEM[(unsigned int *)&in]+0 S16 A256])
                    (parallel [
                            (const_int 0 [0])
                            (const_int 2 [0x2])
                        ]))))) t.c:17 -1
     (nil))

(WTF!?  As if there were no integer vector multiply with SSE2 but only DImode
ones?!)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/65832] Inefficient vector construction
  2015-04-21 13:56 [Bug target/65832] New: Inefficient vector construction rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2015-04-28  8:55 ` rguenth at gcc dot gnu.org
@ 2015-05-22 12:15 ` rguenth at gcc dot gnu.org
  2015-06-01 10:30 ` rguenth at gcc dot gnu.org
  2020-06-24 20:45 ` [Bug middle-end/65832] " gabravier at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-05-22 12:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
On a related note, store_constructor handles VECTOR_TYPE constructors through
the vec_init optab but that doesn't work for vector elements:

            /* Don't use vec_init<mode> if some elements have VECTOR_TYPE.  */
            if (icode != CODE_FOR_nothing)
              {
                tree value;

                FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (exp), idx, value)
                  if (TREE_CODE (TREE_TYPE (value)) == VECTOR_TYPE)
                    {
                      icode = CODE_FOR_nothing;

instead it could, for say a V16QI constructor of four V4QI elements pun
the elements to SI mode, create a V4SI mode vector via vec_init and then
pun it to V16QI.  All punning via subregs.

The vectorizer currently creates such vector constructors from grouped strided
load/store support.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/65832] Inefficient vector construction
  2015-04-21 13:56 [Bug target/65832] New: Inefficient vector construction rguenth at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2015-05-22 12:15 ` rguenth at gcc dot gnu.org
@ 2015-06-01 10:30 ` rguenth at gcc dot gnu.org
  2020-06-24 20:45 ` [Bug middle-end/65832] " gabravier at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-06-01 10:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Testcase simulating all qi vector cases the vectorizer may create

char a[1024];
char b[1024];

void foobar (int s)
{
  for (int i = 0; i < 16; ++i)
    {
      b[i] = a[s*i];
    }
}

void foo (int s)
{
  for (int i = 0; i < 8; ++i)
    {
      b[2*i] = a[s*i];
      b[2*i + 1] = a[s*i + 1];
    }
}

void bar (int s)
{
  for (int i = 0; i < 4; ++i)
    {
      b[4*i] = a[s*i];
      b[4*i + 1] = a[s*i + 1];
      b[4*i + 2] = a[s*i + 2];
      b[4*i + 3] = a[s*i + 3];
    }
}

void baz(int s)
{
  for (int i = 0; i < 2; ++i)
    {
      b[8*i] = a[s*i];
      b[8*i + 1] = a[s*i + 1];
      b[8*i + 2] = a[s*i + 2];
      b[8*i + 3] = a[s*i + 3];
      b[8*i + 4] = a[s*i + 4];
      b[8*i + 5] = a[s*i + 5];
      b[8*i + 6] = a[s*i + 6];
      b[8*i + 7] = a[s*i + 7];
    }
}

Compile with -fdisable-tree-cunrolli.

foobar creates absymal code and baz needlessly goes through the stack.
For plain -msse2 all code-gen isn't great but for foo which ends up
using pinsrw.

baz fails to use pinsrq and foobar fails to use pinsrq with -msse4.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/65832] Inefficient vector construction
  2015-04-21 13:56 [Bug target/65832] New: Inefficient vector construction rguenth at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2015-06-01 10:30 ` rguenth at gcc dot gnu.org
@ 2020-06-24 20:45 ` gabravier at gmail dot com
  6 siblings, 0 replies; 8+ messages in thread
From: gabravier at gmail dot com @ 2020-06-24 20:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65832

Gabriel Ravier <gabravier at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |gabravier at gmail dot com

--- Comment #10 from Gabriel Ravier <gabravier at gmail dot com> ---
Someone might want to look at this, from what I can see it looks like the first
few examples are now optimized optimally but I can't say for sure that the
later examples are optimized optimally in the same way (it looks like some of
them at least somewhat ideally optimized but some of them are... badly
optimized, to say the least).

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-24 20:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-21 13:56 [Bug target/65832] New: Inefficient vector construction rguenth at gcc dot gnu.org
2015-04-21 14:03 ` [Bug target/65832] " rguenth at gcc dot gnu.org
2015-04-21 14:11 ` rguenth at gcc dot gnu.org
2015-04-28  8:51 ` rguenth at gcc dot gnu.org
2015-04-28  8:55 ` rguenth at gcc dot gnu.org
2015-05-22 12:15 ` rguenth at gcc dot gnu.org
2015-06-01 10:30 ` rguenth at gcc dot gnu.org
2020-06-24 20:45 ` [Bug middle-end/65832] " gabravier at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).