[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
Date: Fri, 09 Apr 2021 07:05:46 +0000	[thread overview]
Message-ID: <bug-99971-4-Mmw3DZgY1p@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-99971-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-09
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  While we manage to analyze for the "perfect" solution" we fail
because dependence testing doesn't handle a piece, this throws away half
of the vectorization.  We do actually see that we'll retain the scalar
loads and computations but still doing three vector loads and a vector add
seems cheaper than doing four scalar stores:

0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body
0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body
0x1fddcb0 _15 1 times scalar_store costs 12 in body
0x1fddcb0 _18 1 times scalar_store costs 12 in body
0x1fddcb0 _21 1 times scalar_store costs 12 in body
0x1fddcb0 _24 1 times scalar_store costs 12 in body
t.C:28:1: note:  Cost model analysis:
  Vector inside of basic block cost: 40
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 48
t.C:28:1: note:  Basic block will be vectorized using SLP

now, fortunately GCC 11 will improve on this [a bit] and we'll produce

_Z4testR1ARKS_S2_:
.LFB2:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        movdqu  (%rdi), %xmm1
        paddd   %xmm1, %xmm0
        movups  %xmm0, (%rdi)
        movd    %xmm0, %eax
        subl    (%rdx), %eax
        movl    %eax, (%rdi)
        pextrd  $1, %xmm0, %eax
        subl    4(%rdx), %eax
        movl    %eax, 4(%rdi)
        pextrd  $2, %xmm0, %eax
        subl    8(%rdx), %eax
        movl    %eax, 8(%rdi)
        pextrd  $3, %xmm0, %eax
        subl    12(%rdx), %eax
        movl    %eax, 12(%rdi)
        ret

which is not re-doing the scalar loads/adds but instead uses the vector
result.  Still the same dependence issue is present:

t.C:16:11: missed:   can't determine dependence between y1_3(D)->b and
x_2(D)->a
t.C:16:11: note:  removing SLP instance operations starting from: x_2(D)->a =
_6;

the scalar code before vectorization looks like

  <bb 2> [local count: 1073741824]:
  _13 = x_2(D)->a;
  _14 = y1_3(D)->a;
  _15 = _13 + _14;
  x_2(D)->a = _15;
  _16 = x_2(D)->b;
  _17 = y1_3(D)->b;  <---
  _18 = _16 + _17;
  x_2(D)->b = _18;
  _19 = x_2(D)->c;
  _20 = y1_3(D)->c;
  _21 = _19 + _20;
  x_2(D)->c = _21;
  _22 = x_2(D)->d;
  _23 = y1_3(D)->d;
  _24 = _22 + _23;
  x_2(D)->d = _24;
  _5 = y2_4(D)->a;
  _6 = _15 - _5;
  x_2(D)->a = _6;  <---
  _7 = y2_4(D)->b;
  _8 = _18 - _7;
  x_2(D)->b = _8;
  _9 = y2_4(D)->c;
  _10 = _21 - _9;
  x_2(D)->c = _10;
  _11 = y2_4(D)->d;
  _12 = _24 - _11;
  x_2(D)->d = _12;
  return;


Using

void test(A& __restrict x, A const& y1, A const& y2)
{
    x += y1;
    x -= y2;
}

produces optimal assembly even with GCC 10:

_Z4testR1ARKS_S2_:
.LFB2:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        movdqu  (%rdx), %xmm1
        movdqu  (%rdi), %xmm2
        psubd   %xmm1, %xmm0
        paddd   %xmm2, %xmm0
        movups  %xmm0, (%rdi)
        ret

note that I think we should be able to handle the dependences even without
the __restrict annotation.

next prev parent reply	other threads:[~2021-04-09  7:05 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-08 14:41 [Bug tree-optimization/99971] New: " andysem at mail dot ru
2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
2021-04-09  7:05 ` rguenth at gcc dot gnu.org [this message]
2021-04-15  9:15 ` andysem at mail dot ru
2021-04-15 11:26 ` rguenth at gcc dot gnu.org
2021-04-15 11:30 ` rguenth at gcc dot gnu.org
2021-04-15 16:01 ` andysem at mail dot ru
2021-04-15 23:17 ` david.bolvansky at gmail dot com
2021-04-23  7:35 ` cvs-commit at gcc dot gnu.org
2021-04-23  7:37 ` rguenth at gcc dot gnu.org
2021-04-23  8:43 ` andysem at mail dot ru
2021-04-23  9:03 ` rguenther at suse dot de

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-99971-4-Mmw3DZgY1p@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).