[Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once
@ 2021-04-08 14:41 andysem at mail dot ru
  2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-08 14:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

            Bug ID: 99971
           Summary: GCC generates partially vectorized and scalar code at
                    once
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

Consider the following code sample:

struct A
{
    unsigned int a, b, c, d;

    A& operator+= (A const& that)
    {
        a += that.a;
        b += that.b;
        c += that.c;
        d += that.d;
        return *this;
    }

    A& operator-= (A const& that)
    {
        a -= that.a;
        b -= that.b;
        c -= that.c;
        d -= that.d;
        return *this;
    }
};

void test(A& x, A const& y1, A const& y2)
{
    x += y1;
    x -= y2;
}

The code, when compiled with options "-O3 -march=nehalem", generates:

test(A&, A const&, A const&):
        pushq   %rbp
        movdqu  (%rdi), %xmm1
        pushq   %rbx
        movl    4(%rsi), %r8d
        movdqu  (%rsi), %xmm0
        movl    (%rsi), %r9d
        paddd   %xmm1, %xmm0
        movl    8(%rsi), %ecx
        movl    12(%rsi), %eax
        movl    %r8d, %esi
        movl    (%rdi), %ebp
        movl    4(%rdi), %ebx
        movl    8(%rdi), %r11d
        movl    12(%rdi), %r10d
        movups  %xmm0, (%rdi)
        subl    (%rdx), %r9d
        subl    4(%rdx), %esi
        subl    8(%rdx), %ecx
        subl    12(%rdx), %eax
        addl    %ebp, %r9d
        addl    %ebx, %esi
        movl    %r9d, (%rdi)
        popq    %rbx
        addl    %r11d, %ecx
        popq    %rbp
        movl    %esi, 4(%rdi)
        addl    %r10d, %eax
        movl    %ecx, 8(%rdi)
        movl    %eax, 12(%rdi)
        ret

https://gcc.godbolt.org/z/Mzchj8bxG

Here you can see that the compiler has partially vectorized the test function -
it converted "x += y1" to paddd, as expected, but failed to vectorize "x -=
y2". But at the same time the compiler also generated scalar code, including
for the already vectorized "x += y1" line, basically duplicating it.

Note that when either "x += y1" or "x -= y2" is commented, the compiler is able
to vectorize the line that is left. It is also able to vectorize both lines
when the += and -= operators are applied to different objects instead of x.

This is reproducible since gcc 8 up to and including 10.2. gcc 7 doesn't
vectorize this code. With the current trunk on godbolt the generated code is
different:

test(A&, A const&, A const&):
        movdqu  (%rsi), %xmm0
        movdqu  (%rdi), %xmm1
        paddd   %xmm1, %xmm0
        movups  %xmm0, (%rdi)
        movd    %xmm0, %eax
        subl    (%rdx), %eax
        movl    %eax, (%rdi)
        pextrd  $1, %xmm0, %eax
        subl    4(%rdx), %eax
        movl    %eax, 4(%rdi)
        pextrd  $2, %xmm0, %eax
        subl    8(%rdx), %eax
        movl    %eax, 8(%rdi)
        pextrd  $3, %xmm0, %eax
        subl    12(%rdx), %eax
        movl    %eax, 12(%rdi)
        ret

Here the compiler is able to vectorize "x += y1" but not "x -= y2". At least,
it removed the duplicate scalar version of "x += y1".

Given that the compiler is able to vectorize each line in isolation, I would
expect it to be able to vectorize them combined. Generating duplicate versions
of code is certainly not expected.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
@ 2021-04-08 14:45 ` andysem at mail dot ru
  2021-04-09  7:05 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-08 14:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #1 from andysem at mail dot ru ---
For reference, an ideal version of this code should look something like this:

test(A&, A const&, A const&):
        movdqu  (%rsi), %xmm0
        movdqu  (%rdi), %xmm1
        movdqu  (%rdx), %xmm2
        paddd   %xmm1, %xmm0
        psubd   %xmm2, %xmm0
        movups  %xmm0, (%rdi)
        ret

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
  2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
@ 2021-04-09  7:05 ` rguenth at gcc dot gnu.org
  2021-04-15  9:15 ` andysem at mail dot ru
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-09  7:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-09
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  While we manage to analyze for the "perfect" solution" we fail
because dependence testing doesn't handle a piece, this throws away half
of the vectorization.  We do actually see that we'll retain the scalar
loads and computations but still doing three vector loads and a vector add
seems cheaper than doing four scalar stores:

0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body
0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body
0x1fddcb0 _15 1 times scalar_store costs 12 in body
0x1fddcb0 _18 1 times scalar_store costs 12 in body
0x1fddcb0 _21 1 times scalar_store costs 12 in body
0x1fddcb0 _24 1 times scalar_store costs 12 in body
t.C:28:1: note:  Cost model analysis:
  Vector inside of basic block cost: 40
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 48
t.C:28:1: note:  Basic block will be vectorized using SLP

now, fortunately GCC 11 will improve on this [a bit] and we'll produce

_Z4testR1ARKS_S2_:
.LFB2:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        movdqu  (%rdi), %xmm1
        paddd   %xmm1, %xmm0
        movups  %xmm0, (%rdi)
        movd    %xmm0, %eax
        subl    (%rdx), %eax
        movl    %eax, (%rdi)
        pextrd  $1, %xmm0, %eax
        subl    4(%rdx), %eax
        movl    %eax, 4(%rdi)
        pextrd  $2, %xmm0, %eax
        subl    8(%rdx), %eax
        movl    %eax, 8(%rdi)
        pextrd  $3, %xmm0, %eax
        subl    12(%rdx), %eax
        movl    %eax, 12(%rdi)
        ret

which is not re-doing the scalar loads/adds but instead uses the vector
result.  Still the same dependence issue is present:

t.C:16:11: missed:   can't determine dependence between y1_3(D)->b and
x_2(D)->a
t.C:16:11: note:  removing SLP instance operations starting from: x_2(D)->a =
_6;

the scalar code before vectorization looks like

  <bb 2> [local count: 1073741824]:
  _13 = x_2(D)->a;
  _14 = y1_3(D)->a;
  _15 = _13 + _14;
  x_2(D)->a = _15;
  _16 = x_2(D)->b;
  _17 = y1_3(D)->b;  <---
  _18 = _16 + _17;
  x_2(D)->b = _18;
  _19 = x_2(D)->c;
  _20 = y1_3(D)->c;
  _21 = _19 + _20;
  x_2(D)->c = _21;
  _22 = x_2(D)->d;
  _23 = y1_3(D)->d;
  _24 = _22 + _23;
  x_2(D)->d = _24;
  _5 = y2_4(D)->a;
  _6 = _15 - _5;
  x_2(D)->a = _6;  <---
  _7 = y2_4(D)->b;
  _8 = _18 - _7;
  x_2(D)->b = _8;
  _9 = y2_4(D)->c;
  _10 = _21 - _9;
  x_2(D)->c = _10;
  _11 = y2_4(D)->d;
  _12 = _24 - _11;
  x_2(D)->d = _12;
  return;


Using

void test(A& __restrict x, A const& y1, A const& y2)
{
    x += y1;
    x -= y2;
}

produces optimal assembly even with GCC 10:

_Z4testR1ARKS_S2_:
.LFB2:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        movdqu  (%rdx), %xmm1
        movdqu  (%rdi), %xmm2
        psubd   %xmm1, %xmm0
        paddd   %xmm2, %xmm0
        movups  %xmm0, (%rdi)
        ret

note that I think we should be able to handle the dependences even without
the __restrict annotation.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
  2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
  2021-04-09  7:05 ` rguenth at gcc dot gnu.org
@ 2021-04-15  9:15 ` andysem at mail dot ru
  2021-04-15 11:26 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-15  9:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #3 from andysem at mail dot ru ---
I tried adding __restrict__ to the equivalents of x, y1 and y2 in the original
larger code base and it didn't help. The compiler (gcc 10.2) would still
generate the same half-vectorized code.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
                   ` (2 preceding siblings ...)
  2021-04-15  9:15 ` andysem at mail dot ru
@ 2021-04-15 11:26 ` rguenth at gcc dot gnu.org
  2021-04-15 11:30 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-15 11:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to andysem from comment #3)
> I tried adding __restrict__ to the equivalents of x, y1 and y2 in the
> original larger code base and it didn't help. The compiler (gcc 10.2) would
> still generate the same half-vectorized code.

Hmm, that's odd.  I suppose the equivalent of test() was inlined in the
larger code base?

I'd be interested in preprocessed source of a translation unit that exhibits
this issue (and a pointer to the point in the source that is relevant).

Note for GCC 12 I have a patch to improve things w/o requiring the use
of __restrict (and I'm curious on whether that helps for the larger code base).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
                   ` (3 preceding siblings ...)
  2021-04-15 11:26 ` rguenth at gcc dot gnu.org
@ 2021-04-15 11:30 ` rguenth at gcc dot gnu.org
  2021-04-15 16:01 ` andysem at mail dot ru
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-15 11:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> (In reply to andysem from comment #3)
> > I tried adding __restrict__ to the equivalents of x, y1 and y2 in the
> > original larger code base and it didn't help. The compiler (gcc 10.2) would
> > still generate the same half-vectorized code.
> 
> Hmm, that's odd.  I suppose the equivalent of test() was inlined in the
> larger code base?
> 
> I'd be interested in preprocessed source of a translation unit that exhibits
> this issue (and a pointer to the point in the source that is relevant).
> 
> Note for GCC 12 I have a patch to improve things w/o requiring the use
> of __restrict (and I'm curious on whether that helps for the larger code
> base).

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567805.html

is the patch which applies to current master.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
                   ` (4 preceding siblings ...)
  2021-04-15 11:30 ` rguenth at gcc dot gnu.org
@ 2021-04-15 16:01 ` andysem at mail dot ru
  2021-04-15 23:17 ` david.bolvansky at gmail dot com
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-15 16:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #6 from andysem at mail dot ru ---
Hmm, it looks like the original code has changed enough so that the problem no
longer reproduces, with or without __restrict__. I don't have the older version
of the code, so I can't tell what changed exactly. Data alignment most probably
did change, but data layout of A (its equivalent in the original code) as well
as the operation on it certainly didn't. Sorry for the confusion.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
                   ` (5 preceding siblings ...)
  2021-04-15 16:01 ` andysem at mail dot ru
@ 2021-04-15 23:17 ` david.bolvansky at gmail dot com
  2021-04-23  7:35 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: david.bolvansky at gmail dot com @ 2021-04-15 23:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

Dávid Bolvanský <david.bolvansky at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |david.bolvansky at gmail dot com

--- Comment #7 from Dávid Bolvanský <david.bolvansky at gmail dot com> ---
Still bad for -O3 -march=skylake-avx512

https://godbolt.org/z/azb8aTG43

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
                   ` (6 preceding siblings ...)
  2021-04-15 23:17 ` david.bolvansky at gmail dot com
@ 2021-04-23  7:35 ` cvs-commit at gcc dot gnu.org
  2021-04-23  7:37 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-04-23  7:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:700e542971251b11623cce877075567815f72965

commit r12-79-g700e542971251b11623cce877075567815f72965
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Apr 9 09:35:51 2021 +0200

    tree-optimization/99971 - improve BB vect dependence analysis

    We can use TBAA even when we have a DR, do so.  For the testcase
    that means fully vectorizing it instead of only vectorizing
    the first store group resulting in suboptimal code.

    2021-04-09  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/99971
            * tree-vect-data-refs.c (vect_slp_analyze_node_dependences):
            Always use TBAA for loads.

            * g++.dg/vect/slp-pr99971.cc: New testcase.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
                   ` (7 preceding siblings ...)
  2021-04-23  7:35 ` cvs-commit at gcc dot gnu.org
@ 2021-04-23  7:37 ` rguenth at gcc dot gnu.org
  2021-04-23  8:43 ` andysem at mail dot ru
  2021-04-23  9:03 ` rguenther at suse dot de
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-23  7:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED
      Known to work|                            |12.0

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 12.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
                   ` (8 preceding siblings ...)
  2021-04-23  7:37 ` rguenth at gcc dot gnu.org
@ 2021-04-23  8:43 ` andysem at mail dot ru
  2021-04-23  9:03 ` rguenther at suse dot de
  10 siblings, 0 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-23  8:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #10 from andysem at mail dot ru ---
Thanks. Will this be backported to 10 and 11 branches?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
  2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
                   ` (9 preceding siblings ...)
  2021-04-23  8:43 ` andysem at mail dot ru
@ 2021-04-23  9:03 ` rguenther at suse dot de
  10 siblings, 0 replies; 12+ messages in thread
From: rguenther at suse dot de @ 2021-04-23  9:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971

--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 23 Apr 2021, andysem at mail dot ru wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
> 
> --- Comment #10 from andysem at mail dot ru ---
> Thanks. Will this be backported to 10 and 11 branches?

I don't plan to since it isn't a regression as far as I know, it
doesn't apply to GCC 10 so definitely not there.  I'll consider
for GCC 11.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-04-23  9:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
2021-04-09  7:05 ` rguenth at gcc dot gnu.org
2021-04-15  9:15 ` andysem at mail dot ru
2021-04-15 11:26 ` rguenth at gcc dot gnu.org
2021-04-15 11:30 ` rguenth at gcc dot gnu.org
2021-04-15 16:01 ` andysem at mail dot ru
2021-04-15 23:17 ` david.bolvansky at gmail dot com
2021-04-23  7:35 ` cvs-commit at gcc dot gnu.org
2021-04-23  7:37 ` rguenth at gcc dot gnu.org
2021-04-23  8:43 ` andysem at mail dot ru
2021-04-23  9:03 ` rguenther at suse dot de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).