public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/99149] New: [11 Regression] ICE during vectorization when shared trees contain different complex patterns
@ 2021-02-18 13:15 tnfchris at gcc dot gnu.org
  2021-02-18 13:24 ` [Bug tree-optimization/99149] " rguenth at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-02-18 13:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99149

            Bug ID: 99149
           Summary: [11 Regression] ICE during vectorization when shared
                    trees contain different complex patterns
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: tnfchris at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64-*

The following testcase

class a {
  float b;
  float c;

public:
  a(float d, float e) : b(d), c(e) {}
  a operator+(a d) { return a(b + d.b, c + d.c); }
  a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); }
};
int f, g;
class {
  a *h;
  a *i;

public:
  void j() {
    a k = h[0], l = i[g], m = k * i[f];
    i[g] = l + m;
    i[f] = m;
  }
} n;
main() { n.j(); }

crashes with aarch64-none-elf-g++ -w -march=armv8.3-a -O3 -S main.cpp

At -O3 there are two SLP trees created, but they share partially part of a
complex pattern:

   Final SLP tree for instance 0x559a7e0:
   node 0x56f5520 (max_nunits=2, refcnt=2)
   op template: MEM <float> [(struct a *)_16] = _20;
           stmt 0 MEM <float> [(struct a *)_16] = _20;
           stmt 1 MEM <float> [(struct a *)_16 + 4B] = _23;
           children 0x56f5058
   node 0x56f5058 (max_nunits=2, refcnt=3)
   op: VEC_PERM_EXPR
           stmt 0 _20 = _18 - _19;
           stmt 1 _23 = _21 + _22;
           lane permutation { 0[0] 1[1] }
           children 0x56f5410 0x56f5498
   node 0x56f5410 (max_nunits=1, refcnt=1)
   op template: _20 = _18 - _19;
           { }
           children 0x56f50e0 0x56f5278
   node 0x56f50e0 (max_nunits=2, refcnt=3)
   op template: _18 = k$b_4 * k$b_4;
           stmt 0 _18 = k$b_4 * k$b_4;
           stmt 1 _21 = k$b_4 * k$c_5;
           children 0x56f5168 0x56f51f0
   node 0x56f5168 (max_nunits=2, refcnt=2)
   op template: k$b_4 = MEM[(const struct a &)_3].b;
           stmt 0 k$b_4 = MEM[(const struct a &)_3].b;
           stmt 1 k$b_4 = MEM[(const struct a &)_3].b;
           load permutation { 0 0 }
   node 0x56f51f0 (max_nunits=2, refcnt=2)
   op template: k$b_4 = MEM[(const struct a &)_3].b;
           stmt 0 k$b_4 = MEM[(const struct a &)_3].b;
           stmt 1 k$c_5 = MEM[(const struct a &)_3].c;
           load permutation { 0 1 }
   node 0x56f5278 (max_nunits=2, refcnt=3)
   op template: _19 = k$c_5 * k$c_5;
           stmt 0 _19 = k$c_5 * k$c_5;
           stmt 1 _22 = k$c_5 * d$b_17;
           children 0x56f5300 0x56f5388
   node 0x56f5300 (max_nunits=2, refcnt=2)
   op template: k$c_5 = MEM[(const struct a &)_3].c;
           stmt 0 k$c_5 = MEM[(const struct a &)_3].c;
           stmt 1 k$c_5 = MEM[(const struct a &)_3].c;
           load permutation { 1 1 }
   node (external) 0x56f5388 (max_nunits=1, refcnt=1)
           { k$c_5, d$b_17 }
   node 0x56f5498 (max_nunits=1, refcnt=1)
   op template: _23 = _21 + _22;
           { }
           children 0x56f50e0 0x56f5278

These dump contains two roots, one at 0x56f5520 and one at 0x5763380:

   Final SLP tree for instance 0x5763380:
   node 0x56f4ec0 (max_nunits=2, refcnt=2)
   op template: MEM <float> [(struct a *)_10] = _24;
           stmt 0 MEM <float> [(struct a *)_10] = _24;
           stmt 1 MEM <float> [(struct a *)_10 + 4B] = _25;
           children 0x56f4f48
   node 0x56f4f48 (max_nunits=2, refcnt=2)
   op template: _24 = l$b_11 + _20;
           stmt 0 _24 = l$b_11 + _20;
           stmt 1 _25 = l$c_12 + _23;
           children 0x56f4fd0 0x56f5058
   node 0x56f4fd0 (max_nunits=2, refcnt=2)
   op template: l$b_11 = MEM[(const struct a &)_10].b;
           stmt 0 l$b_11 = MEM[(const struct a &)_10].b;
           stmt 1 l$c_12 = MEM[(const struct a &)_10].c;
           load permutation { 0 1 }
   node 0x56f5058 (max_nunits=2, refcnt=2)
   op: VEC_PERM_EXPR
           stmt 0 _20 = _18 - _19;
           stmt 1 _23 = _21 + _22;
           lane permutation { 0[0] 1[1] }
           children 0x56f5410 0x56f5498
   node 0x56f5410 (max_nunits=1, refcnt=1)
   op template: _20 = _18 - _19;
           { }
           children 0x56f50e0 0x56f5278
   node 0x56f50e0 (max_nunits=2, refcnt=3)
   op template: _18 = k$b_4 * k$b_4;
           stmt 0 _18 = k$b_4 * k$b_4;
           stmt 1 _21 = k$b_4 * k$c_5;
           children 0x56f5168 0x56f51f0
   node 0x56f5168 (max_nunits=2, refcnt=2)
   op template: k$b_4 = MEM[(const struct a &)_3].b;
           stmt 0 k$b_4 = MEM[(const struct a &)_3].b;
           stmt 1 k$b_4 = MEM[(const struct a &)_3].b;
           load permutation { 0 0 }
   node 0x56f51f0 (max_nunits=2, refcnt=2)
   op template: k$b_4 = MEM[(const struct a &)_3].b;
           stmt 0 k$b_4 = MEM[(const struct a &)_3].b;
           stmt 1 k$c_5 = MEM[(const struct a &)_3].c;
           load permutation { 0 1 }
   node 0x56f5278 (max_nunits=2, refcnt=3)
   op template: _19 = k$c_5 * k$c_5;
           stmt 0 _19 = k$c_5 * k$c_5;
           stmt 1 _22 = k$c_5 * d$b_17;
           children 0x56f5300 0x56f5388
   node 0x56f5300 (max_nunits=2, refcnt=2)
   op template: k$c_5 = MEM[(const struct a &)_3].c;
           stmt 0 k$c_5 = MEM[(const struct a &)_3].c;
           stmt 1 k$c_5 = MEM[(const struct a &)_3].c;
           load permutation { 1 1 }
   node (external) 0x56f5388 (max_nunits=1, refcnt=1)
           { k$c_5, d$b_17 }
   node 0x56f5498 (max_nunits=1, refcnt=1)
   op template: _23 = _21 + _22;
           { }
           children 0x56f50e0 0x56f5278


It's a bit had to see so let's render the graphs
https://sketchviz.com/@Mistuke/3f10160785531d484f9eba0de186bf65/5cedb98ad6ee1e083c3bf9574dba1c679851412e

The graph rooted to the store in *_10 matches COMPLEX_FMA but that rooted in
the store to *_16 only matches COMPLEX_MUL.

In this case the COMPLEX_FMA shouldn't be recognized as they both share the
VEC_PERM_EXPR.

Disabling the COMPLEX_FMA pattern shows that this does then generate the right
code:

main:
        adrp    x1, .LANCHOR0
        add     x2, x1, :lo12:.LANCHOR0
        movi    v0.2s, 0
        mov     w0, 0
        ldr     x4, [x1, #:lo12:.LANCHOR0]
        ldrsw   x3, [x2, 16]
        ldr     x1, [x2, 8]
        ldr     d1, [x4]
        fcmla   v0.2s, v1.2s, v1.2s, #0
        fcmla   v0.2s, v1.2s, v1.2s, #90
        ldrsw   x2, [x2, 20]
        ldr     d1, [x1, x3, lsl 3]
        fadd    v1.2s, v1.2s, v0.2s
        str     d1, [x1, x3, lsl 3]
        str     d0, [x1, x2, lsl 3]
        ret

So the pattern matcher should probably check that all children it's going to
replace belong to the same instance. But I am not sure how.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/99149] [11 Regression] ICE during vectorization when shared trees contain different complex patterns
  2021-02-18 13:15 [Bug tree-optimization/99149] New: [11 Regression] ICE during vectorization when shared trees contain different complex patterns tnfchris at gcc dot gnu.org
@ 2021-02-18 13:24 ` rguenth at gcc dot gnu.org
  2021-02-24  9:44 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-18 13:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99149

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |ice-on-valid-code
   Target Milestone|---                         |11.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/99149] [11 Regression] ICE during vectorization when shared trees contain different complex patterns
  2021-02-18 13:15 [Bug tree-optimization/99149] New: [11 Regression] ICE during vectorization when shared trees contain different complex patterns tnfchris at gcc dot gnu.org
  2021-02-18 13:24 ` [Bug tree-optimization/99149] " rguenth at gcc dot gnu.org
@ 2021-02-24  9:44 ` cvs-commit at gcc dot gnu.org
  2021-02-24  9:46 ` tnfchris at gcc dot gnu.org
  2021-02-24 14:57 ` cvs-commit at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-02-24  9:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99149

--- Comment #1 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:5296bd57d0605d1fec900d85e3ab3875197e609a

commit r11-7355-g5296bd57d0605d1fec900d85e3ab3875197e609a
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Wed Feb 24 09:43:22 2021 +0000

    slp: fix sharing of SLP only patterns.

    The attached testcase ICEs due to a couple of issues.
    In the testcase you have two SLP instances that share the majority of their
    definition with each other.  One tree defines a COMPLEX_MUL sequence and
the
    other tree a COMPLEX_FMA.

    The ice happens because:

    1. the refcounts are wrong, in particular the FMA case doesn't correctly
count
    the references for the COMPLEX_MUL that it consumes.

    2. when the FMA is created it incorrectly assumes it can just tear apart
the MUL
    node that it's consuming.  This is wrong and should only be done when there
is
    no more uses of the node, in which case the vector only pattern is no
longer
    relevant.

    To fix the last part the SLP only pattern reset code was moved into
    vect_free_slp_tree which results in cleaner code.  I also think it does
belong
    there since that function knows when there are no more uses of the node and
so
    the pattern should be unmarked, so when the the vectorizer is inspecting
the BB
    it doesn't find the now invalid vector only patterns.

    The patch also clears the SLP_TREE_REPRESENTATIVE when stores are removed
such
    that we don't hit an error later trying to free the stmt_vec_info again.

    Lastly it also tweaks the results of whether a pattern has been detected or
not
    to return true when another SLP instance has created a pattern that is only
used
    by a different instance (due to the trees being unshared).

    Instead of ICEing this code now produces

            adrp    x1, .LANCHOR0
            add     x2, x1, :lo12:.LANCHOR0
            movi    v1.2s, 0
            mov     w0, 0
            ldr     x4, [x1, #:lo12:.LANCHOR0]
            ldrsw   x3, [x2, 16]
            ldr     x1, [x2, 8]
            ldrsw   x2, [x2, 20]
            ldr     d0, [x4]
            ldr     d2, [x1, x3, lsl 3]
            fcmla   v2.2s, v0.2s, v0.2s, #0
            fcmla   v2.2s, v0.2s, v0.2s, #90
            str     d2, [x1, x3, lsl 3]
            fcmla   v1.2s, v0.2s, v0.2s, #0
            fcmla   v1.2s, v0.2s, v0.2s, #90
            str     d1, [x1, x2, lsl 3]
            ret

    PS. This testcase actually shows that the codegen we get in these cases is
not
    optimal. It should generate a MUL + ADD instead MUL + FMA.

    But that's for GCC 12.

    gcc/ChangeLog:

            PR tree-optimization/99149
            * tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate
the
            buffer.
            (vect_slp_reset_pattern): Remove.
            (complex_fma_pattern::matches): Remove call to
vect_slp_reset_pattern.
            (complex_mul_pattern::build, complex_fma_pattern::build,
            complex_fms_pattern::build): Fix ref counts.
            * tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern
relevancy
            when node is being deleted.
            (vect_match_slp_patterns_2): Correct result of cache hit on
patterns.
            (vect_schedule_slp): Invalidate SLP_TREE_REPRESENTATIVE of removed
            stores.
            * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize
value.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/99149
            * g++.dg/vect/pr99149.cc: New test.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/99149] [11 Regression] ICE during vectorization when shared trees contain different complex patterns
  2021-02-18 13:15 [Bug tree-optimization/99149] New: [11 Regression] ICE during vectorization when shared trees contain different complex patterns tnfchris at gcc dot gnu.org
  2021-02-18 13:24 ` [Bug tree-optimization/99149] " rguenth at gcc dot gnu.org
  2021-02-24  9:44 ` cvs-commit at gcc dot gnu.org
@ 2021-02-24  9:46 ` tnfchris at gcc dot gnu.org
  2021-02-24 14:57 ` cvs-commit at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-02-24  9:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99149

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Fixed in master

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/99149] [11 Regression] ICE during vectorization when shared trees contain different complex patterns
  2021-02-18 13:15 [Bug tree-optimization/99149] New: [11 Regression] ICE during vectorization when shared trees contain different complex patterns tnfchris at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-02-24  9:46 ` tnfchris at gcc dot gnu.org
@ 2021-02-24 14:57 ` cvs-commit at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-02-24 14:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99149

--- Comment #3 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:96c5a8589e0dd22819bae34cac3381ad381d3989

commit r11-7358-g96c5a8589e0dd22819bae34cac3381ad381d3989
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Wed Feb 24 14:57:08 2021 +0000

    [comitted] Testsuite: Disable PR99149 test on big-endian

    This patch disables the test for PR99149 on Big-endian
    where for standard AArch64 the patterns are disabled.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/99149
            * g++.dg/vect/pr99149.cc: Disabled on BE.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-02-24 14:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-18 13:15 [Bug tree-optimization/99149] New: [11 Regression] ICE during vectorization when shared trees contain different complex patterns tnfchris at gcc dot gnu.org
2021-02-18 13:24 ` [Bug tree-optimization/99149] " rguenth at gcc dot gnu.org
2021-02-24  9:44 ` cvs-commit at gcc dot gnu.org
2021-02-24  9:46 ` tnfchris at gcc dot gnu.org
2021-02-24 14:57 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).