public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
@ 2020-12-03  1:28 iii at linux dot ibm.com
  2020-12-03  2:44 ` [Bug tree-optimization/98113] " linkw at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: iii at linux dot ibm.com @ 2020-12-03  1:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

            Bug ID: 98113
           Summary: [11 Regression] popcnt is not vectorized on s390 since
                    f5e18dd9c7da
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: iii at linux dot ibm.com
  Target Milestone: ---

s390's vxe/popcount-1.c began to fail after PR96789 fix.

The reason is that for the following source code

uv4si __attribute__((noinline))
vpopctf (uv4si a)
{
  uv4si r;
  int i;

  for (i = 0; i < 4; i++)
    r[i] = __builtin_popcount (a[i]);

  return r;
}

FRE turned

  _4 = BIT_FIELD_REF <aD.2283, 32, 0>;
  _11 = __builtin_popcountD.1211 (_4);
  _18 = (unsigned intD.9) _11;
  BIT_FIELD_REF <rD.2286, 32, 0> = _18;
  i_20 = 1;
  ivtmp_21 = 3;
  _25 = VIEW_CONVERT_EXPR<unsigned intD.9[4]>(aD.2283)[i_20];
  _26 = __builtin_popcountD.1211 (_25);
  _27 = (unsigned intD.9) _26;
  VIEW_CONVERT_EXPR<unsigned intD.9[4]>(rD.2286)[i_20] = _27;
  i_29 = i_20 + 1;
  ivtmp_30 = ivtmp_21 + 4294967295;
  _34 = VIEW_CONVERT_EXPR<unsigned intD.9[4]>(aD.2283)[i_29];
  _35 = __builtin_popcountD.1211 (_34);
  _36 = (unsigned intD.9) _35;
  VIEW_CONVERT_EXPR<unsigned intD.9[4]>(rD.2286)[i_29] = _36;
  i_38 = i_29 + 1;
  ivtmp_39 = ivtmp_30 + 4294967295;
  _1 = VIEW_CONVERT_EXPR<unsigned intD.9[4]>(aD.2283)[i_38];
  _2 = __builtin_popcountD.1211 (_1);
  _3 = (unsigned intD.9) _2;
  VIEW_CONVERT_EXPR<unsigned intD.9[4]>(rD.2286)[i_38] = _3;
  i_10 = i_38 + 1;
  ivtmp_16 = ivtmp_39 + 4294967295;
  _7 = rD.2286;
  rD.2286 ={v} {CLOBBER};
  return _7;

into

  _4 = BIT_FIELD_REF <a_17(D), 32, 0>;
  _11 = __builtin_popcountD.1211 (_4);
  _18 = (unsigned intD.9) _11;
  r_14 = BIT_INSERT_EXPR <r_15(D), _18, 0 (32 bits)>;
  _25 = BIT_FIELD_REF <a_17(D), 32, 32>;
  _26 = __builtin_popcountD.1211 (_25);
  _27 = (unsigned intD.9) _26;
  r_33 = BIT_INSERT_EXPR <r_14, _27, 32 (32 bits)>;
  _34 = BIT_FIELD_REF <a_17(D), 32, 64>;
  _35 = __builtin_popcountD.1211 (_34);
  _36 = (unsigned intD.9) _35;
  r_32 = BIT_INSERT_EXPR <r_33, _36, 64 (32 bits)>;
  _1 = BIT_FIELD_REF <a_17(D), 32, 96>;
  _2 = __builtin_popcountD.1211 (_1);
  _3 = (unsigned intD.9) _2;
  r_31 = BIT_INSERT_EXPR <r_32, _3, 96 (32 bits)>;
  _7 = r_31;
  return _7;

that is, replaced a sequence of stores with a sequence of
BIT_INSERT_EXPRs.

slp1 now says: "missed:  not vectorized: no grouped stores in basic
block", presumably because it doesn't understand BIT_INSERT_EXPRs.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
  2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
@ 2020-12-03  2:44 ` linkw at gcc dot gnu.org
  2020-12-03  3:20 ` linkw at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-12-03  2:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

Kewen Lin <linkw at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenther at suse dot de
   Last reconfirmed|                            |2020-12-03
             Status|UNCONFIRMED                 |ASSIGNED
     Ever confirmed|0                           |1

--- Comment #1 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Ilya Leoshkevich from comment #0)
> s390's vxe/popcount-1.c began to fail after PR96789 fix.

Sorry to see this regression.

...

> 
> that is, replaced a sequence of stores with a sequence of
> BIT_INSERT_EXPRs.
> 
> slp1 now says: "missed:  not vectorized: no grouped stores in basic
> block", presumably because it doesn't understand BIT_INSERT_EXPRs.

Yes, currently slp instance is built from group stores (also CTOR), I expect
Richi's ongoing slp rework will extend it to support group loads. For this
BIT_INSERT_EXPR, I guess we can extend vector CTOR handling to cover this if
the BIT_INSERT_EXPR chain can only and fully cover all lanes which is
equivalent to vector CTOR.

CC @Richi. Hi Richi, what do you think of this?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
  2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
  2020-12-03  2:44 ` [Bug tree-optimization/98113] " linkw at gcc dot gnu.org
@ 2020-12-03  3:20 ` linkw at gcc dot gnu.org
  2020-12-03  7:52 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: linkw at gcc dot gnu.org @ 2020-12-03  3:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

--- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #1)
> (In reply to Ilya Leoshkevich from comment #0)
> > s390's vxe/popcount-1.c began to fail after PR96789 fix.
> 
> Sorry to see this regression.
> 
> ...
> 
> > 
> > that is, replaced a sequence of stores with a sequence of
> > BIT_INSERT_EXPRs.
> > 
> > slp1 now says: "missed:  not vectorized: no grouped stores in basic
> > block", presumably because it doesn't understand BIT_INSERT_EXPRs.
> 
> Yes, currently slp instance is built from group stores (also CTOR), I expect
> Richi's ongoing slp rework will extend it to support group loads. For this
> BIT_INSERT_EXPR, I guess we can extend vector CTOR handling to cover this if
> the BIT_INSERT_EXPR chain can only and fully cover all lanes which is
> equivalent to vector CTOR.
> 
> CC @Richi. Hi Richi, what do you think of this?

Or another idea is to teach FRE to optimize expected BIT_INSERT_EXPR chain to
vector CTOR?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
  2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
  2020-12-03  2:44 ` [Bug tree-optimization/98113] " linkw at gcc dot gnu.org
  2020-12-03  3:20 ` linkw at gcc dot gnu.org
@ 2020-12-03  7:52 ` rguenth at gcc dot gnu.org
  2020-12-03  9:28 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-12-03  7:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
   Target Milestone|---                         |11.0
             Target|                            |x86_64-*-* s390x-*-*
           Keywords|                            |missed-optimization

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
The most straight-forward approach would be to treat

  r_14 = BIT_INSERT_EXPR <r_15(D), _18, 0 (32 bits)>;
  r_33 = BIT_INSERT_EXPR <r_14, _27, 32 (32 bits)>;
  r_32 = BIT_INSERT_EXPR <r_33, _36, 64 (32 bits)>;
  r_31 = BIT_INSERT_EXPR <r_32, _3, 96 (32 bits)>;

itself as a SLP source much like we look for CTORs as SLP source.  Note the
transformed load is an extra complication but at least I added support to
SLP existing vectors.

Also regresses on x86_64.

I'll see whether I can cook up sth.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
  2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
                   ` (2 preceding siblings ...)
  2020-12-03  7:52 ` rguenth at gcc dot gnu.org
@ 2020-12-03  9:28 ` rguenth at gcc dot gnu.org
  2020-12-03 10:59 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-12-03  9:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 49668
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49668&action=edit
prototype

So like this (some correctness verification is missing as well as more general
matching on the def side).  SLP discovery works but it somehow fails to
vectorize the popcount call:

t.i:12:10: note:   Final SLP tree for instance 0x3af32e0:
t.i:12:10: note:   node 0x3a0f6f0 (max_nunits=4, refcnt=2)
t.i:12:10: note:   op template: _18 = (unsigned int) _4;
t.i:12:10: note:        stmt 0 _18 = (unsigned int) _4;
t.i:12:10: note:        stmt 1 _27 = (unsigned int) _26;
t.i:12:10: note:        stmt 2 _36 = (unsigned int) _35;
t.i:12:10: note:        stmt 3 _3 = (unsigned int) _2;
t.i:12:10: note:        children 0x3a0f778
t.i:12:10: note:   node 0x3a0f778 (max_nunits=4, refcnt=2)
t.i:12:10: note:   op template: _4 = __builtin_popcount (_5);
t.i:12:10: note:        stmt 0 _4 = __builtin_popcount (_5);
t.i:12:10: note:        stmt 1 _26 = __builtin_popcount (_25);
t.i:12:10: note:        stmt 2 _35 = __builtin_popcount (_34);
t.i:12:10: note:        stmt 3 _2 = __builtin_popcount (_1);
t.i:12:10: note:        children 0x3a0f800
t.i:12:10: note:   node 0x3a0f800 (max_nunits=1, refcnt=2)
t.i:12:10: note:   op: VEC_PERM_EXPR
t.i:12:10: note:        stmt 0 _5 = BIT_FIELD_REF <a_17(D), 32, 0>;
t.i:12:10: note:        stmt 1 _25 = BIT_FIELD_REF <a_17(D), 32, 32>;
t.i:12:10: note:        stmt 2 _34 = BIT_FIELD_REF <a_17(D), 32, 64>;
t.i:12:10: note:        stmt 3 _1 = BIT_FIELD_REF <a_17(D), 32, 96>;
t.i:12:10: note:        lane permutation { 0[0] 0[1] 0[2] 0[3] }
t.i:12:10: note:        children 0x3a0f888
t.i:12:10: note:   node (external) 0x3a0f888 (max_nunits=1, refcnt=1)
t.i:12:10: note:        { }
...
t.i:10:10: note:   ==> examining statement: _4 = __builtin_popcount (_5);
t.i:10:10: note:   vect_is_simple_use: operand BIT_FIELD_REF <a_17(D), 32, 0>,
type of def: internal
t.i:10:10: missed:   function is not vectorizable.
t.i:10:12: missed:   not vectorized: relevant stmt not supported: _4 =
__builtin_popcount (_5);
t.i:10:10: note:   Building vector operands of 0x3a0f778 from scalars instead

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
  2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
                   ` (3 preceding siblings ...)
  2020-12-03  9:28 ` rguenth at gcc dot gnu.org
@ 2020-12-03 10:59 ` rguenth at gcc dot gnu.org
  2020-12-03 13:53 ` iii at linux dot ibm.com
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-12-03 10:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #49668|0                           |1
        is obsolete|                            |

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 49669
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49669&action=edit
patch

Needs -mavx512vpopcntdq for x86.  The attached is a completed patch, costing
still (as for CTORs) fails to account the root stmts in the scalar cost part.

The testcase is probably going to FAIL too much, there's no effective target
keyword for popcntsi vectorization.

I've made it a non-loop as well since on x86 loop vectorization handles it
otherwise.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
  2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
                   ` (4 preceding siblings ...)
  2020-12-03 10:59 ` rguenth at gcc dot gnu.org
@ 2020-12-03 13:53 ` iii at linux dot ibm.com
  2020-12-07 11:54 ` cvs-commit at gcc dot gnu.org
  2020-12-07 11:54 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: iii at linux dot ibm.com @ 2020-12-03 13:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

--- Comment #6 from Ilya Leoshkevich <iii at linux dot ibm.com> ---
With the patch, vxe/popcount-1.c works on s390 again:

vpopctf:
.LFB2:
        .cfi_startproc
        vpopctf %v24,%v24
        br      %r14

Thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
  2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
                   ` (5 preceding siblings ...)
  2020-12-03 13:53 ` iii at linux dot ibm.com
@ 2020-12-07 11:54 ` cvs-commit at gcc dot gnu.org
  2020-12-07 11:54 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-12-07 11:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:ebdfd1606da6b5aa586b0cd156b69b659235c9c2

commit r11-5821-gebdfd1606da6b5aa586b0cd156b69b659235c9c2
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Dec 3 10:25:14 2020 +0100

    tree-optimization/98113 - vectorize a sequence of BIT_INSERT_EXPRs

    This adds the capability to handle a sequence of vector BIT_INSERT_EXPRs
    to be vectorized similar as to how we vectorize vector constructors.

    2020-12-03  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/98113
            * tree-vectorizer.h (struct slp_root): New.
            (_bb_vec_info::roots): New member.
            * tree-vect-slp.c (vect_analyze_slp): Also walk BB info
            roots.
            (_bb_vec_info::_bb_vec_info): Adjust.
            (_bb_vec_info::~_bb_vec_info): Likewise.
            (vld_cmp): New.
            (vect_slp_is_lane_insert): Likewise.
            (vect_slp_check_for_constructors): Match a series of
            BIT_INSERT_EXPRs as vector constructor.
            (vect_slp_analyze_bb_1): Continue if BB info roots is
            not empty.
            (vect_slp_analyze_bb_1): Mark the whole BIT_INSERT_EXPR root
            sequence as pure_slp.

            * gcc.dg/vect/bb-slp-70.c: New testcase.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da
  2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
                   ` (6 preceding siblings ...)
  2020-12-07 11:54 ` cvs-commit at gcc dot gnu.org
@ 2020-12-07 11:54 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-12-07 11:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-12-07 11:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03  1:28 [Bug tree-optimization/98113] New: [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da iii at linux dot ibm.com
2020-12-03  2:44 ` [Bug tree-optimization/98113] " linkw at gcc dot gnu.org
2020-12-03  3:20 ` linkw at gcc dot gnu.org
2020-12-03  7:52 ` rguenth at gcc dot gnu.org
2020-12-03  9:28 ` rguenth at gcc dot gnu.org
2020-12-03 10:59 ` rguenth at gcc dot gnu.org
2020-12-03 13:53 ` iii at linux dot ibm.com
2020-12-07 11:54 ` cvs-commit at gcc dot gnu.org
2020-12-07 11:54 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).