[Bug middle-end/99409] New: s252 benchmark of TSVC is vectorized by clang and not by gcc

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/99409] New: s252 benchmark of TSVC is vectorized by clang and not by gcc
@ 2021-03-05 14:23 hubicka at gcc dot gnu.org
  2021-03-08  8:23 ` [Bug tree-optimization/99409] " rguenth at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-05 14:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409

            Bug ID: 99409
           Summary: s252 benchmark of TSVC is vectorized by clang and not
                    by gcc
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

typedef float real_t;
#define iterations 100000
#define LEN_1D 32000
#define LEN_2D 256
real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D];

void main()
{

//    scalar and array expansion
//    loop with ambiguous scalar temporary

    real_t t, s;
    for (int nl = 0; nl < iterations; nl++) {
        t = (real_t) 0.;
        for (int i = 0; i < LEN_1D; i++) {
            s = b[i] * c[i];
            a[i] = s + t;
            t = s;
        }
    }

}

clang does:
main:                                   # @main
        .cfi_startproc
# %bb.0:
        xorl    %eax, %eax
        .p2align        4, 0x90
.LBB0_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB0_2 Depth 2
        vxorps  %xmm0, %xmm0, %xmm0
        movq    $-128000, %rcx                  # imm = 0xFFFE0C00
        .p2align        4, 0x90
.LBB0_2:                                #   Parent Loop BB0_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovups c+128000(%rcx), %ymm1
        vmovups c+128032(%rcx), %ymm2
        vmovups c+128064(%rcx), %ymm3
        vmovups c+128096(%rcx), %ymm4
        vmulps  b+128000(%rcx), %ymm1, %ymm1
        vmulps  b+128032(%rcx), %ymm2, %ymm2
        vmulps  b+128064(%rcx), %ymm3, %ymm3
        vmulps  b+128096(%rcx), %ymm4, %ymm4
        vperm2f128      $33, %ymm1, %ymm0, %ymm0 # ymm0 = ymm0[2,3],ymm1[0,1]
        vperm2f128      $33, %ymm2, %ymm1, %ymm5 # ymm5 = ymm1[2,3],ymm2[0,1]
        vperm2f128      $33, %ymm3, %ymm2, %ymm6 # ymm6 = ymm2[2,3],ymm3[0,1]
        vperm2f128      $33, %ymm4, %ymm3, %ymm7 # ymm7 = ymm3[2,3],ymm4[0,1]
        vshufps $3, %ymm1, %ymm0, %ymm0         # ymm0 =
ymm0[3,0],ymm1[0,0],ymm0[7,4],ymm1[4,4]
        vshufps $3, %ymm2, %ymm5, %ymm5         # ymm5 =
ymm5[3,0],ymm2[0,0],ymm5[7,4],ymm2[4,4]
        vshufps $3, %ymm3, %ymm6, %ymm6         # ymm6 =
ymm6[3,0],ymm3[0,0],ymm6[7,4],ymm3[4,4]
        vshufps $3, %ymm4, %ymm7, %ymm7         # ymm7 =
ymm7[3,0],ymm4[0,0],ymm7[7,4],ymm4[4,4]
        vshufps $152, %ymm1, %ymm0, %ymm0       # ymm0 =
ymm0[0,2],ymm1[1,2],ymm0[4,6],ymm1[5,6]
        vshufps $152, %ymm2, %ymm5, %ymm5       # ymm5 =
ymm5[0,2],ymm2[1,2],ymm5[4,6],ymm2[5,6]
        vshufps $152, %ymm3, %ymm6, %ymm6       # ymm6 =
ymm6[0,2],ymm3[1,2],ymm6[4,6],ymm3[5,6]
        vshufps $152, %ymm4, %ymm7, %ymm7       # ymm7 =
ymm7[0,2],ymm4[1,2],ymm7[4,6],ymm4[5,6]
        vaddps  %ymm0, %ymm1, %ymm0
        vaddps  %ymm5, %ymm2, %ymm1
        vaddps  %ymm6, %ymm3, %ymm2
        vaddps  %ymm7, %ymm4, %ymm3
        vmovups %ymm0, a+128000(%rcx)
        vmovups %ymm1, a+128032(%rcx)
        vmovups %ymm2, a+128064(%rcx)
        vmovups %ymm3, a+128096(%rcx)
        subq    $-128, %rcx
        vmovaps %ymm4, %ymm0
        jne     .LBB0_2
# %bb.3:                                #   in Loop: Header=BB0_1 Depth=1
        incl    %eax
        cmpl    $100000, %eax                   # imm = 0x186A0
        jne     .LBB0_1
# %bb.4:
        vzeroupper
        retq

s252.c:18:27: note:   worklist: examine stmt: _3 = s_11 + t_21;
s252.c:18:27: note:   vect_is_simple_use: operand _1 * _2, type of def:
internal
s252.c:18:27: note:   mark relevant 5, live 0: s_11 = _1 * _2;
s252.c:18:27: note:   vect_is_simple_use: operand t_21 = PHI <s_11(8), 0.0(5)>,
type of def: unknown
s252.c:18:27: missed:   Unsupported pattern.
s252.c:20:22: missed:   not vectorized: unsupported use in stmt.
s252.c:18:27: missed:  unexpected pattern.

  <bb 8> [local count: 1052266996]:

  <bb 3> [local count: 1063004409]:
  # t_21 = PHI <s_11(8), 0.0(5)>
  # i_23 = PHI <i_13(8), 0(5)>
  # ivtmp_20 = PHI <ivtmp_19(8), 32000(5)>
  _1 = b[i_23];
  _2 = c[i_23];
  s_11 = _1 * _2;
  _3 = s_11 + t_21;
  a[i_23] = _3;
  i_13 = i_23 + 1;
  ivtmp_19 = ivtmp_20 - 1;
  if (ivtmp_19 != 0)
    goto <bb 8>; [98.99%]
  else
    goto <bb 4>; [1.01%]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/99409] s252 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:23 [Bug middle-end/99409] New: s252 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
@ 2021-03-08  8:23 ` rguenth at gcc dot gnu.org
  2022-10-17 10:36 ` cvs-commit at gcc dot gnu.org
  2022-10-17 10:40 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Blocks|                            |53947
          Component|middle-end                  |tree-optimization

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, we can't do 'scalar expansion'.  We'd need some pre-pass to turn PHIs
into data accesses.  Here we want

        t[0] = (real_t) 0.;
        for (int i = 0; i < LEN_1D; i++) {
            s = b[i] * c[i];
            a[i] = s + t[i];
            t[i+1] = s;
        }

and then of course the trick is to elide the actual array and instead do
clever shuffling of vector registers instead.

IIRC one of the other TSVC examples was similar.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/99409] s252 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:23 [Bug middle-end/99409] New: s252 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
  2021-03-08  8:23 ` [Bug tree-optimization/99409] " rguenth at gcc dot gnu.org
@ 2022-10-17 10:36 ` cvs-commit at gcc dot gnu.org
  2022-10-17 10:40 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-10-17 10:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409

--- Comment #2 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:46a8e017d048ec3271bbb898942e3b166c4e8ff3

commit r13-3327-g46a8e017d048ec3271bbb898942e3b166c4e8ff3
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Oct 6 13:56:09 2022 +0200

    Vectorization of first-order recurrences

    The following picks up the prototype by Ju-Zhe Zhong for vectorizing
    first order recurrences.  That solves two TSVC missed optimization PRs.

    There's a new scalar cycle def kind, vect_first_order_recurrence
    and it's handling of the backedge value vectorization is complicated
    by the fact that the vectorized value isn't the PHI but instead
    a (series of) permute(s) shifting in the recurring value from the
    previous iteration.  I've implemented this by creating both the
    single vectorized PHI and the series of permutes when vectorizing
    the scalar PHI but leave the backedge values in both unassigned.
    The backedge values are (for the testcases) computed by a load
    which is also the place after which the permutes are inserted.
    That placement also restricts the cases we can handle (without
    resorting to code motion).

    I added both costing and SLP handling though SLP handling is
    restricted to the case where a single vectorized PHI is enough.

    Missing is epilogue handling - while prologue peeling would
    be handled transparently by adjusting iv_phi_p the epilogue
    case doesn't work with just inserting a scalar LC PHI since
    that a) keeps the scalar load live and b) that loads is the
    wrong one, it has to be the last, much like when we'd vectorize
    the LC PHI as live operation.  Unfortunately LIVE
    compute/analysis happens too early before we decide on
    peeling.  When using fully masked loop vectorization the
    vect-recurr-6.c works as expected though.

    I have tested this on x86_64 for now, but since epilogue
    handling is missing there's probably no practical cases.
    My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c
    just fine but I didn't feel like running SPEC within SDE nor
    is the WHILE_ULT patch complete enough.

            PR tree-optimization/99409
            PR tree-optimization/99394
            * tree-vectorizer.h (vect_def_type::vect_first_order_recurrence):
Add.
            (stmt_vec_info_type::recurr_info_type): Likewise.
            (vectorizable_recurr): New function.
            * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New
            function.
            (vect_analyze_scalar_cycles_1): Look for first order
            recurrences.
            (vect_analyze_loop_operations): Handle them.
            (vect_transform_loop): Likewise.
            (vectorizable_recurr): New function.
            (maybe_set_vectorized_backedge_value): Handle the backedge value
            setting in the first order recurrence PHI and the permutes.
            * tree-vect-stmts.cc (vect_analyze_stmt): Handle first order
            recurrences.
            (vect_transform_stmt): Likewise.
            (vect_is_simple_use): Likewise.
            (vect_is_simple_use): Likewise.
            * tree-vect-slp.cc (vect_get_and_check_slp_defs): Likewise.
            (vect_build_slp_tree_2): Likewise.
            (vect_schedule_scc): Handle the backedge value setting in the
            first order recurrence PHI and the permutes.

            * gcc.dg/vect/vect-recurr-1.c: New testcase.
            * gcc.dg/vect/vect-recurr-2.c: Likewise.
            * gcc.dg/vect/vect-recurr-3.c: Likewise.
            * gcc.dg/vect/vect-recurr-4.c: Likewise.
            * gcc.dg/vect/vect-recurr-5.c: Likewise.
            * gcc.dg/vect/vect-recurr-6.c: Likewise.
            * gcc.dg/vect/tsvc/vect-tsvc-s252.c: Un-XFAIL.
            * gcc.dg/vect/tsvc/vect-tsvc-s254.c: Likewise.
            * gcc.dg/vect/tsvc/vect-tsvc-s291.c: Likewise.

    Co-authored-by: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/99409] s252 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:23 [Bug middle-end/99409] New: s252 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
  2021-03-08  8:23 ` [Bug tree-optimization/99409] " rguenth at gcc dot gnu.org
  2022-10-17 10:36 ` cvs-commit at gcc dot gnu.org
@ 2022-10-17 10:40 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-17 10:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED
      Known to work|                            |13.0

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 13.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-10-17 10:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 14:23 [Bug middle-end/99409] New: s252 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-08  8:23 ` [Bug tree-optimization/99409] " rguenth at gcc dot gnu.org
2022-10-17 10:36 ` cvs-commit at gcc dot gnu.org
2022-10-17 10:40 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).