public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc
@ 2021-03-04 22:56 hubicka at gcc dot gnu.org
  2021-03-04 23:29 ` [Bug middle-end/99394] " hubicka at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-04 22:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

            Bug ID: 99394
           Summary: s254 benchmark of TSVC is vectorized by clang and not
                    by gcc
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

Clang is vectorizing s254 loop with -mtune=archive on znver2 leading to about
758% speedup. Loop is:

real_t s254(struct args_t * func_args)
{

//    scalar and array expansion
//    carry around variable

    initialise_arrays(__func__);
    gettimeofday(&func_args->t1, NULL);

    real_t x;
    for (int nl = 0; nl < 4*iterations; nl++) {
        x = b[LEN_1D-1];
        for (int i = 0; i < LEN_1D; i++) {
            a[i] = (b[i] + x) * (real_t).5;
            x = b[i];
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }

    gettimeofday(&func_args->t2, NULL);
    return calc_checksum(__func__);
}

and clang produces:
0000000000407d30 <s254>:
  407d30:       41 56                   push   %r14
  407d32:       53                      push   %rbx
  407d33:       48 83 ec 28             sub    $0x28,%rsp
  407d37:       49 89 fe                mov    %rdi,%r14
  407d3a:       bf 6b e2 42 00          mov    $0x42e26b,%edi
  407d3f:       e8 cc f8 00 00          call   417610 <initialise_arrays>
  407d44:       31 db                   xor    %ebx,%ebx
  407d46:       4c 89 f7                mov    %r14,%rdi
  407d49:       31 f6                   xor    %esi,%esi
  407d4b:       e8 10 93 ff ff          call   401060 <gettimeofday@plt>
  407d50:       c4 62 7d 18 05 af 62    vbroadcastss 0x262af(%rip),%ymm8       
# 42e008 <_IO_stdin_used+0x8>
  407d57:       02 00 
  407d59:       c5 7c 11 04 24          vmovups %ymm8,(%rsp)
  407d5e:       66 90                   xchg   %ax,%ax
  407d60:       48 c7 c0 00 0c fe ff    mov    $0xfffffffffffe0c00,%rax
  407d67:       c4 e2 7d 18 05 8c a7    vbroadcastss 0x4a78c(%rip),%ymm0       
# 4524fc <b+0x1f3fc>
  407d6e:       04 00 
  407d70:       c5 fc 28 88 00 25 45    vmovaps 0x452500(%rax),%ymm1
  407d77:       00 
  407d78:       c5 fc 28 90 20 25 45    vmovaps 0x452520(%rax),%ymm2
  407d7f:       00 
  407d80:       c5 fc 28 98 40 25 45    vmovaps 0x452540(%rax),%ymm3
  407d87:       00 
  407d88:       c4 e3 7d 06 c1 21       vperm2f128 $0x21,%ymm1,%ymm0,%ymm0
  407d8e:       c5 fc 28 a0 60 25 45    vmovaps 0x452560(%rax),%ymm4
  407d95:       00 
  407d96:       c5 fc c6 c1 03          vshufps $0x3,%ymm1,%ymm0,%ymm0
  407d9b:       c5 fc c6 c1 98          vshufps $0x98,%ymm1,%ymm0,%ymm0
  407da0:       c4 e3 75 06 ea 21       vperm2f128 $0x21,%ymm2,%ymm1,%ymm5
  407da6:       c5 d4 c6 ea 03          vshufps $0x3,%ymm2,%ymm5,%ymm5
  407dab:       c5 d4 c6 ea 98          vshufps $0x98,%ymm2,%ymm5,%ymm5
  407db0:       c4 e3 6d 06 f3 21       vperm2f128 $0x21,%ymm3,%ymm2,%ymm6
  407db6:       c5 cc c6 f3 03          vshufps $0x3,%ymm3,%ymm6,%ymm6
  407dbb:       c5 cc c6 f3 98          vshufps $0x98,%ymm3,%ymm6,%ymm6
  407dc0:       c4 e3 65 06 fc 21       vperm2f128 $0x21,%ymm4,%ymm3,%ymm7
  407dc6:       c5 c4 c6 fc 03          vshufps $0x3,%ymm4,%ymm7,%ymm7
  407dcb:       c5 c4 c6 fc 98          vshufps $0x98,%ymm4,%ymm7,%ymm7
  407dd0:       c5 f4 58 c0             vaddps %ymm0,%ymm1,%ymm0
  407dd4:       c5 ec 58 cd             vaddps %ymm5,%ymm2,%ymm1
  407dd8:       c5 e4 58 d6             vaddps %ymm6,%ymm3,%ymm2
  407ddc:       c5 dc 58 df             vaddps %ymm7,%ymm4,%ymm3
  407de0:       c5 bc 59 c0             vmulps %ymm0,%ymm8,%ymm0
  407de4:       c5 bc 59 c9             vmulps %ymm1,%ymm8,%ymm1
  407de8:       c5 bc 59 d2             vmulps %ymm2,%ymm8,%ymm2
  407dec:       c5 bc 59 db             vmulps %ymm3,%ymm8,%ymm3
  407df0:       c5 fc 29 80 00 19 47    vmovaps %ymm0,0x471900(%rax)
  407df7:       00 
  407df8:       c5 fc 29 88 20 19 47    vmovaps %ymm1,0x471920(%rax)
  407dff:       00 
  407e00:       c5 fc 29 90 40 19 47    vmovaps %ymm2,0x471940(%rax)
  407e07:       00 
  407e08:       c5 fc 29 98 60 19 47    vmovaps %ymm3,0x471960(%rax)
  407e0f:       00 
  407e10:       c5 fc 28 c4             vmovaps %ymm4,%ymm0
  407e14:       48 83 e8 80             sub    $0xffffffffffffff80,%rax
  407e18:       0f 85 52 ff ff ff       jne    407d70 <s254+0x40>
  407e1e:       bf 00 25 45 00          mov    $0x452500,%edi
  407e23:       be 00 31 43 00          mov    $0x433100,%esi
  407e28:       ba 00 19 47 00          mov    $0x471900,%edx
  407e2d:       b9 00 0d 49 00          mov    $0x490d00,%ecx
  407e32:       41 b8 00 01 4b 00       mov    $0x4b0100,%r8d
  407e38:       41 b9 00 f5 4c 00       mov    $0x4cf500,%r9d
  407e3e:       c5 f8 57 c0             vxorps %xmm0,%xmm0,%xmm0
  407e42:       68 00 f5 54 00          push   $0x54f500
  407e47:       68 00 f5 50 00          push   $0x50f500
  407e4c:       c5 f8 77                vzeroupper 
  407e4f:       e8 6c db 00 00          call   4159c0 <dummy>
  407e54:       c5 7c 10 44 24 10       vmovups 0x10(%rsp),%ymm8
  407e5a:       48 83 c4 10             add    $0x10,%rsp
  407e5e:       83 c3 01                add    $0x1,%ebx
  407e61:       81 fb 80 1a 06 00       cmp    $0x61a80,%ebx
  407e67:       0f 85 f3 fe ff ff       jne    407d60 <s254+0x30>
  407e6d:       49 83 c6 10             add    $0x10,%r14
  407e71:       4c 89 f7                mov    %r14,%rdi
  407e74:       31 f6                   xor    %esi,%esi
  407e76:       c5 f8 77                vzeroupper 
  407e79:       e8 e2 91 ff ff          call   401060 <gettimeofday@plt>
  407e7e:       bf 6b e2 42 00          mov    $0x42e26b,%edi
  407e83:       48 83 c4 28             add    $0x28,%rsp
  407e87:       5b                      pop    %rbx
  407e88:       41 5e                   pop    %r14
  407e8a:       e9 71 f1 01 00          jmp    427000 <calc_checksum>
  407e8f:       90                      nop

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 22:56 [Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
@ 2021-03-04 23:29 ` hubicka at gcc dot gnu.org
  2021-03-05  8:20 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-04 23:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Here we fail with:
tsvc.c:1526:27: note:   vect_is_simple_use: operand x_30 = PHI <_2(8),
x_18(3)>, type of def: unknown
tsvc.c:1526:27: missed:   Unsupported pattern.
tsvc.c:1527:26: missed:   not vectorized: unsupported use in stmt.
tsvc.c:1526:27: missed:  unexpected pattern.


{
  int i;
  int nl;
  real_t x;
  static const char __func__[5] = "s254";
  struct timeval * _1;
  float _2;
  float _3;
  float _4;
  struct timeval * _5;
  real_t _17;
  unsigned int ivtmp_27;
  unsigned int ivtmp_28;
  unsigned int ivtmp_29;
  unsigned int ivtmp_35;

  <bb 2> [local count: 108459]:
  initialise_arrays (&__func__);
  _1 = &func_args_13(D)->t1;
  gettimeofday (_1, 0B);

  <bb 3> [local count: 10737416]:
  # nl_31 = PHI <nl_20(7), 0(2)>
  # ivtmp_28 = PHI <ivtmp_27(7), 400000(2)>
  x_18 = b[31999];

  <bb 4> [local count: 1063004409]:
  # x_30 = PHI <_2(8), x_18(3)>
  # i_32 = PHI <i_22(8), 0(3)>
  # ivtmp_35 = PHI <ivtmp_29(8), 32000(3)>
  _2 = b[i_32];
  _3 = _2 + x_30;
  _4 = _3 * 5.0e-1;
  a[i_32] = _4;
  i_22 = i_32 + 1;
  ivtmp_29 = ivtmp_35 - 1;
  if (ivtmp_29 != 0)
    goto <bb 8>; [98.99%]
  else
    goto <bb 5>; [1.01%]

  <bb 8> [local count: 1052266996]:
  goto <bb 4>; [100.00%]

....

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 22:56 [Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
  2021-03-04 23:29 ` [Bug middle-end/99394] " hubicka at gcc dot gnu.org
@ 2021-03-05  8:20 ` rguenth at gcc dot gnu.org
  2021-03-05  8:20 ` [Bug tree-optimization/99394] " rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05  8:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
This is a loop-carried data dependence which we can't handle (we avoid creating
those from PRE but here it appears in the source itself).  I wonder how
LLVM handles this (pre/post vectorization IL).

Specifically 'carry around variable' is something we don't handle.

Can you somehow extract a compilable testcase (with just this kernel)?

Looking at the source peeling a single iteration (to get rid of the initial
value) and then undoing the PRE, vectorizing

        for (int i = 1; i < LEN_1D; i++) {
            a[i] = (b[i] + b[i-1]) * (real_t).5;
        }

would likely result in optimal code.  The assembly from clang doesn't look
optimal to me - llvm likely materializes 'x' as temporary array, vectorizing

  x[0] = b[LEN_1D-1];
        for (int i = 0; i < LEN_1D; i++) {
            a[i] = (b[i] + x[i]) * (real_t).5;
            x[i+1] = b[i];
        }

and then somehow (like we handle OMP simd lane arrays?) uses two vectors
as a sliding window over x[].  At least the standard strathegy for
these kind of dependences is to get "rid" of them by making them data
dependences and then hope for the best.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 22:56 [Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
  2021-03-04 23:29 ` [Bug middle-end/99394] " hubicka at gcc dot gnu.org
  2021-03-05  8:20 ` rguenth at gcc dot gnu.org
@ 2021-03-05  8:20 ` rguenth at gcc dot gnu.org
  2021-03-05 13:52 ` hubicka at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05  8:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-03-05
                 CC|                            |rguenth at gcc dot gnu.org
          Component|middle-end                  |tree-optimization

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 22:56 [Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-03-05  8:20 ` [Bug tree-optimization/99394] " rguenth at gcc dot gnu.org
@ 2021-03-05 13:52 ` hubicka at gcc dot gnu.org
  2022-10-17 10:36 ` cvs-commit at gcc dot gnu.org
  2022-10-17 10:39 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-05 13:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
testcase is:

typedef float real_t;

#define iterations 100000
#define LEN_1D 32000
#define LEN_2D 256
// array definitions
real_t flat_2d_array[LEN_2D*LEN_2D];

real_t x[LEN_1D];

real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D],
bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D];

int indx[LEN_1D];

real_t* __restrict__ xx;
real_t* yy;

// %2.5

real_t s254(void)
{

//    scalar and array expansion
//    carry around variable

    real_t x;
    for (int nl = 0; nl < 4*iterations; nl++) {
        x = b[LEN_1D-1];
        for (int i = 0; i < LEN_1D; i++) {
            a[i] = (b[i] + x) * (real_t).5;
            x = b[i];
        }
    }

}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 22:56 [Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-03-05 13:52 ` hubicka at gcc dot gnu.org
@ 2022-10-17 10:36 ` cvs-commit at gcc dot gnu.org
  2022-10-17 10:39 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-10-17 10:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

--- Comment #4 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:46a8e017d048ec3271bbb898942e3b166c4e8ff3

commit r13-3327-g46a8e017d048ec3271bbb898942e3b166c4e8ff3
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Oct 6 13:56:09 2022 +0200

    Vectorization of first-order recurrences

    The following picks up the prototype by Ju-Zhe Zhong for vectorizing
    first order recurrences.  That solves two TSVC missed optimization PRs.

    There's a new scalar cycle def kind, vect_first_order_recurrence
    and it's handling of the backedge value vectorization is complicated
    by the fact that the vectorized value isn't the PHI but instead
    a (series of) permute(s) shifting in the recurring value from the
    previous iteration.  I've implemented this by creating both the
    single vectorized PHI and the series of permutes when vectorizing
    the scalar PHI but leave the backedge values in both unassigned.
    The backedge values are (for the testcases) computed by a load
    which is also the place after which the permutes are inserted.
    That placement also restricts the cases we can handle (without
    resorting to code motion).

    I added both costing and SLP handling though SLP handling is
    restricted to the case where a single vectorized PHI is enough.

    Missing is epilogue handling - while prologue peeling would
    be handled transparently by adjusting iv_phi_p the epilogue
    case doesn't work with just inserting a scalar LC PHI since
    that a) keeps the scalar load live and b) that loads is the
    wrong one, it has to be the last, much like when we'd vectorize
    the LC PHI as live operation.  Unfortunately LIVE
    compute/analysis happens too early before we decide on
    peeling.  When using fully masked loop vectorization the
    vect-recurr-6.c works as expected though.

    I have tested this on x86_64 for now, but since epilogue
    handling is missing there's probably no practical cases.
    My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c
    just fine but I didn't feel like running SPEC within SDE nor
    is the WHILE_ULT patch complete enough.

            PR tree-optimization/99409
            PR tree-optimization/99394
            * tree-vectorizer.h (vect_def_type::vect_first_order_recurrence):
Add.
            (stmt_vec_info_type::recurr_info_type): Likewise.
            (vectorizable_recurr): New function.
            * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New
            function.
            (vect_analyze_scalar_cycles_1): Look for first order
            recurrences.
            (vect_analyze_loop_operations): Handle them.
            (vect_transform_loop): Likewise.
            (vectorizable_recurr): New function.
            (maybe_set_vectorized_backedge_value): Handle the backedge value
            setting in the first order recurrence PHI and the permutes.
            * tree-vect-stmts.cc (vect_analyze_stmt): Handle first order
            recurrences.
            (vect_transform_stmt): Likewise.
            (vect_is_simple_use): Likewise.
            (vect_is_simple_use): Likewise.
            * tree-vect-slp.cc (vect_get_and_check_slp_defs): Likewise.
            (vect_build_slp_tree_2): Likewise.
            (vect_schedule_scc): Handle the backedge value setting in the
            first order recurrence PHI and the permutes.

            * gcc.dg/vect/vect-recurr-1.c: New testcase.
            * gcc.dg/vect/vect-recurr-2.c: Likewise.
            * gcc.dg/vect/vect-recurr-3.c: Likewise.
            * gcc.dg/vect/vect-recurr-4.c: Likewise.
            * gcc.dg/vect/vect-recurr-5.c: Likewise.
            * gcc.dg/vect/vect-recurr-6.c: Likewise.
            * gcc.dg/vect/tsvc/vect-tsvc-s252.c: Un-XFAIL.
            * gcc.dg/vect/tsvc/vect-tsvc-s254.c: Likewise.
            * gcc.dg/vect/tsvc/vect-tsvc-s291.c: Likewise.

    Co-authored-by: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 22:56 [Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-10-17 10:36 ` cvs-commit at gcc dot gnu.org
@ 2022-10-17 10:39 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-17 10:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |13.0
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 13.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-10-17 10:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-04 22:56 [Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-04 23:29 ` [Bug middle-end/99394] " hubicka at gcc dot gnu.org
2021-03-05  8:20 ` rguenth at gcc dot gnu.org
2021-03-05  8:20 ` [Bug tree-optimization/99394] " rguenth at gcc dot gnu.org
2021-03-05 13:52 ` hubicka at gcc dot gnu.org
2022-10-17 10:36 ` cvs-commit at gcc dot gnu.org
2022-10-17 10:39 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).