[Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc
@ 2021-03-04 23:01 hubicka at gcc dot gnu.org
  2021-03-04 23:24 ` [Bug middle-end/99395] " hubicka at gcc dot gnu.org
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-04 23:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

            Bug ID: 99395
           Summary: s116 benchmark of TSVC is vectorized by clang and not
                    by gcc
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

s116 loop is:

real_t s116(struct args_t * func_args)
{

//    linear dependence testing

    initialise_arrays(__func__);
    gettimeofday(&func_args->t1, NULL);

    for (int nl = 0; nl < iterations*10; nl++) {
        for (int i = 0; i < LEN_1D - 5; i += 5) {
            a[i] = a[i + 1] * a[i];
            a[i + 1] = a[i + 2] * a[i + 1];
            a[i + 2] = a[i + 3] * a[i + 2];
            a[i + 3] = a[i + 4] * a[i + 3];
            a[i + 4] = a[i + 5] * a[i + 4];
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }

    gettimeofday(&func_args->t2, NULL);
    return calc_checksum(__func__);
}

and vectorized code produced by clang11 is about 2 times faster on zen3 machine

0000000000401d00 <s116>:
  401d00:       41 56                   push   %r14
  401d02:       53                      push   %rbx
  401d03:       50                      push   %rax
  401d04:       49 89 fe                mov    %rdi,%r14
  401d07:       bf 66 e1 42 00          mov    $0x42e166,%edi
  401d0c:       e8 ff 58 01 00          call   417610 <initialise_arrays>
  401d11:       31 db                   xor    %ebx,%ebx
  401d13:       4c 89 f7                mov    %r14,%rdi
  401d16:       31 f6                   xor    %esi,%esi
  401d18:       e8 43 f3 ff ff          call   401060 <gettimeofday@plt>
  401d1d:       eb 47                   jmp    401d66 <s116+0x66>
  401d1f:       90                      nop
  401d20:       bf 00 25 45 00          mov    $0x452500,%edi
  401d25:       be 00 31 43 00          mov    $0x433100,%esi
  401d2a:       ba 00 19 47 00          mov    $0x471900,%edx
  401d2f:       b9 00 0d 49 00          mov    $0x490d00,%ecx
  401d34:       41 b8 00 01 4b 00       mov    $0x4b0100,%r8d
  401d3a:       41 b9 00 f5 4c 00       mov    $0x4cf500,%r9d
  401d40:       c5 f8 57 c0             vxorps %xmm0,%xmm0,%xmm0
  401d44:       68 00 f5 54 00          push   $0x54f500
  401d49:       68 00 f5 50 00          push   $0x50f500
  401d4e:       e8 6d 3c 01 00          call   4159c0 <dummy>
  401d53:       48 83 c4 10             add    $0x10,%rsp
  401d57:       83 c3 01                add    $0x1,%ebx
  401d5a:       81 fb 40 42 0f 00       cmp    $0xf4240,%ebx
  401d60:       0f 84 9a 00 00 00       je     401e00 <s116+0x100>
  401d66:       c5 fa 10 05 92 07 05    vmovss 0x50792(%rip),%xmm0        #
452500 <a>
  401d6d:       00 
  401d6e:       31 c0                   xor    %eax,%eax
  401d70:       c5 fa 10 0c 85 04 25    vmovss 0x452504(,%rax,4),%xmm1
  401d77:       45 00 
  401d79:       c5 fa 59 c1             vmulss %xmm1,%xmm0,%xmm0
  401d7d:       c5 fa 11 04 85 00 25    vmovss %xmm0,0x452500(,%rax,4)
  401d84:       45 00 
  401d86:       c5 f8 10 04 85 08 25    vmovups 0x452508(,%rax,4),%xmm0
  401d8d:       45 00 
  401d8f:       c5 f0 c6 c8 00          vshufps $0x0,%xmm0,%xmm1,%xmm1
  401d94:       c5 f0 c6 c8 98          vshufps $0x98,%xmm0,%xmm1,%xmm1
  401d99:       c5 f8 59 c9             vmulps %xmm1,%xmm0,%xmm1
  401d9d:       c5 f8 11 0c 85 04 25    vmovups %xmm1,0x452504(,%rax,4)
  401da4:       45 00 
  401da6:       48 3d f5 7c 00 00       cmp    $0x7cf5,%rax
  401dac:       0f 87 6e ff ff ff       ja     401d20 <s116+0x20>
  401db2:       c4 e3 79 04 c0 e7       vpermilps $0xe7,%xmm0,%xmm0
  401db8:       c5 fa 10 0c 85 18 25    vmovss 0x452518(,%rax,4),%xmm1
  401dbf:       45 00 
  401dc1:       c5 fa 59 c1             vmulss %xmm1,%xmm0,%xmm0
  401dc5:       c5 fa 11 04 85 14 25    vmovss %xmm0,0x452514(,%rax,4)
  401dcc:       45 00 
  401dce:       c5 f8 10 04 85 1c 25    vmovups 0x45251c(,%rax,4),%xmm0
  401dd5:       45 00 
  401dd7:       c5 f0 c6 c8 00          vshufps $0x0,%xmm0,%xmm1,%xmm1
  401ddc:       c5 f0 c6 c8 98          vshufps $0x98,%xmm0,%xmm1,%xmm1
  401de1:       c5 f8 59 c9             vmulps %xmm1,%xmm0,%xmm1
  401de5:       c5 fa 10 04 85 28 25    vmovss 0x452528(,%rax,4),%xmm0
  401dec:       45 00 
  401dee:       c5 f8 11 0c 85 18 25    vmovups %xmm1,0x452518(,%rax,4)
  401df5:       45 00 
  401df7:       48 83 c0 0a             add    $0xa,%rax
  401dfb:       e9 70 ff ff ff          jmp    401d70 <s116+0x70>
  401e00:       49 83 c6 10             add    $0x10,%r14
  401e04:       4c 89 f7                mov    %r14,%rdi
  401e07:       31 f6                   xor    %esi,%esi
  401e09:       e8 52 f2 ff ff          call   401060 <gettimeofday@plt>
  401e0e:       bf 66 e1 42 00          mov    $0x42e166,%edi
  401e13:       48 83 c4 08             add    $0x8,%rsp
  401e17:       5b                      pop    %rbx
  401e18:       41 5e                   pop    %r14
  401e1a:       e9 e1 51 02 00          jmp    427000 <calc_checksum>
  401e1f:       90                      nop

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug middle-end/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
@ 2021-03-04 23:24 ` hubicka at gcc dot gnu.org
  2021-03-05  8:27 ` [Bug tree-optimization/99395] " rguenth at gcc dot gnu.org
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-04 23:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Loop is:

real_t s116 (struct args_t * func_args)
{
  int i;
  int nl;
  static const char __func__[5] = "s116";
  struct timeval * _1;
  int _2;
  float _3;
  float _4;
  float _5;
  int _6;
  float _7;
  float _8;
  float _9;
  int _10;
  float _11;
  float _12;
  float _13;
  int _14;
  float _15;
  float _16;
  float _17;
  int _18;
  float _19;
  float _20;
  float _21;
  struct timeval * _22;
  real_t _33;
  unsigned int ivtmp_43;
  unsigned int ivtmp_44;
  unsigned int ivtmp_45;
  unsigned int ivtmp_46;

  <bb 2> [local count: 108459]:
  initialise_arrays (&__func__);
  _1 = &func_args_29(D)->t1;
  gettimeofday (_1, 0B);
  goto <bb 5>; [100.00%]

  <bb 8> [local count: 1052266996]:

  <bb 3> [local count: 1063004409]:
  # i_48 = PHI <_18(8), 0(5)>
  # ivtmp_46 = PHI <ivtmp_45(8), 6399(5)>
  _2 = i_48 + 1;
  _3 = a[_2];
  _4 = a[i_48];
  _5 = _3 * _4;
  a[i_48] = _5;
  _6 = i_48 + 2;
  _7 = a[_6];
  _8 = a[_2];
  _9 = _7 * _8;
  a[_2] = _9;
  _10 = i_48 + 3;
  _11 = a[_10];
  _12 = a[_6];
  _13 = _11 * _12;
  a[_6] = _13;
  _14 = i_48 + 4;
  _15 = a[_14];
  _16 = a[_10];
  _17 = _15 * _16;
  a[_10] = _17;
  _18 = i_48 + 5;
  _19 = a[_18];
  _20 = a[_14];
  _21 = _19 * _20;
  a[_14] = _21;
  ivtmp_45 = ivtmp_46 - 1;
  if (ivtmp_45 != 0)
    goto <bb 8>; [98.99%]
  else
    goto <bb 4>; [1.01%]


tsvc.c:275:18: missed:   not vectorized, possible dependence between data-refs
a[i_48] and a[_18]
tsvc.c:274:27: missed:  bad data dependence.

_18 = i_48 + 5 and stride is 5...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
  2021-03-04 23:24 ` [Bug middle-end/99395] " hubicka at gcc dot gnu.org
@ 2021-03-05  8:27 ` rguenth at gcc dot gnu.org
  2021-03-05 15:42 ` hubicka at gcc dot gnu.org
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-05  8:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-03-05
                 CC|                            |rguenth at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
          Component|middle-end                  |tree-optimization

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
please provide compilable testcases ...

Reduced testcase:

double a[1024];
void foo ()
{
  for (int i = 0; i < 1022; i += 2)
    {
      a[i] = a[i+1] * a[i];
      a[i+1] = a[i+2] * a[i+1];
    }
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
  2021-03-04 23:24 ` [Bug middle-end/99395] " hubicka at gcc dot gnu.org
  2021-03-05  8:27 ` [Bug tree-optimization/99395] " rguenth at gcc dot gnu.org
@ 2021-03-05 15:42 ` hubicka at gcc dot gnu.org
  2022-10-18 10:28 ` rguenth at gcc dot gnu.org
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-05 15:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
ICC version seems to run faster
000000000040a050 <s116>:
  40a050:       55                      push   %rbp
  40a051:       48 89 e5                mov    %rsp,%rbp
  40a054:       48 83 e4 e0             and    $0xffffffffffffffe0,%rsp
  40a058:       41 57                   push   %r15
  40a05a:       53                      push   %rbx
  40a05b:       48 83 ec 10             sub    $0x10,%rsp
  40a05f:       48 89 fb                mov    %rdi,%rbx
  40a062:       bf 74 f5 42 00          mov    $0x42f574,%edi
  40a067:       e8 14 cc 00 00          call   416c80 <initialise_arrays>
  40a06c:       48 89 df                mov    %rbx,%rdi
  40a06f:       33 f6                   xor    %esi,%esi
  40a071:       e8 4a 70 ff ff          call   4010c0 <gettimeofday@plt>
  40a076:       33 c0                   xor    %eax,%eax
  40a078:       41 89 c7                mov    %eax,%r15d
  40a07b:       33 d2                   xor    %edx,%edx
  40a07d:       0f 1f 00                nopl   (%rax)
  40a080:       c5 fc 10 04 95 04 9d    vmovups 0x579d04(,%rdx,4),%ymm0
  40a087:       57 00 
  40a089:       c5 fc 10 14 95 24 9d    vmovups 0x579d24(,%rdx,4),%ymm2
  40a090:       57 00 
  40a092:       c5 fc 10 24 95 44 9d    vmovups 0x579d44(,%rdx,4),%ymm4
  40a099:       57 00 
  40a09b:       c5 fc 10 34 95 64 9d    vmovups 0x579d64(,%rdx,4),%ymm6
  40a0a2:       57 00 
  40a0a4:       c5 fc 59 0c 95 00 9d    vmulps 0x579d00(,%rdx,4),%ymm0,%ymm1
  40a0ab:       57 00 
  40a0ad:       c5 ec 59 1c 95 20 9d    vmulps 0x579d20(,%rdx,4),%ymm2,%ymm3
  40a0b4:       57 00 
  40a0b6:       c5 dc 59 2c 95 40 9d    vmulps 0x579d40(,%rdx,4),%ymm4,%ymm5
  40a0bd:       57 00 
  40a0bf:       c5 cc 59 3c 95 60 9d    vmulps 0x579d60(,%rdx,4),%ymm6,%ymm7
  40a0c6:       57 00 
  40a0c8:       c5 fc 11 0c 95 00 9d    vmovups %ymm1,0x579d00(,%rdx,4)
  40a0cf:       57 00 
  40a0d1:       c5 fc 11 1c 95 20 9d    vmovups %ymm3,0x579d20(,%rdx,4)
  40a0d8:       57 00 
  40a0da:       c5 fc 11 2c 95 40 9d    vmovups %ymm5,0x579d40(,%rdx,4)
  40a0e1:       57 00 
  40a0e3:       c5 fc 11 3c 95 60 9d    vmovups %ymm7,0x579d60(,%rdx,4)
  40a0ea:       57 00 
  40a0ec:       48 83 c2 20             add    $0x20,%rdx
  40a0f0:       48 81 fa e0 7c 00 00    cmp    $0x7ce0,%rdx
  40a0f7:       72 87                   jb     40a080 <s116+0x30>
  40a0f9:       33 c9                   xor    %ecx,%ecx
  40a0fb:       ba e1 7c 00 00          mov    $0x7ce1,%edx
  40a100:       c5 fc 10 04 95 00 9d    vmovups 0x579d00(,%rdx,4),%ymm0
  40a107:       57 00 
  40a109:       48 83 c2 08             add    $0x8,%rdx
  40a10d:       c5 fc 59 0c 8d 80 90    vmulps 0x599080(,%rcx,4),%ymm0,%ymm1
  40a114:       59 00 
  40a116:       c5 fc 11 0c 8d 80 90    vmovups %ymm1,0x599080(,%rcx,4)
  40a11d:       59 00 
  40a11f:       48 83 c1 08             add    $0x8,%rcx
  40a123:       48 83 f9 18             cmp    $0x18,%rcx
  40a127:       72 d7                   jb     40a100 <s116+0xb0>
  40a129:       c5 fa 10 0d b3 ef 18    vmovss 0x18efb3(%rip),%xmm1        #
5990e4 <a+0x1f3e4>
  40a130:       00 
  40a131:       bf 00 9d 57 00          mov    $0x579d00,%edi
  40a136:       c5 fa 10 1d aa ef 18    vmovss 0x18efaa(%rip),%xmm3        #
5990e8 <a+0x1f3e8>
  40a13d:       00 
  40a13e:       be 80 d8 45 00          mov    $0x45d880,%esi
  40a143:       c5 f2 59 05 95 ef 18    vmulss 0x18ef95(%rip),%xmm1,%xmm0      
 # 5990e0 <a+0x1f3e0>
  40a14a:       00 
  40a14b:       ba 00 a9 55 00          mov    $0x55a900,%edx
  40a150:       c5 e2 59 25 94 ef 18    vmulss 0x18ef94(%rip),%xmm3,%xmm4      
 # 5990ec <a+0x1f3ec>
  40a157:       00 
  40a158:       c5 f2 59 d3             vmulss %xmm3,%xmm1,%xmm2
  40a15c:       c5 fa 11 05 7c ef 18    vmovss %xmm0,0x18ef7c(%rip)        #
5990e0 <a+0x1f3e0>
  40a163:       00 
  40a164:       b9 80 e4 43 00          mov    $0x43e480,%ecx
  40a169:       c5 fa 11 15 73 ef 18    vmovss %xmm2,0x18ef73(%rip)        #
5990e4 <a+0x1f3e4>
  40a170:       00 
  40a171:       41 b8 00 b5 53 00       mov    $0x53b500,%r8d
  40a177:       c5 fa 11 25 69 ef 18    vmovss %xmm4,0x18ef69(%rip)        #
5990e8 <a+0x1f3e8>
  40a17e:       00 
  40a17f:       41 b9 c0 b4 4b 00       mov    $0x4bb4c0,%r9d
  40a185:       68 00 91 59 00          push   $0x599100
  40a18a:       68 00 b5 4f 00          push   $0x4fb500
  40a18f:       c5 f8 77                vzeroupper 
  40a192:       c5 f8 57 c0             vxorps %xmm0,%xmm0,%xmm0
  40a196:       e8 d5 92 00 00          call   413470 <dummy>
  40a19b:       48 83 c4 10             add    $0x10,%rsp
  40a19f:       41 ff c7                inc    %r15d
  40a1a2:       41 81 ff 40 42 0f 00    cmp    $0xf4240,%r15d
  40a1a9:       0f 82 cc fe ff ff       jb     40a07b <s116+0x2b>
  40a1af:       48 83 c3 10             add    $0x10,%rbx
  40a1b3:       33 f6                   xor    %esi,%esi
  40a1b5:       48 89 df                mov    %rbx,%rdi
  40a1b8:       e8 03 6f ff ff          call   4010c0 <gettimeofday@plt>
  40a1bd:       bf 74 f5 42 00          mov    $0x42f574,%edi
  40a1c2:       48 83 c4 10             add    $0x10,%rsp
  40a1c6:       5b                      pop    %rbx
  40a1c7:       41 5f                   pop    %r15
  40a1c9:       48 89 ec                mov    %rbp,%rsp
  40a1cc:       5d                      pop    %rbp
  40a1cd:       e9 de 97 00 00          jmp    4139b0 <calc_checksum>
  40a1d2:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  40a1d9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-03-05 15:42 ` hubicka at gcc dot gnu.org
@ 2022-10-18 10:28 ` rguenth at gcc dot gnu.org
  2022-10-18 10:37 ` rguenth at gcc dot gnu.org
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-18 10:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So in the BB SLP attempt from loop vectorization (or in the BB SLP pass with
-fno-predictive-commoning) we get confused during DR group building because
of a duplicate access and fixup splitting the candidates at odd points.

For the reduced testcase we see

  <bb 3> [local count: 1063004409]:
  # i_16 = PHI <_5(5), 0(2)>
  # ivtmp_18 = PHI <ivtmp_15(5), 511(2)>
  _1 = i_16 + 1;
  _2 = a[_1];
  _3 = a[i_16];
  _4 = _2 * _3;
  a[i_16] = _4;
  _5 = i_16 + 2;
  _6 = a[_5];
  _7 = a[_1];
  _8 = _6 * _7;
  a[_1] = _8;
  ivtmp_15 = ivtmp_18 - 1;
  if (ivtmp_15 != 0)
    goto <bb 5>; [99.00%]
  else
    goto <bb 4>; [1.00%]

so a[_1] is loaded twice because CSE doesn't figure that a[i_16] cannot alias
it.  That causes us to split the load group.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-10-18 10:28 ` rguenth at gcc dot gnu.org
@ 2022-10-18 10:37 ` rguenth at gcc dot gnu.org
  2023-10-07  9:10 ` juzhe.zhong at rivai dot ai
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-10-18 10:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixing the CSE in the testcase by doing

double a[1024];
void foo ()
{
  for (int i = 0; i < 1022; i += 2)
    {
      double tem = a[i+1];
      a[i] = tem * a[i];
      a[i+1] = a[i+2] * tem;
    }
}

gets us

t.c:4:21: note:   Detected interleaving load a[i_15] and a[_1]
t.c:4:21: note:   Detected interleaving store a[i_15] and a[_1]
t.c:4:21: note:   Detected interleaving load of size 2
t.c:4:21: note:         _2 = a[i_15];
t.c:4:21: note:         tem_10 = a[_1];
t.c:4:21: note:   Detected single element interleaving a[_4] step 16
t.c:4:21: note:   Detected interleaving store of size 2
t.c:4:21: note:         a[i_15] = _3;
t.c:4:21: note:         a[_1] = _6;

in the loop pass and failed dependence analysis and
with the SLP pass (no predcom):

t.c:10:1: note:   Detected interleaving load a[i_15] and a[_1]
t.c:10:1: note:   Detected interleaving load a[i_15] and a[_4]
t.c:10:1: note:   Detected interleaving store a[i_15] and a[_1]
t.c:10:1: note:   Detected interleaving load of size 3
t.c:10:1: note:         _2 = a[i_15];
t.c:10:1: note:         tem_10 = a[_1];
t.c:10:1: note:         _5 = a[_4];
t.c:10:1: note:   Detected interleaving store of size 2
t.c:10:1: note:         a[i_15] = _3;
t.c:10:1: note:         a[_1] = _6;

which then runs into gap vect issues for how we'd vectorize the three
element load.

The dependence analysis is done by analyzing the validity of the
vectorized load/store placement and the implied motion of the
scalar load/store statements.  The missed optimization here would
be the missed alternate placement that would be correct.  But I
think the way we form groups would need to be revisited first here.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-10-18 10:37 ` rguenth at gcc dot gnu.org
@ 2023-10-07  9:10 ` juzhe.zhong at rivai dot ai
  2023-10-09  6:30 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-10-07  9:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

JuzheZhong <juzhe.zhong at rivai dot ai> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |juzhe.zhong at rivai dot ai

--- Comment #6 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Richi.

Recently, I am evaluating TSVC performance of GCC:

I found both RISC-V and aarch64 can SLP vectorize it:

https://godbolt.org/z/ssvTxxjeT

Both GCC-13 and trunk GCC can SLP it like LLVM (GCC-12 failed) but with
-fno-vect-cost-model.

I suspect we should adjust Vector COST model (I don't think we should ajust
cost
model in target backend since LLVM by default vectorize such case).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-10-07  9:10 ` juzhe.zhong at rivai dot ai
@ 2023-10-09  6:30 ` rguenth at gcc dot gnu.org
  2024-01-30 11:37 ` juzhe.zhong at rivai dot ai
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-09  6:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2021-03-05 00:00:00         |2023-10-9

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #6)
> Hi, Richi.
> 
> Recently, I am evaluating TSVC performance of GCC:
> 
> I found both RISC-V and aarch64 can SLP vectorize it:
> 
> https://godbolt.org/z/ssvTxxjeT
> 
> Both GCC-13 and trunk GCC can SLP it like LLVM (GCC-12 failed) but with
> -fno-vect-cost-model.
> 
> I suspect we should adjust Vector COST model (I don't think we should ajust
> cost
> model in target backend since LLVM by default vectorize such case).

We are only vectorizing part of the scalar code.  The CSE issue still exists,
so is the resulting loop analysis issues.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2023-10-09  6:30 ` rguenth at gcc dot gnu.org
@ 2024-01-30 11:37 ` juzhe.zhong at rivai dot ai
  2024-01-30 12:26 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-30 11:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #8 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Richard.

Now, I find the time to GCC vectorization optimization.

I find this case:

  _2 = a[_1];
  ...
  a[i_16] = _4;
  ,,,
  _7 = a[_1];    ---> This load should be eliminated and re-use _2.

Am I right ?

Could you guide me which pass should do this CSE optimization ?

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-01-30 11:37 ` juzhe.zhong at rivai dot ai
@ 2024-01-30 12:26 ` rguenth at gcc dot gnu.org
  2024-01-31  1:32 ` juzhe.zhong at rivai dot ai
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-30 12:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #8)
> Hi, Richard.
> 
> Now, I find the time to GCC vectorization optimization.
> 
> I find this case:
> 
>   _2 = a[_1];
>   ...
>   a[i_16] = _4;
>   ,,,
>   _7 = a[_1];    ---> This load should be eliminated and re-use _2.
> 
> Am I right ?
> 
> Could you guide me which pass should do this CSE optimization ?
> 
> Thanks.

In principle it's value-numbering.  The reason it doesn't do this is
compile-time cost of doing full data-ref analysis.  In principle it's
as "easy" as hooking that up into vn_reference_lookup_3 as part of the
early work therein to disambiguate more defs.

Iff we chose to refrain from valueizing any of the SSA uses we could
cache both the data references and the dependence resolution.

One could also think of doing very simple recognition of these
single index expressions and / or integrating this with other cases.
IIRC there's some warranting SCEV processing / niter analysis as well
for example to figure that

 for (int i = 0; i < 128; ++i)
   a[i] = 1;
 return a[5];

returns 1.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-01-30 12:26 ` rguenth at gcc dot gnu.org
@ 2024-01-31  1:32 ` juzhe.zhong at rivai dot ai
  2024-01-31  3:50 ` juzhe.zhong at rivai dot ai
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31  1:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #10 from JuzheZhong <juzhe.zhong at rivai dot ai> ---

I think the root cause is we think i_16 and _1 are alias due to scalar
evolution:

(get_scalar_evolution 
  (scalar = i_16)
  (scalar_evolution = {0, +, 2}<nw>_1))

(get_scalar_evolution 
  (scalar = _1)
  (scalar_evolution = {1, +, 2}<nw>_1))

Even though I didn't understand what it is.

diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 25e3130e2f1..2df6de67043 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -553,7 +553,7 @@ get_scalar_evolution (basic_block instantiated_below, tree
scalar)
         if (SSA_NAME_IS_DEFAULT_DEF (scalar))
          res = scalar;
        else
-         res = *find_var_scev_info (instantiated_below, scalar);
+         res = scalar;
        break;

       case REAL_CST:

Ah... I tried an ugly hack which is definitely wrong (just for experiment) in
scalar evolution.

Then, we can vectorize it:

foo:
        lui     a1,%hi(a)
        addi    a1,a1,%lo(a)
        li      a2,511
        li      a3,0
        vsetivli        zero,2,e64,m1,ta,ma
.L2:
        addiw   a5,a3,1
        slli    a5,a5,3
        add     a5,a1,a5
        fld     fa5,0(a5)
        slli    a4,a3,3
        add     a4,a1,a4
        vlse64.v        v2,0(a4),zero
        vle64.v v1,0(a5)
        vfslide1down.vf v2,v2,fa5
        addiw   a2,a2,-1
        vfmul.vv        v1,v1,v2
        vse64.v v1,0(a4)
        addiw   a3,a3,2
        bne     a2,zero,.L2
        ret

I think we can add some simple memory access index recognition, but I don't
known where to add this recognition.

Would you mind giving me some more hints ?

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-01-31  1:32 ` juzhe.zhong at rivai dot ai
@ 2024-01-31  3:50 ` juzhe.zhong at rivai dot ai
  2024-01-31  7:00 ` juzhe.zhong at rivai dot ai
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31  3:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #11 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
It seems that we should fix this case (Richard gave) first which I think it's
not the SCEV or value-numbering issue:

double a[1024];
void foo ()
{
  for (int i = 0; i < 1022; i += 2)
    {
      double tem = a[i+1];
      a[i] = tem * a[i];
      a[i+1] = a[i+2] * tem;
    }
}

auto.c:13:21: missed: couldn't vectorize loop
auto.c:15:14: missed: not vectorized: no vectype for stmt: tem_10 = a[_1];
 scalar_type: double

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2024-01-31  3:50 ` juzhe.zhong at rivai dot ai
@ 2024-01-31  7:00 ` juzhe.zhong at rivai dot ai
  2024-01-31  7:59 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31  7:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #12 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
OK. It seems it has data dependency issue:

missed:   not vectorized, possible dependence between data-refs a[i_15] and
a[_4]

a[i_15] = _3;  STMT 1
_4 = i_15 + 2;
_5 = a[_4];    STMT 2

STMT2 should not depend on STMT1.

It's recognized as dependency in vect_analyze_data_ref_dependence.

Is is reasonable to fix it in vect_analyze_data_ref_dependence ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2024-01-31  7:00 ` juzhe.zhong at rivai dot ai
@ 2024-01-31  7:59 ` rguenth at gcc dot gnu.org
  2024-01-31  8:09 ` juzhe.zhong at rivai dot ai
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-01-31  7:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #12)
> OK. It seems it has data dependency issue:
> 
> missed:   not vectorized, possible dependence between data-refs a[i_15] and
> a[_4]
> 
> a[i_15] = _3;  STMT 1
> _4 = i_15 + 2;
> _5 = a[_4];    STMT 2
> 
> STMT2 should not depend on STMT1.
> 
> It's recognized as dependency in vect_analyze_data_ref_dependence.
> 
> Is is reasonable to fix it in vect_analyze_data_ref_dependence ?

t2.c:4:21: note:   dependence distance  = 1.
t2.c:7:12: missed:   not vectorized, possible dependence between data-refs
a[i_15] and a[_4]
t2.c:4:21: missed:  bad data dependence.

so there's a cross iteration dependence with distance 1 - that's

(compute_affine_dependence
  ref_a: a[i_15], stmt_a: a[i_15] = _3;
  ref_b: a[_4], stmt_b: _5 = a[_4];
(analyze_overlapping_iterations
  (chrec_a = {0, +, 2}<nw>_1)
  (chrec_b = {2, +, 2}<nw>_1)
(analyze_siv_subscript 
(analyze_subscript_affine_affine
  (overlaps_a = [1 + 1 * x_1])
  (overlaps_b = [0 + 1 * x_1]))
) 
  (overlap_iterations_a = [1 + 1 * x_1])
  (overlap_iterations_b = [0 + 1 * x_1])) 
(build_classic_dist_vector
  dist_vector = (1 
  )
)
)

a read-after-write of a[i+2] after storing to a[i+1] in program order.
This would be fine with a VF of 1 only, but we are not really considering
that (a pure SLP vectorization w/o unrolling).  Instead we start with the
assumption of classical vectorization using interleaving which has a
minimal VF of the number of lanes of the vector type with the largest
number of lanes as determined by vect_analyze_data_refs.

We can delay this all a bit but then the SLP build will fail anyway:

t2.c:4:21: missed:   Build SLP failed: different interleaving chains in one
node _5 = a[_4];

which is because we do

t2.c:4:21: note:   === vect_analyze_data_ref_accesses ===
t2.c:4:21: note:   Detected interleaving load a[i_15] and a[_1]
t2.c:4:21: note:   Detected interleaving store a[i_15] and a[_1]
t2.c:4:21: note:   Detected interleaving load of size 2
t2.c:4:21: note:        _2 = a[i_15];
t2.c:4:21: note:        tem_10 = a[_1];
t2.c:4:21: note:   Detected single element interleaving a[_4] step 16

that is, we are splitting the chain because of the intermediate store
(that's kind-of OK-ish, heuristically it works for more cases).

We'd usually handle the VF == 1 cases also duriing BB vectorization on
the loop body, but we're only doing that when there was if-conversion
and the later stand-alone BB vectorization is after predictive commoning
which wrecks the loop.  We should move predcom after BB vect for that.

That said, this PR is quite elaborate and it will touch some key design
issues in the vectorizer.  I'd rather finally finish getting us to
work on the SLP representation only before touching all these delicate
things.  The following allows the analysis to proceed a bit longer
with VF == 1.  Not adjusting min_vf early might have issues, but the
change might work as-is and possibly allow some cases to be loop vectorized
with SLP and a low VF that we now fail to.

diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index f592aeb8028..b16b4664e7b 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -589,7 +589,7 @@ vect_analyze_data_ref_dependence (struct
data_dependence_relation *ddr,
        }

       unsigned int abs_dist = abs (dist);
-      if (abs_dist >= 2 && abs_dist < *max_vf)
+      if (abs_dist >= 1 && abs_dist < *max_vf)
        {
          /* The dependence distance requires reduction of the maximal
             vectorization factor.  */
@@ -4946,7 +4955,7 @@ vect_analyze_data_refs (vec_info *vinfo, poly_uint64
*min_vf, bool *fatal)
       /* Adjust the minimal vectorization factor according to the
         vector type.  */
       vf = TYPE_VECTOR_SUBPARTS (vectype);
-      *min_vf = upper_bound (*min_vf, vf);
+      //*min_vf = upper_bound (*min_vf, vf);

       /* Leave the BB vectorizer to pick the vector type later, based on
         the final dataref group size and SLP node size.  */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 30b90d99925..7eab3d4bebc 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2719,7 +2719,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool
&fatal,
   opt_result ok = opt_result::success ();
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
-  poly_uint64 min_vf = 2;
+  poly_uint64 min_vf = 1;
   loop_vec_info orig_loop_vinfo = NULL;

   /* If we are dealing with an epilogue then orig_loop_vinfo points to the

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2024-01-31  7:59 ` rguenth at gcc dot gnu.org
@ 2024-01-31  8:09 ` juzhe.zhong at rivai dot ai
  2024-01-31  8:18 ` rguenther at suse dot de
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31  8:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Thanks Richard.

It seems that we can't fix this issue for now. Is that right ?

If I understand correctly, do you mean we should wait after SLP representations
are finished and then revisit this PR?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2024-01-31  8:09 ` juzhe.zhong at rivai dot ai
@ 2024-01-31  8:18 ` rguenther at suse dot de
  2024-01-31  8:27 ` juzhe.zhong at rivai dot ai
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-01-31  8:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> Thanks Richard.
> 
> It seems that we can't fix this issue for now. Is that right ?
> 
> If I understand correctly, do you mean we should wait after SLP representations
> are finished and then revisit this PR?

Yes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2024-01-31  8:18 ` rguenther at suse dot de
@ 2024-01-31  8:27 ` juzhe.zhong at rivai dot ai
  2024-01-31  9:08 ` rguenther at suse dot de
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31  8:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to rguenther@suse.de from comment #15)
> On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > 
> > --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > Thanks Richard.
> > 
> > It seems that we can't fix this issue for now. Is that right ?
> > 
> > If I understand correctly, do you mean we should wait after SLP representations
> > are finished and then revisit this PR?
> 
> Yes.

It seems to be a big refactor work.

I wonder I can do anything to help with SLP representations ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2024-01-31  8:27 ` juzhe.zhong at rivai dot ai
@ 2024-01-31  9:08 ` rguenther at suse dot de
  2024-01-31  9:19 ` juzhe.zhong at rivai dot ai
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-01-31  9:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #17 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> (In reply to rguenther@suse.de from comment #15)
> > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > 
> > > --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > > Thanks Richard.
> > > 
> > > It seems that we can't fix this issue for now. Is that right ?
> > > 
> > > If I understand correctly, do you mean we should wait after SLP representations
> > > are finished and then revisit this PR?
> > 
> > Yes.
> 
> It seems to be a big refactor work.

It's not too bad if people wouldn't continue to add features not 
implementing SLP ...

> I wonder I can do anything to help with SLP representations ?

I hope to get back to this before stage1 re-opens and will post
another request for testing.  It's really mostly going to be making
sure all paths have coverage which means testing all the various
architectures - I can only easily test x86.  There's a branch
I worked on last year, refs/users/rguenth/heads/vect-force-slp,
which I use to hunt down cases not supporting SLP (it's a bit
overeager to trigger, and it has known holes so it's not really
a good starting point yet for folks to try other archs).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2024-01-31  9:08 ` rguenther at suse dot de
@ 2024-01-31  9:19 ` juzhe.zhong at rivai dot ai
  2024-01-31  9:56 ` rguenther at suse dot de
  2024-04-07 21:33 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-31  9:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #18 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to rguenther@suse.de from comment #17)
> On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > 
> > --- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > (In reply to rguenther@suse.de from comment #15)
> > > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > > 
> > > > --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > > > Thanks Richard.
> > > > 
> > > > It seems that we can't fix this issue for now. Is that right ?
> > > > 
> > > > If I understand correctly, do you mean we should wait after SLP representations
> > > > are finished and then revisit this PR?
> > > 
> > > Yes.
> > 
> > It seems to be a big refactor work.
> 
> It's not too bad if people wouldn't continue to add features not 
> implementing SLP ...
> 
> > I wonder I can do anything to help with SLP representations ?
> 
> I hope to get back to this before stage1 re-opens and will post
> another request for testing.  It's really mostly going to be making
> sure all paths have coverage which means testing all the various
> architectures - I can only easily test x86.  There's a branch
> I worked on last year, refs/users/rguenth/heads/vect-force-slp,
> which I use to hunt down cases not supporting SLP (it's a bit
> overeager to trigger, and it has known holes so it's not really
> a good starting point yet for folks to try other archs).

Ok. It seems that you almost done with that but needs more testing in
various targets.

So, if I want to work on optimizing vectorization (start with TSVC),
I should avoid touching the failed vectorized due to data reference/dependence
analysis (e.g. this PR case, s116).

and avoid adding new features into loop vectorizer, e.g. min/max reduction with
index (s315).

To not to make your SLP refactoring work heavier.

Am I right ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2024-01-31  9:19 ` juzhe.zhong at rivai dot ai
@ 2024-01-31  9:56 ` rguenther at suse dot de
  2024-04-07 21:33 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: rguenther at suse dot de @ 2024-01-31  9:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #19 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #18 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> (In reply to rguenther@suse.de from comment #17)
> > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > 
> > > --- Comment #16 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > > (In reply to rguenther@suse.de from comment #15)
> > > > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > > > 
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > > > 
> > > > > --- Comment #14 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > > > > Thanks Richard.
> > > > > 
> > > > > It seems that we can't fix this issue for now. Is that right ?
> > > > > 
> > > > > If I understand correctly, do you mean we should wait after SLP representations
> > > > > are finished and then revisit this PR?
> > > > 
> > > > Yes.
> > > 
> > > It seems to be a big refactor work.
> > 
> > It's not too bad if people wouldn't continue to add features not 
> > implementing SLP ...
> > 
> > > I wonder I can do anything to help with SLP representations ?
> > 
> > I hope to get back to this before stage1 re-opens and will post
> > another request for testing.  It's really mostly going to be making
> > sure all paths have coverage which means testing all the various
> > architectures - I can only easily test x86.  There's a branch
> > I worked on last year, refs/users/rguenth/heads/vect-force-slp,
> > which I use to hunt down cases not supporting SLP (it's a bit
> > overeager to trigger, and it has known holes so it's not really
> > a good starting point yet for folks to try other archs).
> 
> Ok. It seems that you almost done with that but needs more testing in
> various targets.
> 
> So, if I want to work on optimizing vectorization (start with TSVC),
> I should avoid touching the failed vectorized due to data reference/dependence
> analysis (e.g. this PR case, s116).

It depends on the actual case - the one in this bug at least looks like
half of it might be dealt with with the refactoring.

> and avoid adding new features into loop vectorizer, e.g. min/max reduction with
> index (s315).

It's fine to add features if they works with SLP as well ;)  Note that
in the future SLP will also do the "single lane" case but it doesn't
do that on trunk.  Some features are difficult with multi-lane SLP
and probably not important in practice for that case, still handling
single-lane SLP will be important as otherwise the feature is lost.

> To not to make your SLP refactoring work heavier.
> 
> Am I right ?

Yes.  I've got early break vectorization to chase now, I was "finished"
with the parts I could exercise on x86_64 in autumn ...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2024-01-31  9:56 ` rguenther at suse dot de
@ 2024-04-07 21:33 ` pinskia at gcc dot gnu.org
  19 siblings, 0 replies; 21+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-07 21:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-04-07 21:33 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-04 23:01 [Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-04 23:24 ` [Bug middle-end/99395] " hubicka at gcc dot gnu.org
2021-03-05  8:27 ` [Bug tree-optimization/99395] " rguenth at gcc dot gnu.org
2021-03-05 15:42 ` hubicka at gcc dot gnu.org
2022-10-18 10:28 ` rguenth at gcc dot gnu.org
2022-10-18 10:37 ` rguenth at gcc dot gnu.org
2023-10-07  9:10 ` juzhe.zhong at rivai dot ai
2023-10-09  6:30 ` rguenth at gcc dot gnu.org
2024-01-30 11:37 ` juzhe.zhong at rivai dot ai
2024-01-30 12:26 ` rguenth at gcc dot gnu.org
2024-01-31  1:32 ` juzhe.zhong at rivai dot ai
2024-01-31  3:50 ` juzhe.zhong at rivai dot ai
2024-01-31  7:00 ` juzhe.zhong at rivai dot ai
2024-01-31  7:59 ` rguenth at gcc dot gnu.org
2024-01-31  8:09 ` juzhe.zhong at rivai dot ai
2024-01-31  8:18 ` rguenther at suse dot de
2024-01-31  8:27 ` juzhe.zhong at rivai dot ai
2024-01-31  9:08 ` rguenther at suse dot de
2024-01-31  9:19 ` juzhe.zhong at rivai dot ai
2024-01-31  9:56 ` rguenther at suse dot de
2024-04-07 21:33 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).