public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations
@ 2023-12-21  8:22 fxue at os dot amperecomputing.com
  2023-12-21  9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: fxue at os dot amperecomputing.com @ 2023-12-21  8:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

            Bug ID: 113104
           Summary: Suboptimal loop-based slp node splicing across
                    iterations
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Given a partial vector-sized slp node in loop, code generation would utilize
inter-iteration parallelism to archive full vectorization by splicing defs of
the node in multiple iterations into one vector. This strategy is not always
good, and could be refined in some situation. To be specific, we'd better not
splice node if it participates in a full-vector-sized operation, otherwise a
permute and vextract that are really unneeded would be introduced.

Suppose target vector size is 128-bit, and a slp node is mapped to VEC_OP in an
iteration. Depending on whether backend supports LO/HI version of the
operation, there are two kinds code sequence for splicing.

  // Isolated 2 iterations
  res_v128_I0 = VEC_OP(opnd_v64_I0, ...)   // iteration #0
  res_v128_I1 = VEC_OP(opnd_v64_I1, ...)   // iteration #1 


  // Spliced (1)
  opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 }      // extra permute
  opnd_v64_lo = [vec_unpack_lo_expr] opnd_v128_I0_I1; // extra vextract
  opnd_v64_hi = [vec_unpack_hi_expr] opnd_v128_I0_I1; // extra vextract
  res_v128_I0 = VEC_OP(opnd_v64_lo, ...)
  res_v128_I1 = VEC_OP(opnd_v64_hi, ...)

  // Spliced (2)
  opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 }  // extra permute
  res_v128_I0 = VEC_OP_LO(opnd_v128_i0_i1, ...)   // similar or same as VEC_OP
  res_v128_I1 = VEC_OP_HI(opnd_v128_i0_i1, ...)   // similar or same as VEC_OP

Sometime, such permute and vextract might be optimized away by backend passes.
But sometime, it can not. Here is a case on aarch64.

  int test(unsigned array[4][4]);

  int foo(unsigned short *a, unsigned long n)
  {
    unsigned array[4][4];

    for (unsigned i = 0; i < 4; i++, a += n)
      {
        array[i][0] = a[0] << 6;
        array[i][1] = a[1] << 6;
        array[i][2] = a[2] << 6;
        array[i][3] = a[3] << 6;
      }

    return test(array);
  }


        // Current code generation
        mov     x2, x0
        stp     x29, x30, [sp, -80]!
        add     x3, x2, x1, lsl 1
        lsl     x1, x1, 1
        mov     x29, sp
        add     x4, x3, x1
        ldr     d0, [x2]
        movi    v30.4s, 0
        add     x0, sp, 16
        ldr     d31, [x2, x1]
        ldr     d29, [x3, x1]
        ldr     d28, [x4, x1]
        ins     v0.d[1], v31.d[0]            //
        ins     v29.d[1], v28.d[0]           // 
        zip1    v1.8h, v0.8h, v30.8h         // superfluous  
        zip2    v0.8h, v0.8h, v30.8h         //
        zip1    v31.8h, v29.8h, v30.8h       //
        zip2    v29.8h, v29.8h, v30.8h       //
        shl     v1.4s, v1.4s, 6
        shl     v0.4s, v0.4s, 6
        shl     v31.4s, v31.4s, 6
        shl     v29.4s, v29.4s, 6
        stp     q1, q0, [sp, 16]
        stp     q31, q29, [sp, 48]
        bl      test
        ldp     x29, x30, [sp], 80
        ret


        // May be optimized to:
        stp     x29, x30, [sp, -80]!
        mov     x29, sp
        mov     x2, x0
        add     x0, sp, 16
        lsl     x3, x1, 1
        add     x1, x2, x1, lsl 1
        add     x4, x1, x3
        ldr     d31, [x2, x3]
        ushll   v31.4s, v31.4h, 6
        ldr     d30, [x2]
        ushll   v30.4s, v30.4h, 6
        str     q30, [sp, 16]
        ldr     d30, [x1, x3]
        ushll   v30.4s, v30.4h, 6
        str     q31, [sp, 32]
        ldr     d31, [x4, x3]
        ushll   v31.4s, v31.4h, 6
        stp     q30, q31, [sp, 48]
        bl      test
        ldp     x29, x30, [sp], 80
        ret

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-01-10  5:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-21  8:22 [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations fxue at os dot amperecomputing.com
2023-12-21  9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
2023-12-21  9:33 ` fxue at os dot amperecomputing.com
2023-12-21  9:41 ` rguenther at suse dot de
2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
2024-01-10  5:01 ` fxue at os dot amperecomputing.com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).