public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "fxue at os dot amperecomputing.com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations
Date: Thu, 21 Dec 2023 08:22:15 +0000	[thread overview]
Message-ID: <bug-113104-4@http.gcc.gnu.org/bugzilla/> (raw)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

            Bug ID: 113104
           Summary: Suboptimal loop-based slp node splicing across
                    iterations
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

Given a partial vector-sized slp node in loop, code generation would utilize
inter-iteration parallelism to archive full vectorization by splicing defs of
the node in multiple iterations into one vector. This strategy is not always
good, and could be refined in some situation. To be specific, we'd better not
splice node if it participates in a full-vector-sized operation, otherwise a
permute and vextract that are really unneeded would be introduced.

Suppose target vector size is 128-bit, and a slp node is mapped to VEC_OP in an
iteration. Depending on whether backend supports LO/HI version of the
operation, there are two kinds code sequence for splicing.

  // Isolated 2 iterations
  res_v128_I0 = VEC_OP(opnd_v64_I0, ...)   // iteration #0
  res_v128_I1 = VEC_OP(opnd_v64_I1, ...)   // iteration #1 


  // Spliced (1)
  opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 }      // extra permute
  opnd_v64_lo = [vec_unpack_lo_expr] opnd_v128_I0_I1; // extra vextract
  opnd_v64_hi = [vec_unpack_hi_expr] opnd_v128_I0_I1; // extra vextract
  res_v128_I0 = VEC_OP(opnd_v64_lo, ...)
  res_v128_I1 = VEC_OP(opnd_v64_hi, ...)

  // Spliced (2)
  opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 }  // extra permute
  res_v128_I0 = VEC_OP_LO(opnd_v128_i0_i1, ...)   // similar or same as VEC_OP
  res_v128_I1 = VEC_OP_HI(opnd_v128_i0_i1, ...)   // similar or same as VEC_OP

Sometime, such permute and vextract might be optimized away by backend passes.
But sometime, it can not. Here is a case on aarch64.

  int test(unsigned array[4][4]);

  int foo(unsigned short *a, unsigned long n)
  {
    unsigned array[4][4];

    for (unsigned i = 0; i < 4; i++, a += n)
      {
        array[i][0] = a[0] << 6;
        array[i][1] = a[1] << 6;
        array[i][2] = a[2] << 6;
        array[i][3] = a[3] << 6;
      }

    return test(array);
  }


        // Current code generation
        mov     x2, x0
        stp     x29, x30, [sp, -80]!
        add     x3, x2, x1, lsl 1
        lsl     x1, x1, 1
        mov     x29, sp
        add     x4, x3, x1
        ldr     d0, [x2]
        movi    v30.4s, 0
        add     x0, sp, 16
        ldr     d31, [x2, x1]
        ldr     d29, [x3, x1]
        ldr     d28, [x4, x1]
        ins     v0.d[1], v31.d[0]            //
        ins     v29.d[1], v28.d[0]           // 
        zip1    v1.8h, v0.8h, v30.8h         // superfluous  
        zip2    v0.8h, v0.8h, v30.8h         //
        zip1    v31.8h, v29.8h, v30.8h       //
        zip2    v29.8h, v29.8h, v30.8h       //
        shl     v1.4s, v1.4s, 6
        shl     v0.4s, v0.4s, 6
        shl     v31.4s, v31.4s, 6
        shl     v29.4s, v29.4s, 6
        stp     q1, q0, [sp, 16]
        stp     q31, q29, [sp, 48]
        bl      test
        ldp     x29, x30, [sp], 80
        ret


        // May be optimized to:
        stp     x29, x30, [sp, -80]!
        mov     x29, sp
        mov     x2, x0
        add     x0, sp, 16
        lsl     x3, x1, 1
        add     x1, x2, x1, lsl 1
        add     x4, x1, x3
        ldr     d31, [x2, x3]
        ushll   v31.4s, v31.4h, 6
        ldr     d30, [x2]
        ushll   v30.4s, v30.4h, 6
        str     q30, [sp, 16]
        ldr     d30, [x1, x3]
        ushll   v30.4s, v30.4h, 6
        str     q31, [sp, 32]
        ldr     d31, [x4, x3]
        ushll   v31.4s, v31.4h, 6
        stp     q30, q31, [sp, 48]
        bl      test
        ldp     x29, x30, [sp], 80
        ret

             reply	other threads:[~2023-12-21  8:22 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-21  8:22 fxue at os dot amperecomputing.com [this message]
2023-12-21  9:07 ` [Bug tree-optimization/113104] " rguenth at gcc dot gnu.org
2023-12-21  9:33 ` fxue at os dot amperecomputing.com
2023-12-21  9:41 ` rguenther at suse dot de
2023-12-30 12:35 ` rsandifo at gcc dot gnu.org
2024-01-05 16:25 ` cvs-commit at gcc dot gnu.org
2024-01-05 16:32 ` rsandifo at gcc dot gnu.org
2024-01-10  5:01 ` fxue at os dot amperecomputing.com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-113104-4@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).