[PATCH v5 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Andre Vieira <andre.simoesdiasvieira@arm.com>
To: gcc-patches@gcc.gnu.org
Cc: stam.markianos-wright@arm.com, richard.earnshaw@arm.com,
	Andre Vieira <andre.simoesdiasvieira@arm.com>
Subject: [PATCH v5 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops
Date: Thu, 22 Feb 2024 17:37:58 +0000	[thread overview]
Message-ID: <20240222173803.20989-1-andre.simoesdiasvieira@arm.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 4368 bytes --]

Hi,

This is a reworked patch series from.  The main differences are a further split
of patches, where:
[1/5] is arm specific and has been approved before,
[2/5] is target agnostic, has had no substantial changes from v3.
[3/5] new arm specific patch that is split from the original last patch and
annotates across lane instructions that are safe for tail predication if their
tail predicated operands are zeroed.
[4/5] new arm specific patch that could be committed indepdent of series to fix
an obvious issue and remove unused unspecs & iterators.
[5/5] (v3-v4) reworked last patch refactoring the implicit predication and some other
validity checks, (v4-v5) removed the expectation that vctp instructions are
always zero extended after this was fixed on trunk.

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n)
{
  while (n > 0)
    {
      mve_pred16_t p = vctp16q (n);
      int16x8_t va = vldrhq_z_s16 (a, p);
      int16x8_t vb = vldrhq_z_s16 (b, p);
      int16x8_t vc = vaddq_x_s16 (va, vb, p);
      vstrhq_p_s16 (c, vc, p);
      c+=8;
      a+=8;
      b+=8;
      n-=8;
    }
}
```
.. would output:

```
        <pre-calculate the number of iterations and place it into lr>
        dls     lr, lr
.L3:
        vctp.16 r3
        vmrs    ip, P0  @ movhi
        sxth    ip, ip
        vmsr     P0, ip @ movhi
        mov     r4, r0
        vpst
        vldrht.16       q2, [r4]
        mov     r4, r1
        vmov    q3, q0
        vpst
        vldrht.16       q1, [r4]
        mov     r4, r2
        vpst
        vaddt.i16       q3, q2, q1
        subs    r3, r3, #8
        vpst
        vstrht.16       q3, [r4]
        adds    r0, r0, #16
        adds    r1, r1, #16
        adds    r2, r2, #16
        le      lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```
        <place n into r3>
        dlstp.16        lr, r3
.L14:
        mov     r3, r0
        vldrh.16        q3, [r3]
        mov     r3, r1
        vldrh.16        q2, [r3]
        mov     r3, r2
        vadd.i16  q3, q3, q2
        adds    r0, r0, #16
        vstrh.16        q3, [r3]
        adds    r1, r1, #16
        adds    r2, r2, #16
        letp    lr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
    unpredicated insns

Andre Vieira (4):
  doloop: Add support for predicated vectorized loops
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  arm: Add support for MVE Tail-Predicated Low Overhead Loops

-- 
2.17.1

next             reply	other threads:[~2024-02-22 17:38 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-22 17:37 Andre Vieira [this message]
2024-02-22 17:37 ` [PATCH v5 1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns Andre Vieira
2024-02-22 17:38 ` [PATCH v5 2/5] doloop: Add support for predicated vectorized loops Andre Vieira
2024-02-22 17:38 ` [PATCH v5 3/5] arm: Annotate instructions with mve_safe_imp_xlane_pred Andre Vieira
2024-02-22 17:38 ` [PATCH v5 4/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators Andre Vieira
2024-02-22 17:38 ` [PATCH v5 5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops Andre Vieira
2024-02-27 13:56 [PATCH v5 0/5] " Andre Vieira

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240222173803.20989-1-andre.simoesdiasvieira@arm.com \
    --to=andre.simoesdiasvieira@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.earnshaw@arm.com \
    --cc=stam.markianos-wright@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).