Hi, This is a reworked patch series from. The main differences are a further split of patches, where: [1/5] is arm specific and has been approved before, [2/5] is target agnostic, has had no substantial changes from v3. [3/5] new arm specific patch that is split from the original last patch and annotates across lane instructions that are safe for tail predication if their tail predicated operands are zeroed. [4/5] new arm specific patch that could be committed indepdent of series to fix an obvious issue and remove unused unspecs & iterators. [5/5] (v3-v4) reworked last patch refactoring the implicit predication and some other validity checks, (v4-v5) removed the expectation that vctp instructions are always zero extended after this was fixed on trunk. Original cover letter: This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop feature. The M-class Arm-ARM: https://developer.arm.com/documentation/ddi0553/bu/?lang=en Section B5.5.1 "Loop tail predication" describes the feature we are adding support for with this patch (although we only add codegen for DLSTP/LETP instruction loops). Previously with commit d2ed233cb94 we'd added support for non-MVE DLS/LE loops through the loop-doloop pass, which, given a standard MVE loop like: ``` void __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int n) { while (n > 0) { mve_pred16_t p = vctp16q (n); int16x8_t va = vldrhq_z_s16 (a, p); int16x8_t vb = vldrhq_z_s16 (b, p); int16x8_t vc = vaddq_x_s16 (va, vb, p); vstrhq_p_s16 (c, vc, p); c+=8; a+=8; b+=8; n-=8; } } ``` .. would output: ``` dls lr, lr .L3: vctp.16 r3 vmrs ip, P0 @ movhi sxth ip, ip vmsr P0, ip @ movhi mov r4, r0 vpst vldrht.16 q2, [r4] mov r4, r1 vmov q3, q0 vpst vldrht.16 q1, [r4] mov r4, r2 vpst vaddt.i16 q3, q2, q1 subs r3, r3, #8 vpst vstrht.16 q3, [r4] adds r0, r0, #16 adds r1, r1, #16 adds r2, r2, #16 le lr, .L3 ``` where the LE instruction will decrement LR by 1, compare and branch if needed. (there are also other inefficiencies with the above code, like the pointless vmrs/sxth/vmsr on the VPR and the adds not being merged into the vldrht/vstrht as a #16 offsets and some random movs! But that's different problems...) The MVE version is similar, except that: * Instead of DLS/LE the instructions are DLSTP/LETP. * Instead of pre-calculating the number of iterations of the loop, we place the number of elements to be processed by the loop into LR. * Instead of decrementing the LR by one, LETP will decrement it by FPSCR.LTPSIZE, which is the number of elements being processed in each iteration: 16 for 8-bit elements, 5 for 16-bit elements, etc. * On the final iteration, automatic Loop Tail Predication is performed, as if the instructions within the loop had been VPT predicated with a VCTP generating the VPR predicate in every loop iteration. The dlstp/letp loop now looks like: ``` dlstp.16 lr, r3 .L14: mov r3, r0 vldrh.16 q3, [r3] mov r3, r1 vldrh.16 q2, [r3] mov r3, r2 vadd.i16 q3, q3, q2 adds r0, r0, #16 vstrh.16 q3, [r3] adds r1, r1, #16 adds r2, r2, #16 letp lr, .L14 ``` Since the loop tail predication is automatic, we have eliminated the VCTP that had been specified by the user in the intrinsic and converted the VPT-predicated instructions into their unpredicated equivalents (which also saves us from VPST insns). The LE instruction here decrements LR by 8 in each iteration. Stam Markianos-Wright (1): arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns Andre Vieira (4): doloop: Add support for predicated vectorized loops arm: Annotate instructions with mve_safe_imp_xlane_pred arm: Fix a wrong attribute use and remove unused unspecs and iterators arm: Add support for MVE Tail-Predicated Low Overhead Loops -- 2.17.1