public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons
@ 2021-09-29 16:19 Tamar Christina
  2021-09-29 16:19 ` [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow Tamar Christina
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:19 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

Hi All,

This patch series is optimizing AArch64 codegen for narrowing operations,
shift and narrow, and some comparisons with bitmasks.

There are more to come but this is the first batch.

This series shows a 2% gain on x264 in SPECCPU2017 and 0.05% size reduction
and shows 5-10% perf gain on various intrinsics optimized real world
libraries.

One part that is missing and needs additional work is being able to combine
stores into sequential locations.  Consider:

#include <arm_neon.h>
?
#define SIZE 1
#define SIZE2 8 * 8 * 8
?
extern void pop (uint8_t*);
?
void foo (int16x8_t row0, int16x8_t row1, int16x8_t row2, int16x8_t row3,
          int16x8_t row4, int16x8_t row5, int16x8_t row6, int16x8_t row7) {
    uint8_t block_nbits[SIZE2];

    uint8x8_t row0_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row0)));
    uint8x8_t row1_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row1)));
    uint8x8_t row2_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row2)));
    uint8x8_t row3_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row3)));
    uint8x8_t row4_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row4)));
    uint8x8_t row5_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row5)));
    uint8x8_t row6_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row6)));
    uint8x8_t row7_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row7)));

    vst1_u8(block_nbits + 0 * SIZE, row0_nbits);
    vst1_u8(block_nbits + 1 * SIZE, row1_nbits);
    vst1_u8(block_nbits + 2 * SIZE, row2_nbits);
    vst1_u8(block_nbits + 3 * SIZE, row3_nbits);
    vst1_u8(block_nbits + 4 * SIZE, row4_nbits);
    vst1_u8(block_nbits + 5 * SIZE, row5_nbits);
    vst1_u8(block_nbits + 6 * SIZE, row6_nbits);
    vst1_u8(block_nbits + 7 * SIZE, row7_nbits);
?
    pop (block_nbits);
}

currently generates:

movi v1.8b, #0x10

xtn v17.8b, v17.8h
xtn v23.8b, v23.8h
xtn v22.8b, v22.8h
xtn v4.8b, v21.8h
xtn v20.8b, v20.8h
xtn v19.8b, v19.8h
xtn v18.8b, v18.8h
xtn v24.8b, v24.8h

sub v17.8b, v1.8b, v17.8b
sub v23.8b, v1.8b, v23.8b
sub v22.8b, v1.8b, v22.8b
sub v16.8b, v1.8b, v4.8b
sub v8.8b, v1.8b, v20.8b
sub v4.8b, v1.8b, v19.8b
sub v2.8b, v1.8b, v18.8b
sub v1.8b, v1.8b, v24.8b

stp d17, d23, [sp, #224]
stp d22, d16, [sp, #240]
stp d8, d4, [sp, #256]
stp d2, d1, [sp, #272]

where optimized codegen for this is:

movi v1.16b, #0x10

uzp1 v17.16b, v17.16b, v23.16b
uzp1 v22.16b, v22.16b, v4.16b
uzp1 v20.16b, v20.16b, v19.16b
uzp1 v24.16b, v18.16b, v24.16b

sub v17.16b, v1.16b, v17.16b
sub v18.16b, v1.16b, v22.16b
sub v19.16b, v1.16b, v20.16b
sub v20.16b, v1.16b, v24.16b

stp q17, q18, [sp, #224]
stp q19, q20, [sp, #256]

which requires us to recognize the stores into sequential locations (multiple
stp d blocks in the current example) and merge them into one.

This pattern happens reasonably often but unsure how to handle it.  For one this
requires st1 and friends to not be unspec, which is currently the focus of

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579582.html

Thanks,
Tamar

--- inline copy of patch -- 

-- 

[-- Attachment #2: rb14899.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-10-13 12:52 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
2021-09-29 16:19 ` [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow Tamar Christina
2021-09-30  8:50   ` Kyrylo Tkachov
2021-10-06 14:32     ` Richard Sandiford
2021-10-12 16:18       ` Tamar Christina
2021-10-12 16:35         ` Kyrylo Tkachov
2021-09-29 16:19 ` [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle) Tamar Christina
2021-09-30  8:54   ` Kyrylo Tkachov
2021-10-12 16:23     ` Tamar Christina
2021-10-12 16:36       ` Kyrylo Tkachov
2021-09-29 16:20 ` [PATCH 3/7]AArch64 Add pattern for sshr to cmlt Tamar Christina
2021-09-30  9:27   ` Kyrylo Tkachov
2021-10-11 19:56     ` Andrew Pinski
2021-10-12 12:19       ` Kyrylo Tkachov
2021-10-12 16:20         ` Tamar Christina
2021-09-29 16:20 ` [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2 Tamar Christina
2021-09-30  9:28   ` Kyrylo Tkachov
2021-10-12 16:25     ` Tamar Christina
2021-10-12 16:39       ` Kyrylo Tkachov
2021-10-13 11:05         ` Tamar Christina
2021-10-13 12:52           ` Kyrylo Tkachov
2021-09-29 16:21 ` [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2> Tamar Christina
2021-09-30  6:17   ` Richard Biener
2021-09-30  9:56     ` Tamar Christina
2021-09-30 10:26       ` Richard Biener
2021-10-05 12:55         ` Tamar Christina
2021-10-13 12:17           ` Richard Biener
2021-09-29 16:21 ` [PATCH 6/7]AArch64 Add neg + cmle into cmgt Tamar Christina
2021-09-30  9:34   ` Kyrylo Tkachov
2021-09-29 16:21 ` [PATCH 7/7]AArch64 Combine cmeq 0 + not into cmtst Tamar Christina
2021-09-30  9:35   ` Kyrylo Tkachov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).