* [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd
@ 2021-07-06 19:01 Roger Sayle
2021-07-07 7:55 ` Richard Biener
0 siblings, 1 reply; 5+ messages in thread
From: Roger Sayle @ 2021-07-06 19:01 UTC (permalink / raw)
To: 'GCC Patches'
[-- Attachment #1: Type: text/plain, Size: 1703 bytes --]
All of the optimizations/transformations mentioned in bugzilla for
PR tree-optimization/40210 are already implemented in mainline GCC,
with one exception. In comment #5, there's a suggestion that
(bswap64(x)>>56)&0xff can be implemented without the bswap as
(unsigned char)x, or equivalently x&0xff.
This patch implements the above optimization, and closely related
variants. For any single bit, (bswap(X)>>C1)&1 can be simplified
to (X>>C2)&1, where bit position C2 is the appropriate permutation
of C1. Similarly, the bswap can eliminated if the desired set of
bits all lie within the same byte, hence (bswap(x)>>8)&255 can
always be optimized, as can (bswap(x)>>8)&123.
Previously,
int foo(long long x) {
return (__builtin_bswap64(x) >> 56) & 0xff;
}
compiled with -O2 to
foo: movq %rdi, %rax
bswap %rax
shrq $56, %rax
ret
with this patch, it now compiles to
foo: movzbl %dil, %eax
ret
This patch has been tested on x86_64-pc-linux-gnu with a "make
bootstrap" and "make -k check" with no new failures.
Ok for mainline?
2021-07-06 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
PR tree-optimization/40210
* builtins.c (get_builtin_precision): Helper function to determine
the precision in bits of a built-in function.
* builtins.h (get_builtin_precision): Prototype here.
* match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
(x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
gcc/testsuite/ChangeLog
PR tree-optimization/40210
* gcc.dg/builtin-bswap-13.c: New test.
* gcc.dg/builtin-bswap-14.c: New test.
Roger
--
Roger Sayle
NextMove Software
Cambridge, UK
[-- Attachment #2: patchb.txt --]
[-- Type: text/plain, Size: 5100 bytes --]
diff --git a/gcc/builtins.c b/gcc/builtins.c
index 855ad1e..86dfeae 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -14831,3 +14831,42 @@ builtin_fnspec (tree callee)
return "";
}
}
+
+/* Return the precision (in bits) of a built-in function.
+ Returns zero for unrecognized built-in functions. */
+
+unsigned int
+get_builtin_precision (enum built_in_function fn)
+{
+ switch (fn)
+ {
+ case BUILT_IN_BSWAP16:
+ return 16;
+ case BUILT_IN_BSWAP32:
+ return 32;
+ case BUILT_IN_BSWAP64:
+ return 64;
+ case BUILT_IN_BSWAP128:
+ return 128;
+ case BUILT_IN_CLZ:
+ case BUILT_IN_CTZ:
+ case BUILT_IN_FFS:
+ case BUILT_IN_PARITY:
+ case BUILT_IN_POPCOUNT:
+ return TYPE_PRECISION (integer_type_node);
+ case BUILT_IN_CLZL:
+ case BUILT_IN_CTZL:
+ case BUILT_IN_FFSL:
+ case BUILT_IN_PARITYL:
+ case BUILT_IN_POPCOUNTL:
+ return TYPE_PRECISION (long_integer_type_node);
+ case BUILT_IN_CLZLL:
+ case BUILT_IN_CTZLL:
+ case BUILT_IN_FFSLL:
+ case BUILT_IN_PARITYLL:
+ case BUILT_IN_POPCOUNTLL:
+ return TYPE_PRECISION (long_long_integer_type_node);
+ default:
+ return 0;
+ }
+}
diff --git a/gcc/builtins.h b/gcc/builtins.h
index e71f40c..cfb48f0 100644
--- a/gcc/builtins.h
+++ b/gcc/builtins.h
@@ -124,6 +124,7 @@ extern void expand_ifn_atomic_bit_test_and (gcall *);
extern void expand_ifn_atomic_compare_exchange (gcall *);
extern rtx expand_builtin (tree, rtx, rtx, machine_mode, int);
extern enum built_in_function builtin_mathfn_code (const_tree);
+extern unsigned int get_builtin_precision (enum built_in_function);
extern tree fold_builtin_expect (location_t, tree, tree, tree, tree);
extern bool avoid_folding_inline_builtin (tree);
extern tree fold_call_expr (location_t, tree, bool);
diff --git a/gcc/match.pd b/gcc/match.pd
index 39fb57e..d5a7fbf 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3610,7 +3610,8 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(complex (convert:itype @0) (negate (convert:itype @1)))))
/* BSWAP simplifications, transforms checked by gcc.dg/builtin-bswap-8.c. */
-(for bswap (BUILT_IN_BSWAP16 BUILT_IN_BSWAP32 BUILT_IN_BSWAP64)
+(for bswap (BUILT_IN_BSWAP16 BUILT_IN_BSWAP32
+ BUILT_IN_BSWAP64 BUILT_IN_BSWAP128)
(simplify
(bswap (bswap @0))
@0)
@@ -3620,8 +3621,69 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(for bitop (bit_xor bit_ior bit_and)
(simplify
(bswap (bitop:c (bswap @0) @1))
- (bitop @0 (bswap @1)))))
-
+ (bitop @0 (bswap @1))))
+ /* (bswap(x) >> C1) & C2 can sometimes be simplified to (x >> C3) & C2. */
+ (simplify
+ (bit_and (convert1? (rshift@0 (convert2? (bswap @1)) INTEGER_CST@2))
+ INTEGER_CST@3)
+ (if (tree_fits_uhwi_p (@2)
+ && tree_fits_uhwi_p (@3))
+ (with
+ {
+ enum built_in_function fncode = as_builtin_fn (bswap);
+ unsigned HOST_WIDE_INT prec = get_builtin_precision (fncode);
+ unsigned HOST_WIDE_INT bits = tree_to_uhwi (@2);
+ unsigned HOST_WIDE_INT mask = tree_to_uhwi (@3);
+ unsigned HOST_WIDE_INT lo = bits & 7;
+ unsigned HOST_WIDE_INT hi = bits - lo;
+ }
+ (if (bits < prec
+ && mask < (256u>>lo)
+ && bits < TYPE_PRECISION (TREE_TYPE(@0)))
+ (with { unsigned HOST_WIDE_INT ns = (prec - (hi + 8)) + lo; }
+ (if (ns == 0)
+ (bit_and (convert @1) @3)
+ (with
+ {
+ tree utype = unsigned_type_for (TREE_TYPE (@1));
+ tree nst = build_int_cst (integer_type_node, ns);
+ }
+ (bit_and (convert (rshift:utype (convert:utype @1) {nst;})) @3))))))))
+ /* bswap(x) >> C1 can sometimes be simplified to (T)x >> C2. */
+ (simplify
+ (rshift (convert? (bswap @0)) INTEGER_CST@1)
+ (if (tree_fits_uhwi_p (@1))
+ (with
+ {
+ enum built_in_function fncode = as_builtin_fn (bswap);
+ unsigned HOST_WIDE_INT prec = get_builtin_precision (fncode);
+ unsigned HOST_WIDE_INT bits = tree_to_uhwi (@1);
+ }
+ (if (bits + 8 == prec)
+ (if (TYPE_UNSIGNED (type))
+ (convert (convert:unsigned_char_type_node @0))
+ (convert (convert:signed_char_type_node @0)))
+ (if (bits < prec && bits + 8 > prec)
+ (with
+ {
+ tree nst = build_int_cst (integer_type_node, bits & 7);
+ tree bt = TYPE_UNSIGNED (type) ? unsigned_char_type_node
+ : signed_char_type_node;
+ }
+ (convert (rshift:bt (convert:bt @0) {nst;}))))))))
+ /* bswap(x) & C1 can sometimes be simplified to (x >> C2) & C1. */
+ (simplify
+ (bit_and (convert? (bswap @0)) INTEGER_CST@1)
+ (if (tree_fits_uhwi_p (@1)
+ && tree_to_uhwi (@1) < 256)
+ (with
+ {
+ enum built_in_function fncode = as_builtin_fn (bswap);
+ unsigned HOST_WIDE_INT prec = get_builtin_precision (fncode);
+ tree utype = unsigned_type_for (TREE_TYPE (@0));
+ tree nst = build_int_cst (integer_type_node, prec - 8);
+ }
+ (bit_and (convert (rshift:utype (convert:utype @0) {nst;})) @1)))))
/* Combine COND_EXPRs and VEC_COND_EXPRs. */
[-- Attachment #3: builtin-bswap-13.c --]
[-- Type: text/plain, Size: 10583 bytes --]
/* { dg-do compile } */
/* { dg-options "-O2 -fdump-tree-optimized" } */
int test_s32_0_1(int x) { return __builtin_bswap32(x) & 1; }
int test_s32_0_2(int x) { return __builtin_bswap32(x) & 2; }
int test_s32_0_240(int x) { return __builtin_bswap32(x) & 240; }
int test_s32_0_255(int x) { return __builtin_bswap32(x) & 255; }
int test_s32_1_1(int x) { return (__builtin_bswap32(x) >> 1) & 1; }
int test_s32_7_1(int x) { return (__builtin_bswap32(x) >> 7) & 1; }
int test_s32_8_1(int x) { return (__builtin_bswap32(x) >> 8) & 1; }
int test_s32_8_240(int x) { return (__builtin_bswap32(x) >> 8) & 240; }
int test_s32_8_255(int x) { return (__builtin_bswap32(x) >> 8) & 255; }
int test_s32_15_1(int x) { return (__builtin_bswap32(x) >> 15) & 1; }
int test_s32_16_1(int x) { return (__builtin_bswap32(x) >> 16) & 1; }
int test_s32_16_240(int x) { return (__builtin_bswap32(x) >> 16) & 240; }
int test_s32_16_255(int x) { return (__builtin_bswap32(x) >> 16) & 255; }
int test_s32_24_1(int x) { return (__builtin_bswap32(x) >> 24) & 1; }
int test_s32_24_240(int x) { return (__builtin_bswap32(x) >> 24) & 240; }
int test_s32_24_255(int x) { return (__builtin_bswap32(x) >> 24) & 255; }
int test_s32_31_1(int x) { return (__builtin_bswap32(x) >> 31) & 1; }
int test_S32_0_1(int x) { return (int)__builtin_bswap32(x) & 1; }
int test_S32_0_2(int x) { return (int)__builtin_bswap32(x) & 2; }
int test_S32_0_240(int x) { return (int)__builtin_bswap32(x) & 240; }
int test_S32_0_255(int x) { return (int)__builtin_bswap32(x) & 255; }
int test_S32_1_1(int x) { return ((int)__builtin_bswap32(x) >> 1) & 1; }
int test_S32_7_1(int x) { return ((int)__builtin_bswap32(x) >> 7) & 1; }
int test_S32_8_1(int x) { return ((int)__builtin_bswap32(x) >> 8) & 1; }
int test_S32_8_240(int x) { return ((int)__builtin_bswap32(x) >> 8) & 240; }
int test_S32_8_255(int x) { return ((int)__builtin_bswap32(x) >> 8) & 255; }
int test_S32_15_1(int x) { return ((int)__builtin_bswap32(x) >> 15) & 1; }
int test_S32_16_1(int x) { return ((int)__builtin_bswap32(x) >> 16) & 1; }
int test_S32_16_240(int x) { return ((int)__builtin_bswap32(x) >> 16) & 240; }
int test_S32_16_255(int x) { return ((int)__builtin_bswap32(x) >> 16) & 255; }
int test_S32_24_1(int x) { return ((int)__builtin_bswap32(x) >> 24) & 1; }
int test_S32_24_240(int x) { return ((int)__builtin_bswap32(x) >> 24) & 240; }
int test_S32_24_255(int x) { return ((int)__builtin_bswap32(x) >> 24) & 255; }
int test_S32_31_1(int x) { return ((int)__builtin_bswap32(x) >> 31) & 1; }
unsigned int test_u32_24_255(unsigned int x) {
return (__builtin_bswap32(x) >> 24) & 255;
}
long long test_s64_0_1(long long x) {
return __builtin_bswap64(x) & 1;
}
long long test_s64_0_2(long long x) {
return __builtin_bswap64(x) & 2;
}
long long test_s64_0_240(long long x) {
return __builtin_bswap64(x) & 240;
}
long long test_s64_0_255(long long x) {
return __builtin_bswap64(x) & 255;
}
long long test_s64_7_1(long long x) {
return (__builtin_bswap64(x) >> 7) & 1;
}
long long test_s64_8_1(long long x) {
return (__builtin_bswap64(x) >> 8) & 1;
}
long long test_s64_8_240(long long x) {
return (__builtin_bswap64(x) >> 56) & 240;
}
long long test_s64_8_255(long long x) {
return (__builtin_bswap64(x) >> 8) & 255;
}
long long test_s64_9_1(long long x) {
return (__builtin_bswap64(x) >> 9) & 1;
}
long long test_s64_31_1(long long x) {
return (__builtin_bswap64(x) >> 31) & 1;
}
long long test_s64_32_1(long long x) {
return (__builtin_bswap64(x) >> 32) & 1;
}
long long test_s64_32_240(long long x) {
return (__builtin_bswap64(x) >> 32) & 240;
}
long long test_s64_32_255(long long x) {
return (__builtin_bswap64(x) >> 32) & 255;
}
long long test_s64_33_1(long long x) {
return (__builtin_bswap64(x) >> 33) & 1;
}
long long test_s64_48_1(long long x) {
return (__builtin_bswap64(x) >> 48) & 1;
}
long long test_s64_48_240(long long x) {
return (__builtin_bswap64(x) >> 48) & 240;
}
long long test_s64_48_255(long long x) {
return (__builtin_bswap64(x) >> 48) & 255;
}
long long test_s64_56_1(long long x) {
return (__builtin_bswap64(x) >> 56) & 1;
}
long long test_s64_56_240(long long x) {
return (__builtin_bswap64(x) >> 56) & 240;
}
long long test_s64_56_255(long long x) {
return (__builtin_bswap64(x) >> 56) & 255;
}
long long test_s64_57_1(long long x) {
return (__builtin_bswap64(x) >> 57) & 1;
}
long long test_s64_63_1(long long x) {
return (__builtin_bswap64(x) >> 63) & 1;
}
long long test_S64_0_1(long long x) {
return (long long)__builtin_bswap64(x) & 1;
}
long long test_S64_0_2(long long x) {
return (long long)__builtin_bswap64(x) & 2;
}
long long test_S64_0_240(long long x) {
return (long long)__builtin_bswap64(x) & 240;
}
long long test_S64_0_255(long long x) {
return (long long)__builtin_bswap64(x) & 255;
}
long long test_S64_7_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 7) & 1;
}
long long test_S64_8_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 8) & 1;
}
long long test_S64_8_240(long long x) {
return ((long long)__builtin_bswap64(x) >> 56) & 240;
}
long long test_S64_8_255(long long x) {
return ((long long)__builtin_bswap64(x) >> 8) & 255;
}
long long test_S64_9_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 9) & 1;
}
long long test_S64_31_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 31) & 1;
}
long long test_S64_32_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 32) & 1;
}
long long test_S64_32_240(long long x) {
return ((long long)__builtin_bswap64(x) >> 32) & 240;
}
long long test_S64_32_255(long long x) {
return ((long long)__builtin_bswap64(x) >> 32) & 255;
}
long long test_S64_33_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 33) & 1;
}
long long test_S64_48_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 48) & 1;
}
long long test_S64_48_240(long long x) {
return ((long long)__builtin_bswap64(x) >> 48) & 240;
}
long long test_S64_48_255(long long x) {
return ((long long)__builtin_bswap64(x) >> 48) & 255;
}
long long test_S64_56_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 56) & 1;
}
long long test_S64_56_240(long long x) {
return ((long long)__builtin_bswap64(x) >> 56) & 240;
}
long long test_S64_56_255(long long x) {
return ((long long)__builtin_bswap64(x) >> 56) & 255;
}
long long test_S64_57_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 57) & 1;
}
long long test_S64_63_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 63) & 1;
}
unsigned long long test_u64_56_255(unsigned long long x) {
return (__builtin_bswap64(x) >> 56) & 255;
}
short test_s16_0_1(short x) {
return __builtin_bswap16(x) & 1;
}
short test_s16_0_240(short x) {
return __builtin_bswap16(x) & 240;
}
short test_s16_0_255(short x) {
return __builtin_bswap16(x) & 255;
}
short test_s16_1_1(short x) {
return (__builtin_bswap16(x) >> 1) & 1;
}
short test_s16_7_1(short x) {
return (__builtin_bswap16(x) >> 7) & 1;
}
short test_s16_8_1(short x) {
return (__builtin_bswap16(x) >> 8) & 1;
}
short test_s16_8_240(short x) {
return (__builtin_bswap16(x) >> 8) & 240;
}
short test_s16_8_255(short x) {
return (__builtin_bswap16(x) >> 8) & 255;
}
short test_s16_9_1(short x) {
return (__builtin_bswap16(x) >> 9) & 1;
}
short test_s16_15_1(short x) {
return (__builtin_bswap16(x) >> 15) & 1;
}
short test_S16_0_1(short x) {
return (short)__builtin_bswap16(x) & 1;
}
short test_S16_0_240(short x) {
return (short)__builtin_bswap16(x) & 240;
}
short test_S16_0_255(short x) {
return (short)__builtin_bswap16(x) & 255;
}
short test_S16_1_1(short x) {
return ((short)__builtin_bswap16(x) >> 1) & 1;
}
short test_S16_7_1(short x) {
return ((short)__builtin_bswap16(x) >> 7) & 1;
}
short test_S16_8_1(short x) {
return ((short)__builtin_bswap16(x) >> 8) & 1;
}
short test_S16_8_240(short x) {
return ((short)__builtin_bswap16(x) >> 8) & 240;
}
short test_S16_8_255(short x) {
return ((short)__builtin_bswap16(x) >> 8) & 255;
}
short test_S16_9_1(short x) {
return ((short)__builtin_bswap16(x) >> 9) & 1;
}
short test_S16_15_1(short x) {
return ((short)__builtin_bswap16(x) >> 15) & 1;
}
unsigned short test_u16_8_255(unsigned short x) {
return (__builtin_bswap16(x) >> 8) & 255;
}
/* Shifts only */
int test_s32_24(int x) {
return __builtin_bswap32(x) >> 24;
}
int test_s32_25(int x) {
return __builtin_bswap32(x) >> 25;
}
int test_s32_30(int x) {
return __builtin_bswap32(x) >> 30;
}
int test_s32_31(int x) {
return __builtin_bswap32(x) >> 31;
}
unsigned int test_u32_24(unsigned int x) {
return __builtin_bswap32(x) >> 24;
}
unsigned int test_u32_25(unsigned int x) {
return __builtin_bswap32(x) >> 25;
}
unsigned int test_u32_30(unsigned int x) {
return __builtin_bswap32(x) >> 30;
}
unsigned int test_u32_31(unsigned int x) {
return __builtin_bswap32(x) >> 31;
}
long long test_s64_56(long long x) {
return __builtin_bswap64(x) >> 56;
}
long long test_s64_57(long long x) {
return __builtin_bswap64(x) >> 57;
}
long long test_s64_62(long long x) {
return __builtin_bswap64(x) >> 62;
}
long long test_s64_63(long long x) {
return __builtin_bswap64(x) >> 63;
}
unsigned long long test_u64_56(unsigned long long x) {
return __builtin_bswap64(x) >> 56;
}
unsigned long long test_u64_57(unsigned long long x) {
return __builtin_bswap64(x) >> 57;
}
unsigned long long test_u64_62(unsigned long long x) {
return __builtin_bswap64(x) >> 62;
}
unsigned long long test_u64_63(unsigned long long x) {
return __builtin_bswap64(x) >> 63;
}
short test_s16_8(short x) {
return __builtin_bswap16(x) >> 8;
}
short test_s16_9(short x) {
return __builtin_bswap16(x) >> 9;
}
short test_s16_14(short x) {
return __builtin_bswap16(x) >> 14;
}
short test_s16_15(short x) {
return __builtin_bswap16(x) >> 15;
}
unsigned short test_u16_8(unsigned short x) {
return __builtin_bswap16(x) >> 8;
}
unsigned short test_u16_9(unsigned short x) {
return __builtin_bswap16(x) >> 9;
}
unsigned short test_u16_14(unsigned short x) {
return __builtin_bswap16(x) >> 14;
}
unsigned short test_u16_15(unsigned short x) {
return __builtin_bswap16(x) >> 15;
}
/* { dg-final { scan-tree-dump-not "__builtin_bswap" "optimized" } } */
[-- Attachment #4: builtin-bswap-14.c --]
[-- Type: text/plain, Size: 6937 bytes --]
/* { dg-do run } */
/* { dg-options "-O2" } */
extern void abort (void);
__attribute__ ((noinline, noclone))
static int rt32 (int x, int y, int z) {
return (__builtin_bswap32(x) >> y) & z;
}
#define TEST32(X,Y,Z) if(((__builtin_bswap32(X)>>Y)&Z)!=rt32(X,Y,Z)) abort()
void test32(int x)
{
TEST32(x,0,1);
TEST32(x,0,255);
TEST32(x,1,1);
TEST32(x,2,1);
TEST32(x,3,1);
TEST32(x,4,1);
TEST32(x,5,1);
TEST32(x,6,1);
TEST32(x,7,1);
TEST32(x,8,1);
TEST32(x,8,255);
TEST32(x,9,1);
TEST32(x,10,1);
TEST32(x,11,1);
TEST32(x,12,1);
TEST32(x,13,1);
TEST32(x,14,1);
TEST32(x,15,1);
TEST32(x,16,1);
TEST32(x,16,255);
TEST32(x,17,1);
TEST32(x,18,1);
TEST32(x,19,1);
TEST32(x,20,1);
TEST32(x,21,1);
TEST32(x,22,1);
TEST32(x,23,1);
TEST32(x,24,1);
TEST32(x,24,255);
TEST32(x,25,1);
TEST32(x,26,1);
TEST32(x,27,1);
TEST32(x,28,1);
TEST32(x,29,1);
TEST32(x,30,1);
TEST32(x,31,1);
}
#if __SIZEOF_LONG_LONG__ == 8
__attribute__ ((noinline, noclone))
static long long rt64 (long long x, int y, long long z) {
return (__builtin_bswap64(x) >> y) & z;
}
#define TEST64(X,Y,Z) if(((__builtin_bswap64(X)>>Y)&Z)!=rt64(X,Y,Z)) abort()
void test64(long long x)
{
TEST64(x,0,1);
TEST64(x,0,255);
TEST64(x,1,1);
TEST64(x,2,1);
TEST64(x,3,1);
TEST64(x,4,1);
TEST64(x,5,1);
TEST64(x,6,1);
TEST64(x,7,1);
TEST64(x,8,1);
TEST64(x,8,255);
TEST64(x,9,1);
TEST64(x,10,1);
TEST64(x,11,1);
TEST64(x,12,1);
TEST64(x,13,1);
TEST64(x,14,1);
TEST64(x,15,1);
TEST64(x,16,1);
TEST64(x,16,255);
TEST64(x,17,1);
TEST64(x,18,1);
TEST64(x,19,1);
TEST64(x,20,1);
TEST64(x,21,1);
TEST64(x,22,1);
TEST64(x,23,1);
TEST64(x,24,1);
TEST64(x,24,255);
TEST64(x,25,1);
TEST64(x,26,1);
TEST64(x,27,1);
TEST64(x,28,1);
TEST64(x,29,1);
TEST64(x,30,1);
TEST64(x,31,1);
TEST64(x,32,1);
TEST64(x,32,255);
TEST64(x,33,1);
TEST64(x,34,1);
TEST64(x,35,1);
TEST64(x,36,1);
TEST64(x,37,1);
TEST64(x,38,1);
TEST64(x,39,1);
TEST64(x,40,1);
TEST64(x,40,255);
TEST64(x,41,1);
TEST64(x,42,1);
TEST64(x,43,1);
TEST64(x,44,1);
TEST64(x,45,1);
TEST64(x,46,1);
TEST64(x,47,1);
TEST64(x,48,1);
TEST64(x,48,255);
TEST64(x,49,1);
TEST64(x,50,1);
TEST64(x,51,1);
TEST64(x,52,1);
TEST64(x,53,1);
TEST64(x,54,1);
TEST64(x,55,1);
TEST64(x,56,1);
TEST64(x,56,255);
TEST64(x,57,1);
TEST64(x,58,1);
TEST64(x,59,1);
TEST64(x,60,1);
TEST64(x,61,1);
TEST64(x,62,1);
TEST64(x,63,1);
}
#endif
__attribute__ ((noinline, noclone))
static int rt16 (int x, int y, int z) {
return (__builtin_bswap16(x) >> y) & z;
}
#define TEST16(X,Y,Z) if(((__builtin_bswap16(X)>>Y)&Z)!=rt16(X,Y,Z)) abort()
void test16(int x)
{
TEST16(x,0,1);
TEST16(x,0,255);
TEST16(x,1,1);
TEST16(x,2,1);
TEST16(x,3,1);
TEST16(x,4,1);
TEST16(x,5,1);
TEST16(x,6,1);
TEST16(x,7,1);
TEST16(x,8,1);
TEST16(x,8,255);
TEST16(x,9,1);
TEST16(x,10,1);
TEST16(x,11,1);
TEST16(x,12,1);
TEST16(x,13,1);
TEST16(x,14,1);
TEST16(x,15,1);
}
int main()
{
test32(0x00000000);
test32(0xffffffff);
test32(0x00000001);
test32(0x00000002);
test32(0x00000004);
test32(0x00000008);
test32(0x00000010);
test32(0x00000020);
test32(0x00000040);
test32(0x00000080);
test32(0x00000100);
test32(0x00000200);
test32(0x00000400);
test32(0x00000800);
test32(0x00001000);
test32(0x00002000);
test32(0x00004000);
test32(0x00008000);
test32(0x00010000);
test32(0x00020000);
test32(0x00040000);
test32(0x00080000);
test32(0x00100000);
test32(0x00200000);
test32(0x00400000);
test32(0x00800000);
test32(0x01000000);
test32(0x02000000);
test32(0x04000000);
test32(0x08000000);
test32(0x10000000);
test32(0x20000000);
test32(0x40000000);
test32(0x80000000);
test32(0x12345678);
test32(0x87654321);
test32(0xdeadbeef);
test32(0xcafebabe);
#if __SIZEOF_LONG_LONG__ == 8
test64(0x0000000000000000ll);
test64(0xffffffffffffffffll);
test64(0x0000000000000001ll);
test64(0x0000000000000002ll);
test64(0x0000000000000004ll);
test64(0x0000000000000008ll);
test64(0x0000000000000010ll);
test64(0x0000000000000020ll);
test64(0x0000000000000040ll);
test64(0x0000000000000080ll);
test64(0x0000000000000100ll);
test64(0x0000000000000200ll);
test64(0x0000000000000400ll);
test64(0x0000000000000800ll);
test64(0x0000000000001000ll);
test64(0x0000000000002000ll);
test64(0x0000000000004000ll);
test64(0x0000000000008000ll);
test64(0x0000000000010000ll);
test64(0x0000000000020000ll);
test64(0x0000000000040000ll);
test64(0x0000000000080000ll);
test64(0x0000000000100000ll);
test64(0x0000000000200000ll);
test64(0x0000000000400000ll);
test64(0x0000000000800000ll);
test64(0x0000000001000000ll);
test64(0x0000000002000000ll);
test64(0x0000000004000000ll);
test64(0x0000000008000000ll);
test64(0x0000000010000000ll);
test64(0x0000000020000000ll);
test64(0x0000000040000000ll);
test64(0x0000000080000000ll);
test64(0x0000000100000000ll);
test64(0x0000000200000000ll);
test64(0x0000000400000000ll);
test64(0x0000000800000000ll);
test64(0x0000001000000000ll);
test64(0x0000002000000000ll);
test64(0x0000004000000000ll);
test64(0x0000008000000000ll);
test64(0x0000010000000000ll);
test64(0x0000020000000000ll);
test64(0x0000040000000000ll);
test64(0x0000080000000000ll);
test64(0x0000100000000000ll);
test64(0x0000200000000000ll);
test64(0x0000400000000000ll);
test64(0x0000800000000000ll);
test64(0x0001000000000000ll);
test64(0x0002000000000000ll);
test64(0x0004000000000000ll);
test64(0x0008000000000000ll);
test64(0x0010000000000000ll);
test64(0x0020000000000000ll);
test64(0x0040000000000000ll);
test64(0x0080000000000000ll);
test64(0x0100000000000000ll);
test64(0x0200000000000000ll);
test64(0x0400000000000000ll);
test64(0x0800000000000000ll);
test64(0x1000000000000000ll);
test64(0x2000000000000000ll);
test64(0x4000000000000000ll);
test64(0x8000000000000000ll);
test64(0x0123456789abcdefll);
test64(0xfedcba9876543210ll);
test64(0xdeadbeefdeadbeefll);
test64(0xcafebabecafebabell);
#endif
test16(0x0000);
test16(0xffff);
test16(0x0001);
test16(0x0002);
test16(0x0004);
test16(0x0008);
test16(0x0010);
test16(0x0020);
test16(0x0040);
test16(0x0080);
test16(0x0100);
test16(0x0200);
test16(0x0400);
test16(0x0800);
test16(0x1000);
test16(0x2000);
test16(0x4000);
test16(0x8000);
test16(0x1234);
test16(0x4321);
test16(0xdead);
test16(0xbeef);
test16(0xcafe);
test16(0xbabe);
return 0;
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd
2021-07-06 19:01 [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd Roger Sayle
@ 2021-07-07 7:55 ` Richard Biener
2021-07-08 7:37 ` Roger Sayle
0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2021-07-07 7:55 UTC (permalink / raw)
To: Roger Sayle; +Cc: GCC Patches
On Tue, Jul 6, 2021 at 9:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> All of the optimizations/transformations mentioned in bugzilla for
> PR tree-optimization/40210 are already implemented in mainline GCC,
> with one exception. In comment #5, there's a suggestion that
> (bswap64(x)>>56)&0xff can be implemented without the bswap as
> (unsigned char)x, or equivalently x&0xff.
>
> This patch implements the above optimization, and closely related
> variants. For any single bit, (bswap(X)>>C1)&1 can be simplified
> to (X>>C2)&1, where bit position C2 is the appropriate permutation
> of C1. Similarly, the bswap can eliminated if the desired set of
> bits all lie within the same byte, hence (bswap(x)>>8)&255 can
> always be optimized, as can (bswap(x)>>8)&123.
>
> Previously,
>
> int foo(long long x) {
> return (__builtin_bswap64(x) >> 56) & 0xff;
> }
>
> compiled with -O2 to
> foo: movq %rdi, %rax
> bswap %rax
> shrq $56, %rax
> ret
>
> with this patch, it now compiles to
> foo: movzbl %dil, %eax
> ret
>
> This patch has been tested on x86_64-pc-linux-gnu with a "make
> bootstrap" and "make -k check" with no new failures.
>
> Ok for mainline?
I don't like get_builtin_precision too much, did you consider
simply using
+ (bit_and (convert1? (rshift@0 (convert2? (bswap@3 @1)) INTEGER_CST@2))
and TYPE_PRECISION (TREE_TYPE (@3))? I think while we'll
see argument promotion and thus cannot use @1 to derive the type
the return value will be the original type.
Now, I see '8' being used which likely should be CHAR_TYPE_SIZE
since you also use char_type_node.
I wonder whether
+ /* (bswap(x) >> C1) & C2 can sometimes be simplified to (x >> C3) & C2. */
+ (simplify
+ (bit_and (convert1? (rshift@0 (convert2? (bswap @1)) INTEGER_CST@2))
+ INTEGER_CST@3)
and
+ /* bswap(x) >> C1 can sometimes be simplified to (T)x >> C2. */
+ (simplify
+ (rshift (convert? (bswap @0)) INTEGER_CST@1)
can build upon each other, for example by extending the latter
to handle more cases, transforming to ((T)x >> C2) & C3?
That might of course be only profitable when the bswap goes away.
Thanks,
Richard.
>
>
> 2021-07-06 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> PR tree-optimization/40210
> * builtins.c (get_builtin_precision): Helper function to determine
> the precision in bits of a built-in function.
> * builtins.h (get_builtin_precision): Prototype here.
> * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
> (x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
> when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
>
> gcc/testsuite/ChangeLog
> PR tree-optimization/40210
> * gcc.dg/builtin-bswap-13.c: New test.
> * gcc.dg/builtin-bswap-14.c: New test.
>
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd
2021-07-07 7:55 ` Richard Biener
@ 2021-07-08 7:37 ` Roger Sayle
2021-07-08 9:50 ` Richard Biener
0 siblings, 1 reply; 5+ messages in thread
From: Roger Sayle @ 2021-07-08 7:37 UTC (permalink / raw)
To: 'Richard Biener'; +Cc: 'GCC Patches'
[-- Attachment #1: Type: text/plain, Size: 4601 bytes --]
Hi Richard,
Thanks. Yep, you've correctly the diagnosed that the motivation for the
get_builtin_precision helper function was that the TREE_TYPE of the
argument is affected by argument promotion. Your suggestion to instead
use the TREE_TYPE of the function result is a much nicer solution.
I also agree that that all of these bswap optimizations make the assumption
that BITS_PER_UNIT is 8 (i.e. that bytes are 8-bits), and some that the
front-end supports an 8-bit type (i.e. that CHAR_TYPE_SIZE is 8), which
can be checked explicitly.
Both of these improvements are implemented in the attached revised patch,
which has been tested on x86_64-pc-linux-gnu with a "make bootstrap"
and "make -k check" with no new failures.
Ok for mainline?
2021-07-08 Roger Sayle <roger@nextmovesoftware.com>
Richard Biener <rguenther@suse.de>
gcc/ChangeLog
PR tree-optimization/40210
* match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
(x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
gcc/testsuite/ChangeLog
PR tree-optimization/40210
* gcc.dg/builtin-bswap-13.c: New test.
* gcc.dg/builtin-bswap-14.c: New test.
Roger
--
-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com>
Sent: 07 July 2021 08:56
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd
On Tue, Jul 6, 2021 at 9:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> All of the optimizations/transformations mentioned in bugzilla for PR
> tree-optimization/40210 are already implemented in mainline GCC, with
> one exception. In comment #5, there's a suggestion that
> (bswap64(x)>>56)&0xff can be implemented without the bswap as
> (unsigned char)x, or equivalently x&0xff.
>
> This patch implements the above optimization, and closely related
> variants. For any single bit, (bswap(X)>>C1)&1 can be simplified to
> (X>>C2)&1, where bit position C2 is the appropriate permutation of C1.
> Similarly, the bswap can eliminated if the desired set of bits all lie
> within the same byte, hence (bswap(x)>>8)&255 can always be optimized,
> as can (bswap(x)>>8)&123.
>
> Previously,
>
> int foo(long long x) {
> return (__builtin_bswap64(x) >> 56) & 0xff; }
>
> compiled with -O2 to
> foo: movq %rdi, %rax
> bswap %rax
> shrq $56, %rax
> ret
>
> with this patch, it now compiles to
> foo: movzbl %dil, %eax
> ret
>
> This patch has been tested on x86_64-pc-linux-gnu with a "make
> bootstrap" and "make -k check" with no new failures.
>
> Ok for mainline?
I don't like get_builtin_precision too much, did you consider simply using
+ (bit_and (convert1? (rshift@0 (convert2? (bswap@3 @1))
+ INTEGER_CST@2))
and TYPE_PRECISION (TREE_TYPE (@3))? I think while we'll see argument promotion and thus cannot use @1 to derive the type the return value will be the original type.
Now, I see '8' being used which likely should be CHAR_TYPE_SIZE since you also use char_type_node.
I wonder whether
+ /* (bswap(x) >> C1) & C2 can sometimes be simplified to (x >> C3) &
+ C2. */ (simplify (bit_and (convert1? (rshift@0 (convert2? (bswap
+ @1)) INTEGER_CST@2))
+ INTEGER_CST@3)
and
+ /* bswap(x) >> C1 can sometimes be simplified to (T)x >> C2. */
+ (simplify (rshift (convert? (bswap @0)) INTEGER_CST@1)
can build upon each other, for example by extending the latter to handle more cases, transforming to ((T)x >> C2) & C3?
That might of course be only profitable when the bswap goes away.
Thanks,
Richard.
>
>
> 2021-07-06 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> PR tree-optimization/40210
> * builtins.c (get_builtin_precision): Helper function to determine
> the precision in bits of a built-in function.
> * builtins.h (get_builtin_precision): Prototype here.
> * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
> (x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
> when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
>
> gcc/testsuite/ChangeLog
> PR tree-optimization/40210
> * gcc.dg/builtin-bswap-13.c: New test.
> * gcc.dg/builtin-bswap-14.c: New test.
>
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>
[-- Attachment #2: patchb2.txt --]
[-- Type: text/plain, Size: 3140 bytes --]
diff --git a/gcc/match.pd b/gcc/match.pd
index 39fb57e..a134485 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3610,7 +3610,8 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(complex (convert:itype @0) (negate (convert:itype @1)))))
/* BSWAP simplifications, transforms checked by gcc.dg/builtin-bswap-8.c. */
-(for bswap (BUILT_IN_BSWAP16 BUILT_IN_BSWAP32 BUILT_IN_BSWAP64)
+(for bswap (BUILT_IN_BSWAP16 BUILT_IN_BSWAP32
+ BUILT_IN_BSWAP64 BUILT_IN_BSWAP128)
(simplify
(bswap (bswap @0))
@0)
@@ -3620,7 +3621,70 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(for bitop (bit_xor bit_ior bit_and)
(simplify
(bswap (bitop:c (bswap @0) @1))
- (bitop @0 (bswap @1)))))
+ (bitop @0 (bswap @1))))
+ /* (bswap(x) >> C1) & C2 can sometimes be simplified to (x >> C3) & C2. */
+ (simplify
+ (bit_and (convert1? (rshift@0 (convert2? (bswap@4 @1)) INTEGER_CST@2))
+ INTEGER_CST@3)
+ (if (BITS_PER_UNIT == 8
+ && tree_fits_uhwi_p (@2)
+ && tree_fits_uhwi_p (@3))
+ (with
+ {
+ unsigned HOST_WIDE_INT prec = TYPE_PRECISION (TREE_TYPE (@4));
+ unsigned HOST_WIDE_INT bits = tree_to_uhwi (@2);
+ unsigned HOST_WIDE_INT mask = tree_to_uhwi (@3);
+ unsigned HOST_WIDE_INT lo = bits & 7;
+ unsigned HOST_WIDE_INT hi = bits - lo;
+ }
+ (if (bits < prec
+ && mask < (256u>>lo)
+ && bits < TYPE_PRECISION (TREE_TYPE(@0)))
+ (with { unsigned HOST_WIDE_INT ns = (prec - (hi + 8)) + lo; }
+ (if (ns == 0)
+ (bit_and (convert @1) @3)
+ (with
+ {
+ tree utype = unsigned_type_for (TREE_TYPE (@1));
+ tree nst = build_int_cst (integer_type_node, ns);
+ }
+ (bit_and (convert (rshift:utype (convert:utype @1) {nst;})) @3))))))))
+ /* bswap(x) >> C1 can sometimes be simplified to (T)x >> C2. */
+ (simplify
+ (rshift (convert? (bswap@2 @0)) INTEGER_CST@1)
+ (if (BITS_PER_UNIT == 8
+ && CHAR_TYPE_SIZE == 8
+ && tree_fits_uhwi_p (@1))
+ (with
+ {
+ unsigned HOST_WIDE_INT prec = TYPE_PRECISION (TREE_TYPE (@2));
+ unsigned HOST_WIDE_INT bits = tree_to_uhwi (@1);
+ }
+ (if (bits + 8 == prec)
+ (if (TYPE_UNSIGNED (type))
+ (convert (convert:unsigned_char_type_node @0))
+ (convert (convert:signed_char_type_node @0)))
+ (if (bits < prec && bits + 8 > prec)
+ (with
+ {
+ tree nst = build_int_cst (integer_type_node, bits & 7);
+ tree bt = TYPE_UNSIGNED (type) ? unsigned_char_type_node
+ : signed_char_type_node;
+ }
+ (convert (rshift:bt (convert:bt @0) {nst;}))))))))
+ /* bswap(x) & C1 can sometimes be simplified to (x >> C2) & C1. */
+ (simplify
+ (bit_and (convert? (bswap@2 @0)) INTEGER_CST@1)
+ (if (BITS_PER_UNIT == 8
+ && tree_fits_uhwi_p (@1)
+ && tree_to_uhwi (@1) < 256)
+ (with
+ {
+ unsigned HOST_WIDE_INT prec = TYPE_PRECISION (TREE_TYPE (@2));
+ tree utype = unsigned_type_for (TREE_TYPE (@0));
+ tree nst = build_int_cst (integer_type_node, prec - 8);
+ }
+ (bit_and (convert (rshift:utype (convert:utype @0) {nst;})) @1)))))
/* Combine COND_EXPRs and VEC_COND_EXPRs. */
[-- Attachment #3: builtin-bswap-13.c --]
[-- Type: text/plain, Size: 10583 bytes --]
/* { dg-do compile } */
/* { dg-options "-O2 -fdump-tree-optimized" } */
int test_s32_0_1(int x) { return __builtin_bswap32(x) & 1; }
int test_s32_0_2(int x) { return __builtin_bswap32(x) & 2; }
int test_s32_0_240(int x) { return __builtin_bswap32(x) & 240; }
int test_s32_0_255(int x) { return __builtin_bswap32(x) & 255; }
int test_s32_1_1(int x) { return (__builtin_bswap32(x) >> 1) & 1; }
int test_s32_7_1(int x) { return (__builtin_bswap32(x) >> 7) & 1; }
int test_s32_8_1(int x) { return (__builtin_bswap32(x) >> 8) & 1; }
int test_s32_8_240(int x) { return (__builtin_bswap32(x) >> 8) & 240; }
int test_s32_8_255(int x) { return (__builtin_bswap32(x) >> 8) & 255; }
int test_s32_15_1(int x) { return (__builtin_bswap32(x) >> 15) & 1; }
int test_s32_16_1(int x) { return (__builtin_bswap32(x) >> 16) & 1; }
int test_s32_16_240(int x) { return (__builtin_bswap32(x) >> 16) & 240; }
int test_s32_16_255(int x) { return (__builtin_bswap32(x) >> 16) & 255; }
int test_s32_24_1(int x) { return (__builtin_bswap32(x) >> 24) & 1; }
int test_s32_24_240(int x) { return (__builtin_bswap32(x) >> 24) & 240; }
int test_s32_24_255(int x) { return (__builtin_bswap32(x) >> 24) & 255; }
int test_s32_31_1(int x) { return (__builtin_bswap32(x) >> 31) & 1; }
int test_S32_0_1(int x) { return (int)__builtin_bswap32(x) & 1; }
int test_S32_0_2(int x) { return (int)__builtin_bswap32(x) & 2; }
int test_S32_0_240(int x) { return (int)__builtin_bswap32(x) & 240; }
int test_S32_0_255(int x) { return (int)__builtin_bswap32(x) & 255; }
int test_S32_1_1(int x) { return ((int)__builtin_bswap32(x) >> 1) & 1; }
int test_S32_7_1(int x) { return ((int)__builtin_bswap32(x) >> 7) & 1; }
int test_S32_8_1(int x) { return ((int)__builtin_bswap32(x) >> 8) & 1; }
int test_S32_8_240(int x) { return ((int)__builtin_bswap32(x) >> 8) & 240; }
int test_S32_8_255(int x) { return ((int)__builtin_bswap32(x) >> 8) & 255; }
int test_S32_15_1(int x) { return ((int)__builtin_bswap32(x) >> 15) & 1; }
int test_S32_16_1(int x) { return ((int)__builtin_bswap32(x) >> 16) & 1; }
int test_S32_16_240(int x) { return ((int)__builtin_bswap32(x) >> 16) & 240; }
int test_S32_16_255(int x) { return ((int)__builtin_bswap32(x) >> 16) & 255; }
int test_S32_24_1(int x) { return ((int)__builtin_bswap32(x) >> 24) & 1; }
int test_S32_24_240(int x) { return ((int)__builtin_bswap32(x) >> 24) & 240; }
int test_S32_24_255(int x) { return ((int)__builtin_bswap32(x) >> 24) & 255; }
int test_S32_31_1(int x) { return ((int)__builtin_bswap32(x) >> 31) & 1; }
unsigned int test_u32_24_255(unsigned int x) {
return (__builtin_bswap32(x) >> 24) & 255;
}
long long test_s64_0_1(long long x) {
return __builtin_bswap64(x) & 1;
}
long long test_s64_0_2(long long x) {
return __builtin_bswap64(x) & 2;
}
long long test_s64_0_240(long long x) {
return __builtin_bswap64(x) & 240;
}
long long test_s64_0_255(long long x) {
return __builtin_bswap64(x) & 255;
}
long long test_s64_7_1(long long x) {
return (__builtin_bswap64(x) >> 7) & 1;
}
long long test_s64_8_1(long long x) {
return (__builtin_bswap64(x) >> 8) & 1;
}
long long test_s64_8_240(long long x) {
return (__builtin_bswap64(x) >> 56) & 240;
}
long long test_s64_8_255(long long x) {
return (__builtin_bswap64(x) >> 8) & 255;
}
long long test_s64_9_1(long long x) {
return (__builtin_bswap64(x) >> 9) & 1;
}
long long test_s64_31_1(long long x) {
return (__builtin_bswap64(x) >> 31) & 1;
}
long long test_s64_32_1(long long x) {
return (__builtin_bswap64(x) >> 32) & 1;
}
long long test_s64_32_240(long long x) {
return (__builtin_bswap64(x) >> 32) & 240;
}
long long test_s64_32_255(long long x) {
return (__builtin_bswap64(x) >> 32) & 255;
}
long long test_s64_33_1(long long x) {
return (__builtin_bswap64(x) >> 33) & 1;
}
long long test_s64_48_1(long long x) {
return (__builtin_bswap64(x) >> 48) & 1;
}
long long test_s64_48_240(long long x) {
return (__builtin_bswap64(x) >> 48) & 240;
}
long long test_s64_48_255(long long x) {
return (__builtin_bswap64(x) >> 48) & 255;
}
long long test_s64_56_1(long long x) {
return (__builtin_bswap64(x) >> 56) & 1;
}
long long test_s64_56_240(long long x) {
return (__builtin_bswap64(x) >> 56) & 240;
}
long long test_s64_56_255(long long x) {
return (__builtin_bswap64(x) >> 56) & 255;
}
long long test_s64_57_1(long long x) {
return (__builtin_bswap64(x) >> 57) & 1;
}
long long test_s64_63_1(long long x) {
return (__builtin_bswap64(x) >> 63) & 1;
}
long long test_S64_0_1(long long x) {
return (long long)__builtin_bswap64(x) & 1;
}
long long test_S64_0_2(long long x) {
return (long long)__builtin_bswap64(x) & 2;
}
long long test_S64_0_240(long long x) {
return (long long)__builtin_bswap64(x) & 240;
}
long long test_S64_0_255(long long x) {
return (long long)__builtin_bswap64(x) & 255;
}
long long test_S64_7_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 7) & 1;
}
long long test_S64_8_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 8) & 1;
}
long long test_S64_8_240(long long x) {
return ((long long)__builtin_bswap64(x) >> 56) & 240;
}
long long test_S64_8_255(long long x) {
return ((long long)__builtin_bswap64(x) >> 8) & 255;
}
long long test_S64_9_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 9) & 1;
}
long long test_S64_31_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 31) & 1;
}
long long test_S64_32_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 32) & 1;
}
long long test_S64_32_240(long long x) {
return ((long long)__builtin_bswap64(x) >> 32) & 240;
}
long long test_S64_32_255(long long x) {
return ((long long)__builtin_bswap64(x) >> 32) & 255;
}
long long test_S64_33_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 33) & 1;
}
long long test_S64_48_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 48) & 1;
}
long long test_S64_48_240(long long x) {
return ((long long)__builtin_bswap64(x) >> 48) & 240;
}
long long test_S64_48_255(long long x) {
return ((long long)__builtin_bswap64(x) >> 48) & 255;
}
long long test_S64_56_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 56) & 1;
}
long long test_S64_56_240(long long x) {
return ((long long)__builtin_bswap64(x) >> 56) & 240;
}
long long test_S64_56_255(long long x) {
return ((long long)__builtin_bswap64(x) >> 56) & 255;
}
long long test_S64_57_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 57) & 1;
}
long long test_S64_63_1(long long x) {
return ((long long)__builtin_bswap64(x) >> 63) & 1;
}
unsigned long long test_u64_56_255(unsigned long long x) {
return (__builtin_bswap64(x) >> 56) & 255;
}
short test_s16_0_1(short x) {
return __builtin_bswap16(x) & 1;
}
short test_s16_0_240(short x) {
return __builtin_bswap16(x) & 240;
}
short test_s16_0_255(short x) {
return __builtin_bswap16(x) & 255;
}
short test_s16_1_1(short x) {
return (__builtin_bswap16(x) >> 1) & 1;
}
short test_s16_7_1(short x) {
return (__builtin_bswap16(x) >> 7) & 1;
}
short test_s16_8_1(short x) {
return (__builtin_bswap16(x) >> 8) & 1;
}
short test_s16_8_240(short x) {
return (__builtin_bswap16(x) >> 8) & 240;
}
short test_s16_8_255(short x) {
return (__builtin_bswap16(x) >> 8) & 255;
}
short test_s16_9_1(short x) {
return (__builtin_bswap16(x) >> 9) & 1;
}
short test_s16_15_1(short x) {
return (__builtin_bswap16(x) >> 15) & 1;
}
short test_S16_0_1(short x) {
return (short)__builtin_bswap16(x) & 1;
}
short test_S16_0_240(short x) {
return (short)__builtin_bswap16(x) & 240;
}
short test_S16_0_255(short x) {
return (short)__builtin_bswap16(x) & 255;
}
short test_S16_1_1(short x) {
return ((short)__builtin_bswap16(x) >> 1) & 1;
}
short test_S16_7_1(short x) {
return ((short)__builtin_bswap16(x) >> 7) & 1;
}
short test_S16_8_1(short x) {
return ((short)__builtin_bswap16(x) >> 8) & 1;
}
short test_S16_8_240(short x) {
return ((short)__builtin_bswap16(x) >> 8) & 240;
}
short test_S16_8_255(short x) {
return ((short)__builtin_bswap16(x) >> 8) & 255;
}
short test_S16_9_1(short x) {
return ((short)__builtin_bswap16(x) >> 9) & 1;
}
short test_S16_15_1(short x) {
return ((short)__builtin_bswap16(x) >> 15) & 1;
}
unsigned short test_u16_8_255(unsigned short x) {
return (__builtin_bswap16(x) >> 8) & 255;
}
/* Shifts only */
int test_s32_24(int x) {
return __builtin_bswap32(x) >> 24;
}
int test_s32_25(int x) {
return __builtin_bswap32(x) >> 25;
}
int test_s32_30(int x) {
return __builtin_bswap32(x) >> 30;
}
int test_s32_31(int x) {
return __builtin_bswap32(x) >> 31;
}
unsigned int test_u32_24(unsigned int x) {
return __builtin_bswap32(x) >> 24;
}
unsigned int test_u32_25(unsigned int x) {
return __builtin_bswap32(x) >> 25;
}
unsigned int test_u32_30(unsigned int x) {
return __builtin_bswap32(x) >> 30;
}
unsigned int test_u32_31(unsigned int x) {
return __builtin_bswap32(x) >> 31;
}
long long test_s64_56(long long x) {
return __builtin_bswap64(x) >> 56;
}
long long test_s64_57(long long x) {
return __builtin_bswap64(x) >> 57;
}
long long test_s64_62(long long x) {
return __builtin_bswap64(x) >> 62;
}
long long test_s64_63(long long x) {
return __builtin_bswap64(x) >> 63;
}
unsigned long long test_u64_56(unsigned long long x) {
return __builtin_bswap64(x) >> 56;
}
unsigned long long test_u64_57(unsigned long long x) {
return __builtin_bswap64(x) >> 57;
}
unsigned long long test_u64_62(unsigned long long x) {
return __builtin_bswap64(x) >> 62;
}
unsigned long long test_u64_63(unsigned long long x) {
return __builtin_bswap64(x) >> 63;
}
short test_s16_8(short x) {
return __builtin_bswap16(x) >> 8;
}
short test_s16_9(short x) {
return __builtin_bswap16(x) >> 9;
}
short test_s16_14(short x) {
return __builtin_bswap16(x) >> 14;
}
short test_s16_15(short x) {
return __builtin_bswap16(x) >> 15;
}
unsigned short test_u16_8(unsigned short x) {
return __builtin_bswap16(x) >> 8;
}
unsigned short test_u16_9(unsigned short x) {
return __builtin_bswap16(x) >> 9;
}
unsigned short test_u16_14(unsigned short x) {
return __builtin_bswap16(x) >> 14;
}
unsigned short test_u16_15(unsigned short x) {
return __builtin_bswap16(x) >> 15;
}
/* { dg-final { scan-tree-dump-not "__builtin_bswap" "optimized" } } */
[-- Attachment #4: builtin-bswap-14.c --]
[-- Type: text/plain, Size: 6937 bytes --]
/* { dg-do run } */
/* { dg-options "-O2" } */
extern void abort (void);
__attribute__ ((noinline, noclone))
static int rt32 (int x, int y, int z) {
return (__builtin_bswap32(x) >> y) & z;
}
#define TEST32(X,Y,Z) if(((__builtin_bswap32(X)>>Y)&Z)!=rt32(X,Y,Z)) abort()
void test32(int x)
{
TEST32(x,0,1);
TEST32(x,0,255);
TEST32(x,1,1);
TEST32(x,2,1);
TEST32(x,3,1);
TEST32(x,4,1);
TEST32(x,5,1);
TEST32(x,6,1);
TEST32(x,7,1);
TEST32(x,8,1);
TEST32(x,8,255);
TEST32(x,9,1);
TEST32(x,10,1);
TEST32(x,11,1);
TEST32(x,12,1);
TEST32(x,13,1);
TEST32(x,14,1);
TEST32(x,15,1);
TEST32(x,16,1);
TEST32(x,16,255);
TEST32(x,17,1);
TEST32(x,18,1);
TEST32(x,19,1);
TEST32(x,20,1);
TEST32(x,21,1);
TEST32(x,22,1);
TEST32(x,23,1);
TEST32(x,24,1);
TEST32(x,24,255);
TEST32(x,25,1);
TEST32(x,26,1);
TEST32(x,27,1);
TEST32(x,28,1);
TEST32(x,29,1);
TEST32(x,30,1);
TEST32(x,31,1);
}
#if __SIZEOF_LONG_LONG__ == 8
__attribute__ ((noinline, noclone))
static long long rt64 (long long x, int y, long long z) {
return (__builtin_bswap64(x) >> y) & z;
}
#define TEST64(X,Y,Z) if(((__builtin_bswap64(X)>>Y)&Z)!=rt64(X,Y,Z)) abort()
void test64(long long x)
{
TEST64(x,0,1);
TEST64(x,0,255);
TEST64(x,1,1);
TEST64(x,2,1);
TEST64(x,3,1);
TEST64(x,4,1);
TEST64(x,5,1);
TEST64(x,6,1);
TEST64(x,7,1);
TEST64(x,8,1);
TEST64(x,8,255);
TEST64(x,9,1);
TEST64(x,10,1);
TEST64(x,11,1);
TEST64(x,12,1);
TEST64(x,13,1);
TEST64(x,14,1);
TEST64(x,15,1);
TEST64(x,16,1);
TEST64(x,16,255);
TEST64(x,17,1);
TEST64(x,18,1);
TEST64(x,19,1);
TEST64(x,20,1);
TEST64(x,21,1);
TEST64(x,22,1);
TEST64(x,23,1);
TEST64(x,24,1);
TEST64(x,24,255);
TEST64(x,25,1);
TEST64(x,26,1);
TEST64(x,27,1);
TEST64(x,28,1);
TEST64(x,29,1);
TEST64(x,30,1);
TEST64(x,31,1);
TEST64(x,32,1);
TEST64(x,32,255);
TEST64(x,33,1);
TEST64(x,34,1);
TEST64(x,35,1);
TEST64(x,36,1);
TEST64(x,37,1);
TEST64(x,38,1);
TEST64(x,39,1);
TEST64(x,40,1);
TEST64(x,40,255);
TEST64(x,41,1);
TEST64(x,42,1);
TEST64(x,43,1);
TEST64(x,44,1);
TEST64(x,45,1);
TEST64(x,46,1);
TEST64(x,47,1);
TEST64(x,48,1);
TEST64(x,48,255);
TEST64(x,49,1);
TEST64(x,50,1);
TEST64(x,51,1);
TEST64(x,52,1);
TEST64(x,53,1);
TEST64(x,54,1);
TEST64(x,55,1);
TEST64(x,56,1);
TEST64(x,56,255);
TEST64(x,57,1);
TEST64(x,58,1);
TEST64(x,59,1);
TEST64(x,60,1);
TEST64(x,61,1);
TEST64(x,62,1);
TEST64(x,63,1);
}
#endif
__attribute__ ((noinline, noclone))
static int rt16 (int x, int y, int z) {
return (__builtin_bswap16(x) >> y) & z;
}
#define TEST16(X,Y,Z) if(((__builtin_bswap16(X)>>Y)&Z)!=rt16(X,Y,Z)) abort()
void test16(int x)
{
TEST16(x,0,1);
TEST16(x,0,255);
TEST16(x,1,1);
TEST16(x,2,1);
TEST16(x,3,1);
TEST16(x,4,1);
TEST16(x,5,1);
TEST16(x,6,1);
TEST16(x,7,1);
TEST16(x,8,1);
TEST16(x,8,255);
TEST16(x,9,1);
TEST16(x,10,1);
TEST16(x,11,1);
TEST16(x,12,1);
TEST16(x,13,1);
TEST16(x,14,1);
TEST16(x,15,1);
}
int main()
{
test32(0x00000000);
test32(0xffffffff);
test32(0x00000001);
test32(0x00000002);
test32(0x00000004);
test32(0x00000008);
test32(0x00000010);
test32(0x00000020);
test32(0x00000040);
test32(0x00000080);
test32(0x00000100);
test32(0x00000200);
test32(0x00000400);
test32(0x00000800);
test32(0x00001000);
test32(0x00002000);
test32(0x00004000);
test32(0x00008000);
test32(0x00010000);
test32(0x00020000);
test32(0x00040000);
test32(0x00080000);
test32(0x00100000);
test32(0x00200000);
test32(0x00400000);
test32(0x00800000);
test32(0x01000000);
test32(0x02000000);
test32(0x04000000);
test32(0x08000000);
test32(0x10000000);
test32(0x20000000);
test32(0x40000000);
test32(0x80000000);
test32(0x12345678);
test32(0x87654321);
test32(0xdeadbeef);
test32(0xcafebabe);
#if __SIZEOF_LONG_LONG__ == 8
test64(0x0000000000000000ll);
test64(0xffffffffffffffffll);
test64(0x0000000000000001ll);
test64(0x0000000000000002ll);
test64(0x0000000000000004ll);
test64(0x0000000000000008ll);
test64(0x0000000000000010ll);
test64(0x0000000000000020ll);
test64(0x0000000000000040ll);
test64(0x0000000000000080ll);
test64(0x0000000000000100ll);
test64(0x0000000000000200ll);
test64(0x0000000000000400ll);
test64(0x0000000000000800ll);
test64(0x0000000000001000ll);
test64(0x0000000000002000ll);
test64(0x0000000000004000ll);
test64(0x0000000000008000ll);
test64(0x0000000000010000ll);
test64(0x0000000000020000ll);
test64(0x0000000000040000ll);
test64(0x0000000000080000ll);
test64(0x0000000000100000ll);
test64(0x0000000000200000ll);
test64(0x0000000000400000ll);
test64(0x0000000000800000ll);
test64(0x0000000001000000ll);
test64(0x0000000002000000ll);
test64(0x0000000004000000ll);
test64(0x0000000008000000ll);
test64(0x0000000010000000ll);
test64(0x0000000020000000ll);
test64(0x0000000040000000ll);
test64(0x0000000080000000ll);
test64(0x0000000100000000ll);
test64(0x0000000200000000ll);
test64(0x0000000400000000ll);
test64(0x0000000800000000ll);
test64(0x0000001000000000ll);
test64(0x0000002000000000ll);
test64(0x0000004000000000ll);
test64(0x0000008000000000ll);
test64(0x0000010000000000ll);
test64(0x0000020000000000ll);
test64(0x0000040000000000ll);
test64(0x0000080000000000ll);
test64(0x0000100000000000ll);
test64(0x0000200000000000ll);
test64(0x0000400000000000ll);
test64(0x0000800000000000ll);
test64(0x0001000000000000ll);
test64(0x0002000000000000ll);
test64(0x0004000000000000ll);
test64(0x0008000000000000ll);
test64(0x0010000000000000ll);
test64(0x0020000000000000ll);
test64(0x0040000000000000ll);
test64(0x0080000000000000ll);
test64(0x0100000000000000ll);
test64(0x0200000000000000ll);
test64(0x0400000000000000ll);
test64(0x0800000000000000ll);
test64(0x1000000000000000ll);
test64(0x2000000000000000ll);
test64(0x4000000000000000ll);
test64(0x8000000000000000ll);
test64(0x0123456789abcdefll);
test64(0xfedcba9876543210ll);
test64(0xdeadbeefdeadbeefll);
test64(0xcafebabecafebabell);
#endif
test16(0x0000);
test16(0xffff);
test16(0x0001);
test16(0x0002);
test16(0x0004);
test16(0x0008);
test16(0x0010);
test16(0x0020);
test16(0x0040);
test16(0x0080);
test16(0x0100);
test16(0x0200);
test16(0x0400);
test16(0x0800);
test16(0x1000);
test16(0x2000);
test16(0x4000);
test16(0x8000);
test16(0x1234);
test16(0x4321);
test16(0xdead);
test16(0xbeef);
test16(0xcafe);
test16(0xbabe);
return 0;
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd
2021-07-08 7:37 ` Roger Sayle
@ 2021-07-08 9:50 ` Richard Biener
2021-07-10 14:06 ` H.J. Lu
0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2021-07-08 9:50 UTC (permalink / raw)
To: Roger Sayle; +Cc: GCC Patches
On Thu, Jul 8, 2021 at 9:37 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard,
> Thanks. Yep, you've correctly the diagnosed that the motivation for the
> get_builtin_precision helper function was that the TREE_TYPE of the
> argument is affected by argument promotion. Your suggestion to instead
> use the TREE_TYPE of the function result is a much nicer solution.
>
> I also agree that that all of these bswap optimizations make the assumption
> that BITS_PER_UNIT is 8 (i.e. that bytes are 8-bits), and some that the
> front-end supports an 8-bit type (i.e. that CHAR_TYPE_SIZE is 8), which
> can be checked explicitly.
>
> Both of these improvements are implemented in the attached revised patch,
> which has been tested on x86_64-pc-linux-gnu with a "make bootstrap"
> and "make -k check" with no new failures.
>
> Ok for mainline?
OK.
Thanks,
Richard.
> 2021-07-08 Roger Sayle <roger@nextmovesoftware.com>
> Richard Biener <rguenther@suse.de>
>
> gcc/ChangeLog
> PR tree-optimization/40210
> * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
> (x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
> when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
>
> gcc/testsuite/ChangeLog
> PR tree-optimization/40210
> * gcc.dg/builtin-bswap-13.c: New test.
> * gcc.dg/builtin-bswap-14.c: New test.
>
> Roger
> --
>
> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: 07 July 2021 08:56
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd
>
> On Tue, Jul 6, 2021 at 9:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> >
> > All of the optimizations/transformations mentioned in bugzilla for PR
> > tree-optimization/40210 are already implemented in mainline GCC, with
> > one exception. In comment #5, there's a suggestion that
> > (bswap64(x)>>56)&0xff can be implemented without the bswap as
> > (unsigned char)x, or equivalently x&0xff.
> >
> > This patch implements the above optimization, and closely related
> > variants. For any single bit, (bswap(X)>>C1)&1 can be simplified to
> > (X>>C2)&1, where bit position C2 is the appropriate permutation of C1.
> > Similarly, the bswap can eliminated if the desired set of bits all lie
> > within the same byte, hence (bswap(x)>>8)&255 can always be optimized,
> > as can (bswap(x)>>8)&123.
> >
> > Previously,
> >
> > int foo(long long x) {
> > return (__builtin_bswap64(x) >> 56) & 0xff; }
> >
> > compiled with -O2 to
> > foo: movq %rdi, %rax
> > bswap %rax
> > shrq $56, %rax
> > ret
> >
> > with this patch, it now compiles to
> > foo: movzbl %dil, %eax
> > ret
> >
> > This patch has been tested on x86_64-pc-linux-gnu with a "make
> > bootstrap" and "make -k check" with no new failures.
> >
> > Ok for mainline?
>
> I don't like get_builtin_precision too much, did you consider simply using
>
> + (bit_and (convert1? (rshift@0 (convert2? (bswap@3 @1))
> + INTEGER_CST@2))
>
> and TYPE_PRECISION (TREE_TYPE (@3))? I think while we'll see argument promotion and thus cannot use @1 to derive the type the return value will be the original type.
>
> Now, I see '8' being used which likely should be CHAR_TYPE_SIZE since you also use char_type_node.
>
> I wonder whether
>
> + /* (bswap(x) >> C1) & C2 can sometimes be simplified to (x >> C3) &
> + C2. */ (simplify (bit_and (convert1? (rshift@0 (convert2? (bswap
> + @1)) INTEGER_CST@2))
> + INTEGER_CST@3)
>
> and
>
> + /* bswap(x) >> C1 can sometimes be simplified to (T)x >> C2. */
> + (simplify (rshift (convert? (bswap @0)) INTEGER_CST@1)
>
> can build upon each other, for example by extending the latter to handle more cases, transforming to ((T)x >> C2) & C3?
> That might of course be only profitable when the bswap goes away.
>
> Thanks,
> Richard.
>
> >
> >
> > 2021-07-06 Roger Sayle <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> > PR tree-optimization/40210
> > * builtins.c (get_builtin_precision): Helper function to determine
> > the precision in bits of a built-in function.
> > * builtins.h (get_builtin_precision): Prototype here.
> > * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
> > (x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
> > when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
> >
> > gcc/testsuite/ChangeLog
> > PR tree-optimization/40210
> > * gcc.dg/builtin-bswap-13.c: New test.
> > * gcc.dg/builtin-bswap-14.c: New test.
> >
> > Roger
> > --
> > Roger Sayle
> > NextMove Software
> > Cambridge, UK
> >
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd
2021-07-08 9:50 ` Richard Biener
@ 2021-07-10 14:06 ` H.J. Lu
0 siblings, 0 replies; 5+ messages in thread
From: H.J. Lu @ 2021-07-10 14:06 UTC (permalink / raw)
To: Richard Biener; +Cc: Roger Sayle, GCC Patches
On Thu, Jul 8, 2021 at 2:51 AM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 8, 2021 at 9:37 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> >
> > Hi Richard,
> > Thanks. Yep, you've correctly the diagnosed that the motivation for the
> > get_builtin_precision helper function was that the TREE_TYPE of the
> > argument is affected by argument promotion. Your suggestion to instead
> > use the TREE_TYPE of the function result is a much nicer solution.
> >
> > I also agree that that all of these bswap optimizations make the assumption
> > that BITS_PER_UNIT is 8 (i.e. that bytes are 8-bits), and some that the
> > front-end supports an 8-bit type (i.e. that CHAR_TYPE_SIZE is 8), which
> > can be checked explicitly.
> >
> > Both of these improvements are implemented in the attached revised patch,
> > which has been tested on x86_64-pc-linux-gnu with a "make bootstrap"
> > and "make -k check" with no new failures.
> >
> > Ok for mainline?
>
> OK.
>
> Thanks,
> Richard.
>
> > 2021-07-08 Roger Sayle <roger@nextmovesoftware.com>
> > Richard Biener <rguenther@suse.de>
> >
> > gcc/ChangeLog
> > PR tree-optimization/40210
> > * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
> > (x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
> > when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
> >
> > gcc/testsuite/ChangeLog
> > PR tree-optimization/40210
> > * gcc.dg/builtin-bswap-13.c: New test.
> > * gcc.dg/builtin-bswap-14.c: New test.
> >
> > Roger
> > --
> >
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: 07 July 2021 08:56
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd
> >
> > On Tue, Jul 6, 2021 at 9:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> > >
> > >
> > > All of the optimizations/transformations mentioned in bugzilla for PR
> > > tree-optimization/40210 are already implemented in mainline GCC, with
> > > one exception. In comment #5, there's a suggestion that
> > > (bswap64(x)>>56)&0xff can be implemented without the bswap as
> > > (unsigned char)x, or equivalently x&0xff.
> > >
> > > This patch implements the above optimization, and closely related
> > > variants. For any single bit, (bswap(X)>>C1)&1 can be simplified to
> > > (X>>C2)&1, where bit position C2 is the appropriate permutation of C1.
> > > Similarly, the bswap can eliminated if the desired set of bits all lie
> > > within the same byte, hence (bswap(x)>>8)&255 can always be optimized,
> > > as can (bswap(x)>>8)&123.
> > >
> > > Previously,
> > >
> > > int foo(long long x) {
> > > return (__builtin_bswap64(x) >> 56) & 0xff; }
> > >
> > > compiled with -O2 to
> > > foo: movq %rdi, %rax
> > > bswap %rax
> > > shrq $56, %rax
> > > ret
> > >
> > > with this patch, it now compiles to
> > > foo: movzbl %dil, %eax
> > > ret
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with a "make
> > > bootstrap" and "make -k check" with no new failures.
> > >
> > > Ok for mainline?
> >
> > I don't like get_builtin_precision too much, did you consider simply using
> >
> > + (bit_and (convert1? (rshift@0 (convert2? (bswap@3 @1))
> > + INTEGER_CST@2))
> >
> > and TYPE_PRECISION (TREE_TYPE (@3))? I think while we'll see argument promotion and thus cannot use @1 to derive the type the return value will be the original type.
> >
> > Now, I see '8' being used which likely should be CHAR_TYPE_SIZE since you also use char_type_node.
> >
> > I wonder whether
> >
> > + /* (bswap(x) >> C1) & C2 can sometimes be simplified to (x >> C3) &
> > + C2. */ (simplify (bit_and (convert1? (rshift@0 (convert2? (bswap
> > + @1)) INTEGER_CST@2))
> > + INTEGER_CST@3)
> >
> > and
> >
> > + /* bswap(x) >> C1 can sometimes be simplified to (T)x >> C2. */
> > + (simplify (rshift (convert? (bswap @0)) INTEGER_CST@1)
> >
> > can build upon each other, for example by extending the latter to handle more cases, transforming to ((T)x >> C2) & C3?
> > That might of course be only profitable when the bswap goes away.
> >
> > Thanks,
> > Richard.
> >
> > >
> > >
> > > 2021-07-06 Roger Sayle <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > > PR tree-optimization/40210
> > > * builtins.c (get_builtin_precision): Helper function to determine
> > > the precision in bits of a built-in function.
> > > * builtins.h (get_builtin_precision): Prototype here.
> > > * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
> > > (x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2
> > > when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
> > >
> > > gcc/testsuite/ChangeLog
> > > PR tree-optimization/40210
> > > * gcc.dg/builtin-bswap-13.c: New test.
> > > * gcc.dg/builtin-bswap-14.c: New test.
> > >
> > > Roger
> > > --
> > > Roger Sayle
> > > NextMove Software
> > > Cambridge, UK
> > >
This caused:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101403
--
H.J.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-07-10 14:07 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-06 19:01 [PATCH] PR tree-opt/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd Roger Sayle
2021-07-07 7:55 ` Richard Biener
2021-07-08 7:37 ` Roger Sayle
2021-07-08 9:50 ` Richard Biener
2021-07-10 14:06 ` H.J. Lu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).