From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id 78CA938387E1 for ; Mon, 13 Jun 2022 11:47:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 78CA938387E1 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 547BA1F993; Mon, 13 Jun 2022 11:47:56 +0000 (UTC) Received: from [10.168.4.8] (unknown [10.168.4.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 33C022C141; Mon, 13 Jun 2022 11:47:56 +0000 (UTC) Date: Mon, 13 Jun 2022 13:47:56 +0200 (CEST) From: Richard Biener To: Tamar Christina cc: "gcc-patches@gcc.gnu.org" , nd , Richard Sandiford Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask In-Reply-To: Message-ID: References: MIME-Version: 1.0 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_LOTSOFHASH, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Jun 2022 11:48:00 -0000 On Mon, 13 Jun 2022, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener > > Sent: Monday, June 13, 2022 10:39 AM > > To: Tamar Christina > > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford > > > > Subject: Re: [PATCH 1/2]middle-end Support optimized division by pow2 > > bitmask > > > > On Mon, 13 Jun 2022, Richard Biener wrote: > > > > > On Thu, 9 Jun 2022, Tamar Christina wrote: > > > > > > > Hi All, > > > > > > > > In plenty of image and video processing code it's common to modify > > > > pixel values by a widening operation and then scale them back into range > > by dividing by 255. > > > > > > > > This patch adds an optab to allow us to emit an optimized sequence > > > > when doing an unsigned division that is equivalent to: > > > > > > > > x = y / (2 ^ (bitsize (y)/2)-1 > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > > x86_64-pc-linux-gnu and no issues. > > > > > > > > Ok for master? > > > > > > Looking at 2/2 it seems that this is the wrong way to attack the > > > problem. The ISA doesn't have such instruction so adding an optab > > > looks premature. I suppose that there's no unsigned vector integer > > > division and thus we open-code that in a different way? Isn't the > > > correct thing then to fixup that open-coding if it is more efficient? > > > > The problem is that even if you fixup the open-coding it would need to > be something target specific? The sequence of instructions we generate > don't have a GIMPLE representation. So whatever is generated I'd have to fixup > in RTL then. What's the operation that doesn't have a GIMPLE representation? I think for costing you could resort to the *_cost functions as used by synth_mult and friends. > The problem with this is that it seemed fragile. We generate from the > Vectorizer: > > vect__3.8_35 = MEM [(uint8_t *)_21]; > vect_patt_28.9_37 = WIDEN_MULT_LO_EXPR ; > vect_patt_28.9_38 = WIDEN_MULT_HI_EXPR ; > vect_patt_19.10_40 = vect_patt_28.9_37 h* { 32897, 32897, 32897, 32897, 32897, 32897, 32897, 32897 }; > vect_patt_19.10_41 = vect_patt_28.9_38 h* { 32897, 32897, 32897, 32897, 32897, 32897, 32897, 32897 }; > vect_patt_25.11_42 = vect_patt_19.10_40 >> 7; > vect_patt_25.11_43 = vect_patt_19.10_41 >> 7; > vect_patt_11.12_44 = VEC_PACK_TRUNC_EXPR ; > > and if the magic constants change then we miss the optimization. I could rewrite the open coding to use > shifts alone, but that might be a regression for some uarches I would imagine. OK, so you do have a highpart multiply. I suppose the pattern is too deep to be recognized by combine? What's the RTL good vs. bad before combine of one of the expressions? > > Btw, on x86 we use > > > > t.c:3:21: note: replacing earlier pattern patt_25 = patt_28 / 255; > > t.c:3:21: note: with patt_25 = patt_19 >> 7; > > t.c:3:21: note: extra pattern stmt: patt_19 = patt_28 h* 32897; > > > > which translates to > > > > vpmulhuw %ymm4, %ymm0, %ymm0 > > vpmulhuw %ymm4, %ymm1, %ymm1 > > vpsrlw $7, %ymm0, %ymm0 > > vpsrlw $7, %ymm1, %ymm1 > > > > there's odd > > > > vpand %ymm0, %ymm3, %ymm0 > > vpand %ymm1, %ymm3, %ymm1 > > > > before (%ymm3 is all 0x00ff) > > > > vpackuswb %ymm1, %ymm0, %ymm0 > > > > that's not visible in GIMPLE. I guess aarch64 lacks a highpart multiply here? > > In any case, it seems that generic division expansion could be improved > > here? (choose_multiplier?) > > We do generate multiply highpart here, but the patch completely avoids multiplies > and shifts entirely by creative use of the ISA. Another reason I went for an optab is costing. > The chosen operations are significantly cheaper on all Arm uarches than Shifts and multiply. > > This means we get vectorization in some cases where the cost model would correctly say > It's too expensive to vectorize. Particularly around double precision. > > Thanks, > Tamar > > > > > Richard. > > > > > Richard. > > > > > > > Thanks, > > > > Tamar > > > > > > > > gcc/ChangeLog: > > > > > > > > * internal-fn.def (DIV_POW2_BITMASK): New. > > > > * optabs.def (udiv_pow2_bitmask_optab): New. > > > > * doc/md.texi: Document it. > > > > * tree-vect-patterns.cc (vect_recog_divmod_pattern): Recognize > > pattern. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > * gcc.dg/vect/vect-div-bitmask-1.c: New test. > > > > * gcc.dg/vect/vect-div-bitmask-2.c: New test. > > > > * gcc.dg/vect/vect-div-bitmask-3.c: New test. > > > > * gcc.dg/vect/vect-div-bitmask.h: New file. > > > > > > > > --- inline copy of patch -- > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index > > > > > > f3619c505c025f158c2bc64756531877378b22e1..784c49d7d24cef7619e4d613f7 > > > > b4f6e945866c38 100644 > > > > --- a/gcc/doc/md.texi > > > > +++ b/gcc/doc/md.texi > > > > @@ -5588,6 +5588,18 @@ signed op0, op1; > > > > op0 = op1 / (1 << imm); > > > > @end smallexample > > > > > > > > +@cindex @code{udiv_pow2_bitmask@var{m2}} instruction pattern > > @item > > > > +@samp{udiv_pow2_bitmask@var{m2}} @cindex > > > > +@code{udiv_pow2_bitmask@var{m2}} instruction pattern @itemx > > > > +@samp{udiv_pow2_bitmask@var{m2}} Unsigned vector division by an > > > > +immediate that is equivalent to > > > > +@samp{2^(bitsize(m) / 2) - 1}. > > > > +@smallexample > > > > +unsigned short op0; op1; > > > > +@dots{} > > > > +op0 = op1 / 0xffU; > > > > +@end smallexample > > > > + > > > > @cindex @code{vec_shl_insert_@var{m}} instruction pattern @item > > > > @samp{vec_shl_insert_@var{m}} Shift the elements in vector input > > > > operand 1 left one element (i.e.@: > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index > > > > > > d2d550d358606022b1cb44fa842f06e0be507bc3..a3e3cc1520f77683ebf6256898 > > > > f916ed45de475f 100644 > > > > --- a/gcc/internal-fn.def > > > > +++ b/gcc/internal-fn.def > > > > @@ -159,6 +159,8 @@ DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT, > > ECF_CONST | ECF_NOTHROW, > > > > vec_shl_insert, binary) > > > > > > > > DEF_INTERNAL_OPTAB_FN (DIV_POW2, ECF_CONST | ECF_NOTHROW, > > > > sdiv_pow2, binary) > > > > +DEF_INTERNAL_OPTAB_FN (DIV_POW2_BITMASK, ECF_CONST | > > ECF_NOTHROW, > > > > + udiv_pow2_bitmask, unary) > > > > > > > > DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary) > > > > DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary) diff --git > > > > a/gcc/optabs.def b/gcc/optabs.def index > > > > > > 801310ebaa7d469520809bb7efed6820f8eb866b..3f0ac05ef5ad5aed8d6ca391f > > 4 > > > > eed71b0494e17f 100644 > > > > --- a/gcc/optabs.def > > > > +++ b/gcc/optabs.def > > > > @@ -372,6 +372,7 @@ OPTAB_D (smulhrs_optab, "smulhrs$a3") > > OPTAB_D > > > > (umulhs_optab, "umulhs$a3") OPTAB_D (umulhrs_optab, "umulhrs$a3") > > > > OPTAB_D (sdiv_pow2_optab, "sdiv_pow2$a3") > > > > +OPTAB_D (udiv_pow2_bitmask_optab, "udiv_pow2_bitmask$a2") > > > > OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a") > > > > OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a") OPTAB_D > > > > (vec_pack_trunc_optab, "vec_pack_trunc_$a") diff --git > > > > a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c > > > > new file mode 100644 > > > > index > > > > > > 0000000000000000000000000000000000000000..a7ea3cce4764239c5d281a8f0b > > > > ead1f6a452de3f > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c > > > > @@ -0,0 +1,25 @@ > > > > +/* { dg-require-effective-target vect_int } */ > > > > + > > > > +#include > > > > +#include "tree-vect.h" > > > > + > > > > +#define N 50 > > > > +#define TYPE uint8_t > > > > + > > > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE* > > > > +restrict pixel, TYPE level, int n) { > > > > + for (int i = 0; i < n; i+=1) > > > > + pixel[i] = (pixel[i] * level) / 0xff; } > > > > + > > > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE* > > > > +restrict pixel, TYPE level, int n) { > > > > + for (int i = 0; i < n; i+=1) > > > > + pixel[i] = (pixel[i] * level) / 0xff; } > > > > + > > > > +#include "vect-div-bitmask.h" > > > > + > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern: > > > > +detected" "vect" } } */ > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c > > > > new file mode 100644 > > > > index > > > > > > 0000000000000000000000000000000000000000..009e16e1b36497e5724410d98 > > 4 > > > > 3f1ce122b26dda > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c > > > > @@ -0,0 +1,25 @@ > > > > +/* { dg-require-effective-target vect_int } */ > > > > + > > > > +#include > > > > +#include "tree-vect.h" > > > > + > > > > +#define N 50 > > > > +#define TYPE uint16_t > > > > + > > > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE* > > > > +restrict pixel, TYPE level, int n) { > > > > + for (int i = 0; i < n; i+=1) > > > > + pixel[i] = (pixel[i] * level) / 0xffffU; } > > > > + > > > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE* > > > > +restrict pixel, TYPE level, int n) { > > > > + for (int i = 0; i < n; i+=1) > > > > + pixel[i] = (pixel[i] * level) / 0xffffU; } > > > > + > > > > +#include "vect-div-bitmask.h" > > > > + > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern: > > > > +detected" "vect" } } */ > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c > > > > new file mode 100644 > > > > index > > > > > > 0000000000000000000000000000000000000000..bf35a0bda8333c418e692d942 > > 2 > > > > 0df849cc47930b > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c > > > > @@ -0,0 +1,26 @@ > > > > +/* { dg-require-effective-target vect_int } */ > > > > +/* { dg-additional-options "-fno-vect-cost-model" { target > > > > +aarch64*-*-* } } */ > > > > + > > > > +#include > > > > +#include "tree-vect.h" > > > > + > > > > +#define N 50 > > > > +#define TYPE uint32_t > > > > + > > > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE* > > > > +restrict pixel, TYPE level, int n) { > > > > + for (int i = 0; i < n; i+=1) > > > > + pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; } > > > > + > > > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE* > > > > +restrict pixel, TYPE level, int n) { > > > > + for (int i = 0; i < n; i+=1) > > > > + pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; } > > > > + > > > > +#include "vect-div-bitmask.h" > > > > + > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern: > > > > +detected" "vect" } } */ > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h > > > > new file mode 100644 > > > > index > > > > > > 0000000000000000000000000000000000000000..29a16739aa4b706616367bfd1 > > 8 > > > > 32f28ebd07993e > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h > > > > @@ -0,0 +1,43 @@ > > > > +#include > > > > + > > > > +#ifndef N > > > > +#define N 65 > > > > +#endif > > > > + > > > > +#ifndef TYPE > > > > +#define TYPE uint32_t > > > > +#endif > > > > + > > > > +#ifndef DEBUG > > > > +#define DEBUG 0 > > > > +#endif > > > > + > > > > +#define BASE ((TYPE) -1 < 0 ? -126 : 4) > > > > + > > > > +int main () > > > > +{ > > > > + TYPE a[N]; > > > > + TYPE b[N]; > > > > + > > > > + for (int i = 0; i < N; ++i) > > > > + { > > > > + a[i] = BASE + i * 13; > > > > + b[i] = BASE + i * 13; > > > > + if (DEBUG) > > > > + printf ("%d: 0x%x\n", i, a[i]); > > > > + } > > > > + > > > > + fun1 (a, N / 2, N); > > > > + fun2 (b, N / 2, N); > > > > + > > > > + for (int i = 0; i < N; ++i) > > > > + { > > > > + if (DEBUG) > > > > + printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]); > > > > + > > > > + if (a[i] != b[i]) > > > > + __builtin_abort (); > > > > + } > > > > + return 0; > > > > +} > > > > + > > > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc > > > > index > > > > > > 217bdfd7045a22578a35bb891a4318d741071872..a738558cb8d12296bff462d71 > > 6 > > > > 310ca8d82957b5 100644 > > > > --- a/gcc/tree-vect-patterns.cc > > > > +++ b/gcc/tree-vect-patterns.cc > > > > @@ -3558,6 +3558,33 @@ vect_recog_divmod_pattern (vec_info *vinfo, > > > > > > > > return pattern_stmt; > > > > } > > > > + else if ((TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1) != 1) > > > > + && rhs_code != TRUNC_MOD_EXPR) > > > > + { > > > > + wide_int icst = wi::to_wide (oprnd1); > > > > + wide_int val = wi::add (icst, 1); > > > > + int pow = wi::exact_log2 (val); > > > > + if (pow == (prec / 2)) > > > > + { > > > > + /* Pattern detected. */ > > > > + vect_pattern_detected ("vect_recog_divmod_pattern", last_stmt); > > > > + > > > > + *type_out = vectype; > > > > + > > > > + /* Check if the target supports this internal function. */ > > > > + internal_fn ifn = IFN_DIV_POW2_BITMASK; > > > > + if (direct_internal_fn_supported_p (ifn, vectype, > > OPTIMIZE_FOR_SPEED)) > > > > + { > > > > + tree var_div = vect_recog_temp_ssa_var (itype, NULL); > > > > + gimple *div_stmt = gimple_build_call_internal (ifn, 1, oprnd0); > > > > + gimple_call_set_lhs (div_stmt, var_div); > > > > + > > > > + gimple_set_location (div_stmt, gimple_location (last_stmt)); > > > > + > > > > + return div_stmt; > > > > + } > > > > + } > > > > + } > > > > > > > > if (prec > HOST_BITS_PER_WIDE_INT > > > > || integer_zerop (oprnd1)) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Richard Biener > > SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461 > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, > > Boudien Moerman; HRB 36809 (AG Nuernberg) > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)