From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id 7038B3858D28 for ; Tue, 14 Jun 2022 13:18:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7038B3858D28 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 4884E21A5E; Tue, 14 Jun 2022 13:18:38 +0000 (UTC) Received: from [10.168.4.8] (unknown [10.168.4.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 2902C2C141; Tue, 14 Jun 2022 13:18:38 +0000 (UTC) Date: Tue, 14 Jun 2022 15:18:38 +0200 (CEST) From: Richard Biener To: Tamar Christina cc: "gcc-patches@gcc.gnu.org" , nd , Richard Sandiford Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask In-Reply-To: Message-ID: <2p382n54-427o-8q82-6o45-p2nn6869opr5@fhfr.qr> References: MIME-Version: 1.0 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_LOTSOFHASH, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jun 2022 13:18:42 -0000 On Mon, 13 Jun 2022, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener > > Sent: Monday, June 13, 2022 12:48 PM > > To: Tamar Christina > > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford > > > > Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2 > > bitmask > > > > On Mon, 13 Jun 2022, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener > > > > Sent: Monday, June 13, 2022 10:39 AM > > > > To: Tamar Christina > > > > Cc: gcc-patches@gcc.gnu.org; nd ; Richard Sandiford > > > > > > > > Subject: Re: [PATCH 1/2]middle-end Support optimized division by > > > > pow2 bitmask > > > > > > > > On Mon, 13 Jun 2022, Richard Biener wrote: > > > > > > > > > On Thu, 9 Jun 2022, Tamar Christina wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > In plenty of image and video processing code it's common to > > > > > > modify pixel values by a widening operation and then scale them > > > > > > back into range > > > > by dividing by 255. > > > > > > > > > > > > This patch adds an optab to allow us to emit an optimized > > > > > > sequence when doing an unsigned division that is equivalent to: > > > > > > > > > > > > x = y / (2 ^ (bitsize (y)/2)-1 > > > > > > > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > > > > x86_64-pc-linux-gnu and no issues. > > > > > > > > > > > > Ok for master? > > > > > > > > > > Looking at 2/2 it seems that this is the wrong way to attack the > > > > > problem. The ISA doesn't have such instruction so adding an optab > > > > > looks premature. I suppose that there's no unsigned vector > > > > > integer division and thus we open-code that in a different way? > > > > > Isn't the correct thing then to fixup that open-coding if it is more > > efficient? > > > > > > > > > > The problem is that even if you fixup the open-coding it would need to > > > be something target specific? The sequence of instructions we generate > > > don't have a GIMPLE representation. So whatever is generated I'd have > > > to fixup in RTL then. > > > > What's the operation that doesn't have a GIMPLE representation? > > For NEON use two operations: > 1. Add High narrowing lowpart, essentially doing (a +w b) >>.n bitsize(a)/2 > Where the + widens and the >> narrows. So you give it two shorts, get a byte > 2. Add widening add of lowpart so basically lowpart (a +w b) > > For SVE2 we use a different sequence, we use two back-to-back sequences of: > 1. Add narrow high part (bottom). In SVE the Top and Bottom instructions select > Even and odd elements of the vector rather than "top half" and "bottom half". > > So this instruction does : Add each vector element of the first source vector to the > corresponding vector element of the second source vector, and place the most > significant half of the result in the even-numbered half-width destination elements, > while setting the odd-numbered elements to zero. > > So there's an explicit permute in there. The instructions are sufficiently different that there > wouldn't be a single GIMPLE representation. I see. Are these also useful to express scalar integer division? I'll defer to others to ack the special udiv_pow2_bitmask optab or suggest some piecemail things other targets might be able to do as well. It does look very special. I'd also bikeshed it to udiv_pow2m1 since 'bitmask' is less obvious than 2^n-1 (assuming I interpreted 'bitmask' correctly ;)). It seems to be even less general since it is an unary op and the actual divisor is constrained by the mode itself? Richard. > > > > I think for costing you could resort to the *_cost functions as used by > > synth_mult and friends. > > > > > The problem with this is that it seemed fragile. We generate from the > > > Vectorizer: > > > > > > vect__3.8_35 = MEM [(uint8_t *)_21]; > > > vect_patt_28.9_37 = WIDEN_MULT_LO_EXPR > vect_cst__36>; > > > vect_patt_28.9_38 = WIDEN_MULT_HI_EXPR > vect_cst__36>; > > > vect_patt_19.10_40 = vect_patt_28.9_37 h* { 32897, 32897, 32897, 32897, > > 32897, 32897, 32897, 32897 }; > > > vect_patt_19.10_41 = vect_patt_28.9_38 h* { 32897, 32897, 32897, 32897, > > 32897, 32897, 32897, 32897 }; > > > vect_patt_25.11_42 = vect_patt_19.10_40 >> 7; > > > vect_patt_25.11_43 = vect_patt_19.10_41 >> 7; > > > vect_patt_11.12_44 = VEC_PACK_TRUNC_EXPR > > vect_patt_25.11_43>; > > > > > > and if the magic constants change then we miss the optimization. I > > > could rewrite the open coding to use shifts alone, but that might be a > > regression for some uarches I would imagine. > > > > OK, so you do have a highpart multiply. I suppose the pattern is too deep to > > be recognized by combine? What's the RTL good vs. bad before combine of > > one of the expressions? > > Yeah combine only tried 2-3 instructions, but to use these sequences we have to > match the entire chain as the instructions do the narrowing themselves. So the RTL > for the bad case before combine is > > (insn 39 37 42 4 (set (reg:V4SI 119) > (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 116 [ vect_patt_28.9D.3754 ]) > (parallel:V8HI [ > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ]))) > (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 118) > (parallel:V8HI [ > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ]))))) "/app/example.c":6:14 2114 {aarch64_simd_vec_umult_hi_v8hi} > (expr_list:REG_DEAD (reg:V8HI 116 [ vect_patt_28.9D.3754 ]) > (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 116 [ vect_patt_28.9D.3754 ]) > (parallel:V8HI [ > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ]))) > (const_vector:V4SI [ > (const_int 32897 [0x8081]) repeated x4 > ])) > (nil)))) > (insn 42 39 43 4 (set (reg:V8HI 121 [ vect_patt_19.10D.3755 ]) > (unspec:V8HI [ > (subreg:V8HI (reg:V4SI 117) 0) > (subreg:V8HI (reg:V4SI 119) 0) > ] UNSPEC_UZP2)) "/app/example.c":6:14 4096 {aarch64_uzp2v8hi} > (expr_list:REG_DEAD (reg:V4SI 119) > (expr_list:REG_DEAD (reg:V4SI 117) > (nil)))) > (insn 43 42 44 4 (set (reg:V8HI 124 [ vect_patt_25.11D.3756 ]) > (lshiftrt:V8HI (reg:V8HI 121 [ vect_patt_19.10D.3755 ]) > (const_vector:V8HI [ > (const_int 7 [0x7]) repeated x8 > ]))) "/app/example.c":6:14 1803 {aarch64_simd_lshrv8hi} > (expr_list:REG_DEAD (reg:V8HI 121 [ vect_patt_19.10D.3755 ]) > (nil))) > (insn 44 43 46 4 (set (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) > (mult:V8HI (zero_extend:V8HI (vec_select:V8QI (reg:V16QI 115 [ MEM [(uint8_tD.3704 *)_21 clique 1 base 1] ]) > (parallel:V16QI [ > (const_int 8 [0x8]) > (const_int 9 [0x9]) > (const_int 10 [0xa]) > (const_int 11 [0xb]) > (const_int 12 [0xc]) > (const_int 13 [0xd]) > (const_int 14 [0xe]) > (const_int 15 [0xf]) > ]))) > (zero_extend:V8HI (vec_select:V8QI (reg:V16QI 100 [ vect_cst__36 ]) > (parallel:V16QI [ > (const_int 8 [0x8]) > (const_int 9 [0x9]) > (const_int 10 [0xa]) > (const_int 11 [0xb]) > (const_int 12 [0xc]) > (const_int 13 [0xd]) > (const_int 14 [0xe]) > (const_int 15 [0xf]) > ]))))) "/app/example.c":6:14 2112 {aarch64_simd_vec_umult_hi_v16qi} > (expr_list:REG_DEAD (reg:V16QI 115 [ MEM [(uint8_tD.3704 *)_21 clique 1 base 1] ]) > (nil))) > (insn 46 44 48 4 (set (reg:V4SI 126) > (mult:V4SI (zero_extend:V4SI (subreg:V4HI (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) 0)) > (zero_extend:V4SI (subreg:V4HI (reg:V8HI 118) 0)))) "/app/example.c":6:14 2108 {aarch64_intrinsic_vec_umult_lo_v4hi} > (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (subreg:V4HI (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) 0)) > (const_vector:V4SI [ > (const_int 32897 [0x8081]) repeated x4 > ])) > (nil))) > (insn 48 46 51 4 (set (reg:V4SI 128) > (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) > (parallel:V8HI [ > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ]))) > (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 118) > (parallel:V8HI [ > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ]))))) "/app/example.c":6:14 2114 {aarch64_simd_vec_umult_hi_v8hi} > (expr_list:REG_DEAD (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) > (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) > (parallel:V8HI [ > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ]))) > (const_vector:V4SI [ > (const_int 32897 [0x8081]) repeated x4 > ])) > (nil)))) > (insn 51 48 52 4 (set (reg:V8HI 130 [ vect_patt_19.10D.3755 ]) > (unspec:V8HI [ > (subreg:V8HI (reg:V4SI 126) 0) > (subreg:V8HI (reg:V4SI 128) 0) > ] UNSPEC_UZP2)) "/app/example.c":6:14 4096 {aarch64_uzp2v8hi} > (expr_list:REG_DEAD (reg:V4SI 128) > (expr_list:REG_DEAD (reg:V4SI 126) > (nil)))) > (insn 52 51 53 4 (set (reg:V8HI 133 [ vect_patt_25.11D.3756 ]) > (lshiftrt:V8HI (reg:V8HI 130 [ vect_patt_19.10D.3755 ]) > (const_vector:V8HI [ > (const_int 7 [0x7]) repeated x8 > ]))) "/app/example.c":6:14 1803 {aarch64_simd_lshrv8hi} > (expr_list:REG_DEAD (reg:V8HI 130 [ vect_patt_19.10D.3755 ]) > (nil))) > > And for good: > > (insn 32 30 34 4 (set (reg:V16QI 118) > (vec_concat:V16QI (unspec:V8QI [ > (reg:V8HI 114 [ vect_patt_28.9 ]) > (reg:V8HI 115) > ] UNSPEC_ADDHN) > (const_vector:V8QI [ > (const_int 0 [0]) repeated x8 > ]))) "draw.c":6:35 2688 {aarch64_addhnv8hi_insn_le} > (expr_list:REG_EQUAL (vec_concat:V16QI (unspec:V8QI [ > (reg:V8HI 114 [ vect_patt_28.9 ]) > (const_vector:V8HI [ > (const_int 257 [0x101]) repeated x8 > ]) > ] UNSPEC_ADDHN) > (const_vector:V8QI [ > (const_int 0 [0]) repeated x8 > ])) > (nil))) > (insn 34 32 35 4 (set (reg:V8HI 117) > (plus:V8HI (zero_extend:V8HI (subreg:V8QI (reg:V16QI 118) 0)) > (reg:V8HI 114 [ vect_patt_28.9 ]))) "draw.c":6:35 2635 {aarch64_uaddwv8qi} > (expr_list:REG_DEAD (reg:V16QI 118) > (expr_list:REG_DEAD (reg:V8HI 114 [ vect_patt_28.9 ]) > (nil)))) > (insn 35 34 37 4 (set (reg:V8HI 103 [ vect_patt_25.10 ]) > (lshiftrt:V8HI (reg:V8HI 117) > (const_vector:V8HI [ > (const_int 8 [0x8]) repeated x8 > ]))) "draw.c":6:35 1741 {aarch64_simd_lshrv8hi} > (expr_list:REG_DEAD (reg:V8HI 117) > (nil))) > (insn 37 35 39 4 (set (reg:V16QI 122) > (vec_concat:V16QI (unspec:V8QI [ > (reg:V8HI 102 [ vect_patt_28.9 ]) > (reg:V8HI 115) > ] UNSPEC_ADDHN) > (const_vector:V8QI [ > (const_int 0 [0]) repeated x8 > ]))) "draw.c":6:35 2688 {aarch64_addhnv8hi_insn_le} > (expr_list:REG_EQUAL (vec_concat:V16QI (unspec:V8QI [ > (reg:V8HI 102 [ vect_patt_28.9 ]) > (const_vector:V8HI [ > (const_int 257 [0x101]) repeated x8 > ]) > ] UNSPEC_ADDHN) > (const_vector:V8QI [ > (const_int 0 [0]) repeated x8 > ])) > (nil))) > (insn 39 37 40 4 (set (reg:V8HI 121) > (plus:V8HI (zero_extend:V8HI (subreg:V8QI (reg:V16QI 122) 0)) > (reg:V8HI 102 [ vect_patt_28.9 ]))) "draw.c":6:35 2635 {aarch64_uaddwv8qi} > (expr_list:REG_DEAD (reg:V16QI 122) > (expr_list:REG_DEAD (reg:V8HI 102 [ vect_patt_28.9 ]) > (nil)))) > (insn 40 39 41 4 (set (reg:V8HI 104 [ vect_patt_25.10 ]) > (lshiftrt:V8HI (reg:V8HI 121) > (const_vector:V8HI [ > (const_int 8 [0x8]) repeated x8 > ]))) "draw.c":6:35 1741 {aarch64_simd_lshrv8hi} > > Cheers, > Tamar > > > > > > > Btw, on x86 we use > > > > > > > > t.c:3:21: note: replacing earlier pattern patt_25 = patt_28 / 255; > > > > t.c:3:21: note: with patt_25 = patt_19 >> 7; > > > > t.c:3:21: note: extra pattern stmt: patt_19 = patt_28 h* 32897; > > > > > > > > which translates to > > > > > > > > vpmulhuw %ymm4, %ymm0, %ymm0 > > > > vpmulhuw %ymm4, %ymm1, %ymm1 > > > > vpsrlw $7, %ymm0, %ymm0 > > > > vpsrlw $7, %ymm1, %ymm1 > > > > > > > > there's odd > > > > > > > > vpand %ymm0, %ymm3, %ymm0 > > > > vpand %ymm1, %ymm3, %ymm1 > > > > > > > > before (%ymm3 is all 0x00ff) > > > > > > > > vpackuswb %ymm1, %ymm0, %ymm0 > > > > > > > > that's not visible in GIMPLE. I guess aarch64 lacks a highpart multiply > > here? > > > > In any case, it seems that generic division expansion could be > > > > improved here? (choose_multiplier?) > > > > > > We do generate multiply highpart here, but the patch completely avoids > > > multiplies and shifts entirely by creative use of the ISA. Another reason I > > went for an optab is costing. > > > The chosen operations are significantly cheaper on all Arm uarches than > > Shifts and multiply. > > > > > > This means we get vectorization in some cases where the cost model > > > would correctly say It's too expensive to vectorize. Particularly around > > double precision. > > > > > > Thanks, > > > Tamar > > > > > > > > > > > Richard. > > > > > > > > > Richard. > > > > > > > > > > > Thanks, > > > > > > Tamar > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > * internal-fn.def (DIV_POW2_BITMASK): New. > > > > > > * optabs.def (udiv_pow2_bitmask_optab): New. > > > > > > * doc/md.texi: Document it. > > > > > > * tree-vect-patterns.cc (vect_recog_divmod_pattern): Recognize > > > > pattern. > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > * gcc.dg/vect/vect-div-bitmask-1.c: New test. > > > > > > * gcc.dg/vect/vect-div-bitmask-2.c: New test. > > > > > > * gcc.dg/vect/vect-div-bitmask-3.c: New test. > > > > > > * gcc.dg/vect/vect-div-bitmask.h: New file. > > > > > > > > > > > > --- inline copy of patch -- > > > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index > > > > > > > > > > > > f3619c505c025f158c2bc64756531877378b22e1..784c49d7d24cef7619e4d613f7 > > > > > > b4f6e945866c38 100644 > > > > > > --- a/gcc/doc/md.texi > > > > > > +++ b/gcc/doc/md.texi > > > > > > @@ -5588,6 +5588,18 @@ signed op0, op1; > > > > > > op0 = op1 / (1 << imm); > > > > > > @end smallexample > > > > > > > > > > > > +@cindex @code{udiv_pow2_bitmask@var{m2}} instruction pattern > > > > @item > > > > > > +@samp{udiv_pow2_bitmask@var{m2}} @cindex > > > > > > +@code{udiv_pow2_bitmask@var{m2}} instruction pattern @itemx > > > > > > +@samp{udiv_pow2_bitmask@var{m2}} Unsigned vector division by > > an > > > > > > +immediate that is equivalent to > > > > > > +@samp{2^(bitsize(m) / 2) - 1}. > > > > > > +@smallexample > > > > > > +unsigned short op0; op1; > > > > > > +@dots{} > > > > > > +op0 = op1 / 0xffU; > > > > > > +@end smallexample > > > > > > + > > > > > > @cindex @code{vec_shl_insert_@var{m}} instruction pattern > > > > > > @item @samp{vec_shl_insert_@var{m}} Shift the elements in > > > > > > vector input operand 1 left one element (i.e.@: > > > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index > > > > > > > > > > > > d2d550d358606022b1cb44fa842f06e0be507bc3..a3e3cc1520f77683ebf6256898 > > > > > > f916ed45de475f 100644 > > > > > > --- a/gcc/internal-fn.def > > > > > > +++ b/gcc/internal-fn.def > > > > > > @@ -159,6 +159,8 @@ DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT, > > > > ECF_CONST | ECF_NOTHROW, > > > > > > vec_shl_insert, binary) > > > > > > > > > > > > DEF_INTERNAL_OPTAB_FN (DIV_POW2, ECF_CONST | > > ECF_NOTHROW, > > > > > > sdiv_pow2, binary) > > > > > > +DEF_INTERNAL_OPTAB_FN (DIV_POW2_BITMASK, ECF_CONST | > > > > ECF_NOTHROW, > > > > > > + udiv_pow2_bitmask, unary) > > > > > > > > > > > > DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary) > > > > > > DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary) diff > > > > > > --git a/gcc/optabs.def b/gcc/optabs.def index > > > > > > > > > > > > 801310ebaa7d469520809bb7efed6820f8eb866b..3f0ac05ef5ad5aed8d6ca391f > > > > 4 > > > > > > eed71b0494e17f 100644 > > > > > > --- a/gcc/optabs.def > > > > > > +++ b/gcc/optabs.def > > > > > > @@ -372,6 +372,7 @@ OPTAB_D (smulhrs_optab, "smulhrs$a3") > > > > OPTAB_D > > > > > > (umulhs_optab, "umulhs$a3") OPTAB_D (umulhrs_optab, > > > > > > "umulhrs$a3") OPTAB_D (sdiv_pow2_optab, "sdiv_pow2$a3") > > > > > > +OPTAB_D (udiv_pow2_bitmask_optab, "udiv_pow2_bitmask$a2") > > > > > > OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a") > > > > > > OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a") OPTAB_D > > > > > > (vec_pack_trunc_optab, "vec_pack_trunc_$a") diff --git > > > > > > a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c > > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c > > > > > > new file mode 100644 > > > > > > index > > > > > > > > > > > > 0000000000000000000000000000000000000000..a7ea3cce4764239c5d281a8f0b > > > > > > ead1f6a452de3f > > > > > > --- /dev/null > > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c > > > > > > @@ -0,0 +1,25 @@ > > > > > > +/* { dg-require-effective-target vect_int } */ > > > > > > + > > > > > > +#include > > > > > > +#include "tree-vect.h" > > > > > > + > > > > > > +#define N 50 > > > > > > +#define TYPE uint8_t > > > > > > + > > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void > > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) { > > > > > > + for (int i = 0; i < n; i+=1) > > > > > > + pixel[i] = (pixel[i] * level) / 0xff; } > > > > > > + > > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void > > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) { > > > > > > + for (int i = 0; i < n; i+=1) > > > > > > + pixel[i] = (pixel[i] * level) / 0xff; } > > > > > > + > > > > > > +#include "vect-div-bitmask.h" > > > > > > + > > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern: > > > > > > +detected" "vect" } } */ > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c > > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c > > > > > > new file mode 100644 > > > > > > index > > > > > > > > > > > > 0000000000000000000000000000000000000000..009e16e1b36497e5724410d98 > > > > 4 > > > > > > 3f1ce122b26dda > > > > > > --- /dev/null > > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c > > > > > > @@ -0,0 +1,25 @@ > > > > > > +/* { dg-require-effective-target vect_int } */ > > > > > > + > > > > > > +#include > > > > > > +#include "tree-vect.h" > > > > > > + > > > > > > +#define N 50 > > > > > > +#define TYPE uint16_t > > > > > > + > > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void > > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) { > > > > > > + for (int i = 0; i < n; i+=1) > > > > > > + pixel[i] = (pixel[i] * level) / 0xffffU; } > > > > > > + > > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void > > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) { > > > > > > + for (int i = 0; i < n; i+=1) > > > > > > + pixel[i] = (pixel[i] * level) / 0xffffU; } > > > > > > + > > > > > > +#include "vect-div-bitmask.h" > > > > > > + > > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern: > > > > > > +detected" "vect" } } */ > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c > > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c > > > > > > new file mode 100644 > > > > > > index > > > > > > > > > > > > 0000000000000000000000000000000000000000..bf35a0bda8333c418e692d942 > > > > 2 > > > > > > 0df849cc47930b > > > > > > --- /dev/null > > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c > > > > > > @@ -0,0 +1,26 @@ > > > > > > +/* { dg-require-effective-target vect_int } */ > > > > > > +/* { dg-additional-options "-fno-vect-cost-model" { target > > > > > > +aarch64*-*-* } } */ > > > > > > + > > > > > > +#include > > > > > > +#include "tree-vect.h" > > > > > > + > > > > > > +#define N 50 > > > > > > +#define TYPE uint32_t > > > > > > + > > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void > > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) { > > > > > > + for (int i = 0; i < n; i+=1) > > > > > > + pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; } > > > > > > + > > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void > > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) { > > > > > > + for (int i = 0; i < n; i+=1) > > > > > > + pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; } > > > > > > + > > > > > > +#include "vect-div-bitmask.h" > > > > > > + > > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern: > > > > > > +detected" "vect" } } */ > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h > > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h > > > > > > new file mode 100644 > > > > > > index > > > > > > > > > > > > 0000000000000000000000000000000000000000..29a16739aa4b706616367bfd1 > > > > 8 > > > > > > 32f28ebd07993e > > > > > > --- /dev/null > > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h > > > > > > @@ -0,0 +1,43 @@ > > > > > > +#include > > > > > > + > > > > > > +#ifndef N > > > > > > +#define N 65 > > > > > > +#endif > > > > > > + > > > > > > +#ifndef TYPE > > > > > > +#define TYPE uint32_t > > > > > > +#endif > > > > > > + > > > > > > +#ifndef DEBUG > > > > > > +#define DEBUG 0 > > > > > > +#endif > > > > > > + > > > > > > +#define BASE ((TYPE) -1 < 0 ? -126 : 4) > > > > > > + > > > > > > +int main () > > > > > > +{ > > > > > > + TYPE a[N]; > > > > > > + TYPE b[N]; > > > > > > + > > > > > > + for (int i = 0; i < N; ++i) > > > > > > + { > > > > > > + a[i] = BASE + i * 13; > > > > > > + b[i] = BASE + i * 13; > > > > > > + if (DEBUG) > > > > > > + printf ("%d: 0x%x\n", i, a[i]); > > > > > > + } > > > > > > + > > > > > > + fun1 (a, N / 2, N); > > > > > > + fun2 (b, N / 2, N); > > > > > > + > > > > > > + for (int i = 0; i < N; ++i) > > > > > > + { > > > > > > + if (DEBUG) > > > > > > + printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]); > > > > > > + > > > > > > + if (a[i] != b[i]) > > > > > > + __builtin_abort (); > > > > > > + } > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > diff --git a/gcc/tree-vect-patterns.cc > > > > > > b/gcc/tree-vect-patterns.cc index > > > > > > > > > > > > 217bdfd7045a22578a35bb891a4318d741071872..a738558cb8d12296bff462d71 > > > > 6 > > > > > > 310ca8d82957b5 100644 > > > > > > --- a/gcc/tree-vect-patterns.cc > > > > > > +++ b/gcc/tree-vect-patterns.cc > > > > > > @@ -3558,6 +3558,33 @@ vect_recog_divmod_pattern (vec_info > > > > > > *vinfo, > > > > > > > > > > > > return pattern_stmt; > > > > > > } > > > > > > + else if ((TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1) != 1) > > > > > > + && rhs_code != TRUNC_MOD_EXPR) > > > > > > + { > > > > > > + wide_int icst = wi::to_wide (oprnd1); > > > > > > + wide_int val = wi::add (icst, 1); > > > > > > + int pow = wi::exact_log2 (val); > > > > > > + if (pow == (prec / 2)) > > > > > > + { > > > > > > + /* Pattern detected. */ > > > > > > + vect_pattern_detected ("vect_recog_divmod_pattern", > > > > > > +last_stmt); > > > > > > + > > > > > > + *type_out = vectype; > > > > > > + > > > > > > + /* Check if the target supports this internal function. */ > > > > > > + internal_fn ifn = IFN_DIV_POW2_BITMASK; > > > > > > + if (direct_internal_fn_supported_p (ifn, vectype, > > > > OPTIMIZE_FOR_SPEED)) > > > > > > + { > > > > > > + tree var_div = vect_recog_temp_ssa_var (itype, NULL); > > > > > > + gimple *div_stmt = gimple_build_call_internal (ifn, 1, > > oprnd0); > > > > > > + gimple_call_set_lhs (div_stmt, var_div); > > > > > > + > > > > > > + gimple_set_location (div_stmt, gimple_location > > > > > > +(last_stmt)); > > > > > > + > > > > > > + return div_stmt; > > > > > > + } > > > > > > + } > > > > > > + } > > > > > > > > > > > > if (prec > HOST_BITS_PER_WIDE_INT > > > > > > || integer_zerop (oprnd1)) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Richard Biener > > > > SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461 > > > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, > > > > Boudien Moerman; HRB 36809 (AG Nuernberg) > > > > > > > -- > > Richard Biener > > SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461 > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, > > Boudien Moerman; HRB 36809 (AG Nuernberg) > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)