public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, Richard Sandiford <Richard.Sandiford@arm.com>
Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask
Date: Tue, 14 Jun 2022 13:38:41 +0000	[thread overview]
Message-ID: <VI1PR08MB53258D935EF20C3BE1E47B4AFFAA9@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <2p382n54-427o-8q82-6o45-p2nn6869opr5@fhfr.qr>

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, June 14, 2022 2:19 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2
> bitmask
> 
> On Mon, 13 Jun 2022, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguenther@suse.de>
> > > Sent: Monday, June 13, 2022 12:48 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Sandiford
> > > <Richard.Sandiford@arm.com>
> > > Subject: RE: [PATCH 1/2]middle-end Support optimized division by
> > > pow2 bitmask
> > >
> > > On Mon, 13 Jun 2022, Tamar Christina wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <rguenther@suse.de>
> > > > > Sent: Monday, June 13, 2022 10:39 AM
> > > > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Sandiford
> > > > > <Richard.Sandiford@arm.com>
> > > > > Subject: Re: [PATCH 1/2]middle-end Support optimized division by
> > > > > pow2 bitmask
> > > > >
> > > > > On Mon, 13 Jun 2022, Richard Biener wrote:
> > > > >
> > > > > > On Thu, 9 Jun 2022, Tamar Christina wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > In plenty of image and video processing code it's common to
> > > > > > > modify pixel values by a widening operation and then scale
> > > > > > > them back into range
> > > > > by dividing by 255.
> > > > > > >
> > > > > > > This patch adds an optab to allow us to emit an optimized
> > > > > > > sequence when doing an unsigned division that is equivalent to:
> > > > > > >
> > > > > > >    x = y / (2 ^ (bitsize (y)/2)-1
> > > > > > >
> > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > > > x86_64-pc-linux-gnu and no issues.
> > > > > > >
> > > > > > > Ok for master?
> > > > > >
> > > > > > Looking at 2/2 it seems that this is the wrong way to attack
> > > > > > the problem.  The ISA doesn't have such instruction so adding
> > > > > > an optab looks premature.  I suppose that there's no unsigned
> > > > > > vector integer division and thus we open-code that in a different
> way?
> > > > > > Isn't the correct thing then to fixup that open-coding if it
> > > > > > is more
> > > efficient?
> > > > >
> > > >
> > > > The problem is that even if you fixup the open-coding it would
> > > > need to be something target specific? The sequence of instructions
> > > > we generate don't have a GIMPLE representation.  So whatever is
> > > > generated I'd have to fixup in RTL then.
> > >
> > > What's the operation that doesn't have a GIMPLE representation?
> >
> > For NEON use two operations:
> > 1. Add High narrowing lowpart, essentially doing (a +w b) >>.n bitsize(a)/2
> >     Where the + widens and the >> narrows.  So you give it two shorts,
> > get a byte 2. Add widening add of lowpart so basically lowpart (a +w
> > b)
> >
> > For SVE2 we use a different sequence, we use two back-to-back
> sequences of:
> > 1. Add narrow high part (bottom).  In SVE the Top and Bottom instructions
> select
> >    Even and odd elements of the vector rather than "top half" and "bottom
> half".
> >
> >    So this instruction does : Add each vector element of the first source
> vector to the
> >    corresponding vector element of the second source vector, and place the
> most
> >     significant half of the result in the even-numbered half-width destination
> elements,
> >     while setting the odd-numbered elements to zero.
> >
> > So there's an explicit permute in there. The instructions are
> > sufficiently different that there wouldn't be a single GIMPLE
> representation.
> 
> I see.  Are these also useful to express scalar integer division?

Hmm not these exact instructions as they only exist on vector. Scalar may
Potentially benefit from rewriting this to (x + ((x + 257) >> 8)) >> 8
Which avoids the multiply with the magic constant.  But the problem here is
that unless undone for vector it would likely generate worse code if vectorized
exactly like this on most ISAs compared to what we have now.

> 
> I'll defer to others to ack the special udiv_pow2_bitmask optab or suggest
> some piecemail things other targets might be able to do as well.  It does look
> very special.  I'd also bikeshed it to
> udiv_pow2m1 since 'bitmask' is less obvious than 2^n-1 (assuming I
> interpreted 'bitmask' correctly ;)).  It seems to be even less general since it is
> an unary op and the actual divisor is constrained by the mode itself?

I am happy to change the name, and quite happy to add the constant as an
argument.   I had only made it this specific because this was the only fairly
common operation I had found.  Though perhaps it's indeed better to keep
the optab a bit more general?

Thanks,
Tamar

> 
> Richard.
> 
> > >
> > > I think for costing you could resort to the *_cost functions as used
> > > by synth_mult and friends.
> > >
> > > > The problem with this is that it seemed fragile. We generate from
> > > > the
> > > > Vectorizer:
> > > >
> > > >   vect__3.8_35 = MEM <vector(16) unsigned char> [(uint8_t *)_21];
> > > >   vect_patt_28.9_37 = WIDEN_MULT_LO_EXPR <vect__3.8_35,
> > > vect_cst__36>;
> > > >   vect_patt_28.9_38 = WIDEN_MULT_HI_EXPR <vect__3.8_35,
> > > vect_cst__36>;
> > > >   vect_patt_19.10_40 = vect_patt_28.9_37 h* { 32897, 32897, 32897,
> > > > 32897,
> > > 32897, 32897, 32897, 32897 };
> > > >   vect_patt_19.10_41 = vect_patt_28.9_38 h* { 32897, 32897, 32897,
> > > > 32897,
> > > 32897, 32897, 32897, 32897 };
> > > >   vect_patt_25.11_42 = vect_patt_19.10_40 >> 7;
> > > >   vect_patt_25.11_43 = vect_patt_19.10_41 >> 7;
> > > >   vect_patt_11.12_44 = VEC_PACK_TRUNC_EXPR <vect_patt_25.11_42,
> > > > vect_patt_25.11_43>;
> > > >
> > > > and if the magic constants change then we miss the optimization. I
> > > > could rewrite the open coding to use shifts alone, but that might
> > > > be a
> > > regression for some uarches I would imagine.
> > >
> > > OK, so you do have a highpart multiply.  I suppose the pattern is
> > > too deep to be recognized by combine?  What's the RTL good vs. bad
> > > before combine of one of the expressions?
> >
> > Yeah combine only tried 2-3 instructions, but to use these sequences
> > we have to match the entire chain as the instructions do the narrowing
> > themselves.  So the RTL for the bad case before combine is
> >
> > (insn 39 37 42 4 (set (reg:V4SI 119)
> >         (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 116 [
> vect_patt_28.9D.3754 ])
> >                     (parallel:V8HI [
> >                             (const_int 4 [0x4])
> >                             (const_int 5 [0x5])
> >                             (const_int 6 [0x6])
> >                             (const_int 7 [0x7])
> >                         ])))
> >             (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 118)
> >                     (parallel:V8HI [
> >                             (const_int 4 [0x4])
> >                             (const_int 5 [0x5])
> >                             (const_int 6 [0x6])
> >                             (const_int 7 [0x7])
> >                         ]))))) "/app/example.c":6:14 2114
> {aarch64_simd_vec_umult_hi_v8hi}
> >      (expr_list:REG_DEAD (reg:V8HI 116 [ vect_patt_28.9D.3754 ])
> >         (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (vec_select:V4HI
> (reg:V8HI 116 [ vect_patt_28.9D.3754 ])
> >                         (parallel:V8HI [
> >                                 (const_int 4 [0x4])
> >                                 (const_int 5 [0x5])
> >                                 (const_int 6 [0x6])
> >                                 (const_int 7 [0x7])
> >                             ])))
> >                 (const_vector:V4SI [
> >                         (const_int 32897 [0x8081]) repeated x4
> >                     ]))
> >             (nil))))
> > (insn 42 39 43 4 (set (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
> >         (unspec:V8HI [
> >                 (subreg:V8HI (reg:V4SI 117) 0)
> >                 (subreg:V8HI (reg:V4SI 119) 0)
> >             ] UNSPEC_UZP2)) "/app/example.c":6:14 4096 {aarch64_uzp2v8hi}
> >      (expr_list:REG_DEAD (reg:V4SI 119)
> >         (expr_list:REG_DEAD (reg:V4SI 117)
> >             (nil))))
> > (insn 43 42 44 4 (set (reg:V8HI 124 [ vect_patt_25.11D.3756 ])
> >         (lshiftrt:V8HI (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
> >             (const_vector:V8HI [
> >                     (const_int 7 [0x7]) repeated x8
> >                 ]))) "/app/example.c":6:14 1803 {aarch64_simd_lshrv8hi}
> >      (expr_list:REG_DEAD (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
> >         (nil)))
> > (insn 44 43 46 4 (set (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
> >         (mult:V8HI (zero_extend:V8HI (vec_select:V8QI (reg:V16QI 115 [ MEM
> <vector(16) unsigned charD.21> [(uint8_tD.3704 *)_21 clique 1 base 1] ])
> >                     (parallel:V16QI [
> >                             (const_int 8 [0x8])
> >                             (const_int 9 [0x9])
> >                             (const_int 10 [0xa])
> >                             (const_int 11 [0xb])
> >                             (const_int 12 [0xc])
> >                             (const_int 13 [0xd])
> >                             (const_int 14 [0xe])
> >                             (const_int 15 [0xf])
> >                         ])))
> >             (zero_extend:V8HI (vec_select:V8QI (reg:V16QI 100 [ vect_cst__36 ])
> >                     (parallel:V16QI [
> >                             (const_int 8 [0x8])
> >                             (const_int 9 [0x9])
> >                             (const_int 10 [0xa])
> >                             (const_int 11 [0xb])
> >                             (const_int 12 [0xc])
> >                             (const_int 13 [0xd])
> >                             (const_int 14 [0xe])
> >                             (const_int 15 [0xf])
> >                         ]))))) "/app/example.c":6:14 2112
> {aarch64_simd_vec_umult_hi_v16qi}
> >      (expr_list:REG_DEAD (reg:V16QI 115 [ MEM <vector(16) unsigned
> charD.21> [(uint8_tD.3704 *)_21 clique 1 base 1] ])
> >         (nil)))
> > (insn 46 44 48 4 (set (reg:V4SI 126)
> >         (mult:V4SI (zero_extend:V4SI (subreg:V4HI (reg:V8HI 125 [
> vect_patt_28.9D.3754 ]) 0))
> >             (zero_extend:V4SI (subreg:V4HI (reg:V8HI 118) 0))))
> "/app/example.c":6:14 2108 {aarch64_intrinsic_vec_umult_lo_v4hi}
> >      (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (subreg:V4HI
> (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) 0))
> >             (const_vector:V4SI [
> >                     (const_int 32897 [0x8081]) repeated x4
> >                 ]))
> >         (nil)))
> > (insn 48 46 51 4 (set (reg:V4SI 128)
> >         (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 125 [
> vect_patt_28.9D.3754 ])
> >                     (parallel:V8HI [
> >                             (const_int 4 [0x4])
> >                             (const_int 5 [0x5])
> >                             (const_int 6 [0x6])
> >                             (const_int 7 [0x7])
> >                         ])))
> >             (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 118)
> >                     (parallel:V8HI [
> >                             (const_int 4 [0x4])
> >                             (const_int 5 [0x5])
> >                             (const_int 6 [0x6])
> >                             (const_int 7 [0x7])
> >                         ]))))) "/app/example.c":6:14 2114
> {aarch64_simd_vec_umult_hi_v8hi}
> >      (expr_list:REG_DEAD (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
> >         (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (vec_select:V4HI
> (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
> >                         (parallel:V8HI [
> >                                 (const_int 4 [0x4])
> >                                 (const_int 5 [0x5])
> >                                 (const_int 6 [0x6])
> >                                 (const_int 7 [0x7])
> >                             ])))
> >                 (const_vector:V4SI [
> >                         (const_int 32897 [0x8081]) repeated x4
> >                     ]))
> >             (nil))))
> > (insn 51 48 52 4 (set (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
> >         (unspec:V8HI [
> >                 (subreg:V8HI (reg:V4SI 126) 0)
> >                 (subreg:V8HI (reg:V4SI 128) 0)
> >             ] UNSPEC_UZP2)) "/app/example.c":6:14 4096 {aarch64_uzp2v8hi}
> >      (expr_list:REG_DEAD (reg:V4SI 128)
> >         (expr_list:REG_DEAD (reg:V4SI 126)
> >             (nil))))
> > (insn 52 51 53 4 (set (reg:V8HI 133 [ vect_patt_25.11D.3756 ])
> >         (lshiftrt:V8HI (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
> >             (const_vector:V8HI [
> >                     (const_int 7 [0x7]) repeated x8
> >                 ]))) "/app/example.c":6:14 1803 {aarch64_simd_lshrv8hi}
> >      (expr_list:REG_DEAD (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
> >         (nil)))
> >
> > And for good:
> >
> > (insn 32 30 34 4 (set (reg:V16QI 118)
> >         (vec_concat:V16QI (unspec:V8QI [
> >                     (reg:V8HI 114 [ vect_patt_28.9 ])
> >                     (reg:V8HI 115)
> >                 ] UNSPEC_ADDHN)
> >             (const_vector:V8QI [
> >                     (const_int 0 [0]) repeated x8
> >                 ]))) "draw.c":6:35 2688 {aarch64_addhnv8hi_insn_le}
> >      (expr_list:REG_EQUAL (vec_concat:V16QI (unspec:V8QI [
> >                     (reg:V8HI 114 [ vect_patt_28.9 ])
> >                     (const_vector:V8HI [
> >                             (const_int 257 [0x101]) repeated x8
> >                         ])
> >                 ] UNSPEC_ADDHN)
> >             (const_vector:V8QI [
> >                     (const_int 0 [0]) repeated x8
> >                 ]))
> >         (nil)))
> > (insn 34 32 35 4 (set (reg:V8HI 117)
> >         (plus:V8HI (zero_extend:V8HI (subreg:V8QI (reg:V16QI 118) 0))
> >             (reg:V8HI 114 [ vect_patt_28.9 ]))) "draw.c":6:35 2635
> {aarch64_uaddwv8qi}
> >      (expr_list:REG_DEAD (reg:V16QI 118)
> >         (expr_list:REG_DEAD (reg:V8HI 114 [ vect_patt_28.9 ])
> >             (nil))))
> > (insn 35 34 37 4 (set (reg:V8HI 103 [ vect_patt_25.10 ])
> >         (lshiftrt:V8HI (reg:V8HI 117)
> >             (const_vector:V8HI [
> >                     (const_int 8 [0x8]) repeated x8
> >                 ]))) "draw.c":6:35 1741 {aarch64_simd_lshrv8hi}
> >      (expr_list:REG_DEAD (reg:V8HI 117)
> >         (nil)))
> > (insn 37 35 39 4 (set (reg:V16QI 122)
> >         (vec_concat:V16QI (unspec:V8QI [
> >                     (reg:V8HI 102 [ vect_patt_28.9 ])
> >                     (reg:V8HI 115)
> >                 ] UNSPEC_ADDHN)
> >             (const_vector:V8QI [
> >                     (const_int 0 [0]) repeated x8
> >                 ]))) "draw.c":6:35 2688 {aarch64_addhnv8hi_insn_le}
> >      (expr_list:REG_EQUAL (vec_concat:V16QI (unspec:V8QI [
> >                     (reg:V8HI 102 [ vect_patt_28.9 ])
> >                     (const_vector:V8HI [
> >                             (const_int 257 [0x101]) repeated x8
> >                         ])
> >                 ] UNSPEC_ADDHN)
> >             (const_vector:V8QI [
> >                     (const_int 0 [0]) repeated x8
> >                 ]))
> >         (nil)))
> > (insn 39 37 40 4 (set (reg:V8HI 121)
> >         (plus:V8HI (zero_extend:V8HI (subreg:V8QI (reg:V16QI 122) 0))
> >             (reg:V8HI 102 [ vect_patt_28.9 ]))) "draw.c":6:35 2635
> {aarch64_uaddwv8qi}
> >      (expr_list:REG_DEAD (reg:V16QI 122)
> >         (expr_list:REG_DEAD (reg:V8HI 102 [ vect_patt_28.9 ])
> >             (nil))))
> > (insn 40 39 41 4 (set (reg:V8HI 104 [ vect_patt_25.10 ])
> >         (lshiftrt:V8HI (reg:V8HI 121)
> >             (const_vector:V8HI [
> >                     (const_int 8 [0x8]) repeated x8
> >                 ]))) "draw.c":6:35 1741 {aarch64_simd_lshrv8hi}
> >
> > Cheers,
> > Tamar
> >
> > >
> > > > > Btw, on x86 we use
> > > > >
> > > > > t.c:3:21: note:   replacing earlier pattern patt_25 = patt_28 / 255;
> > > > > t.c:3:21: note:   with patt_25 = patt_19 >> 7;
> > > > > t.c:3:21: note:   extra pattern stmt: patt_19 = patt_28 h* 32897;
> > > > >
> > > > > which translates to
> > > > >
> > > > >         vpmulhuw        %ymm4, %ymm0, %ymm0
> > > > >         vpmulhuw        %ymm4, %ymm1, %ymm1
> > > > >         vpsrlw  $7, %ymm0, %ymm0
> > > > >         vpsrlw  $7, %ymm1, %ymm1
> > > > >
> > > > > there's odd
> > > > >
> > > > >         vpand   %ymm0, %ymm3, %ymm0
> > > > >         vpand   %ymm1, %ymm3, %ymm1
> > > > >
> > > > > before (%ymm3 is all 0x00ff)
> > > > >
> > > > >         vpackuswb       %ymm1, %ymm0, %ymm0
> > > > >
> > > > > that's not visible in GIMPLE.  I guess aarch64 lacks a highpart
> > > > > multiply
> > > here?
> > > > > In any case, it seems that generic division expansion could be
> > > > > improved here? (choose_multiplier?)
> > > >
> > > > We do generate multiply highpart here, but the patch completely
> > > > avoids multiplies and shifts entirely by creative use of the ISA.
> > > > Another reason I
> > > went for an optab is costing.
> > > > The chosen operations are significantly cheaper on all Arm uarches
> > > > than
> > > Shifts and multiply.
> > > >
> > > > This means we get vectorization in some cases where the cost model
> > > > would correctly say It's too expensive to vectorize. Particularly
> > > > around
> > > double precision.
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > Thanks,
> > > > > > > Tamar
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > > 	* internal-fn.def (DIV_POW2_BITMASK): New.
> > > > > > > 	* optabs.def (udiv_pow2_bitmask_optab): New.
> > > > > > > 	* doc/md.texi: Document it.
> > > > > > > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern):
> > > > > > > Recognize
> > > > > pattern.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > > 	* gcc.dg/vect/vect-div-bitmask-1.c: New test.
> > > > > > > 	* gcc.dg/vect/vect-div-bitmask-2.c: New test.
> > > > > > > 	* gcc.dg/vect/vect-div-bitmask-3.c: New test.
> > > > > > > 	* gcc.dg/vect/vect-div-bitmask.h: New file.
> > > > > > >
> > > > > > > --- inline copy of patch --
> > > > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> > > > > > >
> > > > >
> > >
> f3619c505c025f158c2bc64756531877378b22e1..784c49d7d24cef7619e4d613f7
> > > > > > > b4f6e945866c38 100644
> > > > > > > --- a/gcc/doc/md.texi
> > > > > > > +++ b/gcc/doc/md.texi
> > > > > > > @@ -5588,6 +5588,18 @@ signed op0, op1;
> > > > > > >  op0 = op1 / (1 << imm);
> > > > > > >  @end smallexample
> > > > > > >
> > > > > > > +@cindex @code{udiv_pow2_bitmask@var{m2}} instruction
> > > > > > > +pattern
> > > > > @item
> > > > > > > +@samp{udiv_pow2_bitmask@var{m2}} @cindex
> > > > > > > +@code{udiv_pow2_bitmask@var{m2}} instruction pattern
> @itemx
> > > > > > > +@samp{udiv_pow2_bitmask@var{m2}} Unsigned vector division
> > > > > > > +by
> > > an
> > > > > > > +immediate that is equivalent to
> > > > > > > +@samp{2^(bitsize(m) / 2) - 1}.
> > > > > > > +@smallexample
> > > > > > > +unsigned short op0; op1;
> > > > > > > +@dots{}
> > > > > > > +op0 = op1 / 0xffU;
> > > > > > > +@end smallexample
> > > > > > > +
> > > > > > >  @cindex @code{vec_shl_insert_@var{m}} instruction pattern
> > > > > > > @item @samp{vec_shl_insert_@var{m}}  Shift the elements in
> > > > > > > vector input operand 1 left one element (i.e.@:
> > > > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> > > > > > >
> > > > >
> > >
> d2d550d358606022b1cb44fa842f06e0be507bc3..a3e3cc1520f77683ebf6256898
> > > > > > > f916ed45de475f 100644
> > > > > > > --- a/gcc/internal-fn.def
> > > > > > > +++ b/gcc/internal-fn.def
> > > > > > > @@ -159,6 +159,8 @@ DEF_INTERNAL_OPTAB_FN
> (VEC_SHL_INSERT,
> > > > > ECF_CONST | ECF_NOTHROW,
> > > > > > >  		       vec_shl_insert, binary)
> > > > > > >
> > > > > > >  DEF_INTERNAL_OPTAB_FN (DIV_POW2, ECF_CONST |
> > > ECF_NOTHROW,
> > > > > > > sdiv_pow2, binary)
> > > > > > > +DEF_INTERNAL_OPTAB_FN (DIV_POW2_BITMASK, ECF_CONST |
> > > > > ECF_NOTHROW,
> > > > > > > +		       udiv_pow2_bitmask, unary)
> > > > > > >
> > > > > > >  DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary)
> > > > > > > DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary)
> diff
> > > > > > > --git a/gcc/optabs.def b/gcc/optabs.def index
> > > > > > >
> > > > >
> > >
> 801310ebaa7d469520809bb7efed6820f8eb866b..3f0ac05ef5ad5aed8d6ca391f
> > > > > 4
> > > > > > > eed71b0494e17f 100644
> > > > > > > --- a/gcc/optabs.def
> > > > > > > +++ b/gcc/optabs.def
> > > > > > > @@ -372,6 +372,7 @@ OPTAB_D (smulhrs_optab, "smulhrs$a3")
> > > > > OPTAB_D
> > > > > > > (umulhs_optab, "umulhs$a3")  OPTAB_D (umulhrs_optab,
> > > > > > > "umulhrs$a3") OPTAB_D (sdiv_pow2_optab, "sdiv_pow2$a3")
> > > > > > > +OPTAB_D (udiv_pow2_bitmask_optab,
> "udiv_pow2_bitmask$a2")
> > > > > > >  OPTAB_D (vec_pack_sfix_trunc_optab,
> > > > > > > "vec_pack_sfix_trunc_$a") OPTAB_D (vec_pack_ssat_optab,
> > > > > > > "vec_pack_ssat_$a")  OPTAB_D (vec_pack_trunc_optab,
> > > > > > > "vec_pack_trunc_$a") diff --git
> > > > > > > a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > > > new file mode 100644
> > > > > > > index
> > > > > > >
> > > > >
> > >
> 0000000000000000000000000000000000000000..a7ea3cce4764239c5d281a8f0b
> > > > > > > ead1f6a452de3f
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > > > @@ -0,0 +1,25 @@
> > > > > > > +/* { dg-require-effective-target vect_int } */
> > > > > > > +
> > > > > > > +#include <stdint.h>
> > > > > > > +#include "tree-vect.h"
> > > > > > > +
> > > > > > > +#define N 50
> > > > > > > +#define TYPE uint8_t
> > > > > > > +
> > > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > > > > > +
> > > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > > > > > +
> > > > > > > +#include "vect-div-bitmask.h"
> > > > > > > +
> > > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > > > +detected" "vect" } } */
> > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > > > new file mode 100644
> > > > > > > index
> > > > > > >
> > > > >
> > >
> 0000000000000000000000000000000000000000..009e16e1b36497e5724410d98
> > > > > 4
> > > > > > > 3f1ce122b26dda
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > > > @@ -0,0 +1,25 @@
> > > > > > > +/* { dg-require-effective-target vect_int } */
> > > > > > > +
> > > > > > > +#include <stdint.h>
> > > > > > > +#include "tree-vect.h"
> > > > > > > +
> > > > > > > +#define N 50
> > > > > > > +#define TYPE uint16_t
> > > > > > > +
> > > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > > > > > +
> > > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > > > > > +
> > > > > > > +#include "vect-div-bitmask.h"
> > > > > > > +
> > > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > > > +detected" "vect" } } */
> > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > > > new file mode 100644
> > > > > > > index
> > > > > > >
> > > > >
> > >
> 0000000000000000000000000000000000000000..bf35a0bda8333c418e692d942
> > > > > 2
> > > > > > > 0df849cc47930b
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > > > @@ -0,0 +1,26 @@
> > > > > > > +/* { dg-require-effective-target vect_int } */
> > > > > > > +/* { dg-additional-options "-fno-vect-cost-model" { target
> > > > > > > +aarch64*-*-* } } */
> > > > > > > +
> > > > > > > +#include <stdint.h>
> > > > > > > +#include "tree-vect.h"
> > > > > > > +
> > > > > > > +#define N 50
> > > > > > > +#define TYPE uint32_t
> > > > > > > +
> > > > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL;
> > > > > > > +}
> > > > > > > +
> > > > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > > > +  for (int i = 0; i < n; i+=1)
> > > > > > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL;
> > > > > > > +}
> > > > > > > +
> > > > > > > +#include "vect-div-bitmask.h"
> > > > > > > +
> > > > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > > > +detected" "vect" } } */
> > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > > > new file mode 100644
> > > > > > > index
> > > > > > >
> > > > >
> > >
> 0000000000000000000000000000000000000000..29a16739aa4b706616367bfd1
> > > > > 8
> > > > > > > 32f28ebd07993e
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > > > @@ -0,0 +1,43 @@
> > > > > > > +#include <stdio.h>
> > > > > > > +
> > > > > > > +#ifndef N
> > > > > > > +#define N 65
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +#ifndef TYPE
> > > > > > > +#define TYPE uint32_t
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +#ifndef DEBUG
> > > > > > > +#define DEBUG 0
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > > > > > > +
> > > > > > > +int main ()
> > > > > > > +{
> > > > > > > +  TYPE a[N];
> > > > > > > +  TYPE b[N];
> > > > > > > +
> > > > > > > +  for (int i = 0; i < N; ++i)
> > > > > > > +    {
> > > > > > > +      a[i] = BASE + i * 13;
> > > > > > > +      b[i] = BASE + i * 13;
> > > > > > > +      if (DEBUG)
> > > > > > > +        printf ("%d: 0x%x\n", i, a[i]);
> > > > > > > +    }
> > > > > > > +
> > > > > > > +  fun1 (a, N / 2, N);
> > > > > > > +  fun2 (b, N / 2, N);
> > > > > > > +
> > > > > > > +  for (int i = 0; i < N; ++i)
> > > > > > > +    {
> > > > > > > +      if (DEBUG)
> > > > > > > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > > > > > > +
> > > > > > > +      if (a[i] != b[i])
> > > > > > > +        __builtin_abort ();
> > > > > > > +    }
> > > > > > > +  return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > diff --git a/gcc/tree-vect-patterns.cc
> > > > > > > b/gcc/tree-vect-patterns.cc index
> > > > > > >
> > > > >
> > >
> 217bdfd7045a22578a35bb891a4318d741071872..a738558cb8d12296bff462d71
> > > > > 6
> > > > > > > 310ca8d82957b5 100644
> > > > > > > --- a/gcc/tree-vect-patterns.cc
> > > > > > > +++ b/gcc/tree-vect-patterns.cc
> > > > > > > @@ -3558,6 +3558,33 @@ vect_recog_divmod_pattern (vec_info
> > > > > > > *vinfo,
> > > > > > >
> > > > > > >        return pattern_stmt;
> > > > > > >      }
> > > > > > > +  else if ((TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1)
> != 1)
> > > > > > > +	   && rhs_code != TRUNC_MOD_EXPR)
> > > > > > > +    {
> > > > > > > +      wide_int icst = wi::to_wide (oprnd1);
> > > > > > > +      wide_int val = wi::add (icst, 1);
> > > > > > > +      int pow = wi::exact_log2 (val);
> > > > > > > +      if (pow == (prec / 2))
> > > > > > > +	{
> > > > > > > +	  /* Pattern detected.  */
> > > > > > > +	  vect_pattern_detected ("vect_recog_divmod_pattern",
> > > > > > > +last_stmt);
> > > > > > > +
> > > > > > > +	  *type_out = vectype;
> > > > > > > +
> > > > > > > +	  /* Check if the target supports this internal function.  */
> > > > > > > +	  internal_fn ifn = IFN_DIV_POW2_BITMASK;
> > > > > > > +	  if (direct_internal_fn_supported_p (ifn, vectype,
> > > > > OPTIMIZE_FOR_SPEED))
> > > > > > > +	    {
> > > > > > > +	      tree var_div = vect_recog_temp_ssa_var (itype, NULL);
> > > > > > > +	      gimple *div_stmt = gimple_build_call_internal (ifn,
> > > > > > > +1,
> > > oprnd0);
> > > > > > > +	      gimple_call_set_lhs (div_stmt, var_div);
> > > > > > > +
> > > > > > > +	      gimple_set_location (div_stmt, gimple_location
> > > > > > > +(last_stmt));
> > > > > > > +
> > > > > > > +	      return div_stmt;
> > > > > > > +	    }
> > > > > > > +	}
> > > > > > > +    }
> > > > > > >
> > > > > > >    if (prec > HOST_BITS_PER_WIDE_INT
> > > > > > >        || integer_zerop (oprnd1))
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Richard Biener <rguenther@suse.de> SUSE Software Solutions
> > > > > Germany GmbH, Frankenstraße 146, 90461 Nuernberg, Germany; GF:
> > > > > Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> HRB
> > > > > 36809 (AG Nuernberg)
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
> > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> > > Boudien Moerman; HRB 36809 (AG Nuernberg)
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
> Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> Boudien Moerman; HRB 36809 (AG Nuernberg)

  reply	other threads:[~2022-06-14 13:39 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09  4:39 Tamar Christina
2022-06-09  4:40 ` [PATCH 2/2]AArch64 aarch64: Add implementation for pow2 bitmask division Tamar Christina
2022-06-13  9:24 ` [PATCH 1/2]middle-end Support optimized division by pow2 bitmask Richard Biener
2022-06-13  9:39   ` Richard Biener
2022-06-13 10:09     ` Tamar Christina
2022-06-13 11:47       ` Richard Biener
2022-06-13 14:37         ` Tamar Christina
2022-06-14 13:18           ` Richard Biener
2022-06-14 13:38             ` Tamar Christina [this message]
2022-06-14 13:42             ` Richard Sandiford
2022-06-14 15:57               ` Tamar Christina
2022-06-14 16:09                 ` Richard Biener
2022-06-22  0:34                 ` Tamar Christina
2022-06-26 19:55                   ` Jeff Law
2022-09-23  9:33 ` [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization Tamar Christina
2022-09-23  9:33 ` [PATCH 2/4]AArch64 Add implementation for pow2 bitmask division Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-09 16:02     ` Kyrylo Tkachov
2022-09-23  9:33 ` [PATCH 3/4]AArch64 Add SVE2 " Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-12 12:17   ` Richard Sandiford
2022-09-23  9:34 ` [PATCH 4/4]AArch64 sve2: rewrite pack + NARROWB + NARROWB to NARROWB + NARROWT Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-12 12:25   ` Richard Sandiford
2022-11-12 12:33     ` Richard Sandiford
2022-09-26 10:39 ` [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization Richard Biener
2022-10-31 11:34   ` Tamar Christina
2022-10-31 17:12     ` Jeff Law
2022-11-08 17:36     ` Tamar Christina
2022-11-09  8:01       ` Richard Biener
2022-11-09  8:26         ` Tamar Christina
2022-11-09 10:37 ` Kyrylo Tkachov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB53258D935EF20C3BE1E47B4AFFAA9@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=Richard.Sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).