RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, Richard Sandiford <Richard.Sandiford@arm.com>
Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask
Date: Mon, 13 Jun 2022 14:37:14 +0000	[thread overview]
Message-ID: <VI1PR08MB5325B5B8B2AB0A77E9424A59FFAB9@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <sn5qn872-834o-8149-ns0-46996nsp41q1@fhfr.qr>

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Monday, June 13, 2022 12:48 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2
> bitmask
> 
> On Mon, 13 Jun 2022, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguenther@suse.de>
> > > Sent: Monday, June 13, 2022 10:39 AM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Sandiford
> > > <Richard.Sandiford@arm.com>
> > > Subject: Re: [PATCH 1/2]middle-end Support optimized division by
> > > pow2 bitmask
> > >
> > > On Mon, 13 Jun 2022, Richard Biener wrote:
> > >
> > > > On Thu, 9 Jun 2022, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > In plenty of image and video processing code it's common to
> > > > > modify pixel values by a widening operation and then scale them
> > > > > back into range
> > > by dividing by 255.
> > > > >
> > > > > This patch adds an optab to allow us to emit an optimized
> > > > > sequence when doing an unsigned division that is equivalent to:
> > > > >
> > > > >    x = y / (2 ^ (bitsize (y)/2)-1
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > x86_64-pc-linux-gnu and no issues.
> > > > >
> > > > > Ok for master?
> > > >
> > > > Looking at 2/2 it seems that this is the wrong way to attack the
> > > > problem.  The ISA doesn't have such instruction so adding an optab
> > > > looks premature.  I suppose that there's no unsigned vector
> > > > integer division and thus we open-code that in a different way?
> > > > Isn't the correct thing then to fixup that open-coding if it is more
> efficient?
> > >
> >
> > The problem is that even if you fixup the open-coding it would need to
> > be something target specific? The sequence of instructions we generate
> > don't have a GIMPLE representation.  So whatever is generated I'd have
> > to fixup in RTL then.
> 
> What's the operation that doesn't have a GIMPLE representation?

For NEON use two operations:
1. Add High narrowing lowpart, essentially doing (a +w b) >>.n bitsize(a)/2
    Where the + widens and the >> narrows.  So you give it two shorts, get a byte
2. Add widening add of lowpart so basically lowpart (a +w b)

For SVE2 we use a different sequence, we use two back-to-back sequences of:
1. Add narrow high part (bottom).  In SVE the Top and Bottom instructions select
   Even and odd elements of the vector rather than "top half" and "bottom half".

   So this instruction does : Add each vector element of the first source vector to the
   corresponding vector element of the second source vector, and place the most
    significant half of the result in the even-numbered half-width destination elements,
    while setting the odd-numbered elements to zero.

So there's an explicit permute in there. The instructions are sufficiently different that there
wouldn't be a single GIMPLE representation.

> 
> I think for costing you could resort to the *_cost functions as used by
> synth_mult and friends.
> 
> > The problem with this is that it seemed fragile. We generate from the
> > Vectorizer:
> >
> >   vect__3.8_35 = MEM <vector(16) unsigned char> [(uint8_t *)_21];
> >   vect_patt_28.9_37 = WIDEN_MULT_LO_EXPR <vect__3.8_35,
> vect_cst__36>;
> >   vect_patt_28.9_38 = WIDEN_MULT_HI_EXPR <vect__3.8_35,
> vect_cst__36>;
> >   vect_patt_19.10_40 = vect_patt_28.9_37 h* { 32897, 32897, 32897, 32897,
> 32897, 32897, 32897, 32897 };
> >   vect_patt_19.10_41 = vect_patt_28.9_38 h* { 32897, 32897, 32897, 32897,
> 32897, 32897, 32897, 32897 };
> >   vect_patt_25.11_42 = vect_patt_19.10_40 >> 7;
> >   vect_patt_25.11_43 = vect_patt_19.10_41 >> 7;
> >   vect_patt_11.12_44 = VEC_PACK_TRUNC_EXPR <vect_patt_25.11_42,
> > vect_patt_25.11_43>;
> >
> > and if the magic constants change then we miss the optimization. I
> > could rewrite the open coding to use shifts alone, but that might be a
> regression for some uarches I would imagine.
> 
> OK, so you do have a highpart multiply.  I suppose the pattern is too deep to
> be recognized by combine?  What's the RTL good vs. bad before combine of
> one of the expressions?

Yeah combine only tried 2-3 instructions, but to use these sequences we have to
match the entire chain as the instructions do the narrowing themselves.  So the RTL
for the bad case before combine is

(insn 39 37 42 4 (set (reg:V4SI 119)
        (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 116 [ vect_patt_28.9D.3754 ])
                    (parallel:V8HI [
                            (const_int 4 [0x4])
                            (const_int 5 [0x5])
                            (const_int 6 [0x6])
                            (const_int 7 [0x7])
                        ])))
            (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 118)
                    (parallel:V8HI [
                            (const_int 4 [0x4])
                            (const_int 5 [0x5])
                            (const_int 6 [0x6])
                            (const_int 7 [0x7])
                        ]))))) "/app/example.c":6:14 2114 {aarch64_simd_vec_umult_hi_v8hi}
     (expr_list:REG_DEAD (reg:V8HI 116 [ vect_patt_28.9D.3754 ])
        (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 116 [ vect_patt_28.9D.3754 ])
                        (parallel:V8HI [
                                (const_int 4 [0x4])
                                (const_int 5 [0x5])
                                (const_int 6 [0x6])
                                (const_int 7 [0x7])
                            ])))
                (const_vector:V4SI [
                        (const_int 32897 [0x8081]) repeated x4
                    ]))
            (nil))))
(insn 42 39 43 4 (set (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
        (unspec:V8HI [
                (subreg:V8HI (reg:V4SI 117) 0)
                (subreg:V8HI (reg:V4SI 119) 0)
            ] UNSPEC_UZP2)) "/app/example.c":6:14 4096 {aarch64_uzp2v8hi}
     (expr_list:REG_DEAD (reg:V4SI 119)
        (expr_list:REG_DEAD (reg:V4SI 117)
            (nil))))
(insn 43 42 44 4 (set (reg:V8HI 124 [ vect_patt_25.11D.3756 ])
        (lshiftrt:V8HI (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
            (const_vector:V8HI [
                    (const_int 7 [0x7]) repeated x8
                ]))) "/app/example.c":6:14 1803 {aarch64_simd_lshrv8hi}
     (expr_list:REG_DEAD (reg:V8HI 121 [ vect_patt_19.10D.3755 ])
        (nil)))
(insn 44 43 46 4 (set (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
        (mult:V8HI (zero_extend:V8HI (vec_select:V8QI (reg:V16QI 115 [ MEM <vector(16) unsigned charD.21> [(uint8_tD.3704 *)_21 clique 1 base 1] ])
                    (parallel:V16QI [
                            (const_int 8 [0x8])
                            (const_int 9 [0x9])
                            (const_int 10 [0xa])
                            (const_int 11 [0xb])
                            (const_int 12 [0xc])
                            (const_int 13 [0xd])
                            (const_int 14 [0xe])
                            (const_int 15 [0xf])
                        ])))
            (zero_extend:V8HI (vec_select:V8QI (reg:V16QI 100 [ vect_cst__36 ])
                    (parallel:V16QI [
                            (const_int 8 [0x8])
                            (const_int 9 [0x9])
                            (const_int 10 [0xa])
                            (const_int 11 [0xb])
                            (const_int 12 [0xc])
                            (const_int 13 [0xd])
                            (const_int 14 [0xe])
                            (const_int 15 [0xf])
                        ]))))) "/app/example.c":6:14 2112 {aarch64_simd_vec_umult_hi_v16qi}
     (expr_list:REG_DEAD (reg:V16QI 115 [ MEM <vector(16) unsigned charD.21> [(uint8_tD.3704 *)_21 clique 1 base 1] ])
        (nil)))
(insn 46 44 48 4 (set (reg:V4SI 126)
        (mult:V4SI (zero_extend:V4SI (subreg:V4HI (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) 0))
            (zero_extend:V4SI (subreg:V4HI (reg:V8HI 118) 0)))) "/app/example.c":6:14 2108 {aarch64_intrinsic_vec_umult_lo_v4hi}
     (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (subreg:V4HI (reg:V8HI 125 [ vect_patt_28.9D.3754 ]) 0))
            (const_vector:V4SI [
                    (const_int 32897 [0x8081]) repeated x4
                ]))
        (nil)))
(insn 48 46 51 4 (set (reg:V4SI 128)
        (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
                    (parallel:V8HI [
                            (const_int 4 [0x4])
                            (const_int 5 [0x5])
                            (const_int 6 [0x6])
                            (const_int 7 [0x7])
                        ])))
            (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 118)
                    (parallel:V8HI [
                            (const_int 4 [0x4])
                            (const_int 5 [0x5])
                            (const_int 6 [0x6])
                            (const_int 7 [0x7])
                        ]))))) "/app/example.c":6:14 2114 {aarch64_simd_vec_umult_hi_v8hi}
     (expr_list:REG_DEAD (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
        (expr_list:REG_EQUAL (mult:V4SI (zero_extend:V4SI (vec_select:V4HI (reg:V8HI 125 [ vect_patt_28.9D.3754 ])
                        (parallel:V8HI [
                                (const_int 4 [0x4])
                                (const_int 5 [0x5])
                                (const_int 6 [0x6])
                                (const_int 7 [0x7])
                            ])))
                (const_vector:V4SI [
                        (const_int 32897 [0x8081]) repeated x4
                    ]))
            (nil))))
(insn 51 48 52 4 (set (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
        (unspec:V8HI [
                (subreg:V8HI (reg:V4SI 126) 0)
                (subreg:V8HI (reg:V4SI 128) 0)
            ] UNSPEC_UZP2)) "/app/example.c":6:14 4096 {aarch64_uzp2v8hi}
     (expr_list:REG_DEAD (reg:V4SI 128)
        (expr_list:REG_DEAD (reg:V4SI 126)
            (nil))))
(insn 52 51 53 4 (set (reg:V8HI 133 [ vect_patt_25.11D.3756 ])
        (lshiftrt:V8HI (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
            (const_vector:V8HI [
                    (const_int 7 [0x7]) repeated x8
                ]))) "/app/example.c":6:14 1803 {aarch64_simd_lshrv8hi}
     (expr_list:REG_DEAD (reg:V8HI 130 [ vect_patt_19.10D.3755 ])
        (nil)))

And for good:

(insn 32 30 34 4 (set (reg:V16QI 118)
        (vec_concat:V16QI (unspec:V8QI [
                    (reg:V8HI 114 [ vect_patt_28.9 ])
                    (reg:V8HI 115)
                ] UNSPEC_ADDHN)
            (const_vector:V8QI [
                    (const_int 0 [0]) repeated x8
                ]))) "draw.c":6:35 2688 {aarch64_addhnv8hi_insn_le}
     (expr_list:REG_EQUAL (vec_concat:V16QI (unspec:V8QI [
                    (reg:V8HI 114 [ vect_patt_28.9 ])
                    (const_vector:V8HI [
                            (const_int 257 [0x101]) repeated x8
                        ])
                ] UNSPEC_ADDHN)
            (const_vector:V8QI [
                    (const_int 0 [0]) repeated x8
                ]))
        (nil)))
(insn 34 32 35 4 (set (reg:V8HI 117)
        (plus:V8HI (zero_extend:V8HI (subreg:V8QI (reg:V16QI 118) 0))
            (reg:V8HI 114 [ vect_patt_28.9 ]))) "draw.c":6:35 2635 {aarch64_uaddwv8qi}
     (expr_list:REG_DEAD (reg:V16QI 118)
        (expr_list:REG_DEAD (reg:V8HI 114 [ vect_patt_28.9 ])
            (nil))))
(insn 35 34 37 4 (set (reg:V8HI 103 [ vect_patt_25.10 ])
        (lshiftrt:V8HI (reg:V8HI 117)
            (const_vector:V8HI [
                    (const_int 8 [0x8]) repeated x8
                ]))) "draw.c":6:35 1741 {aarch64_simd_lshrv8hi}
     (expr_list:REG_DEAD (reg:V8HI 117)
        (nil)))
(insn 37 35 39 4 (set (reg:V16QI 122)
        (vec_concat:V16QI (unspec:V8QI [
                    (reg:V8HI 102 [ vect_patt_28.9 ])
                    (reg:V8HI 115)
                ] UNSPEC_ADDHN)
            (const_vector:V8QI [
                    (const_int 0 [0]) repeated x8
                ]))) "draw.c":6:35 2688 {aarch64_addhnv8hi_insn_le}
     (expr_list:REG_EQUAL (vec_concat:V16QI (unspec:V8QI [
                    (reg:V8HI 102 [ vect_patt_28.9 ])
                    (const_vector:V8HI [
                            (const_int 257 [0x101]) repeated x8
                        ])
                ] UNSPEC_ADDHN)
            (const_vector:V8QI [
                    (const_int 0 [0]) repeated x8
                ]))
        (nil)))
(insn 39 37 40 4 (set (reg:V8HI 121)
        (plus:V8HI (zero_extend:V8HI (subreg:V8QI (reg:V16QI 122) 0))
            (reg:V8HI 102 [ vect_patt_28.9 ]))) "draw.c":6:35 2635 {aarch64_uaddwv8qi}
     (expr_list:REG_DEAD (reg:V16QI 122)
        (expr_list:REG_DEAD (reg:V8HI 102 [ vect_patt_28.9 ])
            (nil))))
(insn 40 39 41 4 (set (reg:V8HI 104 [ vect_patt_25.10 ])
        (lshiftrt:V8HI (reg:V8HI 121)
            (const_vector:V8HI [
                    (const_int 8 [0x8]) repeated x8
                ]))) "draw.c":6:35 1741 {aarch64_simd_lshrv8hi}

Cheers,
Tamar

> 
> > > Btw, on x86 we use
> > >
> > > t.c:3:21: note:   replacing earlier pattern patt_25 = patt_28 / 255;
> > > t.c:3:21: note:   with patt_25 = patt_19 >> 7;
> > > t.c:3:21: note:   extra pattern stmt: patt_19 = patt_28 h* 32897;
> > >
> > > which translates to
> > >
> > >         vpmulhuw        %ymm4, %ymm0, %ymm0
> > >         vpmulhuw        %ymm4, %ymm1, %ymm1
> > >         vpsrlw  $7, %ymm0, %ymm0
> > >         vpsrlw  $7, %ymm1, %ymm1
> > >
> > > there's odd
> > >
> > >         vpand   %ymm0, %ymm3, %ymm0
> > >         vpand   %ymm1, %ymm3, %ymm1
> > >
> > > before (%ymm3 is all 0x00ff)
> > >
> > >         vpackuswb       %ymm1, %ymm0, %ymm0
> > >
> > > that's not visible in GIMPLE.  I guess aarch64 lacks a highpart multiply
> here?
> > > In any case, it seems that generic division expansion could be
> > > improved here? (choose_multiplier?)
> >
> > We do generate multiply highpart here, but the patch completely avoids
> > multiplies and shifts entirely by creative use of the ISA. Another reason I
> went for an optab is costing.
> > The chosen operations are significantly cheaper on all Arm uarches than
> Shifts and multiply.
> >
> > This means we get vectorization in some cases where the cost model
> > would correctly say It's too expensive to vectorize. Particularly around
> double precision.
> >
> > Thanks,
> > Tamar
> >
> > >
> > > Richard.
> > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > 	* internal-fn.def (DIV_POW2_BITMASK): New.
> > > > > 	* optabs.def (udiv_pow2_bitmask_optab): New.
> > > > > 	* doc/md.texi: Document it.
> > > > > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Recognize
> > > pattern.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > > 	* gcc.dg/vect/vect-div-bitmask-1.c: New test.
> > > > > 	* gcc.dg/vect/vect-div-bitmask-2.c: New test.
> > > > > 	* gcc.dg/vect/vect-div-bitmask-3.c: New test.
> > > > > 	* gcc.dg/vect/vect-div-bitmask.h: New file.
> > > > >
> > > > > --- inline copy of patch --
> > > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> > > > >
> > >
> f3619c505c025f158c2bc64756531877378b22e1..784c49d7d24cef7619e4d613f7
> > > > > b4f6e945866c38 100644
> > > > > --- a/gcc/doc/md.texi
> > > > > +++ b/gcc/doc/md.texi
> > > > > @@ -5588,6 +5588,18 @@ signed op0, op1;
> > > > >  op0 = op1 / (1 << imm);
> > > > >  @end smallexample
> > > > >
> > > > > +@cindex @code{udiv_pow2_bitmask@var{m2}} instruction pattern
> > > @item
> > > > > +@samp{udiv_pow2_bitmask@var{m2}} @cindex
> > > > > +@code{udiv_pow2_bitmask@var{m2}} instruction pattern @itemx
> > > > > +@samp{udiv_pow2_bitmask@var{m2}} Unsigned vector division by
> an
> > > > > +immediate that is equivalent to
> > > > > +@samp{2^(bitsize(m) / 2) - 1}.
> > > > > +@smallexample
> > > > > +unsigned short op0; op1;
> > > > > +@dots{}
> > > > > +op0 = op1 / 0xffU;
> > > > > +@end smallexample
> > > > > +
> > > > >  @cindex @code{vec_shl_insert_@var{m}} instruction pattern
> > > > > @item @samp{vec_shl_insert_@var{m}}  Shift the elements in
> > > > > vector input operand 1 left one element (i.e.@:
> > > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> > > > >
> > >
> d2d550d358606022b1cb44fa842f06e0be507bc3..a3e3cc1520f77683ebf6256898
> > > > > f916ed45de475f 100644
> > > > > --- a/gcc/internal-fn.def
> > > > > +++ b/gcc/internal-fn.def
> > > > > @@ -159,6 +159,8 @@ DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT,
> > > ECF_CONST | ECF_NOTHROW,
> > > > >  		       vec_shl_insert, binary)
> > > > >
> > > > >  DEF_INTERNAL_OPTAB_FN (DIV_POW2, ECF_CONST |
> ECF_NOTHROW,
> > > > > sdiv_pow2, binary)
> > > > > +DEF_INTERNAL_OPTAB_FN (DIV_POW2_BITMASK, ECF_CONST |
> > > ECF_NOTHROW,
> > > > > +		       udiv_pow2_bitmask, unary)
> > > > >
> > > > >  DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary)
> > > > > DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary) diff
> > > > > --git a/gcc/optabs.def b/gcc/optabs.def index
> > > > >
> > >
> 801310ebaa7d469520809bb7efed6820f8eb866b..3f0ac05ef5ad5aed8d6ca391f
> > > 4
> > > > > eed71b0494e17f 100644
> > > > > --- a/gcc/optabs.def
> > > > > +++ b/gcc/optabs.def
> > > > > @@ -372,6 +372,7 @@ OPTAB_D (smulhrs_optab, "smulhrs$a3")
> > > OPTAB_D
> > > > > (umulhs_optab, "umulhs$a3")  OPTAB_D (umulhrs_optab,
> > > > > "umulhrs$a3") OPTAB_D (sdiv_pow2_optab, "sdiv_pow2$a3")
> > > > > +OPTAB_D (udiv_pow2_bitmask_optab, "udiv_pow2_bitmask$a2")
> > > > >  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
> > > > > OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")  OPTAB_D
> > > > > (vec_pack_trunc_optab, "vec_pack_trunc_$a") diff --git
> > > > > a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > new file mode 100644
> > > > > index
> > > > >
> > >
> 0000000000000000000000000000000000000000..a7ea3cce4764239c5d281a8f0b
> > > > > ead1f6a452de3f
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > > @@ -0,0 +1,25 @@
> > > > > +/* { dg-require-effective-target vect_int } */
> > > > > +
> > > > > +#include <stdint.h>
> > > > > +#include "tree-vect.h"
> > > > > +
> > > > > +#define N 50
> > > > > +#define TYPE uint8_t
> > > > > +
> > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > +  for (int i = 0; i < n; i+=1)
> > > > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > > > +
> > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > +  for (int i = 0; i < n; i+=1)
> > > > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > > > +
> > > > > +#include "vect-div-bitmask.h"
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > +detected" "vect" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > new file mode 100644
> > > > > index
> > > > >
> > >
> 0000000000000000000000000000000000000000..009e16e1b36497e5724410d98
> > > 4
> > > > > 3f1ce122b26dda
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > > @@ -0,0 +1,25 @@
> > > > > +/* { dg-require-effective-target vect_int } */
> > > > > +
> > > > > +#include <stdint.h>
> > > > > +#include "tree-vect.h"
> > > > > +
> > > > > +#define N 50
> > > > > +#define TYPE uint16_t
> > > > > +
> > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > +  for (int i = 0; i < n; i+=1)
> > > > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > > > +
> > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > +  for (int i = 0; i < n; i+=1)
> > > > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > > > +
> > > > > +#include "vect-div-bitmask.h"
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > +detected" "vect" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > new file mode 100644
> > > > > index
> > > > >
> > >
> 0000000000000000000000000000000000000000..bf35a0bda8333c418e692d942
> > > 2
> > > > > 0df849cc47930b
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > > @@ -0,0 +1,26 @@
> > > > > +/* { dg-require-effective-target vect_int } */
> > > > > +/* { dg-additional-options "-fno-vect-cost-model" { target
> > > > > +aarch64*-*-* } } */
> > > > > +
> > > > > +#include <stdint.h>
> > > > > +#include "tree-vect.h"
> > > > > +
> > > > > +#define N 50
> > > > > +#define TYPE uint32_t
> > > > > +
> > > > > +__attribute__((noipa, noinline, optimize("O1"))) void
> > > > > +fun1(TYPE* restrict pixel, TYPE level, int n) {
> > > > > +  for (int i = 0; i < n; i+=1)
> > > > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; }
> > > > > +
> > > > > +__attribute__((noipa, noinline, optimize("O3"))) void
> > > > > +fun2(TYPE* restrict pixel, TYPE level, int n) {
> > > > > +  for (int i = 0; i < n; i+=1)
> > > > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; }
> > > > > +
> > > > > +#include "vect-div-bitmask.h"
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > > +detected" "vect" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > new file mode 100644
> > > > > index
> > > > >
> > >
> 0000000000000000000000000000000000000000..29a16739aa4b706616367bfd1
> > > 8
> > > > > 32f28ebd07993e
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > > @@ -0,0 +1,43 @@
> > > > > +#include <stdio.h>
> > > > > +
> > > > > +#ifndef N
> > > > > +#define N 65
> > > > > +#endif
> > > > > +
> > > > > +#ifndef TYPE
> > > > > +#define TYPE uint32_t
> > > > > +#endif
> > > > > +
> > > > > +#ifndef DEBUG
> > > > > +#define DEBUG 0
> > > > > +#endif
> > > > > +
> > > > > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > > > > +
> > > > > +int main ()
> > > > > +{
> > > > > +  TYPE a[N];
> > > > > +  TYPE b[N];
> > > > > +
> > > > > +  for (int i = 0; i < N; ++i)
> > > > > +    {
> > > > > +      a[i] = BASE + i * 13;
> > > > > +      b[i] = BASE + i * 13;
> > > > > +      if (DEBUG)
> > > > > +        printf ("%d: 0x%x\n", i, a[i]);
> > > > > +    }
> > > > > +
> > > > > +  fun1 (a, N / 2, N);
> > > > > +  fun2 (b, N / 2, N);
> > > > > +
> > > > > +  for (int i = 0; i < N; ++i)
> > > > > +    {
> > > > > +      if (DEBUG)
> > > > > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > > > > +
> > > > > +      if (a[i] != b[i])
> > > > > +        __builtin_abort ();
> > > > > +    }
> > > > > +  return 0;
> > > > > +}
> > > > > +
> > > > > diff --git a/gcc/tree-vect-patterns.cc
> > > > > b/gcc/tree-vect-patterns.cc index
> > > > >
> > >
> 217bdfd7045a22578a35bb891a4318d741071872..a738558cb8d12296bff462d71
> > > 6
> > > > > 310ca8d82957b5 100644
> > > > > --- a/gcc/tree-vect-patterns.cc
> > > > > +++ b/gcc/tree-vect-patterns.cc
> > > > > @@ -3558,6 +3558,33 @@ vect_recog_divmod_pattern (vec_info
> > > > > *vinfo,
> > > > >
> > > > >        return pattern_stmt;
> > > > >      }
> > > > > +  else if ((TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1) != 1)
> > > > > +	   && rhs_code != TRUNC_MOD_EXPR)
> > > > > +    {
> > > > > +      wide_int icst = wi::to_wide (oprnd1);
> > > > > +      wide_int val = wi::add (icst, 1);
> > > > > +      int pow = wi::exact_log2 (val);
> > > > > +      if (pow == (prec / 2))
> > > > > +	{
> > > > > +	  /* Pattern detected.  */
> > > > > +	  vect_pattern_detected ("vect_recog_divmod_pattern",
> > > > > +last_stmt);
> > > > > +
> > > > > +	  *type_out = vectype;
> > > > > +
> > > > > +	  /* Check if the target supports this internal function.  */
> > > > > +	  internal_fn ifn = IFN_DIV_POW2_BITMASK;
> > > > > +	  if (direct_internal_fn_supported_p (ifn, vectype,
> > > OPTIMIZE_FOR_SPEED))
> > > > > +	    {
> > > > > +	      tree var_div = vect_recog_temp_ssa_var (itype, NULL);
> > > > > +	      gimple *div_stmt = gimple_build_call_internal (ifn, 1,
> oprnd0);
> > > > > +	      gimple_call_set_lhs (div_stmt, var_div);
> > > > > +
> > > > > +	      gimple_set_location (div_stmt, gimple_location
> > > > > +(last_stmt));
> > > > > +
> > > > > +	      return div_stmt;
> > > > > +	    }
> > > > > +	}
> > > > > +    }
> > > > >
> > > > >    if (prec > HOST_BITS_PER_WIDE_INT
> > > > >        || integer_zerop (oprnd1))
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
> > > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> > > Boudien Moerman; HRB 36809 (AG Nuernberg)
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
> Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> Boudien Moerman; HRB 36809 (AG Nuernberg)

next prev parent reply	other threads:[~2022-06-13 14:37 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09  4:39 Tamar Christina
2022-06-09  4:40 ` [PATCH 2/2]AArch64 aarch64: Add implementation for pow2 bitmask division Tamar Christina
2022-06-13  9:24 ` [PATCH 1/2]middle-end Support optimized division by pow2 bitmask Richard Biener
2022-06-13  9:39   ` Richard Biener
2022-06-13 10:09     ` Tamar Christina
2022-06-13 11:47       ` Richard Biener
2022-06-13 14:37         ` Tamar Christina [this message]
2022-06-14 13:18           ` Richard Biener
2022-06-14 13:38             ` Tamar Christina
2022-06-14 13:42             ` Richard Sandiford
2022-06-14 15:57               ` Tamar Christina
2022-06-14 16:09                 ` Richard Biener
2022-06-22  0:34                 ` Tamar Christina
2022-06-26 19:55                   ` Jeff Law
2022-09-23  9:33 ` [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization Tamar Christina
2022-09-23  9:33 ` [PATCH 2/4]AArch64 Add implementation for pow2 bitmask division Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-09 16:02     ` Kyrylo Tkachov
2022-09-23  9:33 ` [PATCH 3/4]AArch64 Add SVE2 " Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-12 12:17   ` Richard Sandiford
2022-09-23  9:34 ` [PATCH 4/4]AArch64 sve2: rewrite pack + NARROWB + NARROWB to NARROWB + NARROWT Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-12 12:25   ` Richard Sandiford
2022-11-12 12:33     ` Richard Sandiford
2022-09-26 10:39 ` [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization Richard Biener
2022-10-31 11:34   ` Tamar Christina
2022-10-31 17:12     ` Jeff Law
2022-11-08 17:36     ` Tamar Christina
2022-11-09  8:01       ` Richard Biener
2022-11-09  8:26         ` Tamar Christina
2022-11-09 10:37 ` Kyrylo Tkachov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB5325B5B8B2AB0A77E9424A59FFAB9@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=Richard.Sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).