RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <rguenther@suse.de>
To: Tamar Christina <Tamar.Christina@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>,  Richard Sandiford <Richard.Sandiford@arm.com>
Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask
Date: Mon, 13 Jun 2022 13:47:56 +0200 (CEST)	[thread overview]
Message-ID: <sn5qn872-834o-8149-ns0-46996nsp41q1@fhfr.qr> (raw)
In-Reply-To: <VI1PR08MB532574ACE2174CE64460E480FFAB9@VI1PR08MB5325.eurprd08.prod.outlook.com>

On Mon, 13 Jun 2022, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Monday, June 13, 2022 10:39 AM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Sandiford
> > <Richard.Sandiford@arm.com>
> > Subject: Re: [PATCH 1/2]middle-end Support optimized division by pow2
> > bitmask
> > 
> > On Mon, 13 Jun 2022, Richard Biener wrote:
> > 
> > > On Thu, 9 Jun 2022, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > In plenty of image and video processing code it's common to modify
> > > > pixel values by a widening operation and then scale them back into range
> > by dividing by 255.
> > > >
> > > > This patch adds an optab to allow us to emit an optimized sequence
> > > > when doing an unsigned division that is equivalent to:
> > > >
> > > >    x = y / (2 ^ (bitsize (y)/2)-1
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > x86_64-pc-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > >
> > > Looking at 2/2 it seems that this is the wrong way to attack the
> > > problem.  The ISA doesn't have such instruction so adding an optab
> > > looks premature.  I suppose that there's no unsigned vector integer
> > > division and thus we open-code that in a different way?  Isn't the
> > > correct thing then to fixup that open-coding if it is more efficient?
> > 
> 
> The problem is that even if you fixup the open-coding it would need to
> be something target specific? The sequence of instructions we generate
> don't have a GIMPLE representation.  So whatever is generated I'd have to fixup
> in RTL then.

What's the operation that doesn't have a GIMPLE representation?

I think for costing you could resort to the *_cost functions as used
by synth_mult and friends.

> The problem with this is that it seemed fragile. We generate from the
> Vectorizer:
> 
>   vect__3.8_35 = MEM <vector(16) unsigned char> [(uint8_t *)_21];
>   vect_patt_28.9_37 = WIDEN_MULT_LO_EXPR <vect__3.8_35, vect_cst__36>;
>   vect_patt_28.9_38 = WIDEN_MULT_HI_EXPR <vect__3.8_35, vect_cst__36>;
>   vect_patt_19.10_40 = vect_patt_28.9_37 h* { 32897, 32897, 32897, 32897, 32897, 32897, 32897, 32897 };
>   vect_patt_19.10_41 = vect_patt_28.9_38 h* { 32897, 32897, 32897, 32897, 32897, 32897, 32897, 32897 };
>   vect_patt_25.11_42 = vect_patt_19.10_40 >> 7;
>   vect_patt_25.11_43 = vect_patt_19.10_41 >> 7;
>   vect_patt_11.12_44 = VEC_PACK_TRUNC_EXPR <vect_patt_25.11_42, vect_patt_25.11_43>;
> 
> and if the magic constants change then we miss the optimization. I could rewrite the open coding to use
> shifts alone, but that might be a regression for some uarches I would imagine.

OK, so you do have a highpart multiply.  I suppose the pattern is too deep
to be recognized by combine?  What's the RTL good vs. bad before combine
of one of the expressions?

> > Btw, on x86 we use
> > 
> > t.c:3:21: note:   replacing earlier pattern patt_25 = patt_28 / 255;
> > t.c:3:21: note:   with patt_25 = patt_19 >> 7;
> > t.c:3:21: note:   extra pattern stmt: patt_19 = patt_28 h* 32897;
> > 
> > which translates to
> > 
> >         vpmulhuw        %ymm4, %ymm0, %ymm0
> >         vpmulhuw        %ymm4, %ymm1, %ymm1
> >         vpsrlw  $7, %ymm0, %ymm0
> >         vpsrlw  $7, %ymm1, %ymm1
> > 
> > there's odd
> > 
> >         vpand   %ymm0, %ymm3, %ymm0
> >         vpand   %ymm1, %ymm3, %ymm1
> > 
> > before (%ymm3 is all 0x00ff)
> > 
> >         vpackuswb       %ymm1, %ymm0, %ymm0
> > 
> > that's not visible in GIMPLE.  I guess aarch64 lacks a highpart multiply here?
> > In any case, it seems that generic division expansion could be improved
> > here? (choose_multiplier?)
> 
> We do generate multiply highpart here, but the patch completely avoids multiplies
> and shifts entirely by creative use of the ISA. Another reason I went for an optab is costing.
> The chosen operations are significantly cheaper on all Arm uarches than Shifts and multiply.
> 
> This means we get vectorization in some cases where the cost model would correctly say
> It's too expensive to vectorize. Particularly around double precision.
> 
> Thanks,
> Tamar
> 
> > 
> > Richard.
> > 
> > > Richard.
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 	* internal-fn.def (DIV_POW2_BITMASK): New.
> > > > 	* optabs.def (udiv_pow2_bitmask_optab): New.
> > > > 	* doc/md.texi: Document it.
> > > > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Recognize
> > pattern.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > 	* gcc.dg/vect/vect-div-bitmask-1.c: New test.
> > > > 	* gcc.dg/vect/vect-div-bitmask-2.c: New test.
> > > > 	* gcc.dg/vect/vect-div-bitmask-3.c: New test.
> > > > 	* gcc.dg/vect/vect-div-bitmask.h: New file.
> > > >
> > > > --- inline copy of patch --
> > > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> > > >
> > f3619c505c025f158c2bc64756531877378b22e1..784c49d7d24cef7619e4d613f7
> > > > b4f6e945866c38 100644
> > > > --- a/gcc/doc/md.texi
> > > > +++ b/gcc/doc/md.texi
> > > > @@ -5588,6 +5588,18 @@ signed op0, op1;
> > > >  op0 = op1 / (1 << imm);
> > > >  @end smallexample
> > > >
> > > > +@cindex @code{udiv_pow2_bitmask@var{m2}} instruction pattern
> > @item
> > > > +@samp{udiv_pow2_bitmask@var{m2}} @cindex
> > > > +@code{udiv_pow2_bitmask@var{m2}} instruction pattern @itemx
> > > > +@samp{udiv_pow2_bitmask@var{m2}} Unsigned vector division by an
> > > > +immediate that is equivalent to
> > > > +@samp{2^(bitsize(m) / 2) - 1}.
> > > > +@smallexample
> > > > +unsigned short op0; op1;
> > > > +@dots{}
> > > > +op0 = op1 / 0xffU;
> > > > +@end smallexample
> > > > +
> > > >  @cindex @code{vec_shl_insert_@var{m}} instruction pattern  @item
> > > > @samp{vec_shl_insert_@var{m}}  Shift the elements in vector input
> > > > operand 1 left one element (i.e.@:
> > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> > > >
> > d2d550d358606022b1cb44fa842f06e0be507bc3..a3e3cc1520f77683ebf6256898
> > > > f916ed45de475f 100644
> > > > --- a/gcc/internal-fn.def
> > > > +++ b/gcc/internal-fn.def
> > > > @@ -159,6 +159,8 @@ DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT,
> > ECF_CONST | ECF_NOTHROW,
> > > >  		       vec_shl_insert, binary)
> > > >
> > > >  DEF_INTERNAL_OPTAB_FN (DIV_POW2, ECF_CONST | ECF_NOTHROW,
> > > > sdiv_pow2, binary)
> > > > +DEF_INTERNAL_OPTAB_FN (DIV_POW2_BITMASK, ECF_CONST |
> > ECF_NOTHROW,
> > > > +		       udiv_pow2_bitmask, unary)
> > > >
> > > >  DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary)
> > > > DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary) diff --git
> > > > a/gcc/optabs.def b/gcc/optabs.def index
> > > >
> > 801310ebaa7d469520809bb7efed6820f8eb866b..3f0ac05ef5ad5aed8d6ca391f
> > 4
> > > > eed71b0494e17f 100644
> > > > --- a/gcc/optabs.def
> > > > +++ b/gcc/optabs.def
> > > > @@ -372,6 +372,7 @@ OPTAB_D (smulhrs_optab, "smulhrs$a3")
> > OPTAB_D
> > > > (umulhs_optab, "umulhs$a3")  OPTAB_D (umulhrs_optab, "umulhrs$a3")
> > > > OPTAB_D (sdiv_pow2_optab, "sdiv_pow2$a3")
> > > > +OPTAB_D (udiv_pow2_bitmask_optab, "udiv_pow2_bitmask$a2")
> > > >  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
> > > > OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")  OPTAB_D
> > > > (vec_pack_trunc_optab, "vec_pack_trunc_$a") diff --git
> > > > a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > new file mode 100644
> > > > index
> > > >
> > 0000000000000000000000000000000000000000..a7ea3cce4764239c5d281a8f0b
> > > > ead1f6a452de3f
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > > @@ -0,0 +1,25 @@
> > > > +/* { dg-require-effective-target vect_int } */
> > > > +
> > > > +#include <stdint.h>
> > > > +#include "tree-vect.h"
> > > > +
> > > > +#define N 50
> > > > +#define TYPE uint8_t
> > > > +
> > > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> > > > +restrict pixel, TYPE level, int n) {
> > > > +  for (int i = 0; i < n; i+=1)
> > > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > > +
> > > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> > > > +restrict pixel, TYPE level, int n) {
> > > > +  for (int i = 0; i < n; i+=1)
> > > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > > +
> > > > +#include "vect-div-bitmask.h"
> > > > +
> > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > +detected" "vect" } } */
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > new file mode 100644
> > > > index
> > > >
> > 0000000000000000000000000000000000000000..009e16e1b36497e5724410d98
> > 4
> > > > 3f1ce122b26dda
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > > @@ -0,0 +1,25 @@
> > > > +/* { dg-require-effective-target vect_int } */
> > > > +
> > > > +#include <stdint.h>
> > > > +#include "tree-vect.h"
> > > > +
> > > > +#define N 50
> > > > +#define TYPE uint16_t
> > > > +
> > > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> > > > +restrict pixel, TYPE level, int n) {
> > > > +  for (int i = 0; i < n; i+=1)
> > > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > > +
> > > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> > > > +restrict pixel, TYPE level, int n) {
> > > > +  for (int i = 0; i < n; i+=1)
> > > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > > +
> > > > +#include "vect-div-bitmask.h"
> > > > +
> > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > +detected" "vect" } } */
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > new file mode 100644
> > > > index
> > > >
> > 0000000000000000000000000000000000000000..bf35a0bda8333c418e692d942
> > 2
> > > > 0df849cc47930b
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > > @@ -0,0 +1,26 @@
> > > > +/* { dg-require-effective-target vect_int } */
> > > > +/* { dg-additional-options "-fno-vect-cost-model" { target
> > > > +aarch64*-*-* } } */
> > > > +
> > > > +#include <stdint.h>
> > > > +#include "tree-vect.h"
> > > > +
> > > > +#define N 50
> > > > +#define TYPE uint32_t
> > > > +
> > > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> > > > +restrict pixel, TYPE level, int n) {
> > > > +  for (int i = 0; i < n; i+=1)
> > > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; }
> > > > +
> > > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> > > > +restrict pixel, TYPE level, int n) {
> > > > +  for (int i = 0; i < n; i+=1)
> > > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; }
> > > > +
> > > > +#include "vect-div-bitmask.h"
> > > > +
> > > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > > +detected" "vect" } } */
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > new file mode 100644
> > > > index
> > > >
> > 0000000000000000000000000000000000000000..29a16739aa4b706616367bfd1
> > 8
> > > > 32f28ebd07993e
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > > @@ -0,0 +1,43 @@
> > > > +#include <stdio.h>
> > > > +
> > > > +#ifndef N
> > > > +#define N 65
> > > > +#endif
> > > > +
> > > > +#ifndef TYPE
> > > > +#define TYPE uint32_t
> > > > +#endif
> > > > +
> > > > +#ifndef DEBUG
> > > > +#define DEBUG 0
> > > > +#endif
> > > > +
> > > > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > > > +
> > > > +int main ()
> > > > +{
> > > > +  TYPE a[N];
> > > > +  TYPE b[N];
> > > > +
> > > > +  for (int i = 0; i < N; ++i)
> > > > +    {
> > > > +      a[i] = BASE + i * 13;
> > > > +      b[i] = BASE + i * 13;
> > > > +      if (DEBUG)
> > > > +        printf ("%d: 0x%x\n", i, a[i]);
> > > > +    }
> > > > +
> > > > +  fun1 (a, N / 2, N);
> > > > +  fun2 (b, N / 2, N);
> > > > +
> > > > +  for (int i = 0; i < N; ++i)
> > > > +    {
> > > > +      if (DEBUG)
> > > > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > > > +
> > > > +      if (a[i] != b[i])
> > > > +        __builtin_abort ();
> > > > +    }
> > > > +  return 0;
> > > > +}
> > > > +
> > > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > > > index
> > > >
> > 217bdfd7045a22578a35bb891a4318d741071872..a738558cb8d12296bff462d71
> > 6
> > > > 310ca8d82957b5 100644
> > > > --- a/gcc/tree-vect-patterns.cc
> > > > +++ b/gcc/tree-vect-patterns.cc
> > > > @@ -3558,6 +3558,33 @@ vect_recog_divmod_pattern (vec_info *vinfo,
> > > >
> > > >        return pattern_stmt;
> > > >      }
> > > > +  else if ((TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1) != 1)
> > > > +	   && rhs_code != TRUNC_MOD_EXPR)
> > > > +    {
> > > > +      wide_int icst = wi::to_wide (oprnd1);
> > > > +      wide_int val = wi::add (icst, 1);
> > > > +      int pow = wi::exact_log2 (val);
> > > > +      if (pow == (prec / 2))
> > > > +	{
> > > > +	  /* Pattern detected.  */
> > > > +	  vect_pattern_detected ("vect_recog_divmod_pattern", last_stmt);
> > > > +
> > > > +	  *type_out = vectype;
> > > > +
> > > > +	  /* Check if the target supports this internal function.  */
> > > > +	  internal_fn ifn = IFN_DIV_POW2_BITMASK;
> > > > +	  if (direct_internal_fn_supported_p (ifn, vectype,
> > OPTIMIZE_FOR_SPEED))
> > > > +	    {
> > > > +	      tree var_div = vect_recog_temp_ssa_var (itype, NULL);
> > > > +	      gimple *div_stmt = gimple_build_call_internal (ifn, 1, oprnd0);
> > > > +	      gimple_call_set_lhs (div_stmt, var_div);
> > > > +
> > > > +	      gimple_set_location (div_stmt, gimple_location (last_stmt));
> > > > +
> > > > +	      return div_stmt;
> > > > +	    }
> > > > +	}
> > > > +    }
> > > >
> > > >    if (prec > HOST_BITS_PER_WIDE_INT
> > > >        || integer_zerop (oprnd1))
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
> > Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> > Boudien Moerman; HRB 36809 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

next prev parent reply	other threads:[~2022-06-13 11:47 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09  4:39 Tamar Christina
2022-06-09  4:40 ` [PATCH 2/2]AArch64 aarch64: Add implementation for pow2 bitmask division Tamar Christina
2022-06-13  9:24 ` [PATCH 1/2]middle-end Support optimized division by pow2 bitmask Richard Biener
2022-06-13  9:39   ` Richard Biener
2022-06-13 10:09     ` Tamar Christina
2022-06-13 11:47       ` Richard Biener [this message]
2022-06-13 14:37         ` Tamar Christina
2022-06-14 13:18           ` Richard Biener
2022-06-14 13:38             ` Tamar Christina
2022-06-14 13:42             ` Richard Sandiford
2022-06-14 15:57               ` Tamar Christina
2022-06-14 16:09                 ` Richard Biener
2022-06-22  0:34                 ` Tamar Christina
2022-06-26 19:55                   ` Jeff Law
2022-09-23  9:33 ` [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization Tamar Christina
2022-09-23  9:33 ` [PATCH 2/4]AArch64 Add implementation for pow2 bitmask division Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-09 16:02     ` Kyrylo Tkachov
2022-09-23  9:33 ` [PATCH 3/4]AArch64 Add SVE2 " Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-12 12:17   ` Richard Sandiford
2022-09-23  9:34 ` [PATCH 4/4]AArch64 sve2: rewrite pack + NARROWB + NARROWB to NARROWB + NARROWT Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-12 12:25   ` Richard Sandiford
2022-11-12 12:33     ` Richard Sandiford
2022-09-26 10:39 ` [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization Richard Biener
2022-10-31 11:34   ` Tamar Christina
2022-10-31 17:12     ` Jeff Law
2022-11-08 17:36     ` Tamar Christina
2022-11-09  8:01       ` Richard Biener
2022-11-09  8:26         ` Tamar Christina
2022-11-09 10:37 ` Kyrylo Tkachov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=sn5qn872-834o-8149-ns0-46996nsp41q1@fhfr.qr \
    --to=rguenther@suse.de \
    --cc=Richard.Sandiford@arm.com \
    --cc=Tamar.Christina@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).