RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, Richard Sandiford <Richard.Sandiford@arm.com>
Subject: RE: [PATCH 1/2]middle-end Support optimized division by pow2 bitmask
Date: Mon, 13 Jun 2022 10:09:12 +0000	[thread overview]
Message-ID: <VI1PR08MB532574ACE2174CE64460E480FFAB9@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <q77434s9-7825-n2r-p664-7ssqrn6no5n8@fhfr.qr>

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Monday, June 13, 2022 10:39 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: Re: [PATCH 1/2]middle-end Support optimized division by pow2
> bitmask
> 
> On Mon, 13 Jun 2022, Richard Biener wrote:
> 
> > On Thu, 9 Jun 2022, Tamar Christina wrote:
> >
> > > Hi All,
> > >
> > > In plenty of image and video processing code it's common to modify
> > > pixel values by a widening operation and then scale them back into range
> by dividing by 255.
> > >
> > > This patch adds an optab to allow us to emit an optimized sequence
> > > when doing an unsigned division that is equivalent to:
> > >
> > >    x = y / (2 ^ (bitsize (y)/2)-1
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > x86_64-pc-linux-gnu and no issues.
> > >
> > > Ok for master?
> >
> > Looking at 2/2 it seems that this is the wrong way to attack the
> > problem.  The ISA doesn't have such instruction so adding an optab
> > looks premature.  I suppose that there's no unsigned vector integer
> > division and thus we open-code that in a different way?  Isn't the
> > correct thing then to fixup that open-coding if it is more efficient?
> 

The problem is that even if you fixup the open-coding it would need to
be something target specific? The sequence of instructions we generate
don't have a GIMPLE representation.  So whatever is generated I'd have to fixup
in RTL then.

The problem with this is that it seemed fragile. We generate from the
Vectorizer:

  vect__3.8_35 = MEM <vector(16) unsigned char> [(uint8_t *)_21];
  vect_patt_28.9_37 = WIDEN_MULT_LO_EXPR <vect__3.8_35, vect_cst__36>;
  vect_patt_28.9_38 = WIDEN_MULT_HI_EXPR <vect__3.8_35, vect_cst__36>;
  vect_patt_19.10_40 = vect_patt_28.9_37 h* { 32897, 32897, 32897, 32897, 32897, 32897, 32897, 32897 };
  vect_patt_19.10_41 = vect_patt_28.9_38 h* { 32897, 32897, 32897, 32897, 32897, 32897, 32897, 32897 };
  vect_patt_25.11_42 = vect_patt_19.10_40 >> 7;
  vect_patt_25.11_43 = vect_patt_19.10_41 >> 7;
  vect_patt_11.12_44 = VEC_PACK_TRUNC_EXPR <vect_patt_25.11_42, vect_patt_25.11_43>;

and if the magic constants change then we miss the optimization. I could rewrite the open coding to use
shifts alone, but that might be a regression for some uarches I would imagine.

> Btw, on x86 we use
> 
> t.c:3:21: note:   replacing earlier pattern patt_25 = patt_28 / 255;
> t.c:3:21: note:   with patt_25 = patt_19 >> 7;
> t.c:3:21: note:   extra pattern stmt: patt_19 = patt_28 h* 32897;
> 
> which translates to
> 
>         vpmulhuw        %ymm4, %ymm0, %ymm0
>         vpmulhuw        %ymm4, %ymm1, %ymm1
>         vpsrlw  $7, %ymm0, %ymm0
>         vpsrlw  $7, %ymm1, %ymm1
> 
> there's odd
> 
>         vpand   %ymm0, %ymm3, %ymm0
>         vpand   %ymm1, %ymm3, %ymm1
> 
> before (%ymm3 is all 0x00ff)
> 
>         vpackuswb       %ymm1, %ymm0, %ymm0
> 
> that's not visible in GIMPLE.  I guess aarch64 lacks a highpart multiply here?
> In any case, it seems that generic division expansion could be improved
> here? (choose_multiplier?)

We do generate multiply highpart here, but the patch completely avoids multiplies
and shifts entirely by creative use of the ISA. Another reason I went for an optab is costing.
The chosen operations are significantly cheaper on all Arm uarches than Shifts and multiply.

This means we get vectorization in some cases where the cost model would correctly say
It's too expensive to vectorize. Particularly around double precision.

Thanks,
Tamar

> 
> Richard.
> 
> > Richard.
> >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	* internal-fn.def (DIV_POW2_BITMASK): New.
> > > 	* optabs.def (udiv_pow2_bitmask_optab): New.
> > > 	* doc/md.texi: Document it.
> > > 	* tree-vect-patterns.cc (vect_recog_divmod_pattern): Recognize
> pattern.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > 	* gcc.dg/vect/vect-div-bitmask-1.c: New test.
> > > 	* gcc.dg/vect/vect-div-bitmask-2.c: New test.
> > > 	* gcc.dg/vect/vect-div-bitmask-3.c: New test.
> > > 	* gcc.dg/vect/vect-div-bitmask.h: New file.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index
> > >
> f3619c505c025f158c2bc64756531877378b22e1..784c49d7d24cef7619e4d613f7
> > > b4f6e945866c38 100644
> > > --- a/gcc/doc/md.texi
> > > +++ b/gcc/doc/md.texi
> > > @@ -5588,6 +5588,18 @@ signed op0, op1;
> > >  op0 = op1 / (1 << imm);
> > >  @end smallexample
> > >
> > > +@cindex @code{udiv_pow2_bitmask@var{m2}} instruction pattern
> @item
> > > +@samp{udiv_pow2_bitmask@var{m2}} @cindex
> > > +@code{udiv_pow2_bitmask@var{m2}} instruction pattern @itemx
> > > +@samp{udiv_pow2_bitmask@var{m2}} Unsigned vector division by an
> > > +immediate that is equivalent to
> > > +@samp{2^(bitsize(m) / 2) - 1}.
> > > +@smallexample
> > > +unsigned short op0; op1;
> > > +@dots{}
> > > +op0 = op1 / 0xffU;
> > > +@end smallexample
> > > +
> > >  @cindex @code{vec_shl_insert_@var{m}} instruction pattern  @item
> > > @samp{vec_shl_insert_@var{m}}  Shift the elements in vector input
> > > operand 1 left one element (i.e.@:
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> > >
> d2d550d358606022b1cb44fa842f06e0be507bc3..a3e3cc1520f77683ebf6256898
> > > f916ed45de475f 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -159,6 +159,8 @@ DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT,
> ECF_CONST | ECF_NOTHROW,
> > >  		       vec_shl_insert, binary)
> > >
> > >  DEF_INTERNAL_OPTAB_FN (DIV_POW2, ECF_CONST | ECF_NOTHROW,
> > > sdiv_pow2, binary)
> > > +DEF_INTERNAL_OPTAB_FN (DIV_POW2_BITMASK, ECF_CONST |
> ECF_NOTHROW,
> > > +		       udiv_pow2_bitmask, unary)
> > >
> > >  DEF_INTERNAL_OPTAB_FN (FMS, ECF_CONST, fms, ternary)
> > > DEF_INTERNAL_OPTAB_FN (FNMA, ECF_CONST, fnma, ternary) diff --git
> > > a/gcc/optabs.def b/gcc/optabs.def index
> > >
> 801310ebaa7d469520809bb7efed6820f8eb866b..3f0ac05ef5ad5aed8d6ca391f
> 4
> > > eed71b0494e17f 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -372,6 +372,7 @@ OPTAB_D (smulhrs_optab, "smulhrs$a3")
> OPTAB_D
> > > (umulhs_optab, "umulhs$a3")  OPTAB_D (umulhrs_optab, "umulhrs$a3")
> > > OPTAB_D (sdiv_pow2_optab, "sdiv_pow2$a3")
> > > +OPTAB_D (udiv_pow2_bitmask_optab, "udiv_pow2_bitmask$a2")
> > >  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
> > > OPTAB_D (vec_pack_ssat_optab, "vec_pack_ssat_$a")  OPTAB_D
> > > (vec_pack_trunc_optab, "vec_pack_trunc_$a") diff --git
> > > a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > new file mode 100644
> > > index
> > >
> 0000000000000000000000000000000000000000..a7ea3cce4764239c5d281a8f0b
> > > ead1f6a452de3f
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-1.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-require-effective-target vect_int } */
> > > +
> > > +#include <stdint.h>
> > > +#include "tree-vect.h"
> > > +
> > > +#define N 50
> > > +#define TYPE uint8_t
> > > +
> > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> > > +restrict pixel, TYPE level, int n) {
> > > +  for (int i = 0; i < n; i+=1)
> > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > +
> > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> > > +restrict pixel, TYPE level, int n) {
> > > +  for (int i = 0; i < n; i+=1)
> > > +    pixel[i] = (pixel[i] * level) / 0xff; }
> > > +
> > > +#include "vect-div-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > +detected" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > new file mode 100644
> > > index
> > >
> 0000000000000000000000000000000000000000..009e16e1b36497e5724410d98
> 4
> > > 3f1ce122b26dda
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-2.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-require-effective-target vect_int } */
> > > +
> > > +#include <stdint.h>
> > > +#include "tree-vect.h"
> > > +
> > > +#define N 50
> > > +#define TYPE uint16_t
> > > +
> > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> > > +restrict pixel, TYPE level, int n) {
> > > +  for (int i = 0; i < n; i+=1)
> > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > +
> > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> > > +restrict pixel, TYPE level, int n) {
> > > +  for (int i = 0; i < n; i+=1)
> > > +    pixel[i] = (pixel[i] * level) / 0xffffU; }
> > > +
> > > +#include "vect-div-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > +detected" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > new file mode 100644
> > > index
> > >
> 0000000000000000000000000000000000000000..bf35a0bda8333c418e692d942
> 2
> > > 0df849cc47930b
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask-3.c
> > > @@ -0,0 +1,26 @@
> > > +/* { dg-require-effective-target vect_int } */
> > > +/* { dg-additional-options "-fno-vect-cost-model" { target
> > > +aarch64*-*-* } } */
> > > +
> > > +#include <stdint.h>
> > > +#include "tree-vect.h"
> > > +
> > > +#define N 50
> > > +#define TYPE uint32_t
> > > +
> > > +__attribute__((noipa, noinline, optimize("O1"))) void fun1(TYPE*
> > > +restrict pixel, TYPE level, int n) {
> > > +  for (int i = 0; i < n; i+=1)
> > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; }
> > > +
> > > +__attribute__((noipa, noinline, optimize("O3"))) void fun2(TYPE*
> > > +restrict pixel, TYPE level, int n) {
> > > +  for (int i = 0; i < n; i+=1)
> > > +    pixel[i] = (pixel[i] * (uint64_t)level) / 0xffffffffUL; }
> > > +
> > > +#include "vect-div-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump "vect_recog_divmod_pattern:
> > > +detected" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > new file mode 100644
> > > index
> > >
> 0000000000000000000000000000000000000000..29a16739aa4b706616367bfd1
> 8
> > > 32f28ebd07993e
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-div-bitmask.h
> > > @@ -0,0 +1,43 @@
> > > +#include <stdio.h>
> > > +
> > > +#ifndef N
> > > +#define N 65
> > > +#endif
> > > +
> > > +#ifndef TYPE
> > > +#define TYPE uint32_t
> > > +#endif
> > > +
> > > +#ifndef DEBUG
> > > +#define DEBUG 0
> > > +#endif
> > > +
> > > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > > +
> > > +int main ()
> > > +{
> > > +  TYPE a[N];
> > > +  TYPE b[N];
> > > +
> > > +  for (int i = 0; i < N; ++i)
> > > +    {
> > > +      a[i] = BASE + i * 13;
> > > +      b[i] = BASE + i * 13;
> > > +      if (DEBUG)
> > > +        printf ("%d: 0x%x\n", i, a[i]);
> > > +    }
> > > +
> > > +  fun1 (a, N / 2, N);
> > > +  fun2 (b, N / 2, N);
> > > +
> > > +  for (int i = 0; i < N; ++i)
> > > +    {
> > > +      if (DEBUG)
> > > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > > +
> > > +      if (a[i] != b[i])
> > > +        __builtin_abort ();
> > > +    }
> > > +  return 0;
> > > +}
> > > +
> > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > > index
> > >
> 217bdfd7045a22578a35bb891a4318d741071872..a738558cb8d12296bff462d71
> 6
> > > 310ca8d82957b5 100644
> > > --- a/gcc/tree-vect-patterns.cc
> > > +++ b/gcc/tree-vect-patterns.cc
> > > @@ -3558,6 +3558,33 @@ vect_recog_divmod_pattern (vec_info *vinfo,
> > >
> > >        return pattern_stmt;
> > >      }
> > > +  else if ((TYPE_UNSIGNED (itype) || tree_int_cst_sgn (oprnd1) != 1)
> > > +	   && rhs_code != TRUNC_MOD_EXPR)
> > > +    {
> > > +      wide_int icst = wi::to_wide (oprnd1);
> > > +      wide_int val = wi::add (icst, 1);
> > > +      int pow = wi::exact_log2 (val);
> > > +      if (pow == (prec / 2))
> > > +	{
> > > +	  /* Pattern detected.  */
> > > +	  vect_pattern_detected ("vect_recog_divmod_pattern", last_stmt);
> > > +
> > > +	  *type_out = vectype;
> > > +
> > > +	  /* Check if the target supports this internal function.  */
> > > +	  internal_fn ifn = IFN_DIV_POW2_BITMASK;
> > > +	  if (direct_internal_fn_supported_p (ifn, vectype,
> OPTIMIZE_FOR_SPEED))
> > > +	    {
> > > +	      tree var_div = vect_recog_temp_ssa_var (itype, NULL);
> > > +	      gimple *div_stmt = gimple_build_call_internal (ifn, 1, oprnd0);
> > > +	      gimple_call_set_lhs (div_stmt, var_div);
> > > +
> > > +	      gimple_set_location (div_stmt, gimple_location (last_stmt));
> > > +
> > > +	      return div_stmt;
> > > +	    }
> > > +	}
> > > +    }
> > >
> > >    if (prec > HOST_BITS_PER_WIDE_INT
> > >        || integer_zerop (oprnd1))
> > >
> > >
> > >
> > >
> > >
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstraße 146, 90461
> Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald,
> Boudien Moerman; HRB 36809 (AG Nuernberg)

next prev parent reply	other threads:[~2022-06-13 10:09 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09  4:39 Tamar Christina
2022-06-09  4:40 ` [PATCH 2/2]AArch64 aarch64: Add implementation for pow2 bitmask division Tamar Christina
2022-06-13  9:24 ` [PATCH 1/2]middle-end Support optimized division by pow2 bitmask Richard Biener
2022-06-13  9:39   ` Richard Biener
2022-06-13 10:09     ` Tamar Christina [this message]
2022-06-13 11:47       ` Richard Biener
2022-06-13 14:37         ` Tamar Christina
2022-06-14 13:18           ` Richard Biener
2022-06-14 13:38             ` Tamar Christina
2022-06-14 13:42             ` Richard Sandiford
2022-06-14 15:57               ` Tamar Christina
2022-06-14 16:09                 ` Richard Biener
2022-06-22  0:34                 ` Tamar Christina
2022-06-26 19:55                   ` Jeff Law
2022-09-23  9:33 ` [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization Tamar Christina
2022-09-23  9:33 ` [PATCH 2/4]AArch64 Add implementation for pow2 bitmask division Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-09 16:02     ` Kyrylo Tkachov
2022-09-23  9:33 ` [PATCH 3/4]AArch64 Add SVE2 " Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-12 12:17   ` Richard Sandiford
2022-09-23  9:34 ` [PATCH 4/4]AArch64 sve2: rewrite pack + NARROWB + NARROWB to NARROWB + NARROWT Tamar Christina
2022-10-31 11:34   ` Tamar Christina
2022-11-09  8:33     ` Tamar Christina
2022-11-12 12:25   ` Richard Sandiford
2022-11-12 12:33     ` Richard Sandiford
2022-09-26 10:39 ` [PATCH 1/4]middle-end Support not decomposing specific divisions during vectorization Richard Biener
2022-10-31 11:34   ` Tamar Christina
2022-10-31 17:12     ` Jeff Law
2022-11-08 17:36     ` Tamar Christina
2022-11-09  8:01       ` Richard Biener
2022-11-09  8:26         ` Tamar Christina
2022-11-09 10:37 ` Kyrylo Tkachov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB532574ACE2174CE64460E480FFAB9@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=Richard.Sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).