Re: [PATCH] rs6000: Add optimizations for _mm_sad

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
@ 2022-01-07 19:40 David Edelsohn
  2022-01-07 20:57 ` Paul A. Clarke
  0 siblings, 1 reply; 5+ messages in thread
From: David Edelsohn @ 2022-01-07 19:40 UTC (permalink / raw)
  To: Paul A. Clarke, Segher Boessenkool; +Cc: GCC Patches

+#ifdef __LITTLE_ENDIAN__
+  /* Sum across four integers with two integer results.  */
+  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
+  /* Note: vec_sum2s could be used here, but on little-endian, vector
+     shifts are added that are not needed for this use-case.
+     A vector shift to correctly position the 32-bit integer results
+     (currently at [0] and [2]) to [1] and [3] would then need to be
+     swapped back again since the desired results are two 64-bit
+     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
+#else
   /* Sum across four integers with two integer results.  */
   result = vec_sum2s (vsum, (__vector signed int) zero);

If little-endian adds shifts to correct for the position and
big-endian does not, why not use the inline asm without the shifts for
both?  It seems confusing to add the inline asm only for LE instead of
always using it with the appropriate comment.

It's a good and valuable optimization for LE.  Fewer variants are less
fragile, easier to test and easier to maintain.  If you're going to go
to the trouble of using inline asm for LE, use it for both.

Thanks, David

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
  2022-01-07 19:40 [PATCH] rs6000: Add optimizations for _mm_sad_epu8 David Edelsohn
@ 2022-01-07 20:57 ` Paul A. Clarke
  2022-01-07 21:03   ` David Edelsohn
  0 siblings, 1 reply; 5+ messages in thread
From: Paul A. Clarke @ 2022-01-07 20:57 UTC (permalink / raw)
  To: David Edelsohn; +Cc: Segher Boessenkool, GCC Patches

On Fri, Jan 07, 2022 at 02:40:51PM -0500, David Edelsohn via Gcc-patches wrote:
> +#ifdef __LITTLE_ENDIAN__
> +  /* Sum across four integers with two integer results.  */
> +  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
> +  /* Note: vec_sum2s could be used here, but on little-endian, vector
> +     shifts are added that are not needed for this use-case.
> +     A vector shift to correctly position the 32-bit integer results
> +     (currently at [0] and [2]) to [1] and [3] would then need to be
> +     swapped back again since the desired results are two 64-bit
> +     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
> +#else
>    /* Sum across four integers with two integer results.  */
>    result = vec_sum2s (vsum, (__vector signed int) zero);
> 
> If little-endian adds shifts to correct for the position and
> big-endian does not, why not use the inline asm without the shifts for
> both?  It seems confusing to add the inline asm only for LE instead of
> always using it with the appropriate comment.
> 
> It's a good and valuable optimization for LE.  Fewer variants are less
> fragile, easier to test and easier to maintain.  If you're going to go
> to the trouble of using inline asm for LE, use it for both.

BE (only) _does_ need a shift as seen on the next two lines after the
code snippet above:
  /* Sum across four integers with two integer results.  */
  result = vec_sum2s (vsum, (__vector signed int) zero);
  /* Rotate the sums into the correct position.  */
  result = vec_sld (result, result, 6);

So, when using {vec_sum2s;vec_sld}:
- LE gets an implicit shift in vec_sum2s which just needs to be undone
  by the vec_sld, and those shifts don't "cancel out" and get removed
  by GCC.
- BE does not get any implicit shifts, but needs one that comes from
  vec_sld.

Are you saying use the asm(vsum2sws) and then conditionally call
vec_sld on BE only?

I viewed this change as a temporary bandage unless and until GCC can
remove the unnecessary swaps.  It seems like the preferred code is
vec_sum2s/vec_sld, not the asm, but that currently is suboptimal for LE.

PC

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
  2022-01-07 20:57 ` Paul A. Clarke
@ 2022-01-07 21:03   ` David Edelsohn
  0 siblings, 0 replies; 5+ messages in thread
From: David Edelsohn @ 2022-01-07 21:03 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: Segher Boessenkool, GCC Patches

On Fri, Jan 7, 2022 at 3:57 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Fri, Jan 07, 2022 at 02:40:51PM -0500, David Edelsohn via Gcc-patches wrote:
> > +#ifdef __LITTLE_ENDIAN__
> > +  /* Sum across four integers with two integer results.  */
> > +  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
> > +  /* Note: vec_sum2s could be used here, but on little-endian, vector
> > +     shifts are added that are not needed for this use-case.
> > +     A vector shift to correctly position the 32-bit integer results
> > +     (currently at [0] and [2]) to [1] and [3] would then need to be
> > +     swapped back again since the desired results are two 64-bit
> > +     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
> > +#else
> >    /* Sum across four integers with two integer results.  */
> >    result = vec_sum2s (vsum, (__vector signed int) zero);
> >
> > If little-endian adds shifts to correct for the position and
> > big-endian does not, why not use the inline asm without the shifts for
> > both?  It seems confusing to add the inline asm only for LE instead of
> > always using it with the appropriate comment.
> >
> > It's a good and valuable optimization for LE.  Fewer variants are less
> > fragile, easier to test and easier to maintain.  If you're going to go
> > to the trouble of using inline asm for LE, use it for both.
>
> BE (only) _does_ need a shift as seen on the next two lines after the
> code snippet above:
>   /* Sum across four integers with two integer results.  */
>   result = vec_sum2s (vsum, (__vector signed int) zero);
>   /* Rotate the sums into the correct position.  */
>   result = vec_sld (result, result, 6);
>
> So, when using {vec_sum2s;vec_sld}:
> - LE gets an implicit shift in vec_sum2s which just needs to be undone
>   by the vec_sld, and those shifts don't "cancel out" and get removed
>   by GCC.
> - BE does not get any implicit shifts, but needs one that comes from
>   vec_sld.
>
> Are you saying use the asm(vsum2sws) and then conditionally call
> vec_sld on BE only?
>
> I viewed this change as a temporary bandage unless and until GCC can
> remove the unnecessary swaps.  It seems like the preferred code is
> vec_sum2s/vec_sld, not the asm, but that currently is suboptimal for LE.

Nevermind.  I thought that these patches had not been reviewed.

Thanks, David

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] rs6000: Add optimizations for _mm_sad_epu8
@ 2021-10-22 17:28 Paul A. Clarke
  2021-11-19 18:09 ` Segher Boessenkool
  0 siblings, 1 reply; 5+ messages in thread
From: Paul A. Clarke @ 2021-10-22 17:28 UTC (permalink / raw)
  To: segher, gcc-patches; +Cc: wschmidt

Power9 ISA added `vabsdub` instruction which is realized in the
`vec_absd` instrinsic.

Use `vec_absd` for `_mm_sad_epu8` compatibility intrinsic, when
`_ARCH_PWR9`.

Also, the realization of `vec_sum2s` on little-endian includes
two shifts in order to position the input and output to match
the semantics of `vec_sum2s`:
- Shift the second input vector left 12 bytes. In the current usage,
  that vector is `{0}`, so this shift is unnecessary, but is currently
  not eliminated under optimization.
- Shift the vector produced by the `vsum2sws` instruction left 4 bytes.
  The two words within each doubleword of this (shifted) result must then
  be explicitly swapped to match the semantics of `_mm_sad_epu8`,
  effectively reversing this shift.  So, this shift (and a susequent swap)
  are unnecessary, but not currently removed under optimization.

Using `__builtin_altivec_vsum2sws` retains both shifts, so is not an
option for removing the shifts.

For little-endian, use the `vsum2sws` instruction directly, and
eliminate the explicit shift (swap).

2021-10-22  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/emmintrin.h (_mm_sad_epu8): Use vec_absd
	when _ARCH_PWR9, optimize vec_sum2s when LE.
---
Tested on powerpc64le-linux on Power9, with and without `-mcpu=power9`,
and on powerpc/powerpc64-linux on Power8.

OK for trunk?

 gcc/config/rs6000/emmintrin.h | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/gcc/config/rs6000/emmintrin.h b/gcc/config/rs6000/emmintrin.h
index ab16c13c379e..c4758be0e777 100644
--- a/gcc/config/rs6000/emmintrin.h
+++ b/gcc/config/rs6000/emmintrin.h
@@ -2197,27 +2197,37 @@ extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __arti
 _mm_sad_epu8 (__m128i __A, __m128i __B)
 {
   __v16qu a, b;
-  __v16qu vmin, vmax, vabsdiff;
+  __v16qu vabsdiff;
   __v4si vsum;
   const __v4su zero = { 0, 0, 0, 0 };
   __v4si result;
 
   a = (__v16qu) __A;
   b = (__v16qu) __B;
-  vmin = vec_min (a, b);
-  vmax = vec_max (a, b);
+#ifndef _ARCH_PWR9
+  __v16qu vmin = vec_min (a, b);
+  __v16qu vmax = vec_max (a, b);
   vabsdiff = vec_sub (vmax, vmin);
+#else
+  vabsdiff = vec_absd (a, b);
+#endif
   /* Sum four groups of bytes into integers.  */
   vsum = (__vector signed int) vec_sum4s (vabsdiff, zero);
+#ifdef __LITTLE_ENDIAN__
+  /* Sum across four integers with two integer results.  */
+  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
+  /* Note: vec_sum2s could be used here, but on little-endian, vector
+     shifts are added that are not needed for this use-case.
+     A vector shift to correctly position the 32-bit integer results
+     (currently at [0] and [2]) to [1] and [3] would then need to be
+     swapped back again since the desired results are two 64-bit
+     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
+#else
   /* Sum across four integers with two integer results.  */
   result = vec_sum2s (vsum, (__vector signed int) zero);
   /* Rotate the sums into the correct position.  */
-#ifdef __LITTLE_ENDIAN__
-  result = vec_sld (result, result, 4);
-#else
   result = vec_sld (result, result, 6);
 #endif
-  /* Rotate the sums into the correct position.  */
   return (__m128i) result;
 }
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
  2021-10-22 17:28 Paul A. Clarke
@ 2021-11-19 18:09 ` Segher Boessenkool
  0 siblings, 0 replies; 5+ messages in thread
From: Segher Boessenkool @ 2021-11-19 18:09 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: gcc-patches, wschmidt

Hi!

On Fri, Oct 22, 2021 at 12:28:49PM -0500, Paul A. Clarke wrote:
> Power9 ISA added `vabsdub` instruction which is realized in the
> `vec_absd` instrinsic.
> 
> Use `vec_absd` for `_mm_sad_epu8` compatibility intrinsic, when
> `_ARCH_PWR9`.
> 
> Also, the realization of `vec_sum2s` on little-endian includes
> two shifts in order to position the input and output to match
> the semantics of `vec_sum2s`:
> - Shift the second input vector left 12 bytes. In the current usage,
>   that vector is `{0}`, so this shift is unnecessary, but is currently
>   not eliminated under optimization.

The vsum2sws implementation uses an unspec, so there is almost no chance
of anything with it being optimised :-(

It rotates it right by 4 bytes btw, it's not a shift.

> - Shift the vector produced by the `vsum2sws` instruction left 4 bytes.
>   The two words within each doubleword of this (shifted) result must then
>   be explicitly swapped to match the semantics of `_mm_sad_epu8`,
>   effectively reversing this shift.  So, this shift (and a susequent swap)
>   are unnecessary, but not currently removed under optimization.

Rotate left by 4 -- same thing once you consider word 0 and 2 are set
to zeroes by the sum2sws.

Not sure why it is not optimised, what do the dump files say?  -dap and
I'd start looking at the combine dump.

> Using `__builtin_altivec_vsum2sws` retains both shifts, so is not an
> option for removing the shifts.
> 
> For little-endian, use the `vsum2sws` instruction directly, and
> eliminate the explicit shift (swap).
> 
> 2021-10-22  Paul A. Clarke  <pc@us.ibm.com>
> 
> gcc
> 	* config/rs6000/emmintrin.h (_mm_sad_epu8): Use vec_absd
> 	when _ARCH_PWR9, optimize vec_sum2s when LE.

Please don't break changelog lines early.

> -  vmin = vec_min (a, b);
> -  vmax = vec_max (a, b);
> +#ifndef _ARCH_PWR9
> +  __v16qu vmin = vec_min (a, b);
> +  __v16qu vmax = vec_max (a, b);
>    vabsdiff = vec_sub (vmax, vmin);
> +#else
> +  vabsdiff = vec_absd (a, b);
> +#endif

So hrm, maybe we should have the vec_absd macro (or the builtin) always,
just expanding to three insns if necessary.

Okay for trunk with approproate changelog and commit message changes.
Thanks!


Segher

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-01-07 21:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-07 19:40 [PATCH] rs6000: Add optimizations for _mm_sad_epu8 David Edelsohn
2022-01-07 20:57 ` Paul A. Clarke
2022-01-07 21:03   ` David Edelsohn
  -- strict thread matches above, loose matches on Subject: below --
2021-10-22 17:28 Paul A. Clarke
2021-11-19 18:09 ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).