public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
@ 2022-01-07 19:40 David Edelsohn
  2022-01-07 20:57 ` Paul A. Clarke
  0 siblings, 1 reply; 5+ messages in thread
From: David Edelsohn @ 2022-01-07 19:40 UTC (permalink / raw)
  To: Paul A. Clarke, Segher Boessenkool; +Cc: GCC Patches

+#ifdef __LITTLE_ENDIAN__
+  /* Sum across four integers with two integer results.  */
+  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
+  /* Note: vec_sum2s could be used here, but on little-endian, vector
+     shifts are added that are not needed for this use-case.
+     A vector shift to correctly position the 32-bit integer results
+     (currently at [0] and [2]) to [1] and [3] would then need to be
+     swapped back again since the desired results are two 64-bit
+     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
+#else
   /* Sum across four integers with two integer results.  */
   result = vec_sum2s (vsum, (__vector signed int) zero);

If little-endian adds shifts to correct for the position and
big-endian does not, why not use the inline asm without the shifts for
both?  It seems confusing to add the inline asm only for LE instead of
always using it with the appropriate comment.

It's a good and valuable optimization for LE.  Fewer variants are less
fragile, easier to test and easier to maintain.  If you're going to go
to the trouble of using inline asm for LE, use it for both.

Thanks, David

^ permalink raw reply	[flat|nested] 5+ messages in thread
* [PATCH] rs6000: Add optimizations for _mm_sad_epu8
@ 2021-10-22 17:28 Paul A. Clarke
  2021-11-19 18:09 ` Segher Boessenkool
  0 siblings, 1 reply; 5+ messages in thread
From: Paul A. Clarke @ 2021-10-22 17:28 UTC (permalink / raw)
  To: segher, gcc-patches; +Cc: wschmidt

Power9 ISA added `vabsdub` instruction which is realized in the
`vec_absd` instrinsic.

Use `vec_absd` for `_mm_sad_epu8` compatibility intrinsic, when
`_ARCH_PWR9`.

Also, the realization of `vec_sum2s` on little-endian includes
two shifts in order to position the input and output to match
the semantics of `vec_sum2s`:
- Shift the second input vector left 12 bytes. In the current usage,
  that vector is `{0}`, so this shift is unnecessary, but is currently
  not eliminated under optimization.
- Shift the vector produced by the `vsum2sws` instruction left 4 bytes.
  The two words within each doubleword of this (shifted) result must then
  be explicitly swapped to match the semantics of `_mm_sad_epu8`,
  effectively reversing this shift.  So, this shift (and a susequent swap)
  are unnecessary, but not currently removed under optimization.

Using `__builtin_altivec_vsum2sws` retains both shifts, so is not an
option for removing the shifts.

For little-endian, use the `vsum2sws` instruction directly, and
eliminate the explicit shift (swap).

2021-10-22  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/emmintrin.h (_mm_sad_epu8): Use vec_absd
	when _ARCH_PWR9, optimize vec_sum2s when LE.
---
Tested on powerpc64le-linux on Power9, with and without `-mcpu=power9`,
and on powerpc/powerpc64-linux on Power8.

OK for trunk?

 gcc/config/rs6000/emmintrin.h | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/gcc/config/rs6000/emmintrin.h b/gcc/config/rs6000/emmintrin.h
index ab16c13c379e..c4758be0e777 100644
--- a/gcc/config/rs6000/emmintrin.h
+++ b/gcc/config/rs6000/emmintrin.h
@@ -2197,27 +2197,37 @@ extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __arti
 _mm_sad_epu8 (__m128i __A, __m128i __B)
 {
   __v16qu a, b;
-  __v16qu vmin, vmax, vabsdiff;
+  __v16qu vabsdiff;
   __v4si vsum;
   const __v4su zero = { 0, 0, 0, 0 };
   __v4si result;
 
   a = (__v16qu) __A;
   b = (__v16qu) __B;
-  vmin = vec_min (a, b);
-  vmax = vec_max (a, b);
+#ifndef _ARCH_PWR9
+  __v16qu vmin = vec_min (a, b);
+  __v16qu vmax = vec_max (a, b);
   vabsdiff = vec_sub (vmax, vmin);
+#else
+  vabsdiff = vec_absd (a, b);
+#endif
   /* Sum four groups of bytes into integers.  */
   vsum = (__vector signed int) vec_sum4s (vabsdiff, zero);
+#ifdef __LITTLE_ENDIAN__
+  /* Sum across four integers with two integer results.  */
+  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
+  /* Note: vec_sum2s could be used here, but on little-endian, vector
+     shifts are added that are not needed for this use-case.
+     A vector shift to correctly position the 32-bit integer results
+     (currently at [0] and [2]) to [1] and [3] would then need to be
+     swapped back again since the desired results are two 64-bit
+     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
+#else
   /* Sum across four integers with two integer results.  */
   result = vec_sum2s (vsum, (__vector signed int) zero);
   /* Rotate the sums into the correct position.  */
-#ifdef __LITTLE_ENDIAN__
-  result = vec_sld (result, result, 4);
-#else
   result = vec_sld (result, result, 6);
 #endif
-  /* Rotate the sums into the correct position.  */
   return (__m128i) result;
 }
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-01-07 21:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-07 19:40 [PATCH] rs6000: Add optimizations for _mm_sad_epu8 David Edelsohn
2022-01-07 20:57 ` Paul A. Clarke
2022-01-07 21:03   ` David Edelsohn
  -- strict thread matches above, loose matches on Subject: below --
2021-10-22 17:28 Paul A. Clarke
2021-11-19 18:09 ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).