public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/51492] New: vectorizer generates unnecessary code
@ 2011-12-10  1:38 drepper.fsp at gmail dot com
  2011-12-12 10:25 ` [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns rguenth at gcc dot gnu.org
                   ` (18 more replies)
  0 siblings, 19 replies; 20+ messages in thread
From: drepper.fsp at gmail dot com @ 2011-12-10  1:38 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

             Bug #: 51492
           Summary: vectorizer generates unnecessary code
    Classification: Unclassified
           Product: gcc
           Version: 4.6.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: drepper.fsp@gmail.com
             Build: x86_64-linux


Compile this code with 4.6.2 on a x86-64 machine with -O3:

#define SIZE 65536
#define WSIZE 64
unsigned short head[SIZE] __attribute__((aligned(64)));

void
f(void)
{
  for (unsigned n = 0; n < SIZE; ++n) {
    unsigned short m = head[n];
    head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0);
  }
}

The result I see is this:

0000000000000000 <f>:
   0:    66 0f ef d2              pxor   %xmm2,%xmm2
   4:    b8 00 00 00 00           mov    $0x0,%eax
            5: R_X86_64_32    head
   9:    66 0f 6f 25 00 00 00     movdqa 0x0(%rip),%xmm4        # 11 <f+0x11>
  10:    00 
            d: R_X86_64_PC32    .LC0-0x4
  11:    66 0f 6f 1d 00 00 00     movdqa 0x0(%rip),%xmm3        # 19 <f+0x19>
  18:    00 
            15: R_X86_64_PC32    .LC1-0x4
  19:    0f 1f 80 00 00 00 00     nopl   0x0(%rax)
  20:    66 0f 6f 00              movdqa (%rax),%xmm0
  24:    66 0f 6f c8              movdqa %xmm0,%xmm1
  28:    66 0f d9 c4              psubusw %xmm4,%xmm0
  2c:    66 0f 75 c2              pcmpeqw %xmm2,%xmm0
  30:    66 0f fd cb              paddw  %xmm3,%xmm1
  34:    66 0f df c1              pandn  %xmm1,%xmm0
  38:    66 0f 7f 00              movdqa %xmm0,(%rax)
  3c:    48 83 c0 10              add    $0x10,%rax
  40:    48 3d 00 00 00 00        cmp    $0x0,%rax
            42: R_X86_64_32S    head+0x20000
  46:    75 d8                    jne    20 <f+0x20>
  48:    f3 c3                    repz retq 


There is a lot of unnecessary code.  The psubusw instruction alone is
sufficient.  The purpose of this instruction is to implement saturated
subtraction.  Why does gcc create all this extra code?  The code should just be

   movdqa (%rax), %xmm0
   psubusw %xmm1, %xmm0
   movdqa %mm0, (%rax)

where %xmm1 has WSIZE in the 16-bit values.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
@ 2011-12-12 10:25 ` rguenth at gcc dot gnu.org
  2012-01-08 18:57 ` drepper.fsp at gmail dot com
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-12-12 10:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011-12-12
            Summary|vectorizer generates        |vectorizer does not support
                   |unnecessary code            |saturated arithmetic
                   |                            |patterns
     Ever Confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-12-12 10:23:20 UTC ---
It's vectorized as

  vect_var_.11_17 = MEM[base: D.1616_5, offset: 0B];
  vect_var_.12_19 = vect_var_.11_17 + { 65472, 65472, 65472, 65472, 65472,
65472, 65472, 65472 };
  vect_var_.14_22 = VEC_COND_EXPR <vect_var_.11_17 > { 63, 63, 63, 63, 63, 63,
63, 63 }, vect_var_.12_19, { 0, 0, 0, 0, 0, 0, 0, 0 }>;
  MEM[base: D.1616_5, offset: 0B] = vect_var_.14_22;

GCC doesn't have the idea that this is a "saturated subtraction".  If targets
have saturated arithmetic support, but only with vectors, then the vectorizer
pattern recognition would need to be enhanced and the targets eventually
should support expanding saturated arithmetic.

OTOH middle-end support for saturated arithmetic needs to be improved,
scalar code could also benefit from optimization.  On the RTL level
we have [us]s_{plus,minus} which the vectorizer could use (if implemented
on the target for vector types).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
  2011-12-12 10:25 ` [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns rguenth at gcc dot gnu.org
@ 2012-01-08 18:57 ` drepper.fsp at gmail dot com
  2012-07-13  8:39 ` rguenth at gcc dot gnu.org
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: drepper.fsp at gmail dot com @ 2012-01-08 18:57 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #2 from Ulrich Drepper <drepper.fsp at gmail dot com> 2012-01-08 18:56:48 UTC ---
Note, this code appears in gzip and therefore IIRC in specCPU (in
deflate.c:fill_window).  Although when compiling gzip myself with that code
embedded in a larger function I cannot get the optimization to apply at all.

If this bug is fixed and the optimization is applied the spec numbers could go
up if specCPUis testing unzipping...


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
  2011-12-12 10:25 ` [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns rguenth at gcc dot gnu.org
  2012-01-08 18:57 ` drepper.fsp at gmail dot com
@ 2012-07-13  8:39 ` rguenth at gcc dot gnu.org
  2021-08-24 23:44 ` pinskia at gcc dot gnu.org
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-13  8:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947

--- Comment #3 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-13 08:39:43 UTC ---
Link to vectorizer missed-optimization meta-bug.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (2 preceding siblings ...)
  2012-07-13  8:39 ` rguenth at gcc dot gnu.org
@ 2021-08-24 23:44 ` pinskia at gcc dot gnu.org
  2021-08-25  3:54 ` pinskia at gcc dot gnu.org
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-24 23:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2016-01-04 00:00:00         |2021-8-24

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
We do slightly better but not close:
        movdqa  (%rax), %xmm0
        addq    $16, %rax
        psubusw %xmm1, %xmm0
        paddw   %xmm1, %xmm0
        paddw   %xmm2, %xmm0
        movaps  %xmm0, -16(%rax)

Which is expanded from:
  vect__1.6_15 = MAX_EXPR <vect_m_6.5_3, { 64, 64, 64, 64, 64, 64, 64, 64 }>;
  vect__2.7_17 = vect__1.6_15 + { 65472, 65472, 65472, 65472, 65472, 65472,
65472, 65472 };

-mavx2 we get:
        vpmaxuw (%rax), %ymm2, %ymm0
        addq    $32, %rax
        vpaddw  %ymm1, %ymm0, %ymm0
        vmovdqa %ymm0, -32(%rax)

Just note 65472 is -64.

This shouldn't be too hard to detect and add and even lower back to
MAX_EXPR/PLUS_EXPR if us_minus does not exist.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (3 preceding siblings ...)
  2021-08-24 23:44 ` pinskia at gcc dot gnu.org
@ 2021-08-25  3:54 ` pinskia at gcc dot gnu.org
  2024-02-01 13:06 ` pan2.li at intel dot com
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-25  3:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
https://gcc.gnu.org/pipermail/gcc/2021-May/236015.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (4 preceding siblings ...)
  2021-08-25  3:54 ` pinskia at gcc dot gnu.org
@ 2024-02-01 13:06 ` pan2.li at intel dot com
  2024-02-01 13:37 ` juzhe.zhong at rivai dot ai
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: pan2.li at intel dot com @ 2024-02-01 13:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #7 from Li Pan <pan2.li at intel dot com> ---
RISC-V backend reproduce code, build with "-march=rv64gcv_zba_zbb_zbc_zbs
--param=riscv-autovec-preference=fixed-vlmax -Ofast -ffast-math"

typedef unsigned short uint16_t;

void AAA (uint16_t *x, uint16_t *y, unsigned wsize, unsigned count)
{
  unsigned m = 0, n = count;
  register uint16_t *p;

  p = x;

  do {
    m = *--p;
    *p = (uint16_t)(m >= wsize ? m-wsize : 0);
  } while (--n);

  n = wsize;
  p = y;

  do {
      m = *--p;
      *p = (uint16_t)(m >= wsize ? m-wsize : 0);
  } while (--n);
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (5 preceding siblings ...)
  2024-02-01 13:06 ` pan2.li at intel dot com
@ 2024-02-01 13:37 ` juzhe.zhong at rivai dot ai
  2024-02-01 13:42 ` juzhe.zhong at rivai dot ai
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-01 13:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #8 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Missing saturate vectorization causes RVV Clang 20% performance better than RVV
GCC during recent benchmark evaluation.

In coremark pro zip-test, I believe other targets should be the same.

I wonder how we should start to support it.  Or did some body has already
started it ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (6 preceding siblings ...)
  2024-02-01 13:37 ` juzhe.zhong at rivai dot ai
@ 2024-02-01 13:42 ` juzhe.zhong at rivai dot ai
  2024-02-01 14:40 ` juzhe.zhong at rivai dot ai
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-01 13:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #9 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Ok. After investigation of LLVM:

Before loop vectorizer:

  %cond12 = tail call i32 @llvm.usub.sat.i32(i32 %conv5, i32 %wsize)
  %conv13 = trunc i32 %cond12 to i16

After loop vectorizer:

  %10 = call <16 x i32> @llvm.usub.sat.v16i32(<16 x i32> %9, <16 x i32>
%broadcast.splat)
  %11 = trunc <16 x i32> %10 to <16 x i16>

I think GCC can follow this approach, that is, first recognize scalar
saturation,
then fall into loop vectorizer to vectorize it into the saturation.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (7 preceding siblings ...)
  2024-02-01 13:42 ` juzhe.zhong at rivai dot ai
@ 2024-02-01 14:40 ` juzhe.zhong at rivai dot ai
  2024-02-01 14:41 ` juzhe.zhong at rivai dot ai
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-01 14:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #10 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Tamar.

We are interested in supporting saturating and rounding.

We may need to support scalar first.

Do you have any suggestions ?

Or you are already working on it?

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (8 preceding siblings ...)
  2024-02-01 14:40 ` juzhe.zhong at rivai dot ai
@ 2024-02-01 14:41 ` juzhe.zhong at rivai dot ai
  2024-02-01 15:10 ` tnfchris at gcc dot gnu.org
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-02-01 14:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #11 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Tamar.

We are interested in supporting saturating and rounding.

We may need to support scalar first.

Do you have any suggestions ?

Or you are already working on it?

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (9 preceding siblings ...)
  2024-02-01 14:41 ` juzhe.zhong at rivai dot ai
@ 2024-02-01 15:10 ` tnfchris at gcc dot gnu.org
  2024-02-02  1:04 ` pan2.li at intel dot com
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-01 15:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #11)
> Hi, Tamar.
> 
> We are interested in supporting saturating and rounding.

Awesome!

> 
> We may need to support scalar first.
> 
> Do you have any suggestions ?
> 
> Or you are already working on it?

No, atm we're not, it's on the backlog but haven't gotten to it so feel free to
do so.

The general conclusion of the thread is that we should introduce new internal
functions in the mid-end for this (also see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600 for some other scalar
examples).

So e.g. we'd have IFN_SAT_ADD etc and new optabs.  recognizing this on scalar
you'll then automatically get autovect.

What I would do is create non-direct-optab IFNs.  as in, have a default
fallback for architectures that don't have the optab implemented, and those
that do use the optab.

I think we should be able to do better here in general even for scalar if we
know the operation is supposed to saturate like
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600 shows.

This also simplifies optimizations because every target then has the same
GIMPLE representation for these operations.

The only outstanding thing is where to do this.  We obviously have to do so
before vectorization but some of the saturation idioms require phi-opts
https://godbolt.org/z/9oWP5vqee but others can't be done in phi-opts, those
probably fit in match.pd or forwardprop.

Any suggestions of where to best add the detection richi?


> 
> Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (10 preceding siblings ...)
  2024-02-01 15:10 ` tnfchris at gcc dot gnu.org
@ 2024-02-02  1:04 ` pan2.li at intel dot com
  2024-02-02 11:11 ` tnfchris at gcc dot gnu.org
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: pan2.li at intel dot com @ 2024-02-02  1:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #13 from Li Pan <pan2.li at intel dot com> ---
I'll try to understand it and make it happen recently.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (11 preceding siblings ...)
  2024-02-02  1:04 ` pan2.li at intel dot com
@ 2024-02-02 11:11 ` tnfchris at gcc dot gnu.org
  2024-02-03  6:57 ` pan2.li at intel dot com
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-02 11:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #14 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Awesome! Feel free to reach out if you need any help.

It’s likely easier to start with add and sub and get things pipe cleaned and
expand incrementally than to try and do it all at once.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (12 preceding siblings ...)
  2024-02-02 11:11 ` tnfchris at gcc dot gnu.org
@ 2024-02-03  6:57 ` pan2.li at intel dot com
  2024-02-06  1:13 ` pan2.li at intel dot com
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: pan2.li at intel dot com @ 2024-02-03  6:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #15 from Li Pan <pan2.li at intel dot com> ---
(In reply to Tamar Christina from comment #14)
> Awesome! Feel free to reach out if you need any help.
> 
> It’s likely easier to start with add and sub and get things pipe cleaned and
> expand incrementally than to try and do it all at once.

Cool, thanks in advance.

I will first try to make a SAT_ADD to the direct optab for a POC following your
RFC and suggestion. Looks like at least match.pd and internal-fn.def will be
touched. I am learning how match.pd works right now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (13 preceding siblings ...)
  2024-02-03  6:57 ` pan2.li at intel dot com
@ 2024-02-06  1:13 ` pan2.li at intel dot com
  2024-02-06 22:11 ` tnfchris at gcc dot gnu.org
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: pan2.li at intel dot com @ 2024-02-06  1:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #16 from Li Pan <pan2.li at intel dot com> ---
I have a try like below and finally have the Standard Name "SAT_ADD". Could you
please help to double-check if my understanding is correct?

Given below example code below:

typedef unsigned int uint32_t;

uint32_t
sat_add (uint32_t x, uint32_t y)
{
  return (x + y) | - ((x + y) < x);
}

And then add one simpify to match.pd and define new DEF_INTERNAL_OPTAB_FN for
it. Then we have the SAT_ADD representation after expand.

uint32_t sat_add (uint32_t x, uint32_t y)
{
  uint32_t _6;

;;   basic block 2, loop depth 0
;;    pred:       ENTRY
  _6 = .SAT_ADD (x_4(D), y_5(D)); [tail call]
  return _6;
;;    succ:       EXIT

}

If everything goes well, I will prepare the patch for it later. Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (14 preceding siblings ...)
  2024-02-06  1:13 ` pan2.li at intel dot com
@ 2024-02-06 22:11 ` tnfchris at gcc dot gnu.org
  2024-02-07  0:57 ` pan2.li at intel dot com
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 20+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-02-06 22:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #17 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Li Pan from comment #16)
> I have a try like below and finally have the Standard Name "SAT_ADD". Could
> you please help to double-check if my understanding is correct?
> 
> Given below example code below:
> 
> typedef unsigned int uint32_t;
> 
> uint32_t
> sat_add (uint32_t x, uint32_t y)
> {
>   return (x + y) | - ((x + y) < x);
> }
> 
> And then add one simpify to match.pd and define new DEF_INTERNAL_OPTAB_FN
> for it. Then we have the SAT_ADD representation after expand.
> 
> uint32_t sat_add (uint32_t x, uint32_t y)
> {
>   uint32_t _6;
> 
> ;;   basic block 2, loop depth 0
> ;;    pred:       ENTRY
>   _6 = .SAT_ADD (x_4(D), y_5(D)); [tail call]
>   return _6;
> ;;    succ:       EXIT
> 
> }
> 
> If everything goes well, I will prepare the patch for it later. Thanks.

Yeah that's looks right, I assume above you mean before expand?

I believe saturating add is commutative but not associative, so you'd want to
add it to commutative_binary_fn_p in internal-fn.cc.

You may also want to provide some basic optimizations for it in match.pd such
as  .SAT_ADD (a, 0) = a. etc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (15 preceding siblings ...)
  2024-02-06 22:11 ` tnfchris at gcc dot gnu.org
@ 2024-02-07  0:57 ` pan2.li at intel dot com
  2024-05-16 12:09 ` cvs-commit at gcc dot gnu.org
  2024-05-16 12:09 ` cvs-commit at gcc dot gnu.org
  18 siblings, 0 replies; 20+ messages in thread
From: pan2.li at intel dot com @ 2024-02-07  0:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #18 from Li Pan <pan2.li at intel dot com> ---
Thanks for the confirmation.

Yes, it was before expand. I will prepare one PATCH for this, and it should
target for gcc-15 I bet.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (16 preceding siblings ...)
  2024-02-07  0:57 ` pan2.li at intel dot com
@ 2024-05-16 12:09 ` cvs-commit at gcc dot gnu.org
  2024-05-16 12:09 ` cvs-commit at gcc dot gnu.org
  18 siblings, 0 replies; 20+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-05-16 12:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #19 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Pan Li <panli@gcc.gnu.org>:

https://gcc.gnu.org/g:52b0536710ff3f3ace72ab00ce9ef6c630cd1183

commit r15-576-g52b0536710ff3f3ace72ab00ce9ef6c630cd1183
Author: Pan Li <pan2.li@intel.com>
Date:   Wed May 15 10:14:05 2024 +0800

    Internal-fn: Support new IFN SAT_ADD for unsigned scalar int

    This patch would like to add the middle-end presentation for the
    saturation add.  Aka set the result of add to the max when overflow.
    It will take the pattern similar as below.

    SAT_ADD (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))

    Take uint8_t as example, we will have:

    * SAT_ADD (1, 254)   => 255.
    * SAT_ADD (1, 255)   => 255.
    * SAT_ADD (2, 255)   => 255.
    * SAT_ADD (255, 255) => 255.

    Given below example for the unsigned scalar integer uint64_t:

    uint64_t sat_add_u64 (uint64_t x, uint64_t y)
    {
      return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x));
    }

    Before this patch:
    uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
    {
      long unsigned int _1;
      _Bool _2;
      long unsigned int _3;
      long unsigned int _4;
      uint64_t _7;
      long unsigned int _10;
      __complex__ long unsigned int _11;

    ;;   basic block 2, loop depth 0
    ;;    pred:       ENTRY
      _11 = .ADD_OVERFLOW (x_5(D), y_6(D));
      _1 = REALPART_EXPR <_11>;
      _10 = IMAGPART_EXPR <_11>;
      _2 = _10 != 0;
      _3 = (long unsigned int) _2;
      _4 = -_3;
      _7 = _1 | _4;
      return _7;
    ;;    succ:       EXIT

    }

    After this patch:
    uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
    {
      uint64_t _7;

    ;;   basic block 2, loop depth 0
    ;;    pred:       ENTRY
      _7 = .SAT_ADD (x_5(D), y_6(D)); [tail call]
      return _7;
    ;;    succ:       EXIT
    }

    The below tests are passed for this patch:
    1. The riscv fully regression tests.
    3. The x86 bootstrap tests.
    4. The x86 fully regression tests.

            PR target/51492
            PR target/112600

    gcc/ChangeLog:

            * internal-fn.cc (commutative_binary_fn_p): Add type IFN_SAT_ADD
            to the return true switch case(s).
            * internal-fn.def (SAT_ADD):  Add new signed optab SAT_ADD.
            * match.pd: Add unsigned SAT_ADD match(es).
            * optabs.def (OPTAB_NL): Remove fixed-point limitation for
            us/ssadd.
            * tree-ssa-math-opts.cc (gimple_unsigned_integer_sat_add): New
            extern func decl generated in match.pd match.
            (match_saturation_arith): New func impl to match the saturation
arith.
            (math_opts_dom_walker::after_dom_children): Try match saturation
            arith when IOR expr.

    Signed-off-by: Pan Li <pan2.li@intel.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
  2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
                   ` (17 preceding siblings ...)
  2024-05-16 12:09 ` cvs-commit at gcc dot gnu.org
@ 2024-05-16 12:09 ` cvs-commit at gcc dot gnu.org
  18 siblings, 0 replies; 20+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-05-16 12:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #20 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Pan Li <panli@gcc.gnu.org>:

https://gcc.gnu.org/g:d4dee347b3fe1982bab26485ff31cd039c9df010

commit r15-577-gd4dee347b3fe1982bab26485ff31cd039c9df010
Author: Pan Li <pan2.li@intel.com>
Date:   Wed May 15 10:14:06 2024 +0800

    Vect: Support new IFN SAT_ADD for unsigned vector int

    For vectorize, we leverage the existing vect pattern recog to find
    the pattern similar to scalar and let the vectorizer to perform
    the rest part for standard name usadd<mode>3 in vector mode.
    The riscv vector backend have insn "Vector Single-Width Saturating
    Add and Subtract" which can be leveraged when expand the usadd<mode>3
    in vector mode.  For example:

    void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n)
    {
      unsigned i;

      for (i = 0; i < n; i++)
        out[i] = (x[i] + y[i]) | (- (uint64_t)((uint64_t)(x[i] + y[i]) <
x[i]));
    }

    Before this patch:
    void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n)
    {
      ...
      _80 = .SELECT_VL (ivtmp_78, POLY_INT_CST [2, 2]);
      ivtmp_58 = _80 * 8;
      vect__4.7_61 = .MASK_LEN_LOAD (vectp_x.5_59, 64B, { -1, ... }, _80, 0);
      vect__6.10_65 = .MASK_LEN_LOAD (vectp_y.8_63, 64B, { -1, ... }, _80, 0);
      vect__7.11_66 = vect__4.7_61 + vect__6.10_65;
      mask__8.12_67 = vect__4.7_61 > vect__7.11_66;
      vect__12.15_72 = .VCOND_MASK (mask__8.12_67, { 18446744073709551615,
        ... }, vect__7.11_66);
      .MASK_LEN_STORE (vectp_out.16_74, 64B, { -1, ... }, _80, 0,
vect__12.15_72);
      vectp_x.5_60 = vectp_x.5_59 + ivtmp_58;
      vectp_y.8_64 = vectp_y.8_63 + ivtmp_58;
      vectp_out.16_75 = vectp_out.16_74 + ivtmp_58;
      ivtmp_79 = ivtmp_78 - _80;
      ...
    }

    After this patch:
    void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n)
    {
      ...
      _62 = .SELECT_VL (ivtmp_60, POLY_INT_CST [2, 2]);
      ivtmp_46 = _62 * 8;
      vect__4.7_49 = .MASK_LEN_LOAD (vectp_x.5_47, 64B, { -1, ... }, _62, 0);
      vect__6.10_53 = .MASK_LEN_LOAD (vectp_y.8_51, 64B, { -1, ... }, _62, 0);
      vect__12.11_54 = .SAT_ADD (vect__4.7_49, vect__6.10_53);
      .MASK_LEN_STORE (vectp_out.12_56, 64B, { -1, ... }, _62, 0,
vect__12.11_54);
      ...
    }

    The below test suites are passed for this patch.
    * The riscv fully regression tests.
    * The x86 bootstrap tests.
    * The x86 fully regression tests.

            PR target/51492
            PR target/112600

    gcc/ChangeLog:

            * tree-vect-patterns.cc (gimple_unsigned_integer_sat_add): New
            func decl generated by match.pd match.
            (vect_recog_sat_add_pattern): New func impl to recog the pattern
            for unsigned SAT_ADD.

    Signed-off-by: Pan Li <pan2.li@intel.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-05-16 12:09 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-10  1:38 [Bug tree-optimization/51492] New: vectorizer generates unnecessary code drepper.fsp at gmail dot com
2011-12-12 10:25 ` [Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns rguenth at gcc dot gnu.org
2012-01-08 18:57 ` drepper.fsp at gmail dot com
2012-07-13  8:39 ` rguenth at gcc dot gnu.org
2021-08-24 23:44 ` pinskia at gcc dot gnu.org
2021-08-25  3:54 ` pinskia at gcc dot gnu.org
2024-02-01 13:06 ` pan2.li at intel dot com
2024-02-01 13:37 ` juzhe.zhong at rivai dot ai
2024-02-01 13:42 ` juzhe.zhong at rivai dot ai
2024-02-01 14:40 ` juzhe.zhong at rivai dot ai
2024-02-01 14:41 ` juzhe.zhong at rivai dot ai
2024-02-01 15:10 ` tnfchris at gcc dot gnu.org
2024-02-02  1:04 ` pan2.li at intel dot com
2024-02-02 11:11 ` tnfchris at gcc dot gnu.org
2024-02-03  6:57 ` pan2.li at intel dot com
2024-02-06  1:13 ` pan2.li at intel dot com
2024-02-06 22:11 ` tnfchris at gcc dot gnu.org
2024-02-07  0:57 ` pan2.li at intel dot com
2024-05-16 12:09 ` cvs-commit at gcc dot gnu.org
2024-05-16 12:09 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).