[RFC] middle-end: Extend CSE to understand vector extracts.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] middle-end: Extend CSE to understand vector extracts.
@ 2021-01-04 12:18 Tamar Christina
  2021-01-04 13:33 ` Richard Biener
  2021-08-23  3:39 ` Jeff Law
  0 siblings, 2 replies; 6+ messages in thread
From: Tamar Christina @ 2021-01-04 12:18 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, ian, law

[-- Attachment #1: Type: text/plain, Size: 7010 bytes --]

Hi All,

I am trying to get CSE to re-use constants already inside a vector rather than
re-materializing the constant again.

Basically consider the following case:

#include <stdint.h>
#include <arm_neon.h>

uint64_t
test (uint64_t a, uint64x2_t b, uint64x2_t* rt)
{
  uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
  uint64_t res = a | arr[0];
  uint64x2_t val = vld1q_u64 (arr);
  *rt = vaddq_u64 (val, b);
  return res;
}

The actual behavior is inconsequential however notice that the same constants
are used in the vector (arr and later val) and in the calculation of res.

The code we generate for this however is quite sub-optimal:

test:
        adrp    x2, .LC0
        sub     sp, sp, #16
        ldr     q1, [x2, #:lo12:.LC0]
        mov     x2, 16502
        movk    x2, 0x1023, lsl 16
        movk    x2, 0x4308, lsl 32
        add     v1.2d, v1.2d, v0.2d
        movk    x2, 0x942, lsl 48
        orr     x0, x0, x2
        str     q1, [x1]
        add     sp, sp, 16
        ret
.LC0:
        .xword  667169396713799798
        .xword  667169396713799798

Essentially we materialize the same constant twice.  The reason for this is
because the front-end lowers the constant extracted from arr[0] quite early on.
If you look into the result of fre you'll find

  <bb 2> :
  arr[0] = 667169396713799798;
  arr[1] = 667169396713799798;
  res_7 = a_6(D) | 667169396713799798;
  _16 = __builtin_aarch64_ld1v2di (&arr);
  _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
  _11 = b_10(D) + _17;
  *rt_12(D) = _11;
  arr ={v} {CLOBBER};
  return res_7;

Which makes sense for further optimization.  However come expand time if the
constant isn't representable in the target arch it will be assigned to a
register again.

(insn 8 5 9 2 (set (reg:V2DI 99)
        (const_vector:V2DI [
                (const_int 667169396713799798 [0x942430810234076]) repeated x2
            ])) "cse.c":7:12 -1
     (nil))
...
(insn 14 13 15 2 (set (reg:DI 103)
        (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
     (nil))
(insn 15 14 16 2 (set (reg:DI 102 [ res ])
        (ior:DI (reg/v:DI 96 [ a ])
            (reg:DI 103))) "cse.c":8:12 -1
     (nil))

And since it's out of the immediate range of the scalar instruction used
combine won't be able to do anything here.

This will then trigger the re-materialization of the constant twice.

So I figured the best place to handle this is in CSE since in some uArch it's
far cheaper to extract a constant from a vector than to materialize it.

Particularly doing it pre-RA has the benefit of allowing RA to decide whether it
needs to move the constant between register files or not as some uArch can
perform scalar operation both on the SIMD and GENREG side.

The issue is I don't know that much about CSE.  I have been reading through the
source and think I have a basic understanding of how it works but this email is
to see if I'm on the right track or not (to something that is acceptable
upstream).

My current patch for CSE is:

diff --git a/gcc/cse.c b/gcc/cse.c
index 36bcfc354d8..3cee53bed85 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "rtl-iter.h"
 #include "regs.h"
 #include "function-abi.h"
+#include "expr.h"

 /* The basic idea of common subexpression elimination is to go
    through the code, keeping a record of expressions that would
@@ -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
         someplace else, so it isn't worth cse'ing.  */
       else if (GET_CODE (SET_SRC (x)) == CALL)
        ;
+      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
+       {
+         /* First register the vector itself.  */
+         sets[n_sets++].rtl = x;
+         rtx src = SET_SRC (x);
+         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
+          /* Go over the constants of the CONST_VECTOR in forward order, to
+            put them in the same order in the SETS array.  */
+         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
+           {
+             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
+             sets[n_sets++].rtl = PATTERN (gen_move_insn (y, CONST_VECTOR_ELT (src, i)));
+           }
+       }
       else
        sets[n_sets++].rtl = x;
     }
@@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
   struct set *sets = (struct set *) 0;

   if (GET_CODE (x) == SET)
-    sets = XALLOCA (struct set);
+    {
+      /* For CONST_VECTOR we wants to be able to CSE the vector itself along with
+        elements inside the vector if the target says it's cheap.  */
+      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
+       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC (x)) + 1);
+      else
+       sets = XALLOCA (struct set);
+    }
   else if (GET_CODE (x) == PARALLEL)
     sets = XALLOCAVEC (struct set, XVECLEN (x, 0));

--

This extends the sets that CSE uses to perform CSE to not only contain the
CONST_VECTOR but also the individual elements of the vector.

For each element I generate new RTL which models them as a constant being set
into a subreg of the original vector at the index of the element in the vector.

This so that the SRC is the constant we want to CSE and DEST contains the
SUBREG to extract from the vector.

It works as expected, the testcase above generates:

test:
        adrp    x2, .LC0
        sub     sp, sp, #16
        ldr     q1, [x2, #:lo12:.LC0]
        add     v0.2d, v1.2d, v0.2d
        fmov    x2, d1
        str     q0, [x1]
        orr     x0, x0, x2
        add     sp, sp, 16
        ret
.LC0:
        .xword  667169396713799798
        .xword  667169396713799798

The problem is that this is somewhat accidental.  CSE is single pass, presumably
because it currently only tracks SETs of constants where any of the duplicates
can be replaced by any alternative (it does pick the cheapest, but all the
alternatives are valid.).

This breaks with vectors because vectors can only be used as a SRC.  The code
does validate that the resulting CSE is valid, so this does not break.

but if the INSN are flipped in RTL:

(insn 14 13 15 2 (set (reg:DI 103)
        (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
     (nil))
...
(insn 8 5 9 2 (set (reg:V2DI 99)
        (const_vector:V2DI [
                (const_int 667169396713799798 [0x942430810234076]) repeated x2
            ])) "cse.c":7:12 -1
     (nil))

This no longer works, because it sees the constant version in insn 14 before it
sees insn 8.  When we find insn 8 we can tell that there is an instruction that
can be replaced by insn 8, but we don't know the original insn and so as a
consequence we can't update it.

so questions:

1) Does what I'm doing make sense?
2) Is there anyway to go from a SET to an insn?
3) If not, can I store the insn in table_elt and have cse_insn produce a worklist
   of additional insn that need to be re-examined?

Thanks,
Tamar

--- inline copy of patch -- 

-- 

[-- Attachment #2: rb13965.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] middle-end: Extend CSE to understand vector extracts.
  2021-01-04 12:18 [RFC] middle-end: Extend CSE to understand vector extracts Tamar Christina
@ 2021-01-04 13:33 ` Richard Biener
  2021-01-04 13:57   ` Tamar Christina
  2021-08-23  3:39 ` Jeff Law
  1 sibling, 1 reply; 6+ messages in thread
From: Richard Biener @ 2021-01-04 13:33 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, ian, law

On Mon, 4 Jan 2021, Tamar Christina wrote:

> Hi All,
> 
> I am trying to get CSE to re-use constants already inside a vector rather than
> re-materializing the constant again.
> 
> Basically consider the following case:
> 
> #include <stdint.h>
> #include <arm_neon.h>
> 
> uint64_t
> test (uint64_t a, uint64x2_t b, uint64x2_t* rt)
> {
>   uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
>   uint64_t res = a | arr[0];
>   uint64x2_t val = vld1q_u64 (arr);
>   *rt = vaddq_u64 (val, b);
>   return res;
> }
> 
> The actual behavior is inconsequential however notice that the same constants
> are used in the vector (arr and later val) and in the calculation of res.
> 
> The code we generate for this however is quite sub-optimal:
> 
> test:
>         adrp    x2, .LC0
>         sub     sp, sp, #16
>         ldr     q1, [x2, #:lo12:.LC0]
>         mov     x2, 16502
>         movk    x2, 0x1023, lsl 16
>         movk    x2, 0x4308, lsl 32
>         add     v1.2d, v1.2d, v0.2d
>         movk    x2, 0x942, lsl 48
>         orr     x0, x0, x2
>         str     q1, [x1]
>         add     sp, sp, 16
>         ret
> .LC0:
>         .xword  667169396713799798
>         .xword  667169396713799798
> 
> Essentially we materialize the same constant twice.  The reason for this is
> because the front-end lowers the constant extracted from arr[0] quite early on.
> If you look into the result of fre you'll find
> 
>   <bb 2> :
>   arr[0] = 667169396713799798;
>   arr[1] = 667169396713799798;
>   res_7 = a_6(D) | 667169396713799798;
>   _16 = __builtin_aarch64_ld1v2di (&arr);
>   _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
>   _11 = b_10(D) + _17;
>   *rt_12(D) = _11;
>   arr ={v} {CLOBBER};
>   return res_7;
> 
> Which makes sense for further optimization.  However come expand time if the
> constant isn't representable in the target arch it will be assigned to a
> register again.
> 
> (insn 8 5 9 2 (set (reg:V2DI 99)
>         (const_vector:V2DI [
>                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
>             ])) "cse.c":7:12 -1
>      (nil))
> ...
> (insn 14 13 15 2 (set (reg:DI 103)
>         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
>      (nil))
> (insn 15 14 16 2 (set (reg:DI 102 [ res ])
>         (ior:DI (reg/v:DI 96 [ a ])
>             (reg:DI 103))) "cse.c":8:12 -1
>      (nil))
> 
> And since it's out of the immediate range of the scalar instruction used
> combine won't be able to do anything here.
> 
> This will then trigger the re-materialization of the constant twice.
> 
> So I figured the best place to handle this is in CSE since in some uArch it's
> far cheaper to extract a constant from a vector than to materialize it.
> 
> Particularly doing it pre-RA has the benefit of allowing RA to decide whether it
> needs to move the constant between register files or not as some uArch can
> perform scalar operation both on the SIMD and GENREG side.
> 
> The issue is I don't know that much about CSE.  I have been reading through the
> source and think I have a basic understanding of how it works but this email is
> to see if I'm on the right track or not (to something that is acceptable
> upstream).
> 
> My current patch for CSE is:
> 
> diff --git a/gcc/cse.c b/gcc/cse.c
> index 36bcfc354d8..3cee53bed85 100644
> --- a/gcc/cse.c
> +++ b/gcc/cse.c
> @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "rtl-iter.h"
>  #include "regs.h"
>  #include "function-abi.h"
> +#include "expr.h"
> 
>  /* The basic idea of common subexpression elimination is to go
>     through the code, keeping a record of expressions that would
> @@ -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
>          someplace else, so it isn't worth cse'ing.  */
>        else if (GET_CODE (SET_SRC (x)) == CALL)
>         ;
> +      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> +       {
> +         /* First register the vector itself.  */
> +         sets[n_sets++].rtl = x;
> +         rtx src = SET_SRC (x);
> +         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
> +          /* Go over the constants of the CONST_VECTOR in forward order, to
> +            put them in the same order in the SETS array.  */
> +         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
> +           {
> +             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
> +             sets[n_sets++].rtl = PATTERN (gen_move_insn (y, CONST_VECTOR_ELT (src, i)));
> +           }
> +       }
>        else
>         sets[n_sets++].rtl = x;
>      }
> @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
>    struct set *sets = (struct set *) 0;
> 
>    if (GET_CODE (x) == SET)
> -    sets = XALLOCA (struct set);
> +    {
> +      /* For CONST_VECTOR we wants to be able to CSE the vector itself along with
> +        elements inside the vector if the target says it's cheap.  */
> +      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> +       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC (x)) + 1);
> +      else
> +       sets = XALLOCA (struct set);
> +    }
>    else if (GET_CODE (x) == PARALLEL)
>      sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
> 
> --
> 
> This extends the sets that CSE uses to perform CSE to not only contain the
> CONST_VECTOR but also the individual elements of the vector.
> 
> For each element I generate new RTL which models them as a constant being set
> into a subreg of the original vector at the index of the element in the vector.
> 
> This so that the SRC is the constant we want to CSE and DEST contains the
> SUBREG to extract from the vector.
> 
> It works as expected, the testcase above generates:
> 
> test:
>         adrp    x2, .LC0
>         sub     sp, sp, #16
>         ldr     q1, [x2, #:lo12:.LC0]
>         add     v0.2d, v1.2d, v0.2d
>         fmov    x2, d1
>         str     q0, [x1]
>         orr     x0, x0, x2
>         add     sp, sp, 16
>         ret
> .LC0:
>         .xword  667169396713799798
>         .xword  667169396713799798
> 
> The problem is that this is somewhat accidental.  CSE is single pass, presumably
> because it currently only tracks SETs of constants where any of the duplicates
> can be replaced by any alternative (it does pick the cheapest, but all the
> alternatives are valid.).
> 
> This breaks with vectors because vectors can only be used as a SRC.  The code
> does validate that the resulting CSE is valid, so this does not break.
> 
> but if the INSN are flipped in RTL:
> 
> (insn 14 13 15 2 (set (reg:DI 103)
>         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
>      (nil))
> ...
> (insn 8 5 9 2 (set (reg:V2DI 99)
>         (const_vector:V2DI [
>                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
>             ])) "cse.c":7:12 -1
>      (nil))
> 
> This no longer works, because it sees the constant version in insn 14 before it
> sees insn 8.  When we find insn 8 we can tell that there is an instruction that
> can be replaced by insn 8, but we don't know the original insn and so as a
> consequence we can't update it.
> 
> so questions:
> 
> 1) Does what I'm doing make sense?
> 2) Is there anyway to go from a SET to an insn?
> 3) If not, can I store the insn in table_elt and have cse_insn produce a worklist
>    of additional insn that need to be re-examined?

Without being able to comment on RTL or the CSE implementation the
issue at hand (optimizing constant generation / placement) doesn't
fit CSE well but it's more a global LCM/PRE problem.  There's also
the issue that while on x86 many constants _are_ valid as immediates
CSEing them into a register (if one is available!) is still
profitable but RTL passes generally propagate / duplicate them
back into the instructions where they are valid (so "fixing" things
on GIMPLE generally doesn't work).

Also IIRC targets can delegitmize constants late (during reload/LRA)
which might cause extra complication.

Richard.

> Thanks,
> Tamar
> 
> --- inline copy of patch -- 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC] middle-end: Extend CSE to understand vector extracts.
  2021-01-04 13:33 ` Richard Biener
@ 2021-01-04 13:57   ` Tamar Christina
  2021-01-04 14:13     ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Tamar Christina @ 2021-01-04 13:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, nd, ian, law

Hi Richi, 

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Monday, January 4, 2021 1:33 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ian@airs.com;
> law@redhat.com
> Subject: Re: [RFC] middle-end: Extend CSE to understand vector extracts.
> 
> On Mon, 4 Jan 2021, Tamar Christina wrote:
> 
> > Hi All,
> >
> > I am trying to get CSE to re-use constants already inside a vector
> > rather than re-materializing the constant again.
> >
> > Basically consider the following case:
> >
> > #include <stdint.h>
> > #include <arm_neon.h>
> >
> > uint64_t
> > test (uint64_t a, uint64x2_t b, uint64x2_t* rt) {
> >   uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
> >   uint64_t res = a | arr[0];
> >   uint64x2_t val = vld1q_u64 (arr);
> >   *rt = vaddq_u64 (val, b);
> >   return res;
> > }
> >
> > The actual behavior is inconsequential however notice that the same
> > constants are used in the vector (arr and later val) and in the calculation of
> res.
> >
> > The code we generate for this however is quite sub-optimal:
> >
> > test:
> >         adrp    x2, .LC0
> >         sub     sp, sp, #16
> >         ldr     q1, [x2, #:lo12:.LC0]
> >         mov     x2, 16502
> >         movk    x2, 0x1023, lsl 16
> >         movk    x2, 0x4308, lsl 32
> >         add     v1.2d, v1.2d, v0.2d
> >         movk    x2, 0x942, lsl 48
> >         orr     x0, x0, x2
> >         str     q1, [x1]
> >         add     sp, sp, 16
> >         ret
> > .LC0:
> >         .xword  667169396713799798
> >         .xword  667169396713799798
> >
> > Essentially we materialize the same constant twice.  The reason for
> > this is because the front-end lowers the constant extracted from arr[0]
> quite early on.
> > If you look into the result of fre you'll find
> >
> >   <bb 2> :
> >   arr[0] = 667169396713799798;
> >   arr[1] = 667169396713799798;
> >   res_7 = a_6(D) | 667169396713799798;
> >   _16 = __builtin_aarch64_ld1v2di (&arr);
> >   _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
> >   _11 = b_10(D) + _17;
> >   *rt_12(D) = _11;
> >   arr ={v} {CLOBBER};
> >   return res_7;
> >
> > Which makes sense for further optimization.  However come expand time
> > if the constant isn't representable in the target arch it will be
> > assigned to a register again.
> >
> > (insn 8 5 9 2 (set (reg:V2DI 99)
> >         (const_vector:V2DI [
> >                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
> >             ])) "cse.c":7:12 -1
> >      (nil))
> > ...
> > (insn 14 13 15 2 (set (reg:DI 103)
> >         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
> >      (nil))
> > (insn 15 14 16 2 (set (reg:DI 102 [ res ])
> >         (ior:DI (reg/v:DI 96 [ a ])
> >             (reg:DI 103))) "cse.c":8:12 -1
> >      (nil))
> >
> > And since it's out of the immediate range of the scalar instruction
> > used combine won't be able to do anything here.
> >
> > This will then trigger the re-materialization of the constant twice.
> >
> > So I figured the best place to handle this is in CSE since in some
> > uArch it's far cheaper to extract a constant from a vector than to materialize
> it.
> >
> > Particularly doing it pre-RA has the benefit of allowing RA to decide
> > whether it needs to move the constant between register files or not as
> > some uArch can perform scalar operation both on the SIMD and GENREG
> side.
> >
> > The issue is I don't know that much about CSE.  I have been reading
> > through the source and think I have a basic understanding of how it
> > works but this email is to see if I'm on the right track or not (to
> > something that is acceptable upstream).
> >
> > My current patch for CSE is:
> >
> > diff --git a/gcc/cse.c b/gcc/cse.c
> > index 36bcfc354d8..3cee53bed85 100644
> > --- a/gcc/cse.c
> > +++ b/gcc/cse.c
> > @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
> > #include "rtl-iter.h"
> >  #include "regs.h"
> >  #include "function-abi.h"
> > +#include "expr.h"
> >
> >  /* The basic idea of common subexpression elimination is to go
> >     through the code, keeping a record of expressions that would @@
> > -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
> >          someplace else, so it isn't worth cse'ing.  */
> >        else if (GET_CODE (SET_SRC (x)) == CALL)
> >         ;
> > +      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> > +       {
> > +         /* First register the vector itself.  */
> > +         sets[n_sets++].rtl = x;
> > +         rtx src = SET_SRC (x);
> > +         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
> > +          /* Go over the constants of the CONST_VECTOR in forward order, to
> > +            put them in the same order in the SETS array.  */
> > +         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
> > +           {
> > +             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
> > +             sets[n_sets++].rtl = PATTERN (gen_move_insn (y,
> CONST_VECTOR_ELT (src, i)));
> > +           }
> > +       }
> >        else
> >         sets[n_sets++].rtl = x;
> >      }
> > @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
> >    struct set *sets = (struct set *) 0;
> >
> >    if (GET_CODE (x) == SET)
> > -    sets = XALLOCA (struct set);
> > +    {
> > +      /* For CONST_VECTOR we wants to be able to CSE the vector itself
> along with
> > +        elements inside the vector if the target says it's cheap.  */
> > +      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> > +       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts
> (SET_SRC (x)) + 1);
> > +      else
> > +       sets = XALLOCA (struct set);
> > +    }
> >    else if (GET_CODE (x) == PARALLEL)
> >      sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
> >
> > --
> >
> > This extends the sets that CSE uses to perform CSE to not only contain
> > the CONST_VECTOR but also the individual elements of the vector.
> >
> > For each element I generate new RTL which models them as a constant
> > being set into a subreg of the original vector at the index of the element in
> the vector.
> >
> > This so that the SRC is the constant we want to CSE and DEST contains
> > the SUBREG to extract from the vector.
> >
> > It works as expected, the testcase above generates:
> >
> > test:
> >         adrp    x2, .LC0
> >         sub     sp, sp, #16
> >         ldr     q1, [x2, #:lo12:.LC0]
> >         add     v0.2d, v1.2d, v0.2d
> >         fmov    x2, d1
> >         str     q0, [x1]
> >         orr     x0, x0, x2
> >         add     sp, sp, 16
> >         ret
> > .LC0:
> >         .xword  667169396713799798
> >         .xword  667169396713799798
> >
> > The problem is that this is somewhat accidental.  CSE is single pass,
> > presumably because it currently only tracks SETs of constants where
> > any of the duplicates can be replaced by any alternative (it does pick
> > the cheapest, but all the alternatives are valid.).
> >
> > This breaks with vectors because vectors can only be used as a SRC.
> > The code does validate that the resulting CSE is valid, so this does not break.
> >
> > but if the INSN are flipped in RTL:
> >
> > (insn 14 13 15 2 (set (reg:DI 103)
> >         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
> >      (nil))
> > ...
> > (insn 8 5 9 2 (set (reg:V2DI 99)
> >         (const_vector:V2DI [
> >                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
> >             ])) "cse.c":7:12 -1
> >      (nil))
> >
> > This no longer works, because it sees the constant version in insn 14
> > before it sees insn 8.  When we find insn 8 we can tell that there is
> > an instruction that can be replaced by insn 8, but we don't know the
> > original insn and so as a consequence we can't update it.
> >
> > so questions:
> >
> > 1) Does what I'm doing make sense?
> > 2) Is there anyway to go from a SET to an insn?
> > 3) If not, can I store the insn in table_elt and have cse_insn produce a
> worklist
> >    of additional insn that need to be re-examined?
> 
> Without being able to comment on RTL or the CSE implementation the issue
> at hand (optimizing constant generation / placement) doesn't fit CSE well but
> it's more a global LCM/PRE problem.

Hmm that's fair, I can try using PRE.  I initially chose CSE since it already did the majority
of the work to support PARALLELs already.

> There's also the issue that while on x86
> many constants _are_ valid as immediates CSEing them into a register (if one
> is available!) is still profitable but RTL passes generally propagate / duplicate
> them back into the instructions where they are valid (so "fixing" things on
> GIMPLE generally doesn't work).

I was going to make this a target hook so the back-end can decide what it wants to do,
I just didn't do that yet. It would have to be, even for PRE wouldn't it?

I agree that at GIMPLE it wouldn't work but CSE always runs at RTL no?

> 
> Also IIRC targets can delegitmize constants late (during reload/LRA) which
> might cause extra complication.

True, but doing it post-reloads has the issue that reload has then already chosen a register class,
which makes it not able to generate the most efficient code anymore.

For this simple case with a vec_dup I can of course fix this by changing the representation at expand time
from being a vec_dup of a constant to that of a register and shove the constant in the register.

Which would work for the dups case but not the general case of extracting any element.  I could again change
The representation to be a concat of a bunch of registers, but at some point they have to go back in.

Cheers,
Tamar

> 
> Richard.
> 
> > Thanks,
> > Tamar
> >
> > --- inline copy of patch --
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC] middle-end: Extend CSE to understand vector extracts.
  2021-01-04 13:57   ` Tamar Christina
@ 2021-01-04 14:13     ` Richard Biener
  2021-01-04 15:25       ` Jeff Law
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2021-01-04 14:13 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, ian, law

On Mon, 4 Jan 2021, Tamar Christina wrote:

> Hi Richi, 
> 
> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Monday, January 4, 2021 1:33 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ian@airs.com;
> > law@redhat.com
> > Subject: Re: [RFC] middle-end: Extend CSE to understand vector extracts.
> > 
> > On Mon, 4 Jan 2021, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > I am trying to get CSE to re-use constants already inside a vector
> > > rather than re-materializing the constant again.
> > >
> > > Basically consider the following case:
> > >
> > > #include <stdint.h>
> > > #include <arm_neon.h>
> > >
> > > uint64_t
> > > test (uint64_t a, uint64x2_t b, uint64x2_t* rt) {
> > >   uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
> > >   uint64_t res = a | arr[0];
> > >   uint64x2_t val = vld1q_u64 (arr);
> > >   *rt = vaddq_u64 (val, b);
> > >   return res;
> > > }
> > >
> > > The actual behavior is inconsequential however notice that the same
> > > constants are used in the vector (arr and later val) and in the calculation of
> > res.
> > >
> > > The code we generate for this however is quite sub-optimal:
> > >
> > > test:
> > >         adrp    x2, .LC0
> > >         sub     sp, sp, #16
> > >         ldr     q1, [x2, #:lo12:.LC0]
> > >         mov     x2, 16502
> > >         movk    x2, 0x1023, lsl 16
> > >         movk    x2, 0x4308, lsl 32
> > >         add     v1.2d, v1.2d, v0.2d
> > >         movk    x2, 0x942, lsl 48
> > >         orr     x0, x0, x2
> > >         str     q1, [x1]
> > >         add     sp, sp, 16
> > >         ret
> > > .LC0:
> > >         .xword  667169396713799798
> > >         .xword  667169396713799798
> > >
> > > Essentially we materialize the same constant twice.  The reason for
> > > this is because the front-end lowers the constant extracted from arr[0]
> > quite early on.
> > > If you look into the result of fre you'll find
> > >
> > >   <bb 2> :
> > >   arr[0] = 667169396713799798;
> > >   arr[1] = 667169396713799798;
> > >   res_7 = a_6(D) | 667169396713799798;
> > >   _16 = __builtin_aarch64_ld1v2di (&arr);
> > >   _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
> > >   _11 = b_10(D) + _17;
> > >   *rt_12(D) = _11;
> > >   arr ={v} {CLOBBER};
> > >   return res_7;
> > >
> > > Which makes sense for further optimization.  However come expand time
> > > if the constant isn't representable in the target arch it will be
> > > assigned to a register again.
> > >
> > > (insn 8 5 9 2 (set (reg:V2DI 99)
> > >         (const_vector:V2DI [
> > >                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
> > >             ])) "cse.c":7:12 -1
> > >      (nil))
> > > ...
> > > (insn 14 13 15 2 (set (reg:DI 103)
> > >         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
> > >      (nil))
> > > (insn 15 14 16 2 (set (reg:DI 102 [ res ])
> > >         (ior:DI (reg/v:DI 96 [ a ])
> > >             (reg:DI 103))) "cse.c":8:12 -1
> > >      (nil))
> > >
> > > And since it's out of the immediate range of the scalar instruction
> > > used combine won't be able to do anything here.
> > >
> > > This will then trigger the re-materialization of the constant twice.
> > >
> > > So I figured the best place to handle this is in CSE since in some
> > > uArch it's far cheaper to extract a constant from a vector than to materialize
> > it.
> > >
> > > Particularly doing it pre-RA has the benefit of allowing RA to decide
> > > whether it needs to move the constant between register files or not as
> > > some uArch can perform scalar operation both on the SIMD and GENREG
> > side.
> > >
> > > The issue is I don't know that much about CSE.  I have been reading
> > > through the source and think I have a basic understanding of how it
> > > works but this email is to see if I'm on the right track or not (to
> > > something that is acceptable upstream).
> > >
> > > My current patch for CSE is:
> > >
> > > diff --git a/gcc/cse.c b/gcc/cse.c
> > > index 36bcfc354d8..3cee53bed85 100644
> > > --- a/gcc/cse.c
> > > +++ b/gcc/cse.c
> > > @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
> > > #include "rtl-iter.h"
> > >  #include "regs.h"
> > >  #include "function-abi.h"
> > > +#include "expr.h"
> > >
> > >  /* The basic idea of common subexpression elimination is to go
> > >     through the code, keeping a record of expressions that would @@
> > > -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
> > >          someplace else, so it isn't worth cse'ing.  */
> > >        else if (GET_CODE (SET_SRC (x)) == CALL)
> > >         ;
> > > +      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> > > +       {
> > > +         /* First register the vector itself.  */
> > > +         sets[n_sets++].rtl = x;
> > > +         rtx src = SET_SRC (x);
> > > +         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
> > > +          /* Go over the constants of the CONST_VECTOR in forward order, to
> > > +            put them in the same order in the SETS array.  */
> > > +         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
> > > +           {
> > > +             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
> > > +             sets[n_sets++].rtl = PATTERN (gen_move_insn (y,
> > CONST_VECTOR_ELT (src, i)));
> > > +           }
> > > +       }
> > >        else
> > >         sets[n_sets++].rtl = x;
> > >      }
> > > @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
> > >    struct set *sets = (struct set *) 0;
> > >
> > >    if (GET_CODE (x) == SET)
> > > -    sets = XALLOCA (struct set);
> > > +    {
> > > +      /* For CONST_VECTOR we wants to be able to CSE the vector itself
> > along with
> > > +        elements inside the vector if the target says it's cheap.  */
> > > +      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> > > +       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts
> > (SET_SRC (x)) + 1);
> > > +      else
> > > +       sets = XALLOCA (struct set);
> > > +    }
> > >    else if (GET_CODE (x) == PARALLEL)
> > >      sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
> > >
> > > --
> > >
> > > This extends the sets that CSE uses to perform CSE to not only contain
> > > the CONST_VECTOR but also the individual elements of the vector.
> > >
> > > For each element I generate new RTL which models them as a constant
> > > being set into a subreg of the original vector at the index of the element in
> > the vector.
> > >
> > > This so that the SRC is the constant we want to CSE and DEST contains
> > > the SUBREG to extract from the vector.
> > >
> > > It works as expected, the testcase above generates:
> > >
> > > test:
> > >         adrp    x2, .LC0
> > >         sub     sp, sp, #16
> > >         ldr     q1, [x2, #:lo12:.LC0]
> > >         add     v0.2d, v1.2d, v0.2d
> > >         fmov    x2, d1
> > >         str     q0, [x1]
> > >         orr     x0, x0, x2
> > >         add     sp, sp, 16
> > >         ret
> > > .LC0:
> > >         .xword  667169396713799798
> > >         .xword  667169396713799798
> > >
> > > The problem is that this is somewhat accidental.  CSE is single pass,
> > > presumably because it currently only tracks SETs of constants where
> > > any of the duplicates can be replaced by any alternative (it does pick
> > > the cheapest, but all the alternatives are valid.).
> > >
> > > This breaks with vectors because vectors can only be used as a SRC.
> > > The code does validate that the resulting CSE is valid, so this does not break.
> > >
> > > but if the INSN are flipped in RTL:
> > >
> > > (insn 14 13 15 2 (set (reg:DI 103)
> > >         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
> > >      (nil))
> > > ...
> > > (insn 8 5 9 2 (set (reg:V2DI 99)
> > >         (const_vector:V2DI [
> > >                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
> > >             ])) "cse.c":7:12 -1
> > >      (nil))
> > >
> > > This no longer works, because it sees the constant version in insn 14
> > > before it sees insn 8.  When we find insn 8 we can tell that there is
> > > an instruction that can be replaced by insn 8, but we don't know the
> > > original insn and so as a consequence we can't update it.
> > >
> > > so questions:
> > >
> > > 1) Does what I'm doing make sense?
> > > 2) Is there anyway to go from a SET to an insn?
> > > 3) If not, can I store the insn in table_elt and have cse_insn produce a
> > worklist
> > >    of additional insn that need to be re-examined?
> > 
> > Without being able to comment on RTL or the CSE implementation the issue
> > at hand (optimizing constant generation / placement) doesn't fit CSE well but
> > it's more a global LCM/PRE problem.
> 
> Hmm that's fair, I can try using PRE.  I initially chose CSE since it already did the majority
> of the work to support PARALLELs already.
> 
> > There's also the issue that while on x86
> > many constants _are_ valid as immediates CSEing them into a register (if one
> > is available!) is still profitable but RTL passes generally propagate / duplicate
> > them back into the instructions where they are valid (so "fixing" things on
> > GIMPLE generally doesn't work).
> 
> I was going to make this a target hook so the back-end can decide what it wants to do,
> I just didn't do that yet. It would have to be, even for PRE wouldn't it?
> 
> I agree that at GIMPLE it wouldn't work but CSE always runs at RTL no?

Yes.

> > 
> > Also IIRC targets can delegitmize constants late (during reload/LRA) which
> > might cause extra complication.
> 
> True, but doing it post-reloads has the issue that reload has then already chosen a register class,
> which makes it not able to generate the most efficient code anymore.

True - I thought of a pass just before IRA/LRA that splits out constants
from all insns that can bear a register operand in its place placing the
init of the pseudo using LCM dataflow (and doing CSE plus magic for 
dealing with the vector component case).  I think LRA can already
rematerialize a constant in the insn (aka propagate it back) in case
the def of the constant didn't get a hardreg.  One could, after LCM,
do trivial propagation of single-use defs back to the insns as well.

> For this simple case with a vec_dup I can of course fix this by changing the representation at expand time
> from being a vec_dup of a constant to that of a register and shove the constant in the register.
> 
> Which would work for the dups case but not the general case of extracting any element.  I could again change
> The representation to be a concat of a bunch of registers, but at some point they have to go back in.

I think for your case at hand the CSE approach is fine (no comments on the
details) - just that CSE isn't going to be the place to fix all cases
(as you noticed with the second one).

Richard.

> Cheers,
> Tamar
> 
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > --- inline copy of patch --
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] middle-end: Extend CSE to understand vector extracts.
  2021-01-04 14:13     ` Richard Biener
@ 2021-01-04 15:25       ` Jeff Law
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff Law @ 2021-01-04 15:25 UTC (permalink / raw)
  To: Richard Biener, Tamar Christina; +Cc: gcc-patches, nd, ian



On 1/4/21 7:13 AM, Richard Biener wrote:
> On Mon, 4 Jan 2021, Tamar Christina wrote:
>
>> Hi Richi, 
>>
>>> -----Original Message-----
>>> From: Richard Biener <rguenther@suse.de>
>>> Sent: Monday, January 4, 2021 1:33 PM
>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ian@airs.com;
>>> law@redhat.com
>>> Subject: Re: [RFC] middle-end: Extend CSE to understand vector extracts.
>>>
>>> On Mon, 4 Jan 2021, Tamar Christina wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am trying to get CSE to re-use constants already inside a vector
>>>> rather than re-materializing the constant again.
>>>>
>>>> Basically consider the following case:
>>>>
>>>> #include <stdint.h>
>>>> #include <arm_neon.h>
>>>>
>>>> uint64_t
>>>> test (uint64_t a, uint64x2_t b, uint64x2_t* rt) {
>>>>   uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
>>>>   uint64_t res = a | arr[0];
>>>>   uint64x2_t val = vld1q_u64 (arr);
>>>>   *rt = vaddq_u64 (val, b);
>>>>   return res;
>>>> }
>>>>
>>>> The actual behavior is inconsequential however notice that the same
>>>> constants are used in the vector (arr and later val) and in the calculation of
>>> res.
>>>> The code we generate for this however is quite sub-optimal:
>>>>
>>>> test:
>>>>         adrp    x2, .LC0
>>>>         sub     sp, sp, #16
>>>>         ldr     q1, [x2, #:lo12:.LC0]
>>>>         mov     x2, 16502
>>>>         movk    x2, 0x1023, lsl 16
>>>>         movk    x2, 0x4308, lsl 32
>>>>         add     v1.2d, v1.2d, v0.2d
>>>>         movk    x2, 0x942, lsl 48
>>>>         orr     x0, x0, x2
>>>>         str     q1, [x1]
>>>>         add     sp, sp, 16
>>>>         ret
>>>> .LC0:
>>>>         .xword  667169396713799798
>>>>         .xword  667169396713799798
>>>>
>>>> Essentially we materialize the same constant twice.  The reason for
>>>> this is because the front-end lowers the constant extracted from arr[0]
>>> quite early on.
>>>> If you look into the result of fre you'll find
>>>>
>>>>   <bb 2> :
>>>>   arr[0] = 667169396713799798;
>>>>   arr[1] = 667169396713799798;
>>>>   res_7 = a_6(D) | 667169396713799798;
>>>>   _16 = __builtin_aarch64_ld1v2di (&arr);
>>>>   _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
>>>>   _11 = b_10(D) + _17;
>>>>   *rt_12(D) = _11;
>>>>   arr ={v} {CLOBBER};
>>>>   return res_7;
>>>>
>>>> Which makes sense for further optimization.  However come expand time
>>>> if the constant isn't representable in the target arch it will be
>>>> assigned to a register again.
>>>>
>>>> (insn 8 5 9 2 (set (reg:V2DI 99)
>>>>         (const_vector:V2DI [
>>>>                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
>>>>             ])) "cse.c":7:12 -1
>>>>      (nil))
>>>> ...
>>>> (insn 14 13 15 2 (set (reg:DI 103)
>>>>         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
>>>>      (nil))
>>>> (insn 15 14 16 2 (set (reg:DI 102 [ res ])
>>>>         (ior:DI (reg/v:DI 96 [ a ])
>>>>             (reg:DI 103))) "cse.c":8:12 -1
>>>>      (nil))
>>>>
>>>> And since it's out of the immediate range of the scalar instruction
>>>> used combine won't be able to do anything here.
>>>>
>>>> This will then trigger the re-materialization of the constant twice.
>>>>
>>>> So I figured the best place to handle this is in CSE since in some
>>>> uArch it's far cheaper to extract a constant from a vector than to materialize
>>> it.
>>>> Particularly doing it pre-RA has the benefit of allowing RA to decide
>>>> whether it needs to move the constant between register files or not as
>>>> some uArch can perform scalar operation both on the SIMD and GENREG
>>> side.
>>>> The issue is I don't know that much about CSE.  I have been reading
>>>> through the source and think I have a basic understanding of how it
>>>> works but this email is to see if I'm on the right track or not (to
>>>> something that is acceptable upstream).
>>>>
>>>> My current patch for CSE is:
>>>>
>>>> diff --git a/gcc/cse.c b/gcc/cse.c
>>>> index 36bcfc354d8..3cee53bed85 100644
>>>> --- a/gcc/cse.c
>>>> +++ b/gcc/cse.c
>>>> @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
>>>> #include "rtl-iter.h"
>>>>  #include "regs.h"
>>>>  #include "function-abi.h"
>>>> +#include "expr.h"
>>>>
>>>>  /* The basic idea of common subexpression elimination is to go
>>>>     through the code, keeping a record of expressions that would @@
>>>> -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
>>>>          someplace else, so it isn't worth cse'ing.  */
>>>>        else if (GET_CODE (SET_SRC (x)) == CALL)
>>>>         ;
>>>> +      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
>>>> +       {
>>>> +         /* First register the vector itself.  */
>>>> +         sets[n_sets++].rtl = x;
>>>> +         rtx src = SET_SRC (x);
>>>> +         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
>>>> +          /* Go over the constants of the CONST_VECTOR in forward order, to
>>>> +            put them in the same order in the SETS array.  */
>>>> +         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
>>>> +           {
>>>> +             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
>>>> +             sets[n_sets++].rtl = PATTERN (gen_move_insn (y,
>>> CONST_VECTOR_ELT (src, i)));
>>>> +           }
>>>> +       }
>>>>        else
>>>>         sets[n_sets++].rtl = x;
>>>>      }
>>>> @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
>>>>    struct set *sets = (struct set *) 0;
>>>>
>>>>    if (GET_CODE (x) == SET)
>>>> -    sets = XALLOCA (struct set);
>>>> +    {
>>>> +      /* For CONST_VECTOR we wants to be able to CSE the vector itself
>>> along with
>>>> +        elements inside the vector if the target says it's cheap.  */
>>>> +      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
>>>> +       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts
>>> (SET_SRC (x)) + 1);
>>>> +      else
>>>> +       sets = XALLOCA (struct set);
>>>> +    }
>>>>    else if (GET_CODE (x) == PARALLEL)
>>>>      sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
>>>>
>>>> --
>>>>
>>>> This extends the sets that CSE uses to perform CSE to not only contain
>>>> the CONST_VECTOR but also the individual elements of the vector.
>>>>
>>>> For each element I generate new RTL which models them as a constant
>>>> being set into a subreg of the original vector at the index of the element in
>>> the vector.
>>>> This so that the SRC is the constant we want to CSE and DEST contains
>>>> the SUBREG to extract from the vector.
>>>>
>>>> It works as expected, the testcase above generates:
>>>>
>>>> test:
>>>>         adrp    x2, .LC0
>>>>         sub     sp, sp, #16
>>>>         ldr     q1, [x2, #:lo12:.LC0]
>>>>         add     v0.2d, v1.2d, v0.2d
>>>>         fmov    x2, d1
>>>>         str     q0, [x1]
>>>>         orr     x0, x0, x2
>>>>         add     sp, sp, 16
>>>>         ret
>>>> .LC0:
>>>>         .xword  667169396713799798
>>>>         .xword  667169396713799798
>>>>
>>>> The problem is that this is somewhat accidental.  CSE is single pass,
>>>> presumably because it currently only tracks SETs of constants where
>>>> any of the duplicates can be replaced by any alternative (it does pick
>>>> the cheapest, but all the alternatives are valid.).
>>>>
>>>> This breaks with vectors because vectors can only be used as a SRC.
>>>> The code does validate that the resulting CSE is valid, so this does not break.
>>>>
>>>> but if the INSN are flipped in RTL:
>>>>
>>>> (insn 14 13 15 2 (set (reg:DI 103)
>>>>         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
>>>>      (nil))
>>>> ...
>>>> (insn 8 5 9 2 (set (reg:V2DI 99)
>>>>         (const_vector:V2DI [
>>>>                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
>>>>             ])) "cse.c":7:12 -1
>>>>      (nil))
>>>>
>>>> This no longer works, because it sees the constant version in insn 14
>>>> before it sees insn 8.  When we find insn 8 we can tell that there is
>>>> an instruction that can be replaced by insn 8, but we don't know the
>>>> original insn and so as a consequence we can't update it.
>>>>
>>>> so questions:
>>>>
>>>> 1) Does what I'm doing make sense?
>>>> 2) Is there anyway to go from a SET to an insn?
>>>> 3) If not, can I store the insn in table_elt and have cse_insn produce a
>>> worklist
>>>>    of additional insn that need to be re-examined?
>>> Without being able to comment on RTL or the CSE implementation the issue
>>> at hand (optimizing constant generation / placement) doesn't fit CSE well but
>>> it's more a global LCM/PRE problem.
>> Hmm that's fair, I can try using PRE.  I initially chose CSE since it already did the majority
>> of the work to support PARALLELs already.
>>
>>> There's also the issue that while on x86
>>> many constants _are_ valid as immediates CSEing them into a register (if one
>>> is available!) is still profitable but RTL passes generally propagate / duplicate
>>> them back into the instructions where they are valid (so "fixing" things on
>>> GIMPLE generally doesn't work).
>> I was going to make this a target hook so the back-end can decide what it wants to do,
>> I just didn't do that yet. It would have to be, even for PRE wouldn't it?
>>
>> I agree that at GIMPLE it wouldn't work but CSE always runs at RTL no?
> Yes.
>
>>> Also IIRC targets can delegitmize constants late (during reload/LRA) which
>>> might cause extra complication.
>> True, but doing it post-reloads has the issue that reload has then already chosen a register class,
>> which makes it not able to generate the most efficient code anymore.
> True - I thought of a pass just before IRA/LRA that splits out constants
> from all insns that can bear a register operand in its place placing the
> init of the pseudo using LCM dataflow (and doing CSE plus magic for 
> dealing with the vector component case).  I think LRA can already
> rematerialize a constant in the insn (aka propagate it back) in case
> the def of the constant didn't get a hardreg.  One could, after LCM,
> do trivial propagation of single-use defs back to the insns as well.
>
>> For this simple case with a vec_dup I can of course fix this by changing the representation at expand time
>> from being a vec_dup of a constant to that of a register and shove the constant in the register.
>>
>> Which would work for the dups case but not the general case of extracting any element.  I could again change
>> The representation to be a concat of a bunch of registers, but at some point they have to go back in.
> I think for your case at hand the CSE approach is fine (no comments on the
> details) - just that CSE isn't going to be the place to fix all cases
> (as you noticed with the second one).
But note that our RTL PRE implementation generally ignores constants. 
It's also the case that our RTL PRE implementation assumes that the
source operand can be trivially copied into a register.

Jeff


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] middle-end: Extend CSE to understand vector extracts.
  2021-01-04 12:18 [RFC] middle-end: Extend CSE to understand vector extracts Tamar Christina
  2021-01-04 13:33 ` Richard Biener
@ 2021-08-23  3:39 ` Jeff Law
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff Law @ 2021-08-23  3:39 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches; +Cc: nd, rguenther, ian, law



On 1/4/2021 6:18 AM, Tamar Christina wrote:
> Hi All,
>
> I am trying to get CSE to re-use constants already inside a vector rather than
> re-materializing the constant again.
>
> Basically consider the following case:
>
> #include <stdint.h>
> #include <arm_neon.h>
>
> uint64_t
> test (uint64_t a, uint64x2_t b, uint64x2_t* rt)
> {
>    uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
>    uint64_t res = a | arr[0];
>    uint64x2_t val = vld1q_u64 (arr);
>    *rt = vaddq_u64 (val, b);
>    return res;
> }
>
> The actual behavior is inconsequential however notice that the same constants
> are used in the vector (arr and later val) and in the calculation of res.
>
> The code we generate for this however is quite sub-optimal:
>
> test:
>          adrp    x2, .LC0
>          sub     sp, sp, #16
>          ldr     q1, [x2, #:lo12:.LC0]
>          mov     x2, 16502
>          movk    x2, 0x1023, lsl 16
>          movk    x2, 0x4308, lsl 32
>          add     v1.2d, v1.2d, v0.2d
>          movk    x2, 0x942, lsl 48
>          orr     x0, x0, x2
>          str     q1, [x1]
>          add     sp, sp, 16
>          ret
> .LC0:
>          .xword  667169396713799798
>          .xword  667169396713799798
>
> Essentially we materialize the same constant twice.  The reason for this is
> because the front-end lowers the constant extracted from arr[0] quite early on.
> If you look into the result of fre you'll find
>
>    <bb 2> :
>    arr[0] = 667169396713799798;
>    arr[1] = 667169396713799798;
>    res_7 = a_6(D) | 667169396713799798;
>    _16 = __builtin_aarch64_ld1v2di (&arr);
>    _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
>    _11 = b_10(D) + _17;
>    *rt_12(D) = _11;
>    arr ={v} {CLOBBER};
>    return res_7;
>
> Which makes sense for further optimization.  However come expand time if the
> constant isn't representable in the target arch it will be assigned to a
> register again.
>
> (insn 8 5 9 2 (set (reg:V2DI 99)
>          (const_vector:V2DI [
>                  (const_int 667169396713799798 [0x942430810234076]) repeated x2
>              ])) "cse.c":7:12 -1
>       (nil))
> ...
> (insn 14 13 15 2 (set (reg:DI 103)
>          (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
>       (nil))
> (insn 15 14 16 2 (set (reg:DI 102 [ res ])
>          (ior:DI (reg/v:DI 96 [ a ])
>              (reg:DI 103))) "cse.c":8:12 -1
>       (nil))
So I think the key here is to be able to hash the elements of the 
const_vector to the same value as the const_int.  If you can hash them 
the same, they'll be seen as common subexpressions regardless of the 
order in which the insns appear.


>
> My current patch for CSE is:
>
> diff --git a/gcc/cse.c b/gcc/cse.c
> index 36bcfc354d8..3cee53bed85 100644
> --- a/gcc/cse.c
> +++ b/gcc/cse.c
> @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
>   #include "rtl-iter.h"
>   #include "regs.h"
>   #include "function-abi.h"
> +#include "expr.h"
>
>   /* The basic idea of common subexpression elimination is to go
>      through the code, keeping a record of expressions that would
> @@ -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
>           someplace else, so it isn't worth cse'ing.  */
>         else if (GET_CODE (SET_SRC (x)) == CALL)
>          ;
> +      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> +       {
> +         /* First register the vector itself.  */
> +         sets[n_sets++].rtl = x;
> +         rtx src = SET_SRC (x);
> +         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
> +          /* Go over the constants of the CONST_VECTOR in forward order, to
> +            put them in the same order in the SETS array.  */
> +         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
> +           {
> +             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
> +             sets[n_sets++].rtl = PATTERN (gen_move_insn (y, CONST_VECTOR_ELT (src, i)));
> +           }
> +       }
>         else
>          sets[n_sets++].rtl = x;
>       }
> @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
>     struct set *sets = (struct set *) 0;
>
>     if (GET_CODE (x) == SET)
> -    sets = XALLOCA (struct set);
> +    {
> +      /* For CONST_VECTOR we wants to be able to CSE the vector itself along with
> +        elements inside the vector if the target says it's cheap.  */
> +      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> +       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC (x)) + 1);
> +      else
> +       sets = XALLOCA (struct set);
> +    }
>     else if (GET_CODE (x) == PARALLEL)
>       sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
>
> --
>
> This extends the sets that CSE uses to perform CSE to not only contain the
> CONST_VECTOR but also the individual elements of the vector.
Seems conceptually reasonable.  You probably want something similar to 
allow you to replace those elements in the vector as well.

>
> For each element I generate new RTL which models them as a constant being set
> into a subreg of the original vector at the index of the element in the vector.
>
> This so that the SRC is the constant we want to CSE and DEST contains the
> SUBREG to extract from the vector.
>
> It works as expected, the testcase above generates:
>
> test:
>          adrp    x2, .LC0
>          sub     sp, sp, #16
>          ldr     q1, [x2, #:lo12:.LC0]
>          add     v0.2d, v1.2d, v0.2d
>          fmov    x2, d1
>          str     q0, [x1]
>          orr     x0, x0, x2
>          add     sp, sp, 16
>          ret
> .LC0:
>          .xword  667169396713799798
>          .xword  667169396713799798
>
> The problem is that this is somewhat accidental.  CSE is single pass, presumably
> because it currently only tracks SETs of constants where any of the duplicates
> can be replaced by any alternative (it does pick the cheapest, but all the
> alternatives are valid.).
>
> This breaks with vectors because vectors can only be used as a SRC.  The code
> does validate that the resulting CSE is valid, so this does not break.
>
> but if the INSN are flipped in RTL:
>
> (insn 14 13 15 2 (set (reg:DI 103)
>          (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
>       (nil))
> ...
> (insn 8 5 9 2 (set (reg:V2DI 99)
>          (const_vector:V2DI [
>                  (const_int 667169396713799798 [0x942430810234076]) repeated x2
>              ])) "cse.c":7:12 -1
>       (nil))
>
> This no longer works, because it sees the constant version in insn 14 before it
> sees insn 8.  When we find insn 8 we can tell that there is an instruction that
> can be replaced by insn 8, but we don't know the original insn and so as a
> consequence we can't update it.
Well, what I think you need to do in this case  is look inside the 
const_vector at each element and see if you get a hit in the hash table.

>
> so questions:
>
> 1) Does what I'm doing make sense?
Yes so far.

> 2) Is there anyway to go from a SET to an insn?
Not really sure what you're asking.  Are you trying to map from a set 
back to the insn with the set?

> 3) If not, can I store the insn in table_elt and have cse_insn produce a worklist
>     of additional insn that need to be re-examined?
I think that's an indication you're going the wrong direction for the 
reversed case.

jeff


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-23  3:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-04 12:18 [RFC] middle-end: Extend CSE to understand vector extracts Tamar Christina
2021-01-04 13:33 ` Richard Biener
2021-01-04 13:57   ` Tamar Christina
2021-01-04 14:13     ` Richard Biener
2021-01-04 15:25       ` Jeff Law
2021-08-23  3:39 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).