[PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]
@ 2024-06-14 11:49 Tamar Christina
  2024-06-19 12:13 ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Tamar Christina @ 2024-06-14 11:49 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther, bin.cheng

[-- Attachment #1: Type: text/plain, Size: 5622 bytes --]

Hi All,

When the patch for PR114074 was applied we saw a good boost in exchange2.

This boost was partially caused by a simplification of the addressing modes.
With the patch applied IV opts saw the following form for the base addressing;

  Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)

vs what we normally get:

  Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4

This is because the patch promoted multiplies where one operand is a constant
from a signed multiply to an unsigned one, to attempt to fold away the constant.

This patch attempts the same but due to the various problems with SCEV and
niters not being able to analyze the resulting forms (i.e. PR114322) we can't
do it during SCEV or in the general form like in fold-const like extract_muldiv
attempts.

Instead this applies the simplification during IVopts initialization when we
create the IV.  Essentially when we know the IV won't overflow with regards to
niters then we perform an affine fold which gets it to simplify the internal
computation, even if this is signed because we know that for IVOPTs uses the
IV won't ever overflow.  This allows IV opts to see the simplified form
without influencing the rest of the compiler.

as mentioned in PR114074 it would be good to fix the missed optimization in the
other passes so we can perform this in general.

The reason this has a big impact on fortran code is that fortran doesn't seem to
have unsigned integer types.  As such all it's addressing are created with
signed types and folding does not happen on them due to the possible overflow.

concretely on AArch64 this changes the results from generation:

        mov     x27, -108
        mov     x24, -72
        mov     x23, -36
        add     x21, x1, x0, lsl 2
        add     x19, x20, x22
.L5:
        add     x0, x22, x19
        add     x19, x19, 324
        ldr     d1, [x0, x27]
        add     v1.2s, v1.2s, v15.2s
        str     d1, [x20, 216]
        ldr     d0, [x0, x24]
        add     v0.2s, v0.2s, v15.2s
        str     d0, [x20, 252]
        ldr     d31, [x0, x23]
        add     v31.2s, v31.2s, v15.2s
        str     d31, [x20, 288]
        bl      digits_20_
        cmp     x21, x19
        bne     .L5

into:

.L5:
        ldr     d1, [x19, -108]
        add     v1.2s, v1.2s, v15.2s
        str     d1, [x20, 216]
        ldr     d0, [x19, -72]
        add     v0.2s, v0.2s, v15.2s
        str     d0, [x20, 252]
        ldr     d31, [x19, -36]
        add     x19, x19, 324
        add     v31.2s, v31.2s, v15.2s
        str     d31, [x20, 288]
        bl      digits_20_
        cmp     x21, x19
        bne     .L5

The two patches together results in a 10% performance increase in exchange2 in
SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile
time. There's also a 5% performance improvement in fotonik3d and similar
reduction in binary size.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR tree-optimization/114932
	* tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.

gcc/testsuite/ChangeLog:

	PR tree-optimization/114932
	* gfortran.dg/addressing-modes_1.f90: New test.

---
diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
new file mode 100644
index 0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
@@ -0,0 +1,37 @@
+! { dg-do compile { target aarch64-*-* } }
+! { dg-additional-options "-w -Ofast" }
+
+  module brute_force
+    integer, parameter :: r=9
+     integer  block(r, r, 0)
+    contains
+  subroutine brute
+     do
+      do
+          do
+           do
+                do
+                     do
+                         do i7 = l0, 1
+                       select case(1 )
+                       case(1)
+                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
+                       end select
+                            do i8 = 1, 1
+                               do i9 = 1, 1
+                            if(1 == 1) then
+                                    call digits_20
+                                end if
+                                end do
+                          end do
+                    end do
+                    end do
+              end do
+              end do
+           end do
+     end do
+  end do
+ end
+  end
+
+! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e35f3d3a0d52470811 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
     }
 
+  /* If we know the IV won't overflow wrt niters and the type is an unsigned
+     type then fold using affine unsigned arithmetic to allow more folding of
+     constants.  */
+  if (no_overflow
+      && TYPE_UNSIGNED (TREE_TYPE (expr)))
+    {
+      aff_tree comb;
+      tree utype = unsigned_type_for (TREE_TYPE (expr));
+      tree_to_aff_combination (expr, utype, &comb);
+      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
+    }
+
   iv->base = base;
   iv->base_object = determine_base_object (data, base);
   iv->step = step;




-- 

[-- Attachment #2: rb18487.patch --]
[-- Type: text/x-diff, Size: 2333 bytes --]

diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
new file mode 100644
index 0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
@@ -0,0 +1,37 @@
+! { dg-do compile { target aarch64-*-* } }
+! { dg-additional-options "-w -Ofast" }
+
+  module brute_force
+    integer, parameter :: r=9
+     integer  block(r, r, 0)
+    contains
+  subroutine brute
+     do
+      do
+          do
+           do
+                do
+                     do
+                         do i7 = l0, 1
+                       select case(1 )
+                       case(1)
+                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
+                       end select
+                            do i8 = 1, 1
+                               do i9 = 1, 1
+                            if(1 == 1) then
+                                    call digits_20
+                                end if
+                                end do
+                          end do
+                    end do
+                    end do
+              end do
+              end do
+           end do
+     end do
+  end do
+ end
+  end
+
+! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e35f3d3a0d52470811 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
     }
 
+  /* If we know the IV won't overflow wrt niters and the type is an unsigned
+     type then fold using affine unsigned arithmetic to allow more folding of
+     constants.  */
+  if (no_overflow
+      && TYPE_UNSIGNED (TREE_TYPE (expr)))
+    {
+      aff_tree comb;
+      tree utype = unsigned_type_for (TREE_TYPE (expr));
+      tree_to_aff_combination (expr, utype, &comb);
+      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
+    }
+
   iv->base = base;
   iv->base_object = determine_base_object (data, base);
   iv->step = step;




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]
  2024-06-14 11:49 [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932] Tamar Christina
@ 2024-06-19 12:13 ` Richard Biener
  2024-06-19 13:22   ` Tamar Christina
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2024-06-19 12:13 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, bin.cheng

On Fri, 14 Jun 2024, Tamar Christina wrote:

> Hi All,
> 
> When the patch for PR114074 was applied we saw a good boost in exchange2.
> 
> This boost was partially caused by a simplification of the addressing modes.
> With the patch applied IV opts saw the following form for the base addressing;
> 
>   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> 324) + 36)
> 
> vs what we normally get:
> 
>   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
> 
> This is because the patch promoted multiplies where one operand is a constant
> from a signed multiply to an unsigned one, to attempt to fold away the constant.
> 
> This patch attempts the same but due to the various problems with SCEV and
> niters not being able to analyze the resulting forms (i.e. PR114322) we can't
> do it during SCEV or in the general form like in fold-const like extract_muldiv
> attempts.
> 
> Instead this applies the simplification during IVopts initialization when we
> create the IV.  Essentially when we know the IV won't overflow with regards to
> niters then we perform an affine fold which gets it to simplify the internal
> computation, even if this is signed because we know that for IVOPTs uses the
> IV won't ever overflow.  This allows IV opts to see the simplified form
> without influencing the rest of the compiler.
> 
> as mentioned in PR114074 it would be good to fix the missed optimization in the
> other passes so we can perform this in general.
> 
> The reason this has a big impact on fortran code is that fortran doesn't seem to
> have unsigned integer types.  As such all it's addressing are created with
> signed types and folding does not happen on them due to the possible overflow.
> 
> concretely on AArch64 this changes the results from generation:
> 
>         mov     x27, -108
>         mov     x24, -72
>         mov     x23, -36
>         add     x21, x1, x0, lsl 2
>         add     x19, x20, x22
> .L5:
>         add     x0, x22, x19
>         add     x19, x19, 324
>         ldr     d1, [x0, x27]
>         add     v1.2s, v1.2s, v15.2s
>         str     d1, [x20, 216]
>         ldr     d0, [x0, x24]
>         add     v0.2s, v0.2s, v15.2s
>         str     d0, [x20, 252]
>         ldr     d31, [x0, x23]
>         add     v31.2s, v31.2s, v15.2s
>         str     d31, [x20, 288]
>         bl      digits_20_
>         cmp     x21, x19
>         bne     .L5
> 
> into:
> 
> .L5:
>         ldr     d1, [x19, -108]
>         add     v1.2s, v1.2s, v15.2s
>         str     d1, [x20, 216]
>         ldr     d0, [x19, -72]
>         add     v0.2s, v0.2s, v15.2s
>         str     d0, [x20, 252]
>         ldr     d31, [x19, -36]
>         add     x19, x19, 324
>         add     v31.2s, v31.2s, v15.2s
>         str     d31, [x20, 288]
>         bl      digits_20_
>         cmp     x21, x19
>         bne     .L5
> 
> The two patches together results in a 10% performance increase in exchange2 in
> SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile
> time. There's also a 5% performance improvement in fotonik3d and similar
> reduction in binary size.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR tree-optimization/114932
> 	* tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR tree-optimization/114932
> 	* gfortran.dg/addressing-modes_1.f90: New test.
> 
> ---
> diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> new file mode 100644
> index 0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> @@ -0,0 +1,37 @@
> +! { dg-do compile { target aarch64-*-* } }
> +! { dg-additional-options "-w -Ofast" }
> +
> +  module brute_force
> +    integer, parameter :: r=9
> +     integer  block(r, r, 0)
> +    contains
> +  subroutine brute
> +     do
> +      do
> +          do
> +           do
> +                do
> +                     do
> +                         do i7 = l0, 1
> +                       select case(1 )
> +                       case(1)
> +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> +                       end select
> +                            do i8 = 1, 1
> +                               do i9 = 1, 1
> +                            if(1 == 1) then
> +                                    call digits_20
> +                                end if
> +                                end do
> +                          end do
> +                    end do
> +                    end do
> +              end do
> +              end do
> +           end do
> +     end do
> +  end do
> + end
> +  end
> +
> +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
> diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> index 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e35f3d3a0d52470811 100644
> --- a/gcc/tree-ssa-loop-ivopts.cc
> +++ b/gcc/tree-ssa-loop-ivopts.cc
> @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
>        base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
>      }
>  
> +  /* If we know the IV won't overflow wrt niters and the type is an unsigned
> +     type then fold using affine unsigned arithmetic to allow more folding of
> +     constants.  */
> +  if (no_overflow
> +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> +    {
> +      aff_tree comb;
> +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> +      tree_to_aff_combination (expr, utype, &comb);
> +      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> +    }
> +

So right above we already do

  /* Lower address expression in base except ones with DECL_P as operand.
     By doing this:
       1) More accurate cost can be computed for address expressions;
       2) Duplicate candidates won't be created for bases in different
          forms, like &a[0] and &a.  */
  STRIP_NOPS (expr);
  if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
      || contain_complex_addr_expr (expr))
    {
      aff_tree comb;
      tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree 
(&comb));
    }

and if I read correctly 'expr' is

  (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4

in your interesting case which means it doesn't satisfy
contain_complex_addr_expr.

I don't quite get why rewriting the base into (T)(unsigned)... is
only valid when no_overflow - no_overflow is about {base, +, step},
not about any overflow contained in 'base'.

I wonder if we maybe want to record an "original" iv->base
to be used for code generation (and there only the unexpanded form)
and a variant used for the various sorts of canonicalization/compare
(I see we eventually add/subtract step and then compare against
sth else).  And then apply this normalization always to the not
"original" form.

The above STRIP_NOPS (expr) + expand might turn an unsigned
affine combination into a signed one which might be problematic.
So what happens if you change the above to simply always
unsigned expand?

Richard.

>    iv->base = base;
>    iv->base_object = determine_base_object (data, base);
>    iv->step = step;
> 
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]
  2024-06-19 12:13 ` Richard Biener
@ 2024-06-19 13:22   ` Tamar Christina
  2024-06-20  7:54     ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Tamar Christina @ 2024-06-19 13:22 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, nd, bin.cheng

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Wednesday, June 19, 2024 1:14 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; bin.cheng@linux.alibaba.com
> Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned addressing modes
> known not to overflow. [PR114932]
> 
> On Fri, 14 Jun 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > When the patch for PR114074 was applied we saw a good boost in exchange2.
> >
> > This boost was partially caused by a simplification of the addressing modes.
> > With the patch applied IV opts saw the following form for the base addressing;
> >
> >   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> > 324) + 36)
> >
> > vs what we normally get:
> >
> >   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > * 81) + 9) * 4
> >
> > This is because the patch promoted multiplies where one operand is a constant
> > from a signed multiply to an unsigned one, to attempt to fold away the constant.
> >
> > This patch attempts the same but due to the various problems with SCEV and
> > niters not being able to analyze the resulting forms (i.e. PR114322) we can't
> > do it during SCEV or in the general form like in fold-const like extract_muldiv
> > attempts.
> >
> > Instead this applies the simplification during IVopts initialization when we
> > create the IV.  Essentially when we know the IV won't overflow with regards to
> > niters then we perform an affine fold which gets it to simplify the internal
> > computation, even if this is signed because we know that for IVOPTs uses the
> > IV won't ever overflow.  This allows IV opts to see the simplified form
> > without influencing the rest of the compiler.
> >
> > as mentioned in PR114074 it would be good to fix the missed optimization in the
> > other passes so we can perform this in general.
> >
> > The reason this has a big impact on fortran code is that fortran doesn't seem to
> > have unsigned integer types.  As such all it's addressing are created with
> > signed types and folding does not happen on them due to the possible overflow.
> >
> > concretely on AArch64 this changes the results from generation:
> >
> >         mov     x27, -108
> >         mov     x24, -72
> >         mov     x23, -36
> >         add     x21, x1, x0, lsl 2
> >         add     x19, x20, x22
> > .L5:
> >         add     x0, x22, x19
> >         add     x19, x19, 324
> >         ldr     d1, [x0, x27]
> >         add     v1.2s, v1.2s, v15.2s
> >         str     d1, [x20, 216]
> >         ldr     d0, [x0, x24]
> >         add     v0.2s, v0.2s, v15.2s
> >         str     d0, [x20, 252]
> >         ldr     d31, [x0, x23]
> >         add     v31.2s, v31.2s, v15.2s
> >         str     d31, [x20, 288]
> >         bl      digits_20_
> >         cmp     x21, x19
> >         bne     .L5
> >
> > into:
> >
> > .L5:
> >         ldr     d1, [x19, -108]
> >         add     v1.2s, v1.2s, v15.2s
> >         str     d1, [x20, 216]
> >         ldr     d0, [x19, -72]
> >         add     v0.2s, v0.2s, v15.2s
> >         str     d0, [x20, 252]
> >         ldr     d31, [x19, -36]
> >         add     x19, x19, 324
> >         add     v31.2s, v31.2s, v15.2s
> >         str     d31, [x20, 288]
> >         bl      digits_20_
> >         cmp     x21, x19
> >         bne     .L5
> >
> > The two patches together results in a 10% performance increase in exchange2 in
> > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in
> compile
> > time. There's also a 5% performance improvement in fotonik3d and similar
> > reduction in binary size.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	PR tree-optimization/114932
> > 	* tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	PR tree-optimization/114932
> > 	* gfortran.dg/addressing-modes_1.f90: New test.
> >
> > ---
> > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b
> b1f90dfeff584b4e494
> > --- /dev/null
> > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > @@ -0,0 +1,37 @@
> > +! { dg-do compile { target aarch64-*-* } }
> > +! { dg-additional-options "-w -Ofast" }
> > +
> > +  module brute_force
> > +    integer, parameter :: r=9
> > +     integer  block(r, r, 0)
> > +    contains
> > +  subroutine brute
> > +     do
> > +      do
> > +          do
> > +           do
> > +                do
> > +                     do
> > +                         do i7 = l0, 1
> > +                       select case(1 )
> > +                       case(1)
> > +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> > +                       end select
> > +                            do i8 = 1, 1
> > +                               do i9 = 1, 1
> > +                            if(1 == 1) then
> > +                                    call digits_20
> > +                                end if
> > +                                end do
> > +                          end do
> > +                    end do
> > +                    end do
> > +              end do
> > +              end do
> > +           end do
> > +     end do
> > +  end do
> > + end
> > +  end
> > +
> > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
> > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> > index
> 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e3
> 5f3d3a0d52470811 100644
> > --- a/gcc/tree-ssa-loop-ivopts.cc
> > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree
> step,
> >        base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> >      }
> >
> > +  /* If we know the IV won't overflow wrt niters and the type is an unsigned
> > +     type then fold using affine unsigned arithmetic to allow more folding of
> > +     constants.  */
> > +  if (no_overflow
> > +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> > +    {
> > +      aff_tree comb;
> > +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> > +      tree_to_aff_combination (expr, utype, &comb);
> > +      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> > +    }
> > +
> 
> So right above we already do
> 
>   /* Lower address expression in base except ones with DECL_P as operand.
>      By doing this:
>        1) More accurate cost can be computed for address expressions;
>        2) Duplicate candidates won't be created for bases in different
>           forms, like &a[0] and &a.  */
>   STRIP_NOPS (expr);
>   if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
>       || contain_complex_addr_expr (expr))
>     {
>       aff_tree comb;
>       tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
>       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> (&comb));
>     }
> 
> and if I read correctly 'expr' is
> 
>   (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
> 
> in your interesting case which means it doesn't satisfy
> contain_complex_addr_expr.
> 
> I don't quite get why rewriting the base into (T)(unsigned)... is
> only valid when no_overflow - no_overflow is about {base, +, step},
> not about any overflow contained in 'base'.
> 

Right, so you're saying overflow only applies to step and it's final
evolution wrt to niters? Ok, makes sense.  I had thought it referred
to that the final possible address doesn't overflow. So offset + step * niters.

> I wonder if we maybe want to record an "original" iv->base
> to be used for code generation (and there only the unexpanded form)
> and a variant used for the various sorts of canonicalization/compare
> (I see we eventually add/subtract step and then compare against
> sth else).  And then apply this normalization always to the not
> "original" form.

That's certainly possible, but it'll make the code a bit harder to follow in the
cases where the comparisons are done between the split address base + step
and then a new IV built up from it.

In those cases we'd have to split both expressions but used different ones in
the right places.  Where it gets messy is that something it updates the value
and then compares.

So if we can avoid it it's likely cleaner.

> 
> The above STRIP_NOPS (expr) + expand might turn an unsigned
> affine combination into a signed one which might be problematic.
> So what happens if you change the above to simply always
> unsigned expand?
> 

Yes, that's what's currently happening. It's stripping (signed) (a + b) into
a + b.

To be clear you mean change the affine fold I added to always be unsigned?
and thus removing the contain_complex_addr_expr chunk?

That works and was my first patch locally (though need to fully reg-test it).

Thanks,
Tamar

> Richard.
> 
> >    iv->base = base;
> >    iv->base_object = determine_base_object (data, base);
> >    iv->step = step;
> >
> >
> >
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]
  2024-06-19 13:22   ` Tamar Christina
@ 2024-06-20  7:54     ` Richard Biener
  2024-06-24 14:44       ` Tamar Christina
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2024-06-20  7:54 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd, bin.cheng

On Wed, 19 Jun 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Wednesday, June 19, 2024 1:14 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; bin.cheng@linux.alibaba.com
> > Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned addressing modes
> > known not to overflow. [PR114932]
> > 
> > On Fri, 14 Jun 2024, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > When the patch for PR114074 was applied we saw a good boost in exchange2.
> > >
> > > This boost was partially caused by a simplification of the addressing modes.
> > > With the patch applied IV opts saw the following form for the base addressing;
> > >
> > >   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> > > 324) + 36)
> > >
> > > vs what we normally get:
> > >
> > >   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > > * 81) + 9) * 4
> > >
> > > This is because the patch promoted multiplies where one operand is a constant
> > > from a signed multiply to an unsigned one, to attempt to fold away the constant.
> > >
> > > This patch attempts the same but due to the various problems with SCEV and
> > > niters not being able to analyze the resulting forms (i.e. PR114322) we can't
> > > do it during SCEV or in the general form like in fold-const like extract_muldiv
> > > attempts.
> > >
> > > Instead this applies the simplification during IVopts initialization when we
> > > create the IV.  Essentially when we know the IV won't overflow with regards to
> > > niters then we perform an affine fold which gets it to simplify the internal
> > > computation, even if this is signed because we know that for IVOPTs uses the
> > > IV won't ever overflow.  This allows IV opts to see the simplified form
> > > without influencing the rest of the compiler.
> > >
> > > as mentioned in PR114074 it would be good to fix the missed optimization in the
> > > other passes so we can perform this in general.
> > >
> > > The reason this has a big impact on fortran code is that fortran doesn't seem to
> > > have unsigned integer types.  As such all it's addressing are created with
> > > signed types and folding does not happen on them due to the possible overflow.
> > >
> > > concretely on AArch64 this changes the results from generation:
> > >
> > >         mov     x27, -108
> > >         mov     x24, -72
> > >         mov     x23, -36
> > >         add     x21, x1, x0, lsl 2
> > >         add     x19, x20, x22
> > > .L5:
> > >         add     x0, x22, x19
> > >         add     x19, x19, 324
> > >         ldr     d1, [x0, x27]
> > >         add     v1.2s, v1.2s, v15.2s
> > >         str     d1, [x20, 216]
> > >         ldr     d0, [x0, x24]
> > >         add     v0.2s, v0.2s, v15.2s
> > >         str     d0, [x20, 252]
> > >         ldr     d31, [x0, x23]
> > >         add     v31.2s, v31.2s, v15.2s
> > >         str     d31, [x20, 288]
> > >         bl      digits_20_
> > >         cmp     x21, x19
> > >         bne     .L5
> > >
> > > into:
> > >
> > > .L5:
> > >         ldr     d1, [x19, -108]
> > >         add     v1.2s, v1.2s, v15.2s
> > >         str     d1, [x20, 216]
> > >         ldr     d0, [x19, -72]
> > >         add     v0.2s, v0.2s, v15.2s
> > >         str     d0, [x20, 252]
> > >         ldr     d31, [x19, -36]
> > >         add     x19, x19, 324
> > >         add     v31.2s, v31.2s, v15.2s
> > >         str     d31, [x20, 288]
> > >         bl      digits_20_
> > >         cmp     x21, x19
> > >         bne     .L5
> > >
> > > The two patches together results in a 10% performance increase in exchange2 in
> > > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in
> > compile
> > > time. There's also a 5% performance improvement in fotonik3d and similar
> > > reduction in binary size.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > >
> > > Ok for master?
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	PR tree-optimization/114932
> > > 	* tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > 	PR tree-optimization/114932
> > > 	* gfortran.dg/addressing-modes_1.f90: New test.
> > >
> > > ---
> > > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > new file mode 100644
> > > index
> > 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b
> > b1f90dfeff584b4e494
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > @@ -0,0 +1,37 @@
> > > +! { dg-do compile { target aarch64-*-* } }
> > > +! { dg-additional-options "-w -Ofast" }
> > > +
> > > +  module brute_force
> > > +    integer, parameter :: r=9
> > > +     integer  block(r, r, 0)
> > > +    contains
> > > +  subroutine brute
> > > +     do
> > > +      do
> > > +          do
> > > +           do
> > > +                do
> > > +                     do
> > > +                         do i7 = l0, 1
> > > +                       select case(1 )
> > > +                       case(1)
> > > +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> > > +                       end select
> > > +                            do i8 = 1, 1
> > > +                               do i9 = 1, 1
> > > +                            if(1 == 1) then
> > > +                                    call digits_20
> > > +                                end if
> > > +                                end do
> > > +                          end do
> > > +                    end do
> > > +                    end do
> > > +              end do
> > > +              end do
> > > +           end do
> > > +     end do
> > > +  end do
> > > + end
> > > +  end
> > > +
> > > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
> > > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> > > index
> > 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e3
> > 5f3d3a0d52470811 100644
> > > --- a/gcc/tree-ssa-loop-ivopts.cc
> > > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > > @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree
> > step,
> > >        base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> > >      }
> > >
> > > +  /* If we know the IV won't overflow wrt niters and the type is an unsigned
> > > +     type then fold using affine unsigned arithmetic to allow more folding of
> > > +     constants.  */
> > > +  if (no_overflow
> > > +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> > > +    {
> > > +      aff_tree comb;
> > > +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> > > +      tree_to_aff_combination (expr, utype, &comb);
> > > +      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> > > +    }
> > > +
> > 
> > So right above we already do
> > 
> >   /* Lower address expression in base except ones with DECL_P as operand.
> >      By doing this:
> >        1) More accurate cost can be computed for address expressions;
> >        2) Duplicate candidates won't be created for bases in different
> >           forms, like &a[0] and &a.  */
> >   STRIP_NOPS (expr);
> >   if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
> >       || contain_complex_addr_expr (expr))
> >     {
> >       aff_tree comb;
> >       tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
> >       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> > (&comb));
> >     }
> > 
> > and if I read correctly 'expr' is
> > 
> >   (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > * 81) + 9) * 4
> > 
> > in your interesting case which means it doesn't satisfy
> > contain_complex_addr_expr.
> > 
> > I don't quite get why rewriting the base into (T)(unsigned)... is
> > only valid when no_overflow - no_overflow is about {base, +, step},
> > not about any overflow contained in 'base'.
> > 
> 
> Right, so you're saying overflow only applies to step and it's final
> evolution wrt to niters? Ok, makes sense.  I had thought it referred
> to that the final possible address doesn't overflow. So offset + step * niters.

It refers to the IV itself not wrapping but 'base' is the initial
value computation, whether that computation itself wraps or not
(it would wrap the same for each iteration) isn't covered.

> > I wonder if we maybe want to record an "original" iv->base
> > to be used for code generation (and there only the unexpanded form)
> > and a variant used for the various sorts of canonicalization/compare
> > (I see we eventually add/subtract step and then compare against
> > sth else).  And then apply this normalization always to the not
> > "original" form.
> 
> That's certainly possible, but it'll make the code a bit harder to follow in the
> cases where the comparisons are done between the split address base + step
> and then a new IV built up from it.
> 
> In those cases we'd have to split both expressions but used different ones in
> the right places.  Where it gets messy is that something it updates the value
> and then compares.
> 
> So if we can avoid it it's likely cleaner.

Sure.

> > 
> > The above STRIP_NOPS (expr) + expand might turn an unsigned
> > affine combination into a signed one which might be problematic.
> > So what happens if you change the above to simply always
> > unsigned expand?
> > 
> 
> Yes, that's what's currently happening. It's stripping (signed) (a + b) into
> a + b.
> 
> To be clear you mean change the affine fold I added to always be unsigned?
> and thus removing the contain_complex_addr_expr chunk?

Yes.

> That works and was my first patch locally (though need to fully reg-test it).

I'm curious what happens.

Btw, I think we want to go very piece-mail with these changes given
there's nobody who's really familiar with the code.

Richard.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]
  2024-06-20  7:54     ` Richard Biener
@ 2024-06-24 14:44       ` Tamar Christina
  0 siblings, 0 replies; 5+ messages in thread
From: Tamar Christina @ 2024-06-24 14:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, nd, bin.cheng

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Thursday, June 20, 2024 8:55 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; bin.cheng@linux.alibaba.com
> Subject: RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes
> known not to overflow. [PR114932]
> 
> On Wed, 19 Jun 2024, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguenther@suse.de>
> > > Sent: Wednesday, June 19, 2024 1:14 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
> bin.cheng@linux.alibaba.com
> > > Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned addressing
> modes
> > > known not to overflow. [PR114932]
> > >
> > > On Fri, 14 Jun 2024, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > When the patch for PR114074 was applied we saw a good boost in
> exchange2.
> > > >
> > > > This boost was partially caused by a simplification of the addressing modes.
> > > > With the patch applied IV opts saw the following form for the base
> addressing;
> > > >
> > > >   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> > > > 324) + 36)
> > > >
> > > > vs what we normally get:
> > > >
> > > >   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > > > * 81) + 9) * 4
> > > >
> > > > This is because the patch promoted multiplies where one operand is a
> constant
> > > > from a signed multiply to an unsigned one, to attempt to fold away the
> constant.
> > > >
> > > > This patch attempts the same but due to the various problems with SCEV and
> > > > niters not being able to analyze the resulting forms (i.e. PR114322) we can't
> > > > do it during SCEV or in the general form like in fold-const like extract_muldiv
> > > > attempts.
> > > >
> > > > Instead this applies the simplification during IVopts initialization when we
> > > > create the IV.  Essentially when we know the IV won't overflow with regards
> to
> > > > niters then we perform an affine fold which gets it to simplify the internal
> > > > computation, even if this is signed because we know that for IVOPTs uses the
> > > > IV won't ever overflow.  This allows IV opts to see the simplified form
> > > > without influencing the rest of the compiler.
> > > >
> > > > as mentioned in PR114074 it would be good to fix the missed optimization in
> the
> > > > other passes so we can perform this in general.
> > > >
> > > > The reason this has a big impact on fortran code is that fortran doesn't seem
> to
> > > > have unsigned integer types.  As such all it's addressing are created with
> > > > signed types and folding does not happen on them due to the possible
> overflow.
> > > >
> > > > concretely on AArch64 this changes the results from generation:
> > > >
> > > >         mov     x27, -108
> > > >         mov     x24, -72
> > > >         mov     x23, -36
> > > >         add     x21, x1, x0, lsl 2
> > > >         add     x19, x20, x22
> > > > .L5:
> > > >         add     x0, x22, x19
> > > >         add     x19, x19, 324
> > > >         ldr     d1, [x0, x27]
> > > >         add     v1.2s, v1.2s, v15.2s
> > > >         str     d1, [x20, 216]
> > > >         ldr     d0, [x0, x24]
> > > >         add     v0.2s, v0.2s, v15.2s
> > > >         str     d0, [x20, 252]
> > > >         ldr     d31, [x0, x23]
> > > >         add     v31.2s, v31.2s, v15.2s
> > > >         str     d31, [x20, 288]
> > > >         bl      digits_20_
> > > >         cmp     x21, x19
> > > >         bne     .L5
> > > >
> > > > into:
> > > >
> > > > .L5:
> > > >         ldr     d1, [x19, -108]
> > > >         add     v1.2s, v1.2s, v15.2s
> > > >         str     d1, [x20, 216]
> > > >         ldr     d0, [x19, -72]
> > > >         add     v0.2s, v0.2s, v15.2s
> > > >         str     d0, [x20, 252]
> > > >         ldr     d31, [x19, -36]
> > > >         add     x19, x19, 324
> > > >         add     v31.2s, v31.2s, v15.2s
> > > >         str     d31, [x20, 288]
> > > >         bl      digits_20_
> > > >         cmp     x21, x19
> > > >         bne     .L5
> > > >
> > > > The two patches together results in a 10% performance increase in exchange2
> in
> > > > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in
> > > compile
> > > > time. There's also a 5% performance improvement in fotonik3d and similar
> > > > reduction in binary size.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 	PR tree-optimization/114932
> > > > 	* tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > 	PR tree-optimization/114932
> > > > 	* gfortran.dg/addressing-modes_1.f90: New test.
> > > >
> > > > ---
> > > > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > > new file mode 100644
> > > > index
> > >
> 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b
> > > b1f90dfeff584b4e494
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > > > @@ -0,0 +1,37 @@
> > > > +! { dg-do compile { target aarch64-*-* } }
> > > > +! { dg-additional-options "-w -Ofast" }
> > > > +
> > > > +  module brute_force
> > > > +    integer, parameter :: r=9
> > > > +     integer  block(r, r, 0)
> > > > +    contains
> > > > +  subroutine brute
> > > > +     do
> > > > +      do
> > > > +          do
> > > > +           do
> > > > +                do
> > > > +                     do
> > > > +                         do i7 = l0, 1
> > > > +                       select case(1 )
> > > > +                       case(1)
> > > > +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> > > > +                       end select
> > > > +                            do i8 = 1, 1
> > > > +                               do i9 = 1, 1
> > > > +                            if(1 == 1) then
> > > > +                                    call digits_20
> > > > +                                end if
> > > > +                                end do
> > > > +                          end do
> > > > +                    end do
> > > > +                    end do
> > > > +              end do
> > > > +              end do
> > > > +           end do
> > > > +     end do
> > > > +  end do
> > > > + end
> > > > +  end
> > > > +
> > > > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
> > > > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> > > > index
> > >
> 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e3
> > > 5f3d3a0d52470811 100644
> > > > --- a/gcc/tree-ssa-loop-ivopts.cc
> > > > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > > > @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree
> > > step,
> > > >        base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> (&comb));
> > > >      }
> > > >
> > > > +  /* If we know the IV won't overflow wrt niters and the type is an unsigned
> > > > +     type then fold using affine unsigned arithmetic to allow more folding of
> > > > +     constants.  */
> > > > +  if (no_overflow
> > > > +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> > > > +    {
> > > > +      aff_tree comb;
> > > > +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> > > > +      tree_to_aff_combination (expr, utype, &comb);
> > > > +      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> (&comb));
> > > > +    }
> > > > +
> > >
> > > So right above we already do
> > >
> > >   /* Lower address expression in base except ones with DECL_P as operand.
> > >      By doing this:
> > >        1) More accurate cost can be computed for address expressions;
> > >        2) Duplicate candidates won't be created for bases in different
> > >           forms, like &a[0] and &a.  */
> > >   STRIP_NOPS (expr);
> > >   if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
> > >       || contain_complex_addr_expr (expr))
> > >     {
> > >       aff_tree comb;
> > >       tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
> > >       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> > > (&comb));
> > >     }
> > >
> > > and if I read correctly 'expr' is
> > >
> > >   (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > > * 81) + 9) * 4
> > >
> > > in your interesting case which means it doesn't satisfy
> > > contain_complex_addr_expr.
> > >
> > > I don't quite get why rewriting the base into (T)(unsigned)... is
> > > only valid when no_overflow - no_overflow is about {base, +, step},
> > > not about any overflow contained in 'base'.
> > >
> >
> > Right, so you're saying overflow only applies to step and it's final
> > evolution wrt to niters? Ok, makes sense.  I had thought it referred
> > to that the final possible address doesn't overflow. So offset + step * niters.
> 
> It refers to the IV itself not wrapping but 'base' is the initial
> value computation, whether that computation itself wraps or not
> (it would wrap the same for each iteration) isn't covered.
> 
> > > I wonder if we maybe want to record an "original" iv->base
> > > to be used for code generation (and there only the unexpanded form)
> > > and a variant used for the various sorts of canonicalization/compare
> > > (I see we eventually add/subtract step and then compare against
> > > sth else).  And then apply this normalization always to the not
> > > "original" form.
> >
> > That's certainly possible, but it'll make the code a bit harder to follow in the
> > cases where the comparisons are done between the split address base + step
> > and then a new IV built up from it.
> >
> > In those cases we'd have to split both expressions but used different ones in
> > the right places.  Where it gets messy is that something it updates the value
> > and then compares.
> >
> > So if we can avoid it it's likely cleaner.
> 
> Sure.
> 
> > >
> > > The above STRIP_NOPS (expr) + expand might turn an unsigned
> > > affine combination into a signed one which might be problematic.
> > > So what happens if you change the above to simply always
> > > unsigned expand?
> > >
> >
> > Yes, that's what's currently happening. It's stripping (signed) (a + b) into
> > a + b.
> >
> > To be clear you mean change the affine fold I added to always be unsigned?
> > and thus removing the contain_complex_addr_expr chunk?
> 
> Yes.
> 
> > That works and was my first patch locally (though need to fully reg-test it).
> 
> I'm curious what happens.

It looks like two failures in the testsuite,

The testcase for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64705 fails
(though I need to check the codegen to see if it's a testism or not)
There is however a torture execution failure bitint-49.c, taking a look at that one now.
Looks like it timed out.

> 
> Btw, I think we want to go very piece-mail with these changes given
> there's nobody who's really familiar with the code.

I completely understand and agree, thanks for the review so far :)

Thanks,
Tamar

> 
> Richard.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-06-24 14:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-14 11:49 [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932] Tamar Christina
2024-06-19 12:13 ` Richard Biener
2024-06-19 13:22   ` Tamar Christina
2024-06-20  7:54     ` Richard Biener
2024-06-24 14:44       ` Tamar Christina

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).