public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>,
	"bin.cheng@linux.alibaba.com" <bin.cheng@linux.alibaba.com>
Subject: RE: [PATCH][ivopts]: perform affine fold on unsigned addressing modes known not to overflow. [PR114932]
Date: Wed, 19 Jun 2024 13:22:53 +0000	[thread overview]
Message-ID: <VI1PR08MB5325C516221DE375B6DBE997FFCF2@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <po87oq32-452o-0o70-s968-rqnnp8263390@fhfr.qr>

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Wednesday, June 19, 2024 1:14 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; bin.cheng@linux.alibaba.com
> Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned addressing modes
> known not to overflow. [PR114932]
> 
> On Fri, 14 Jun 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > When the patch for PR114074 was applied we saw a good boost in exchange2.
> >
> > This boost was partially caused by a simplification of the addressing modes.
> > With the patch applied IV opts saw the following form for the base addressing;
> >
> >   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> > 324) + 36)
> >
> > vs what we normally get:
> >
> >   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> > * 81) + 9) * 4
> >
> > This is because the patch promoted multiplies where one operand is a constant
> > from a signed multiply to an unsigned one, to attempt to fold away the constant.
> >
> > This patch attempts the same but due to the various problems with SCEV and
> > niters not being able to analyze the resulting forms (i.e. PR114322) we can't
> > do it during SCEV or in the general form like in fold-const like extract_muldiv
> > attempts.
> >
> > Instead this applies the simplification during IVopts initialization when we
> > create the IV.  Essentially when we know the IV won't overflow with regards to
> > niters then we perform an affine fold which gets it to simplify the internal
> > computation, even if this is signed because we know that for IVOPTs uses the
> > IV won't ever overflow.  This allows IV opts to see the simplified form
> > without influencing the rest of the compiler.
> >
> > as mentioned in PR114074 it would be good to fix the missed optimization in the
> > other passes so we can perform this in general.
> >
> > The reason this has a big impact on fortran code is that fortran doesn't seem to
> > have unsigned integer types.  As such all it's addressing are created with
> > signed types and folding does not happen on them due to the possible overflow.
> >
> > concretely on AArch64 this changes the results from generation:
> >
> >         mov     x27, -108
> >         mov     x24, -72
> >         mov     x23, -36
> >         add     x21, x1, x0, lsl 2
> >         add     x19, x20, x22
> > .L5:
> >         add     x0, x22, x19
> >         add     x19, x19, 324
> >         ldr     d1, [x0, x27]
> >         add     v1.2s, v1.2s, v15.2s
> >         str     d1, [x20, 216]
> >         ldr     d0, [x0, x24]
> >         add     v0.2s, v0.2s, v15.2s
> >         str     d0, [x20, 252]
> >         ldr     d31, [x0, x23]
> >         add     v31.2s, v31.2s, v15.2s
> >         str     d31, [x20, 288]
> >         bl      digits_20_
> >         cmp     x21, x19
> >         bne     .L5
> >
> > into:
> >
> > .L5:
> >         ldr     d1, [x19, -108]
> >         add     v1.2s, v1.2s, v15.2s
> >         str     d1, [x20, 216]
> >         ldr     d0, [x19, -72]
> >         add     v0.2s, v0.2s, v15.2s
> >         str     d0, [x20, 252]
> >         ldr     d31, [x19, -36]
> >         add     x19, x19, 324
> >         add     v31.2s, v31.2s, v15.2s
> >         str     d31, [x20, 288]
> >         bl      digits_20_
> >         cmp     x21, x19
> >         bne     .L5
> >
> > The two patches together results in a 10% performance increase in exchange2 in
> > SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in
> compile
> > time. There's also a 5% performance improvement in fotonik3d and similar
> > reduction in binary size.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	PR tree-optimization/114932
> > 	* tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	PR tree-optimization/114932
> > 	* gfortran.dg/addressing-modes_1.f90: New test.
> >
> > ---
> > diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..334d5bc47a16e53e9168b
> b1f90dfeff584b4e494
> > --- /dev/null
> > +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> > @@ -0,0 +1,37 @@
> > +! { dg-do compile { target aarch64-*-* } }
> > +! { dg-additional-options "-w -Ofast" }
> > +
> > +  module brute_force
> > +    integer, parameter :: r=9
> > +     integer  block(r, r, 0)
> > +    contains
> > +  subroutine brute
> > +     do
> > +      do
> > +          do
> > +           do
> > +                do
> > +                     do
> > +                         do i7 = l0, 1
> > +                       select case(1 )
> > +                       case(1)
> > +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> > +                       end select
> > +                            do i8 = 1, 1
> > +                               do i9 = 1, 1
> > +                            if(1 == 1) then
> > +                                    call digits_20
> > +                                end if
> > +                                end do
> > +                          end do
> > +                    end do
> > +                    end do
> > +              end do
> > +              end do
> > +           end do
> > +     end do
> > +  end do
> > + end
> > +  end
> > +
> > +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
> > diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> > index
> 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e3
> 5f3d3a0d52470811 100644
> > --- a/gcc/tree-ssa-loop-ivopts.cc
> > +++ b/gcc/tree-ssa-loop-ivopts.cc
> > @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree
> step,
> >        base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> >      }
> >
> > +  /* If we know the IV won't overflow wrt niters and the type is an unsigned
> > +     type then fold using affine unsigned arithmetic to allow more folding of
> > +     constants.  */
> > +  if (no_overflow
> > +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> > +    {
> > +      aff_tree comb;
> > +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> > +      tree_to_aff_combination (expr, utype, &comb);
> > +      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> > +    }
> > +
> 
> So right above we already do
> 
>   /* Lower address expression in base except ones with DECL_P as operand.
>      By doing this:
>        1) More accurate cost can be computed for address expressions;
>        2) Duplicate candidates won't be created for bases in different
>           forms, like &a[0] and &a.  */
>   STRIP_NOPS (expr);
>   if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
>       || contain_complex_addr_expr (expr))
>     {
>       aff_tree comb;
>       tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
>       base = fold_convert (TREE_TYPE (base), aff_combination_to_tree
> (&comb));
>     }
> 
> and if I read correctly 'expr' is
> 
>   (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
> 
> in your interesting case which means it doesn't satisfy
> contain_complex_addr_expr.
> 
> I don't quite get why rewriting the base into (T)(unsigned)... is
> only valid when no_overflow - no_overflow is about {base, +, step},
> not about any overflow contained in 'base'.
> 

Right, so you're saying overflow only applies to step and it's final
evolution wrt to niters? Ok, makes sense.  I had thought it referred
to that the final possible address doesn't overflow. So offset + step * niters.

> I wonder if we maybe want to record an "original" iv->base
> to be used for code generation (and there only the unexpanded form)
> and a variant used for the various sorts of canonicalization/compare
> (I see we eventually add/subtract step and then compare against
> sth else).  And then apply this normalization always to the not
> "original" form.

That's certainly possible, but it'll make the code a bit harder to follow in the
cases where the comparisons are done between the split address base + step
and then a new IV built up from it.

In those cases we'd have to split both expressions but used different ones in
the right places.  Where it gets messy is that something it updates the value
and then compares.

So if we can avoid it it's likely cleaner.

> 
> The above STRIP_NOPS (expr) + expand might turn an unsigned
> affine combination into a signed one which might be problematic.
> So what happens if you change the above to simply always
> unsigned expand?
> 

Yes, that's what's currently happening. It's stripping (signed) (a + b) into
a + b.

To be clear you mean change the affine fold I added to always be unsigned?
and thus removing the contain_complex_addr_expr chunk?

That works and was my first patch locally (though need to fully reg-test it).

Thanks,
Tamar

> Richard.
> 
> >    iv->base = base;
> >    iv->base_object = determine_base_object (data, base);
> >    iv->step = step;
> >
> >
> >
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

  reply	other threads:[~2024-06-19 13:23 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-14 11:49 Tamar Christina
2024-06-19 12:13 ` Richard Biener
2024-06-19 13:22   ` Tamar Christina [this message]
2024-06-20  7:54     ` Richard Biener
2024-06-24 14:44       ` Tamar Christina

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB5325C516221DE375B6DBE997FFCF2@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=bin.cheng@linux.alibaba.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).