From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=MvOm=NV=suse.de=rguenther@sourceware.org>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2a07:de40:b251:101:10:150:64:2])
	by sourceware.org (Postfix) with ESMTPS id 3F5FF388450F
	for <gcc-patches@gcc.gnu.org>; Wed, 19 Jun 2024 12:13:54 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3F5FF388450F
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 3F5FF388450F
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:2
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718799236; cv=none;
	b=C7NhS8xoqBbUyc7450eylsEHuSYcrLBauUP1Nc2N+894fBblFj1ff1Eboh7NeajsMCYXBBSfXl8OtXa8GkZoWcCFP1wAzmdh5J1GiRA5W3tmX71mY2g4n8OKOVspO3Xy7ptnWrDDRUpsRGReOkT4uO7vzxOJsm1X00AdgaTqXVc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1718799236; c=relaxed/simple;
	bh=P6kPQucXBDWbbjA6pvf0JNsURKuJlqEMyvLe0SdKEZ0=;
	h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
	 From:To:Subject:Message-ID:MIME-Version; b=gp/X6rqYxltyhykvOK2Ow6+1MlUSacbXEV+dGYGiYHDsJg/b8kAcZNz3ARc3ZpjeeDLsTUq8a7P6HvLLjTh2V+CfhEBN1Zk5MB2fRUWiM5k4hl0is8jnPUyDd6OwLu8fHnTcEOy/8O63wxyxqP85Ayp1q4Ulr86/CmaIKrEEsaw=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id F3B661F832;
	Wed, 19 Jun 2024 12:13:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1718799233; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=uZ+HZ9NmGZsNKdk9sHs3vvD3l8pb6vJVINyWQNidFY0=;
	b=FUycedGm3mXOEByAfJBK1FE53q6FcmHx6Y5NCSoHilGTP2T2iGVWniUv1BE3Ks8tyEO/Jg
	vkBpF00O7wdtfH0Wf6o4c5c5i8nYcHOi2y5ag3G1lolRPFvM+h9W6DLkl3j3JD2zaaXxj1
	XssBx+IlaFO1lDhiUgaj0PNEBfjg+W0=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1718799233;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=uZ+HZ9NmGZsNKdk9sHs3vvD3l8pb6vJVINyWQNidFY0=;
	b=yfRk0NwbFs//o4aXrGQciIg0JGkAJt7OlDJWP9b6eL1lk+LIHBYUNrRWN0Cx7GSt2Z0h7A
	Y7qziAnJ3v1EnVDg==
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1718799233; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=uZ+HZ9NmGZsNKdk9sHs3vvD3l8pb6vJVINyWQNidFY0=;
	b=FUycedGm3mXOEByAfJBK1FE53q6FcmHx6Y5NCSoHilGTP2T2iGVWniUv1BE3Ks8tyEO/Jg
	vkBpF00O7wdtfH0Wf6o4c5c5i8nYcHOi2y5ag3G1lolRPFvM+h9W6DLkl3j3JD2zaaXxj1
	XssBx+IlaFO1lDhiUgaj0PNEBfjg+W0=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1718799233;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=uZ+HZ9NmGZsNKdk9sHs3vvD3l8pb6vJVINyWQNidFY0=;
	b=yfRk0NwbFs//o4aXrGQciIg0JGkAJt7OlDJWP9b6eL1lk+LIHBYUNrRWN0Cx7GSt2Z0h7A
	Y7qziAnJ3v1EnVDg==
Date: Wed, 19 Jun 2024 14:13:52 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Tamar Christina <tamar.christina@arm.com>
cc: gcc-patches@gcc.gnu.org, nd@arm.com, bin.cheng@linux.alibaba.com
Subject: Re: [PATCH][ivopts]: perform affine fold on unsigned addressing
 modes known not to overflow. [PR114932]
In-Reply-To: <patch-18487-tamar@arm.com>
Message-ID: <po87oq32-452o-0o70-s968-rqnnp8263390@fhfr.qr>
References: <patch-18487-tamar@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Score: -4.30
X-Spam-Level: 
X-Spamd-Result: default: False [-4.30 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-0.20)[-1.000];
	MIME_GOOD(-0.10)[text/plain];
	MISSING_XM_UA(0.00)[];
	MIME_TRACE(0.00)[0:+];
	TO_DN_SOME(0.00)[];
	ARC_NA(0.00)[];
	RCVD_COUNT_ZERO(0.00)[0];
	FROM_HAS_DN(0.00)[];
	DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
	FROM_EQ_ENVFROM(0.00)[];
	RCPT_COUNT_THREE(0.00)[4];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:email]
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Fri, 14 Jun 2024, Tamar Christina wrote:

> Hi All,
> 
> When the patch for PR114074 was applied we saw a good boost in exchange2.
> 
> This boost was partially caused by a simplification of the addressing modes.
> With the patch applied IV opts saw the following form for the base addressing;
> 
>   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> 324) + 36)
> 
> vs what we normally get:
> 
>   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
> 
> This is because the patch promoted multiplies where one operand is a constant
> from a signed multiply to an unsigned one, to attempt to fold away the constant.
> 
> This patch attempts the same but due to the various problems with SCEV and
> niters not being able to analyze the resulting forms (i.e. PR114322) we can't
> do it during SCEV or in the general form like in fold-const like extract_muldiv
> attempts.
> 
> Instead this applies the simplification during IVopts initialization when we
> create the IV.  Essentially when we know the IV won't overflow with regards to
> niters then we perform an affine fold which gets it to simplify the internal
> computation, even if this is signed because we know that for IVOPTs uses the
> IV won't ever overflow.  This allows IV opts to see the simplified form
> without influencing the rest of the compiler.
> 
> as mentioned in PR114074 it would be good to fix the missed optimization in the
> other passes so we can perform this in general.
> 
> The reason this has a big impact on fortran code is that fortran doesn't seem to
> have unsigned integer types.  As such all it's addressing are created with
> signed types and folding does not happen on them due to the possible overflow.
> 
> concretely on AArch64 this changes the results from generation:
> 
>         mov     x27, -108
>         mov     x24, -72
>         mov     x23, -36
>         add     x21, x1, x0, lsl 2
>         add     x19, x20, x22
> .L5:
>         add     x0, x22, x19
>         add     x19, x19, 324
>         ldr     d1, [x0, x27]
>         add     v1.2s, v1.2s, v15.2s
>         str     d1, [x20, 216]
>         ldr     d0, [x0, x24]
>         add     v0.2s, v0.2s, v15.2s
>         str     d0, [x20, 252]
>         ldr     d31, [x0, x23]
>         add     v31.2s, v31.2s, v15.2s
>         str     d31, [x20, 288]
>         bl      digits_20_
>         cmp     x21, x19
>         bne     .L5
> 
> into:
> 
> .L5:
>         ldr     d1, [x19, -108]
>         add     v1.2s, v1.2s, v15.2s
>         str     d1, [x20, 216]
>         ldr     d0, [x19, -72]
>         add     v0.2s, v0.2s, v15.2s
>         str     d0, [x20, 252]
>         ldr     d31, [x19, -36]
>         add     x19, x19, 324
>         add     v31.2s, v31.2s, v15.2s
>         str     d31, [x20, 288]
>         bl      digits_20_
>         cmp     x21, x19
>         bne     .L5
> 
> The two patches together results in a 10% performance increase in exchange2 in
> SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile
> time. There's also a 5% performance improvement in fotonik3d and similar
> reduction in binary size.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR tree-optimization/114932
> 	* tree-scalar-evolution.cc (alloc_iv): Perform affine unsigned fold.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR tree-optimization/114932
> 	* gfortran.dg/addressing-modes_1.f90: New test.
> 
> ---
> diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_1.f90 b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> new file mode 100644
> index 0000000000000000000000000000000000000000..334d5bc47a16e53e9168bb1f90dfeff584b4e494
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/addressing-modes_1.f90
> @@ -0,0 +1,37 @@
> +! { dg-do compile { target aarch64-*-* } }
> +! { dg-additional-options "-w -Ofast" }
> +
> +  module brute_force
> +    integer, parameter :: r=9
> +     integer  block(r, r, 0)
> +    contains
> +  subroutine brute
> +     do
> +      do
> +          do
> +           do
> +                do
> +                     do
> +                         do i7 = l0, 1
> +                       select case(1 )
> +                       case(1)
> +                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
> +                       end select
> +                            do i8 = 1, 1
> +                               do i9 = 1, 1
> +                            if(1 == 1) then
> +                                    call digits_20
> +                                end if
> +                                end do
> +                          end do
> +                    end do
> +                    end do
> +              end do
> +              end do
> +           end do
> +     end do
> +  end do
> + end
> +  end
> +
> +! { dg-final { scan-assembler-not {ldr\s+d([0-9]+),\s+\[x[0-9]+, x[0-9]+\]} } }
> diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> index 4338d7b64a6c2df6404b8d5e51c7f62c23006e72..f621e4ee924b930e1e1d68e35f3d3a0d52470811 100644
> --- a/gcc/tree-ssa-loop-ivopts.cc
> +++ b/gcc/tree-ssa-loop-ivopts.cc
> @@ -1216,6 +1216,18 @@ alloc_iv (struct ivopts_data *data, tree base, tree step,
>        base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
>      }
>  
> +  /* If we know the IV won't overflow wrt niters and the type is an unsigned
> +     type then fold using affine unsigned arithmetic to allow more folding of
> +     constants.  */
> +  if (no_overflow
> +      && TYPE_UNSIGNED (TREE_TYPE (expr)))
> +    {
> +      aff_tree comb;
> +      tree utype = unsigned_type_for (TREE_TYPE (expr));
> +      tree_to_aff_combination (expr, utype, &comb);
> +      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree (&comb));
> +    }
> +

So right above we already do

  /* Lower address expression in base except ones with DECL_P as operand.
     By doing this:
       1) More accurate cost can be computed for address expressions;
       2) Duplicate candidates won't be created for bases in different
          forms, like &a[0] and &a.  */
  STRIP_NOPS (expr);
  if ((TREE_CODE (expr) == ADDR_EXPR && !DECL_P (TREE_OPERAND (expr, 0)))
      || contain_complex_addr_expr (expr))
    {
      aff_tree comb;
      tree_to_aff_combination (expr, TREE_TYPE (expr), &comb);
      base = fold_convert (TREE_TYPE (base), aff_combination_to_tree 
(&comb));
    }

and if I read correctly 'expr' is

  (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4

in your interesting case which means it doesn't satisfy
contain_complex_addr_expr.

I don't quite get why rewriting the base into (T)(unsigned)... is
only valid when no_overflow - no_overflow is about {base, +, step},
not about any overflow contained in 'base'.

I wonder if we maybe want to record an "original" iv->base
to be used for code generation (and there only the unexpanded form)
and a variant used for the various sorts of canonicalization/compare
(I see we eventually add/subtract step and then compare against
sth else).  And then apply this normalization always to the not
"original" form.

The above STRIP_NOPS (expr) + expand might turn an unsigned
affine combination into a signed one which might be problematic.
So what happens if you change the above to simply always
unsigned expand?

Richard.

>    iv->base = base;
>    iv->base_object = determine_base_object (data, base);
>    iv->step = step;
> 
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)