From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-492369-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 17106 invoked by alias); 13 Dec 2018 16:08:13 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 17087 invoked by uid 89); 13 Dec 2018 16:08:12 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_PASS autolearn=ham version=3.3.2 spammy=beneficiary, hoping
X-HELO: mx1.suse.de
Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 13 Dec 2018 16:08:05 +0000
Received: from relay1.suse.de (unknown [195.135.220.254])	by mx1.suse.de (Postfix) with ESMTP id 18C77AEE9;	Thu, 13 Dec 2018 16:08:03 +0000 (UTC)
Date: Thu, 13 Dec 2018 16:08:00 -0000
User-Agent: K-9 Mail for Android
In-Reply-To: <87o99qegch.fsf@arm.com>
References: <alpine.LSU.2.20.1810301109360.1827@zhemvz.fhfr.qr> <87d0qpi1wr.fsf@arm.com> <alpine.LSU.2.20.1812031342150.1827@zhemvz.fhfr.qr> <875zw63i9n.fsf@arm.com> <CAFiYyc3R01O+RUQm2MX35G7_jUWMf=ERfqmuOdt0NKBWTJii3w@mail.gmail.com> <87o99qegch.fsf@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: Add a loop versioning pass
To: Richard Sandiford <richard.sandiford@arm.com>,Richard Biener <richard.guenther@gmail.com>
CC: GCC Patches <gcc-patches@gcc.gnu.org>
From: Richard Biener <rguenther@suse.de>
Message-ID: <2E49010E-6D50-4BB7-893A-82BD961755FB@suse.de>
X-SW-Source: 2018-12/txt/msg00943.txt.bz2

On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford <richard.sandi=
ford@arm.com> wrote:
>Richard Biener <richard.guenther@gmail.com> writes:
>> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
>>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
>>> Also repeated the performance testing (but haven't yet tried an
>>> LTO variant; will do that over the weekend).
>>
>> Any results?
>
>Sorry, I should've remembered that finding time to run tests is easy,
>finding time to analyse them is hard.
>
>Speed-wise, the impact of the patch for LTO is similar to without,
>with 554.roms_r being the main beneficiary for both AArch64 and x86_64.
>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=3Dnative
>-flto=3Djobserver.
>
>Size-wise, there are three tests that grow by >=3D2% on x86_64:
>
>549.fotonik3d_r: 5.5%
>548.exchange2_r: 29.5%
>554.roms_r: 39.6%

Uh. With LTO we might have a reasonable guessed profile and you do have a o=
ptimize_loop_nest_for_speed guard on the transform?=20

How does compile time fare with the above benchmarks?

>The roms increase seems fair enough given the benefit, since the
>whole program uses a similar style for handling arrays.
>
>fotonik3d is a mixed bag.  Some versioning conditions are from
>things like:
>
>  subroutine foo(a)
>    real :: a(:,:,:)
>    a(1,:,:) =3D ...
>  end subroutine
>
>where we version for the middle dimension having a stride of 1.
>This could be eliminated if we had more semantic information.
>
>Other conditions are for things like:
>
>  subroutine foo(a)
>    real :: a(:,:,:)
>    a(:,1,:) =3D ...
>  end subroutine
>
>where the pass really does help, but unfortunately those loops
>aren't hot and might not even be used at all.
>
>Some opportunities are cases in which we avoid gather loads, such as
>in the "mp" loads in the hot __material_mod_MOD_mat_updatee function.
>But mp feeds a gather load itself and these assignments aren't a
>critical part of the loop.
>
>(I'm testing on an AVX-only system, so these are synthesised gathers
>rather than real gathers.)
>
>The growth in 548.exchange2_r comes from reasonable changes to cold
>code.
>The test spends almost all its time in __brute_force_MOD_digits_2,
>which
>contains the same code before and after the patch, but which accounts
>for less than 1% of .text before the patch.
>
>> I've skimmed over the updated patch and it looks
>> a lot better now.
>>
>> +bool
>> +loop_versioning
>> +::find_per_loop_multiplication (address_info &address,
>address_term_info &term)
>> +{
>>
>> is that what coding convention allows?
>
>Not sure we've settled on something, so I improvised.
>
>> For grepping I'd then say we should do
>>
>> bool loop_versioning::
>> find_per_loop_multiplication (...)
>>
>> ;)
>
>Sounds good to me.
>
>> Anywhere else we you use
>>
>> loop_versioning::foo
>>
>> so please stick to that.
>
>Yeah, I used that when I could avoid an argument spilling over 80
>chars,
>but I think I missed that the above function now fits into that
>category,
>Will double-check the others.
>
>> Otherwise OK.
>
>Thanks :-)
>
>> I think I don't see a testcase where we could version both loops in a
>nest
>> so I'm not sure whether the transform works fine when you are only
>> updating SSA form at the very end of the pass.
>
>You mean something like:
>
>  real :: foo(:,:), bar(:)
>
>  do i=3D1,n
>    do j=3D1,n
>      foo(i,j) =3D ...
>    end do
>    bar(i) =3D ..
>  end do
>
>?  I can add a test if so.

Please.=20

>At the moment the pass only looks for direct versioning opportunities
>in inner loops, so the assignment to bar wouldn't be treated as a
>versioning opportunity.=20=20

Ah, OK.=20

We should still hoist the version check
>for foo outside both loops, which is tested by loop_versioning_4.f90
>for cases in which the outer loop doesn't have its own array
>arithmetic, but isn't tested for cases like the above.
>
>> There may also be some subtle issues with substitute_and_fold being
>> applied to non-up-to-date SSA form given it folds stmts looking at
>> (single-use!) SSA edges.  The single-use guard might be what saves
>you
>> here (SSA uses in the copies are not yet updated to point to the
>> copied DEFs).
>
>OK.  I was hoping that because we only apply substitute_and_fold
>to new code, there would be no problem with uses elsewhere.

Might be, yes.

>Would it be safer to:
>
>  - version all loops we want to version
>  - update SSA explicitly
>  - apply substitute and fold to all "new" loops

That would be definitely less fishy. But you need to get at the actual 'new=
' SSA names for the replacements as I guess they'll be rewritten? Or maybe =
those are not.=20

>?  Could we then get away with returning a 0 TODO at the end?

Yes.=20

Richard.=20

>Thanks,
>Richard