From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-492283-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 48361 invoked by alias); 12 Dec 2018 18:43:17 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 48350 invoked by uid 89); 12 Dec 2018 18:43:17 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_PASS autolearn=ham version=3.3.2 spammy=settled, opportunities, beneficiary, address_info
X-HELO: foss.arm.com
Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 12 Dec 2018 18:43:15 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5171480D;	Wed, 12 Dec 2018 10:43:13 -0800 (PST)
Received: from localhost (unknown [10.32.99.101])	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8299F3F575;	Wed, 12 Dec 2018 10:43:12 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <richard.guenther@gmail.com>
Mail-Followup-To: Richard Biener <richard.guenther@gmail.com>,Richard Guenther <rguenther@suse.de>,  GCC Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: Richard Guenther <rguenther@suse.de>,  GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: Add a loop versioning pass
References: <alpine.LSU.2.20.1810301109360.1827@zhemvz.fhfr.qr>	<87d0qpi1wr.fsf@arm.com>	<alpine.LSU.2.20.1812031342150.1827@zhemvz.fhfr.qr>	<875zw63i9n.fsf@arm.com>	<CAFiYyc3R01O+RUQm2MX35G7_jUWMf=ERfqmuOdt0NKBWTJii3w@mail.gmail.com>
Date: Wed, 12 Dec 2018 18:43:00 -0000
In-Reply-To: <CAFiYyc3R01O+RUQm2MX35G7_jUWMf=ERfqmuOdt0NKBWTJii3w@mail.gmail.com>	(Richard Biener's message of "Wed, 12 Dec 2018 13:05:40 +0100")
Message-ID: <87o99qegch.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-SW-Source: 2018-12/txt/msg00857.txt.bz2

Richard Biener <richard.guenther@gmail.com> writes:
> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
>> Also repeated the performance testing (but haven't yet tried an
>> LTO variant; will do that over the weekend).
>
> Any results?

Sorry, I should've remembered that finding time to run tests is easy,
finding time to analyse them is hard.

Speed-wise, the impact of the patch for LTO is similar to without,
with 554.roms_r being the main beneficiary for both AArch64 and x86_64.
I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native
-flto=jobserver.

Size-wise, there are three tests that grow by >=2% on x86_64:

549.fotonik3d_r: 5.5%
548.exchange2_r: 29.5%
554.roms_r: 39.6%

The roms increase seems fair enough given the benefit, since the
whole program uses a similar style for handling arrays.

fotonik3d is a mixed bag.  Some versioning conditions are from
things like:

  subroutine foo(a)
    real :: a(:,:,:)
    a(1,:,:) = ...
  end subroutine

where we version for the middle dimension having a stride of 1.
This could be eliminated if we had more semantic information.

Other conditions are for things like:

  subroutine foo(a)
    real :: a(:,:,:)
    a(:,1,:) = ...
  end subroutine

where the pass really does help, but unfortunately those loops
aren't hot and might not even be used at all.

Some opportunities are cases in which we avoid gather loads, such as
in the "mp" loads in the hot __material_mod_MOD_mat_updatee function.
But mp feeds a gather load itself and these assignments aren't a
critical part of the loop.

(I'm testing on an AVX-only system, so these are synthesised gathers
rather than real gathers.)

The growth in 548.exchange2_r comes from reasonable changes to cold code.
The test spends almost all its time in __brute_force_MOD_digits_2, which
contains the same code before and after the patch, but which accounts
for less than 1% of .text before the patch.

> I've skimmed over the updated patch and it looks
> a lot better now.
>
> +bool
> +loop_versioning
> +::find_per_loop_multiplication (address_info &address, address_term_info &term)
> +{
>
> is that what coding convention allows?

Not sure we've settled on something, so I improvised.

> For grepping I'd then say we should do
>
> bool loop_versioning::
> find_per_loop_multiplication (...)
>
> ;)

Sounds good to me.

> Anywhere else we you use
>
> loop_versioning::foo
>
> so please stick to that.

Yeah, I used that when I could avoid an argument spilling over 80 chars,
but I think I missed that the above function now fits into that category,
Will double-check the others.

> Otherwise OK.

Thanks :-)

> I think I don't see a testcase where we could version both loops in a nest
> so I'm not sure whether the transform works fine when you are only
> updating SSA form at the very end of the pass.

You mean something like:

  real :: foo(:,:), bar(:)

  do i=1,n
    do j=1,n
      foo(i,j) = ...
    end do
    bar(i) = ..
  end do

?  I can add a test if so.

At the moment the pass only looks for direct versioning opportunities
in inner loops, so the assignment to bar wouldn't be treated as a
versioning opportunity.  We should still hoist the version check
for foo outside both loops, which is tested by loop_versioning_4.f90
for cases in which the outer loop doesn't have its own array
arithmetic, but isn't tested for cases like the above.

> There may also be some subtle issues with substitute_and_fold being
> applied to non-up-to-date SSA form given it folds stmts looking at
> (single-use!) SSA edges.  The single-use guard might be what saves you
> here (SSA uses in the copies are not yet updated to point to the
> copied DEFs).

OK.  I was hoping that because we only apply substitute_and_fold
to new code, there would be no problem with uses elsewhere.

Would it be safer to:

  - version all loops we want to version
  - update SSA explicitly
  - apply substitute and fold to all "new" loops

?  Could we then get away with returning a 0 TODO at the end?

Thanks,
Richard