From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17106 invoked by alias); 13 Dec 2018 16:08:13 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 17087 invoked by uid 89); 13 Dec 2018 16:08:12 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_PASS autolearn=ham version=3.3.2 spammy=beneficiary, hoping X-HELO: mx1.suse.de Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 13 Dec 2018 16:08:05 +0000 Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 18C77AEE9; Thu, 13 Dec 2018 16:08:03 +0000 (UTC) Date: Thu, 13 Dec 2018 16:08:00 -0000 User-Agent: K-9 Mail for Android In-Reply-To: <87o99qegch.fsf@arm.com> References: <87d0qpi1wr.fsf@arm.com> <875zw63i9n.fsf@arm.com> <87o99qegch.fsf@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: Add a loop versioning pass To: Richard Sandiford ,Richard Biener CC: GCC Patches From: Richard Biener Message-ID: <2E49010E-6D50-4BB7-893A-82BD961755FB@suse.de> X-SW-Source: 2018-12/txt/msg00943.txt.bz2 On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford wrote: >Richard Biener writes: >> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford >>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf. >>> Also repeated the performance testing (but haven't yet tried an >>> LTO variant; will do that over the weekend). >> >> Any results? > >Sorry, I should've remembered that finding time to run tests is easy, >finding time to analyse them is hard. > >Speed-wise, the impact of the patch for LTO is similar to without, >with 554.roms_r being the main beneficiary for both AArch64 and x86_64. >I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=3Dnative >-flto=3Djobserver. > >Size-wise, there are three tests that grow by >=3D2% on x86_64: > >549.fotonik3d_r: 5.5% >548.exchange2_r: 29.5% >554.roms_r: 39.6% Uh. With LTO we might have a reasonable guessed profile and you do have a o= ptimize_loop_nest_for_speed guard on the transform?=20 How does compile time fare with the above benchmarks? >The roms increase seems fair enough given the benefit, since the >whole program uses a similar style for handling arrays. > >fotonik3d is a mixed bag. Some versioning conditions are from >things like: > > subroutine foo(a) > real :: a(:,:,:) > a(1,:,:) =3D ... > end subroutine > >where we version for the middle dimension having a stride of 1. >This could be eliminated if we had more semantic information. > >Other conditions are for things like: > > subroutine foo(a) > real :: a(:,:,:) > a(:,1,:) =3D ... > end subroutine > >where the pass really does help, but unfortunately those loops >aren't hot and might not even be used at all. > >Some opportunities are cases in which we avoid gather loads, such as >in the "mp" loads in the hot __material_mod_MOD_mat_updatee function. >But mp feeds a gather load itself and these assignments aren't a >critical part of the loop. > >(I'm testing on an AVX-only system, so these are synthesised gathers >rather than real gathers.) > >The growth in 548.exchange2_r comes from reasonable changes to cold >code. >The test spends almost all its time in __brute_force_MOD_digits_2, >which >contains the same code before and after the patch, but which accounts >for less than 1% of .text before the patch. > >> I've skimmed over the updated patch and it looks >> a lot better now. >> >> +bool >> +loop_versioning >> +::find_per_loop_multiplication (address_info &address, >address_term_info &term) >> +{ >> >> is that what coding convention allows? > >Not sure we've settled on something, so I improvised. > >> For grepping I'd then say we should do >> >> bool loop_versioning:: >> find_per_loop_multiplication (...) >> >> ;) > >Sounds good to me. > >> Anywhere else we you use >> >> loop_versioning::foo >> >> so please stick to that. > >Yeah, I used that when I could avoid an argument spilling over 80 >chars, >but I think I missed that the above function now fits into that >category, >Will double-check the others. > >> Otherwise OK. > >Thanks :-) > >> I think I don't see a testcase where we could version both loops in a >nest >> so I'm not sure whether the transform works fine when you are only >> updating SSA form at the very end of the pass. > >You mean something like: > > real :: foo(:,:), bar(:) > > do i=3D1,n > do j=3D1,n > foo(i,j) =3D ... > end do > bar(i) =3D .. > end do > >? I can add a test if so. Please.=20 >At the moment the pass only looks for direct versioning opportunities >in inner loops, so the assignment to bar wouldn't be treated as a >versioning opportunity.=20=20 Ah, OK.=20 We should still hoist the version check >for foo outside both loops, which is tested by loop_versioning_4.f90 >for cases in which the outer loop doesn't have its own array >arithmetic, but isn't tested for cases like the above. > >> There may also be some subtle issues with substitute_and_fold being >> applied to non-up-to-date SSA form given it folds stmts looking at >> (single-use!) SSA edges. The single-use guard might be what saves >you >> here (SSA uses in the copies are not yet updated to point to the >> copied DEFs). > >OK. I was hoping that because we only apply substitute_and_fold >to new code, there would be no problem with uses elsewhere. Might be, yes. >Would it be safer to: > > - version all loops we want to version > - update SSA explicitly > - apply substitute and fold to all "new" loops That would be definitely less fishy. But you need to get at the actual 'new= ' SSA names for the replacements as I guess they'll be rewritten? Or maybe = those are not.=20 >? Could we then get away with returning a 0 TODO at the end? Yes.=20 Richard.=20 >Thanks, >Richard