From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 48361 invoked by alias); 12 Dec 2018 18:43:17 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 48350 invoked by uid 89); 12 Dec 2018 18:43:17 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_PASS autolearn=ham version=3.3.2 spammy=settled, opportunities, beneficiary, address_info X-HELO: foss.arm.com Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 12 Dec 2018 18:43:15 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5171480D; Wed, 12 Dec 2018 10:43:13 -0800 (PST) Received: from localhost (unknown [10.32.99.101]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8299F3F575; Wed, 12 Dec 2018 10:43:12 -0800 (PST) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener ,Richard Guenther , GCC Patches , richard.sandiford@arm.com Cc: Richard Guenther , GCC Patches Subject: Re: Add a loop versioning pass References: <87d0qpi1wr.fsf@arm.com> <875zw63i9n.fsf@arm.com> Date: Wed, 12 Dec 2018 18:43:00 -0000 In-Reply-To: (Richard Biener's message of "Wed, 12 Dec 2018 13:05:40 +0100") Message-ID: <87o99qegch.fsf@arm.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-SW-Source: 2018-12/txt/msg00857.txt.bz2 Richard Biener writes: > On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford >> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf. >> Also repeated the performance testing (but haven't yet tried an >> LTO variant; will do that over the weekend). > > Any results? Sorry, I should've remembered that finding time to run tests is easy, finding time to analyse them is hard. Speed-wise, the impact of the patch for LTO is similar to without, with 554.roms_r being the main beneficiary for both AArch64 and x86_64. I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native -flto=jobserver. Size-wise, there are three tests that grow by >=2% on x86_64: 549.fotonik3d_r: 5.5% 548.exchange2_r: 29.5% 554.roms_r: 39.6% The roms increase seems fair enough given the benefit, since the whole program uses a similar style for handling arrays. fotonik3d is a mixed bag. Some versioning conditions are from things like: subroutine foo(a) real :: a(:,:,:) a(1,:,:) = ... end subroutine where we version for the middle dimension having a stride of 1. This could be eliminated if we had more semantic information. Other conditions are for things like: subroutine foo(a) real :: a(:,:,:) a(:,1,:) = ... end subroutine where the pass really does help, but unfortunately those loops aren't hot and might not even be used at all. Some opportunities are cases in which we avoid gather loads, such as in the "mp" loads in the hot __material_mod_MOD_mat_updatee function. But mp feeds a gather load itself and these assignments aren't a critical part of the loop. (I'm testing on an AVX-only system, so these are synthesised gathers rather than real gathers.) The growth in 548.exchange2_r comes from reasonable changes to cold code. The test spends almost all its time in __brute_force_MOD_digits_2, which contains the same code before and after the patch, but which accounts for less than 1% of .text before the patch. > I've skimmed over the updated patch and it looks > a lot better now. > > +bool > +loop_versioning > +::find_per_loop_multiplication (address_info &address, address_term_info &term) > +{ > > is that what coding convention allows? Not sure we've settled on something, so I improvised. > For grepping I'd then say we should do > > bool loop_versioning:: > find_per_loop_multiplication (...) > > ;) Sounds good to me. > Anywhere else we you use > > loop_versioning::foo > > so please stick to that. Yeah, I used that when I could avoid an argument spilling over 80 chars, but I think I missed that the above function now fits into that category, Will double-check the others. > Otherwise OK. Thanks :-) > I think I don't see a testcase where we could version both loops in a nest > so I'm not sure whether the transform works fine when you are only > updating SSA form at the very end of the pass. You mean something like: real :: foo(:,:), bar(:) do i=1,n do j=1,n foo(i,j) = ... end do bar(i) = .. end do ? I can add a test if so. At the moment the pass only looks for direct versioning opportunities in inner loops, so the assignment to bar wouldn't be treated as a versioning opportunity. We should still hoist the version check for foo outside both loops, which is tested by loop_versioning_4.f90 for cases in which the outer loop doesn't have its own array arithmetic, but isn't tested for cases like the above. > There may also be some subtle issues with substitute_and_fold being > applied to non-up-to-date SSA form given it folds stmts looking at > (single-use!) SSA edges. The single-use guard might be what saves you > here (SSA uses in the copies are not yet updated to point to the > copied DEFs). OK. I was hoping that because we only apply substitute_and_fold to new code, there would be no problem with uses elsewhere. Would it be safer to: - version all loops we want to version - update SSA explicitly - apply substitute and fold to all "new" loops ? Could we then get away with returning a 0 TODO at the end? Thanks, Richard