From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 103130 invoked by alias); 11 Jun 2019 02:39:12 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 103116 invoked by uid 89); 11 Jun 2019 02:39:11 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-7.6 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_2,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.1 spammy=sk:cfi_def, 073, 038, advance! X-HELO: mx0a-001b2d01.pphosted.com Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com) (148.163.158.5) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 11 Jun 2019 02:39:10 +0000 Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x5B2b6A0016629 for ; Mon, 10 Jun 2019 22:39:08 -0400 Received: from e06smtp07.uk.ibm.com (e06smtp07.uk.ibm.com [195.75.94.103]) by mx0b-001b2d01.pphosted.com with ESMTP id 2t22e9j9h3-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Mon, 10 Jun 2019 22:39:07 -0400 Received: from localhost by e06smtp07.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 11 Jun 2019 03:39:06 +0100 Received: from b06avi18878370.portsmouth.uk.ibm.com (9.149.26.194) by e06smtp07.uk.ibm.com (192.168.101.137) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Tue, 11 Jun 2019 03:39:03 +0100 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x5B2d2Tq35455254 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 11 Jun 2019 02:39:02 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7429FAE055; Tue, 11 Jun 2019 02:39:02 +0000 (GMT) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F3420AE045; Tue, 11 Jun 2019 02:38:59 +0000 (GMT) Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.146.168]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP; Tue, 11 Jun 2019 02:38:59 +0000 (GMT) Subject: Re: [PATCH v3 2/3] Add predict_doloop_p target hook To: Richard Biener Cc: Segher Boessenkool , Jeff Law , gcc-patches@gcc.gnu.org, wschmidt@linux.ibm.com, bin.cheng@linux.alibaba.com, jakub@redhat.com References: <1558064130-111037-1-git-send-email-linkw@linux.ibm.com> <20190520102439.GT31586@gate.crashing.org> <20190520163759.GY31586@gate.crashing.org> From: "Kewen.Lin" Date: Tue, 11 Jun 2019 02:39:00 -0000 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 7bit x-cbid: 19061102-0028-0000-0000-000003791DEA X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19061102-0029-0000-0000-0000243909CD Message-Id: <3d3347ad-6910-a5ea-11f9-1a4fc3cbc6d0@linux.ibm.com> X-IsSubscribed: yes X-SW-Source: 2019-06/txt/msg00592.txt.bz2 >> If my understanding on this question is correct, IMHO we should try to make >> IVOPTs conservative than optimistic, since once the predict is wrong from >> too optimistic decision, the costing on the doloop use is wrong, it's very >> possible to affect the global optimal set. It looks we don't have any ways >> to recover it in RTL then? (otherwise, there should be better place to fix >> the PR). Although it's also possible to miss some good cases, it's at least >> as good as before, I'm inclined to make it conservative. > > I wonder if you could simply benchmark what happens if you make > IVOPTs _always_ create a doloop IV (if possible)? I doubt the > cases where a doloop IV is bad (calls, etc.) are too common and > that in those cases the extra simple IV hurts. > Hi Richard and all, With these different settings: Base) without any changes A) having predict_doloop and enable all checks B) A + disable check on "too few iterations" (0 <= est_niter < 3) C) A + disable costly niter check D) A + disable invalid stmt check (call/computed_goto/switch) I collected some runtime performance data with SPEC2017 as following: Avs.Base Bvs.A Cvs.A Dvs.A 500.perlbench_r 0.00% 0.38% 0.00% -0.19% 502.gcc_r 0.00% 0.38% 0.00% 0.00% 505.mcf_r 0.89% 0.00% 0.00% 0.00% 520.omnetpp_r -0.41% -1.25% 0.00% 0.00% 523.xalancbmk_r -0.36% -0.36% -0.36% -0.73% 525.x264_r 1.14% 0.00% 0.00% 0.00% 531.deepsjeng_r -0.26% 0.26% 0.00% 0.00% 541.leela_r 0.00% 0.37% 0.00% 0.00% 548.exchange2_r 0.85% -0.21% 0.00% 0.00% 557.xz_r -0.77% 0.52% -0.26% 0.00% 503.bwaves_r 0.00% 0.00% 0.36% 0.00% 507.cactuBSSN_r -0.57% 0.00% 0.00% 0.00% 508.namd_r -0.69% 0.35% 0.00% 0.35% 510.parest_r 0.17% -0.17% 0.00% -0.17% 511.povray_r -1.31% -0.44% 0.15% 0.15% 519.lbm_r 0.00% 0.00% 0.00% 0.00% 521.wrf_r 0.33% -0.44% -0.33% -0.33% 526.blender_r 0.26% 0.26% 0.00% 0.00% 527.cam4_r 0.59% -0.59% 0.00% -0.39% 538.imagick_r 0.45% 0.00% 0.00% 0.00% 544.nab_r 0.23% 0.00% 0.00% 0.00% 549.fotonik3d_r 1.80% -0.29% 0.00% 0.00% 554.roms_r 0.00% 0.00% 0.00% 0.00% geomean 0.10% -0.05% -0.02% -0.06% As above, the difference is very small, looks like caused by noise and can be ignored. I also ran partial of test suite with some explicit statistics dumping (on gcc/g++/gfortran etc.). No regressions found. The unique files number with predicted doloop found are: A) 3297 B) 3416 C) 3297 D) 3858 Some observations: * Based on A) and C), we can see the checking on costly niter is useless, I plan to give the check up or replace it with one existing interface expression_expensive_p as Richard mentioned. (Correct me if you have any concerns.) * B) does filter some cases, I checked a few different cases, they are written with small iteration count indeed. * The delta number isn't small between A) and D). I ran some filtering by compiling C/C++ files at -O2 with A) and D) (-O2 is probably not the actual option used in each testing, may not cause the difference, but just for simplicity), obtained dumping assembly file and did further comparison, then I got 60 files finally. I looked into all 60 cases (I assumed these are typical enough, don't need to go through the Fortran etc.). Most of wi/wo differences are expected, 55 of them use more insns to update biv, but 5 of them are trivial since they have to use original biv later so it's kept wi/wo the scanning. Some of 55 takes one more register in prologue/epilogue, some of them have fewer setup insns (which are to calculate the original value and bound for selected iv). Some typical case looks like: Replacing exit test: if (ivtmp_2 != 0) \ Replacing exit test: if (ivtmp_2 != 0) xxx () \ xxx () { \ { unsigned long ivtmp.9; \ unsigned long ivtmp.9; int iter; \ int iter; int _1; \ int _1; -----------------------------------------------------------------------------\ unsigned int ivtmp_2; -----------------------------------------------------------------------------\ unsigned int ivtmp_3; struct bla[100] * _13; \ struct bla[100] * _13; void * _14; \ void * _14; unsigned long _15; \ ----------------------------------------------------------------------------- unsigned long _16; \ ----------------------------------------------------------------------------- \ [local count: 10737418]: \ [local count: 10737418]: _13 = &arr_base + 188; \ _13 = &arr_base + 188; ivtmp.9_8 = (unsigned long) _13; \ ivtmp.9_8 = (unsigned long) _13; _15 = (unsigned long) &arr_base; \ ----------------------------------------------------------------------------- _16 = _15 + 44988; \ ----------------------------------------------------------------------------- \ [local count: 1063004407]: \ [local count: 1063004407]: -----------------------------------------------------------------------------\ # ivtmp_3 = PHI <100(2), ivtmp_2(5)> # ivtmp.9_10 = PHI \ # ivtmp.9_10 = PHI _1 = foo (); \ _1 = foo (); _14 = (void *) ivtmp.9_10; \ _14 = (void *) ivtmp.9_10; MEM[base: _14, offset: 0B] = _1; \ MEM[base: _14, offset: 0B] = _1; -----------------------------------------------------------------------------\ ivtmp_2 = ivtmp_3 - 1; ivtmp.9_9 = ivtmp.9_10 + 448; \ ivtmp.9_9 = ivtmp.9_10 + 448; if (ivtmp.9_9 != _16) \ if (ivtmp_2 != 0) goto ; [98.99%] \ goto ; [98.99%] else \ else goto ; [1.01%] \ goto ; [1.01%] >-------addis 31,2,.LC0@toc@ha \ >-------li 31,100 >-------ld 31,.LC0@toc@l(31) \ >-------addis 30,2,.LC0@toc@ha >-------addis 30,31,0x1 \ >-------ld 30,.LC0@toc@l(30) >-------std 0,16(1) \ >-------std 0,16(1) >-------stdu 1,-48(1) \ >-------stdu 1,-48(1) >-------.cfi_def_cfa_offset 48 \ >-------.cfi_def_cfa_offset 48 >-------.cfi_offset 65, 16 \ ----------------------------------------------------------------------------- >-------addi 30,30,-20736 \ >-------.cfi_offset 65, 16 >-------.p2align 5 \ >-------.p2align 5 .L2: \ .L2: >-------bl foo \ >-------bl foo >-------nop \ >-------nop >-------addi 31,31,448 \ >-------addi 9,31,-1 >-------stw 3,-448(31) \ >-------addi 30,30,448 >-------cmpld 0,31,30 \ >-------rldicl. 31,9,0,32 -----------------------------------------------------------------------------\ >-------stw 3,-448(30) >-------bne 0,.L2 \ >-------bne 0,.L2 >-------addi 1,1,48 \ >-------addi 1,1,48 >-------.cfi_def_cfa_offset 0 \ >-------.cfi_def_cfa_offset 0 >-------ld 0,16(1) \ >-------ld 0,16(1) >-------ld 30,-16(1) \ >-------ld 30,-16(1) >-------ld 31,-8(1) \ >-------ld 31,-8(1) As shown above, P8 SPEC2017 performance evaluation shows to disable invalid_stmt scanning doesn't have any impacts. It looks it's fine not to force it. The test suite small cases show it's possible to have sub optimal instruction sequence with eliminated BIV and its update. I guess the concern on the scanning is compilation time cost, is it worth to doing according to current data? What do you think of this? Thanks a lot in advance! Kewen