From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-502725-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 103130 invoked by alias); 11 Jun 2019 02:39:12 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 103116 invoked by uid 89); 11 Jun 2019 02:39:11 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-7.6 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_2,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.1 spammy=sk:cfi_def, 073, 038, advance!
X-HELO: mx0a-001b2d01.pphosted.com
Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com) (148.163.158.5) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 11 Jun 2019 02:39:10 +0000
Received: from pps.filterd (m0098413.ppops.net [127.0.0.1])	by mx0b-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x5B2b6A0016629	for <gcc-patches@gcc.gnu.org>; Mon, 10 Jun 2019 22:39:08 -0400
Received: from e06smtp07.uk.ibm.com (e06smtp07.uk.ibm.com [195.75.94.103])	by mx0b-001b2d01.pphosted.com with ESMTP id 2t22e9j9h3-1	(version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT)	for <gcc-patches@gcc.gnu.org>; Mon, 10 Jun 2019 22:39:07 -0400
Received: from localhost	by e06smtp07.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted	for <gcc-patches@gcc.gnu.org> from <linkw@linux.ibm.com>;	Tue, 11 Jun 2019 03:39:06 +0100
Received: from b06avi18878370.portsmouth.uk.ibm.com (9.149.26.194)	by e06smtp07.uk.ibm.com (192.168.101.137) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted;	(version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256)	Tue, 11 Jun 2019 03:39:03 +0100
Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62])	by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x5B2d2Tq35455254	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);	Tue, 11 Jun 2019 02:39:02 GMT
Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1])	by IMSVA (Postfix) with ESMTP id 7429FAE055;	Tue, 11 Jun 2019 02:39:02 +0000 (GMT)
Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1])	by IMSVA (Postfix) with ESMTP id F3420AE045;	Tue, 11 Jun 2019 02:38:59 +0000 (GMT)
Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.146.168])	by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP;	Tue, 11 Jun 2019 02:38:59 +0000 (GMT)
Subject: Re: [PATCH v3 2/3] Add predict_doloop_p target hook
To: Richard Biener <rguenther@suse.de>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,        Jeff Law <law@redhat.com>, gcc-patches@gcc.gnu.org,        wschmidt@linux.ibm.com, bin.cheng@linux.alibaba.com, jakub@redhat.com
References: <1558064130-111037-1-git-send-email-linkw@linux.ibm.com> <alpine.LSU.2.20.1905201122310.10704@zhemvz.fhfr.qr> <20190520102439.GT31586@gate.crashing.org> <b88b2031-28a2-4c87-b6fc-9685ee08c25f@redhat.com> <20190520163759.GY31586@gate.crashing.org> <e8a1d9e1-e2f7-9d85-393f-1784055664ad@linux.ibm.com> <alpine.LSU.2.20.1905211219190.10704@zhemvz.fhfr.qr>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Date: Tue, 11 Jun 2019 02:39:00 -0000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.7.0
MIME-Version: 1.0
In-Reply-To: <alpine.LSU.2.20.1905211219190.10704@zhemvz.fhfr.qr>
Content-Type: text/plain; charset=gbk
Content-Transfer-Encoding: 7bit
x-cbid: 19061102-0028-0000-0000-000003791DEA
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 19061102-0029-0000-0000-0000243909CD
Message-Id: <3d3347ad-6910-a5ea-11f9-1a4fc3cbc6d0@linux.ibm.com>
X-IsSubscribed: yes
X-SW-Source: 2019-06/txt/msg00592.txt.bz2

>> If my understanding on this question is correct, IMHO we should try to make
>> IVOPTs conservative than optimistic, since once the predict is wrong from
>> too optimistic decision, the costing on the doloop use is wrong, it's very
>> possible to affect the global optimal set.  It looks we don't have any ways
>> to recover it in RTL then?  (otherwise, there should be better place to fix
>> the PR).  Although it's also possible to miss some good cases, it's at least
>> as good as before, I'm inclined to make it conservative.
> 
> I wonder if you could simply benchmark what happens if you make
> IVOPTs _always_ create a doloop IV (if possible)?  I doubt the
> cases where a doloop IV is bad (calls, etc.) are too common and
> that in those cases the extra simple IV hurts.
> 

Hi Richard and all,

With these different settings:
	Base) without any changes
	A) having predict_doloop and enable all checks
	B) A + disable check on "too few iterations" (0 <= est_niter < 3)
	C) A + disable costly niter check
	D) A + disable invalid stmt check (call/computed_goto/switch)

I collected some runtime performance data with SPEC2017 as following:

			Avs.Base   Bvs.A  Cvs.A  Dvs.A
500.perlbench_r		0.00%	0.38%	0.00%	-0.19%
502.gcc_r		0.00%	0.38%	0.00%	0.00%
505.mcf_r		0.89%	0.00%	0.00%	0.00%
520.omnetpp_r		-0.41%	-1.25%	0.00%	0.00%
523.xalancbmk_r		-0.36%	-0.36%	-0.36%	-0.73%
525.x264_r		1.14%	0.00%	0.00%	0.00%
531.deepsjeng_r		-0.26%	0.26%	0.00%	0.00%
541.leela_r		0.00%	0.37%	0.00%	0.00%
548.exchange2_r		0.85%	-0.21%	0.00%	0.00%
557.xz_r		-0.77%	0.52%	-0.26%	0.00%
503.bwaves_r		0.00%	0.00%	0.36%	0.00%
507.cactuBSSN_r		-0.57%	0.00%	0.00%	0.00%
508.namd_r		-0.69%	0.35%	0.00%	0.35%
510.parest_r		0.17%	-0.17%	0.00%	-0.17%
511.povray_r		-1.31%	-0.44%	0.15%	0.15%
519.lbm_r		0.00%	0.00%	0.00%	0.00%
521.wrf_r		0.33%	-0.44%	-0.33%	-0.33%
526.blender_r		0.26%	0.26%	0.00%	0.00%
527.cam4_r		0.59%	-0.59%	0.00%	-0.39%
538.imagick_r		0.45%	0.00%	0.00%	0.00%
544.nab_r		0.23%	0.00%	0.00%	0.00%
549.fotonik3d_r		1.80%	-0.29%	0.00%	0.00%
554.roms_r		0.00%	0.00%	0.00%	0.00%
geomean			0.10%	-0.05%	-0.02%	-0.06%

As above, the difference is very small, looks like caused by noise and can 
be ignored. 

I also ran partial of test suite with some explicit statistics dumping 
(on gcc/g++/gfortran etc.). No regressions found. The unique files number 
with predicted doloop found are:

  A) 3297
  B) 3416
  C) 3297
  D) 3858

Some observations:
  * Based on A) and C), we can see the checking on costly niter is useless, 
    I plan to give the check up or replace it with one existing interface 
    expression_expensive_p as Richard mentioned.  (Correct me if you have 
    any concerns.)
  * B) does filter some cases, I checked a few different cases, they are 
    written with small iteration count indeed.
  * The delta number isn't small between A) and D).

I ran some filtering by compiling C/C++ files at -O2 with A) and D) (-O2 is 
probably not the actual option used in each testing, may not cause the 
difference, but just for simplicity), obtained dumping assembly file and did
further comparison, then I got 60 files finally.

I looked into all 60 cases (I assumed these are typical enough, don't need
to go through the Fortran etc.).  Most of wi/wo differences are expected, 
55 of them use more insns to update biv, but 5 of them are trivial since they
have to use original biv later so it's kept wi/wo the scanning.  Some of 55 
takes one more register in prologue/epilogue, some of them have fewer setup 
insns (which are to calculate the original value and bound for selected iv).

Some typical case looks like:

 Replacing exit test: if (ivtmp_2 != 0)                                        \  Replacing exit test: if (ivtmp_2 != 0)
  xxx ()                                                                       \  xxx ()
  {                                                                            \  {
    unsigned long ivtmp.9;                                                     \    unsigned long ivtmp.9;
    int iter;                                                                  \    int iter;
    int _1;                                                                    \    int _1;
  -----------------------------------------------------------------------------\    unsigned int ivtmp_2;
  -----------------------------------------------------------------------------\    unsigned int ivtmp_3;
    struct bla[100] * _13;                                                     \    struct bla[100] * _13;
    void * _14;                                                                \    void * _14;
    unsigned long _15;                                                         \  -----------------------------------------------------------------------------
    unsigned long _16;                                                         \  -----------------------------------------------------------------------------
                                                                               \
    <bb 2> [local count: 10737418]:                                            \    <bb 2> [local count: 10737418]:
    _13 = &arr_base + 188;                                                     \    _13 = &arr_base + 188;
    ivtmp.9_8 = (unsigned long) _13;                                           \    ivtmp.9_8 = (unsigned long) _13;
    _15 = (unsigned long) &arr_base;                                           \  -----------------------------------------------------------------------------
    _16 = _15 + 44988;                                                         \  -----------------------------------------------------------------------------
                                                                               \
    <bb 3> [local count: 1063004407]:                                          \    <bb 3> [local count: 1063004407]:
  -----------------------------------------------------------------------------\    # ivtmp_3 = PHI <100(2), ivtmp_2(5)>
    # ivtmp.9_10 = PHI <ivtmp.9_8(2), ivtmp.9_9(5)>                            \    # ivtmp.9_10 = PHI <ivtmp.9_8(2), ivtmp.9_9(5)>
    _1 = foo ();                                                               \    _1 = foo ();
    _14 = (void *) ivtmp.9_10;                                                 \    _14 = (void *) ivtmp.9_10;
    MEM[base: _14, offset: 0B] = _1;                                           \    MEM[base: _14, offset: 0B] = _1;
  -----------------------------------------------------------------------------\    ivtmp_2 = ivtmp_3 - 1;
    ivtmp.9_9 = ivtmp.9_10 + 448;                                              \    ivtmp.9_9 = ivtmp.9_10 + 448;
    if (ivtmp.9_9 != _16)                                                      \    if (ivtmp_2 != 0)
      goto <bb 5>; [98.99%]                                                    \      goto <bb 5>; [98.99%]
    else                                                                       \    else
      goto <bb 4>; [1.01%]                                                     \      goto <bb 4>; [1.01%]


  >-------addis 31,2,.LC0@toc@ha                                               \  >-------li 31,100
  >-------ld 31,.LC0@toc@l(31)                                                 \  >-------addis 30,2,.LC0@toc@ha
  >-------addis 30,31,0x1                                                      \  >-------ld 30,.LC0@toc@l(30)
  >-------std 0,16(1)                                                          \  >-------std 0,16(1)
  >-------stdu 1,-48(1)                                                        \  >-------stdu 1,-48(1)
  >-------.cfi_def_cfa_offset 48                                               \  >-------.cfi_def_cfa_offset 48
  >-------.cfi_offset 65, 16                                                   \  -----------------------------------------------------------------------------
  >-------addi 30,30,-20736                                                    \  >-------.cfi_offset 65, 16
  >-------.p2align 5                                                           \  >-------.p2align 5
  .L2:                                                                         \  .L2:
  >-------bl foo                                                               \  >-------bl foo
  >-------nop                                                                  \  >-------nop
  >-------addi 31,31,448                                                       \  >-------addi 9,31,-1
  >-------stw 3,-448(31)                                                       \  >-------addi 30,30,448
  >-------cmpld 0,31,30                                                        \  >-------rldicl. 31,9,0,32
  -----------------------------------------------------------------------------\  >-------stw 3,-448(30)
  >-------bne 0,.L2                                                            \  >-------bne 0,.L2
  >-------addi 1,1,48                                                          \  >-------addi 1,1,48
  >-------.cfi_def_cfa_offset 0                                                \  >-------.cfi_def_cfa_offset 0
  >-------ld 0,16(1)                                                           \  >-------ld 0,16(1)
  >-------ld 30,-16(1)                                                         \  >-------ld 30,-16(1)
  >-------ld 31,-8(1)                                                          \  >-------ld 31,-8(1)


As shown above, P8 SPEC2017 performance evaluation shows to disable 
invalid_stmt scanning doesn't have any impacts.  It looks it's fine
not to force it.  The test suite small cases show it's possible to
have sub optimal instruction sequence with eliminated BIV and its update. 

I guess the concern on the scanning is compilation time cost, is it
worth to doing according to current data?

What do you think of this? 

Thanks a lot in advance!


Kewen