Re: [PATCH v3 2/3] Add predict_doloop_p target hook

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Kewen.Lin" <linkw@linux.ibm.com>
To: Richard Biener <rguenther@suse.de>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
	       Jeff Law <law@redhat.com>,
	gcc-patches@gcc.gnu.org,        wschmidt@linux.ibm.com,
	bin.cheng@linux.alibaba.com, jakub@redhat.com
Subject: Re: [PATCH v3 2/3] Add predict_doloop_p target hook
Date: Tue, 11 Jun 2019 02:39:00 -0000	[thread overview]
Message-ID: <3d3347ad-6910-a5ea-11f9-1a4fc3cbc6d0@linux.ibm.com> (raw)
In-Reply-To: <alpine.LSU.2.20.1905211219190.10704@zhemvz.fhfr.qr>

>> If my understanding on this question is correct, IMHO we should try to make
>> IVOPTs conservative than optimistic, since once the predict is wrong from
>> too optimistic decision, the costing on the doloop use is wrong, it's very
>> possible to affect the global optimal set.  It looks we don't have any ways
>> to recover it in RTL then?  (otherwise, there should be better place to fix
>> the PR).  Although it's also possible to miss some good cases, it's at least
>> as good as before, I'm inclined to make it conservative.
> 
> I wonder if you could simply benchmark what happens if you make
> IVOPTs _always_ create a doloop IV (if possible)?  I doubt the
> cases where a doloop IV is bad (calls, etc.) are too common and
> that in those cases the extra simple IV hurts.
> 

Hi Richard and all,

With these different settings:
	Base) without any changes
	A) having predict_doloop and enable all checks
	B) A + disable check on "too few iterations" (0 <= est_niter < 3)
	C) A + disable costly niter check
	D) A + disable invalid stmt check (call/computed_goto/switch)

I collected some runtime performance data with SPEC2017 as following:

			Avs.Base   Bvs.A  Cvs.A  Dvs.A
500.perlbench_r		0.00%	0.38%	0.00%	-0.19%
502.gcc_r		0.00%	0.38%	0.00%	0.00%
505.mcf_r		0.89%	0.00%	0.00%	0.00%
520.omnetpp_r		-0.41%	-1.25%	0.00%	0.00%
523.xalancbmk_r		-0.36%	-0.36%	-0.36%	-0.73%
525.x264_r		1.14%	0.00%	0.00%	0.00%
531.deepsjeng_r		-0.26%	0.26%	0.00%	0.00%
541.leela_r		0.00%	0.37%	0.00%	0.00%
548.exchange2_r		0.85%	-0.21%	0.00%	0.00%
557.xz_r		-0.77%	0.52%	-0.26%	0.00%
503.bwaves_r		0.00%	0.00%	0.36%	0.00%
507.cactuBSSN_r		-0.57%	0.00%	0.00%	0.00%
508.namd_r		-0.69%	0.35%	0.00%	0.35%
510.parest_r		0.17%	-0.17%	0.00%	-0.17%
511.povray_r		-1.31%	-0.44%	0.15%	0.15%
519.lbm_r		0.00%	0.00%	0.00%	0.00%
521.wrf_r		0.33%	-0.44%	-0.33%	-0.33%
526.blender_r		0.26%	0.26%	0.00%	0.00%
527.cam4_r		0.59%	-0.59%	0.00%	-0.39%
538.imagick_r		0.45%	0.00%	0.00%	0.00%
544.nab_r		0.23%	0.00%	0.00%	0.00%
549.fotonik3d_r		1.80%	-0.29%	0.00%	0.00%
554.roms_r		0.00%	0.00%	0.00%	0.00%
geomean			0.10%	-0.05%	-0.02%	-0.06%

As above, the difference is very small, looks like caused by noise and can 
be ignored. 

I also ran partial of test suite with some explicit statistics dumping 
(on gcc/g++/gfortran etc.). No regressions found. The unique files number 
with predicted doloop found are:

  A) 3297
  B) 3416
  C) 3297
  D) 3858

Some observations:
  * Based on A) and C), we can see the checking on costly niter is useless, 
    I plan to give the check up or replace it with one existing interface 
    expression_expensive_p as Richard mentioned.  (Correct me if you have 
    any concerns.)
  * B) does filter some cases, I checked a few different cases, they are 
    written with small iteration count indeed.
  * The delta number isn't small between A) and D).

I ran some filtering by compiling C/C++ files at -O2 with A) and D) (-O2 is 
probably not the actual option used in each testing, may not cause the 
difference, but just for simplicity), obtained dumping assembly file and did
further comparison, then I got 60 files finally.

I looked into all 60 cases (I assumed these are typical enough, don't need
to go through the Fortran etc.).  Most of wi/wo differences are expected, 
55 of them use more insns to update biv, but 5 of them are trivial since they
have to use original biv later so it's kept wi/wo the scanning.  Some of 55 
takes one more register in prologue/epilogue, some of them have fewer setup 
insns (which are to calculate the original value and bound for selected iv).

Some typical case looks like:

 Replacing exit test: if (ivtmp_2 != 0)                                        \  Replacing exit test: if (ivtmp_2 != 0)
  xxx ()                                                                       \  xxx ()
  {                                                                            \  {
    unsigned long ivtmp.9;                                                     \    unsigned long ivtmp.9;
    int iter;                                                                  \    int iter;
    int _1;                                                                    \    int _1;
  -----------------------------------------------------------------------------\    unsigned int ivtmp_2;
  -----------------------------------------------------------------------------\    unsigned int ivtmp_3;
    struct bla[100] * _13;                                                     \    struct bla[100] * _13;
    void * _14;                                                                \    void * _14;
    unsigned long _15;                                                         \  -----------------------------------------------------------------------------
    unsigned long _16;                                                         \  -----------------------------------------------------------------------------
                                                                               \
    <bb 2> [local count: 10737418]:                                            \    <bb 2> [local count: 10737418]:
    _13 = &arr_base + 188;                                                     \    _13 = &arr_base + 188;
    ivtmp.9_8 = (unsigned long) _13;                                           \    ivtmp.9_8 = (unsigned long) _13;
    _15 = (unsigned long) &arr_base;                                           \  -----------------------------------------------------------------------------
    _16 = _15 + 44988;                                                         \  -----------------------------------------------------------------------------
                                                                               \
    <bb 3> [local count: 1063004407]:                                          \    <bb 3> [local count: 1063004407]:
  -----------------------------------------------------------------------------\    # ivtmp_3 = PHI <100(2), ivtmp_2(5)>
    # ivtmp.9_10 = PHI <ivtmp.9_8(2), ivtmp.9_9(5)>                            \    # ivtmp.9_10 = PHI <ivtmp.9_8(2), ivtmp.9_9(5)>
    _1 = foo ();                                                               \    _1 = foo ();
    _14 = (void *) ivtmp.9_10;                                                 \    _14 = (void *) ivtmp.9_10;
    MEM[base: _14, offset: 0B] = _1;                                           \    MEM[base: _14, offset: 0B] = _1;
  -----------------------------------------------------------------------------\    ivtmp_2 = ivtmp_3 - 1;
    ivtmp.9_9 = ivtmp.9_10 + 448;                                              \    ivtmp.9_9 = ivtmp.9_10 + 448;
    if (ivtmp.9_9 != _16)                                                      \    if (ivtmp_2 != 0)
      goto <bb 5>; [98.99%]                                                    \      goto <bb 5>; [98.99%]
    else                                                                       \    else
      goto <bb 4>; [1.01%]                                                     \      goto <bb 4>; [1.01%]


  >-------addis 31,2,.LC0@toc@ha                                               \  >-------li 31,100
  >-------ld 31,.LC0@toc@l(31)                                                 \  >-------addis 30,2,.LC0@toc@ha
  >-------addis 30,31,0x1                                                      \  >-------ld 30,.LC0@toc@l(30)
  >-------std 0,16(1)                                                          \  >-------std 0,16(1)
  >-------stdu 1,-48(1)                                                        \  >-------stdu 1,-48(1)
  >-------.cfi_def_cfa_offset 48                                               \  >-------.cfi_def_cfa_offset 48
  >-------.cfi_offset 65, 16                                                   \  -----------------------------------------------------------------------------
  >-------addi 30,30,-20736                                                    \  >-------.cfi_offset 65, 16
  >-------.p2align 5                                                           \  >-------.p2align 5
  .L2:                                                                         \  .L2:
  >-------bl foo                                                               \  >-------bl foo
  >-------nop                                                                  \  >-------nop
  >-------addi 31,31,448                                                       \  >-------addi 9,31,-1
  >-------stw 3,-448(31)                                                       \  >-------addi 30,30,448
  >-------cmpld 0,31,30                                                        \  >-------rldicl. 31,9,0,32
  -----------------------------------------------------------------------------\  >-------stw 3,-448(30)
  >-------bne 0,.L2                                                            \  >-------bne 0,.L2
  >-------addi 1,1,48                                                          \  >-------addi 1,1,48
  >-------.cfi_def_cfa_offset 0                                                \  >-------.cfi_def_cfa_offset 0
  >-------ld 0,16(1)                                                           \  >-------ld 0,16(1)
  >-------ld 30,-16(1)                                                         \  >-------ld 30,-16(1)
  >-------ld 31,-8(1)                                                          \  >-------ld 31,-8(1)


As shown above, P8 SPEC2017 performance evaluation shows to disable 
invalid_stmt scanning doesn't have any impacts.  It looks it's fine
not to force it.  The test suite small cases show it's possible to
have sub optimal instruction sequence with eliminated BIV and its update. 

I guess the concern on the scanning is compilation time cost, is it
worth to doing according to current data?

What do you think of this? 

Thanks a lot in advance!


Kewen

next prev parent reply	other threads:[~2019-06-11  2:39 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-17  3:37 linkw
2019-05-17  5:31 ` Kugan Vivekanandarajah
2019-05-17  6:15   ` Kewen.Lin
2019-05-17  6:57   ` Segher Boessenkool
2019-05-17  6:49 ` Segher Boessenkool
2019-05-17  7:20   ` Kewen.Lin
2019-05-20  9:28 ` Richard Biener
2019-05-20 10:24   ` Segher Boessenkool
2019-05-20 14:44     ` Jeff Law
2019-05-20 16:38       ` Segher Boessenkool
2019-05-21  6:03         ` Kewen.Lin
2019-05-21 10:20           ` Richard Biener
2019-05-21 12:42             ` Segher Boessenkool
2019-05-21 15:22             ` Jeff Law
2019-05-22  2:07             ` Kewen.Lin
2019-06-11  2:39             ` Kewen.Lin [this message]
2019-06-11 12:17               ` Richard Biener
2019-06-13  5:50                 ` [PATCH v4 " Kewen.Lin
2019-06-14 21:53                   ` Segher Boessenkool
2019-06-14 23:46                     ` Bill Schmidt
2019-06-17  2:08                       ` Kewen.Lin
2019-06-17  8:51                         ` Richard Biener
2019-06-17  9:39                           ` Kewen.Lin
2019-06-17  9:59                           ` Segher Boessenkool
2019-06-17 12:08                             ` Richard Biener
2019-06-17 13:39                               ` Kewen.Lin
2019-06-17 13:44                                 ` Richard Biener
2019-06-17 14:24                                   ` Kewen.Lin
2019-05-21 11:09           ` [PATCH v3 " Segher Boessenkool
2019-05-21  9:58         ` Richard Biener
2019-05-21  5:50       ` Kewen.Lin
2019-05-21  6:32         ` Bin.Cheng
2019-05-21 10:36         ` Segher Boessenkool
2019-05-21  3:48   ` Kewen.Lin
2019-05-21  5:28   ` Kewen.Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3d3347ad-6910-a5ea-11f9-1a4fc3cbc6d0@linux.ibm.com \
    --to=linkw@linux.ibm.com \
    --cc=bin.cheng@linux.alibaba.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=law@redhat.com \
    --cc=rguenther@suse.de \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).