Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Qing Zhao <QING.ZHAO@ORACLE.COM>
To: Richard Biener <rguenther@suse.de>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
	Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
Subject: Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init
Date: Fri, 15 Jan 2021 10:16:40 -0600	[thread overview]
Message-ID: <C43EAB76-F9DC-476D-BA32-85F4C8CE3C22@ORACLE.COM> (raw)
In-Reply-To: <alpine.DEB.2.21.2101150907170.2612@rguenther-XPS-13-9380>



> On Jan 15, 2021, at 2:11 AM, Richard Biener <rguenther@suse.de> wrote:
> 
> 
> 
> On Thu, 14 Jan 2021, Qing Zhao wrote:
> 
>> Hi, 
>> More data on code size and compilation time with CPU2017:
>> ********Compilation time data:   the numbers are the slowdown against the
>> default “no”:
>> benchmarks  A/no D/no
>>                         
>> 500.perlbench_r 5.19% 1.95%
>> 502.gcc_r 0.46% -0.23%
>> 505.mcf_r 0.00% 0.00%
>> 520.omnetpp_r 0.85% 0.00%
>> 523.xalancbmk_r 0.79% -0.40%
>> 525.x264_r -4.48% 0.00%
>> 531.deepsjeng_r 16.67% 16.67%
>> 541.leela_r  0.00%  0.00%
>> 557.xz_r 0.00%  0.00%
>>                         
>> 507.cactuBSSN_r 1.16% 0.58%
>> 508.namd_r 9.62% 8.65%
>> 510.parest_r 0.48% 1.19%
>> 511.povray_r 3.70% 3.70%
>> 519.lbm_r 0.00% 0.00%
>> 521.wrf_r 0.05% 0.02%
>> 526.blender_r 0.33% 1.32%
>> 527.cam4_r -0.93% -0.93%
>> 538.imagick_r 1.32% 3.95%
>> 544.nab_r  0.00% 0.00%
>> From the above data, looks like that the compilation time impact
>> from implementation A and D are almost the same.
>> *******code size data: the numbers are the code size increase against the
>> default “no”:
>> benchmarks A/no D/no
>>                         
>> 500.perlbench_r 2.84% 0.34%
>> 502.gcc_r 2.59% 0.35%
>> 505.mcf_r 3.55% 0.39%
>> 520.omnetpp_r 0.54% 0.03%
>> 523.xalancbmk_r 0.36%  0.39%
>> 525.x264_r 1.39% 0.13%
>> 531.deepsjeng_r 2.15% -1.12%
>> 541.leela_r 0.50% -0.20%
>> 557.xz_r 0.31% 0.13%
>>                         
>> 507.cactuBSSN_r 5.00% -0.01%
>> 508.namd_r 3.64% -0.07%
>> 510.parest_r 1.12% 0.33%
>> 511.povray_r 4.18% 1.16%
>> 519.lbm_r 8.83% 6.44%
>> 521.wrf_r 0.08% 0.02%
>> 526.blender_r 1.63% 0.45%
>> 527.cam4_r  0.16% 0.06%
>> 538.imagick_r 3.18% -0.80%
>> 544.nab_r 5.76% -1.11%
>> Avg 2.52% 0.36%
>> From the above data, the implementation D is always better than A, it’s a
>> surprising to me, not sure what’s the reason for this.
> 
> D probably inhibits most interesting loop transforms (check SPEC FP
> performance).

The call to .DEFERRED_INIT is marked as ECF_CONST:

/* A function to represent an artifical initialization to an uninitialized
   automatic variable. The first argument is the variable itself, the
   second argument is the initialization type.  */
DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)

So, I assume that such const call should minimize the impact to loop optimizations. But yes, it will still inhibit some of the loop transformations.

>  It will also most definitely disallow SRA which, when
> an aggregate is not completely elided, tends to grow code.

Make sense to me. 

The run-time performance data for D and A are actually very similar as I posted in the previous email (I listed it here for convenience)

Run-time performance overhead with A and D:

benchmarks		A / no	D /no

500.perlbench_r	1.25%	1.25%
502.gcc_r		0.68%	1.80%
505.mcf_r		0.68%	0.14%
520.omnetpp_r	4.83%	4.68%
523.xalancbmk_r	0.18%	1.96%
525.x264_r		1.55%	2.07%
531.deepsjeng_	11.57%	11.85%
541.leela_r		0.64%	0.80%
557.xz_			 -0.41%	-0.41%

507.cactuBSSN_r	0.44%	0.44%
508.namd_r		0.34%	0.34%
510.parest_r		0.17%	0.25%
511.povray_r		56.57%	57.27%
519.lbm_r		0.00%	0.00%
521.wrf_r			 -0.28%	-0.37%
526.blender_r		16.96%	17.71%
527.cam4_r		0.70%	0.53%
538.imagick_r		2.40%	2.40%
544.nab_r		0.00%	-0.65%

avg				5.17%	5.37%

Especially for the SPEC FP benchmarks, I didn’t see too much performance difference between A and D. 
I guess that the RTL optimizations might be enough to get rid of most of the overhead introduced by the additional initialization. 

> 
>> ********stack usage data, I added -fstack-usage to the compilation line when
>> compiling CPU2017 benchmarks. And all the *.su files were generated for each
>> of the modules.
>> Since there a lot of such files, and the stack size information are embedded
>> in each of the files.  I just picked up one benchmark 511.povray to
>> check. Which is the one that 
>> has the most runtime overhead when adding initialization (both A and D). 
>> I identified all the *.su files that are different between A and D and do a
>> diff on those *.su files, and looks like that the stack size is much higher
>> with D than that with A, for example:
>> $ diff build_base_auto_init.D.0000/bbox.su
>> build_base_auto_init.A.0000/bbox.su5c5
>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static
>> ---
>> > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static
>> $ diff build_base_auto_init.D.0000/image.su
>> build_base_auto_init.A.0000/image.su
>> 9c9
>> < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624
>> static
>> ---
>> > image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 272
>> static
>> ….
>> Looks like that implementation D has more stack size impact than A. 
>> Do you have any insight on what the reason for this?
> 
> D will keep all initialized aggregates as aggregates and live which
> means stack will be allocated for it.  With A the usual optimizations
> to reduce stack usage can be applied.

I checked the routine “poverties::bump_map” in 511.povray_r since it has a lot stack increase 
due to implementation D, by examine the IR immediate before RTL expansion phase.  
(image.cpp.244t.optimized), I found that we have the following additional statements for the array elements:

void  pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double * normal)
{
…
  double p3[3];
  double p2[3];
  double p1[3];
  float colour3[5];
  float colour2[5];
  float colour1[5];
…
   # DEBUG BEGIN_STMT
  colour1 = .DEFERRED_INIT (colour1, 2);
  colour2 = .DEFERRED_INIT (colour2, 2);
  colour3 = .DEFERRED_INIT (colour3, 2);
  # DEBUG BEGIN_STMT
  MEM <double> [(double[3] *)&p1] = p1$0_144(D);
  MEM <double> [(double[3] *)&p1 + 8B] = p1$1_135(D);
  MEM <double> [(double[3] *)&p1 + 16B] = p1$2_138(D);
  p1 = .DEFERRED_INIT (p1, 2);
  # DEBUG D#12 => MEM <double> [(double[3] *)&p1]
  # DEBUG p1$0 => D#12
  # DEBUG D#11 => MEM <double> [(double[3] *)&p1 + 8B]
  # DEBUG p1$1 => D#11
  # DEBUG D#10 => MEM <double> [(double[3] *)&p1 + 16B]
  # DEBUG p1$2 => D#10
  MEM <double> [(double[3] *)&p2] = p2$0_109(D);
  MEM <double> [(double[3] *)&p2 + 8B] = p2$1_111(D);
  MEM <double> [(double[3] *)&p2 + 16B] = p2$2_254(D);
  p2 = .DEFERRED_INIT (p2, 2);
  # DEBUG D#9 => MEM <double> [(double[3] *)&p2]
  # DEBUG p2$0 => D#9
  # DEBUG D#8 => MEM <double> [(double[3] *)&p2 + 8B]
  # DEBUG p2$1 => D#8
  # DEBUG D#7 => MEM <double> [(double[3] *)&p2 + 16B]
  # DEBUG p2$2 => D#7
  MEM <double> [(double[3] *)&p3] = p3$0_256(D);
  MEM <double> [(double[3] *)&p3 + 8B] = p3$1_258(D);
  MEM <double> [(double[3] *)&p3 + 16B] = p3$2_260(D);
  p3 = .DEFERRED_INIT (p3, 2);
  ….
}

I guess that the above “MEM <double>….. = …” are the ones that make the differences. Which phase introduced them?

> 
>> Let me know if you have any comments and suggestions.
> 
> First of all I would check whether the prototype implementations
> work as expected.
I have done such check with small testing cases already, checking the IR generated with the implementation A or D, mainly
Focus on *.c.006t.gimple.  and *.c.*t.expand, all worked as expected. 

For the CPU2017, for example as the above, I also checked the IR for both A and D, looks like all worked as expected.

Thanks. 

Qing
> 
> Richard.
> 
> 
>> thanks.
>> Qing
>>      On Jan 13, 2021, at 1:39 AM, Richard Biener <rguenther@suse.de>
>>      wrote:
>> 
>>      On Tue, 12 Jan 2021, Qing Zhao wrote:
>> 
>>            Hi, 
>> 
>>            Just check in to see whether you have any comments
>>            and suggestions on this:
>> 
>>            FYI, I have been continue with Approach D
>>            implementation since last week:
>> 
>>            D. Adding  calls to .DEFFERED_INIT during
>>            gimplification, expand the .DEFFERED_INIT during
>>            expand to
>>            real initialization. Adjusting uninitialized pass
>>            with the new refs with “.DEFFERED_INIT”.
>> 
>>            For the remaining work of Approach D:
>> 
>>            ** complete the implementation of
>>            -ftrivial-auto-var-init=pattern;
>>            ** complete the implementation of uninitialized
>>            warnings maintenance work for D. 
>> 
>>            I have completed the uninitialized warnings
>>            maintenance work for D.
>>            And finished partial of the
>>            -ftrivial-auto-var-init=pattern implementation. 
>> 
>>            The following are remaining work of Approach D:
>> 
>>              ** -ftrivial-auto-var-init=pattern for VLA;
>>              **add a new attribute for variable:
>>            __attribute((uninitialized)
>>            the marked variable is uninitialized intentionaly
>>            for performance purpose.
>>              ** adding complete testing cases;
>> 
>>            Please let me know if you have any objection on my
>>            current decision on implementing approach D. 
>> 
>>      Did you do any analysis on how stack usage and code size are
>>      changed 
>>      with approach D?  How does compile-time behave (we could gobble
>>      up
>>      lots of .DEFERRED_INIT calls I guess)?
>> 
>>      Richard.
>> 
>>            Thanks a lot for your help.
>> 
>>            Qing
>> 
>>                  On Jan 5, 2021, at 1:05 PM, Qing Zhao
>>                  via Gcc-patches
>>                  <gcc-patches@gcc.gnu.org> wrote:
>> 
>>                  Hi,
>> 
>>                  This is an update for our previous
>>                  discussion. 
>> 
>>                  1. I implemented the following two
>>                  different implementations in the latest
>>                  upstream gcc:
>> 
>>                  A. Adding real initialization during
>>                  gimplification, not maintain the
>>                  uninitialized warnings.
>> 
>>                  D. Adding  calls to .DEFFERED_INIT
>>                  during gimplification, expand the
>>                  .DEFFERED_INIT during expand to
>>                  real initialization. Adjusting
>>                  uninitialized pass with the new refs
>>                  with “.DEFFERED_INIT”.
>> 
>>                  Note, in this initial implementation,
>>                  ** I ONLY implement
>>                  -ftrivial-auto-var-init=zero, the
>>                  implementation of
>>                  -ftrivial-auto-var-init=pattern 
>>                     is not done yet.  Therefore, the
>>                  performance data is only about
>>                  -ftrivial-auto-var-init=zero. 
>> 
>>                  ** I added an temporary  option
>>                  -fauto-var-init-approach=A|B|C|D  to
>>                  choose implementation A or D for 
>>                     runtime performance study.
>>                  ** I didn’t finish the uninitialized
>>                  warnings maintenance work for D. (That
>>                  might take more time than I expected). 
>> 
>>                  2. I collected runtime data for CPU2017
>>                  on a x86 machine with this new gcc for
>>                  the following 3 cases:
>> 
>>                  no: default. (-g -O2 -march=native )
>>                  A:  default +
>>                   -ftrivial-auto-var-init=zero
>>                  -fauto-var-init-approach=A 
>>                  D:  default +
>>                   -ftrivial-auto-var-init=zero
>>                  -fauto-var-init-approach=D 
>> 
>>                  And then compute the slowdown data for
>>                  both A and D as following:
>> 
>>                  benchmarks A / no D /no
>> 
>>                  500.perlbench_r 1.25% 1.25%
>>                  502.gcc_r 0.68% 1.80%
>>                  505.mcf_r 0.68% 0.14%
>>                  520.omnetpp_r 4.83% 4.68%
>>                  523.xalancbmk_r 0.18% 1.96%
>>                  525.x264_r 1.55% 2.07%
>>                  531.deepsjeng_ 11.57% 11.85%
>>                  541.leela_r 0.64% 0.80%
>>                  557.xz_  -0.41% -0.41%
>> 
>>                  507.cactuBSSN_r 0.44% 0.44%
>>                  508.namd_r 0.34% 0.34%
>>                  510.parest_r 0.17% 0.25%
>>                  511.povray_r 56.57% 57.27%
>>                  519.lbm_r 0.00% 0.00%
>>                  521.wrf_r  -0.28% -0.37%
>>                  526.blender_r 16.96% 17.71%
>>                  527.cam4_r 0.70% 0.53%
>>                  538.imagick_r 2.40% 2.40%
>>                  544.nab_r 0.00% -0.65%
>> 
>>                  avg 5.17% 5.37%
>> 
>>                  From the above data, we can see that in
>>                  general, the runtime performance
>>                  slowdown for 
>>                  implementation A and D are similar for
>>                  individual benchmarks.
>> 
>>                  There are several benchmarks that have
>>                  significant slowdown with the new added
>>                  initialization for both
>>                  A and D, for example, 511.povray_r,
>>                  526.blender_, and 531.deepsjeng_r, I
>>                  will try to study a little bit
>>                  more on what kind of new initializations
>>                  introduced such slowdown. 
>> 
>>                  From the current study so far, I think
>>                  that approach D should be good enough
>>                  for our final implementation. 
>>                  So, I will try to finish approach D with
>>                  the following remaining work
>> 
>>                      ** complete the implementation of
>>                  -ftrivial-auto-var-init=pattern;
>>                      ** complete the implementation of
>>                  uninitialized warnings maintenance work
>>                  for D. 
>> 
>>                  Let me know if you have any comments and
>>                  suggestions on my current and future
>>                  work.
>> 
>>                  Thanks a lot for your help.
>> 
>>                  Qing
>> 
>>                        On Dec 9, 2020, at 10:18 AM,
>>                        Qing Zhao via Gcc-patches
>>                        <gcc-patches@gcc.gnu.org>
>>                        wrote:
>> 
>>                        The following are the
>>                        approaches I will implement
>>                        and compare:
>> 
>>                        Our final goal is to keep
>>                        the uninitialized warning
>>                        and minimize the run-time
>>                        performance cost.
>> 
>>                        A. Adding real
>>                        initialization during
>>                        gimplification, not maintain
>>                        the uninitialized warnings.
>>                        B. Adding real
>>                        initialization during
>>                        gimplification, marking them
>>                        with “artificial_init”. 
>>                          Adjusting uninitialized
>>                        pass, maintaining the
>>                        annotation, making sure the
>>                        real init not
>>                          Deleted from the fake
>>                        init. 
>>                        C.  Marking the DECL for an
>>                        uninitialized auto variable
>>                        as “no_explicit_init” during
>>                        gimplification,
>>                           maintain this
>>                        “no_explicit_init” bit till
>>                        after
>>                        pass_late_warn_uninitialized,
>>                        or till pass_expand, 
>>                           add real initialization
>>                        for all DECLs that are
>>                        marked with
>>                        “no_explicit_init”.
>>                        D. Adding .DEFFERED_INIT
>>                        during gimplification,
>>                        expand the .DEFFERED_INIT
>>                        during expand to
>>                          real initialization.
>>                        Adjusting uninitialized pass
>>                        with the new refs with
>>                        “.DEFFERED_INIT”.
>> 
>>                        In the above, approach A
>>                        will be the one that have
>>                        the minimum run-time cost,
>>                        will be the base for the
>>                        performance
>>                        comparison. 
>> 
>>                        I will implement approach D
>>                        then, this one is expected
>>                        to have the most run-time
>>                        overhead among the above
>>                        list, but
>>                        Implementation should be the
>>                        cleanest among B, C, D.
>>                        Let’s see how much more
>>                        performance overhead this
>>                        approach
>>                        will be. If the data is
>>                        good, maybe we can avoid the
>>                        effort to implement B, and
>>                        C. 
>> 
>>                        If the performance of D is
>>                        not good, I will implement B
>>                        or C at that time.
>> 
>>                        Let me know if you have any
>>                        comment or suggestions.
>> 
>>                        Thanks.
>> 
>>                        Qing
>> 
>>      -- 
>>      Richard Biener <rguenther@suse.de>
>>      SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
>>      Nuernberg,
>>      Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

next prev parent reply	other threads:[~2021-01-15 16:16 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-23 23:05 How to traverse all the local variables that declared in the current routine? Qing Zhao
2020-11-24  7:32 ` Richard Biener
2020-11-24 15:47   ` Qing Zhao
2020-11-24 15:55     ` Richard Biener
2020-11-24 16:54       ` Qing Zhao
2020-11-25  9:11         ` Richard Biener
2020-11-25 17:41           ` Qing Zhao
2020-12-01 19:47           ` Qing Zhao
2020-12-02  8:45             ` Richard Biener
2020-12-02 15:36               ` Qing Zhao
2020-12-03  8:45                 ` Richard Biener
2020-12-03 16:07                   ` Qing Zhao
2020-12-03 16:36                     ` Richard Biener
2020-12-03 16:40                       ` Qing Zhao
2020-12-03 16:56                       ` Richard Sandiford
2020-11-26  0:08         ` Martin Sebor
2020-11-30 16:23           ` Qing Zhao
2020-11-30 17:18             ` Martin Sebor
2020-11-30 23:05               ` Qing Zhao
2020-12-03 17:32       ` Richard Sandiford
2020-12-03 23:04         ` Qing Zhao
2020-12-04  8:50         ` Richard Biener
2020-12-04 16:19           ` Qing Zhao
2020-12-07  7:12             ` Richard Biener
2020-12-07 16:20               ` Qing Zhao
2020-12-07 17:10                 ` Richard Sandiford
2020-12-07 17:36                   ` Qing Zhao
2020-12-07 18:05                     ` Richard Sandiford
2020-12-07 18:34                       ` Qing Zhao
2020-12-08  7:35                         ` Richard Biener
2020-12-08  7:40                 ` Richard Biener
2020-12-08 19:54                   ` Qing Zhao
2020-12-09  8:23                     ` Richard Biener
2020-12-09 15:04                       ` Qing Zhao
2020-12-09 15:12                         ` Richard Biener
2020-12-09 16:18                           ` Qing Zhao
2021-01-05 19:05                             ` The performance data for two different implementation of new security feature -ftrivial-auto-var-init Qing Zhao
2021-01-05 19:10                               ` Qing Zhao
2021-01-12 20:34                               ` Qing Zhao
2021-01-13  7:39                                 ` Richard Biener
2021-01-13 15:06                                   ` Qing Zhao
2021-01-13 15:10                                     ` Richard Biener
2021-01-13 15:35                                       ` Qing Zhao
2021-01-13 15:40                                         ` Richard Biener
2021-01-14 21:16                                   ` Qing Zhao
2021-01-15  8:11                                     ` Richard Biener
2021-01-15 16:16                                       ` Qing Zhao [this message]
2021-01-15 17:22                                         ` Richard Biener
2021-01-15 17:57                                           ` Qing Zhao
2021-01-18 13:09                                             ` Richard Sandiford
2021-01-18 16:12                                               ` Qing Zhao
2021-02-01 19:12                                                 ` Qing Zhao
2021-02-02  7:43                                                   ` Richard Biener
2021-02-02 15:17                                                     ` Qing Zhao
2021-02-02 23:32                                                       ` Qing Zhao
2020-12-07 17:21           ` How to traverse all the local variables that declared in the current routine? Richard Sandiford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C43EAB76-F9DC-476D-BA32-85F4C8CE3C22@ORACLE.COM \
    --to=qing.zhao@oracle.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=rguenther@suse.de \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).