From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by sourceware.org (Postfix) with ESMTPS id ED39F39878E5 for ; Fri, 15 Jan 2021 17:22:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org ED39F39878E5 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rguenther@suse.de X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id B5727B7EC; Fri, 15 Jan 2021 17:22:38 +0000 (UTC) Date: Fri, 15 Jan 2021 18:22:36 +0100 User-Agent: K-9 Mail for Android In-Reply-To: References: <33955130-9D2D-43D5-818D-1DCC13FC1988@ORACLE.COM> <89D58812-0F3E-47AE-95A5-0A07B66EED8C@ORACLE.COM> <9585CBB2-0082-4B9A-AC75-250F54F0797C@ORACLE.COM> <51911859-45D5-4566-B588-F828B9D7313B@ORACLE.COM> <9127AAB9-92C8-4A1B-BAD5-2F5F8762DCF9@ORACLE.COM> <5A0F7219-DAFA-4EAA-B845-0E236A108738@ORACLE.COM> <7E70D6B0-CA52-4957-BF84-401AA6E094D7@ORACLE.COM> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init To: Qing Zhao CC: Richard Sandiford , Richard Biener via Gcc-patches From: Richard Biener Message-ID: X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jan 2021 17:22:42 -0000 On January 15, 2021 5:16:40 PM GMT+01:00, Qing Zhao wrote: > > >> On Jan 15, 2021, at 2:11 AM, Richard Biener >wrote: >>=20 >>=20 >>=20 >> On Thu, 14 Jan 2021, Qing Zhao wrote: >>=20 >>> Hi,=20 >>> More data on code size and compilation time with CPU2017: >>> ********Compilation time data: the numbers are the slowdown >against the >>> default =E2=80=9Cno=E2=80=9D: >>> benchmarks A/no D/no >>> =20 >>> 500=2Eperlbench_r 5=2E19% 1=2E95% >>> 502=2Egcc_r 0=2E46% -0=2E23% >>> 505=2Emcf_r 0=2E00% 0=2E00% >>> 520=2Eomnetpp_r 0=2E85% 0=2E00% >>> 523=2Exalancbmk_r 0=2E79% -0=2E40% >>> 525=2Ex264_r -4=2E48% 0=2E00% >>> 531=2Edeepsjeng_r 16=2E67% 16=2E67% >>> 541=2Eleela_r 0=2E00% 0=2E00% >>> 557=2Exz_r 0=2E00% 0=2E00% >>> =20 >>> 507=2EcactuBSSN_r 1=2E16% 0=2E58% >>> 508=2Enamd_r 9=2E62% 8=2E65% >>> 510=2Eparest_r 0=2E48% 1=2E19% >>> 511=2Epovray_r 3=2E70% 3=2E70% >>> 519=2Elbm_r 0=2E00% 0=2E00% >>> 521=2Ewrf_r 0=2E05% 0=2E02% >>> 526=2Eblender_r 0=2E33% 1=2E32% >>> 527=2Ecam4_r -0=2E93% -0=2E93% >>> 538=2Eimagick_r 1=2E32% 3=2E95% >>> 544=2Enab_r 0=2E00% 0=2E00% >>> From the above data, looks like that the compilation time impact >>> from implementation A and D are almost the same=2E >>> *******code size data: the numbers are the code size increase >against the >>> default =E2=80=9Cno=E2=80=9D: >>> benchmarks A/no D/no >>> =20 >>> 500=2Eperlbench_r 2=2E84% 0=2E34% >>> 502=2Egcc_r 2=2E59% 0=2E35% >>> 505=2Emcf_r 3=2E55% 0=2E39% >>> 520=2Eomnetpp_r 0=2E54% 0=2E03% >>> 523=2Exalancbmk_r 0=2E36% 0=2E39% >>> 525=2Ex264_r 1=2E39% 0=2E13% >>> 531=2Edeepsjeng_r 2=2E15% -1=2E12% >>> 541=2Eleela_r 0=2E50% -0=2E20% >>> 557=2Exz_r 0=2E31% 0=2E13% >>> =20 >>> 507=2EcactuBSSN_r 5=2E00% -0=2E01% >>> 508=2Enamd_r 3=2E64% -0=2E07% >>> 510=2Eparest_r 1=2E12% 0=2E33% >>> 511=2Epovray_r 4=2E18% 1=2E16% >>> 519=2Elbm_r 8=2E83% 6=2E44% >>> 521=2Ewrf_r 0=2E08% 0=2E02% >>> 526=2Eblender_r 1=2E63% 0=2E45% >>> 527=2Ecam4_r 0=2E16% 0=2E06% >>> 538=2Eimagick_r 3=2E18% -0=2E80% >>> 544=2Enab_r 5=2E76% -1=2E11% >>> Avg 2=2E52% 0=2E36% >>> From the above data, the implementation D is always better than A, >it=E2=80=99s a >>> surprising to me, not sure what=E2=80=99s the reason for this=2E >>=20 >> D probably inhibits most interesting loop transforms (check SPEC FP >> performance)=2E > >The call to =2EDEFERRED_INIT is marked as ECF_CONST: > >/* A function to represent an artifical initialization to an >uninitialized > automatic variable=2E The first argument is the variable itself, the > second argument is the initialization type=2E */ >DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, >NULL) > >So, I assume that such const call should minimize the impact to loop >optimizations=2E But yes, it will still inhibit some of the loop >transformations=2E > >> It will also most definitely disallow SRA which, when >> an aggregate is not completely elided, tends to grow code=2E > >Make sense to me=2E=20 > >The run-time performance data for D and A are actually very similar as >I posted in the previous email (I listed it here for convenience) > >Run-time performance overhead with A and D: > >benchmarks A / no D /no > >500=2Eperlbench_r 1=2E25% 1=2E25% >502=2Egcc_r 0=2E68% 1=2E80% >505=2Emcf_r 0=2E68% 0=2E14% >520=2Eomnetpp_r 4=2E83% 4=2E68% >523=2Exalancbmk_r 0=2E18% 1=2E96% >525=2Ex264_r 1=2E55% 2=2E07% >531=2Edeepsjeng_ 11=2E57% 11=2E85% >541=2Eleela_r 0=2E64% 0=2E80% >557=2Exz_ -0=2E41% -0=2E41% > >507=2EcactuBSSN_r 0=2E44% 0=2E44% >508=2Enamd_r 0=2E34% 0=2E34% >510=2Eparest_r 0=2E17% 0=2E25% >511=2Epovray_r 56=2E57% 57=2E27% >519=2Elbm_r 0=2E00% 0=2E00% >521=2Ewrf_r -0=2E28% -0=2E37% >526=2Eblender_r 16=2E96% 17=2E71% >527=2Ecam4_r 0=2E70% 0=2E53% >538=2Eimagick_r 2=2E40% 2=2E40% >544=2Enab_r 0=2E00% -0=2E65% > >avg 5=2E17% 5=2E37% > >Especially for the SPEC FP benchmarks, I didn=E2=80=99t see too much >performance difference between A and D=2E=20 >I guess that the RTL optimizations might be enough to get rid of most >of the overhead introduced by the additional initialization=2E=20 > >>=20 >>> ********stack usage data, I added -fstack-usage to the compilation >line when >>> compiling CPU2017 benchmarks=2E And all the *=2Esu files were generate= d >for each >>> of the modules=2E >>> Since there a lot of such files, and the stack size information are >embedded >>> in each of the files=2E I just picked up one benchmark 511=2Epovray t= o >>> check=2E Which is the one that=20 >>> has the most runtime overhead when adding initialization (both A and >D)=2E=20 >>> I identified all the *=2Esu files that are different between A and D >and do a >>> diff on those *=2Esu files, and looks like that the stack size is much >higher >>> with D than that with A, for example: >>> $ diff build_base_auto_init=2ED=2E0000/bbox=2Esu >>> build_base_auto_init=2EA=2E0000/bbox=2Esu5c5 >>> < bbox=2Ecpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, >>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static >>> --- >>> > bbox=2Ecpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**, >>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static >>> $ diff build_base_auto_init=2ED=2E0000/image=2Esu >>> build_base_auto_init=2EA=2E0000/image=2Esu >>> 9c9 >>> < image=2Ecpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, >double*) 624 >>> static >>> --- >>> > image=2Ecpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, >double*) 272 >>> static >>> =E2=80=A6=2E >>> Looks like that implementation D has more stack size impact than A=2E= =20 >>> Do you have any insight on what the reason for this? >>=20 >> D will keep all initialized aggregates as aggregates and live which >> means stack will be allocated for it=2E With A the usual optimizations >> to reduce stack usage can be applied=2E > >I checked the routine =E2=80=9Cpoverties::bump_map=E2=80=9D in 511=2Epovr= ay_r since it >has a lot stack increase=20 >due to implementation D, by examine the IR immediate before RTL >expansion phase=2E =20 >(image=2Ecpp=2E244t=2Eoptimized), I found that we have the following >additional statements for the array elements: > >void pov::bump_map (double * EPoint, struct TNORMAL * Tnormal, double >* normal) >{ >=E2=80=A6 > double p3[3]; > double p2[3]; > double p1[3]; > float colour3[5]; > float colour2[5]; > float colour1[5]; >=E2=80=A6 > # DEBUG BEGIN_STMT > colour1 =3D =2EDEFERRED_INIT (colour1, 2); > colour2 =3D =2EDEFERRED_INIT (colour2, 2); > colour3 =3D =2EDEFERRED_INIT (colour3, 2); > # DEBUG BEGIN_STMT > MEM [(double[3] *)&p1] =3D p1$0_144(D); > MEM [(double[3] *)&p1 + 8B] =3D p1$1_135(D); > MEM [(double[3] *)&p1 + 16B] =3D p1$2_138(D); > p1 =3D =2EDEFERRED_INIT (p1, 2); > # DEBUG D#12 =3D> MEM [(double[3] *)&p1] > # DEBUG p1$0 =3D> D#12 > # DEBUG D#11 =3D> MEM [(double[3] *)&p1 + 8B] > # DEBUG p1$1 =3D> D#11 > # DEBUG D#10 =3D> MEM [(double[3] *)&p1 + 16B] > # DEBUG p1$2 =3D> D#10 > MEM [(double[3] *)&p2] =3D p2$0_109(D); > MEM [(double[3] *)&p2 + 8B] =3D p2$1_111(D); > MEM [(double[3] *)&p2 + 16B] =3D p2$2_254(D); > p2 =3D =2EDEFERRED_INIT (p2, 2); > # DEBUG D#9 =3D> MEM [(double[3] *)&p2] > # DEBUG p2$0 =3D> D#9 > # DEBUG D#8 =3D> MEM [(double[3] *)&p2 + 8B] > # DEBUG p2$1 =3D> D#8 > # DEBUG D#7 =3D> MEM [(double[3] *)&p2 + 16B] > # DEBUG p2$2 =3D> D#7 > MEM [(double[3] *)&p3] =3D p3$0_256(D); > MEM [(double[3] *)&p3 + 8B] =3D p3$1_258(D); > MEM [(double[3] *)&p3 + 16B] =3D p3$2_260(D); > p3 =3D =2EDEFERRED_INIT (p3, 2); > =E2=80=A6=2E >} > >I guess that the above =E2=80=9CMEM =E2=80=A6=2E=2E =3D =E2=80=A6= =E2=80=9D are the ones that make the >differences=2E Which phase introduced them? Looks like SRA=2E But you can just dump all and grep for the first occurre= nce=2E=20 >>=20 >>> Let me know if you have any comments and suggestions=2E >>=20 >> First of all I would check whether the prototype implementations >> work as expected=2E >I have done such check with small testing cases already, checking the >IR generated with the implementation A or D, mainly >Focus on *=2Ec=2E006t=2Egimple=2E and *=2Ec=2E*t=2Eexpand, all worked as= expected=2E=20 > >For the CPU2017, for example as the above, I also checked the IR for >both A and D, looks like all worked as expected=2E > >Thanks=2E=20 > >Qing >>=20 >> Richard=2E >>=20 >>=20 >>> thanks=2E >>> Qing >>> On Jan 13, 2021, at 1:39 AM, Richard Biener >>> wrote: >>>=20 >>> On Tue, 12 Jan 2021, Qing Zhao wrote: >>>=20 >>> Hi,=20 >>>=20 >>> Just check in to see whether you have any comments >>> and suggestions on this: >>>=20 >>> FYI, I have been continue with Approach D >>> implementation since last week: >>>=20 >>> D=2E Adding calls to =2EDEFFERED_INIT during >>> gimplification, expand the =2EDEFFERED_INIT during >>> expand to >>> real initialization=2E Adjusting uninitialized pass >>> with the new refs with =E2=80=9C=2EDEFFERED_INIT=E2=80=9D= =2E >>>=20 >>> For the remaining work of Approach D: >>>=20 >>> ** complete the implementation of >>> -ftrivial-auto-var-init=3Dpattern; >>> ** complete the implementation of uninitialized >>> warnings maintenance work for D=2E=20 >>>=20 >>> I have completed the uninitialized warnings >>> maintenance work for D=2E >>> And finished partial of the >>> -ftrivial-auto-var-init=3Dpattern implementation=2E=20 >>>=20 >>> The following are remaining work of Approach D: >>>=20 >>> ** -ftrivial-auto-var-init=3Dpattern for VLA; >>> **add a new attribute for variable: >>> __attribute((uninitialized) >>> the marked variable is uninitialized intentionaly >>> for performance purpose=2E >>> ** adding complete testing cases; >>>=20 >>> Please let me know if you have any objection on my >>> current decision on implementing approach D=2E=20 >>>=20 >>> Did you do any analysis on how stack usage and code size are >>> changed=20 >>> with approach D? How does compile-time behave (we could gobble >>> up >>> lots of =2EDEFERRED_INIT calls I guess)? >>>=20 >>> Richard=2E >>>=20 >>> Thanks a lot for your help=2E >>>=20 >>> Qing >>>=20 >>> On Jan 5, 2021, at 1:05 PM, Qing Zhao >>> via Gcc-patches >>> wrote: >>>=20 >>> Hi, >>>=20 >>> This is an update for our previous >>> discussion=2E=20 >>>=20 >>> 1=2E I implemented the following two >>> different implementations in the latest >>> upstream gcc: >>>=20 >>> A=2E Adding real initialization during >>> gimplification, not maintain the >>> uninitialized warnings=2E >>>=20 >>> D=2E Adding calls to =2EDEFFERED_INIT >>> during gimplification, expand the >>> =2EDEFFERED_INIT during expand to >>> real initialization=2E Adjusting >>> uninitialized pass with the new refs >>> with =E2=80=9C=2EDEFFERED_INIT=E2=80=9D=2E >>>=20 >>> Note, in this initial implementation, >>> ** I ONLY implement >>> -ftrivial-auto-var-init=3Dzero, the >>> implementation of >>> -ftrivial-auto-var-init=3Dpattern=20 >>> is not done yet=2E Therefore, the >>> performance data is only about >>> -ftrivial-auto-var-init=3Dzero=2E=20 >>>=20 >>> ** I added an temporary option >>> -fauto-var-init-approach=3DA|B|C|D to >>> choose implementation A or D for=20 >>> runtime performance study=2E >>> ** I didn=E2=80=99t finish the uninitialized >>> warnings maintenance work for D=2E (That >>> might take more time than I expected)=2E=20 >>>=20 >>> 2=2E I collected runtime data for CPU2017 >>> on a x86 machine with this new gcc for >>> the following 3 cases: >>>=20 >>> no: default=2E (-g -O2 -march=3Dnative ) >>> A: default + >>> -ftrivial-auto-var-init=3Dzero >>> -fauto-var-init-approach=3DA=20 >>> D: default + >>> -ftrivial-auto-var-init=3Dzero >>> -fauto-var-init-approach=3DD=20 >>>=20 >>> And then compute the slowdown data for >>> both A and D as following: >>>=20 >>> benchmarks A / no D /no >>>=20 >>> 500=2Eperlbench_r 1=2E25% 1=2E25% >>> 502=2Egcc_r 0=2E68% 1=2E80% >>> 505=2Emcf_r 0=2E68% 0=2E14% >>> 520=2Eomnetpp_r 4=2E83% 4=2E68% >>> 523=2Exalancbmk_r 0=2E18% 1=2E96% >>> 525=2Ex264_r 1=2E55% 2=2E07% >>> 531=2Edeepsjeng_ 11=2E57% 11=2E85% >>> 541=2Eleela_r 0=2E64% 0=2E80% >>> 557=2Exz_ -0=2E41% -0=2E41% >>>=20 >>> 507=2EcactuBSSN_r 0=2E44% 0=2E44% >>> 508=2Enamd_r 0=2E34% 0=2E34% >>> 510=2Eparest_r 0=2E17% 0=2E25% >>> 511=2Epovray_r 56=2E57% 57=2E27% >>> 519=2Elbm_r 0=2E00% 0=2E00% >>> 521=2Ewrf_r -0=2E28% -0=2E37% >>> 526=2Eblender_r 16=2E96% 17=2E71% >>> 527=2Ecam4_r 0=2E70% 0=2E53% >>> 538=2Eimagick_r 2=2E40% 2=2E40% >>> 544=2Enab_r 0=2E00% -0=2E65% >>>=20 >>> avg 5=2E17% 5=2E37% >>>=20 >>> From the above data, we can see that in >>> general, the runtime performance >>> slowdown for=20 >>> implementation A and D are similar for >>> individual benchmarks=2E >>>=20 >>> There are several benchmarks that have >>> significant slowdown with the new added >>> initialization for both >>> A and D, for example, 511=2Epovray_r, >>> 526=2Eblender_, and 531=2Edeepsjeng_r, I >>> will try to study a little bit >>> more on what kind of new initializations >>> introduced such slowdown=2E=20 >>>=20 >>> From the current study so far, I think >>> that approach D should be good enough >>> for our final implementation=2E=20 >>> So, I will try to finish approach D with >>> the following remaining work >>>=20 >>> ** complete the implementation of >>> -ftrivial-auto-var-init=3Dpattern; >>> ** complete the implementation of >>> uninitialized warnings maintenance work >>> for D=2E=20 >>>=20 >>> Let me know if you have any comments and >>> suggestions on my current and future >>> work=2E >>>=20 >>> Thanks a lot for your help=2E >>>=20 >>> Qing >>>=20 >>> On Dec 9, 2020, at 10:18 AM, >>> Qing Zhao via Gcc-patches >>> >>> wrote: >>>=20 >>> The following are the >>> approaches I will implement >>> and compare: >>>=20 >>> Our final goal is to keep >>> the uninitialized warning >>> and minimize the run-time >>> performance cost=2E >>>=20 >>> A=2E Adding real >>> initialization during >>> gimplification, not maintain >>> the uninitialized warnings=2E >>> B=2E Adding real >>> initialization during >>> gimplification, marking them >>> with =E2=80=9Cartificial_init=E2=80=9D=2E=20 >>> Adjusting uninitialized >>> pass, maintaining the >>> annotation, making sure the >>> real init not >>> Deleted from the fake >>> init=2E=20 >>> C=2E Marking the DECL for an >>> uninitialized auto variable >>> as =E2=80=9Cno_explicit_init=E2=80=9D during >>> gimplification, >>> maintain this >>> =E2=80=9Cno_explicit_init=E2=80=9D bit till >>> after >>> pass_late_warn_uninitialized, >>> or till pass_expand,=20 >>> add real initialization >>> for all DECLs that are >>> marked with >>> =E2=80=9Cno_explicit_init=E2=80=9D=2E >>> D=2E Adding =2EDEFFERED_INIT >>> during gimplification, >>> expand the =2EDEFFERED_INIT >>> during expand to >>> real initialization=2E >>> Adjusting uninitialized pass >>> with the new refs with >>> =E2=80=9C=2EDEFFERED_INIT=E2=80=9D=2E >>>=20 >>> In the above, approach A >>> will be the one that have >>> the minimum run-time cost, >>> will be the base for the >>> performance >>> comparison=2E=20 >>>=20 >>> I will implement approach D >>> then, this one is expected >>> to have the most run-time >>> overhead among the above >>> list, but >>> Implementation should be the >>> cleanest among B, C, D=2E >>> Let=E2=80=99s see how much more >>> performance overhead this >>> approach >>> will be=2E If the data is >>> good, maybe we can avoid the >>> effort to implement B, and >>> C=2E=20 >>>=20 >>> If the performance of D is >>> not good, I will implement B >>> or C at that time=2E >>>=20 >>> Let me know if you have any >>> comment or suggestions=2E >>>=20 >>> Thanks=2E >>>=20 >>> Qing >>>=20 >>> --=20 >>> Richard Biener >>> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 >>> Nuernberg, >>> Germany; GF: Felix Imend=C3=B6rffer; HRB 36809 (AG Nuernberg)