[PATCH i386] Enable -freorder-blocks-and-partition

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH i386] Enable -freorder-blocks-and-partition
@ 2013-11-19 15:17 Teresa Johnson
  2013-11-19 16:31 ` Jan Hubicka
  2013-11-19 22:05 ` Andi Kleen
  0 siblings, 2 replies; 20+ messages in thread
From: Teresa Johnson @ 2013-11-19 15:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: David Li, Jan Hubicka, Steven Bosscher

This patch enables -freorder-blocks-and-partition by default for x86
at -O2 and up. It is showing some modest gains in cpu2006 performance
with profile feedback and -O2 on an Intel Westmere system. Specifically,
I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
(1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.

Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
configured with --enable-languages=all,obj-c++ and tested for both
32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".

It would be good to enable this for additional targets as a follow on,
but it needs more testing for both correctness and performance on those
other targets (i.e for correctness because I see a number of places
in other config/*/*.c files that do some special handling under this
option for different targets or simply disable it, so I am not sure
how well-tested it is under different architectural constraints).

Ok for trunk?

Thanks,
Teresa

2013-11-19  Teresa Johnson  <tejohnson@google.com>

        * common/config/i386/i386-common.c: Enable
        -freorder-blocks-and-partition at -O2 and up for x86.
        * opts.c (finish_options): Only warn if -freorder-blocks-and-
        partition was set on command line.

Index: common/config/i386/i386-common.c
===================================================================
--- common/config/i386/i386-common.c    (revision 205001)
+++ common/config/i386/i386-common.c    (working copy)
@@ -789,6 +789,8 @@ static const struct default_options ix86_option_op
   {
     /* Enable redundant extension instructions removal at -O2 and higher.  */
     { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
+    /* Enable function splitting at -O2 and higher.  */
+    { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_and_partition, NULL, 1 },
     /* Turn off -fschedule-insns by default.  It tends to make the
        problem with not enough registers even worse.  */
     { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
Index: opts.c
===================================================================
--- opts.c      (revision 205001)
+++ opts.c      (working copy)
@@ -737,9 +737,10 @@ finish_options (struct gcc_options *opts, struct g
       && opts->x_flag_reorder_blocks_and_partition
       && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))
     {
-      inform (loc,
-             "-freorder-blocks-and-partition does not work "
-             "with exceptions on this architecture");
+      if (opts_set->x_flag_reorder_blocks_and_partition)
+        inform (loc,
+                "-freorder-blocks-and-partition does not work "
+                "with exceptions on this architecture");
       opts->x_flag_reorder_blocks_and_partition = 0;
       opts->x_flag_reorder_blocks = 1;
     }
@@ -752,9 +753,10 @@ finish_options (struct gcc_options *opts, struct g
       && opts->x_flag_reorder_blocks_and_partition
       && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))
     {
-      inform (loc,
-             "-freorder-blocks-and-partition does not support "
-             "unwind info on this architecture");
+      if (opts_set->x_flag_reorder_blocks_and_partition)
+        inform (loc,
+                "-freorder-blocks-and-partition does not support "
+                "unwind info on this architecture");
       opts->x_flag_reorder_blocks_and_partition = 0;
       opts->x_flag_reorder_blocks = 1;
     }
@@ -769,9 +771,10 @@ finish_options (struct gcc_options *opts, struct g
              && targetm_common.unwind_tables_default
              && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))))
     {
-      inform (loc,
-             "-freorder-blocks-and-partition does not work "
-             "on this architecture");
+      if (opts_set->x_flag_reorder_blocks_and_partition)
+        inform (loc,
+                "-freorder-blocks-and-partition does not work "
+                "on this architecture");
       opts->x_flag_reorder_blocks_and_partition = 0;
       opts->x_flag_reorder_blocks = 1;
     }


-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-11-19 15:17 [PATCH i386] Enable -freorder-blocks-and-partition Teresa Johnson
@ 2013-11-19 16:31 ` Jan Hubicka
  2013-11-19 18:23   ` Teresa Johnson
  2013-11-19 22:05 ` Andi Kleen
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Hubicka @ 2013-11-19 16:31 UTC (permalink / raw)
  To: Teresa Johnson, marxin.liska, gcc-patches

Martin,
can you, please, generate the updated systemtap with
-freorder-blocks-and-partition enabled?

I am in favour of enabling this - it is usefull pass and it is pointless ot
have passes that are not enabled by default.
Is there reason why this would not work on other ELF target? Is it working
with Darwin and Windows?

> This patch enables -freorder-blocks-and-partition by default for x86
> at -O2 and up. It is showing some modest gains in cpu2006 performance
> with profile feedback and -O2 on an Intel Westmere system. Specifically,
> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.

This actually sounds very good ;)

Lets see how the systemtap graphs goes.  If we will end up with problem
of too many accesses to cold section, I would suggest making cold section
subdivided into .unlikely and .unlikely.part (we could have better name)
with the second consisting only of unlikely parts of hot&normal functions.

This should reduce the problems we are seeing with mistakely identifying
code to be cold because of roundoff errors (and it probably makes sense
in general, too).
We will however need to update gold and ld for that.
> 
> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
> configured with --enable-languages=all,obj-c++ and tested for both
> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".
> 
> It would be good to enable this for additional targets as a follow on,
> but it needs more testing for both correctness and performance on those
> other targets (i.e for correctness because I see a number of places
> in other config/*/*.c files that do some special handling under this
> option for different targets or simply disable it, so I am not sure
> how well-tested it is under different architectural constraints).
> 
> Ok for trunk?
> 
> Thanks,
> Teresa
> 
> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
> 
>         * common/config/i386/i386-common.c: Enable
>         -freorder-blocks-and-partition at -O2 and up for x86.
>         * opts.c (finish_options): Only warn if -freorder-blocks-and-
>         partition was set on command line.

You probably mis doc/invoke.texi update.
Thank you for working on this!

Honza
> 
> Index: common/config/i386/i386-common.c
> ===================================================================
> --- common/config/i386/i386-common.c    (revision 205001)
> +++ common/config/i386/i386-common.c    (working copy)
> @@ -789,6 +789,8 @@ static const struct default_options ix86_option_op
>    {
>      /* Enable redundant extension instructions removal at -O2 and higher.  */
>      { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
> +    /* Enable function splitting at -O2 and higher.  */
> +    { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_and_partition, NULL, 1 },
>      /* Turn off -fschedule-insns by default.  It tends to make the
>         problem with not enough registers even worse.  */
>      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> Index: opts.c
> ===================================================================
> --- opts.c      (revision 205001)
> +++ opts.c      (working copy)
> @@ -737,9 +737,10 @@ finish_options (struct gcc_options *opts, struct g
>        && opts->x_flag_reorder_blocks_and_partition
>        && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))
>      {
> -      inform (loc,
> -             "-freorder-blocks-and-partition does not work "
> -             "with exceptions on this architecture");
> +      if (opts_set->x_flag_reorder_blocks_and_partition)
> +        inform (loc,
> +                "-freorder-blocks-and-partition does not work "
> +                "with exceptions on this architecture");
>        opts->x_flag_reorder_blocks_and_partition = 0;
>        opts->x_flag_reorder_blocks = 1;
>      }
> @@ -752,9 +753,10 @@ finish_options (struct gcc_options *opts, struct g
>        && opts->x_flag_reorder_blocks_and_partition
>        && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))
>      {
> -      inform (loc,
> -             "-freorder-blocks-and-partition does not support "
> -             "unwind info on this architecture");
> +      if (opts_set->x_flag_reorder_blocks_and_partition)
> +        inform (loc,
> +                "-freorder-blocks-and-partition does not support "
> +                "unwind info on this architecture");
>        opts->x_flag_reorder_blocks_and_partition = 0;
>        opts->x_flag_reorder_blocks = 1;
>      }
> @@ -769,9 +771,10 @@ finish_options (struct gcc_options *opts, struct g
>               && targetm_common.unwind_tables_default
>               && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))))
>      {
> -      inform (loc,
> -             "-freorder-blocks-and-partition does not work "
> -             "on this architecture");
> +      if (opts_set->x_flag_reorder_blocks_and_partition)
> +        inform (loc,
> +                "-freorder-blocks-and-partition does not work "
> +                "on this architecture");
>        opts->x_flag_reorder_blocks_and_partition = 0;
>        opts->x_flag_reorder_blocks = 1;
>      }
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-11-19 16:31 ` Jan Hubicka
@ 2013-11-19 18:23   ` Teresa Johnson
  2013-11-19 19:32     ` Jeff Law
  0 siblings, 1 reply; 20+ messages in thread
From: Teresa Johnson @ 2013-11-19 18:23 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Martin Liška, gcc-patches

On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Martin,
> can you, please, generate the updated systemtap with
> -freorder-blocks-and-partition enabled?
>
> I am in favour of enabling this - it is usefull pass and it is pointless ot
> have passes that are not enabled by default.
> Is there reason why this would not work on other ELF target? Is it working
> with Darwin and Windows?

I don't know how to test these (I don't see any machines listed in the
gcc compile farm of those types). For Windows, I assume you mean
MinGW, which should be enabled as it is under i386. Should I disable
it there and for Darwin?

>
>> This patch enables -freorder-blocks-and-partition by default for x86
>> at -O2 and up. It is showing some modest gains in cpu2006 performance
>> with profile feedback and -O2 on an Intel Westmere system. Specifically,
>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.
>
> This actually sounds very good ;)
>
> Lets see how the systemtap graphs goes.  If we will end up with problem
> of too many accesses to cold section, I would suggest making cold section
> subdivided into .unlikely and .unlikely.part (we could have better name)
> with the second consisting only of unlikely parts of hot&normal functions.
>
> This should reduce the problems we are seeing with mistakely identifying
> code to be cold because of roundoff errors (and it probably makes sense
> in general, too).
> We will however need to update gold and ld for that.

Note that I don't think this would help much unless the linker is
changed to move the cold split section close to the hot section. There
is probably some fine-tuning we could do eventually in the linker
under -ffunction-sections without putting the split portions in a
separate section. I.e. clump the split parts together within unlikely.
But hopefully this can all be done later on as follow-on work to boost
the performance further.

>>
>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
>> configured with --enable-languages=all,obj-c++ and tested for both
>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".
>>
>> It would be good to enable this for additional targets as a follow on,
>> but it needs more testing for both correctness and performance on those
>> other targets (i.e for correctness because I see a number of places
>> in other config/*/*.c files that do some special handling under this
>> option for different targets or simply disable it, so I am not sure
>> how well-tested it is under different architectural constraints).
>>
>> Ok for trunk?
>>
>> Thanks,
>> Teresa
>>
>> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
>>
>>         * common/config/i386/i386-common.c: Enable
>>         -freorder-blocks-and-partition at -O2 and up for x86.
>>         * opts.c (finish_options): Only warn if -freorder-blocks-and-
>>         partition was set on command line.
>
> You probably mis doc/invoke.texi update.
> Thank you for working on this!

Yes, thanks. Here is the patch with the invoke.texi update.

Teresa


2013-11-19  Teresa Johnson  <tejohnson@google.com>

        * common/config/i386/i386-common.c: Enable
        -freorder-blocks-and-partition at -O2 and up for x86.
        * doc/invoke.texi: Update -freorder-blocks-and-partition default.
        * opts.c (finish_options): Only warn if -freorder-blocks-and-
        partition was set on command line.

Index: common/config/i386/i386-common.c
===================================================================
--- common/config/i386/i386-common.c    (revision 205001)
+++ common/config/i386/i386-common.c    (working copy)
@@ -789,6 +789,8 @@ static const struct default_options ix86_option_op
   {
     /* Enable redundant extension instructions removal at -O2 and higher.  */
     { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
+    /* Enable function splitting at -O2 and higher.  */
+    { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_and_partition, NULL, 1 },
     /* Turn off -fschedule-insns by default.  It tends to make the
        problem with not enough registers even worse.  */
     { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi     (revision 205001)
+++ doc/invoke.texi     (working copy)
@@ -8172,6 +8172,8 @@ exception handling, for linkonce sections, for fun
 section attribute and on any architecture that does not support named
 sections.

+Enabled for x86 at levels @option{-O2}, @option{-O3}.
+
 @item -freorder-functions
 @opindex freorder-functions
 Reorder functions in the object file in order to
Index: opts.c
===================================================================
--- opts.c      (revision 205001)
+++ opts.c      (working copy)
@@ -737,9 +737,10 @@ finish_options (struct gcc_options *opts, struct g
       && opts->x_flag_reorder_blocks_and_partition
       && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))
     {
-      inform (loc,
-             "-freorder-blocks-and-partition does not work "
-             "with exceptions on this architecture");
+      if (opts_set->x_flag_reorder_blocks_and_partition)
+        inform (loc,
+                "-freorder-blocks-and-partition does not work "
+                "with exceptions on this architecture");
       opts->x_flag_reorder_blocks_and_partition = 0;
       opts->x_flag_reorder_blocks = 1;
     }
@@ -752,9 +753,10 @@ finish_options (struct gcc_options *opts, struct g
       && opts->x_flag_reorder_blocks_and_partition
       && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))
     {
-      inform (loc,
-             "-freorder-blocks-and-partition does not support "
-             "unwind info on this architecture");
+      if (opts_set->x_flag_reorder_blocks_and_partition)
+        inform (loc,
+                "-freorder-blocks-and-partition does not support "
+                "unwind info on this architecture");
       opts->x_flag_reorder_blocks_and_partition = 0;
       opts->x_flag_reorder_blocks = 1;
     }
@@ -769,9 +771,10 @@ finish_options (struct gcc_options *opts, struct g
              && targetm_common.unwind_tables_default
              && (ui_except == UI_SJLJ || ui_except >= UI_TARGET))))
     {
-      inform (loc,
-             "-freorder-blocks-and-partition does not work "
-             "on this architecture");
+      if (opts_set->x_flag_reorder_blocks_and_partition)
+        inform (loc,
+                "-freorder-blocks-and-partition does not work "
+                "on this architecture");
       opts->x_flag_reorder_blocks_and_partition = 0;
       opts->x_flag_reorder_blocks = 1;
     }


-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-11-19 18:23   ` Teresa Johnson
@ 2013-11-19 19:32     ` Jeff Law
  2013-11-20  1:55       ` Teresa Johnson
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Law @ 2013-11-19 19:32 UTC (permalink / raw)
  To: Teresa Johnson, Jan Hubicka; +Cc: Martin Liška, gcc-patches

On 11/19/13 10:24, Teresa Johnson wrote:
> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Martin,
>> can you, please, generate the updated systemtap with
>> -freorder-blocks-and-partition enabled?
>>
>> I am in favour of enabling this - it is usefull pass and it is pointless ot
>> have passes that are not enabled by default.
>> Is there reason why this would not work on other ELF target? Is it working
>> with Darwin and Windows?
>
> I don't know how to test these (I don't see any machines listed in the
> gcc compile farm of those types). For Windows, I assume you mean
> MinGW, which should be enabled as it is under i386. Should I disable
> it there and for Darwin?
>
>>
>>> This patch enables -freorder-blocks-and-partition by default for x86
>>> at -O2 and up. It is showing some modest gains in cpu2006 performance
>>> with profile feedback and -O2 on an Intel Westmere system. Specifically,
>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.
>>
>> This actually sounds very good ;)
>>
>> Lets see how the systemtap graphs goes.  If we will end up with problem
>> of too many accesses to cold section, I would suggest making cold section
>> subdivided into .unlikely and .unlikely.part (we could have better name)
>> with the second consisting only of unlikely parts of hot&normal functions.
>>
>> This should reduce the problems we are seeing with mistakely identifying
>> code to be cold because of roundoff errors (and it probably makes sense
>> in general, too).
>> We will however need to update gold and ld for that.
>
> Note that I don't think this would help much unless the linker is
> changed to move the cold split section close to the hot section. There
> is probably some fine-tuning we could do eventually in the linker
> under -ffunction-sections without putting the split portions in a
> separate section. I.e. clump the split parts together within unlikely.
> But hopefully this can all be done later on as follow-on work to boost
> the performance further.
>
>>>
>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
>>> configured with --enable-languages=all,obj-c++ and tested for both
>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".
>>>
>>> It would be good to enable this for additional targets as a follow on,
>>> but it needs more testing for both correctness and performance on those
>>> other targets (i.e for correctness because I see a number of places
>>> in other config/*/*.c files that do some special handling under this
>>> option for different targets or simply disable it, so I am not sure
>>> how well-tested it is under different architectural constraints).
>>>
>>> Ok for trunk?
>>>
>>> Thanks,
>>> Teresa
>>>
>>> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
>>>
>>>          * common/config/i386/i386-common.c: Enable
>>>          -freorder-blocks-and-partition at -O2 and up for x86.
>>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
>>>          partition was set on command line.
>>
>> You probably mis doc/invoke.texi update.
>> Thank you for working on this!
>
> Yes, thanks. Here is the patch with the invoke.texi update.
>
> Teresa
>
>
> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
>
>          * common/config/i386/i386-common.c: Enable
>          -freorder-blocks-and-partition at -O2 and up for x86.
>          * doc/invoke.texi: Update -freorder-blocks-and-partition default.
>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
>          partition was set on command line.
This is fine.  Glad to see we're getting some good benefits from this code.

FWIW, I would expect the PA to show benefits from this work -- HP showed 
good results for block positioning and procedure splitting eons ago.  A 
secondary effect you would expect to see would be an increase in the 
number of long branch stubs (static count), but a decrease in the number 
of those stubs that actually execute.  Measuring those effects would 
obviously be out-of-scope.

I totally understand that the PA is irrelevant these days, but seeing 
good results on another architecture would go a long way to answering 
the question "should this be enabled by default across the board?"

jeff


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-11-19 15:17 [PATCH i386] Enable -freorder-blocks-and-partition Teresa Johnson
  2013-11-19 16:31 ` Jan Hubicka
@ 2013-11-19 22:05 ` Andi Kleen
  2013-11-19 22:10   ` Teresa Johnson
  1 sibling, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2013-11-19 22:05 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: gcc-patches, David Li, Jan Hubicka, Steven Bosscher

Teresa Johnson <tejohnson@google.com> writes:

> This patch enables -freorder-blocks-and-partition by default for x86
> at -O2 and up. It is showing some modest gains in cpu2006 performance
> with profile feedback and -O2 on an Intel Westmere system. Specifically,
> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.

One thing that worries me is what this will do to profilers.

I had to hack some assembler code using out of line sections
to able to handle the libunwind based perf dwarf unwinder.

My understanding is that these out of line sections can be 
only described properly in dwarf3, and there's still some
dwarf2 based unwinder code around.

So this may cause problems with profiling and debugging.

It's probably still a good idea, just may need some extra
care.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-11-19 22:05 ` Andi Kleen
@ 2013-11-19 22:10   ` Teresa Johnson
  0 siblings, 0 replies; 20+ messages in thread
From: Teresa Johnson @ 2013-11-19 22:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gcc-patches, David Li, Jan Hubicka, Steven Bosscher

On Tue, Nov 19, 2013 at 1:00 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Teresa Johnson <tejohnson@google.com> writes:
>
>> This patch enables -freorder-blocks-and-partition by default for x86
>> at -O2 and up. It is showing some modest gains in cpu2006 performance
>> with profile feedback and -O2 on an Intel Westmere system. Specifically,
>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.
>
> One thing that worries me is what this will do to profilers.
>
> I had to hack some assembler code using out of line sections
> to able to handle the libunwind based perf dwarf unwinder.
>
> My understanding is that these out of line sections can be
> only described properly in dwarf3, and there's still some
> dwarf2 based unwinder code around.
>
> So this may cause problems with profiling and debugging.
>
> It's probably still a good idea, just may need some extra
> care.
>
> -Andi

Sri has approval for a patch that should address this by giving the
split cold sections a label. It should go in today as well:

http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02143.html

Thanks,
Teresa
>
> --
> ak@linux.intel.com -- Speaking for myself only



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-11-19 19:32     ` Jeff Law
@ 2013-11-20  1:55       ` Teresa Johnson
       [not found]         ` <CAObPJ3OZHvET=QNmNtx9ZjaHZk=GhokWjoF2njr5mgwcv2ogDA@mail.gmail.com>
  0 siblings, 1 reply; 20+ messages in thread
From: Teresa Johnson @ 2013-11-20  1:55 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jan Hubicka, Martin Liška, gcc-patches

On Tue, Nov 19, 2013 at 9:40 AM, Jeff Law <law@redhat.com> wrote:
> On 11/19/13 10:24, Teresa Johnson wrote:
>>
>> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> Martin,
>>> can you, please, generate the updated systemtap with
>>> -freorder-blocks-and-partition enabled?
>>>
>>> I am in favour of enabling this - it is usefull pass and it is pointless
>>> ot
>>> have passes that are not enabled by default.
>>> Is there reason why this would not work on other ELF target? Is it
>>> working
>>> with Darwin and Windows?
>>
>>
>> I don't know how to test these (I don't see any machines listed in the
>> gcc compile farm of those types). For Windows, I assume you mean
>> MinGW, which should be enabled as it is under i386. Should I disable
>> it there and for Darwin?
>>
>>>
>>>> This patch enables -freorder-blocks-and-partition by default for x86
>>>> at -O2 and up. It is showing some modest gains in cpu2006 performance
>>>> with profile feedback and -O2 on an Intel Westmere system. Specifically,
>>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
>>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.
>>>
>>>
>>> This actually sounds very good ;)
>>>
>>> Lets see how the systemtap graphs goes.  If we will end up with problem
>>> of too many accesses to cold section, I would suggest making cold section
>>> subdivided into .unlikely and .unlikely.part (we could have better name)
>>> with the second consisting only of unlikely parts of hot&normal
>>> functions.
>>>
>>> This should reduce the problems we are seeing with mistakely identifying
>>> code to be cold because of roundoff errors (and it probably makes sense
>>> in general, too).
>>> We will however need to update gold and ld for that.
>>
>>
>> Note that I don't think this would help much unless the linker is
>> changed to move the cold split section close to the hot section. There
>> is probably some fine-tuning we could do eventually in the linker
>> under -ffunction-sections without putting the split portions in a
>> separate section. I.e. clump the split parts together within unlikely.
>> But hopefully this can all be done later on as follow-on work to boost
>> the performance further.
>>
>>>>
>>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
>>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
>>>> configured with --enable-languages=all,obj-c++ and tested for both
>>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".
>>>>
>>>> It would be good to enable this for additional targets as a follow on,
>>>> but it needs more testing for both correctness and performance on those
>>>> other targets (i.e for correctness because I see a number of places
>>>> in other config/*/*.c files that do some special handling under this
>>>> option for different targets or simply disable it, so I am not sure
>>>> how well-tested it is under different architectural constraints).
>>>>
>>>> Ok for trunk?
>>>>
>>>> Thanks,
>>>> Teresa
>>>>
>>>> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
>>>>
>>>>          * common/config/i386/i386-common.c: Enable
>>>>          -freorder-blocks-and-partition at -O2 and up for x86.
>>>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
>>>>          partition was set on command line.
>>>
>>>
>>> You probably mis doc/invoke.texi update.
>>> Thank you for working on this!
>>
>>
>> Yes, thanks. Here is the patch with the invoke.texi update.
>>
>> Teresa
>>
>>
>> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
>>
>>          * common/config/i386/i386-common.c: Enable
>>          -freorder-blocks-and-partition at -O2 and up for x86.
>>          * doc/invoke.texi: Update -freorder-blocks-and-partition default.
>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
>>          partition was set on command line.
>
> This is fine.  Glad to see we're getting some good benefits from this code.

Ok, thanks. It is in as r205058.

>
> FWIW, I would expect the PA to show benefits from this work -- HP showed
> good results for block positioning and procedure splitting eons ago.  A
> secondary effect you would expect to see would be an increase in the number
> of long branch stubs (static count), but a decrease in the number of those
> stubs that actually execute.  Measuring those effects would obviously be
> out-of-scope.
>
> I totally understand that the PA is irrelevant these days, but seeing good
> results on another architecture would go a long way to answering the
> question "should this be enabled by default across the board?"

Yep, we saw benefits for IPF on hp-ux as well. It would be great to
eventually enable this across the board and only selectively disable.
I know there were people interested in trying this with ARM, that
would be a good relevant arch to try this with now to see if it can be
enabled more widely.

Teresa

>
> jeff
>
>



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
       [not found]         ` <CAObPJ3OZHvET=QNmNtx9ZjaHZk=GhokWjoF2njr5mgwcv2ogDA@mail.gmail.com>
@ 2013-11-28 16:34           ` Jan Hubicka
  2013-12-02 15:16             ` Teresa Johnson
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Hubicka @ 2013-11-28 16:34 UTC (permalink / raw)
  To: Martin Liška; +Cc: Teresa Johnson, Jeff Law, Jan Hubicka, gcc-patches

> Dear Teresa and Jan,
>    I tried to test Teresa's patch, but I've encountered two bugs
> during usage of -fprofile-generate/use (one in SPEC CPU 2006 and
> Inkscape).

Thanks, this is non-LTO run. Is there a chance to get -flto version, too?
So we see how things combine with -freorder-function
> 
> This will be probably for Jan:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59266
> 
> second one:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59265
> 
> There are numbers I recorded for GIMP with and without block reordering.
> 
> GIMP (-freorder-blocks-and-partition)
> pages read (no readahead): 597 pages (4K)
> 
> GIMP (-no-freorder-blocks-and-partition)
> pages read (no readahead): 596 pages (4K)

The graphs themselves seems bit odd however, why do we have so many accesses
to cold section with -fno-reorder-blocks-and-partition again?

Honza
> 
> Martin
> 
> On 19 November 2013 23:18, Teresa Johnson <tejohnson@google.com> wrote:
> > On Tue, Nov 19, 2013 at 9:40 AM, Jeff Law <law@redhat.com> wrote:
> >> On 11/19/13 10:24, Teresa Johnson wrote:
> >>>
> >>> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>>>
> >>>> Martin,
> >>>> can you, please, generate the updated systemtap with
> >>>> -freorder-blocks-and-partition enabled?
> >>>>
> >>>> I am in favour of enabling this - it is usefull pass and it is pointless
> >>>> ot
> >>>> have passes that are not enabled by default.
> >>>> Is there reason why this would not work on other ELF target? Is it
> >>>> working
> >>>> with Darwin and Windows?
> >>>
> >>>
> >>> I don't know how to test these (I don't see any machines listed in the
> >>> gcc compile farm of those types). For Windows, I assume you mean
> >>> MinGW, which should be enabled as it is under i386. Should I disable
> >>> it there and for Darwin?
> >>>
> >>>>
> >>>>> This patch enables -freorder-blocks-and-partition by default for x86
> >>>>> at -O2 and up. It is showing some modest gains in cpu2006 performance
> >>>>> with profile feedback and -O2 on an Intel Westmere system. Specifically,
> >>>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
> >>>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.
> >>>>
> >>>>
> >>>> This actually sounds very good ;)
> >>>>
> >>>> Lets see how the systemtap graphs goes.  If we will end up with problem
> >>>> of too many accesses to cold section, I would suggest making cold section
> >>>> subdivided into .unlikely and .unlikely.part (we could have better name)
> >>>> with the second consisting only of unlikely parts of hot&normal
> >>>> functions.
> >>>>
> >>>> This should reduce the problems we are seeing with mistakely identifying
> >>>> code to be cold because of roundoff errors (and it probably makes sense
> >>>> in general, too).
> >>>> We will however need to update gold and ld for that.
> >>>
> >>>
> >>> Note that I don't think this would help much unless the linker is
> >>> changed to move the cold split section close to the hot section. There
> >>> is probably some fine-tuning we could do eventually in the linker
> >>> under -ffunction-sections without putting the split portions in a
> >>> separate section. I.e. clump the split parts together within unlikely.
> >>> But hopefully this can all be done later on as follow-on work to boost
> >>> the performance further.
> >>>
> >>>>>
> >>>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
> >>>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
> >>>>> configured with --enable-languages=all,obj-c++ and tested for both
> >>>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".
> >>>>>
> >>>>> It would be good to enable this for additional targets as a follow on,
> >>>>> but it needs more testing for both correctness and performance on those
> >>>>> other targets (i.e for correctness because I see a number of places
> >>>>> in other config/*/*.c files that do some special handling under this
> >>>>> option for different targets or simply disable it, so I am not sure
> >>>>> how well-tested it is under different architectural constraints).
> >>>>>
> >>>>> Ok for trunk?
> >>>>>
> >>>>> Thanks,
> >>>>> Teresa
> >>>>>
> >>>>> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
> >>>>>
> >>>>>          * common/config/i386/i386-common.c: Enable
> >>>>>          -freorder-blocks-and-partition at -O2 and up for x86.
> >>>>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
> >>>>>          partition was set on command line.
> >>>>
> >>>>
> >>>> You probably mis doc/invoke.texi update.
> >>>> Thank you for working on this!
> >>>
> >>>
> >>> Yes, thanks. Here is the patch with the invoke.texi update.
> >>>
> >>> Teresa
> >>>
> >>>
> >>> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
> >>>
> >>>          * common/config/i386/i386-common.c: Enable
> >>>          -freorder-blocks-and-partition at -O2 and up for x86.
> >>>          * doc/invoke.texi: Update -freorder-blocks-and-partition default.
> >>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
> >>>          partition was set on command line.
> >>
> >> This is fine.  Glad to see we're getting some good benefits from this code.
> >
> > Ok, thanks. It is in as r205058.
> >
> >>
> >> FWIW, I would expect the PA to show benefits from this work -- HP showed
> >> good results for block positioning and procedure splitting eons ago.  A
> >> secondary effect you would expect to see would be an increase in the number
> >> of long branch stubs (static count), but a decrease in the number of those
> >> stubs that actually execute.  Measuring those effects would obviously be
> >> out-of-scope.
> >>
> >> I totally understand that the PA is irrelevant these days, but seeing good
> >> results on another architecture would go a long way to answering the
> >> question "should this be enabled by default across the board?"
> >
> > Yep, we saw benefits for IPF on hp-ux as well. It would be great to
> > eventually enable this across the board and only selectively disable.
> > I know there were people interested in trying this with ARM, that
> > would be a good relevant arch to try this with now to see if it can be
> > enabled more widely.
> >
> > Teresa
> >
> >>
> >> jeff
> >>
> >>
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-11-28 16:34           ` Jan Hubicka
@ 2013-12-02 15:16             ` Teresa Johnson
  2013-12-02 16:17               ` Jeff Law
  0 siblings, 1 reply; 20+ messages in thread
From: Teresa Johnson @ 2013-12-02 15:16 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Martin Liška, Jeff Law, gcc-patches

On Thu, Nov 28, 2013 at 6:06 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Dear Teresa and Jan,
>>    I tried to test Teresa's patch, but I've encountered two bugs
>> during usage of -fprofile-generate/use (one in SPEC CPU 2006 and
>> Inkscape).
>
> Thanks, this is non-LTO run. Is there a chance to get -flto version, too?
> So we see how things combine with -freorder-function
>>
>> This will be probably for Jan:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59266
>>
>> second one:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59265
>>
>> There are numbers I recorded for GIMP with and without block reordering.
>>
>> GIMP (-freorder-blocks-and-partition)
>> pages read (no readahead): 597 pages (4K)
>>
>> GIMP (-no-freorder-blocks-and-partition)
>> pages read (no readahead): 596 pages (4K)
>
> The graphs themselves seems bit odd however, why do we have so many accesses
> to cold section with -fno-reorder-blocks-and-partition again?

Comparing the two graphs I don't see additional accesses in the cold
section from -freorder-blocks-and-partition. For the most part the
graphs look identical. In contrast, the graphs Martin had generated
with and without -freorder-blocks-and-partition back in August had a
significant increase in execution out of text.unlikely.

I'm wondering if the -fno-reorder-blocks-and-partition graph really
had that disabled. I am surprised that the size of the .text and
.text.hot did not shrink from splitting. And the accesses in the cold
section in both graphs look suspiciously like the accesses we ended up
with in the cold section when enabling -freorder-blocks-and-partition
back in Aug (although there are certainly a lot fewer than before,
which is good news).

Martin, can you check that the binary used for
-fno-reorder-blocks-and-partition really doesn't have any splitting?

Thanks,
Teresa

>
> Honza
>>
>> Martin
>>
>> On 19 November 2013 23:18, Teresa Johnson <tejohnson@google.com> wrote:
>> > On Tue, Nov 19, 2013 at 9:40 AM, Jeff Law <law@redhat.com> wrote:
>> >> On 11/19/13 10:24, Teresa Johnson wrote:
>> >>>
>> >>> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >>>>
>> >>>> Martin,
>> >>>> can you, please, generate the updated systemtap with
>> >>>> -freorder-blocks-and-partition enabled?
>> >>>>
>> >>>> I am in favour of enabling this - it is usefull pass and it is pointless
>> >>>> ot
>> >>>> have passes that are not enabled by default.
>> >>>> Is there reason why this would not work on other ELF target? Is it
>> >>>> working
>> >>>> with Darwin and Windows?
>> >>>
>> >>>
>> >>> I don't know how to test these (I don't see any machines listed in the
>> >>> gcc compile farm of those types). For Windows, I assume you mean
>> >>> MinGW, which should be enabled as it is under i386. Should I disable
>> >>> it there and for Darwin?
>> >>>
>> >>>>
>> >>>>> This patch enables -freorder-blocks-and-partition by default for x86
>> >>>>> at -O2 and up. It is showing some modest gains in cpu2006 performance
>> >>>>> with profile feedback and -O2 on an Intel Westmere system. Specifically,
>> >>>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
>> >>>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.
>> >>>>
>> >>>>
>> >>>> This actually sounds very good ;)
>> >>>>
>> >>>> Lets see how the systemtap graphs goes.  If we will end up with problem
>> >>>> of too many accesses to cold section, I would suggest making cold section
>> >>>> subdivided into .unlikely and .unlikely.part (we could have better name)
>> >>>> with the second consisting only of unlikely parts of hot&normal
>> >>>> functions.
>> >>>>
>> >>>> This should reduce the problems we are seeing with mistakely identifying
>> >>>> code to be cold because of roundoff errors (and it probably makes sense
>> >>>> in general, too).
>> >>>> We will however need to update gold and ld for that.
>> >>>
>> >>>
>> >>> Note that I don't think this would help much unless the linker is
>> >>> changed to move the cold split section close to the hot section. There
>> >>> is probably some fine-tuning we could do eventually in the linker
>> >>> under -ffunction-sections without putting the split portions in a
>> >>> separate section. I.e. clump the split parts together within unlikely.
>> >>> But hopefully this can all be done later on as follow-on work to boost
>> >>> the performance further.
>> >>>
>> >>>>>
>> >>>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
>> >>>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
>> >>>>> configured with --enable-languages=all,obj-c++ and tested for both
>> >>>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".
>> >>>>>
>> >>>>> It would be good to enable this for additional targets as a follow on,
>> >>>>> but it needs more testing for both correctness and performance on those
>> >>>>> other targets (i.e for correctness because I see a number of places
>> >>>>> in other config/*/*.c files that do some special handling under this
>> >>>>> option for different targets or simply disable it, so I am not sure
>> >>>>> how well-tested it is under different architectural constraints).
>> >>>>>
>> >>>>> Ok for trunk?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Teresa
>> >>>>>
>> >>>>> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
>> >>>>>
>> >>>>>          * common/config/i386/i386-common.c: Enable
>> >>>>>          -freorder-blocks-and-partition at -O2 and up for x86.
>> >>>>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
>> >>>>>          partition was set on command line.
>> >>>>
>> >>>>
>> >>>> You probably mis doc/invoke.texi update.
>> >>>> Thank you for working on this!
>> >>>
>> >>>
>> >>> Yes, thanks. Here is the patch with the invoke.texi update.
>> >>>
>> >>> Teresa
>> >>>
>> >>>
>> >>> 2013-11-19  Teresa Johnson  <tejohnson@google.com>
>> >>>
>> >>>          * common/config/i386/i386-common.c: Enable
>> >>>          -freorder-blocks-and-partition at -O2 and up for x86.
>> >>>          * doc/invoke.texi: Update -freorder-blocks-and-partition default.
>> >>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
>> >>>          partition was set on command line.
>> >>
>> >> This is fine.  Glad to see we're getting some good benefits from this code.
>> >
>> > Ok, thanks. It is in as r205058.
>> >
>> >>
>> >> FWIW, I would expect the PA to show benefits from this work -- HP showed
>> >> good results for block positioning and procedure splitting eons ago.  A
>> >> secondary effect you would expect to see would be an increase in the number
>> >> of long branch stubs (static count), but a decrease in the number of those
>> >> stubs that actually execute.  Measuring those effects would obviously be
>> >> out-of-scope.
>> >>
>> >> I totally understand that the PA is irrelevant these days, but seeing good
>> >> results on another architecture would go a long way to answering the
>> >> question "should this be enabled by default across the board?"
>> >
>> > Yep, we saw benefits for IPF on hp-ux as well. It would be great to
>> > eventually enable this across the board and only selectively disable.
>> > I know there were people interested in trying this with ARM, that
>> > would be a good relevant arch to try this with now to see if it can be
>> > enabled more widely.
>> >
>> > Teresa
>> >
>> >>
>> >> jeff
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-02 15:16             ` Teresa Johnson
@ 2013-12-02 16:17               ` Jeff Law
  2013-12-02 16:53                 ` Martin Liška
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Law @ 2013-12-02 16:17 UTC (permalink / raw)
  To: Teresa Johnson, Jan Hubicka; +Cc: Martin Liška, gcc-patches

On 12/02/13 08:16, Teresa Johnson wrote:
>
> I'm wondering if the -fno-reorder-blocks-and-partition graph really
> had that disabled. I am surprised that the size of the .text and
> .text.hot did not shrink from splitting.
Could be due to needing longer jump opcodes to reach the unlikely sections.
jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-02 16:17               ` Jeff Law
@ 2013-12-02 16:53                 ` Martin Liška
  2013-12-11  9:21                   ` Martin Liška
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Liška @ 2013-12-02 16:53 UTC (permalink / raw)
  To: Jeff Law; +Cc: Teresa Johnson, Jan Hubicka, gcc-patches

Dear Teresa,
   I will today double check if the graphs are correct :)

Martin

On 2 December 2013 17:16, Jeff Law <law@redhat.com> wrote:
> On 12/02/13 08:16, Teresa Johnson wrote:
>>
>>
>> I'm wondering if the -fno-reorder-blocks-and-partition graph really
>> had that disabled. I am surprised that the size of the .text and
>> .text.hot did not shrink from splitting.
>
> Could be due to needing longer jump opcodes to reach the unlikely sections.
> jeff
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-02 16:53                 ` Martin Liška
@ 2013-12-11  9:21                   ` Martin Liška
  2013-12-12  5:51                     ` Teresa Johnson
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Liška @ 2013-12-11  9:21 UTC (permalink / raw)
  To: Jeff Law; +Cc: Teresa Johnson, Jan Hubicka, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 842 bytes --]

Hello,
   I prepared a collection of systemtap graphs for GIMP.

1) just my profile-based function reordering: 550 pages
2) just -freorder-blocks-and-partitions: 646 pages
3) just -fno-reorder-blocks-and-partitions: 638 pages

Please see attached data.

Martin

On 2 December 2013 17:52, Martin Liška <marxin.liska@gmail.com> wrote:
> Dear Teresa,
>    I will today double check if the graphs are correct :)
>
> Martin
>
> On 2 December 2013 17:16, Jeff Law <law@redhat.com> wrote:
>> On 12/02/13 08:16, Teresa Johnson wrote:
>>>
>>>
>>> I'm wondering if the -fno-reorder-blocks-and-partition graph really
>>> had that disabled. I am surprised that the size of the .text and
>>> .text.hot did not shrink from splitting.
>>
>> Could be due to needing longer jump opcodes to reach the unlikely sections.
>> jeff
>>

[-- Attachment #2: gimp-graphs.tar.bz2 --]
[-- Type: application/x-bzip2, Size: 152914 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-11  9:21                   ` Martin Liška
@ 2013-12-12  5:51                     ` Teresa Johnson
  2013-12-12 20:54                       ` Jan Hubicka
  2013-12-13  1:13                       ` Jan Hubicka
  0 siblings, 2 replies; 20+ messages in thread
From: Teresa Johnson @ 2013-12-12  5:51 UTC (permalink / raw)
  To: Martin Liška; +Cc: Jeff Law, Jan Hubicka, gcc-patches

On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote:
> Hello,
>    I prepared a collection of systemtap graphs for GIMP.
>
> 1) just my profile-based function reordering: 550 pages
> 2) just -freorder-blocks-and-partitions: 646 pages
> 3) just -fno-reorder-blocks-and-partitions: 638 pages
>
> Please see attached data.

Thanks for the data. A few observations/questions:

With both 1) (your (time-based?) reordering) and 2)
(-freorder-blocks-and-partitions) there are a fair amount of accesses
out of the cold section. I'm not seeing so many accesses out of the
cold section in the apps I am looking at with splitting enabled. In
the case of splitting, it could either be non-representative profile
data or profile data that isn't being maintained properly and lost,
although I think I fixed most of those. If you have identified any of
the cold split routines that are being executed in the case of 2) it
would be interesting to look at the dumps.

With 2) there is also a big clump towards the end which is being
executed out of the cold section, which again would be interesting to
investigate.

Why is your function reordering in 1) accessing more out of the cold
section than 3) (-fno-reorder-blocks-and-partitions)?

Both 2) and 3) have both normal .text and .text.hot, whereas 1) only
has .text. I wonder if that is contributing to the higher number of
pages either of these has compared to 1), since the non-cold addresses
are distributed across both sections?

Teresa

>
> Martin
>
> On 2 December 2013 17:52, Martin Liška <marxin.liska@gmail.com> wrote:
>> Dear Teresa,
>>    I will today double check if the graphs are correct :)
>>
>> Martin
>>
>> On 2 December 2013 17:16, Jeff Law <law@redhat.com> wrote:
>>> On 12/02/13 08:16, Teresa Johnson wrote:
>>>>
>>>>
>>>> I'm wondering if the -fno-reorder-blocks-and-partition graph really
>>>> had that disabled. I am surprised that the size of the .text and
>>>> .text.hot did not shrink from splitting.
>>>
>>> Could be due to needing longer jump opcodes to reach the unlikely sections.
>>> jeff
>>>

-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-12  5:51                     ` Teresa Johnson
@ 2013-12-12 20:54                       ` Jan Hubicka
  2013-12-13  1:13                       ` Jan Hubicka
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Hubicka @ 2013-12-12 20:54 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: Martin Liška, Jeff Law, Jan Hubicka, gcc-patches

> On Wed, Dec 11, 2013 at 1:21 AM, Martin LiÂška <marxin.liska@gmail.com> wrote:
> > Hello,
> >    I prepared a collection of systemtap graphs for GIMP.
> >
> > 1) just my profile-based function reordering: 550 pages
> > 2) just -freorder-blocks-and-partitions: 646 pages
> > 3) just -fno-reorder-blocks-and-partitions: 638 pages
> >
> > Please see attached data.
> 
> Thanks for the data. A few observations/questions:
> 
> With both 1) (your (time-based?) reordering) and 2)
> (-freorder-blocks-and-partitions) there are a fair amount of accesses
> out of the cold section. I'm not seeing so many accesses out of the

Good point, I misread the description and assumed that 1) is time profiling +
reorder-blcoks-and-partition.

Martin, what version of GCC you used?  Rong introduced bug into libgcov
that made gcov streaming around fork to split summaries (so the number
of runs did not match).  I fixed it by
2013-11-18  Jan Hubicka  <jh@suse.cz>

        * libgcov-driver.c (run_accounted): Make global level static.
        (gcov_exit_merge_summary): Silence warning; do not clear
        run_accounted here.
        (gcov_exit): Clear it here.

        * libgcov-driver.c (gcov_exit_merge_summary): Fix setting
        run_accounted.

        * libgcov-driver.c (get_gcov_dump_complete): Update comments.
        (all_prg, crc32): Remove static vars.
        (gcov_exit_compute_summary): Rewrite to return crc32; do not clear
        all_prg.
        (gcov_exit_merge_gcda): Add crc32 parameter.
        (gcov_exit_merge_summary): Add crc32 and all_prg parameter;
        do not account run if it was already accounted.
        (gcov_exit_dump_gcov): Add crc32 and all_prg parameters.
        (gcov_exit): Initialize all_prg; update.

so please be sure you have this one in tree.  If you do, can you please repeat
the trick with locked unlikely section so we see why we get there even 
with -fno-reorder-blcoks-and-partition?

> cold section in the apps I am looking at with splitting enabled. In
> the case of splitting, it could either be non-representative profile
> data or profile data that isn't being maintained properly and lost,
> although I think I fixed most of those. If you have identified any of
> the cold split routines that are being executed in the case of 2) it
> would be interesting to look at the dumps.
> 
> With 2) there is also a big clump towards the end which is being
> executed out of the cold section, which again would be interesting to
> investigate.

I think the data towards the end comes from fact that Martin is manually
quiting gimp and at that time he may do it differently (and after different
delay) each time.  This makes the graphs harder to read, but one should
basically ignore everything after the huge gap that indicate that the app
has started.
> 
> Why is your function reordering in 1) accessing more out of the cold
> section than 3) (-fno-reorder-blocks-and-partitions)?

This seems like a scale differnce here (i.e. Martin took longer to quit
gimp in gimp_reorder_blocks_and_partition).  In the first gap you can see
several red dots aligned vertically.  I did not go into manually counting the
dots, but it seems that they should be about the same.
> 
> Both 2) and 3) have both normal .text and .text.hot, whereas 1) only
> has .text. I wonder if that is contributing to the higher number of
> pages either of these has compared to 1), since the non-cold addresses
> are distributed across both sections?

Looking at Martin's tree
https://github.com/marxin/gcc/blob/time-profiler-patch2/gcc/varasm.c#L536
he has a hack in disables startup/hot sections, but keeps unlikely.
I agree that for more comparable results we should keep all.

But first lets work out why we have unlikely section accesses in again...
Martin, is there any chance you can test these things on mainline rather
than patched trees?

Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-12  5:51                     ` Teresa Johnson
  2013-12-12 20:54                       ` Jan Hubicka
@ 2013-12-13  1:13                       ` Jan Hubicka
       [not found]                         ` <CAObPJ3MLrWsTwK-23ienktyASO9YLvAXMNKUxVm3v+KC=5JzOA@mail.gmail.com>
  2013-12-20  6:19                         ` Teresa Johnson
  1 sibling, 2 replies; 20+ messages in thread
From: Jan Hubicka @ 2013-12-13  1:13 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: Martin Liška, Jeff Law, Jan Hubicka, gcc-patches

> On Wed, Dec 11, 2013 at 1:21 AM, Martin LiÂška <marxin.liska@gmail.com> wrote:
> > Hello,
> >    I prepared a collection of systemtap graphs for GIMP.
> >
> > 1) just my profile-based function reordering: 550 pages
> > 2) just -freorder-blocks-and-partitions: 646 pages
> > 3) just -fno-reorder-blocks-and-partitions: 638 pages
> >
> > Please see attached data.
> 
> Thanks for the data. A few observations/questions:
> 
> With both 1) (your (time-based?) reordering) and 2)
> (-freorder-blocks-and-partitions) there are a fair amount of accesses
> out of the cold section. I'm not seeing so many accesses out of the
> cold section in the apps I am looking at with splitting enabled. In

I see you already comitted the patch, so perhaps Martin's measurement assume
the pass is off by default?

I rebuilded GCC with profiledboostrap and with the linkerscript unmapping
text.unlikely.  I get ICE in:
(gdb) bt
#0  diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108
#1  0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135
#2  0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110
#3  toplev_main(int, char**) () at ../../gcc/toplev.c:1922
#4  0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6
#5  0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122

That is relatively early in startup process. The function seems inlined and
it fails only on second invocation, did not have time to investigate further,
yet while without -fprofile-use it starts...

On our periodic testers I see off-noise improvement in crafty 2200->2300
and regression on Vortex, 2900->2800, plus code size increase.

Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
       [not found]                         ` <CAObPJ3MLrWsTwK-23ienktyASO9YLvAXMNKUxVm3v+KC=5JzOA@mail.gmail.com>
@ 2013-12-15 22:19                           ` Martin Liška
  2013-12-17 15:09                             ` Teresa Johnson
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Liška @ 2013-12-15 22:19 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Teresa Johnson, Jeff Law, gcc-patches

On 15 December 2013 23:17, Martin Liška <marxin.liska@gmail.com> wrote:
> Dear Jan and Teresa,
>     Jan was right that I've been using changes which were commited by
> Teresa and do live in trunk. So the graph with time profile presented
> in my previous post was really with enabled
> -freorder-blocks-and-partition. I removed the hack in varasm.c and I
> do use classic section layout. Please open the following dump
> (includes PDF graph+html report that shows functions with time profile
> located in cold section and all -fdump-ipa-all dumps):
>
> https://drive.google.com/file/d/0B0pisUJ80pO1YW1QWUFkZjdqME0/edit?usp=sharing
>
> Apart from that, I created also PDF graph (https://drive.google.com/file/d/0B0pisUJ80pO1aHhPWW56dXpLVTQ/edit?usp=sharing) that
> shows that time profile is almost perfect for GIMP. I miss just some
> examples that do not have profile in generate phase.
>
> I will merge current trunk and prepare final patch.
>
> Are there any other data that you want to be prepared?
>
> Martin
>
>
> On 13 December 2013 02:13, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote:
>>> > Hello,
>>> >    I prepared a collection of systemtap graphs for GIMP.
>>> >
>>> > 1) just my profile-based function reordering: 550 pages
>>> > 2) just -freorder-blocks-and-partitions: 646 pages
>>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages
>>> >
>>> > Please see attached data.
>>>
>>> Thanks for the data. A few observations/questions:
>>>
>>> With both 1) (your (time-based?) reordering) and 2)
>>> (-freorder-blocks-and-partitions) there are a fair amount of accesses
>>> out of the cold section. I'm not seeing so many accesses out of the
>>> cold section in the apps I am looking at with splitting enabled. In
>>
>> I see you already comitted the patch, so perhaps Martin's measurement assume
>> the pass is off by default?
>>
>> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping
>> text.unlikely.  I get ICE in:
>> (gdb) bt
>> #0  diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108
>> #1  0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135
>> #2  0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110
>> #3  toplev_main(int, char**) () at ../../gcc/toplev.c:1922
>> #4  0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6
>> #5  0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122
>>
>> That is relatively early in startup process. The function seems inlined and
>> it fails only on second invocation, did not have time to investigate further,
>> yet while without -fprofile-use it starts...
>>
>> On our periodic testers I see off-noise improvement in crafty 2200->2300
>> and regression on Vortex, 2900->2800, plus code size increase.
>>
>> Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-15 22:19                           ` Martin Liška
@ 2013-12-17 15:09                             ` Teresa Johnson
  0 siblings, 0 replies; 20+ messages in thread
From: Teresa Johnson @ 2013-12-17 15:09 UTC (permalink / raw)
  To: Martin Liška; +Cc: Jan Hubicka, Jeff Law, gcc-patches

Thanks for the data. A few questions:

- Do you have the raw data used to generate your pdfs available? Since
you gave me the binaries, if I have the data in terms of exactly what
addresses are being plotted I can correlate with the specific cold
functions via nm. Once I know what cold functions are being hit, I
would then need the .i files and the .gcda files to reproduce the
build.

- I tried running the binaries, but don't have the necessary shared
library dependencies installed on my system:
$ ldd gimp-2.8 | grep found
libgimpwidgets-2.0.so.0 => not found
libgimpconfig-2.0.so.0 => not found
libgimpcolor-2.0.so.0 => not found
libgimpmath-2.0.so.0 => not found
libgimpthumb-2.0.so.0 => not found
libgimpmodule-2.0.so.0 => not found
libgimpbase-2.0.so.0 => not found
libgegl-0.2.so.0 => not found
libbabl-0.1.so.0 => not found

I'll try to get these installed, but the last time I did that in an
attempt to build gimp I had a lot of trouble trying to get the right
versions and get them to build for me - any chance you could build an
archive version of the gimp binary?

Thanks,
Teresa

On Sun, Dec 15, 2013 at 2:19 PM, Martin Liška <marxin.liska@gmail.com> wrote:
> On 15 December 2013 23:17, Martin Liška <marxin.liska@gmail.com> wrote:
>> Dear Jan and Teresa,
>>     Jan was right that I've been using changes which were commited by
>> Teresa and do live in trunk. So the graph with time profile presented
>> in my previous post was really with enabled
>> -freorder-blocks-and-partition. I removed the hack in varasm.c and I
>> do use classic section layout. Please open the following dump
>> (includes PDF graph+html report that shows functions with time profile
>> located in cold section and all -fdump-ipa-all dumps):
>>
>> https://drive.google.com/file/d/0B0pisUJ80pO1YW1QWUFkZjdqME0/edit?usp=sharing
>>
>> Apart from that, I created also PDF graph (https://drive.google.com/file/d/0B0pisUJ80pO1aHhPWW56dXpLVTQ/edit?usp=sharing) that
>> shows that time profile is almost perfect for GIMP. I miss just some
>> examples that do not have profile in generate phase.
>>
>> I will merge current trunk and prepare final patch.
>>
>> Are there any other data that you want to be prepared?
>>
>> Martin
>>
>>
>> On 13 December 2013 02:13, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote:
>>>> > Hello,
>>>> >    I prepared a collection of systemtap graphs for GIMP.
>>>> >
>>>> > 1) just my profile-based function reordering: 550 pages
>>>> > 2) just -freorder-blocks-and-partitions: 646 pages
>>>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages
>>>> >
>>>> > Please see attached data.
>>>>
>>>> Thanks for the data. A few observations/questions:
>>>>
>>>> With both 1) (your (time-based?) reordering) and 2)
>>>> (-freorder-blocks-and-partitions) there are a fair amount of accesses
>>>> out of the cold section. I'm not seeing so many accesses out of the
>>>> cold section in the apps I am looking at with splitting enabled. In
>>>
>>> I see you already comitted the patch, so perhaps Martin's measurement assume
>>> the pass is off by default?
>>>
>>> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping
>>> text.unlikely.  I get ICE in:
>>> (gdb) bt
>>> #0  diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108
>>> #1  0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135
>>> #2  0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110
>>> #3  toplev_main(int, char**) () at ../../gcc/toplev.c:1922
>>> #4  0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6
>>> #5  0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122
>>>
>>> That is relatively early in startup process. The function seems inlined and
>>> it fails only on second invocation, did not have time to investigate further,
>>> yet while without -fprofile-use it starts...
>>>
>>> On our periodic testers I see off-noise improvement in crafty 2200->2300
>>> and regression on Vortex, 2900->2800, plus code size increase.
>>>
>>> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-13  1:13                       ` Jan Hubicka
       [not found]                         ` <CAObPJ3MLrWsTwK-23ienktyASO9YLvAXMNKUxVm3v+KC=5JzOA@mail.gmail.com>
@ 2013-12-20  6:19                         ` Teresa Johnson
  2014-02-11 22:21                           ` Teresa Johnson
  1 sibling, 1 reply; 20+ messages in thread
From: Teresa Johnson @ 2013-12-20  6:19 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Martin Liška, gcc-patches

On Thu, Dec 12, 2013 at 5:13 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote:
>> > Hello,
>> >    I prepared a collection of systemtap graphs for GIMP.
>> >
>> > 1) just my profile-based function reordering: 550 pages
>> > 2) just -freorder-blocks-and-partitions: 646 pages
>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages
>> >
>> > Please see attached data.
>>
>> Thanks for the data. A few observations/questions:
>>
>> With both 1) (your (time-based?) reordering) and 2)
>> (-freorder-blocks-and-partitions) there are a fair amount of accesses
>> out of the cold section. I'm not seeing so many accesses out of the
>> cold section in the apps I am looking at with splitting enabled. In
>
> I see you already comitted the patch, so perhaps Martin's measurement assume
> the pass is off by default?
>
> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping
> text.unlikely.  I get ICE in:
> (gdb) bt
> #0  diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108
> #1  0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135
> #2  0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110
> #3  toplev_main(int, char**) () at ../../gcc/toplev.c:1922
> #4  0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6
> #5  0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122
>
> That is relatively early in startup process. The function seems inlined and
> it fails only on second invocation, did not have time to investigate further,
> yet while without -fprofile-use it starts...

I'll see if I can reproduce this and investigate, although at this
point that might have to wait until after my holiday vacation.

>
> On our periodic testers I see off-noise improvement in crafty 2200->2300
> and regression on Vortex, 2900->2800, plus code size increase.

I had only run cpu2006, but not cpu2000. I'll see if I can reproduce
this as well.

I have been investigating a few places where I saw accesses in the
cold split regions in internal benchmarks. Here are a couple, and how
I have addressed them so far:

1) loop unswitching

In this case, loop unswitching hoisted a branch from within the loop
to outside the loop, and in doing so it was effectively speculated
above several other branches. In it's original location it always went
to only one of the successors (biased 0/100%). But when it was hoisted
it sometimes took the previously 0% path. This led to executing out of
the cold region, since we didn't update the branch probability when
hoisting. I worked around this by assigning a small non-zero
probability after hoisting with the following change:

Index: tree-ssa-loop-unswitch.c
===================================================================
--- tree-ssa-loop-unswitch.c    (revision 205590)
+++ tree-ssa-loop-unswitch.c    (working copy)
@@ -384,6 +384,8 @@ tree_unswitch_loop (struct loop *loop,

   extract_true_false_edges_from_block (unswitch_on, &edge_true, &edge_false);
   prob_true = edge_true->probability;
+  if (!prob_true)
+    prob_true = REG_BR_PROB_BASE/10;
   return loop_version (loop, unshare_expr (cond),
                       NULL, prob_true, prob_true,
                       REG_BR_PROB_BASE - prob_true, false);

This should probably be refined (if prob_true is 100% we want to
assign a small non-zero probability to the false path), and 10% may be
too high (maybe give it 1%?).

2) More COMDAT issues

My earlier patch handled the case where the comdat had 0 counts since
the linker kept the copy in a different module. In that case we
prevent the guessed frequencies from being dropped by counts_to_freq,
and then later mark any reached via non-zero callgraph edges as
guessed. Finally, when one such 0-count comdat is inlined the call
count is propagated to the callee blocks using the guessed
probabilities.

However, in this case, there was a comdat that had a very small
non-zero count, that was being inlined to a much hotter callsite. I
believe this could happen when there was a copy that was ipa-inlined
in the profile gen compile, so the copy in that module gets some
non-zero counts from the ipa inlined instance, but when the out of
line copy was eliminated by the linker (selected from a different
module). In this case the inliner was scaling the bb counts up quite a
lot when inlining. The problem is that you most likely can't trust
that the 0 count bbs in such a case are really not executed by the
callsite it is being into, since the counts are very small and
correspond to a different callsite.

The problem is how to address this. We can't simply suppress
counts_to_freq from overwriting the guessed frequencies in this case,
since the profile counts are non-zero and would not match the guessed
probabilities. But we can't figure out which are called by much hotter
callsites (compared to their entry count) until later when the
callgraph is built, which is when we would know that we want to ignore
the profile counts and use the guessed probabilities instead. The
solution I came up with is to allow the profile counts to overwrite
the guessed probabilites in counts_to_freq. But then when we inline we
re-estimate the probabilities in the callee when the callsite count is
much hotter than the entry count, and then follow the same procedure
we were doing in the 0-count case (propagate the call count into the
callee bb counts via the guessed probabilities). Is there a better
solution?

Thanks,
Teresa

>
> Honza

-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2013-12-20  6:19                         ` Teresa Johnson
@ 2014-02-11 22:21                           ` Teresa Johnson
  2014-02-14 18:50                             ` Teresa Johnson
  0 siblings, 1 reply; 20+ messages in thread
From: Teresa Johnson @ 2014-02-11 22:21 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Martin Liška, gcc-patches

On Thu, Dec 19, 2013 at 10:19 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Thu, Dec 12, 2013 at 5:13 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote:
>>> > Hello,
>>> >    I prepared a collection of systemtap graphs for GIMP.
>>> >
>>> > 1) just my profile-based function reordering: 550 pages
>>> > 2) just -freorder-blocks-and-partitions: 646 pages
>>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages
>>> >
>>> > Please see attached data.
>>>
>>> Thanks for the data. A few observations/questions:
>>>
>>> With both 1) (your (time-based?) reordering) and 2)
>>> (-freorder-blocks-and-partitions) there are a fair amount of accesses
>>> out of the cold section. I'm not seeing so many accesses out of the
>>> cold section in the apps I am looking at with splitting enabled. In
>>
>> I see you already comitted the patch, so perhaps Martin's measurement assume
>> the pass is off by default?
>>
>> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping
>> text.unlikely.  I get ICE in:
>> (gdb) bt
>> #0  diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108
>> #1  0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135
>> #2  0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110
>> #3  toplev_main(int, char**) () at ../../gcc/toplev.c:1922
>> #4  0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6
>> #5  0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122
>>
>> That is relatively early in startup process. The function seems inlined and
>> it fails only on second invocation, did not have time to investigate further,
>> yet while without -fprofile-use it starts...
>
> I'll see if I can reproduce this and investigate, although at this
> point that might have to wait until after my holiday vacation.

I tried the linkerscript with cpu2006 and got quite a lot of failures
(using the ref inputs to train, so the behavior should be the same in
both profile-gen and profile-use). I investigated the one in bzip2 and
found an issue that may not be easy to fix and is perhaps something it
is not worth fixing. The case was essentially the following: Function
foo was called by callsites A and B, with call counts 148122615 and
18, respectively.

Within function foo, there was a basic block that had a very low count
(compared to the entry bb count of 148122633), and therefore a 0
frequency:

;;   basic block 6, loop depth 0, count 18, freq 0

The ipa inliner decided to inline into callsite A but not B. Because
the vast majority of the call count was from callsite A, when we
performed execute_fixup_cfg after doing the inline transformation, the
count_scale is 0 and the out-of-line copy of foo's blocks all got
counts 0. However, most of the bbs still had non-zero frequencies. But
bb 6 ended up with a count and frequency of 0, leading us to split it
out. It turns out that at least one of the 18 counts for this block
were from callsite B, and we ended up trying to execute the split bb
in the out-of-line copy from that callsite.

I can think of a couple of ways to prevent this to happen (e.g. have
execute_fixup_cfg give the block a count or frequency of 1 instead of
0, or mark the bb somehow as not eligible for splitting due to a low
confidence in the 0 count/frequency), but they seem a little hacky. I
am thinking that the splitting here is something we shouldn't worry
about - it is so close to 0 count that the occasional jump to the
split section caused by the lack of precision in the frequency is not
a big deal. Unfortunately, without fixing this I can't use the linker
script without disabling inlining to avoid this problem.

I reran cpu2006 with the linker script but with -fno-inline and got 6
more benchmarks to pass. So there are other issues out there. I will
take a look at another one and see if it is a similar
scaling/precision issue. I'm thinking that I may just use my heatmap
scripts (which leverages perf-events profiles) to see if there is any
significant execution in the split cold sections, since it seems we
can't realistically prevent any and all execution of the cold split
sections, and that is more meaningful anyway.

Teresa

>
>>
>> On our periodic testers I see off-noise improvement in crafty 2200->2300
>> and regression on Vortex, 2900->2800, plus code size increase.
>
> I had only run cpu2006, but not cpu2000. I'll see if I can reproduce
> this as well.
>
> I have been investigating a few places where I saw accesses in the
> cold split regions in internal benchmarks. Here are a couple, and how
> I have addressed them so far:
>
> 1) loop unswitching
>
> In this case, loop unswitching hoisted a branch from within the loop
> to outside the loop, and in doing so it was effectively speculated
> above several other branches. In it's original location it always went
> to only one of the successors (biased 0/100%). But when it was hoisted
> it sometimes took the previously 0% path. This led to executing out of
> the cold region, since we didn't update the branch probability when
> hoisting. I worked around this by assigning a small non-zero
> probability after hoisting with the following change:
>
> Index: tree-ssa-loop-unswitch.c
> ===================================================================
> --- tree-ssa-loop-unswitch.c    (revision 205590)
> +++ tree-ssa-loop-unswitch.c    (working copy)
> @@ -384,6 +384,8 @@ tree_unswitch_loop (struct loop *loop,
>
>    extract_true_false_edges_from_block (unswitch_on, &edge_true, &edge_false);
>    prob_true = edge_true->probability;
> +  if (!prob_true)
> +    prob_true = REG_BR_PROB_BASE/10;
>    return loop_version (loop, unshare_expr (cond),
>                        NULL, prob_true, prob_true,
>                        REG_BR_PROB_BASE - prob_true, false);
>
> This should probably be refined (if prob_true is 100% we want to
> assign a small non-zero probability to the false path), and 10% may be
> too high (maybe give it 1%?).
>
> 2) More COMDAT issues
>
> My earlier patch handled the case where the comdat had 0 counts since
> the linker kept the copy in a different module. In that case we
> prevent the guessed frequencies from being dropped by counts_to_freq,
> and then later mark any reached via non-zero callgraph edges as
> guessed. Finally, when one such 0-count comdat is inlined the call
> count is propagated to the callee blocks using the guessed
> probabilities.
>
> However, in this case, there was a comdat that had a very small
> non-zero count, that was being inlined to a much hotter callsite. I
> believe this could happen when there was a copy that was ipa-inlined
> in the profile gen compile, so the copy in that module gets some
> non-zero counts from the ipa inlined instance, but when the out of
> line copy was eliminated by the linker (selected from a different
> module). In this case the inliner was scaling the bb counts up quite a
> lot when inlining. The problem is that you most likely can't trust
> that the 0 count bbs in such a case are really not executed by the
> callsite it is being into, since the counts are very small and
> correspond to a different callsite.
>
> The problem is how to address this. We can't simply suppress
> counts_to_freq from overwriting the guessed frequencies in this case,
> since the profile counts are non-zero and would not match the guessed
> probabilities. But we can't figure out which are called by much hotter
> callsites (compared to their entry count) until later when the
> callgraph is built, which is when we would know that we want to ignore
> the profile counts and use the guessed probabilities instead. The
> solution I came up with is to allow the profile counts to overwrite
> the guessed probabilites in counts_to_freq. But then when we inline we
> re-estimate the probabilities in the callee when the callsite count is
> much hotter than the entry count, and then follow the same procedure
> we were doing in the 0-count case (propagate the call count into the
> callee bb counts via the guessed probabilities). Is there a better
> solution?
>
> Thanks,
> Teresa
>
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH i386] Enable -freorder-blocks-and-partition
  2014-02-11 22:21                           ` Teresa Johnson
@ 2014-02-14 18:50                             ` Teresa Johnson
  0 siblings, 0 replies; 20+ messages in thread
From: Teresa Johnson @ 2014-02-14 18:50 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Martin Liška, gcc-patches

On Tue, Feb 11, 2014 at 2:21 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Thu, Dec 19, 2013 at 10:19 PM, Teresa Johnson <tejohnson@google.com> wrote:
>> On Thu, Dec 12, 2013 at 5:13 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote:
>>>> > Hello,
>>>> >    I prepared a collection of systemtap graphs for GIMP.
>>>> >
>>>> > 1) just my profile-based function reordering: 550 pages
>>>> > 2) just -freorder-blocks-and-partitions: 646 pages
>>>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages
>>>> >
>>>> > Please see attached data.
>>>>
>>>> Thanks for the data. A few observations/questions:
>>>>
>>>> With both 1) (your (time-based?) reordering) and 2)
>>>> (-freorder-blocks-and-partitions) there are a fair amount of accesses
>>>> out of the cold section. I'm not seeing so many accesses out of the
>>>> cold section in the apps I am looking at with splitting enabled. In
>>>
>>> I see you already comitted the patch, so perhaps Martin's measurement assume
>>> the pass is off by default?
>>>
>>> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping
>>> text.unlikely.  I get ICE in:
>>> (gdb) bt
>>> #0  diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108
>>> #1  0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135
>>> #2  0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110
>>> #3  toplev_main(int, char**) () at ../../gcc/toplev.c:1922
>>> #4  0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6
>>> #5  0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122
>>>
>>> That is relatively early in startup process. The function seems inlined and
>>> it fails only on second invocation, did not have time to investigate further,
>>> yet while without -fprofile-use it starts...
>>
>> I'll see if I can reproduce this and investigate, although at this
>> point that might have to wait until after my holiday vacation.
>
> I tried the linkerscript with cpu2006 and got quite a lot of failures
> (using the ref inputs to train, so the behavior should be the same in
> both profile-gen and profile-use). I investigated the one in bzip2 and
> found an issue that may not be easy to fix and is perhaps something it
> is not worth fixing. The case was essentially the following: Function
> foo was called by callsites A and B, with call counts 148122615 and
> 18, respectively.
>
> Within function foo, there was a basic block that had a very low count
> (compared to the entry bb count of 148122633), and therefore a 0
> frequency:
>
> ;;   basic block 6, loop depth 0, count 18, freq 0
>
> The ipa inliner decided to inline into callsite A but not B. Because
> the vast majority of the call count was from callsite A, when we
> performed execute_fixup_cfg after doing the inline transformation, the
> count_scale is 0 and the out-of-line copy of foo's blocks all got
> counts 0. However, most of the bbs still had non-zero frequencies. But
> bb 6 ended up with a count and frequency of 0, leading us to split it
> out. It turns out that at least one of the 18 counts for this block
> were from callsite B, and we ended up trying to execute the split bb
> in the out-of-line copy from that callsite.
>
> I can think of a couple of ways to prevent this to happen (e.g. have
> execute_fixup_cfg give the block a count or frequency of 1 instead of
> 0, or mark the bb somehow as not eligible for splitting due to a low
> confidence in the 0 count/frequency), but they seem a little hacky. I
> am thinking that the splitting here is something we shouldn't worry
> about - it is so close to 0 count that the occasional jump to the
> split section caused by the lack of precision in the frequency is not
> a big deal. Unfortunately, without fixing this I can't use the linker
> script without disabling inlining to avoid this problem.
>
> I reran cpu2006 with the linker script but with -fno-inline and got 6
> more benchmarks to pass. So there are other issues out there. I will
> take a look at another one and see if it is a similar
> scaling/precision issue. I'm thinking that I may just use my heatmap
> scripts (which leverages perf-events profiles) to see if there is any
> significant execution in the split cold sections, since it seems we
> can't realistically prevent any and all execution of the cold split
> sections, and that is more meaningful anyway.

I collected perf cycle profiles for all of the split cpu2006 binaries
(with the patch I sent separately to do more COMDAT profile fixes),
and processed the results to see how many samples were in cold
functions. There were 13 benchmarks that had non-zero samples in split
cold sections, although they were very small as a total percentage of
sampled cycles in the benchmark. Here are the results, sorted in
reverse order of the percentage of total samples in the cold section:

Benchmark     Cold samples   Total samples    Percent cold samples
471.omnetpp       471           380596        .12390660966787241039
400.perlbench     313           348717        .08983823377458352946
403.gcc           245           369530        .06634442232963700123
453.povray        105           210146        .04999024000076175603
454.calculix      326           814160        .04005730898438747951
450.soplex        104           277853        .03744387918588365754
445.gobmk          81           484055        .01673643625484013604
458.sjeng          89           537853        .01655001078540028711
465.tonto           8           554507        .00144274381017819689
437.leslie3d        4           351889        .00113673501285931483
473.astar           4           436143        .00091713880207915366
464.h264ref         3           649895        .00046161516067285025
434.zeusmp          1           563594        .00017743300573286041

I looked at omnetpp and perlbench.

Omnetpp's cold samples are exclusively from a single routine,
_ZN9TOmnetApp15checkTimeLimitsEv (TOmnetApp::checkTimeLimits). I
tracked this down to the RTL expansion for the floating point
comparison "opt_simtimelimit!=0" at libs/envir/omnetapp.cc:528.
do_compare_rtx_and_jump in dojump.c calls split_comparison and
produces two jumps, and we assign the full probability to one path
through these two branches. In this case the first split jump actually
goes the other way (so we evaluate the second split jump), into the
cold section. This should be easy to fix along the same lines as the
way I fixed the TRUTH_ANDIF_EXPR/TRUTH_ORIF_EXPR expansions in
r203126. I'll make this fix and send a patch soon.

Most of perlbench's cold samples are from Perl_pp_helem. I tracked
these down to some big profile insanities introduced by jump threading
in vrp1, which in turn caused tracer to end up creating a hot path
that flowed into a 0-weight path, which was then split and executed.
If I disable -ftree-vrp, tracer no longer causes this issue. However,
with vrp off I found similar profile insanities created by jump
threading called from dom1, which also resulted in hot blocks flowing
into a 0-weight block due to a profile insanity. I already fixed some
profile insanities created by jump threading back in r203041, but
apparently there are more (or new ones). I will take a look.

Teresa

>
> Teresa
>
>>
>>>
>>> On our periodic testers I see off-noise improvement in crafty 2200->2300
>>> and regression on Vortex, 2900->2800, plus code size increase.
>>
>> I had only run cpu2006, but not cpu2000. I'll see if I can reproduce
>> this as well.
>>
>> I have been investigating a few places where I saw accesses in the
>> cold split regions in internal benchmarks. Here are a couple, and how
>> I have addressed them so far:
>>
>> 1) loop unswitching
>>
>> In this case, loop unswitching hoisted a branch from within the loop
>> to outside the loop, and in doing so it was effectively speculated
>> above several other branches. In it's original location it always went
>> to only one of the successors (biased 0/100%). But when it was hoisted
>> it sometimes took the previously 0% path. This led to executing out of
>> the cold region, since we didn't update the branch probability when
>> hoisting. I worked around this by assigning a small non-zero
>> probability after hoisting with the following change:
>>
>> Index: tree-ssa-loop-unswitch.c
>> ===================================================================
>> --- tree-ssa-loop-unswitch.c    (revision 205590)
>> +++ tree-ssa-loop-unswitch.c    (working copy)
>> @@ -384,6 +384,8 @@ tree_unswitch_loop (struct loop *loop,
>>
>>    extract_true_false_edges_from_block (unswitch_on, &edge_true, &edge_false);
>>    prob_true = edge_true->probability;
>> +  if (!prob_true)
>> +    prob_true = REG_BR_PROB_BASE/10;
>>    return loop_version (loop, unshare_expr (cond),
>>                        NULL, prob_true, prob_true,
>>                        REG_BR_PROB_BASE - prob_true, false);
>>
>> This should probably be refined (if prob_true is 100% we want to
>> assign a small non-zero probability to the false path), and 10% may be
>> too high (maybe give it 1%?).
>>
>> 2) More COMDAT issues
>>
>> My earlier patch handled the case where the comdat had 0 counts since
>> the linker kept the copy in a different module. In that case we
>> prevent the guessed frequencies from being dropped by counts_to_freq,
>> and then later mark any reached via non-zero callgraph edges as
>> guessed. Finally, when one such 0-count comdat is inlined the call
>> count is propagated to the callee blocks using the guessed
>> probabilities.
>>
>> However, in this case, there was a comdat that had a very small
>> non-zero count, that was being inlined to a much hotter callsite. I
>> believe this could happen when there was a copy that was ipa-inlined
>> in the profile gen compile, so the copy in that module gets some
>> non-zero counts from the ipa inlined instance, but when the out of
>> line copy was eliminated by the linker (selected from a different
>> module). In this case the inliner was scaling the bb counts up quite a
>> lot when inlining. The problem is that you most likely can't trust
>> that the 0 count bbs in such a case are really not executed by the
>> callsite it is being into, since the counts are very small and
>> correspond to a different callsite.
>>
>> The problem is how to address this. We can't simply suppress
>> counts_to_freq from overwriting the guessed frequencies in this case,
>> since the profile counts are non-zero and would not match the guessed
>> probabilities. But we can't figure out which are called by much hotter
>> callsites (compared to their entry count) until later when the
>> callgraph is built, which is when we would know that we want to ignore
>> the profile counts and use the guessed probabilities instead. The
>> solution I came up with is to allow the profile counts to overwrite
>> the guessed probabilites in counts_to_freq. But then when we inline we
>> re-estimate the probabilities in the callee when the callsite count is
>> much hotter than the entry count, and then follow the same procedure
>> we were doing in the 0-count case (propagate the call count into the
>> callee bb counts via the guessed probabilities). Is there a better
>> solution?
>>
>> Thanks,
>> Teresa
>>
>>
>>>
>>> Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-02-14 18:50 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-19 15:17 [PATCH i386] Enable -freorder-blocks-and-partition Teresa Johnson
2013-11-19 16:31 ` Jan Hubicka
2013-11-19 18:23   ` Teresa Johnson
2013-11-19 19:32     ` Jeff Law
2013-11-20  1:55       ` Teresa Johnson
     [not found]         ` <CAObPJ3OZHvET=QNmNtx9ZjaHZk=GhokWjoF2njr5mgwcv2ogDA@mail.gmail.com>
2013-11-28 16:34           ` Jan Hubicka
2013-12-02 15:16             ` Teresa Johnson
2013-12-02 16:17               ` Jeff Law
2013-12-02 16:53                 ` Martin Liška
2013-12-11  9:21                   ` Martin Liška
2013-12-12  5:51                     ` Teresa Johnson
2013-12-12 20:54                       ` Jan Hubicka
2013-12-13  1:13                       ` Jan Hubicka
     [not found]                         ` <CAObPJ3MLrWsTwK-23ienktyASO9YLvAXMNKUxVm3v+KC=5JzOA@mail.gmail.com>
2013-12-15 22:19                           ` Martin Liška
2013-12-17 15:09                             ` Teresa Johnson
2013-12-20  6:19                         ` Teresa Johnson
2014-02-11 22:21                           ` Teresa Johnson
2014-02-14 18:50                             ` Teresa Johnson
2013-11-19 22:05 ` Andi Kleen
2013-11-19 22:10   ` Teresa Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).