* [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
@ 2015-02-17 17:14 Martin Liška
2015-02-17 18:38 ` Jan Hubicka
2015-02-17 21:03 ` Jan Hubicka
0 siblings, 2 replies; 6+ messages in thread
From: Martin Liška @ 2015-02-17 17:14 UTC (permalink / raw)
To: GCC Patches; +Cc: hubicka >> Jan Hubicka
[-- Attachment #1: Type: text/plain, Size: 742 bytes --]
Hello.
After LTO debugging of Chrome we noticed with Honza that WPA phase taken quite long time.
Following patch is an attempt to cache IPA inliner predicates that are constant during
inline_small functions.
As you can see in attached report, this patch can reduce time spent in WPA by ~40%, which
is really big improvement. Disadvantage of the solution is that the patch adds 4 new bitfields
to cgraph_node class. Well, we can move these flags to inline_summary, but as this struct is not
accessible from cgraph.h, we cannot benefit from inlining that is crucial for these predicates.
I welcome and ideas about the solution and I'm not sure if it's acceptable for STAGE4? That's reason
why no ChangeLog entry is prepared.
Thanks,
Martin
[-- Attachment #2: 0001-ipa-inline-introduce-computed-value-that-speeds-up-I.patch --]
[-- Type: text/x-patch, Size: 22738 bytes --]
From 4e878a928ff7e9fe4eee0ea4b241c01c4440bd60 Mon Sep 17 00:00:00 2001
From: mliska <mliska@suse.cz>
Date: Mon, 16 Feb 2015 16:48:01 +0100
Subject: [PATCH] ipa-inline: introduce computed value that speeds up IPA
inliner.
---
gcc/cgraph.c | 77 -------------
gcc/cgraph.h | 309 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
gcc/ipa-inline.c | 2 +
gcc/lto-streamer.c | 2 +
gcc/symtab.c | 48 ++++++---
5 files changed, 345 insertions(+), 93 deletions(-)
diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 3548bd0..b72a6c0 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2403,83 +2403,6 @@ cgraph_edge::maybe_hot_p (void)
return true;
}
-/* Worker for cgraph_can_remove_if_no_direct_calls_p. */
-
-static bool
-nonremovable_p (cgraph_node *node, void *)
-{
- return !node->can_remove_if_no_direct_calls_and_refs_p ();
-}
-
-/* Return true when function cgraph_node and its aliases can be removed from
- callgraph if all direct calls are eliminated. */
-
-bool
-cgraph_node::can_remove_if_no_direct_calls_p (void)
-{
- /* Extern inlines can always go, we will use the external definition. */
- if (DECL_EXTERNAL (decl))
- return true;
- if (address_taken)
- return false;
- return !call_for_symbol_and_aliases (nonremovable_p, NULL, true);
-}
-
-/* Return true when function cgraph_node can be expected to be removed
- from program when direct calls in this compilation unit are removed.
-
- As a special case COMDAT functions are
- cgraph_can_remove_if_no_direct_calls_p while the are not
- cgraph_only_called_directly_p (it is possible they are called from other
- unit)
-
- This function behaves as cgraph_only_called_directly_p because eliminating
- all uses of COMDAT function does not make it necessarily disappear from
- the program unless we are compiling whole program or we do LTO. In this
- case we know we win since dynamic linking will not really discard the
- linkonce section. */
-
-bool
-cgraph_node::will_be_removed_from_program_if_no_direct_calls_p (void)
-{
- gcc_assert (!global.inlined_to);
-
- if (call_for_symbol_and_aliases (used_from_object_file_p_worker,
- NULL, true))
- return false;
- if (!in_lto_p && !flag_whole_program)
- return only_called_directly_p ();
- else
- {
- if (DECL_EXTERNAL (decl))
- return true;
- return can_remove_if_no_direct_calls_p ();
- }
-}
-
-
-/* Worker for cgraph_only_called_directly_p. */
-
-static bool
-cgraph_not_only_called_directly_p_1 (cgraph_node *node, void *)
-{
- return !node->only_called_directly_or_aliased_p ();
-}
-
-/* Return true when function cgraph_node and all its aliases are only called
- directly.
- i.e. it is not externally visible, address was not taken and
- it is not used in any other non-standard way. */
-
-bool
-cgraph_node::only_called_directly_p (void)
-{
- gcc_assert (ultimate_alias_target () == this);
- return !call_for_symbol_and_aliases (cgraph_not_only_called_directly_p_1,
- NULL, true);
-}
-
-
/* Collect all callers of NODE. Worker for collect_callers_of_node. */
static bool
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 06d2704..39cb340 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -261,17 +261,29 @@ public:
void *data,
bool include_overwrite);
+ /* Call callback on symtab node and aliases associated to this node.
+ When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+ skipped. */
+ template <typename Arg, bool (*callback) (symtab_node*, Arg arg)>
+ bool call_for_symbol_and_aliases (Arg data, bool include_overwrite);
+
/* If node can not be interposable by static or dynamic linker to point to
different definition, return this symbol. Otherwise look for alias with
such property and if none exists, introduce new one. */
symtab_node *noninterposable_alias (void);
+ /* Worker searching noninterposable alias. */
+ static bool noninterposable_alias (symtab_node *node, symtab_node **data);
+
/* Return node that alias is aliasing. */
inline symtab_node *get_alias_target (void);
/* Set section for symbol and its aliases. */
void set_section (const char *section);
+ /* Worker for set_section. */
+ static bool set_section (symtab_node *n, const char *s);
+
/* Set section, do not recurse into aliases.
When one wants to change section of symbol and its aliases,
use set_section. */
@@ -523,6 +535,11 @@ protected:
bool call_for_symbol_and_aliases_1 (bool (*callback) (symtab_node *, void *),
void *data,
bool include_overwrite);
+
+ /* Worker for call_for_symbol_and_aliases. */
+ template <typename Arg, bool (*callback) (symtab_node *, Arg)>
+ bool call_for_symbol_and_aliases_1 (Arg data, bool include_overwritable);
+
private:
/* Worker for set_section. */
static bool set_section (symtab_node *n, void *s);
@@ -1042,6 +1059,13 @@ public:
void *),
void *data, bool include_overwritable);
+ /* Call callback on function and aliases associated to the function.
+ When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+ skipped. */
+ template <typename Arg, bool (*callback) (cgraph_node *, Arg)>
+ bool call_for_symbol_and_aliases (Arg data, bool include_overwritable);
+
+
/* Call callback on cgraph_node, thunks and aliases associated to NODE.
When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
skipped. When EXCLUDE_VIRTUAL_THUNKS is true, virtual thunks are
@@ -1052,6 +1076,15 @@ public:
bool include_overwritable,
bool exclude_virtual_thunks = false);
+ /* Call callback on cgraph_node, thunks and aliases associated to NODE.
+ When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+ skipped. When EXCLUDE_VIRTUAL_THUNKS is true, virtual thunks are
+ skipped. */
+ template <typename Arg, bool (*callback) (cgraph_node *, Arg)>
+ bool call_for_symbol_thunks_and_aliases (Arg data,
+ bool include_overwritable,
+ bool exclude_virtual_thunks = false);
+
/* Likewise indicate that a node is needed, i.e. reachable via some
external means. */
inline void mark_force_output (void);
@@ -1093,6 +1126,9 @@ public:
the program unless we are compiling whole program or we do LTO. In this
case we know we win since dynamic linking will not really discard the
linkonce section. */
+ bool will_be_removed_from_program_if_no_direct_calls_compute_p (void);
+
+ /* Wrapper for will_be_removed_from_program_if_no_direct_calls_compute_p. */
bool will_be_removed_from_program_if_no_direct_calls_p (void);
/* Return true when function can be removed from callgraph
@@ -1101,8 +1137,15 @@ public:
/* Return true when function cgraph_node and its aliases can be removed from
callgraph if all direct calls are eliminated. */
+ bool can_remove_if_no_direct_calls_compute_p (void);
+
+ /* Wrapper for can_remove_if_no_direct_calls_compute_p. */
bool can_remove_if_no_direct_calls_p (void);
+ /* Worker for cgraph_can_remove_if_no_direct_calls_p. */
+ static bool nonremovable_p (cgraph_node *node, void *);
+ static bool nonremovable_compute_p (cgraph_node *node, void *);
+
/* Return true when callgraph node is a function with Gimple body defined
in current unit. Functions can also be define externally or they
can be thunks with no Gimple representation.
@@ -1295,11 +1338,24 @@ public:
/* True if there was multiple COMDAT bodies merged by lto-symtab. */
unsigned merged : 1;
+ /* IPA inline cached values. */
+ unsigned inline_nonremovable_init: 1;
+ unsigned inline_can_remove_if_no_direct_calls_init: 1;
+ unsigned inline_will_be_removed_if_no_direct_calls_init: 1;
+
+ unsigned inline_nonremovable: 1;
+ unsigned inline_can_remove_if_no_direct_calls: 1;
+ unsigned inline_will_be_removed_if_no_direct_calls: 1;
+
private:
/* Worker for call_for_symbol_and_aliases. */
bool call_for_symbol_and_aliases_1 (bool (*callback) (cgraph_node *,
void *),
void *data, bool include_overwritable);
+
+ /* Worker for call_for_symbol_and_aliases. */
+ template <typename Arg, bool (*callback) (cgraph_node *, Arg)>
+ bool call_for_symbol_and_aliases_1 (Arg data, bool include_overwritable);
};
/* A cgraph node set is a collection of cgraph nodes. A cgraph node
@@ -1683,6 +1739,12 @@ public:
void *data,
bool include_overwritable);
+ /* Call calback on varpool symbol and aliases associated to varpool symbol.
+ When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+ skipped. */
+ template <typename Arg, bool (*callback) (varpool_node *, Arg)>
+ bool call_for_symbol_and_aliases (Arg data, bool include_overwritable);
+
/* Return true when variable should be considered externally visible. */
bool externally_visible_p (void);
@@ -1761,6 +1823,10 @@ private:
bool call_for_symbol_and_aliases_1 (bool (*callback) (varpool_node *, void *),
void *data,
bool include_overwritable);
+
+ /* Worker for call_for_symbol_and_aliases. */
+ template <typename Arg, bool (*callback) (varpool_node*, Arg arg)>
+ bool call_for_symbol_and_aliases_1 (Arg data, bool include_overwritable);
};
/* Every top level asm statement is put into a asm_node. */
@@ -1862,7 +1928,7 @@ public:
friend class cgraph_node;
friend class cgraph_edge;
- symbol_table (): cgraph_max_summary_uid (1)
+ symbol_table (): cgraph_max_summary_uid (1), enable_inline_cache (false)
{
}
@@ -2101,6 +2167,9 @@ public:
FILE* GTY ((skip)) dump_file;
+ /* Inline cache flag. */
+ bool enable_inline_cache;
+
private:
/* Allocate new callgraph node. */
inline cgraph_node * allocate_cgraph_symbol (void);
@@ -2987,6 +3056,21 @@ symtab_node::call_for_symbol_and_aliases (bool (*callback) (symtab_node *,
return false;
}
+template <typename Arg, bool (*callback) (symtab_node *, Arg arg)>
+inline bool
+symtab_node::call_for_symbol_and_aliases (Arg data, bool include_overwritable)
+{
+ ipa_ref *ref;
+
+ if (callback (this, data))
+ return true;
+ if (iterate_direct_aliases (0, ref))
+ return call_for_symbol_and_aliases_1 <Arg, callback>
+ (data, include_overwritable);
+ return false;
+}
+
+
/* Call callback on function and aliases associated to the function.
When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
skipped. */
@@ -3004,6 +3088,43 @@ cgraph_node::call_for_symbol_and_aliases (bool (*callback) (cgraph_node *,
return false;
}
+/* Call callback on function and aliases associated to the function.
+ When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+ skipped. */
+
+template <typename Arg, bool (*callback) (cgraph_node *, Arg arg)>
+inline bool
+cgraph_node::call_for_symbol_and_aliases (Arg data, bool include_overwritable)
+{
+ ipa_ref *ref;
+
+ if (callback (this, data))
+ return true;
+
+ if (iterate_direct_aliases (0, ref))
+ return call_for_symbol_and_aliases_1 <Arg, callback> (data, include_overwritable);
+
+ return false;
+}
+
+template <typename Arg, bool (*callback) (cgraph_node *, Arg arg)>
+inline bool
+cgraph_node::call_for_symbol_and_aliases_1 (Arg data, bool include_overwritable)
+{
+ ipa_ref *ref;
+ FOR_EACH_ALIAS (this, ref)
+ {
+ cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
+ if (include_overwritable
+ || alias->get_availability () > AVAIL_INTERPOSABLE)
+ if (alias->call_for_symbol_and_aliases <Arg, callback> (data, include_overwritable))
+ return true;
+ }
+
+ return false;
+}
+
+
/* Call calback on varpool symbol and aliases associated to varpool symbol.
When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
skipped. */
@@ -3021,6 +3142,47 @@ varpool_node::call_for_symbol_and_aliases (bool (*callback) (varpool_node *,
return false;
}
+
+/* Call calback on varpool symbol and aliases associated to varpool symbol.
+ When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+ skipped. */
+
+template <typename Arg, bool (*callback) (varpool_node*, Arg arg)>
+inline bool
+varpool_node::call_for_symbol_and_aliases (Arg data, bool include_overwritable)
+{
+ ipa_ref *ref;
+
+ if (callback (this, data))
+ return true;
+ if (iterate_direct_aliases (0, ref))
+ return call_for_symbol_and_aliases_1 <Arg, callback>
+ (data, include_overwritable);
+
+ return false;
+}
+
+/* Worker for call_for_symbol_and_aliases. */
+
+template <typename Arg, bool (*callback) (varpool_node*, Arg arg)>
+bool
+varpool_node::call_for_symbol_and_aliases_1 (Arg data,
+ bool include_overwritable)
+{
+ ipa_ref *ref;
+
+ FOR_EACH_ALIAS (this, ref)
+ {
+ varpool_node *alias = dyn_cast <varpool_node *> (ref->referring);
+ if (include_overwritable
+ || alias->get_availability () > AVAIL_INTERPOSABLE)
+ if (alias->call_for_symbol_and_aliases <Arg, callback>
+ (data, include_overwritable))
+ return true;
+ }
+ return false;
+}
+
/* Build polymorphic call context for indirect call E. */
inline
@@ -3094,6 +3256,151 @@ cgraph_local_p (cgraph_node *node)
return node->local.local && node->instrumented_version->local.local;
}
+inline bool
+cgraph_node::nonremovable_compute_p (cgraph_node *node, void *)
+{
+ return !node->can_remove_if_no_direct_calls_and_refs_p ();
+}
+
+inline bool
+cgraph_node::nonremovable_p (cgraph_node *node, void *)
+{
+ bool retval;
+
+ if (symtab->enable_inline_cache)
+ {
+ if (!node->inline_nonremovable_init)
+ {
+ node->inline_nonremovable = nonremovable_compute_p (node, NULL);
+ node->inline_nonremovable_init = true;
+ }
+
+ retval = node->inline_nonremovable;
+
+ gcc_checking_assert (retval == nonremovable_compute_p (node, NULL));
+ }
+ else
+ retval = nonremovable_compute_p (node, NULL);
+
+ return retval;
+}
+
+inline bool
+cgraph_node::can_remove_if_no_direct_calls_compute_p (void)
+{
+ if (DECL_EXTERNAL (decl))
+ return true;
+ if (address_taken)
+ return false;
+
+ return !call_for_symbol_and_aliases <void *, cgraph_node::nonremovable_compute_p>
+ (NULL, true);
+}
+
+/* Return true when function cgraph_node and its aliases can be removed from
+ callgraph if all direct calls are eliminated. */
+
+inline bool
+cgraph_node::can_remove_if_no_direct_calls_p (void)
+{
+ bool retval;
+
+ if (symtab->enable_inline_cache)
+ {
+ if (!inline_can_remove_if_no_direct_calls_init)
+ {
+ inline_can_remove_if_no_direct_calls = can_remove_if_no_direct_calls_compute_p ();
+ inline_can_remove_if_no_direct_calls_init = true;
+ }
+
+ retval = inline_can_remove_if_no_direct_calls;
+
+ gcc_checking_assert
+ (retval == can_remove_if_no_direct_calls_compute_p ());
+ }
+ else
+ retval = can_remove_if_no_direct_calls_compute_p ();
+
+ return retval;
+}
+
+/* Return true when function cgraph_node can be expected to be removed
+ from program when direct calls in this compilation unit are removed.
+
+ As a special case COMDAT functions are
+ cgraph_can_remove_if_no_direct_calls_p while the are not
+ cgraph_only_called_directly_p (it is possible they are called from other
+ unit)
+
+ This function behaves as cgraph_only_called_directly_p because eliminating
+ all uses of COMDAT function does not make it necessarily disappear from
+ the program unless we are compiling whole program or we do LTO. In this
+ case we know we win since dynamic linking will not really discard the
+ linkonce section. */
+
+inline bool
+cgraph_node::will_be_removed_from_program_if_no_direct_calls_compute_p (void)
+{
+ gcc_assert (!global.inlined_to);
+
+ if (call_for_symbol_and_aliases <void *, used_from_object_file_p_worker>
+ (NULL, true))
+ return false;
+ if (!in_lto_p && !flag_whole_program)
+ return only_called_directly_p ();
+ else
+ {
+ if (DECL_EXTERNAL (decl))
+ return true;
+ return can_remove_if_no_direct_calls_p ();
+ }
+}
+
+/* Wrapper for will_be_removed_from_program_if_no_direct_calls_computed_p. */
+
+inline bool
+cgraph_node::will_be_removed_from_program_if_no_direct_calls_p (void)
+{
+ if (symtab->enable_inline_cache)
+ {
+ if (!inline_will_be_removed_if_no_direct_calls_init)
+ {
+ inline_will_be_removed_if_no_direct_calls
+ = will_be_removed_from_program_if_no_direct_calls_compute_p ();
+
+ inline_will_be_removed_if_no_direct_calls_init = true;
+ }
+
+ gcc_checking_assert (inline_will_be_removed_if_no_direct_calls ==
+ will_be_removed_from_program_if_no_direct_calls_compute_p ());
+ return inline_will_be_removed_if_no_direct_calls;
+ }
+
+ return will_be_removed_from_program_if_no_direct_calls_compute_p ();
+}
+
+/* Worker for cgraph_only_called_directly_p. */
+
+static bool
+cgraph_not_only_called_directly_p_1 (cgraph_node *node, void *)
+{
+ return !node->only_called_directly_or_aliased_p ();
+}
+
+/* Return true when function cgraph_node and all its aliases are only called
+ directly.
+ i.e. it is not externally visible, address was not taken and
+ it is not used in any other non-standard way. */
+
+inline bool
+cgraph_node::only_called_directly_p (void)
+{
+ gcc_assert (ultimate_alias_target () == this);
+ return !call_for_symbol_and_aliases (cgraph_not_only_called_directly_p_1,
+ NULL, true);
+}
+
+
/* When using fprintf (or similar), problems can arise with
transient generated strings. Many string-generation APIs
only support one result being alive at once (e.g. by
diff --git a/gcc/ipa-inline.c b/gcc/ipa-inline.c
index 287a6dd..8a07e04 100644
--- a/gcc/ipa-inline.c
+++ b/gcc/ipa-inline.c
@@ -1651,6 +1651,7 @@ inline_small_functions (void)
ipa_reduced_postorder (order, true, true, NULL);
free (order);
+ symtab->enable_inline_cache = true;
FOR_EACH_DEFINED_FUNCTION (node)
if (!node->global.inlined_to)
{
@@ -1966,6 +1967,7 @@ inline_small_functions (void)
}
}
+ symtab->enable_inline_cache = false;
free_growth_caches ();
if (dump_file)
fprintf (dump_file,
diff --git a/gcc/lto-streamer.c b/gcc/lto-streamer.c
index 836dce9..542a813 100644
--- a/gcc/lto-streamer.c
+++ b/gcc/lto-streamer.c
@@ -319,11 +319,13 @@ static hash_table<tree_hash_entry> *tree_htab;
void
lto_streamer_init (void)
{
+#ifdef ENABLE_CHECKING
/* Check that all the TS_* handled by the reader and writer routines
match exactly the structures defined in treestruct.def. When a
new TS_* astructure is added, the streamer should be updated to
handle it. */
streamer_check_handled_ts_structures ();
+#endif
#ifdef LTO_STREAMER_DEBUG
tree_htab = new hash_table<tree_hash_entry> (31);
diff --git a/gcc/symtab.c b/gcc/symtab.c
index ee47a73..df0950b 100644
--- a/gcc/symtab.c
+++ b/gcc/symtab.c
@@ -1337,9 +1337,9 @@ symtab_node::set_section_for_node (const char *section)
/* Worker for set_section. */
bool
-symtab_node::set_section (symtab_node *n, void *s)
+symtab_node::set_section (symtab_node *n, const char *s)
{
- n->set_section_for_node ((char *)s);
+ n->set_section_for_node (s);
return false;
}
@@ -1349,8 +1349,7 @@ void
symtab_node::set_section (const char *section)
{
gcc_assert (!this->alias);
- call_for_symbol_and_aliases
- (symtab_node::set_section, const_cast<char *>(section), true);
+ call_for_symbol_and_aliases <const char *, symtab_node::set_section> (section, true);
}
/* Return the initialization priority. */
@@ -1491,10 +1490,11 @@ symtab_node::resolve_alias (symtab_node *target)
{
error ("section of alias %q+D must match section of its target", decl);
}
- call_for_symbol_and_aliases (symtab_node::set_section,
- const_cast<char *>(target->get_section ()), true);
+ call_for_symbol_and_aliases <const char *, symtab_node::set_section>
+ (const_cast<char *>(target->get_section ()), true);
if (target->implicit_section)
- call_for_symbol_and_aliases (set_implicit_section, NULL, true);
+ call_for_symbol_and_aliases <void *, symtab_node::set_implicit_section>
+ (NULL, true);
/* Alias targets become redundant after alias is resolved into an reference.
We do not want to keep it around or we would have to mind updating them
@@ -1513,7 +1513,7 @@ symtab_node::resolve_alias (symtab_node *target)
/* Worker searching noninterposable alias. */
bool
-symtab_node::noninterposable_alias (symtab_node *node, void *data)
+symtab_node::noninterposable_alias (symtab_node *node, symtab_node **data)
{
if (decl_binds_to_current_def_p (node->decl))
{
@@ -1530,7 +1530,7 @@ symtab_node::noninterposable_alias (symtab_node *node, void *data)
|| DECL_ATTRIBUTES (node->decl) != DECL_ATTRIBUTES (fn->decl))
return false;
- *(symtab_node **)data = node;
+ *data = node;
return true;
}
return false;
@@ -1550,8 +1550,8 @@ symtab_node::noninterposable_alias (void)
(if that is already non-overwritable). */
symtab_node *node = ultimate_alias_target ();
gcc_assert (!node->alias && !node->weakref);
- node->call_for_symbol_and_aliases (symtab_node::noninterposable_alias,
- (void *)&new_node, true);
+ node->call_for_symbol_and_aliases
+ <symtab_node **, symtab_node::noninterposable_alias> (&new_node, true);
if (new_node)
return new_node;
#ifndef ASM_OUTPUT_DEF
@@ -1840,10 +1840,8 @@ symtab_node::equal_address_to (symtab_node *s2)
/* Worker for call_for_symbol_and_aliases. */
bool
-symtab_node::call_for_symbol_and_aliases_1 (bool (*callback) (symtab_node *,
- void *),
- void *data,
- bool include_overwritable)
+symtab_node::call_for_symbol_and_aliases_1 (bool (*callback) (symtab_node *,void *),
+ void *data, bool include_overwritable)
{
ipa_ref *ref;
FOR_EACH_ALIAS (this, ref)
@@ -1857,3 +1855,23 @@ symtab_node::call_for_symbol_and_aliases_1 (bool (*callback) (symtab_node *,
}
return false;
}
+
+/* Worker for call_for_symbol_and_aliases. */
+
+template <typename Arg, bool (*callback) (symtab_node*, Arg arg)>
+bool
+symtab_node::call_for_symbol_and_aliases_1 (Arg data,
+ bool include_overwritable)
+{
+ ipa_ref *ref;
+ FOR_EACH_ALIAS (this, ref)
+ {
+ symtab_node *alias = ref->referring;
+ if (include_overwritable
+ || alias->get_availability () > AVAIL_INTERPOSABLE)
+ if (alias->call_for_symbol_and_aliases <Arg, callback> (data,
+ include_overwritable))
+ return true;
+ }
+ return false;
+}
--
2.1.2
[-- Attachment #3: cover-letter-chromium.txt --]
[-- Type: text/plain, Size: 9314 bytes --]
Hello.
Following mini patchset is speed-up for LTO WPA received on chromium binary:
Before:
Execution times (seconds)
phase setup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1977 kB ( 0%) ggc
phase opt and generate : 179.87 (66%) usr 1.67 (45%) sys 181.47 (66%) wall 2682287 kB (13%) ggc
phase stream in : 92.75 (34%) usr 2.05 (55%) sys 94.77 (34%) wall18738391 kB (87%) ggc
callgraph optimization : 0.71 ( 0%) usr 0.00 ( 0%) sys 0.71 ( 0%) wall 16 kB ( 0%) ggc
ipa dead code removal : 5.20 ( 2%) usr 0.05 ( 1%) sys 5.26 ( 2%) wall 0 kB ( 0%) ggc
ipa virtual call target : 3.22 ( 1%) usr 0.03 ( 1%) sys 3.20 ( 1%) wall 0 kB ( 0%) ggc
ipa devirtualization : 0.28 ( 0%) usr 0.01 ( 0%) sys 0.26 ( 0%) wall 32638 kB ( 0%) ggc
ipa cp : 4.27 ( 2%) usr 0.24 ( 6%) sys 4.55 ( 2%) wall 851324 kB ( 4%) ggc
ipa inlining heuristics : 127.09 (47%) usr 0.27 ( 7%) sys 127.25 (46%) wall 807884 kB ( 4%) ggc
ipa comdats : 0.57 ( 0%) usr 0.00 ( 0%) sys 0.57 ( 0%) wall 0 kB ( 0%) ggc
ipa lto gimple in : 5.47 ( 2%) usr 0.92 (25%) sys 6.37 ( 2%) wall 1370242 kB ( 6%) ggc
ipa lto decl in : 79.23 (29%) usr 1.32 (35%) sys 80.53 (29%) wall16957392 kB (79%) ggc
ipa lto constructors in : 0.33 ( 0%) usr 0.03 ( 1%) sys 0.44 ( 0%) wall 22897 kB ( 0%) ggc
ipa lto cgraph I/O : 1.41 ( 1%) usr 0.21 ( 6%) sys 1.62 ( 1%) wall 901987 kB ( 4%) ggc
ipa lto decl merge : 3.22 ( 1%) usr 0.00 ( 0%) sys 3.22 ( 1%) wall 16383 kB ( 0%) ggc
ipa lto cgraph merge : 5.10 ( 2%) usr 0.01 ( 0%) sys 5.11 ( 2%) wall 20432 kB ( 0%) ggc
whopr wpa : 1.95 ( 1%) usr 0.00 ( 0%) sys 1.94 ( 1%) wall 2 kB ( 0%) ggc
whopr partitioning : 5.22 ( 2%) usr 0.01 ( 0%) sys 5.23 ( 2%) wall 7800 kB ( 0%) ggc
ipa reference : 2.97 ( 1%) usr 0.06 ( 2%) sys 3.02 ( 1%) wall 0 kB ( 0%) ggc
ipa profile : 0.52 ( 0%) usr 0.04 ( 1%) sys 0.56 ( 0%) wall 0 kB ( 0%) ggc
ipa pure const : 3.51 ( 1%) usr 0.04 ( 1%) sys 3.56 ( 1%) wall 0 kB ( 0%) ggc
ipa icf : 19.33 ( 7%) usr 0.12 ( 3%) sys 19.52 ( 7%) wall 3089 kB ( 0%) ggc
tree SSA rewrite : 0.35 ( 0%) usr 0.02 ( 1%) sys 0.37 ( 0%) wall 51191 kB ( 0%) ggc
tree SSA other : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc
tree SSA incremental : 0.48 ( 0%) usr 0.06 ( 2%) sys 0.37 ( 0%) wall 33552 kB ( 0%) ggc
tree operand scan : 0.41 ( 0%) usr 0.08 ( 2%) sys 0.53 ( 0%) wall 343835 kB ( 2%) ggc
dominance frontiers : 0.04 ( 0%) usr 0.01 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc
dominance computation : 0.36 ( 0%) usr 0.09 ( 2%) sys 0.55 ( 0%) wall 0 kB ( 0%) ggc
varconst : 0.03 ( 0%) usr 0.03 ( 1%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc
loop fini : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc
unaccounted todo : 1.18 ( 0%) usr 0.00 ( 0%) sys 1.19 ( 0%) wall 0 kB ( 0%) ggc
TOTAL : 272.63 3.72 276.25 21422657 kB
AFTER:
Execution times (seconds)
phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 1977 kB ( 0%) ggc
phase opt and generate : 73.30 (43%) usr 1.79 (44%) sys 75.06 (43%) wall 2682287 kB (13%) ggc
phase stream in : 95.72 (57%) usr 2.25 (56%) sys 97.94 (57%) wall18738391 kB (87%) ggc
callgraph optimization : 0.75 ( 0%) usr 0.00 ( 0%) sys 0.76 ( 0%) wall 16 kB ( 0%) ggc
ipa dead code removal : 5.19 ( 3%) usr 0.03 ( 1%) sys 5.25 ( 3%) wall 0 kB ( 0%) ggc
ipa virtual call target : 2.81 ( 2%) usr 0.03 ( 1%) sys 3.15 ( 2%) wall 0 kB ( 0%) ggc
ipa devirtualization : 0.29 ( 0%) usr 0.00 ( 0%) sys 0.26 ( 0%) wall 32638 kB ( 0%) ggc
ipa cp : 4.59 ( 3%) usr 0.24 ( 6%) sys 4.76 ( 3%) wall 851324 kB ( 4%) ggc
ipa inlining heuristics : 22.09 (13%) usr 0.26 ( 6%) sys 22.20 (13%) wall 807884 kB ( 4%) ggc
ipa comdats : 0.57 ( 0%) usr 0.00 ( 0%) sys 0.57 ( 0%) wall 0 kB ( 0%) ggc
ipa lto gimple in : 5.67 ( 3%) usr 0.93 (23%) sys 6.51 ( 4%) wall 1370242 kB ( 6%) ggc
ipa lto decl in : 81.86 (48%) usr 1.45 (36%) sys 83.29 (48%) wall16957392 kB (79%) ggc
ipa lto constructors in : 0.41 ( 0%) usr 0.09 ( 2%) sys 0.36 ( 0%) wall 22897 kB ( 0%) ggc
ipa lto cgraph I/O : 1.49 ( 1%) usr 0.25 ( 6%) sys 1.73 ( 1%) wall 901987 kB ( 4%) ggc
ipa lto decl merge : 3.55 ( 2%) usr 0.00 ( 0%) sys 3.55 ( 2%) wall 16383 kB ( 0%) ggc
ipa lto cgraph merge : 5.05 ( 3%) usr 0.00 ( 0%) sys 5.07 ( 3%) wall 20432 kB ( 0%) ggc
whopr wpa : 1.88 ( 1%) usr 0.00 ( 0%) sys 1.86 ( 1%) wall 2 kB ( 0%) ggc
whopr partitioning : 4.89 ( 3%) usr 0.02 ( 0%) sys 4.90 ( 3%) wall 7800 kB ( 0%) ggc
ipa reference : 2.85 ( 2%) usr 0.05 ( 1%) sys 2.91 ( 2%) wall 0 kB ( 0%) ggc
ipa profile : 0.55 ( 0%) usr 0.04 ( 1%) sys 0.59 ( 0%) wall 0 kB ( 0%) ggc
ipa pure const : 3.28 ( 2%) usr 0.04 ( 1%) sys 3.33 ( 2%) wall 0 kB ( 0%) ggc
ipa icf : 18.23 (11%) usr 0.12 ( 3%) sys 18.29 (11%) wall 3089 kB ( 0%) ggc
tree SSA rewrite : 0.26 ( 0%) usr 0.04 ( 1%) sys 0.32 ( 0%) wall 51191 kB ( 0%) ggc
tree SSA other : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
tree SSA incremental : 0.51 ( 0%) usr 0.16 ( 4%) sys 0.60 ( 0%) wall 33552 kB ( 0%) ggc
tree operand scan : 0.36 ( 0%) usr 0.13 ( 3%) sys 0.49 ( 0%) wall 343835 kB ( 2%) ggc
dominance frontiers : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc
dominance computation : 0.39 ( 0%) usr 0.06 ( 1%) sys 0.63 ( 0%) wall 0 kB ( 0%) ggc
varconst : 0.05 ( 0%) usr 0.04 ( 1%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc
loop fini : 0.10 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc
unaccounted todo : 1.26 ( 1%) usr 0.00 ( 0%) sys 1.26 ( 1%) wall 0 kB ( 0%) ggc
TOTAL : 169.02 4.04 173.00 21422657 kB
perf report after:
10.17% lto1-wpa lto1 [.] inflate_fast
3.74% lto1-wpa lto1 [.] compare_tree_sccs_1(tree_node*, tree_node*, tree_node***)
3.56% lto1-wpa lto1 [.] streamer_read_uhwi(lto_input_block*)
3.16% lto1-wpa lto1 [.] ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
3.01% lto1-wpa lto1 [.] unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, unsigned int)
2.69% lto1-wpa lto1 [.] streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*)
2.16% lto1-wpa lto1 [.] lto_cgraph_replace_node(cgraph_node*, cgraph_node*)
2.00% lto1-wpa lto1 [.] streamer_get_pickled_tree(lto_input_block*, data_in*)
2.00% lto1-wpa libc-2.19.so [.] msort_with_tmp.part.0
1.91% lto1-wpa lto1 [.] ipa_icf::sem_variable::equals(tree_node*, tree_node*)
1.72% lto1-wpa libc-2.19.so [.] _int_malloc
1.70% lto1-wpa lto1 [.] symbol_table::remove_unreachable_nodes(_IO_FILE*)
1.54% lto1-wpa lto1 [.] lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int)
1.33% lto1-wpa lto1 [.] inflate
1.21% lto1-wpa lto1 [.] adler32
1.16% lto1-wpa lto1 [.] cgraph_node::call_for_symbol_thunks_and_aliases(bool (*)(cgraph_node*, void*), void*, bool, bool)
1.11% lto1-wpa lto1 [.] lto_input_tree(lto_input_block*, data_in*)
1.07% lto1-wpa lto1 [.] streamer_read_tree_body(lto_input_block*, data_in*, tree_node*)
1.03% lto1-wpa lto1 [.] lto_input_location(bitpack_d*, data_in*)
1.01% lto1-wpa lto1 [.] htab_hash_string
0.99% lto1-wpa lto1 [.] estimate_calls_size_and_time(cgraph_node*, int*, int*, int*, int*, unsigned int, vec<tree_node*, va_heap, vl_ptr>, vec<ipa_polymorphic_call_context, va_heap, vl_ptr>, vec<ipa_agg_jump_function*, va_heap, vl_ptr>) [clone .isra.137]
0.92% lto1-wpa lto1 [.] ht_lookup(ht*, unsigned char const*, unsigned long, ht_lookup_option)
0.92% lto1-wpa lto1 [.] ggc_internal_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long)
0.86% lto1-wpa lto1 [.] splay_tree_splay
0.83% lto1-wpa lto1 [.] bp_unpack_var_len_unsigned(bitpack_d*)
0.80% lto1-wpa libc-2.19.so [.] malloc_consolidate
0.77% lto1-wpa lto1 [.] can_inline_edge_p(cgraph_edge*, bool, bool)
0.72% lto1-wpa lto1 [.] gimple_has_body_p(tree_node*)
Thanks,
Martin
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
2015-02-17 17:14 [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome) Martin Liška
@ 2015-02-17 18:38 ` Jan Hubicka
2015-02-18 10:28 ` Martin Liška
2015-02-17 21:03 ` Jan Hubicka
1 sibling, 1 reply; 6+ messages in thread
From: Jan Hubicka @ 2015-02-17 18:38 UTC (permalink / raw)
To: Martin Liška; +Cc: GCC Patches, hubicka >> Jan Hubicka
Hi,
thanks for working on it. There are 3 basically indpeendent changes in the patch
- The patch to make checking in lto_streamer_init ENABLE_CHECKING only that I
think can be comitted as obvoius.
- Templates for call_for_symbol_and_aliases
I do not think these should be strictly necessary for perofrmance, because once we
spent too much time in these we are bit screwed.
I however see it also makes things bit nicer by not needing typecasts on data pointer.
Pehraps that could be further cleaned?
Alternative would be to implement FOR_EACH_ALIAS macro with tree walking iterator.
You have all the structure to not require stack. Iterator will ocntain an
root node, current node and index to ref.
This may be even easier to use and probably wind up generating about the same code
given that the for each template anyway needs to produce self recursive function.
I would not care about for_symbol_thunk_and_aliases. That function is heavy by walking
all callers anyway and should not be used in hot code.
I have patch that removes its use from inliner - it is more or less leftover from time
we represented thunks as special aliases instead of functions w/o gimple body.
- the caching itself.
I will look into the caching in detail. I am not quite sure I like the idea of exposing inline
only cache into cgraph.h. You could just keep the predicates as are, but have inline_ variants
in ipa-inline.h that does the caching for you.
Allocating the bits directly in cgraph_node is probably OK, we don't really have shortage there
and can be revisited easily later...
Honza
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
2015-02-17 18:38 ` Jan Hubicka
@ 2015-02-18 10:28 ` Martin Liška
0 siblings, 0 replies; 6+ messages in thread
From: Martin Liška @ 2015-02-18 10:28 UTC (permalink / raw)
To: Jan Hubicka, GCC Patches
[-- Attachment #1: Type: text/plain, Size: 2161 bytes --]
On 02/17/2015 07:38 PM, Jan Hubicka wrote:
> Hi,
> thanks for working on it. There are 3 basically indpeendent changes in the patch
> - The patch to make checking in lto_streamer_init ENABLE_CHECKING only that I
> think can be comitted as obvoius.
Hello.
Following email contains fix for that, which I'm going to install.
> - Templates for call_for_symbol_and_aliases
> I do not think these should be strictly necessary for perofrmance, because once we
> spent too much time in these we are bit screwed.
> I however see it also makes things bit nicer by not needing typecasts on data pointer.
> Pehraps that could be further cleaned?
>
> Alternative would be to implement FOR_EACH_ALIAS macro with tree walking iterator.
> You have all the structure to not require stack. Iterator will ocntain an
> root node, current node and index to ref.
> This may be even easier to use and probably wind up generating about the same code
> given that the for each template anyway needs to produce self recursive function.
>
> I would not care about for_symbol_thunk_and_aliases. That function is heavy by walking
> all callers anyway and should not be used in hot code.
> I have patch that removes its use from inliner - it is more or less leftover from time
> we represented thunks as special aliases instead of functions w/o gimple body.
Yes, I was also thinking about flat iterator that will be capable of iterating thunks/aliases and
I prefer that approach compared to recursive functions. I think we can prepare it for next release,
as you said it does not bring so much performance gain.
> - the caching itself.
>
> I will look into the caching in detail. I am not quite sure I like the idea of exposing inline
> only cache into cgraph.h. You could just keep the predicates as are, but have inline_ variants
> in ipa-inline.h that does the caching for you.
>
> Allocating the bits directly in cgraph_node is probably OK, we don't really have shortage there
> and can be revisited easily later...
>
> Honza
>
Please take a look at caching, it would be crucial part of speed improvement.
Martin
[-- Attachment #2: 0001-Add-checking-macro-within-lto_streamer_init.patch --]
[-- Type: text/x-patch, Size: 1076 bytes --]
From eb9d34244c43ae1d0576b2ae1002f5267c6cd547 Mon Sep 17 00:00:00 2001
From: mliska <mliska@suse.cz>
Date: Wed, 18 Feb 2015 11:18:47 +0100
Subject: [PATCH] Add checking macro within lto_streamer_init.
gcc/ChangeLog:
2015-02-18 Martin Liska <mliska@suse.cz>
* lto-streamer.c (lto_streamer_init): Encapsulate
streamer_check_handled_ts_structures with checking macro.
---
gcc/lto-streamer.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/gcc/lto-streamer.c b/gcc/lto-streamer.c
index 836dce9..542a813 100644
--- a/gcc/lto-streamer.c
+++ b/gcc/lto-streamer.c
@@ -319,11 +319,13 @@ static hash_table<tree_hash_entry> *tree_htab;
void
lto_streamer_init (void)
{
+#ifdef ENABLE_CHECKING
/* Check that all the TS_* handled by the reader and writer routines
match exactly the structures defined in treestruct.def. When a
new TS_* astructure is added, the streamer should be updated to
handle it. */
streamer_check_handled_ts_structures ();
+#endif
#ifdef LTO_STREAMER_DEBUG
tree_htab = new hash_table<tree_hash_entry> (31);
--
2.1.2
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
2015-02-17 17:14 [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome) Martin Liška
2015-02-17 18:38 ` Jan Hubicka
@ 2015-02-17 21:03 ` Jan Hubicka
2015-02-18 13:58 ` Martin Liška
1 sibling, 1 reply; 6+ messages in thread
From: Jan Hubicka @ 2015-02-17 21:03 UTC (permalink / raw)
To: Martin Liška; +Cc: GCC Patches, hubicka >> Jan Hubicka
Hi,
this patch should chase away the expensive thunks and aliases walks from most
of analysis code. I think only real use left is local_p predicate that needs to
stay because i386 expect local flag to match between caller and callee when
expanding assembler thunk. I at least optimized it by first moving the walk to
be conditional for nonlocal functions only and then reorganizing
call_for_symbol_thunks_and_aliases to first inspect aliases (that is cheap) and
only then work on thunks. Most likely this will find the non-local thunk/alias
faster. Other cases was leftovers from the conversion of thunks from aliases
to functions.
I also noticed a bug in ipa-profile that does not disable all the
transofrms with !ipa_profile_flag used on OPTIMIZTION_NODE and fixed it.
Bootstrapped/regtested x86_64-linux, comitted. I would be interested to
know if the call_for_symbol_thunks_and_aliases is now off your oprofiles
(sorry, easier to type than perf-profiles)
Honza
* ipa-visibility.c (function_and_variable_visibility): Only
check locality if node is not already local.
* ipa-inline.c (want_inline_function_to_all_callers_p): Use
call_for_symbol_and_aliases instead of
call_for_symbol_thunks_and_aliases.
(ipa_inline): Likewise.
* cgraph.c (cgraph_node::call_for_symbol_thunks_and_aliases):
first walk aliases.
* ipa.c (symbol_table::remove_unreachable_nodes): Use
call_for_symbol_and_aliases.
* ipa-profile.c (ipa_propagate_frequency_data): Add function_symbol.
(ipa_propagate_frequency_1): Use it; use opt_for_fn
(ipa_propagate_frequency): Update.
(ipa_profile): Add opt_for_fn gueards.
Index: ipa-visibility.c
===================================================================
--- ipa-visibility.c (revision 220741)
+++ ipa-visibility.c (working copy)
@@ -595,7 +595,8 @@ function_and_variable_visibility (bool w
}
FOR_EACH_DEFINED_FUNCTION (node)
{
- node->local.local |= node->local_p ();
+ if (!node->local.local)
+ node->local.local |= node->local_p ();
/* If we know that function can not be overwritten by a different semantics
and moreover its section can not be discarded, replace all direct calls
Index: ipa-inline.c
===================================================================
--- ipa-inline.c (revision 220741)
+++ ipa-inline.c (working copy)
@@ -975,14 +975,14 @@ want_inline_function_to_all_callers_p (s
if (node->global.inlined_to)
return false;
/* Does it have callers? */
- if (!node->call_for_symbol_thunks_and_aliases (has_caller_p, NULL, true))
+ if (!node->call_for_symbol_and_aliases (has_caller_p, NULL, true))
return false;
/* Inlining into all callers would increase size? */
if (estimate_growth (node) > 0)
return false;
/* All inlines must be possible. */
- if (node->call_for_symbol_thunks_and_aliases (check_callers, &has_hot_call,
- true))
+ if (node->call_for_symbol_and_aliases (check_callers, &has_hot_call,
+ true))
return false;
if (!cold && !has_hot_call)
return false;
@@ -2359,9 +2359,9 @@ ipa_inline (void)
if (want_inline_function_to_all_callers_p (node, cold))
{
int num_calls = 0;
- node->call_for_symbol_thunks_and_aliases (sum_callers, &num_calls,
- true);
- while (node->call_for_symbol_thunks_and_aliases
+ node->call_for_symbol_and_aliases (sum_callers, &num_calls,
+ true);
+ while (node->call_for_symbol_and_aliases
(inline_to_all_callers, &num_calls, true))
;
remove_functions = true;
Index: cgraph.c
===================================================================
--- cgraph.c (revision 220741)
+++ cgraph.c (working copy)
@@ -2191,6 +2191,16 @@ cgraph_node::call_for_symbol_thunks_and_
if (callback (this, data))
return true;
+ FOR_EACH_ALIAS (this, ref)
+ {
+ cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
+ if (include_overwritable
+ || alias->get_availability () > AVAIL_INTERPOSABLE)
+ if (alias->call_for_symbol_thunks_and_aliases (callback, data,
+ include_overwritable,
+ exclude_virtual_thunks))
+ return true;
+ }
for (e = callers; e; e = e->next_caller)
if (e->caller->thunk.thunk_p
&& (include_overwritable
@@ -2202,16 +2212,6 @@ cgraph_node::call_for_symbol_thunks_and_
exclude_virtual_thunks))
return true;
- FOR_EACH_ALIAS (this, ref)
- {
- cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
- if (include_overwritable
- || alias->get_availability () > AVAIL_INTERPOSABLE)
- if (alias->call_for_symbol_thunks_and_aliases (callback, data,
- include_overwritable,
- exclude_virtual_thunks))
- return true;
- }
return false;
}
Index: ipa.c
===================================================================
--- ipa.c (revision 220741)
+++ ipa.c (working copy)
@@ -661,7 +661,7 @@ symbol_table::remove_unreachable_nodes (
if (node->address_taken
&& !node->used_from_other_partition)
{
- if (!node->call_for_symbol_thunks_and_aliases
+ if (!node->call_for_symbol_and_aliases
(has_addr_references_p, NULL, true)
&& (!node->instrumentation_clone
|| !node->instrumented_version
Index: ipa-profile.c
===================================================================
--- ipa-profile.c (revision 220741)
+++ ipa-profile.c (working copy)
@@ -322,6 +322,7 @@ ipa_profile_read_summary (void)
struct ipa_propagate_frequency_data
{
+ cgraph_node *function_symbol;
bool maybe_unlikely_executed;
bool maybe_executed_once;
bool only_called_at_startup;
@@ -342,7 +343,7 @@ ipa_propagate_frequency_1 (struct cgraph
|| d->only_called_at_startup || d->only_called_at_exit);
edge = edge->next_caller)
{
- if (edge->caller != node)
+ if (edge->caller != d->function_symbol)
{
d->only_called_at_startup &= edge->caller->only_called_at_startup;
/* It makes sense to put main() together with the static constructors.
@@ -358,7 +359,11 @@ ipa_propagate_frequency_1 (struct cgraph
errors can make us to push function into unlikely section even when
it is executed by the train run. Transfer the function only if all
callers are unlikely executed. */
- if (profile_info && flag_branch_probabilities
+ if (profile_info
+ && opt_for_fn (d->function_symbol->decl, flag_branch_probabilities)
+ /* Thunks are not profiled. This is more or less implementation
+ bug. */
+ && !d->function_symbol->thunk.thunk_p
&& (edge->caller->frequency != NODE_FREQUENCY_UNLIKELY_EXECUTED
|| (edge->caller->global.inlined_to
&& edge->caller->global.inlined_to->frequency
@@ -418,7 +423,7 @@ contains_hot_call_p (struct cgraph_node
bool
ipa_propagate_frequency (struct cgraph_node *node)
{
- struct ipa_propagate_frequency_data d = {true, true, true, true};
+ struct ipa_propagate_frequency_data d = {node, true, true, true, true};
bool changed = false;
/* We can not propagate anything useful about externally visible functions
@@ -432,8 +437,8 @@ ipa_propagate_frequency (struct cgraph_n
if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "Processing frequency %s\n", node->name ());
- node->call_for_symbol_thunks_and_aliases (ipa_propagate_frequency_1, &d,
- true);
+ node->call_for_symbol_and_aliases (ipa_propagate_frequency_1, &d,
+ true);
if ((d.only_called_at_startup && !d.only_called_at_exit)
&& !node->only_called_at_startup)
@@ -597,6 +602,9 @@ ipa_profile (void)
{
bool update = false;
+ if (!opt_for_fn (n->decl, flag_ipa_profile))
+ continue;
+
for (e = n->indirect_calls; e; e = e->next_callee)
{
if (n->count)
@@ -697,7 +705,9 @@ ipa_profile (void)
order_pos = ipa_reverse_postorder (order);
for (i = order_pos - 1; i >= 0; i--)
{
- if (order[i]->local.local && ipa_propagate_frequency (order[i]))
+ if (order[i]->local.local
+ && opt_for_fn (order[i]->decl, flag_ipa_profile)
+ && ipa_propagate_frequency (order[i]))
{
for (e = order[i]->callees; e; e = e->next_callee)
if (e->callee->local.local && !e->callee->aux)
@@ -714,7 +724,9 @@ ipa_profile (void)
something_changed = false;
for (i = order_pos - 1; i >= 0; i--)
{
- if (order[i]->aux && ipa_propagate_frequency (order[i]))
+ if (order[i]->aux
+ && opt_for_fn (order[i]->decl, flag_ipa_profile)
+ && ipa_propagate_frequency (order[i]))
{
for (e = order[i]->callees; e; e = e->next_callee)
if (e->callee->local.local && !e->callee->aux)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
2015-02-17 21:03 ` Jan Hubicka
@ 2015-02-18 13:58 ` Martin Liška
2015-02-18 14:13 ` Martin Liška
0 siblings, 1 reply; 6+ messages in thread
From: Martin Liška @ 2015-02-18 13:58 UTC (permalink / raw)
To: Jan Hubicka; +Cc: GCC Patches
[-- Attachment #1: Type: text/plain, Size: 9240 bytes --]
On 02/17/2015 10:03 PM, Jan Hubicka wrote:
> Hi,
> this patch should chase away the expensive thunks and aliases walks from most
> of analysis code. I think only real use left is local_p predicate that needs to
> stay because i386 expect local flag to match between caller and callee when
> expanding assembler thunk. I at least optimized it by first moving the walk to
> be conditional for nonlocal functions only and then reorganizing
> call_for_symbol_thunks_and_aliases to first inspect aliases (that is cheap) and
> only then work on thunks. Most likely this will find the non-local thunk/alias
> faster. Other cases was leftovers from the conversion of thunks from aliases
> to functions.
>
> I also noticed a bug in ipa-profile that does not disable all the
> transofrms with !ipa_profile_flag used on OPTIMIZTION_NODE and fixed it.
>
> Bootstrapped/regtested x86_64-linux, comitted. I would be interested to
> know if the call_for_symbol_thunks_and_aliases is now off your oprofiles
> (sorry, easier to type than perf-profiles)
>
> Honza
>
> * ipa-visibility.c (function_and_variable_visibility): Only
> check locality if node is not already local.
> * ipa-inline.c (want_inline_function_to_all_callers_p): Use
> call_for_symbol_and_aliases instead of
> call_for_symbol_thunks_and_aliases.
> (ipa_inline): Likewise.
> * cgraph.c (cgraph_node::call_for_symbol_thunks_and_aliases):
> first walk aliases.
> * ipa.c (symbol_table::remove_unreachable_nodes): Use
> call_for_symbol_and_aliases.
> * ipa-profile.c (ipa_propagate_frequency_data): Add function_symbol.
> (ipa_propagate_frequency_1): Use it; use opt_for_fn
> (ipa_propagate_frequency): Update.
> (ipa_profile): Add opt_for_fn gueards.
> Index: ipa-visibility.c
> ===================================================================
> --- ipa-visibility.c (revision 220741)
> +++ ipa-visibility.c (working copy)
> @@ -595,7 +595,8 @@ function_and_variable_visibility (bool w
> }
> FOR_EACH_DEFINED_FUNCTION (node)
> {
> - node->local.local |= node->local_p ();
> + if (!node->local.local)
> + node->local.local |= node->local_p ();
>
> /* If we know that function can not be overwritten by a different semantics
> and moreover its section can not be discarded, replace all direct calls
> Index: ipa-inline.c
> ===================================================================
> --- ipa-inline.c (revision 220741)
> +++ ipa-inline.c (working copy)
> @@ -975,14 +975,14 @@ want_inline_function_to_all_callers_p (s
> if (node->global.inlined_to)
> return false;
> /* Does it have callers? */
> - if (!node->call_for_symbol_thunks_and_aliases (has_caller_p, NULL, true))
> + if (!node->call_for_symbol_and_aliases (has_caller_p, NULL, true))
> return false;
> /* Inlining into all callers would increase size? */
> if (estimate_growth (node) > 0)
> return false;
> /* All inlines must be possible. */
> - if (node->call_for_symbol_thunks_and_aliases (check_callers, &has_hot_call,
> - true))
> + if (node->call_for_symbol_and_aliases (check_callers, &has_hot_call,
> + true))
> return false;
> if (!cold && !has_hot_call)
> return false;
> @@ -2359,9 +2359,9 @@ ipa_inline (void)
> if (want_inline_function_to_all_callers_p (node, cold))
> {
> int num_calls = 0;
> - node->call_for_symbol_thunks_and_aliases (sum_callers, &num_calls,
> - true);
> - while (node->call_for_symbol_thunks_and_aliases
> + node->call_for_symbol_and_aliases (sum_callers, &num_calls,
> + true);
> + while (node->call_for_symbol_and_aliases
> (inline_to_all_callers, &num_calls, true))
> ;
> remove_functions = true;
> Index: cgraph.c
> ===================================================================
> --- cgraph.c (revision 220741)
> +++ cgraph.c (working copy)
> @@ -2191,6 +2191,16 @@ cgraph_node::call_for_symbol_thunks_and_
>
> if (callback (this, data))
> return true;
> + FOR_EACH_ALIAS (this, ref)
> + {
> + cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
> + if (include_overwritable
> + || alias->get_availability () > AVAIL_INTERPOSABLE)
> + if (alias->call_for_symbol_thunks_and_aliases (callback, data,
> + include_overwritable,
> + exclude_virtual_thunks))
> + return true;
> + }
> for (e = callers; e; e = e->next_caller)
> if (e->caller->thunk.thunk_p
> && (include_overwritable
> @@ -2202,16 +2212,6 @@ cgraph_node::call_for_symbol_thunks_and_
> exclude_virtual_thunks))
> return true;
>
> - FOR_EACH_ALIAS (this, ref)
> - {
> - cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
> - if (include_overwritable
> - || alias->get_availability () > AVAIL_INTERPOSABLE)
> - if (alias->call_for_symbol_thunks_and_aliases (callback, data,
> - include_overwritable,
> - exclude_virtual_thunks))
> - return true;
> - }
> return false;
> }
>
> Index: ipa.c
> ===================================================================
> --- ipa.c (revision 220741)
> +++ ipa.c (working copy)
> @@ -661,7 +661,7 @@ symbol_table::remove_unreachable_nodes (
> if (node->address_taken
> && !node->used_from_other_partition)
> {
> - if (!node->call_for_symbol_thunks_and_aliases
> + if (!node->call_for_symbol_and_aliases
> (has_addr_references_p, NULL, true)
> && (!node->instrumentation_clone
> || !node->instrumented_version
> Index: ipa-profile.c
> ===================================================================
> --- ipa-profile.c (revision 220741)
> +++ ipa-profile.c (working copy)
> @@ -322,6 +322,7 @@ ipa_profile_read_summary (void)
>
> struct ipa_propagate_frequency_data
> {
> + cgraph_node *function_symbol;
> bool maybe_unlikely_executed;
> bool maybe_executed_once;
> bool only_called_at_startup;
> @@ -342,7 +343,7 @@ ipa_propagate_frequency_1 (struct cgraph
> || d->only_called_at_startup || d->only_called_at_exit);
> edge = edge->next_caller)
> {
> - if (edge->caller != node)
> + if (edge->caller != d->function_symbol)
> {
> d->only_called_at_startup &= edge->caller->only_called_at_startup;
> /* It makes sense to put main() together with the static constructors.
> @@ -358,7 +359,11 @@ ipa_propagate_frequency_1 (struct cgraph
> errors can make us to push function into unlikely section even when
> it is executed by the train run. Transfer the function only if all
> callers are unlikely executed. */
> - if (profile_info && flag_branch_probabilities
> + if (profile_info
> + && opt_for_fn (d->function_symbol->decl, flag_branch_probabilities)
> + /* Thunks are not profiled. This is more or less implementation
> + bug. */
> + && !d->function_symbol->thunk.thunk_p
> && (edge->caller->frequency != NODE_FREQUENCY_UNLIKELY_EXECUTED
> || (edge->caller->global.inlined_to
> && edge->caller->global.inlined_to->frequency
> @@ -418,7 +423,7 @@ contains_hot_call_p (struct cgraph_node
> bool
> ipa_propagate_frequency (struct cgraph_node *node)
> {
> - struct ipa_propagate_frequency_data d = {true, true, true, true};
> + struct ipa_propagate_frequency_data d = {node, true, true, true, true};
> bool changed = false;
>
> /* We can not propagate anything useful about externally visible functions
> @@ -432,8 +437,8 @@ ipa_propagate_frequency (struct cgraph_n
> if (dump_file && (dump_flags & TDF_DETAILS))
> fprintf (dump_file, "Processing frequency %s\n", node->name ());
>
> - node->call_for_symbol_thunks_and_aliases (ipa_propagate_frequency_1, &d,
> - true);
> + node->call_for_symbol_and_aliases (ipa_propagate_frequency_1, &d,
> + true);
>
> if ((d.only_called_at_startup && !d.only_called_at_exit)
> && !node->only_called_at_startup)
> @@ -597,6 +602,9 @@ ipa_profile (void)
> {
> bool update = false;
>
> + if (!opt_for_fn (n->decl, flag_ipa_profile))
> + continue;
> +
> for (e = n->indirect_calls; e; e = e->next_callee)
> {
> if (n->count)
> @@ -697,7 +705,9 @@ ipa_profile (void)
> order_pos = ipa_reverse_postorder (order);
> for (i = order_pos - 1; i >= 0; i--)
> {
> - if (order[i]->local.local && ipa_propagate_frequency (order[i]))
> + if (order[i]->local.local
> + && opt_for_fn (order[i]->decl, flag_ipa_profile)
> + && ipa_propagate_frequency (order[i]))
> {
> for (e = order[i]->callees; e; e = e->next_callee)
> if (e->callee->local.local && !e->callee->aux)
> @@ -714,7 +724,9 @@ ipa_profile (void)
> something_changed = false;
> for (i = order_pos - 1; i >= 0; i--)
> {
> - if (order[i]->aux && ipa_propagate_frequency (order[i]))
> + if (order[i]->aux
> + && opt_for_fn (order[i]->decl, flag_ipa_profile)
> + && ipa_propagate_frequency (order[i]))
> {
> for (e = order[i]->callees; e; e = e->next_callee)
> if (e->callee->local.local && !e->callee->aux)
>
Hi.
There's perf report and -ftime report of WPA phase.
Martin
[-- Attachment #2: chrome-latest.profile.txt --]
[-- Type: text/plain, Size: 6466 bytes --]
Execution times (seconds)
phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 1977 kB ( 0%) ggc
phase opt and generate : 171.18 (65%) usr 2.29 (47%) sys 173.40 (65%) wall 2682609 kB (13%) ggc
phase stream in : 92.09 (35%) usr 2.55 (53%) sys 94.61 (35%) wall18738048 kB (87%) ggc
callgraph optimization : 0.72 ( 0%) usr 0.00 ( 0%) sys 0.73 ( 0%) wall 16 kB ( 0%) ggc
ipa dead code removal : 5.12 ( 2%) usr 0.05 ( 1%) sys 5.07 ( 2%) wall 0 kB ( 0%) ggc
ipa virtual call target : 2.93 ( 1%) usr 0.03 ( 1%) sys 3.02 ( 1%) wall 0 kB ( 0%) ggc
ipa devirtualization : 0.26 ( 0%) usr 0.01 ( 0%) sys 0.34 ( 0%) wall 32646 kB ( 0%) ggc
ipa cp : 4.29 ( 2%) usr 0.48 (10%) sys 4.86 ( 2%) wall 851380 kB ( 4%) ggc
ipa inlining heuristics : 122.37 (46%) usr 0.42 ( 9%) sys 122.72 (46%) wall 807997 kB ( 4%) ggc
ipa comdats : 0.53 ( 0%) usr 0.00 ( 0%) sys 0.53 ( 0%) wall 0 kB ( 0%) ggc
ipa lto gimple in : 5.16 ( 2%) usr 1.09 (23%) sys 6.64 ( 2%) wall 1370302 kB ( 6%) ggc
ipa lto decl in : 79.11 (30%) usr 1.58 (33%) sys 80.64 (30%) wall16957092 kB (79%) ggc
ipa lto constructors in : 0.37 ( 0%) usr 0.06 ( 1%) sys 0.37 ( 0%) wall 22897 kB ( 0%) ggc
ipa lto cgraph I/O : 1.44 ( 1%) usr 0.24 ( 5%) sys 1.69 ( 1%) wall 901960 kB ( 4%) ggc
ipa lto decl merge : 3.27 ( 1%) usr 0.01 ( 0%) sys 3.26 ( 1%) wall 16383 kB ( 0%) ggc
ipa lto cgraph merge : 4.63 ( 2%) usr 0.04 ( 1%) sys 4.68 ( 2%) wall 20432 kB ( 0%) ggc
whopr wpa : 1.70 ( 1%) usr 0.00 ( 0%) sys 1.71 ( 1%) wall 2 kB ( 0%) ggc
whopr partitioning : 4.72 ( 2%) usr 0.02 ( 0%) sys 4.73 ( 2%) wall 7796 kB ( 0%) ggc
ipa reference : 2.70 ( 1%) usr 0.10 ( 2%) sys 2.80 ( 1%) wall 0 kB ( 0%) ggc
ipa profile : 0.53 ( 0%) usr 0.03 ( 1%) sys 0.58 ( 0%) wall 0 kB ( 0%) ggc
ipa pure const : 3.13 ( 1%) usr 0.09 ( 2%) sys 3.21 ( 1%) wall 0 kB ( 0%) ggc
ipa icf : 16.96 ( 6%) usr 0.17 ( 4%) sys 17.06 ( 6%) wall 3087 kB ( 0%) ggc
inline parameters : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc
tree SSA rewrite : 0.39 ( 0%) usr 0.05 ( 1%) sys 0.27 ( 0%) wall 51205 kB ( 0%) ggc
tree SSA other : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
tree SSA incremental : 0.50 ( 0%) usr 0.08 ( 2%) sys 0.50 ( 0%) wall 33556 kB ( 0%) ggc
tree operand scan : 0.45 ( 0%) usr 0.11 ( 2%) sys 0.47 ( 0%) wall 343892 kB ( 2%) ggc
dominance frontiers : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc
dominance computation : 0.51 ( 0%) usr 0.08 ( 2%) sys 0.58 ( 0%) wall 0 kB ( 0%) ggc
varconst : 0.02 ( 0%) usr 0.06 ( 1%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc
loop fini : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 0%) wall 0 kB ( 0%) ggc
unaccounted todo : 1.19 ( 0%) usr 0.00 ( 0%) sys 1.15 ( 0%) wall 0 kB ( 0%) ggc
TOTAL : 263.27 4.84 268.01 21422636 kB
[ perf record: Woken up 254 times to write data ]
[ perf record: Captured and wrote 63.481 MB perf.data (~2773530 samples) ]
marxin@marxinbox:~/Programming/chromium/src/out/Release> perf report --stdio | sed 's/\ *$//' | head -n50# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 945739511218
#
# Overhead Command Shared Object
# ........ ........ ................. ..................................................................................................................................................................................................................................................................................................
#
19.88% lto1-wpa lto1 [.] nonremovable_p(cgraph_node*, void*)
9.17% lto1-wpa lto1 [.] cgraph_node::used_from_object_file_p_worker(cgraph_node*, void*)
7.93% lto1-wpa lto1 [.] cgraph_node::call_for_symbol_and_aliases_1(bool (*)(cgraph_node*, void*), void*, bool)
6.37% lto1-wpa lto1 [.] inflate_fast
2.23% lto1-wpa lto1 [.] compare_tree_sccs_1(tree_node*, tree_node*, tree_node***)
2.14% lto1-wpa lto1 [.] streamer_read_uhwi(lto_input_block*)
1.96% lto1-wpa lto1 [.] ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
1.83% lto1-wpa lto1 [.] unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, unsigned int)
1.61% lto1-wpa lto1 [.] streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*)
1.23% lto1-wpa lto1 [.] lto_cgraph_replace_node(cgraph_node*, cgraph_node*)
1.21% lto1-wpa libc-2.19.so [.] msort_with_tmp.part.0
1.19% lto1-wpa lto1 [.] streamer_get_pickled_tree(lto_input_block*, data_in*)
1.14% lto1-wpa lto1 [.] symbol_table::remove_unreachable_nodes(_IO_FILE*)
1.08% lto1-wpa libc-2.19.so [.] _int_malloc
1.02% lto1-wpa lto1 [.] ipa_icf::sem_variable::equals(tree_node*, tree_node*)
0.96% lto1-wpa lto1 [.] lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int)
0.84% lto1-wpa lto1 [.] inflate
0.74% lto1-wpa lto1 [.] adler32
0.71% lto1-wpa lto1 [.] lto_input_tree(lto_input_block*, data_in*)
0.68% lto1-wpa lto1 [.] cgraph_node::call_for_symbol_thunks_and_aliases(bool (*)(cgraph_node*, void*), void*, bool, bool)
0.66% lto1-wpa lto1 [.] streamer_read_tree_body(lto_input_block*, data_in*, tree_node*)
0.64% lto1-wpa lto1 [.] estimate_calls_size_and_time(cgraph_node*, int*, int*, int*, int*, unsigned int, vec<tree_node*, va_heap, vl_ptr>, vec<ipa_polymorphic_call_context, va_heap, vl_ptr>, vec<ipa_agg_jump_function*, va_heap, vl_ptr>) [clone .isra.129]
0.63% lto1-wpa lto1 [.] lto_input_location(bitpack_d*, data_in*)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
2015-02-18 13:58 ` Martin Liška
@ 2015-02-18 14:13 ` Martin Liška
0 siblings, 0 replies; 6+ messages in thread
From: Martin Liška @ 2015-02-18 14:13 UTC (permalink / raw)
To: gcc-patches
[-- Attachment #1: Type: text/plain, Size: 10096 bytes --]
On 02/18/2015 02:58 PM, Martin Liška wrote:
> On 02/17/2015 10:03 PM, Jan Hubicka wrote:
>> Hi,
>> this patch should chase away the expensive thunks and aliases walks from most
>> of analysis code. I think only real use left is local_p predicate that needs to
>> stay because i386 expect local flag to match between caller and callee when
>> expanding assembler thunk. I at least optimized it by first moving the walk to
>> be conditional for nonlocal functions only and then reorganizing
>> call_for_symbol_thunks_and_aliases to first inspect aliases (that is cheap) and
>> only then work on thunks. Most likely this will find the non-local thunk/alias
>> faster. Other cases was leftovers from the conversion of thunks from aliases
>> to functions.
>>
>> I also noticed a bug in ipa-profile that does not disable all the
>> transofrms with !ipa_profile_flag used on OPTIMIZTION_NODE and fixed it.
>>
>> Bootstrapped/regtested x86_64-linux, comitted. I would be interested to
>> know if the call_for_symbol_thunks_and_aliases is now off your oprofiles
>> (sorry, easier to type than perf-profiles)
>>
>> Honza
>>
>> * ipa-visibility.c (function_and_variable_visibility): Only
>> check locality if node is not already local.
>> * ipa-inline.c (want_inline_function_to_all_callers_p): Use
>> call_for_symbol_and_aliases instead of
>> call_for_symbol_thunks_and_aliases.
>> (ipa_inline): Likewise.
>> * cgraph.c (cgraph_node::call_for_symbol_thunks_and_aliases):
>> first walk aliases.
>> * ipa.c (symbol_table::remove_unreachable_nodes): Use
>> call_for_symbol_and_aliases.
>> * ipa-profile.c (ipa_propagate_frequency_data): Add function_symbol.
>> (ipa_propagate_frequency_1): Use it; use opt_for_fn
>> (ipa_propagate_frequency): Update.
>> (ipa_profile): Add opt_for_fn gueards.
>> Index: ipa-visibility.c
>> ===================================================================
>> --- ipa-visibility.c (revision 220741)
>> +++ ipa-visibility.c (working copy)
>> @@ -595,7 +595,8 @@ function_and_variable_visibility (bool w
>> }
>> FOR_EACH_DEFINED_FUNCTION (node)
>> {
>> - node->local.local |= node->local_p ();
>> + if (!node->local.local)
>> + node->local.local |= node->local_p ();
>>
>> /* If we know that function can not be overwritten by a different semantics
>> and moreover its section can not be discarded, replace all direct calls
>> Index: ipa-inline.c
>> ===================================================================
>> --- ipa-inline.c (revision 220741)
>> +++ ipa-inline.c (working copy)
>> @@ -975,14 +975,14 @@ want_inline_function_to_all_callers_p (s
>> if (node->global.inlined_to)
>> return false;
>> /* Does it have callers? */
>> - if (!node->call_for_symbol_thunks_and_aliases (has_caller_p, NULL, true))
>> + if (!node->call_for_symbol_and_aliases (has_caller_p, NULL, true))
>> return false;
>> /* Inlining into all callers would increase size? */
>> if (estimate_growth (node) > 0)
>> return false;
>> /* All inlines must be possible. */
>> - if (node->call_for_symbol_thunks_and_aliases (check_callers, &has_hot_call,
>> - true))
>> + if (node->call_for_symbol_and_aliases (check_callers, &has_hot_call,
>> + true))
>> return false;
>> if (!cold && !has_hot_call)
>> return false;
>> @@ -2359,9 +2359,9 @@ ipa_inline (void)
>> if (want_inline_function_to_all_callers_p (node, cold))
>> {
>> int num_calls = 0;
>> - node->call_for_symbol_thunks_and_aliases (sum_callers, &num_calls,
>> - true);
>> - while (node->call_for_symbol_thunks_and_aliases
>> + node->call_for_symbol_and_aliases (sum_callers, &num_calls,
>> + true);
>> + while (node->call_for_symbol_and_aliases
>> (inline_to_all_callers, &num_calls, true))
>> ;
>> remove_functions = true;
>> Index: cgraph.c
>> ===================================================================
>> --- cgraph.c (revision 220741)
>> +++ cgraph.c (working copy)
>> @@ -2191,6 +2191,16 @@ cgraph_node::call_for_symbol_thunks_and_
>>
>> if (callback (this, data))
>> return true;
>> + FOR_EACH_ALIAS (this, ref)
>> + {
>> + cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
>> + if (include_overwritable
>> + || alias->get_availability () > AVAIL_INTERPOSABLE)
>> + if (alias->call_for_symbol_thunks_and_aliases (callback, data,
>> + include_overwritable,
>> + exclude_virtual_thunks))
>> + return true;
>> + }
>> for (e = callers; e; e = e->next_caller)
>> if (e->caller->thunk.thunk_p
>> && (include_overwritable
>> @@ -2202,16 +2212,6 @@ cgraph_node::call_for_symbol_thunks_and_
>> exclude_virtual_thunks))
>> return true;
>>
>> - FOR_EACH_ALIAS (this, ref)
>> - {
>> - cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
>> - if (include_overwritable
>> - || alias->get_availability () > AVAIL_INTERPOSABLE)
>> - if (alias->call_for_symbol_thunks_and_aliases (callback, data,
>> - include_overwritable,
>> - exclude_virtual_thunks))
>> - return true;
>> - }
>> return false;
>> }
>>
>> Index: ipa.c
>> ===================================================================
>> --- ipa.c (revision 220741)
>> +++ ipa.c (working copy)
>> @@ -661,7 +661,7 @@ symbol_table::remove_unreachable_nodes (
>> if (node->address_taken
>> && !node->used_from_other_partition)
>> {
>> - if (!node->call_for_symbol_thunks_and_aliases
>> + if (!node->call_for_symbol_and_aliases
>> (has_addr_references_p, NULL, true)
>> && (!node->instrumentation_clone
>> || !node->instrumented_version
>> Index: ipa-profile.c
>> ===================================================================
>> --- ipa-profile.c (revision 220741)
>> +++ ipa-profile.c (working copy)
>> @@ -322,6 +322,7 @@ ipa_profile_read_summary (void)
>>
>> struct ipa_propagate_frequency_data
>> {
>> + cgraph_node *function_symbol;
>> bool maybe_unlikely_executed;
>> bool maybe_executed_once;
>> bool only_called_at_startup;
>> @@ -342,7 +343,7 @@ ipa_propagate_frequency_1 (struct cgraph
>> || d->only_called_at_startup || d->only_called_at_exit);
>> edge = edge->next_caller)
>> {
>> - if (edge->caller != node)
>> + if (edge->caller != d->function_symbol)
>> {
>> d->only_called_at_startup &= edge->caller->only_called_at_startup;
>> /* It makes sense to put main() together with the static constructors.
>> @@ -358,7 +359,11 @@ ipa_propagate_frequency_1 (struct cgraph
>> errors can make us to push function into unlikely section even when
>> it is executed by the train run. Transfer the function only if all
>> callers are unlikely executed. */
>> - if (profile_info && flag_branch_probabilities
>> + if (profile_info
>> + && opt_for_fn (d->function_symbol->decl, flag_branch_probabilities)
>> + /* Thunks are not profiled. This is more or less implementation
>> + bug. */
>> + && !d->function_symbol->thunk.thunk_p
>> && (edge->caller->frequency != NODE_FREQUENCY_UNLIKELY_EXECUTED
>> || (edge->caller->global.inlined_to
>> && edge->caller->global.inlined_to->frequency
>> @@ -418,7 +423,7 @@ contains_hot_call_p (struct cgraph_node
>> bool
>> ipa_propagate_frequency (struct cgraph_node *node)
>> {
>> - struct ipa_propagate_frequency_data d = {true, true, true, true};
>> + struct ipa_propagate_frequency_data d = {node, true, true, true, true};
>> bool changed = false;
>>
>> /* We can not propagate anything useful about externally visible functions
>> @@ -432,8 +437,8 @@ ipa_propagate_frequency (struct cgraph_n
>> if (dump_file && (dump_flags & TDF_DETAILS))
>> fprintf (dump_file, "Processing frequency %s\n", node->name ());
>>
>> - node->call_for_symbol_thunks_and_aliases (ipa_propagate_frequency_1, &d,
>> - true);
>> + node->call_for_symbol_and_aliases (ipa_propagate_frequency_1, &d,
>> + true);
>>
>> if ((d.only_called_at_startup && !d.only_called_at_exit)
>> && !node->only_called_at_startup)
>> @@ -597,6 +602,9 @@ ipa_profile (void)
>> {
>> bool update = false;
>>
>> + if (!opt_for_fn (n->decl, flag_ipa_profile))
>> + continue;
>> +
>> for (e = n->indirect_calls; e; e = e->next_callee)
>> {
>> if (n->count)
>> @@ -697,7 +705,9 @@ ipa_profile (void)
>> order_pos = ipa_reverse_postorder (order);
>> for (i = order_pos - 1; i >= 0; i--)
>> {
>> - if (order[i]->local.local && ipa_propagate_frequency (order[i]))
>> + if (order[i]->local.local
>> + && opt_for_fn (order[i]->decl, flag_ipa_profile)
>> + && ipa_propagate_frequency (order[i]))
>> {
>> for (e = order[i]->callees; e; e = e->next_callee)
>> if (e->callee->local.local && !e->callee->aux)
>> @@ -714,7 +724,9 @@ ipa_profile (void)
>> something_changed = false;
>> for (i = order_pos - 1; i >= 0; i--)
>> {
>> - if (order[i]->aux && ipa_propagate_frequency (order[i]))
>> + if (order[i]->aux
>> + && opt_for_fn (order[i]->decl, flag_ipa_profile)
>> + && ipa_propagate_frequency (order[i]))
>> {
>> for (e = order[i]->callees; e; e = e->next_callee)
>> if (e->callee->local.local && !e->callee->aux)
>>
>
> Hi.
>
> There's perf report and -ftime report of WPA phase.
>
> Martin
Hm, using the same compiler, Firefox LTO time statistics and perf report and very different.
I'm wondering how can be that possible?
Martin
[-- Attachment #2: firefox-latest.profile.txt --]
[-- Type: text/plain, Size: 9434 bytes --]
Execution times (seconds)
phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1988 kB ( 0%) ggc
phase opt and generate : 42.32 (70%) usr 0.85 (56%) sys 43.16 (69%) wall 1387464 kB (28%) ggc
phase stream in : 18.50 (30%) usr 0.68 (44%) sys 19.17 (31%) wall 3528077 kB (72%) ggc
garbage collection : 2.24 ( 4%) usr 0.00 ( 0%) sys 2.24 ( 4%) wall 0 kB ( 0%) ggc
callgraph optimization : 0.37 ( 1%) usr 0.00 ( 0%) sys 0.37 ( 1%) wall 38 kB ( 0%) ggc
ipa dead code removal : 3.06 ( 5%) usr 0.01 ( 1%) sys 2.88 ( 5%) wall 0 kB ( 0%) ggc
ipa virtual call target : 5.72 ( 9%) usr 0.06 ( 4%) sys 5.87 ( 9%) wall 0 kB ( 0%) ggc
ipa devirtualization : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.23 ( 0%) wall 22382 kB ( 0%) ggc
ipa cp : 2.88 ( 5%) usr 0.09 ( 6%) sys 2.97 ( 5%) wall 515623 kB (10%) ggc
ipa inlining heuristics : 13.96 (23%) usr 0.13 ( 8%) sys 14.12 (23%) wall 471848 kB (10%) ggc
ipa comdats : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall 0 kB ( 0%) ggc
ipa lto gimple in : 2.54 ( 4%) usr 0.48 (31%) sys 3.23 ( 5%) wall 645652 kB (13%) ggc
ipa lto decl in : 12.64 (21%) usr 0.37 (24%) sys 13.01 (21%) wall 2592737 kB (53%) ggc
ipa lto constructors in : 0.17 ( 0%) usr 0.01 ( 1%) sys 0.20 ( 0%) wall 16493 kB ( 0%) ggc
ipa lto cgraph I/O : 0.58 ( 1%) usr 0.09 ( 6%) sys 0.67 ( 1%) wall 437504 kB ( 9%) ggc
ipa lto decl merge : 1.90 ( 3%) usr 0.00 ( 0%) sys 1.90 ( 3%) wall 8191 kB ( 0%) ggc
ipa lto cgraph merge : 1.30 ( 2%) usr 0.00 ( 0%) sys 1.29 ( 2%) wall 14989 kB ( 0%) ggc
whopr wpa : 0.91 ( 1%) usr 0.00 ( 0%) sys 0.88 ( 1%) wall 2 kB ( 0%) ggc
whopr partitioning : 2.66 ( 4%) usr 0.00 ( 0%) sys 2.67 ( 4%) wall 6081 kB ( 0%) ggc
ipa reference : 1.38 ( 2%) usr 0.01 ( 1%) sys 1.40 ( 2%) wall 0 kB ( 0%) ggc
ipa profile : 0.21 ( 0%) usr 0.01 ( 1%) sys 0.21 ( 0%) wall 0 kB ( 0%) ggc
ipa pure const : 1.61 ( 3%) usr 0.01 ( 1%) sys 1.61 ( 3%) wall 0 kB ( 0%) ggc
ipa icf : 4.99 ( 8%) usr 0.06 ( 4%) sys 5.00 ( 8%) wall 1120 kB ( 0%) ggc
tree SSA rewrite : 0.12 ( 0%) usr 0.02 ( 1%) sys 0.12 ( 0%) wall 23170 kB ( 0%) ggc
tree SSA incremental : 0.23 ( 0%) usr 0.05 ( 3%) sys 0.21 ( 0%) wall 14434 kB ( 0%) ggc
tree operand scan : 0.14 ( 0%) usr 0.03 ( 2%) sys 0.22 ( 0%) wall 145252 kB ( 3%) ggc
dominance frontiers : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
dominance computation : 0.14 ( 0%) usr 0.05 ( 3%) sys 0.11 ( 0%) wall 0 kB ( 0%) ggc
varconst : 0.01 ( 0%) usr 0.02 ( 1%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc
loop fini : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc
unaccounted todo : 0.62 ( 1%) usr 0.00 ( 0%) sys 0.65 ( 1%) wall 0 kB ( 0%) ggc
TOTAL : 60.82 1.53 62.34 4917531 kB
[ perf record: Woken up 59 times to write data ]
[ perf record: Captured and wrote 14.722 MB perf.data (~643202 samples) ]
marxin@marxinbox:~/Programming/gecko-dev/obj-x86_64-unknown-linux-gnu/toolkit/library> perf report
marxin@marxinbox:~/Programming/gecko-dev/obj-x86_64-unknown-linux-gnu/toolkit/library> gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/marxin/Programming/bin/gcc2/lib/gcc/x86_64-unknown-linux-gnu/5.0.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --enable-languages=c,c++ --disable-libsanitizer --prefix=/home/marxin/Programming/bin/gcc2 --disable-bootstrap --enable-checking=release
Thread model: posix
gcc version 5.0.0 20150218 (experimental) (GCC)
marxin@marxinbox:~/Programming/gecko-dev/obj-x86_64-unknown-linux-gnu/toolkit/library> perf report
marxin@marxinbox:~/Programming/gecko-dev/obj-x86_64-unknown-linux-gnu/toolkit/library> perf report --stdio | sed 's/\ *$//' | head -n50
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 245K of event 'cycles'
# Event count (approx.): 216467422123
#
# Overhead Command Shared Object
# ........ ........ ................. ..................................................................................................................................................................................................................................................................................................
#
4.97% lto1-wpa lto1 [.] inflate_fast
2.78% lto1-wpa lto1 [.] symbol_table::remove_unreachable_nodes(_IO_FILE*)
2.37% lto1-wpa libc-2.19.so [.] _int_malloc
1.77% lto1-wpa lto1 [.] record_target_from_binfo(vec<cgraph_node*, va_heap, vl_ptr>&, vec<tree_node*, va_heap, vl_ptr>*, tree_node*, tree_node*, vec<tree_node*, va_heap, vl_ptr>&, long, tree_node*, long, hash_set<tree_node*, default_hashset_traits>*, hash_set<tree_node*, default_hashset_traits>*, bool, bool*)
1.57% lto1-wpa lto1 [.] ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
1.56% lto1-wpa lto1 [.] streamer_read_uhwi(lto_input_block*)
1.48% lto1-wpa lto1 [.] estimate_calls_size_and_time(cgraph_node*, int*, int*, int*, int*, unsigned int, vec<tree_node*, va_heap, vl_ptr>, vec<ipa_polymorphic_call_context, va_heap, vl_ptr>, vec<ipa_agg_jump_function*, va_heap, vl_ptr>) [clone .isra.129]
1.48% lto1-wpa lto1 [.] unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, unsigned int)
1.40% lto1-wpa lto1 [.] lto_cgraph_replace_node(cgraph_node*, cgraph_node*)
1.38% lto1-wpa lto1 [.] ggc_set_mark(void const*)
1.30% lto1-wpa libc-2.19.so [.] malloc_consolidate
1.28% lto1-wpa lto1 [.] htab_hash_string
1.25% lto1-wpa lto1 [.] compare_tree_sccs_1(tree_node*, tree_node*, tree_node***)
1.23% lto1-wpa lto1 [.] fibonacci_heap<sreal, cgraph_edge>::consolidate()
1.19% lto1-wpa lto1 [.] splay_tree_splay
1.15% lto1-wpa lto1 [.] can_inline_edge_p(cgraph_edge*, bool, bool)
1.14% lto1-wpa lto1 [.] cgraph_node::get_availability()
1.14% lto1-wpa lto1 [.] evaluate_properties_for_edge(cgraph_edge*, bool, unsigned int*, vec<tree_node*, va_heap, vl_ptr>*, vec<ipa_polymorphic_call_context, va_heap, vl_ptr>*, vec<ipa_agg_jump_function*, va_heap, vl_ptr>*) [clone .constprop.131]
1.13% lto1-wpa lto1 [.] gimple_get_virt_method_for_vtable(long, tree_node*, unsigned long, bool*)
1.10% lto1-wpa lto1 [.] types_same_for_odr(tree_node const*, tree_node const*)
1.08% lto1-wpa lto1 [.] gt_ggc_mx_lang_tree_node(void*)
1.05% lto1-wpa lto1 [.] streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*)
0.99% lto1-wpa lto1 [.] type_in_anonymous_namespace_p(tree_node const*)
0.99% lto1-wpa lto1 [.] gimple_has_body_p(tree_node*)
0.95% lto1-wpa lto1 [.] decl_assembler_name(tree_node*)
0.93% lto1-wpa lto1 [.] do_per_function(void (*)(function*, void*), void*)
0.82% lto1-wpa libc-2.19.so [.] _int_free
0.81% lto1-wpa lto1 [.] possible_polymorphic_call_targets_1(vec<cgraph_node*, va_heap, vl_ptr>&, hash_set<tree_node*, default_hashset_traits>*, hash_set<tree_node*, default_hashset_traits>*, tree_node*, odr_type_d*, long, tree_node*, long, bool*, vec<tree_node*, va_heap, vl_ptr>&, bool)
0.81% lto1-wpa lto1 [.] searchc(searchc_env*, cgraph_node*, bool (*)(cgraph_edge*))
0.80% lto1-wpa lto1 [.] streamer_get_pickled_tree(lto_input_block*, data_in*)
0.78% lto1-wpa lto1 [.] edge_badness(cgraph_edge*, bool)
0.77% lto1-wpa lto1 [.] hash_table<asmname_hasher, xcallocator, true>::find_slot_with_hash(tree_node const* const&, unsigned int, insert_option)
0.77% lto1-wpa lto1 [.] update_callee_keys(fibonacci_heap<sreal, cgraph_edge>*, cgraph_node*, bitmap_head*)
0.76% lto1-wpa lto1 [.] ggc_internal_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long)
0.75% lto1-wpa lto1 [.] fibonacci_heap<sreal, cgraph_edge>::extract_minimum_node()
0.75% lto1-wpa lto1 [.] execute_one_pass(opt_pass*)
0.74% lto1-wpa lto1 [.] inflate
0.71% lto1-wpa lto1 [.] contains_polymorphic_type_p(tree_node const*)
0.67% lto1-wpa lto1 [.] get_binfo_at_offset(tree_node*, long, tree_node*)
0.64% lto1-wpa lto1 [.] symbol_table::decl_assembler_name_equal(tree_node*, tree_node const*)
0.61% lto1-wpa lto1 [.] lto_balanced_map(int)
0.61% lto1-wpa lto1 [.] ipa_icf::sem_item_optimizer::do_congruence_step_for_index(ipa_icf::congruence_class*, unsigned int)
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-02-18 14:13 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-17 17:14 [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome) Martin Liška
2015-02-17 18:38 ` Jan Hubicka
2015-02-18 10:28 ` Martin Liška
2015-02-17 21:03 ` Jan Hubicka
2015-02-18 13:58 ` Martin Liška
2015-02-18 14:13 ` Martin Liška
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).