public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
@ 2015-02-17 17:14 Martin Liška
  2015-02-17 18:38 ` Jan Hubicka
  2015-02-17 21:03 ` Jan Hubicka
  0 siblings, 2 replies; 6+ messages in thread
From: Martin Liška @ 2015-02-17 17:14 UTC (permalink / raw)
  To: GCC Patches; +Cc: hubicka >> Jan Hubicka

[-- Attachment #1: Type: text/plain, Size: 742 bytes --]

Hello.

After LTO debugging of Chrome we noticed with Honza that WPA phase taken quite long time.
Following patch is an attempt to cache IPA inliner predicates that are constant during
inline_small functions.

As you can see in attached report, this patch can reduce time spent in WPA by ~40%, which
is really big improvement. Disadvantage of the solution is that the patch adds 4 new bitfields
to cgraph_node class. Well, we can move these flags to inline_summary, but as this struct is not
accessible from cgraph.h, we cannot benefit from inlining that is crucial for these predicates.

I welcome and ideas about the solution and I'm not sure if it's acceptable for STAGE4? That's reason
why no ChangeLog entry is prepared.

Thanks,
Martin

[-- Attachment #2: 0001-ipa-inline-introduce-computed-value-that-speeds-up-I.patch --]
[-- Type: text/x-patch, Size: 22738 bytes --]

From 4e878a928ff7e9fe4eee0ea4b241c01c4440bd60 Mon Sep 17 00:00:00 2001
From: mliska <mliska@suse.cz>
Date: Mon, 16 Feb 2015 16:48:01 +0100
Subject: [PATCH] ipa-inline: introduce computed value that speeds up IPA
 inliner.

---
 gcc/cgraph.c       |  77 -------------
 gcc/cgraph.h       | 309 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 gcc/ipa-inline.c   |   2 +
 gcc/lto-streamer.c |   2 +
 gcc/symtab.c       |  48 ++++++---
 5 files changed, 345 insertions(+), 93 deletions(-)

diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 3548bd0..b72a6c0 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2403,83 +2403,6 @@ cgraph_edge::maybe_hot_p (void)
   return true;
 }
 
-/* Worker for cgraph_can_remove_if_no_direct_calls_p.  */
-
-static bool
-nonremovable_p (cgraph_node *node, void *)
-{
-  return !node->can_remove_if_no_direct_calls_and_refs_p ();
-}
-
-/* Return true when function cgraph_node and its aliases can be removed from
-   callgraph if all direct calls are eliminated.  */
-
-bool
-cgraph_node::can_remove_if_no_direct_calls_p (void)
-{
-  /* Extern inlines can always go, we will use the external definition.  */
-  if (DECL_EXTERNAL (decl))
-    return true;
-  if (address_taken)
-    return false;
-  return !call_for_symbol_and_aliases (nonremovable_p, NULL, true);
-}
-
-/* Return true when function cgraph_node can be expected to be removed
-   from program when direct calls in this compilation unit are removed.
-
-   As a special case COMDAT functions are
-   cgraph_can_remove_if_no_direct_calls_p while the are not
-   cgraph_only_called_directly_p (it is possible they are called from other
-   unit)
-
-   This function behaves as cgraph_only_called_directly_p because eliminating
-   all uses of COMDAT function does not make it necessarily disappear from
-   the program unless we are compiling whole program or we do LTO.  In this
-   case we know we win since dynamic linking will not really discard the
-   linkonce section.  */
-
-bool
-cgraph_node::will_be_removed_from_program_if_no_direct_calls_p (void)
-{
-  gcc_assert (!global.inlined_to);
-
-  if (call_for_symbol_and_aliases (used_from_object_file_p_worker,
-				   NULL, true))
-    return false;
-  if (!in_lto_p && !flag_whole_program)
-    return only_called_directly_p ();
-  else
-    {
-       if (DECL_EXTERNAL (decl))
-         return true;
-      return can_remove_if_no_direct_calls_p ();
-    }
-}
-
-
-/* Worker for cgraph_only_called_directly_p.  */
-
-static bool
-cgraph_not_only_called_directly_p_1 (cgraph_node *node, void *)
-{
-  return !node->only_called_directly_or_aliased_p ();
-}
-
-/* Return true when function cgraph_node and all its aliases are only called
-   directly.
-   i.e. it is not externally visible, address was not taken and
-   it is not used in any other non-standard way.  */
-
-bool
-cgraph_node::only_called_directly_p (void)
-{
-  gcc_assert (ultimate_alias_target () == this);
-  return !call_for_symbol_and_aliases (cgraph_not_only_called_directly_p_1,
-				       NULL, true);
-}
-
-
 /* Collect all callers of NODE.  Worker for collect_callers_of_node.  */
 
 static bool
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 06d2704..39cb340 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -261,17 +261,29 @@ public:
 				  void *data,
 				  bool include_overwrite);
 
+  /* Call callback on symtab node and aliases associated to this node.
+     When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+     skipped.  */
+  template <typename Arg, bool (*callback) (symtab_node*, Arg arg)>
+  bool call_for_symbol_and_aliases (Arg data, bool include_overwrite);
+
   /* If node can not be interposable by static or dynamic linker to point to
      different definition, return this symbol. Otherwise look for alias with
      such property and if none exists, introduce new one.  */
   symtab_node *noninterposable_alias (void);
 
+  /* Worker searching noninterposable alias.  */
+  static bool noninterposable_alias (symtab_node *node, symtab_node **data);
+
   /* Return node that alias is aliasing.  */
   inline symtab_node *get_alias_target (void);
 
   /* Set section for symbol and its aliases.  */
   void set_section (const char *section);
 
+  /* Worker for set_section.  */
+  static bool set_section (symtab_node *n, const char *s);
+
   /* Set section, do not recurse into aliases.
      When one wants to change section of symbol and its aliases,
      use set_section.  */
@@ -523,6 +535,11 @@ protected:
   bool call_for_symbol_and_aliases_1 (bool (*callback) (symtab_node *, void *),
 				      void *data,
 				      bool include_overwrite);
+
+  /* Worker for call_for_symbol_and_aliases.  */
+  template <typename Arg, bool (*callback) (symtab_node *, Arg)>
+  bool call_for_symbol_and_aliases_1 (Arg data, bool include_overwritable);
+
 private:
   /* Worker for set_section.  */
   static bool set_section (symtab_node *n, void *s);
@@ -1042,6 +1059,13 @@ public:
 						      void *),
 				    void *data, bool include_overwritable);
 
+  /* Call callback on function and aliases associated to the function.
+     When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+     skipped. */
+  template <typename Arg, bool (*callback) (cgraph_node *, Arg)>
+  bool call_for_symbol_and_aliases (Arg data, bool include_overwritable);
+
+
   /* Call callback on cgraph_node, thunks and aliases associated to NODE.
      When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
      skipped.  When EXCLUDE_VIRTUAL_THUNKS is true, virtual thunks are
@@ -1052,6 +1076,15 @@ public:
 					   bool include_overwritable,
 					   bool exclude_virtual_thunks = false);
 
+  /* Call callback on cgraph_node, thunks and aliases associated to NODE.
+     When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+     skipped.  When EXCLUDE_VIRTUAL_THUNKS is true, virtual thunks are
+     skipped.  */
+  template <typename Arg, bool (*callback) (cgraph_node *, Arg)>
+  bool call_for_symbol_thunks_and_aliases (Arg data,
+					   bool include_overwritable,
+					   bool exclude_virtual_thunks = false);
+
   /* Likewise indicate that a node is needed, i.e. reachable via some
      external means.  */
   inline void mark_force_output (void);
@@ -1093,6 +1126,9 @@ public:
      the program unless we are compiling whole program or we do LTO.  In this
      case we know we win since dynamic linking will not really discard the
      linkonce section.  */
+  bool will_be_removed_from_program_if_no_direct_calls_compute_p (void);
+
+  /* Wrapper for will_be_removed_from_program_if_no_direct_calls_compute_p.  */
   bool will_be_removed_from_program_if_no_direct_calls_p (void);
 
   /* Return true when function can be removed from callgraph
@@ -1101,8 +1137,15 @@ public:
 
   /* Return true when function cgraph_node and its aliases can be removed from
      callgraph if all direct calls are eliminated.  */
+  bool can_remove_if_no_direct_calls_compute_p (void);
+
+  /* Wrapper for can_remove_if_no_direct_calls_compute_p.  */
   bool can_remove_if_no_direct_calls_p (void);
 
+  /* Worker for cgraph_can_remove_if_no_direct_calls_p.  */
+  static bool nonremovable_p (cgraph_node *node, void *);
+  static bool nonremovable_compute_p (cgraph_node *node, void *);
+
   /* Return true when callgraph node is a function with Gimple body defined
      in current unit.  Functions can also be define externally or they
      can be thunks with no Gimple representation.
@@ -1295,11 +1338,24 @@ public:
   /* True if there was multiple COMDAT bodies merged by lto-symtab.  */
   unsigned merged : 1;
 
+  /* IPA inline cached values.  */
+  unsigned inline_nonremovable_init: 1;
+  unsigned inline_can_remove_if_no_direct_calls_init: 1;
+  unsigned inline_will_be_removed_if_no_direct_calls_init: 1;
+
+  unsigned inline_nonremovable: 1;
+  unsigned inline_can_remove_if_no_direct_calls: 1;
+  unsigned inline_will_be_removed_if_no_direct_calls: 1;
+
 private:
   /* Worker for call_for_symbol_and_aliases.  */
   bool call_for_symbol_and_aliases_1 (bool (*callback) (cgraph_node *,
 						        void *),
 				      void *data, bool include_overwritable);
+
+  /* Worker for call_for_symbol_and_aliases.  */
+  template <typename Arg, bool (*callback) (cgraph_node *, Arg)>
+  bool call_for_symbol_and_aliases_1 (Arg data, bool include_overwritable);
 };
 
 /* A cgraph node set is a collection of cgraph nodes.  A cgraph node
@@ -1683,6 +1739,12 @@ public:
 				    void *data,
 				    bool include_overwritable);
 
+  /* Call calback on varpool symbol and aliases associated to varpool symbol.
+     When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+     skipped. */
+  template <typename Arg, bool (*callback) (varpool_node *, Arg)>
+  bool call_for_symbol_and_aliases (Arg data, bool include_overwritable);
+
   /* Return true when variable should be considered externally visible.  */
   bool externally_visible_p (void);
 
@@ -1761,6 +1823,10 @@ private:
   bool call_for_symbol_and_aliases_1 (bool (*callback) (varpool_node *, void *),
 				      void *data,
 				      bool include_overwritable);
+
+  /* Worker for call_for_symbol_and_aliases.  */
+  template <typename Arg, bool (*callback) (varpool_node*, Arg arg)>
+  bool call_for_symbol_and_aliases_1 (Arg data, bool include_overwritable);
 };
 
 /* Every top level asm statement is put into a asm_node.  */
@@ -1862,7 +1928,7 @@ public:
   friend class cgraph_node;
   friend class cgraph_edge;
 
-  symbol_table (): cgraph_max_summary_uid (1)
+  symbol_table (): cgraph_max_summary_uid (1), enable_inline_cache (false)
   {
   }
 
@@ -2101,6 +2167,9 @@ public:
 
   FILE* GTY ((skip)) dump_file;
 
+  /* Inline cache flag.  */
+  bool enable_inline_cache;
+
 private:
   /* Allocate new callgraph node.  */
   inline cgraph_node * allocate_cgraph_symbol (void);
@@ -2987,6 +3056,21 @@ symtab_node::call_for_symbol_and_aliases (bool (*callback) (symtab_node *,
   return false;
 }
 
+template <typename Arg, bool (*callback) (symtab_node *, Arg arg)>
+inline bool
+symtab_node::call_for_symbol_and_aliases (Arg data, bool include_overwritable)
+{
+  ipa_ref *ref;
+
+  if (callback (this, data))
+    return true;
+  if (iterate_direct_aliases (0, ref))
+    return call_for_symbol_and_aliases_1 <Arg, callback>
+      (data, include_overwritable);
+  return false;
+}
+
+
 /* Call callback on function and aliases associated to the function.
    When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
    skipped.  */
@@ -3004,6 +3088,43 @@ cgraph_node::call_for_symbol_and_aliases (bool (*callback) (cgraph_node *,
   return false;
 }
 
+/* Call callback on function and aliases associated to the function.
+   When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+   skipped.  */
+
+template <typename Arg, bool (*callback) (cgraph_node *, Arg arg)>
+inline bool
+cgraph_node::call_for_symbol_and_aliases (Arg data, bool include_overwritable)
+{
+  ipa_ref *ref;
+
+  if (callback (this, data))
+    return true;
+
+  if (iterate_direct_aliases (0, ref))
+    return call_for_symbol_and_aliases_1 <Arg, callback> (data, include_overwritable);
+
+  return false;
+}
+
+template <typename Arg, bool (*callback) (cgraph_node *, Arg arg)>
+inline bool
+cgraph_node::call_for_symbol_and_aliases_1 (Arg data, bool include_overwritable)
+{
+  ipa_ref *ref;
+  FOR_EACH_ALIAS (this, ref)
+    {
+      cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
+      if (include_overwritable
+	  || alias->get_availability () > AVAIL_INTERPOSABLE)
+	if (alias->call_for_symbol_and_aliases <Arg, callback> (data, include_overwritable))
+	  return true;
+    }
+
+  return false;
+}
+
+
 /* Call calback on varpool symbol and aliases associated to varpool symbol.
    When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
    skipped. */
@@ -3021,6 +3142,47 @@ varpool_node::call_for_symbol_and_aliases (bool (*callback) (varpool_node *,
   return false;
 }
 
+
+/* Call calback on varpool symbol and aliases associated to varpool symbol.
+   When INCLUDE_OVERWRITABLE is false, overwritable aliases and thunks are
+   skipped. */
+
+template <typename Arg, bool (*callback) (varpool_node*, Arg arg)>
+inline bool
+varpool_node::call_for_symbol_and_aliases (Arg data, bool include_overwritable)
+{
+  ipa_ref *ref;
+
+  if (callback (this, data))
+    return true;
+  if (iterate_direct_aliases (0, ref))
+    return call_for_symbol_and_aliases_1 <Arg, callback>
+      (data, include_overwritable);
+
+  return false;
+}
+
+/* Worker for call_for_symbol_and_aliases.  */
+
+template <typename Arg, bool (*callback) (varpool_node*, Arg arg)>
+bool
+varpool_node::call_for_symbol_and_aliases_1 (Arg data,
+					     bool include_overwritable)
+{
+  ipa_ref *ref;
+
+  FOR_EACH_ALIAS (this, ref)
+    {
+      varpool_node *alias = dyn_cast <varpool_node *> (ref->referring);
+      if (include_overwritable
+	  || alias->get_availability () > AVAIL_INTERPOSABLE)
+	if (alias->call_for_symbol_and_aliases <Arg, callback>
+	  (data, include_overwritable))
+	    return true;
+    }
+  return false;
+}
+
 /* Build polymorphic call context for indirect call E.  */
 
 inline
@@ -3094,6 +3256,151 @@ cgraph_local_p (cgraph_node *node)
   return node->local.local && node->instrumented_version->local.local;
 }
 
+inline bool
+cgraph_node::nonremovable_compute_p (cgraph_node *node, void *)
+{
+  return !node->can_remove_if_no_direct_calls_and_refs_p ();
+}
+
+inline bool
+cgraph_node::nonremovable_p (cgraph_node *node, void *)
+{
+  bool retval;
+
+  if (symtab->enable_inline_cache)
+    {
+      if (!node->inline_nonremovable_init)
+        {
+	  node->inline_nonremovable = nonremovable_compute_p (node, NULL);
+	  node->inline_nonremovable_init = true;
+	}
+
+      retval = node->inline_nonremovable;
+
+      gcc_checking_assert (retval == nonremovable_compute_p (node, NULL));
+    }
+  else
+    retval = nonremovable_compute_p (node, NULL);
+
+  return retval;
+}
+
+inline bool
+cgraph_node::can_remove_if_no_direct_calls_compute_p (void)
+{
+  if (DECL_EXTERNAL (decl))
+    return true;
+  if (address_taken)
+    return false;
+
+  return !call_for_symbol_and_aliases <void *, cgraph_node::nonremovable_compute_p>
+    (NULL, true);
+}
+
+/* Return true when function cgraph_node and its aliases can be removed from
+   callgraph if all direct calls are eliminated.  */
+
+inline bool
+cgraph_node::can_remove_if_no_direct_calls_p (void)
+{
+  bool retval;
+
+  if (symtab->enable_inline_cache)
+  {
+    if (!inline_can_remove_if_no_direct_calls_init)
+      {
+	inline_can_remove_if_no_direct_calls = can_remove_if_no_direct_calls_compute_p ();
+	inline_can_remove_if_no_direct_calls_init = true;
+      }
+
+    retval = inline_can_remove_if_no_direct_calls;
+
+    gcc_checking_assert
+      (retval == can_remove_if_no_direct_calls_compute_p ());
+  }
+  else
+    retval = can_remove_if_no_direct_calls_compute_p ();
+
+  return retval;
+}
+
+/* Return true when function cgraph_node can be expected to be removed
+   from program when direct calls in this compilation unit are removed.
+
+   As a special case COMDAT functions are
+   cgraph_can_remove_if_no_direct_calls_p while the are not
+   cgraph_only_called_directly_p (it is possible they are called from other
+   unit)
+
+   This function behaves as cgraph_only_called_directly_p because eliminating
+   all uses of COMDAT function does not make it necessarily disappear from
+   the program unless we are compiling whole program or we do LTO.  In this
+   case we know we win since dynamic linking will not really discard the
+   linkonce section.  */
+
+inline bool
+cgraph_node::will_be_removed_from_program_if_no_direct_calls_compute_p (void)
+{
+  gcc_assert (!global.inlined_to);
+
+  if (call_for_symbol_and_aliases <void *, used_from_object_file_p_worker>
+    (NULL, true))
+      return false;
+  if (!in_lto_p && !flag_whole_program)
+    return only_called_directly_p ();
+  else
+    {
+       if (DECL_EXTERNAL (decl))
+         return true;
+      return can_remove_if_no_direct_calls_p ();
+    }
+}
+
+/* Wrapper for will_be_removed_from_program_if_no_direct_calls_computed_p.  */
+
+inline bool
+cgraph_node::will_be_removed_from_program_if_no_direct_calls_p (void)
+{
+  if (symtab->enable_inline_cache)
+    {
+      if (!inline_will_be_removed_if_no_direct_calls_init)
+        {
+	  inline_will_be_removed_if_no_direct_calls
+	    = will_be_removed_from_program_if_no_direct_calls_compute_p ();
+
+	  inline_will_be_removed_if_no_direct_calls_init = true;
+        }
+
+      gcc_checking_assert (inline_will_be_removed_if_no_direct_calls ==
+	will_be_removed_from_program_if_no_direct_calls_compute_p ());
+      return inline_will_be_removed_if_no_direct_calls;
+    }
+
+  return will_be_removed_from_program_if_no_direct_calls_compute_p ();
+}
+
+/* Worker for cgraph_only_called_directly_p.  */
+
+static bool
+cgraph_not_only_called_directly_p_1 (cgraph_node *node, void *)
+{
+  return !node->only_called_directly_or_aliased_p ();
+}
+
+/* Return true when function cgraph_node and all its aliases are only called
+   directly.
+   i.e. it is not externally visible, address was not taken and
+   it is not used in any other non-standard way.  */
+
+inline bool
+cgraph_node::only_called_directly_p (void)
+{
+  gcc_assert (ultimate_alias_target () == this);
+  return !call_for_symbol_and_aliases (cgraph_not_only_called_directly_p_1,
+				       NULL, true);
+}
+
+
 /* When using fprintf (or similar), problems can arise with
    transient generated strings.  Many string-generation APIs
    only support one result being alive at once (e.g. by
diff --git a/gcc/ipa-inline.c b/gcc/ipa-inline.c
index 287a6dd..8a07e04 100644
--- a/gcc/ipa-inline.c
+++ b/gcc/ipa-inline.c
@@ -1651,6 +1651,7 @@ inline_small_functions (void)
   ipa_reduced_postorder (order, true, true, NULL);
   free (order);
 
+  symtab->enable_inline_cache = true;
   FOR_EACH_DEFINED_FUNCTION (node)
     if (!node->global.inlined_to)
       {
@@ -1966,6 +1967,7 @@ inline_small_functions (void)
 	}
     }
 
+  symtab->enable_inline_cache = false;
   free_growth_caches ();
   if (dump_file)
     fprintf (dump_file,
diff --git a/gcc/lto-streamer.c b/gcc/lto-streamer.c
index 836dce9..542a813 100644
--- a/gcc/lto-streamer.c
+++ b/gcc/lto-streamer.c
@@ -319,11 +319,13 @@ static hash_table<tree_hash_entry> *tree_htab;
 void
 lto_streamer_init (void)
 {
+#ifdef ENABLE_CHECKING
   /* Check that all the TS_* handled by the reader and writer routines
      match exactly the structures defined in treestruct.def.  When a
      new TS_* astructure is added, the streamer should be updated to
      handle it.  */
   streamer_check_handled_ts_structures ();
+#endif
 
 #ifdef LTO_STREAMER_DEBUG
   tree_htab = new hash_table<tree_hash_entry> (31);
diff --git a/gcc/symtab.c b/gcc/symtab.c
index ee47a73..df0950b 100644
--- a/gcc/symtab.c
+++ b/gcc/symtab.c
@@ -1337,9 +1337,9 @@ symtab_node::set_section_for_node (const char *section)
 /* Worker for set_section.  */
 
 bool
-symtab_node::set_section (symtab_node *n, void *s)
+symtab_node::set_section (symtab_node *n, const char *s)
 {
-  n->set_section_for_node ((char *)s);
+  n->set_section_for_node (s);
   return false;
 }
 
@@ -1349,8 +1349,7 @@ void
 symtab_node::set_section (const char *section)
 {
   gcc_assert (!this->alias);
-  call_for_symbol_and_aliases
-    (symtab_node::set_section, const_cast<char *>(section), true);
+  call_for_symbol_and_aliases <const char *, symtab_node::set_section> (section, true);
 }
 
 /* Return the initialization priority.  */
@@ -1491,10 +1490,11 @@ symtab_node::resolve_alias (symtab_node *target)
     {
       error ("section of alias %q+D must match section of its target", decl);
     }
-  call_for_symbol_and_aliases (symtab_node::set_section,
-			     const_cast<char *>(target->get_section ()), true);
+  call_for_symbol_and_aliases <const char *, symtab_node::set_section>
+    (const_cast<char *>(target->get_section ()), true);
   if (target->implicit_section)
-    call_for_symbol_and_aliases (set_implicit_section, NULL, true);
+    call_for_symbol_and_aliases <void *, symtab_node::set_implicit_section>
+      (NULL, true);
 
   /* Alias targets become redundant after alias is resolved into an reference.
      We do not want to keep it around or we would have to mind updating them
@@ -1513,7 +1513,7 @@ symtab_node::resolve_alias (symtab_node *target)
 /* Worker searching noninterposable alias.  */
 
 bool
-symtab_node::noninterposable_alias (symtab_node *node, void *data)
+symtab_node::noninterposable_alias (symtab_node *node, symtab_node **data)
 {
   if (decl_binds_to_current_def_p (node->decl))
     {
@@ -1530,7 +1530,7 @@ symtab_node::noninterposable_alias (symtab_node *node, void *data)
 	  || DECL_ATTRIBUTES (node->decl) != DECL_ATTRIBUTES (fn->decl))
 	return false;
 
-      *(symtab_node **)data = node;
+      *data = node;
       return true;
     }
   return false;
@@ -1550,8 +1550,8 @@ symtab_node::noninterposable_alias (void)
      (if that is already non-overwritable).  */
   symtab_node *node = ultimate_alias_target ();
   gcc_assert (!node->alias && !node->weakref);
-  node->call_for_symbol_and_aliases (symtab_node::noninterposable_alias,
-				   (void *)&new_node, true);
+  node->call_for_symbol_and_aliases
+    <symtab_node **, symtab_node::noninterposable_alias> (&new_node, true);
   if (new_node)
     return new_node;
 #ifndef ASM_OUTPUT_DEF
@@ -1840,10 +1840,8 @@ symtab_node::equal_address_to (symtab_node *s2)
 /* Worker for call_for_symbol_and_aliases.  */
 
 bool
-symtab_node::call_for_symbol_and_aliases_1 (bool (*callback) (symtab_node *,
-							      void *),
-					    void *data,
-					    bool include_overwritable)
+symtab_node::call_for_symbol_and_aliases_1 (bool (*callback) (symtab_node *,void *),
+                                           void *data, bool include_overwritable)
 {
   ipa_ref *ref;
   FOR_EACH_ALIAS (this, ref)
@@ -1857,3 +1855,23 @@ symtab_node::call_for_symbol_and_aliases_1 (bool (*callback) (symtab_node *,
     }
   return false;
 }
+
+/* Worker for call_for_symbol_and_aliases.  */
+
+template <typename Arg, bool (*callback) (symtab_node*, Arg arg)>
+bool
+symtab_node::call_for_symbol_and_aliases_1 (Arg data,
+					    bool include_overwritable)
+{
+  ipa_ref *ref;
+  FOR_EACH_ALIAS (this, ref)
+    {
+      symtab_node *alias = ref->referring;
+      if (include_overwritable
+	  || alias->get_availability () > AVAIL_INTERPOSABLE)
+	if (alias->call_for_symbol_and_aliases <Arg, callback> (data,
+					      include_overwritable))
+	  return true;
+    }
+  return false;
+}
-- 
2.1.2


[-- Attachment #3: cover-letter-chromium.txt --]
[-- Type: text/plain, Size: 9314 bytes --]

Hello.

Following mini patchset is speed-up for LTO WPA received on chromium binary:

Before:
Execution times (seconds)
 phase setup             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1977 kB ( 0%) ggc
 phase opt and generate  : 179.87 (66%) usr   1.67 (45%) sys 181.47 (66%) wall 2682287 kB (13%) ggc
 phase stream in         :  92.75 (34%) usr   2.05 (55%) sys  94.77 (34%) wall18738391 kB (87%) ggc
 callgraph optimization  :   0.71 ( 0%) usr   0.00 ( 0%) sys   0.71 ( 0%) wall      16 kB ( 0%) ggc
 ipa dead code removal   :   5.20 ( 2%) usr   0.05 ( 1%) sys   5.26 ( 2%) wall       0 kB ( 0%) ggc
 ipa virtual call target :   3.22 ( 1%) usr   0.03 ( 1%) sys   3.20 ( 1%) wall       0 kB ( 0%) ggc
 ipa devirtualization    :   0.28 ( 0%) usr   0.01 ( 0%) sys   0.26 ( 0%) wall   32638 kB ( 0%) ggc
 ipa cp                  :   4.27 ( 2%) usr   0.24 ( 6%) sys   4.55 ( 2%) wall  851324 kB ( 4%) ggc
 ipa inlining heuristics : 127.09 (47%) usr   0.27 ( 7%) sys 127.25 (46%) wall  807884 kB ( 4%) ggc
 ipa comdats             :   0.57 ( 0%) usr   0.00 ( 0%) sys   0.57 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple in       :   5.47 ( 2%) usr   0.92 (25%) sys   6.37 ( 2%) wall 1370242 kB ( 6%) ggc
 ipa lto decl in         :  79.23 (29%) usr   1.32 (35%) sys  80.53 (29%) wall16957392 kB (79%) ggc
 ipa lto constructors in :   0.33 ( 0%) usr   0.03 ( 1%) sys   0.44 ( 0%) wall   22897 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.41 ( 1%) usr   0.21 ( 6%) sys   1.62 ( 1%) wall  901987 kB ( 4%) ggc
 ipa lto decl merge      :   3.22 ( 1%) usr   0.00 ( 0%) sys   3.22 ( 1%) wall   16383 kB ( 0%) ggc
 ipa lto cgraph merge    :   5.10 ( 2%) usr   0.01 ( 0%) sys   5.11 ( 2%) wall   20432 kB ( 0%) ggc
 whopr wpa               :   1.95 ( 1%) usr   0.00 ( 0%) sys   1.94 ( 1%) wall       2 kB ( 0%) ggc
 whopr partitioning      :   5.22 ( 2%) usr   0.01 ( 0%) sys   5.23 ( 2%) wall    7800 kB ( 0%) ggc
 ipa reference           :   2.97 ( 1%) usr   0.06 ( 2%) sys   3.02 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.52 ( 0%) usr   0.04 ( 1%) sys   0.56 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   3.51 ( 1%) usr   0.04 ( 1%) sys   3.56 ( 1%) wall       0 kB ( 0%) ggc
 ipa icf                 :  19.33 ( 7%) usr   0.12 ( 3%) sys  19.52 ( 7%) wall    3089 kB ( 0%) ggc
 tree SSA rewrite        :   0.35 ( 0%) usr   0.02 ( 1%) sys   0.37 ( 0%) wall   51191 kB ( 0%) ggc
 tree SSA other          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental    :   0.48 ( 0%) usr   0.06 ( 2%) sys   0.37 ( 0%) wall   33552 kB ( 0%) ggc
 tree operand scan       :   0.41 ( 0%) usr   0.08 ( 2%) sys   0.53 ( 0%) wall  343835 kB ( 2%) ggc
 dominance frontiers     :   0.04 ( 0%) usr   0.01 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.36 ( 0%) usr   0.09 ( 2%) sys   0.55 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.03 ( 0%) usr   0.03 ( 1%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 loop fini               :   0.08 ( 0%) usr   0.00 ( 0%) sys   0.09 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   1.18 ( 0%) usr   0.00 ( 0%) sys   1.19 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 : 272.63             3.72           276.25           21422657 kB

AFTER:

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1977 kB ( 0%) ggc
 phase opt and generate  :  73.30 (43%) usr   1.79 (44%) sys  75.06 (43%) wall 2682287 kB (13%) ggc
 phase stream in         :  95.72 (57%) usr   2.25 (56%) sys  97.94 (57%) wall18738391 kB (87%) ggc
 callgraph optimization  :   0.75 ( 0%) usr   0.00 ( 0%) sys   0.76 ( 0%) wall      16 kB ( 0%) ggc
 ipa dead code removal   :   5.19 ( 3%) usr   0.03 ( 1%) sys   5.25 ( 3%) wall       0 kB ( 0%) ggc
 ipa virtual call target :   2.81 ( 2%) usr   0.03 ( 1%) sys   3.15 ( 2%) wall       0 kB ( 0%) ggc
 ipa devirtualization    :   0.29 ( 0%) usr   0.00 ( 0%) sys   0.26 ( 0%) wall   32638 kB ( 0%) ggc
 ipa cp                  :   4.59 ( 3%) usr   0.24 ( 6%) sys   4.76 ( 3%) wall  851324 kB ( 4%) ggc
 ipa inlining heuristics :  22.09 (13%) usr   0.26 ( 6%) sys  22.20 (13%) wall  807884 kB ( 4%) ggc
 ipa comdats             :   0.57 ( 0%) usr   0.00 ( 0%) sys   0.57 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple in       :   5.67 ( 3%) usr   0.93 (23%) sys   6.51 ( 4%) wall 1370242 kB ( 6%) ggc
 ipa lto decl in         :  81.86 (48%) usr   1.45 (36%) sys  83.29 (48%) wall16957392 kB (79%) ggc
 ipa lto constructors in :   0.41 ( 0%) usr   0.09 ( 2%) sys   0.36 ( 0%) wall   22897 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.49 ( 1%) usr   0.25 ( 6%) sys   1.73 ( 1%) wall  901987 kB ( 4%) ggc
 ipa lto decl merge      :   3.55 ( 2%) usr   0.00 ( 0%) sys   3.55 ( 2%) wall   16383 kB ( 0%) ggc
 ipa lto cgraph merge    :   5.05 ( 3%) usr   0.00 ( 0%) sys   5.07 ( 3%) wall   20432 kB ( 0%) ggc
 whopr wpa               :   1.88 ( 1%) usr   0.00 ( 0%) sys   1.86 ( 1%) wall       2 kB ( 0%) ggc
 whopr partitioning      :   4.89 ( 3%) usr   0.02 ( 0%) sys   4.90 ( 3%) wall    7800 kB ( 0%) ggc
 ipa reference           :   2.85 ( 2%) usr   0.05 ( 1%) sys   2.91 ( 2%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.55 ( 0%) usr   0.04 ( 1%) sys   0.59 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   3.28 ( 2%) usr   0.04 ( 1%) sys   3.33 ( 2%) wall       0 kB ( 0%) ggc
 ipa icf                 :  18.23 (11%) usr   0.12 ( 3%) sys  18.29 (11%) wall    3089 kB ( 0%) ggc
 tree SSA rewrite        :   0.26 ( 0%) usr   0.04 ( 1%) sys   0.32 ( 0%) wall   51191 kB ( 0%) ggc
 tree SSA other          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental    :   0.51 ( 0%) usr   0.16 ( 4%) sys   0.60 ( 0%) wall   33552 kB ( 0%) ggc
 tree operand scan       :   0.36 ( 0%) usr   0.13 ( 3%) sys   0.49 ( 0%) wall  343835 kB ( 2%) ggc
 dominance frontiers     :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.39 ( 0%) usr   0.06 ( 1%) sys   0.63 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.05 ( 0%) usr   0.04 ( 1%) sys   0.06 ( 0%) wall       0 kB ( 0%) ggc
 loop fini               :   0.10 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   1.26 ( 1%) usr   0.00 ( 0%) sys   1.26 ( 1%) wall       0 kB ( 0%) ggc
 TOTAL                 : 169.02             4.04           173.00           21422657 kB

perf report after:

    10.17%  lto1-wpa  lto1               [.] inflate_fast
     3.74%  lto1-wpa  lto1               [.] compare_tree_sccs_1(tree_node*, tree_node*, tree_node***)
     3.56%  lto1-wpa  lto1               [.] streamer_read_uhwi(lto_input_block*)
     3.16%  lto1-wpa  lto1               [.] ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
     3.01%  lto1-wpa  lto1               [.] unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, unsigned int)
     2.69%  lto1-wpa  lto1               [.] streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*)
     2.16%  lto1-wpa  lto1               [.] lto_cgraph_replace_node(cgraph_node*, cgraph_node*)
     2.00%  lto1-wpa  lto1               [.] streamer_get_pickled_tree(lto_input_block*, data_in*)
     2.00%  lto1-wpa  libc-2.19.so       [.] msort_with_tmp.part.0
     1.91%  lto1-wpa  lto1               [.] ipa_icf::sem_variable::equals(tree_node*, tree_node*)
     1.72%  lto1-wpa  libc-2.19.so       [.] _int_malloc
     1.70%  lto1-wpa  lto1               [.] symbol_table::remove_unreachable_nodes(_IO_FILE*)
     1.54%  lto1-wpa  lto1               [.] lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int)
     1.33%  lto1-wpa  lto1               [.] inflate
     1.21%  lto1-wpa  lto1               [.] adler32
     1.16%  lto1-wpa  lto1               [.] cgraph_node::call_for_symbol_thunks_and_aliases(bool (*)(cgraph_node*, void*), void*, bool, bool)
     1.11%  lto1-wpa  lto1               [.] lto_input_tree(lto_input_block*, data_in*)
     1.07%  lto1-wpa  lto1               [.] streamer_read_tree_body(lto_input_block*, data_in*, tree_node*)
     1.03%  lto1-wpa  lto1               [.] lto_input_location(bitpack_d*, data_in*)
     1.01%  lto1-wpa  lto1               [.] htab_hash_string
     0.99%  lto1-wpa  lto1               [.] estimate_calls_size_and_time(cgraph_node*, int*, int*, int*, int*, unsigned int, vec<tree_node*, va_heap, vl_ptr>, vec<ipa_polymorphic_call_context, va_heap, vl_ptr>, vec<ipa_agg_jump_function*, va_heap, vl_ptr>) [clone .isra.137]
     0.92%  lto1-wpa  lto1               [.] ht_lookup(ht*, unsigned char const*, unsigned long, ht_lookup_option)
     0.92%  lto1-wpa  lto1               [.] ggc_internal_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long)
     0.86%  lto1-wpa  lto1               [.] splay_tree_splay
     0.83%  lto1-wpa  lto1               [.] bp_unpack_var_len_unsigned(bitpack_d*)
     0.80%  lto1-wpa  libc-2.19.so       [.] malloc_consolidate
     0.77%  lto1-wpa  lto1               [.] can_inline_edge_p(cgraph_edge*, bool, bool)
     0.72%  lto1-wpa  lto1               [.] gimple_has_body_p(tree_node*)



Thanks,
Martin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
  2015-02-17 17:14 [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome) Martin Liška
@ 2015-02-17 18:38 ` Jan Hubicka
  2015-02-18 10:28   ` Martin Liška
  2015-02-17 21:03 ` Jan Hubicka
  1 sibling, 1 reply; 6+ messages in thread
From: Jan Hubicka @ 2015-02-17 18:38 UTC (permalink / raw)
  To: Martin Liška; +Cc: GCC Patches, hubicka >> Jan Hubicka

Hi,
thanks for working on it.  There are 3 basically indpeendent changes in the patch
 - The patch to make checking in lto_streamer_init ENABLE_CHECKING only that I
   think can be comitted as obvoius.
 - Templates for call_for_symbol_and_aliases
   I do not think these should be strictly necessary for perofrmance, because once we
   spent too much time in these we are bit screwed.
   I however see it also makes things bit nicer by not needing typecasts on data pointer.
   Pehraps that could be further cleaned?

   Alternative would be to implement FOR_EACH_ALIAS macro with tree walking iterator.
   You have all the structure to not require stack.  Iterator will ocntain an
   root node, current node and index to ref.
   This may be even easier to use and probably wind up generating about the same code
   given that the for each template anyway needs to produce self recursive function.

   I would not care about for_symbol_thunk_and_aliases.  That function is heavy by walking
   all callers anyway and should not be used in hot code.
   I have patch that removes its use from inliner - it is more or less leftover from time
   we represented thunks as special aliases instead of functions w/o gimple body.
 - the caching itself.

I will look into the caching in detail.  I am not quite sure I like the idea of exposing inline
only cache into cgraph.h.  You could just keep the predicates as are, but have inline_ variants
in ipa-inline.h that does the caching for you.

Allocating the bits directly in cgraph_node is probably OK, we don't really have shortage there
and can be revisited easily later...

Honza

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
  2015-02-17 17:14 [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome) Martin Liška
  2015-02-17 18:38 ` Jan Hubicka
@ 2015-02-17 21:03 ` Jan Hubicka
  2015-02-18 13:58   ` Martin Liška
  1 sibling, 1 reply; 6+ messages in thread
From: Jan Hubicka @ 2015-02-17 21:03 UTC (permalink / raw)
  To: Martin Liška; +Cc: GCC Patches, hubicka >> Jan Hubicka

Hi,
this patch should chase away the expensive thunks and aliases walks from most
of analysis code. I think only real use left is local_p predicate that needs to
stay because i386 expect local flag to match between caller and callee when
expanding assembler thunk. I at least optimized it by first moving the walk to
be conditional for nonlocal functions only and then reorganizing
call_for_symbol_thunks_and_aliases to first inspect aliases (that is cheap) and
only then work on thunks.  Most likely this will find the non-local thunk/alias
faster.  Other cases was leftovers from the conversion of thunks from aliases
to functions.

I also noticed a bug in ipa-profile that does not disable all the
transofrms with !ipa_profile_flag used on OPTIMIZTION_NODE and fixed it.

Bootstrapped/regtested x86_64-linux, comitted.  I would be interested to
know if the call_for_symbol_thunks_and_aliases is now off your oprofiles
(sorry, easier to type than perf-profiles)

Honza

	* ipa-visibility.c (function_and_variable_visibility): Only
	check locality if node is not already local.
	* ipa-inline.c (want_inline_function_to_all_callers_p): Use
	call_for_symbol_and_aliases instead of
	call_for_symbol_thunks_and_aliases.
	(ipa_inline): Likewise.
	* cgraph.c (cgraph_node::call_for_symbol_thunks_and_aliases):
	first walk aliases.
	* ipa.c (symbol_table::remove_unreachable_nodes): Use
	call_for_symbol_and_aliases.
	* ipa-profile.c (ipa_propagate_frequency_data): Add function_symbol.
	(ipa_propagate_frequency_1): Use it; use opt_for_fn
	(ipa_propagate_frequency): Update.
	(ipa_profile): Add opt_for_fn gueards.
Index: ipa-visibility.c
===================================================================
--- ipa-visibility.c	(revision 220741)
+++ ipa-visibility.c	(working copy)
@@ -595,7 +595,8 @@ function_and_variable_visibility (bool w
     }
   FOR_EACH_DEFINED_FUNCTION (node)
     {
-      node->local.local |= node->local_p ();
+      if (!node->local.local)
+        node->local.local |= node->local_p ();
 
       /* If we know that function can not be overwritten by a different semantics
 	 and moreover its section can not be discarded, replace all direct calls
Index: ipa-inline.c
===================================================================
--- ipa-inline.c	(revision 220741)
+++ ipa-inline.c	(working copy)
@@ -975,14 +975,14 @@ want_inline_function_to_all_callers_p (s
   if (node->global.inlined_to)
     return false;
   /* Does it have callers?  */
-  if (!node->call_for_symbol_thunks_and_aliases (has_caller_p, NULL, true))
+  if (!node->call_for_symbol_and_aliases (has_caller_p, NULL, true))
     return false;
   /* Inlining into all callers would increase size?  */
   if (estimate_growth (node) > 0)
     return false;
   /* All inlines must be possible.  */
-  if (node->call_for_symbol_thunks_and_aliases (check_callers, &has_hot_call,
-						true))
+  if (node->call_for_symbol_and_aliases (check_callers, &has_hot_call,
+					 true))
     return false;
   if (!cold && !has_hot_call)
     return false;
@@ -2359,9 +2359,9 @@ ipa_inline (void)
 	  if (want_inline_function_to_all_callers_p (node, cold))
 	    {
 	      int num_calls = 0;
-	      node->call_for_symbol_thunks_and_aliases (sum_callers, &num_calls,
-						      true);
-	      while (node->call_for_symbol_thunks_and_aliases
+	      node->call_for_symbol_and_aliases (sum_callers, &num_calls,
+						 true);
+	      while (node->call_for_symbol_and_aliases
 		       (inline_to_all_callers, &num_calls, true))
 		;
 	      remove_functions = true;
Index: cgraph.c
===================================================================
--- cgraph.c	(revision 220741)
+++ cgraph.c	(working copy)
@@ -2191,6 +2191,16 @@ cgraph_node::call_for_symbol_thunks_and_
 
   if (callback (this, data))
     return true;
+  FOR_EACH_ALIAS (this, ref)
+    {
+      cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
+      if (include_overwritable
+	  || alias->get_availability () > AVAIL_INTERPOSABLE)
+	if (alias->call_for_symbol_thunks_and_aliases (callback, data,
+						     include_overwritable,
+						     exclude_virtual_thunks))
+	  return true;
+    }
   for (e = callers; e; e = e->next_caller)
     if (e->caller->thunk.thunk_p
 	&& (include_overwritable
@@ -2202,16 +2212,6 @@ cgraph_node::call_for_symbol_thunks_and_
 						       exclude_virtual_thunks))
 	return true;
 
-  FOR_EACH_ALIAS (this, ref)
-    {
-      cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
-      if (include_overwritable
-	  || alias->get_availability () > AVAIL_INTERPOSABLE)
-	if (alias->call_for_symbol_thunks_and_aliases (callback, data,
-						     include_overwritable,
-						     exclude_virtual_thunks))
-	  return true;
-    }
   return false;
 }
 
Index: ipa.c
===================================================================
--- ipa.c	(revision 220741)
+++ ipa.c	(working copy)
@@ -661,7 +661,7 @@ symbol_table::remove_unreachable_nodes (
     if (node->address_taken
 	&& !node->used_from_other_partition)
       {
-	if (!node->call_for_symbol_thunks_and_aliases
+	if (!node->call_for_symbol_and_aliases
 	    (has_addr_references_p, NULL, true)
 	    && (!node->instrumentation_clone
 		|| !node->instrumented_version
Index: ipa-profile.c
===================================================================
--- ipa-profile.c	(revision 220741)
+++ ipa-profile.c	(working copy)
@@ -322,6 +322,7 @@ ipa_profile_read_summary (void)
 
 struct ipa_propagate_frequency_data
 {
+  cgraph_node *function_symbol;
   bool maybe_unlikely_executed;
   bool maybe_executed_once;
   bool only_called_at_startup;
@@ -342,7 +343,7 @@ ipa_propagate_frequency_1 (struct cgraph
 	        || d->only_called_at_startup || d->only_called_at_exit);
        edge = edge->next_caller)
     {
-      if (edge->caller != node)
+      if (edge->caller != d->function_symbol)
 	{
           d->only_called_at_startup &= edge->caller->only_called_at_startup;
 	  /* It makes sense to put main() together with the static constructors.
@@ -358,7 +359,11 @@ ipa_propagate_frequency_1 (struct cgraph
 	 errors can make us to push function into unlikely section even when
 	 it is executed by the train run.  Transfer the function only if all
 	 callers are unlikely executed.  */
-      if (profile_info && flag_branch_probabilities
+      if (profile_info
+	  && opt_for_fn (d->function_symbol->decl, flag_branch_probabilities)
+	  /* Thunks are not profiled.  This is more or less implementation
+	     bug.  */
+	  && !d->function_symbol->thunk.thunk_p
 	  && (edge->caller->frequency != NODE_FREQUENCY_UNLIKELY_EXECUTED
 	      || (edge->caller->global.inlined_to
 		  && edge->caller->global.inlined_to->frequency
@@ -418,7 +423,7 @@ contains_hot_call_p (struct cgraph_node
 bool
 ipa_propagate_frequency (struct cgraph_node *node)
 {
-  struct ipa_propagate_frequency_data d = {true, true, true, true};
+  struct ipa_propagate_frequency_data d = {node, true, true, true, true};
   bool changed = false;
 
   /* We can not propagate anything useful about externally visible functions
@@ -432,8 +437,8 @@ ipa_propagate_frequency (struct cgraph_n
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Processing frequency %s\n", node->name ());
 
-  node->call_for_symbol_thunks_and_aliases (ipa_propagate_frequency_1, &d,
-					    true);
+  node->call_for_symbol_and_aliases (ipa_propagate_frequency_1, &d,
+				     true);
 
   if ((d.only_called_at_startup && !d.only_called_at_exit)
       && !node->only_called_at_startup)
@@ -597,6 +602,9 @@ ipa_profile (void)
     {
       bool update = false;
 
+      if (!opt_for_fn (n->decl, flag_ipa_profile))
+	continue;
+
       for (e = n->indirect_calls; e; e = e->next_callee)
 	{
 	  if (n->count)
@@ -697,7 +705,9 @@ ipa_profile (void)
   order_pos = ipa_reverse_postorder (order);
   for (i = order_pos - 1; i >= 0; i--)
     {
-      if (order[i]->local.local && ipa_propagate_frequency (order[i]))
+      if (order[i]->local.local
+	  && opt_for_fn (order[i]->decl, flag_ipa_profile)
+	  && ipa_propagate_frequency (order[i]))
 	{
 	  for (e = order[i]->callees; e; e = e->next_callee)
 	    if (e->callee->local.local && !e->callee->aux)
@@ -714,7 +724,9 @@ ipa_profile (void)
       something_changed = false;
       for (i = order_pos - 1; i >= 0; i--)
 	{
-	  if (order[i]->aux && ipa_propagate_frequency (order[i]))
+	  if (order[i]->aux
+	      && opt_for_fn (order[i]->decl, flag_ipa_profile)
+	      && ipa_propagate_frequency (order[i]))
 	    {
 	      for (e = order[i]->callees; e; e = e->next_callee)
 		if (e->callee->local.local && !e->callee->aux)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
  2015-02-17 18:38 ` Jan Hubicka
@ 2015-02-18 10:28   ` Martin Liška
  0 siblings, 0 replies; 6+ messages in thread
From: Martin Liška @ 2015-02-18 10:28 UTC (permalink / raw)
  To: Jan Hubicka, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 2161 bytes --]

On 02/17/2015 07:38 PM, Jan Hubicka wrote:
> Hi,
> thanks for working on it.  There are 3 basically indpeendent changes in the patch
>   - The patch to make checking in lto_streamer_init ENABLE_CHECKING only that I
>     think can be comitted as obvoius.

Hello.

Following email contains fix for that, which I'm going to install.

>   - Templates for call_for_symbol_and_aliases
>     I do not think these should be strictly necessary for perofrmance, because once we
>     spent too much time in these we are bit screwed.
>     I however see it also makes things bit nicer by not needing typecasts on data pointer.
>     Pehraps that could be further cleaned?
>
>     Alternative would be to implement FOR_EACH_ALIAS macro with tree walking iterator.
>     You have all the structure to not require stack.  Iterator will ocntain an
>     root node, current node and index to ref.
>     This may be even easier to use and probably wind up generating about the same code
>     given that the for each template anyway needs to produce self recursive function.
>
>     I would not care about for_symbol_thunk_and_aliases.  That function is heavy by walking
>     all callers anyway and should not be used in hot code.
>     I have patch that removes its use from inliner - it is more or less leftover from time
>     we represented thunks as special aliases instead of functions w/o gimple body.

Yes, I was also thinking about flat iterator that will be capable of iterating thunks/aliases and
I prefer that approach compared to recursive functions. I think we can prepare it for next release,
as you said it does not bring so much performance gain.

>   - the caching itself.
>
> I will look into the caching in detail.  I am not quite sure I like the idea of exposing inline
> only cache into cgraph.h.  You could just keep the predicates as are, but have inline_ variants
> in ipa-inline.h that does the caching for you.
>
> Allocating the bits directly in cgraph_node is probably OK, we don't really have shortage there
> and can be revisited easily later...
>
> Honza
>

Please take a look at caching, it would be crucial part of speed improvement.

Martin

[-- Attachment #2: 0001-Add-checking-macro-within-lto_streamer_init.patch --]
[-- Type: text/x-patch, Size: 1076 bytes --]

From eb9d34244c43ae1d0576b2ae1002f5267c6cd547 Mon Sep 17 00:00:00 2001
From: mliska <mliska@suse.cz>
Date: Wed, 18 Feb 2015 11:18:47 +0100
Subject: [PATCH] Add checking macro within lto_streamer_init.

gcc/ChangeLog:

2015-02-18  Martin Liska  <mliska@suse.cz>

	* lto-streamer.c (lto_streamer_init): Encapsulate
	streamer_check_handled_ts_structures with checking macro.
---
 gcc/lto-streamer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/lto-streamer.c b/gcc/lto-streamer.c
index 836dce9..542a813 100644
--- a/gcc/lto-streamer.c
+++ b/gcc/lto-streamer.c
@@ -319,11 +319,13 @@ static hash_table<tree_hash_entry> *tree_htab;
 void
 lto_streamer_init (void)
 {
+#ifdef ENABLE_CHECKING
   /* Check that all the TS_* handled by the reader and writer routines
      match exactly the structures defined in treestruct.def.  When a
      new TS_* astructure is added, the streamer should be updated to
      handle it.  */
   streamer_check_handled_ts_structures ();
+#endif
 
 #ifdef LTO_STREAMER_DEBUG
   tree_htab = new hash_table<tree_hash_entry> (31);
-- 
2.1.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
  2015-02-17 21:03 ` Jan Hubicka
@ 2015-02-18 13:58   ` Martin Liška
  2015-02-18 14:13     ` Martin Liška
  0 siblings, 1 reply; 6+ messages in thread
From: Martin Liška @ 2015-02-18 13:58 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 9240 bytes --]

On 02/17/2015 10:03 PM, Jan Hubicka wrote:
> Hi,
> this patch should chase away the expensive thunks and aliases walks from most
> of analysis code. I think only real use left is local_p predicate that needs to
> stay because i386 expect local flag to match between caller and callee when
> expanding assembler thunk. I at least optimized it by first moving the walk to
> be conditional for nonlocal functions only and then reorganizing
> call_for_symbol_thunks_and_aliases to first inspect aliases (that is cheap) and
> only then work on thunks.  Most likely this will find the non-local thunk/alias
> faster.  Other cases was leftovers from the conversion of thunks from aliases
> to functions.
>
> I also noticed a bug in ipa-profile that does not disable all the
> transofrms with !ipa_profile_flag used on OPTIMIZTION_NODE and fixed it.
>
> Bootstrapped/regtested x86_64-linux, comitted.  I would be interested to
> know if the call_for_symbol_thunks_and_aliases is now off your oprofiles
> (sorry, easier to type than perf-profiles)
>
> Honza
>
> 	* ipa-visibility.c (function_and_variable_visibility): Only
> 	check locality if node is not already local.
> 	* ipa-inline.c (want_inline_function_to_all_callers_p): Use
> 	call_for_symbol_and_aliases instead of
> 	call_for_symbol_thunks_and_aliases.
> 	(ipa_inline): Likewise.
> 	* cgraph.c (cgraph_node::call_for_symbol_thunks_and_aliases):
> 	first walk aliases.
> 	* ipa.c (symbol_table::remove_unreachable_nodes): Use
> 	call_for_symbol_and_aliases.
> 	* ipa-profile.c (ipa_propagate_frequency_data): Add function_symbol.
> 	(ipa_propagate_frequency_1): Use it; use opt_for_fn
> 	(ipa_propagate_frequency): Update.
> 	(ipa_profile): Add opt_for_fn gueards.
> Index: ipa-visibility.c
> ===================================================================
> --- ipa-visibility.c	(revision 220741)
> +++ ipa-visibility.c	(working copy)
> @@ -595,7 +595,8 @@ function_and_variable_visibility (bool w
>       }
>     FOR_EACH_DEFINED_FUNCTION (node)
>       {
> -      node->local.local |= node->local_p ();
> +      if (!node->local.local)
> +        node->local.local |= node->local_p ();
>
>         /* If we know that function can not be overwritten by a different semantics
>   	 and moreover its section can not be discarded, replace all direct calls
> Index: ipa-inline.c
> ===================================================================
> --- ipa-inline.c	(revision 220741)
> +++ ipa-inline.c	(working copy)
> @@ -975,14 +975,14 @@ want_inline_function_to_all_callers_p (s
>     if (node->global.inlined_to)
>       return false;
>     /* Does it have callers?  */
> -  if (!node->call_for_symbol_thunks_and_aliases (has_caller_p, NULL, true))
> +  if (!node->call_for_symbol_and_aliases (has_caller_p, NULL, true))
>       return false;
>     /* Inlining into all callers would increase size?  */
>     if (estimate_growth (node) > 0)
>       return false;
>     /* All inlines must be possible.  */
> -  if (node->call_for_symbol_thunks_and_aliases (check_callers, &has_hot_call,
> -						true))
> +  if (node->call_for_symbol_and_aliases (check_callers, &has_hot_call,
> +					 true))
>       return false;
>     if (!cold && !has_hot_call)
>       return false;
> @@ -2359,9 +2359,9 @@ ipa_inline (void)
>   	  if (want_inline_function_to_all_callers_p (node, cold))
>   	    {
>   	      int num_calls = 0;
> -	      node->call_for_symbol_thunks_and_aliases (sum_callers, &num_calls,
> -						      true);
> -	      while (node->call_for_symbol_thunks_and_aliases
> +	      node->call_for_symbol_and_aliases (sum_callers, &num_calls,
> +						 true);
> +	      while (node->call_for_symbol_and_aliases
>   		       (inline_to_all_callers, &num_calls, true))
>   		;
>   	      remove_functions = true;
> Index: cgraph.c
> ===================================================================
> --- cgraph.c	(revision 220741)
> +++ cgraph.c	(working copy)
> @@ -2191,6 +2191,16 @@ cgraph_node::call_for_symbol_thunks_and_
>
>     if (callback (this, data))
>       return true;
> +  FOR_EACH_ALIAS (this, ref)
> +    {
> +      cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
> +      if (include_overwritable
> +	  || alias->get_availability () > AVAIL_INTERPOSABLE)
> +	if (alias->call_for_symbol_thunks_and_aliases (callback, data,
> +						     include_overwritable,
> +						     exclude_virtual_thunks))
> +	  return true;
> +    }
>     for (e = callers; e; e = e->next_caller)
>       if (e->caller->thunk.thunk_p
>   	&& (include_overwritable
> @@ -2202,16 +2212,6 @@ cgraph_node::call_for_symbol_thunks_and_
>   						       exclude_virtual_thunks))
>   	return true;
>
> -  FOR_EACH_ALIAS (this, ref)
> -    {
> -      cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
> -      if (include_overwritable
> -	  || alias->get_availability () > AVAIL_INTERPOSABLE)
> -	if (alias->call_for_symbol_thunks_and_aliases (callback, data,
> -						     include_overwritable,
> -						     exclude_virtual_thunks))
> -	  return true;
> -    }
>     return false;
>   }
>
> Index: ipa.c
> ===================================================================
> --- ipa.c	(revision 220741)
> +++ ipa.c	(working copy)
> @@ -661,7 +661,7 @@ symbol_table::remove_unreachable_nodes (
>       if (node->address_taken
>   	&& !node->used_from_other_partition)
>         {
> -	if (!node->call_for_symbol_thunks_and_aliases
> +	if (!node->call_for_symbol_and_aliases
>   	    (has_addr_references_p, NULL, true)
>   	    && (!node->instrumentation_clone
>   		|| !node->instrumented_version
> Index: ipa-profile.c
> ===================================================================
> --- ipa-profile.c	(revision 220741)
> +++ ipa-profile.c	(working copy)
> @@ -322,6 +322,7 @@ ipa_profile_read_summary (void)
>
>   struct ipa_propagate_frequency_data
>   {
> +  cgraph_node *function_symbol;
>     bool maybe_unlikely_executed;
>     bool maybe_executed_once;
>     bool only_called_at_startup;
> @@ -342,7 +343,7 @@ ipa_propagate_frequency_1 (struct cgraph
>   	        || d->only_called_at_startup || d->only_called_at_exit);
>          edge = edge->next_caller)
>       {
> -      if (edge->caller != node)
> +      if (edge->caller != d->function_symbol)
>   	{
>             d->only_called_at_startup &= edge->caller->only_called_at_startup;
>   	  /* It makes sense to put main() together with the static constructors.
> @@ -358,7 +359,11 @@ ipa_propagate_frequency_1 (struct cgraph
>   	 errors can make us to push function into unlikely section even when
>   	 it is executed by the train run.  Transfer the function only if all
>   	 callers are unlikely executed.  */
> -      if (profile_info && flag_branch_probabilities
> +      if (profile_info
> +	  && opt_for_fn (d->function_symbol->decl, flag_branch_probabilities)
> +	  /* Thunks are not profiled.  This is more or less implementation
> +	     bug.  */
> +	  && !d->function_symbol->thunk.thunk_p
>   	  && (edge->caller->frequency != NODE_FREQUENCY_UNLIKELY_EXECUTED
>   	      || (edge->caller->global.inlined_to
>   		  && edge->caller->global.inlined_to->frequency
> @@ -418,7 +423,7 @@ contains_hot_call_p (struct cgraph_node
>   bool
>   ipa_propagate_frequency (struct cgraph_node *node)
>   {
> -  struct ipa_propagate_frequency_data d = {true, true, true, true};
> +  struct ipa_propagate_frequency_data d = {node, true, true, true, true};
>     bool changed = false;
>
>     /* We can not propagate anything useful about externally visible functions
> @@ -432,8 +437,8 @@ ipa_propagate_frequency (struct cgraph_n
>     if (dump_file && (dump_flags & TDF_DETAILS))
>       fprintf (dump_file, "Processing frequency %s\n", node->name ());
>
> -  node->call_for_symbol_thunks_and_aliases (ipa_propagate_frequency_1, &d,
> -					    true);
> +  node->call_for_symbol_and_aliases (ipa_propagate_frequency_1, &d,
> +				     true);
>
>     if ((d.only_called_at_startup && !d.only_called_at_exit)
>         && !node->only_called_at_startup)
> @@ -597,6 +602,9 @@ ipa_profile (void)
>       {
>         bool update = false;
>
> +      if (!opt_for_fn (n->decl, flag_ipa_profile))
> +	continue;
> +
>         for (e = n->indirect_calls; e; e = e->next_callee)
>   	{
>   	  if (n->count)
> @@ -697,7 +705,9 @@ ipa_profile (void)
>     order_pos = ipa_reverse_postorder (order);
>     for (i = order_pos - 1; i >= 0; i--)
>       {
> -      if (order[i]->local.local && ipa_propagate_frequency (order[i]))
> +      if (order[i]->local.local
> +	  && opt_for_fn (order[i]->decl, flag_ipa_profile)
> +	  && ipa_propagate_frequency (order[i]))
>   	{
>   	  for (e = order[i]->callees; e; e = e->next_callee)
>   	    if (e->callee->local.local && !e->callee->aux)
> @@ -714,7 +724,9 @@ ipa_profile (void)
>         something_changed = false;
>         for (i = order_pos - 1; i >= 0; i--)
>   	{
> -	  if (order[i]->aux && ipa_propagate_frequency (order[i]))
> +	  if (order[i]->aux
> +	      && opt_for_fn (order[i]->decl, flag_ipa_profile)
> +	      && ipa_propagate_frequency (order[i]))
>   	    {
>   	      for (e = order[i]->callees; e; e = e->next_callee)
>   		if (e->callee->local.local && !e->callee->aux)
>

Hi.

There's perf report and -ftime report of WPA phase.

Martin

[-- Attachment #2: chrome-latest.profile.txt --]
[-- Type: text/plain, Size: 6466 bytes --]

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1977 kB ( 0%) ggc
 phase opt and generate  : 171.18 (65%) usr   2.29 (47%) sys 173.40 (65%) wall 2682609 kB (13%) ggc
 phase stream in         :  92.09 (35%) usr   2.55 (53%) sys  94.61 (35%) wall18738048 kB (87%) ggc
 callgraph optimization  :   0.72 ( 0%) usr   0.00 ( 0%) sys   0.73 ( 0%) wall      16 kB ( 0%) ggc
 ipa dead code removal   :   5.12 ( 2%) usr   0.05 ( 1%) sys   5.07 ( 2%) wall       0 kB ( 0%) ggc
 ipa virtual call target :   2.93 ( 1%) usr   0.03 ( 1%) sys   3.02 ( 1%) wall       0 kB ( 0%) ggc
 ipa devirtualization    :   0.26 ( 0%) usr   0.01 ( 0%) sys   0.34 ( 0%) wall   32646 kB ( 0%) ggc
 ipa cp                  :   4.29 ( 2%) usr   0.48 (10%) sys   4.86 ( 2%) wall  851380 kB ( 4%) ggc
 ipa inlining heuristics : 122.37 (46%) usr   0.42 ( 9%) sys 122.72 (46%) wall  807997 kB ( 4%) ggc
 ipa comdats             :   0.53 ( 0%) usr   0.00 ( 0%) sys   0.53 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple in       :   5.16 ( 2%) usr   1.09 (23%) sys   6.64 ( 2%) wall 1370302 kB ( 6%) ggc
 ipa lto decl in         :  79.11 (30%) usr   1.58 (33%) sys  80.64 (30%) wall16957092 kB (79%) ggc
 ipa lto constructors in :   0.37 ( 0%) usr   0.06 ( 1%) sys   0.37 ( 0%) wall   22897 kB ( 0%) ggc
 ipa lto cgraph I/O      :   1.44 ( 1%) usr   0.24 ( 5%) sys   1.69 ( 1%) wall  901960 kB ( 4%) ggc
 ipa lto decl merge      :   3.27 ( 1%) usr   0.01 ( 0%) sys   3.26 ( 1%) wall   16383 kB ( 0%) ggc
 ipa lto cgraph merge    :   4.63 ( 2%) usr   0.04 ( 1%) sys   4.68 ( 2%) wall   20432 kB ( 0%) ggc
 whopr wpa               :   1.70 ( 1%) usr   0.00 ( 0%) sys   1.71 ( 1%) wall       2 kB ( 0%) ggc
 whopr partitioning      :   4.72 ( 2%) usr   0.02 ( 0%) sys   4.73 ( 2%) wall    7796 kB ( 0%) ggc
 ipa reference           :   2.70 ( 1%) usr   0.10 ( 2%) sys   2.80 ( 1%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.53 ( 0%) usr   0.03 ( 1%) sys   0.58 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   3.13 ( 1%) usr   0.09 ( 2%) sys   3.21 ( 1%) wall       0 kB ( 0%) ggc
 ipa icf                 :  16.96 ( 6%) usr   0.17 ( 4%) sys  17.06 ( 6%) wall    3087 kB ( 0%) ggc
 inline parameters       :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA rewrite        :   0.39 ( 0%) usr   0.05 ( 1%) sys   0.27 ( 0%) wall   51205 kB ( 0%) ggc
 tree SSA other          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 tree SSA incremental    :   0.50 ( 0%) usr   0.08 ( 2%) sys   0.50 ( 0%) wall   33556 kB ( 0%) ggc
 tree operand scan       :   0.45 ( 0%) usr   0.11 ( 2%) sys   0.47 ( 0%) wall  343892 kB ( 2%) ggc
 dominance frontiers     :   0.05 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.51 ( 0%) usr   0.08 ( 2%) sys   0.58 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.02 ( 0%) usr   0.06 ( 1%) sys   0.05 ( 0%) wall       0 kB ( 0%) ggc
 loop fini               :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   1.19 ( 0%) usr   0.00 ( 0%) sys   1.15 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 : 263.27             4.84           268.01           21422636 kB
[ perf record: Woken up 254 times to write data ]
[ perf record: Captured and wrote 63.481 MB perf.data (~2773530 samples) ]
marxin@marxinbox:~/Programming/chromium/src/out/Release> perf report --stdio | sed 's/\ *$//' | head -n50# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 945739511218
#
# Overhead   Command      Shared Object
# ........  ........  .................  ..................................................................................................................................................................................................................................................................................................
#
    19.88%  lto1-wpa  lto1               [.] nonremovable_p(cgraph_node*, void*)
     9.17%  lto1-wpa  lto1               [.] cgraph_node::used_from_object_file_p_worker(cgraph_node*, void*)
     7.93%  lto1-wpa  lto1               [.] cgraph_node::call_for_symbol_and_aliases_1(bool (*)(cgraph_node*, void*), void*, bool)
     6.37%  lto1-wpa  lto1               [.] inflate_fast
     2.23%  lto1-wpa  lto1               [.] compare_tree_sccs_1(tree_node*, tree_node*, tree_node***)
     2.14%  lto1-wpa  lto1               [.] streamer_read_uhwi(lto_input_block*)
     1.96%  lto1-wpa  lto1               [.] ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
     1.83%  lto1-wpa  lto1               [.] unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, unsigned int)
     1.61%  lto1-wpa  lto1               [.] streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*)
     1.23%  lto1-wpa  lto1               [.] lto_cgraph_replace_node(cgraph_node*, cgraph_node*)
     1.21%  lto1-wpa  libc-2.19.so       [.] msort_with_tmp.part.0
     1.19%  lto1-wpa  lto1               [.] streamer_get_pickled_tree(lto_input_block*, data_in*)
     1.14%  lto1-wpa  lto1               [.] symbol_table::remove_unreachable_nodes(_IO_FILE*)
     1.08%  lto1-wpa  libc-2.19.so       [.] _int_malloc
     1.02%  lto1-wpa  lto1               [.] ipa_icf::sem_variable::equals(tree_node*, tree_node*)
     0.96%  lto1-wpa  lto1               [.] lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int)
     0.84%  lto1-wpa  lto1               [.] inflate
     0.74%  lto1-wpa  lto1               [.] adler32
     0.71%  lto1-wpa  lto1               [.] lto_input_tree(lto_input_block*, data_in*)
     0.68%  lto1-wpa  lto1               [.] cgraph_node::call_for_symbol_thunks_and_aliases(bool (*)(cgraph_node*, void*), void*, bool, bool)
     0.66%  lto1-wpa  lto1               [.] streamer_read_tree_body(lto_input_block*, data_in*, tree_node*)
     0.64%  lto1-wpa  lto1               [.] estimate_calls_size_and_time(cgraph_node*, int*, int*, int*, int*, unsigned int, vec<tree_node*, va_heap, vl_ptr>, vec<ipa_polymorphic_call_context, va_heap, vl_ptr>, vec<ipa_agg_jump_function*, va_heap, vl_ptr>) [clone .isra.129]
     0.63%  lto1-wpa  lto1               [.] lto_input_location(bitpack_d*, data_in*)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome)
  2015-02-18 13:58   ` Martin Liška
@ 2015-02-18 14:13     ` Martin Liška
  0 siblings, 0 replies; 6+ messages in thread
From: Martin Liška @ 2015-02-18 14:13 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 10096 bytes --]

On 02/18/2015 02:58 PM, Martin Liška wrote:
> On 02/17/2015 10:03 PM, Jan Hubicka wrote:
>> Hi,
>> this patch should chase away the expensive thunks and aliases walks from most
>> of analysis code. I think only real use left is local_p predicate that needs to
>> stay because i386 expect local flag to match between caller and callee when
>> expanding assembler thunk. I at least optimized it by first moving the walk to
>> be conditional for nonlocal functions only and then reorganizing
>> call_for_symbol_thunks_and_aliases to first inspect aliases (that is cheap) and
>> only then work on thunks.  Most likely this will find the non-local thunk/alias
>> faster.  Other cases was leftovers from the conversion of thunks from aliases
>> to functions.
>>
>> I also noticed a bug in ipa-profile that does not disable all the
>> transofrms with !ipa_profile_flag used on OPTIMIZTION_NODE and fixed it.
>>
>> Bootstrapped/regtested x86_64-linux, comitted.  I would be interested to
>> know if the call_for_symbol_thunks_and_aliases is now off your oprofiles
>> (sorry, easier to type than perf-profiles)
>>
>> Honza
>>
>>     * ipa-visibility.c (function_and_variable_visibility): Only
>>     check locality if node is not already local.
>>     * ipa-inline.c (want_inline_function_to_all_callers_p): Use
>>     call_for_symbol_and_aliases instead of
>>     call_for_symbol_thunks_and_aliases.
>>     (ipa_inline): Likewise.
>>     * cgraph.c (cgraph_node::call_for_symbol_thunks_and_aliases):
>>     first walk aliases.
>>     * ipa.c (symbol_table::remove_unreachable_nodes): Use
>>     call_for_symbol_and_aliases.
>>     * ipa-profile.c (ipa_propagate_frequency_data): Add function_symbol.
>>     (ipa_propagate_frequency_1): Use it; use opt_for_fn
>>     (ipa_propagate_frequency): Update.
>>     (ipa_profile): Add opt_for_fn gueards.
>> Index: ipa-visibility.c
>> ===================================================================
>> --- ipa-visibility.c    (revision 220741)
>> +++ ipa-visibility.c    (working copy)
>> @@ -595,7 +595,8 @@ function_and_variable_visibility (bool w
>>       }
>>     FOR_EACH_DEFINED_FUNCTION (node)
>>       {
>> -      node->local.local |= node->local_p ();
>> +      if (!node->local.local)
>> +        node->local.local |= node->local_p ();
>>
>>         /* If we know that function can not be overwritten by a different semantics
>>        and moreover its section can not be discarded, replace all direct calls
>> Index: ipa-inline.c
>> ===================================================================
>> --- ipa-inline.c    (revision 220741)
>> +++ ipa-inline.c    (working copy)
>> @@ -975,14 +975,14 @@ want_inline_function_to_all_callers_p (s
>>     if (node->global.inlined_to)
>>       return false;
>>     /* Does it have callers?  */
>> -  if (!node->call_for_symbol_thunks_and_aliases (has_caller_p, NULL, true))
>> +  if (!node->call_for_symbol_and_aliases (has_caller_p, NULL, true))
>>       return false;
>>     /* Inlining into all callers would increase size?  */
>>     if (estimate_growth (node) > 0)
>>       return false;
>>     /* All inlines must be possible.  */
>> -  if (node->call_for_symbol_thunks_and_aliases (check_callers, &has_hot_call,
>> -                        true))
>> +  if (node->call_for_symbol_and_aliases (check_callers, &has_hot_call,
>> +                     true))
>>       return false;
>>     if (!cold && !has_hot_call)
>>       return false;
>> @@ -2359,9 +2359,9 @@ ipa_inline (void)
>>         if (want_inline_function_to_all_callers_p (node, cold))
>>           {
>>             int num_calls = 0;
>> -          node->call_for_symbol_thunks_and_aliases (sum_callers, &num_calls,
>> -                              true);
>> -          while (node->call_for_symbol_thunks_and_aliases
>> +          node->call_for_symbol_and_aliases (sum_callers, &num_calls,
>> +                         true);
>> +          while (node->call_for_symbol_and_aliases
>>                  (inline_to_all_callers, &num_calls, true))
>>           ;
>>             remove_functions = true;
>> Index: cgraph.c
>> ===================================================================
>> --- cgraph.c    (revision 220741)
>> +++ cgraph.c    (working copy)
>> @@ -2191,6 +2191,16 @@ cgraph_node::call_for_symbol_thunks_and_
>>
>>     if (callback (this, data))
>>       return true;
>> +  FOR_EACH_ALIAS (this, ref)
>> +    {
>> +      cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
>> +      if (include_overwritable
>> +      || alias->get_availability () > AVAIL_INTERPOSABLE)
>> +    if (alias->call_for_symbol_thunks_and_aliases (callback, data,
>> +                             include_overwritable,
>> +                             exclude_virtual_thunks))
>> +      return true;
>> +    }
>>     for (e = callers; e; e = e->next_caller)
>>       if (e->caller->thunk.thunk_p
>>       && (include_overwritable
>> @@ -2202,16 +2212,6 @@ cgraph_node::call_for_symbol_thunks_and_
>>                                  exclude_virtual_thunks))
>>       return true;
>>
>> -  FOR_EACH_ALIAS (this, ref)
>> -    {
>> -      cgraph_node *alias = dyn_cast <cgraph_node *> (ref->referring);
>> -      if (include_overwritable
>> -      || alias->get_availability () > AVAIL_INTERPOSABLE)
>> -    if (alias->call_for_symbol_thunks_and_aliases (callback, data,
>> -                             include_overwritable,
>> -                             exclude_virtual_thunks))
>> -      return true;
>> -    }
>>     return false;
>>   }
>>
>> Index: ipa.c
>> ===================================================================
>> --- ipa.c    (revision 220741)
>> +++ ipa.c    (working copy)
>> @@ -661,7 +661,7 @@ symbol_table::remove_unreachable_nodes (
>>       if (node->address_taken
>>       && !node->used_from_other_partition)
>>         {
>> -    if (!node->call_for_symbol_thunks_and_aliases
>> +    if (!node->call_for_symbol_and_aliases
>>           (has_addr_references_p, NULL, true)
>>           && (!node->instrumentation_clone
>>           || !node->instrumented_version
>> Index: ipa-profile.c
>> ===================================================================
>> --- ipa-profile.c    (revision 220741)
>> +++ ipa-profile.c    (working copy)
>> @@ -322,6 +322,7 @@ ipa_profile_read_summary (void)
>>
>>   struct ipa_propagate_frequency_data
>>   {
>> +  cgraph_node *function_symbol;
>>     bool maybe_unlikely_executed;
>>     bool maybe_executed_once;
>>     bool only_called_at_startup;
>> @@ -342,7 +343,7 @@ ipa_propagate_frequency_1 (struct cgraph
>>               || d->only_called_at_startup || d->only_called_at_exit);
>>          edge = edge->next_caller)
>>       {
>> -      if (edge->caller != node)
>> +      if (edge->caller != d->function_symbol)
>>       {
>>             d->only_called_at_startup &= edge->caller->only_called_at_startup;
>>         /* It makes sense to put main() together with the static constructors.
>> @@ -358,7 +359,11 @@ ipa_propagate_frequency_1 (struct cgraph
>>        errors can make us to push function into unlikely section even when
>>        it is executed by the train run.  Transfer the function only if all
>>        callers are unlikely executed.  */
>> -      if (profile_info && flag_branch_probabilities
>> +      if (profile_info
>> +      && opt_for_fn (d->function_symbol->decl, flag_branch_probabilities)
>> +      /* Thunks are not profiled.  This is more or less implementation
>> +         bug.  */
>> +      && !d->function_symbol->thunk.thunk_p
>>         && (edge->caller->frequency != NODE_FREQUENCY_UNLIKELY_EXECUTED
>>             || (edge->caller->global.inlined_to
>>             && edge->caller->global.inlined_to->frequency
>> @@ -418,7 +423,7 @@ contains_hot_call_p (struct cgraph_node
>>   bool
>>   ipa_propagate_frequency (struct cgraph_node *node)
>>   {
>> -  struct ipa_propagate_frequency_data d = {true, true, true, true};
>> +  struct ipa_propagate_frequency_data d = {node, true, true, true, true};
>>     bool changed = false;
>>
>>     /* We can not propagate anything useful about externally visible functions
>> @@ -432,8 +437,8 @@ ipa_propagate_frequency (struct cgraph_n
>>     if (dump_file && (dump_flags & TDF_DETAILS))
>>       fprintf (dump_file, "Processing frequency %s\n", node->name ());
>>
>> -  node->call_for_symbol_thunks_and_aliases (ipa_propagate_frequency_1, &d,
>> -                        true);
>> +  node->call_for_symbol_and_aliases (ipa_propagate_frequency_1, &d,
>> +                     true);
>>
>>     if ((d.only_called_at_startup && !d.only_called_at_exit)
>>         && !node->only_called_at_startup)
>> @@ -597,6 +602,9 @@ ipa_profile (void)
>>       {
>>         bool update = false;
>>
>> +      if (!opt_for_fn (n->decl, flag_ipa_profile))
>> +    continue;
>> +
>>         for (e = n->indirect_calls; e; e = e->next_callee)
>>       {
>>         if (n->count)
>> @@ -697,7 +705,9 @@ ipa_profile (void)
>>     order_pos = ipa_reverse_postorder (order);
>>     for (i = order_pos - 1; i >= 0; i--)
>>       {
>> -      if (order[i]->local.local && ipa_propagate_frequency (order[i]))
>> +      if (order[i]->local.local
>> +      && opt_for_fn (order[i]->decl, flag_ipa_profile)
>> +      && ipa_propagate_frequency (order[i]))
>>       {
>>         for (e = order[i]->callees; e; e = e->next_callee)
>>           if (e->callee->local.local && !e->callee->aux)
>> @@ -714,7 +724,9 @@ ipa_profile (void)
>>         something_changed = false;
>>         for (i = order_pos - 1; i >= 0; i--)
>>       {
>> -      if (order[i]->aux && ipa_propagate_frequency (order[i]))
>> +      if (order[i]->aux
>> +          && opt_for_fn (order[i]->decl, flag_ipa_profile)
>> +          && ipa_propagate_frequency (order[i]))
>>           {
>>             for (e = order[i]->callees; e; e = e->next_callee)
>>           if (e->callee->local.local && !e->callee->aux)
>>
>
> Hi.
>
> There's perf report and -ftime report of WPA phase.
>
> Martin

Hm, using the same compiler, Firefox LTO time statistics and perf report and very different.
I'm wondering how can be that possible?

Martin

[-- Attachment #2: firefox-latest.profile.txt --]
[-- Type: text/plain, Size: 9434 bytes --]

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall    1988 kB ( 0%) ggc
 phase opt and generate  :  42.32 (70%) usr   0.85 (56%) sys  43.16 (69%) wall 1387464 kB (28%) ggc
 phase stream in         :  18.50 (30%) usr   0.68 (44%) sys  19.17 (31%) wall 3528077 kB (72%) ggc
 garbage collection      :   2.24 ( 4%) usr   0.00 ( 0%) sys   2.24 ( 4%) wall       0 kB ( 0%) ggc
 callgraph optimization  :   0.37 ( 1%) usr   0.00 ( 0%) sys   0.37 ( 1%) wall      38 kB ( 0%) ggc
 ipa dead code removal   :   3.06 ( 5%) usr   0.01 ( 1%) sys   2.88 ( 5%) wall       0 kB ( 0%) ggc
 ipa virtual call target :   5.72 ( 9%) usr   0.06 ( 4%) sys   5.87 ( 9%) wall       0 kB ( 0%) ggc
 ipa devirtualization    :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.23 ( 0%) wall   22382 kB ( 0%) ggc
 ipa cp                  :   2.88 ( 5%) usr   0.09 ( 6%) sys   2.97 ( 5%) wall  515623 kB (10%) ggc
 ipa inlining heuristics :  13.96 (23%) usr   0.13 ( 8%) sys  14.12 (23%) wall  471848 kB (10%) ggc
 ipa comdats             :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall       0 kB ( 0%) ggc
 ipa lto gimple in       :   2.54 ( 4%) usr   0.48 (31%) sys   3.23 ( 5%) wall  645652 kB (13%) ggc
 ipa lto decl in         :  12.64 (21%) usr   0.37 (24%) sys  13.01 (21%) wall 2592737 kB (53%) ggc
 ipa lto constructors in :   0.17 ( 0%) usr   0.01 ( 1%) sys   0.20 ( 0%) wall   16493 kB ( 0%) ggc
 ipa lto cgraph I/O      :   0.58 ( 1%) usr   0.09 ( 6%) sys   0.67 ( 1%) wall  437504 kB ( 9%) ggc
 ipa lto decl merge      :   1.90 ( 3%) usr   0.00 ( 0%) sys   1.90 ( 3%) wall    8191 kB ( 0%) ggc
 ipa lto cgraph merge    :   1.30 ( 2%) usr   0.00 ( 0%) sys   1.29 ( 2%) wall   14989 kB ( 0%) ggc
 whopr wpa               :   0.91 ( 1%) usr   0.00 ( 0%) sys   0.88 ( 1%) wall       2 kB ( 0%) ggc
 whopr partitioning      :   2.66 ( 4%) usr   0.00 ( 0%) sys   2.67 ( 4%) wall    6081 kB ( 0%) ggc
 ipa reference           :   1.38 ( 2%) usr   0.01 ( 1%) sys   1.40 ( 2%) wall       0 kB ( 0%) ggc
 ipa profile             :   0.21 ( 0%) usr   0.01 ( 1%) sys   0.21 ( 0%) wall       0 kB ( 0%) ggc
 ipa pure const          :   1.61 ( 3%) usr   0.01 ( 1%) sys   1.61 ( 3%) wall       0 kB ( 0%) ggc
 ipa icf                 :   4.99 ( 8%) usr   0.06 ( 4%) sys   5.00 ( 8%) wall    1120 kB ( 0%) ggc
 tree SSA rewrite        :   0.12 ( 0%) usr   0.02 ( 1%) sys   0.12 ( 0%) wall   23170 kB ( 0%) ggc
 tree SSA incremental    :   0.23 ( 0%) usr   0.05 ( 3%) sys   0.21 ( 0%) wall   14434 kB ( 0%) ggc
 tree operand scan       :   0.14 ( 0%) usr   0.03 ( 2%) sys   0.22 ( 0%) wall  145252 kB ( 3%) ggc
 dominance frontiers     :   0.04 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 dominance computation   :   0.14 ( 0%) usr   0.05 ( 3%) sys   0.11 ( 0%) wall       0 kB ( 0%) ggc
 varconst                :   0.01 ( 0%) usr   0.02 ( 1%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 loop fini               :   0.07 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall       0 kB ( 0%) ggc
 unaccounted todo        :   0.62 ( 1%) usr   0.00 ( 0%) sys   0.65 ( 1%) wall       0 kB ( 0%) ggc
 TOTAL                 :  60.82             1.53            62.34            4917531 kB
[ perf record: Woken up 59 times to write data ]
[ perf record: Captured and wrote 14.722 MB perf.data (~643202 samples) ]
marxin@marxinbox:~/Programming/gecko-dev/obj-x86_64-unknown-linux-gnu/toolkit/library> perf report
marxin@marxinbox:~/Programming/gecko-dev/obj-x86_64-unknown-linux-gnu/toolkit/library> gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/marxin/Programming/bin/gcc2/lib/gcc/x86_64-unknown-linux-gnu/5.0.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --enable-languages=c,c++ --disable-libsanitizer --prefix=/home/marxin/Programming/bin/gcc2 --disable-bootstrap --enable-checking=release
Thread model: posix
gcc version 5.0.0 20150218 (experimental) (GCC) 
marxin@marxinbox:~/Programming/gecko-dev/obj-x86_64-unknown-linux-gnu/toolkit/library> perf report
marxin@marxinbox:~/Programming/gecko-dev/obj-x86_64-unknown-linux-gnu/toolkit/library> perf report --stdio | sed 's/\ *$//' | head -n50 
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 245K of event 'cycles'
# Event count (approx.): 216467422123
#
# Overhead   Command      Shared Object
# ........  ........  .................  ..................................................................................................................................................................................................................................................................................................
#
     4.97%  lto1-wpa  lto1               [.] inflate_fast
     2.78%  lto1-wpa  lto1               [.] symbol_table::remove_unreachable_nodes(_IO_FILE*)
     2.37%  lto1-wpa  libc-2.19.so       [.] _int_malloc
     1.77%  lto1-wpa  lto1               [.] record_target_from_binfo(vec<cgraph_node*, va_heap, vl_ptr>&, vec<tree_node*, va_heap, vl_ptr>*, tree_node*, tree_node*, vec<tree_node*, va_heap, vl_ptr>&, long, tree_node*, long, hash_set<tree_node*, default_hashset_traits>*, hash_set<tree_node*, default_hashset_traits>*, bool, bool*)
     1.57%  lto1-wpa  lto1               [.] ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option)
     1.56%  lto1-wpa  lto1               [.] streamer_read_uhwi(lto_input_block*)
     1.48%  lto1-wpa  lto1               [.] estimate_calls_size_and_time(cgraph_node*, int*, int*, int*, int*, unsigned int, vec<tree_node*, va_heap, vl_ptr>, vec<ipa_polymorphic_call_context, va_heap, vl_ptr>, vec<ipa_agg_jump_function*, va_heap, vl_ptr>) [clone .isra.129]
     1.48%  lto1-wpa  lto1               [.] unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, unsigned int)
     1.40%  lto1-wpa  lto1               [.] lto_cgraph_replace_node(cgraph_node*, cgraph_node*)
     1.38%  lto1-wpa  lto1               [.] ggc_set_mark(void const*)
     1.30%  lto1-wpa  libc-2.19.so       [.] malloc_consolidate
     1.28%  lto1-wpa  lto1               [.] htab_hash_string
     1.25%  lto1-wpa  lto1               [.] compare_tree_sccs_1(tree_node*, tree_node*, tree_node***)
     1.23%  lto1-wpa  lto1               [.] fibonacci_heap<sreal, cgraph_edge>::consolidate()
     1.19%  lto1-wpa  lto1               [.] splay_tree_splay
     1.15%  lto1-wpa  lto1               [.] can_inline_edge_p(cgraph_edge*, bool, bool)
     1.14%  lto1-wpa  lto1               [.] cgraph_node::get_availability()
     1.14%  lto1-wpa  lto1               [.] evaluate_properties_for_edge(cgraph_edge*, bool, unsigned int*, vec<tree_node*, va_heap, vl_ptr>*, vec<ipa_polymorphic_call_context, va_heap, vl_ptr>*, vec<ipa_agg_jump_function*, va_heap, vl_ptr>*) [clone .constprop.131]
     1.13%  lto1-wpa  lto1               [.] gimple_get_virt_method_for_vtable(long, tree_node*, unsigned long, bool*)
     1.10%  lto1-wpa  lto1               [.] types_same_for_odr(tree_node const*, tree_node const*)
     1.08%  lto1-wpa  lto1               [.] gt_ggc_mx_lang_tree_node(void*)
     1.05%  lto1-wpa  lto1               [.] streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*)
     0.99%  lto1-wpa  lto1               [.] type_in_anonymous_namespace_p(tree_node const*)
     0.99%  lto1-wpa  lto1               [.] gimple_has_body_p(tree_node*)
     0.95%  lto1-wpa  lto1               [.] decl_assembler_name(tree_node*)
     0.93%  lto1-wpa  lto1               [.] do_per_function(void (*)(function*, void*), void*)
     0.82%  lto1-wpa  libc-2.19.so       [.] _int_free
     0.81%  lto1-wpa  lto1               [.] possible_polymorphic_call_targets_1(vec<cgraph_node*, va_heap, vl_ptr>&, hash_set<tree_node*, default_hashset_traits>*, hash_set<tree_node*, default_hashset_traits>*, tree_node*, odr_type_d*, long, tree_node*, long, bool*, vec<tree_node*, va_heap, vl_ptr>&, bool)
     0.81%  lto1-wpa  lto1               [.] searchc(searchc_env*, cgraph_node*, bool (*)(cgraph_edge*))
     0.80%  lto1-wpa  lto1               [.] streamer_get_pickled_tree(lto_input_block*, data_in*)
     0.78%  lto1-wpa  lto1               [.] edge_badness(cgraph_edge*, bool)
     0.77%  lto1-wpa  lto1               [.] hash_table<asmname_hasher, xcallocator, true>::find_slot_with_hash(tree_node const* const&, unsigned int, insert_option)
     0.77%  lto1-wpa  lto1               [.] update_callee_keys(fibonacci_heap<sreal, cgraph_edge>*, cgraph_node*, bitmap_head*)
     0.76%  lto1-wpa  lto1               [.] ggc_internal_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long)
     0.75%  lto1-wpa  lto1               [.] fibonacci_heap<sreal, cgraph_edge>::extract_minimum_node()
     0.75%  lto1-wpa  lto1               [.] execute_one_pass(opt_pass*)
     0.74%  lto1-wpa  lto1               [.] inflate
     0.71%  lto1-wpa  lto1               [.] contains_polymorphic_type_p(tree_node const*)
     0.67%  lto1-wpa  lto1               [.] get_binfo_at_offset(tree_node*, long, tree_node*)
     0.64%  lto1-wpa  lto1               [.] symbol_table::decl_assembler_name_equal(tree_node*, tree_node const*)
     0.61%  lto1-wpa  lto1               [.] lto_balanced_map(int)
     0.61%  lto1-wpa  lto1               [.] ipa_icf::sem_item_optimizer::do_congruence_step_for_index(ipa_icf::congruence_class*, unsigned int)


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-02-18 14:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-17 17:14 [RFC, PATCH] LTO: IPA inline speed up for large apps (Chrome) Martin Liška
2015-02-17 18:38 ` Jan Hubicka
2015-02-18 10:28   ` Martin Liška
2015-02-17 21:03 ` Jan Hubicka
2015-02-18 13:58   ` Martin Liška
2015-02-18 14:13     ` Martin Liška

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).