public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
@ 2013-09-13 20:23 H.J. Lu
  2013-09-13 21:47 ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2013-09-13 20:23 UTC (permalink / raw)
  To: Wei Mi
  Cc: Alexander Monakov, Steven Bosscher, GCC Patches, David Li, Kirill Yukhin

On Fri, Sep 13, 2013 at 11:28 AM, Wei Mi <wmi@google.com> wrote:
>> Checking corei7/corei7-avx explicitly isn't a good idea.
>> It is also useful for Ivy Bridge and Haswell.  I think you
>> should use a variable to control it, similar to
>> TARGET_FUSE_CMP_AND_BRANCH.
>>
>>
>> --
>> H.J.
>
> Different x86 microarchitectures support macro-fusion for different
> compare and branch combinations. I need to differentiate various x86
> microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
> control it, it requires a bunch of them. That is why I choose to check

Can you use TARGET_FUSE_CMP_AND_BRANCH covers cmp/test
and branch,  TARGET_FUSE_ALU_AND_BRANCH covers and/add/sub/inc/dec
and branch?

> corei7/corei7-avx in that function. I don't add core-avx-i/core-avx2
> for now because I don't have those machines for testing.
>

Kirill, can we test it on Ivy Bridge and Haswell?

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-13 20:23 Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion H.J. Lu
@ 2013-09-13 21:47 ` Wei Mi
  2013-09-14  0:44   ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-13 21:47 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Alexander Monakov, Steven Bosscher, GCC Patches, David Li, Kirill Yukhin

On Fri, Sep 13, 2013 at 12:09 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Sep 13, 2013 at 11:28 AM, Wei Mi <wmi@google.com> wrote:
>>> Checking corei7/corei7-avx explicitly isn't a good idea.
>>> It is also useful for Ivy Bridge and Haswell.  I think you
>>> should use a variable to control it, similar to
>>> TARGET_FUSE_CMP_AND_BRANCH.
>>>
>>>
>>> --
>>> H.J.
>>
>> Different x86 microarchitectures support macro-fusion for different
>> compare and branch combinations. I need to differentiate various x86
>> microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
>> control it, it requires a bunch of them. That is why I choose to check
>
> Can you use TARGET_FUSE_CMP_AND_BRANCH covers cmp/test
> and branch,  TARGET_FUSE_ALU_AND_BRANCH covers and/add/sub/inc/dec
> and branch?
>

Yes, I can. Thanks for the suggestion. Will fix it, and with Ivy
Bridge and Haswell included.

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-13 21:47 ` Wei Mi
@ 2013-09-14  0:44   ` Wei Mi
  2013-09-14  0:55     ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-14  0:44 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Alexander Monakov, Steven Bosscher, GCC Patches, David Li, Kirill Yukhin

On Fri, Sep 13, 2013 at 1:45 PM, Wei Mi <wmi@google.com> wrote:
> On Fri, Sep 13, 2013 at 12:09 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Fri, Sep 13, 2013 at 11:28 AM, Wei Mi <wmi@google.com> wrote:
>>>> Checking corei7/corei7-avx explicitly isn't a good idea.
>>>> It is also useful for Ivy Bridge and Haswell.  I think you
>>>> should use a variable to control it, similar to
>>>> TARGET_FUSE_CMP_AND_BRANCH.
>>>>
>>>>
>>>> --
>>>> H.J.
>>>
>>> Different x86 microarchitectures support macro-fusion for different
>>> compare and branch combinations. I need to differentiate various x86
>>> microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
>>> control it, it requires a bunch of them. That is why I choose to check
>>
>> Can you use TARGET_FUSE_CMP_AND_BRANCH covers cmp/test
>> and branch,  TARGET_FUSE_ALU_AND_BRANCH covers and/add/sub/inc/dec
>> and branch?
>>
>
> Yes, I can. Thanks for the suggestion. Will fix it, and with Ivy
> Bridge and Haswell included.
>

Just notice another problem here:
processor_type only contains PROCESSOR_COREI7, so I cannot
differentiate Westmere and Sandybridge in x86-tune.def, which are
different for TARGET_FUSE_ALU_AND_BRANCH. So do I have to separate
m_SANDYBRIDGE out from m_COREI7?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-14  0:44   ` Wei Mi
@ 2013-09-14  0:55     ` H.J. Lu
  2013-09-16 18:21       ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2013-09-14  0:55 UTC (permalink / raw)
  To: Wei Mi
  Cc: Alexander Monakov, Steven Bosscher, GCC Patches, David Li, Kirill Yukhin

On Fri, Sep 13, 2013 at 4:07 PM, Wei Mi <wmi@google.com> wrote:
> On Fri, Sep 13, 2013 at 1:45 PM, Wei Mi <wmi@google.com> wrote:
>> On Fri, Sep 13, 2013 at 12:09 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Fri, Sep 13, 2013 at 11:28 AM, Wei Mi <wmi@google.com> wrote:
>>>>> Checking corei7/corei7-avx explicitly isn't a good idea.
>>>>> It is also useful for Ivy Bridge and Haswell.  I think you
>>>>> should use a variable to control it, similar to
>>>>> TARGET_FUSE_CMP_AND_BRANCH.
>>>>>
>>>>>
>>>>> --
>>>>> H.J.
>>>>
>>>> Different x86 microarchitectures support macro-fusion for different
>>>> compare and branch combinations. I need to differentiate various x86
>>>> microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
>>>> control it, it requires a bunch of them. That is why I choose to check
>>>
>>> Can you use TARGET_FUSE_CMP_AND_BRANCH covers cmp/test
>>> and branch,  TARGET_FUSE_ALU_AND_BRANCH covers and/add/sub/inc/dec
>>> and branch?
>>>
>>
>> Yes, I can. Thanks for the suggestion. Will fix it, and with Ivy
>> Bridge and Haswell included.
>>
>
> Just notice another problem here:
> processor_type only contains PROCESSOR_COREI7, so I cannot
> differentiate Westmere and Sandybridge in x86-tune.def, which are
> different for TARGET_FUSE_ALU_AND_BRANCH. So do I have to separate
> m_SANDYBRIDGE out from m_COREI7?

Yes, please.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-14  0:55     ` H.J. Lu
@ 2013-09-16 18:21       ` Wei Mi
  2013-09-22 12:41         ` Jan Hubicka
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-16 18:21 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Alexander Monakov, Steven Bosscher, GCC Patches, David Li, Kirill Yukhin

>> Just notice another problem here:
>> processor_type only contains PROCESSOR_COREI7, so I cannot
>> differentiate Westmere and Sandybridge in x86-tune.def, which are
>> different for TARGET_FUSE_ALU_AND_BRANCH. So do I have to separate
>> m_SANDYBRIDGE out from m_COREI7?
>
> Yes, please.
>
> Thanks.
>
> --
> H.J.

I separate the change into two patches here:

Patch1 is to separate PROCESSOR_COREI7_AVX from PROCESSOR_COREI7.
PROCESSOR_COREI7_AVX includes Sandybridge and Ivybridge.

Patch2 is about the change for macro-fusion. Add three tune features here:
X86_TUNE_FUSE_CMP_AND_BRANCH_64:   CORE2 only support macrofusion for
32 bits. COREI7, COREI7_AVX and Haswell support 64 bits.
X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: CORE2 only support macrofusion
for branch only checking Zero and Carry flags. COREI7, COREI7_AVX and
Haswell support branch checking Sign and Overflow flags.
X86_TUNE_FUSE_ALU_AND_BRANCH: COREI7 doesn't support macrofusion for
alu + branch. COREI7_AVX and Haswell support it.

bootstrap and regression ok for the two patches.

Thanks,
Wei Mi.


Patch1:

2013-09-16  Wei Mi  <wmi@google.com>

        * gcc/config/i386/i386-c.c (ix86_target_macros_internal): Separate
        PROCESSOR_COREI7_AVX out from PROCESSOR_COREI7.
        * gcc/config/i386/i386.c (ix86_option_override_internal): Ditto.
        (ix86_issue_rate): Ditto.
        (ia32_multipass_dfa_lookahead): Ditto.
        (ix86_sched_init_global): Ditto.
        (get_builtin_code_for_version): Ditto.
        * gcc/config/i386/i386.h (enum target_cpu_default): Ditto.
        (enum processor_type): Ditto.
        * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.

diff --git a/gcc/config/i386/i386-c.c b/gcc/config/i386/i386-c.c
index 14349be..7ea68cc 100644
--- a/gcc/config/i386/i386-c.c
+++ b/gcc/config/i386/i386-c.c
@@ -141,6 +141,10 @@ ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
       def_or_undef (parse_in, "__corei7");
       def_or_undef (parse_in, "__corei7__");
       break;
+    case PROCESSOR_COREI7_AVX:
+      def_or_undef (parse_in, "__corei7_avx");
+      def_or_undef (parse_in, "__corei7_avx__");
+      break;
     case PROCESSOR_HASWELL:
       def_or_undef (parse_in, "__core_avx2");
       def_or_undef (parse_in, "__core_avx2__");
@@ -239,6 +243,9 @@ ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
     case PROCESSOR_COREI7:
       def_or_undef (parse_in, "__tune_corei7__");
       break;
+    case PROCESSOR_COREI7_AVX:
+      def_or_undef (parse_in, "__tune_corei7_avx__");
+      break;
     case PROCESSOR_HASWELL:
       def_or_undef (parse_in, "__tune_core_avx2__");
       break;
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 536c357..1fd3f60 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -1908,8 +1908,9 @@ const struct processor_costs *ix86_cost = &pentium_cost;
 #define m_P4_NOCONA (m_PENT4 | m_NOCONA)
 #define m_CORE2 (1<<PROCESSOR_CORE2)
 #define m_COREI7 (1<<PROCESSOR_COREI7)
+#define m_COREI7_AVX (1<<PROCESSOR_COREI7_AVX)
 #define m_HASWELL (1<<PROCESSOR_HASWELL)
-#define m_CORE_ALL (m_CORE2 | m_COREI7  | m_HASWELL)
+#define m_CORE_ALL (m_CORE2 | m_COREI7  | m_COREI7_AVX | m_HASWELL)
 #define m_ATOM (1<<PROCESSOR_ATOM)
 #define m_SLM (1<<PROCESSOR_SLM)

@@ -1984,10 +1985,10 @@ static const unsigned int x86_arch_always_fancy_math_387
   = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM |
m_AMD_MULTIPLE | m_GENERIC;

 static const unsigned int x86_avx256_split_unaligned_load
-  = m_COREI7 | m_GENERIC;
+  = m_COREI7 | m_COREI7_AVX | m_GENERIC;

 static const unsigned int x86_avx256_split_unaligned_store
-  = m_COREI7 | m_BDVER | m_GENERIC;
+  = m_COREI7 | m_COREI7_AVX | m_BDVER | m_GENERIC;

 /* In case the average insn count for single function invocation is
    lower than this constant, emit fast (but longer) prologue and
@@ -2377,6 +2378,8 @@ static const struct ptt
processor_target_table[PROCESSOR_max] =
   {&core_cost, 16, 10, 16, 10, 16},
   /* Core i7  */
   {&core_cost, 16, 10, 16, 10, 16},
+  /* Core i7 avx  */
+  {&core_cost, 16, 10, 16, 10, 16},
   /* Core avx2  */
   {&core_cost, 16, 10, 16, 10, 16},
   {&generic32_cost, 16, 7, 16, 7, 16},
@@ -2407,6 +2410,7 @@ static const char *const
cpu_names[TARGET_CPU_DEFAULT_max] =
   "nocona",
   "core2",
   "corei7",
+  "corei7-avx",
   "core-avx2",
   "atom",
   "slm",
@@ -3091,12 +3095,12 @@ ix86_option_override_internal (bool main_args_p)
       {"corei7", PROCESSOR_COREI7, CPU_COREI7,
        PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3
        | PTA_SSE4_1 | PTA_SSE4_2 | PTA_CX16 | PTA_POPCNT | PTA_FXSR},
-      {"corei7-avx", PROCESSOR_COREI7, CPU_COREI7,
+      {"corei7-avx", PROCESSOR_COREI7_AVX, CPU_COREI7,
        PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
        | PTA_SSSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_AVX
        | PTA_CX16 | PTA_POPCNT | PTA_AES | PTA_PCLMUL
        | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT},
-      {"core-avx-i", PROCESSOR_COREI7, CPU_COREI7,
+      {"core-avx-i", PROCESSOR_COREI7_AVX, CPU_COREI7,
        PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
        | PTA_SSSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_AVX
        | PTA_CX16 | PTA_POPCNT | PTA_AES | PTA_PCLMUL | PTA_FSGSBASE
@@ -24477,6 +24481,7 @@ ix86_issue_rate (void)
     case PROCESSOR_PENTIUM4:
     case PROCESSOR_CORE2:
     case PROCESSOR_COREI7:
+    case PROCESSOR_COREI7_AVX:
     case PROCESSOR_HASWELL:
     case PROCESSOR_ATHLON:
     case PROCESSOR_K8:
@@ -24834,6 +24839,7 @@ ia32_multipass_dfa_lookahead (void)

     case PROCESSOR_CORE2:
     case PROCESSOR_COREI7:
+    case PROCESSOR_COREI7_AVX:
     case PROCESSOR_HASWELL:
     case PROCESSOR_ATOM:
     case PROCESSOR_SLM:
@@ -25474,6 +25480,7 @@ ix86_sched_init_global (FILE *dump ATTRIBUTE_UNUSED,
     {
     case PROCESSOR_CORE2:
     case PROCESSOR_COREI7:
+    case PROCESSOR_COREI7_AVX:
     case PROCESSOR_HASWELL:
       /* Do not perform multipass scheduling for pre-reload schedule
          to save compile time.  */
@@ -29324,6 +29331,10 @@ get_builtin_code_for_version (tree decl, tree
*predicate_list)
              arg_str = "corei7";
              priority = P_PROC_SSE4_2;
              break;
+            case PROCESSOR_COREI7_AVX:
+              arg_str = "corei7-avx";
+              priority = P_PROC_SSE4_2;
+              break;
            case PROCESSOR_ATOM:
              arg_str = "atom";
              priority = P_PROC_SSSE3;
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index e820aa6..788cb8a 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -250,6 +250,7 @@ extern const struct processor_costs ix86_size_cost;
 #define TARGET_NOCONA (ix86_tune == PROCESSOR_NOCONA)
 #define TARGET_CORE2 (ix86_tune == PROCESSOR_CORE2)
 #define TARGET_COREI7 (ix86_tune == PROCESSOR_COREI7)
+#define TARGET_COREI7_AVX (ix86_tune == PROCESSOR_COREI7_AVX)
 #define TARGET_HASWELL (ix86_tune == PROCESSOR_HASWELL)
 #define TARGET_GENERIC32 (ix86_tune == PROCESSOR_GENERIC32)
 #define TARGET_GENERIC64 (ix86_tune == PROCESSOR_GENERIC64)
@@ -559,6 +560,7 @@ enum target_cpu_default
   TARGET_CPU_DEFAULT_nocona,
   TARGET_CPU_DEFAULT_core2,
   TARGET_CPU_DEFAULT_corei7,
+  TARGET_CPU_DEFAULT_corei7_avx,
   TARGET_CPU_DEFAULT_haswell,
   TARGET_CPU_DEFAULT_atom,
   TARGET_CPU_DEFAULT_slm,
@@ -2118,6 +2120,7 @@ enum processor_type
   PROCESSOR_NOCONA,
   PROCESSOR_CORE2,
   PROCESSOR_COREI7,
+  PROCESSOR_COREI7_AVX,
   PROCESSOR_HASWELL,
   PROCESSOR_GENERIC32,
   PROCESSOR_GENERIC64,
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index e3a34ee..4ae5f70 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -118,9 +118,9 @@ DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY,
"sse_partial_reg_dependency",
           m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMDFAM10
           | m_BDVER | m_GENERIC)
 DEF_TUNE (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL, "sse_unaligned_load_optimal",
-          m_COREI7 | m_AMDFAM10 | m_BDVER | m_BTVER | m_SLM)
+          m_COREI7 | m_COREI7_AVX | m_AMDFAM10 | m_BDVER | m_BTVER | m_SLM)
 DEF_TUNE (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL, "sse_unaligned_store_optimal",
-          m_COREI7 | m_BDVER | m_SLM)
+          m_COREI7 | m_COREI7_AVX | m_BDVER | m_SLM)
 DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
"sse_packed_single_insn_optimal",
           m_BDVER)
 /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies


Patch2:

2013-09-16  Wei Mi  <wmi@google.com>

        * gcc/config/i386/i386.c (ix86_macro_fusion_p): New Function.
        (ix86_macro_fusion_pair_p): Ditto.
        * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
        * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
        * gcc/doc/tm.texi: Generated.
        * gcc/doc/tm.texi.in: Ditto.
        * gcc/haifa-sched.c (try_group_insn): New function.
        (group_insns_for_macro_fusion): Ditto.
        (sched_init): Call group_insns_for_macro_fusion.
        * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
        a SCHED_GROUP at the end of BB to remain their location.
        * gcc/target.def: Add two hooks: macro_fusion_p and
        macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..85b7aa0 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24856,6 +24856,90 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH
+      && (!TARGET_64BIT || TARGET_FUSE_CMP_AND_BRANCH_64))
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+
+  if (single_set == NULL_RTX
+      && !TARGET_FUSE_ALU_AND_BRANCH)
+    return false;
+
+  if (single_set != NULL_RTX)
+    compare_set = single_set;
+  else
+    {
+      int i;
+      rtx pat = PATTERN (condgen);
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+       if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+           && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
+         compare_set = XVECEXP (pat, 0, i);
+    }
+  if (compare_set == NULL_RTX)
+    return false;
+  src = SET_SRC (compare_set);
+  if (GET_CODE (src) != COMPARE)
+    return false;
+
+  /* Macro-fusion for cmp/test MEM-IMM + conditional jmp is not
+     supported.  */
+  if ((MEM_P (XEXP (src, 0))
+       && CONST_INT_P (XEXP (src, 1)))
+      || (MEM_P (XEXP (src, 1))
+         && CONST_INT_P (XEXP (src, 0))))
+    return false;
+
+  test_if = SET_SRC (pc_set (condjmp));
+  cond = XEXP (test_if, 0);
+  ccode = GET_CODE (cond);
+  /* Check whether conditional jump use Sign or Overflow Flags.  */
+  if (!TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS
+      && (ccode == GE
+          || ccode == GT
+         || ccode == LE
+         || ccode == LT))
+    return false;
+
+  if (get_attr_type (condgen) == TYPE_TEST
+      || get_attr_type (condgen) == TYPE_ICMP)
+    return true;
+
+  /* The following is the case that macro-fusion for alu + jmp.  */
+  if (!TARGET_FUSE_ALU_AND_BRANCH)
+    return false;
+
+  /* Macro-fusion for inc/dec + unsigned conditional jump is not
+     supported.  */
+  if (get_attr_type (condgen) == TYPE_INCDEC
+      && (ccode == GEU
+         || ccode == GTU
+         || ccode == LEU
+         || ccode == LTU))
+    return false;
+
+  return true;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42993,6 +43077,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_INT val)
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 788cb8a..048cc18 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -364,6 +364,12 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
        ix86_tune_features[X86_TUNE_USE_VECTOR_CONVERTS]
 #define TARGET_FUSE_CMP_AND_BRANCH \
        ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH]
+#define TARGET_FUSE_CMP_AND_BRANCH_64 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_64]
+#define TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS]
+#define TARGET_FUSE_ALU_AND_BRANCH \
+       ix86_tune_features[X86_TUNE_FUSE_ALU_AND_BRANCH]
 #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
 #define TARGET_VECTORIZE_DOUBLE \
        ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 4ae5f70..a60d0f4 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -196,7 +196,22 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
"use_vector_converts", m_AMDFAM10)
 /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
    with a subsequent conditional jump instruction into a single
    compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch",
+          m_CORE_ALL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_64: Fuse compare with a subsequent
+   conditional jump instruction for TARGET_64BIT.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_64, "fuse_cmp_and_branch_64",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
+   subsequent conditional jump instruction when the condition jump
+   check sign flag (SF) or overflow flag (OF).  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, "fuse_cmp_and_branch_soflags",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL)
+/* X86_TUNE_FUSE_ALU_AND_BRANCH: Fuse alu with a subsequent conditional
+   jump instruction when the alu instruction produces the CCFLAG consumed by
+   the conditional jump instruction. */
+DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
+          m_COREI7_AVX | m_HASWELL)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index d15f53c..66b45b9 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6553,6 +6553,17 @@ scheduling one insn causes other insns to
become ready in the same
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b51d7b3..361ee87 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4940,6 +4940,10 @@ them: try the first ones in this list first.

 @hook TARGET_SCHED_REORDER2

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK

 @hook TARGET_SCHED_INIT
diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 61eaaef..d6726a9 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -6519,6 +6519,44 @@ setup_sched_dump (void)
                ? stderr : dump_file);
 }

+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!any_condjump_p (insn)
+      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion ()
+{
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    try_group_insn (BB_END (bb));
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6645,6 +6683,11 @@ sched_init (void)
     }

   curr_state = xmalloc (dfa_state_size);
+
+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion ();
 }

 static void haifa_init_only_bb (basic_block, basic_block);
diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
index e1a2dce..156359e 100644
--- a/gcc/sched-rgn.c
+++ b/gcc/sched-rgn.c
@@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.

+     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).

      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
 #endif
                 || (!reload_completed
                     && sets_likely_spilled (PATTERN (insn)))))
-        || NOTE_P (insn))
+        || NOTE_P (insn)
+        || (last != 0 && SCHED_GROUP_P (last)))
     {
       if (!NOTE_P (insn))
        {
diff --git a/gcc/target.def b/gcc/target.def
index 6de513f..dae0378 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1041,6 +1041,19 @@ scheduling one insn causes other insns to
become ready in the same\n\
 cycle.  These other insns can then be taken into account properly.",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-16 18:21       ` Wei Mi
@ 2013-09-22 12:41         ` Jan Hubicka
  2013-09-22 13:07           ` Jan Hubicka
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Hubicka @ 2013-09-22 12:41 UTC (permalink / raw)
  To: Wei Mi
  Cc: H.J. Lu, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin

> 2013-09-16  Wei Mi  <wmi@google.com>
> 
>         * gcc/config/i386/i386-c.c (ix86_target_macros_internal): Separate
>         PROCESSOR_COREI7_AVX out from PROCESSOR_COREI7.
>         * gcc/config/i386/i386.c (ix86_option_override_internal): Ditto.
>         (ix86_issue_rate): Ditto.
>         (ia32_multipass_dfa_lookahead): Ditto.
>         (ix86_sched_init_global): Ditto.
>         (get_builtin_code_for_version): Ditto.
>         * gcc/config/i386/i386.h (enum target_cpu_default): Ditto.
>         (enum processor_type): Ditto.
>         * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.

This patch is OK.
> 
> 2013-09-16  Wei Mi  <wmi@google.com>
> 
>         * gcc/config/i386/i386.c (ix86_macro_fusion_p): New Function.
>         (ix86_macro_fusion_pair_p): Ditto.
>         * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
>         * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
>         * gcc/doc/tm.texi: Generated.
>         * gcc/doc/tm.texi.in: Ditto.
>         * gcc/haifa-sched.c (try_group_insn): New function.
>         (group_insns_for_macro_fusion): Ditto.
>         (sched_init): Call group_insns_for_macro_fusion.
>         * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
>         a SCHED_GROUP at the end of BB to remain their location.
>         * gcc/target.def: Add two hooks: macro_fusion_p and
>         macro_fusion_pair_p.

I think original plan for fusing was to turn the fused instructions into one within
combiner.  I guess this would lead to quite a large explossion of insns patterns,
so the scheduler approach may be better, if scheduler maintainer agree.
> 
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 1fd3f60..85b7aa0 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -24856,6 +24856,90 @@ ia32_multipass_dfa_lookahead (void)
>      }
>  }
> 
> +/* Return true if target platform supports macro-fusion.  */
> +
> +static bool
> +ix86_macro_fusion_p ()
> +{
> +  if (TARGET_FUSE_CMP_AND_BRANCH
> +      && (!TARGET_64BIT || TARGET_FUSE_CMP_AND_BRANCH_64))

You disable fusion for Budozer here sinze you did not add it into
TARGET_FUSE_CMP_AND_BRANCH_64.
> @@ -364,6 +364,12 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
>         ix86_tune_features[X86_TUNE_USE_VECTOR_CONVERTS]
>  #define TARGET_FUSE_CMP_AND_BRANCH \
>         ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH]
> +#define TARGET_FUSE_CMP_AND_BRANCH_64 \
> +       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_64]

Perhaps we can have TARGET_FUSE_CMP_AND_BRANCH_64 and TARGET_FUSE_CMP_AND_BRANCH_32
plus an macro TARGET_FUSE_CMP_AND_BRANCH that chose corresponding variant based
on TARGET_64BIT rather than having to wind down the test in every use.
> +#define TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS \
> +       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS]
> +#define TARGET_FUSE_ALU_AND_BRANCH \
> +       ix86_tune_features[X86_TUNE_FUSE_ALU_AND_BRANCH]
>  #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
>  #define TARGET_VECTORIZE_DOUBLE \
>         ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 4ae5f70..a60d0f4 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -196,7 +196,22 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
> "use_vector_converts", m_AMDFAM10)
>  /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>     with a subsequent conditional jump instruction into a single
>     compare-and-branch uop.  */
> -DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch",
> +          m_CORE_ALL | m_BDVER)
> +/* X86_TUNE_FUSE_CMP_AND_BRANCH_64: Fuse compare with a subsequent
> +   conditional jump instruction for TARGET_64BIT.  */
> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_64, "fuse_cmp_and_branch_64",
> +          m_COREI7 | m_COREI7_AVX | m_HASWELL)
Add m_BDVER
> +/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
> +   subsequent conditional jump instruction when the condition jump
> +   check sign flag (SF) or overflow flag (OF).  */
> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, "fuse_cmp_and_branch_soflags",
> +          m_COREI7 | m_COREI7_AVX | m_HASWELL)

This flag is affecting only fuding of ALU and BRANCh or should it also affect
X86_TUNE_FUSE_CMP_AND_BRANCH?  In current implementation it seems to be the first
and in that case it ought to be documented that way and probably
called ALT_AND_BRANCH_SOFLAGS to avoid confussion.

I am not sure if AMD hardware has any limitations here.  It fuses only cmp/test
as far as I know, but I do not think it matters what flags you use.

The i386 specific part of the change seems resonable to me.

Honza

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-22 12:41         ` Jan Hubicka
@ 2013-09-22 13:07           ` Jan Hubicka
  2013-09-23  6:50             ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Hubicka @ 2013-09-22 13:07 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Wei Mi, H.J. Lu, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin

> 
> I am not sure if AMD hardware has any limitations here.  It fuses only cmp/test

This is what Agner Fog says:

A CMP or TEST instruction immediately followed by a conditional jump can be
fused into a single macro-op. This applies to all versions of the CMP and TEST
instructions and all conditional jumps except if the CMP or TEST has a
rip-relative address or both a displacement and an immediate operand. 

So it is a bit more weird.  Perhaps you can extend your predicate to look
for IP relative addresses & displacements of CMP and TEST, too.

Honza
> as far as I know, but I do not think it matters what flags you use.
> 
> The i386 specific part of the change seems resonable to me.
> 
> Honza

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-22 13:07           ` Jan Hubicka
@ 2013-09-23  6:50             ` Wei Mi
  2013-09-24 19:24               ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-23  6:50 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: H.J. Lu, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin

> You disable fusion for Budozer here sinze you did not add it into
> TARGET_FUSE_CMP_AND_BRANCH_64.

Ok, will add it.

>
> Perhaps we can have TARGET_FUSE_CMP_AND_BRANCH_64 and TARGET_FUSE_CMP_AND_BRANCH_32
> plus an macro TARGET_FUSE_CMP_AND_BRANCH that chose corresponding variant based
> on TARGET_64BIT rather than having to wind down the test in every use.

Ok, will fix it.

>> +/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
>> +   subsequent conditional jump instruction when the condition jump
>> +   check sign flag (SF) or overflow flag (OF).  */
>> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, "fuse_cmp_and_branch_soflags",
>> +          m_COREI7 | m_COREI7_AVX | m_HASWELL)
>
> This flag is affecting only fuding of ALU and BRANCh or should it also affect
> X86_TUNE_FUSE_CMP_AND_BRANCH?  In current implementation it seems to be the first
> and in that case it ought to be documented that way and probably
> called ALT_AND_BRANCH_SOFLAGS to avoid confussion.
>

X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS is not affecting fusing ALU and
BRANCH. It is added because m_CORE2 doesn't support fusing cmp and
JL/JG/JLE/JGE.

> This is what Agner Fog says:
>
> A CMP or TEST instruction immediately followed by a conditional jump can be
> fused into a single macro-op. This applies to all versions of the CMP and TEST
> instructions and all conditional jumps except if the CMP or TEST has a
> rip-relative address or both a displacement and an immediate operand.
>
> So it is a bit more weird.  Perhaps you can extend your predicate to look
> for IP relative addresses & displacements of CMP and TEST, too.
>
> Honza

Thanks for checking it. Agner's guide also mentions this constraint
for sandybridge, ivybridge.... I missed it because Intel optimization
reference manual doesn't mention it. I did some experiment just now
and verified the constraint for sandybridge existed. Will add the
predicate.

Thanks,
Wei Mi.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-23  6:50             ` Wei Mi
@ 2013-09-24 19:24               ` Wei Mi
  2013-09-24 21:59                 ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-24 19:24 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: H.J. Lu, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin

This is the updated patch2.
Changed:
1. For cmp/test with rip-relative addressing mem operand, don't group
insns. Bulldozer also doesn't support fusion for cmp/test with both
displacement MEM and immediate operand, while m_CORE_ALL doesn't
support fusion for cmp/test with MEM and immediate operand. I simplify
choose to use the more stringent constraint here (m_CORE_ALL's
constraint).
2. Add Budozer back and merge TARGET_FUSE_CMP_AND_BRANCH_64 and
TARGET_FUSE_CMP_AND_BRANCH_32.

bootstrap and regression pass. ok for trunk?

2013-09-24  Wei Mi  <wmi@google.com>

        * gcc/config/i386/i386.c (rip_relative_addr_p): New Function.
        (ix86_macro_fusion_p): Ditto.
        (ix86_macro_fusion_pair_p): Ditto.
        * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
        * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
        * gcc/doc/tm.texi: Generated.
        * gcc/doc/tm.texi.in: Ditto.
        * gcc/haifa-sched.c (try_group_insn): New Function.
        (group_insns_for_macro_fusion): Ditto.
        (sched_init): Call group_insns_for_macro_fusion.
        * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
        a SCHED_GROUP at the end of BB to remain their location.
        * gcc/target.def: Add two hooks: macro_fusion_p and
        macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..4a04778 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24856,6 +24856,167 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Extracted from ix86_print_operand_address. Check whether ADDR is a
+   rip-relative address.  */
+
+static bool
+rip_relative_addr_p (rtx addr)
+{
+  struct ix86_address parts;
+  rtx base, index, disp;
+  int ok;
+
+  if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_VSIBADDR)
+    {
+      ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);
+      parts.index = XVECEXP (addr, 0, 1);
+    }
+  else if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_LEA_ADDR)
+    ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);
+  else
+    ok = ix86_decompose_address (addr, &parts);
+
+  gcc_assert (ok);
+  base = parts.base;
+  index = parts.index;
+  disp = parts.disp;
+
+  if (TARGET_64BIT && !base && !index)
+    {
+      rtx symbol = disp;
+
+      if (GET_CODE (disp) == CONST
+         && GET_CODE (XEXP (disp, 0)) == PLUS
+         && CONST_INT_P (XEXP (XEXP (disp, 0), 1)))
+       symbol = XEXP (XEXP (disp, 0), 0);
+
+      if (GET_CODE (symbol) == LABEL_REF
+         || (GET_CODE (symbol) == SYMBOL_REF
+             && SYMBOL_REF_TLS_MODEL (symbol) == 0))
+       return true;
+    }
+  if (flag_pic && !base && !index)
+    {
+      if (GET_CODE (disp) == CONST
+         && GET_CODE (XEXP (disp, 0)) == UNSPEC
+         && (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL
+             || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL
+             || (TARGET_64BIT
+                 && XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF)))
+       return true;
+    }
+  return false;
+}
+
+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src, dest;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx alu_set = NULL_RTX, addr = NULL_RTX;
+
+  if (get_attr_type (condgen) != TYPE_TEST
+      && get_attr_type (condgen) != TYPE_ICMP
+      && get_attr_type (condgen) != TYPE_INCDEC
+      && get_attr_type (condgen) != TYPE_ALU)
+    return false;
+
+  if (single_set == NULL_RTX
+      && !TARGET_FUSE_ALU_AND_BRANCH)
+    return false;
+
+  if (single_set != NULL_RTX)
+    compare_set = single_set;
+  else
+    {
+      int i;
+      rtx pat = PATTERN (condgen);
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+       if (GET_CODE (XVECEXP (pat, 0, i)) == SET)
+         {
+           rtx set_src = SET_SRC (XVECEXP (pat, 0, i));
+           if (GET_CODE (set_src) == COMPARE)
+             compare_set = XVECEXP (pat, 0, i);
+           else
+             alu_set = XVECEXP (pat, 0, i);
+         }
+    }
+  if (compare_set == NULL_RTX)
+    return false;
+  src = SET_SRC (compare_set);
+  if (GET_CODE (src) != COMPARE)
+    return false;
+
+  /* Macro-fusion for cmp/test MEM-IMM + conditional jmp is not
+     supported.  */
+  if ((MEM_P (XEXP (src, 0))
+       && CONST_INT_P (XEXP (src, 1)))
+      || (MEM_P (XEXP (src, 1))
+         && CONST_INT_P (XEXP (src, 0))))
+    return false;
+
+  /* No fusion for RIP-relative address.  */
+  if (MEM_P (XEXP (src, 0)))
+    addr = XEXP (XEXP (src, 0), 0);
+  else if (MEM_P (XEXP (src, 1)))
+    addr = XEXP (XEXP (src, 1), 0);
+  if (addr && rip_relative_addr_p (addr))
+    return false;
+
+  test_if = SET_SRC (pc_set (condjmp));
+  cond = XEXP (test_if, 0);
+  ccode = GET_CODE (cond);
+  /* Check whether conditional jump use Sign or Overflow Flags.  */
+  if (!TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS
+      && (ccode == GE
+          || ccode == GT
+         || ccode == LE
+         || ccode == LT))
+    return false;
+
+  /* Return true for TYPE_TEST and TYPE_ICMP.  */
+  if (get_attr_type (condgen) == TYPE_TEST
+      || get_attr_type (condgen) == TYPE_ICMP)
+    return true;
+
+  /* The following is the case that macro-fusion for alu + jmp.  */
+  if (!TARGET_FUSE_ALU_AND_BRANCH || !alu_set)
+    return false;
+
+  /* No fusion for alu op with memory destination operand.  */
+  dest = SET_DEST (alu_set);
+  if (MEM_P (dest))
+    return false;
+
+  /* Macro-fusion for inc/dec + unsigned conditional jump is not
+     supported.  */
+  if (get_attr_type (condgen) == TYPE_INCDEC
+      && (ccode == GEU
+         || ccode == GTU
+         || ccode == LEU
+         || ccode == LTU))
+    return false;
+
+  return true;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42993,6 +43154,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_INT val)
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 788cb8a..68fabd9 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -362,8 +362,17 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
        ix86_tune_features[X86_TUNE_USE_VECTOR_FP_CONVERTS]
 #define TARGET_USE_VECTOR_CONVERTS \
        ix86_tune_features[X86_TUNE_USE_VECTOR_CONVERTS]
+#define TARGET_FUSE_CMP_AND_BRANCH_32 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
+#define TARGET_FUSE_CMP_AND_BRANCH_64 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_64]
 #define TARGET_FUSE_CMP_AND_BRANCH \
-       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH]
+       (TARGET_64BIT ? TARGET_FUSE_CMP_AND_BRANCH_64 \
+        : TARGET_FUSE_CMP_AND_BRANCH_32)
+#define TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS]
+#define TARGET_FUSE_ALU_AND_BRANCH \
+       ix86_tune_features[X86_TUNE_FUSE_ALU_AND_BRANCH]
 #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
 #define TARGET_VECTORIZE_DOUBLE \
        ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 4ae5f70..3d395b0 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -193,10 +193,24 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS,
"use_vector_fp_converts",
 /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
    from integer to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
-/* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
-   with a subsequent conditional jump instruction into a single
-   compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_32: Fuse compare with a subsequent
+   conditional jump instruction for 32 bit TARGET.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_32, "fuse_cmp_and_branch_32",
+          m_CORE_ALL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_64: Fuse compare with a subsequent
+   conditional jump instruction for TARGET_64BIT.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_64, "fuse_cmp_and_branch_64",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
+   subsequent conditional jump instruction when the condition jump
+   check sign flag (SF) or overflow flag (OF).  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, "fuse_cmp_and_branch_soflags",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
+/* X86_TUNE_FUSE_ALU_AND_BRANCH: Fuse alu with a subsequent conditional
+   jump instruction when the alu instruction produces the CCFLAG consumed by
+   the conditional jump instruction. */
+DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
+          m_COREI7_AVX | m_HASWELL)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index d15f53c..66b45b9 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6553,6 +6553,17 @@ scheduling one insn causes other insns to
become ready in the same
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b51d7b3..361ee87 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4940,6 +4940,10 @@ them: try the first ones in this list first.

 @hook TARGET_SCHED_REORDER2

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK

 @hook TARGET_SCHED_INIT
diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 61eaaef..d6726a9 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -6519,6 +6519,44 @@ setup_sched_dump (void)
                ? stderr : dump_file);
 }

+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!any_condjump_p (insn)
+      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion ()
+{
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    try_group_insn (BB_END (bb));
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6645,6 +6683,11 @@ sched_init (void)
     }

   curr_state = xmalloc (dfa_state_size);
+
+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion ();
 }

 static void haifa_init_only_bb (basic_block, basic_block);
diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
index e1a2dce..156359e 100644
--- a/gcc/sched-rgn.c
+++ b/gcc/sched-rgn.c
@@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.

+     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).

      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
 #endif
                 || (!reload_completed
                     && sets_likely_spilled (PATTERN (insn)))))
-        || NOTE_P (insn))
+        || NOTE_P (insn)
+        || (last != 0 && SCHED_GROUP_P (last)))
     {
       if (!NOTE_P (insn))
        {
diff --git a/gcc/target.def b/gcc/target.def
index 6de513f..dae0378 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1041,6 +1041,19 @@ scheduling one insn causes other insns to
become ready in the same\n\
 cycle.  These other insns can then be taken into account properly.",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-24 19:24               ` Wei Mi
@ 2013-09-24 21:59                 ` H.J. Lu
  2013-09-24 22:37                   ` Jan Hubicka
  2013-09-30 10:31                   ` Gopalasubramanian, Ganesh
  0 siblings, 2 replies; 59+ messages in thread
From: H.J. Lu @ 2013-09-24 21:59 UTC (permalink / raw)
  To: Wei Mi
  Cc: Jan Hubicka, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin

On Tue, Sep 24, 2013 at 12:06 PM, Wei Mi <wmi@google.com> wrote:
> This is the updated patch2.
> Changed:
> 1. For cmp/test with rip-relative addressing mem operand, don't group
> insns. Bulldozer also doesn't support fusion for cmp/test with both
> displacement MEM and immediate operand, while m_CORE_ALL doesn't
> support fusion for cmp/test with MEM and immediate operand. I simplify
> choose to use the more stringent constraint here (m_CORE_ALL's
> constraint).
> 2. Add Budozer back and merge TARGET_FUSE_CMP_AND_BRANCH_64 and
> TARGET_FUSE_CMP_AND_BRANCH_32.
>
> bootstrap and regression pass. ok for trunk?
>
> 2013-09-24  Wei Mi  <wmi@google.com>
>
>         * gcc/config/i386/i386.c (rip_relative_addr_p): New Function.
>         (ix86_macro_fusion_p): Ditto.
>         (ix86_macro_fusion_pair_p): Ditto.
>         * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
>         * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
>         * gcc/doc/tm.texi: Generated.
>         * gcc/doc/tm.texi.in: Ditto.
>         * gcc/haifa-sched.c (try_group_insn): New Function.
>         (group_insns_for_macro_fusion): Ditto.
>         (sched_init): Call group_insns_for_macro_fusion.
>         * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
>         a SCHED_GROUP at the end of BB to remain their location.
>         * gcc/target.def: Add two hooks: macro_fusion_p and
>         macro_fusion_pair_p.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 1fd3f60..4a04778 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -24856,6 +24856,167 @@ ia32_multipass_dfa_lookahead (void)
>      }
>  }
>
> +/* Extracted from ix86_print_operand_address. Check whether ADDR is a
> +   rip-relative address.  */
> +
> +static bool
> +rip_relative_addr_p (rtx addr)
> +{
> +  struct ix86_address parts;
> +  rtx base, index, disp;
> +  int ok;
> +
> +  if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_VSIBADDR)
> +    {
> +      ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);
> +      parts.index = XVECEXP (addr, 0, 1);
> +    }
> +  else if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_LEA_ADDR)
> +    ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);
> +  else
> +    ok = ix86_decompose_address (addr, &parts);
> +
> +  gcc_assert (ok);
> +  base = parts.base;
> +  index = parts.index;
> +  disp = parts.disp;
> +
> +  if (TARGET_64BIT && !base && !index)
> +    {
> +      rtx symbol = disp;
> +
> +      if (GET_CODE (disp) == CONST
> +         && GET_CODE (XEXP (disp, 0)) == PLUS
> +         && CONST_INT_P (XEXP (XEXP (disp, 0), 1)))
> +       symbol = XEXP (XEXP (disp, 0), 0);
> +
> +      if (GET_CODE (symbol) == LABEL_REF
> +         || (GET_CODE (symbol) == SYMBOL_REF
> +             && SYMBOL_REF_TLS_MODEL (symbol) == 0))
> +       return true;
> +    }
> +  if (flag_pic && !base && !index)
> +    {
> +      if (GET_CODE (disp) == CONST
> +         && GET_CODE (XEXP (disp, 0)) == UNSPEC
> +         && (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL
> +             || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL
> +             || (TARGET_64BIT
> +                 && XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF)))
> +       return true;
> +    }
> +  return false;
> +}
> +

It doesn't look right.  IP relative address is only possible
with TARGET_64BIT and

1. base == pc. Or
2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
NSPEC_GOTNTPOFF.


-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-24 21:59                 ` H.J. Lu
@ 2013-09-24 22:37                   ` Jan Hubicka
  2013-09-25  3:28                     ` Wei Mi
  2013-09-30 10:31                   ` Gopalasubramanian, Ganesh
  1 sibling, 1 reply; 59+ messages in thread
From: Jan Hubicka @ 2013-09-24 22:37 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Wei Mi, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin

> > +  gcc_assert (ok);
> > +  base = parts.base;
> > +  index = parts.index;
> > +  disp = parts.disp;
> > +
> > +  if (TARGET_64BIT && !base && !index)
> > +    {
> > +      rtx symbol = disp;
> > +
> > +      if (GET_CODE (disp) == CONST
> > +         && GET_CODE (XEXP (disp, 0)) == PLUS
> > +         && CONST_INT_P (XEXP (XEXP (disp, 0), 1)))
> > +       symbol = XEXP (XEXP (disp, 0), 0);
> > +
> > +      if (GET_CODE (symbol) == LABEL_REF
> > +         || (GET_CODE (symbol) == SYMBOL_REF
> > +             && SYMBOL_REF_TLS_MODEL (symbol) == 0))
> > +       return true;
> > +    }
> > +  if (flag_pic && !base && !index)
> > +    {
> > +      if (GET_CODE (disp) == CONST
> > +         && GET_CODE (XEXP (disp, 0)) == UNSPEC
> > +         && (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL
> > +             || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL
> > +             || (TARGET_64BIT
> > +                 && XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF)))
> > +       return true;
> > +    }
> > +  return false;
> > +}
> > +
> 
> It doesn't look right.  IP relative address is only possible
> with TARGET_64BIT and
> 
> 1. base == pc. Or
> 2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
> NSPEC_GOTNTPOFF.

Target 64bit should be tested above.  We however output RIP addresses
also for basic symbol references.  I.e. when base is an symbol addresss.
such as in:
int a;
int t()
{
  return a;
}

memory_address_length already contains logic to figure out if there is IP
relative addressing going on (I am not sure it is completely accurate either).
Better to break it out to a common predicate and perhaps unify with what
ix86_print_operand_address is doing.

Honza
> 
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-24 22:37                   ` Jan Hubicka
@ 2013-09-25  3:28                     ` Wei Mi
  2013-10-03 18:25                       ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-25  3:28 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: H.J. Lu, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin

>> It doesn't look right.  IP relative address is only possible
>> with TARGET_64BIT and
>>
>> 1. base == pc. Or
>> 2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
>> NSPEC_GOTNTPOFF.
>
> Target 64bit should be tested above.  We however output RIP addresses
> also for basic symbol references.  I.e. when base is an symbol addresss.
> such as in:
> int a;
> int t()
> {
>   return a;
> }
>
> memory_address_length already contains logic to figure out if there is IP
> relative addressing going on (I am not sure it is completely accurate either).
> Better to break it out to a common predicate and perhaps unify with what
> ix86_print_operand_address is doing.
>
> Honza
>>
>>
>> --
>> H.J.

Thanks. How about this one. bootstrap and regression are going on.

2013-09-24  Wei Mi  <wmi@google.com>

        * gcc/config/i386/i386.c (memory_address_length): Extract a part
        of code to rip_relative_addr_p.
        (rip_relative_addr_p): New Function.
        (ix86_macro_fusion_p): Ditto.
        (ix86_macro_fusion_pair_p): Ditto.
        * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
        * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
        * gcc/doc/tm.texi: Generated.
        * gcc/doc/tm.texi.in: Ditto.
        * gcc/haifa-sched.c (try_group_insn): New Function.
        (group_insns_for_macro_fusion): Ditto.
        (sched_init): Call group_insns_for_macro_fusion.
        * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
        a SCHED_GROUP at the end of BB to remain their location.
        * gcc/target.def: Add two hooks: macro_fusion_p and
        macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..808e0c6 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24275,25 +24275,8 @@ memory_address_length (rtx addr, bool lea)
   else if (disp && !base && !index)
     {
       len += 4;
-      if (TARGET_64BIT)
-       {
-         rtx symbol = disp;
-
-         if (GET_CODE (disp) == CONST)
-           symbol = XEXP (disp, 0);
-         if (GET_CODE (symbol) == PLUS
-             && CONST_INT_P (XEXP (symbol, 1)))
-           symbol = XEXP (symbol, 0);
-
-         if (GET_CODE (symbol) != LABEL_REF
-             && (GET_CODE (symbol) != SYMBOL_REF
-                 || SYMBOL_REF_TLS_MODEL (symbol) != 0)
-             && (GET_CODE (symbol) != UNSPEC
-                 || (XINT (symbol, 1) != UNSPEC_GOTPCREL
-                     && XINT (symbol, 1) != UNSPEC_PCREL
-                     && XINT (symbol, 1) != UNSPEC_GOTNTPOFF)))
-           len++;
-       }
+      if (rip_relative_addr_p (&parts))
+       len++;
     }
   else
     {
@@ -24856,6 +24839,159 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Check whether x86 address PARTS is a pc-relative address.  */
+
+static bool
+rip_relative_addr_p (struct ix86_address *parts)
+{
+  struct ix86_address *parts;
+  rtx base, index, disp;
+
+  base = parts->base;
+  index = parts->index;
+  disp = parts->disp;
+
+  if (disp && !base && !index)
+    {
+      if (TARGET_64BIT)
+       {
+         rtx symbol = disp;
+
+         if (GET_CODE (disp) == CONST)
+           symbol = XEXP (disp, 0);
+         if (GET_CODE (symbol) == PLUS
+             && CONST_INT_P (XEXP (symbol, 1)))
+           symbol = XEXP (symbol, 0);
+
+         if (GET_CODE (symbol) == LABEL_REF
+             || (GET_CODE (symbol) == SYMBOL_REF
+                 && SYMBOL_REF_TLS_MODEL (symbol) == 0)
+             || (GET_CODE (symbol) == UNSPEC
+                 && (XINT (symbol, 1) == UNSPEC_GOTPCREL
+                     || XINT (symbol, 1) == UNSPEC_PCREL
+                     || XINT (symbol, 1) == UNSPEC_GOTNTPOFF)))
+           return true;
+       }
+    }
+  return false;
+}
+
+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src, dest;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx alu_set = NULL_RTX, addr = NULL_RTX;
+
+  if (get_attr_type (condgen) != TYPE_TEST
+      && get_attr_type (condgen) != TYPE_ICMP
+      && get_attr_type (condgen) != TYPE_INCDEC
+      && get_attr_type (condgen) != TYPE_ALU)
+    return false;
+
+  if (single_set == NULL_RTX
+      && !TARGET_FUSE_ALU_AND_BRANCH)
+    return false;
+
+  if (single_set != NULL_RTX)
+    compare_set = single_set;
+  else
+    {
+      int i;
+      rtx pat = PATTERN (condgen);
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+       if (GET_CODE (XVECEXP (pat, 0, i)) == SET)
+         {
+           rtx set_src = SET_SRC (XVECEXP (pat, 0, i));
+           if (GET_CODE (set_src) == COMPARE)
+             compare_set = XVECEXP (pat, 0, i);
+           else
+             alu_set = XVECEXP (pat, 0, i);
+         }
+    }
+  if (compare_set == NULL_RTX)
+    return false;
+  src = SET_SRC (compare_set);
+  if (GET_CODE (src) != COMPARE)
+    return false;
+
+  /* Macro-fusion for cmp/test MEM-IMM + conditional jmp is not
+     supported.  */
+  if ((MEM_P (XEXP (src, 0))
+       && CONST_INT_P (XEXP (src, 1)))
+      || (MEM_P (XEXP (src, 1))
+         && CONST_INT_P (XEXP (src, 0))))
+    return false;
+
+  /* No fusion for RIP-relative address.  */
+  if (MEM_P (XEXP (src, 0)))
+    addr = XEXP (XEXP (src, 0), 0);
+  else if (MEM_P (XEXP (src, 1)))
+    addr = XEXP (XEXP (src, 1), 0);
+
+  if (addr) {
+    ix86_address parts;
+    int ok = ix86_decompose_address (addr, &parts);
+    gcc_assert (ok);
+
+    if (rip_relative_addr_p (&parts))
+      return false;
+  }
+
+  test_if = SET_SRC (pc_set (condjmp));
+  cond = XEXP (test_if, 0);
+  ccode = GET_CODE (cond);
+  /* Check whether conditional jump use Sign or Overflow Flags.  */
+  if (!TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS
+      && (ccode == GE
+          || ccode == GT
+         || ccode == LE
+         || ccode == LT))
+    return false;
+
+  /* Return true for TYPE_TEST and TYPE_ICMP.  */
+  if (get_attr_type (condgen) == TYPE_TEST
+      || get_attr_type (condgen) == TYPE_ICMP)
+    return true;
+
+  /* The following is the case that macro-fusion for alu + jmp.  */
+  if (!TARGET_FUSE_ALU_AND_BRANCH || !alu_set)
+    return false;
+
+  /* No fusion for alu op with memory destination operand.  */
+  dest = SET_DEST (alu_set);
+  if (MEM_P (dest))
+    return false;
+
+  /* Macro-fusion for inc/dec + unsigned conditional jump is not
+     supported.  */
+  if (get_attr_type (condgen) == TYPE_INCDEC
+      && (ccode == GEU
+         || ccode == GTU
+         || ccode == LEU
+         || ccode == LTU))
+    return false;
+
+  return true;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42993,6 +43129,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_INT val)
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 788cb8a..68fabd9 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -362,8 +362,17 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
        ix86_tune_features[X86_TUNE_USE_VECTOR_FP_CONVERTS]
 #define TARGET_USE_VECTOR_CONVERTS \
        ix86_tune_features[X86_TUNE_USE_VECTOR_CONVERTS]
+#define TARGET_FUSE_CMP_AND_BRANCH_32 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
+#define TARGET_FUSE_CMP_AND_BRANCH_64 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_64]
 #define TARGET_FUSE_CMP_AND_BRANCH \
-       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH]
+       (TARGET_64BIT ? TARGET_FUSE_CMP_AND_BRANCH_64 \
+        : TARGET_FUSE_CMP_AND_BRANCH_32)
+#define TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS]
+#define TARGET_FUSE_ALU_AND_BRANCH \
+       ix86_tune_features[X86_TUNE_FUSE_ALU_AND_BRANCH]
 #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
 #define TARGET_VECTORIZE_DOUBLE \
        ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 4ae5f70..3d395b0 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -193,10 +193,24 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS,
"use_vector_fp_converts",
 /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
    from integer to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
-/* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
-   with a subsequent conditional jump instruction into a single
-   compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_32: Fuse compare with a subsequent
+   conditional jump instruction for 32 bit TARGET.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_32, "fuse_cmp_and_branch_32",
+          m_CORE_ALL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_64: Fuse compare with a subsequent
+   conditional jump instruction for TARGET_64BIT.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_64, "fuse_cmp_and_branch_64",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
+   subsequent conditional jump instruction when the condition jump
+   check sign flag (SF) or overflow flag (OF).  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, "fuse_cmp_and_branch_soflags",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
+/* X86_TUNE_FUSE_ALU_AND_BRANCH: Fuse alu with a subsequent conditional
+   jump instruction when the alu instruction produces the CCFLAG consumed by
+   the conditional jump instruction. */
+DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
+          m_COREI7_AVX | m_HASWELL)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index d15f53c..66b45b9 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6553,6 +6553,17 @@ scheduling one insn causes other insns to
become ready in the same
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b51d7b3..361ee87 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4940,6 +4940,10 @@ them: try the first ones in this list first.

 @hook TARGET_SCHED_REORDER2

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK

 @hook TARGET_SCHED_INIT
diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 61eaaef..d6726a9 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -6519,6 +6519,44 @@ setup_sched_dump (void)
                ? stderr : dump_file);
 }

+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!any_condjump_p (insn)
+      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion ()
+{
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    try_group_insn (BB_END (bb));
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6645,6 +6683,11 @@ sched_init (void)
     }

   curr_state = xmalloc (dfa_state_size);
+
+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion ();
 }

 static void haifa_init_only_bb (basic_block, basic_block);
diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
index e1a2dce..156359e 100644
--- a/gcc/sched-rgn.c
+++ b/gcc/sched-rgn.c
@@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.

+     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).

      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
 #endif
                 || (!reload_completed
                     && sets_likely_spilled (PATTERN (insn)))))
-        || NOTE_P (insn))
+        || NOTE_P (insn)
+        || (last != 0 && SCHED_GROUP_P (last)))
     {
       if (!NOTE_P (insn))
        {
diff --git a/gcc/target.def b/gcc/target.def
index 6de513f..dae0378 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1041,6 +1041,19 @@ scheduling one insn causes other insns to
become ready in the same\n\
 cycle.  These other insns can then be taken into account properly.",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-24 21:59                 ` H.J. Lu
  2013-09-24 22:37                   ` Jan Hubicka
@ 2013-09-30 10:31                   ` Gopalasubramanian, Ganesh
  1 sibling, 0 replies; 59+ messages in thread
From: Gopalasubramanian, Ganesh @ 2013-09-30 10:31 UTC (permalink / raw)
  To: Wei Mi
  Cc: Jan Hubicka, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin, H.J. Lu

> 1. For cmp/test with rip-relative addressing mem operand, don't group 
> insns. Bulldozer also doesn't support fusion for cmp/test with both 
> displacement MEM and immediate operand, while m_CORE_ALL doesn't 
> support fusion for cmp/test with MEM and immediate operand. I simplify 
> choose to use the more stringent constraint here (m_CORE_ALL's 
> constraint).

This suits Bulldozer's specification. We don't see an issue with the proposed patch.

Regards
Ganesh

-----Original Message-----
From: H.J. Lu [mailto:hjl.tools@gmail.com] 
Sent: Wednesday, September 25, 2013 2:12 AM
To: Wei Mi
Cc: Jan Hubicka; Alexander Monakov; Steven Bosscher; GCC Patches; David Li; Kirill Yukhin
Subject: Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

On Tue, Sep 24, 2013 at 12:06 PM, Wei Mi <wmi@google.com> wrote:
> This is the updated patch2.
> Changed:
> 1. For cmp/test with rip-relative addressing mem operand, don't group 
> insns. Bulldozer also doesn't support fusion for cmp/test with both 
> displacement MEM and immediate operand, while m_CORE_ALL doesn't 
> support fusion for cmp/test with MEM and immediate operand. I simplify 
> choose to use the more stringent constraint here (m_CORE_ALL's 
> constraint).
> 2. Add Budozer back and merge TARGET_FUSE_CMP_AND_BRANCH_64 and 
> TARGET_FUSE_CMP_AND_BRANCH_32.
>
> bootstrap and regression pass. ok for trunk?
>
> 2013-09-24  Wei Mi  <wmi@google.com>
>
>         * gcc/config/i386/i386.c (rip_relative_addr_p): New Function.
>         (ix86_macro_fusion_p): Ditto.
>         (ix86_macro_fusion_pair_p): Ditto.
>         * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
>         * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
>         * gcc/doc/tm.texi: Generated.
>         * gcc/doc/tm.texi.in: Ditto.
>         * gcc/haifa-sched.c (try_group_insn): New Function.
>         (group_insns_for_macro_fusion): Ditto.
>         (sched_init): Call group_insns_for_macro_fusion.
>         * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
>         a SCHED_GROUP at the end of BB to remain their location.
>         * gcc/target.def: Add two hooks: macro_fusion_p and
>         macro_fusion_pair_p.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 
> 1fd3f60..4a04778 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -24856,6 +24856,167 @@ ia32_multipass_dfa_lookahead (void)
>      }
>  }
>
> +/* Extracted from ix86_print_operand_address. Check whether ADDR is a
> +   rip-relative address.  */
> +
> +static bool
> +rip_relative_addr_p (rtx addr)
> +{
> +  struct ix86_address parts;
> +  rtx base, index, disp;
> +  int ok;
> +
> +  if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_VSIBADDR)
> +    {
> +      ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);
> +      parts.index = XVECEXP (addr, 0, 1);
> +    }
> +  else if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_LEA_ADDR)
> +    ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);  else
> +    ok = ix86_decompose_address (addr, &parts);
> +
> +  gcc_assert (ok);
> +  base = parts.base;
> +  index = parts.index;
> +  disp = parts.disp;
> +
> +  if (TARGET_64BIT && !base && !index)
> +    {
> +      rtx symbol = disp;
> +
> +      if (GET_CODE (disp) == CONST
> +         && GET_CODE (XEXP (disp, 0)) == PLUS
> +         && CONST_INT_P (XEXP (XEXP (disp, 0), 1)))
> +       symbol = XEXP (XEXP (disp, 0), 0);
> +
> +      if (GET_CODE (symbol) == LABEL_REF
> +         || (GET_CODE (symbol) == SYMBOL_REF
> +             && SYMBOL_REF_TLS_MODEL (symbol) == 0))
> +       return true;
> +    }
> +  if (flag_pic && !base && !index)
> +    {
> +      if (GET_CODE (disp) == CONST
> +         && GET_CODE (XEXP (disp, 0)) == UNSPEC
> +         && (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL
> +             || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL
> +             || (TARGET_64BIT
> +                 && XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF)))
> +       return true;
> +    }
> +  return false;
> +}
> +

It doesn't look right.  IP relative address is only possible with TARGET_64BIT and

1. base == pc. Or
2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
NSPEC_GOTNTPOFF.


--
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-25  3:28                     ` Wei Mi
@ 2013-10-03 18:25                       ` Wei Mi
  2013-10-15 20:47                         ` Jeff Law
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-10-03 18:25 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: H.J. Lu, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin, Vladimir Makarov, law

On Tue, Sep 24, 2013 at 4:32 PM, Wei Mi <wmi@google.com> wrote:
>>> It doesn't look right.  IP relative address is only possible
>>> with TARGET_64BIT and
>>>
>>> 1. base == pc. Or
>>> 2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
>>> NSPEC_GOTNTPOFF.
>>
>> Target 64bit should be tested above.  We however output RIP addresses
>> also for basic symbol references.  I.e. when base is an symbol addresss.
>> such as in:
>> int a;
>> int t()
>> {
>>   return a;
>> }
>>
>> memory_address_length already contains logic to figure out if there is IP
>> relative addressing going on (I am not sure it is completely accurate either).
>> Better to break it out to a common predicate and perhaps unify with what
>> ix86_print_operand_address is doing.
>>
>> Honza
>>>
>>>
>>> --
>>> H.J.
>
> Thanks. How about this one. bootstrap and regression are going on.
>

Ccing scheduler maintainers.

Ping. Repaste the patch with some minor error fixed. bootstrap and
regression ok. Ok for trunk?

Thanks,
Wei Mi.

2013-10-03  Wei Mi  <wmi@google.com>

        * gcc/config/i386/i386.c (memory_address_length): Extract a part
        of code to rip_relative_addr_p.
        (rip_relative_addr_p): New Function.
        (ix86_macro_fusion_p): Ditto.
        (ix86_macro_fusion_pair_p): Ditto.
        * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
        * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
        * gcc/doc/tm.texi: Generated.
        * gcc/doc/tm.texi.in: Ditto.
        * gcc/haifa-sched.c (try_group_insn): New Function.
        (group_insns_for_macro_fusion): Ditto.
        (sched_init): Call group_insns_for_macro_fusion.
        * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
        a SCHED_GROUP at the end of BB to remain their location.
        * gcc/target.def: Add two hooks: macro_fusion_p and
        macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..59b0bcf 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24204,6 +24204,42 @@ ix86_instantiate_decls (void)
       instantiate_decl_rtl (s->rtl);
 }

+/* Check whether x86 address PARTS is a pc-relative address.  */
+
+static bool
+rip_relative_addr_p (struct ix86_address *parts)
+{
+  rtx base, index, disp;
+
+  base = parts->base;
+  index = parts->index;
+  disp = parts->disp;
+
+  if (disp && !base && !index)
+    {
+      if (TARGET_64BIT)
+       {
+         rtx symbol = disp;
+
+         if (GET_CODE (disp) == CONST)
+           symbol = XEXP (disp, 0);
+         if (GET_CODE (symbol) == PLUS
+             && CONST_INT_P (XEXP (symbol, 1)))
+           symbol = XEXP (symbol, 0);
+
+         if (GET_CODE (symbol) == LABEL_REF
+             || (GET_CODE (symbol) == SYMBOL_REF
+                 && SYMBOL_REF_TLS_MODEL (symbol) == 0)
+             || (GET_CODE (symbol) == UNSPEC
+                 && (XINT (symbol, 1) == UNSPEC_GOTPCREL
+                     || XINT (symbol, 1) == UNSPEC_PCREL
+                     || XINT (symbol, 1) == UNSPEC_GOTNTPOFF)))
+           return true;
+       }
+    }
+  return false;
+}
+
 /* Calculate the length of the memory address in the instruction encoding.
    Includes addr32 prefix, does not include the one-byte modrm, opcode,
    or other prefixes.  We never generate addr32 prefix for LEA insn.  */
@@ -24275,25 +24311,8 @@ memory_address_length (rtx addr, bool lea)
   else if (disp && !base && !index)
     {
       len += 4;
-      if (TARGET_64BIT)
-       {
-         rtx symbol = disp;
-
-         if (GET_CODE (disp) == CONST)
-           symbol = XEXP (disp, 0);
-         if (GET_CODE (symbol) == PLUS
-             && CONST_INT_P (XEXP (symbol, 1)))
-           symbol = XEXP (symbol, 0);
-
-         if (GET_CODE (symbol) != LABEL_REF
-             && (GET_CODE (symbol) != SYMBOL_REF
-                 || SYMBOL_REF_TLS_MODEL (symbol) != 0)
-             && (GET_CODE (symbol) != UNSPEC
-                 || (XINT (symbol, 1) != UNSPEC_GOTPCREL
-                     && XINT (symbol, 1) != UNSPEC_PCREL
-                     && XINT (symbol, 1) != UNSPEC_GOTNTPOFF)))
-           len++;
-       }
+      if (rip_relative_addr_p (&parts))
+       len++;
     }
   else
     {
@@ -24856,6 +24875,122 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src, dest;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx alu_set = NULL_RTX, addr = NULL_RTX;
+
+  if (get_attr_type (condgen) != TYPE_TEST
+      && get_attr_type (condgen) != TYPE_ICMP
+      && get_attr_type (condgen) != TYPE_INCDEC
+      && get_attr_type (condgen) != TYPE_ALU)
+    return false;
+
+  if (single_set == NULL_RTX
+      && !TARGET_FUSE_ALU_AND_BRANCH)
+    return false;
+
+  if (single_set != NULL_RTX)
+    compare_set = single_set;
+  else
+    {
+      int i;
+      rtx pat = PATTERN (condgen);
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+       if (GET_CODE (XVECEXP (pat, 0, i)) == SET)
+         {
+           rtx set_src = SET_SRC (XVECEXP (pat, 0, i));
+           if (GET_CODE (set_src) == COMPARE)
+             compare_set = XVECEXP (pat, 0, i);
+           else
+             alu_set = XVECEXP (pat, 0, i);
+         }
+    }
+  if (compare_set == NULL_RTX)
+    return false;
+  src = SET_SRC (compare_set);
+  if (GET_CODE (src) != COMPARE)
+    return false;
+
+  /* Macro-fusion for cmp/test MEM-IMM + conditional jmp is not
+     supported.  */
+  if ((MEM_P (XEXP (src, 0))
+       && CONST_INT_P (XEXP (src, 1)))
+      || (MEM_P (XEXP (src, 1))
+         && CONST_INT_P (XEXP (src, 0))))
+    return false;
+
+  /* No fusion for RIP-relative address.  */
+  if (MEM_P (XEXP (src, 0)))
+    addr = XEXP (XEXP (src, 0), 0);
+  else if (MEM_P (XEXP (src, 1)))
+    addr = XEXP (XEXP (src, 1), 0);
+
+  if (addr) {
+    ix86_address parts;
+    int ok = ix86_decompose_address (addr, &parts);
+    gcc_assert (ok);
+
+    if (rip_relative_addr_p (&parts))
+      return false;
+  }
+
+  test_if = SET_SRC (pc_set (condjmp));
+  cond = XEXP (test_if, 0);
+  ccode = GET_CODE (cond);
+  /* Check whether conditional jump use Sign or Overflow Flags.  */
+  if (!TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS
+      && (ccode == GE
+          || ccode == GT
+         || ccode == LE
+         || ccode == LT))
+    return false;
+
+  /* Return true for TYPE_TEST and TYPE_ICMP.  */
+  if (get_attr_type (condgen) == TYPE_TEST
+      || get_attr_type (condgen) == TYPE_ICMP)
+    return true;
+
+  /* The following is the case that macro-fusion for alu + jmp.  */
+  if (!TARGET_FUSE_ALU_AND_BRANCH || !alu_set)
+    return false;
+
+  /* No fusion for alu op with memory destination operand.  */
+  dest = SET_DEST (alu_set);
+  if (MEM_P (dest))
+    return false;
+
+  /* Macro-fusion for inc/dec + unsigned conditional jump is not
+     supported.  */
+  if (get_attr_type (condgen) == TYPE_INCDEC
+      && (ccode == GEU
+         || ccode == GTU
+         || ccode == LEU
+         || ccode == LTU))
+    return false;
+
+  return true;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42993,6 +43128,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_INT val)
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 788cb8a..68fabd9 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -362,8 +362,17 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
        ix86_tune_features[X86_TUNE_USE_VECTOR_FP_CONVERTS]
 #define TARGET_USE_VECTOR_CONVERTS \
        ix86_tune_features[X86_TUNE_USE_VECTOR_CONVERTS]
+#define TARGET_FUSE_CMP_AND_BRANCH_32 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
+#define TARGET_FUSE_CMP_AND_BRANCH_64 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_64]
 #define TARGET_FUSE_CMP_AND_BRANCH \
-       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH]
+       (TARGET_64BIT ? TARGET_FUSE_CMP_AND_BRANCH_64 \
+        : TARGET_FUSE_CMP_AND_BRANCH_32)
+#define TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS]
+#define TARGET_FUSE_ALU_AND_BRANCH \
+       ix86_tune_features[X86_TUNE_FUSE_ALU_AND_BRANCH]
 #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
 #define TARGET_VECTORIZE_DOUBLE \
        ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 4ae5f70..3d395b0 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -193,10 +193,24 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS,
"use_vector_fp_converts",
 /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
    from integer to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
-/* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
-   with a subsequent conditional jump instruction into a single
-   compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_32: Fuse compare with a subsequent
+   conditional jump instruction for 32 bit TARGET.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_32, "fuse_cmp_and_branch_32",
+          m_CORE_ALL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_64: Fuse compare with a subsequent
+   conditional jump instruction for TARGET_64BIT.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_64, "fuse_cmp_and_branch_64",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
+   subsequent conditional jump instruction when the condition jump
+   check sign flag (SF) or overflow flag (OF).  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, "fuse_cmp_and_branch_soflags",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
+/* X86_TUNE_FUSE_ALU_AND_BRANCH: Fuse alu with a subsequent conditional
+   jump instruction when the alu instruction produces the CCFLAG consumed by
+   the conditional jump instruction. */
+DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
+          m_COREI7_AVX | m_HASWELL)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index d15f53c..66b45b9 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6553,6 +6553,17 @@ scheduling one insn causes other insns to
become ready in the same
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b51d7b3..361ee87 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4940,6 +4940,10 @@ them: try the first ones in this list first.

 @hook TARGET_SCHED_REORDER2

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK

 @hook TARGET_SCHED_INIT
diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 61eaaef..d6726a9 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -6519,6 +6519,44 @@ setup_sched_dump (void)
                ? stderr : dump_file);
 }

+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!any_condjump_p (insn)
+      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion ()
+{
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    try_group_insn (BB_END (bb));
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6645,6 +6683,11 @@ sched_init (void)
     }

   curr_state = xmalloc (dfa_state_size);
+
+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion ();
 }

 static void haifa_init_only_bb (basic_block, basic_block);
diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
index e1a2dce..156359e 100644
--- a/gcc/sched-rgn.c
+++ b/gcc/sched-rgn.c
@@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.

+     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).

      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
 #endif
                 || (!reload_completed
                     && sets_likely_spilled (PATTERN (insn)))))
-        || NOTE_P (insn))
+        || NOTE_P (insn)
+        || (last != 0 && SCHED_GROUP_P (last)))
     {
       if (!NOTE_P (insn))
        {
diff --git a/gcc/target.def b/gcc/target.def
index 6de513f..dae0378 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1041,6 +1041,19 @@ scheduling one insn causes other insns to
become ready in the same\n\
 cycle.  These other insns can then be taken into account properly.",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-10-03 18:25                       ` Wei Mi
@ 2013-10-15 20:47                         ` Jeff Law
  2013-10-15 21:57                           ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Law @ 2013-10-15 20:47 UTC (permalink / raw)
  To: Wei Mi, Jan Hubicka
  Cc: H.J. Lu, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin, Vladimir Makarov

On 10/03/13 12:24, Wei Mi wrote:
> Thanks,
> Wei Mi.
>
> 2013-10-03  Wei Mi  <wmi@google.com>
>
>          * gcc/config/i386/i386.c (memory_address_length): Extract a part
>          of code to rip_relative_addr_p.
>          (rip_relative_addr_p): New Function.
>          (ix86_macro_fusion_p): Ditto.
>          (ix86_macro_fusion_pair_p): Ditto.
>          * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
>          * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
>          * gcc/doc/tm.texi: Generated.
>          * gcc/doc/tm.texi.in: Ditto.
>          * gcc/haifa-sched.c (try_group_insn): New Function.
>          (group_insns_for_macro_fusion): Ditto.
>          (sched_init): Call group_insns_for_macro_fusion.
>          * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
>          a SCHED_GROUP at the end of BB to remain their location.
>          * gcc/target.def: Add two hooks: macro_fusion_p and
>          macro_fusion_pair_p.
I'm not going to comment on the x86 specific stuff -- I'll defer to the 
port maintainers for that.


> index 61eaaef..d6726a9 100644
> --- a/gcc/haifa-sched.c
> +++ b/gcc/haifa-sched.c
> @@ -6519,6 +6519,44 @@ setup_sched_dump (void)
>                  ? stderr : dump_file);
>   }
>
> +static void
> +try_group_insn (rtx insn)
You need a comment for this function.


> +{
> +  unsigned int condreg1, condreg2;
> +  rtx cc_reg_1;
> +  rtx prev;
> +
> +  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
> +  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
> +  prev = prev_nonnote_nondebug_insn (insn);
> +  if (!any_condjump_p (insn)
> +      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
> +      || !prev
> +      || !modified_in_p (cc_reg_1, prev))
> +    return;
I'd test !any_condjump_p at the start of this function before calling 
the target hook.  If insn isn't a conditional jump, then all the other 
work is totally useless.

Aren't you just trying to see if we have a comparison feeding the 
conditional jump and if they're already adjacent?  Do you actually need 
to get the condition code regs to do that test?

> +
> +  /* Different microarchitectures support macro fusions for different
> +     combinations of insn pairs.  */
> +  if (!targetm.sched.macro_fusion_pair_p
> +      || !targetm.sched.macro_fusion_pair_p (prev, insn))
> +    return;
> +
> +  SCHED_GROUP_P (insn) = 1;
I'm surprised that SCHED_GROUP_P worked -- I've tried to do similar 
stuff in the past and ran into numerous problems trying to hijack 
SCHED_GROUP_P for this kind of purpose.


>
>   static void haifa_init_only_bb (basic_block, basic_block);
> diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
> index e1a2dce..156359e 100644
> --- a/gcc/sched-rgn.c
> +++ b/gcc/sched-rgn.c
> @@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
>        cc0 setters remain at the end because they can't be moved away from
>        their cc0 user.
>
> +     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
> +
>        COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).
>
>        Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
> @@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
>   #endif
>                   || (!reload_completed
>                       && sets_likely_spilled (PATTERN (insn)))))
> -        || NOTE_P (insn))
> +        || NOTE_P (insn)
> +        || (last != 0 && SCHED_GROUP_P (last)))
>       {
>         if (!NOTE_P (insn))
>          {
This looks like a straighforward bugfix and probably should go forward 
independent of this enhancement.

Jeff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-10-15 20:47                         ` Jeff Law
@ 2013-10-15 21:57                           ` Wei Mi
  2013-10-16 20:51                             ` Jeff Law
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-10-15 21:57 UTC (permalink / raw)
  To: Jeff Law
  Cc: Jan Hubicka, H.J. Lu, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

Thanks for the comments. One question inlined. Preparing another patch
addressing the comments.

Regards,
Wei Mi.

On Tue, Oct 15, 2013 at 1:35 PM, Jeff Law <law@redhat.com> wrote:
> On 10/03/13 12:24, Wei Mi wrote:
>>
>> Thanks,
>> Wei Mi.
>>
>> 2013-10-03  Wei Mi  <wmi@google.com>
>>
>>          * gcc/config/i386/i386.c (memory_address_length): Extract a part
>>          of code to rip_relative_addr_p.
>>          (rip_relative_addr_p): New Function.
>>          (ix86_macro_fusion_p): Ditto.
>>          (ix86_macro_fusion_pair_p): Ditto.
>>          * gcc/config/i386/i386.h: Add new tune features about
>> macro-fusion.
>>          * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
>>          * gcc/doc/tm.texi: Generated.
>>          * gcc/doc/tm.texi.in: Ditto.
>>          * gcc/haifa-sched.c (try_group_insn): New Function.
>>          (group_insns_for_macro_fusion): Ditto.
>>          (sched_init): Call group_insns_for_macro_fusion.
>>          * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
>>          a SCHED_GROUP at the end of BB to remain their location.
>>          * gcc/target.def: Add two hooks: macro_fusion_p and
>>          macro_fusion_pair_p.
>
> I'm not going to comment on the x86 specific stuff -- I'll defer to the port
> maintainers for that.
>
>
>
>> index 61eaaef..d6726a9 100644
>> --- a/gcc/haifa-sched.c
>> +++ b/gcc/haifa-sched.c
>> @@ -6519,6 +6519,44 @@ setup_sched_dump (void)
>>                  ? stderr : dump_file);
>>   }
>>
>> +static void
>> +try_group_insn (rtx insn)
>
> You need a comment for this function.
>

Ok, will add comment for it.

>
>
>> +{
>> +  unsigned int condreg1, condreg2;
>> +  rtx cc_reg_1;
>> +  rtx prev;
>> +
>> +  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
>> +  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
>> +  prev = prev_nonnote_nondebug_insn (insn);
>> +  if (!any_condjump_p (insn)
>> +      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
>> +      || !prev
>> +      || !modified_in_p (cc_reg_1, prev))
>> +    return;
>
> I'd test !any_condjump_p at the start of this function before calling the
> target hook.  If insn isn't a conditional jump, then all the other work is
> totally useless.

Ok. will fix it.

>
> Aren't you just trying to see if we have a comparison feeding the
> conditional jump and if they're already adjacent?  Do you actually need to
> get the condition code regs to do that test?
>

Yes, I am trying to see if we have a comparison feeding the
conditional jump and if they're already adjacent. Do you have more
easier way to do that test?

>
>> +
>> +  /* Different microarchitectures support macro fusions for different
>> +     combinations of insn pairs.  */
>> +  if (!targetm.sched.macro_fusion_pair_p
>> +      || !targetm.sched.macro_fusion_pair_p (prev, insn))
>> +    return;
>> +
>> +  SCHED_GROUP_P (insn) = 1;
>
> I'm surprised that SCHED_GROUP_P worked -- I've tried to do similar stuff in
> the past and ran into numerous problems trying to hijack SCHED_GROUP_P for
> this kind of purpose.
>
>
>
>>
>>   static void haifa_init_only_bb (basic_block, basic_block);
>> diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
>> index e1a2dce..156359e 100644
>> --- a/gcc/sched-rgn.c
>> +++ b/gcc/sched-rgn.c
>> @@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
>>        cc0 setters remain at the end because they can't be moved away from
>>        their cc0 user.
>>
>> +     Predecessors of SCHED_GROUP_P instructions at the end remain at the
>> end.
>> +
>>        COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).
>>
>>        Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually
>> return
>> @@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
>>   #endif
>>                   || (!reload_completed
>>                       && sets_likely_spilled (PATTERN (insn)))))
>> -        || NOTE_P (insn))
>> +        || NOTE_P (insn)
>> +        || (last != 0 && SCHED_GROUP_P (last)))
>>       {
>>         if (!NOTE_P (insn))
>>          {
>
> This looks like a straighforward bugfix and probably should go forward
> independent of this enhancement.

Ok, I will separate it into another patch.

>
> Jeff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-10-15 21:57                           ` Wei Mi
@ 2013-10-16 20:51                             ` Jeff Law
  2013-10-17  1:06                               ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Law @ 2013-10-16 20:51 UTC (permalink / raw)
  To: Wei Mi
  Cc: Jan Hubicka, H.J. Lu, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On 10/15/13 15:30, Wei Mi wrote:
>
>>
>> Aren't you just trying to see if we have a comparison feeding the
>> conditional jump and if they're already adjacent?  Do you actually need to
>> get the condition code regs to do that test?
>>
>
> Yes, I am trying to see if we have a comparison feeding the
> conditional jump and if they're already adjacent. Do you have more
> easier way to do that test?
Can't you just look at the last insn in the block and if it's a 
conditional peek at the previous insn and see if it sets CC mode register?

Hmm, I guess that's effectively what you're doing, I guess I was just 
surprised by the need to first get the fixed_condition_code_regs as I 
expected you to just extract them from the conditional jump.   But 
thinking a bit more about it now your solution seems rather clean.



>>>    static void haifa_init_only_bb (basic_block, basic_block);
>>> diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
>>> index e1a2dce..156359e 100644
>>> --- a/gcc/sched-rgn.c
>>> +++ b/gcc/sched-rgn.c
>>> @@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
>>>         cc0 setters remain at the end because they can't be moved away from
>>>         their cc0 user.
>>>
>>> +     Predecessors of SCHED_GROUP_P instructions at the end remain at the
>>> end.
>>> +
>>>         COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).
>>>
>>>         Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually
>>> return
>>> @@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
>>>    #endif
>>>                    || (!reload_completed
>>>                        && sets_likely_spilled (PATTERN (insn)))))
>>> -        || NOTE_P (insn))
>>> +        || NOTE_P (insn)
>>> +        || (last != 0 && SCHED_GROUP_P (last)))
>>>        {
>>>          if (!NOTE_P (insn))
>>>           {
>>
>> This looks like a straighforward bugfix and probably should go forward
>> independent of this enhancement.
>
> Ok, I will separate it into another patch.
Go ahead and consider that pre-approved.  Just send it to the list with 
a note that I approved it in this thread.
>
>>
>> Jeff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-10-16 20:51                             ` Jeff Law
@ 2013-10-17  1:06                               ` Wei Mi
  2013-10-17  8:06                                 ` Marek Polacek
  2013-11-02  1:40                                 ` Wei Mi
  0 siblings, 2 replies; 59+ messages in thread
From: Wei Mi @ 2013-10-17  1:06 UTC (permalink / raw)
  To: Jeff Law
  Cc: Jan Hubicka, H.J. Lu, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

> Go ahead and consider that pre-approved.  Just send it to the list with a
> note that I approved it in this thread.
>
> Jeff

Thanks! The new patch addressed Jeff's comments.

Is it ok for x86 maintainer?

Thanks,
Wei Mi.

2013-10-16  Wei Mi  <wmi@google.com>

        * gcc/config/i386/i386.c (memory_address_length): Extract a part
        of code to rip_relative_addr_p.
        (rip_relative_addr_p): New Function.
        (ix86_macro_fusion_p): Ditto.
        (ix86_macro_fusion_pair_p): Ditto.
        * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
        * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
        * gcc/doc/tm.texi: Generated.
        * gcc/doc/tm.texi.in: Ditto.
        * gcc/haifa-sched.c (try_group_insn): New Function.
        (group_insns_for_macro_fusion): Ditto.
        (sched_init): Call group_insns_for_macro_fusion.
        * gcc/target.def: Add two hooks: macro_fusion_p and
        macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..59b0bcf 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24204,6 +24204,42 @@ ix86_instantiate_decls (void)
       instantiate_decl_rtl (s->rtl);
 }

+/* Check whether x86 address PARTS is a pc-relative address.  */
+
+static bool
+rip_relative_addr_p (struct ix86_address *parts)
+{
+  rtx base, index, disp;
+
+  base = parts->base;
+  index = parts->index;
+  disp = parts->disp;
+
+  if (disp && !base && !index)
+    {
+      if (TARGET_64BIT)
+       {
+         rtx symbol = disp;
+
+         if (GET_CODE (disp) == CONST)
+           symbol = XEXP (disp, 0);
+         if (GET_CODE (symbol) == PLUS
+             && CONST_INT_P (XEXP (symbol, 1)))
+           symbol = XEXP (symbol, 0);
+
+         if (GET_CODE (symbol) == LABEL_REF
+             || (GET_CODE (symbol) == SYMBOL_REF
+                 && SYMBOL_REF_TLS_MODEL (symbol) == 0)
+             || (GET_CODE (symbol) == UNSPEC
+                 && (XINT (symbol, 1) == UNSPEC_GOTPCREL
+                     || XINT (symbol, 1) == UNSPEC_PCREL
+                     || XINT (symbol, 1) == UNSPEC_GOTNTPOFF)))
+           return true;
+       }
+    }
+  return false;
+}
+
 /* Calculate the length of the memory address in the instruction encoding.
    Includes addr32 prefix, does not include the one-byte modrm, opcode,
    or other prefixes.  We never generate addr32 prefix for LEA insn.  */
@@ -24275,25 +24311,8 @@ memory_address_length (rtx addr, bool lea)
   else if (disp && !base && !index)
     {
       len += 4;
-      if (TARGET_64BIT)
-       {
-         rtx symbol = disp;
-
-         if (GET_CODE (disp) == CONST)
-           symbol = XEXP (disp, 0);
-         if (GET_CODE (symbol) == PLUS
-             && CONST_INT_P (XEXP (symbol, 1)))
-           symbol = XEXP (symbol, 0);
-
-         if (GET_CODE (symbol) != LABEL_REF
-             && (GET_CODE (symbol) != SYMBOL_REF
-                 || SYMBOL_REF_TLS_MODEL (symbol) != 0)
-             && (GET_CODE (symbol) != UNSPEC
-                 || (XINT (symbol, 1) != UNSPEC_GOTPCREL
-                     && XINT (symbol, 1) != UNSPEC_PCREL
-                     && XINT (symbol, 1) != UNSPEC_GOTNTPOFF)))
-           len++;
-       }
+      if (rip_relative_addr_p (&parts))
+       len++;
     }
   else
     {
@@ -24856,6 +24875,122 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src, dest;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx alu_set = NULL_RTX, addr = NULL_RTX;
+
+  if (get_attr_type (condgen) != TYPE_TEST
+      && get_attr_type (condgen) != TYPE_ICMP
+      && get_attr_type (condgen) != TYPE_INCDEC
+      && get_attr_type (condgen) != TYPE_ALU)
+    return false;
+
+  if (single_set == NULL_RTX
+      && !TARGET_FUSE_ALU_AND_BRANCH)
+    return false;
+
+  if (single_set != NULL_RTX)
+    compare_set = single_set;
+  else
+    {
+      int i;
+      rtx pat = PATTERN (condgen);
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+       if (GET_CODE (XVECEXP (pat, 0, i)) == SET)
+         {
+           rtx set_src = SET_SRC (XVECEXP (pat, 0, i));
+           if (GET_CODE (set_src) == COMPARE)
+             compare_set = XVECEXP (pat, 0, i);
+           else
+             alu_set = XVECEXP (pat, 0, i);
+         }
+    }
+  if (compare_set == NULL_RTX)
+    return false;
+  src = SET_SRC (compare_set);
+  if (GET_CODE (src) != COMPARE)
+    return false;
+
+  /* Macro-fusion for cmp/test MEM-IMM + conditional jmp is not
+     supported.  */
+  if ((MEM_P (XEXP (src, 0))
+       && CONST_INT_P (XEXP (src, 1)))
+      || (MEM_P (XEXP (src, 1))
+         && CONST_INT_P (XEXP (src, 0))))
+    return false;
+
+  /* No fusion for RIP-relative address.  */
+  if (MEM_P (XEXP (src, 0)))
+    addr = XEXP (XEXP (src, 0), 0);
+  else if (MEM_P (XEXP (src, 1)))
+    addr = XEXP (XEXP (src, 1), 0);
+
+  if (addr) {
+    ix86_address parts;
+    int ok = ix86_decompose_address (addr, &parts);
+    gcc_assert (ok);
+
+    if (rip_relative_addr_p (&parts))
+      return false;
+  }
+
+  test_if = SET_SRC (pc_set (condjmp));
+  cond = XEXP (test_if, 0);
+  ccode = GET_CODE (cond);
+  /* Check whether conditional jump use Sign or Overflow Flags.  */
+  if (!TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS
+      && (ccode == GE
+          || ccode == GT
+         || ccode == LE
+         || ccode == LT))
+    return false;
+
+  /* Return true for TYPE_TEST and TYPE_ICMP.  */
+  if (get_attr_type (condgen) == TYPE_TEST
+      || get_attr_type (condgen) == TYPE_ICMP)
+    return true;
+
+  /* The following is the case that macro-fusion for alu + jmp.  */
+  if (!TARGET_FUSE_ALU_AND_BRANCH || !alu_set)
+    return false;
+
+  /* No fusion for alu op with memory destination operand.  */
+  dest = SET_DEST (alu_set);
+  if (MEM_P (dest))
+    return false;
+
+  /* Macro-fusion for inc/dec + unsigned conditional jump is not
+     supported.  */
+  if (get_attr_type (condgen) == TYPE_INCDEC
+      && (ccode == GEU
+         || ccode == GTU
+         || ccode == LEU
+         || ccode == LTU))
+    return false;
+
+  return true;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42993,6 +43128,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_INT val)
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 788cb8a..68fabd9 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -362,8 +362,17 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
        ix86_tune_features[X86_TUNE_USE_VECTOR_FP_CONVERTS]
 #define TARGET_USE_VECTOR_CONVERTS \
        ix86_tune_features[X86_TUNE_USE_VECTOR_CONVERTS]
+#define TARGET_FUSE_CMP_AND_BRANCH_32 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
+#define TARGET_FUSE_CMP_AND_BRANCH_64 \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_64]
 #define TARGET_FUSE_CMP_AND_BRANCH \
-       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH]
+       (TARGET_64BIT ? TARGET_FUSE_CMP_AND_BRANCH_64 \
+        : TARGET_FUSE_CMP_AND_BRANCH_32)
+#define TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS \
+       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS]
+#define TARGET_FUSE_ALU_AND_BRANCH \
+       ix86_tune_features[X86_TUNE_FUSE_ALU_AND_BRANCH]
 #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
 #define TARGET_VECTORIZE_DOUBLE \
        ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 4ae5f70..3d395b0 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -193,10 +193,24 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS,
"use_vector_fp_converts",
 /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
    from integer to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
-/* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
-   with a subsequent conditional jump instruction into a single
-   compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_32: Fuse compare with a subsequent
+   conditional jump instruction for 32 bit TARGET.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_32, "fuse_cmp_and_branch_32",
+          m_CORE_ALL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_64: Fuse compare with a subsequent
+   conditional jump instruction for TARGET_64BIT.  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_64, "fuse_cmp_and_branch_64",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
+/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
+   subsequent conditional jump instruction when the condition jump
+   check sign flag (SF) or overflow flag (OF).  */
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, "fuse_cmp_and_branch_soflags",
+          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
+/* X86_TUNE_FUSE_ALU_AND_BRANCH: Fuse alu with a subsequent conditional
+   jump instruction when the alu instruction produces the CCFLAG consumed by
+   the conditional jump instruction. */
+DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
+          m_COREI7_AVX | m_HASWELL)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index d15f53c..66b45b9 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6553,6 +6553,17 @@ scheduling one insn causes other insns to
become ready in the same
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b51d7b3..361ee87 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4940,6 +4940,10 @@ them: try the first ones in this list first.

 @hook TARGET_SCHED_REORDER2

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK

 @hook TARGET_SCHED_INIT
diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 61eaaef..e24009d 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -6519,6 +6519,50 @@ setup_sched_dump (void)
                ? stderr : dump_file);
 }

+/* Try to group comparison and the following conditional jump INSN if
+   they're already adjacent. This is to prevent scheduler from scheduling
+   them apart.  */
+
+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  if (!any_condjump_p (insn))
+    return;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion ()
+{
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    try_group_insn (BB_END (bb));
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6645,6 +6689,11 @@ sched_init (void)
     }

   curr_state = xmalloc (dfa_state_size);
+
+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion ();
 }

 static void haifa_init_only_bb (basic_block, basic_block);
diff --git a/gcc/target.def b/gcc/target.def
index 6de513f..dae0378 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1041,6 +1041,19 @@ scheduling one insn causes other insns to
become ready in the same\n\
 cycle.  These other insns can then be taken into account properly.",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */




2013-10-16  Wei Mi  <wmi@google.com>
        * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
        a SCHED_GROUP at the end of BB to remain their location.

diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
index e1a2dce..156359e 100644
--- a/gcc/sched-rgn.c
+++ b/gcc/sched-rgn.c
@@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.

+     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).

      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
 #endif
                 || (!reload_completed
                     && sets_likely_spilled (PATTERN (insn)))))
-        || NOTE_P (insn))
+        || NOTE_P (insn)
+        || (last != 0 && SCHED_GROUP_P (last)))
     {
       if (!NOTE_P (insn))
        {

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-10-17  1:06                               ` Wei Mi
@ 2013-10-17  8:06                                 ` Marek Polacek
  2013-10-17 16:28                                   ` Wei Mi
  2013-11-02  1:40                                 ` Wei Mi
  1 sibling, 1 reply; 59+ messages in thread
From: Marek Polacek @ 2013-10-17  8:06 UTC (permalink / raw)
  To: Wei Mi
  Cc: Jeff Law, Jan Hubicka, H.J. Lu, Alexander Monakov,
	Steven Bosscher, GCC Patches, David Li, Kirill Yukhin,
	Vladimir Makarov

On Wed, Oct 16, 2013 at 04:25:58PM -0700, Wei Mi wrote:
> +/* Return true if target platform supports macro-fusion.  */
> +
> +static bool
> +ix86_macro_fusion_p ()
> +{
> +  if (TARGET_FUSE_CMP_AND_BRANCH)
> +    return true;
> +  else
> +    return false;
> +}

That looks weird, why not just

static bool
ix86_macro_fusion_p (void)
{
  return TARGET_FUSE_CMP_AND_BRANCH;
}

?

	Marek

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-10-17  8:06                                 ` Marek Polacek
@ 2013-10-17 16:28                                   ` Wei Mi
  0 siblings, 0 replies; 59+ messages in thread
From: Wei Mi @ 2013-10-17 16:28 UTC (permalink / raw)
  To: Marek Polacek
  Cc: Jeff Law, Jan Hubicka, H.J. Lu, Alexander Monakov,
	Steven Bosscher, GCC Patches, David Li, Kirill Yukhin,
	Vladimir Makarov

On Thu, Oct 17, 2013 at 12:35 AM, Marek Polacek <polacek@redhat.com> wrote:
> On Wed, Oct 16, 2013 at 04:25:58PM -0700, Wei Mi wrote:
>> +/* Return true if target platform supports macro-fusion.  */
>> +
>> +static bool
>> +ix86_macro_fusion_p ()
>> +{
>> +  if (TARGET_FUSE_CMP_AND_BRANCH)
>> +    return true;
>> +  else
>> +    return false;
>> +}
>
> That looks weird, why not just
>
> static bool
> ix86_macro_fusion_p (void)
> {
>   return TARGET_FUSE_CMP_AND_BRANCH;
> }
>
> ?
>
>         Marek

Thanks, fixed.

Wei Mi.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-10-17  1:06                               ` Wei Mi
  2013-10-17  8:06                                 ` Marek Polacek
@ 2013-11-02  1:40                                 ` Wei Mi
  2013-11-04  1:18                                   ` Jan Hubicka
  1 sibling, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-11-02  1:40 UTC (permalink / raw)
  To: Jeff Law
  Cc: Jan Hubicka, H.J. Lu, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

Ping.  Is it ok for x86 maintainer?

Thanks,
Wei Mi.

On Wed, Oct 16, 2013 at 4:25 PM, Wei Mi <wmi@google.com> wrote:
>> Go ahead and consider that pre-approved.  Just send it to the list with a
>> note that I approved it in this thread.
>>
>> Jeff
>
> Thanks! The new patch addressed Jeff's comments.
>
> Is it ok for x86 maintainer?
>
> Thanks,
> Wei Mi.
>
> 2013-10-16  Wei Mi  <wmi@google.com>
>
>         * gcc/config/i386/i386.c (memory_address_length): Extract a part
>         of code to rip_relative_addr_p.
>         (rip_relative_addr_p): New Function.
>         (ix86_macro_fusion_p): Ditto.
>         (ix86_macro_fusion_pair_p): Ditto.
>         * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
>         * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
>         * gcc/doc/tm.texi: Generated.
>         * gcc/doc/tm.texi.in: Ditto.
>         * gcc/haifa-sched.c (try_group_insn): New Function.
>         (group_insns_for_macro_fusion): Ditto.
>         (sched_init): Call group_insns_for_macro_fusion.
>         * gcc/target.def: Add two hooks: macro_fusion_p and
>         macro_fusion_pair_p.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 1fd3f60..59b0bcf 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -24204,6 +24204,42 @@ ix86_instantiate_decls (void)
>        instantiate_decl_rtl (s->rtl);
>  }
>
> +/* Check whether x86 address PARTS is a pc-relative address.  */
> +
> +static bool
> +rip_relative_addr_p (struct ix86_address *parts)
> +{
> +  rtx base, index, disp;
> +
> +  base = parts->base;
> +  index = parts->index;
> +  disp = parts->disp;
> +
> +  if (disp && !base && !index)
> +    {
> +      if (TARGET_64BIT)
> +       {
> +         rtx symbol = disp;
> +
> +         if (GET_CODE (disp) == CONST)
> +           symbol = XEXP (disp, 0);
> +         if (GET_CODE (symbol) == PLUS
> +             && CONST_INT_P (XEXP (symbol, 1)))
> +           symbol = XEXP (symbol, 0);
> +
> +         if (GET_CODE (symbol) == LABEL_REF
> +             || (GET_CODE (symbol) == SYMBOL_REF
> +                 && SYMBOL_REF_TLS_MODEL (symbol) == 0)
> +             || (GET_CODE (symbol) == UNSPEC
> +                 && (XINT (symbol, 1) == UNSPEC_GOTPCREL
> +                     || XINT (symbol, 1) == UNSPEC_PCREL
> +                     || XINT (symbol, 1) == UNSPEC_GOTNTPOFF)))
> +           return true;
> +       }
> +    }
> +  return false;
> +}
> +
>  /* Calculate the length of the memory address in the instruction encoding.
>     Includes addr32 prefix, does not include the one-byte modrm, opcode,
>     or other prefixes.  We never generate addr32 prefix for LEA insn.  */
> @@ -24275,25 +24311,8 @@ memory_address_length (rtx addr, bool lea)
>    else if (disp && !base && !index)
>      {
>        len += 4;
> -      if (TARGET_64BIT)
> -       {
> -         rtx symbol = disp;
> -
> -         if (GET_CODE (disp) == CONST)
> -           symbol = XEXP (disp, 0);
> -         if (GET_CODE (symbol) == PLUS
> -             && CONST_INT_P (XEXP (symbol, 1)))
> -           symbol = XEXP (symbol, 0);
> -
> -         if (GET_CODE (symbol) != LABEL_REF
> -             && (GET_CODE (symbol) != SYMBOL_REF
> -                 || SYMBOL_REF_TLS_MODEL (symbol) != 0)
> -             && (GET_CODE (symbol) != UNSPEC
> -                 || (XINT (symbol, 1) != UNSPEC_GOTPCREL
> -                     && XINT (symbol, 1) != UNSPEC_PCREL
> -                     && XINT (symbol, 1) != UNSPEC_GOTNTPOFF)))
> -           len++;
> -       }
> +      if (rip_relative_addr_p (&parts))
> +       len++;
>      }
>    else
>      {
> @@ -24856,6 +24875,122 @@ ia32_multipass_dfa_lookahead (void)
>      }
>  }
>
> +/* Return true if target platform supports macro-fusion.  */
> +
> +static bool
> +ix86_macro_fusion_p ()
> +{
> +  if (TARGET_FUSE_CMP_AND_BRANCH)
> +    return true;
> +  else
> +    return false;
> +}
> +
> +/* Check whether current microarchitecture support macro fusion
> +   for insn pair "CONDGEN + CONDJMP". Refer to
> +   "Intel Architectures Optimization Reference Manual". */
> +
> +static bool
> +ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
> +{
> +  rtx src, dest;
> +  rtx single_set = single_set (condgen);
> +  enum rtx_code ccode;
> +  rtx compare_set = NULL_RTX, test_if, cond;
> +  rtx alu_set = NULL_RTX, addr = NULL_RTX;
> +
> +  if (get_attr_type (condgen) != TYPE_TEST
> +      && get_attr_type (condgen) != TYPE_ICMP
> +      && get_attr_type (condgen) != TYPE_INCDEC
> +      && get_attr_type (condgen) != TYPE_ALU)
> +    return false;
> +
> +  if (single_set == NULL_RTX
> +      && !TARGET_FUSE_ALU_AND_BRANCH)
> +    return false;
> +
> +  if (single_set != NULL_RTX)
> +    compare_set = single_set;
> +  else
> +    {
> +      int i;
> +      rtx pat = PATTERN (condgen);
> +      for (i = 0; i < XVECLEN (pat, 0); i++)
> +       if (GET_CODE (XVECEXP (pat, 0, i)) == SET)
> +         {
> +           rtx set_src = SET_SRC (XVECEXP (pat, 0, i));
> +           if (GET_CODE (set_src) == COMPARE)
> +             compare_set = XVECEXP (pat, 0, i);
> +           else
> +             alu_set = XVECEXP (pat, 0, i);
> +         }
> +    }
> +  if (compare_set == NULL_RTX)
> +    return false;
> +  src = SET_SRC (compare_set);
> +  if (GET_CODE (src) != COMPARE)
> +    return false;
> +
> +  /* Macro-fusion for cmp/test MEM-IMM + conditional jmp is not
> +     supported.  */
> +  if ((MEM_P (XEXP (src, 0))
> +       && CONST_INT_P (XEXP (src, 1)))
> +      || (MEM_P (XEXP (src, 1))
> +         && CONST_INT_P (XEXP (src, 0))))
> +    return false;
> +
> +  /* No fusion for RIP-relative address.  */
> +  if (MEM_P (XEXP (src, 0)))
> +    addr = XEXP (XEXP (src, 0), 0);
> +  else if (MEM_P (XEXP (src, 1)))
> +    addr = XEXP (XEXP (src, 1), 0);
> +
> +  if (addr) {
> +    ix86_address parts;
> +    int ok = ix86_decompose_address (addr, &parts);
> +    gcc_assert (ok);
> +
> +    if (rip_relative_addr_p (&parts))
> +      return false;
> +  }
> +
> +  test_if = SET_SRC (pc_set (condjmp));
> +  cond = XEXP (test_if, 0);
> +  ccode = GET_CODE (cond);
> +  /* Check whether conditional jump use Sign or Overflow Flags.  */
> +  if (!TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS
> +      && (ccode == GE
> +          || ccode == GT
> +         || ccode == LE
> +         || ccode == LT))
> +    return false;
> +
> +  /* Return true for TYPE_TEST and TYPE_ICMP.  */
> +  if (get_attr_type (condgen) == TYPE_TEST
> +      || get_attr_type (condgen) == TYPE_ICMP)
> +    return true;
> +
> +  /* The following is the case that macro-fusion for alu + jmp.  */
> +  if (!TARGET_FUSE_ALU_AND_BRANCH || !alu_set)
> +    return false;
> +
> +  /* No fusion for alu op with memory destination operand.  */
> +  dest = SET_DEST (alu_set);
> +  if (MEM_P (dest))
> +    return false;
> +
> +  /* Macro-fusion for inc/dec + unsigned conditional jump is not
> +     supported.  */
> +  if (get_attr_type (condgen) == TYPE_INCDEC
> +      && (ccode == GEU
> +         || ccode == GTU
> +         || ccode == LEU
> +         || ccode == LTU))
> +    return false;
> +
> +  return true;
> +}
> +
>  /* Try to reorder ready list to take advantage of Atom pipelined IMUL
>     execution. It is applied if
>     (1) IMUL instruction is on the top of list;
> @@ -42993,6 +43128,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_INT val)
>  #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
>  #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
>    ia32_multipass_dfa_lookahead
> +#undef TARGET_SCHED_MACRO_FUSION_P
> +#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
> +#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
> +#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p
>
>  #undef TARGET_FUNCTION_OK_FOR_SIBCALL
>  #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 788cb8a..68fabd9 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -362,8 +362,17 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
>         ix86_tune_features[X86_TUNE_USE_VECTOR_FP_CONVERTS]
>  #define TARGET_USE_VECTOR_CONVERTS \
>         ix86_tune_features[X86_TUNE_USE_VECTOR_CONVERTS]
> +#define TARGET_FUSE_CMP_AND_BRANCH_32 \
> +       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
> +#define TARGET_FUSE_CMP_AND_BRANCH_64 \
> +       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_64]
>  #define TARGET_FUSE_CMP_AND_BRANCH \
> -       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH]
> +       (TARGET_64BIT ? TARGET_FUSE_CMP_AND_BRANCH_64 \
> +        : TARGET_FUSE_CMP_AND_BRANCH_32)
> +#define TARGET_FUSE_CMP_AND_BRANCH_SOFLAGS \
> +       ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS]
> +#define TARGET_FUSE_ALU_AND_BRANCH \
> +       ix86_tune_features[X86_TUNE_FUSE_ALU_AND_BRANCH]
>  #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
>  #define TARGET_VECTORIZE_DOUBLE \
>         ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 4ae5f70..3d395b0 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -193,10 +193,24 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS,
> "use_vector_fp_converts",
>  /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
>     from integer to FP. */
>  DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
> -/* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
> -   with a subsequent conditional jump instruction into a single
> -   compare-and-branch uop.  */
> -DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
> +/* X86_TUNE_FUSE_CMP_AND_BRANCH_32: Fuse compare with a subsequent
> +   conditional jump instruction for 32 bit TARGET.  */
> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_32, "fuse_cmp_and_branch_32",
> +          m_CORE_ALL | m_BDVER)
> +/* X86_TUNE_FUSE_CMP_AND_BRANCH_64: Fuse compare with a subsequent
> +   conditional jump instruction for TARGET_64BIT.  */
> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_64, "fuse_cmp_and_branch_64",
> +          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
> +/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
> +   subsequent conditional jump instruction when the condition jump
> +   check sign flag (SF) or overflow flag (OF).  */
> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, "fuse_cmp_and_branch_soflags",
> +          m_COREI7 | m_COREI7_AVX | m_HASWELL | m_BDVER)
> +/* X86_TUNE_FUSE_ALU_AND_BRANCH: Fuse alu with a subsequent conditional
> +   jump instruction when the alu instruction produces the CCFLAG consumed by
> +   the conditional jump instruction. */
> +DEF_TUNE (X86_TUNE_FUSE_ALU_AND_BRANCH, "fuse_alu_and_branch",
> +          m_COREI7_AVX | m_HASWELL)
>  /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>     will impact LEA instruction selection. */
>  DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index d15f53c..66b45b9 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6553,6 +6553,17 @@ scheduling one insn causes other insns to
> become ready in the same
>  cycle.  These other insns can then be taken into account properly.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
> +This hook is used to check whether target platform supports macro fusion.
> +@end deftypefn
> +
> +@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
> @var{condgen}, rtx @var{condjmp})
> +This hook is used to check whether two insns could be macro fused for
> +target microarchitecture. If this hook returns true for the given insn pair
> +(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
> +group, and they will not be scheduled apart.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} void
> TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
> @var{tail})
>  This hook is called after evaluation forward dependencies of insns in
>  chain given by two parameter values (@var{head} and @var{tail}
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index b51d7b3..361ee87 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4940,6 +4940,10 @@ them: try the first ones in this list first.
>
>  @hook TARGET_SCHED_REORDER2
>
> +@hook TARGET_SCHED_MACRO_FUSION_P
> +
> +@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
> +
>  @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK
>
>  @hook TARGET_SCHED_INIT
> diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
> index 61eaaef..e24009d 100644
> --- a/gcc/haifa-sched.c
> +++ b/gcc/haifa-sched.c
> @@ -6519,6 +6519,50 @@ setup_sched_dump (void)
>                 ? stderr : dump_file);
>  }
>
> +/* Try to group comparison and the following conditional jump INSN if
> +   they're already adjacent. This is to prevent scheduler from scheduling
> +   them apart.  */
> +
> +static void
> +try_group_insn (rtx insn)
> +{
> +  unsigned int condreg1, condreg2;
> +  rtx cc_reg_1;
> +  rtx prev;
> +
> +  if (!any_condjump_p (insn))
> +    return;
> +
> +  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
> +  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
> +  prev = prev_nonnote_nondebug_insn (insn);
> +  if (!reg_referenced_p (cc_reg_1, PATTERN (insn))
> +      || !prev
> +      || !modified_in_p (cc_reg_1, prev))
> +    return;
> +
> +  /* Different microarchitectures support macro fusions for different
> +     combinations of insn pairs.  */
> +  if (!targetm.sched.macro_fusion_pair_p
> +      || !targetm.sched.macro_fusion_pair_p (prev, insn))
> +    return;
> +
> +  SCHED_GROUP_P (insn) = 1;
> +}
> +
> +/* If the last cond jump and the cond register defining insn are consecutive
> +   before scheduling, we want them to be in a schedule group. This is good
> +   for performance on microarchitectures supporting macro-fusion.  */
> +
> +static void
> +group_insns_for_macro_fusion ()
> +{
> +  basic_block bb;
> +
> +  FOR_EACH_BB (bb)
> +    try_group_insn (BB_END (bb));
> +}
> +
>  /* Initialize some global state for the scheduler.  This function works
>     with the common data shared between all the schedulers.  It is called
>     from the scheduler specific initialization routine.  */
> @@ -6645,6 +6689,11 @@ sched_init (void)
>      }
>
>    curr_state = xmalloc (dfa_state_size);
> +
> +  /* Group compare and branch insns for macro-fusion.  */
> +  if (targetm.sched.macro_fusion_p
> +      && targetm.sched.macro_fusion_p ())
> +    group_insns_for_macro_fusion ();
>  }
>
>  static void haifa_init_only_bb (basic_block, basic_block);
> diff --git a/gcc/target.def b/gcc/target.def
> index 6de513f..dae0378 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1041,6 +1041,19 @@ scheduling one insn causes other insns to
> become ready in the same\n\
>  cycle.  These other insns can then be taken into account properly.",
>   int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)
>
> +DEFHOOK
> +(macro_fusion_p,
> + "This hook is used to check whether target platform supports macro fusion.",
> + bool, (void), NULL)
> +
> +DEFHOOK
> +(macro_fusion_pair_p,
> + "This hook is used to check whether two insns could be macro fused for\n\
> +target microarchitecture. If this hook returns true for the given insn pair\n\
> +(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
> +group, and they will not be scheduled apart.",
> + bool, (rtx condgen, rtx condjmp), NULL)
> +
>  /* The following member value is a pointer to a function called
>     after evaluation forward dependencies of insns in chain given
>     by two parameter values (head and tail correspondingly).  */
>
>
>
>
> 2013-10-16  Wei Mi  <wmi@google.com>
>         * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
>         a SCHED_GROUP at the end of BB to remain their location.
>
> diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
> index e1a2dce..156359e 100644
> --- a/gcc/sched-rgn.c
> +++ b/gcc/sched-rgn.c
> @@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
>       cc0 setters remain at the end because they can't be moved away from
>       their cc0 user.
>
> +     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
> +
>       COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).
>
>       Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
> @@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
>  #endif
>                  || (!reload_completed
>                      && sets_likely_spilled (PATTERN (insn)))))
> -        || NOTE_P (insn))
> +        || NOTE_P (insn)
> +        || (last != 0 && SCHED_GROUP_P (last)))
>      {
>        if (!NOTE_P (insn))
>         {

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-02  1:40                                 ` Wei Mi
@ 2013-11-04  1:18                                   ` Jan Hubicka
  2013-11-04 22:04                                     ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Hubicka @ 2013-11-04  1:18 UTC (permalink / raw)
  To: Wei Mi
  Cc: Jeff Law, Jan Hubicka, H.J. Lu, Alexander Monakov,
	Steven Bosscher, GCC Patches, David Li, Kirill Yukhin,
	Vladimir Makarov

> Ping.  Is it ok for x86 maintainer?

I tought I already approved the x86 bits.
> 
> Thanks,
> Wei Mi.
> 
> On Wed, Oct 16, 2013 at 4:25 PM, Wei Mi <wmi@google.com> wrote:
> >> Go ahead and consider that pre-approved.  Just send it to the list with a
> >> note that I approved it in this thread.
> >>
> >> Jeff
> >
> > Thanks! The new patch addressed Jeff's comments.
> >
> > Is it ok for x86 maintainer?
> >
> > Thanks,
> > Wei Mi.
> >
> > 2013-10-16  Wei Mi  <wmi@google.com>
> >
> >         * gcc/config/i386/i386.c (memory_address_length): Extract a part
> >         of code to rip_relative_addr_p.
> >         (rip_relative_addr_p): New Function.
> >         (ix86_macro_fusion_p): Ditto.
> >         (ix86_macro_fusion_pair_p): Ditto.
> >         * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
> >         * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
> >         * gcc/doc/tm.texi: Generated.
> >         * gcc/doc/tm.texi.in: Ditto.
> >         * gcc/haifa-sched.c (try_group_insn): New Function.
> >         (group_insns_for_macro_fusion): Ditto.
> >         (sched_init): Call group_insns_for_macro_fusion.
> >         * gcc/target.def: Add two hooks: macro_fusion_p and
> >         macro_fusion_pair_p.
> >
The i386 bits are OK.

Honza

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-04  1:18                                   ` Jan Hubicka
@ 2013-11-04 22:04                                     ` Wei Mi
  2013-11-24 11:14                                       ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-11-04 22:04 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jeff Law, H.J. Lu, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

Thanks! The three patches are commited as r204367, r204369 and r204371.

Regards,
Wei Mi.

On Sun, Nov 3, 2013 at 5:18 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Ping.  Is it ok for x86 maintainer?
>
> I tought I already approved the x86 bits.
>>
>> Thanks,
>> Wei Mi.
>>
>> On Wed, Oct 16, 2013 at 4:25 PM, Wei Mi <wmi@google.com> wrote:
>> >> Go ahead and consider that pre-approved.  Just send it to the list with a
>> >> note that I approved it in this thread.
>> >>
>> >> Jeff
>> >
>> > Thanks! The new patch addressed Jeff's comments.
>> >
>> > Is it ok for x86 maintainer?
>> >
>> > Thanks,
>> > Wei Mi.
>> >
>> > 2013-10-16  Wei Mi  <wmi@google.com>
>> >
>> >         * gcc/config/i386/i386.c (memory_address_length): Extract a part
>> >         of code to rip_relative_addr_p.
>> >         (rip_relative_addr_p): New Function.
>> >         (ix86_macro_fusion_p): Ditto.
>> >         (ix86_macro_fusion_pair_p): Ditto.
>> >         * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
>> >         * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
>> >         * gcc/doc/tm.texi: Generated.
>> >         * gcc/doc/tm.texi.in: Ditto.
>> >         * gcc/haifa-sched.c (try_group_insn): New Function.
>> >         (group_insns_for_macro_fusion): Ditto.
>> >         (sched_init): Call group_insns_for_macro_fusion.
>> >         * gcc/target.def: Add two hooks: macro_fusion_p and
>> >         macro_fusion_pair_p.
>> >
> The i386 bits are OK.
>
> Honza

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-04 22:04                                     ` Wei Mi
@ 2013-11-24 11:14                                       ` H.J. Lu
  2013-11-24 13:38                                         ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2013-11-24 11:14 UTC (permalink / raw)
  To: Wei Mi
  Cc: Jan Hubicka, Jeff Law, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On Mon, Nov 4, 2013 at 1:51 PM, Wei Mi <wmi@google.com> wrote:
> Thanks! The three patches are commited as r204367, r204369 and r204371.
>

r204369 caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59020

-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-24 11:14                                       ` H.J. Lu
@ 2013-11-24 13:38                                         ` Wei Mi
  2013-11-25 12:08                                           ` Alexander Monakov
  2013-11-25 21:08                                           ` Jeff Law
  0 siblings, 2 replies; 59+ messages in thread
From: Wei Mi @ 2013-11-24 13:38 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, Jeff Law, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

Sorry about the problem.

For the failed testcase, it was compiled using -fmodulo-sched.
modulo-sched phase set SCHED_GROUP_P of a jump insn to be true, which
means the jump insn should be scheduled with prev insn as a group.
When modulo scheduling is finished, the flag of SCHED_GROUP_P is not
cleaned up. After that, pass_jump2 phase split the bb and move the
prev insn to another bb. Then pass_sched2 see the flag and mistakenly
try to bind the jump insn with a code label.

I am thinking other cases setting SCHED_GROUP_P should have the same
problem because SCHED_GROUP_P is not cleaned up after scheduling is
done. The flag may become inconsistent after some optimizations and
may cause problem if it is used by later scheduling passes. I don't
know why similar problem was never exposed before.

The fix is to simply cleanup SCHED_GROUP_P flag in sched_finish.

bootstrap is ok. regression test is going on. Is it ok if regression passes?

Thanks,
Wei.

2013-11-23  Wei Mi  <wmi@google.com>

        PR rtl-optimization/59020
        * haifa-sched.c (cleanup_sched_group): New function.
        (sched_finish): Call cleanup_sched_group to cleanup SCHED_GROUP_P.

2013-11-23  Wei Mi  <wmi@google.com>
        PR rtl-optimization/59020
        * testsuite/gcc.dg/pr59020.c (void f):

Index: haifa-sched.c
===================================================================
--- haifa-sched.c       (revision 204923)
+++ haifa-sched.c       (working copy)
@@ -6598,6 +6598,23 @@ group_insns_for_macro_fusion ()
     try_group_insn (BB_END (bb));
 }

+/* Cleanup SCHED_GROUP_P after scheduling is done. This is necessary because
+   bb may be changed by other optimizations and the flag from last scheduling
+   may become invalid. If later scheduler see the flag generated from last
+   scheduling, it may produces incorrect result.  */
+
+static void
+cleanup_sched_group ()
+{
+  basic_block bb;
+  rtx insn;
+
+  FOR_EACH_BB (bb)
+    FOR_BB_INSNS(bb, insn)
+      if (INSN_P (insn) && SCHED_GROUP_P (insn))
+       SCHED_GROUP_P (insn) = 0;
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6841,6 +6858,8 @@ sched_finish (void)
     }
   free (curr_state);

+  cleanup_sched_group ();
+
   if (targetm.sched.finish_global)
     targetm.sched.finish_global (sched_dump, sched_verbose);

Index: testsuite/gcc.dg/pr59020.c
===================================================================
--- testsuite/gcc.dg/pr59020.c  (revision 0)
+++ testsuite/gcc.dg/pr59020.c  (revision 0)
@@ -0,0 +1,15 @@
+/* PR rtl-optimization/59020 */
+
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-O2 -fmodulo-sched -fno-inline -march=corei7" } */
+
+int a, b, d;
+unsigned c;
+
+void f()
+{
+  unsigned q;
+  for(; a; a++)
+    if(((c %= d && 1) ? : 1) & 1)
+      for(; b; q++);
+}

On Sat, Nov 23, 2013 at 4:34 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Nov 4, 2013 at 1:51 PM, Wei Mi <wmi@google.com> wrote:
>> Thanks! The three patches are commited as r204367, r204369 and r204371.
>>
>
> r204369 caused:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59020
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-24 13:38                                         ` Wei Mi
@ 2013-11-25 12:08                                           ` Alexander Monakov
  2013-11-25 19:19                                             ` Wei Mi
  2013-11-25 21:08                                           ` Jeff Law
  1 sibling, 1 reply; 59+ messages in thread
From: Alexander Monakov @ 2013-11-25 12:08 UTC (permalink / raw)
  To: Wei Mi
  Cc: H.J. Lu, Jan Hubicka, Jeff Law, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin, Vladimir Makarov

On Sat, 23 Nov 2013, Wei Mi wrote:
> For the failed testcase, it was compiled using -fmodulo-sched.
> modulo-sched phase set SCHED_GROUP_P of a jump insn to be true, which
> means the jump insn should be scheduled with prev insn as a group.

SMS doesn't set SCHED_GROUP_P by itself; did you mean that SCHED_GROUP_P is
set by dependency analysis code similar to sched2?

> When modulo scheduling is finished, the flag of SCHED_GROUP_P is not
> cleaned up. After that, pass_jump2 phase split the bb and move the
> prev insn to another bb. Then pass_sched2 see the flag and mistakenly
> try to bind the jump insn with a code label.

I think the analysis is incomplete.  Looking at the backtrace posted in the
bug report, the failure path goes through chain_to_prev_insn, which protects
against such failure:

  prev_nonnote = prev_nonnote_nondebug_insn (insn);
  if (BLOCK_FOR_INSN (insn) == BLOCK_FOR_INSN (prev_nonnote)
      && ! sched_insns_conditions_mutex_p (insn, prev_nonnote))
    add_dependence (insn, prev_nonnote, REG_DEP_ANTI);

Why does it end up with a label at the assertion failure point?

Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-25 12:08                                           ` Alexander Monakov
@ 2013-11-25 19:19                                             ` Wei Mi
  0 siblings, 0 replies; 59+ messages in thread
From: Wei Mi @ 2013-11-25 19:19 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: H.J. Lu, Jan Hubicka, Jeff Law, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin, Vladimir Makarov

On Mon, Nov 25, 2013 at 2:08 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> On Sat, 23 Nov 2013, Wei Mi wrote:
>> For the failed testcase, it was compiled using -fmodulo-sched.
>> modulo-sched phase set SCHED_GROUP_P of a jump insn to be true, which
>> means the jump insn should be scheduled with prev insn as a group.
>
> SMS doesn't set SCHED_GROUP_P by itself; did you mean that SCHED_GROUP_P is
> set by dependency analysis code similar to sched2?
>

SCHED_GROUP_P is set in sched_analyze for "call + return value" group
and other groups, and in sched_init for macrofusion. Both
sched_analyze and sched_init are used by SMS (sched_analyze is used by
creating ddg). I think sched1 may have the same problem when it set
SCHED_GROUP_P and sched2 uses it.

>> When modulo scheduling is finished, the flag of SCHED_GROUP_P is not
>> cleaned up. After that, pass_jump2 phase split the bb and move the
>> prev insn to another bb. Then pass_sched2 see the flag and mistakenly
>> try to bind the jump insn with a code label.
>
> I think the analysis is incomplete.  Looking at the backtrace posted in the
> bug report, the failure path goes through chain_to_prev_insn, which protects
> against such failure:
>
>   prev_nonnote = prev_nonnote_nondebug_insn (insn);
>   if (BLOCK_FOR_INSN (insn) == BLOCK_FOR_INSN (prev_nonnote)
>       && ! sched_insns_conditions_mutex_p (insn, prev_nonnote))
>     add_dependence (insn, prev_nonnote, REG_DEP_ANTI);
>
> Why does it end up with a label at the assertion failure point?
>
> Alexander

Because code label is not a note or debug insn.

I think it is impossible to detect such inconsistency in
chain_to_prev_insn. If the prev_nonnote is not a code label, the bug
will not be exposed this time. Suppose some other optimizations insert
a real insn before the jump marked as SCHED_GROUP_P, following
scheduler pass will schedule them together silently. That is why I
think it is necessary to cleanup SCHED_GROUP_P when a scheduling pass
is finished.

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-24 13:38                                         ` Wei Mi
  2013-11-25 12:08                                           ` Alexander Monakov
@ 2013-11-25 21:08                                           ` Jeff Law
  2013-11-25 21:47                                             ` Wei Mi
  1 sibling, 1 reply; 59+ messages in thread
From: Jeff Law @ 2013-11-25 21:08 UTC (permalink / raw)
  To: Wei Mi, H.J. Lu
  Cc: Jan Hubicka, Alexander Monakov, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin, Vladimir Makarov

On 11/24/13 00:30, Wei Mi wrote:
> Sorry about the problem.
>
> For the failed testcase, it was compiled using -fmodulo-sched.
> modulo-sched phase set SCHED_GROUP_P of a jump insn to be true, which
> means the jump insn should be scheduled with prev insn as a group.
> When modulo scheduling is finished, the flag of SCHED_GROUP_P is not
> cleaned up. After that, pass_jump2 phase split the bb and move the
> prev insn to another bb. Then pass_sched2 see the flag and mistakenly
> try to bind the jump insn with a code label.
>
> I am thinking other cases setting SCHED_GROUP_P should have the same
> problem because SCHED_GROUP_P is not cleaned up after scheduling is
> done. The flag may become inconsistent after some optimizations and
> may cause problem if it is used by later scheduling passes. I don't
> know why similar problem was never exposed before.
>
> The fix is to simply cleanup SCHED_GROUP_P flag in sched_finish.
I think this is showing up because this is the first time we have used 
SCHED_GROUP_P in cases where we merely want to keep two instructions 
consecutive vs cases where we are required to keep certain instructions 
consecutive.  For example, all the RTL passes already know they need to 
keep a cc0 setter and cc0 user consecutive on a HAVE_cc0 target.

In the latter case passes should already be doing what is necessary to 
keep those instructions consecutive.  In the former case, we'd have to 
audit & fix passes to honor the desire to keep certain instructions 
consecutive.




>
> bootstrap is ok. regression test is going on. Is it ok if regression passes?
>
> Thanks,
> Wei.
>
> 2013-11-23  Wei Mi  <wmi@google.com>
>
>          PR rtl-optimization/59020
>          * haifa-sched.c (cleanup_sched_group): New function.
>          (sched_finish): Call cleanup_sched_group to cleanup SCHED_GROUP_P.
>
> 2013-11-23  Wei Mi  <wmi@google.com>
>          PR rtl-optimization/59020
>          * testsuite/gcc.dg/pr59020.c (void f):
I'll note you're doing an extra pass over all the RTL here.   Is there 
any clean way you can clean SCHED_GROUP_P without that extra pass over 
the RTL?  Perhaps when the group actually gets scheduled?

jeff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-25 21:08                                           ` Jeff Law
@ 2013-11-25 21:47                                             ` Wei Mi
  2013-11-25 22:13                                               ` Jeff Law
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-11-25 21:47 UTC (permalink / raw)
  To: Jeff Law
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On Mon, Nov 25, 2013 at 10:36 AM, Jeff Law <law@redhat.com> wrote:
> On 11/24/13 00:30, Wei Mi wrote:
>>
>> Sorry about the problem.
>>
>> For the failed testcase, it was compiled using -fmodulo-sched.
>> modulo-sched phase set SCHED_GROUP_P of a jump insn to be true, which
>> means the jump insn should be scheduled with prev insn as a group.
>> When modulo scheduling is finished, the flag of SCHED_GROUP_P is not
>> cleaned up. After that, pass_jump2 phase split the bb and move the
>> prev insn to another bb. Then pass_sched2 see the flag and mistakenly
>> try to bind the jump insn with a code label.
>>
>> I am thinking other cases setting SCHED_GROUP_P should have the same
>> problem because SCHED_GROUP_P is not cleaned up after scheduling is
>> done. The flag may become inconsistent after some optimizations and
>> may cause problem if it is used by later scheduling passes. I don't
>> know why similar problem was never exposed before.
>>
>> The fix is to simply cleanup SCHED_GROUP_P flag in sched_finish.
>
> I think this is showing up because this is the first time we have used
> SCHED_GROUP_P in cases where we merely want to keep two instructions
> consecutive vs cases where we are required to keep certain instructions
> consecutive.  For example, all the RTL passes already know they need to keep
> a cc0 setter and cc0 user consecutive on a HAVE_cc0 target.
>
> In the latter case passes should already be doing what is necessary to keep
> those instructions consecutive.  In the former case, we'd have to audit &
> fix passes to honor the desire to keep certain instructions consecutive.
>

I see. Thanks for showing me the reason.

>>
>> bootstrap is ok. regression test is going on. Is it ok if regression
>> passes?
>>
>> Thanks,
>> Wei.
>>
>> 2013-11-23  Wei Mi  <wmi@google.com>
>>
>>          PR rtl-optimization/59020
>>          * haifa-sched.c (cleanup_sched_group): New function.
>>          (sched_finish): Call cleanup_sched_group to cleanup
>> SCHED_GROUP_P.
>>
>> 2013-11-23  Wei Mi  <wmi@google.com>
>>          PR rtl-optimization/59020
>>          * testsuite/gcc.dg/pr59020.c (void f):
>
> I'll note you're doing an extra pass over all the RTL here.   Is there any
> clean way you can clean SCHED_GROUP_P without that extra pass over the RTL?
> Perhaps when the group actually gets scheduled?
>
> jeff
>

With your help to understand that sched group will not be broken by
other passes in other cases, I can cleanup SCHED_GROUP_P for
macrofusion only by checking every condjump insn which is at the end
of BB. Then the cost will be in the same scale with bb nums. Do you
think it is ok?

Thanks,
Wei.

2013-11-25  Wei Mi  <wmi@google.com>

        PR rtl-optimization/59020
        * haifa-sched.c (cleanup_sched_group): New function.
        (sched_finish): Call cleanup_sched_group to cleanup SCHED_GROUP_P.

2013-11-25  Wei Mi  <wmi@google.com>
        PR rtl-optimization/59020
        * testsuite/gcc.dg/pr59020.c (void f):

Index: haifa-sched.c
===================================================================
--- haifa-sched.c       (revision 204923)
+++ haifa-sched.c       (working copy)
@@ -6598,6 +6598,25 @@ group_insns_for_macro_fusion ()
     try_group_insn (BB_END (bb));
 }

+/* Cleanup SCHED_GROUP_P after scheduling is done. This is necessary because
+   bb may be changed by other optimizations and the flag from last scheduling
+   may become invalid. If later scheduler see the flag generated from last
+   scheduling, it may produces incorrect result.  */
+
+static void
+cleanup_sched_group ()
+{
+  basic_block bb;
+  rtx insn;
+
+  FOR_EACH_BB (bb)
+    {
+      insn = BB_END (bb);
+      if (INSN_P (insn) && SCHED_GROUP_P (insn))
+       SCHED_GROUP_P (insn) = 0;
+    }
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6841,6 +6860,8 @@ sched_finish (void)
     }
   free (curr_state);

+  cleanup_sched_group ();
+
   if (targetm.sched.finish_global)
     targetm.sched.finish_global (sched_dump, sched_verbose);

Index: testsuite/gcc.dg/pr59020.c
===================================================================
--- testsuite/gcc.dg/pr59020.c  (revision 0)
+++ testsuite/gcc.dg/pr59020.c  (revision 0)
@@ -0,0 +1,15 @@
+/* PR rtl-optimization/59020 */
+
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-O2 -fmodulo-sched -fno-inline -march=corei7" } */
+
+int a, b, d;
+unsigned c;
+
+void f()
+{
+  unsigned q;
+  for(; a; a++)
+    if(((c %= d && 1) ? : 1) & 1)
+      for(; b; q++);
+}

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-25 21:47                                             ` Wei Mi
@ 2013-11-25 22:13                                               ` Jeff Law
  2013-11-25 22:19                                                 ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Law @ 2013-11-25 22:13 UTC (permalink / raw)
  To: Wei Mi
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On 11/25/13 12:16, Wei Mi wrote:
>>
>> I'll note you're doing an extra pass over all the RTL here.   Is there any
>> clean way you can clean SCHED_GROUP_P without that extra pass over the RTL?
>> Perhaps when the group actually gets scheduled?
>>
>> jeff
>>
>
> With your help to understand that sched group will not be broken by
> other passes in other cases, I can cleanup SCHED_GROUP_P for
> macrofusion only by checking every condjump insn which is at the end
> of BB. Then the cost will be in the same scale with bb nums. Do you
> think it is ok?
>
> Thanks,
> Wei.
>
> 2013-11-25  Wei Mi  <wmi@google.com>
>
>          PR rtl-optimization/59020
>          * haifa-sched.c (cleanup_sched_group): New function.
>          (sched_finish): Call cleanup_sched_group to cleanup SCHED_GROUP_P.
>
> 2013-11-25  Wei Mi  <wmi@google.com>
>          PR rtl-optimization/59020
>          * testsuite/gcc.dg/pr59020.c (void f):
But there's nothing that requires the SCHED_GROUP_P to be at the end of 
a block.  The cc0-setter/cc0-user case was just an example.  Another 
example would be groups created around call insns on small register 
class machines.

ISTM that when an insn moves from the ready list to back to the main 
insn chain, that you can just clear SCHED_GROUP_P at that time.  Is that 
not the case?

jeff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-25 22:13                                               ` Jeff Law
@ 2013-11-25 22:19                                                 ` Wei Mi
  2013-11-26  3:35                                                   ` Jeff Law
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-11-25 22:19 UTC (permalink / raw)
  To: Jeff Law
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On Mon, Nov 25, 2013 at 11:25 AM, Jeff Law <law@redhat.com> wrote:
> On 11/25/13 12:16, Wei Mi wrote:
>>>
>>>
>>> I'll note you're doing an extra pass over all the RTL here.   Is there
>>> any
>>> clean way you can clean SCHED_GROUP_P without that extra pass over the
>>> RTL?
>>> Perhaps when the group actually gets scheduled?
>>>
>>> jeff
>>>
>>
>> With your help to understand that sched group will not be broken by
>> other passes in other cases, I can cleanup SCHED_GROUP_P for
>> macrofusion only by checking every condjump insn which is at the end
>> of BB. Then the cost will be in the same scale with bb nums. Do you
>> think it is ok?
>>
>> Thanks,
>> Wei.
>>
>> 2013-11-25  Wei Mi  <wmi@google.com>
>>
>>          PR rtl-optimization/59020
>>          * haifa-sched.c (cleanup_sched_group): New function.
>>          (sched_finish): Call cleanup_sched_group to cleanup
>> SCHED_GROUP_P.
>>
>> 2013-11-25  Wei Mi  <wmi@google.com>
>>          PR rtl-optimization/59020
>>          * testsuite/gcc.dg/pr59020.c (void f):
>
> But there's nothing that requires the SCHED_GROUP_P to be at the end of a
> block.  The cc0-setter/cc0-user case was just an example.  Another example
> would be groups created around call insns on small register class machines.
>

Doing the cleanup at the end of BB could ensure all the groups
inserted for macrofusion will be cleaned. For groups not at the end of
a block, no matter whether they are cleaned up or not, nothing will
happen because other passes will not mess up those groups -- you said
cc0-setter/cc0-user was such a case. Is it call group a different
case?

> ISTM that when an insn moves from the ready list to back to the main insn
> chain, that you can just clear SCHED_GROUP_P at that time.  Is that not the
> case?
>
> jeff
>

For sched1 and sched2, we can do that. Actually, I find it has been
done in move_insn when commit_schedule. But for modulo scheduling, I
havn't found a good place to do it.

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-25 22:19                                                 ` Wei Mi
@ 2013-11-26  3:35                                                   ` Jeff Law
  2013-11-26 21:20                                                     ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Law @ 2013-11-26  3:35 UTC (permalink / raw)
  To: Wei Mi
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov


>
> Doing the cleanup at the end of BB could ensure all the groups
> inserted for macrofusion will be cleaned. For groups not at the end of
> a block, no matter whether they are cleaned up or not, nothing will
> happen because other passes will not mess up those groups -- you said
> cc0-setter/cc0-user was such a case. Is it call group a different
> case?
True, it would be safe, but it seems inconsistent and confusing that 
some SCHED_GROUP_P references would be purged and others remain.

Given SCHED_GROUP_P is to used strictly in the scheduler ISTM that we 
should be wiping it as we leave and that our RTL checkers ought to be 
verifying there are no insns with SCHED_GROUP_P left on.

>
> For sched1 and sched2, we can do that. Actually, I find it has been
> done in move_insn when commit_schedule. But for modulo scheduling, I
> havn't found a good place to do it.
Well, that's where I'd suggest focusing attention.

jeff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-26  3:35                                                   ` Jeff Law
@ 2013-11-26 21:20                                                     ` Wei Mi
  2013-11-27  9:56                                                       ` Jeff Law
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-11-26 21:20 UTC (permalink / raw)
  To: Jeff Law
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On Mon, Nov 25, 2013 at 2:12 PM, Jeff Law <law@redhat.com> wrote:
>
>>
>> Doing the cleanup at the end of BB could ensure all the groups
>> inserted for macrofusion will be cleaned. For groups not at the end of
>> a block, no matter whether they are cleaned up or not, nothing will
>> happen because other passes will not mess up those groups -- you said
>> cc0-setter/cc0-user was such a case. Is it call group a different
>> case?
>
> True, it would be safe, but it seems inconsistent and confusing that some
> SCHED_GROUP_P references would be purged and others remain.
>
> Given SCHED_GROUP_P is to used strictly in the scheduler ISTM that we should
> be wiping it as we leave and that our RTL checkers ought to be verifying
> there are no insns with SCHED_GROUP_P left on.
>

How about add a verifier TODO_verify_sched_group_flag similar as
TODO_verify_rtl_sharing, and add the verifier in the todo lists of all
the scheduling passes.

>
>>
>> For sched1 and sched2, we can do that. Actually, I find it has been
>> done in move_insn when commit_schedule. But for modulo scheduling, I
>> havn't found a good place to do it.
>
> Well, that's where I'd suggest focusing attention.
>
> jeff
>

After looking it carefully, even for sched1 and sched2, it is not ok
to depend on move_insn in commit_schedule to clean up all the
SCHED_GROUP_P, suppose a block is decided not to be scheduled by
dbg_cnt, then SCHED_GROUP_P inside the block will not be cleaned.
It is even more difficult to find a place inside SMS scheduling to do
the cleanup.

Any suggestions?

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-26 21:20                                                     ` Wei Mi
@ 2013-11-27  9:56                                                       ` Jeff Law
  2013-11-27 10:23                                                         ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Law @ 2013-11-27  9:56 UTC (permalink / raw)
  To: Wei Mi
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On 11/26/13 12:33, Wei Mi wrote:
> On Mon, Nov 25, 2013 at 2:12 PM, Jeff Law <law@redhat.com> wrote:
>>
>>>
>>> Doing the cleanup at the end of BB could ensure all the groups
>>> inserted for macrofusion will be cleaned. For groups not at the end of
>>> a block, no matter whether they are cleaned up or not, nothing will
>>> happen because other passes will not mess up those groups -- you said
>>> cc0-setter/cc0-user was such a case. Is it call group a different
>>> case?
>>
>> True, it would be safe, but it seems inconsistent and confusing that some
>> SCHED_GROUP_P references would be purged and others remain.
>>
>> Given SCHED_GROUP_P is to used strictly in the scheduler ISTM that we should
>> be wiping it as we leave and that our RTL checkers ought to be verifying
>> there are no insns with SCHED_GROUP_P left on.
>>
>
> How about add a verifier TODO_verify_sched_group_flag similar as
> TODO_verify_rtl_sharing, and add the verifier in the todo lists of all
> the scheduling passes.
>
>>
>>>
>>> For sched1 and sched2, we can do that. Actually, I find it has been
>>> done in move_insn when commit_schedule. But for modulo scheduling, I
>>> havn't found a good place to do it.
>>
>> Well, that's where I'd suggest focusing attention.
>>
>> jeff
>>
>
> After looking it carefully, even for sched1 and sched2, it is not ok
> to depend on move_insn in commit_schedule to clean up all the
> SCHED_GROUP_P, suppose a block is decided not to be scheduled by
> dbg_cnt, then SCHED_GROUP_P inside the block will not be cleaned.
> It is even more difficult to find a place inside SMS scheduling to do
> the cleanup.
>
> Any suggestions?
Hmm, maybe attack from the other direction? -- could we clear 
SCHED_GROUP_P for each insn at the start of this loop in sched_analyze?

It's not as clean in the sense that SCHED_GROUP_P "escapes" the 
scheduler, but it might be an option.

    for (insn = head;; insn = NEXT_INSN (insn))
     {

       if (INSN_P (insn))
         {
           /* And initialize deps_lists.  */
           sd_init_insn (insn);
         }

       deps_analyze_insn (deps, insn);

       if (insn == tail)
         {
           if (sched_deps_info->use_cselib)
             cselib_finish ();
           return;
         }
     }
Jeff
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-27  9:56                                                       ` Jeff Law
@ 2013-11-27 10:23                                                         ` Wei Mi
  2013-11-28  2:24                                                           ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-11-27 10:23 UTC (permalink / raw)
  To: Jeff Law
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On Tue, Nov 26, 2013 at 9:34 PM, Jeff Law <law@redhat.com> wrote:
> On 11/26/13 12:33, Wei Mi wrote:
>>
>> On Mon, Nov 25, 2013 at 2:12 PM, Jeff Law <law@redhat.com> wrote:
>>>
>>>
>>>>
>>>> Doing the cleanup at the end of BB could ensure all the groups
>>>> inserted for macrofusion will be cleaned. For groups not at the end of
>>>> a block, no matter whether they are cleaned up or not, nothing will
>>>> happen because other passes will not mess up those groups -- you said
>>>> cc0-setter/cc0-user was such a case. Is it call group a different
>>>> case?
>>>
>>>
>>> True, it would be safe, but it seems inconsistent and confusing that some
>>> SCHED_GROUP_P references would be purged and others remain.
>>>
>>> Given SCHED_GROUP_P is to used strictly in the scheduler ISTM that we
>>> should
>>> be wiping it as we leave and that our RTL checkers ought to be verifying
>>> there are no insns with SCHED_GROUP_P left on.
>>>
>>
>> How about add a verifier TODO_verify_sched_group_flag similar as
>> TODO_verify_rtl_sharing, and add the verifier in the todo lists of all
>> the scheduling passes.
>>
>>>
>>>>
>>>> For sched1 and sched2, we can do that. Actually, I find it has been
>>>> done in move_insn when commit_schedule. But for modulo scheduling, I
>>>> havn't found a good place to do it.
>>>
>>>
>>> Well, that's where I'd suggest focusing attention.
>>>
>>> jeff
>>>
>>
>> After looking it carefully, even for sched1 and sched2, it is not ok
>> to depend on move_insn in commit_schedule to clean up all the
>> SCHED_GROUP_P, suppose a block is decided not to be scheduled by
>> dbg_cnt, then SCHED_GROUP_P inside the block will not be cleaned.
>> It is even more difficult to find a place inside SMS scheduling to do
>> the cleanup.
>>
>> Any suggestions?
>
> Hmm, maybe attack from the other direction? -- could we clear SCHED_GROUP_P
> for each insn at the start of this loop in sched_analyze?
>
> It's not as clean in the sense that SCHED_GROUP_P "escapes" the scheduler,
> but it might be an option.
>
>    for (insn = head;; insn = NEXT_INSN (insn))
>     {
>
>       if (INSN_P (insn))
>         {
>           /* And initialize deps_lists.  */
>           sd_init_insn (insn);
>         }
>
>       deps_analyze_insn (deps, insn);
>
>       if (insn == tail)
>         {
>           if (sched_deps_info->use_cselib)
>             cselib_finish ();
>           return;
>         }
>     }
> Jeff
>>
>>
>

Thanks for the suggestion. It looks workable. Then I need to move the
SCHED_GROUP_P setting for macrofusion from sched_init to a place
inside sched_analyze after the SCHED_GROUP_P cleanup. It will be more
consistent with the settings for cc0 setter-user group and call group,
which are both inside sched_analyze.
I am trying this method...

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-27 10:23                                                         ` Wei Mi
@ 2013-11-28  2:24                                                           ` Wei Mi
  2013-12-03  4:53                                                             ` Jeff Law
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-11-28  2:24 UTC (permalink / raw)
  To: Jeff Law
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

>> Hmm, maybe attack from the other direction? -- could we clear SCHED_GROUP_P
>> for each insn at the start of this loop in sched_analyze?
>>
>> It's not as clean in the sense that SCHED_GROUP_P "escapes" the scheduler,
>> but it might be an option.
>>
>>    for (insn = head;; insn = NEXT_INSN (insn))
>>     {
>>
>>       if (INSN_P (insn))
>>         {
>>           /* And initialize deps_lists.  */
>>           sd_init_insn (insn);
>>         }
>>
>>       deps_analyze_insn (deps, insn);
>>
>>       if (insn == tail)
>>         {
>>           if (sched_deps_info->use_cselib)
>>             cselib_finish ();
>>           return;
>>         }
>>     }
>> Jeff
>>>
>>>
>>
>
> Thanks for the suggestion. It looks workable. Then I need to move the
> SCHED_GROUP_P setting for macrofusion from sched_init to a place
> inside sched_analyze after the SCHED_GROUP_P cleanup. It will be more
> consistent with the settings for cc0 setter-user group and call group,
> which are both inside sched_analyze.
> I am trying this method...
>
> Thanks,
> Wei.

Here is the patch. The patch does the SCHED_GROUP_P cleanup in
sched_analyze before deps_analyze_insn set SCHED_GROUP_P and chain the
insn with prev insns. And it move try_group_insn for macrofusion from
sched_init to sched_analyze_insn.

bootstrap and regression pass on x86_64-linux-gnu. Is it ok?

Thanks,
Wei.

2013-11-27  Wei Mi  <wmi@google.com>

        PR rtl-optimization/59020
        * sched-deps.c (try_group_insn): Move it from haifa-sched.c to here.
        (sched_analyze_insn): Call try_group_insn.
        (sched_analyze): Cleanup SCHED_GROUP_P before start the analysis.
        * haifa-sched.c (try_group_insn): Moved to sched-deps.c.
        (group_insns_for_macro_fusion): Removed.
        (sched_init): Remove calling group_insns_for_macro_fusion.

2013-11-27  Wei Mi  <wmi@google.com>

        PR rtl-optimization/59020
        * testsuite/gcc.dg/pr59020.c: New.
        * testsuite/gcc.dg/macro-fusion-1.c: New.
        * testsuite/gcc.dg/macro-fusion-2.c: New.

Index: sched-deps.c
===================================================================
--- sched-deps.c        (revision 204923)
+++ sched-deps.c        (working copy)
@@ -2820,6 +2820,37 @@ sched_analyze_2 (struct deps_desc *deps,
     sched_deps_info->finish_rhs ();
 }

+/* Try to group comparison and the following conditional jump INSN if
+   they're already adjacent. This is to prevent scheduler from scheduling
+   them apart.  */
+
+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  if (!any_condjump_p (insn))
+    return;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
 /* Analyze an INSN with pattern X to find all dependencies.  */
 static void
 sched_analyze_insn (struct deps_desc *deps, rtx x, rtx insn)
@@ -2843,6 +2874,11 @@ sched_analyze_insn (struct deps_desc *de
   can_start_lhs_rhs_p = (NONJUMP_INSN_P (insn)
                         && code == SET);

+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    try_group_insn (insn);
+
   if (may_trap_p (x))
     /* Avoid moving trapping instructions across function calls that might
        not always return.  */
@@ -3733,6 +3769,10 @@ sched_analyze (struct deps_desc *deps, r
        {
          /* And initialize deps_lists.  */
          sd_init_insn (insn);
+         /* Clean up SCHED_GROUP_P which may have been set by last
+            scheduler pass.  */
+         if (SCHED_GROUP_P (insn))
+           SCHED_GROUP_P (insn) = 0;
        }

       deps_analyze_insn (deps, insn);
Index: haifa-sched.c
===================================================================
--- haifa-sched.c       (revision 204923)
+++ haifa-sched.c       (working copy)
@@ -6554,50 +6554,6 @@ setup_sched_dump (void)
                ? stderr : dump_file);
 }

-/* Try to group comparison and the following conditional jump INSN if
-   they're already adjacent. This is to prevent scheduler from scheduling
-   them apart.  */
-
-static void
-try_group_insn (rtx insn)
-{
-  unsigned int condreg1, condreg2;
-  rtx cc_reg_1;
-  rtx prev;
-
-  if (!any_condjump_p (insn))
-    return;
-
-  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
-  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
-  prev = prev_nonnote_nondebug_insn (insn);
-  if (!reg_referenced_p (cc_reg_1, PATTERN (insn))
-      || !prev
-      || !modified_in_p (cc_reg_1, prev))
-    return;
-
-  /* Different microarchitectures support macro fusions for different
-     combinations of insn pairs.  */
-  if (!targetm.sched.macro_fusion_pair_p
-      || !targetm.sched.macro_fusion_pair_p (prev, insn))
-    return;
-
-  SCHED_GROUP_P (insn) = 1;
-}
-
-/* If the last cond jump and the cond register defining insn are consecutive
-   before scheduling, we want them to be in a schedule group. This is good
-   for performance on microarchitectures supporting macro-fusion.  */
-
-static void
-group_insns_for_macro_fusion ()
-{
-  basic_block bb;
-
-  FOR_EACH_BB (bb)
-    try_group_insn (BB_END (bb));
-}
-
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6726,11 +6682,6 @@ sched_init (void)
     }

   curr_state = xmalloc (dfa_state_size);
-
-  /* Group compare and branch insns for macro-fusion.  */
-  if (targetm.sched.macro_fusion_p
-      && targetm.sched.macro_fusion_p ())
-    group_insns_for_macro_fusion ();
 }

 static void haifa_init_only_bb (basic_block, basic_block);
Index: testsuite/gcc.dg/macro-fusion-1.c
===================================================================
--- testsuite/gcc.dg/macro-fusion-1.c   (revision 0)
+++ testsuite/gcc.dg/macro-fusion-1.c   (revision 0)
@@ -0,0 +1,13 @@
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-O2 -mtune=corei7 -fdump-rtl-sched2" } */
+/* { dg-final { scan-rtl-dump-not
"compare.*insn.*jump_insn.*jump_insn" "sched2" } } */
+
+int a[100];
+
+double bar (double sum)
+{
+  int i;
+  for (i = 0; i < 1000000; i++)
+   sum += (0.5 + (a[i%100] - 128));
+  return sum;
+}
Index: testsuite/gcc.dg/macro-fusion-2.c
===================================================================
--- testsuite/gcc.dg/macro-fusion-2.c   (revision 0)
+++ testsuite/gcc.dg/macro-fusion-2.c   (revision 0)
@@ -0,0 +1,16 @@
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-O2 -mtune=corei7-avx -fdump-rtl-sched2" } */
+/* { dg-final { scan-rtl-dump-not
"compare.*insn.*jump_insn.*jump_insn" "sched2" } } */
+
+int a[100];
+
+double bar (double sum)
+{
+  int i = 100000;
+  while (i != 0)
+    {
+      sum += (0.5 + (a[i%100] - 128));
+      i--;
+    }
+  return sum;
+}
Index: testsuite/gcc.dg/pr59020.c
===================================================================
--- testsuite/gcc.dg/pr59020.c  (revision 0)
+++ testsuite/gcc.dg/pr59020.c  (revision 0)
@@ -0,0 +1,15 @@
+/* PR rtl-optimization/59020 */
+
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-O2 -fmodulo-sched -fno-inline -march=corei7" } */
+
+int a, b, d;
+unsigned c;
+
+void f()
+{
+  unsigned q;
+  for(; a; a++)
+    if(((c %= d && 1) ? : 1) & 1)
+      for(; b; q++);
+}

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-11-28  2:24                                                           ` Wei Mi
@ 2013-12-03  4:53                                                             ` Jeff Law
  0 siblings, 0 replies; 59+ messages in thread
From: Jeff Law @ 2013-12-03  4:53 UTC (permalink / raw)
  To: Wei Mi
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, Steven Bosscher,
	GCC Patches, David Li, Kirill Yukhin, Vladimir Makarov

On 11/27/13 15:31, Wei Mi wrote:
>>> Hmm, maybe attack from the other direction? -- could we clear SCHED_GROUP_P
>>> for each insn at the start of this loop in sched_analyze?
>>>
>>> It's not as clean in the sense that SCHED_GROUP_P "escapes" the scheduler,
>>> but it might be an option.
>>>
>>>     for (insn = head;; insn = NEXT_INSN (insn))
>>>      {
>>>
>>>        if (INSN_P (insn))
>>>          {
>>>            /* And initialize deps_lists.  */
>>>            sd_init_insn (insn);
>>>          }
>>>
>>>        deps_analyze_insn (deps, insn);
>>>
>>>        if (insn == tail)
>>>          {
>>>            if (sched_deps_info->use_cselib)
>>>              cselib_finish ();
>>>            return;
>>>          }
>>>      }
>>> Jeff
>>>>
>>>>
>>>
>>
>> Thanks for the suggestion. It looks workable. Then I need to move the
>> SCHED_GROUP_P setting for macrofusion from sched_init to a place
>> inside sched_analyze after the SCHED_GROUP_P cleanup. It will be more
>> consistent with the settings for cc0 setter-user group and call group,
>> which are both inside sched_analyze.
>> I am trying this method...
>>
>> Thanks,
>> Wei.
>
> Here is the patch. The patch does the SCHED_GROUP_P cleanup in
> sched_analyze before deps_analyze_insn set SCHED_GROUP_P and chain the
> insn with prev insns. And it move try_group_insn for macrofusion from
> sched_init to sched_analyze_insn.
>
> bootstrap and regression pass on x86_64-linux-gnu. Is it ok?
>
> Thanks,
> Wei.
>
> 2013-11-27  Wei Mi  <wmi@google.com>
>
>          PR rtl-optimization/59020
>          * sched-deps.c (try_group_insn): Move it from haifa-sched.c to here.
>          (sched_analyze_insn): Call try_group_insn.
>          (sched_analyze): Cleanup SCHED_GROUP_P before start the analysis.
>          * haifa-sched.c (try_group_insn): Moved to sched-deps.c.
>          (group_insns_for_macro_fusion): Removed.
>          (sched_init): Remove calling group_insns_for_macro_fusion.
>
> 2013-11-27  Wei Mi  <wmi@google.com>
>
>          PR rtl-optimization/59020
>          * testsuite/gcc.dg/pr59020.c: New.
>          * testsuite/gcc.dg/macro-fusion-1.c: New.
>          * testsuite/gcc.dg/macro-fusion-2.c: New.
This is fine.  Thanks for your patience,

Jeff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-13 18:35                                   ` Wei Mi
@ 2013-09-14  6:10                                     ` Andi Kleen
  0 siblings, 0 replies; 59+ messages in thread
From: Andi Kleen @ 2013-09-14  6:10 UTC (permalink / raw)
  To: Wei Mi; +Cc: H.J. Lu, Alexander Monakov, Steven Bosscher, GCC Patches, David Li

Wei Mi <wmi@google.com> writes:

>> Checking corei7/corei7-avx explicitly isn't a good idea.
>> It is also useful for Ivy Bridge and Haswell.  I think you
>> should use a variable to control it, similar to
>> TARGET_FUSE_CMP_AND_BRANCH.
>>
>>
>> --
>> H.J.
>
> Different x86 microarchitectures support macro-fusion for different
> compare and branch combinations. I need to differentiate various x86
> microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
> control it, it requires a bunch of them. That is why I choose to check
> corei7/corei7-avx in that function. I don't add core-avx-i/core-avx2
> for now because I don't have those machines for testing.

They are normally a super set of each other.
So flags as suggested by HJ will work better than explicit matches.

-Andi

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-13 18:29                                 ` H.J. Lu
@ 2013-09-13 18:35                                   ` Wei Mi
  2013-09-14  6:10                                     ` Andi Kleen
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-13 18:35 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Alexander Monakov, Steven Bosscher, GCC Patches, David Li

> Checking corei7/corei7-avx explicitly isn't a good idea.
> It is also useful for Ivy Bridge and Haswell.  I think you
> should use a variable to control it, similar to
> TARGET_FUSE_CMP_AND_BRANCH.
>
>
> --
> H.J.

Different x86 microarchitectures support macro-fusion for different
compare and branch combinations. I need to differentiate various x86
microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
control it, it requires a bunch of them. That is why I choose to check
corei7/corei7-avx in that function. I don't add core-avx-i/core-avx2
for now because I don't have those machines for testing.

Thanks,
Wei Mi.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-13 18:06                               ` Wei Mi
@ 2013-09-13 18:29                                 ` H.J. Lu
  2013-09-13 18:35                                   ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: H.J. Lu @ 2013-09-13 18:29 UTC (permalink / raw)
  To: Wei Mi; +Cc: Alexander Monakov, Steven Bosscher, GCC Patches, David Li

On Fri, Sep 13, 2013 at 10:28 AM, Wei Mi <wmi@google.com> wrote:
>> Thanks.  At this point you need feedback from x86 and scheduler maintainers.
>> I would recommend you to resubmit the patch with a Changelog text, and with
>> the text of the patch inline in the email (your last mail has the patch as a
>> binary attachment, which makes it harder to review and respond to).  Please
>> mention if the updated patch passes bootstrap and regtest.
>
> Thanks! Here is the new patch. bootstrap and regression pass. ok for trunk?
>
> 2013-09-13  Wei Mi  <wmi@google.com>
>
>         * sched-rgn.c (add_branch_dependences): Keep insns in
>         a SCHED_GROUP at the end of bb to remain their locations.
>         * config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
>         X86_TUNE_FUSE_CMP_AND_BRANCH.
>         * config/i386/i386.c (ix86_macro_fusion_p): New Function.
>         (ix86_macro_fusion_pair_p): Ditto.
>         * doc/tm.texi.in: Generated.
>         * doc/tm.texi: Ditto.
>         * target.def: Add two hooks: macro_fusion_p and
>         macro_fusion_pair_p.
>         * haifa-sched.c (try_group_insn): New function.
>         (group_insns_for_macro_fusion): New function.
>         (sched_init): Call group_insns_for_macro_fusion.
>

> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c  (revision 201963)
> +++ config/i386/i386.c  (working copy)
> @@ -24850,6 +24850,99 @@ ia32_multipass_dfa_lookahead (void)
>      }
>  }
>
> +/* Return true if target platform supports macro-fusion.  */
> +
> +static bool
> +ix86_macro_fusion_p ()
> +{
> +  if (TARGET_FUSE_CMP_AND_BRANCH)
> +    return true;
> +  else
> +    return false;
> +}
> +
> +/* Check whether current microarchitecture support macro fusion
> +   for insn pair "CONDGEN + CONDJMP". Refer to
> +   "Intel Architectures Optimization Reference Manual". */
> +
> +static bool
> +ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
> +{
> +  rtx src;
> +  if (!strcmp (ix86_tune_string, "corei7"))
> +    {
> +      /* For Nehalem.  */
> +      rtx single_set = single_set (condgen);
> +      /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
> +      if (single_set == NULL_RTX)
> +        return false;
> +
> +      src = SET_SRC (single_set);
> +      if (GET_CODE (src) != COMPARE)
> +       return false;
> +
> +      /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
> +        insn pattern.  */
> +      if ((MEM_P (XEXP (src, 0))
> +          && CONST_INT_P (XEXP (src, 1)))
> +         || (MEM_P (XEXP (src, 1))
> +             && CONST_INT_P (XEXP (src, 0))))
> +       return false;
> +
> +      /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
> +      if (get_attr_type (condgen) != TYPE_TEST
> +         && get_attr_type (condgen) != TYPE_ICMP)
> +       return false;
> +      return true;
> +    }
> +  else if (!strcmp (ix86_tune_string, "corei7-avx"))
> +    {
> +      /* For Sandybridge.  */
> +      enum rtx_code ccode;
> +      rtx compare_set = NULL_RTX, test_if, cond;
> +      rtx single_set = single_set (condgen);
> +      if (single_set != NULL_RTX)
> +        compare_set = single_set;
> +      else
> +       {
> +         int i;
> +         rtx pat = PATTERN (condgen);
> +         for (i = 0; i < XVECLEN (pat, 0); i++)
> +           if (GET_CODE (XVECEXP (pat, 0, i)) == SET
> +               && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
> +             compare_set = XVECEXP (pat, 0, i);
> +       }
> +
> +      if (compare_set == NULL_RTX)
> +       return false;
> +      src = SET_SRC (compare_set);
> +      if (GET_CODE (src) != COMPARE)
> +       return false;
> +
> +      /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
> +        insn pattern.  */
> +      if ((MEM_P (XEXP (src, 0))
> +           && CONST_INT_P (XEXP (src, 1)))
> +          || (MEM_P (XEXP (src, 1))
> +              && CONST_INT_P (XEXP (src, 0))))
> +        return false;
> +
> +      /* Sandybridge doesn't support macro-fusion for inc/dec +
> +        unsigned comparison jmp.  */
> +      test_if = SET_SRC (pc_set (condjmp));
> +      cond = XEXP (test_if, 0);
> +      ccode = GET_CODE (cond);
> +      if (get_attr_type (condgen) == TYPE_INCDEC
> +         && (ccode == GEU
> +             || ccode == GTU
> +             || ccode == LEU
> +             || ccode == LTU))
> +       return false;
> +      return true;
> +    }
> +  return false;
> +}
> +

Checking corei7/corei7-avx explicitly isn't a good idea.
It is also useful for Ivy Bridge and Haswell.  I think you
should use a variable to control it, similar to
TARGET_FUSE_CMP_AND_BRANCH.


-- 
H.J.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-13 10:18                             ` Alexander Monakov
@ 2013-09-13 18:06                               ` Wei Mi
  2013-09-13 18:29                                 ` H.J. Lu
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-13 18:06 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, GCC Patches, David Li

> Thanks.  At this point you need feedback from x86 and scheduler maintainers.
> I would recommend you to resubmit the patch with a Changelog text, and with
> the text of the patch inline in the email (your last mail has the patch as a
> binary attachment, which makes it harder to review and respond to).  Please
> mention if the updated patch passes bootstrap and regtest.

Thanks! Here is the new patch. bootstrap and regression pass. ok for trunk?

2013-09-13  Wei Mi  <wmi@google.com>

        * sched-rgn.c (add_branch_dependences): Keep insns in
        a SCHED_GROUP at the end of bb to remain their locations.
        * config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
        X86_TUNE_FUSE_CMP_AND_BRANCH.
        * config/i386/i386.c (ix86_macro_fusion_p): New Function.
        (ix86_macro_fusion_pair_p): Ditto.
        * doc/tm.texi.in: Generated.
        * doc/tm.texi: Ditto.
        * target.def: Add two hooks: macro_fusion_p and
        macro_fusion_pair_p.
        * haifa-sched.c (try_group_insn): New function.
        (group_insns_for_macro_fusion): New function.
        (sched_init): Call group_insns_for_macro_fusion.

Index: sched-rgn.c
===================================================================
--- sched-rgn.c (revision 201963)
+++ sched-rgn.c (working copy)
@@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx ta
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.

+     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).

      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx ta
 #endif
  || (!reload_completed
      && sets_likely_spilled (PATTERN (insn)))))
- || NOTE_P (insn))
+ || NOTE_P (insn)
+ || (last != 0 && SCHED_GROUP_P (last)))
     {
       if (!NOTE_P (insn))
  {
Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def    (revision 201963)
+++ config/i386/x86-tune.def    (working copy)
@@ -196,7 +196,8 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
 /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
    with a subsequent conditional jump instruction into a single
    compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch",
+          m_COREI7 | m_BDVER)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 201963)
+++ config/i386/i386.c  (working copy)
@@ -24850,6 +24850,99 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  if (!strcmp (ix86_tune_string, "corei7"))
+    {
+      /* For Nehalem.  */
+      rtx single_set = single_set (condgen);
+      /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
+      if (single_set == NULL_RTX)
+        return false;
+
+      src = SET_SRC (single_set);
+      if (GET_CODE (src) != COMPARE)
+       return false;
+
+      /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
+        insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+          && CONST_INT_P (XEXP (src, 1)))
+         || (MEM_P (XEXP (src, 1))
+             && CONST_INT_P (XEXP (src, 0))))
+       return false;
+
+      /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
+      if (get_attr_type (condgen) != TYPE_TEST
+         && get_attr_type (condgen) != TYPE_ICMP)
+       return false;
+      return true;
+    }
+  else if (!strcmp (ix86_tune_string, "corei7-avx"))
+    {
+      /* For Sandybridge.  */
+      enum rtx_code ccode;
+      rtx compare_set = NULL_RTX, test_if, cond;
+      rtx single_set = single_set (condgen);
+      if (single_set != NULL_RTX)
+        compare_set = single_set;
+      else
+       {
+         int i;
+         rtx pat = PATTERN (condgen);
+         for (i = 0; i < XVECLEN (pat, 0); i++)
+           if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+               && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
+             compare_set = XVECEXP (pat, 0, i);
+       }
+
+      if (compare_set == NULL_RTX)
+       return false;
+      src = SET_SRC (compare_set);
+      if (GET_CODE (src) != COMPARE)
+       return false;
+
+      /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
+        insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+           && CONST_INT_P (XEXP (src, 1)))
+          || (MEM_P (XEXP (src, 1))
+              && CONST_INT_P (XEXP (src, 0))))
+        return false;
+
+      /* Sandybridge doesn't support macro-fusion for inc/dec +
+        unsigned comparison jmp.  */
+      test_if = SET_SRC (pc_set (condjmp));
+      cond = XEXP (test_if, 0);
+      ccode = GET_CODE (cond);
+      if (get_attr_type (condgen) == TYPE_INCDEC
+         && (ccode == GEU
+             || ccode == GTU
+             || ccode == LEU
+             || ccode == LTU))
+       return false;
+      return true;
+    }
+  return false;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42982,6 +43075,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in      (revision 201963)
+++ doc/tm.texi.in      (working copy)
@@ -4940,6 +4940,10 @@ them: try the first ones in this list fi

 @hook TARGET_SCHED_REORDER2

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK

 @hook TARGET_SCHED_INIT
Index: doc/tm.texi
===================================================================
--- doc/tm.texi (revision 201963)
+++ doc/tm.texi (working copy)
@@ -6553,6 +6553,17 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: target.def
===================================================================
--- target.def  (revision 201963)
+++ target.def  (working copy)
@@ -1041,6 +1041,19 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */
Index: haifa-sched.c
===================================================================
--- haifa-sched.c       (revision 201963)
+++ haifa-sched.c       (working copy)
@@ -6519,6 +6519,44 @@ setup_sched_dump (void)
                ? stderr : dump_file);
 }

+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!any_condjump_p (insn)
+      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion ()
+{
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    try_group_insn (BB_END (bb));
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6645,6 +6683,11 @@ sched_init (void)
     }

   curr_state = xmalloc (dfa_state_size);
+
+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion ();
 }

 static void haifa_init_only_bb (basic_block, basic_block);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-12 18:19                           ` Wei Mi
@ 2013-09-13 10:18                             ` Alexander Monakov
  2013-09-13 18:06                               ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Monakov @ 2013-09-13 10:18 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li

On Thu, 12 Sep 2013, Wei Mi wrote:

> Thanks, fixed. New patch attached.

Thanks.  At this point you need feedback from x86 and scheduler maintainers.
I would recommend you to resubmit the patch with a Changelog text, and with
the text of the patch inline in the email (your last mail has the patch as a
binary attachment, which makes it harder to review and respond to).  Please
mention if the updated patch passes bootstrap and regtest.

Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-12 10:25                         ` Alexander Monakov
@ 2013-09-12 18:19                           ` Wei Mi
  2013-09-13 10:18                             ` Alexander Monakov
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-12 18:19 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, GCC Patches, David Li

[-- Attachment #1: Type: text/plain, Size: 167 bytes --]

> Your new implementation is not efficient: when looping over BBs, you need to
> look only at the last insn of each basic block.
>

Thanks, fixed. New patch attached.

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 9201 bytes --]

Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in	(revision 201771)
+++ doc/tm.texi.in	(working copy)
@@ -6455,6 +6455,10 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn
 
+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: doc/tm.texi
===================================================================
--- doc/tm.texi	(revision 201771)
+++ doc/tm.texi	(working copy)
@@ -6551,6 +6551,17 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx @var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx @var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 201771)
+++ config/i386/i386.c	(working copy)
@@ -2004,7 +2004,7 @@ static unsigned int initial_ix86_tune_fe
   /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
      with a subsequent conditional jump instruction into a single
      compare-and-branch uop.  */
-  m_BDVER,
+  m_COREI7 | m_BDVER,
 
   /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
      will impact LEA instruction selection. */
@@ -24845,6 +24845,99 @@ ia32_multipass_dfa_lookahead (void)
     }
 }
 
+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to 
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  if (!strcmp (ix86_tune_string, "corei7"))
+    {
+      /* For Nehalem.  */
+      rtx single_set = single_set (condgen);
+      /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
+      if (single_set == NULL_RTX)
+        return false;
+
+      src = SET_SRC (single_set);
+      if (GET_CODE (src) != COMPARE)
+	return false;
+
+      /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
+	 insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+	   && CONST_INT_P (XEXP (src, 1)))
+	  || (MEM_P (XEXP (src, 1))
+	      && CONST_INT_P (XEXP (src, 0))))
+	return false;
+
+      /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
+      if (get_attr_type (condgen) != TYPE_TEST
+	  && get_attr_type (condgen) != TYPE_ICMP)
+	return false;
+      return true;
+    }
+  else if (!strcmp (ix86_tune_string, "corei7-avx"))
+    {
+      /* For Sandybridge.  */
+      enum rtx_code ccode;
+      rtx compare_set = NULL_RTX, test_if, cond;
+      rtx single_set = single_set (condgen);
+      if (single_set != NULL_RTX)
+        compare_set = single_set;
+      else
+	{
+	  int i;
+	  rtx pat = PATTERN (condgen);
+	  for (i = 0; i < XVECLEN (pat, 0); i++)
+	    if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+		&& GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
+	      compare_set = XVECEXP (pat, 0, i);
+	}
+
+      if (compare_set == NULL_RTX)
+	return false;      
+      src = SET_SRC (compare_set);
+      if (GET_CODE (src) != COMPARE)
+	return false;
+
+      /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
+	 insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+           && CONST_INT_P (XEXP (src, 1)))
+          || (MEM_P (XEXP (src, 1))
+              && CONST_INT_P (XEXP (src, 0))))
+        return false;
+
+      /* Sandybridge doesn't support macro-fusion for inc/dec +
+	 unsigned comparison jmp.  */
+      test_if = SET_SRC (pc_set (condjmp));
+      cond = XEXP (test_if, 0);
+      ccode = GET_CODE (cond);
+      if (get_attr_type (condgen) == TYPE_INCDEC
+	  && (ccode == GEU
+	      || ccode == GTU
+	      || ccode == LEU
+	      || ccode == LTU))
+	return false;
+      return true;
+    }
+  return false;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42834,6 +42927,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p
 
 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
Index: haifa-sched.c
===================================================================
--- haifa-sched.c	(revision 201963)
+++ haifa-sched.c	(working copy)
@@ -6519,6 +6519,44 @@ setup_sched_dump (void)
 		? stderr : dump_file);
 }
 
+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!any_condjump_p (insn)
+      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion ()
+{
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    try_group_insn (BB_END (bb));
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6645,6 +6683,11 @@ sched_init (void)
     }
 
   curr_state = xmalloc (dfa_state_size);
+
+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion ();
 }
 
 static void haifa_init_only_bb (basic_block, basic_block);
Index: target.def
===================================================================
--- target.def	(revision 201771)
+++ target.def	(working copy)
@@ -591,6 +591,19 @@ DEFHOOK
  "",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)
 
+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */
Index: sched-rgn.c
===================================================================
--- sched-rgn.c	(revision 201771)
+++ sched-rgn.c	(working copy)
@@ -2437,6 +2437,8 @@ add_branch_dependences (rtx head, rtx ta
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.
 
+     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).
 
      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2459,7 +2461,8 @@ add_branch_dependences (rtx head, rtx ta
 #endif
 		 || (!reload_completed
 		     && sets_likely_spilled (PATTERN (insn)))))
-	 || NOTE_P (insn))
+	 || NOTE_P (insn)
+	 || (last != 0 && SCHED_GROUP_P (last)))
     {
       if (!NOTE_P (insn))
 	{

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-12  7:56                       ` Wei Mi
@ 2013-09-12 10:25                         ` Alexander Monakov
  2013-09-12 18:19                           ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Monakov @ 2013-09-12 10:25 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li



On Wed, 11 Sep 2013, Wei Mi wrote:
> I agree with you that explicit handling in sched-deps.c for this
> feature looks not good. So I move it to sched_init (Instead of
> ix86_sched_init_global because ix86_sched_init_global is used to
> install scheduling hooks), and then it is possible for other
> architectures to use it.

To clarify, I meant TARGET_SCHED_INIT hook, which is currently not used in the
x86 backend (the corresponding function would be ix86_sched_init, not
ix86_sched_init_global).

Your new implementation is not efficient: when looping over BBs, you need to
look only at the last insn of each basic block.

Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-11 17:07                     ` Alexander Monakov
@ 2013-09-12  7:56                       ` Wei Mi
  2013-09-12 10:25                         ` Alexander Monakov
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-12  7:56 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, GCC Patches, David Li

Thanks! Your method to adjust 'last' is more concise. I try it and it
works for small testcases. bootstrap and regression are ok. More
performance test is going on.

I agree with you that explicit handling in sched-deps.c for this
feature looks not good. So I move it to sched_init (Instead of
ix86_sched_init_global because ix86_sched_init_global is used to
install scheduling hooks), and then it is possible for other
architectures to use it.
I also need the two hooks because one is used as the gate for
macro-fusion controlled by -mtune-ctrl=fuse_cmp_and_branch on x86, and
the other is used to check for which kind of cmp and branch pair
macro-fusion is supported on target platform. But I am not sure if it
is proper to put those two hooks under TARGET_SCHED hook vector.

Thanks,
Wei Mi.

updated patch:

Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in      (revision 201771)
+++ doc/tm.texi.in      (working copy)
@@ -6455,6 +6455,10 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: doc/tm.texi
===================================================================
--- doc/tm.texi (revision 201771)
+++ doc/tm.texi (working copy)
@@ -6551,6 +6551,17 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 201771)
+++ config/i386/i386.c  (working copy)
@@ -2004,7 +2004,7 @@ static unsigned int initial_ix86_tune_fe
   /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
      with a subsequent conditional jump instruction into a single
      compare-and-branch uop.  */
-  m_BDVER,
+  m_COREI7 | m_BDVER,

   /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
      will impact LEA instruction selection. */
@@ -24845,6 +24845,99 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  if (!strcmp (ix86_tune_string, "corei7"))
+    {
+      /* For Nehalem.  */
+      rtx single_set = single_set (condgen);
+      /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
+      if (single_set == NULL_RTX)
+        return false;
+
+      src = SET_SRC (single_set);
+      if (GET_CODE (src) != COMPARE)
+       return false;
+
+      /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
+        insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+          && CONST_INT_P (XEXP (src, 1)))
+         || (MEM_P (XEXP (src, 1))
+             && CONST_INT_P (XEXP (src, 0))))
+       return false;
+
+      /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
+      if (get_attr_type (condgen) != TYPE_TEST
+         && get_attr_type (condgen) != TYPE_ICMP)
+       return false;
+      return true;
+    }
+  else if (!strcmp (ix86_tune_string, "corei7-avx"))
+    {
+      /* For Sandybridge.  */
+      enum rtx_code ccode;
+      rtx compare_set = NULL_RTX, test_if, cond;
+      rtx single_set = single_set (condgen);
+      if (single_set != NULL_RTX)
+        compare_set = single_set;
+      else
+       {
+         int i;
+         rtx pat = PATTERN (condgen);
+         for (i = 0; i < XVECLEN (pat, 0); i++)
+           if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+               && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
+             compare_set = XVECEXP (pat, 0, i);
+       }
+
+      if (compare_set == NULL_RTX)
+       return false;
+      src = SET_SRC (compare_set);
+      if (GET_CODE (src) != COMPARE)
+       return false;
+
+      /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
+        insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+           && CONST_INT_P (XEXP (src, 1)))
+          || (MEM_P (XEXP (src, 1))
+              && CONST_INT_P (XEXP (src, 0))))
+        return false;
+
+      /* Sandybridge doesn't support macro-fusion for inc/dec +
+        unsigned comparison jmp.  */
+      test_if = SET_SRC (pc_set (condjmp));
+      cond = XEXP (test_if, 0);
+      ccode = GET_CODE (cond);
+      if (get_attr_type (condgen) == TYPE_INCDEC
+         && (ccode == GEU
+             || ccode == GTU
+             || ccode == LEU
+             || ccode == LTU))
+       return false;
+      return true;
+    }
+  return false;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42834,6 +42927,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
Index: haifa-sched.c
===================================================================
--- haifa-sched.c       (revision 201771)
+++ haifa-sched.c       (working copy)
@@ -6511,6 +6511,49 @@ setup_sched_dump (void)
                ? stderr : dump_file);
 }

+static void
+try_group_insn (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!any_condjump_p (insn)
+      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion ()
+{
+  rtx insn;
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    FOR_BB_INSNS (bb, insn)
+      {
+       if (INSN_P (insn))
+         try_group_insn (insn);
+      }
+}
+
 /* Initialize some global state for the scheduler.  This function works
    with the common data shared between all the schedulers.  It is called
    from the scheduler specific initialization routine.  */
@@ -6637,6 +6680,11 @@ sched_init (void)
     }

   curr_state = xmalloc (dfa_state_size);
+
+  /* Group compare and branch insns for macro-fusion.  */
+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion ();
 }

 static void haifa_init_only_bb (basic_block, basic_block);
Index: target.def
===================================================================
--- target.def  (revision 201771)
+++ target.def  (working copy)
@@ -591,6 +591,19 @@ DEFHOOK
  "",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */
Index: sched-rgn.c
===================================================================
--- sched-rgn.c (revision 201771)
+++ sched-rgn.c (working copy)
@@ -2437,6 +2437,8 @@ add_branch_dependences (rtx head, rtx ta
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.

+     Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).

      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2459,7 +2461,8 @@ add_branch_dependences (rtx head, rtx ta
 #endif
                 || (!reload_completed
                     && sets_likely_spilled (PATTERN (insn)))))
-        || NOTE_P (insn))
+        || NOTE_P (insn)
+        || (last != 0 && SCHED_GROUP_P (last)))
     {
       if (!NOTE_P (insn))
        {

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-11 18:18         ` Andrew Pinski
@ 2013-09-11 19:03           ` Wei Mi
  0 siblings, 0 replies; 59+ messages in thread
From: Wei Mi @ 2013-09-11 19:03 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Alexander Monakov, Steven Bosscher, GCC Patches, David Li

Taking the same issue slot is not enough for x86. The compare and
branch need to be consecutive in binary to be macro-fused on x86.

Thanks,
Wei Mi.

On Wed, Sep 11, 2013 at 10:45 AM, Andrew Pinski <pinskia@gmail.com> wrote:
> On Wed, Sep 4, 2013 at 12:33 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
>> On Wed, Sep 4, 2013 at 9:53 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>>
>>> On Wed, Sep 4, 2013 at 10:58 AM, Alexander Monakov wrote:
>>> > Hello,
>>> >
>>> > Could you use the existing facilities instead, such as adjust_priority hook,
>>> > or making the compare-branch insn sequence a SCHED_GROUP?
>>>
>>>
>>> Or a define_bypass?
>>
>> Hm, I don't think define_bypass would work: it still leaves the
>> scheduler freedom to move the compare up.
>
> Even though it allows the scheduler freedom to move the compare up,
> the schedule does due to the schedule model not being correct for the
> processor.  I have done the same for Octeon2 where it is able to
> combine the compare and the branch and found the resulting schedule is
> much better than even what this hack could do due to the instructions
> still take a issue slot.  Is it true that for these two processors it
> takes an issue slot or is it being done before issue?
>
> Thanks,
> Andrew Pinski
>
>>
>> IMO adjust_priority would be preferable if it allows to achieve the goal.
>>
>> Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-04 19:33       ` Alexander Monakov
  2013-09-04 23:32         ` Wei Mi
@ 2013-09-11 18:18         ` Andrew Pinski
  2013-09-11 19:03           ` Wei Mi
  1 sibling, 1 reply; 59+ messages in thread
From: Andrew Pinski @ 2013-09-11 18:18 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, Wei Mi, GCC Patches, David Li

On Wed, Sep 4, 2013 at 12:33 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
> On Wed, Sep 4, 2013 at 9:53 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>
>> On Wed, Sep 4, 2013 at 10:58 AM, Alexander Monakov wrote:
>> > Hello,
>> >
>> > Could you use the existing facilities instead, such as adjust_priority hook,
>> > or making the compare-branch insn sequence a SCHED_GROUP?
>>
>>
>> Or a define_bypass?
>
> Hm, I don't think define_bypass would work: it still leaves the
> scheduler freedom to move the compare up.

Even though it allows the scheduler freedom to move the compare up,
the schedule does due to the schedule model not being correct for the
processor.  I have done the same for Octeon2 where it is able to
combine the compare and the branch and found the resulting schedule is
much better than even what this hack could do due to the instructions
still take a issue slot.  Is it true that for these two processors it
takes an issue slot or is it being done before issue?

Thanks,
Andrew Pinski

>
> IMO adjust_priority would be preferable if it allows to achieve the goal.
>
> Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-11 16:43                   ` Wei Mi
@ 2013-09-11 17:07                     ` Alexander Monakov
  2013-09-12  7:56                       ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Monakov @ 2013-09-11 17:07 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li



On Wed, 11 Sep 2013, Wei Mi wrote:

> I tried that and it caused some regressions, so I choosed to do
> chain_to_prev_insn another time in add_branch_dependences. There could
> be some dependence between those two functions.

(please don't top-post on this list)

In that case you can adjust 'last' in add_branch_dependences so that the
dependences pin the compare rather than the jump to the end, like this
(untested):

diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
index 2c971e2..a774d5d 100644
--- a/gcc/sched-rgn.c
+++ b/gcc/sched-rgn.c
@@ -2443,6 +2443,9 @@ add_branch_dependences (rtx head, rtx tail)
      cc0 setters remain at the end because they can't be moved away from
      their cc0 user.
 
+     Predecessors of SCHED_GROUP_P instructions that remain at the end also
+     remain at the end.
+
      COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).
 
      Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2465,6 +2468,7 @@ add_branch_dependences (rtx head, rtx tail)
 #endif
 		 || (!reload_completed
 		     && sets_likely_spilled (PATTERN (insn)))))
+	 || (last != 0 && SCHED_GROUP_P (last))
 	 || NOTE_P (insn))
     {
       if (!NOTE_P (insn))

I'm also not a fan of adding two scheduler hooks and explicit handling in
sched-deps.c for this feature.  You probably could handle that with sched_init
hook entirely in the x86 backend (just loop over basic blocks and mark
suitable jumps with SCHED_GROUP_P), but on the other hand I can see an
argument that this might be useful in the future for other architectures.
Have you considered that?  What do other maintainers say?

Thanks.

Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-11 10:36                 ` Alexander Monakov
@ 2013-09-11 16:43                   ` Wei Mi
  2013-09-11 17:07                     ` Alexander Monakov
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-11 16:43 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, GCC Patches, David Li

I tried that and it caused some regressions, so I choosed to do
chain_to_prev_insn another time in add_branch_dependences. There could
be some dependence between those two functions.

On Wed, Sep 11, 2013 at 2:58 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Tue, 10 Sep 2013, Wei Mi wrote:
>
>> Because deps_analyze_insn only analyzes data deps but no control deps.
>> Control deps are included by add_branch_dependences. Without the
>> chain_to_prev_insn in the end of add_branch_dependences, jmp will be
>> control dependent on every previous insn in the same bb, and the cmp
>> and jmp group could still be scheduled apart since they will not be
>> put in ready list at the same time.
>
> Would calling add_branch_dependences before sched_analyze solve that, then?
>
> Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-10 16:14               ` Wei Mi
@ 2013-09-11 10:36                 ` Alexander Monakov
  2013-09-11 16:43                   ` Wei Mi
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Monakov @ 2013-09-11 10:36 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li



On Tue, 10 Sep 2013, Wei Mi wrote:

> Because deps_analyze_insn only analyzes data deps but no control deps.
> Control deps are included by add_branch_dependences. Without the
> chain_to_prev_insn in the end of add_branch_dependences, jmp will be
> control dependent on every previous insn in the same bb, and the cmp
> and jmp group could still be scheduled apart since they will not be
> put in ready list at the same time.

Would calling add_branch_dependences before sched_analyze solve that, then?

Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-10 12:30             ` Alexander Monakov
@ 2013-09-10 16:14               ` Wei Mi
  2013-09-11 10:36                 ` Alexander Monakov
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-10 16:14 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, GCC Patches, David Li

Because deps_analyze_insn only analyzes data deps but no control deps.
Control deps are included by add_branch_dependences. Without the
chain_to_prev_insn in the end of add_branch_dependences, jmp will be
control dependent on every previous insn in the same bb, and the cmp
and jmp group could still be scheduled apart since they will not be
put in ready list at the same time.


On Tue, Sep 10, 2013 at 4:44 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Fri, 6 Sep 2013, Wei Mi wrote:
>
>> SCHED_GROUP works after I add chain_to_prev_insn after
>> add_branch_dependences, in order to chain control dependences to prev
>> insn for sched group.
>
> chain_to_prev_insn is done in the end of deps_analyze_insn, why is that not
> sufficient?
>
> Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-06 17:39           ` Wei Mi
  2013-09-09 18:02             ` Wei Mi
@ 2013-09-10 12:30             ` Alexander Monakov
  2013-09-10 16:14               ` Wei Mi
  1 sibling, 1 reply; 59+ messages in thread
From: Alexander Monakov @ 2013-09-10 12:30 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li



On Fri, 6 Sep 2013, Wei Mi wrote:

> SCHED_GROUP works after I add chain_to_prev_insn after
> add_branch_dependences, in order to chain control dependences to prev
> insn for sched group.

chain_to_prev_insn is done in the end of deps_analyze_insn, why is that not
sufficient?

Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-06 17:39           ` Wei Mi
@ 2013-09-09 18:02             ` Wei Mi
  2013-09-10 12:30             ` Alexander Monakov
  1 sibling, 0 replies; 59+ messages in thread
From: Wei Mi @ 2013-09-09 18:02 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, GCC Patches, David Li

Add a testcase. bootstrap and regression ok for the patch in last mail.

2013-09-09  Wei Mi  <wmi@google.com>

        * gcc/testsuite/gcc.dg/macro-fusion-1.c: New.

Index: gcc/testsuite/gcc.dg/macro-fusion-1.c
===================================================================
--- gcc/testsuite/gcc.dg/macro-fusion-1.c       (revision 0)
+++ gcc/testsuite/gcc.dg/macro-fusion-1.c       (revision 0)
@@ -0,0 +1,14 @@
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-O2 -mtune=corei7 -fdump-rtl-sched2" } */
+/* { dg-final { scan-rtl-dump-not
"compare.*insn.*jump_insn.*jump_insn" "sched2" } } */
+
+int a[100];
+
+double bar (double sum)
+{
+  int i;
+  for (i = 0; i < 1000000; i++)
+   sum += (0.5 + (a[i%100] - 128));
+  return sum;
+}
+

On Fri, Sep 6, 2013 at 10:39 AM, Wei Mi <wmi@google.com> wrote:
> SCHED_GROUP works after I add chain_to_prev_insn after
> add_branch_dependences, in order to chain control dependences to prev
> insn for sched group. Here is the new patch. Testing is going on.
>
> Thanks,
> Wei Mi.
>
> 2013-09-06  Wei Mi  <wmi@google.com>
>
>         * config/i386/i386.c (ix86_macro_fusion_p): New function.
>         (ix86_macro_fusion_pair_p): Ditto.
>         * config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
>         X86_TUNE_FUSE_CMP_AND_BRANCH.
>         * sched-deps.c (group_insns_for_macro_fusion): New function.
>         (sched_analyze_insn): Call group_insns_for_macro_fusion.
>         (chain_to_prev_insn): Change it from static to extern.
>         (chain_to_prev_insn_p): Ditto.
>         * doc/tm.texi: Generated.
>         * doc/tm.texi.in: Ditto.
>         * sched-int.h: New declarations.
>         * sched-rgn.c (add_branch_dependences): Chain control
>         dependences to prev insn for sched group.
>         * target.def: Add macro_fusion_p and macro_fusion_pair_p.
>
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c  (revision 201963)
> +++ config/i386/i386.c  (working copy)
> @@ -24850,6 +24850,99 @@ ia32_multipass_dfa_lookahead (void)
>      }
>  }
>
> +/* Return true if target platform supports macro-fusion.  */
> +
> +static bool
> +ix86_macro_fusion_p ()
> +{
> +  if (TARGET_FUSE_CMP_AND_BRANCH)
> +    return true;
> +  else
> +    return false;
> +}
> +
> +/* Check whether current microarchitecture support macro fusion
> +   for insn pair "CONDGEN + CONDJMP". Refer to
> +   "Intel Architectures Optimization Reference Manual". */
> +
> +static bool
> +ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
> +{
> +  rtx src;
> +  if (!strcmp (ix86_tune_string, "corei7"))
> +    {
> +      /* For Nehalem.  */
> +      rtx single_set = single_set (condgen);
> +      /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
> +      if (single_set == NULL_RTX)
> +        return false;
> +
> +      src = SET_SRC (single_set);
> +      if (GET_CODE (src) != COMPARE)
> +       return false;
> +
> +      /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
> +        insn pattern.  */
> +      if ((MEM_P (XEXP (src, 0))
> +          && CONST_INT_P (XEXP (src, 1)))
> +         || (MEM_P (XEXP (src, 1))
> +             && CONST_INT_P (XEXP (src, 0))))
> +       return false;
> +
> +      /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
> +      if (get_attr_type (condgen) != TYPE_TEST
> +         && get_attr_type (condgen) != TYPE_ICMP)
> +       return false;
> +      return true;
> +    }
> +  else if (!strcmp (ix86_tune_string, "corei7-avx"))
> +    {
> +      /* For Sandybridge.  */
> +      enum rtx_code ccode;
> +      rtx compare_set = NULL_RTX, test_if, cond;
> +      rtx single_set = single_set (condgen);
> +      if (single_set != NULL_RTX)
> +        compare_set = single_set;
> +      else
> +       {
> +         int i;
> +         rtx pat = PATTERN (condgen);
> +         for (i = 0; i < XVECLEN (pat, 0); i++)
> +           if (GET_CODE (XVECEXP (pat, 0, i)) == SET
> +               && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
> +             compare_set = XVECEXP (pat, 0, i);
> +       }
> +
> +      if (compare_set == NULL_RTX)
> +       return false;
> +      src = SET_SRC (compare_set);
> +      if (GET_CODE (src) != COMPARE)
> +       return false;
> +
> +      /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
> +        insn pattern.  */
> +      if ((MEM_P (XEXP (src, 0))
> +           && CONST_INT_P (XEXP (src, 1)))
> +          || (MEM_P (XEXP (src, 1))
> +              && CONST_INT_P (XEXP (src, 0))))
> +        return false;
> +
> +      /* Sandybridge doesn't support macro-fusion for inc/dec +
> +        unsigned comparison jmp.  */
> +      test_if = SET_SRC (pc_set (condjmp));
> +      cond = XEXP (test_if, 0);
> +      ccode = GET_CODE (cond);
> +      if (get_attr_type (condgen) == TYPE_INCDEC
> +         && (ccode == GEU
> +             || ccode == GTU
> +             || ccode == LEU
> +             || ccode == LTU))
> +       return false;
> +      return true;
> +    }
> +  return false;
> +}
> +
>  /* Try to reorder ready list to take advantage of Atom pipelined IMUL
>     execution. It is applied if
>     (1) IMUL instruction is on the top of list;
> @@ -42982,6 +43075,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_
>  #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
>  #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
>    ia32_multipass_dfa_lookahead
> +#undef TARGET_SCHED_MACRO_FUSION_P
> +#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
> +#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
> +#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p
>
>  #undef TARGET_FUNCTION_OK_FOR_SIBCALL
>  #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
> Index: config/i386/x86-tune.def
> ===================================================================
> --- config/i386/x86-tune.def    (revision 201963)
> +++ config/i386/x86-tune.def    (working copy)
> @@ -196,7 +196,8 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
>  /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
>     with a subsequent conditional jump instruction into a single
>     compare-and-branch uop.  */
> -DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch",
> +          m_COREI7 | m_BDVER)
>  /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
>     will impact LEA instruction selection. */
>  DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
> Index: sched-deps.c
> ===================================================================
> --- sched-deps.c        (revision 201963)
> +++ sched-deps.c        (working copy)
> @@ -487,7 +487,6 @@ static void add_dependence_list (rtx, rt
>  static void add_dependence_list_and_free (struct deps_desc *, rtx,
>                                           rtx *, int, enum reg_note, bool);
>  static void delete_all_dependences (rtx);
> -static void chain_to_prev_insn (rtx);
>
>  static void flush_pending_lists (struct deps_desc *, rtx, int, int);
>  static void sched_analyze_1 (struct deps_desc *, rtx, rtx);
> @@ -1660,7 +1659,7 @@ delete_all_dependences (rtx insn)
>     chains backwards. Then we add the dependencies for the group to
>     the previous nonnote insn.  */
>
> -static void
> +void
>  chain_to_prev_insn (rtx insn)
>  {
>    sd_iterator_def sd_it;
> @@ -2821,6 +2820,35 @@ sched_analyze_2 (struct deps_desc *deps,
>      sched_deps_info->finish_rhs ();
>  }
>
> +/* If the last cond jump and the cond register defining insn are consecutive
> +   before scheduling, we want them to be in a schedule group. This is good
> +   for performance on microarchitectures supporting macro-fusion.  */
> +
> +static void
> +group_insns_for_macro_fusion (rtx insn)
> +{
> +  unsigned int condreg1, condreg2;
> +  rtx cc_reg_1;
> +  rtx prev;
> +
> +  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
> +  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
> +  prev = prev_nonnote_nondebug_insn (insn);
> +  if (!any_condjump_p (insn)
> +      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
> +      || !prev
> +      || !modified_in_p (cc_reg_1, prev))
> +    return;
> +
> +  /* Different microarchitectures support macro fusions for different
> +     combinations of insn pairs.  */
> +  if (!targetm.sched.macro_fusion_pair_p
> +      || !targetm.sched.macro_fusion_pair_p (prev, insn))
> +    return;
> +
> +  SCHED_GROUP_P (insn) = 1;
> +}
> +
>  /* Analyze an INSN with pattern X to find all dependencies.  */
>  static void
>  sched_analyze_insn (struct deps_desc *deps, rtx x, rtx insn)
> @@ -2844,6 +2872,10 @@ sched_analyze_insn (struct deps_desc *de
>    can_start_lhs_rhs_p = (NONJUMP_INSN_P (insn)
>                          && code == SET);
>
> +  if (targetm.sched.macro_fusion_p
> +      && targetm.sched.macro_fusion_p ())
> +    group_insns_for_macro_fusion (insn);
> +
>    if (may_trap_p (x))
>      /* Avoid moving trapping instructions across function calls that might
>         not always return.  */
> @@ -3504,7 +3536,7 @@ call_may_noreturn_p (rtx insn)
>     group, and if all INSN's dependencies should be moved to the first
>     instruction of that group.  */
>
> -static bool
> +bool
>  chain_to_prev_insn_p (rtx insn)
>  {
>    rtx prev, x;
> Index: doc/tm.texi
> ===================================================================
> --- doc/tm.texi (revision 201963)
> +++ doc/tm.texi (working copy)
> @@ -6553,6 +6553,17 @@ scheduling one insn causes other insns t
>  cycle.  These other insns can then be taken into account properly.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
> +This hook is used to check whether target platform supports macro fusion.
> +@end deftypefn
> +
> +@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
> @var{condgen}, rtx @var{condjmp})
> +This hook is used to check whether two insns could be macro fused for
> +target microarchitecture. If this hook returns true for the given insn pair
> +(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
> +group, and they will not be scheduled apart.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} void
> TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
> @var{tail})
>  This hook is called after evaluation forward dependencies of insns in
>  chain given by two parameter values (@var{head} and @var{tail}
> Index: doc/tm.texi.in
> ===================================================================
> --- doc/tm.texi.in      (revision 201963)
> +++ doc/tm.texi.in      (working copy)
> @@ -4940,6 +4940,10 @@ them: try the first ones in this list fi
>
>  @hook TARGET_SCHED_REORDER2
>
> +@hook TARGET_SCHED_MACRO_FUSION_P
> +
> +@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
> +
>  @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK
>
>  @hook TARGET_SCHED_INIT
> Index: sched-int.h
> ===================================================================
> --- sched-int.h (revision 201963)
> +++ sched-int.h (working copy)
> @@ -1302,6 +1302,8 @@ extern void finish_deps_global (void);
>  extern void deps_analyze_insn (struct deps_desc *, rtx);
>  extern void remove_from_deps (struct deps_desc *, rtx);
>  extern void init_insn_reg_pressure_info (rtx);
> +extern bool chain_to_prev_insn_p (rtx insn);
> +extern void chain_to_prev_insn (rtx);
>
>  extern dw_t get_dep_weak (ds_t, ds_t);
>  extern ds_t set_dep_weak (ds_t, ds_t, dw_t);
> Index: sched-rgn.c
> ===================================================================
> --- sched-rgn.c (revision 201963)
> +++ sched-rgn.c (working copy)
> @@ -2507,7 +2507,7 @@ add_branch_dependences (rtx head, rtx ta
>        }
>
>    if (!targetm.have_conditional_execution ())
> -    return;
> +    goto chain_to_prev_insn;
>
>    /* Finally, if the block ends in a jump, and we are doing intra-block
>       scheduling, make sure that the branch depends on any COND_EXEC insns
> @@ -2543,7 +2543,7 @@ add_branch_dependences (rtx head, rtx ta
>       could remove always-true predicates.  */
>
>    if (!reload_completed || ! (JUMP_P (tail) || JUMP_TABLE_DATA_P (tail)))
> -    return;
> +    goto chain_to_prev_insn;
>
>    insn = tail;
>    while (insn != head)
> @@ -2557,6 +2557,23 @@ add_branch_dependences (rtx head, rtx ta
>        if (INSN_P (insn) && GET_CODE (PATTERN (insn)) == COND_EXEC)
>         add_dependence (tail, insn, REG_DEP_ANTI);
>      }
> +
> + chain_to_prev_insn:
> +  /* Control dependences also need to be chained to the prev insn
> +     for sched group.  */
> +  insn = tail;
> +  while (insn != head)
> +    {
> +      /* Fixup the dependencies in the sched group.  */
> +      if (JUMP_P (insn)
> +         && chain_to_prev_insn_p (insn)
> +         && !sel_sched_p ())
> +       chain_to_prev_insn (insn);
> +
> +      insn = PREV_INSN (insn);
> +    }
> +
> +  return;
>  }
>
>  /* Data structures for the computation of data dependences in a regions.  We
> Index: target.def
> ===================================================================
> --- target.def  (revision 201963)
> +++ target.def  (working copy)
> @@ -1041,6 +1041,19 @@ scheduling one insn causes other insns t
>  cycle.  These other insns can then be taken into account properly.",
>   int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)
>
> +DEFHOOK
> +(macro_fusion_p,
> + "This hook is used to check whether target platform supports macro fusion.",
> + bool, (void), NULL)
> +
> +DEFHOOK
> +(macro_fusion_pair_p,
> + "This hook is used to check whether two insns could be macro fused for\n\
> +target microarchitecture. If this hook returns true for the given insn pair\n\
> +(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
> +group, and they will not be scheduled apart.",
> + bool, (rtx condgen, rtx condjmp), NULL)
> +
>  /* The following member value is a pointer to a function called
>     after evaluation forward dependencies of insns in chain given
>     by two parameter values (head and tail correspondingly).  */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-04 23:32         ` Wei Mi
@ 2013-09-06 17:39           ` Wei Mi
  2013-09-09 18:02             ` Wei Mi
  2013-09-10 12:30             ` Alexander Monakov
  0 siblings, 2 replies; 59+ messages in thread
From: Wei Mi @ 2013-09-06 17:39 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, GCC Patches, David Li

SCHED_GROUP works after I add chain_to_prev_insn after
add_branch_dependences, in order to chain control dependences to prev
insn for sched group. Here is the new patch. Testing is going on.

Thanks,
Wei Mi.

2013-09-06  Wei Mi  <wmi@google.com>

        * config/i386/i386.c (ix86_macro_fusion_p): New function.
        (ix86_macro_fusion_pair_p): Ditto.
        * config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
        X86_TUNE_FUSE_CMP_AND_BRANCH.
        * sched-deps.c (group_insns_for_macro_fusion): New function.
        (sched_analyze_insn): Call group_insns_for_macro_fusion.
        (chain_to_prev_insn): Change it from static to extern.
        (chain_to_prev_insn_p): Ditto.
        * doc/tm.texi: Generated.
        * doc/tm.texi.in: Ditto.
        * sched-int.h: New declarations.
        * sched-rgn.c (add_branch_dependences): Chain control
        dependences to prev insn for sched group.
        * target.def: Add macro_fusion_p and macro_fusion_pair_p.

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 201963)
+++ config/i386/i386.c  (working copy)
@@ -24850,6 +24850,99 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  if (!strcmp (ix86_tune_string, "corei7"))
+    {
+      /* For Nehalem.  */
+      rtx single_set = single_set (condgen);
+      /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
+      if (single_set == NULL_RTX)
+        return false;
+
+      src = SET_SRC (single_set);
+      if (GET_CODE (src) != COMPARE)
+       return false;
+
+      /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
+        insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+          && CONST_INT_P (XEXP (src, 1)))
+         || (MEM_P (XEXP (src, 1))
+             && CONST_INT_P (XEXP (src, 0))))
+       return false;
+
+      /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
+      if (get_attr_type (condgen) != TYPE_TEST
+         && get_attr_type (condgen) != TYPE_ICMP)
+       return false;
+      return true;
+    }
+  else if (!strcmp (ix86_tune_string, "corei7-avx"))
+    {
+      /* For Sandybridge.  */
+      enum rtx_code ccode;
+      rtx compare_set = NULL_RTX, test_if, cond;
+      rtx single_set = single_set (condgen);
+      if (single_set != NULL_RTX)
+        compare_set = single_set;
+      else
+       {
+         int i;
+         rtx pat = PATTERN (condgen);
+         for (i = 0; i < XVECLEN (pat, 0); i++)
+           if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+               && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
+             compare_set = XVECEXP (pat, 0, i);
+       }
+
+      if (compare_set == NULL_RTX)
+       return false;
+      src = SET_SRC (compare_set);
+      if (GET_CODE (src) != COMPARE)
+       return false;
+
+      /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
+        insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+           && CONST_INT_P (XEXP (src, 1)))
+          || (MEM_P (XEXP (src, 1))
+              && CONST_INT_P (XEXP (src, 0))))
+        return false;
+
+      /* Sandybridge doesn't support macro-fusion for inc/dec +
+        unsigned comparison jmp.  */
+      test_if = SET_SRC (pc_set (condjmp));
+      cond = XEXP (test_if, 0);
+      ccode = GET_CODE (cond);
+      if (get_attr_type (condgen) == TYPE_INCDEC
+         && (ccode == GEU
+             || ccode == GTU
+             || ccode == LEU
+             || ccode == LTU))
+       return false;
+      return true;
+    }
+  return false;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42982,6 +43075,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def    (revision 201963)
+++ config/i386/x86-tune.def    (working copy)
@@ -196,7 +196,8 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
 /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
    with a subsequent conditional jump instruction into a single
    compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch",
+          m_COREI7 | m_BDVER)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
Index: sched-deps.c
===================================================================
--- sched-deps.c        (revision 201963)
+++ sched-deps.c        (working copy)
@@ -487,7 +487,6 @@ static void add_dependence_list (rtx, rt
 static void add_dependence_list_and_free (struct deps_desc *, rtx,
                                          rtx *, int, enum reg_note, bool);
 static void delete_all_dependences (rtx);
-static void chain_to_prev_insn (rtx);

 static void flush_pending_lists (struct deps_desc *, rtx, int, int);
 static void sched_analyze_1 (struct deps_desc *, rtx, rtx);
@@ -1660,7 +1659,7 @@ delete_all_dependences (rtx insn)
    chains backwards. Then we add the dependencies for the group to
    the previous nonnote insn.  */

-static void
+void
 chain_to_prev_insn (rtx insn)
 {
   sd_iterator_def sd_it;
@@ -2821,6 +2820,35 @@ sched_analyze_2 (struct deps_desc *deps,
     sched_deps_info->finish_rhs ();
 }

+/* If the last cond jump and the cond register defining insn are consecutive
+   before scheduling, we want them to be in a schedule group. This is good
+   for performance on microarchitectures supporting macro-fusion.  */
+
+static void
+group_insns_for_macro_fusion (rtx insn)
+{
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx prev;
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  prev = prev_nonnote_nondebug_insn (insn);
+  if (!any_condjump_p (insn)
+      || !reg_referenced_p (cc_reg_1, PATTERN (insn))
+      || !prev
+      || !modified_in_p (cc_reg_1, prev))
+    return;
+
+  /* Different microarchitectures support macro fusions for different
+     combinations of insn pairs.  */
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (prev, insn))
+    return;
+
+  SCHED_GROUP_P (insn) = 1;
+}
+
 /* Analyze an INSN with pattern X to find all dependencies.  */
 static void
 sched_analyze_insn (struct deps_desc *deps, rtx x, rtx insn)
@@ -2844,6 +2872,10 @@ sched_analyze_insn (struct deps_desc *de
   can_start_lhs_rhs_p = (NONJUMP_INSN_P (insn)
                         && code == SET);

+  if (targetm.sched.macro_fusion_p
+      && targetm.sched.macro_fusion_p ())
+    group_insns_for_macro_fusion (insn);
+
   if (may_trap_p (x))
     /* Avoid moving trapping instructions across function calls that might
        not always return.  */
@@ -3504,7 +3536,7 @@ call_may_noreturn_p (rtx insn)
    group, and if all INSN's dependencies should be moved to the first
    instruction of that group.  */

-static bool
+bool
 chain_to_prev_insn_p (rtx insn)
 {
   rtx prev, x;
Index: doc/tm.texi
===================================================================
--- doc/tm.texi (revision 201963)
+++ doc/tm.texi (working copy)
@@ -6553,6 +6553,17 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in      (revision 201963)
+++ doc/tm.texi.in      (working copy)
@@ -4940,6 +4940,10 @@ them: try the first ones in this list fi

 @hook TARGET_SCHED_REORDER2

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK

 @hook TARGET_SCHED_INIT
Index: sched-int.h
===================================================================
--- sched-int.h (revision 201963)
+++ sched-int.h (working copy)
@@ -1302,6 +1302,8 @@ extern void finish_deps_global (void);
 extern void deps_analyze_insn (struct deps_desc *, rtx);
 extern void remove_from_deps (struct deps_desc *, rtx);
 extern void init_insn_reg_pressure_info (rtx);
+extern bool chain_to_prev_insn_p (rtx insn);
+extern void chain_to_prev_insn (rtx);

 extern dw_t get_dep_weak (ds_t, ds_t);
 extern ds_t set_dep_weak (ds_t, ds_t, dw_t);
Index: sched-rgn.c
===================================================================
--- sched-rgn.c (revision 201963)
+++ sched-rgn.c (working copy)
@@ -2507,7 +2507,7 @@ add_branch_dependences (rtx head, rtx ta
       }

   if (!targetm.have_conditional_execution ())
-    return;
+    goto chain_to_prev_insn;

   /* Finally, if the block ends in a jump, and we are doing intra-block
      scheduling, make sure that the branch depends on any COND_EXEC insns
@@ -2543,7 +2543,7 @@ add_branch_dependences (rtx head, rtx ta
      could remove always-true predicates.  */

   if (!reload_completed || ! (JUMP_P (tail) || JUMP_TABLE_DATA_P (tail)))
-    return;
+    goto chain_to_prev_insn;

   insn = tail;
   while (insn != head)
@@ -2557,6 +2557,23 @@ add_branch_dependences (rtx head, rtx ta
       if (INSN_P (insn) && GET_CODE (PATTERN (insn)) == COND_EXEC)
        add_dependence (tail, insn, REG_DEP_ANTI);
     }
+
+ chain_to_prev_insn:
+  /* Control dependences also need to be chained to the prev insn
+     for sched group.  */
+  insn = tail;
+  while (insn != head)
+    {
+      /* Fixup the dependencies in the sched group.  */
+      if (JUMP_P (insn)
+         && chain_to_prev_insn_p (insn)
+         && !sel_sched_p ())
+       chain_to_prev_insn (insn);
+
+      insn = PREV_INSN (insn);
+    }
+
+  return;
 }

 /* Data structures for the computation of data dependences in a regions.  We
Index: target.def
===================================================================
--- target.def  (revision 201963)
+++ target.def  (working copy)
@@ -1041,6 +1041,19 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched\n\
+group, and they will not be scheduled apart.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-04 19:33       ` Alexander Monakov
@ 2013-09-04 23:32         ` Wei Mi
  2013-09-06 17:39           ` Wei Mi
  2013-09-11 18:18         ` Andrew Pinski
  1 sibling, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-04 23:32 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Steven Bosscher, GCC Patches, David Li

Thanks for the suggestions! I take a look at adjust_priority, and find
it may not guarantee to schedule cmp and jmp together. The priority is
used to choose a candidate from ready list. If cmp is the only insn in
ready list and there is another insn-A in queued set (insn-A's
dependence has been resolved, but it is not ready because of data
delay or resource delay), then cmp will be scheduled before insn-A no
matter what their priorities are.

I will take a look at whether SCHED_GROUP is going to work.

On Wed, Sep 4, 2013 at 12:33 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
> On Wed, Sep 4, 2013 at 9:53 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>
>> On Wed, Sep 4, 2013 at 10:58 AM, Alexander Monakov wrote:
>> > Hello,
>> >
>> > Could you use the existing facilities instead, such as adjust_priority hook,
>> > or making the compare-branch insn sequence a SCHED_GROUP?
>>
>>
>> Or a define_bypass?
>
> Hm, I don't think define_bypass would work: it still leaves the
> scheduler freedom to move the compare up.
>
> IMO adjust_priority would be preferable if it allows to achieve the goal.
>
> Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-04 17:54     ` Steven Bosscher
@ 2013-09-04 19:33       ` Alexander Monakov
  2013-09-04 23:32         ` Wei Mi
  2013-09-11 18:18         ` Andrew Pinski
  0 siblings, 2 replies; 59+ messages in thread
From: Alexander Monakov @ 2013-09-04 19:33 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Wei Mi, GCC Patches, David Li

On Wed, Sep 4, 2013 at 9:53 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>
> On Wed, Sep 4, 2013 at 10:58 AM, Alexander Monakov wrote:
> > Hello,
> >
> > Could you use the existing facilities instead, such as adjust_priority hook,
> > or making the compare-branch insn sequence a SCHED_GROUP?
>
>
> Or a define_bypass?

Hm, I don't think define_bypass would work: it still leaves the
scheduler freedom to move the compare up.

IMO adjust_priority would be preferable if it allows to achieve the goal.

Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-04  9:00   ` Alexander Monakov
@ 2013-09-04 17:54     ` Steven Bosscher
  2013-09-04 19:33       ` Alexander Monakov
  0 siblings, 1 reply; 59+ messages in thread
From: Steven Bosscher @ 2013-09-04 17:54 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Wei Mi, GCC Patches, David Li

On Wed, Sep 4, 2013 at 10:58 AM, Alexander Monakov wrote:
> Hello,
>
> Could you use the existing facilities instead, such as adjust_priority hook,
> or making the compare-branch insn sequence a SCHED_GROUP?


Or a define_bypass?

Ciao!
Steven

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
  2013-09-03 22:55 ` Wei Mi
@ 2013-09-04  9:00   ` Alexander Monakov
  2013-09-04 17:54     ` Steven Bosscher
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Monakov @ 2013-09-04  9:00 UTC (permalink / raw)
  To: Wei Mi; +Cc: GCC Patches, David Li

Hello,

Could you use the existing facilities instead, such as adjust_priority hook,
or making the compare-branch insn sequence a SCHED_GROUP?

Alexander

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
       [not found] <CA+4CFy60EZ99xBGecerYLM9wD6w2v60SpJsfOGguA_PWLLupsA@mail.gmail.com>
@ 2013-09-03 22:55 ` Wei Mi
  2013-09-04  9:00   ` Alexander Monakov
  0 siblings, 1 reply; 59+ messages in thread
From: Wei Mi @ 2013-09-03 22:55 UTC (permalink / raw)
  To: GCC Patches; +Cc: David Li

This is a patch to prevent scheduler from scheduling compare and
branch away, in order to increase macro-fusion opportunity on recent
x86 platforms. It is motivated by the following small testcase.

double __attribute__ ((noinline)) bar (double sum);

int a[100];

double bar (double sum)
{
  int i;
  for (i = 0; i < 1000000; i++)
   sum += (0.5 + (a[i%100] - 128));
  return sum;
}

int main() {
  double total;
  int i;

  for (i = 0; i < 1000; i++)
    total += bar (i);

  return total != 0.333;
}

~/workarea/gcc-r201963/build/install/bin/gcc -O2 -mtune=corei7-avx 1.c -o 1.out
The binary of the kernel loop in func bar () is:

  401180:       89 c8                   mov    %ecx,%eax
  401182:       66 0f 57 c9             xorpd  %xmm1,%xmm1
  401186:       f7 ee                   imul   %esi
  401188:       89 c8                   mov    %ecx,%eax
  40118a:       c1 f8 1f                sar    $0x1f,%eax
  40118d:       c1 fa 05                sar    $0x5,%edx
  401190:       29 c2                   sub    %eax,%edx
  401192:       b8 64 00 00 00          mov    $0x64,%eax
  401197:       0f af d0                imul   %eax,%edx
  40119a:       89 c8                   mov    %ecx,%eax
  40119c:       83 c1 01                add    $0x1,%ecx
  40119f:       29 d0                   sub    %edx,%eax
  4011a1:       48 98                   cltq
  4011a3:       8b 04 85 60 51 6c 00    mov    0x6c5160(,%rax,4),%eax
  4011aa:       83 c0 80                add    $0xffffff80,%eax
  4011ad:       81 f9 40 42 0f 00       cmp    $0xf4240,%ecx
  4011b3:       f2 0f 2a c8             cvtsi2sd %eax,%xmm1
  4011b7:       f2 0f 58 ca             addsd  %xmm2,%xmm1
  4011bb:       f2 0f 58 c1             addsd  %xmm1,%xmm0
  4011bf:       75 bf                   jne    401180 <bar+0x10>

Here cmp (addr: 4011ad) and jne (addr: 4011bf) are not consecutive in
object code, but they are consecutive before sched2 pass. If we
manually keep the cmp and jne together, the performance of 1.out
changes from 2.40s to 2.31s on a sandybridge machine. Perf stat result
shows that UOPS_RETIRED.MACRO_FUSED event increases from 131,075 to
1,000,130,308, and UOPS_RETIRED.ANY event decreases from
23,002,543,637 to 22,002,511,525.

The patch is to reschedule cmp and jmp to make them consecutive. It is
done at the end of scheduling each block before schedule result is
commited. bootstrapped and regression ok on x86_64-linux-gnu. ok for
trunk?

2013-09-03  Wei Mi  <wmi@google.com>

        * haifa-sched.c (move_insns): New function.
        (adjust_for_macro_fusion): Ditto.
        (schedule_block): Call adjust_for_macro_fusion before commit schedule.
        * doc/tm.texi.in: Generated.
        * doc/tm.texi: Ditto.
        * config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
        X86_TUNE_FUSE_CMP_AND_BRANCH.
        * config/i386/i386.c (ix86_macro_fusion_p): New function.
        (ix86_macro_fusion_pair_p): Ditto.
        * target.def: Add macro_fusion_p and macro_fusion_pair_p in sched
        group.

Index: haifa-sched.c
===================================================================
--- haifa-sched.c       (revision 201963)
+++ haifa-sched.c       (working copy)
@@ -5605,6 +5605,56 @@ choose_ready (struct ready_list *ready,
     }
 }

+/* Move insn scheduled_insns[I] to the position J in scheduled_insns.  */
+
+static void
+move_insns (int i, int j)
+{
+  rtx insn = scheduled_insns[i];
+  scheduled_insns.ordered_remove (i);
+  scheduled_insns.safe_insert (j, insn);
+}
+
+/* If the last cond jump and the cond register setting insn are consecutive
+   before scheduling, and are scheduled away from each other, this func
+   tries to rearrange insns in scheduled_insns and keep those two insns
+   together. This is good for performance on microarchitectures supporting
+   macro-fusion.  */
+
+static void
+adjust_for_macro_fusion ()
+{
+  int i = -1, length;
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx insn;
+  rtx last = scheduled_insns.last();
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  length = scheduled_insns.length ();
+  if (any_condjump_p (last) && reg_referenced_p (cc_reg_1, PATTERN (last)))
+    {
+      for (i = length - 2; i >= 0; i--)
+       {
+         insn = scheduled_insns[i];
+         if (modified_in_p (cc_reg_1, insn))
+            break;
+       }
+    }
+  if (i < 0 || i == length - 2)
+    return;
+
+  if (NEXT_INSN (insn) != last)
+    return;
+
+  if (!targetm.sched.macro_fusion_pair_p
+      || !targetm.sched.macro_fusion_pair_p (insn, last))
+    return;
+
+  move_insns (i, length - 2);
+}
+
 /* This function is called when we have successfully scheduled a
    block.  It uses the schedule stored in the scheduled_insns vector
    to rearrange the RTL.  PREV_HEAD is used as the anchor to which we
@@ -6421,6 +6471,9 @@ schedule_block (basic_block *target_bb,

   if (success)
     {
+      if (targetm.sched.macro_fusion_p
+         && targetm.sched.macro_fusion_p ())
+       adjust_for_macro_fusion ();
       commit_schedule (prev_head, tail, target_bb);
       if (sched_verbose)
        fprintf (sched_dump, ";;   total time = %d\n", clock_var);
Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in      (revision 201963)
+++ doc/tm.texi.in      (working copy)
@@ -4940,6 +4940,10 @@ them: try the first ones in this list fi

 @hook TARGET_SCHED_REORDER2

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK

 @hook TARGET_SCHED_INIT
Index: doc/tm.texi
===================================================================
--- doc/tm.texi (revision 201963)
+++ doc/tm.texi (working copy)
@@ -6553,6 +6553,18 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. Now it is used in scheduler to adjust scheduling
+result for macro-fusion. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will reschedule @var{condgen}
+to the position just before condjmp before commit the scheduling result.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def    (revision 201963)
+++ config/i386/x86-tune.def    (working copy)
@@ -196,7 +196,8 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
 /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
    with a subsequent conditional jump instruction into a single
    compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch",
+          m_COREI7 | m_BDVER)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 201963)
+++ config/i386/i386.c  (working copy)
@@ -24850,6 +24850,99 @@ ia32_multipass_dfa_lookahead (void)
     }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+    return true;
+  else
+    return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  if (strcmp (ix86_tune_string, "corei7"))
+    {
+      /* For Nehalem.  */
+      rtx single_set = single_set (condgen);
+      /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
+      if (single_set == NULL_RTX)
+        return false;
+
+      src = SET_SRC (single_set);
+      if (GET_CODE (src) != COMPARE)
+       return false;
+
+      /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
+        insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+          && CONST_INT_P (XEXP (src, 1)))
+         || (MEM_P (XEXP (src, 1))
+             && CONST_INT_P (XEXP (src, 0))))
+       return false;
+
+      /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
+      if (get_attr_type (condgen) != TYPE_TEST
+         && get_attr_type (condgen) != TYPE_ICMP)
+       return false;
+      return true;
+    }
+  else if (strcmp (ix86_tune_string, "corei7-avx"))
+    {
+      /* For Sandybridge.  */
+      enum rtx_code ccode;
+      rtx compare_set = NULL_RTX, test_if, cond;
+      rtx single_set = single_set (condgen);
+      if (single_set != NULL_RTX)
+        compare_set = single_set;
+      else
+       {
+         int i;
+         rtx pat = PATTERN (condgen);
+         for (i = 0; i < XVECLEN (pat, 0); i++)
+           if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+               && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
+             compare_set = XVECEXP (pat, 0, i);
+       }
+
+      if (compare_set == NULL_RTX)
+       return false;
+      src = SET_SRC (compare_set);
+      if (GET_CODE (src) != COMPARE)
+       return false;
+
+      /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
+        insn pattern.  */
+      if ((MEM_P (XEXP (src, 0))
+           && CONST_INT_P (XEXP (src, 1)))
+          || (MEM_P (XEXP (src, 1))
+              && CONST_INT_P (XEXP (src, 0))))
+        return false;
+
+      /* Sandybridge doesn't support macro-fusion for inc/dec +
+        unsigned comparison jmp.  */
+      test_if = SET_SRC (pc_set (condjmp));
+      cond = XEXP (test_if, 0);
+      ccode = GET_CODE (cond);
+      if (get_attr_type (condgen) == TYPE_INCDEC
+         && (ccode == GEU
+             || ccode == GTU
+             || ccode == LEU
+             || ccode == LTU))
+       return false;
+      return true;
+    }
+  return false;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
    execution. It is applied if
    (1) IMUL instruction is on the top of list;
@@ -42982,6 +43075,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
Index: target.def
===================================================================
--- target.def  (revision 201963)
+++ target.def  (working copy)
@@ -1041,6 +1041,20 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.",
  int, (FILE *file, int verbose, rtx *ready, int *n_readyp, int clock), NULL)

+DEFHOOK
+(macro_fusion_p,
+ "This hook is used to check whether target platform supports macro fusion.",
+ bool, (void), NULL)
+
+DEFHOOK
+(macro_fusion_pair_p,
+ "This hook is used to check whether two insns could be macro fused for\n\
+target microarchitecture. Now it is used in scheduler to adjust scheduling\n\
+result for macro-fusion. If this hook returns true for the given insn pair\n\
+(@var{condgen} and @var{condjmp}), scheduler will reschedule @var{condgen}\n\
+to the position just before condjmp before commit the scheduling result.",
+ bool, (rtx condgen, rtx condjmp), NULL)
+
 /* The following member value is a pointer to a function called
    after evaluation forward dependencies of insns in chain given
    by two parameter values (head and tail correspondingly).  */

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2013-12-03  4:53 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-13 20:23 Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion H.J. Lu
2013-09-13 21:47 ` Wei Mi
2013-09-14  0:44   ` Wei Mi
2013-09-14  0:55     ` H.J. Lu
2013-09-16 18:21       ` Wei Mi
2013-09-22 12:41         ` Jan Hubicka
2013-09-22 13:07           ` Jan Hubicka
2013-09-23  6:50             ` Wei Mi
2013-09-24 19:24               ` Wei Mi
2013-09-24 21:59                 ` H.J. Lu
2013-09-24 22:37                   ` Jan Hubicka
2013-09-25  3:28                     ` Wei Mi
2013-10-03 18:25                       ` Wei Mi
2013-10-15 20:47                         ` Jeff Law
2013-10-15 21:57                           ` Wei Mi
2013-10-16 20:51                             ` Jeff Law
2013-10-17  1:06                               ` Wei Mi
2013-10-17  8:06                                 ` Marek Polacek
2013-10-17 16:28                                   ` Wei Mi
2013-11-02  1:40                                 ` Wei Mi
2013-11-04  1:18                                   ` Jan Hubicka
2013-11-04 22:04                                     ` Wei Mi
2013-11-24 11:14                                       ` H.J. Lu
2013-11-24 13:38                                         ` Wei Mi
2013-11-25 12:08                                           ` Alexander Monakov
2013-11-25 19:19                                             ` Wei Mi
2013-11-25 21:08                                           ` Jeff Law
2013-11-25 21:47                                             ` Wei Mi
2013-11-25 22:13                                               ` Jeff Law
2013-11-25 22:19                                                 ` Wei Mi
2013-11-26  3:35                                                   ` Jeff Law
2013-11-26 21:20                                                     ` Wei Mi
2013-11-27  9:56                                                       ` Jeff Law
2013-11-27 10:23                                                         ` Wei Mi
2013-11-28  2:24                                                           ` Wei Mi
2013-12-03  4:53                                                             ` Jeff Law
2013-09-30 10:31                   ` Gopalasubramanian, Ganesh
     [not found] <CA+4CFy60EZ99xBGecerYLM9wD6w2v60SpJsfOGguA_PWLLupsA@mail.gmail.com>
2013-09-03 22:55 ` Wei Mi
2013-09-04  9:00   ` Alexander Monakov
2013-09-04 17:54     ` Steven Bosscher
2013-09-04 19:33       ` Alexander Monakov
2013-09-04 23:32         ` Wei Mi
2013-09-06 17:39           ` Wei Mi
2013-09-09 18:02             ` Wei Mi
2013-09-10 12:30             ` Alexander Monakov
2013-09-10 16:14               ` Wei Mi
2013-09-11 10:36                 ` Alexander Monakov
2013-09-11 16:43                   ` Wei Mi
2013-09-11 17:07                     ` Alexander Monakov
2013-09-12  7:56                       ` Wei Mi
2013-09-12 10:25                         ` Alexander Monakov
2013-09-12 18:19                           ` Wei Mi
2013-09-13 10:18                             ` Alexander Monakov
2013-09-13 18:06                               ` Wei Mi
2013-09-13 18:29                                 ` H.J. Lu
2013-09-13 18:35                                   ` Wei Mi
2013-09-14  6:10                                     ` Andi Kleen
2013-09-11 18:18         ` Andrew Pinski
2013-09-11 19:03           ` Wei Mi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).