public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH,x86] Fix combine for condditional instructions.
@ 2012-12-12 11:27 Yuri Rumyantsev
  2012-12-12 11:38 ` Uros Bizjak
  2012-12-12 11:44 ` Richard Biener
  0 siblings, 2 replies; 25+ messages in thread
From: Yuri Rumyantsev @ 2012-12-12 11:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: Uros Bizjak, Igor Zamyatin

[-- Attachment #1: Type: text/plain, Size: 816 bytes --]

Hi All,

This fix is aimed to remove performance degradation introduced by new
LRA phase that in fact is combining problem. Gcc combiner does
propagation of memory load to if-then-else gimple that was splitted
back by old reload phase. LRA does not perform such splitting. To
avoid performance slowdown on important benchmark (this is true for
all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
with a check on such propagation and consider such conditional
instruction with memory operand as illegal one from performance point
of view.

The fix was bootstrapped and regtested for x86-64.
Is it OK for 4.8 and mainline?

ChangeLog:

2012-12-12  Yuri Rumyantsev  <ysrumyan@gmail.com>

	* config/i386/i386.c (ix86_legitimate_combined_insn) : Avoid combining
	of load and if_then_else instructions.

[-- Attachment #2: x86-cmove-combine-fix.diff --]
[-- Type: application/octet-stream, Size: 1042 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
old mode 100644
new mode 100755
index 69f44aa..bfe6271
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -5350,6 +5350,7 @@ ix86_return_pops_args (tree fundecl, tree funtype, int size)
 static bool
 ix86_legitimate_combined_insn (rtx insn)
 {
+  rtx set;
   /* Check operand constraints in case hard registers were propagated
      into insn pattern.  This check prevents combine pass from
      generating insn patterns with invalid hard register operands.
@@ -5413,6 +5414,16 @@ ix86_legitimate_combined_insn (rtx insn)
 	    return false;
 	}
     }
+  /* Avoid combining of load and if_then_else insns since it can cause 
+     performance degradation.  */
+  set = single_set (insn);
+  if (set != NULL_RTX
+      && GET_CODE (set) == SET
+      && GET_CODE (SET_DEST (set)) == REG
+      && GET_CODE (SET_SRC (set))  == IF_THEN_ELSE
+      && (MEM_P (XEXP (SET_SRC (set), 1))
+          || MEM_P (XEXP (SET_SRC (set), 2))))
+    return false;
 
   return true;
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 11:27 [PATCH,x86] Fix combine for condditional instructions Yuri Rumyantsev
@ 2012-12-12 11:38 ` Uros Bizjak
  2012-12-12 11:47   ` Yuri Rumyantsev
  2012-12-12 11:44 ` Richard Biener
  1 sibling, 1 reply; 25+ messages in thread
From: Uros Bizjak @ 2012-12-12 11:38 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: gcc-patches, Igor Zamyatin

On Wed, Dec 12, 2012 at 12:27 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:

> This fix is aimed to remove performance degradation introduced by new
> LRA phase that in fact is combining problem. Gcc combiner does
> propagation of memory load to if-then-else gimple that was splitted
> back by old reload phase. LRA does not perform such splitting. To
> avoid performance slowdown on important benchmark (this is true for
> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
> with a check on such propagation and consider such conditional
> instruction with memory operand as illegal one from performance point
> of view.

Is this true for all x86 targets? I have no objections to the
implementation, but these fine-tunings should be declared in
ix86_tune_features[] array, and used as conditions involving
TARGET_xxx in the code. Please see many examples in the i386 source
dir.

> Is it OK for 4.8 and mainline?

Hm, currently 4.8 _is_ mainline. Did you mean 4.7?

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 11:27 [PATCH,x86] Fix combine for condditional instructions Yuri Rumyantsev
  2012-12-12 11:38 ` Uros Bizjak
@ 2012-12-12 11:44 ` Richard Biener
  2012-12-12 12:55   ` Uros Bizjak
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Biener @ 2012-12-12 11:44 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: gcc-patches, Uros Bizjak, Igor Zamyatin

On Wed, Dec 12, 2012 at 12:27 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi All,
>
> This fix is aimed to remove performance degradation introduced by new
> LRA phase that in fact is combining problem. Gcc combiner does
> propagation of memory load to if-then-else gimple that was splitted
> back by old reload phase. LRA does not perform such splitting. To
> avoid performance slowdown on important benchmark (this is true for
> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
> with a check on such propagation and consider such conditional
> instruction with memory operand as illegal one from performance point
> of view.
>
> The fix was bootstrapped and regtested for x86-64.
> Is it OK for 4.8 and mainline?

Isn't it a win for -Os though?  Thus, optimize_insn_for_size ()?  It can
also increase register pressure, no?  So eventually this splitting should
be done post-reload only.  Not sure what appropriate machinery there is,
besides from mdreorg (or split itself).

Richard.

> ChangeLog:
>
> 2012-12-12  Yuri Rumyantsev  <ysrumyan@gmail.com>
>
>         * config/i386/i386.c (ix86_legitimate_combined_insn) : Avoid combining
>         of load and if_then_else instructions.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 11:38 ` Uros Bizjak
@ 2012-12-12 11:47   ` Yuri Rumyantsev
  2012-12-12 12:17     ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Yuri Rumyantsev @ 2012-12-12 11:47 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: gcc-patches, Igor Zamyatin

Hi Uros,

This fix is for all x86 platforms, we tested it on core2/corei7,
atom/atom2 and AMD and got performance improvement +6% -- +11%. So I
don' think we need to introduce additioanl tune feature.

Sorry for my typo with gcc version - I ment mainline only since 4.7
does not use LRA.

Thanks.
Yuri.



2012/12/12 Uros Bizjak <ubizjak@gmail.com>:
> On Wed, Dec 12, 2012 at 12:27 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>
>> This fix is aimed to remove performance degradation introduced by new
>> LRA phase that in fact is combining problem. Gcc combiner does
>> propagation of memory load to if-then-else gimple that was splitted
>> back by old reload phase. LRA does not perform such splitting. To
>> avoid performance slowdown on important benchmark (this is true for
>> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
>> with a check on such propagation and consider such conditional
>> instruction with memory operand as illegal one from performance point
>> of view.
>
> Is this true for all x86 targets? I have no objections to the
> implementation, but these fine-tunings should be declared in
> ix86_tune_features[] array, and used as conditions involving
> TARGET_xxx in the code. Please see many examples in the i386 source
> dir.
>
>> Is it OK for 4.8 and mainline?
>
> Hm, currently 4.8 _is_ mainline. Did you mean 4.7?
>
> Thanks,
> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 11:47   ` Yuri Rumyantsev
@ 2012-12-12 12:17     ` Richard Biener
  2012-12-12 12:35       ` Yuri Rumyantsev
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2012-12-12 12:17 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Uros Bizjak, gcc-patches, Igor Zamyatin

On Wed, Dec 12, 2012 at 12:47 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi Uros,
>
> This fix is for all x86 platforms, we tested it on core2/corei7,
> atom/atom2 and AMD and got performance improvement +6% -- +11%. So I
> don' think we need to introduce additioanl tune feature.
>
> Sorry for my typo with gcc version - I ment mainline only since 4.7
> does not use LRA.

Btw, if it's a performance improvement all over the board why not adjust
the patterns itself to not allow memory operands?  Thus, why restrict it
to combine not creating this instruction?  (My -Os comment still stands)

Richard.

> Thanks.
> Yuri.
>
>
>
> 2012/12/12 Uros Bizjak <ubizjak@gmail.com>:
>> On Wed, Dec 12, 2012 at 12:27 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>
>>> This fix is aimed to remove performance degradation introduced by new
>>> LRA phase that in fact is combining problem. Gcc combiner does
>>> propagation of memory load to if-then-else gimple that was splitted
>>> back by old reload phase. LRA does not perform such splitting. To
>>> avoid performance slowdown on important benchmark (this is true for
>>> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
>>> with a check on such propagation and consider such conditional
>>> instruction with memory operand as illegal one from performance point
>>> of view.
>>
>> Is this true for all x86 targets? I have no objections to the
>> implementation, but these fine-tunings should be declared in
>> ix86_tune_features[] array, and used as conditions involving
>> TARGET_xxx in the code. Please see many examples in the i386 source
>> dir.
>>
>>> Is it OK for 4.8 and mainline?
>>
>> Hm, currently 4.8 _is_ mainline. Did you mean 4.7?
>>
>> Thanks,
>> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 12:17     ` Richard Biener
@ 2012-12-12 12:35       ` Yuri Rumyantsev
  0 siblings, 0 replies; 25+ messages in thread
From: Yuri Rumyantsev @ 2012-12-12 12:35 UTC (permalink / raw)
  To: Richard Biener; +Cc: Uros Bizjak, gcc-patches, Igor Zamyatin

Hi Richard,

I assume that this fix does not affect on code size since such pattern
happens very rare although I can add a check on it if you insist.
Register pressure is not a issue here since I assume that additional
fill won't affect on performance as cmove with memory operand. I
decided to not change machine description of cmov (prohibit memory
operand0 since it is very risky but only did a simple change to not
produce such instruction during forward substitution (aka combine)
phase.

Best regards.
Yuri.


2012/12/12 Richard Biener <richard.guenther@gmail.com>:
> On Wed, Dec 12, 2012 at 12:47 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Hi Uros,
>>
>> This fix is for all x86 platforms, we tested it on core2/corei7,
>> atom/atom2 and AMD and got performance improvement +6% -- +11%. So I
>> don' think we need to introduce additioanl tune feature.
>>
>> Sorry for my typo with gcc version - I ment mainline only since 4.7
>> does not use LRA.
>
> Btw, if it's a performance improvement all over the board why not adjust
> the patterns itself to not allow memory operands?  Thus, why restrict it
> to combine not creating this instruction?  (My -Os comment still stands)
>
> Richard.
>
>> Thanks.
>> Yuri.
>>
>>
>>
>> 2012/12/12 Uros Bizjak <ubizjak@gmail.com>:
>>> On Wed, Dec 12, 2012 at 12:27 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>
>>>> This fix is aimed to remove performance degradation introduced by new
>>>> LRA phase that in fact is combining problem. Gcc combiner does
>>>> propagation of memory load to if-then-else gimple that was splitted
>>>> back by old reload phase. LRA does not perform such splitting. To
>>>> avoid performance slowdown on important benchmark (this is true for
>>>> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
>>>> with a check on such propagation and consider such conditional
>>>> instruction with memory operand as illegal one from performance point
>>>> of view.
>>>
>>> Is this true for all x86 targets? I have no objections to the
>>> implementation, but these fine-tunings should be declared in
>>> ix86_tune_features[] array, and used as conditions involving
>>> TARGET_xxx in the code. Please see many examples in the i386 source
>>> dir.
>>>
>>>> Is it OK for 4.8 and mainline?
>>>
>>> Hm, currently 4.8 _is_ mainline. Did you mean 4.7?
>>>
>>> Thanks,
>>> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 11:44 ` Richard Biener
@ 2012-12-12 12:55   ` Uros Bizjak
  2012-12-12 13:10     ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Uros Bizjak @ 2012-12-12 12:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: Yuri Rumyantsev, gcc-patches, Igor Zamyatin

On Wed, Dec 12, 2012 at 12:44 PM, Richard Biener
<richard.guenther@gmail.com> wrote:

>> This fix is aimed to remove performance degradation introduced by new
>> LRA phase that in fact is combining problem. Gcc combiner does
>> propagation of memory load to if-then-else gimple that was splitted
>> back by old reload phase. LRA does not perform such splitting. To
>> avoid performance slowdown on important benchmark (this is true for
>> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
>> with a check on such propagation and consider such conditional
>> instruction with memory operand as illegal one from performance point
>> of view.
>>
>> The fix was bootstrapped and regtested for x86-64.
>> Is it OK for 4.8 and mainline?
>
> Isn't it a win for -Os though?  Thus, optimize_insn_for_size ()?  It can
> also increase register pressure, no?  So eventually this splitting should
> be done post-reload only.  Not sure what appropriate machinery there is,
> besides from mdreorg (or split itself).

So, you are proposing to use peephole2 with (match_scratch)
conditional temporary?

Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 12:55   ` Uros Bizjak
@ 2012-12-12 13:10     ` Richard Biener
  2012-12-12 14:39       ` Yuri Rumyantsev
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2012-12-12 13:10 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Yuri Rumyantsev, gcc-patches, Igor Zamyatin

On Wed, Dec 12, 2012 at 1:55 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Wed, Dec 12, 2012 at 12:44 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>
>>> This fix is aimed to remove performance degradation introduced by new
>>> LRA phase that in fact is combining problem. Gcc combiner does
>>> propagation of memory load to if-then-else gimple that was splitted
>>> back by old reload phase. LRA does not perform such splitting. To
>>> avoid performance slowdown on important benchmark (this is true for
>>> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
>>> with a check on such propagation and consider such conditional
>>> instruction with memory operand as illegal one from performance point
>>> of view.
>>>
>>> The fix was bootstrapped and regtested for x86-64.
>>> Is it OK for 4.8 and mainline?
>>
>> Isn't it a win for -Os though?  Thus, optimize_insn_for_size ()?  It can
>> also increase register pressure, no?  So eventually this splitting should
>> be done post-reload only.  Not sure what appropriate machinery there is,
>> besides from mdreorg (or split itself).
>
> So, you are proposing to use peephole2 with (match_scratch)
> conditional temporary?

Yes, if that works.  (sounds backward to me having a peephole split one
insn into two ... ;))

Richard.

> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 13:10     ` Richard Biener
@ 2012-12-12 14:39       ` Yuri Rumyantsev
  2012-12-12 14:46         ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: Yuri Rumyantsev @ 2012-12-12 14:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: Uros Bizjak, gcc-patches, Igor Zamyatin

Guys,

I assume that this is not right way for fixing such simple performance
anomaly since we need to do redundant work - combine load to
conditional and then split it back in peephole2? Does it look
reasonable? Why we should produce non-efficient instrucction that must
be splitted later?

Best regards.
Yuri.

2012/12/12 Richard Biener <richard.guenther@gmail.com>:
> On Wed, Dec 12, 2012 at 1:55 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Wed, Dec 12, 2012 at 12:44 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>
>>>> This fix is aimed to remove performance degradation introduced by new
>>>> LRA phase that in fact is combining problem. Gcc combiner does
>>>> propagation of memory load to if-then-else gimple that was splitted
>>>> back by old reload phase. LRA does not perform such splitting. To
>>>> avoid performance slowdown on important benchmark (this is true for
>>>> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
>>>> with a check on such propagation and consider such conditional
>>>> instruction with memory operand as illegal one from performance point
>>>> of view.
>>>>
>>>> The fix was bootstrapped and regtested for x86-64.
>>>> Is it OK for 4.8 and mainline?
>>>
>>> Isn't it a win for -Os though?  Thus, optimize_insn_for_size ()?  It can
>>> also increase register pressure, no?  So eventually this splitting should
>>> be done post-reload only.  Not sure what appropriate machinery there is,
>>> besides from mdreorg (or split itself).
>>
>> So, you are proposing to use peephole2 with (match_scratch)
>> conditional temporary?
>
> Yes, if that works.  (sounds backward to me having a peephole split one
> insn into two ... ;))
>
> Richard.
>
>> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 14:39       ` Yuri Rumyantsev
@ 2012-12-12 14:46         ` Richard Biener
  2012-12-12 18:33           ` Uros Bizjak
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2012-12-12 14:46 UTC (permalink / raw)
  To: Yuri Rumyantsev; +Cc: Uros Bizjak, gcc-patches, Igor Zamyatin

On Wed, Dec 12, 2012 at 3:39 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Guys,
>
> I assume that this is not right way for fixing such simple performance
> anomaly since we need to do redundant work - combine load to
> conditional and then split it back in peephole2? Does it look
> reasonable? Why we should produce non-efficient instrucction that must
> be splitted later?

Well, either don't allow this instruction variant from the start, or allow
the extra freedom for register allocation this creates.  It doesn't make
sense to just reject it being generated by combine - that doesn't address
when it materializes in another way.

Just my 2 cents - I'm not a x86 target maintainer and it's definitely up to them
to decide.

Richard.

> Best regards.
> Yuri.
>
> 2012/12/12 Richard Biener <richard.guenther@gmail.com>:
>> On Wed, Dec 12, 2012 at 1:55 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Wed, Dec 12, 2012 at 12:44 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>
>>>>> This fix is aimed to remove performance degradation introduced by new
>>>>> LRA phase that in fact is combining problem. Gcc combiner does
>>>>> propagation of memory load to if-then-else gimple that was splitted
>>>>> back by old reload phase. LRA does not perform such splitting. To
>>>>> avoid performance slowdown on important benchmark (this is true for
>>>>> all x86 targets) we decided to enhance 'ix86_legitimate_combined_insn'
>>>>> with a check on such propagation and consider such conditional
>>>>> instruction with memory operand as illegal one from performance point
>>>>> of view.
>>>>>
>>>>> The fix was bootstrapped and regtested for x86-64.
>>>>> Is it OK for 4.8 and mainline?
>>>>
>>>> Isn't it a win for -Os though?  Thus, optimize_insn_for_size ()?  It can
>>>> also increase register pressure, no?  So eventually this splitting should
>>>> be done post-reload only.  Not sure what appropriate machinery there is,
>>>> besides from mdreorg (or split itself).
>>>
>>> So, you are proposing to use peephole2 with (match_scratch)
>>> conditional temporary?
>>
>> Yes, if that works.  (sounds backward to me having a peephole split one
>> insn into two ... ;))
>>
>> Richard.
>>
>>> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 14:46         ` Richard Biener
@ 2012-12-12 18:33           ` Uros Bizjak
  2012-12-12 19:32             ` Richard Henderson
  2012-12-13  9:51             ` Richard Biener
  0 siblings, 2 replies; 25+ messages in thread
From: Uros Bizjak @ 2012-12-12 18:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: Yuri Rumyantsev, gcc-patches, Igor Zamyatin

[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]

On Wed, Dec 12, 2012 at 3:45 PM, Richard Biener
<richard.guenther@gmail.com> wrote:

>> I assume that this is not right way for fixing such simple performance
>> anomaly since we need to do redundant work - combine load to
>> conditional and then split it back in peephole2? Does it look
>> reasonable? Why we should produce non-efficient instrucction that must
>> be splitted later?
>
> Well, either don't allow this instruction variant from the start, or allow
> the extra freedom for register allocation this creates.  It doesn't make
> sense to just reject it being generated by combine - that doesn't address
> when it materializes in another way.

Please check the attached patch, it implements this limitation in a correct way:
- keeps memory operands for -Os or cold parts of the executable
- doesn't increase register pressure
- handles all situations where memory operand can propagate into RTX

Yuri, can you please check if this patch fixes the performance problem for you?

BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
target macro and conditionalize new peephole2 patterns on it.

Uros.

[-- Attachment #2: p.diff.txt --]
[-- Type: text/plain, Size: 2711 bytes --]

Index: i386.md
===================================================================
--- i386.md	(revision 194451)
+++ i386.md	(working copy)
@@ -16122,6 +16122,31 @@
   operands[3] = gen_lowpart (SImode, operands[3]);
 })
 
+;; Don't do conditional moves with memory inputs
+(define_peephole2
+  [(match_scratch:SWI248 2 "r")
+   (set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_dup 0)
+	  (match_operand:SWI248 3 "memory_operand")))]
+  "TARGET_CMOVE && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 0) (match_dup 2)))])
+
+(define_peephole2
+  [(match_scratch:SWI248 2 "r")
+   (set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:SWI248 3 "memory_operand")
+	  (match_dup 0)))]
+  "TARGET_CMOVE && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 0)))])
+
 (define_expand "mov<mode>cc"
   [(set (match_operand:X87MODEF 0 "register_operand")
 	(if_then_else:X87MODEF
@@ -16209,6 +16234,35 @@
   [(set_attr "type" "fcmov,fcmov,icmov,icmov")
    (set_attr "mode" "SF,SF,SI,SI")])
 
+;; Don't do conditional moves with memory inputs
+(define_peephole2
+  [(match_scratch:MODEF 2 "r")
+   (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand")
+	(if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_dup 0)
+	  (match_operand:MODEF 3 "memory_operand")))]
+  "(<MODE>mode != DFmode || TARGET_64BIT)
+   && TARGET_80387 && TARGET_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 0) (match_dup 2)))])
+
+(define_peephole2
+  [(match_scratch:MODEF 2 "r")
+   (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand")
+	(if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:MODEF 3 "memory_operand")
+	  (match_dup 0)))]
+  "(<MODE>mode != DFmode || TARGET_64BIT)
+   && TARGET_80387 && TARGET_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 2) (match_dup 0)))])
+
 ;; All moves in XOP pcmov instructions are 128 bits and hence we restrict
 ;; the scalar versions to have only XMM registers as operands.
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 18:33           ` Uros Bizjak
@ 2012-12-12 19:32             ` Richard Henderson
  2012-12-13 14:23               ` Yuri Rumyantsev
  2012-12-13  9:51             ` Richard Biener
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Henderson @ 2012-12-12 19:32 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Biener, Yuri Rumyantsev, gcc-patches, Igor Zamyatin

On 12/12/2012 10:32 AM, Uros Bizjak wrote:
> Please check the attached patch, it implements this limitation in a correct way:
> - keeps memory operands for -Os or cold parts of the executable
> - doesn't increase register pressure
> - handles all situations where memory operand can propagate into RTX

I agree this is the right way to attack this problem.


r~

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 18:33           ` Uros Bizjak
  2012-12-12 19:32             ` Richard Henderson
@ 2012-12-13  9:51             ` Richard Biener
  2012-12-13 10:20               ` Uros Bizjak
  2012-12-13 19:30               ` Jan Hubicka
  1 sibling, 2 replies; 25+ messages in thread
From: Richard Biener @ 2012-12-13  9:51 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Yuri Rumyantsev, gcc-patches, Igor Zamyatin, Jan Hubicka

On Wed, Dec 12, 2012 at 7:32 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Wed, Dec 12, 2012 at 3:45 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>
>>> I assume that this is not right way for fixing such simple performance
>>> anomaly since we need to do redundant work - combine load to
>>> conditional and then split it back in peephole2? Does it look
>>> reasonable? Why we should produce non-efficient instrucction that must
>>> be splitted later?
>>
>> Well, either don't allow this instruction variant from the start, or allow
>> the extra freedom for register allocation this creates.  It doesn't make
>> sense to just reject it being generated by combine - that doesn't address
>> when it materializes in another way.
>
> Please check the attached patch, it implements this limitation in a correct way:
> - keeps memory operands for -Os or cold parts of the executable
> - doesn't increase register pressure
> - handles all situations where memory operand can propagate into RTX
>
> Yuri, can you please check if this patch fixes the performance problem for you?
>
> BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
> target macro and conditionalize new peephole2 patterns on it.

Looks good to me.  I believe optimize_insn_for_speed_p ()
only works reliable during RTL expansion as it relies on
crtl->maybe_hot_insn_p to be better than function granular.  I'm quite sure
split does not adjust this (it probably should, as those predicates are
definitely the correct ones to use), via rtl_profile_for_bb ().  I
think passes that
do not adjust this get what is left over by previous passes (instead of the
default).

Honza, I think the pass manager should call default_rtl_profile () before each
RTL pass to avoid this, no?

Thanks,
Richard.

> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13  9:51             ` Richard Biener
@ 2012-12-13 10:20               ` Uros Bizjak
  2012-12-13 10:33                 ` Richard Biener
  2012-12-13 19:33                 ` Jan Hubicka
  2012-12-13 19:30               ` Jan Hubicka
  1 sibling, 2 replies; 25+ messages in thread
From: Uros Bizjak @ 2012-12-13 10:20 UTC (permalink / raw)
  To: Richard Biener; +Cc: Yuri Rumyantsev, gcc-patches, Igor Zamyatin, Jan Hubicka

On Thu, Dec 13, 2012 at 10:51 AM, Richard Biener
<richard.guenther@gmail.com> wrote:

>>>> I assume that this is not right way for fixing such simple performance
>>>> anomaly since we need to do redundant work - combine load to
>>>> conditional and then split it back in peephole2? Does it look
>>>> reasonable? Why we should produce non-efficient instrucction that must
>>>> be splitted later?
>>>
>>> Well, either don't allow this instruction variant from the start, or allow
>>> the extra freedom for register allocation this creates.  It doesn't make
>>> sense to just reject it being generated by combine - that doesn't address
>>> when it materializes in another way.
>>
>> Please check the attached patch, it implements this limitation in a correct way:
>> - keeps memory operands for -Os or cold parts of the executable
>> - doesn't increase register pressure
>> - handles all situations where memory operand can propagate into RTX
>>
>> Yuri, can you please check if this patch fixes the performance problem for you?
>>
>> BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
>> target macro and conditionalize new peephole2 patterns on it.
>
> Looks good to me.  I believe optimize_insn_for_speed_p ()
> only works reliable during RTL expansion as it relies on
> crtl->maybe_hot_insn_p to be better than function granular.  I'm quite sure
> split does not adjust this (it probably should, as those predicates are
> definitely the correct ones to use), via rtl_profile_for_bb ().  I
> think passes that
> do not adjust this get what is left over by previous passes (instead of the
> default).
>
> Honza, I think the pass manager should call default_rtl_profile () before each
> RTL pass to avoid this, no?

Please note that we have plenty of existing peephole2s that use
optimize_insn_for_speed_p predicate. It is assumed to work ...

Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13 10:20               ` Uros Bizjak
@ 2012-12-13 10:33                 ` Richard Biener
  2012-12-13 19:33                 ` Jan Hubicka
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Biener @ 2012-12-13 10:33 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Yuri Rumyantsev, gcc-patches, Igor Zamyatin, Jan Hubicka

On Thu, Dec 13, 2012 at 11:20 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Dec 13, 2012 at 10:51 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>
>>>>> I assume that this is not right way for fixing such simple performance
>>>>> anomaly since we need to do redundant work - combine load to
>>>>> conditional and then split it back in peephole2? Does it look
>>>>> reasonable? Why we should produce non-efficient instrucction that must
>>>>> be splitted later?
>>>>
>>>> Well, either don't allow this instruction variant from the start, or allow
>>>> the extra freedom for register allocation this creates.  It doesn't make
>>>> sense to just reject it being generated by combine - that doesn't address
>>>> when it materializes in another way.
>>>
>>> Please check the attached patch, it implements this limitation in a correct way:
>>> - keeps memory operands for -Os or cold parts of the executable
>>> - doesn't increase register pressure
>>> - handles all situations where memory operand can propagate into RTX
>>>
>>> Yuri, can you please check if this patch fixes the performance problem for you?
>>>
>>> BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
>>> target macro and conditionalize new peephole2 patterns on it.
>>
>> Looks good to me.  I believe optimize_insn_for_speed_p ()
>> only works reliable during RTL expansion as it relies on
>> crtl->maybe_hot_insn_p to be better than function granular.  I'm quite sure
>> split does not adjust this (it probably should, as those predicates are
>> definitely the correct ones to use), via rtl_profile_for_bb ().  I
>> think passes that
>> do not adjust this get what is left over by previous passes (instead of the
>> default).
>>
>> Honza, I think the pass manager should call default_rtl_profile () before each
>> RTL pass to avoid this, no?
>
> Please note that we have plenty of existing peephole2s that use
> optimize_insn_for_speed_p predicate. It is assumed to work ...

Indeed - it calls rtl_profile_for_bb already.  (it's in recog.c ...
just looked for
sth like "split.c" in my grep).

Richard.

> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-12 19:32             ` Richard Henderson
@ 2012-12-13 14:23               ` Yuri Rumyantsev
  2012-12-13 14:27                 ` Uros Bizjak
  2012-12-13 14:36                 ` Richard Biener
  0 siblings, 2 replies; 25+ messages in thread
From: Yuri Rumyantsev @ 2012-12-13 14:23 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Uros Bizjak, Richard Biener, gcc-patches, Igor Zamyatin

Hi Guys,

The patch proposed by Uros is useless since we don't have free scratch
register to do splitting of memory operand:

;;  regs ever live 	 0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]

...

(insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
        (if_then_else:SI (ne (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
                    (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
            (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
routelookup/route_lookup.c:639 940 {*movsicc_noc}
     (expr_list:REG_DEAD (reg:CCZ 17 flags)
        (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
            (nil))))

How we can cope with this? I still assume that we can apply my patch
with additional check on optimization for speed.

Best regards.
Yuri.

2012/12/12 Richard Henderson <rth@redhat.com>:
> On 12/12/2012 10:32 AM, Uros Bizjak wrote:
>> Please check the attached patch, it implements this limitation in a correct way:
>> - keeps memory operands for -Os or cold parts of the executable
>> - doesn't increase register pressure
>> - handles all situations where memory operand can propagate into RTX
>
> I agree this is the right way to attack this problem.
>
>
> r~

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13 14:23               ` Yuri Rumyantsev
@ 2012-12-13 14:27                 ` Uros Bizjak
  2012-12-13 14:40                   ` Uros Bizjak
  2012-12-13 14:36                 ` Richard Biener
  1 sibling, 1 reply; 25+ messages in thread
From: Uros Bizjak @ 2012-12-13 14:27 UTC (permalink / raw)
  To: Yuri Rumyantsev
  Cc: Richard Henderson, Richard Biener, gcc-patches, Igor Zamyatin

On Thu, Dec 13, 2012 at 3:23 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:

> The patch proposed by Uros is useless since we don't have free scratch
> register to do splitting of memory operand:
>
> ;;  regs ever live       0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]
>
> ...
>
> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
>         (if_then_else:SI (ne (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>                     (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
>             (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
> routelookup/route_lookup.c:639 940 {*movsicc_noc}
>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>         (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>             (nil))))
>
> How we can cope with this? I still assume that we can apply my patch
> with additional check on optimization for speed.

Can you please post the testcase?

Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13 14:23               ` Yuri Rumyantsev
  2012-12-13 14:27                 ` Uros Bizjak
@ 2012-12-13 14:36                 ` Richard Biener
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Biener @ 2012-12-13 14:36 UTC (permalink / raw)
  To: Yuri Rumyantsev
  Cc: Richard Henderson, Uros Bizjak, gcc-patches, Igor Zamyatin

On Thu, Dec 13, 2012 at 3:23 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Hi Guys,
>
> The patch proposed by Uros is useless since we don't have free scratch
> register to do splitting of memory operand:
>
> ;;  regs ever live       0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]
>
> ...
>
> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
>         (if_then_else:SI (ne (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>                     (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
>             (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
> routelookup/route_lookup.c:639 940 {*movsicc_noc}
>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>         (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>             (nil))))
>
> How we can cope with this? I still assume that we can apply my patch
> with additional check on optimization for speed.

What's the important benchmark you are register-starved for?

Richard.

> Best regards.
> Yuri.
>
> 2012/12/12 Richard Henderson <rth@redhat.com>:
>> On 12/12/2012 10:32 AM, Uros Bizjak wrote:
>>> Please check the attached patch, it implements this limitation in a correct way:
>>> - keeps memory operands for -Os or cold parts of the executable
>>> - doesn't increase register pressure
>>> - handles all situations where memory operand can propagate into RTX
>>
>> I agree this is the right way to attack this problem.
>>
>>
>> r~

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13 14:27                 ` Uros Bizjak
@ 2012-12-13 14:40                   ` Uros Bizjak
  2012-12-13 15:02                     ` Yuri Rumyantsev
  0 siblings, 1 reply; 25+ messages in thread
From: Uros Bizjak @ 2012-12-13 14:40 UTC (permalink / raw)
  To: Yuri Rumyantsev
  Cc: Richard Henderson, Richard Biener, gcc-patches, Igor Zamyatin

On Thu, Dec 13, 2012 at 3:27 PM, Uros Bizjak <ubizjak@gmail.com> wrote:

>> The patch proposed by Uros is useless since we don't have free scratch
>> register to do splitting of memory operand:
>>
>> ;;  regs ever live       0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]
>>
>> ...
>>
>> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
>>         (if_then_else:SI (ne (reg:CCZ 17 flags)
>>                 (const_int 0 [0]))
>>             (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>>                     (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
>>             (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
>> routelookup/route_lookup.c:639 940 {*movsicc_noc}
>>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>>         (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>>             (nil))))
>>
>> How we can cope with this? I still assume that we can apply my patch
>> with additional check on optimization for speed.
>
> Can you please post the testcase?

Also, peephole2 pass recalculates life information by itself before
every insn, so "regs ever live" array is not something to look at. The
question was, if the patch solves the runtime regression for you, not
if it removes all memory operands from the conditional moves. The
split depends on optimize_insn_for_speed_p predicate.

Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13 14:40                   ` Uros Bizjak
@ 2012-12-13 15:02                     ` Yuri Rumyantsev
  2012-12-13 17:03                       ` Uros Bizjak
  0 siblings, 1 reply; 25+ messages in thread
From: Yuri Rumyantsev @ 2012-12-13 15:02 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Henderson, Richard Biener, gcc-patches, Igor Zamyatin

Uros,

We did not see any performance improvement on Atom in 32-bit mode at
routelookup from eembc_2_0 (eembc_1_1).

Best regards.
Yuri.

2012/12/13 Uros Bizjak <ubizjak@gmail.com>:
> On Thu, Dec 13, 2012 at 3:27 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>
>>> The patch proposed by Uros is useless since we don't have free scratch
>>> register to do splitting of memory operand:
>>>
>>> ;;  regs ever live       0[ax] 1[dx] 2[cx] 3[bx] 4[si] 5[di] 6[bp] 7[sp] 17[flags]
>>>
>>> ...
>>>
>>> (insn 96 131 132 7 (set (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])
>>>         (if_then_else:SI (ne (reg:CCZ 17 flags)
>>>                 (const_int 0 [0]))
>>>             (mem/f:SI (plus:SI (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>>>                     (const_int 12 [0xc])) [2 trie_root_23->rlink+0 S4 A32])
>>>             (reg/v/f:SI 6 bp [orig:70 trie_root ] [70])))
>>> routelookup/route_lookup.c:639 940 {*movsicc_noc}
>>>      (expr_list:REG_DEAD (reg:CCZ 17 flags)
>>>         (expr_list:REG_DEAD (reg/v/f:SI 0 ax [orig:70 trie_root ] [70])
>>>             (nil))))
>>>
>>> How we can cope with this? I still assume that we can apply my patch
>>> with additional check on optimization for speed.
>>
>> Can you please post the testcase?
>
> Also, peephole2 pass recalculates life information by itself before
> every insn, so "regs ever live" array is not something to look at. The
> question was, if the patch solves the runtime regression for you, not
> if it removes all memory operands from the conditional moves. The
> split depends on optimize_insn_for_speed_p predicate.
>
> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13 15:02                     ` Yuri Rumyantsev
@ 2012-12-13 17:03                       ` Uros Bizjak
  2012-12-14 10:47                         ` Yuri Rumyantsev
  0 siblings, 1 reply; 25+ messages in thread
From: Uros Bizjak @ 2012-12-13 17:03 UTC (permalink / raw)
  To: Yuri Rumyantsev
  Cc: Richard Henderson, Richard Biener, gcc-patches, Igor Zamyatin

[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]

On Thu, Dec 13, 2012 at 4:02 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:

> We did not see any performance improvement on Atom in 32-bit mode at
> routelookup from eembc_2_0 (eembc_1_1).

I assume that for x86_64 the patch works as expected. Let's take a
bigger hammer for 32bit targets - the splitter that effectively does
the same as your proposed patch. Please note, that this splitter
operates with optimize_function_for_speed_p predicate, so this
transformation will take place in the whole function. Can you perhaps
investigate what happens if this predicate is changed back to
optimize_insn_for_speed_p () - this is what we would like to have
here?

;; Don't do conditional moves with memory inputs.  This splitter helps
;; register starved x86_32 by forcing inputs into registers before reload.
(define_split
  [(set (match_operand:SWI248 0 "register_operand")
	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
			       [(reg FLAGS_REG) (const_int 0)])
	  (match_operand:SWI248 2 "nonimmediate_operand")
	  (match_operand:SWI248 3 "nonimmediate_operand")))]
  "!TARGET_64BIT && TARGET_CMOVE
   && (MEM_P (operands[2]) || MEM_P (operands[3]))
   && can_create_pseudo_p ()
   && optimize_function_for_speed_p (cfun)"
  [(set (match_dup 0)
	(if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 3)))]
{
  if (MEM_P (operands[2]))
    operands[2] = force_reg (<MODE>mode, operands[2]);
  if (MEM_P (operands[3]))
    operands[3] = force_reg (<MODE>mode, operands[3]);
})

Attached is the complete patch, including peephole2s.

Uros.

[-- Attachment #2: p.diff.txt --]
[-- Type: text/plain, Size: 5512 bytes --]

Index: i386.md
===================================================================
--- i386.md	(revision 194451)
+++ i386.md	(working copy)
@@ -16093,6 +16093,27 @@
   [(set_attr "type" "icmov")
    (set_attr "mode" "<MODE>")])
 
+;; Don't do conditional moves with memory inputs.  This splitter helps
+;; register starved x86_32 by forcing inputs into registers before reload.
+(define_split
+  [(set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:SWI248 2 "nonimmediate_operand")
+	  (match_operand:SWI248 3 "nonimmediate_operand")))]
+  "!TARGET_64BIT && TARGET_CMOVE
+   && (MEM_P (operands[2]) || MEM_P (operands[3]))
+   && can_create_pseudo_p ()
+   && optimize_function_for_speed_p (cfun)"
+  [(set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 3)))]
+{
+  if (MEM_P (operands[2]))
+    operands[2] = force_reg (<MODE>mode, operands[2]);
+  if (MEM_P (operands[3]))
+    operands[3] = force_reg (<MODE>mode, operands[3]);
+})
+
 (define_insn "*movqicc_noc"
   [(set (match_operand:QI 0 "register_operand" "=r,r")
 	(if_then_else:QI (match_operator 1 "ix86_comparison_operator"
@@ -16105,14 +16126,12 @@
    (set_attr "mode" "QI")])
 
 (define_split
-  [(set (match_operand 0 "register_operand")
-	(if_then_else (match_operator 1 "ix86_comparison_operator"
-			[(reg FLAGS_REG) (const_int 0)])
-		      (match_operand 2 "register_operand")
-		      (match_operand 3 "register_operand")))]
+  [(set (match_operand:SWI12 0 "register_operand")
+	(if_then_else:SWI12 (match_operator 1 "ix86_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+		      (match_operand:SWI12 2 "register_operand")
+		      (match_operand:SWI12 3 "register_operand")))]
   "TARGET_CMOVE && !TARGET_PARTIAL_REG_STALL
-   && (GET_MODE (operands[0]) == QImode
-       || GET_MODE (operands[0]) == HImode)
    && reload_completed"
   [(set (match_dup 0)
 	(if_then_else:SI (match_dup 1) (match_dup 2) (match_dup 3)))]
@@ -16122,6 +16141,31 @@
   operands[3] = gen_lowpart (SImode, operands[3]);
 })
 
+;; Don't do conditional moves with memory inputs
+(define_peephole2
+  [(match_scratch:SWI248 2 "r")
+   (set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_dup 0)
+	  (match_operand:SWI248 3 "memory_operand")))]
+  "TARGET_CMOVE && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 0) (match_dup 2)))])
+
+(define_peephole2
+  [(match_scratch:SWI248 2 "r")
+   (set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:SWI248 3 "memory_operand")
+	  (match_dup 0)))]
+  "TARGET_CMOVE && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 0)))])
+
 (define_expand "mov<mode>cc"
   [(set (match_operand:X87MODEF 0 "register_operand")
 	(if_then_else:X87MODEF
@@ -16209,6 +16253,56 @@
   [(set_attr "type" "fcmov,fcmov,icmov,icmov")
    (set_attr "mode" "SF,SF,SI,SI")])
 
+;; Don't do conditional moves with memory inputs.  This splitter helps
+;; register starved x86_32 by forcing inputs into registers before reload.
+(define_split
+  [(set (match_operand:MODEF 0 "register_operand")
+	(if_then_else:MODEF (match_operator 1 "ix86_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:MODEF 2 "nonimmediate_operand")
+	  (match_operand:MODEF 3 "nonimmediate_operand")))]
+  "!TARGET_64BIT && TARGET_80387 && TARGET_CMOVE
+   && (MEM_P (operands[2]) || MEM_P (operands[3]))
+   && can_create_pseudo_p ()
+   && optimize_function_for_speed_p (cfun)"
+  [(set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 2) (match_dup 3)))]
+{
+  if (MEM_P (operands[2]))
+    operands[2] = force_reg (<MODE>mode, operands[2]);
+  if (MEM_P (operands[3]))
+    operands[3] = force_reg (<MODE>mode, operands[3]);
+})
+
+;; Don't do conditional moves with memory inputs
+(define_peephole2
+  [(match_scratch:MODEF 2 "r")
+   (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand")
+	(if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_dup 0)
+	  (match_operand:MODEF 3 "memory_operand")))]
+  "(<MODE>mode != DFmode || TARGET_64BIT)
+   && TARGET_80387 && TARGET_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 0) (match_dup 2)))])
+
+(define_peephole2
+  [(match_scratch:MODEF 2 "r")
+   (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand")
+	(if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:MODEF 3 "memory_operand")
+	  (match_dup 0)))]
+  "(<MODE>mode != DFmode || TARGET_64BIT)
+   && TARGET_80387 && TARGET_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 2) (match_dup 0)))])
+
 ;; All moves in XOP pcmov instructions are 128 bits and hence we restrict
 ;; the scalar versions to have only XMM registers as operands.
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13  9:51             ` Richard Biener
  2012-12-13 10:20               ` Uros Bizjak
@ 2012-12-13 19:30               ` Jan Hubicka
  1 sibling, 0 replies; 25+ messages in thread
From: Jan Hubicka @ 2012-12-13 19:30 UTC (permalink / raw)
  To: Richard Biener
  Cc: Uros Bizjak, Yuri Rumyantsev, gcc-patches, Igor Zamyatin, Jan Hubicka

> On Wed, Dec 12, 2012 at 7:32 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> > On Wed, Dec 12, 2012 at 3:45 PM, Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >
> >>> I assume that this is not right way for fixing such simple performance
> >>> anomaly since we need to do redundant work - combine load to
> >>> conditional and then split it back in peephole2? Does it look
> >>> reasonable? Why we should produce non-efficient instrucction that must
> >>> be splitted later?
> >>
> >> Well, either don't allow this instruction variant from the start, or allow
> >> the extra freedom for register allocation this creates.  It doesn't make
> >> sense to just reject it being generated by combine - that doesn't address
> >> when it materializes in another way.
> >
> > Please check the attached patch, it implements this limitation in a correct way:
> > - keeps memory operands for -Os or cold parts of the executable
> > - doesn't increase register pressure
> > - handles all situations where memory operand can propagate into RTX
> >
> > Yuri, can you please check if this patch fixes the performance problem for you?
> >
> > BTW: I would really like to add some TARGET_USE_CMOVE_WITH_MEMOP
> > target macro and conditionalize new peephole2 patterns on it.
> 
> Looks good to me.  I believe optimize_insn_for_speed_p ()
> only works reliable during RTL expansion as it relies on
> crtl->maybe_hot_insn_p to be better than function granular.  I'm quite sure
> split does not adjust this (it probably should, as those predicates are
> definitely the correct ones to use), via rtl_profile_for_bb ().  I
> think passes that
> do not adjust this get what is left over by previous passes (instead of the
> default).
> 
> Honza, I think the pass manager should call default_rtl_profile () before each
> RTL pass to avoid this, no?

Yep, we should get to default_rtl_profile after each pass.  We should also fix passes
that do splitting/expansion without preprly setting current basic block.

Honza
> 
> Thanks,
> Richard.
> 
> > Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13 10:20               ` Uros Bizjak
  2012-12-13 10:33                 ` Richard Biener
@ 2012-12-13 19:33                 ` Jan Hubicka
  1 sibling, 0 replies; 25+ messages in thread
From: Jan Hubicka @ 2012-12-13 19:33 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Richard Biener, Yuri Rumyantsev, gcc-patches, Igor Zamyatin, Jan Hubicka

> > Honza, I think the pass manager should call default_rtl_profile () before each
> > RTL pass to avoid this, no?
> 
> Please note that we have plenty of existing peephole2s that use
> optimize_insn_for_speed_p predicate. It is assumed to work ...

It is set by peep2 pass
static void
peephole2_optimize (void)
{
  rtx insn;
  bitmap live;
  int i;
  basic_block bb;

  peep2_do_cleanup_cfg = false;
  peep2_do_rebuild_jump_labels = false;

  df_set_flags (DF_LR_RUN_DCE);
  df_note_add_problem ();
  df_analyze ();

  /* Initialize the regsets we're going to use.  */
  for (i = 0; i < MAX_INSNS_PER_PEEP2 + 1; ++i)
    peep2_insn_data[i].live_before = BITMAP_ALLOC (&reg_obstack);
  live = BITMAP_ALLOC (&reg_obstack);

  FOR_EACH_BB_REVERSE (bb)
    {
      bool past_end = false;
      int pos;

      rtl_profile_for_bb (bb);
      ^^^ here.

spliters/peepholes should be safe as splitting/peephole and combine passes care
about the profile.  What is unreliable somewhat is expansion is done late, like
from loop optimizer.  I am not sure how much of those we have left.

Honza
> 
> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-13 17:03                       ` Uros Bizjak
@ 2012-12-14 10:47                         ` Yuri Rumyantsev
  2012-12-19 16:08                           ` Uros Bizjak
  0 siblings, 1 reply; 25+ messages in thread
From: Yuri Rumyantsev @ 2012-12-14 10:47 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Richard Henderson, Richard Biener, gcc-patches, Igor Zamyatin

Hi Uros,

With your new fix that add if-then-else splitting for memory operand I
got expected performance speed-up - +6.7% for Atom and +8.4% for SNB.
We need to do all testing this weekend and I will get you our final
feedback on Monday.

Thanks ahead for all your help.
Yuri.

2012/12/13 Uros Bizjak <ubizjak@gmail.com>:
> On Thu, Dec 13, 2012 at 4:02 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>
>> We did not see any performance improvement on Atom in 32-bit mode at
>> routelookup from eembc_2_0 (eembc_1_1).
>
> I assume that for x86_64 the patch works as expected. Let's take a
> bigger hammer for 32bit targets - the splitter that effectively does
> the same as your proposed patch. Please note, that this splitter
> operates with optimize_function_for_speed_p predicate, so this
> transformation will take place in the whole function. Can you perhaps
> investigate what happens if this predicate is changed back to
> optimize_insn_for_speed_p () - this is what we would like to have
> here?
>
> ;; Don't do conditional moves with memory inputs.  This splitter helps
> ;; register starved x86_32 by forcing inputs into registers before reload.
> (define_split
>   [(set (match_operand:SWI248 0 "register_operand")
>         (if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
>                                [(reg FLAGS_REG) (const_int 0)])
>           (match_operand:SWI248 2 "nonimmediate_operand")
>           (match_operand:SWI248 3 "nonimmediate_operand")))]
>   "!TARGET_64BIT && TARGET_CMOVE
>    && (MEM_P (operands[2]) || MEM_P (operands[3]))
>    && can_create_pseudo_p ()
>    && optimize_function_for_speed_p (cfun)"
>   [(set (match_dup 0)
>         (if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 3)))]
> {
>   if (MEM_P (operands[2]))
>     operands[2] = force_reg (<MODE>mode, operands[2]);
>   if (MEM_P (operands[3]))
>     operands[3] = force_reg (<MODE>mode, operands[3]);
> })
>
> Attached is the complete patch, including peephole2s.
>
> Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH,x86] Fix combine for condditional instructions.
  2012-12-14 10:47                         ` Yuri Rumyantsev
@ 2012-12-19 16:08                           ` Uros Bizjak
  0 siblings, 0 replies; 25+ messages in thread
From: Uros Bizjak @ 2012-12-19 16:08 UTC (permalink / raw)
  To: Yuri Rumyantsev
  Cc: Richard Henderson, Richard Biener, gcc-patches, Igor Zamyatin

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Fri, Dec 14, 2012 at 11:47 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:

> With your new fix that add if-then-else splitting for memory operand I
> got expected performance speed-up - +6.7% for Atom and +8.4% for SNB.
> We need to do all testing this weekend and I will get you our final
> feedback on Monday.

After some off-line discussions, we decided to enable splitting for
Atom only (where it was always a win), since splitting regressed SNB
in some other tests.

2012-12-19  Uros Bizjak  <ubizjak@gmail.com>
	    Yuri Rumyantsev  <ysrumyan@gmail.com>

	* config/i386/i386.h (enum ix86_tune_indices): Add
	X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE.
	(TARGET_AVOID_MEM_OPND_FOR_CMOVE): New define.
	* config/i386/i386.c (initial_ix86_tune_features)
	<X86TUNE_AVOID_MEM_OPND_FOR_CMOVE>: Initialize.
	* config/i386/i386.md (splitters to avoid cmove memory operands): New.
	(peephole2s to avoid cmove memory operands): New.

Tested on x86_64-pc-linux-gnu, committed to mainline SVN.

Uros.

[-- Attachment #2: p.diff.txt --]
[-- Type: text/plain, Size: 7053 bytes --]

Index: i386.c
===================================================================
--- i386.c	(revision 194610)
+++ i386.c	(working copy)
@@ -2026,7 +2026,11 @@ static unsigned int initial_ix86_tune_features[X86
 
   /* X86_TUNE_GENERAL_REGS_SSE_SPILL: Try to spill general regs to SSE
      regs instead of memory.  */
-  m_COREI7 | m_CORE2I7
+  m_COREI7 | m_CORE2I7,
+
+  /* X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE: Try to avoid memory operands for
+     a conditional move.  */
+  m_ATOM
 };
 
 /* Feature tests against the various architecture variations.  */
Index: i386.h
===================================================================
--- i386.h	(revision 194610)
+++ i386.h	(working copy)
@@ -331,6 +331,7 @@ enum ix86_tune_indices {
   X86_TUNE_REASSOC_INT_TO_PARALLEL,
   X86_TUNE_REASSOC_FP_TO_PARALLEL,
   X86_TUNE_GENERAL_REGS_SSE_SPILL,
+  X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE,
 
   X86_TUNE_LAST
 };
@@ -436,6 +437,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_L
 	ix86_tune_features[X86_TUNE_REASSOC_FP_TO_PARALLEL]
 #define TARGET_GENERAL_REGS_SSE_SPILL \
 	ix86_tune_features[X86_TUNE_GENERAL_REGS_SSE_SPILL]
+#define TARGET_AVOID_MEM_OPND_FOR_CMOVE \
+	ix86_tune_features[X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE]
 
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
Index: i386.md
===================================================================
--- i386.md	(revision 194610)
+++ i386.md	(working copy)
@@ -16093,6 +16093,28 @@
   [(set_attr "type" "icmov")
    (set_attr "mode" "<MODE>")])
 
+;; Don't do conditional moves with memory inputs.  This splitter helps
+;; register starved x86_32 by forcing inputs into registers before reload.
+(define_split
+  [(set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:SWI248 2 "nonimmediate_operand")
+	  (match_operand:SWI248 3 "nonimmediate_operand")))]
+  "!TARGET_64BIT && TARGET_CMOVE
+   && TARGET_AVOID_MEM_OPND_FOR_CMOVE
+   && (MEM_P (operands[2]) || MEM_P (operands[3]))
+   && can_create_pseudo_p ()
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 3)))]
+{
+  if (MEM_P (operands[2]))
+    operands[2] = force_reg (<MODE>mode, operands[2]);
+  if (MEM_P (operands[3]))
+    operands[3] = force_reg (<MODE>mode, operands[3]);
+})
+
 (define_insn "*movqicc_noc"
   [(set (match_operand:QI 0 "register_operand" "=r,r")
 	(if_then_else:QI (match_operator 1 "ix86_comparison_operator"
@@ -16105,14 +16127,12 @@
    (set_attr "mode" "QI")])
 
 (define_split
-  [(set (match_operand 0 "register_operand")
-	(if_then_else (match_operator 1 "ix86_comparison_operator"
-			[(reg FLAGS_REG) (const_int 0)])
-		      (match_operand 2 "register_operand")
-		      (match_operand 3 "register_operand")))]
+  [(set (match_operand:SWI12 0 "register_operand")
+	(if_then_else:SWI12 (match_operator 1 "ix86_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+		      (match_operand:SWI12 2 "register_operand")
+		      (match_operand:SWI12 3 "register_operand")))]
   "TARGET_CMOVE && !TARGET_PARTIAL_REG_STALL
-   && (GET_MODE (operands[0]) == QImode
-       || GET_MODE (operands[0]) == HImode)
    && reload_completed"
   [(set (match_dup 0)
 	(if_then_else:SI (match_dup 1) (match_dup 2) (match_dup 3)))]
@@ -16122,6 +16142,33 @@
   operands[3] = gen_lowpart (SImode, operands[3]);
 })
 
+;; Don't do conditional moves with memory inputs
+(define_peephole2
+  [(match_scratch:SWI248 2 "r")
+   (set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_dup 0)
+	  (match_operand:SWI248 3 "memory_operand")))]
+  "TARGET_CMOVE && TARGET_AVOID_MEM_OPND_FOR_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 0) (match_dup 2)))])
+
+(define_peephole2
+  [(match_scratch:SWI248 2 "r")
+   (set (match_operand:SWI248 0 "register_operand")
+	(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+			       [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:SWI248 3 "memory_operand")
+	  (match_dup 0)))]
+  "TARGET_CMOVE && TARGET_AVOID_MEM_OPND_FOR_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:SWI248 (match_dup 1) (match_dup 2) (match_dup 0)))])
+
 (define_expand "mov<mode>cc"
   [(set (match_operand:X87MODEF 0 "register_operand")
 	(if_then_else:X87MODEF
@@ -16209,6 +16256,59 @@
   [(set_attr "type" "fcmov,fcmov,icmov,icmov")
    (set_attr "mode" "SF,SF,SI,SI")])
 
+;; Don't do conditional moves with memory inputs.  This splitter helps
+;; register starved x86_32 by forcing inputs into registers before reload.
+(define_split
+  [(set (match_operand:MODEF 0 "register_operand")
+	(if_then_else:MODEF (match_operator 1 "ix86_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:MODEF 2 "nonimmediate_operand")
+	  (match_operand:MODEF 3 "nonimmediate_operand")))]
+  "!TARGET_64BIT && TARGET_80387 && TARGET_CMOVE
+   && TARGET_AVOID_MEM_OPND_FOR_CMOVE
+   && (MEM_P (operands[2]) || MEM_P (operands[3]))
+   && can_create_pseudo_p ()
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 2) (match_dup 3)))]
+{
+  if (MEM_P (operands[2]))
+    operands[2] = force_reg (<MODE>mode, operands[2]);
+  if (MEM_P (operands[3]))
+    operands[3] = force_reg (<MODE>mode, operands[3]);
+})
+
+;; Don't do conditional moves with memory inputs
+(define_peephole2
+  [(match_scratch:MODEF 2 "r")
+   (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand")
+	(if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_dup 0)
+	  (match_operand:MODEF 3 "memory_operand")))]
+  "(<MODE>mode != DFmode || TARGET_64BIT)
+   && TARGET_80387 && TARGET_CMOVE
+   && TARGET_AVOID_MEM_OPND_FOR_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 0) (match_dup 2)))])
+
+(define_peephole2
+  [(match_scratch:MODEF 2 "r")
+   (set (match_operand:MODEF 0 "register_and_not_any_fp_reg_operand")
+	(if_then_else:MODEF (match_operator 1 "fcmov_comparison_operator"
+			      [(reg FLAGS_REG) (const_int 0)])
+	  (match_operand:MODEF 3 "memory_operand")
+	  (match_dup 0)))]
+  "(<MODE>mode != DFmode || TARGET_64BIT)
+   && TARGET_80387 && TARGET_CMOVE
+   && TARGET_AVOID_MEM_OPND_FOR_CMOVE
+   && optimize_insn_for_speed_p ()"
+  [(set (match_dup 2) (match_dup 3))
+   (set (match_dup 0)
+	(if_then_else:MODEF (match_dup 1) (match_dup 2) (match_dup 0)))])
+
 ;; All moves in XOP pcmov instructions are 128 bits and hence we restrict
 ;; the scalar versions to have only XMM registers as operands.
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-12-19 16:08 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-12 11:27 [PATCH,x86] Fix combine for condditional instructions Yuri Rumyantsev
2012-12-12 11:38 ` Uros Bizjak
2012-12-12 11:47   ` Yuri Rumyantsev
2012-12-12 12:17     ` Richard Biener
2012-12-12 12:35       ` Yuri Rumyantsev
2012-12-12 11:44 ` Richard Biener
2012-12-12 12:55   ` Uros Bizjak
2012-12-12 13:10     ` Richard Biener
2012-12-12 14:39       ` Yuri Rumyantsev
2012-12-12 14:46         ` Richard Biener
2012-12-12 18:33           ` Uros Bizjak
2012-12-12 19:32             ` Richard Henderson
2012-12-13 14:23               ` Yuri Rumyantsev
2012-12-13 14:27                 ` Uros Bizjak
2012-12-13 14:40                   ` Uros Bizjak
2012-12-13 15:02                     ` Yuri Rumyantsev
2012-12-13 17:03                       ` Uros Bizjak
2012-12-14 10:47                         ` Yuri Rumyantsev
2012-12-19 16:08                           ` Uros Bizjak
2012-12-13 14:36                 ` Richard Biener
2012-12-13  9:51             ` Richard Biener
2012-12-13 10:20               ` Uros Bizjak
2012-12-13 10:33                 ` Richard Biener
2012-12-13 19:33                 ` Jan Hubicka
2012-12-13 19:30               ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).