public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation @ 2015-09-02 17:29 afomin.mailbox at gmail dot com 2015-09-02 17:31 ` [Bug middle-end/67438] " afomin.mailbox at gmail dot com ` (9 more replies) 0 siblings, 10 replies; 11+ messages in thread From: afomin.mailbox at gmail dot com @ 2015-09-02 17:29 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Bug ID: 67438 Summary: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: afomin.mailbox at gmail dot com CC: izamyatin at gmail dot com, rguenth at gcc dot gnu.org, ysrumyan at gmail dot com Target Milestone: --- Target: i686 For the loop in the attached test compiled with -O3 -m32 -march=slm -ftree-loop-if-convert (in fact, -march=slm can be omitted resulting in a greater number of insns) after r225249 we generate 28 insns instead of 23 insns for r225248. That revision moves some simplification patterns from fold-const.c to match.pd, and I've noticed that relocating back ~X op ~Y -> Y op X from match.pd to fold-const.c fixes the problem. r225248: movzbl (%ebx),%ecx add $0x3,%ebx movzbl -0x2(%ebx),%edx not %ecx movzbl -0x1(%ebx),%eax not %edx mov %cl,(%esi) mov %dl,0x1(%esi) not %eax cmp %al,%cl mov %eax,%edi mov %al,0x2(%esi) mov %eax,%ebp cmovle %ecx,%edi cmp %al,%dl cmovle %edx,%ebp add $0x4,%esi cmp %dl,%cl mov %ebp,%eax cmovl %edi,%eax cmp (%esp),%ebx mov %al,-0x1(%esi) jne 30 <foo+0x30> r225249: movzbl (%edi),%eax add $0x3,%edi movzbl -0x2(%edi),%edx mov %al,0x2(%esp) mov %eax,%ebx movzbl -0x1(%edi),%eax not %ebx mov %dl,0x3(%esp) mov %edx,%ecx mov %bl,0x0(%ebp) not %ecx mov %cl,0x1(%ebp) not %eax cmp %al,%bl mov %eax,%esi mov %al,0x2(%ebp) cmovle %ebx,%esi cmp %al,%cl mov %esi,%edx mov %eax,%esi cmovle %ecx,%esi add $0x4,%ebp movzbl 0x3(%esp),%ecx cmp %cl,0x2(%esp) cmovle %esi,%edx cmp 0x4(%esp),%edi mov %dl,-0x1(%ebp) jne 30 <foo+0x30> ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com @ 2015-09-02 17:31 ` afomin.mailbox at gmail dot com 2015-09-02 17:47 ` pinskia at gcc dot gnu.org ` (8 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: afomin.mailbox at gmail dot com @ 2015-09-02 17:31 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Alexander Fomin <afomin.mailbox at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |afomin.mailbox at gmail dot com --- Comment #1 from Alexander Fomin <afomin.mailbox at gmail dot com> --- Created attachment 36287 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36287&action=edit Testcase A reproducer ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com 2015-09-02 17:31 ` [Bug middle-end/67438] " afomin.mailbox at gmail dot com @ 2015-09-02 17:47 ` pinskia at gcc dot gnu.org 2015-09-02 17:48 ` [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 pinskia at gcc dot gnu.org ` (7 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: pinskia at gcc dot gnu.org @ 2015-09-02 17:47 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- THis just looks like increased register pressure. Maybe some :s should be used in match.pd or maybe not. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com 2015-09-02 17:31 ` [Bug middle-end/67438] " afomin.mailbox at gmail dot com 2015-09-02 17:47 ` pinskia at gcc dot gnu.org @ 2015-09-02 17:48 ` pinskia at gcc dot gnu.org 2015-09-03 3:36 ` miyuki at gcc dot gnu.org ` (6 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: pinskia at gcc dot gnu.org @ 2015-09-02 17:48 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target|i686 |i?86-*-* Summary|[6 Regression] ~X op ~Y |[6 Regression] ~X op ~Y |pattern relocation causes |pattern relocation causes |loop performance |loop performance |degradation |degradation on 32bit x86 --- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> --- I bet this code is slightly faster on a machine with some extra registers like even x86_64 or aarch64 or PowerPC or MIPS. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com ` (2 preceding siblings ...) 2015-09-02 17:48 ` [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 pinskia at gcc dot gnu.org @ 2015-09-03 3:36 ` miyuki at gcc dot gnu.org 2015-09-03 8:04 ` rguenther at suse dot de ` (5 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: miyuki at gcc dot gnu.org @ 2015-09-03 3:36 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Mikhail Maltsev <miyuki at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |miyuki at gcc dot gnu.org --- Comment #4 from Mikhail Maltsev <miyuki at gcc dot gnu.org> --- I looked at gimple dumps. The only difference looks like this. In the "good" revision after forwprop1: <bb 3>: _13 = *in_2; a_14 = ~_13; _17 = MEM[(char *)in_2 + 1B]; b_18 = ~_17; in_20 = &MEM[(void *)in_2 + 3B]; _21 = MEM[(char *)in_2 + 2B]; c_22 = ~_21; if (a_14 < b_18) goto <bb 4>; else goto <bb 5>; In the "bad" revision this basic block is simplified: <bb 3>: _13 = *in_2; a_14 = ~_13; _17 = MEM[(char *)in_2 + 1B]; b_18 = ~_17; in_20 = &MEM[(void *)in_2 + 3B]; _21 = MEM[(char *)in_2 + 2B]; c_22 = ~_21; if (_13 > _17) goto <bb 4>; else goto <bb 5>; Next BB's are: <bb 4>: d_23 = MIN_EXPR <a_14, c_22>; <bb 5>: d_24 = MIN_EXPR <b_18, c_22>; <bb 6>: # d_4 = PHI <d_23(4), d_24(5)> The condition of "if" is not altered throughout all other passes (it gets if-converted and vectorized). Another small difference: VRP adds assertions in bb 4 (a_12 lt_expr b_14, b_14 gt_expr a_12) and bb5 (a_12 ge_expr b_14, b_14 le_expr a_12). For some reason this does not happen in the "bad" revision. As I understand, the problem is that if we do not fold the condition, values _13 and _17 are killed after we calculate a_14 = ~_13 and b_18 = ~_17. But if we do fold, they are still live (because they are used in the condition), thus, register pressure increases. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com ` (3 preceding siblings ...) 2015-09-03 3:36 ` miyuki at gcc dot gnu.org @ 2015-09-03 8:04 ` rguenther at suse dot de 2015-09-03 18:00 ` miyuki at gcc dot gnu.org ` (4 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: rguenther at suse dot de @ 2015-09-03 8:04 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 3 Sep 2015, miyuki at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 > > Mikhail Maltsev <miyuki at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |miyuki at gcc dot gnu.org > > --- Comment #4 from Mikhail Maltsev <miyuki at gcc dot gnu.org> --- > I looked at gimple dumps. The only difference looks like this. In the "good" > revision after forwprop1: > > <bb 3>: > _13 = *in_2; > a_14 = ~_13; > _17 = MEM[(char *)in_2 + 1B]; > b_18 = ~_17; > in_20 = &MEM[(void *)in_2 + 3B]; > _21 = MEM[(char *)in_2 + 2B]; > c_22 = ~_21; > if (a_14 < b_18) > goto <bb 4>; > else > goto <bb 5>; > > In the "bad" revision this basic block is simplified: > > <bb 3>: > _13 = *in_2; > a_14 = ~_13; > _17 = MEM[(char *)in_2 + 1B]; > b_18 = ~_17; > in_20 = &MEM[(void *)in_2 + 3B]; > _21 = MEM[(char *)in_2 + 2B]; > c_22 = ~_21; > if (_13 > _17) > goto <bb 4>; > else > goto <bb 5>; > > Next BB's are: > > <bb 4>: d_23 = MIN_EXPR <a_14, c_22>; > <bb 5>: d_24 = MIN_EXPR <b_18, c_22>; > <bb 6>: # d_4 = PHI <d_23(4), d_24(5)> > > The condition of "if" is not altered throughout all other passes (it gets > if-converted and vectorized). > > Another small difference: VRP adds assertions in bb 4 (a_12 lt_expr b_14, b_14 > gt_expr a_12) and bb5 (a_12 ge_expr b_14, b_14 le_expr a_12). For some reason > this does not happen in the "bad" revision. > > As I understand, the problem is that if we do not fold the condition, values > _13 and _17 are killed after we calculate a_14 = ~_13 and b_18 = ~_17. But if > we do fold, they are still live (because they are used in the condition), thus, > register pressure increases. Yes. Note that because of :s implementation details "fixing" /* Fold ~X op ~Y as Y op X. */ (for cmp (simple_comparison) (simplify (cmp (bit_not @0) (bit_not @1)) (cmp @1 @0))) with :s on the bit_not's is not going to help (because we still allow a single-stmt result as we are just replacing one with another). So :s cannot be used to guard against register pressure increase but only to guard against undoing CSE. For the case in this bug the user might have written the testcase in the way we transform it now and thus what is desirable is a pass that can reduce register pressure by expressing values in a different way. For the case above, why is a_14 = ~_13 not sunk to the edge 3->4 and b_18 = ~_17 to the edge 3->5? (yes, this creates additional BBs) This would reduce register pressure. Maybe this kind of scheduling can be considered when register pressure is high (does -fsched-pressure -fschedule-insns help?) ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com ` (4 preceding siblings ...) 2015-09-03 8:04 ` rguenther at suse dot de @ 2015-09-03 18:00 ` miyuki at gcc dot gnu.org 2015-09-07 12:35 ` afomin.mailbox at gmail dot com ` (3 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: miyuki at gcc dot gnu.org @ 2015-09-03 18:00 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #6 from Mikhail Maltsev <miyuki at gcc dot gnu.org> --- (In reply to rguenther@suse.de from comment #5) > For the case above, why is a_14 = ~_13 not sunk to the edge > 3->4 and b_18 = ~_17 to the edge 3->5? (yes, this creates > additional BBs) This would reduce register pressure. I think, because a_14 and b_18 are used in the next bb. Actually I wrote only part of bb6. The full dump looks like this: <bb 6>: # d_4 = PHI <d_23(4), d_24(5)> out_26 = out_3 + 1; *out_3 = a_14; out_29 = &MEM[(void *)out_3 + 2B]; MEM[(char *)out_3 + 1B] = b_18; out_32 = &MEM[(void *)out_3 + 3B]; MEM[(char *)out_3 + 2B] = c_22; out_35 = &MEM[(void *)out_3 + 4B]; MEM[(char *)out_3 + 3B] = d_4; <bb 7>: # n_1 = PHI <n_6(D)(2), n_10(6)> # in_2 = PHI <in_7(D)(2), in_20(6)> # out_3 = PHI <out_8(D)(2), out_35(6)> n_10 = n_1 + -1; if (n_10 != 0) goto <bb 3>; else goto <bb 8>; <bb 8>: return; > Maybe this kind of scheduling can be considered when register pressure > is high (does -fsched-pressure -fschedule-insns help?) Not much. With -fsched-pressure -fschedule-insns we generate 2 insns less: .L7: movzbl 0(%ebp), %edi # MEM[base: in_70, offset: 0B], D.1940 addl $3, %ebp #, in movzbl -2(%ebp), %esi # MEM[base: in_70, offset: 1B], D.1940 movl %edi, %eax # D.1940, a movzbl -1(%ebp), %edx # MEM[base: in_30, offset: 4294967295B], MEM[base: in_30, offset: 4294967295B] notl %eax # a movb %al, (%ebx) # a, MEM[base: out_71, offset: 0B] movl %esi, %ecx # D.1940, b notl %ecx # b movb %cl, 1(%ebx) # b, MEM[base: out_71, offset: 1B] notl %edx # c movb %dl, 2(%ebx) # c, MEM[base: out_71, offset: 2B] cmpb %dl, %al # c, a cmovg %edx, %eax # d,, c, d cmpb %dl, %cl # c, b movb %al, 4(%esp) # tmp277, %sfp cmovle %ecx, %edx # b,, d movl %esi, %eax # D.1940, D.1940 movl %edi, %ecx # D.1940, D.1940 addl $4, %ebx #, out cmpb %al, %cl # D.1940, D.1940 movzbl 4(%esp), %eax # %sfp, d cmovg %eax, %edx # d,, d cmpl 8(%esp), %ebp # %sfp, in movb %dl, -1(%ebx) # d, MEM[base: out_11, offset: 4294967295B] jne .L7 #, I wonder, whether a transformation like this could help: bb1: x = min(a, c) goto bb3 bb2: y = min(b, c) goto bb3 bb3: z = phi(x, y) // x and y are single-use ---> bb1: x = a goto bb3 bb2: y = b goto bb3 bb3: z' = phi(x, y) z = min(z', c) Though if we don't simplify phi(x, y), we would increase register pressure even more. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com ` (5 preceding siblings ...) 2015-09-03 18:00 ` miyuki at gcc dot gnu.org @ 2015-09-07 12:35 ` afomin.mailbox at gmail dot com 2015-09-07 12:57 ` graham.stott at btinternet dot com ` (2 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: afomin.mailbox at gmail dot com @ 2015-09-07 12:35 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #7 from Alexander Fomin <afomin.mailbox at gmail dot com> --- Looks like a cost model should be introduced to avoid such kind of transformations for targets with HW min/max implementation. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com ` (6 preceding siblings ...) 2015-09-07 12:35 ` afomin.mailbox at gmail dot com @ 2015-09-07 12:57 ` graham.stott at btinternet dot com 2015-09-14 11:52 ` rguenth at gcc dot gnu.org 2023-10-15 23:08 ` pinskia at gcc dot gnu.org 9 siblings, 0 replies; 11+ messages in thread From: graham.stott at btinternet dot com @ 2015-09-07 12:57 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #8 from graham.stott at btinternet dot com --- Sent from Samsung Mobile on O2 <div>-------- Original message --------</div><div>From: "afomin.mailbox at gmail dot com" <gcc-bugzilla@gcc.gnu.org> </div><div>Date:07/09/2015 13:35 (GMT+00:00) </div><div>To: gcc-bugs@gcc.gnu.org </div><div>Subject: [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 </div><div> </div>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #7 from Alexander Fomin <afomin.mailbox at gmail dot com> --- Looks like a cost model should be introduced to avoid such kind of transformations for targets with HW min/max implementation. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com ` (7 preceding siblings ...) 2015-09-07 12:57 ` graham.stott at btinternet dot com @ 2015-09-14 11:52 ` rguenth at gcc dot gnu.org 2023-10-15 23:08 ` pinskia at gcc dot gnu.org 9 siblings, 0 replies; 11+ messages in thread From: rguenth at gcc dot gnu.org @ 2015-09-14 11:52 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |6.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com ` (8 preceding siblings ...) 2015-09-14 11:52 ` rguenth at gcc dot gnu.org @ 2023-10-15 23:08 ` pinskia at gcc dot gnu.org 9 siblings, 0 replies; 11+ messages in thread From: pinskia at gcc dot gnu.org @ 2023-10-15 23:08 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67438 --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Yuri Rumyantsev from comment #11) > Richard proposed to use the same simplification for min/max operations but > in original test-case nested min/max operation (min(x,min(y,z)) or multi > operand min/max (min(x,y,z)) are not recognized by gcc (Note that icc does > such transformation) and so this won't help since we have the same register > pressure issue: > c = ~r; > m = ~g; > y = ~b; > k = min(c, m, y); > *out++ = c - k; > *out++ = m - k; > *out++ = y - k; > *out++ = k; This is now recognized since GCC 13 (by r13-1950-g9bb19e143cfe88 and improved for GCC 14 by r14-337-gc43819a9b4cdaa). Now there is a missing MIN/MAX detection still: int f(int a, int b, int c) { int at = ~a; int bt = ~b; int ct = ~c; int t = a < b ? at : bt; return t; } Which is not detected until phiopt4. I will file a bug about that. I think once that is fixed I think we might be able to remove the single_use again. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-10-15 23:08 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-09-02 17:29 [Bug middle-end/67438] New: [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation afomin.mailbox at gmail dot com 2015-09-02 17:31 ` [Bug middle-end/67438] " afomin.mailbox at gmail dot com 2015-09-02 17:47 ` pinskia at gcc dot gnu.org 2015-09-02 17:48 ` [Bug middle-end/67438] [6 Regression] ~X op ~Y pattern relocation causes loop performance degradation on 32bit x86 pinskia at gcc dot gnu.org 2015-09-03 3:36 ` miyuki at gcc dot gnu.org 2015-09-03 8:04 ` rguenther at suse dot de 2015-09-03 18:00 ` miyuki at gcc dot gnu.org 2015-09-07 12:35 ` afomin.mailbox at gmail dot com 2015-09-07 12:57 ` graham.stott at btinternet dot com 2015-09-14 11:52 ` rguenth at gcc dot gnu.org 2023-10-15 23:08 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).