* [PATCH] combine: Allow combining two insns to two insns @ 2018-07-24 17:18 Segher Boessenkool 2018-07-24 21:13 ` Jeff Law ` (4 more replies) 0 siblings, 5 replies; 18+ messages in thread From: Segher Boessenkool @ 2018-07-24 17:18 UTC (permalink / raw) To: gcc-patches; +Cc: Segher Boessenkool This patch allows combine to combine two insns into two. This helps in many cases, by reducing instruction path length, and also allowing further combinations to happen. PR85160 is a typical example of code that it can improve. This patch does not allow such combinations if either of the original instructions was a simple move instruction. In those cases combining the two instructions increases register pressure without improving the code. With this move test register pressure does no longer increase noticably as far as I can tell. (At first I also didn't allow either of the resulting insns to be a move instruction. But that is actually a very good thing to have, as should have been obvious). Tested for many months; tested on about 30 targets. I'll commit this later this week if there are no objections. Segher 2018-07-24 Segher Boessenkool <segher@kernel.crashing.org> PR rtl-optimization/85160 * combine.c (is_just_move): New function. (try_combine): Allow combining two instructions into two if neither of the original instructions was a move. --- gcc/combine.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/gcc/combine.c b/gcc/combine.c index cfe0f19..d64e84d 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn *insn, int n) return true; } +/* Return whether X is just a single set, with the source + a general_operand. */ +static bool +is_just_move (rtx x) +{ + if (INSN_P (x)) + x = PATTERN (x); + + return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode)); +} + /* Try to combine the insns I0, I1 and I2 into I3. Here I0, I1 and I2 appear earlier than I3. I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into @@ -2668,6 +2679,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, int swap_i2i3 = 0; int split_i2i3 = 0; int changed_i3_dest = 0; + bool i2_was_move = false, i3_was_move = false; int maxreg; rtx_insn *temp_insn; @@ -3059,6 +3071,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, return 0; } + /* Record whether i2 and i3 are trivial moves. */ + i2_was_move = is_just_move (i2); + i3_was_move = is_just_move (i3); + /* Record whether I2DEST is used in I2SRC and similarly for the other cases. Knowing this will help in register status updating below. */ i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src); @@ -4014,8 +4030,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, && XVECLEN (newpat, 0) == 2 && GET_CODE (XVECEXP (newpat, 0, 0)) == SET && GET_CODE (XVECEXP (newpat, 0, 1)) == SET - && (i1 || set_noop_p (XVECEXP (newpat, 0, 0)) - || set_noop_p (XVECEXP (newpat, 0, 1))) + && (i1 + || set_noop_p (XVECEXP (newpat, 0, 0)) + || set_noop_p (XVECEXP (newpat, 0, 1)) + || (!i2_was_move && !i3_was_move)) && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT -- 1.8.3.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-24 17:18 [PATCH] combine: Allow combining two insns to two insns Segher Boessenkool @ 2018-07-24 21:13 ` Jeff Law 2018-07-25 8:28 ` Richard Biener ` (3 subsequent siblings) 4 siblings, 0 replies; 18+ messages in thread From: Jeff Law @ 2018-07-24 21:13 UTC (permalink / raw) To: Segher Boessenkool, gcc-patches On 07/24/2018 11:18 AM, Segher Boessenkool wrote: > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. > > This patch does not allow such combinations if either of the original > instructions was a simple move instruction. In those cases combining > the two instructions increases register pressure without improving the > code. With this move test register pressure does no longer increase > noticably as far as I can tell. > > (At first I also didn't allow either of the resulting insns to be a > move instruction. But that is actually a very good thing to have, as > should have been obvious). > > Tested for many months; tested on about 30 targets. > > I'll commit this later this week if there are no objections. > > > Segher > > > 2018-07-24 Segher Boessenkool <segher@kernel.crashing.org> > > PR rtl-optimization/85160 > * combine.c (is_just_move): New function. > (try_combine): Allow combining two instructions into two if neither of > the original instructions was a move. I've had several instances where a 2->2 combination would be useful through the years. I didn't save any of those examples though... Good to see the limitation being addressed. jeff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-24 17:18 [PATCH] combine: Allow combining two insns to two insns Segher Boessenkool 2018-07-24 21:13 ` Jeff Law @ 2018-07-25 8:28 ` Richard Biener 2018-07-25 9:50 ` Segher Boessenkool 2018-07-31 12:39 ` H.J. Lu 2018-07-25 13:47 ` David Malcolm ` (2 subsequent siblings) 4 siblings, 2 replies; 18+ messages in thread From: Richard Biener @ 2018-07-25 8:28 UTC (permalink / raw) To: Segher Boessenkool; +Cc: GCC Patches On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. > > This patch does not allow such combinations if either of the original > instructions was a simple move instruction. In those cases combining > the two instructions increases register pressure without improving the > code. With this move test register pressure does no longer increase > noticably as far as I can tell. > > (At first I also didn't allow either of the resulting insns to be a > move instruction. But that is actually a very good thing to have, as > should have been obvious). > > Tested for many months; tested on about 30 targets. > > I'll commit this later this week if there are no objections. Sounds good - but, _any_ testcase? Please! ;) Richard. > > Segher > > > 2018-07-24 Segher Boessenkool <segher@kernel.crashing.org> > > PR rtl-optimization/85160 > * combine.c (is_just_move): New function. > (try_combine): Allow combining two instructions into two if neither of > the original instructions was a move. > > --- > gcc/combine.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/gcc/combine.c b/gcc/combine.c > index cfe0f19..d64e84d 100644 > --- a/gcc/combine.c > +++ b/gcc/combine.c > @@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn *insn, int n) > return true; > } > > +/* Return whether X is just a single set, with the source > + a general_operand. */ > +static bool > +is_just_move (rtx x) > +{ > + if (INSN_P (x)) > + x = PATTERN (x); > + > + return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode)); > +} > + > /* Try to combine the insns I0, I1 and I2 into I3. > Here I0, I1 and I2 appear earlier than I3. > I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into > @@ -2668,6 +2679,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, > int swap_i2i3 = 0; > int split_i2i3 = 0; > int changed_i3_dest = 0; > + bool i2_was_move = false, i3_was_move = false; > > int maxreg; > rtx_insn *temp_insn; > @@ -3059,6 +3071,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, > return 0; > } > > + /* Record whether i2 and i3 are trivial moves. */ > + i2_was_move = is_just_move (i2); > + i3_was_move = is_just_move (i3); > + > /* Record whether I2DEST is used in I2SRC and similarly for the other > cases. Knowing this will help in register status updating below. */ > i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src); > @@ -4014,8 +4030,10 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, > && XVECLEN (newpat, 0) == 2 > && GET_CODE (XVECEXP (newpat, 0, 0)) == SET > && GET_CODE (XVECEXP (newpat, 0, 1)) == SET > - && (i1 || set_noop_p (XVECEXP (newpat, 0, 0)) > - || set_noop_p (XVECEXP (newpat, 0, 1))) > + && (i1 > + || set_noop_p (XVECEXP (newpat, 0, 0)) > + || set_noop_p (XVECEXP (newpat, 0, 1)) > + || (!i2_was_move && !i3_was_move)) > && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT > && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART > && GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-25 8:28 ` Richard Biener @ 2018-07-25 9:50 ` Segher Boessenkool 2018-07-25 10:37 ` Richard Biener 2018-07-31 12:39 ` H.J. Lu 1 sibling, 1 reply; 18+ messages in thread From: Segher Boessenkool @ 2018-07-25 9:50 UTC (permalink / raw) To: Richard Biener; +Cc: GCC Patches On Wed, Jul 25, 2018 at 10:28:30AM +0200, Richard Biener wrote: > On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > > > This patch allows combine to combine two insns into two. This helps > > in many cases, by reducing instruction path length, and also allowing > > further combinations to happen. PR85160 is a typical example of code > > that it can improve. > > > > This patch does not allow such combinations if either of the original > > instructions was a simple move instruction. In those cases combining > > the two instructions increases register pressure without improving the > > code. With this move test register pressure does no longer increase > > noticably as far as I can tell. > > > > (At first I also didn't allow either of the resulting insns to be a > > move instruction. But that is actually a very good thing to have, as > > should have been obvious). > > > > Tested for many months; tested on about 30 targets. > > > > I'll commit this later this week if there are no objections. > > Sounds good - but, _any_ testcase? Please! ;) I only have target-specific ones. Most *simple* ones will already be optimised by current code (via 3->2 combination). But I've now got one that trunk does not optimise, and it can be confirmed with looking at the resulting machine code even (no need to look at the combine dump, which is a very good thing). And it is a proper thing to test even: it tests that some source is compiled to properly optimised machine code. Any other kind of testcase is worse than useless, of course. Testing it results in working code isn't very feasible or useful either. Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-25 9:50 ` Segher Boessenkool @ 2018-07-25 10:37 ` Richard Biener 0 siblings, 0 replies; 18+ messages in thread From: Richard Biener @ 2018-07-25 10:37 UTC (permalink / raw) To: Segher Boessenkool; +Cc: GCC Patches On Wed, Jul 25, 2018 at 11:50 AM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Wed, Jul 25, 2018 at 10:28:30AM +0200, Richard Biener wrote: > > On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > > > > > > This patch allows combine to combine two insns into two. This helps > > > in many cases, by reducing instruction path length, and also allowing > > > further combinations to happen. PR85160 is a typical example of code > > > that it can improve. > > > > > > This patch does not allow such combinations if either of the original > > > instructions was a simple move instruction. In those cases combining > > > the two instructions increases register pressure without improving the > > > code. With this move test register pressure does no longer increase > > > noticably as far as I can tell. > > > > > > (At first I also didn't allow either of the resulting insns to be a > > > move instruction. But that is actually a very good thing to have, as > > > should have been obvious). > > > > > > Tested for many months; tested on about 30 targets. > > > > > > I'll commit this later this week if there are no objections. > > > > Sounds good - but, _any_ testcase? Please! ;) > > I only have target-specific ones. Works for me. > Most *simple* ones will already be > optimised by current code (via 3->2 combination). But I've now got one > that trunk does not optimise, and it can be confirmed with looking at > the resulting machine code even (no need to look at the combine dump, > which is a very good thing). And it is a proper thing to test even: it > tests that some source is compiled to properly optimised machine code. > > Any other kind of testcase is worse than useless, of course. > > Testing it results in working code isn't very feasible or useful either. > > > Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-25 8:28 ` Richard Biener 2018-07-25 9:50 ` Segher Boessenkool @ 2018-07-31 12:39 ` H.J. Lu 2018-07-31 14:08 ` Segher Boessenkool 1 sibling, 1 reply; 18+ messages in thread From: H.J. Lu @ 2018-07-31 12:39 UTC (permalink / raw) To: Richard Biener; +Cc: Segher Boessenkool, GCC Patches On Wed, Jul 25, 2018 at 1:28 AM, Richard Biener <richard.guenther@gmail.com> wrote: > On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: >> >> This patch allows combine to combine two insns into two. This helps >> in many cases, by reducing instruction path length, and also allowing >> further combinations to happen. PR85160 is a typical example of code >> that it can improve. >> >> This patch does not allow such combinations if either of the original >> instructions was a simple move instruction. In those cases combining >> the two instructions increases register pressure without improving the >> code. With this move test register pressure does no longer increase >> noticably as far as I can tell. >> >> (At first I also didn't allow either of the resulting insns to be a >> move instruction. But that is actually a very good thing to have, as >> should have been obvious). >> >> Tested for many months; tested on about 30 targets. >> >> I'll commit this later this week if there are no objections. > > Sounds good - but, _any_ testcase? Please! ;) > Here is a testcase: For --- #define N 16 float f[N]; double d[N]; int n[N]; __attribute__((noinline)) void f3 (void) { int i; for (i = 0; i < N; i++) d[i] = f[i]; } --- r263067 improved -O3 -mavx2 -mtune=generic -m64 from .cfi_startproc vmovaps f(%rip), %xmm2 vmovaps f+32(%rip), %xmm3 vinsertf128 $0x1, f+16(%rip), %ymm2, %ymm0 vcvtps2pd %xmm0, %ymm1 vextractf128 $0x1, %ymm0, %xmm0 vmovaps %xmm1, d(%rip) vextractf128 $0x1, %ymm1, d+16(%rip) vcvtps2pd %xmm0, %ymm0 vmovaps %xmm0, d+32(%rip) vextractf128 $0x1, %ymm0, d+48(%rip) vinsertf128 $0x1, f+48(%rip), %ymm3, %ymm0 vcvtps2pd %xmm0, %ymm1 vextractf128 $0x1, %ymm0, %xmm0 vmovaps %xmm1, d+64(%rip) vextractf128 $0x1, %ymm1, d+80(%rip) vcvtps2pd %xmm0, %ymm0 vmovaps %xmm0, d+96(%rip) vextractf128 $0x1, %ymm0, d+112(%rip) vzeroupper ret .cfi_endproc to .cfi_startproc vcvtps2pd f(%rip), %ymm0 vmovaps %xmm0, d(%rip) vextractf128 $0x1, %ymm0, d+16(%rip) vcvtps2pd f+16(%rip), %ymm0 vmovaps %xmm0, d+32(%rip) vextractf128 $0x1, %ymm0, d+48(%rip) vcvtps2pd f+32(%rip), %ymm0 vextractf128 $0x1, %ymm0, d+80(%rip) vmovaps %xmm0, d+64(%rip) vcvtps2pd f+48(%rip), %ymm0 vextractf128 $0x1, %ymm0, d+112(%rip) vmovaps %xmm0, d+96(%rip) vzeroupper ret .cfi_endproc This is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86752 H.J. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-31 12:39 ` H.J. Lu @ 2018-07-31 14:08 ` Segher Boessenkool 0 siblings, 0 replies; 18+ messages in thread From: Segher Boessenkool @ 2018-07-31 14:08 UTC (permalink / raw) To: H.J. Lu; +Cc: Richard Biener, GCC Patches On Tue, Jul 31, 2018 at 05:39:37AM -0700, H.J. Lu wrote: > For > > --- > #define N 16 > float f[N]; > double d[N]; > int n[N]; > > __attribute__((noinline)) void > f3 (void) > { > int i; > for (i = 0; i < N; i++) > d[i] = f[i]; > } > --- > > r263067 improved -O3 -mavx2 -mtune=generic -m64 from > > .cfi_startproc > vmovaps f(%rip), %xmm2 > vmovaps f+32(%rip), %xmm3 > vinsertf128 $0x1, f+16(%rip), %ymm2, %ymm0 > vcvtps2pd %xmm0, %ymm1 > vextractf128 $0x1, %ymm0, %xmm0 > vmovaps %xmm1, d(%rip) > vextractf128 $0x1, %ymm1, d+16(%rip) > vcvtps2pd %xmm0, %ymm0 > vmovaps %xmm0, d+32(%rip) > vextractf128 $0x1, %ymm0, d+48(%rip) > vinsertf128 $0x1, f+48(%rip), %ymm3, %ymm0 > vcvtps2pd %xmm0, %ymm1 > vextractf128 $0x1, %ymm0, %xmm0 > vmovaps %xmm1, d+64(%rip) > vextractf128 $0x1, %ymm1, d+80(%rip) > vcvtps2pd %xmm0, %ymm0 > vmovaps %xmm0, d+96(%rip) > vextractf128 $0x1, %ymm0, d+112(%rip) > vzeroupper > ret > .cfi_endproc > > to > > .cfi_startproc > vcvtps2pd f(%rip), %ymm0 > vmovaps %xmm0, d(%rip) > vextractf128 $0x1, %ymm0, d+16(%rip) > vcvtps2pd f+16(%rip), %ymm0 > vmovaps %xmm0, d+32(%rip) > vextractf128 $0x1, %ymm0, d+48(%rip) > vcvtps2pd f+32(%rip), %ymm0 > vextractf128 $0x1, %ymm0, d+80(%rip) > vmovaps %xmm0, d+64(%rip) > vcvtps2pd f+48(%rip), %ymm0 > vextractf128 $0x1, %ymm0, d+112(%rip) > vmovaps %xmm0, d+96(%rip) > vzeroupper > ret > .cfi_endproc I cannot really read AVX, but that looks like better code alright :-) Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-24 17:18 [PATCH] combine: Allow combining two insns to two insns Segher Boessenkool 2018-07-24 21:13 ` Jeff Law 2018-07-25 8:28 ` Richard Biener @ 2018-07-25 13:47 ` David Malcolm 2018-07-25 14:19 ` Segher Boessenkool 2018-07-30 16:09 ` Segher Boessenkool 2018-08-02 5:52 ` Toon Moene 4 siblings, 1 reply; 18+ messages in thread From: David Malcolm @ 2018-07-25 13:47 UTC (permalink / raw) To: Segher Boessenkool, gcc-patches On Tue, 2018-07-24 at 17:18 +0000, Segher Boessenkool wrote: > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. > > This patch does not allow such combinations if either of the original > instructions was a simple move instruction. In those cases combining > the two instructions increases register pressure without improving > the > code. With this move test register pressure does no longer increase > noticably as far as I can tell. > > (At first I also didn't allow either of the resulting insns to be a > move instruction. But that is actually a very good thing to have, as > should have been obvious). > > Tested for many months; tested on about 30 targets. > > I'll commit this later this week if there are no objections. > > > Segher > > > 2018-07-24 Segher Boessenkool <segher@kernel.crashing.org> > > PR rtl-optimization/85160 > * combine.c (is_just_move): New function. > (try_combine): Allow combining two instructions into two if > neither of > the original instructions was a move. > > --- > gcc/combine.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/gcc/combine.c b/gcc/combine.c > index cfe0f19..d64e84d 100644 > --- a/gcc/combine.c > +++ b/gcc/combine.c > @@ -2604,6 +2604,17 @@ can_split_parallel_of_n_reg_sets (rtx_insn > *insn, int n) > return true; > } > > +/* Return whether X is just a single set, with the source > + a general_operand. */ > +static bool > +is_just_move (rtx x) > +{ > + if (INSN_P (x)) > + x = PATTERN (x); > + > + return (GET_CODE (x) == SET && general_operand (SET_SRC (x), > VOIDmode)); > +} If I'm reading it right, the patch only calls this function on i2 and i3, which are known to be rtx_insn *, rather than just rtx. Hence the only way in which GET_CODE (x) can be SET is if the INSN_P pattern test sets x to PATTERN (x) immediately above: it can't be a SET otherwise - but this isn't obvious from the code. Can this function take an rtx_insn * instead? Maybe something like: /* Return whether INSN's pattern is just a single set, with the source a general_operand. */ static bool is_just_move_p (rtx_insn *insn) { if (!INSN_P (insn)) return false; rtx x = PATTERN (insn); return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode)); } or similar? [...snip...] Thanks; I hope this is constructive. Dave ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-25 13:47 ` David Malcolm @ 2018-07-25 14:19 ` Segher Boessenkool 0 siblings, 0 replies; 18+ messages in thread From: Segher Boessenkool @ 2018-07-25 14:19 UTC (permalink / raw) To: David Malcolm; +Cc: gcc-patches On Wed, Jul 25, 2018 at 09:47:31AM -0400, David Malcolm wrote: > > +/* Return whether X is just a single set, with the source > > + a general_operand. */ > > +static bool > > +is_just_move (rtx x) > > +{ > > + if (INSN_P (x)) > > + x = PATTERN (x); > > + > > + return (GET_CODE (x) == SET && general_operand (SET_SRC (x), > > VOIDmode)); > > +} > > If I'm reading it right, the patch only calls this function on i2 and > i3, which are known to be rtx_insn *, rather than just rtx. I used to also have is_just_move (XVECEXP (newpat, 0, 0)) etc.; during most of combine you do not have instructions, just patterns. Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-24 17:18 [PATCH] combine: Allow combining two insns to two insns Segher Boessenkool ` (2 preceding siblings ...) 2018-07-25 13:47 ` David Malcolm @ 2018-07-30 16:09 ` Segher Boessenkool 2018-07-31 12:34 ` Christophe Lyon 2018-08-02 5:52 ` Toon Moene 4 siblings, 1 reply; 18+ messages in thread From: Segher Boessenkool @ 2018-07-30 16:09 UTC (permalink / raw) To: gcc-patches On Tue, Jul 24, 2018 at 05:18:41PM +0000, Segher Boessenkool wrote: > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. > > This patch does not allow such combinations if either of the original > instructions was a simple move instruction. In those cases combining > the two instructions increases register pressure without improving the > code. With this move test register pressure does no longer increase > noticably as far as I can tell. > > (At first I also didn't allow either of the resulting insns to be a > move instruction. But that is actually a very good thing to have, as > should have been obvious). > > Tested for many months; tested on about 30 targets. > > I'll commit this later this week if there are no objections. Done now, with the testcase at https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html . Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-30 16:09 ` Segher Boessenkool @ 2018-07-31 12:34 ` Christophe Lyon 2018-07-31 12:59 ` Richard Sandiford 2018-07-31 13:57 ` Segher Boessenkool 0 siblings, 2 replies; 18+ messages in thread From: Christophe Lyon @ 2018-07-31 12:34 UTC (permalink / raw) To: Segher Boessenkool; +Cc: gcc Patches On Mon, 30 Jul 2018 at 18:09, Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Tue, Jul 24, 2018 at 05:18:41PM +0000, Segher Boessenkool wrote: > > This patch allows combine to combine two insns into two. This helps > > in many cases, by reducing instruction path length, and also allowing > > further combinations to happen. PR85160 is a typical example of code > > that it can improve. > > > > This patch does not allow such combinations if either of the original > > instructions was a simple move instruction. In those cases combining > > the two instructions increases register pressure without improving the > > code. With this move test register pressure does no longer increase > > noticably as far as I can tell. > > > > (At first I also didn't allow either of the resulting insns to be a > > move instruction. But that is actually a very good thing to have, as > > should have been obvious). > > > > Tested for many months; tested on about 30 targets. > > > > I'll commit this later this week if there are no objections. > > Done now, with the testcase at https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html . > Hi, Since this was committed, I've noticed regressions on aarch64: FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: on arm-none-linux-gnueabi FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test On aarch64, I've also noticed a few others regressions but I'm not yet 100% sure it's caused by this patch (bisect running): gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 gcc.target/aarch64/sve/var_stride_4.c -march=armv8.2-a+sve scan-assembler-times \\tlsl\\tx[0-9]+, x[0-9]+, 10\\n 2 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 7 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 14 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 5 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 10 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0\\n 21 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 42 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0\\n 15 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 30 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d\\n 252 gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s\\n 180 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 14 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 28 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 10 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 20 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 28 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 56 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 20 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 40 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 21 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 42 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 15 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 30 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 28 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 56 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 20 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 40 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, #0\\.0 7 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 14 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, #0\\.0 5 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmne\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 10 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, z[0-9]+\\.d 63 gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, z[0-9]+\\.s 45 > > Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-31 12:34 ` Christophe Lyon @ 2018-07-31 12:59 ` Richard Sandiford 2018-07-31 13:57 ` Segher Boessenkool 1 sibling, 0 replies; 18+ messages in thread From: Richard Sandiford @ 2018-07-31 12:59 UTC (permalink / raw) To: Christophe Lyon; +Cc: Segher Boessenkool, gcc Patches Christophe Lyon <christophe.lyon@linaro.org> writes: > On Mon, 30 Jul 2018 at 18:09, Segher Boessenkool > <segher@kernel.crashing.org> wrote: >> >> On Tue, Jul 24, 2018 at 05:18:41PM +0000, Segher Boessenkool wrote: >> > This patch allows combine to combine two insns into two. This helps >> > in many cases, by reducing instruction path length, and also allowing >> > further combinations to happen. PR85160 is a typical example of code >> > that it can improve. >> > >> > This patch does not allow such combinations if either of the original >> > instructions was a simple move instruction. In those cases combining >> > the two instructions increases register pressure without improving the >> > code. With this move test register pressure does no longer increase >> > noticably as far as I can tell. >> > >> > (At first I also didn't allow either of the resulting insns to be a >> > move instruction. But that is actually a very good thing to have, as >> > should have been obvious). >> > >> > Tested for many months; tested on about 30 targets. >> > >> > I'll commit this later this week if there are no objections. >> >> Done now, with the testcase at >> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01856.html . >> > > Hi, > > Since this was committed, I've noticed regressions > on aarch64: > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > on arm-none-linux-gnueabi > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > On aarch64, I've also noticed a few others regressions but I'm not yet > 100% sure it's caused by this patch (bisect running): > gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 > gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve > scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 > gcc.target/aarch64/sve/var_stride_4.c -march=armv8.2-a+sve > scan-assembler-times \\tlsl\\tx[0-9]+, x[0-9]+, 10\\n 2 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 7 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 14 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 5 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 10 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0\\n 21 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 42 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0\\n 15 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 30 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d\\n 252 > gcc.target/aarch64/sve/vcond_4.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s\\n 180 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0 14 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d 28 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0 10 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmeq\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s 20 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0 21 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d 42 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0 15 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmge\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s 30 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0 28 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d 56 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0 20 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmgt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s 40 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0 21 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d 42 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0 15 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmle\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s 30 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0 28 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d 56 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0 20 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmlt\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s 40 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > #0\\.0 7 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d 14 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > #0\\.0 5 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmne\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s 10 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmuo\\tp[0-9]+\\.d, p[0-7]/z, z[0-9]+\\.d, > z[0-9]+\\.d 63 > gcc.target/aarch64/sve/vcond_5.c -march=armv8.2-a+sve > scan-assembler-times \\tfcmuo\\tp[0-9]+\\.s, p[0-7]/z, z[0-9]+\\.s, > z[0-9]+\\.s 45 The SVE failures were caused by it. What combine is doing is definitely valid though, since it's converting two dependent instructions into two independent instructions of equal cost. I think the fix would be to have proper support for conditional comparisons, but that's not a short-term thing. I've filed PR86753 in the meantime and will probably XFAIL the tests for now. Thanks, Richard ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-31 12:34 ` Christophe Lyon 2018-07-31 12:59 ` Richard Sandiford @ 2018-07-31 13:57 ` Segher Boessenkool 2018-07-31 15:37 ` Richard Earnshaw (lists) 2018-08-01 8:27 ` Christophe Lyon 1 sibling, 2 replies; 18+ messages in thread From: Segher Boessenkool @ 2018-07-31 13:57 UTC (permalink / raw) To: Christophe Lyon; +Cc: gcc Patches Hi Christophe, On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: > Since this was committed, I've noticed regressions > on aarch64: > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: This went from and w0, w0, 255 lsl w1, w0, 8 orr w0, w1, w0, lsl 20 ret to and w1, w0, 255 ubfiz w0, w0, 8, 8 orr w0, w0, w1, lsl 20 ret so it's neither an improvement nor a regression, just different code. The testcase wants no ANDs in the RTL. > on arm-none-linux-gnueabi > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test That sounds bad. Open a PR, maybe? > On aarch64, I've also noticed a few others regressions but I'm not yet > 100% sure it's caused by this patch (bisect running): > gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 ushift_53_i: - uxtw x1, w0 - lsl x0, x1, 53 - lsr x1, x1, 11 + lsr w1, w0, 11 + lsl x0, x0, 53 ret shift_53_i: - sxtw x1, w0 - lsl x0, x1, 53 - asr x1, x1, 11 + sbfx x1, x0, 11, 21 + lsl x0, x0, 53 ret Both are improvements afais. The number of asr insns changes, sure. > gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve > scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 Skipping all the SVE tests, sorry. Richard says they look like improvements, and exactly of the expected kind. :-) Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-31 13:57 ` Segher Boessenkool @ 2018-07-31 15:37 ` Richard Earnshaw (lists) 2018-08-01 8:27 ` Christophe Lyon 1 sibling, 0 replies; 18+ messages in thread From: Richard Earnshaw (lists) @ 2018-07-31 15:37 UTC (permalink / raw) To: Segher Boessenkool, Christophe Lyon; +Cc: gcc Patches On 31/07/18 14:57, Segher Boessenkool wrote: > Hi Christophe, > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: >> Since this was committed, I've noticed regressions >> on aarch64: >> FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > This went from > and w0, w0, 255 > lsl w1, w0, 8 These are sequentially dependent. > orr w0, w1, w0, lsl 20 > ret > to > and w1, w0, 255 > ubfiz w0, w0, 8, 8 These can run in parallel. So the change is a good one! On a super-scalar machine we save a cycle. R. > orr w0, w0, w1, lsl 20 > ret > so it's neither an improvement nor a regression, just different code. > The testcase wants no ANDs in the RTL. > > >> on arm-none-linux-gnueabi >> FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > That sounds bad. Open a PR, maybe? > > >> On aarch64, I've also noticed a few others regressions but I'm not yet >> 100% sure it's caused by this patch (bisect running): >> gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 > > ushift_53_i: > - uxtw x1, w0 > - lsl x0, x1, 53 > - lsr x1, x1, 11 > + lsr w1, w0, 11 > + lsl x0, x0, 53 > ret > > shift_53_i: > - sxtw x1, w0 > - lsl x0, x1, 53 > - asr x1, x1, 11 > + sbfx x1, x0, 11, 21 > + lsl x0, x0, 53 > ret > > Both are improvements afais. The number of asr insns changes, sure. > > >> gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve >> scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 > > Skipping all the SVE tests, sorry. Richard says they look like > improvements, and exactly of the expected kind. :-) > > > Segher > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-31 13:57 ` Segher Boessenkool 2018-07-31 15:37 ` Richard Earnshaw (lists) @ 2018-08-01 8:27 ` Christophe Lyon 2018-08-01 9:40 ` Segher Boessenkool 1 sibling, 1 reply; 18+ messages in thread From: Christophe Lyon @ 2018-08-01 8:27 UTC (permalink / raw) To: Segher Boessenkool; +Cc: gcc Patches On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool <segher@kernel.crashing.org> wrote: > > Hi Christophe, > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: > > Since this was committed, I've noticed regressions > > on aarch64: > > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > This went from > and w0, w0, 255 > lsl w1, w0, 8 > orr w0, w1, w0, lsl 20 > ret > to > and w1, w0, 255 > ubfiz w0, w0, 8, 8 > orr w0, w0, w1, lsl 20 > ret > so it's neither an improvement nor a regression, just different code. > The testcase wants no ANDs in the RTL. I didn't try to manually regenerate the code before and after the patch, but if there was "and w0, w0, 255" before the patch, why did the test pass? > > on arm-none-linux-gnueabi > > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > That sounds bad. Open a PR, maybe? > I've just filed PR86771 > > On aarch64, I've also noticed a few others regressions but I'm not yet > > 100% sure it's caused by this patch (bisect running): > > gcc.target/aarch64/ashltidisi.c scan-assembler-times asr 4 > > ushift_53_i: > - uxtw x1, w0 > - lsl x0, x1, 53 > - lsr x1, x1, 11 > + lsr w1, w0, 11 > + lsl x0, x0, 53 > ret > > shift_53_i: > - sxtw x1, w0 > - lsl x0, x1, 53 > - asr x1, x1, 11 > + sbfx x1, x0, 11, 21 > + lsl x0, x0, 53 > ret > > Both are improvements afais. The number of asr insns changes, sure. > Looks like an easy "fix" would be to change to "scan-assembler-times asr 3" but maybe the aarch64 maintainers want to add more checks here (lsl/lsr counts) Don't you include arm/aarch64 in the 30 targets you used for testing? > > > gcc.target/aarch64/sve/var_stride_2.c -march=armv8.2-a+sve > > scan-assembler-times \\tadd\\tx[0-9]+, x[0-9]+, x[0-9]+, lsl 10\\n 2 > > Skipping all the SVE tests, sorry. Richard says they look like > improvements, and exactly of the expected kind. :-) > > > Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-08-01 8:27 ` Christophe Lyon @ 2018-08-01 9:40 ` Segher Boessenkool 2018-08-01 10:52 ` Christophe Lyon 0 siblings, 1 reply; 18+ messages in thread From: Segher Boessenkool @ 2018-08-01 9:40 UTC (permalink / raw) To: Christophe Lyon; +Cc: gcc Patches On Wed, Aug 01, 2018 at 10:27:31AM +0200, Christophe Lyon wrote: > On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: > > > Since this was committed, I've noticed regressions > > > on aarch64: > > > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > > > This went from > > and w0, w0, 255 > > lsl w1, w0, 8 > > orr w0, w1, w0, lsl 20 > > ret > > to > > and w1, w0, 255 > > ubfiz w0, w0, 8, 8 > > orr w0, w0, w1, lsl 20 > > ret > > so it's neither an improvement nor a regression, just different code. > > The testcase wants no ANDs in the RTL. > > I didn't try to manually regenerate the code before and after the patch, > but if there was "and w0, w0, 255" before the patch, why did the test pass? It wasn't an AND in RTL (it was a ZERO_EXTEND). > > > on arm-none-linux-gnueabi > > > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > > > That sounds bad. Open a PR, maybe? > > > I've just filed PR86771 Thanks. Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-08-01 9:40 ` Segher Boessenkool @ 2018-08-01 10:52 ` Christophe Lyon 0 siblings, 0 replies; 18+ messages in thread From: Christophe Lyon @ 2018-08-01 10:52 UTC (permalink / raw) To: Segher Boessenkool; +Cc: gcc Patches On Wed, 1 Aug 2018 at 11:40, Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Wed, Aug 01, 2018 at 10:27:31AM +0200, Christophe Lyon wrote: > > On Tue, 31 Jul 2018 at 15:57, Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > > > On Tue, Jul 31, 2018 at 02:34:06PM +0200, Christophe Lyon wrote: > > > > Since this was committed, I've noticed regressions > > > > on aarch64: > > > > FAIL: gcc.dg/zero_bits_compound-1.c scan-assembler-not \\(and: > > > > > > This went from > > > and w0, w0, 255 > > > lsl w1, w0, 8 > > > orr w0, w1, w0, lsl 20 > > > ret > > > to > > > and w1, w0, 255 > > > ubfiz w0, w0, 8, 8 > > > orr w0, w0, w1, lsl 20 > > > ret > > > so it's neither an improvement nor a regression, just different code. > > > The testcase wants no ANDs in the RTL. > > > > I didn't try to manually regenerate the code before and after the patch, > > but if there was "and w0, w0, 255" before the patch, why did the test pass? > > It wasn't an AND in RTL (it was a ZERO_EXTEND). > Indeed, I missed the -dP in the options. > > > > on arm-none-linux-gnueabi > > > > FAIL: gfortran.dg/actual_array_constructor_1.f90 -O1 execution test > > > > > > That sounds bad. Open a PR, maybe? > > > > > I've just filed PR86771 > > Thanks. > > > Segher ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] combine: Allow combining two insns to two insns 2018-07-24 17:18 [PATCH] combine: Allow combining two insns to two insns Segher Boessenkool ` (3 preceding siblings ...) 2018-07-30 16:09 ` Segher Boessenkool @ 2018-08-02 5:52 ` Toon Moene 4 siblings, 0 replies; 18+ messages in thread From: Toon Moene @ 2018-08-02 5:52 UTC (permalink / raw) To: gcc-patches [-- Attachment #1: Type: text/plain, Size: 1099 bytes --] On 07/24/2018 07:18 PM, Segher Boessenkool wrote: > This patch allows combine to combine two insns into two. This helps > in many cases, by reducing instruction path length, and also allowing > further combinations to happen. PR85160 is a typical example of code > that it can improve. I cannot state with certainty that the improvements to our most notorious routine between 8.2 and current trunk are solely due to this change, but the differences are telling (see attached Fortran code - the analysis is about the third loop). Number of instructions for this loop (Skylake i9-7900). gfortran82 -S -Ofast -march=native -mtune=native: 458 verint.s.82.loop3 gfortran90 -S -Ofast -march=native -mtune=native: 396 verint.s.90.loop3 But the most stunning difference is the use of the stack [ nn(rsp) ] - see the attached files ... -- Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news [-- Attachment #2: verint.f --] [-- Type: text/x-fortran, Size: 12959 bytes --] # 1 "/scratch/hirlam/hl_home/MPI/lib/src/grdy/verint.F" # 1 "<built-in>" # 1 "<command-line>" # 1 "/scratch/hirlam/hl_home/MPI/lib/src/grdy/verint.F" c Library:grdy $RCSfile$, $Revision: 7536 $ c checked in by $Author: ovignes $ at $Date: 2009-12-18 14:23:36 +0100 (Fri, 18 Dec 2009) $ c $State$, $Locker$ c $Log$ c Revision 1.3 1999/04/22 09:30:45 DagBjoerge c MPP code c c Revision 1.2 1999/03/09 10:23:13 GerardCats c Add SGI paralllellisation directives DOACROSS c c Revision 1.1 1996/09/06 13:12:18 GCats c Created from grdy.apl, 1 version 2.6.1, by Gerard Cats c SUBROUTINE VERINT ( I KLON , KLAT , KLEV , KINT , KHALO I , KLON1 , KLON2 , KLAT1 , KLAT2 I , KP , KQ , KR R , PARG , PRES R , PALFH , PBETH R , PALFA , PBETA , PGAMA ) C C******************************************************************* C C VERINT - THREE DIMENSIONAL INTERPOLATION C C PURPOSE: C C THREE DIMENSIONAL INTERPOLATION C C INPUT PARAMETERS: C C KLON NUMBER OF GRIDPOINTS IN X-DIRECTION C KLAT NUMBER OF GRIDPOINTS IN Y-DIRECTION C KLEV NUMBER OF VERTICAL LEVELS C KINT TYPE OF INTERPOLATION C = 1 - LINEAR C = 2 - QUADRATIC C = 3 - CUBIC C = 4 - MIXED CUBIC/LINEAR C KLON1 FIRST GRIDPOINT IN X-DIRECTION C KLON2 LAST GRIDPOINT IN X-DIRECTION C KLAT1 FIRST GRIDPOINT IN Y-DIRECTION C KLAT2 LAST GRIDPOINT IN Y-DIRECTION C KP ARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KQ ARRAY OF INDEXES FOR HORIZONTAL DISPLACEMENTS C KR ARRAY OF INDEXES FOR VERTICAL DISPLACEMENTS C PARG ARRAY OF ARGUMENTS C PALFH ALFA HAT C PBETH BETA HAT C PALFA ARRAY OF WEIGHTS IN X-DIRECTION C PBETA ARRAY OF WEIGHTS IN Y-DIRECTION C PGAMA ARRAY OF WEIGHTS IN VERTICAL DIRECTION C C OUTPUT PARAMETERS: C C PRES INTERPOLATED FIELD C C HISTORY: C C J.E. HAUGEN 1 1992 C C******************************************************************* C IMPLICIT NONE C INTEGER KLON , KLAT , KLEV , KINT , KHALO, I KLON1 , KLON2 , KLAT1 , KLAT2 C INTEGER KP(KLON,KLAT), KQ(KLON,KLAT), KR(KLON,KLAT) REAL PARG(2-KHALO:KLON+KHALO-1,2-KHALO:KLAT+KHALO-1,KLEV) , R PRES(KLON,KLAT) , R PALFH(KLON,KLAT) , PBETH(KLON,KLAT) , R PALFA(KLON,KLAT,4) , PBETA(KLON,KLAT,4), R PGAMA(KLON,KLAT,4) C INTEGER JX, JY, IDX, IDY, ILEV REAL Z1MAH, Z1MBH C IF (KINT.EQ.1) THEN C LINEAR INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV-1) ) ) C + + + PGAMA(JX,JY,2)*( C + + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV ) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV ) ) ) ENDDO ENDDO C ELSE +IF (KINT.EQ.2) THEN C QUADRATIC INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY-1,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY ,ILEV-1) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY+1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY+1,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY+1,ILEV-1) ) ) C + + + PGAMA(JX,JY,2)*( C + + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY-1,ILEV ) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY ,ILEV ) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY+1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX ,IDY+1,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY+1,ILEV ) ) ) C + + + PGAMA(JX,JY,3)*( C + + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY-1,ILEV+1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY-1,ILEV+1) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY-1,ILEV+1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY ,ILEV+1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY ,ILEV+1) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY ,ILEV+1) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-1,IDY+1,ILEV+1) + + PALFA(JX,JY,2)*PARG(IDX ,IDY+1,ILEV+1) + + PALFA(JX,JY,3)*PARG(IDX+1,IDY+1,ILEV+1) ) ) ENDDO ENDDO C ELSE +IF (KINT.EQ.3) THEN C CUBIC INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-2,ILEV-2) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-2,ILEV-2) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-2,ILEV-2) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-2,ILEV-2) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-1,ILEV-2) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-1,ILEV-2) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-1,ILEV-2) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-1,ILEV-2) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY ,ILEV-2) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY ,ILEV-2) + + PALFA(JX,JY,3)*PARG(IDX ,IDY ,ILEV-2) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY ,ILEV-2) ) + + PBETA(JX,JY,4)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY+1,ILEV-2) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY+1,ILEV-2) + + PALFA(JX,JY,3)*PARG(IDX ,IDY+1,ILEV-2) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY+1,ILEV-2) ) ) C + + + PGAMA(JX,JY,2)*( C + + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-2,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-2,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-2,ILEV-1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-2,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-1,ILEV-1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-1,ILEV-1) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY ,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY ,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY ,ILEV-1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY ,ILEV-1) ) + + PBETA(JX,JY,4)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY+1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY+1,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY+1,ILEV-1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY+1,ILEV-1) ) ) C + + + PGAMA(JX,JY,3)*( C + + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-2,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-2,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-2,ILEV ) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-2,ILEV ) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-1,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-1,ILEV ) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-1,ILEV ) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY ,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY ,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX ,IDY ,ILEV ) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY ,ILEV ) ) + + PBETA(JX,JY,4)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY+1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY+1,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX ,IDY+1,ILEV ) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY+1,ILEV ) ) ) C + + + PGAMA(JX,JY,4)*( C + + PBETA(JX,JY,1)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-2,ILEV+1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-2,ILEV+1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-2,ILEV+1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-2,ILEV+1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-1,ILEV+1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-1,ILEV+1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-1,ILEV+1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-1,ILEV+1) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY ,ILEV+1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY ,ILEV+1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY ,ILEV+1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY ,ILEV+1) ) + + PBETA(JX,JY,4)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY+1,ILEV+1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY+1,ILEV+1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY+1,ILEV+1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY+1,ILEV+1) ) ) ENDDO ENDDO C ELSE +IF (KINT.EQ.4) THEN C MIXED CUBIC/LINEAR INTERPOLATION C DO JY = KLAT1,KLAT2 DO JX = KLON1,KLON2 IDX = KP(JX,JY) IDY = KQ(JX,JY) ILEV = KR(JX,JY) C Z1MAH = 1.0 - PALFH(JX,JY) Z1MBH = 1.0 - PBETH(JX,JY) C PRES(JX,JY) = PGAMA(JX,JY,1)*( C + PBETH(JX,JY) *( PALFH(JX,JY) *PARG(IDX-1,IDY-1,ILEV-2) + + Z1MAH *PARG(IDX ,IDY-1,ILEV-2) ) + + Z1MBH *( PALFH(JX,JY) *PARG(IDX-1,IDY ,ILEV-2) + + Z1MAH *PARG(IDX ,IDY ,ILEV-2) ) ) C + + + PGAMA(JX,JY,4)*( C + + PBETH(JX,JY) *( PALFH(JX,JY) *PARG(IDX-1,IDY-1,ILEV+1) + + Z1MAH *PARG(IDX ,IDY-1,ILEV+1) ) + + Z1MBH *( PALFH(JX,JY) *PARG(IDX-1,IDY ,ILEV+1) + + Z1MAH *PARG(IDX ,IDY ,ILEV+1) ) ) C + + + PGAMA(JX,JY,2)*( C + + PBETA(JX,JY,1)*( PALFH(JX,JY) *PARG(IDX-1,IDY-2,ILEV-1) + + Z1MAH *PARG(IDX ,IDY-2,ILEV-1) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-1,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-1,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-1,ILEV-1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-1,ILEV-1) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY ,ILEV-1) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY ,ILEV-1) + + PALFA(JX,JY,3)*PARG(IDX ,IDY ,ILEV-1) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY ,ILEV-1) ) + + PBETA(JX,JY,4)*( PALFH(JX,JY) *PARG(IDX-1,IDY+1,ILEV-1) + + Z1MAH *PARG(IDX ,IDY+1,ILEV-1) ) ) C + + + PGAMA(JX,JY,3)*( C + + PBETA(JX,JY,1)*( PALFH(JX,JY) *PARG(IDX-1,IDY-2,ILEV ) + + Z1MAH *PARG(IDX ,IDY-2,ILEV ) ) + + PBETA(JX,JY,2)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY-1,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY-1,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX ,IDY-1,ILEV ) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY-1,ILEV ) ) + + PBETA(JX,JY,3)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY ,ILEV ) + + PALFA(JX,JY,2)*PARG(IDX-1,IDY ,ILEV ) + + PALFA(JX,JY,3)*PARG(IDX ,IDY ,ILEV ) + + PALFA(JX,JY,4)*PARG(IDX+1,IDY ,ILEV ) ) + + PBETA(JX,JY,4)*( PALFH(JX,JY) *PARG(IDX-1,IDY+1,ILEV ) + + Z1MAH *PARG(IDX ,IDY+1,ILEV ) ) ) ENDDO ENDDO C ENDIF C RETURN END [-- Attachment #3: verint.s.90.loop3.stack --] [-- Type: text/plain, Size: 229 bytes --] movq 32(%rsp), %rdx movq 72(%rsp), %rdx movq 80(%rsp), %rdx vmovaps %ymm18, 296(%rsp) movq 56(%rsp), %rdx movq 48(%rsp), %rdx movq 64(%rsp), %rdx movq 40(%rsp), %rdx vaddps 296(%rsp), %ymm19, %ymm19 cmpq 24(%rsp), %rax [-- Attachment #4: verint.s.82.loop3.stack --] [-- Type: text/plain, Size: 3128 bytes --] movq 1384(%rsp), %rdx vmovaps %ymm29, 2184(%rsp) vmovaps %ymm30, 2216(%rsp) vmovaps %ymm28, 2248(%rsp) vmovaps %ymm17, 2280(%rsp) vmovaps %ymm19, 2312(%rsp) vmovaps %ymm31, 2344(%rsp) vmovaps %ymm25, 2376(%rsp) vmovaps %ymm21, 2408(%rsp) vmovaps %ymm16, 2440(%rsp) vmovaps %ymm30, 2472(%rsp) vmovaps %ymm13, 2504(%rsp) vmovaps %ymm11, 2536(%rsp) vmovaps %ymm28, 2568(%rsp) vmovaps %ymm4, 2600(%rsp) vmovaps %ymm19, 2632(%rsp) vmovaps %ymm17, 2664(%rsp) vmovaps %ymm16, 2696(%rsp) vmovaps %ymm19, 2728(%rsp) movq 1256(%rsp), %rdx vmovaps %ymm31, 2760(%rsp) vmovaps %ymm21, 2792(%rsp) vmovaps %ymm31, 2152(%rsp) vmovaps %ymm17, 2824(%rsp) vmovaps %ymm21, 2856(%rsp) vmovaps %ymm17, 2888(%rsp) vmovaps %ymm21, 2920(%rsp) vmovaps %ymm16, 2952(%rsp) vmovaps %ymm11, 2984(%rsp) vmovaps %ymm31, 3016(%rsp) vmovaps %ymm21, 3048(%rsp) vmovaps %ymm31, 3080(%rsp) vmovaps %ymm21, 3112(%rsp) vmovaps %ymm29, 3144(%rsp) vmovaps %ymm31, 3176(%rsp) vmovaps %ymm16, 3208(%rsp) vmovaps %ymm29, 3240(%rsp) vmovaps %ymm4, 3272(%rsp) vmovaps %ymm31, 3304(%rsp) vmovaps %ymm29, 3336(%rsp) vmovaps %ymm26, 3368(%rsp) vmovaps %ymm25, 3400(%rsp) vmovaps %ymm26, 3432(%rsp) vmovaps %ymm25, 3464(%rsp) vmovaps %ymm25, 3496(%rsp) vmovaps %ymm31, 3528(%rsp) vmovaps %ymm2, 3560(%rsp) vmovaps %ymm29, 3592(%rsp) vmovaps 3144(%rsp), %ymm8 vshuff32x4 $0, 3176(%rsp), %ymm8, %ymm0 vshuff32x4 $0, 3208(%rsp), %ymm11, %ymm11 vmovaps 3016(%rsp), %ymm10 vmovaps 2152(%rsp), %ymm31 vshuff32x4 $0, 3272(%rsp), %ymm21, %ymm21 vshuff32x4 $0, 3240(%rsp), %ymm15, %ymm15 vshuff32x4 $0, 3048(%rsp), %ymm10, %ymm0 vmovaps 3080(%rsp), %ymm10 vshuff32x4 $0, 3112(%rsp), %ymm10, %ymm5 vmovaps 2632(%rsp), %ymm5 vshuff32x4 $0, 2664(%rsp), %ymm5, %ymm8 vmovaps 2696(%rsp), %ymm5 vshuff32x4 $0, 2728(%rsp), %ymm5, %ymm5 vmovaps 2760(%rsp), %ymm15 vshuff32x4 $0, 2792(%rsp), %ymm15, %ymm9 vmovaps 2824(%rsp), %ymm15 vmovaps 2888(%rsp), %ymm0 vshuff32x4 $0, 2920(%rsp), %ymm0, %ymm8 vmovaps 2952(%rsp), %ymm0 vshuff32x4 $0, 2984(%rsp), %ymm0, %ymm0 vshuff32x4 $0, 2856(%rsp), %ymm15, %ymm8 vshuff32x4 $0, 3560(%rsp), %ymm19, %ymm19 vmovaps 3304(%rsp), %ymm7 vshuff32x4 $0, 3336(%rsp), %ymm7, %ymm0 vmovaps 3368(%rsp), %ymm7 vshuff32x4 $0, 3528(%rsp), %ymm13, %ymm13 vshuff32x4 $0, 3400(%rsp), %ymm7, %ymm1 vmovaps 3432(%rsp), %ymm18 vshuff32x4 $0, 3592(%rsp), %ymm20, %ymm20 vshuff32x4 $0, 3464(%rsp), %ymm18, %ymm18 vshuff32x4 $0, 3496(%rsp), %ymm26, %ymm1 vmovaps 2184(%rsp), %ymm29 movq 1224(%rsp), %rdx vmovaps 2440(%rsp), %ymm20 vshuff32x4 $0, 2216(%rsp), %ymm29, %ymm0 vmovaps 2248(%rsp), %ymm29 vshuff32x4 $0, 2280(%rsp), %ymm29, %ymm1 vmovaps 2312(%rsp), %ymm29 movq 1288(%rsp), %rdx vshuff32x4 $0, 2344(%rsp), %ymm29, %ymm1 vmovaps 2376(%rsp), %ymm29 vshuff32x4 $0, 2408(%rsp), %ymm29, %ymm2 vshuff32x4 $0, 2472(%rsp), %ymm20, %ymm1 vmovaps 2504(%rsp), %ymm20 vshuff32x4 $0, 2600(%rsp), %ymm28, %ymm3 vshuff32x4 $0, 2536(%rsp), %ymm20, %ymm2 movq 1320(%rsp), %rdx vshuff32x4 $0, 2568(%rsp), %ymm30, %ymm2 movq 1352(%rsp), %rdx movq 1192(%rsp), %rdx cmpq 1160(%rsp), %rax ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2018-08-02 5:52 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-24 17:18 [PATCH] combine: Allow combining two insns to two insns Segher Boessenkool 2018-07-24 21:13 ` Jeff Law 2018-07-25 8:28 ` Richard Biener 2018-07-25 9:50 ` Segher Boessenkool 2018-07-25 10:37 ` Richard Biener 2018-07-31 12:39 ` H.J. Lu 2018-07-31 14:08 ` Segher Boessenkool 2018-07-25 13:47 ` David Malcolm 2018-07-25 14:19 ` Segher Boessenkool 2018-07-30 16:09 ` Segher Boessenkool 2018-07-31 12:34 ` Christophe Lyon 2018-07-31 12:59 ` Richard Sandiford 2018-07-31 13:57 ` Segher Boessenkool 2018-07-31 15:37 ` Richard Earnshaw (lists) 2018-08-01 8:27 ` Christophe Lyon 2018-08-01 9:40 ` Segher Boessenkool 2018-08-01 10:52 ` Christophe Lyon 2018-08-02 5:52 ` Toon Moene
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).