* [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization @ 2023-07-28 7:05 Richard Biener 2023-07-31 15:33 ` Jeff Law 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-07-28 7:05 UTC (permalink / raw) To: gcc-patches The following delays sinking of loads within the same innermost loop when it was unconditional before. That's a not uncommon issue preventing vectorization when masked loads are not available. Bootstrapped and tested on x86_64-unknown-linux-gnu. I have a followup patch improving sinking that without this would cause more of the problematic sinking - now that we have a second sink pass after loop opts this looks like a reasonable approach? OK? Thanks, Richard. PR tree-optimization/92335 * tree-ssa-sink.cc (select_best_block): Before loop optimizations avoid sinking unconditional loads/stores in innermost loops to conditional executed places. * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, expect predictive commoning to happen instead of sinking. * gcc.dg/vect/pr65947-3.c: Adjust. --- gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c | 20 ++++++++++++++++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c | 2 +- gcc/testsuite/gcc.dg/vect/pr65947-3.c | 6 +----- gcc/tree-ssa-sink.cc | 12 ++++++++++++ 4 files changed, 34 insertions(+), 6 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c b/gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c new file mode 100644 index 00000000000..b0fb0e2d4c5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-sink-details -fdump-tree-pcom-details" } */ + +int x[1024], y[1024], z[1024], w[1024]; +void foo (void) +{ + int i; + for (i = 1; i < 1024; ++i) + { + int a = x[i]; + int b = y[i]; + int c = x[i-1]; + int d = y[i-1]; + if (w[i]) + z[i] = (a + b) + (c + d); + } +} + +/* { dg-final { scan-tree-dump-not "Sinking # VUSE" "sink1" } } */ +/* { dg-final { scan-tree-dump "Executing predictive commoning without unrolling" "pcom" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c index 535cb3208f5..a35014be038 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-sink-details -fno-tree-pre" } */ +/* { dg-options "-O2 -fdump-tree-sink-details -fno-tree-vectorize -fno-tree-pre" } */ int x[1024], y[1024], z[1024], w[1024]; void foo (void) diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-3.c b/gcc/testsuite/gcc.dg/vect/pr65947-3.c index f1bfad65c22..6b4077e1a62 100644 --- a/gcc/testsuite/gcc.dg/vect/pr65947-3.c +++ b/gcc/testsuite/gcc.dg/vect/pr65947-3.c @@ -51,10 +51,6 @@ main (void) return 0; } -/* Since the fix for PR97307 which sinks the load of a[i], preventing - if-conversion to happen, targets that cannot do masked loads only - vectorize the inline copy. */ -/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" { target vect_masked_load } } } */ -/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 1 "vect" { target { ! vect_masked_load } } } } */ +/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" } } */ /* { dg-final { scan-tree-dump-times "optimizing condition reduction with FOLD_EXTRACT_LAST" 2 "vect" { target vect_fold_extract_last } } } */ /* { dg-final { scan-tree-dump-not "condition expression based on integer induction." "vect" } } */ diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc index cf0a32a954b..dcbe05b3b03 100644 --- a/gcc/tree-ssa-sink.cc +++ b/gcc/tree-ssa-sink.cc @@ -220,6 +220,18 @@ select_best_block (basic_block early_bb, if (bb_loop_depth (best_bb) < bb_loop_depth (early_bb)) return best_bb; + /* Avoid turning an unconditional load/store into a conditional one when we + still might want to perform vectorization. */ + if (best_bb->loop_father == early_bb->loop_father + && loop_outer (best_bb->loop_father) + && !best_bb->loop_father->inner + && gimple_vuse (stmt) + && flag_tree_loop_vectorize + && !(cfun->curr_properties & PROP_loop_opts_done) + && dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, early_bb) + && !dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, best_bb)) + return early_bb; + /* Get the sinking threshold. If the statement to be moved has memory operands, then increase the threshold by 7% as those are even more profitable to avoid, clamping at 100%. */ -- 2.35.3 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-07-28 7:05 [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization Richard Biener @ 2023-07-31 15:33 ` Jeff Law 2023-08-02 8:46 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Jeff Law @ 2023-07-31 15:33 UTC (permalink / raw) To: Richard Biener, gcc-patches On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > The following delays sinking of loads within the same innermost > loop when it was unconditional before. That's a not uncommon > issue preventing vectorization when masked loads are not available. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > I have a followup patch improving sinking that without this would > cause more of the problematic sinking - now that we have a second > sink pass after loop opts this looks like a reasonable approach? > > OK? > > Thanks, > Richard. > > PR tree-optimization/92335 > * tree-ssa-sink.cc (select_best_block): Before loop > optimizations avoid sinking unconditional loads/stores > in innermost loops to conditional executed places. > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > expect predictive commoning to happen instead of sinking. > * gcc.dg/vect/pr65947-3.c: Adjust. I think it's reasonable -- there's probably going to be cases where it's not great, but more often than not I think it's going to be a reasonable heuristic. If there is undesirable fallout, better to find it over the coming months than next spring. So I'd suggest we go forward now to give more time to find any pathological cases (if they exist). Jeff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-07-31 15:33 ` Jeff Law @ 2023-08-02 8:46 ` Richard Biener 2023-08-03 11:42 ` Prathamesh Kulkarni 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-02 8:46 UTC (permalink / raw) To: Jeff Law; +Cc: gcc-patches On Mon, 31 Jul 2023, Jeff Law wrote: > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > The following delays sinking of loads within the same innermost > > loop when it was unconditional before. That's a not uncommon > > issue preventing vectorization when masked loads are not available. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > I have a followup patch improving sinking that without this would > > cause more of the problematic sinking - now that we have a second > > sink pass after loop opts this looks like a reasonable approach? > > > > OK? > > > > Thanks, > > Richard. > > > > PR tree-optimization/92335 > > * tree-ssa-sink.cc (select_best_block): Before loop > > optimizations avoid sinking unconditional loads/stores > > in innermost loops to conditional executed places. > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > expect predictive commoning to happen instead of sinking. > > * gcc.dg/vect/pr65947-3.c: Adjust. > I think it's reasonable -- there's probably going to be cases where it's not > great, but more often than not I think it's going to be a reasonable > heuristic. > > If there is undesirable fallout, better to find it over the coming months than > next spring. So I'd suggest we go forward now to give more time to find any > pathological cases (if they exist). Agreed, I've pushed this now. Richard. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-02 8:46 ` Richard Biener @ 2023-08-03 11:42 ` Prathamesh Kulkarni 2023-08-03 12:14 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Prathamesh Kulkarni @ 2023-08-03 11:42 UTC (permalink / raw) To: Richard Biener; +Cc: Jeff Law, gcc-patches [-- Attachment #1: Type: text/plain, Size: 2397 bytes --] On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > The following delays sinking of loads within the same innermost > > > loop when it was unconditional before. That's a not uncommon > > > issue preventing vectorization when masked loads are not available. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > I have a followup patch improving sinking that without this would > > > cause more of the problematic sinking - now that we have a second > > > sink pass after loop opts this looks like a reasonable approach? > > > > > > OK? > > > > > > Thanks, > > > Richard. > > > > > > PR tree-optimization/92335 > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > optimizations avoid sinking unconditional loads/stores > > > in innermost loops to conditional executed places. > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > expect predictive commoning to happen instead of sinking. > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > I think it's reasonable -- there's probably going to be cases where it's not > > great, but more often than not I think it's going to be a reasonable > > heuristic. > > > > If there is undesirable fallout, better to find it over the coming months than > > next spring. So I'd suggest we go forward now to give more time to find any > > pathological cases (if they exist). > > Agreed, I've pushed this now. Hi Richard, After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), pr65947-7.c "failed" for aarch64-linux-gnu: FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED" FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects scan-tree-dump-not vect "LOOP VECTORIZED" /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { ! vect_fold_extract_last } } } } */ With your commit, condition_reduction in pr65947-7.c gets vectorized regardless of vect_fold_extract_last, which gates the above test (which is an improvement, because the function didn't get vectorized before the commit). The attached patch thus removes the gating on vect_fold_extract_last, and the test passes again. OK to commit ? Thanks, Prathamesh > > Richard. [-- Attachment #2: foo.txt --] [-- Type: text/plain, Size: 535 bytes --] diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-7.c b/gcc/testsuite/gcc.dg/vect/pr65947-7.c index 16cdcd1c6eb..7dabae81abf 100644 --- a/gcc/testsuite/gcc.dg/vect/pr65947-7.c +++ b/gcc/testsuite/gcc.dg/vect/pr65947-7.c @@ -52,5 +52,4 @@ main (void) return 0; } -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target vect_fold_extract_last } } } */ -/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { ! vect_fold_extract_last } } } } */ +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-03 11:42 ` Prathamesh Kulkarni @ 2023-08-03 12:14 ` Richard Biener 2023-08-03 12:16 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-03 12:14 UTC (permalink / raw) To: Prathamesh Kulkarni; +Cc: Jeff Law, gcc-patches On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > The following delays sinking of loads within the same innermost > > > > loop when it was unconditional before. That's a not uncommon > > > > issue preventing vectorization when masked loads are not available. > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > I have a followup patch improving sinking that without this would > > > > cause more of the problematic sinking - now that we have a second > > > > sink pass after loop opts this looks like a reasonable approach? > > > > > > > > OK? > > > > > > > > Thanks, > > > > Richard. > > > > > > > > PR tree-optimization/92335 > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > optimizations avoid sinking unconditional loads/stores > > > > in innermost loops to conditional executed places. > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > expect predictive commoning to happen instead of sinking. > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > I think it's reasonable -- there's probably going to be cases where it's not > > > great, but more often than not I think it's going to be a reasonable > > > heuristic. > > > > > > If there is undesirable fallout, better to find it over the coming months than > > > next spring. So I'd suggest we go forward now to give more time to find any > > > pathological cases (if they exist). > > > > Agreed, I've pushed this now. > Hi Richard, > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), > pr65947-7.c "failed" for aarch64-linux-gnu: > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED" > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > scan-tree-dump-not vect "LOOP VECTORIZED" > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { > ! vect_fold_extract_last } } } } */ > > With your commit, condition_reduction in pr65947-7.c gets vectorized > regardless of vect_fold_extract_last, > which gates the above test (which is an improvement, because the > function didn't get vectorized before the commit). > > The attached patch thus removes the gating on vect_fold_extract_last, > and the test passes again. > OK to commit ? OK. Thanks, Richard. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-03 12:14 ` Richard Biener @ 2023-08-03 12:16 ` Richard Biener 2023-08-03 12:18 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-03 12:16 UTC (permalink / raw) To: Prathamesh Kulkarni; +Cc: Jeff Law, gcc-patches On Thu, 3 Aug 2023, Richard Biener wrote: > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > > The following delays sinking of loads within the same innermost > > > > > loop when it was unconditional before. That's a not uncommon > > > > > issue preventing vectorization when masked loads are not available. > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > I have a followup patch improving sinking that without this would > > > > > cause more of the problematic sinking - now that we have a second > > > > > sink pass after loop opts this looks like a reasonable approach? > > > > > > > > > > OK? > > > > > > > > > > Thanks, > > > > > Richard. > > > > > > > > > > PR tree-optimization/92335 > > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > > optimizations avoid sinking unconditional loads/stores > > > > > in innermost loops to conditional executed places. > > > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > > expect predictive commoning to happen instead of sinking. > > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > > I think it's reasonable -- there's probably going to be cases where it's not > > > > great, but more often than not I think it's going to be a reasonable > > > > heuristic. > > > > > > > > If there is undesirable fallout, better to find it over the coming months than > > > > next spring. So I'd suggest we go forward now to give more time to find any > > > > pathological cases (if they exist). > > > > > > Agreed, I've pushed this now. > > Hi Richard, > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), > > pr65947-7.c "failed" for aarch64-linux-gnu: > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED" > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > > scan-tree-dump-not vect "LOOP VECTORIZED" > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { > > ! vect_fold_extract_last } } } } */ > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized > > regardless of vect_fold_extract_last, > > which gates the above test (which is an improvement, because the > > function didn't get vectorized before the commit). > > > > The attached patch thus removes the gating on vect_fold_extract_last, > > and the test passes again. > > OK to commit ? > > OK. Or wait - the loop doesn't vectorize on x86_64, so I guess one critical target condition is missing. Can you figure out which? Thanks, Richard. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-03 12:16 ` Richard Biener @ 2023-08-03 12:18 ` Richard Biener 2023-08-07 0:04 ` Prathamesh Kulkarni 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-03 12:18 UTC (permalink / raw) To: Prathamesh Kulkarni; +Cc: Jeff Law, gcc-patches On Thu, 3 Aug 2023, Richard Biener wrote: > On Thu, 3 Aug 2023, Richard Biener wrote: > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > > > The following delays sinking of loads within the same innermost > > > > > > loop when it was unconditional before. That's a not uncommon > > > > > > issue preventing vectorization when masked loads are not available. > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > > > I have a followup patch improving sinking that without this would > > > > > > cause more of the problematic sinking - now that we have a second > > > > > > sink pass after loop opts this looks like a reasonable approach? > > > > > > > > > > > > OK? > > > > > > > > > > > > Thanks, > > > > > > Richard. > > > > > > > > > > > > PR tree-optimization/92335 > > > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > > > optimizations avoid sinking unconditional loads/stores > > > > > > in innermost loops to conditional executed places. > > > > > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > > > expect predictive commoning to happen instead of sinking. > > > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > > > I think it's reasonable -- there's probably going to be cases where it's not > > > > > great, but more often than not I think it's going to be a reasonable > > > > > heuristic. > > > > > > > > > > If there is undesirable fallout, better to find it over the coming months than > > > > > next spring. So I'd suggest we go forward now to give more time to find any > > > > > pathological cases (if they exist). > > > > > > > > Agreed, I've pushed this now. > > > Hi Richard, > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), > > > pr65947-7.c "failed" for aarch64-linux-gnu: > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED" > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > > > scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { > > > ! vect_fold_extract_last } } } } */ > > > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized > > > regardless of vect_fold_extract_last, > > > which gates the above test (which is an improvement, because the > > > function didn't get vectorized before the commit). > > > > > > The attached patch thus removes the gating on vect_fold_extract_last, > > > and the test passes again. > > > OK to commit ? > > > > OK. > > Or wait - the loop doesn't vectorize on x86_64, so I guess one > critical target condition is missing. Can you figure out which? I see /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: note: vect_is_simple_use: vectype vector(4) int /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: missed: multiple types in double reduction or condition reduction or fold-left reduction. /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1: missed: not vectorized: relevant phi not supported: last_19 = PHI <last_8(7), 108(15)> /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: missed: bad operation or unsupported loop bound. Richard. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-03 12:18 ` Richard Biener @ 2023-08-07 0:04 ` Prathamesh Kulkarni 2023-08-07 7:48 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Prathamesh Kulkarni @ 2023-08-07 0:04 UTC (permalink / raw) To: Richard Biener; +Cc: Jeff Law, gcc-patches [-- Attachment #1: Type: text/plain, Size: 6220 bytes --] On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote: > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > > > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > > > > The following delays sinking of loads within the same innermost > > > > > > > loop when it was unconditional before. That's a not uncommon > > > > > > > issue preventing vectorization when masked loads are not available. > > > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > > > > > I have a followup patch improving sinking that without this would > > > > > > > cause more of the problematic sinking - now that we have a second > > > > > > > sink pass after loop opts this looks like a reasonable approach? > > > > > > > > > > > > > > OK? > > > > > > > > > > > > > > Thanks, > > > > > > > Richard. > > > > > > > > > > > > > > PR tree-optimization/92335 > > > > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > > > > optimizations avoid sinking unconditional loads/stores > > > > > > > in innermost loops to conditional executed places. > > > > > > > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > > > > expect predictive commoning to happen instead of sinking. > > > > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > > > > I think it's reasonable -- there's probably going to be cases where it's not > > > > > > great, but more often than not I think it's going to be a reasonable > > > > > > heuristic. > > > > > > > > > > > > If there is undesirable fallout, better to find it over the coming months than > > > > > > next spring. So I'd suggest we go forward now to give more time to find any > > > > > > pathological cases (if they exist). > > > > > > > > > > Agreed, I've pushed this now. > > > > Hi Richard, > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), > > > > pr65947-7.c "failed" for aarch64-linux-gnu: > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED" > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > > > > scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { > > > > ! vect_fold_extract_last } } } } */ > > > > > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized > > > > regardless of vect_fold_extract_last, > > > > which gates the above test (which is an improvement, because the > > > > function didn't get vectorized before the commit). > > > > > > > > The attached patch thus removes the gating on vect_fold_extract_last, > > > > and the test passes again. > > > > OK to commit ? > > > > > > OK. > > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one > > critical target condition is missing. Can you figure out which? > > I see > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, > type of def: reduction > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > note: vect_is_simple_use: vectype vector(4) int > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > missed: multiple types in double reduction or condition reduction or > fold-left reduction. > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1: > missed: not vectorized: relevant phi not supported: last_19 = PHI > <last_8(7), 108(15)> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > missed: bad operation or unsupported loop bound. Hi Richard, Looking at the aarch64 vect dump, it seems the loop in condition_reduction gets vectorized with V4HI mode while fails for other modes in vectorizable_condition: if ((double_reduc || reduction_type != TREE_CODE_REDUCTION) && ncopies > 1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "multiple types in double reduction or condition " "reduction or fold-left reduction.\n"); return false; } From the dump: foo.c:9:21: note: === vect_analyze_loop_operations === foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int For V8HI, VF = 8, and vectype_in = vector(4) int. Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1, and thus fails: foo.c:9:21: missed: multiple types in double reduction or condition reduction or fold-left reduction. foo.c:4:1: missed: not vectorized: relevant phi not supported: last_19 = PHI <last_8(7), 108(15)> While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds. For x86_64, it seems the vectorizer doesn't seem to try V4HI mode. If I "force" the vectorizer to use V4HI mode, we get the following dump: foo.c:9:21: note: === vect_analyze_loop_operations === foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int foo.c:9:21: missed: multiple types in double reduction or condition reduction or fold-left reduction. Not sure tho if this is the only reason for the test to fail to vectorize on the target. Will investigate in more details next week. Thanks, Prathamesh > > Richard. [-- Attachment #2: aarch64-vect-dump.txt --] [-- Type: text/plain, Size: 70531 bytes --] ;; Function condition_reduction (condition_reduction, funcdef_no=0, decl_uid=4390, cgraph_uid=1, symbol_order=0) Analyzing loop at foo.c:9 foo.c:9:21: note: === analyze_loop_nest === foo.c:9:21: note: === vect_analyze_loop_form === foo.c:9:21: note: === get_loop_niters === Analyzing # of iterations of loop 1 exit condition [42, + , 4294967295] != 0 bounds on difference of bases: -42 ... -42 result: # of iterations 42, bounded by 42 Creating dr for *_3 analyze_innermost: success. base_address: a_12(D) offset from base address: 0 constant offset from base address: 0 step: 2 base alignment: 2 base misalignment: 0 offset alignment: 128 step alignment: 2 base_object: *a_12(D) Access function 0: {0B, +, 2}_1 Creating dr for *_6 analyze_innermost: success. base_address: b_14(D) offset from base address: 0 constant offset from base address: 0 step: 4 base alignment: 4 base misalignment: 0 offset alignment: 128 step alignment: 4 base_object: *b_14(D) Access function 0: {0B, +, 4}_1 foo.c:9:21: note: === vect_analyze_data_refs === foo.c:9:21: note: got vectype for stmt: aval_13 = *_3; vector(8) short int foo.c:9:21: note: got vectype for stmt: _7 = *_6; vector(4) int foo.c:9:21: note: === vect_analyze_scalar_cycles === foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: Access function of PHI: last_19 foo.c:9:21: note: Analyze phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: Access function of PHI: {0, +, 1}_1 foo.c:9:21: note: step: 1, init: 0 foo.c:9:21: note: Detected induction. foo.c:9:21: note: Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: Access function of PHI: {43, +, 4294967295}_1 foo.c:9:21: note: step: 4294967295, init: 43 foo.c:9:21: note: Detected induction. foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduction path: last_8 last_19 foo.c:9:21: note: reduction: detected reduction foo.c:9:21: note: Detected reduction. foo.c:9:21: note: === vect_determine_precisions === foo.c:9:21: note: using boolean precision 32 for _9 = _7 < min_v_15(D); foo.c:9:21: note: ivtmp_10 has no range info foo.c:9:21: note: i_17 has range [0x1, 0x2b] foo.c:9:21: note: can narrow to unsigned:6 without loss of precision: i_17 = i_21 + 1; foo.c:9:21: note: last_8 has no range info foo.c:9:21: note: last_16 has no range info foo.c:9:21: note: _7 has no range info foo.c:9:21: note: _5 has range [0x0, 0xa8] foo.c:9:21: note: can narrow to unsigned:8 without loss of precision: _5 = _1 * 4; foo.c:9:21: note: aval_13 has no range info foo.c:9:21: note: _2 has range [0x0, 0x54] foo.c:9:21: note: can narrow to unsigned:7 without loss of precision: _2 = _1 * 2; foo.c:9:21: note: _1 has range [0x0, 0x2a] foo.c:9:21: note: === vect_pattern_recog === foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_recog_widen_mult_pattern: detected: _2 = _1 * 2; foo.c:9:21: note: widen_mult pattern recognized: patt_37 = (long unsigned int) patt_4; foo.c:9:21: note: extra pattern stmt: patt_4 = i_21 w* 2; foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_recog_widen_mult_pattern: detected: _5 = _1 * 4; foo.c:9:21: note: widen_mult pattern recognized: patt_39 = (long unsigned int) patt_38; foo.c:9:21: note: extra pattern stmt: patt_38 = i_21 w* 4; foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand ivtmp_18 = PHI <ivtmp_10(7), 43(15)>, type of def: induction foo.c:9:21: note: === vect_analyze_data_ref_accesses === foo.c:9:21: note: === vect_mark_stmts_to_be_vectorized === foo.c:9:21: note: init: phi relevant? last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: init: phi relevant? i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: init: phi relevant? ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: init: stmt relevant? _1 = (long unsigned int) i_21; foo.c:9:21: note: init: stmt relevant? _2 = _1 * 2; foo.c:9:21: note: init: stmt relevant? _3 = a_12(D) + _2; foo.c:9:21: note: init: stmt relevant? aval_13 = *_3; foo.c:9:21: note: init: stmt relevant? _5 = _1 * 4; foo.c:9:21: note: init: stmt relevant? _6 = b_14(D) + _5; foo.c:9:21: note: init: stmt relevant? _7 = *_6; foo.c:9:21: note: init: stmt relevant? last_16 = (int) aval_13; foo.c:9:21: note: init: stmt relevant? _9 = _7 < min_v_15(D); foo.c:9:21: note: init: stmt relevant? last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vec_stmt_relevant_p: used out of loop. foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: vec_stmt_relevant_p: stmt live but not relevant. foo.c:9:21: note: mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: init: stmt relevant? i_17 = i_21 + 1; foo.c:9:21: note: init: stmt relevant? ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: init: stmt relevant? if (ivtmp_10 != 0) foo.c:9:21: note: worklist: examine stmt: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: mark relevant 1, live 0: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: mark relevant 1, live 0: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: worklist: examine stmt: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand _9 ? last_16 : last_19, type of def: reduction foo.c:9:21: note: reduc-stmt defining reduc-phi in the same nest. foo.c:9:21: note: mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: already marked relevant/live. foo.c:9:21: note: vect_is_simple_use: operand 108, type of def: constant foo.c:9:21: note: worklist: examine stmt: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand *_3, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: aval_13 = *_3; foo.c:9:21: note: worklist: examine stmt: aval_13 = *_3; foo.c:9:21: note: worklist: examine stmt: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand *_6, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: _7 = *_6; foo.c:9:21: note: vect_is_simple_use: operand min_v_15(D), type of def: external foo.c:9:21: note: worklist: examine stmt: _7 = *_6; foo.c:9:21: note: === vect_analyze_data_ref_dependences === foo.c:9:21: note: === vect_determine_vectorization_factor === foo.c:9:21: note: ==> examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: ==> examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: ==> examining statement: _1 = (long unsigned int) i_21; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _2 = _1 * 2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern def stmt: patt_4 = i_21 w* 2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern statement: patt_37 = (long unsigned int) patt_4; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _3 = a_12(D) + _2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: aval_13 = *_3; foo.c:9:21: note: precomputed vectype: vector(8) short int foo.c:9:21: note: nunits = 8 foo.c:9:21: note: ==> examining statement: _5 = _1 * 4; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern def stmt: patt_38 = i_21 w* 4; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern statement: patt_39 = (long unsigned int) patt_38; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _6 = b_14(D) + _5; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _7 = *_6; foo.c:9:21: note: precomputed vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: last_16 = (int) aval_13; foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: get vectype for smallest scalar type: short int foo.c:9:21: note: nunits vectype: vector(8) short int foo.c:9:21: note: nunits = 8 foo.c:9:21: note: ==> examining statement: _9 = _7 < min_v_15(D); foo.c:9:21: note: vectype: vector(4) <signed-boolean:32> foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: i_17 = i_21 + 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: if (ivtmp_10 != 0) foo.c:9:21: note: skip. foo.c:9:21: note: vectorization factor = 8 foo.c:9:21: note: === vect_compute_single_scalar_iteration_cost === *_3 1 times scalar_load costs 1 in prologue *_6 1 times scalar_load costs 1 in prologue (int) aval_13 1 times scalar_stmt costs 1 in prologue _7 < min_v_15(D) 1 times scalar_stmt costs 1 in prologue _9 ? last_16 : last_19 1 times scalar_stmt costs 1 in prologue foo.c:9:21: note: === vect_analyze_slp === foo.c:9:21: note: === vect_make_slp_decision === foo.c:9:21: note: vectorization_factor = 8, niters = 43 foo.c:9:21: note: === vect_analyze_data_refs_alignment === foo.c:9:21: note: recording new base alignment for a_12(D) alignment: 2 misalignment: 0 based on: aval_13 = *_3; foo.c:9:21: note: recording new base alignment for b_14(D) alignment: 4 misalignment: 0 based on: _7 = *_6; foo.c:9:21: note: vect_compute_data_ref_alignment: foo.c:9:21: note: can't force alignment of ref: *_3 foo.c:9:21: note: vect_compute_data_ref_alignment: foo.c:9:21: note: can't force alignment of ref: *_6 foo.c:9:21: note: === vect_prune_runtime_alias_test_list === foo.c:9:21: note: === vect_enhance_data_refs_alignment === foo.c:9:21: missed: Unknown misalignment, naturally aligned foo.c:9:21: missed: Unknown misalignment, naturally aligned foo.c:9:21: note: vect_can_advance_ivs_p: foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduc or virtual phi. skip. foo.c:9:21: note: Analyze phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: vect_model_load_cost: aligned. foo.c:9:21: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0. foo.c:9:21: note: vect_model_load_cost: unaligned supported by hardware. foo.c:9:21: note: vect_get_data_access_cost: inside_cost = 3, outside_cost = 0. foo.c:9:21: note: vect_model_load_cost: unaligned supported by hardware. foo.c:9:21: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0. foo.c:9:21: note: vect_model_load_cost: unaligned supported by hardware. foo.c:9:21: note: vect_get_data_access_cost: inside_cost = 3, outside_cost = 0. foo.c:9:21: note: === vect_dissolve_slp_only_groups === foo.c:9:21: note: === vect_analyze_loop_operations === foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: missed: multiple types in double reduction or condition reduction or fold-left reduction. foo.c:4:1: missed: not vectorized: relevant phi not supported: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: missed: bad operation or unsupported loop bound. foo.c:9:21: note: ***** Analysis failed with vector mode V8HI foo.c:9:21: note: ***** The result for vector mode V16QI would be the same foo.c:9:21: note: ***** The result for vector mode V8QI would be the same foo.c:9:21: note: ***** Re-trying analysis with vector mode V4HI foo.c:9:21: note: === vect_analyze_data_refs === foo.c:9:21: note: got vectype for stmt: aval_13 = *_3; vector(4) short int foo.c:9:21: note: got vectype for stmt: _7 = *_6; vector(4) int foo.c:9:21: note: === vect_analyze_scalar_cycles === foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: Access function of PHI: last_19 foo.c:9:21: note: Analyze phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: Access function of PHI: {0, +, 1}_1 foo.c:9:21: note: step: 1, init: 0 foo.c:9:21: note: Detected induction. foo.c:9:21: note: Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: Access function of PHI: {43, +, 4294967295}_1 foo.c:9:21: note: step: 4294967295, init: 43 foo.c:9:21: note: Detected induction. foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduction path: last_8 last_19 foo.c:9:21: note: reduction: detected reduction foo.c:9:21: note: Detected reduction. foo.c:9:21: note: === vect_determine_precisions === foo.c:9:21: note: using boolean precision 32 for _9 = _7 < min_v_15(D); foo.c:9:21: note: ivtmp_10 has no range info foo.c:9:21: note: i_17 has range [0x1, 0x2b] foo.c:9:21: note: can narrow to unsigned:6 without loss of precision: i_17 = i_21 + 1; foo.c:9:21: note: last_8 has no range info foo.c:9:21: note: last_16 has no range info foo.c:9:21: note: _7 has no range info foo.c:9:21: note: _5 has range [0x0, 0xa8] foo.c:9:21: note: can narrow to unsigned:8 without loss of precision: _5 = _1 * 4; foo.c:9:21: note: aval_13 has no range info foo.c:9:21: note: _2 has range [0x0, 0x54] foo.c:9:21: note: can narrow to unsigned:7 without loss of precision: _2 = _1 * 2; foo.c:9:21: note: _1 has range [0x0, 0x2a] foo.c:9:21: note: === vect_pattern_recog === foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_recog_widen_mult_pattern: detected: _2 = _1 * 2; foo.c:9:21: note: widen_mult pattern recognized: patt_41 = (long unsigned int) patt_40; foo.c:9:21: note: extra pattern stmt: patt_40 = i_21 w* 2; foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_recog_widen_mult_pattern: detected: _5 = _1 * 4; foo.c:9:21: note: widen_mult pattern recognized: patt_43 = (long unsigned int) patt_42; foo.c:9:21: note: extra pattern stmt: patt_42 = i_21 w* 4; foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand ivtmp_18 = PHI <ivtmp_10(7), 43(15)>, type of def: induction foo.c:9:21: note: === vect_analyze_data_ref_accesses === foo.c:9:21: note: === vect_mark_stmts_to_be_vectorized === foo.c:9:21: note: init: phi relevant? last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: init: phi relevant? i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: init: phi relevant? ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: init: stmt relevant? _1 = (long unsigned int) i_21; foo.c:9:21: note: init: stmt relevant? _2 = _1 * 2; foo.c:9:21: note: init: stmt relevant? _3 = a_12(D) + _2; foo.c:9:21: note: init: stmt relevant? aval_13 = *_3; foo.c:9:21: note: init: stmt relevant? _5 = _1 * 4; foo.c:9:21: note: init: stmt relevant? _6 = b_14(D) + _5; foo.c:9:21: note: init: stmt relevant? _7 = *_6; foo.c:9:21: note: init: stmt relevant? last_16 = (int) aval_13; foo.c:9:21: note: init: stmt relevant? _9 = _7 < min_v_15(D); foo.c:9:21: note: init: stmt relevant? last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vec_stmt_relevant_p: used out of loop. foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: vec_stmt_relevant_p: stmt live but not relevant. foo.c:9:21: note: mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: init: stmt relevant? i_17 = i_21 + 1; foo.c:9:21: note: init: stmt relevant? ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: init: stmt relevant? if (ivtmp_10 != 0) foo.c:9:21: note: worklist: examine stmt: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: mark relevant 1, live 0: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: mark relevant 1, live 0: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: worklist: examine stmt: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand _9 ? last_16 : last_19, type of def: reduction foo.c:9:21: note: reduc-stmt defining reduc-phi in the same nest. foo.c:9:21: note: mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: already marked relevant/live. foo.c:9:21: note: vect_is_simple_use: operand 108, type of def: constant foo.c:9:21: note: worklist: examine stmt: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand *_3, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: aval_13 = *_3; foo.c:9:21: note: worklist: examine stmt: aval_13 = *_3; foo.c:9:21: note: worklist: examine stmt: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand *_6, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: _7 = *_6; foo.c:9:21: note: vect_is_simple_use: operand min_v_15(D), type of def: external foo.c:9:21: note: worklist: examine stmt: _7 = *_6; foo.c:9:21: note: === vect_analyze_data_ref_dependences === foo.c:9:21: note: === vect_determine_vectorization_factor === foo.c:9:21: note: ==> examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: ==> examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: ==> examining statement: _1 = (long unsigned int) i_21; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _2 = _1 * 2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern def stmt: patt_40 = i_21 w* 2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern statement: patt_41 = (long unsigned int) patt_40; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _3 = a_12(D) + _2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: aval_13 = *_3; foo.c:9:21: note: precomputed vectype: vector(4) short int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: _5 = _1 * 4; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern def stmt: patt_42 = i_21 w* 4; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern statement: patt_43 = (long unsigned int) patt_42; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _6 = b_14(D) + _5; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _7 = *_6; foo.c:9:21: note: precomputed vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: last_16 = (int) aval_13; foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: get vectype for smallest scalar type: short int foo.c:9:21: note: nunits vectype: vector(4) short int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: _9 = _7 < min_v_15(D); foo.c:9:21: note: vectype: vector(4) <signed-boolean:32> foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: i_17 = i_21 + 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: if (ivtmp_10 != 0) foo.c:9:21: note: skip. foo.c:9:21: note: vectorization factor = 4 foo.c:9:21: note: === vect_compute_single_scalar_iteration_cost === *_3 1 times scalar_load costs 1 in prologue *_6 1 times scalar_load costs 1 in prologue (int) aval_13 1 times scalar_stmt costs 1 in prologue _7 < min_v_15(D) 1 times scalar_stmt costs 1 in prologue _9 ? last_16 : last_19 1 times scalar_stmt costs 1 in prologue foo.c:9:21: note: === vect_analyze_slp === foo.c:9:21: note: === vect_make_slp_decision === foo.c:9:21: note: vectorization_factor = 4, niters = 43 foo.c:9:21: note: === vect_analyze_data_refs_alignment === foo.c:9:21: note: recording new base alignment for a_12(D) alignment: 2 misalignment: 0 based on: aval_13 = *_3; foo.c:9:21: note: recording new base alignment for b_14(D) alignment: 4 misalignment: 0 based on: _7 = *_6; foo.c:9:21: note: vect_compute_data_ref_alignment: foo.c:9:21: note: can't force alignment of ref: *_3 foo.c:9:21: note: vect_compute_data_ref_alignment: foo.c:9:21: note: can't force alignment of ref: *_6 foo.c:9:21: note: === vect_prune_runtime_alias_test_list === foo.c:9:21: note: === vect_enhance_data_refs_alignment === foo.c:9:21: missed: Unknown misalignment, naturally aligned foo.c:9:21: missed: Unknown misalignment, naturally aligned foo.c:9:21: note: vect_can_advance_ivs_p: foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduc or virtual phi. skip. foo.c:9:21: note: Analyze phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: vect_model_load_cost: aligned. foo.c:9:21: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0. foo.c:9:21: note: vect_model_load_cost: unaligned supported by hardware. foo.c:9:21: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0. foo.c:9:21: note: vect_model_load_cost: unaligned supported by hardware. foo.c:9:21: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0. foo.c:9:21: note: vect_model_load_cost: unaligned supported by hardware. foo.c:9:21: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0. foo.c:9:21: note: === vect_dissolve_slp_only_groups === foo.c:9:21: note: === vect_analyze_loop_operations === foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int Estimating # of iterations of loop 1 Analyzing # of iterations of loop 1 exit condition [42, + , 4294967295] != 0 bounds on difference of bases: -42 ... -42 result: # of iterations 42, bounded by 42 Analyzing # of iterations of loop 1 exit condition [42, + , 4294967295] != 0 bounds on difference of bases: -42 ... -42 result: # of iterations 42, bounded by 42 Statement (exit)if (ivtmp_10 != 0) is executed at most 42 (bounded by 42) + 1 times in loop 1. Induction variable (short int *) a_12(D) + 2 * iteration does not wrap in statement _3 = a_12(D) + _2; in loop 1. Statement _3 = a_12(D) + _2; is executed at most 9223372036854775806 (bounded by 9223372036854775806) + 1 times in loop 1. Induction variable (int *) b_14(D) + 4 * iteration does not wrap in statement _6 = b_14(D) + _5; in loop 1. Statement _6 = b_14(D) + _5; is executed at most 4611686018427387902 (bounded by 4611686018427387902) + 1 times in loop 1. Induction variable (int) 1 + 1 * iteration does not wrap in statement i_17 = i_21 + 1; in loop 1. Statement i_17 = i_21 + 1; is executed at most 42 (bounded by 42) + 1 times in loop 1. vect_model_reduction_cost: inside_cost = 0, prologue_cost = 4, epilogue_cost = 7 . foo.c:9:21: note: examining phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: ==> examining statement: _1 = (long unsigned int) i_21; foo.c:9:21: note: irrelevant. foo.c:9:21: note: ==> examining statement: _2 = _1 * 2; foo.c:9:21: note: irrelevant. foo.c:9:21: note: ==> examining statement: _3 = a_12(D) + _2; foo.c:9:21: note: irrelevant. foo.c:9:21: note: ==> examining statement: aval_13 = *_3; foo.c:9:21: missed: can't operate on partial vectors because the target doesn't have the appropriate partial vectorization load or store. foo.c:9:21: note: Vectorizing an unaligned access. foo.c:9:21: note: vect_model_load_cost: unaligned supported by hardware. foo.c:9:21: note: vect_model_load_cost: inside_cost = 1, prologue_cost = 0 . foo.c:9:21: note: ==> examining statement: _5 = _1 * 4; foo.c:9:21: note: irrelevant. foo.c:9:21: note: ==> examining statement: _6 = b_14(D) + _5; foo.c:9:21: note: irrelevant. foo.c:9:21: note: ==> examining statement: _7 = *_6; foo.c:9:21: note: Vectorizing an unaligned access. foo.c:9:21: note: vect_model_load_cost: unaligned supported by hardware. foo.c:9:21: note: vect_model_load_cost: inside_cost = 1, prologue_cost = 0 . foo.c:9:21: note: ==> examining statement: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand *_3, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) short int foo.c:9:21: note: === vectorizable_conversion === foo.c:9:21: note: vect_model_simple_cost: inside_cost = 1, prologue_cost = 0 . foo.c:9:21: note: ==> examining statement: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand *_6, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand min_v_15(D), type of def: external foo.c:9:21: note: vect_model_simple_cost: inside_cost = 1, prologue_cost = 1 . foo.c:9:21: note: ==> examining statement: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) <signed-boolean:32> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_model_simple_cost: inside_cost = 1, prologue_cost = 0 . foo.c:9:21: note: ==> examining statement: i_17 = i_21 + 1; foo.c:9:21: note: irrelevant. foo.c:9:21: note: ==> examining statement: ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: irrelevant. foo.c:9:21: note: ==> examining statement: if (ivtmp_10 != 0) foo.c:9:21: note: irrelevant. _9 ? last_16 : last_19 4 times scalar_to_vec costs 4 in prologue _9 ? last_16 : last_19 2 times vector_stmt costs 2 in epilogue _9 ? last_16 : last_19 2 times vec_to_scalar costs 4 in epilogue _9 ? last_16 : last_19 1 times scalar_to_vec costs 1 in epilogue *_3 1 times unaligned_load (misalign -1) costs 1 in body *_6 1 times unaligned_load (misalign -1) costs 1 in body (int) aval_13 1 times vector_stmt costs 1 in body _7 < min_v_15(D) 1 times scalar_to_vec costs 1 in prologue _7 < min_v_15(D) 1 times vector_stmt costs 1 in body _9 ? last_16 : last_19 1 times vector_stmt costs 1 in body foo.c:9:21: note: operating on full vectors. foo.c:9:21: note: cost model disabled. foo.c:9:21: note: epilog loop required foo.c:9:21: note: vect_can_advance_ivs_p: foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduc or virtual phi. skip. foo.c:9:21: note: Analyze phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: ***** Analysis succeeded with vector mode V4HI foo.c:9:21: note: ***** Choosing vector mode V4HI foo.c:9:21: note: ***** Re-trying epilogue analysis with vector mode V16QI foo.c:9:21: note: === vect_analyze_data_refs === foo.c:9:21: note: got vectype for stmt: aval_13 = *_3; vector(8) short int foo.c:9:21: note: got vectype for stmt: _7 = *_6; vector(4) int foo.c:9:21: note: === vect_analyze_scalar_cycles === foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: Access function of PHI: last_19 foo.c:9:21: note: Analyze phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: Access function of PHI: {0, +, 1}_1 foo.c:9:21: note: step: 1, init: 0 foo.c:9:21: note: Detected induction. foo.c:9:21: note: Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: Access function of PHI: {43, +, 4294967295}_1 foo.c:9:21: note: step: 4294967295, init: 43 foo.c:9:21: note: Detected induction. foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduction path: last_8 last_19 foo.c:9:21: note: reduction: detected reduction foo.c:9:21: note: Detected reduction. foo.c:9:21: note: === vect_determine_precisions === foo.c:9:21: note: using boolean precision 32 for _9 = _7 < min_v_15(D); foo.c:9:21: note: ivtmp_10 has no range info foo.c:9:21: note: i_17 has range [0x1, 0x2b] foo.c:9:21: note: can narrow to unsigned:6 without loss of precision: i_17 = i_21 + 1; foo.c:9:21: note: last_8 has no range info foo.c:9:21: note: last_16 has no range info foo.c:9:21: note: _7 has no range info foo.c:9:21: note: _5 has range [0x0, 0xa8] foo.c:9:21: note: can narrow to unsigned:8 without loss of precision: _5 = _1 * 4; foo.c:9:21: note: aval_13 has no range info foo.c:9:21: note: _2 has range [0x0, 0x54] foo.c:9:21: note: can narrow to unsigned:7 without loss of precision: _2 = _1 * 2; foo.c:9:21: note: _1 has range [0x0, 0x2a] foo.c:9:21: note: === vect_pattern_recog === foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_recog_widen_mult_pattern: detected: _2 = _1 * 2; foo.c:9:21: note: widen_mult pattern recognized: patt_45 = (long unsigned int) patt_44; foo.c:9:21: note: extra pattern stmt: patt_44 = i_21 w* 2; foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_recog_widen_mult_pattern: detected: _5 = _1 * 4; foo.c:9:21: note: widen_mult pattern recognized: patt_47 = (long unsigned int) patt_46; foo.c:9:21: note: extra pattern stmt: patt_46 = i_21 w* 4; foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand ivtmp_18 = PHI <ivtmp_10(7), 43(15)>, type of def: induction foo.c:9:21: note: === vect_analyze_data_ref_accesses === foo.c:9:21: note: === vect_mark_stmts_to_be_vectorized === foo.c:9:21: note: init: phi relevant? last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: init: phi relevant? i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: init: phi relevant? ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: init: stmt relevant? _1 = (long unsigned int) i_21; foo.c:9:21: note: init: stmt relevant? _2 = _1 * 2; foo.c:9:21: note: init: stmt relevant? _3 = a_12(D) + _2; foo.c:9:21: note: init: stmt relevant? aval_13 = *_3; foo.c:9:21: note: init: stmt relevant? _5 = _1 * 4; foo.c:9:21: note: init: stmt relevant? _6 = b_14(D) + _5; foo.c:9:21: note: init: stmt relevant? _7 = *_6; foo.c:9:21: note: init: stmt relevant? last_16 = (int) aval_13; foo.c:9:21: note: init: stmt relevant? _9 = _7 < min_v_15(D); foo.c:9:21: note: init: stmt relevant? last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vec_stmt_relevant_p: used out of loop. foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: vec_stmt_relevant_p: stmt live but not relevant. foo.c:9:21: note: mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: init: stmt relevant? i_17 = i_21 + 1; foo.c:9:21: note: init: stmt relevant? ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: init: stmt relevant? if (ivtmp_10 != 0) foo.c:9:21: note: worklist: examine stmt: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: mark relevant 1, live 0: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: mark relevant 1, live 0: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: worklist: examine stmt: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand _9 ? last_16 : last_19, type of def: reduction foo.c:9:21: note: reduc-stmt defining reduc-phi in the same nest. foo.c:9:21: note: mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: already marked relevant/live. foo.c:9:21: note: vect_is_simple_use: operand 108, type of def: constant foo.c:9:21: note: worklist: examine stmt: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand *_3, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: aval_13 = *_3; foo.c:9:21: note: worklist: examine stmt: aval_13 = *_3; foo.c:9:21: note: worklist: examine stmt: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand *_6, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: _7 = *_6; foo.c:9:21: note: vect_is_simple_use: operand min_v_15(D), type of def: external foo.c:9:21: note: worklist: examine stmt: _7 = *_6; foo.c:9:21: note: === vect_analyze_data_ref_dependences === foo.c:9:21: note: === vect_determine_vectorization_factor === foo.c:9:21: note: ==> examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: ==> examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: ==> examining statement: _1 = (long unsigned int) i_21; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _2 = _1 * 2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern def stmt: patt_44 = i_21 w* 2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern statement: patt_45 = (long unsigned int) patt_44; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _3 = a_12(D) + _2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: aval_13 = *_3; foo.c:9:21: note: precomputed vectype: vector(8) short int foo.c:9:21: note: nunits = 8 foo.c:9:21: note: ==> examining statement: _5 = _1 * 4; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern def stmt: patt_46 = i_21 w* 4; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern statement: patt_47 = (long unsigned int) patt_46; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _6 = b_14(D) + _5; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _7 = *_6; foo.c:9:21: note: precomputed vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: last_16 = (int) aval_13; foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: get vectype for smallest scalar type: short int foo.c:9:21: note: nunits vectype: vector(8) short int foo.c:9:21: note: nunits = 8 foo.c:9:21: note: ==> examining statement: _9 = _7 < min_v_15(D); foo.c:9:21: note: vectype: vector(4) <signed-boolean:32> foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(4) int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: i_17 = i_21 + 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: if (ivtmp_10 != 0) foo.c:9:21: note: skip. foo.c:9:21: note: vectorization factor = 8 foo.c:9:21: note: === vect_compute_single_scalar_iteration_cost === *_3 1 times scalar_load costs 1 in prologue *_6 1 times scalar_load costs 1 in prologue (int) aval_13 1 times scalar_stmt costs 1 in prologue _7 < min_v_15(D) 1 times scalar_stmt costs 1 in prologue _9 ? last_16 : last_19 1 times scalar_stmt costs 1 in prologue foo.c:9:21: note: === vect_analyze_slp === foo.c:9:21: note: === vect_make_slp_decision === foo.c:9:21: note: vectorization_factor = 8, niters = 43 foo.c:9:21: note: === vect_analyze_data_refs_alignment === foo.c:9:21: note: recording new base alignment for a_12(D) alignment: 2 misalignment: 0 based on: aval_13 = *_3; foo.c:9:21: note: recording new base alignment for b_14(D) alignment: 4 misalignment: 0 based on: _7 = *_6; foo.c:9:21: note: vect_compute_data_ref_alignment: foo.c:9:21: note: can't force alignment of ref: *_3 foo.c:9:21: note: vect_compute_data_ref_alignment: foo.c:9:21: note: can't force alignment of ref: *_6 foo.c:9:21: note: === vect_prune_runtime_alias_test_list === foo.c:9:21: note: === vect_dissolve_slp_only_groups === foo.c:9:21: note: === vect_analyze_loop_operations === foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: missed: multiple types in double reduction or condition reduction or fold-left reduction. foo.c:4:1: missed: not vectorized: relevant phi not supported: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: missed: bad operation or unsupported loop bound. foo.c:9:21: note: ***** Analysis failed with vector mode V16QI foo.c:9:21: note: ***** The result for vector mode V8QI would be the same foo.c:9:21: note: ***** Re-trying epilogue analysis with vector mode V2SI foo.c:9:21: note: === vect_analyze_data_refs === foo.c:9:21: note: got vectype for stmt: aval_13 = *_3; vector(4) short int foo.c:9:21: note: got vectype for stmt: _7 = *_6; vector(2) int foo.c:9:21: note: === vect_analyze_scalar_cycles === foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: Access function of PHI: last_19 foo.c:9:21: note: Analyze phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: Access function of PHI: {0, +, 1}_1 foo.c:9:21: note: step: 1, init: 0 foo.c:9:21: note: Detected induction. foo.c:9:21: note: Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: Access function of PHI: {43, +, 4294967295}_1 foo.c:9:21: note: step: 4294967295, init: 43 foo.c:9:21: note: Detected induction. foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduction path: last_8 last_19 foo.c:9:21: note: reduction: detected reduction foo.c:9:21: note: Detected reduction. foo.c:9:21: note: === vect_determine_precisions === foo.c:9:21: note: using boolean precision 32 for _9 = _7 < min_v_15(D); foo.c:9:21: note: ivtmp_10 has no range info foo.c:9:21: note: i_17 has range [0x1, 0x2b] foo.c:9:21: note: can narrow to unsigned:6 without loss of precision: i_17 = i_21 + 1; foo.c:9:21: note: last_8 has no range info foo.c:9:21: note: last_16 has no range info foo.c:9:21: note: _7 has no range info foo.c:9:21: note: _5 has range [0x0, 0xa8] foo.c:9:21: note: can narrow to unsigned:8 without loss of precision: _5 = _1 * 4; foo.c:9:21: note: aval_13 has no range info foo.c:9:21: note: _2 has range [0x0, 0x54] foo.c:9:21: note: can narrow to unsigned:7 without loss of precision: _2 = _1 * 2; foo.c:9:21: note: _1 has range [0x0, 0x2a] foo.c:9:21: note: === vect_pattern_recog === foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_recog_widen_mult_pattern: detected: _2 = _1 * 2; foo.c:9:21: note: vect_recog_mult_pattern: detected: _2 = _1 * 2; foo.c:9:21: note: mult pattern recognized: patt_48 = _1 << 1; foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_recog_widen_mult_pattern: detected: _5 = _1 * 4; foo.c:9:21: note: vect_recog_mult_pattern: detected: _5 = _1 * 4; foo.c:9:21: note: mult pattern recognized: patt_49 = _1 << 2; foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction foo.c:9:21: note: vect_is_simple_use: operand ivtmp_18 = PHI <ivtmp_10(7), 43(15)>, type of def: induction foo.c:9:21: note: === vect_analyze_data_ref_accesses === foo.c:9:21: note: === vect_mark_stmts_to_be_vectorized === foo.c:9:21: note: init: phi relevant? last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: init: phi relevant? i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: init: phi relevant? ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: init: stmt relevant? _1 = (long unsigned int) i_21; foo.c:9:21: note: init: stmt relevant? _2 = _1 * 2; foo.c:9:21: note: init: stmt relevant? _3 = a_12(D) + _2; foo.c:9:21: note: init: stmt relevant? aval_13 = *_3; foo.c:9:21: note: init: stmt relevant? _5 = _1 * 4; foo.c:9:21: note: init: stmt relevant? _6 = b_14(D) + _5; foo.c:9:21: note: init: stmt relevant? _7 = *_6; foo.c:9:21: note: init: stmt relevant? last_16 = (int) aval_13; foo.c:9:21: note: init: stmt relevant? _9 = _7 < min_v_15(D); foo.c:9:21: note: init: stmt relevant? last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vec_stmt_relevant_p: used out of loop. foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: vec_stmt_relevant_p: stmt live but not relevant. foo.c:9:21: note: mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: init: stmt relevant? i_17 = i_21 + 1; foo.c:9:21: note: init: stmt relevant? ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: init: stmt relevant? if (ivtmp_10 != 0) foo.c:9:21: note: worklist: examine stmt: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: mark relevant 1, live 0: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: mark relevant 1, live 0: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: worklist: examine stmt: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand _9 ? last_16 : last_19, type of def: reduction foo.c:9:21: note: reduc-stmt defining reduc-phi in the same nest. foo.c:9:21: note: mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: already marked relevant/live. foo.c:9:21: note: vect_is_simple_use: operand 108, type of def: constant foo.c:9:21: note: worklist: examine stmt: last_16 = (int) aval_13; foo.c:9:21: note: vect_is_simple_use: operand *_3, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: aval_13 = *_3; foo.c:9:21: note: worklist: examine stmt: aval_13 = *_3; foo.c:9:21: note: worklist: examine stmt: _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_is_simple_use: operand *_6, type of def: internal foo.c:9:21: note: mark relevant 1, live 0: _7 = *_6; foo.c:9:21: note: vect_is_simple_use: operand min_v_15(D), type of def: external foo.c:9:21: note: worklist: examine stmt: _7 = *_6; foo.c:9:21: note: === vect_analyze_data_ref_dependences === foo.c:9:21: note: === vect_determine_vectorization_factor === foo.c:9:21: note: ==> examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(2) int foo.c:9:21: note: nunits = 2 foo.c:9:21: note: ==> examining phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: ==> examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: ==> examining statement: _1 = (long unsigned int) i_21; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _2 = _1 * 2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern statement: patt_48 = _1 << 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _3 = a_12(D) + _2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: aval_13 = *_3; foo.c:9:21: note: precomputed vectype: vector(4) short int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: _5 = _1 * 4; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining pattern statement: patt_49 = _1 << 2; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _6 = b_14(D) + _5; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: _7 = *_6; foo.c:9:21: note: precomputed vectype: vector(2) int foo.c:9:21: note: nunits = 2 foo.c:9:21: note: ==> examining statement: last_16 = (int) aval_13; foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(2) int foo.c:9:21: note: get vectype for smallest scalar type: short int foo.c:9:21: note: nunits vectype: vector(4) short int foo.c:9:21: note: nunits = 4 foo.c:9:21: note: ==> examining statement: _9 = _7 < min_v_15(D); foo.c:9:21: note: vectype: vector(2) <signed-boolean:32> foo.c:9:21: note: nunits = 2 foo.c:9:21: note: ==> examining statement: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: get vectype for scalar type: int foo.c:9:21: note: vectype: vector(2) int foo.c:9:21: note: nunits = 2 foo.c:9:21: note: ==> examining statement: i_17 = i_21 + 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: skip. foo.c:9:21: note: ==> examining statement: if (ivtmp_10 != 0) foo.c:9:21: note: skip. foo.c:9:21: note: vectorization factor = 4 foo.c:9:21: note: === vect_compute_single_scalar_iteration_cost === *_3 1 times scalar_load costs 1 in prologue *_6 1 times scalar_load costs 1 in prologue (int) aval_13 1 times scalar_stmt costs 1 in prologue _7 < min_v_15(D) 1 times scalar_stmt costs 1 in prologue _9 ? last_16 : last_19 1 times scalar_stmt costs 1 in prologue foo.c:9:21: note: === vect_analyze_slp === foo.c:9:21: note: === vect_make_slp_decision === foo.c:9:21: note: vectorization_factor = 4, niters = 43 foo.c:9:21: note: === vect_analyze_data_refs_alignment === foo.c:9:21: note: recording new base alignment for a_12(D) alignment: 2 misalignment: 0 based on: aval_13 = *_3; foo.c:9:21: note: recording new base alignment for b_14(D) alignment: 4 misalignment: 0 based on: _7 = *_6; foo.c:9:21: note: vect_compute_data_ref_alignment: foo.c:9:21: note: can't force alignment of ref: *_3 foo.c:9:21: note: vect_compute_data_ref_alignment: foo.c:9:21: note: can't force alignment of ref: *_6 foo.c:9:21: note: === vect_prune_runtime_alias_test_list === foo.c:9:21: note: === vect_dissolve_slp_only_groups === foo.c:9:21: note: === vect_analyze_loop_operations === foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int foo.c:9:21: missed: multiple types in double reduction or condition reduction or fold-left reduction. foo.c:4:1: missed: not vectorized: relevant phi not supported: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: missed: bad operation or unsupported loop bound. foo.c:9:21: note: ***** Analysis failed with vector mode V2SI foo.c:9:21: optimized: loop vectorized using 8 byte vectors foo.c:9:21: note: === vec_transform_loop === split exit edge split exit edge of scalar loop Removing basic block 19 ;; basic block 19, loop depth 0 ;; pred: 16 ;; succ: foo.c:9:21: note: vect_can_advance_ivs_p: foo.c:9:21: note: Analyze phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduc or virtual phi. skip. foo.c:9:21: note: Analyze phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> foo.c:9:21: note: vect_update_ivs_after_vectorizer: phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: reduc or virtual phi. skip. foo.c:9:21: note: vect_update_ivs_after_vectorizer: phi: i_21 = PHI <i_17(7), 0(15)> foo.c:9:21: note: vect_update_ivs_after_vectorizer: phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)> ;; Guessed iterations of loop 3 is 42.052870. New upper bound 2. ;; Scaling loop 3 with scale 7.0% (guessed) to reach upper bound 2 foo.c:9:21: note: ------>vectorizing phi: last_19 = PHI <last_8(7), 108(25)> foo.c:9:21: note: transform phi. foo.c:9:21: note: ------>vectorizing phi: i_21 = PHI <i_17(7), 0(25)> foo.c:9:21: note: ------>vectorizing phi: ivtmp_18 = PHI <ivtmp_10(7), 43(25)> foo.c:9:21: note: ------>vectorizing phi: vect_last_19.7_67 = PHI <(7), { 108, 108, 108, 108 }(25)> foo.c:9:21: note: ------>vectorizing statement: _1 = (long unsigned int) i_21; foo.c:9:21: note: ------>vectorizing statement: patt_40 = i_21 w* 2; foo.c:9:21: note: ------>vectorizing statement: patt_41 = (long unsigned int) patt_40; foo.c:9:21: note: ------>vectorizing statement: _3 = a_12(D) + _2; foo.c:9:21: note: ------>vectorizing statement: aval_13 = *_3; foo.c:9:21: note: transform statement. foo.c:9:21: note: transform load. ncopies = 1 foo.c:9:21: note: create vector_type-pointer variable to type: vector(4) short int vectorizing a pointer ref: *a_12(D) foo.c:9:21: note: created a_12(D) foo.c:9:21: note: add new stmt: vect_aval_13.10_70 = MEM <vector(4) short int> [(short int *)vectp_a.8_68]; foo.c:9:21: note: ------>vectorizing statement: patt_42 = i_21 w* 4; foo.c:9:21: note: ------>vectorizing statement: patt_43 = (long unsigned int) patt_42; foo.c:9:21: note: ------>vectorizing statement: _6 = b_14(D) + _5; foo.c:9:21: note: ------>vectorizing statement: _7 = *_6; foo.c:9:21: note: transform statement. foo.c:9:21: note: transform load. ncopies = 1 foo.c:9:21: note: create vector_type-pointer variable to type: vector(4) int vectorizing a pointer ref: *b_14(D) foo.c:9:21: note: created b_14(D) foo.c:9:21: note: add new stmt: vect__7.13_73 = MEM <vector(4) int> [(int *)vectp_b.11_71]; foo.c:9:21: note: ------>vectorizing statement: last_16 = (int) aval_13; foo.c:9:21: note: transform statement. foo.c:9:21: note: vect_is_simple_use: operand *_3, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) short int foo.c:9:21: note: transform conversion. ncopies = 1. foo.c:9:21: note: vect_get_vec_defs_for_operand: aval_13 foo.c:9:21: note: vect_is_simple_use: operand *_3, type of def: internal foo.c:9:21: note: def_stmt = aval_13 = *_3; foo.c:9:21: note: add new stmt: vect_last_16.14_74 = (vector(4) int) vect_aval_13.10_70; foo.c:9:21: note: ------>vectorizing statement: _9 = _7 < min_v_15(D); foo.c:9:21: note: transform statement. foo.c:9:21: note: vect_is_simple_use: operand *_6, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand min_v_15(D), type of def: external foo.c:9:21: note: vect_get_vec_defs_for_operand: _7 foo.c:9:21: note: vect_is_simple_use: operand *_6, type of def: internal foo.c:9:21: note: def_stmt = _7 = *_6; foo.c:9:21: note: vect_get_vec_defs_for_operand: min_v_15(D) foo.c:9:21: note: vect_is_simple_use: operand min_v_15(D), type of def: external foo.c:9:21: note: created new init_stmt: vect_cst__75 = {min_v_15(D), min_v_15(D), min_v_15(D), min_v_15(D)}; foo.c:9:21: note: add new stmt: mask__9.15_76 = vect__7.13_73 < vect_cst__75; foo.c:9:21: note: ------>vectorizing statement: last_8 = _9 ? last_16 : last_19; foo.c:9:21: note: transform statement. foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) <signed-boolean:32> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(25)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_get_vec_defs_for_operand: _9 foo.c:9:21: note: vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal foo.c:9:21: note: def_stmt = _9 = _7 < min_v_15(D); foo.c:9:21: note: vect_get_vec_defs_for_operand: last_16 foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: def_stmt = last_16 = (int) aval_13; foo.c:9:21: note: vect_get_vec_defs_for_operand: last_19 foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(25)>, type of def: reduction foo.c:9:21: note: def_stmt = last_19 = PHI <last_8(7), 108(25)> foo.c:9:21: note: add new stmt: vect_last_8.16_77 = VEC_COND_EXPR <mask__9.15_76, vect_last_16.14_74, vect_last_19.7_67>; foo.c:9:21: note: ------>vectorizing statement: i_17 = i_21 + 1; foo.c:9:21: note: ------>vectorizing statement: ivtmp_10 = ivtmp_18 - 1; foo.c:9:21: note: ------>vectorizing statement: if (ivtmp_10 != 0) foo.c:9:21: note: New loop exit condition: if (ivtmp_91 < 10) ;; Scaling loop 1 with scale 25.0% (adjusted) ;; Guessed iterations of loop 1 is 9.763217. New upper bound 9. ;; Scaling loop 1 with scale 92.9% (guessed) to reach upper bound 9 foo.c:9:21: note: LOOP VECTORIZED foo.c:4:1: note: vectorized 1 loops in function. ;; Created LCSSA PHI: _92 = PHI <_81(3)> Updating SSA: Registering new PHI nodes in block #3 Updating SSA information for statement _81 = VEC_COND_EXPR <mask__9.15_76, ivtmp_78, _80>; Registering new PHI nodes in block #7 Registering new PHI nodes in block #20 Updating SSA information for statement _83 = .REDUC_MAX (_81); Updating SSA information for statement _85 = _81 == _84; Registering new PHI nodes in block #21 SSA replacement table N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j _92 -> { _81 } Incremental SSA update started at block: 3 Number of blocks in CFG: 26 Number of blocks to update: 3 ( 12%) Affected blocks: 3 7 20 Processing block 0: BB25 Value numbering stmt = vect_cst__75 = {min_v_15(D), min_v_15(D), min_v_15(D), min_v_15(D)}; Setting value number of vect_cst__75 to vect_cst__75 (changed) marking outgoing edge 25 -> 3 executable Making available beyond BB25 vect_cst__75 for value vect_cst__75 Processing block 1: BB3 Cannot trust state of predecessor edge 7 -> 3, marking executable Value numbering stmt = last_19 = PHI <last_8(7), 108(25)> Setting value number of last_19 to last_19 (changed) Making available beyond BB3 last_19 for value last_19 Value numbering stmt = i_21 = PHI <i_17(7), 0(25)> Setting value number of i_21 to i_21 (changed) Making available beyond BB3 i_21 for value i_21 Value numbering stmt = ivtmp_18 = PHI <ivtmp_10(7), 43(25)> Setting value number of ivtmp_18 to ivtmp_18 (changed) Making available beyond BB3 ivtmp_18 for value ivtmp_18 Value numbering stmt = vect_last_19.7_67 = PHI <vect_last_8.16_77(7), { 108, 108, 108, 108 }(25)> Setting value number of vect_last_19.7_67 to vect_last_19.7_67 (changed) Making available beyond BB3 vect_last_19.7_67 for value vect_last_19.7_67 Value numbering stmt = vectp_a.8_68 = PHI <vectp_a.8_69(7), a_12(D)(25)> Setting value number of vectp_a.8_68 to vectp_a.8_68 (changed) Making available beyond BB3 vectp_a.8_68 for value vectp_a.8_68 Value numbering stmt = vectp_b.11_71 = PHI <vectp_b.11_72(7), b_14(D)(25)> Setting value number of vectp_b.11_71 to vectp_b.11_71 (changed) Making available beyond BB3 vectp_b.11_71 for value vectp_b.11_71 Value numbering stmt = ivtmp_78 = PHI <ivtmp_79(7), { 1, 2, 3, 4 }(25)> Setting value number of ivtmp_78 to ivtmp_78 (changed) Making available beyond BB3 ivtmp_78 for value ivtmp_78 Value numbering stmt = _80 = PHI <_81(7), { 0, 0, 0, 0 }(25)> Setting value number of _80 to _80 (changed) Making available beyond BB3 _80 for value _80 Value numbering stmt = ivtmp_90 = PHI <ivtmp_91(7), 0(25)> Setting value number of ivtmp_90 to ivtmp_90 (changed) Making available beyond BB3 ivtmp_90 for value ivtmp_90 Value numbering stmt = _1 = (long unsigned int) i_21; Setting value number of _1 to _1 (changed) Making available beyond BB3 _1 for value _1 Value numbering stmt = _2 = _1 * 2; Setting value number of _2 to _2 (changed) Making available beyond BB3 _2 for value _2 Value numbering stmt = _3 = a_12(D) + _2; Setting value number of _3 to _3 (changed) Making available beyond BB3 _3 for value _3 Value numbering stmt = vect_aval_13.10_70 = MEM <vector(4) short int> [(short int *)vectp_a.8_68]; Setting value number of vect_aval_13.10_70 to vect_aval_13.10_70 (changed) Making available beyond BB3 vect_aval_13.10_70 for value vect_aval_13.10_70 Value numbering stmt = aval_13 = *_3; Setting value number of aval_13 to aval_13 (changed) Making available beyond BB3 aval_13 for value aval_13 Value numbering stmt = _5 = _1 * 4; Setting value number of _5 to _5 (changed) Making available beyond BB3 _5 for value _5 Value numbering stmt = _6 = b_14(D) + _5; Setting value number of _6 to _6 (changed) Making available beyond BB3 _6 for value _6 Value numbering stmt = vect__7.13_73 = MEM <vector(4) int> [(int *)vectp_b.11_71]; Setting value number of vect__7.13_73 to vect__7.13_73 (changed) Making available beyond BB3 vect__7.13_73 for value vect__7.13_73 Value numbering stmt = _7 = *_6; Setting value number of _7 to _7 (changed) Making available beyond BB3 _7 for value _7 Value numbering stmt = vect_last_16.14_74 = (vector(4) int) vect_aval_13.10_70; Setting value number of vect_last_16.14_74 to vect_last_16.14_74 (changed) Making available beyond BB3 vect_last_16.14_74 for value vect_last_16.14_74 Value numbering stmt = last_16 = (int) aval_13; Setting value number of last_16 to last_16 (changed) Making available beyond BB3 last_16 for value last_16 Value numbering stmt = mask__9.15_76 = vect__7.13_73 < vect_cst__75; Setting value number of mask__9.15_76 to mask__9.15_76 (changed) Making available beyond BB3 mask__9.15_76 for value mask__9.15_76 Value numbering stmt = _9 = _7 < min_v_15(D); Setting value number of _9 to _9 (changed) Making available beyond BB3 _9 for value _9 Value numbering stmt = vect_last_8.16_77 = VEC_COND_EXPR <mask__9.15_76, vect_last_16.14_74, vect_last_19.7_67>; Setting value number of vect_last_8.16_77 to vect_last_8.16_77 (changed) Making available beyond BB3 vect_last_8.16_77 for value vect_last_8.16_77 Value numbering stmt = last_8 = _9 ? last_16 : last_19; Setting value number of last_8 to last_8 (changed) Making available beyond BB3 last_8 for value last_8 Value numbering stmt = i_17 = i_21 + 1; Setting value number of i_17 to i_17 (changed) Making available beyond BB3 i_17 for value i_17 Value numbering stmt = ivtmp_10 = ivtmp_18 - 1; Setting value number of ivtmp_10 to ivtmp_10 (changed) Making available beyond BB3 ivtmp_10 for value ivtmp_10 Value numbering stmt = vectp_a.8_69 = vectp_a.8_68 + 8; Setting value number of vectp_a.8_69 to vectp_a.8_69 (changed) Making available beyond BB3 vectp_a.8_69 for value vectp_a.8_69 Value numbering stmt = vectp_b.11_72 = vectp_b.11_71 + 16; Setting value number of vectp_b.11_72 to vectp_b.11_72 (changed) Making available beyond BB3 vectp_b.11_72 for value vectp_b.11_72 Value numbering stmt = _81 = VEC_COND_EXPR <mask__9.15_76, ivtmp_78, _80>; Setting value number of _81 to _81 (changed) Making available beyond BB3 _81 for value _81 Value numbering stmt = ivtmp_79 = ivtmp_78 + { 4, 4, 4, 4 }; Setting value number of ivtmp_79 to ivtmp_79 (changed) Making available beyond BB3 ivtmp_79 for value ivtmp_79 Value numbering stmt = ivtmp_91 = ivtmp_90 + 1; Setting value number of ivtmp_91 to ivtmp_91 (changed) Making available beyond BB3 ivtmp_91 for value ivtmp_91 Value numbering stmt = if (ivtmp_91 < 10) Recording on edge 3->7 ivtmp_91 lt_expr 10 == true Recording on edge 3->7 ivtmp_91 ge_expr 10 == false Recording on edge 3->7 ivtmp_91 ne_expr 10 == true Recording on edge 3->7 ivtmp_91 le_expr 10 == true Recording on edge 3->7 ivtmp_91 gt_expr 10 == false Recording on edge 3->7 ivtmp_91 eq_expr 10 == false marking outgoing edge 3 -> 7 executable marking destination block 20 reachable Processing block 2: BB7 RPO iteration over 3 blocks visited 3 blocks in total discovering 3 executable blocks iterating 1.0 times, a block was visited max. 1 times RPO tracked 35 values available at 32 locations and 35 lattice elements Removing basic block 9 ;; basic block 9, loop depth 1 ;; pred: 16 ;; 13 # last_23 = PHI <108(16), last_34(13)> # i_24 = PHI <0(16), i_35(13)> # ivtmp_25 = PHI <43(16), ivtmp_36(13)> _26 = (long unsigned int) i_24; _27 = _26 * 2; _28 = a_12(D) + _27; aval_29 = *_28; _30 = _26 * 4; _31 = b_14(D) + _30; _32 = *_31; if (_32 < min_v_15(D)) goto <bb 11>; [50.00%] else goto <bb 12>; [50.00%] ;; succ: 11 ;; 12 Removing basic block 11 ;; basic block 11, loop depth 1 ;; pred: last_33 = (int) _29; ;; succ: 12 Removing basic block 12 ;; basic block 12, loop depth 1 ;; pred: # last_34 = PHI <> i_35 = _24 + 1; ivtmp_36 = _25 - 1; if (ivtmp_36 != 0) goto <bb 13>; [97.68%] else goto <bb 18>; [2.32%] ;; succ: 13 ;; 18 Removing basic block 13 ;; basic block 13, loop depth 1 ;; pred: ;; succ: Removing basic block 16 ;; basic block 16, loop depth 0 ;; pred: ;; succ: Removing basic block 18 ;; basic block 18, loop depth 0 ;; pred: # last_51 = PHI <> goto <bb 6>; [100.00%] ;; succ: 6 Merging blocks 2 and 15 Merging blocks 17 and 6 Merging blocks 2 and 25 fix_loop_structure: fixing up loops for function fix_loop_structure: removing loop 2 __attribute__((noipa, noinline, noclone, no_icf)) int condition_reduction (short int * a, int min_v, int * b) { int stmp_last_8.17; vector(4) int vect_last_8.16; vector(4) <signed-boolean:32> mask__9.15; vector(4) int vect_last_16.14; vector(4) int vect__7.13; int * vectp_b.12; vector(4) int * vectp_b.11; vector(4) short int vect_aval_13.10; short int * vectp_a.9; vector(4) short int * vectp_a.8; vector(4) int vect_last_19.7; unsigned int tmp.6; int tmp.5; int i; short int aval; int last; long unsigned int _1; long unsigned int _2; short int * _3; long unsigned int _5; int * _6; int _7; _Bool _9; unsigned int ivtmp_10; unsigned int ivtmp_18; _Bool _22; unsigned int ivtmp_54; long unsigned int _55; long unsigned int _56; short int * _57; long unsigned int _59; int * _60; int _61; unsigned int ivtmp_64; vector(4) int vect_cst__75; vector(4) unsigned int ivtmp_78; vector(4) unsigned int ivtmp_79; vector(4) unsigned int _80; vector(4) unsigned int _81; unsigned int _83; vector(4) unsigned int _84; vector(4) <signed-boolean:32> _85; vector(4) int _86; vector(4) unsigned int _87; unsigned int _88; int _89; unsigned int ivtmp_90; unsigned int ivtmp_91; vector(4) unsigned int _92; <bb 2> [local count: 24373936]: _22 = 1; vect_cst__75 = {min_v_15(D), min_v_15(D), min_v_15(D), min_v_15(D)}; <bb 3> [local count: 243739360]: # last_19 = PHI <last_8(7), 108(2)> # i_21 = PHI <i_17(7), 0(2)> # ivtmp_18 = PHI <ivtmp_10(7), 43(2)> # vect_last_19.7_67 = PHI <vect_last_8.16_77(7), { 108, 108, 108, 108 }(2)> # vectp_a.8_68 = PHI <vectp_a.8_69(7), a_12(D)(2)> # vectp_b.11_71 = PHI <vectp_b.11_72(7), b_14(D)(2)> # ivtmp_78 = PHI <ivtmp_79(7), { 1, 2, 3, 4 }(2)> # _80 = PHI <_81(7), { 0, 0, 0, 0 }(2)> # ivtmp_90 = PHI <ivtmp_91(7), 0(2)> _1 = (long unsigned int) i_21; _2 = _1 * 2; _3 = a_12(D) + _2; vect_aval_13.10_70 = MEM <vector(4) short int> [(short int *)vectp_a.8_68]; aval_13 = *_3; _5 = _1 * 4; _6 = b_14(D) + _5; vect__7.13_73 = MEM <vector(4) int> [(int *)vectp_b.11_71]; _7 = *_6; vect_last_16.14_74 = (vector(4) int) vect_aval_13.10_70; last_16 = (int) aval_13; mask__9.15_76 = vect__7.13_73 < vect_cst__75; _9 = _7 < min_v_15(D); vect_last_8.16_77 = VEC_COND_EXPR <mask__9.15_76, vect_last_16.14_74, vect_last_19.7_67>; last_8 = _9 ? last_16 : last_19; i_17 = i_21 + 1; ivtmp_10 = ivtmp_18 - 1; vectp_a.8_69 = vectp_a.8_68 + 8; vectp_b.11_72 = vectp_b.11_71 + 16; _81 = VEC_COND_EXPR <mask__9.15_76, ivtmp_78, _80>; ivtmp_79 = ivtmp_78 + { 4, 4, 4, 4 }; ivtmp_91 = ivtmp_90 + 1; if (ivtmp_91 < 10) goto <bb 7>; [90.00%] else goto <bb 20>; [10.00%] <bb 7> [local count: 219365424]: goto <bb 3>; [100.00%] <bb 20> [local count: 24373936]: # last_66 = PHI <last_8(3)> # vect_last_8.16_82 = PHI <vect_last_8.16_77(3)> # _92 = PHI <_81(3)> _83 = .REDUC_MAX (_92); _84 = {_83, _83, _83, _83}; _85 = _92 == _84; _86 = VEC_COND_EXPR <_85, vect_last_8.16_82, { 0, 0, 0, 0 }>; _87 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(_86); _88 = .REDUC_MAX (_87); _89 = (int) _88; <bb 21> [local count: 73121805]: # last_52 = PHI <_89(20), last_62(22)> # i_53 = PHI <40(20), i_63(22)> # ivtmp_54 = PHI <3(20), ivtmp_64(22)> _55 = (long unsigned int) i_53; _56 = _55 * 2; _57 = a_12(D) + _56; aval_58 = *_57; _59 = _55 * 4; _60 = b_14(D) + _59; _61 = *_60; if (_61 < min_v_15(D)) goto <bb 24>; [50.00%] else goto <bb 23>; [50.00%] <bb 22> [local count: 48747874]: goto <bb 21>; [100.00%] <bb 23> [local count: 73121805]: # last_62 = PHI <last_52(21), last_65(24)> i_63 = i_53 + 1; ivtmp_64 = ivtmp_54 - 1; if (ivtmp_64 != 0) goto <bb 22>; [66.67%] else goto <bb 17>; [33.33%] <bb 24> [local count: 36560903]: last_65 = (int) aval_58; goto <bb 23>; [100.00%] <bb 17> [local count: 24373936]: # last_50 = PHI <last_62(23)> return last_50; } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-07 0:04 ` Prathamesh Kulkarni @ 2023-08-07 7:48 ` Richard Biener 2023-08-14 14:58 ` Prathamesh Kulkarni 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-07 7:48 UTC (permalink / raw) To: Prathamesh Kulkarni; +Cc: Richard Biener, Jeff Law, gcc-patches On Mon, Aug 7, 2023 at 2:05 AM Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote: > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > > > > > The following delays sinking of loads within the same innermost > > > > > > > > loop when it was unconditional before. That's a not uncommon > > > > > > > > issue preventing vectorization when masked loads are not available. > > > > > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > > > > > > > I have a followup patch improving sinking that without this would > > > > > > > > cause more of the problematic sinking - now that we have a second > > > > > > > > sink pass after loop opts this looks like a reasonable approach? > > > > > > > > > > > > > > > > OK? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Richard. > > > > > > > > > > > > > > > > PR tree-optimization/92335 > > > > > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > > > > > optimizations avoid sinking unconditional loads/stores > > > > > > > > in innermost loops to conditional executed places. > > > > > > > > > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > > > > > expect predictive commoning to happen instead of sinking. > > > > > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > > > > > I think it's reasonable -- there's probably going to be cases where it's not > > > > > > > great, but more often than not I think it's going to be a reasonable > > > > > > > heuristic. > > > > > > > > > > > > > > If there is undesirable fallout, better to find it over the coming months than > > > > > > > next spring. So I'd suggest we go forward now to give more time to find any > > > > > > > pathological cases (if they exist). > > > > > > > > > > > > Agreed, I've pushed this now. > > > > > Hi Richard, > > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), > > > > > pr65947-7.c "failed" for aarch64-linux-gnu: > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > > > > > scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { > > > > > ! vect_fold_extract_last } } } } */ > > > > > > > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized > > > > > regardless of vect_fold_extract_last, > > > > > which gates the above test (which is an improvement, because the > > > > > function didn't get vectorized before the commit). > > > > > > > > > > The attached patch thus removes the gating on vect_fold_extract_last, > > > > > and the test passes again. > > > > > OK to commit ? > > > > > > > > OK. > > > > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one > > > critical target condition is missing. Can you figure out which? > > > > I see > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, > > type of def: reduction > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > note: vect_is_simple_use: vectype vector(4) int > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > missed: multiple types in double reduction or condition reduction or > > fold-left reduction. > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1: > > missed: not vectorized: relevant phi not supported: last_19 = PHI > > <last_8(7), 108(15)> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > missed: bad operation or unsupported loop bound. > Hi Richard, > Looking at the aarch64 vect dump, it seems the loop in > condition_reduction gets vectorized with V4HI mode > while fails for other modes in vectorizable_condition: > > if ((double_reduc || reduction_type != TREE_CODE_REDUCTION) > && ncopies > 1) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "multiple types in double reduction or condition " > "reduction or fold-left reduction.\n"); > return false; > } > > From the dump: > foo.c:9:21: note: === vect_analyze_loop_operations === > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > def: internal > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > <last_8(7), 108(15)>, type of def: reduction > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > For V8HI, VF = 8, and vectype_in = vector(4) int. > Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1, > and thus fails: > foo.c:9:21: missed: multiple types in double reduction or condition > reduction or fold-left reduction. > foo.c:4:1: missed: not vectorized: relevant phi not supported: > last_19 = PHI <last_8(7), 108(15)> > While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds. > > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode. > If I "force" the vectorizer to use V4HI mode, we get the following dump: > foo.c:9:21: note: === vect_analyze_loop_operations === > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > def: internal > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > <last_8(7), 108(15)>, type of def: reduction > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > foo.c:9:21: missed: multiple types in double reduction or condition > reduction or fold-left reduction. > > Not sure tho if this is the only reason for the test to fail to > vectorize on the target. > Will investigate in more details next week. The odd thing is that you say for (int i = 0; i < N; i++) { aval = a[i]; if (b[i] < min_v) last = aval; } fails to vectorize but for (int i = 0; i < N; i++) { if (b[i] < min_v) last = a[i]; } succeeds? The IL difference should be irrelevant for the reduction vectorization: <bb 3> [local count: 1049367889]: # last_19 = PHI <last_8(7), 108(15)> # i_21 = PHI <i_17(7), 0(15)> # ivtmp_18 = PHI <ivtmp_10(7), 43(15)> _1 = (long unsigned int) i_21; _2 = _1 * 2; _3 = a_12(D) + _2; aval_13 = *_3; _5 = _1 * 4; _6 = b_14(D) + _5; _7 = *_6; last_16 = (int) aval_13; _9 = _7 < min_v_15(D); last_8 = _9 ? last_16 : last_19; i_17 = i_21 + 1; ivtmp_10 = ivtmp_18 - 1; if (ivtmp_10 != 0) goto <bb 7>; [97.68%] vs <bb 3> [local count: 1049367889]: # last_19 = PHI <last_9(7), 108(15)> # i_21 = PHI <i_17(7), 0(15)> # ivtmp_11 = PHI <ivtmp_10(7), 43(15)> _1 = (long unsigned int) i_21; _2 = _1 * 4; _3 = b_13(D) + _2; _4 = *_3; _5 = _4 < min_v_14(D); _6 = _1 * 2; _38 = _37 + _6; _7 = (short int *) _38; _8 = .MASK_LOAD (_7, 16B, _5); last_16 = (int) _8; last_9 = _5 ? last_16 : last_19; i_17 = i_21 + 1; ivtmp_10 = ivtmp_11 - 1; if (ivtmp_10 != 0) goto <bb 7>; [97.68%] maybe since the "mask" is used twice with the .MASK_LOAD we are not actually looking at the def (the comparison) and it's the comparison which would introduce the "multiple types"? That is, I wonder why not sinking the load, avoiding a conditional load, makes a difference to vectorizing the condition/extract last reduction. It doesn't seem to make a difference for x86. That said, the "fix" is probably sticking the correct target on the dump-check, it seems that vect_fold_extract_last is no longer correct here. Richard. > Thanks, > Prathamesh > > > > Richard. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-07 7:48 ` Richard Biener @ 2023-08-14 14:58 ` Prathamesh Kulkarni 2023-08-15 7:36 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Prathamesh Kulkarni @ 2023-08-14 14:58 UTC (permalink / raw) To: Richard Biener; +Cc: Richard Biener, Jeff Law, gcc-patches On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: > > On Mon, Aug 7, 2023 at 2:05 AM Prathamesh Kulkarni via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > > > On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote: > > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > > > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > > > > > > The following delays sinking of loads within the same innermost > > > > > > > > > loop when it was unconditional before. That's a not uncommon > > > > > > > > > issue preventing vectorization when masked loads are not available. > > > > > > > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > > > > > > > > > I have a followup patch improving sinking that without this would > > > > > > > > > cause more of the problematic sinking - now that we have a second > > > > > > > > > sink pass after loop opts this looks like a reasonable approach? > > > > > > > > > > > > > > > > > > OK? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > PR tree-optimization/92335 > > > > > > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > > > > > > optimizations avoid sinking unconditional loads/stores > > > > > > > > > in innermost loops to conditional executed places. > > > > > > > > > > > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > > > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > > > > > > expect predictive commoning to happen instead of sinking. > > > > > > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > > > > > > I think it's reasonable -- there's probably going to be cases where it's not > > > > > > > > great, but more often than not I think it's going to be a reasonable > > > > > > > > heuristic. > > > > > > > > > > > > > > > > If there is undesirable fallout, better to find it over the coming months than > > > > > > > > next spring. So I'd suggest we go forward now to give more time to find any > > > > > > > > pathological cases (if they exist). > > > > > > > > > > > > > > Agreed, I've pushed this now. > > > > > > Hi Richard, > > > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), > > > > > > pr65947-7.c "failed" for aarch64-linux-gnu: > > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > > > > > > scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { > > > > > > ! vect_fold_extract_last } } } } */ > > > > > > > > > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized > > > > > > regardless of vect_fold_extract_last, > > > > > > which gates the above test (which is an improvement, because the > > > > > > function didn't get vectorized before the commit). > > > > > > > > > > > > The attached patch thus removes the gating on vect_fold_extract_last, > > > > > > and the test passes again. > > > > > > OK to commit ? > > > > > > > > > > OK. > > > > > > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one > > > > critical target condition is missing. Can you figure out which? > > > > > > I see > > > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, > > > type of def: reduction > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > note: vect_is_simple_use: vectype vector(4) int > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > missed: multiple types in double reduction or condition reduction or > > > fold-left reduction. > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1: > > > missed: not vectorized: relevant phi not supported: last_19 = PHI > > > <last_8(7), 108(15)> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > missed: bad operation or unsupported loop bound. > > Hi Richard, > > Looking at the aarch64 vect dump, it seems the loop in > > condition_reduction gets vectorized with V4HI mode > > while fails for other modes in vectorizable_condition: > > > > if ((double_reduc || reduction_type != TREE_CODE_REDUCTION) > > && ncopies > 1) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > "multiple types in double reduction or condition " > > "reduction or fold-left reduction.\n"); > > return false; > > } > > > > From the dump: > > foo.c:9:21: note: === vect_analyze_loop_operations === > > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > > def: internal > > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > > <last_8(7), 108(15)>, type of def: reduction > > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > > > For V8HI, VF = 8, and vectype_in = vector(4) int. > > Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1, > > and thus fails: > > foo.c:9:21: missed: multiple types in double reduction or condition > > reduction or fold-left reduction. > > foo.c:4:1: missed: not vectorized: relevant phi not supported: > > last_19 = PHI <last_8(7), 108(15)> > > While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds. > > > > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode. > > If I "force" the vectorizer to use V4HI mode, we get the following dump: > > foo.c:9:21: note: === vect_analyze_loop_operations === > > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > > def: internal > > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > > <last_8(7), 108(15)>, type of def: reduction > > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > > foo.c:9:21: missed: multiple types in double reduction or condition > > reduction or fold-left reduction. > > > > Not sure tho if this is the only reason for the test to fail to > > vectorize on the target. > > Will investigate in more details next week. > > The odd thing is that you say > > for (int i = 0; i < N; i++) > { > aval = a[i]; > if (b[i] < min_v) > last = aval; > } > > fails to vectorize but > > for (int i = 0; i < N; i++) > { > if (b[i] < min_v) > last = a[i]; > } > > succeeds? The IL difference should be irrelevant for the reduction Hi Richard, Sorry for late response. No this case containing a conditional load doesn't vectorize on aarch64 either: foo2.c:9:21: note: === analyze_loop_nest === foo2.c:9:21: note: === vect_analyze_loop_form === foo2.c:9:21: missed: not vectorized: control flow in loop. foo2.c:9:21: missed: bad loop form. For this test: for (int i = 0; i < N; i++) { aval = a[i]; if (b[i] < min_v) last = aval; } IIUC sink pass made the load conditional preventing vectorization (similar to above), but your PR92335 fix delays the sinking of load before loop opts, and thus gets vectorized. Till vect pass, the dumps are similar for x86 and aarch64. > vectorization: > > <bb 3> [local count: 1049367889]: > # last_19 = PHI <last_8(7), 108(15)> > # i_21 = PHI <i_17(7), 0(15)> > # ivtmp_18 = PHI <ivtmp_10(7), 43(15)> > _1 = (long unsigned int) i_21; > _2 = _1 * 2; > _3 = a_12(D) + _2; > aval_13 = *_3; > _5 = _1 * 4; > _6 = b_14(D) + _5; > _7 = *_6; > last_16 = (int) aval_13; > _9 = _7 < min_v_15(D); > last_8 = _9 ? last_16 : last_19; > i_17 = i_21 + 1; > ivtmp_10 = ivtmp_18 - 1; > if (ivtmp_10 != 0) > goto <bb 7>; [97.68%] > > vs > > <bb 3> [local count: 1049367889]: > # last_19 = PHI <last_9(7), 108(15)> > # i_21 = PHI <i_17(7), 0(15)> > # ivtmp_11 = PHI <ivtmp_10(7), 43(15)> > _1 = (long unsigned int) i_21; > _2 = _1 * 4; > _3 = b_13(D) + _2; > _4 = *_3; > _5 = _4 < min_v_14(D); > _6 = _1 * 2; > _38 = _37 + _6; > _7 = (short int *) _38; > _8 = .MASK_LOAD (_7, 16B, _5); > last_16 = (int) _8; > last_9 = _5 ? last_16 : last_19; > i_17 = i_21 + 1; > ivtmp_10 = ivtmp_11 - 1; > if (ivtmp_10 != 0) > goto <bb 7>; [97.68%] > > maybe since the "mask" is used twice with the .MASK_LOAD > we are not actually looking at the def (the comparison) and it's > the comparison which would introduce the "multiple types"? > > That is, I wonder why not sinking the load, avoiding a conditional > load, makes a difference to vectorizing the condition/extract last reduction. IIUC, the issue is that the vector type used for reduction seems to be different than the vector type used for determining VF, and it passes the above check in vectoriable_reduction, only if VF matches the length of the vector type used for reduction. For V4HI mode on aarch64, it sets VF = 4 which matches with the length of vector type used for reduction (vector (4) int): foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of def: internal foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > It doesn't seem to make a difference for x86. That said, the "fix" is > probably sticking the correct target on the dump-check, it seems > that vect_fold_extract_last is no longer correct here. Um sorry, I did go thru various checks in target-supports.exp, but not sure which one will be appropriate for this case, and am stuck here :/ Could you please suggest how to proceed ? Thanks, Prathamesh > > Richard. > > > Thanks, > > Prathamesh > > > > > > Richard. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-14 14:58 ` Prathamesh Kulkarni @ 2023-08-15 7:36 ` Richard Biener 2023-08-15 8:58 ` Richard Sandiford 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-15 7:36 UTC (permalink / raw) To: Prathamesh Kulkarni; +Cc: richard.sandiford, Jeff Law, gcc-patches On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: > On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: > > > > On Mon, Aug 7, 2023 at 2:05?AM Prathamesh Kulkarni via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote: > > > > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > > > On Thu, 3 Aug 2023, Richard Biener wrote: > > > > > > > > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > > > > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches > > > > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote: > > > > > > > > > > The following delays sinking of loads within the same innermost > > > > > > > > > > loop when it was unconditional before. That's a not uncommon > > > > > > > > > > issue preventing vectorization when masked loads are not available. > > > > > > > > > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > > > > > > > > > > > I have a followup patch improving sinking that without this would > > > > > > > > > > cause more of the problematic sinking - now that we have a second > > > > > > > > > > sink pass after loop opts this looks like a reasonable approach? > > > > > > > > > > > > > > > > > > > > OK? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > PR tree-optimization/92335 > > > > > > > > > > * tree-ssa-sink.cc (select_best_block): Before loop > > > > > > > > > > optimizations avoid sinking unconditional loads/stores > > > > > > > > > > in innermost loops to conditional executed places. > > > > > > > > > > > > > > > > > > > > * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing. > > > > > > > > > > * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c, > > > > > > > > > > expect predictive commoning to happen instead of sinking. > > > > > > > > > > * gcc.dg/vect/pr65947-3.c: Adjust. > > > > > > > > > I think it's reasonable -- there's probably going to be cases where it's not > > > > > > > > > great, but more often than not I think it's going to be a reasonable > > > > > > > > > heuristic. > > > > > > > > > > > > > > > > > > If there is undesirable fallout, better to find it over the coming months than > > > > > > > > > next spring. So I'd suggest we go forward now to give more time to find any > > > > > > > > > pathological cases (if they exist). > > > > > > > > > > > > > > > > Agreed, I've pushed this now. > > > > > > > Hi Richard, > > > > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0), > > > > > > > pr65947-7.c "failed" for aarch64-linux-gnu: > > > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects > > > > > > > scan-tree-dump-not vect "LOOP VECTORIZED" > > > > > > > > > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { > > > > > > > ! vect_fold_extract_last } } } } */ > > > > > > > > > > > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized > > > > > > > regardless of vect_fold_extract_last, > > > > > > > which gates the above test (which is an improvement, because the > > > > > > > function didn't get vectorized before the commit). > > > > > > > > > > > > > > The attached patch thus removes the gating on vect_fold_extract_last, > > > > > > > and the test passes again. > > > > > > > OK to commit ? > > > > > > > > > > > > OK. > > > > > > > > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one > > > > > critical target condition is missing. Can you figure out which? > > > > > > > > I see > > > > > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > > note: vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, > > > > type of def: reduction > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > > note: vect_is_simple_use: vectype vector(4) int > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > > missed: multiple types in double reduction or condition reduction or > > > > fold-left reduction. > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1: > > > > missed: not vectorized: relevant phi not supported: last_19 = PHI > > > > <last_8(7), 108(15)> > > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: > > > > missed: bad operation or unsupported loop bound. > > > Hi Richard, > > > Looking at the aarch64 vect dump, it seems the loop in > > > condition_reduction gets vectorized with V4HI mode > > > while fails for other modes in vectorizable_condition: > > > > > > if ((double_reduc || reduction_type != TREE_CODE_REDUCTION) > > > && ncopies > 1) > > > { > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > "multiple types in double reduction or condition " > > > "reduction or fold-left reduction.\n"); > > > return false; > > > } > > > > > > From the dump: > > > foo.c:9:21: note: === vect_analyze_loop_operations === > > > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > > > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > > > def: internal > > > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > > > <last_8(7), 108(15)>, type of def: reduction > > > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > > > > > > For V8HI, VF = 8, and vectype_in = vector(4) int. > > > Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1, > > > and thus fails: > > > foo.c:9:21: missed: multiple types in double reduction or condition > > > reduction or fold-left reduction. > > > foo.c:4:1: missed: not vectorized: relevant phi not supported: > > > last_19 = PHI <last_8(7), 108(15)> > > > While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds. > > > > > > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode. > > > If I "force" the vectorizer to use V4HI mode, we get the following dump: > > > foo.c:9:21: note: === vect_analyze_loop_operations === > > > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > > > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > > > def: internal > > > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > > > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > > > <last_8(7), 108(15)>, type of def: reduction > > > foo.c:9:21: note: vect_is_simple_use: vectype vector(2) int > > > foo.c:9:21: missed: multiple types in double reduction or condition > > > reduction or fold-left reduction. > > > > > > Not sure tho if this is the only reason for the test to fail to > > > vectorize on the target. > > > Will investigate in more details next week. > > > > The odd thing is that you say > > > > for (int i = 0; i < N; i++) > > { > > aval = a[i]; > > if (b[i] < min_v) > > last = aval; > > } > > > > fails to vectorize but > > > > for (int i = 0; i < N; i++) > > { > > if (b[i] < min_v) > > last = a[i]; > > } > > > > succeeds? The IL difference should be irrelevant for the reduction > Hi Richard, > Sorry for late response. > No this case containing a conditional load doesn't vectorize on aarch64 either: > foo2.c:9:21: note: === analyze_loop_nest === > foo2.c:9:21: note: === vect_analyze_loop_form === > foo2.c:9:21: missed: not vectorized: control flow in loop. > foo2.c:9:21: missed: bad loop form. > > For this test: > for (int i = 0; i < N; i++) > { > aval = a[i]; > if (b[i] < min_v) > last = aval; > } > > IIUC sink pass made the load conditional preventing vectorization > (similar to above), > but your PR92335 fix delays the sinking of load before loop opts, and > thus gets vectorized. > Till vect pass, the dumps are similar for x86 and aarch64. > > vectorization: > > > > <bb 3> [local count: 1049367889]: > > # last_19 = PHI <last_8(7), 108(15)> > > # i_21 = PHI <i_17(7), 0(15)> > > # ivtmp_18 = PHI <ivtmp_10(7), 43(15)> > > _1 = (long unsigned int) i_21; > > _2 = _1 * 2; > > _3 = a_12(D) + _2; > > aval_13 = *_3; > > _5 = _1 * 4; > > _6 = b_14(D) + _5; > > _7 = *_6; > > last_16 = (int) aval_13; > > _9 = _7 < min_v_15(D); > > last_8 = _9 ? last_16 : last_19; > > i_17 = i_21 + 1; > > ivtmp_10 = ivtmp_18 - 1; > > if (ivtmp_10 != 0) > > goto <bb 7>; [97.68%] > > > > vs > > > > <bb 3> [local count: 1049367889]: > > # last_19 = PHI <last_9(7), 108(15)> > > # i_21 = PHI <i_17(7), 0(15)> > > # ivtmp_11 = PHI <ivtmp_10(7), 43(15)> > > _1 = (long unsigned int) i_21; > > _2 = _1 * 4; > > _3 = b_13(D) + _2; > > _4 = *_3; > > _5 = _4 < min_v_14(D); > > _6 = _1 * 2; > > _38 = _37 + _6; > > _7 = (short int *) _38; > > _8 = .MASK_LOAD (_7, 16B, _5); > > last_16 = (int) _8; > > last_9 = _5 ? last_16 : last_19; > > i_17 = i_21 + 1; > > ivtmp_10 = ivtmp_11 - 1; > > if (ivtmp_10 != 0) > > goto <bb 7>; [97.68%] > > > > maybe since the "mask" is used twice with the .MASK_LOAD > > we are not actually looking at the def (the comparison) and it's > > the comparison which would introduce the "multiple types"? > > > > That is, I wonder why not sinking the load, avoiding a conditional > > load, makes a difference to vectorizing the condition/extract last reduction. > IIUC, the issue is that the vector type used for reduction seems to be > different than > the vector type used for determining VF, and it passes the above check > in vectoriable_reduction, > only if VF matches the length of the vector type used for reduction. > > For V4HI mode on aarch64, it sets VF = 4 which matches with the length of vector > type used for reduction (vector (4) int): > foo.c:9:21: note: examining phi: last_19 = PHI <last_8(7), 108(15)> > foo.c:9:21: note: vect_is_simple_use: operand (int) aval_13, type of > def: internal > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int > foo.c:9:21: note: vect_is_simple_use: operand last_19 = PHI > <last_8(7), 108(15)>, type of def: reduction > foo.c:9:21: note: vect_is_simple_use: vectype vector(4) int so it doesn't use a fold_extract_last reduction but a regular condition reduction? How does it end up with a V4HI + V4SI combo here? Ah, so the "key" is that we end up using vect_last_16.23_74 = (vector(4) int) vect_aval_13.19_70; vect_last_8.25_77 = VEC_COND_EXPR <mask__9.24_76, vect_last_16.23_74, vect_last_19.16_67>; so we can promote V4HI to V4SI via direct conversion instead of via unpacking. The x86 backend has no such feat, while it can do zero_extend via punpcklwd doing a sign_extend requires a compare, unpacking the result with itself and then doing the punpcklwd ontop of that. Of course that's what we're doing when using vec_unpack_lo/hi_expr already. > > > > It doesn't seem to make a difference for x86. That said, the "fix" is > > probably sticking the correct target on the dump-check, it seems > > that vect_fold_extract_last is no longer correct here. > Um sorry, I did go thru various checks in target-supports.exp, but not > sure which one will be appropriate for this case, > and am stuck here :/ Could you please suggest how to proceed ? Maybe Richard S. knows the magic thing to test, he originally implemented the direct conversion support. I suggest to implement such dg-checks if they are not present (I can't find them), possibly quite specific to the modes involved (like we have other checks with _qi_to_hi suffixes, for float modes maybe just _float). Richard. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-15 7:36 ` Richard Biener @ 2023-08-15 8:58 ` Richard Sandiford 2023-08-17 17:10 ` Prathamesh Kulkarni 0 siblings, 1 reply; 18+ messages in thread From: Richard Sandiford @ 2023-08-15 8:58 UTC (permalink / raw) To: Richard Biener; +Cc: Prathamesh Kulkarni, Jeff Law, gcc-patches Richard Biener <rguenther@suse.de> writes: > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: >> > It doesn't seem to make a difference for x86. That said, the "fix" is >> > probably sticking the correct target on the dump-check, it seems >> > that vect_fold_extract_last is no longer correct here. >> Um sorry, I did go thru various checks in target-supports.exp, but not >> sure which one will be appropriate for this case, >> and am stuck here :/ Could you please suggest how to proceed ? > > Maybe Richard S. knows the magic thing to test, he originally > implemented the direct conversion support. I suggest to implement > such dg-checks if they are not present (I can't find them), > possibly quite specific to the modes involved (like we have > other checks with _qi_to_hi suffixes, for float modes maybe > just _float). Yeah, can't remember specific selectors for that feature. TBH I think most (all?) of the tests were AArch64-specific. Thanks, Richard ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-15 8:58 ` Richard Sandiford @ 2023-08-17 17:10 ` Prathamesh Kulkarni 2023-08-18 9:29 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Prathamesh Kulkarni @ 2023-08-17 17:10 UTC (permalink / raw) To: Richard Biener, Prathamesh Kulkarni, Jeff Law, gcc-patches, richard.sandiford [-- Attachment #1: Type: text/plain, Size: 1729 bytes --] On Tue, 15 Aug 2023 at 14:28, Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <rguenther@suse.de> writes: > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: > >> > It doesn't seem to make a difference for x86. That said, the "fix" is > >> > probably sticking the correct target on the dump-check, it seems > >> > that vect_fold_extract_last is no longer correct here. > >> Um sorry, I did go thru various checks in target-supports.exp, but not > >> sure which one will be appropriate for this case, > >> and am stuck here :/ Could you please suggest how to proceed ? > > > > Maybe Richard S. knows the magic thing to test, he originally > > implemented the direct conversion support. I suggest to implement > > such dg-checks if they are not present (I can't find them), > > possibly quite specific to the modes involved (like we have > > other checks with _qi_to_hi suffixes, for float modes maybe > > just _float). > > Yeah, can't remember specific selectors for that feature. TBH I think > most (all?) of the tests were AArch64-specific. Hi, As Richi mentioned above, the test now vectorizes on AArch64 because it has support for direct conversion between vectors while x86 doesn't. IIUC this is because supportable_convert_operation returns true for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for doing the conversion ? In the attached patch, I added a new target check vect_extend which (currently) returns 1 only for aarch64*-*-*, which makes the test PASS on both the targets, altho I am not sure if this is entirely correct. Does the patch look OK ? Thanks, Prathamesh > > Thanks, > Richard [-- Attachment #2: gnu-875-1.txt --] [-- Type: text/plain, Size: 1382 bytes --] diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-7.c b/gcc/testsuite/gcc.dg/vect/pr65947-7.c index 16cdcd1c6eb..c8623854af5 100644 --- a/gcc/testsuite/gcc.dg/vect/pr65947-7.c +++ b/gcc/testsuite/gcc.dg/vect/pr65947-7.c @@ -52,5 +52,4 @@ main (void) return 0; } -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target vect_fold_extract_last } } } */ -/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { ! vect_fold_extract_last } } } } */ +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target vect_extend } } } */ diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index 92b6f69730e..29ef64b84f3 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -7768,6 +7768,16 @@ proc check_effective_target_vect_unpack { } { || [istarget amdgcn*-*-*] }}] } +# Return 1 if the target plus current options supports vector +# conversion of chars (to shorts) and shorts (to ints), 0 otherwise. +# +# This won't change for different subtargets so cache the result. + +proc check_effective_target_vect_extend { } { + return [check_cached_effective_target_indexed vect_extend { + expr { [istarget aarch64*-*-*]}}] +} + # Return 1 if the target plus current options does not guarantee # that its STACK_BOUNDARY is >= the reguired vector alignment. # ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-17 17:10 ` Prathamesh Kulkarni @ 2023-08-18 9:29 ` Richard Biener 2023-08-18 11:41 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-18 9:29 UTC (permalink / raw) To: Prathamesh Kulkarni; +Cc: Jeff Law, gcc-patches, richard.sandiford On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote: > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford > <richard.sandiford@arm.com> wrote: > > > > Richard Biener <rguenther@suse.de> writes: > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: > > >> > It doesn't seem to make a difference for x86. That said, the "fix" is > > >> > probably sticking the correct target on the dump-check, it seems > > >> > that vect_fold_extract_last is no longer correct here. > > >> Um sorry, I did go thru various checks in target-supports.exp, but not > > >> sure which one will be appropriate for this case, > > >> and am stuck here :/ Could you please suggest how to proceed ? > > > > > > Maybe Richard S. knows the magic thing to test, he originally > > > implemented the direct conversion support. I suggest to implement > > > such dg-checks if they are not present (I can't find them), > > > possibly quite specific to the modes involved (like we have > > > other checks with _qi_to_hi suffixes, for float modes maybe > > > just _float). > > > > Yeah, can't remember specific selectors for that feature. TBH I think > > most (all?) of the tests were AArch64-specific. > Hi, > As Richi mentioned above, the test now vectorizes on AArch64 because > it has support for direct conversion > between vectors while x86 doesn't. IIUC this is because > supportable_convert_operation returns true > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for > doing the conversion ? > > In the attached patch, I added a new target check vect_extend which > (currently) returns 1 only for aarch64*-*-*, > which makes the test PASS on both the targets, altho I am not sure if > this is entirely correct. > Does the patch look OK ? Can you make vect_extend more specific, say vect_extend_hi_si or what is specifically needed here? Note I'll have to investigate why x86 cannot vectorize here since in fact it does have the extend operation ... it might be also worth splitting the sign/zero extend case, so - vect_sign_extend_hi_si or vect_extend_short_int? > Thanks, > Prathamesh > > > > Thanks, > > Richard > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-18 9:29 ` Richard Biener @ 2023-08-18 11:41 ` Richard Biener 2023-08-19 15:48 ` Prathamesh Kulkarni 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-18 11:41 UTC (permalink / raw) To: Prathamesh Kulkarni; +Cc: Jeff Law, gcc-patches, richard.sandiford On Fri, 18 Aug 2023, Richard Biener wrote: > On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote: > > > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford > > <richard.sandiford@arm.com> wrote: > > > > > > Richard Biener <rguenther@suse.de> writes: > > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: > > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: > > > >> > It doesn't seem to make a difference for x86. That said, the "fix" is > > > >> > probably sticking the correct target on the dump-check, it seems > > > >> > that vect_fold_extract_last is no longer correct here. > > > >> Um sorry, I did go thru various checks in target-supports.exp, but not > > > >> sure which one will be appropriate for this case, > > > >> and am stuck here :/ Could you please suggest how to proceed ? > > > > > > > > Maybe Richard S. knows the magic thing to test, he originally > > > > implemented the direct conversion support. I suggest to implement > > > > such dg-checks if they are not present (I can't find them), > > > > possibly quite specific to the modes involved (like we have > > > > other checks with _qi_to_hi suffixes, for float modes maybe > > > > just _float). > > > > > > Yeah, can't remember specific selectors for that feature. TBH I think > > > most (all?) of the tests were AArch64-specific. > > Hi, > > As Richi mentioned above, the test now vectorizes on AArch64 because > > it has support for direct conversion > > between vectors while x86 doesn't. IIUC this is because > > supportable_convert_operation returns true > > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for > > doing the conversion ? > > > > In the attached patch, I added a new target check vect_extend which > > (currently) returns 1 only for aarch64*-*-*, > > which makes the test PASS on both the targets, altho I am not sure if > > this is entirely correct. > > Does the patch look OK ? > > Can you make vect_extend more specific, say vect_extend_hi_si or > what is specifically needed here? Note I'll have to investigate > why x86 cannot vectorize here since in fact it does have > the extend operation ... it might be also worth splitting the > sign/zero extend case, so - vect_sign_extend_hi_si or > vect_extend_short_int? And now having anaylzed _why_ x86 doesn't vectorize it's rather why we get this vectorized with NEON which is because static opt_machine_mode aarch64_vectorize_related_mode (machine_mode vector_mode, scalar_mode element_mode, poly_uint64 nunits) { ... /* Prefer to use 1 128-bit vector instead of 2 64-bit vectors. */ if (TARGET_SIMD && (vec_flags & VEC_ADVSIMD) && known_eq (nunits, 0U) && known_eq (GET_MODE_BITSIZE (vector_mode), 64U) && maybe_ge (GET_MODE_BITSIZE (element_mode) * GET_MODE_NUNITS (vector_mode), 128U)) { machine_mode res = aarch64_simd_container_mode (element_mode, 128); if (VECTOR_MODE_P (res)) return res; which makes us get a V4SImode vector for a V4HImode loop vector_mode. So I think the appropriate effective dejagnu target is aarch64-*-* (there's none specifically to advsimd, not sure if one can disable that?) Richard. > > Thanks, > > Prathamesh > > > > > > Thanks, > > > Richard > > > > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-18 11:41 ` Richard Biener @ 2023-08-19 15:48 ` Prathamesh Kulkarni 2023-08-21 6:57 ` Richard Biener 0 siblings, 1 reply; 18+ messages in thread From: Prathamesh Kulkarni @ 2023-08-19 15:48 UTC (permalink / raw) To: Richard Biener; +Cc: Jeff Law, gcc-patches, richard.sandiford [-- Attachment #1: Type: text/plain, Size: 4044 bytes --] On Fri, 18 Aug 2023 at 17:11, Richard Biener <rguenther@suse.de> wrote: > > On Fri, 18 Aug 2023, Richard Biener wrote: > > > On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote: > > > > > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford > > > <richard.sandiford@arm.com> wrote: > > > > > > > > Richard Biener <rguenther@suse.de> writes: > > > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: > > > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: > > > > >> > It doesn't seem to make a difference for x86. That said, the "fix" is > > > > >> > probably sticking the correct target on the dump-check, it seems > > > > >> > that vect_fold_extract_last is no longer correct here. > > > > >> Um sorry, I did go thru various checks in target-supports.exp, but not > > > > >> sure which one will be appropriate for this case, > > > > >> and am stuck here :/ Could you please suggest how to proceed ? > > > > > > > > > > Maybe Richard S. knows the magic thing to test, he originally > > > > > implemented the direct conversion support. I suggest to implement > > > > > such dg-checks if they are not present (I can't find them), > > > > > possibly quite specific to the modes involved (like we have > > > > > other checks with _qi_to_hi suffixes, for float modes maybe > > > > > just _float). > > > > > > > > Yeah, can't remember specific selectors for that feature. TBH I think > > > > most (all?) of the tests were AArch64-specific. > > > Hi, > > > As Richi mentioned above, the test now vectorizes on AArch64 because > > > it has support for direct conversion > > > between vectors while x86 doesn't. IIUC this is because > > > supportable_convert_operation returns true > > > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for > > > doing the conversion ? > > > > > > In the attached patch, I added a new target check vect_extend which > > > (currently) returns 1 only for aarch64*-*-*, > > > which makes the test PASS on both the targets, altho I am not sure if > > > this is entirely correct. > > > Does the patch look OK ? > > > > Can you make vect_extend more specific, say vect_extend_hi_si or > > what is specifically needed here? Note I'll have to investigate > > why x86 cannot vectorize here since in fact it does have > > the extend operation ... it might be also worth splitting the > > sign/zero extend case, so - vect_sign_extend_hi_si or > > vect_extend_short_int? > > And now having anaylzed _why_ x86 doesn't vectorize it's rather > why we get this vectorized with NEON which is because > > static opt_machine_mode > aarch64_vectorize_related_mode (machine_mode vector_mode, > scalar_mode element_mode, > poly_uint64 nunits) > { > ... > /* Prefer to use 1 128-bit vector instead of 2 64-bit vectors. */ > if (TARGET_SIMD > && (vec_flags & VEC_ADVSIMD) > && known_eq (nunits, 0U) > && known_eq (GET_MODE_BITSIZE (vector_mode), 64U) > && maybe_ge (GET_MODE_BITSIZE (element_mode) > * GET_MODE_NUNITS (vector_mode), 128U)) > { > machine_mode res = aarch64_simd_container_mode (element_mode, 128); > if (VECTOR_MODE_P (res)) > return res; > > which makes us get a V4SImode vector for a V4HImode loop vector_mode. Thanks for the explanation! > > So I think the appropriate effective dejagnu target is > aarch64-*-* (there's none specifically to advsimd, not sure if one > can disable that?) The attached patch uses aarch64*-*-* target check, and additionally for SVE (and other targets supporting vect_fold_extract_last) it checks if the condition reduction was carried out using FOLD_EXTRACT_LAST. Does that look OK ? Thanks, Prathamesh > > Richard. > > > > Thanks, > > > Prathamesh > > > > > > > > Thanks, > > > > Richard > > > > > > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) [-- Attachment #2: gnu-875-2.txt --] [-- Type: text/plain, Size: 695 bytes --] diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-7.c b/gcc/testsuite/gcc.dg/vect/pr65947-7.c index 16cdcd1c6eb..58c46df5c54 100644 --- a/gcc/testsuite/gcc.dg/vect/pr65947-7.c +++ b/gcc/testsuite/gcc.dg/vect/pr65947-7.c @@ -52,5 +52,5 @@ main (void) return 0; } -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target vect_fold_extract_last } } } */ -/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { ! vect_fold_extract_last } } } } */ +/* { dg-final { scan-tree-dump "optimizing condition reduction with FOLD_EXTRACT_LAST" "vect" { target vect_fold_extract_last } } } */ +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target aarch64*-*-* } } } */ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-19 15:48 ` Prathamesh Kulkarni @ 2023-08-21 6:57 ` Richard Biener 2023-08-21 11:32 ` Prathamesh Kulkarni 0 siblings, 1 reply; 18+ messages in thread From: Richard Biener @ 2023-08-21 6:57 UTC (permalink / raw) To: Prathamesh Kulkarni; +Cc: Jeff Law, gcc-patches, richard.sandiford On Sat, 19 Aug 2023, Prathamesh Kulkarni wrote: > On Fri, 18 Aug 2023 at 17:11, Richard Biener <rguenther@suse.de> wrote: > > > > On Fri, 18 Aug 2023, Richard Biener wrote: > > > > > On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford > > > > <richard.sandiford@arm.com> wrote: > > > > > > > > > > Richard Biener <rguenther@suse.de> writes: > > > > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: > > > > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: > > > > > >> > It doesn't seem to make a difference for x86. That said, the "fix" is > > > > > >> > probably sticking the correct target on the dump-check, it seems > > > > > >> > that vect_fold_extract_last is no longer correct here. > > > > > >> Um sorry, I did go thru various checks in target-supports.exp, but not > > > > > >> sure which one will be appropriate for this case, > > > > > >> and am stuck here :/ Could you please suggest how to proceed ? > > > > > > > > > > > > Maybe Richard S. knows the magic thing to test, he originally > > > > > > implemented the direct conversion support. I suggest to implement > > > > > > such dg-checks if they are not present (I can't find them), > > > > > > possibly quite specific to the modes involved (like we have > > > > > > other checks with _qi_to_hi suffixes, for float modes maybe > > > > > > just _float). > > > > > > > > > > Yeah, can't remember specific selectors for that feature. TBH I think > > > > > most (all?) of the tests were AArch64-specific. > > > > Hi, > > > > As Richi mentioned above, the test now vectorizes on AArch64 because > > > > it has support for direct conversion > > > > between vectors while x86 doesn't. IIUC this is because > > > > supportable_convert_operation returns true > > > > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for > > > > doing the conversion ? > > > > > > > > In the attached patch, I added a new target check vect_extend which > > > > (currently) returns 1 only for aarch64*-*-*, > > > > which makes the test PASS on both the targets, altho I am not sure if > > > > this is entirely correct. > > > > Does the patch look OK ? > > > > > > Can you make vect_extend more specific, say vect_extend_hi_si or > > > what is specifically needed here? Note I'll have to investigate > > > why x86 cannot vectorize here since in fact it does have > > > the extend operation ... it might be also worth splitting the > > > sign/zero extend case, so - vect_sign_extend_hi_si or > > > vect_extend_short_int? > > > > And now having anaylzed _why_ x86 doesn't vectorize it's rather > > why we get this vectorized with NEON which is because > > > > static opt_machine_mode > > aarch64_vectorize_related_mode (machine_mode vector_mode, > > scalar_mode element_mode, > > poly_uint64 nunits) > > { > > ... > > /* Prefer to use 1 128-bit vector instead of 2 64-bit vectors. */ > > if (TARGET_SIMD > > && (vec_flags & VEC_ADVSIMD) > > && known_eq (nunits, 0U) > > && known_eq (GET_MODE_BITSIZE (vector_mode), 64U) > > && maybe_ge (GET_MODE_BITSIZE (element_mode) > > * GET_MODE_NUNITS (vector_mode), 128U)) > > { > > machine_mode res = aarch64_simd_container_mode (element_mode, 128); > > if (VECTOR_MODE_P (res)) > > return res; > > > > which makes us get a V4SImode vector for a V4HImode loop vector_mode. > Thanks for the explanation! > > > > So I think the appropriate effective dejagnu target is > > aarch64-*-* (there's none specifically to advsimd, not sure if one > > can disable that?) > The attached patch uses aarch64*-*-* target check, and additionally > for SVE (and other targets supporting vect_fold_extract_last) it > checks > if the condition reduction was carried out using FOLD_EXTRACT_LAST. > Does that look OK ? Works for me. Richard. > Thanks, > Prathamesh > > > > > Richard. > > > > > > Thanks, > > > > Prathamesh > > > > > > > > > > Thanks, > > > > > Richard > > > > > > > > > > > > > > -- > > Richard Biener <rguenther@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization 2023-08-21 6:57 ` Richard Biener @ 2023-08-21 11:32 ` Prathamesh Kulkarni 0 siblings, 0 replies; 18+ messages in thread From: Prathamesh Kulkarni @ 2023-08-21 11:32 UTC (permalink / raw) To: Richard Biener; +Cc: Jeff Law, gcc-patches, richard.sandiford On Mon, 21 Aug 2023 at 12:27, Richard Biener <rguenther@suse.de> wrote: > > On Sat, 19 Aug 2023, Prathamesh Kulkarni wrote: > > > On Fri, 18 Aug 2023 at 17:11, Richard Biener <rguenther@suse.de> wrote: > > > > > > On Fri, 18 Aug 2023, Richard Biener wrote: > > > > > > > On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > > > > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford > > > > > <richard.sandiford@arm.com> wrote: > > > > > > > > > > > > Richard Biener <rguenther@suse.de> writes: > > > > > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote: > > > > > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote: > > > > > > >> > It doesn't seem to make a difference for x86. That said, the "fix" is > > > > > > >> > probably sticking the correct target on the dump-check, it seems > > > > > > >> > that vect_fold_extract_last is no longer correct here. > > > > > > >> Um sorry, I did go thru various checks in target-supports.exp, but not > > > > > > >> sure which one will be appropriate for this case, > > > > > > >> and am stuck here :/ Could you please suggest how to proceed ? > > > > > > > > > > > > > > Maybe Richard S. knows the magic thing to test, he originally > > > > > > > implemented the direct conversion support. I suggest to implement > > > > > > > such dg-checks if they are not present (I can't find them), > > > > > > > possibly quite specific to the modes involved (like we have > > > > > > > other checks with _qi_to_hi suffixes, for float modes maybe > > > > > > > just _float). > > > > > > > > > > > > Yeah, can't remember specific selectors for that feature. TBH I think > > > > > > most (all?) of the tests were AArch64-specific. > > > > > Hi, > > > > > As Richi mentioned above, the test now vectorizes on AArch64 because > > > > > it has support for direct conversion > > > > > between vectors while x86 doesn't. IIUC this is because > > > > > supportable_convert_operation returns true > > > > > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for > > > > > doing the conversion ? > > > > > > > > > > In the attached patch, I added a new target check vect_extend which > > > > > (currently) returns 1 only for aarch64*-*-*, > > > > > which makes the test PASS on both the targets, altho I am not sure if > > > > > this is entirely correct. > > > > > Does the patch look OK ? > > > > > > > > Can you make vect_extend more specific, say vect_extend_hi_si or > > > > what is specifically needed here? Note I'll have to investigate > > > > why x86 cannot vectorize here since in fact it does have > > > > the extend operation ... it might be also worth splitting the > > > > sign/zero extend case, so - vect_sign_extend_hi_si or > > > > vect_extend_short_int? > > > > > > And now having anaylzed _why_ x86 doesn't vectorize it's rather > > > why we get this vectorized with NEON which is because > > > > > > static opt_machine_mode > > > aarch64_vectorize_related_mode (machine_mode vector_mode, > > > scalar_mode element_mode, > > > poly_uint64 nunits) > > > { > > > ... > > > /* Prefer to use 1 128-bit vector instead of 2 64-bit vectors. */ > > > if (TARGET_SIMD > > > && (vec_flags & VEC_ADVSIMD) > > > && known_eq (nunits, 0U) > > > && known_eq (GET_MODE_BITSIZE (vector_mode), 64U) > > > && maybe_ge (GET_MODE_BITSIZE (element_mode) > > > * GET_MODE_NUNITS (vector_mode), 128U)) > > > { > > > machine_mode res = aarch64_simd_container_mode (element_mode, 128); > > > if (VECTOR_MODE_P (res)) > > > return res; > > > > > > which makes us get a V4SImode vector for a V4HImode loop vector_mode. > > Thanks for the explanation! > > > > > > So I think the appropriate effective dejagnu target is > > > aarch64-*-* (there's none specifically to advsimd, not sure if one > > > can disable that?) > > The attached patch uses aarch64*-*-* target check, and additionally > > for SVE (and other targets supporting vect_fold_extract_last) it > > checks > > if the condition reduction was carried out using FOLD_EXTRACT_LAST. > > Does that look OK ? > > Works for me. Thanks, committed to trunk in dd606dc7c7e49feb7a900902ec6d35b421789173 Thanks, Prathamesh > > Richard. > > > Thanks, > > Prathamesh > > > > > > > > Richard. > > > > > > > > Thanks, > > > > > Prathamesh > > > > > > > > > > > > Thanks, > > > > > > Richard > > > > > > > > > > > > > > > > > > > -- > > > Richard Biener <rguenther@suse.de> > > > SUSE Software Solutions Germany GmbH, > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2023-08-21 11:33 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-07-28 7:05 [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization Richard Biener 2023-07-31 15:33 ` Jeff Law 2023-08-02 8:46 ` Richard Biener 2023-08-03 11:42 ` Prathamesh Kulkarni 2023-08-03 12:14 ` Richard Biener 2023-08-03 12:16 ` Richard Biener 2023-08-03 12:18 ` Richard Biener 2023-08-07 0:04 ` Prathamesh Kulkarni 2023-08-07 7:48 ` Richard Biener 2023-08-14 14:58 ` Prathamesh Kulkarni 2023-08-15 7:36 ` Richard Biener 2023-08-15 8:58 ` Richard Sandiford 2023-08-17 17:10 ` Prathamesh Kulkarni 2023-08-18 9:29 ` Richard Biener 2023-08-18 11:41 ` Richard Biener 2023-08-19 15:48 ` Prathamesh Kulkarni 2023-08-21 6:57 ` Richard Biener 2023-08-21 11:32 ` Prathamesh Kulkarni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).