From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 88708 invoked by alias); 1 Dec 2016 11:34:12 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 86377 invoked by uid 89); 1 Dec 2016 11:34:11 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-4.8 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=rguenther@suse.de, rguenthersusede, statistics, 20161114 X-HELO: mx2.suse.de Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 01 Dec 2016 11:34:01 +0000 Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 97575AABE; Thu, 1 Dec 2016 11:33:58 +0000 (UTC) Date: Thu, 01 Dec 2016 11:34:00 -0000 From: Richard Biener To: Yuri Rumyantsev cc: Christophe Lyon , Jeff Law , gcc-patches , Ilya Enkovich Subject: Re: [PATCH, vec-tails] Support loop epilogue vectorization In-Reply-To: Message-ID: References: <0810E36B-CFA5-45E3-B352-1DE4B11FBF8A@suse.de> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-SW-Source: 2016-12/txt/msg00047.txt.bz2 On Mon, 28 Nov 2016, Yuri Rumyantsev wrote: > Richard! > > I attached vect dump for hte part of attached test-case which > illustrated how vectorization of epilogues works through masking: > #define SIZE 1023 > #define ALIGN 64 > > extern int posix_memalign(void **memptr, __SIZE_TYPE__ alignment, > __SIZE_TYPE__ size) __attribute__((weak)); > extern void free (void *); > > void __attribute__((noinline)) > test_citer (int * __restrict__ a, > int * __restrict__ b, > int * __restrict__ c) > { > int i; > > a = (int *)__builtin_assume_aligned (a, ALIGN); > b = (int *)__builtin_assume_aligned (b, ALIGN); > c = (int *)__builtin_assume_aligned (c, ALIGN); > > for (i = 0; i < SIZE; i++) > c[i] = a[i] + b[i]; > } > > It was compiled with -mavx2 --param vect-epilogues-mask=1 options. > > I did not include in this patch vectorization of low trip-count loops > since in the original patch additional parameter was introduced: > +DEFPARAM (PARAM_VECT_SHORT_LOOPS, > + "vect-short-loops", > + "Enable vectorization of low trip count loops using masking.", > + 0, 0, 1) > > I assume that this ability can be included very quickly but it > requires cost model enhancements also. Comments on the patch itself (as I'm having a closer look again, I know how it vectorizes the above but I wondered why epilogue and short-trip loops are not basically the same code path). Btw, I don't like that the features are behind a --param paywall. That just means a) nobody will use it, b) it will bit-rot quickly, c) bugs are well-hidden. + if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) + && integer_zerop (nested_in_vect_loop + ? STMT_VINFO_DR_STEP (stmt_info) + : DR_STEP (dr))) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "allow invariant load for masked loop.\n"); + } this can test memory_access_type == VMAT_INVARIANT. Please put all the checks in a common if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)) { if (memory_access_type == VMAT_INVARIANT) { } else if (...) { LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false; } else if (..) ... } @@ -6667,6 +6756,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt, gcc_assert (!nested_in_vect_loop); gcc_assert (!STMT_VINFO_GATHER_SCATTER_P (stmt_info)); + if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "cannot be masked: grouped access is not" + " supported."); + LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false; + } + isn't this already handled by the above? Or rather the general disallowance of SLP? @@ -5730,6 +5792,24 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt, &memory_access_type, &gs_info)) return false; + if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) + && memory_access_type != VMAT_CONTIGUOUS) + { + LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false; + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "cannot be masked: unsupported memory access type.\n"); + } + + if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) + && !can_mask_load_store (stmt)) + { + LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false; + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "cannot be masked: unsupported masked store.\n"); + } + likewise please combine the ifs. @@ -2354,7 +2401,10 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi, ptr, vec_mask, vec_rhs); vect_finish_stmt_generation (stmt, new_stmt, gsi); if (i == 0) - STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt; + { + STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt; + STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (new_stmt)) = true; + } else STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt; prev_stmt_info = vinfo_for_stmt (new_stmt); here you only set the flag, elsewhere you copy DR and VECTYPE as well. @@ -2113,6 +2146,20 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi, && !useless_type_conversion_p (vectype, rhs_vectype))) return false; + if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)) + { + /* Check that mask conjuction is supported. */ + optab tab; + tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default); + if (!tab || optab_handler (tab, TYPE_MODE (vectype)) == CODE_FOR_nothing) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "cannot be masked: unsupported mask operation\n"); + LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false; + } + } does this really test whether we can bit-and the mask? You are using the vector type of the store (which might be V2DF for example), also for AVX512 it might be a vector-bool type with integer mode? Of course we maybe can simply assume mask conjunction is available (I know no ISA where that would be not true). +/* Return true if STMT can be converted to masked form. */ + +static bool +can_mask_load_store (gimple *stmt) +{ + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + tree vectype, mask_vectype; + tree lhs, ref; + + if (!stmt_info) + return false; + lhs = gimple_assign_lhs (stmt); + ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs; + if (may_be_nonaddressable_p (ref)) + return false; + vectype = STMT_VINFO_VECTYPE (stmt_info); You probably modeled this after ifcvt_can_use_mask_load_store but I don't think checking may_be_nonaddressable_p is necessary (we couldn't even vectorize such refs). stmt_info should never be NULL either. With the check removed tree-ssa-loop-ivopts.h should no longer be necessary. +static void +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask, + data_reference *dr, gimple_stmt_iterator *si) +{ ... + addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem), + true, NULL_TREE, true, + GSI_SAME_STMT); + + align = TYPE_ALIGN_UNIT (vectype); + if (aligned_access_p (dr)) + misalign = 0; + else if (DR_MISALIGNMENT (dr) == -1) + { + align = TYPE_ALIGN_UNIT (elem_type); + misalign = 0; + } + else + misalign = DR_MISALIGNMENT (dr); + set_ptr_info_alignment (get_ptr_info (addr), align, misalign); + ptr = build_int_cst (reference_alias_ptr_type (mem), + misalign ? misalign & -misalign : align); you should simply use align = get_object_alignment (mem) / BITS_PER_UNIT; here rather than trying to be clever. Eventually you don't need the DR then (see question above). + } + gsi_replace (si ? si : &gsi, new_stmt, false); when you replace the load/store please previously copy VUSE and VDEF from the original one (we were nearly clean enough to no longer require a virtual operand rewrite after vectorization...) Thus gimple_set_vuse (new_stmt, gimple_vuse (stmt)); gimple_set_vdef (new_stmt, gimple_vdef (stmt)); +static void +vect_mask_loop (loop_vec_info loop_vinfo) +{ ... + /* Scan all loop statements to convert vector load/store including masked + form. */ + for (unsigned i = 0; i < loop->num_nodes; i++) + { + basic_block bb = bbs[i]; + for (gimple_stmt_iterator si = gsi_start_bb (bb); + !gsi_end_p (si); gsi_next (&si)) + { + gimple *stmt = gsi_stmt (si); + stmt_vec_info stmt_info = NULL; + tree vectype = NULL; + data_reference *dr; + + /* Mask load case. */ + if (is_gimple_call (stmt) + && gimple_call_internal_p (stmt) + && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD + && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2)))) + { ... + /* Skip invariant loads. */ + if (integer_zerop (nested_in_vect_loop_p (loop, stmt) + ? STMT_VINFO_DR_STEP (stmt_info) + : DR_STEP (STMT_VINFO_DATA_REF (stmt_info)))) + continue; seeing this it would be nice if stmt_info had a flag for whether the stmt needs masking (and a flag on wheter this is a scalar or a vectorized stmt). + /* Skip hoisted out statements. */ + if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt))) + continue; err, you walk stmts in the loop! Isn't this covered by the above skipping of 'invariant loads'? +static gimple * +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev) +{ depending on the reduction operand there are variants that could get away w/o the VEC_COND_EXPR, like S1': tem_4 = d_3 & MASK; S2': r_1 = r_2 + tem_4; which works for plus at least. More generally doing S1': tem_4 = VEC_COND_EXPR S2': r_1 = r_2 OP tem_4; and leaving optimization to & to later opts (& won't work for AVX512 mask registers I guess). Good enough for later enhacement of course. +static void +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec *ivs) +{ ... isn't it enough to always create a single IV and derive the additional copies by IV + i * { elems, elems, elems ... }? IVs are expensive -- I'm sure we can optimize the rest of the scheme further as well but this one looks obvious to me. @@ -3225,12 +3508,32 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, int npeel = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); void *target_cost_data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo); + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) + { + /* Currently we don't produce scalar epilogue version in case + its masked version is provided. It means we don't need to + compute profitability one more time here. Just make a + masked loop version. */ + if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) + && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK)) + { + dump_printf_loc (MSG_NOTE, vect_location, + "cost model: mask loop epilogue.\n"); + LOOP_VINFO_MASK_LOOP (loop_vinfo) = true; + *ret_min_profitable_niters = 0; + *ret_min_profitable_estimate = 0; + return; + } + } /* Cost model disabled. */ - if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))) + else if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))) { dump_printf_loc (MSG_NOTE, vect_location, "cost model disabled.\n"); *ret_min_profitable_niters = 0; *ret_min_profitable_estimate = 0; + if (PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK) + && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)) + LOOP_VINFO_MASK_LOOP (loop_vinfo) = true; return; } the unlimited_cost_model case should come first? OTOH masking or not is probably not sth covered by 'unlimited' - that is about vectorizing or not. But the above code means that for epilogue vectorization w/o masking we ignore unlimited_cost_model ()? That doesn't make sense to me. Plus if this is short-trip or epilogue vectorization and the cost model is _not_ unlimited then we dont' want to enable masking always (if it is possible). It might be we statically know the epilogue executes for at most two iterations for example. I don't see _any_ cost model for vectorizing the epilogue with masking? Am I missing something? A "trivial" cost model should at least consider the additional IV(s), the mask compute and the widening and narrowing ops required. diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index e13d6a2..36be342 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -1635,6 +1635,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, bool epilog_peeling = (LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) + { + prolog_peeling = false; + if (LOOP_VINFO_MASK_LOOP (loop_vinfo)) + epilog_peeling = false; + } + if (!prolog_peeling && !epilog_peeling) return NULL; I think the prolog_peeling was fixed during the epilogue vectorization review and should no longer be necessary. Please add a && ! LOOP_VINFO_MASK_LOOP () to the epilog_peeling init instead (it should also work for short-trip loop vectorization). @@ -2022,11 +2291,18 @@ start_over: || (max_niter != -1 && (unsigned HOST_WIDE_INT) max_niter < vectorization_factor)) { - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "not vectorized: iteration count smaller than " - "vectorization factor.\n"); - return false; + /* Allow low trip count for loop epilogue we want to mask. */ + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && PARAM_VALUE (PARAM_VECT_EPILOGUES_MASK)) + LOOP_VINFO_MASK_LOOP (loop_vinfo) = true; + else + { + if (dump_enabled_p ()) so why do we test only LOOP_VINFO_EPILOGUE_P here? All the code I saw sofar would also work for the main loop (but the cost model is missing). I am missing testcases. There's only a single one but we should have cases covering all kinds of mask IV widths and widen/shorten masks. Do you have any numbers on SPEC 2k6 with epilogue vect and/or masking enabled for an AVX2 machine? Oh, and I really dislike the --param paywall. Thanks, Richard. > Best regards. > Yuri. > > > 2016-11-28 17:39 GMT+03:00 Richard Biener : > > On Thu, 24 Nov 2016, Yuri Rumyantsev wrote: > > > >> Hi All, > >> > >> Here is the second patch which supports epilogue vectorization using > >> masking without cost model. Currently it is possible > >> only with passing parameter "--param vect-epilogues-mask=1". > >> > >> Bootstrapping and regression testing did not show any new regression. > >> > >> Any comments will be appreciated. > > > > Going over the patch the main question is one how it works -- it looks > > like the decision whether to vectorize & mask the epilogue is made > > when vectorizing the loop that generates the epilogue rather than > > in the epilogue vectorization path? > > > > That is, I'd have expected to see this handling low-trip count loops > > by masking? And thus masking the epilogue simply by it being > > low-trip count? > > > > Richard. > > > >> ChangeLog: > >> 2016-11-24 Yuri Rumyantsev > >> > >> * params.def (PARAM_VECT_EPILOGUES_MASK): New. > >> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var. > >> * tree-vect-loop.c: Include insn-config.h, recog.h and alias.h. > >> (new_loop_vec_info): Add zeroing can_be_masked, mask_loop and > >> required_mask fields. > >> (vect_check_required_masks_widening): New. > >> (vect_check_required_masks_narrowing): New. > >> (vect_get_masking_iv_elems): New. > >> (vect_get_masking_iv_type): New. > >> (vect_get_extreme_masks): New. > >> (vect_check_required_masks): New. > >> (vect_analyze_loop_operations): Call vect_check_required_masks if all > >> statements can be masked. > >> (vect_analyze_loop_2): Inititalize to zero min_scalar_loop_bound. > >> Add check that epilogue can be masked with the same vf with issue > >> fail notes. Allow epilogue vectorization through masking of low trip > >> loops. Set to true can_be_masked field before loop operation analysis. > >> Do not set-up min_scalar_loop_bound for epilogue vectorization through > >> masking. Do not peeling for epilogue masking. Reset can_be_masked > >> field before repeat analysis. > >> (vect_estimate_min_profitable_iters): Do not compute profitability > >> for epilogue masking. Set up mask_loop filed to true if parameter > >> PARAM_VECT_EPILOGUES_MASK is non-zero. > >> (vectorizable_reduction): Add check that statement can be masked. > >> (vectorizable_induction): Do not support masking for induction. > >> (vect_gen_ivs_for_masking): New. > >> (vect_get_mask_index_for_elems): New. > >> (vect_get_mask_index_for_type): New. > >> (vect_create_narrowed_masks): New. > >> (vect_create_widened_masks): New. > >> (vect_gen_loop_masks): New. > >> (vect_mask_reduction_stmt): New. > >> (vect_mask_mask_load_store_stmt): New. > >> (vect_mask_load_store_stmt): New. > >> (vect_mask_loop): New. > >> (vect_transform_loop): Invoke vect_mask_loop if required. > >> Use div_ceil to recompute upper bounds for masked loops. Issue > >> statistics for epilogue vectorization through masking. Do not reduce > >> vf for masking epilogue. > >> * tree-vect-stmts.c: Include tree-ssa-loop-ivopts.h. > >> (can_mask_load_store): New. > >> (vectorizable_mask_load_store): Check that mask conjuction is > >> supported. Set-up first_copy_p field of stmt_vinfo. > >> (vectorizable_simd_clone_call): Check that simd clone can not be > >> masked. > >> (vectorizable_store): Check that store can be masked. Mark the first > >> copy of generated vector stores and provide it with vectype and the > >> original data reference. > >> (vectorizable_load): Check that load can be masked. > >> (vect_stmt_should_be_masked_for_epilogue): New. > >> (vect_add_required_mask_for_stmt): New. > >> (vect_analyze_stmt): Add check on unsupported statements for masking > >> with printing message. > >> * tree-vectorizer.h (struct _loop_vec_info): Add new fields > >> can_be_maske, required_masks, masl_loop. > >> (LOOP_VINFO_CAN_BE_MASKED): New. > >> (LOOP_VINFO_REQUIRED_MASKS): New. > >> (LOOP_VINFO_MASK_LOOP): New. > >> (struct _stmt_vec_info): Add first_copy_p field. > >> (STMT_VINFO_FIRST_COPY_P): New. > >> > >> gcc/testsuite/ > >> > >> * gcc.dg/vect/vect-tail-mask-1.c: New test. > >> > >> 2016-11-18 18:54 GMT+03:00 Christophe Lyon : > >> > On 18 November 2016 at 16:46, Yuri Rumyantsev wrote: > >> >> It is very strange that this test failed on arm, since it requires > >> >> target avx2 to check vectorizer dumps: > >> >> > >> >> /* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" { > >> >> target avx2_runtime } } } */ > >> >> /* { dg-final { scan-tree-dump-times "LOOP EPILOGUE VECTORIZED > >> >> \\(VS=16\\)" 2 "vect" { target avx2_runtime } } } */ > >> >> > >> >> Could you please clarify what is the reason of the failure? > >> > > >> > It's not the scan-dumps that fail, but the execution. > >> > The test calls abort() for some reason. > >> > > >> > It will take me a while to rebuild the test manually in the right > >> > debug environment to provide you with more traces. > >> > > >> > > >> > > >> >> > >> >> Thanks. > >> >> > >> >> 2016-11-18 16:20 GMT+03:00 Christophe Lyon : > >> >>> On 15 November 2016 at 15:41, Yuri Rumyantsev wrote: > >> >>>> Hi All, > >> >>>> > >> >>>> Here is patch for non-masked epilogue vectoriziation. > >> >>>> > >> >>>> Bootstrap and regression testing did not show any new failures. > >> >>>> > >> >>>> Is it OK for trunk? > >> >>>> > >> >>>> Thanks. > >> >>>> Changelog: > >> >>>> > >> >>>> 2016-11-15 Yuri Rumyantsev > >> >>>> > >> >>>> * params.def (PARAM_VECT_EPILOGUES_NOMASK): New. > >> >>>> * tree-if-conv.c (tree_if_conversion): Make public. > >> >>>> * * tree-if-conv.h: New file. > >> >>>> * tree-vect-data-refs.c (vect_analyze_data_ref_dependences) Avoid > >> >>>> dynamic alias checks for epilogues. > >> >>>> * tree-vect-loop-manip.c (vect_do_peeling): Return created epilog. > >> >>>> * tree-vect-loop.c: include tree-if-conv.h. > >> >>>> (new_loop_vec_info): Add zeroing orig_loop_info field. > >> >>>> (vect_analyze_loop_2): Don't try to enhance alignment for epilogues. > >> >>>> (vect_analyze_loop): Add argument ORIG_LOOP_INFO which is not NULL > >> >>>> if epilogue is vectorized, set up orig_loop_info field of loop_vinfo > >> >>>> using passed argument. > >> >>>> (vect_transform_loop): Check if created epilogue should be returned > >> >>>> for further vectorization with less vf. If-convert epilogue if > >> >>>> required. Print vectorization success for epilogue. > >> >>>> * tree-vectorizer.c (vectorize_loops): Add epilogue vectorization > >> >>>> if it is required, pass loop_vinfo produced during vectorization of > >> >>>> loop body to vect_analyze_loop. > >> >>>> * tree-vectorizer.h (struct _loop_vec_info): Add new field > >> >>>> orig_loop_info. > >> >>>> (LOOP_VINFO_ORIG_LOOP_INFO): New. > >> >>>> (LOOP_VINFO_EPILOGUE_P): New. > >> >>>> (LOOP_VINFO_ORIG_VECT_FACTOR): New. > >> >>>> (vect_do_peeling): Change prototype to return epilogue. > >> >>>> (vect_analyze_loop): Add argument of loop_vec_info type. > >> >>>> (vect_transform_loop): Return created loop. > >> >>>> > >> >>>> gcc/testsuite/ > >> >>>> > >> >>>> * lib/target-supports.exp (check_avx2_hw_available): New. > >> >>>> (check_effective_target_avx2_runtime): New. > >> >>>> * gcc.dg/vect/vect-tail-nomask-1.c: New test. > >> >>>> > >> >>> > >> >>> Hi, > >> >>> > >> >>> This new test fails on arm-none-eabi (using default cpu/fpu/mode): > >> >>> gcc.dg/vect/vect-tail-nomask-1.c -flto -ffat-lto-objects execution test > >> >>> gcc.dg/vect/vect-tail-nomask-1.c execution test > >> >>> > >> >>> It does pass on the same target if configured --with-cpu=cortex-a9. > >> >>> > >> >>> Christophe > >> >>> > >> >>> > >> >>> > >> >>>> > >> >>>> 2016-11-14 20:04 GMT+03:00 Richard Biener : > >> >>>>> On November 14, 2016 4:39:40 PM GMT+01:00, Yuri Rumyantsev wrote: > >> >>>>>>Richard, > >> >>>>>> > >> >>>>>>I checked one of the tests designed for epilogue vectorization using > >> >>>>>>patches 1 - 3 and found out that build compiler performs vectorization > >> >>>>>>of epilogues with --param vect-epilogues-nomask=1 passed: > >> >>>>>> > >> >>>>>>$ gcc -Ofast -mavx2 t1.c -S --param vect-epilogues-nomask=1 -o > >> >>>>>>t1.new-nomask.s -fdump-tree-vect-details > >> >>>>>>$ grep VECTORIZED -c t1.c.156t.vect > >> >>>>>>4 > >> >>>>>> Without param only 2 loops are vectorized. > >> >>>>>> > >> >>>>>>Should I simply add a part of tests related to this feature or I must > >> >>>>>>delete all not necessary changes also? > >> >>>>> > >> >>>>> Please remove all not necessary changes. > >> >>>>> > >> >>>>> Richard. > >> >>>>> > >> >>>>>>Thanks. > >> >>>>>>Yuri. > >> >>>>>> > >> >>>>>>2016-11-14 16:40 GMT+03:00 Richard Biener : > >> >>>>>>> On Mon, 14 Nov 2016, Yuri Rumyantsev wrote: > >> >>>>>>> > >> >>>>>>>> Richard, > >> >>>>>>>> > >> >>>>>>>> In my previous patch I forgot to remove couple lines related to aux > >> >>>>>>field. > >> >>>>>>>> Here is the correct updated patch. > >> >>>>>>> > >> >>>>>>> Yeah, I noticed. This patch would be ok for trunk (together with > >> >>>>>>> necessary parts from 1 and 2) if all not required parts are removed > >> >>>>>>> (and you'd add the testcases covering non-masked tail vect). > >> >>>>>>> > >> >>>>>>> Thus, can you please produce a single complete patch containing only > >> >>>>>>> non-masked epilogue vectoriziation? > >> >>>>>>> > >> >>>>>>> Thanks, > >> >>>>>>> Richard. > >> >>>>>>> > >> >>>>>>>> Thanks. > >> >>>>>>>> Yuri. > >> >>>>>>>> > >> >>>>>>>> 2016-11-14 15:51 GMT+03:00 Richard Biener : > >> >>>>>>>> > On Fri, 11 Nov 2016, Yuri Rumyantsev wrote: > >> >>>>>>>> > > >> >>>>>>>> >> Richard, > >> >>>>>>>> >> > >> >>>>>>>> >> I prepare updated 3 patch with passing additional argument to > >> >>>>>>>> >> vect_analyze_loop as you proposed (untested). > >> >>>>>>>> >> > >> >>>>>>>> >> You wrote: > >> >>>>>>>> >> tw, I wonder if you can produce a single patch containing just > >> >>>>>>>> >> epilogue vectorization, that is combine patches 1-3 but rip out > >> >>>>>>>> >> changes only needed by later patches? > >> >>>>>>>> >> > >> >>>>>>>> >> Did you mean that I exclude all support for vectorization > >> >>>>>>epilogues, > >> >>>>>>>> >> i.e. exclude from 2-nd patch all non-related changes > >> >>>>>>>> >> like > >> >>>>>>>> >> > >> >>>>>>>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > >> >>>>>>>> >> index 11863af..32011c1 100644 > >> >>>>>>>> >> --- a/gcc/tree-vect-loop.c > >> >>>>>>>> >> +++ b/gcc/tree-vect-loop.c > >> >>>>>>>> >> @@ -1120,6 +1120,12 @@ new_loop_vec_info (struct loop *loop) > >> >>>>>>>> >> LOOP_VINFO_PEELING_FOR_GAPS (res) = false; > >> >>>>>>>> >> LOOP_VINFO_PEELING_FOR_NITER (res) = false; > >> >>>>>>>> >> LOOP_VINFO_OPERANDS_SWAPPED (res) = false; > >> >>>>>>>> >> + LOOP_VINFO_CAN_BE_MASKED (res) = false; > >> >>>>>>>> >> + LOOP_VINFO_REQUIRED_MASKS (res) = 0; > >> >>>>>>>> >> + LOOP_VINFO_COMBINE_EPILOGUE (res) = false; > >> >>>>>>>> >> + LOOP_VINFO_MASK_EPILOGUE (res) = false; > >> >>>>>>>> >> + LOOP_VINFO_NEED_MASKING (res) = false; > >> >>>>>>>> >> + LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL; > >> >>>>>>>> > > >> >>>>>>>> > Yes. > >> >>>>>>>> > > >> >>>>>>>> >> Did you mean also that new combined patch must be working patch, > >> >>>>>>i.e. > >> >>>>>>>> >> can be integrated without other patches? > >> >>>>>>>> > > >> >>>>>>>> > Yes. > >> >>>>>>>> > > >> >>>>>>>> >> Could you please look at updated patch? > >> >>>>>>>> > > >> >>>>>>>> > Will do. > >> >>>>>>>> > > >> >>>>>>>> > Thanks, > >> >>>>>>>> > Richard. > >> >>>>>>>> > > >> >>>>>>>> >> Thanks. > >> >>>>>>>> >> Yuri. > >> >>>>>>>> >> > >> >>>>>>>> >> 2016-11-10 15:36 GMT+03:00 Richard Biener : > >> >>>>>>>> >> > On Thu, 10 Nov 2016, Richard Biener wrote: > >> >>>>>>>> >> > > >> >>>>>>>> >> >> On Tue, 8 Nov 2016, Yuri Rumyantsev wrote: > >> >>>>>>>> >> >> > >> >>>>>>>> >> >> > Richard, > >> >>>>>>>> >> >> > > >> >>>>>>>> >> >> > Here is updated 3 patch. > >> >>>>>>>> >> >> > > >> >>>>>>>> >> >> > I checked that all new tests related to epilogue > >> >>>>>>vectorization passed with it. > >> >>>>>>>> >> >> > > >> >>>>>>>> >> >> > Your comments will be appreciated. > >> >>>>>>>> >> >> > >> >>>>>>>> >> >> A lot better now. Instead of the ->aux dance I now prefer to > >> >>>>>>>> >> >> pass the original loops loop_vinfo to vect_analyze_loop as > >> >>>>>>>> >> >> optional argument (if non-NULL we analyze the epilogue of that > >> >>>>>>>> >> >> loop_vinfo). OTOH I remember we mainly use it to get at the > >> >>>>>>>> >> >> original vectorization factor? So we can pass down an > >> >>>>>>(optional) > >> >>>>>>>> >> >> forced vectorization factor as well? > >> >>>>>>>> >> > > >> >>>>>>>> >> > Btw, I wonder if you can produce a single patch containing just > >> >>>>>>>> >> > epilogue vectorization, that is combine patches 1-3 but rip out > >> >>>>>>>> >> > changes only needed by later patches? > >> >>>>>>>> >> > > >> >>>>>>>> >> > Thanks, > >> >>>>>>>> >> > Richard. > >> >>>>>>>> >> > > >> >>>>>>>> >> >> Richard. > >> >>>>>>>> >> >> > >> >>>>>>>> >> >> > 2016-11-08 15:38 GMT+03:00 Richard Biener > >> >>>>>>: > >> >>>>>>>> >> >> > > On Thu, 3 Nov 2016, Yuri Rumyantsev wrote: > >> >>>>>>>> >> >> > > > >> >>>>>>>> >> >> > >> Hi Richard, > >> >>>>>>>> >> >> > >> > >> >>>>>>>> >> >> > >> I did not understand your last remark: > >> >>>>>>>> >> >> > >> > >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change): > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void) > >> >>>>>>>> >> >> > >> > && dump_enabled_p ()) > >> >>>>>>>> >> >> > >> > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > >> >>>>>>vect_location, > >> >>>>>>>> >> >> > >> > "loop vectorized\n"); > >> >>>>>>>> >> >> > >> > - vect_transform_loop (loop_vinfo); > >> >>>>>>>> >> >> > >> > + new_loop = vect_transform_loop (loop_vinfo); > >> >>>>>>>> >> >> > >> > num_vectorized_loops++; > >> >>>>>>>> >> >> > >> > /* Now that the loop has been vectorized, allow > >> >>>>>>it to be unrolled > >> >>>>>>>> >> >> > >> > etc. */ > >> >>>>>>>> >> >> > >> > loop->force_vectorize = false; > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > + /* Add new loop to a processing queue. To make > >> >>>>>>it easier > >> >>>>>>>> >> >> > >> > + to match loop and its epilogue vectorization > >> >>>>>>in dumps > >> >>>>>>>> >> >> > >> > + put new loop as the next loop to process. > >> >>>>>>*/ > >> >>>>>>>> >> >> > >> > + if (new_loop) > >> >>>>>>>> >> >> > >> > + { > >> >>>>>>>> >> >> > >> > + loops.safe_insert (i + 1, new_loop->num); > >> >>>>>>>> >> >> > >> > + vect_loops_num = number_of_loops (cfun); > >> >>>>>>>> >> >> > >> > + } > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, > >> >>>>>>new_loop) > >> >>>>>>>> >> >> > >> f> unction which will set up stuff properly (and also > >> >>>>>>perform > >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there). > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue > >> >>>>>>vectorization > >> >>>>>>>> >> >> > >> > separately that would be great. > >> >>>>>>>> >> >> > >> > >> >>>>>>>> >> >> > >> Could you please clarify your proposal. > >> >>>>>>>> >> >> > > > >> >>>>>>>> >> >> > > When a loop was vectorized set things up to immediately > >> >>>>>>vectorize > >> >>>>>>>> >> >> > > its epilogue, avoiding changing the loop iteration and > >> >>>>>>avoiding > >> >>>>>>>> >> >> > > the re-use of ->aux. > >> >>>>>>>> >> >> > > > >> >>>>>>>> >> >> > > Richard. > >> >>>>>>>> >> >> > > > >> >>>>>>>> >> >> > >> Thanks. > >> >>>>>>>> >> >> > >> Yuri. > >> >>>>>>>> >> >> > >> > >> >>>>>>>> >> >> > >> 2016-11-02 15:27 GMT+03:00 Richard Biener > >> >>>>>>: > >> >>>>>>>> >> >> > >> > On Tue, 1 Nov 2016, Yuri Rumyantsev wrote: > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> >> Hi All, > >> >>>>>>>> >> >> > >> >> > >> >>>>>>>> >> >> > >> >> I re-send all patches sent by Ilya earlier for review > >> >>>>>>which support > >> >>>>>>>> >> >> > >> >> vectorization of loop epilogues and loops with low > >> >>>>>>trip count. We > >> >>>>>>>> >> >> > >> >> assume that the only patch - > >> >>>>>>vec-tails-07-combine-tail.patch - was not > >> >>>>>>>> >> >> > >> >> approved by Jeff. > >> >>>>>>>> >> >> > >> >> > >> >>>>>>>> >> >> > >> >> I did re-base of all patches and performed > >> >>>>>>bootstrapping and > >> >>>>>>>> >> >> > >> >> regression testing that did not show any new failures. > >> >>>>>>Also all > >> >>>>>>>> >> >> > >> >> changes related to new vect_do_peeling algorithm have > >> >>>>>>been changed > >> >>>>>>>> >> >> > >> >> accordingly. > >> >>>>>>>> >> >> > >> >> > >> >>>>>>>> >> >> > >> >> Is it OK for trunk? > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > I would have prefered that the series up to > >> >>>>>>-03-nomask-tails would > >> >>>>>>>> >> >> > >> > _only_ contain epilogue loop vectorization changes but > >> >>>>>>unfortunately > >> >>>>>>>> >> >> > >> > the patchset is oddly separated. > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > I have a comment on that part nevertheless: > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > @@ -1608,7 +1614,10 @@ vect_enhance_data_refs_alignment > >> >>>>>>(loop_vec_info > >> >>>>>>>> >> >> > >> > loop_vinfo) > >> >>>>>>>> >> >> > >> > /* Check if we can possibly peel the loop. */ > >> >>>>>>>> >> >> > >> > if (!vect_can_advance_ivs_p (loop_vinfo) > >> >>>>>>>> >> >> > >> > || !slpeel_can_duplicate_loop_p (loop, > >> >>>>>>single_exit (loop)) > >> >>>>>>>> >> >> > >> > - || loop->inner) > >> >>>>>>>> >> >> > >> > + || loop->inner > >> >>>>>>>> >> >> > >> > + /* Required peeling was performed in prologue > >> >>>>>>and > >> >>>>>>>> >> >> > >> > + is not required for epilogue. */ > >> >>>>>>>> >> >> > >> > + || LOOP_VINFO_EPILOGUE_P (loop_vinfo)) > >> >>>>>>>> >> >> > >> > do_peeling = false; > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > if (do_peeling > >> >>>>>>>> >> >> > >> > @@ -1888,7 +1897,10 @@ vect_enhance_data_refs_alignment > >> >>>>>>(loop_vec_info > >> >>>>>>>> >> >> > >> > loop_vinfo) > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > do_versioning = > >> >>>>>>>> >> >> > >> > optimize_loop_nest_for_speed_p (loop) > >> >>>>>>>> >> >> > >> > - && (!loop->inner); /* FORNOW */ > >> >>>>>>>> >> >> > >> > + && (!loop->inner) /* FORNOW */ > >> >>>>>>>> >> >> > >> > + /* Required versioning was performed for the > >> >>>>>>>> >> >> > >> > + original loop and is not required for > >> >>>>>>epilogue. */ > >> >>>>>>>> >> >> > >> > + && !LOOP_VINFO_EPILOGUE_P (loop_vinfo); > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > if (do_versioning) > >> >>>>>>>> >> >> > >> > { > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > please do that check in the single caller of this > >> >>>>>>function. > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > Otherwise I still dislike the new ->aux use and I > >> >>>>>>believe that simply > >> >>>>>>>> >> >> > >> > passing down info from the processed parent would be > >> >>>>>>_much_ cleaner. > >> >>>>>>>> >> >> > >> > That is, here (and avoid the FOR_EACH_LOOP change): > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > @@ -580,12 +586,21 @@ vectorize_loops (void) > >> >>>>>>>> >> >> > >> > && dump_enabled_p ()) > >> >>>>>>>> >> >> > >> > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > >> >>>>>>vect_location, > >> >>>>>>>> >> >> > >> > "loop vectorized\n"); > >> >>>>>>>> >> >> > >> > - vect_transform_loop (loop_vinfo); > >> >>>>>>>> >> >> > >> > + new_loop = vect_transform_loop (loop_vinfo); > >> >>>>>>>> >> >> > >> > num_vectorized_loops++; > >> >>>>>>>> >> >> > >> > /* Now that the loop has been vectorized, allow > >> >>>>>>it to be unrolled > >> >>>>>>>> >> >> > >> > etc. */ > >> >>>>>>>> >> >> > >> > loop->force_vectorize = false; > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > + /* Add new loop to a processing queue. To make > >> >>>>>>it easier > >> >>>>>>>> >> >> > >> > + to match loop and its epilogue vectorization > >> >>>>>>in dumps > >> >>>>>>>> >> >> > >> > + put new loop as the next loop to process. > >> >>>>>>*/ > >> >>>>>>>> >> >> > >> > + if (new_loop) > >> >>>>>>>> >> >> > >> > + { > >> >>>>>>>> >> >> > >> > + loops.safe_insert (i + 1, new_loop->num); > >> >>>>>>>> >> >> > >> > + vect_loops_num = number_of_loops (cfun); > >> >>>>>>>> >> >> > >> > + } > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > simply dispatch to a vectorize_epilogue (loop_vinfo, > >> >>>>>>new_loop) > >> >>>>>>>> >> >> > >> > function which will set up stuff properly (and also > >> >>>>>>perform > >> >>>>>>>> >> >> > >> > the if-conversion of the epilogue there). > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > That said, if we can get in non-masked epilogue > >> >>>>>>vectorization > >> >>>>>>>> >> >> > >> > separately that would be great. > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > I'm still torn about all the rest of the stuff and > >> >>>>>>question its > >> >>>>>>>> >> >> > >> > usability (esp. merging the epilogue with the main > >> >>>>>>vector loop). > >> >>>>>>>> >> >> > >> > But it has already been approved ... oh well. > >> >>>>>>>> >> >> > >> > > >> >>>>>>>> >> >> > >> > Thanks, > >> >>>>>>>> >> >> > >> > Richard. > >> >>>>>>>> >> >> > >> > >> >>>>>>>> >> >> > >> > >> >>>>>>>> >> >> > > > >> >>>>>>>> >> >> > > -- > >> >>>>>>>> >> >> > > Richard Biener > >> >>>>>>>> >> >> > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, > >> >>>>>>Graham Norton, HRB 21284 (AG Nuernberg) > >> >>>>>>>> >> >> > > >> >>>>>>>> >> >> > >> >>>>>>>> >> >> > >> >>>>>>>> >> > > >> >>>>>>>> >> > -- > >> >>>>>>>> >> > Richard Biener > >> >>>>>>>> >> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham > >> >>>>>>Norton, HRB 21284 (AG Nuernberg) > >> >>>>>>>> >> > >> >>>>>>>> > > >> >>>>>>>> > -- > >> >>>>>>>> > Richard Biener > >> >>>>>>>> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham > >> >>>>>>Norton, HRB 21284 (AG Nuernberg) > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> -- > >> >>>>>>> Richard Biener > >> >>>>>>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham > >> >>>>>>Norton, HRB 21284 (AG Nuernberg) > >> >>>>> > >> >>>>> > >> > > > > -- > > Richard Biener > > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg) > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)