public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC][PATCH, vec-tails 00/10] Support vectorization of loop epilogues
@ 2016-05-19 19:36 Ilya Enkovich
  2016-06-15 12:06 ` Richard Biener
  2016-06-16  6:28 ` Jeff Law
  0 siblings, 2 replies; 4+ messages in thread
From: Ilya Enkovich @ 2016-05-19 19:36 UTC (permalink / raw)
  To: gcc-patches

Hi,

This series is an extension of previous work on loop epilogue combining [1].

It introduces three ways to handle vectorized loop epilogues: combine it with
vectorized loop, vectorize it with masks, vectorize it using a smaller vector
size.

Also it supports vectorization of loops with low trip count.

Epilogue combining is used as a basic masking transformation.  Epilogue
masking and low trip count loop vectorization is considered as epilogue
combining with a zero trip count vector loop.

Epilogues vectorization is controlled via new option -ftree-vectorize-epilogues=
which gets a comma separated list of enabled modes which include combine, mask,
nomask.  There is a separate option -ftree-vectorize-short-loops for low trip
count loops.

To support epilogues vectorization I use a queue of loops to be vectorized in
vectorize_loops and change vect_transform_loop to return generated epilogue
(in case we want to try vectorize it).  If epilogue is returned then it is
queued for processing.  This variant of epilogues processing was chosen because
it is simple and works for all epilogue processing options.

There are currently some limitations implied by this scheme:
 - Copied loop misses some required optimization info (e.g. scev info)
which may result in an epilogue which cannot be vectorized
 - Loop epilogue may require if-convertion
 - Alias/alignment checks are not inherited and therefore will be performed
one more time for epilogue.  For now epilogue vectorization is just disabled
in case alias versioning is required and alignment enhancement is
disabled for epilogues.

There is a set of new fields added to _loop_vec_info to support epilogues
vectorization.

LOOP_VINFO_CAN_BE_MASKED - true if vectorized loop can be masked.  It is
computed during vectorization analysis (in various vectorizable_* functions).

LOOP_VINFO_REQUIRED_MASKS - for loop which can be masked it holds all masks
required to mask the loop.

LOOP_VINFO_COMBINE_EPILOGUE - true if we decided vectorized loop should be
masked.

LOOP_VINFO_MASK_EPILOGUE - true if we decided an epilogue of this loop
should be vectorized and masked

LOOP_VINFO_NEED_MASKING - true if vectorized loop has to be masked (set for
epilogues we want to mask and low trip count loops).

LOOP_VINFO_ORIG_LOOP_INFO - for epilogues this holds loop_vec_info of the
original vectorized loop.

To make a decision whether we want to mask or combine a loop epilogue
cost model is extended with masking costs.  This includes vect_masking_prologue
and vect_masking_body elements added to vect_cost_model_location enum and
finish_cost extended with two additional returned values correspondingly.  Also
in addition to add_stmt_cost I also add add_stmt_masking_cost to compute
a cost for masking a statement.

vect_estimate_min_profitable_iters checks if epilogue masking is profitable
and also computes a number of iterations required to have profitable
epilogue combining (this number may be used as a threshold in vectorized
loop guard).

These patches do not enable any of new features by default for all optimization
levels.  Masking features are expected to be mostly used for AVX-512 targets
and lack of hardware suitable for wide performance testing is the reason cost
model is not tuned and optimizations are not enabled by default.  With small
tests using a small number of loop iterations and 'heavy' epilogues (e.g.
number of iterations is VF*2-1) I see expected ~2x gain on existing KNL hardware.
Later this year we expect to get an access to KNL machines and have an
opportunity to tune masking cost model.

On Haswell hardware I don't see performance gains on similar loops which means
masked code is not better than a scalar one when we have a heavy masks usage.
This still might be useful in case number statements requiring masking is
relatively small (I used test a[i] += b[i] which needs masking for 3 out of 4
vector statements).  We will continue search for cases where masking is
profitable for Haswell to tune masking costs appropriately.

Below are ChangeLogs for whole series.

[1] https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03014.html

Thanks,
Ilya
--
gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* common.opt (flag_tree_vectorize_epilogues): New.
	(ftree-vectorize-short-loops): New.
	(ftree-vectorize-epilogues=): New.
	(fno-tree-vectorize-epilogues): New.
	(fvect-epilogue-cost-model=): New.
	* flag-types.h (enum vect_epilogue_mode): New.
	* opts.c (parse_vectorizer_options): New.
	(common_handle_option): Support -ftree-vectorize-epilogues=
	and -fno-tree-vectorize-epilogues options.


gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* tree-vectorizer.h (struct _loop_vec_info): Add new fields
	can_be_masked, required_masks, mask_epilogue, combine_epilogue,
	need_masking, orig_loop_info.
	(LOOP_VINFO_CAN_BE_MASKED): New.
	(LOOP_VINFO_REQUIRED_MASKS): New.
	(LOOP_VINFO_COMBINE_EPILOGUE): New.
	(LOOP_VINFO_MASK_EPILOGUE): New.
	(LOOP_VINFO_NEED_MASKING): New.
	(LOOP_VINFO_ORIG_LOOP_INFO): New.
	(LOOP_VINFO_EPILOGUE_P): New.
	(LOOP_VINFO_ORIG_MASK_EPILOGUE): New.
	(LOOP_VINFO_ORIG_VECT_FACTOR): New.
	* tree-vect-loop.c (new_loop_vec_info): Initialize new
	_loop_vec_info fields.


gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* tree-if-conv.c (tree_if_conversion): Make public.
	* tree-if-conv.h: New file.
	* tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Don't
	try to enhance alignment for epilogues.
	* tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Return
	created loop.
	* tree-vect-loop.c: include tree-if-conv.h.
	(destroy_loop_vec_info): Preserve LOOP_VINFO_ORIG_LOOP_INFO in
	loop->aux.
	(vect_analyze_loop_form): Init LOOP_VINFO_ORIG_LOOP_INFO and reset
	loop->aux.
	(vect_analyze_loop): Reset loop->aux.
	(vect_transform_loop): Check if created epilogue should be returned
	for further vectorization.  If-convert epilogue if required.
	* tree-vectorizer.c (vectorize_loops): Add a queue of loops to
	process and insert vectorized loop epilogues into this queue.
	* tree-vectorizer.h (vect_do_peeling_for_loop_bound): Return created
	loop.
	(vect_transform_loop): Return created loop.


gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* config/i386/i386.c (ix86_init_cost): Extend costs array.
	(ix86_add_stmt_masking_cost): New.
	(ix86_finish_cost): Add masking_prologue_cost and masking_body_cost
	args.
	(TARGET_VECTORIZE_ADD_STMT_MASKING_COST): New.
	* config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): New.
	* config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New.
	* config/rs6000/rs6000.c (_rs6000_cost_data): Extend cost array.
	(rs6000_init_cost): Initialize new cost elements.
	(rs6000_finish_cost): Add masking_prologue_cost and masking_body_cost.
	* config/spu/spu.c (spu_init_cost): Extend costs array.
	(spu_finish_cost): Add masking_prologue_cost and masking_body_cost args.
	* doc/tm.texi.in (TARGET_VECTORIZE_ADD_STMT_MASKING_COST): New.
	* doc/tm.texi: Regenerated.
	* target.def (add_stmt_masking_cost): New.
	(finish_cost): Add masking_prologue_cost and masking_body_cost args.
	* target.h (enum vect_cost_for_stmt): Add vector_mask_load and
	vector_mask_store.
	(enum vect_cost_model_location): Add vect_masking_prologue
	and vect_masking_body.
	* targhooks.c (default_builtin_vectorization_cost): Support
	vector_mask_load and vector_mask_store.
	(default_init_cost): Extend costs array.
	(default_add_stmt_masking_cost): New.
	(default_finish_cost): Add masking_prologue_cost and masking_body_cost
	args.
	* targhooks.h (default_add_stmt_masking_cost): New.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Adjust
	finish_cost call.
	* tree-vect-slp.c (vect_bb_vectorization_profitable_p): Likewise.
	* tree-vectorizer.h (add_stmt_masking_cost): New.
	(finish_cost): Add masking_prologue_cost and masking_body_cost args.


gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* tree-vect-loop.c: Include insn-config.h and recog.h.
	(vect_check_required_masks_widening): New.
	(vect_check_required_masks_narrowing): New.
	(vect_get_masking_iv_elems): New.
	(vect_get_masking_iv_type): New.
	(vect_get_extreme_masks): New.
	(vect_check_required_masks): New.
	(vect_analyze_loop_operations): Add vect_check_required_masks
	call to compute LOOP_VINFO_CAN_BE_MASKED.
	(vect_analyze_loop_2): Initialize LOOP_VINFO_CAN_BE_MASKED and
	LOOP_VINFO_NEED_MASKING before starting over.
	(vectorizable_reduction): Compute LOOP_VINFO_CAN_BE_MASKED and
	masking cost.
	* tree-vect-stmts.c (can_mask_load_store): New.
	(vect_model_load_masking_cost): New.
	(vect_model_store_masking_cost): New.
	(vect_model_simple_masking_cost): New.
	(vectorizable_mask_load_store): Compute LOOP_VINFO_CAN_BE_MASKED
	and masking cost.
	(vectorizable_simd_clone_call): Likewise.
	(vectorizable_store): Likewise.
	(vectorizable_load): Likewise.
	(vect_stmt_should_be_masked_for_epilogue): New.
	(vect_add_required_mask_for_stmt): New.
	(vect_analyze_stmt): Compute LOOP_VINFO_CAN_BE_MASKED.
	* tree-vectorizer.h (vect_model_load_masking_cost): New.
	(vect_model_store_masking_cost): New.
	(vect_model_simple_masking_cost): New.


gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* tree-vect-stmts.c (vectorizable_mask_load_store): Mark
	the first copy of generated vector stores.
	(vectorizable_store): Mark the first copy of generated
	vector stores and provide it with vectype and the original
	data reference.
	* tree-vectorizer.h (struct _stmt_vec_info): Add first_copy_p
	field.
	(STMT_VINFO_FIRST_COPY_P): New.


gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* dbgcnt.def (vect_tail_combine): New.
	* params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
	* tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
	* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
	epilogue combined with loop body.
	(vect_do_peeling_for_loop_bound): LIkewise.
	(vect_do_peeling_for_alignment): ???
	* tree-vect-loop.c Include alias.h and dbgcnt.h.
	(vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
	arg, compute number of iterations for which loop epilogue combining is
	profitable.
	(vect_generate_tmps_on_preheader): Support combined apilogue.
	(vect_gen_ivs_for_masking): New.
	(vect_get_mask_index_for_elems): New.
	(vect_get_mask_index_for_type): New.
	(vect_gen_loop_masks): New.
	(vect_mask_reduction_stmt): New.
	(vect_mask_mask_load_store_stmt): New.
	(vect_mask_load_store_stmt): New.
	(vect_combine_loop_epilogue): New.
	(vect_transform_loop): Support combined apilogue.


gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* dbgcnt.def (vect_tail_mask): New.
	* tree-vect-loop.c (vect_analyze_loop_2): Support masked loop
	epilogues and low trip count loops.
	(vect_get_known_peeling_cost): Ignore scalat epilogue cost for
	loops we are going to mask.
	(vect_estimate_min_profitable_iters): Support masked loop
	epilogues and low trip count loops.
	* tree-vectorizer.c (vectorize_loops): Add a message for a case
	when loop epilogue can't be vectorized.


gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* tree-vect-loop.c (vect_transform_loop): Print more info
	about vectorized loop and specify used vector size.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-06-16  6:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-19 19:36 [RFC][PATCH, vec-tails 00/10] Support vectorization of loop epilogues Ilya Enkovich
2016-06-15 12:06 ` Richard Biener
2016-06-16  4:14   ` Jeff Law
2016-06-16  6:28 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).