From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 110341 invoked by alias); 19 May 2016 19:36:32 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 110324 invoked by uid 89); 19 May 2016 19:36:31 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=ilya.enkovich@intel.com, ilyaenkovichintelcom, gains, knl X-HELO: mail-wm0-f44.google.com Received: from mail-wm0-f44.google.com (HELO mail-wm0-f44.google.com) (74.125.82.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Thu, 19 May 2016 19:36:21 +0000 Received: by mail-wm0-f44.google.com with SMTP id n129so50985268wmn.1 for ; Thu, 19 May 2016 12:36:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition:user-agent; bh=3FYYP84sUywxc2mX76xbUvmyuYG183N65rXoOAKgFVc=; b=Yu/5utjy7zxYBQX3vV94OQxxfhJMMXqdvXuW7vEahwcLr+l6AYd4ytj+wAyokTCy4N 5M7GEnkpfsyhXyFp2rtOTFAnKSim8D2wwFPOxvL55hcFdogEHMPuvzYrFPEJ/x+pEg/Q TrA8VIyAIm8dMsg74yHPKmUfFSCla+ntEWp+bFkC+EDRZICy7c6MFsj40HD3xUyqk5eB 8GNjOh2Ms5wCGQ0iRVjxl0/smiY7OlVL9ccJWuL98zyqg2LsQ86CP/R1ROF232Ji9YMh 2T1NqT8y/PcHwxbYOF+dOSFVidcCwG4R8GJ3iipIu/7CdZVNbcsMrBj9PjxwYU4DYzQe VtKw== X-Gm-Message-State: AOPr4FVYhgQyDN8ejgi4RjKw8j+EnQFLydzPZMXeZJjFQCAP7z1mYy9yBBUMeBREgyGbMQ== X-Received: by 10.194.58.110 with SMTP id p14mr16547888wjq.2.1463686578134; Thu, 19 May 2016 12:36:18 -0700 (PDT) Received: from msticlxl57.ims.intel.com (irdmzpr01-ext.ir.intel.com. [192.198.151.36]) by smtp.gmail.com with ESMTPSA id kq9sm15947265wjc.26.2016.05.19.12.36.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 May 2016 12:36:17 -0700 (PDT) Date: Thu, 19 May 2016 19:36:00 -0000 From: Ilya Enkovich To: gcc-patches@gcc.gnu.org Subject: [RFC][PATCH, vec-tails 00/10] Support vectorization of loop epilogues Message-ID: <20160519193515.GA40563@msticlxl57.ims.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) X-IsSubscribed: yes X-SW-Source: 2016-05/txt/msg01562.txt.bz2 Hi, This series is an extension of previous work on loop epilogue combining [1]. It introduces three ways to handle vectorized loop epilogues: combine it with vectorized loop, vectorize it with masks, vectorize it using a smaller vector size. Also it supports vectorization of loops with low trip count. Epilogue combining is used as a basic masking transformation. Epilogue masking and low trip count loop vectorization is considered as epilogue combining with a zero trip count vector loop. Epilogues vectorization is controlled via new option -ftree-vectorize-epilogues= which gets a comma separated list of enabled modes which include combine, mask, nomask. There is a separate option -ftree-vectorize-short-loops for low trip count loops. To support epilogues vectorization I use a queue of loops to be vectorized in vectorize_loops and change vect_transform_loop to return generated epilogue (in case we want to try vectorize it). If epilogue is returned then it is queued for processing. This variant of epilogues processing was chosen because it is simple and works for all epilogue processing options. There are currently some limitations implied by this scheme: - Copied loop misses some required optimization info (e.g. scev info) which may result in an epilogue which cannot be vectorized - Loop epilogue may require if-convertion - Alias/alignment checks are not inherited and therefore will be performed one more time for epilogue. For now epilogue vectorization is just disabled in case alias versioning is required and alignment enhancement is disabled for epilogues. There is a set of new fields added to _loop_vec_info to support epilogues vectorization. LOOP_VINFO_CAN_BE_MASKED - true if vectorized loop can be masked. It is computed during vectorization analysis (in various vectorizable_* functions). LOOP_VINFO_REQUIRED_MASKS - for loop which can be masked it holds all masks required to mask the loop. LOOP_VINFO_COMBINE_EPILOGUE - true if we decided vectorized loop should be masked. LOOP_VINFO_MASK_EPILOGUE - true if we decided an epilogue of this loop should be vectorized and masked LOOP_VINFO_NEED_MASKING - true if vectorized loop has to be masked (set for epilogues we want to mask and low trip count loops). LOOP_VINFO_ORIG_LOOP_INFO - for epilogues this holds loop_vec_info of the original vectorized loop. To make a decision whether we want to mask or combine a loop epilogue cost model is extended with masking costs. This includes vect_masking_prologue and vect_masking_body elements added to vect_cost_model_location enum and finish_cost extended with two additional returned values correspondingly. Also in addition to add_stmt_cost I also add add_stmt_masking_cost to compute a cost for masking a statement. vect_estimate_min_profitable_iters checks if epilogue masking is profitable and also computes a number of iterations required to have profitable epilogue combining (this number may be used as a threshold in vectorized loop guard). These patches do not enable any of new features by default for all optimization levels. Masking features are expected to be mostly used for AVX-512 targets and lack of hardware suitable for wide performance testing is the reason cost model is not tuned and optimizations are not enabled by default. With small tests using a small number of loop iterations and 'heavy' epilogues (e.g. number of iterations is VF*2-1) I see expected ~2x gain on existing KNL hardware. Later this year we expect to get an access to KNL machines and have an opportunity to tune masking cost model. On Haswell hardware I don't see performance gains on similar loops which means masked code is not better than a scalar one when we have a heavy masks usage. This still might be useful in case number statements requiring masking is relatively small (I used test a[i] += b[i] which needs masking for 3 out of 4 vector statements). We will continue search for cases where masking is profitable for Haswell to tune masking costs appropriately. Below are ChangeLogs for whole series. [1] https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03014.html Thanks, Ilya -- gcc/ 2016-05-19 Ilya Enkovich * common.opt (flag_tree_vectorize_epilogues): New. (ftree-vectorize-short-loops): New. (ftree-vectorize-epilogues=): New. (fno-tree-vectorize-epilogues): New. (fvect-epilogue-cost-model=): New. * flag-types.h (enum vect_epilogue_mode): New. * opts.c (parse_vectorizer_options): New. (common_handle_option): Support -ftree-vectorize-epilogues= and -fno-tree-vectorize-epilogues options. gcc/ 2016-05-19 Ilya Enkovich * tree-vectorizer.h (struct _loop_vec_info): Add new fields can_be_masked, required_masks, mask_epilogue, combine_epilogue, need_masking, orig_loop_info. (LOOP_VINFO_CAN_BE_MASKED): New. (LOOP_VINFO_REQUIRED_MASKS): New. (LOOP_VINFO_COMBINE_EPILOGUE): New. (LOOP_VINFO_MASK_EPILOGUE): New. (LOOP_VINFO_NEED_MASKING): New. (LOOP_VINFO_ORIG_LOOP_INFO): New. (LOOP_VINFO_EPILOGUE_P): New. (LOOP_VINFO_ORIG_MASK_EPILOGUE): New. (LOOP_VINFO_ORIG_VECT_FACTOR): New. * tree-vect-loop.c (new_loop_vec_info): Initialize new _loop_vec_info fields. gcc/ 2016-05-19 Ilya Enkovich * tree-if-conv.c (tree_if_conversion): Make public. * tree-if-conv.h: New file. * tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Don't try to enhance alignment for epilogues. * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Return created loop. * tree-vect-loop.c: include tree-if-conv.h. (destroy_loop_vec_info): Preserve LOOP_VINFO_ORIG_LOOP_INFO in loop->aux. (vect_analyze_loop_form): Init LOOP_VINFO_ORIG_LOOP_INFO and reset loop->aux. (vect_analyze_loop): Reset loop->aux. (vect_transform_loop): Check if created epilogue should be returned for further vectorization. If-convert epilogue if required. * tree-vectorizer.c (vectorize_loops): Add a queue of loops to process and insert vectorized loop epilogues into this queue. * tree-vectorizer.h (vect_do_peeling_for_loop_bound): Return created loop. (vect_transform_loop): Return created loop. gcc/ 2016-05-19 Ilya Enkovich * config/i386/i386.c (ix86_init_cost): Extend costs array. (ix86_add_stmt_masking_cost): New. (ix86_finish_cost): Add masking_prologue_cost and masking_body_cost args. (TARGET_VECTORIZE_ADD_STMT_MASKING_COST): New. * config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): New. * config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New. * config/rs6000/rs6000.c (_rs6000_cost_data): Extend cost array. (rs6000_init_cost): Initialize new cost elements. (rs6000_finish_cost): Add masking_prologue_cost and masking_body_cost. * config/spu/spu.c (spu_init_cost): Extend costs array. (spu_finish_cost): Add masking_prologue_cost and masking_body_cost args. * doc/tm.texi.in (TARGET_VECTORIZE_ADD_STMT_MASKING_COST): New. * doc/tm.texi: Regenerated. * target.def (add_stmt_masking_cost): New. (finish_cost): Add masking_prologue_cost and masking_body_cost args. * target.h (enum vect_cost_for_stmt): Add vector_mask_load and vector_mask_store. (enum vect_cost_model_location): Add vect_masking_prologue and vect_masking_body. * targhooks.c (default_builtin_vectorization_cost): Support vector_mask_load and vector_mask_store. (default_init_cost): Extend costs array. (default_add_stmt_masking_cost): New. (default_finish_cost): Add masking_prologue_cost and masking_body_cost args. * targhooks.h (default_add_stmt_masking_cost): New. * tree-vect-loop.c (vect_estimate_min_profitable_iters): Adjust finish_cost call. * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Likewise. * tree-vectorizer.h (add_stmt_masking_cost): New. (finish_cost): Add masking_prologue_cost and masking_body_cost args. gcc/ 2016-05-19 Ilya Enkovich * tree-vect-loop.c: Include insn-config.h and recog.h. (vect_check_required_masks_widening): New. (vect_check_required_masks_narrowing): New. (vect_get_masking_iv_elems): New. (vect_get_masking_iv_type): New. (vect_get_extreme_masks): New. (vect_check_required_masks): New. (vect_analyze_loop_operations): Add vect_check_required_masks call to compute LOOP_VINFO_CAN_BE_MASKED. (vect_analyze_loop_2): Initialize LOOP_VINFO_CAN_BE_MASKED and LOOP_VINFO_NEED_MASKING before starting over. (vectorizable_reduction): Compute LOOP_VINFO_CAN_BE_MASKED and masking cost. * tree-vect-stmts.c (can_mask_load_store): New. (vect_model_load_masking_cost): New. (vect_model_store_masking_cost): New. (vect_model_simple_masking_cost): New. (vectorizable_mask_load_store): Compute LOOP_VINFO_CAN_BE_MASKED and masking cost. (vectorizable_simd_clone_call): Likewise. (vectorizable_store): Likewise. (vectorizable_load): Likewise. (vect_stmt_should_be_masked_for_epilogue): New. (vect_add_required_mask_for_stmt): New. (vect_analyze_stmt): Compute LOOP_VINFO_CAN_BE_MASKED. * tree-vectorizer.h (vect_model_load_masking_cost): New. (vect_model_store_masking_cost): New. (vect_model_simple_masking_cost): New. gcc/ 2016-05-19 Ilya Enkovich * tree-vect-stmts.c (vectorizable_mask_load_store): Mark the first copy of generated vector stores. (vectorizable_store): Mark the first copy of generated vector stores and provide it with vectype and the original data reference. * tree-vectorizer.h (struct _stmt_vec_info): Add first_copy_p field. (STMT_VINFO_FIRST_COPY_P): New. gcc/ 2016-05-19 Ilya Enkovich * dbgcnt.def (vect_tail_combine): New. * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New. * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var. * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support epilogue combined with loop body. (vect_do_peeling_for_loop_bound): LIkewise. (vect_do_peeling_for_alignment): ??? * tree-vect-loop.c Include alias.h and dbgcnt.h. (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters arg, compute number of iterations for which loop epilogue combining is profitable. (vect_generate_tmps_on_preheader): Support combined apilogue. (vect_gen_ivs_for_masking): New. (vect_get_mask_index_for_elems): New. (vect_get_mask_index_for_type): New. (vect_gen_loop_masks): New. (vect_mask_reduction_stmt): New. (vect_mask_mask_load_store_stmt): New. (vect_mask_load_store_stmt): New. (vect_combine_loop_epilogue): New. (vect_transform_loop): Support combined apilogue. gcc/ 2016-05-19 Ilya Enkovich * dbgcnt.def (vect_tail_mask): New. * tree-vect-loop.c (vect_analyze_loop_2): Support masked loop epilogues and low trip count loops. (vect_get_known_peeling_cost): Ignore scalat epilogue cost for loops we are going to mask. (vect_estimate_min_profitable_iters): Support masked loop epilogues and low trip count loops. * tree-vectorizer.c (vectorize_loops): Add a message for a case when loop epilogue can't be vectorized. gcc/ 2016-05-19 Ilya Enkovich * tree-vect-loop.c (vect_transform_loop): Print more info about vectorized loop and specify used vector size.