From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14142 invoked by alias); 8 Apr 2007 14:37:55 -0000 Received: (qmail 14129 invoked by uid 22791); 8 Apr 2007 14:37:53 -0000 X-Spam-Check-By: sourceware.org Received: from mtagate7.de.ibm.com (HELO mtagate7.de.ibm.com) (195.212.29.156) by sourceware.org (qpsmtpd/0.31) with ESMTP; Sun, 08 Apr 2007 15:37:49 +0100 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate7.de.ibm.com (8.13.8/8.13.8) with ESMTP id l38Ebkew224788 for ; Sun, 8 Apr 2007 14:37:46 GMT Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l38EbkT84165734 for ; Sun, 8 Apr 2007 16:37:46 +0200 Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l38Ebj78010475 for ; Sun, 8 Apr 2007 16:37:46 +0200 Received: from d12mc102.megacenter.de.ibm.com (d12mc102.megacenter.de.ibm.com [9.149.167.114]) by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l38EbjUF010472; Sun, 8 Apr 2007 16:37:45 +0200 In-Reply-To: Subject: Re: [patch] Vectorizer cost model implementation To: "Linthicum, Tony" Cc: "GCC Patches" , "Jagasia, Harsha" X-Mailer: Lotus Notes Release 7.0 HF277 June 21, 2006 Message-ID: From: Dorit Nuzman Date: Sun, 08 Apr 2007 14:37:00 -0000 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org X-SW-Source: 2007-04/txt/msg00362.txt.bz2 > Hello All, > > This patch contains the initial implementation for a cost model for the > vectorizer. Thanks Tony for this work, and Harsha for taking it on! As I told Tony off-line, I think this is an excellent start. With this, people can start experimenting and tuning the cost model on different targets, and gradually develop the required target-specific APIs, and also see how to generalize this for the use of other loop transformations. Here are some comments to the patch: > Index: gcc/testsuite/gcc.dg/vect/vect.exp > =================================================================== > --- gcc/testsuite/gcc.dg/vect/vect.exp (revision 123477) > +++ gcc/testsuite/gcc.dg/vect/vect.exp (working copy) > @@ -23,7 +23,7 @@ > set DEFAULT_VECTCFLAGS "" > > # These flags are used for all targets. > -lappend DEFAULT_VECTCFLAGS "-O2" "-ftree-vectorize" > +lappend DEFAULT_VECTCFLAGS "-O2" "-ftree-vectorize" "-fno-vect-cost-model" This is discussed separately (consider not to disable the cost-model by default, and instead xfail the tests that fail to vectorize with cost model enabled (if there's not too many of them), duplicating them etc...). > @@ -2216,6 +2219,7 @@ > only over initial loops skipping newly generated ones. */ > FOR_EACH_LOOP (li, loop, 0) > { > + int required_iters; What do you think about renaming this variable to something like "min_profitable_iters", or "estimated_min_iters" or anyhow something that would indicate that it's about estimated profitability and not something that's really inherently required? > @@ -2225,7 +2229,25 @@ > if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) > continue; > > + required_iters = vect_model_required_iters (loop_vinfo); > + if (required_iters < 0) > + { > + if (vect_print_dump_info (REPORT_DETAILS)) > + fprintf (vect_dump, "Vector body less profitable than scalar.\n"); > + continue; > + } > + > + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && > + LOOP_VINFO_INT_NITERS (loop_vinfo) < (unsigned) required_iters) > + { > + if (vect_print_dump_info (REPORT_DETAILS)) > + fprintf (vect_dump, > + "Not vectorized: min iters: %d, actual: %lu\n", > + required_iters, LOOP_VINFO_INT_NITERS (loop_vinfo)); > + continue; > + } > vect_transform_loop (loop_vinfo); > + > num_vectorized_loops++; > } > vect_loop_location = UNKNOWN_LOC; One comment is that we use REPORT_UNVECTORIZED_LOOPS (rather then REPORT_DETAILS) to guard the last printout that explains why a loop wasn't vectorized. The idea is that when you use -fdump-tree-vect-stats (or -ftree-vectorizer-verbose=2) you'll get one printout per loop explaining why it wasn't vectorized. I think that with the current patch, if a loop doesn't get vectorized because of the cost model, you'll see an explanation only when using -fdump-tree-vect-details (or -ftree-vectorizer-verbose=7). So you can either change REPORT_DETAILS to REPORT_UNVECTORIZED_LOOPS in the code above, or leave it as is and add another, more user friedly, printout under REPORT_UNVECTORIZED_LOOPS (e.g. "not vectorized: vectorization not profitable"). The other comment is that this piece of code would be better placed together with the last part of the function vect_analyze_operations, where we already do several similar loop-bound related checks (if the loop-bound is known, we compare it against the VF or whatever the user specified via --param min-vect-loop-bound, and other checks). So, please move it there, or, if you want, factor all of this out to a separate function, to be called after vect_analyze_operations. This also reminds me that we need to consider what to do if the user asked to vectorize when the loop-bound is greater than X (via --param min-vect-loop-bound), but the cost model determined that it would be profitable when the loop-bound is greater than Y, and X != Y. The --param min-vect-loop-bound was originally added only until such time that we have a cost model, but maybe we want to keep it around, at least until the cost model has been tuned/matured. In the meantime we can decide to do one of the following: (1) always go with the user (2) always go with the cost model (3) go with the user if the user is more conservative than the cost-model (i.e. X > Y) (4) go with the user if the user is less conservative than the cost-model (i.e. X < Y) The first option is equivalent to turning off the cost model with -fno-vect-cost-model, so I'd go with one of 2,3,4. > + > + /* Vectorization costs associated with statement */ According to the coding conventions comments should be treated like real sentences, so there should be a period at the end. Also, coding conventions ask for two spaces before end of the comment. (There are a few other occurrences of that in the patch). > +extern int vect_model_required_iters (loop_vec_info); same comment as I made above - how about 'vect_estimate_min_profitable_iters'? (or something shorter in that spirit) > +fno-vect-cost-model > +Common Report Var(flag_no_vect_cost_model) Optimization > +Disable use of cost model in vectorization > + As others already commented - the flag should be introduced as -fvect-cost-model, and documented in the texi files. > + /* Requires loop versioning tests. */ > + if (LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)) > + vec_outside_cost += TARG_COND_BRANCH_COST; We can try to be a bit more "accurate" here and have the cost depend on the number of stmts in the may_misalign list (the more datarefs in the list, the heavier is the test we create). Obviously, there are a lot of improvements/fine-tuning like that that can be done, and these can be done later. For now, I would just ask that we record this in the code (add a FIXME comment here). > + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > + vec_outside_cost += TARG_COND_BRANCH_COST; ... > + /* Requires guard code for peeling. */ > + if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo)) > + vec_outside_cost += TARG_COND_BRANCH_COST; In the smae spirit of the above comment - we actually generate 2 guards for each peel-loop that we create, so 2*TARG_COND_BRANCH_COST would be more "accurate". > + /* Count statements in scalar loop. Using this as scalar cost for a single > + iteration for now. */ > + for (i = 0; i < nbbs; i++) > + { > + block_stmt_iterator si; > + basic_block bb = bbs[i]; > + > + for (si = bsi_start (bb); !bsi_end_p (si); bsi_next (&si)) > + { > + tree stmt = bsi_stmt (si); > + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > + if (!STMT_VINFO_RELEVANT_P (stmt_info) > + && !STMT_VINFO_LIVE_P (stmt_info)) > + continue; > + scalar_single_iter_cost++; > + vec_inside_cost += STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info); > + vec_outside_cost += STMT_VINFO_OUTSIDE_OF_LOOP_COST (stmt_info); > + } > + } This is more a note for myself - to remember to update this code when looking at outer-loop vectorization (the stmts in the inner-loop execute more times than the rest of the stmts). More adjustments would probably be needed for outer-loop support. > + /* Add in costs of peeled instructions, if any. */ > + peel_iters = LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo); > + vec_outside_cost += peel_iters * scalar_single_iter_cost; > + You'll have a problem here if the number of iterations to be peeled is not known at compile time... in this case LOOP_PEELING_FOR_ALIGNEMNT is negative, and the actual number of iterations is computed at run time. So in that case you also need to build an expression that would be evaluated at runtime. I understand that at this point this patch handles only the case that the number of iterations in known, so until it is extended you just want to check if LOOP_PEELING_FOR_ALIGNMENT > 0. > +static void > +vect_model_reduction_cost (stmt_vec_info stmt_info, enum tree_code reduc_code, > + int ncopies) > +{ As discussed with Tony off-line - there's a close dependency between this function and the reduction transformation functions, as is the case with all the vect_model_*_cost functions - they all depend on how the respective vectorizable_* functions are implemented. At some point we may want to revisit this and see how to better design it. > + /* Add in cost for intial definition. */ typo - initial > + /* Determine cost of epilog code. */ > + if (reduc_code < NUM_TREE_CODES) > + /* We have a reduction operator that will reduce the vector in one > + statement. Also requires scalar extract. */ > + outer_cost += 2 * TARG_VEC_STMT_COST; Did you want to use TARG_VEC_TO_SCALAR_COST for the scalar extract? (As pointed out by others - it is not documented in tree-vectorizer.h). > +static void > +vect_model_simple_cost (stmt_vec_info stmt_info, int ncopies) > +{ > + STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info) = ncopies * TARG_VEC_STMT_COST; > +} > + > +/* Function vect_cost_strided_group_size > + > + For a load or store, returns the group_size if it's the last store, and 1 > + if it is not. */ > +static int > +vect_cost_strided_group_size (stmt_vec_info stmt_info) > +{ This is more of a convention within the tree-vect-* files: we keep 2 empty lines between functions, and one empty line between the description of the function and the function itself (you have kept this convention almost everywhere else, but these are a couple such occurrences here and there). > +/* Function vect_cost_strided_group_size > + > + For a load or store, returns the group_size if it's the last store, and 1 > + if it is not. */ > +static int > +vect_cost_strided_group_size (stmt_vec_info stmt_info) > +{ > + tree first_stmt = DR_GROUP_FIRST_DR (stmt_info); > + stmt_vec_info first_info = vinfo_for_stmt (first_stmt); > + int group_size = DR_GROUP_SIZE (first_info); > + STMT_VINFO_MEM_COUNT_FOR_COST (first_info)++; > + > + /* Is it the last in the group? */ > + if (STMT_VINFO_MEM_COUNT_FOR_COST (first_info) < group_size) > + group_size = 1; > + > + return group_size; > +} I think there's no need to keep track of whether this is the last load/store in a group or not. it doesn't matter for which of the stores/loads we return the actual group_size, as long as we do it exactly once per group. So I think you could rewrite it as follows (this way you can also get rid of the field MEM_COUNT_FOR_COST): static int vect_cost_strided_group_size (stmt_vec_info stmt_info) { tree first_stmt = DR_GROUP_FIRST_DR (stmt_info); if (first_stmt == STMT_VINFO_STMT (stmt_info)) return DR_GROUP_SIZE (stmt_info); return 1; } > + /* Is this the last access in a group of stores providing strided acess? typo - access. Appears a couple of times. > + if (group_size > 1) > + { > + /* Uses a high and low interleave operation for each needed permute. */ > + cost = ncopies * exact_log2(group_size) * group_size * > + TARG_VEC_TO_SCALAR_COST; > + } why are you using TARG_VEC_TO_SCALAR_COST? Again, it's not documented, so I don't know what it's meant to represent, but permute/shuffle like operations do not produce a scalar from a vector. Looks like TARG_VEC_STMT_COST is more appropriate here? > + /* Uses a high and low interleave operation for each needed permute. */ > + inner_cost = ncopies * exact_log2(group_size) * group_size * > + TARG_VEC_STMT_COST; For the loads we actually use even and odd extract operations, not interleave operations (and the cost for both should probably be the same) > + case dr_unaligned_software_pipeline: > + { > + int outer_cost; > + > + /* Unaligned software pipeline has a load of an address, an intial typo - initial > + load, and possibly a mask operation to "prime" the loop. Inside > + the loop, there is a load op and a realignment op. */ > + outer_cost = 2 * TARG_VEC_STMT_COST; > + if (targetm.vectorize.builtin_mask_for_load) > + outer_cost += TARG_VEC_STMT_COST; > + STMT_VINFO_OUTSIDE_OF_LOOP_COST (stmt_info) = outer_cost; if this is a group of loads (when vectorizing a strided access) - looks like the outer-cost will be added by each load in the group? (it should be added only once per group) > + > + inner_cost += (2*ncopies) * TARG_VEC_LOAD_COST; so you're using TARG_VEC_LOAD_COST also for the cost of the realignment op? I think is should be a TARG_VEC_STMT_COST. thanks! dorit > The patch is currently missing loop versioning for > dynamically trip counted loops and some cost model specific tests. My > colleague, Harsha Jagasia, will be providing follow on patches for those > things shortly. > > Thank you. > > Tony > > > [attachment "patch" deleted by Dorit Nuzman/Haifa/IBM]