Hi Richard, Thanks for the review! on 2020/7/22 下午5:11, Richard Sandiford wrote: > "Kewen.Lin" writes: >> - else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) >> - { >> - peel_iters_prologue = 0; >> - peel_iters_epilogue = 0; >> + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) >> + { >> + /* Calculate how many masks we need to generate. */ >> + unsigned int num_masks = 0; >> + rgroup_controls *rgm; >> + unsigned int num_vectors_m1; >> + FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) >> + if (rgm->type) >> + num_masks += num_vectors_m1 + 1; >> + gcc_assert (num_masks > 0); >> + >> + /* In the worst case, we need to generate each mask in the prologue >> + and in the loop body. One of the loop body mask instructions >> + replaces the comparison in the scalar loop, and since we don't >> + count the scalar comparison against the scalar body, we shouldn't >> + count that vector instruction against the vector body either. >> + >> + Sometimes we can use unpacks instead of generating prologue >> + masks and sometimes the prologue mask will fold to a constant, >> + so the actual prologue cost might be smaller. However, it's >> + simpler and safer to use the worst-case cost; if this ends up >> + being the tie-breaker between vectorizing or not, then it's >> + probably better not to vectorize. */ >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks, >> + vector_stmt, NULL, NULL_TREE, 0, vect_prologue); >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1, >> + vector_stmt, NULL, NULL_TREE, 0, vect_body); >> + } >> + else >> + { >> + gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); >> + >> + /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT. */ >> + if (npeel < 0) >> + { >> + peel_iters_prologue = assumed_vf / 2; >> + /* See below, if peeled iterations are unknown, count a taken >> + branch and a not taken branch per peeled loop. */ >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, >> + cond_branch_taken, NULL, NULL_TREE, 0, >> + vect_prologue); >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, >> + cond_branch_not_taken, NULL, NULL_TREE, 0, >> + vect_prologue); >> + } >> + else >> + { >> + peel_iters_prologue = npeel; >> + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) >> + /* See vect_get_known_peeling_cost, if peeled iterations are >> + known but number of scalar loop iterations are unknown, count >> + a taken branch per peeled loop. */ >> + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, >> + cond_branch_taken, NULL, NULL_TREE, 0, >> + vect_prologue); >> + } > > I think it'd be good to avoid duplicating this. How about the > following structure? > > if (vect_use_loop_mask_for_alignment_p (…)) > { > peel_iters_prologue = 0; > peel_iters_epilogue = 0; > } > else if (npeel < 0) > { > … // A > } > else > { > …vect_get_known_peeling_cost stuff… > } > > but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to: > > LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0 > > for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to > whatever value we'd normally use. Then wrap: > > (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken, > NULL, NULL_TREE, 0, vect_epilogue); > (void) add_stmt_cost (loop_vinfo, > target_cost_data, 1, cond_branch_not_taken, > NULL, NULL_TREE, 0, vect_epilogue); > > in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue > stuff in A conditional on peel_iters_epilogue != 0. > > This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P > code: > > if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) > { > /* We need to peel exactly one iteration. */ > peel_iters_epilogue += 1; > stmt_info_for_cost *si; > int j; > FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), > j, si) > (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count, > si->kind, si->stmt_info, si->vectype, > si->misalign, vect_epilogue); > } > > Then, after the above, have: > > if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > …add costs for mask overhead… > else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > …add costs for lengths overhead… > > So we'd have one block of code for estimating the prologue and epilogue > peeling cost, and a separate block of code for the loop control overhead. > It's a great idea, by following your subsequent suggestion to make the structure like: - calculate peel_iters_prologue - calculate peel_iters_epilogue - add costs associated with peel_iters_prologue - add costs associated with peel_iters_epilogue - add costs related to branch taken/not_taken. the updated v3 is attached. Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit param vect-partial-vector-usage=1, I'll test it without partial vectors setting, also on aarch64 later. BR, Kewen ----- gcc/ChangeLog: * config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function. (rs6000_finish_cost): Call adjust_vect_cost_per_loop. * tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code to determine peel_iters_epilogue to function ... (vect_get_peel_iters_epilogue): ... this. New function. (vect_estimate_min_profitable_iters): Add cost modeling for vector with length, refactor cost calculation on peel_iters_prologue and peel_iters_epilogue.