From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 32036 invoked by alias); 6 Jul 2011 11:35:30 -0000 Received: (qmail 31928 invoked by uid 22791); 6 Jul 2011 11:35:28 -0000 X-SWARE-Spam-Status: No, hits=-2.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW X-Spam-Check-By: sourceware.org Received: from mail-fx0-f49.google.com (HELO mail-fx0-f49.google.com) (209.85.161.49) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Wed, 06 Jul 2011 11:35:13 +0000 Received: by fxd20 with SMTP id 20so5818848fxd.22 for ; Wed, 06 Jul 2011 04:35:12 -0700 (PDT) Received: by 10.223.10.150 with SMTP id p22mr12915448fap.86.1309952111853; Wed, 06 Jul 2011 04:35:11 -0700 (PDT) Received: from richards-thinkpad (gbibp9ph1--blueice3n2.emea.ibm.com [195.212.29.84]) by mx.google.com with ESMTPS id j23sm1575537fai.39.2011.07.06.04.35.10 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 06 Jul 2011 04:35:11 -0700 (PDT) From: Richard Sandiford To: gcc@gcc.gnu.org Mail-Followup-To: gcc@gcc.gnu.org, richard.sandiford@linaro.org Subject: Help with ivopts Date: Wed, 06 Jul 2011 11:35:00 -0000 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org X-SW-Source: 2011-07/txt/msg00050.txt.bz2 Consider: void loop (unsigned char *restrict ydst, unsigned char *restrict udst, unsigned char *restrict vdst, unsigned char *restrict src, int n) { int i; for (i = 0; i < n; i++) { ydst[2*i+0] = src[4*i+0]; udst[i] = src[4*i+1]; ydst[2*i+1] = src[4*i+2]; vdst[i] = src[4*i+3]; } } (based on libav's yuy2toyv12 code). Compiled for ARM with: -std=c99 -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp the loop gets vectorised. The code before ivopts looks like this: : vect_p.31_116 = src_10(D); vect_p.40_123 = udst_14(D); vect_p.44_126 = ydst_6(D); vect_p.49_129 = vdst_31(D); [...] : # vect_p.28_117 = PHI # vect_p.37_124 = PHI # vect_p.41_127 = PHI # vect_p.46_130 = PHI # ivtmp.50_132 = PHI vect_array.32 = LOAD_LANES (MEM[(unsigned char *)vect_p.28_117]); vect_var_.33_119 = vect_array.32_I_lsm0.54_47; vect_var_.34_120 = vect_array.32_I_lsm0.53_109; vect_var_.35_121 = vect_array.32_I_lsm0.52_114; vect_var_.36_122 = vect_array.32_I_lsm0.51_113; MEM[(unsigned char *)vect_p.37_124] = vect_var_.34_120; vect_array.45[0] = vect_var_.33_119; vect_array.45[1] = vect_var_.35_121; MEM[(unsigned char *)vect_p.41_127] = STORE_LANES (vect_array.45); MEM[(unsigned char *)vect_p.46_130] = vect_var_.36_122; vect_p.28_118 = vect_p.28_117 + 32; vect_p.37_125 = vect_p.37_124 + 8; vect_p.41_128 = vect_p.41_127 + 16; vect_p.46_131 = vect_p.46_130 + 8; ivtmp.50_133 = ivtmp.50_132 + 1; if (ivtmp.50_133 < bnd.25_70) goto ; else goto ; [...] : goto ; We record these uses: use 1 generic in statement vect_p.37_124 = PHI at position type vector(8) unsigned char * restrict base (vector(8) unsigned char * restrict) udst_14(D) step 8 base object (void *) udst_14(D) is a biv related candidates use 3 generic in statement vect_p.46_130 = PHI at position type vector(8) unsigned char * restrict base (vector(8) unsigned char * restrict) vdst_31(D) step 8 base object (void *) vdst_31(D) is a biv related candidates Note how they are "generic" rather than "address"es. We also have the candidates: candidate 5 (important) var_before ivtmp.81 var_after ivtmp.81 incremented before exit test type unsigned int base (unsigned int) udst_14(D) step 8 base object (void *) udst_14(D) [...] candidate 7 (important) original biv type unsigned int base (unsigned int) udst_14(D) step 8 base object (void *) udst_14(D) [...] candidate 13 (important) var_before ivtmp.87 var_after ivtmp.87 incremented before exit test type unsigned int base (unsigned int) vdst_31(D) step 8 base object (void *) vdst_31(D) candidate 14 (important) original biv type unsigned int base (unsigned int) vdst_31(D) step 8 base object (void *) vdst_31(D) The problem (from an ARM POV) is that we end using candidate 5 for use 3: : [...] D.3678_95 = (unsigned int) vdst_31(D); D.3679_93 = (unsigned int) udst_14(D); D.3680_89 = D.3678_95 - D.3679_93; D.3681_87 = D.3680_89 + ivtmp.81_94; D.3682_83 = (vector(8) unsigned char * restrict) D.3681_87; [vdst + i:] vect_p.46_130 = D.3682_83; [udst + i:] vect_p.37_124 = (vector(8) unsigned char * restrict) ivtmp.81_94; [...] This is based on the following costs: Use 3: cand cost compl. depends on 0 8 0 inv_expr:18 5 4 0 inv_expr:19 6 4 0 inv_expr:18 7 4 0 inv_expr:19 8 4 1 inv_expr:20 13 0 0 14 0 0 15 4 1 16 8 0 inv_expr:18 17 8 1 inv_expr:21 The cost of using candidate 5 for use 3 is calculated as: /* use = ubase + ratio * (var - cbase). If either cbase is a constant or ratio == 1, it is better to handle this like ubase - ratio * cbase + ratio * var (also holds in the case ratio == -1, TODO. */ [...] else if (ratio == 1) { tree real_cbase = cbase; /* Check to see if any adjustment is needed. */ if (cstepi == 0 && stmt_is_after_inc) { aff_tree real_cbase_aff; aff_tree cstep_aff; tree_to_aff_combination (cbase, TREE_TYPE (real_cbase), &real_cbase_aff); tree_to_aff_combination (cstep, TREE_TYPE (cstep), &cstep_aff); aff_combination_add (&real_cbase_aff, &cstep_aff); real_cbase = aff_combination_to_tree (&real_cbase_aff); } cost = difference_cost (data, ubase, real_cbase, &symbol_present, &var_present, &offset, depends_on); cost.cost /= avg_loop_niter (data->current_loop); } [...] cost.cost += add_cost (TYPE_MODE (ctype), speed); The individual difference_cost and add_cost seem reasonable (4 in each case). I don't understand the reasoning behind the division though. Is the idea that this should be hoisted? If so, then: (a) That doesn't happen at the tree level. The subtraction is still inside the loop at RTL generation time. (b) What's the advantage of introducing a new hoisted subtraction that is going to be live throughout the loop, and then adding another IV to it inside the loop, over using the original IV and incrementing it in the normal way? There is code to stop this happening for uses that are marked as addresses: if (address_p) { /* Do not try to express address of an object with computation based on address of a different object. This may cause problems in rtl level alias analysis (that does not expect this to be happening, as this is illegal in C), and would be unlikely to be useful anyway. */ if (use->iv->base_object && cand->iv->base_object && !operand_equal_p (use->iv->base_object, cand->iv->base_object, 0)) return infinite_cost; } but it doesn't trigger for these uses because we treat them as "generic". That could fixed by, e.g., treating them as addresses, or by using POINTER_TYPE_P instead of or as well as address_p in the condition above. But it seems like a slightly odd transformation in any case. The reason I'm interested is that NEON has auto-increment instructions. This transformation stops us using them for vdst and (because of unfortunate instruction ordering) on udst as well. Richard