From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13649 invoked by alias); 15 Feb 2016 10:49:14 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 13635 invoked by uid 89); 15 Feb 2016 10:49:13 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=overwrite, Hope, ramanaradhakrishnanarmcom, sk:ramana. X-HELO: cam-smtp0.cambridge.arm.com Received: from fw-tnat.cambridge.arm.com (HELO cam-smtp0.cambridge.arm.com) (217.140.96.140) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-SHA encrypted) ESMTPS; Mon, 15 Feb 2016 10:49:11 +0000 Received: from arm.com (e107456-lin.cambridge.arm.com [10.2.206.78]) by cam-smtp0.cambridge.arm.com (8.13.8/8.13.8) with ESMTP id u1FAn8kq003914; Mon, 15 Feb 2016 10:49:08 GMT Date: Mon, 15 Feb 2016 10:49:00 -0000 From: James Greenhalgh To: gcc-patches@gcc.gnu.org Cc: nd@arm.com, ramana.radhakrishnan@arm.com, marcus.shawcroft@arm.com, richard.earnshaw@arm.com Subject: Re: [Patch AArch64] GCC 6 regression in vector performance. - Fix vector initialization to happen with lane load instructions. Message-ID: <20160215104907.GB16295@arm.com> References: <1453303331-14492-1-git-send-email-james.greenhalgh@arm.com> <20160202102928.GA5661@arm.com> <20160208105628.GA39718@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160208105628.GA39718@arm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-IsSubscribed: yes X-SW-Source: 2016-02/txt/msg00971.txt.bz2 On Mon, Feb 08, 2016 at 10:56:29AM +0000, James Greenhalgh wrote: > On Tue, Feb 02, 2016 at 10:29:29AM +0000, James Greenhalgh wrote: > > On Wed, Jan 20, 2016 at 03:22:11PM +0000, James Greenhalgh wrote: > > > > > > Hi, > > > > > > In a number of cases where we try to create vectors we end up spilling to the > > > stack and then filling. This is one example distilled from a couple of > > > micro-benchmrks where the issue shows up. The reason for the extra cost > > > in this case is the unnecessary use of the stack. The patch attempts to > > > finesse this by using lane loads or vector inserts to produce the right > > > results. > > > > > > This patch is mostly Ramana's work, I've just cleaned it up a little. > > > > > > This has been in a number of our trees lately, and we haven't seen any > > > regressions. I've also bootstrapped and tested it, and run a set of > > > benchmarks to show no regressions on Cortex-A57 or Cortex-A53. > > > > > > The patch fixes some regressions caused by the more agressive vectorization > > > in GCC6, so I'd like to propose it to go in even though we are in Stage 4. > > > > > > OK? > > > > *Ping* > > *ping^2* *ping ^3* Thanks, James > > > 2016-01-20 James Greenhalgh > > > Ramana Radhakrishnan > > > > > > * config/aarch64/aarch64.c (aarch64_expand_vector_init): Refactor, > > > always use lane loads to construct non-constant vectors. > > > > > > gcc/testsuite/ > > > > > > 2016-01-20 James Greenhalgh > > > Ramana Radhakrishnan > > > > > > * gcc.target/aarch64/vector_initialization_nostack.c: New. > > > > > > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > > index 03bc1b9..3787b38 100644 > > > --- a/gcc/config/aarch64/aarch64.c > > > +++ b/gcc/config/aarch64/aarch64.c > > > @@ -10985,28 +10985,37 @@ aarch64_simd_make_constant (rtx vals) > > > return NULL_RTX; > > > } > > > > > > +/* Expand a vector initialisation sequence, such that TARGET is > > > + initialised to contain VALS. */ > > > + > > > void > > > aarch64_expand_vector_init (rtx target, rtx vals) > > > { > > > machine_mode mode = GET_MODE (target); > > > machine_mode inner_mode = GET_MODE_INNER (mode); > > > + /* The number of vector elements. */ > > > int n_elts = GET_MODE_NUNITS (mode); > > > + /* The number of vector elements which are not constant. */ > > > int n_var = 0; > > > rtx any_const = NULL_RTX; > > > + /* The first element of vals. */ > > > + rtx v0 = XVECEXP (vals, 0, 0); > > > bool all_same = true; > > > > > > + /* Count the number of variable elements to initialise. */ > > > for (int i = 0; i < n_elts; ++i) > > > { > > > rtx x = XVECEXP (vals, 0, i); > > > - if (!CONST_INT_P (x) && !CONST_DOUBLE_P (x)) > > > + if (!(CONST_INT_P (x) || CONST_DOUBLE_P (x))) > > > ++n_var; > > > else > > > any_const = x; > > > > > > - if (i > 0 && !rtx_equal_p (x, XVECEXP (vals, 0, 0))) > > > - all_same = false; > > > + all_same &= rtx_equal_p (x, v0); > > > } > > > > > > + /* No variable elements, hand off to aarch64_simd_make_constant which knows > > > + how best to handle this. */ > > > if (n_var == 0) > > > { > > > rtx constant = aarch64_simd_make_constant (vals); > > > @@ -11020,14 +11029,15 @@ aarch64_expand_vector_init (rtx target, rtx vals) > > > /* Splat a single non-constant element if we can. */ > > > if (all_same) > > > { > > > - rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, 0)); > > > + rtx x = copy_to_mode_reg (inner_mode, v0); > > > aarch64_emit_move (target, gen_rtx_VEC_DUPLICATE (mode, x)); > > > return; > > > } > > > > > > - /* Half the fields (or less) are non-constant. Load constant then overwrite > > > - varying fields. Hope that this is more efficient than using the stack. */ > > > - if (n_var <= n_elts/2) > > > + /* Initialise a vector which is part-variable. We want to first try > > > + to build those lanes which are constant in the most efficient way we > > > + can. */ > > > + if (n_var != n_elts) > > > { > > > rtx copy = copy_rtx (vals); > > > > > > @@ -11054,31 +11064,21 @@ aarch64_expand_vector_init (rtx target, rtx vals) > > > XVECEXP (copy, 0, i) = subst; > > > } > > > aarch64_expand_vector_init (target, copy); > > > + } > > > > > > - /* Insert variables. */ > > > - enum insn_code icode = optab_handler (vec_set_optab, mode); > > > - gcc_assert (icode != CODE_FOR_nothing); > > > + /* Insert the variable lanes directly. */ > > > > > > - for (int i = 0; i < n_elts; i++) > > > - { > > > - rtx x = XVECEXP (vals, 0, i); > > > - if (CONST_INT_P (x) || CONST_DOUBLE_P (x)) > > > - continue; > > > - x = copy_to_mode_reg (inner_mode, x); > > > - emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i))); > > > - } > > > - return; > > > - } > > > + enum insn_code icode = optab_handler (vec_set_optab, mode); > > > + gcc_assert (icode != CODE_FOR_nothing); > > > > > > - /* Construct the vector in memory one field at a time > > > - and load the whole vector. */ > > > - rtx mem = assign_stack_temp (mode, GET_MODE_SIZE (mode)); > > > for (int i = 0; i < n_elts; i++) > > > - emit_move_insn (adjust_address_nv (mem, inner_mode, > > > - i * GET_MODE_SIZE (inner_mode)), > > > - XVECEXP (vals, 0, i)); > > > - emit_move_insn (target, mem); > > > - > > > + { > > > + rtx x = XVECEXP (vals, 0, i); > > > + if (CONST_INT_P (x) || CONST_DOUBLE_P (x)) > > > + continue; > > > + x = copy_to_mode_reg (inner_mode, x); > > > + emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i))); > > > + } > > > } > > > > > > static unsigned HOST_WIDE_INT > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vector_initialization_nostack.c b/gcc/testsuite/gcc.target/aarch64/vector_initialization_nostack.c > > > new file mode 100644 > > > index 0000000..bbad04d > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/vector_initialization_nostack.c > > > @@ -0,0 +1,53 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */ > > > +float arr_f[100][100]; > > > +float > > > +f9 (void) > > > +{ > > > + > > > + int i; > > > + float sum = 0; > > > + for (i = 0; i < 100; i++) > > > + sum += arr_f[i][0] * arr_f[0][i]; > > > + return sum; > > > + > > > +} > > > + > > > + > > > +int arr[100][100]; > > > +int > > > +f10 (void) > > > +{ > > > + > > > + int i; > > > + int sum = 0; > > > + for (i = 0; i < 100; i++) > > > + sum += arr[i][0] * arr[0][i]; > > > + return sum; > > > + > > > +} > > > + > > > +double arr_d[100][100]; > > > +double > > > +f11 (void) > > > +{ > > > + int i; > > > + double sum = 0; > > > + for (i = 0; i < 100; i++) > > > + sum += arr_d[i][0] * arr_d[0][i]; > > > + return sum; > > > +} > > > + > > > +char arr_c[100][100]; > > > +char > > > +f12 (void) > > > +{ > > > + int i; > > > + char sum = 0; > > > + for (i = 0; i < 100; i++) > > > + sum += arr_c[i][0] * arr_c[0][i]; > > > + return sum; > > > +} > > > + > > > + > > > +/* { dg-final { scan-assembler-not "sp" } } */ > > >