From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 85504 invoked by alias); 3 Sep 2019 11:24:33 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 85464 invoked by uid 89); 3 Sep 2019 11:24:28 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-6.9 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=H*i:sk:ZrPx6X1, H*f:sk:ZrPx6X1 X-HELO: mail-lj1-f195.google.com Received: from mail-lj1-f195.google.com (HELO mail-lj1-f195.google.com) (209.85.208.195) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 03 Sep 2019 11:24:26 +0000 Received: by mail-lj1-f195.google.com with SMTP id d5so4643565lja.10 for ; Tue, 03 Sep 2019 04:24:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=XcBq1YT7EA5Av8m7N7S+8JKnkVXgk8BvWkU8QqcTYZk=; b=Q4NWauZlsVRn+Q/pEAvI+fQlRf+3o+IvMzpgkGUmZXweEEfZoMghoF1xUvcpusZ46V PsFgvntxX5ETaRZuSG9IbEl/DedetLHBf/YGHmlJKehu1mrs5H57FLwOB5xplrqqjt49 vamUYully3+cfV/h0C+4dc5iD1Yheb2BX+auhXTChlRLV3pJjWGALsbQMu8+MWKPkTvR Ju+zBHZsRNWDynofbC6ZslT1ewdenoo/HqiObvZWHFSgt6lCn4ZiYmwgfTpHRBw830YG 3HWNaaIqECf1LbaRT0O+8pWxmOdB+dXQlCAdiPuQOONgkCwQt5EUnC8dnkKEKzLxnnZ0 8hxw== MIME-Version: 1.0 References: <20190831005151.GD9227@bubble.grove.modra.org> In-Reply-To: From: Richard Biener Date: Tue, 03 Sep 2019 11:24:00 -0000 Message-ID: Subject: Re: [PATCH, i386]: Do not limit the cost of moves to/from XMM register to minimum 8. To: Hongtao Liu Cc: Uros Bizjak , Jakub Jelinek , Alan Modra , "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2019-09/txt/msg00105.txt.bz2 On Tue, Sep 3, 2019 at 9:57 AM Hongtao Liu wrote: > > On Mon, Sep 2, 2019 at 4:41 PM Uros Bizjak wrote: > > > > On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu wrote: > > > > > > > which is not the case with core_cost (and similar with skylake_cost): > > > > > > > > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > > > > {6, 6, 6, 6, 12}, /* cost of loading SSE registers > > > > in 32,64,128,256 and 512-bit */ > > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > > in 32,64,128,256 and 512-bit */ > > > > 2, 2, /* SSE->integer and integer->SSE moves */ > > > > > > > > We have the same cost of moving between integer registers (by default > > > > set to 2), between SSE registers and between integer and SSE register > > > > sets. I think that at least the cost of moves between regsets should > > > > be substantially higher, rs6000 uses 3x cost of intra-regset moves; > > > > that would translate to the value of 6. The value should be low enough > > > > to keep the cost below the value that forces move through the memory. > > > > Changing core register allocation cost of SSE <-> integer to: > > > > > > > > --cut here-- > > > > Index: config/i386/x86-tune-costs.h > > > > =================================================================== > > > > --- config/i386/x86-tune-costs.h (revision 275281) > > > > +++ config/i386/x86-tune-costs.h (working copy) > > > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = { > > > > in 32,64,128,256 and 512-bit */ > > > > {6, 6, 6, 6, 12}, /* cost of storing SSE registers > > > > in 32,64,128,256 and 512-bit */ > > > > - 2, 2, /* SSE->integer and > > > > integer->SSE moves */ > > > > + 6, 6, /* SSE->integer and > > > > integer->SSE moves */ > > > > /* End of register allocator costs. */ > > > > }, > > > > > > > > --cut here-- > > > > > > > > still produces direct move in gcc.target/i386/minmax-6.c > > > > > > > > I think that in addition to attached patch, values between 2 and 6 > > > > should be considered in benchmarking. Unfortunately, without access to > > > > regressed SPEC tests, I can't analyse these changes by myself. > > > > > > > > Uros. > > > > > > Apply similar change to skylake_cost, on skylake workstation we got > > > performance like: > > > --------------------------- > > > version | > > > 548_exchange_r score > > > gcc10_20180822: | 10 > > > apply remove_max8 | 8.9 > > > also apply increase integer_tofrom_sse cost | 9.69 > > > ----------------------------- > > > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in > > > libgfortran.so.5.0.0. > > > > > > I found suspicious code as bellow, does it affect? > > > > Hard to say without access to the test, but I'm glad that changing the > > knob has noticeable effect. I think that (as said by Alan) a fine-tune > > of register pressure calculation will be needed to push this forward. > > > > Uros. > > > > > ------------------ > > > modified gcc/config/i386/i386-features.c > > > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain () > > > if (dump_file) > > > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > > > > > - /* ??? What about integer to SSE? */ > > > + /* ??? What about integer to SSE? */??? > > > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > > > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > > ------------------ > > > -- > > > BR, > > > Hongtao > > Note: > Removing limit of cost would introduce lots of regressions in SPEC2017 as follow > -------------------------------- > 531.deepsjeng_r -7.18% > 548.exchange_r -6.70% > 557.xz_r -6.74% > 508.namd_r -2.81% > 527.cam4_r -6.48% > 544.nab_r -3.99% > > Tested on skylake server. > ------------------------------------- > How about changing cost from 2 to 8 until we figure out a better number. Certainly works for me. Note the STV code uses the "other" sse_to_integer number and the numbers in question here are those for the RA. There's a multitude of values used in the tables here, including some a lot larger. So the overall bumping to 8 certainly was the wrong thing to do and instead individual numbers should have been adjusted (didn't look at the history of that bumping). For example Pentium4 has quite high bases for move costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20 while the opposite 12. So yes, we want to revert the patch by applying its effect to the individual cost tables so we can revisit this for the still interesting micro-architectures. Richard. > -- > BR, > Hongtao