From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-327331-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 9914 invoked by alias); 24 Sep 2012 20:48:55 -0000
Received: (qmail 9883 invoked by uid 22791); 24 Sep 2012 20:48:54 -0000
X-SWARE-Spam-Status: No, hits=-4.3 required=5.0	tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,KHOP_RCVD_TRUST,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE
X-Spam-Check-By: sourceware.org
Received: from mail-wg0-f51.google.com (HELO mail-wg0-f51.google.com) (74.125.82.51)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 24 Sep 2012 20:48:41 +0000
Received: by wgbed3 with SMTP id ed3so3835023wgb.8        for <gcc-patches@gcc.gnu.org>; Mon, 24 Sep 2012 13:48:39 -0700 (PDT)
Received: by 10.216.136.157 with SMTP id w29mr7874804wei.148.1348519719746;        Mon, 24 Sep 2012 13:48:39 -0700 (PDT)
Received: from localhost ([2.26.188.227])        by mx.google.com with ESMTPS id bc2sm18747241wib.0.2012.09.24.13.48.37        (version=TLSv1/SSLv3 cipher=OTHER);        Mon, 24 Sep 2012 13:48:37 -0700 (PDT)
From: Richard Sandiford <rdsandiford@googlemail.com>
To: "Maciej W. Rozycki" <macro@codesourcery.com>
Mail-Followup-To: "Maciej W. Rozycki" <macro@codesourcery.com>,Sandra Loosemore <sandra@codesourcery.com>,  <gcc-patches@gcc.gnu.org>, rdsandiford@googlemail.com
Cc: Sandra Loosemore <sandra@codesourcery.com>,  <gcc-patches@gcc.gnu.org>
Subject: Re: PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
References: <502D0DF4.3070302@codesourcery.com> <87zk5qu783.fsf@talisman.home>	<5034EBC6.5080602@codesourcery.com> <87bohw1ecb.fsf@talisman.home>	<5058ACDA.5060706@codesourcery.com> <87lig719ig.fsf@talisman.home>	<alpine.DEB.1.10.1209241548180.28358@tp.orcam.me.uk>
Date: Mon, 24 Sep 2012 21:40:00 -0000
In-Reply-To: <alpine.DEB.1.10.1209241548180.28358@tp.orcam.me.uk> (Maciej	W. Rozycki's message of "Mon, 24 Sep 2012 16:08:55 +0100")
Message-ID: <87sja7tadr.fsf@talisman.home>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2012-09/txt/msg01666.txt.bz2

"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Tue, 18 Sep 2012, Richard Sandiford wrote:
>
>> > Have you had time to think about this some more?  I am not sure I can 
>> > guess how you'd like me to fix this patch now without some more specific 
>> > review and/or suggestions about where the optimization should happen and 
>> > what cases it should be extended to detect in addition to the dsp 
>> > accumulator multiplies.
>> 
>> The patch below is the one I've been testing.  But I got sidetracked
>> by looking into the possibility of removing the MD0_REG and MD1_REG
>> classes, in order to get more sensible costs.  I think that was needed
>> for the madd-9.c test to pass.
>
>  Sorry to come up with this so late -- I have only now noticed this being 
> discussed.
>
>> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
>>    return simplify_gen_subreg (word_mode, op, mode, byte);
>>  }
>>  
>> -/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
>> +/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
>> +
>> +static bool
>> +mips_mult_move_p (rtx dest, rtx src)
>> +{
>> +  return (src == const0_rtx
>> +	  && REG_P (dest)
>> +	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
>> +	  && (ISA_HAS_DSP_MULT
>> +	      ? ACC_REG_P (REGNO (dest))
>> +	      : MD_REG_P (REGNO (dest))));
>> +}
>> +
>> +/* Return true if a move from SRC to DEST should be split into two.  */
>
>  Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower 
> than MTHI $0/MTLO $0?  The latency of multiplication varies among 
> implementations, for example the original R3000 took 12 cycles (of course 
> the R3000 itself is not relevant for this change, but you see the 
> picture!).  On the other hand in some (but not all!) processors 
> multiplication runs in parallel to the main pipeline so it is the 
> difference, if positive, between the number of cycles consumed by other 
> instructions up to the next HI/LO access instruction and the latency of 
> MULT run in the background that matters.
>
>  From the context I am assuming none of this matters for the 74K (and 
> presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
> isn't it something that should be decided based on instruction costs from 
> DFA schedulers?  Is there anything that I've missed here?  It doesn't 
> appear to me your (and neither the original) proposal takes instruction 
> cost calculation into consideration.

In practice, we only move 0 into HI and LO for MADD- and MSUB-style
operations.  We deliberately don't use HI and LO as scratch space.

I think it's a reasonable default assumption that anything that supports
those instructions also has a fast path from MULT to MADD or MULT to MSUB.
I certainly don't know of any counter-examples.  The decision is deliberately
centeralised in one place so that the condition can be tweaked in future
if necessary.

Richard