From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-201878-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 14029 invoked by alias); 20 Aug 2007 23:25:35 -0000
Received: (qmail 13983 invoked by uid 22791); 20 Aug 2007 23:25:34 -0000
X-Spam-Check-By: sourceware.org
Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (65.74.133.4)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Mon, 20 Aug 2007 23:25:30 +0000
Received: (qmail 12850 invoked from network); 20 Aug 2007 23:25:28 -0000
Received: from unknown (HELO bullfrog.localdomain) (sandra@127.0.0.2)   by mail.codesourcery.com with ESMTPA; 20 Aug 2007 23:25:28 -0000
Message-ID: <46CA222D.2050107@codesourcery.com>
Date: Mon, 20 Aug 2007 23:38:00 -0000
From: Sandra Loosemore <sandra@codesourcery.com>
User-Agent: Thunderbird 2.0.0.5 (X11/20070716)
MIME-Version: 1.0
To: GCC Patches <gcc-patches@gcc.gnu.org>,   Nigel Stephens <nigel@mips.com>,  Guy Morrogh <guym@mips.com>, David Ung <davidu@mips.com>,   Thiemo Seufer <ths@mips.com>,  Mark Mitchell <mark@codesourcery.com>,  richard@codesourcery.com
Subject: Re: PATCH: fine-tuning for can_store_by_pieces
References: <46C3343A.5080407@codesourcery.com> <87ps1nop2x.fsf@firetop.home>	<46C778D6.5060808@codesourcery.com> <87y7g6r50c.fsf@firetop.home>
In-Reply-To: <87y7g6r50c.fsf@firetop.home>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2007-08/txt/msg01312.txt.bz2

Richard Sandiford wrote:

> Thanks for the testing.  In that case, I agree 4 is fine for everything.
> If you still have the results, could you post the totals?  I'm curious
> what kind of figures we're talking about here.

Here's what I have.  Except for measuring the original version of the patch on a 
mips64-elfoabi build, everything else was done with mips32r2-elfoabi; the 
numbers are total sizes from CSiBE.

		default		 -mips16	-mabicalls	mips64
baseline	3583977		2860177				3558373
call ratio 3	3566997		2859401		4039960		3541493
call ratio 4	3565961		2858881		4037876
call ratio 5	3566857		2859901		4037172
call ratio 6					4037332

>> + #define MIPS_CALL_RATIO 4
> 
> I think the number you use in CLEAR_RATIO (MIPS_CALL_RATIO + 2)
> is effectively estimating the number of instruction for a call.
> ISTM CLEAR_RATIO is basically being compared against an estimate of
> the number of zero stores, and zero stores are 1 instruction on MIPS.
> (Also, nothing really explained why CLEAR_RATIO adds a magic 2 to the
> ratio.)
> 
> So I think this should really be 6 and that CLEAR_RATIO should be:
> 
> #define CLEAR_RATIO	(optimize_size ? MIPS_CALL_RATIO : 15)
> 
> Then...
> 
>> + #define MOVE_RATIO ((TARGET_MIPS16 || TARGET_MEMCPY) ? MIPS_CALL_RATIO : 2)
> 
> ...a comment in the original patch said that MOVE_RATIO effectively
> counted memory-to-memory moves.  I think that was a useful comment,
> and that the use of the old MIPS_CALL_RATIO above should be the new
> MIPS_CALL_RATIO / 2.  Conveniently, that gives us the 3 that you had
> in the original patch.  

Except that 4 seems to be a better number, and that number doesn't fall out of 
this theory.  I guess I could run some tests with different values for 
CLEAR_RATIO too, and just document both numbers as being experimentally determined?

> (You didn't say whether you'd benchmarked
> -mips16 or -mmemcpy; if so, did you see any difference between a
> MOVE_RATIO of 3 and a MOVE_RATIO of 4?)

I tried -mips16 but not -mmemcpy.  See table above.

>> + /* STORE_BY_PIECES_P can be used when copying a constant string, but
>> +    in that case each word takes 3 insns (lui, ori, sw), or more in
>> +    64-bit mode, instead of 2 (lw, sw). So better to always fail this
>> +    and let the move_by_pieces code copy the string from read-only
>> +    memory.  */
>> + 
>> + #define STORE_BY_PIECES_P(SIZE, ALIGN) 0
> 
> You asked when lui/ori/sw might be faster.  Consider a three-word
> store on a typical 2-way superscalar target:
> 
>   Cycle 1:    lui     lui
>         2:    ori     ori
>         3:    sw             lui
>         4:            sw     ori
>         5:                   sw
> 
> That's 5 cycles.  The equivalent lw/sw version is at least 6 cycles
> (more if the read-only string is not in cache).

OK, but what I was really asking was, is there a way to *test* for situations 
where we should generate the lui/ori/sw sequences instead of the lw/sw?  Some 
combination of TARGET_foo flags and/or the size of the string?

-Sandra the clueless