Multiplication of a block of integers

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Multiplication of a block of integers
       [not found]                 ` <20010902110741.D13434@atrey.karlin.mff.cuni.cz>
@ 2001-09-02  8:45                   ` Frank Klemm
  0 siblings, 0 replies; 21+ messages in thread
From: Frank Klemm @ 2001-09-02  8:45 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

This is a simple function multipliying a block of integers with 5000.

C code needs 6.35 clocks per item
unrolled C code needs 5.4 clocks per item
stupid assembler code needs 4 clocks
unrolled assembler code 2.26 clocks

Because of a factor of 1:2.4 between Assembler and C this may be interesting.

I know this is not a typical application, but it show some weak points.

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* mul + div with 64 bit signed ints on IA32
       [not found]                 ` <20010903171717.E13574@atrey.karlin.mff.cuni.cz>
@ 2001-09-04 13:04                   ` Frank Klemm
  2001-09-04 15:10                     ` Tim Prince
       [not found]                   ` <20010904003947.D335@fuchs.offl.uni-jena.de>
  1 sibling, 1 reply; 21+ messages in thread
From: Frank Klemm @ 2001-09-04 13:04 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

There are possibilities to speed up the following operations for IA32

	int64_t * int64_t

	int64_t / int64_t

	int64_t % int64_t	(?)

Interests? Is this used in any benchmark?

Another floating point problem are the rounding bits of the FPU.
It should be forced that these two bits are always '11' (round to zero).
This would decrease code size and speed up significantly the code.

I know the is a problem with ill written assembler code which polutes these
two bits. May be a compiler switch switches between 'safe' and 'fast'
behaviour. If you have 'clean' assembler libs, you can use 'fast', otherwise
you must use 'safe'.

It is stupid to save, modofy and restore the RC bits several million times
per second, especially because this operation is VERY expensive.

A rounding of a 'double' to an 'int' with ANSI-C took 160 clocks on a K6-2.
In the same time it was possible to calculate the scalar of _two_ 1200 byte
long vectors of float values. This is brain dead!

Here we have problems with the design of C and with the design of the FPU
of the iA32 architecture.

Option proposals:

	-fsaverc
	-ffastrc
	-fsavecld	; the same for the cld flag
	-ffastcld

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mul + div with 64 bit signed ints on IA32
  2001-09-04 13:04                   ` mul + div with 64 bit signed ints on IA32 Frank Klemm
@ 2001-09-04 15:10                     ` Tim Prince
  2001-09-05  4:44                       ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: Tim Prince @ 2001-09-04 15:10 UTC (permalink / raw)
  To: Frank Klemm, Jan Hubicka; +Cc: gcc

----- Original Message -----
From: "Frank Klemm" <pfk@fuchs.offl.uni-jena.de>
To: "Jan Hubicka" <jh@suse.cz>
Cc: <gcc@gcc.gnu.org>
Sent: Tuesday, September 04, 2001 12:51 PM
Subject: mul + div with 64 bit signed ints on IA32


>
> A rounding of a 'double' to an 'int' with ANSI-C took 160
clocks on a K6-2.
> In the same time it was possible to calculate the scalar of
_two_ 1200 byte
> long vectors of float values. This is brain dead!

I don't see that you have specified which gcc version you have
chosen.  The 3.0 series chooses by far the slowest method.  I
don't attempt to run gcc on my K6-2 anyway.  Honza has done an
excellent job of correcting this in gcc-3.1, without taking any
unusual shortcuts, implementing both x87 and SSE2.  I hope you
don't advocate undoing his work in favor of something totally
strange.
>
> Here we have problems with the design of C and with the design
of the FPU
> of the iA32 architecture.
No doubt an unusually slow code generation choice could be made
for almost any architecture.  It's true that the idea of a single
instruction to store float to integer with truncation toward zero
came late in the IA32 series, but the gcc-3.0 code sequence is
contrary to any recommendation ever published.  Compaq just
corrected a similar problem in their compilers, before the Intel
acquisition.
>
>
> Option proposals:
>
> -fsaverc
> -ffastrc
> -fsavecld ; the same for the cld flag
> -ffastcld
>
glibc includes implementations of lrint() and the like.  I'd like
to see something like -ffast-rint to support g77 (Fortran
spelling -ffast-nint), which has a precedent in the MipsPro
compilers.  That only modifies the code to accept IEEE style
round-to-nearest in place of Fortran style round-to-nearest.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mul + div with 64 bit signed ints on IA32
  2001-09-04 15:10                     ` Tim Prince
@ 2001-09-05  4:44                       ` Jan Hubicka
  2001-09-05  5:49                         ` Tim Prince
  2001-09-07 11:54                         ` Multiplications on Pentium 4 Frank Klemm
  0 siblings, 2 replies; 21+ messages in thread
From: Jan Hubicka @ 2001-09-05  4:44 UTC (permalink / raw)
  To: Tim Prince; +Cc: Frank Klemm, Jan Hubicka, gcc

> > Option proposals:
> >
> > -fsaverc
> > -ffastrc
> > -fsavecld ; the same for the cld flag
> > -ffastcld
> >
> glibc includes implementations of lrint() and the like.  I'd like
> to see something like -ffast-rint to support g77 (Fortran
> spelling -ffast-nint), which has a precedent in the MipsPro
> compilers.  That only modifies the code to accept IEEE style
> round-to-nearest in place of Fortran style round-to-nearest.

This appears to be requested commonly enought and make perfect sense for 3d
software, where converison to integer is perofrmance cirtical and IEEE
behaviour secondary.  With my new 3.1 mode switching implementation it can be
easy to make gcc expect "round towards zero" CW setting, so I can implement it.

Insead of -ffast-rint, I think we can name it -mround-towards-zero, as it
will make user to more think what the switch actually does.
When compiling it can make fast fp->int converison to be generated and we
can manage spec file to link in an library to actually set the switch,
as we do on other platforms for -ffast-math.

Sounds sensible?

Honza
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mul + div with 64 bit signed ints on IA32
  2001-09-05  4:44                       ` Jan Hubicka
@ 2001-09-05  5:49                         ` Tim Prince
  2001-09-05 12:16                           ` Richard Henderson
  2001-09-07 11:54                         ` Multiplications on Pentium 4 Frank Klemm
  1 sibling, 1 reply; 21+ messages in thread
From: Tim Prince @ 2001-09-05  5:49 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Frank Klemm, Jan Hubicka, gcc

----- Original Message -----
From: "Jan Hubicka" <jh@suse.cz>
To: "Tim Prince" <tprince@computer.org>
Cc: "Frank Klemm" <pfk@fuchs.offl.uni-jena.de>; "Jan Hubicka"
<jh@suse.cz>; <gcc@gcc.gnu.org>
Sent: Wednesday, September 05, 2001 4:44 AM
Subject: Re: mul + div with 64 bit signed ints on IA32


> > > Option proposals:
> > >
> > > -fsaverc
> > > -ffastrc
> > > -fsavecld ; the same for the cld flag
> > > -ffastcld
> > >
> > glibc includes implementations of lrint() and the like.  I'd
like
> > to see something like -ffast-rint to support g77 (Fortran
> > spelling -ffast-nint), which has a precedent in the MipsPro
> > compilers.  That only modifies the code to accept IEEE style
> > round-to-nearest in place of Fortran style round-to-nearest.
>
> This appears to be requested commonly enought and make perfect
sense for 3d
> software, where converison to integer is perofrmance cirtical
and IEEE
> behaviour secondary.  With my new 3.1 mode switching
implementation it can be
> easy to make gcc expect "round towards zero" CW setting, so I
can implement it.
>
> Insead of -ffast-rint, I think we can name
it -mround-towards-zero, as it
> will make user to more think what the switch actually does.
> When compiling it can make fast fp->int converison to be
generated and we
> can manage spec file to link in an library to actually set the
switch,
> as we do on other platforms for -ffast-math.
>
> Sounds sensible?
>
> Honza
> >
My personal belief is that code should be using the C99 or f77
round-to-nearest functions where they are wanted or at least
acceptable, and that compilers should provide means for in-line
compilation of such round-to-nearest operations.  I don't find
the Intel compiler option which substitutes round-to-nearest for
all (int) casts so suitable.  Evidently, others differ with me.

These problems with performance of (int) casts have received
enough attention lately, that future CPU families are likely to
take care of them with hardware instructions.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: mul + div with 64 bit signed ints on IA32
  2001-09-05  5:49                         ` Tim Prince
@ 2001-09-05 12:16                           ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2001-09-05 12:16 UTC (permalink / raw)
  To: Tim Prince; +Cc: Jan Hubicka, Frank Klemm, gcc

On Wed, Sep 05, 2001 at 05:49:15AM -0700, Tim Prince wrote:
> My personal belief is that code should be using the C99 or f77
> round-to-nearest functions where they are wanted or at least
> acceptable, and that compilers should provide means for in-line
> compilation of such round-to-nearest operations.  I don't find
> the Intel compiler option which substitutes round-to-nearest for
> all (int) casts so suitable.

I completely agree.  If people care so much about performance,
they can be bothered to write rint(x) instead of (int)x.

As for proper inlining, we already have FIX_ROUND_EXPR for trees,
all we need is a fix_round rtx code to match.


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Multiplications on Pentium 4
  2001-09-05  4:44                       ` Jan Hubicka
  2001-09-05  5:49                         ` Tim Prince
@ 2001-09-07 11:54                         ` Frank Klemm
  2001-09-07 12:14                           ` Michael Meissner
                                             ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Frank Klemm @ 2001-09-07 11:54 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

The Pentim 4 is so different from all other CPUs so I must write a special
Code Choice Generator. Some Examples:

	imul:		14 Clocks Latency
	shl:		 4 Clocks Latency
	lea (,,1)	 0.5 Clocks  Latency
	lea (,,2)	 4 Clocks  Latency
	lea (,,4)	 4 Clocks  Latency
	lea (,,8)	 4 Clocks  Latency
	add, sub, neg:	 0.5 Clocks Latency
	mov		 0...0.5 Clocks Latency

This generates fully different Code compared with i386...Pentium-III,
K5...Athlon.

Optimizing code for size is easy. It's the same as for other CPUs.

Optimizing for speed normally blows the code. Nearly always 
cascades of adds and lea(,,1) are the fastest solution, also
for huge multiplier. Code can increase up to 50 bytes for ONE
multiplication (2 register solutions).

Only few multiplier. are a _little_ bit faster using the imul
instruction. So the optimization is more a speed <=> code size
tradeoff.

So it should be programmed a proposal generator which generates
the shortest path method for a given multiplier.

Examples: *12:

	lea (r,r,1),t; add t,r; add r,r; add r,r		 2 Clocks (1)
	lea (r,2,2),r; shl $2,r					 8 Clocks (2)
	imul $12,r						14 Clocks (4.667)

Latency! Throughput is higher (in () ).

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiplications on Pentium 4
  2001-09-07 11:54                         ` Multiplications on Pentium 4 Frank Klemm
@ 2001-09-07 12:14                           ` Michael Meissner
  2001-09-08  8:28                           ` Jan Hubicka
  2001-09-10  2:11                           ` Multiplications on Pentium 4 Torbjorn Granlund
  2 siblings, 0 replies; 21+ messages in thread
From: Michael Meissner @ 2001-09-07 12:14 UTC (permalink / raw)
  To: Frank Klemm; +Cc: Jan Hubicka, gcc

On Fri, Sep 07, 2001 at 08:04:03PM +0200, Frank Klemm wrote:
> The Pentim 4 is so different from all other CPUs so I must write a special
> Code Choice Generator. Some Examples:
> 
> 
> 	imul:		14 Clocks Latency
> 	shl:		 4 Clocks Latency
> 	lea (,,1)	 0.5 Clocks  Latency
> 	lea (,,2)	 4 Clocks  Latency
> 	lea (,,4)	 4 Clocks  Latency
> 	lea (,,8)	 4 Clocks  Latency
> 	add, sub, neg:	 0.5 Clocks Latency
> 	mov		 0...0.5 Clocks Latency
> 
> This generates fully different Code compared with i386...Pentium-III,
> K5...Athlon.

As a historical trivia, I believe a similar situation occurred with with the
486, which had a huge penalty for imul (according to the gcc sources, 15 clocks
+ 1 clock for every bit set).  I recall that in some cases replacing a divide
of a constant by a 32x32->64 bit multiply of a large constant, actually took
more cycles to do it as a multiple than a divide.

-- 
Michael Meissner, Red Hat, Inc.  (GCC group)
PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
Work:	  meissner@redhat.com		phone: +1 978-486-9304
Non-work: meissner@spectacle-pond.org	fax:   +1 978-692-4482

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiplication
       [not found]                         ` <20010907111255.D9651@atrey.karlin.mff.cuni.cz>
@ 2001-09-07 13:46                           ` Frank Klemm
  0 siblings, 0 replies; 21+ messages in thread
From: Frank Klemm @ 2001-09-07 13:46 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

On Fri, Sep 07, 2001 at 11:12:55AM +0200, Jan Hubicka wrote:
> > The Multiplication cost table for the Pentium 4 may be at monday.
> > I'm not sure about the search depth. 5 or 6. For a search depth of 6
> > may be a weekend be too short.
>
> P4 will be interesting, due to extreme cost of multiply and shifts :)
>
I need a search depth of up to 28. This needs something around 10^41 years
for my current program for searching the best solution ...

Currently I'm only search two register solutions. Is this enough for IA32?
For CPUs with much more registers (IA64, PPC, Alpha) you also interested in
three or four register solutions?

How does the multiplication code generator work? My stupid idea is:

  - Code_Proposal_Generator [size] (multiplier)
    Code_Proposal_Generator [time] (CPU_type, multiplier)

        emits some known useful proposals for  a *= b;

  - Multiple solutions are necessary because there are also other conditions
    affecting which of them is the best:
        src: mem/reg? 
        dst: mem/reg?
        src!=dst?
        # of free registers?

  - Multiple solutions within small code ranges will be weighted.

  - All solutions have their calculated costs:
      - latency
      - throughput (for larger peepholes throughput stalls will be
        calculated into additional latency)
      - registers used (especially on systems with rare regs)
        (for larger peepholes register usage will be
        calculated into additional load/stores and related latency)
      - code size

  - For very large peeopholes (;-) you have all transformed into
      - latency
      - code size
  
  - Depending on the -O setting the "best" solution is selected.
    On the Pentium 4 fast solutions are often huge and the 1st level code
    cache is very small. So size also affects speed. But this also depends
    on the use pattern. This smell
    like a more sophisticated solution for code optimzation like
    some switches -O -O2 -Os and -O3.

Proposal:


Cost from the view of compile time:

-O0	no optimization, compile as fast as you can! using as less memory as you can
-O1	simple optimizations, speed and memory usage should increase by 10...20% (typical)
-O2	optimizations, speed and memory usage should increase by 50...100% (typical)
-O3	high optimizations, speed and memory usage should increase by 200...1000% (typical)
-O4	highest optimizations, speed and memory usage should increase by 1000...100000% (typical)

This can be done by switch on/off optimization features and setting search
depth and things like that (-O4 unlimited).

parameters like treedepth and things like that.


-Os1...9
-Ot1...9
-Os0		(== -Ot0)
-Ot0		(== -Os0)
-Os		(== -Os5)

old -O  is something like -Ot2
old -O2 is something like -Ot5
old -O3 is something like -Ot8

This selects size <=> speed trade off. So -O4 -Os9 generates smallest code
using all possible methods, also it tries all methods normally known to
increase code size (-O4).

-Os9 affects RTL optimization (never use a solution which is one bit
longer), but also code generator:

int two ( void )
{
    return 2;
}

-Ot9...-Ot0, -Os0...-Ot5:


	movl	$2, %eax
	ret

-Ot6...9:

	xorl	%eax, %eax
	movb	$2, %al		# saves 1 byte
	ret

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiplications on Pentium 4
  2001-09-07 11:54                         ` Multiplications on Pentium 4 Frank Klemm
  2001-09-07 12:14                           ` Michael Meissner
@ 2001-09-08  8:28                           ` Jan Hubicka
  2001-09-08 17:13                             ` typed cast's Frank Klemm
  2001-09-10  2:11                           ` Multiplications on Pentium 4 Torbjorn Granlund
  2 siblings, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2001-09-08  8:28 UTC (permalink / raw)
  To: Frank Klemm; +Cc: Jan Hubicka, gcc

> The Pentim 4 is so different from all other CPUs so I must write a special
> Code Choice Generator. Some Examples:
> 
> 
> 	imul:		14 Clocks Latency
> 	shl:		 4 Clocks Latency
> 	lea (,,1)	 0.5 Clocks  Latency
> 	lea (,,2)	 4 Clocks  Latency
> 	lea (,,4)	 4 Clocks  Latency
> 	lea (,,8)	 4 Clocks  Latency
Actually lea for ,,2 can be rewriten to lea doing addition, that is faster.
The rule is that shift has 4 cycle latency, while add 0.5.
Lea is broken to trivial operations, so for your measurements you probably
can ignore her existence.
> 	add, sub, neg:	 0.5 Clocks Latency
> 	mov		 0...0.5 Clocks Latency
> 
> This generates fully different Code compared with i386...Pentium-III,
> K5...Athlon.
Agreed. Thats the poroblem.
Other problem is that imul's and shift's extreme latency causes that
we can benefit from replacing it by relativly many adds, but P4 is
limited by trace cache. More adds, less cache space so this tradeoff
needs to be controlled mainly by program's profile to find hot spots
and aditionally by scheduler to reduce only critical paths trought
BB.  This is _extremly_ dificult to integrate to existing gcc model.

I hope that Intel will realize that and do some funding to gcc development
as good Pentium4 support will be tricky.

Honza
> 
> Optimizing code for size is easy. It's the same as for other CPUs.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Type problems with C
       [not found]                             ` <20010908182611.N8451@atrey.karlin.mff.cuni.cz>
@ 2001-09-08 17:13                               ` Frank Klemm
  0 siblings, 0 replies; 21+ messages in thread
From: Frank Klemm @ 2001-09-08 17:13 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

For data types there is a generic type for generic pointers:

    void*

For functions there is no such type because functions are normally
incompatible. But Functions with the same return type and the same parameter
layout can be compatible.

Is it possible to define a type which is compatible with
fn_t, fn_schar_t and fn_uchar_t ?

For data this is void*:

void*  p1 =(void*)...;
void*  p1 =(char*)...;
void*  p1 =(signed char*)...;
void*  p1 =(unsigned char*)...;

Is there something similar for function pointers?


------------------------------------------------------------------

int  p1_strcmp ( void** p1, void** p2 )
{
    signed char*    q1 = *(signed char**)p1;		<---- ugly typecast
    signed char*    q2 = *(signed char**)p2;		<---- ugly typecast
     
    for ( ; *q1; q1++, q2++ )
        if (*q1 != *q2)
            return *q1 - *q2;
    return *q1 - *q2;
}

int  p2_strcmp ( void** p1, void** p2 )
{
    unsigned char*  q1 = *(unsigned char**)p1;		<---- ugly typecast
    unsigned char*  q2 = *(unsigned char**)p2;		<---- ugly typecast
     
    for ( ; *q1; q1++, q2++ )
        if (*q1 != *q2)
            return *q1 - *q2;
    return *q1 - *q2;
}

int  p3_strcmp ( signed char** p1, signed char** p2 )
{
    signed char*    q1 = *p1;
    signed char*    q2 = *p2;
     
    for ( ; *q1; q1++, q2++ )
        if (*q1 != *q2)
            return *q1 - *q2;
    return *q1 - *q2;
}

int  p4_strcmp ( unsigned char** p1, unsigned char** p2 )
{
    unsigned char*  q1 = *p1;
    unsigned char*  q2 = *p2;
     
    for ( ; *q1; q1++, q2++ )
        if (*q1 != *q2)
            return *q1 - *q2;
    return *q1 - *q2;
}

typedef int (*fn_t)       (void**, void**);
typedef int (*fn_schar_t) (signed char**, signed char**);
typedef int (*fn_uchar_t) (unsigned char**, unsigned char**);

fn_t  p1 = p1_strcmp;
fn_t  p2 = p2_strcmp;
fn_t  p3 = (fn_t)p3_strcmp;				<---- ugly typecast
fn_t  p4 = (fn_t)p4_strcmp;				<---- ugly typecast
fn_t  p5 = p3_strcmp;					<---- warning or error (C++)
fn_t  p6 = p4_strcmp;					<---- warning or error (C++)

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* typed cast's
  2001-09-08  8:28                           ` Jan Hubicka
@ 2001-09-08 17:13                             ` Frank Klemm
  2001-09-08 18:12                               ` Zack Weinberg
  0 siblings, 1 reply; 21+ messages in thread
From: Frank Klemm @ 2001-09-08 17:13 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

This has nothing to do with optimization.

This has something to do with the parser.



In C you can convert an expression of (nearly) any type to another type.

    float  f = (float) (&f);

Typecasting is dangerous. But sometimes you need typecasting.
Often you want to cast type A to type B.
C has only the possibity to cast any type to type B.


int    x [128];
float  f [128];

// p is a char*, but can also work with int-data from x, but never can work with float data)
char*  p = (char*) x;		// dangerous
char*  p = (char*) f;		// bug, but C can't determine that

char*  q = (char*,int*) x;	// explicit convering from int* to char*	
char*  q = (char*,int*) f;	// compile time error

Especially if you are using macros this can be useful to determine bugs as
early as possible.


Syntax:

	(dest_type)
	(dest_type,allowed_src_type)
	(dest_type,allowed_src_type1,allowed_src_type2)
	(dest_type,allowed_src_type1,allowed_src_type2,allowed_src_type3)

etc.

Examples:
	(size_t,int)-1
	(void*,int)0
	*p++   = (uint8_t,uint32_t)(word32 >>  8);
	*p++ = (int,unsigned int) Bitstream_read (Res_bit[Res[Band].R]) - Dc[Res[Band].R];


-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: typed cast's
  2001-09-08 17:13                             ` typed cast's Frank Klemm
@ 2001-09-08 18:12                               ` Zack Weinberg
  0 siblings, 0 replies; 21+ messages in thread
From: Zack Weinberg @ 2001-09-08 18:12 UTC (permalink / raw)
  To: Frank Klemm; +Cc: Jan Hubicka, gcc

On Sun, Sep 09, 2001 at 12:41:39AM +0200, Frank Klemm wrote:
> This has nothing to do with optimization.
> 
> This has something to do with the parser.

In the future, would you mind starting a new thread if you want to
discuss a completely different subject?

> Typecasting is dangerous. But sometimes you need typecasting.
> Often you want to cast type A to type B.
> C has only the possibity to cast any type to type B.
[etc]

I believe it is the general feeling of the group that we already have
too many extensions, and we are not interested in adding new ones.

I'd like to point out that most of the casts in code I write are
because I don't _know_ the type of the item being cast from.  The
usual example is something like

printf("%lu\n", (unsigned long)value);

where the type of VALUE is e.g. "time_t", which does not correspond to
a printf letter, and assuming it's any specific one is wrong.

zw

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Simple optimizations missing
       [not found]                             ` <20010907111641.G9651@atrey.karlin.mff.cuni.cz>
  2001-09-09  7:20                               ` Code size Frank Klemm
@ 2001-09-09  7:20                               ` Frank Klemm
  2001-09-09  7:39                                 ` Graham Stott
  2001-09-09  7:45                                 ` Andreas Jaeger
  1 sibling, 2 replies; 21+ messages in thread
From: Frank Klemm @ 2001-09-09  7:20 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

Expression optimizations:

	x / x = 1
	x % x = 0

Trivial.

----------------------------------------------------------

Another thing which can save at least size:

 b is a 8 bit value. There is an expression like:

	( b + otherstuff ) & mask

 otherstuff is any other expression, can also be very complex.
 mask is a value between 0 and 255.

 So you can change

	movzbl	(....), %eax
	addl	...
	andl	$mask, %eax
  or
	movzbl	%al, %eax
	addl	...
	andl	$mask, %eax

 to

	movzb	(....), %al		# may be only if optimizing for size
	addl	...
	andl	$mask, %eax
  or 
	addl	...			# always good
	andl	$mask, %eax


 This optimzation can also be done for

	(b + otherstuff) | value

 if value is in the range from 0xFFFFFF00...0xFFFFFFFF.

 This optimization is also possible for 16 bit quantities. Range of the masks
 is different. The operator '+' can also be:

	+
	-
	^
	|
	&

 or a mixture of them.

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Code size
       [not found]                             ` <20010907111641.G9651@atrey.karlin.mff.cuni.cz>
@ 2001-09-09  7:20                               ` Frank Klemm
  2001-09-09  7:20                               ` Simple optimizations missing Frank Klemm
  1 sibling, 0 replies; 21+ messages in thread
From: Frank Klemm @ 2001-09-09  7:20 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc

long long  x;

void main (void)
{
    x &= 100;
}

------------------------------------------------
Code with -Os (31 bytes + 7.5 bytes for alignment = 38.5 bytes on average)

.align 16
.global main
.type   main,@function

main:
        pushl   %ebp
        xorl    %edx, %edx
        movl    %esp, %ebp
        pushl   %eax
        pushl   %eax
        movl    x, %eax
        andl    $-16, %esp
        movl    %edx, x+4
        andl    $100, %eax
        movl    %eax, x
        leave
        ret
-------------------------------------------------
Code with -Os -fomit-frame-pointer (31 bytes + 7.5 bytes for alignment = 38.5 bytes on average)

.align 16
.global main
.type   main,@function

main:
        pushl   %ebp			<---- incomprehensible
        xorl    %edx, %edx		<---- okay
        movl    %esp, %ebp		<---- incomprehensible
        pushl   %eax			<---- incomprehensible
        pushl   %eax			<---- incomprehensible
        movl    x, %eax			<---- okay
        andl    $-16, %esp		<---- incomprehensible
        movl    %edx, x+4		<---- okay
        andl    $100, %eax		<---- okay
        movl    %eax, x			<---- okay
        leave				<---- incomprehensible
        ret				<---- okay
-------------------------------------------------
Optimized code (14 bytes):

.text
.global main
.type   main,@function
					<---- no .align, we want optimize for size
main:
        xorl    %eax, %eax		<---- xor r,r; mov r,(mem) is shorter than mov $0,(mem)
        movl    $x, %ecx		<---- we have a free reg and x is accessed more than once
        andl    $100, (%ecx)		<---- operation can be done directly in memory
        movl    %eax, 4(%ecx)		<---- wipe memory
        ret

May be the last two statements must be switched to improve speed.

-- 
Frank Klemm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Simple optimizations missing
  2001-09-09  7:20                               ` Simple optimizations missing Frank Klemm
@ 2001-09-09  7:39                                 ` Graham Stott
  2001-09-09  7:45                                 ` Andreas Jaeger
  1 sibling, 0 replies; 21+ messages in thread
From: Graham Stott @ 2001-09-09  7:39 UTC (permalink / raw)
  To: Frank Klemm; +Cc: Jan Hubicka, gcc

Frank Klemm wrote:
> 
> Expression optimizations:
> 
>         x / x = 1
>         x % x = 0
> 
> Trivial.
Only if you can deduce that x != 0


> --
> Frank Klemm

Graham

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Simple optimizations missing
  2001-09-09  7:20                               ` Simple optimizations missing Frank Klemm
  2001-09-09  7:39                                 ` Graham Stott
@ 2001-09-09  7:45                                 ` Andreas Jaeger
  1 sibling, 0 replies; 21+ messages in thread
From: Andreas Jaeger @ 2001-09-09  7:45 UTC (permalink / raw)
  To: Frank Klemm; +Cc: Jan Hubicka, gcc

Frank Klemm <pfk@fuchs.offl.uni-jena.de> writes:

> Expression optimizations:
>
> 	x / x = 1
> 	x % x = 0

For floating point or for integer?

For floating point: 0.0/0.0 might generate a floating point exceptions
and the result is a NaN.  The same applies for the infinites.

Frank, I'm looking forward to your patches hoping that they're as good
as your analysis,

Andreas
-- 
 Andreas Jaeger
  SuSE Labs aj@suse.de
   private aj@arthur.inka.de
    http://www.suse.de/~aj

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiplications on Pentium 4
  2001-09-07 11:54                         ` Multiplications on Pentium 4 Frank Klemm
  2001-09-07 12:14                           ` Michael Meissner
  2001-09-08  8:28                           ` Jan Hubicka
@ 2001-09-10  2:11                           ` Torbjorn Granlund
  2001-09-10  4:44                             ` Jan Hubicka
  2 siblings, 1 reply; 21+ messages in thread
From: Torbjorn Granlund @ 2001-09-10  2:11 UTC (permalink / raw)
  To: Frank Klemm; +Cc: Jan Hubicka, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 522 bytes --]

Frank Klemm <pfk@fuchs.offl.uni-jena.de> writes:

  The Pentim 4 is so different from all other CPUs so I must write a special
  Code Choice Generator. Some Examples:

I don't know what a Code Choice Generator is, but I know that
multiply-by-constant code generation is controlled by the *COSTS
macros.  If you set these up properly for Pentium 4, you should get
code sequences that use fewer shifts and more adds.

(It isn't going to become perfect since this code understands nothing
about scheduling.)

-- 
TorbjÃ¶rn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiplications on Pentium 4
  2001-09-10  2:11                           ` Multiplications on Pentium 4 Torbjorn Granlund
@ 2001-09-10  4:44                             ` Jan Hubicka
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2001-09-10  4:44 UTC (permalink / raw)
  To: Torbjorn Granlund; +Cc: Frank Klemm, Jan Hubicka, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

> Frank Klemm <pfk@fuchs.offl.uni-jena.de> writes:
> 
>   The Pentim 4 is so different from all other CPUs so I must write a special
>   Code Choice Generator. Some Examples:
> 
> I don't know what a Code Choice Generator is, but I know that
> multiply-by-constant code generation is controlled by the *COSTS
> macros.  If you set these up properly for Pentium 4, you should get
> code sequences that use fewer shifts and more adds.
Frank wrote3 his own code to generate various sequences and compare them.
His code produces better results in some cases, as it is aware of some details
gcc generator isn't.

Honza
> 
> (It isn't going to become perfect since this code understands nothing
> about scheduling.)
> 
> -- 
> TorbjÃ¶rn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiplications on Pentium 4
  2001-09-08  8:31 Multiplications on Pentium 4 dewar
@ 2001-09-08  9:17 ` Jan Hubicka
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2001-09-08  9:17 UTC (permalink / raw)
  To: dewar; +Cc: jh, pfk, gcc

> It's amazing how poor the scaling lea's are on the Pentium 4, probably they
> should never be generated.
They are no worse than shifts - actually LEA is simply translated to trivial
operations.
It is amazing how bad shifts are :) but replacing them by ADDs is not always
solution due to trace cache misses.

Honza
> 
> (at least you could take care of this by simply removing them :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multiplications on Pentium 4
@ 2001-09-08  8:31 dewar
  2001-09-08  9:17 ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: dewar @ 2001-09-08  8:31 UTC (permalink / raw)
  To: jh, pfk; +Cc: gcc

It's amazing how poor the scaling lea's are on the Pentium 4, probably they
should never be generated.

(at least you could take care of this by simply removing them :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2001-09-10  4:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20010826202953.E2544@fuchs.offl.uni-jena.de>
     [not found] ` <20010826233634.A6693@atrey.karlin.mff.cuni.cz>
     [not found]   ` <20010827004731.G2544@fuchs.offl.uni-jena.de>
     [not found]     ` <20010827121624.D8568@atrey.karlin.mff.cuni.cz>
     [not found]       ` <20010827143032.C636@fuchs.offl.uni-jena.de>
     [not found]         ` <20010827173025.F11402@atrey.karlin.mff.cuni.cz>
     [not found]           ` <20010901202854.A7713@fuchs.offl.uni-jena.de>
     [not found]             ` <20010902000000.C27182@atrey.karlin.mff.cuni.cz>
     [not found]               ` <20010902024104.F7713@fuchs.offl.uni-jena.de>
     [not found]                 ` <20010902110741.D13434@atrey.karlin.mff.cuni.cz>
2001-09-02  8:45                   ` Multiplication of a block of integers Frank Klemm
     [not found]                 ` <20010903171717.E13574@atrey.karlin.mff.cuni.cz>
2001-09-04 13:04                   ` mul + div with 64 bit signed ints on IA32 Frank Klemm
2001-09-04 15:10                     ` Tim Prince
2001-09-05  4:44                       ` Jan Hubicka
2001-09-05  5:49                         ` Tim Prince
2001-09-05 12:16                           ` Richard Henderson
2001-09-07 11:54                         ` Multiplications on Pentium 4 Frank Klemm
2001-09-07 12:14                           ` Michael Meissner
2001-09-08  8:28                           ` Jan Hubicka
2001-09-08 17:13                             ` typed cast's Frank Klemm
2001-09-08 18:12                               ` Zack Weinberg
2001-09-10  2:11                           ` Multiplications on Pentium 4 Torbjorn Granlund
2001-09-10  4:44                             ` Jan Hubicka
     [not found]                   ` <20010904003947.D335@fuchs.offl.uni-jena.de>
     [not found]                     ` <20010905133553.E15564@atrey.karlin.mff.cuni.cz>
     [not found]                       ` <20010906210657.B420@fuchs.offl.uni-jena.de>
     [not found]                         ` <20010907111255.D9651@atrey.karlin.mff.cuni.cz>
2001-09-07 13:46                           ` Multiplication Frank Klemm
     [not found]                       ` <20010905182009.A339@fuchs.offl.uni-jena.de>
     [not found]                         ` <20010907120836.A14477@atrey.karlin.mff.cuni.cz>
     [not found]                           ` <20010907221210.G5281@fuchs.offl.uni-jena.de>
     [not found]                             ` <20010908182611.N8451@atrey.karlin.mff.cuni.cz>
2001-09-08 17:13                               ` Type problems with C Frank Klemm
     [not found]                 ` <20010903154513.C13574@atrey.karlin.mff.cuni.cz>
     [not found]                   ` <20010904003127.C335@fuchs.offl.uni-jena.de>
     [not found]                     ` <20010905133458.D15564@atrey.karlin.mff.cuni.cz>
     [not found]                       ` <20010905185526.A559@fuchs.offl.uni-jena.de>
     [not found]                         ` <20010906174007.B1486@atrey.karlin.mff.cuni.cz>
     [not found]                           ` <20010906222524.A2578@fuchs.offl.uni-jena.de>
     [not found]                             ` <20010907111641.G9651@atrey.karlin.mff.cuni.cz>
2001-09-09  7:20                               ` Code size Frank Klemm
2001-09-09  7:20                               ` Simple optimizations missing Frank Klemm
2001-09-09  7:39                                 ` Graham Stott
2001-09-09  7:45                                 ` Andreas Jaeger
2001-09-08  8:31 Multiplications on Pentium 4 dewar
2001-09-08  9:17 ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).