* Multiplication of a block of integers [not found] ` <20010902110741.D13434@atrey.karlin.mff.cuni.cz> @ 2001-09-02 8:45 ` Frank Klemm 0 siblings, 0 replies; 21+ messages in thread From: Frank Klemm @ 2001-09-02 8:45 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc This is a simple function multipliying a block of integers with 5000. C code needs 6.35 clocks per item unrolled C code needs 5.4 clocks per item stupid assembler code needs 4 clocks unrolled assembler code 2.26 clocks Because of a factor of 1:2.4 between Assembler and C this may be interesting. I know this is not a typical application, but it show some weak points. -- Frank Klemm ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20010903171717.E13574@atrey.karlin.mff.cuni.cz>]
* mul + div with 64 bit signed ints on IA32 [not found] ` <20010903171717.E13574@atrey.karlin.mff.cuni.cz> @ 2001-09-04 13:04 ` Frank Klemm 2001-09-04 15:10 ` Tim Prince [not found] ` <20010904003947.D335@fuchs.offl.uni-jena.de> 1 sibling, 1 reply; 21+ messages in thread From: Frank Klemm @ 2001-09-04 13:04 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc There are possibilities to speed up the following operations for IA32 int64_t * int64_t int64_t / int64_t int64_t % int64_t (?) Interests? Is this used in any benchmark? Another floating point problem are the rounding bits of the FPU. It should be forced that these two bits are always '11' (round to zero). This would decrease code size and speed up significantly the code. I know the is a problem with ill written assembler code which polutes these two bits. May be a compiler switch switches between 'safe' and 'fast' behaviour. If you have 'clean' assembler libs, you can use 'fast', otherwise you must use 'safe'. It is stupid to save, modofy and restore the RC bits several million times per second, especially because this operation is VERY expensive. A rounding of a 'double' to an 'int' with ANSI-C took 160 clocks on a K6-2. In the same time it was possible to calculate the scalar of _two_ 1200 byte long vectors of float values. This is brain dead! Here we have problems with the design of C and with the design of the FPU of the iA32 architecture. Option proposals: -fsaverc -ffastrc -fsavecld ; the same for the cld flag -ffastcld -- Frank Klemm ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: mul + div with 64 bit signed ints on IA32 2001-09-04 13:04 ` mul + div with 64 bit signed ints on IA32 Frank Klemm @ 2001-09-04 15:10 ` Tim Prince 2001-09-05 4:44 ` Jan Hubicka 0 siblings, 1 reply; 21+ messages in thread From: Tim Prince @ 2001-09-04 15:10 UTC (permalink / raw) To: Frank Klemm, Jan Hubicka; +Cc: gcc ----- Original Message ----- From: "Frank Klemm" <pfk@fuchs.offl.uni-jena.de> To: "Jan Hubicka" <jh@suse.cz> Cc: <gcc@gcc.gnu.org> Sent: Tuesday, September 04, 2001 12:51 PM Subject: mul + div with 64 bit signed ints on IA32 > > A rounding of a 'double' to an 'int' with ANSI-C took 160 clocks on a K6-2. > In the same time it was possible to calculate the scalar of _two_ 1200 byte > long vectors of float values. This is brain dead! I don't see that you have specified which gcc version you have chosen. The 3.0 series chooses by far the slowest method. I don't attempt to run gcc on my K6-2 anyway. Honza has done an excellent job of correcting this in gcc-3.1, without taking any unusual shortcuts, implementing both x87 and SSE2. I hope you don't advocate undoing his work in favor of something totally strange. > > Here we have problems with the design of C and with the design of the FPU > of the iA32 architecture. No doubt an unusually slow code generation choice could be made for almost any architecture. It's true that the idea of a single instruction to store float to integer with truncation toward zero came late in the IA32 series, but the gcc-3.0 code sequence is contrary to any recommendation ever published. Compaq just corrected a similar problem in their compilers, before the Intel acquisition. > > > Option proposals: > > -fsaverc > -ffastrc > -fsavecld ; the same for the cld flag > -ffastcld > glibc includes implementations of lrint() and the like. I'd like to see something like -ffast-rint to support g77 (Fortran spelling -ffast-nint), which has a precedent in the MipsPro compilers. That only modifies the code to accept IEEE style round-to-nearest in place of Fortran style round-to-nearest. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: mul + div with 64 bit signed ints on IA32 2001-09-04 15:10 ` Tim Prince @ 2001-09-05 4:44 ` Jan Hubicka 2001-09-05 5:49 ` Tim Prince 2001-09-07 11:54 ` Multiplications on Pentium 4 Frank Klemm 0 siblings, 2 replies; 21+ messages in thread From: Jan Hubicka @ 2001-09-05 4:44 UTC (permalink / raw) To: Tim Prince; +Cc: Frank Klemm, Jan Hubicka, gcc > > Option proposals: > > > > -fsaverc > > -ffastrc > > -fsavecld ; the same for the cld flag > > -ffastcld > > > glibc includes implementations of lrint() and the like. I'd like > to see something like -ffast-rint to support g77 (Fortran > spelling -ffast-nint), which has a precedent in the MipsPro > compilers. That only modifies the code to accept IEEE style > round-to-nearest in place of Fortran style round-to-nearest. This appears to be requested commonly enought and make perfect sense for 3d software, where converison to integer is perofrmance cirtical and IEEE behaviour secondary. With my new 3.1 mode switching implementation it can be easy to make gcc expect "round towards zero" CW setting, so I can implement it. Insead of -ffast-rint, I think we can name it -mround-towards-zero, as it will make user to more think what the switch actually does. When compiling it can make fast fp->int converison to be generated and we can manage spec file to link in an library to actually set the switch, as we do on other platforms for -ffast-math. Sounds sensible? Honza > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: mul + div with 64 bit signed ints on IA32 2001-09-05 4:44 ` Jan Hubicka @ 2001-09-05 5:49 ` Tim Prince 2001-09-05 12:16 ` Richard Henderson 2001-09-07 11:54 ` Multiplications on Pentium 4 Frank Klemm 1 sibling, 1 reply; 21+ messages in thread From: Tim Prince @ 2001-09-05 5:49 UTC (permalink / raw) To: Jan Hubicka; +Cc: Frank Klemm, Jan Hubicka, gcc ----- Original Message ----- From: "Jan Hubicka" <jh@suse.cz> To: "Tim Prince" <tprince@computer.org> Cc: "Frank Klemm" <pfk@fuchs.offl.uni-jena.de>; "Jan Hubicka" <jh@suse.cz>; <gcc@gcc.gnu.org> Sent: Wednesday, September 05, 2001 4:44 AM Subject: Re: mul + div with 64 bit signed ints on IA32 > > > Option proposals: > > > > > > -fsaverc > > > -ffastrc > > > -fsavecld ; the same for the cld flag > > > -ffastcld > > > > > glibc includes implementations of lrint() and the like. I'd like > > to see something like -ffast-rint to support g77 (Fortran > > spelling -ffast-nint), which has a precedent in the MipsPro > > compilers. That only modifies the code to accept IEEE style > > round-to-nearest in place of Fortran style round-to-nearest. > > This appears to be requested commonly enought and make perfect sense for 3d > software, where converison to integer is perofrmance cirtical and IEEE > behaviour secondary. With my new 3.1 mode switching implementation it can be > easy to make gcc expect "round towards zero" CW setting, so I can implement it. > > Insead of -ffast-rint, I think we can name it -mround-towards-zero, as it > will make user to more think what the switch actually does. > When compiling it can make fast fp->int converison to be generated and we > can manage spec file to link in an library to actually set the switch, > as we do on other platforms for -ffast-math. > > Sounds sensible? > > Honza > > My personal belief is that code should be using the C99 or f77 round-to-nearest functions where they are wanted or at least acceptable, and that compilers should provide means for in-line compilation of such round-to-nearest operations. I don't find the Intel compiler option which substitutes round-to-nearest for all (int) casts so suitable. Evidently, others differ with me. These problems with performance of (int) casts have received enough attention lately, that future CPU families are likely to take care of them with hardware instructions. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: mul + div with 64 bit signed ints on IA32 2001-09-05 5:49 ` Tim Prince @ 2001-09-05 12:16 ` Richard Henderson 0 siblings, 0 replies; 21+ messages in thread From: Richard Henderson @ 2001-09-05 12:16 UTC (permalink / raw) To: Tim Prince; +Cc: Jan Hubicka, Frank Klemm, gcc On Wed, Sep 05, 2001 at 05:49:15AM -0700, Tim Prince wrote: > My personal belief is that code should be using the C99 or f77 > round-to-nearest functions where they are wanted or at least > acceptable, and that compilers should provide means for in-line > compilation of such round-to-nearest operations. I don't find > the Intel compiler option which substitutes round-to-nearest for > all (int) casts so suitable. I completely agree. If people care so much about performance, they can be bothered to write rint(x) instead of (int)x. As for proper inlining, we already have FIX_ROUND_EXPR for trees, all we need is a fix_round rtx code to match. r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Multiplications on Pentium 4 2001-09-05 4:44 ` Jan Hubicka 2001-09-05 5:49 ` Tim Prince @ 2001-09-07 11:54 ` Frank Klemm 2001-09-07 12:14 ` Michael Meissner ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Frank Klemm @ 2001-09-07 11:54 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc The Pentim 4 is so different from all other CPUs so I must write a special Code Choice Generator. Some Examples: imul: 14 Clocks Latency shl: 4 Clocks Latency lea (,,1) 0.5 Clocks Latency lea (,,2) 4 Clocks Latency lea (,,4) 4 Clocks Latency lea (,,8) 4 Clocks Latency add, sub, neg: 0.5 Clocks Latency mov 0...0.5 Clocks Latency This generates fully different Code compared with i386...Pentium-III, K5...Athlon. Optimizing code for size is easy. It's the same as for other CPUs. Optimizing for speed normally blows the code. Nearly always cascades of adds and lea(,,1) are the fastest solution, also for huge multiplier. Code can increase up to 50 bytes for ONE multiplication (2 register solutions). Only few multiplier. are a _little_ bit faster using the imul instruction. So the optimization is more a speed <=> code size tradeoff. So it should be programmed a proposal generator which generates the shortest path method for a given multiplier. Examples: *12: lea (r,r,1),t; add t,r; add r,r; add r,r 2 Clocks (1) lea (r,2,2),r; shl $2,r 8 Clocks (2) imul $12,r 14 Clocks (4.667) Latency! Throughput is higher (in () ). -- Frank Klemm ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Multiplications on Pentium 4 2001-09-07 11:54 ` Multiplications on Pentium 4 Frank Klemm @ 2001-09-07 12:14 ` Michael Meissner 2001-09-08 8:28 ` Jan Hubicka 2001-09-10 2:11 ` Multiplications on Pentium 4 Torbjorn Granlund 2 siblings, 0 replies; 21+ messages in thread From: Michael Meissner @ 2001-09-07 12:14 UTC (permalink / raw) To: Frank Klemm; +Cc: Jan Hubicka, gcc On Fri, Sep 07, 2001 at 08:04:03PM +0200, Frank Klemm wrote: > The Pentim 4 is so different from all other CPUs so I must write a special > Code Choice Generator. Some Examples: > > > imul: 14 Clocks Latency > shl: 4 Clocks Latency > lea (,,1) 0.5 Clocks Latency > lea (,,2) 4 Clocks Latency > lea (,,4) 4 Clocks Latency > lea (,,8) 4 Clocks Latency > add, sub, neg: 0.5 Clocks Latency > mov 0...0.5 Clocks Latency > > This generates fully different Code compared with i386...Pentium-III, > K5...Athlon. As a historical trivia, I believe a similar situation occurred with with the 486, which had a huge penalty for imul (according to the gcc sources, 15 clocks + 1 clock for every bit set). I recall that in some cases replacing a divide of a constant by a 32x32->64 bit multiply of a large constant, actually took more cycles to do it as a multiple than a divide. -- Michael Meissner, Red Hat, Inc. (GCC group) PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA Work: meissner@redhat.com phone: +1 978-486-9304 Non-work: meissner@spectacle-pond.org fax: +1 978-692-4482 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Multiplications on Pentium 4 2001-09-07 11:54 ` Multiplications on Pentium 4 Frank Klemm 2001-09-07 12:14 ` Michael Meissner @ 2001-09-08 8:28 ` Jan Hubicka 2001-09-08 17:13 ` typed cast's Frank Klemm 2001-09-10 2:11 ` Multiplications on Pentium 4 Torbjorn Granlund 2 siblings, 1 reply; 21+ messages in thread From: Jan Hubicka @ 2001-09-08 8:28 UTC (permalink / raw) To: Frank Klemm; +Cc: Jan Hubicka, gcc > The Pentim 4 is so different from all other CPUs so I must write a special > Code Choice Generator. Some Examples: > > > imul: 14 Clocks Latency > shl: 4 Clocks Latency > lea (,,1) 0.5 Clocks Latency > lea (,,2) 4 Clocks Latency > lea (,,4) 4 Clocks Latency > lea (,,8) 4 Clocks Latency Actually lea for ,,2 can be rewriten to lea doing addition, that is faster. The rule is that shift has 4 cycle latency, while add 0.5. Lea is broken to trivial operations, so for your measurements you probably can ignore her existence. > add, sub, neg: 0.5 Clocks Latency > mov 0...0.5 Clocks Latency > > This generates fully different Code compared with i386...Pentium-III, > K5...Athlon. Agreed. Thats the poroblem. Other problem is that imul's and shift's extreme latency causes that we can benefit from replacing it by relativly many adds, but P4 is limited by trace cache. More adds, less cache space so this tradeoff needs to be controlled mainly by program's profile to find hot spots and aditionally by scheduler to reduce only critical paths trought BB. This is _extremly_ dificult to integrate to existing gcc model. I hope that Intel will realize that and do some funding to gcc development as good Pentium4 support will be tricky. Honza > > Optimizing code for size is easy. It's the same as for other CPUs. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* typed cast's 2001-09-08 8:28 ` Jan Hubicka @ 2001-09-08 17:13 ` Frank Klemm 2001-09-08 18:12 ` Zack Weinberg 0 siblings, 1 reply; 21+ messages in thread From: Frank Klemm @ 2001-09-08 17:13 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc This has nothing to do with optimization. This has something to do with the parser. In C you can convert an expression of (nearly) any type to another type. float f = (float) (&f); Typecasting is dangerous. But sometimes you need typecasting. Often you want to cast type A to type B. C has only the possibity to cast any type to type B. int x [128]; float f [128]; // p is a char*, but can also work with int-data from x, but never can work with float data) char* p = (char*) x; // dangerous char* p = (char*) f; // bug, but C can't determine that char* q = (char*,int*) x; // explicit convering from int* to char* char* q = (char*,int*) f; // compile time error Especially if you are using macros this can be useful to determine bugs as early as possible. Syntax: (dest_type) (dest_type,allowed_src_type) (dest_type,allowed_src_type1,allowed_src_type2) (dest_type,allowed_src_type1,allowed_src_type2,allowed_src_type3) etc. Examples: (size_t,int)-1 (void*,int)0 *p++ = (uint8_t,uint32_t)(word32 >> 8); *p++ = (int,unsigned int) Bitstream_read (Res_bit[Res[Band].R]) - Dc[Res[Band].R]; -- Frank Klemm ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: typed cast's 2001-09-08 17:13 ` typed cast's Frank Klemm @ 2001-09-08 18:12 ` Zack Weinberg 0 siblings, 0 replies; 21+ messages in thread From: Zack Weinberg @ 2001-09-08 18:12 UTC (permalink / raw) To: Frank Klemm; +Cc: Jan Hubicka, gcc On Sun, Sep 09, 2001 at 12:41:39AM +0200, Frank Klemm wrote: > This has nothing to do with optimization. > > This has something to do with the parser. In the future, would you mind starting a new thread if you want to discuss a completely different subject? > Typecasting is dangerous. But sometimes you need typecasting. > Often you want to cast type A to type B. > C has only the possibity to cast any type to type B. [etc] I believe it is the general feeling of the group that we already have too many extensions, and we are not interested in adding new ones. I'd like to point out that most of the casts in code I write are because I don't _know_ the type of the item being cast from. The usual example is something like printf("%lu\n", (unsigned long)value); where the type of VALUE is e.g. "time_t", which does not correspond to a printf letter, and assuming it's any specific one is wrong. zw ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Multiplications on Pentium 4 2001-09-07 11:54 ` Multiplications on Pentium 4 Frank Klemm 2001-09-07 12:14 ` Michael Meissner 2001-09-08 8:28 ` Jan Hubicka @ 2001-09-10 2:11 ` Torbjorn Granlund 2001-09-10 4:44 ` Jan Hubicka 2 siblings, 1 reply; 21+ messages in thread From: Torbjorn Granlund @ 2001-09-10 2:11 UTC (permalink / raw) To: Frank Klemm; +Cc: Jan Hubicka, gcc [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 522 bytes --] Frank Klemm <pfk@fuchs.offl.uni-jena.de> writes: The Pentim 4 is so different from all other CPUs so I must write a special Code Choice Generator. Some Examples: I don't know what a Code Choice Generator is, but I know that multiply-by-constant code generation is controlled by the *COSTS macros. If you set these up properly for Pentium 4, you should get code sequences that use fewer shifts and more adds. (It isn't going to become perfect since this code understands nothing about scheduling.) -- Torbjörn ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Multiplications on Pentium 4 2001-09-10 2:11 ` Multiplications on Pentium 4 Torbjorn Granlund @ 2001-09-10 4:44 ` Jan Hubicka 0 siblings, 0 replies; 21+ messages in thread From: Jan Hubicka @ 2001-09-10 4:44 UTC (permalink / raw) To: Torbjorn Granlund; +Cc: Frank Klemm, Jan Hubicka, gcc [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 733 bytes --] > Frank Klemm <pfk@fuchs.offl.uni-jena.de> writes: > > The Pentim 4 is so different from all other CPUs so I must write a special > Code Choice Generator. Some Examples: > > I don't know what a Code Choice Generator is, but I know that > multiply-by-constant code generation is controlled by the *COSTS > macros. If you set these up properly for Pentium 4, you should get > code sequences that use fewer shifts and more adds. Frank wrote3 his own code to generate various sequences and compare them. His code produces better results in some cases, as it is aware of some details gcc generator isn't. Honza > > (It isn't going to become perfect since this code understands nothing > about scheduling.) > > -- > Torbjörn ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20010904003947.D335@fuchs.offl.uni-jena.de>]
[parent not found: <20010905133553.E15564@atrey.karlin.mff.cuni.cz>]
[parent not found: <20010906210657.B420@fuchs.offl.uni-jena.de>]
[parent not found: <20010907111255.D9651@atrey.karlin.mff.cuni.cz>]
* Re: Multiplication [not found] ` <20010907111255.D9651@atrey.karlin.mff.cuni.cz> @ 2001-09-07 13:46 ` Frank Klemm 0 siblings, 0 replies; 21+ messages in thread From: Frank Klemm @ 2001-09-07 13:46 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc On Fri, Sep 07, 2001 at 11:12:55AM +0200, Jan Hubicka wrote: > > The Multiplication cost table for the Pentium 4 may be at monday. > > I'm not sure about the search depth. 5 or 6. For a search depth of 6 > > may be a weekend be too short. > > P4 will be interesting, due to extreme cost of multiply and shifts :) > I need a search depth of up to 28. This needs something around 10^41 years for my current program for searching the best solution ... Currently I'm only search two register solutions. Is this enough for IA32? For CPUs with much more registers (IA64, PPC, Alpha) you also interested in three or four register solutions? How does the multiplication code generator work? My stupid idea is: - Code_Proposal_Generator [size] (multiplier) Code_Proposal_Generator [time] (CPU_type, multiplier) emits some known useful proposals for a *= b; - Multiple solutions are necessary because there are also other conditions affecting which of them is the best: src: mem/reg? dst: mem/reg? src!=dst? # of free registers? - Multiple solutions within small code ranges will be weighted. - All solutions have their calculated costs: - latency - throughput (for larger peepholes throughput stalls will be calculated into additional latency) - registers used (especially on systems with rare regs) (for larger peepholes register usage will be calculated into additional load/stores and related latency) - code size - For very large peeopholes (;-) you have all transformed into - latency - code size - Depending on the -O setting the "best" solution is selected. On the Pentium 4 fast solutions are often huge and the 1st level code cache is very small. So size also affects speed. But this also depends on the use pattern. This smell like a more sophisticated solution for code optimzation like some switches -O -O2 -Os and -O3. Proposal: Cost from the view of compile time: -O0 no optimization, compile as fast as you can! using as less memory as you can -O1 simple optimizations, speed and memory usage should increase by 10...20% (typical) -O2 optimizations, speed and memory usage should increase by 50...100% (typical) -O3 high optimizations, speed and memory usage should increase by 200...1000% (typical) -O4 highest optimizations, speed and memory usage should increase by 1000...100000% (typical) This can be done by switch on/off optimization features and setting search depth and things like that (-O4 unlimited). parameters like treedepth and things like that. -Os1...9 -Ot1...9 -Os0 (== -Ot0) -Ot0 (== -Os0) -Os (== -Os5) old -O is something like -Ot2 old -O2 is something like -Ot5 old -O3 is something like -Ot8 This selects size <=> speed trade off. So -O4 -Os9 generates smallest code using all possible methods, also it tries all methods normally known to increase code size (-O4). -Os9 affects RTL optimization (never use a solution which is one bit longer), but also code generator: int two ( void ) { return 2; } -Ot9...-Ot0, -Os0...-Ot5: movl $2, %eax ret -Ot6...9: xorl %eax, %eax movb $2, %al # saves 1 byte ret -- Frank Klemm ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20010905182009.A339@fuchs.offl.uni-jena.de>]
[parent not found: <20010907120836.A14477@atrey.karlin.mff.cuni.cz>]
[parent not found: <20010907221210.G5281@fuchs.offl.uni-jena.de>]
[parent not found: <20010908182611.N8451@atrey.karlin.mff.cuni.cz>]
* Type problems with C [not found] ` <20010908182611.N8451@atrey.karlin.mff.cuni.cz> @ 2001-09-08 17:13 ` Frank Klemm 0 siblings, 0 replies; 21+ messages in thread From: Frank Klemm @ 2001-09-08 17:13 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc For data types there is a generic type for generic pointers: void* For functions there is no such type because functions are normally incompatible. But Functions with the same return type and the same parameter layout can be compatible. Is it possible to define a type which is compatible with fn_t, fn_schar_t and fn_uchar_t ? For data this is void*: void* p1 =(void*)...; void* p1 =(char*)...; void* p1 =(signed char*)...; void* p1 =(unsigned char*)...; Is there something similar for function pointers? ------------------------------------------------------------------ int p1_strcmp ( void** p1, void** p2 ) { signed char* q1 = *(signed char**)p1; <---- ugly typecast signed char* q2 = *(signed char**)p2; <---- ugly typecast for ( ; *q1; q1++, q2++ ) if (*q1 != *q2) return *q1 - *q2; return *q1 - *q2; } int p2_strcmp ( void** p1, void** p2 ) { unsigned char* q1 = *(unsigned char**)p1; <---- ugly typecast unsigned char* q2 = *(unsigned char**)p2; <---- ugly typecast for ( ; *q1; q1++, q2++ ) if (*q1 != *q2) return *q1 - *q2; return *q1 - *q2; } int p3_strcmp ( signed char** p1, signed char** p2 ) { signed char* q1 = *p1; signed char* q2 = *p2; for ( ; *q1; q1++, q2++ ) if (*q1 != *q2) return *q1 - *q2; return *q1 - *q2; } int p4_strcmp ( unsigned char** p1, unsigned char** p2 ) { unsigned char* q1 = *p1; unsigned char* q2 = *p2; for ( ; *q1; q1++, q2++ ) if (*q1 != *q2) return *q1 - *q2; return *q1 - *q2; } typedef int (*fn_t) (void**, void**); typedef int (*fn_schar_t) (signed char**, signed char**); typedef int (*fn_uchar_t) (unsigned char**, unsigned char**); fn_t p1 = p1_strcmp; fn_t p2 = p2_strcmp; fn_t p3 = (fn_t)p3_strcmp; <---- ugly typecast fn_t p4 = (fn_t)p4_strcmp; <---- ugly typecast fn_t p5 = p3_strcmp; <---- warning or error (C++) fn_t p6 = p4_strcmp; <---- warning or error (C++) -- Frank Klemm ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20010903154513.C13574@atrey.karlin.mff.cuni.cz>]
[parent not found: <20010904003127.C335@fuchs.offl.uni-jena.de>]
[parent not found: <20010905133458.D15564@atrey.karlin.mff.cuni.cz>]
[parent not found: <20010905185526.A559@fuchs.offl.uni-jena.de>]
[parent not found: <20010906174007.B1486@atrey.karlin.mff.cuni.cz>]
[parent not found: <20010906222524.A2578@fuchs.offl.uni-jena.de>]
[parent not found: <20010907111641.G9651@atrey.karlin.mff.cuni.cz>]
* Simple optimizations missing [not found] ` <20010907111641.G9651@atrey.karlin.mff.cuni.cz> @ 2001-09-09 7:20 ` Frank Klemm 2001-09-09 7:39 ` Graham Stott 2001-09-09 7:45 ` Andreas Jaeger 2001-09-09 7:20 ` Code size Frank Klemm 1 sibling, 2 replies; 21+ messages in thread From: Frank Klemm @ 2001-09-09 7:20 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc Expression optimizations: x / x = 1 x % x = 0 Trivial. ---------------------------------------------------------- Another thing which can save at least size: b is a 8 bit value. There is an expression like: ( b + otherstuff ) & mask otherstuff is any other expression, can also be very complex. mask is a value between 0 and 255. So you can change movzbl (....), %eax addl ... andl $mask, %eax or movzbl %al, %eax addl ... andl $mask, %eax to movzb (....), %al # may be only if optimizing for size addl ... andl $mask, %eax or addl ... # always good andl $mask, %eax This optimzation can also be done for (b + otherstuff) | value if value is in the range from 0xFFFFFF00...0xFFFFFFFF. This optimization is also possible for 16 bit quantities. Range of the masks is different. The operator '+' can also be: + - ^ | & or a mixture of them. -- Frank Klemm ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Simple optimizations missing 2001-09-09 7:20 ` Simple optimizations missing Frank Klemm @ 2001-09-09 7:39 ` Graham Stott 2001-09-09 7:45 ` Andreas Jaeger 1 sibling, 0 replies; 21+ messages in thread From: Graham Stott @ 2001-09-09 7:39 UTC (permalink / raw) To: Frank Klemm; +Cc: Jan Hubicka, gcc Frank Klemm wrote: > > Expression optimizations: > > x / x = 1 > x % x = 0 > > Trivial. Only if you can deduce that x != 0 > -- > Frank Klemm Graham ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Simple optimizations missing 2001-09-09 7:20 ` Simple optimizations missing Frank Klemm 2001-09-09 7:39 ` Graham Stott @ 2001-09-09 7:45 ` Andreas Jaeger 1 sibling, 0 replies; 21+ messages in thread From: Andreas Jaeger @ 2001-09-09 7:45 UTC (permalink / raw) To: Frank Klemm; +Cc: Jan Hubicka, gcc Frank Klemm <pfk@fuchs.offl.uni-jena.de> writes: > Expression optimizations: > > x / x = 1 > x % x = 0 For floating point or for integer? For floating point: 0.0/0.0 might generate a floating point exceptions and the result is a NaN. The same applies for the infinites. Frank, I'm looking forward to your patches hoping that they're as good as your analysis, Andreas -- Andreas Jaeger SuSE Labs aj@suse.de private aj@arthur.inka.de http://www.suse.de/~aj ^ permalink raw reply [flat|nested] 21+ messages in thread
* Code size [not found] ` <20010907111641.G9651@atrey.karlin.mff.cuni.cz> 2001-09-09 7:20 ` Simple optimizations missing Frank Klemm @ 2001-09-09 7:20 ` Frank Klemm 1 sibling, 0 replies; 21+ messages in thread From: Frank Klemm @ 2001-09-09 7:20 UTC (permalink / raw) To: Jan Hubicka; +Cc: gcc long long x; void main (void) { x &= 100; } ------------------------------------------------ Code with -Os (31 bytes + 7.5 bytes for alignment = 38.5 bytes on average) .align 16 .global main .type main,@function main: pushl %ebp xorl %edx, %edx movl %esp, %ebp pushl %eax pushl %eax movl x, %eax andl $-16, %esp movl %edx, x+4 andl $100, %eax movl %eax, x leave ret ------------------------------------------------- Code with -Os -fomit-frame-pointer (31 bytes + 7.5 bytes for alignment = 38.5 bytes on average) .align 16 .global main .type main,@function main: pushl %ebp <---- incomprehensible xorl %edx, %edx <---- okay movl %esp, %ebp <---- incomprehensible pushl %eax <---- incomprehensible pushl %eax <---- incomprehensible movl x, %eax <---- okay andl $-16, %esp <---- incomprehensible movl %edx, x+4 <---- okay andl $100, %eax <---- okay movl %eax, x <---- okay leave <---- incomprehensible ret <---- okay ------------------------------------------------- Optimized code (14 bytes): .text .global main .type main,@function <---- no .align, we want optimize for size main: xorl %eax, %eax <---- xor r,r; mov r,(mem) is shorter than mov $0,(mem) movl $x, %ecx <---- we have a free reg and x is accessed more than once andl $100, (%ecx) <---- operation can be done directly in memory movl %eax, 4(%ecx) <---- wipe memory ret May be the last two statements must be switched to improve speed. -- Frank Klemm ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Multiplications on Pentium 4 @ 2001-09-08 8:31 dewar 2001-09-08 9:17 ` Jan Hubicka 0 siblings, 1 reply; 21+ messages in thread From: dewar @ 2001-09-08 8:31 UTC (permalink / raw) To: jh, pfk; +Cc: gcc It's amazing how poor the scaling lea's are on the Pentium 4, probably they should never be generated. (at least you could take care of this by simply removing them :-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Multiplications on Pentium 4 2001-09-08 8:31 Multiplications on Pentium 4 dewar @ 2001-09-08 9:17 ` Jan Hubicka 0 siblings, 0 replies; 21+ messages in thread From: Jan Hubicka @ 2001-09-08 9:17 UTC (permalink / raw) To: dewar; +Cc: jh, pfk, gcc > It's amazing how poor the scaling lea's are on the Pentium 4, probably they > should never be generated. They are no worse than shifts - actually LEA is simply translated to trivial operations. It is amazing how bad shifts are :) but replacing them by ADDs is not always solution due to trace cache misses. Honza > > (at least you could take care of this by simply removing them :-) ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2001-09-10 4:44 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20010826202953.E2544@fuchs.offl.uni-jena.de> [not found] ` <20010826233634.A6693@atrey.karlin.mff.cuni.cz> [not found] ` <20010827004731.G2544@fuchs.offl.uni-jena.de> [not found] ` <20010827121624.D8568@atrey.karlin.mff.cuni.cz> [not found] ` <20010827143032.C636@fuchs.offl.uni-jena.de> [not found] ` <20010827173025.F11402@atrey.karlin.mff.cuni.cz> [not found] ` <20010901202854.A7713@fuchs.offl.uni-jena.de> [not found] ` <20010902000000.C27182@atrey.karlin.mff.cuni.cz> [not found] ` <20010902024104.F7713@fuchs.offl.uni-jena.de> [not found] ` <20010902110741.D13434@atrey.karlin.mff.cuni.cz> 2001-09-02 8:45 ` Multiplication of a block of integers Frank Klemm [not found] ` <20010903171717.E13574@atrey.karlin.mff.cuni.cz> 2001-09-04 13:04 ` mul + div with 64 bit signed ints on IA32 Frank Klemm 2001-09-04 15:10 ` Tim Prince 2001-09-05 4:44 ` Jan Hubicka 2001-09-05 5:49 ` Tim Prince 2001-09-05 12:16 ` Richard Henderson 2001-09-07 11:54 ` Multiplications on Pentium 4 Frank Klemm 2001-09-07 12:14 ` Michael Meissner 2001-09-08 8:28 ` Jan Hubicka 2001-09-08 17:13 ` typed cast's Frank Klemm 2001-09-08 18:12 ` Zack Weinberg 2001-09-10 2:11 ` Multiplications on Pentium 4 Torbjorn Granlund 2001-09-10 4:44 ` Jan Hubicka [not found] ` <20010904003947.D335@fuchs.offl.uni-jena.de> [not found] ` <20010905133553.E15564@atrey.karlin.mff.cuni.cz> [not found] ` <20010906210657.B420@fuchs.offl.uni-jena.de> [not found] ` <20010907111255.D9651@atrey.karlin.mff.cuni.cz> 2001-09-07 13:46 ` Multiplication Frank Klemm [not found] ` <20010905182009.A339@fuchs.offl.uni-jena.de> [not found] ` <20010907120836.A14477@atrey.karlin.mff.cuni.cz> [not found] ` <20010907221210.G5281@fuchs.offl.uni-jena.de> [not found] ` <20010908182611.N8451@atrey.karlin.mff.cuni.cz> 2001-09-08 17:13 ` Type problems with C Frank Klemm [not found] ` <20010903154513.C13574@atrey.karlin.mff.cuni.cz> [not found] ` <20010904003127.C335@fuchs.offl.uni-jena.de> [not found] ` <20010905133458.D15564@atrey.karlin.mff.cuni.cz> [not found] ` <20010905185526.A559@fuchs.offl.uni-jena.de> [not found] ` <20010906174007.B1486@atrey.karlin.mff.cuni.cz> [not found] ` <20010906222524.A2578@fuchs.offl.uni-jena.de> [not found] ` <20010907111641.G9651@atrey.karlin.mff.cuni.cz> 2001-09-09 7:20 ` Simple optimizations missing Frank Klemm 2001-09-09 7:39 ` Graham Stott 2001-09-09 7:45 ` Andreas Jaeger 2001-09-09 7:20 ` Code size Frank Klemm 2001-09-08 8:31 Multiplications on Pentium 4 dewar 2001-09-08 9:17 ` Jan Hubicka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).