Re: GCC and fast eveluaton of polynomials

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: GCC and fast eveluaton of polynomials
@ 2001-11-09 14:01 Stephen L Moshier
  2001-11-09 15:49 ` Andreas Svrcek-Seiler
  0 siblings, 1 reply; 6+ messages in thread
From: Stephen L Moshier @ 2001-11-09 14:01 UTC (permalink / raw)
  To: Timothy J. Wood; +Cc: Andreas Svrcek-Seiler, gcc

> An example might help here, say we have:
>
>   y = a7*x^7 ... + a1*x + a0
>
>   Rewrite this as:
>
>   y = yEven + yOdd
>   yEven = a6*x^6 + a4*x^4 ... + a0
>   yOdd = a7*x^7 + a5*x^5 ... + a1*x
>
>
>   Factor yOdd, leaving:
>
>   y = yEven + x * yOdd;
>   yEven = a6*x^6 + a4*x^4 ... + a0
>   yOdd = a7*x^6 + a5*x^4 ... + a1

Some of the glibc functions were modified to work this way.
Unfortunately the details of how you would best keep the pipeline(s)
full are hardware dependent.  I recall that the best code for a Motorola 88000
didn't work well at all on an Intel i860, for example.

> On Monday, November 19, 2001, at 04:48  AM, Andreas Svrcek-Seiler wrote:
> If I do it naively according to horner's scheme:
> (replaciing a log() call, that's even more time-consuming)
> u2 = u*u;
> v = (((((B5*u2+B4)*u2+B3)*u2+B2)*u2+B1)*u2+B0)*u;
>
> the coompiled executable is significantly slower than
> when the source reads
> v = B5*u2 + B4;
> v = v*u2 + B3;
> v = v*u2 + B2;
> v = v*u2 + B1;
> v = v*u2 + B0;
> v *= u;

That's really too bad!  It looks like the second version will defeat
some optimizations, which would mean that writing it that way
actually prevents the compiler from making a mess of it.  Please
mention what brand of computer you are using.  If you give a complete
test case in the form of a bug report, maybe someone will investigate.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GCC and fast eveluaton of polynomials
  2001-11-09 14:01 GCC and fast eveluaton of polynomials Stephen L Moshier
@ 2001-11-09 15:49 ` Andreas Svrcek-Seiler
  0 siblings, 0 replies; 6+ messages in thread
From: Andreas Svrcek-Seiler @ 2001-11-09 15:49 UTC (permalink / raw)
  To: moshier; +Cc: Timothy J. Wood, gcc

> That's really too bad!  It looks like the second version will defeat
> some optimizations, which would mean that writing it that way
> actually prevents the compiler from making a mess of it.  Please
> mention what brand of computer you are using.  If you give a complete
> test case in the form of a bug report, maybe someone will investigate.
I'm using 
"Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-85)"

the commandline was 
gcc -c -O2 -ffast-math -mpentiumpro -march=pentiumpro  -Dflex sff.c

(-O3 or even more brought no gain, using gcc3 brought a minor slowdown)

the machine is a Pentium III (Coppermine) at 733 MHz. 

By 'significant' speed difference I only meant it
was above the uncertainties due to system overhead,
so, practically, it does not matter, I was just curious and did not want 
to report a bug. 

What I WOULD like are faster transcendental functions (like e.g. log()),
which is probably a FPU problem, not a gcc one 
(someone told me that there are (RISC-)processors capable of doing a 
sqrt() in one cycle although I doubt that ;-)

By the way, manually splitting the polynomial into two halves
(as Timothy Wood suggested) brought no improvement.

 
bestregards,
greetings from Vienna
andreas 
             
            )))))
            (((((
           ( O O )
-------oOOO--(_)--OOOo-----------------------------------------------------
              o        Wolfgang Andreas Svrcek-Seiler  
              o        (godzilla) 
                       svrci@tbi.univie.ac.at
      .oooO            Tel.:01-4277-52733 
      (   )   Oooo.    
-------\ (----(   )--------------------------------------------------------
        \_)    ) /
              (_/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GCC and fast eveluaton of polynomials
@ 2001-11-09 14:55 Stephen L Moshier
  0 siblings, 0 replies; 6+ messages in thread
From: Stephen L Moshier @ 2001-11-09 14:55 UTC (permalink / raw)
  To: dewar; +Cc: gcc


> It is certainly familiar that Horner's rule is a bad idea on modern 
> processors with significant ILP. That's nothing new. 

But both of the example versions he shows are Horner's rule.  Why
should the compiler generate significantly slower code for one of
them?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GCC and fast eveluaton of polynomials
@ 2001-11-09 14:15 dewar
  0 siblings, 0 replies; 6+ messages in thread
From: dewar @ 2001-11-09 14:15 UTC (permalink / raw)
  To: moshier, tjw; +Cc: gcc, svrci

<<That's really too bad!  It looks like the second version will defeat
some optimizations, which would mean that writing it that way
actually prevents the compiler from making a mess of it.  Please
mention what brand of computer you are using.  If you give a complete
test case in the form of a bug report, maybe someone will investigate.
>>

It is certainly familiar that Horner's rule is a bad idea on modern 
processors with significant ILP. That's nothing new. 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GCC and fast eveluaton of polynomials
  2001-11-07 12:23 Andreas Svrcek-Seiler
@ 2001-11-08  8:31 ` Timothy J. Wood
  0 siblings, 0 replies; 6+ messages in thread
From: Timothy J. Wood @ 2001-11-08  8:31 UTC (permalink / raw)
  To: Andreas Svrcek-Seiler; +Cc: gcc

   This formulation of the series approximation to log() completely 
serializes the execution of your code (each step cannot be computed 
until the previous has finished due to dependence on 'v').

   Instead, you can compute two 'halves' of the sum in parallel.

   First, rewrite your expression to be the sum of one polynomial with 
only even powers and one with only odd powers.  Then, factor a power out 
of the odd power side (since its lowest power is 1), leaving you with 
two independent expressions in dealing even powers.

   Finally, start your two partial sums, one with the zero power one with 
the coefficient of what was the 1st power before factoring above 
happened and then accumulate in the rest of the terms in parallel.

   An example might help here, say we have:

   y = a7*x^7 ... + a1*x + a0

   Rewrite this as:

   y = yEven + yOdd
   yEven = a6*x^6 + a4*x^4 ... + a0
   yOdd = a7*x^7 + a5*x^5 ... + a1*x

   Factor yOdd, leaving:

   y = yEven + x * yOdd;
   yEven = a6*x^6 + a4*x^4 ... + a0
   yOdd = a7*x^6 + a5*x^4 ... + a1

   Then, compute yEven and yOdd using Horner's scheme in x^2.

   One you have to two halves, the final sum is just another 
multiply-accumulate operation.

   The advantage to this approach is that you free up the compiler to 
overlap the two independent instruction streams (and of course we are 
hoping the compiler is smart enough to do this on CPUs where multiple 
instructions can be in flight).  The only point of synchronization is 
the very ending bit for the final sum.

    This approach could be extended to a x^{3,4,...} scheme if you had a 
lot of terms and you had a CPU that would have FPU stalls with just the 
x^2 version above.  At that point though, you should probably be using 
some sort of vector unit :)

   Bringing this back to something relevant to the this list, it would be 
nice for the compiler to be able to do this in the -ffast-math case, but 
I'm not 100% familiar with all the precision loss issues so please don't 
flame me (duck!)

-tim

On Monday, November 19, 2001, at 04:48  AM, Andreas Svrcek-Seiler wrote:

> Dear developers,
>
> I,ve got an inner loop
> inside which I evaluate a polynomial
> If I do it naively according to horner's scheme:
> (replaciing a log() call, that's even more time-consuming)
> 	      u2 = u*u;
> 	      v = (((((B5*u2+B4)*u2+B3)*u2+B2)*u2+B1)*u2+B0)*u;
>
> the coompiled executable is significantly slower than
> when the source reads
> 	      v = B5*u2 + B4;
> 	      v = v*u2 + B3;
> 	      v = v*u2 + B2;
> 	      v = v*u2 + B1;
> 	      v = v*u2 + B0;
> 	      v *= u;
>
> this beavior is independent of how much optimizatons I turn on.
> by the way,the newly released intel compiler produces even slower 
> code ;-)
>
> ... maybe there's room for easy improvement ?
>
> best regards
> andreas
> --
>
>             )))))
>             (((((
>            ( O O )
> -------oOOO--(_)--OOOo-----------------------------------------------------
>               o        Wolfgang Andreas Svrcek-Seiler
>               o        (godzilla)
>                        svrci@tbi.univie.ac.at
>       .oooO            Tel.:01-4277-52733
>       (   )   Oooo.
> -------\ (----(   
> )--------------------------------------------------------
>         \_)    ) /
>               (_/
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* GCC and fast eveluaton of polynomials
@ 2001-11-07 12:23 Andreas Svrcek-Seiler
  2001-11-08  8:31 ` Timothy J. Wood
  0 siblings, 1 reply; 6+ messages in thread
From: Andreas Svrcek-Seiler @ 2001-11-07 12:23 UTC (permalink / raw)
  To: gcc

Dear developers,

I,ve got an inner loop 
inside which I evaluate a polynomial
If I do it naively according to horner's scheme:
(replaciing a log() call, that's even more time-consuming)
	      u2 = u*u;
	      v = (((((B5*u2+B4)*u2+B3)*u2+B2)*u2+B1)*u2+B0)*u;

the coompiled executable is significantly slower than
when the source reads
	      v = B5*u2 + B4;
	      v = v*u2 + B3;
	      v = v*u2 + B2;
	      v = v*u2 + B1;
	      v = v*u2 + B0;
	      v *= u;

this beavior is independent of how much optimizatons I turn on.
by the way,the newly released intel compiler produces even slower code ;-)

... maybe there's room for easy improvement ?

best regards
andreas
-- 
             
            )))))
            (((((
           ( O O )
-------oOOO--(_)--OOOo-----------------------------------------------------
              o        Wolfgang Andreas Svrcek-Seiler  
              o        (godzilla) 
                       svrci@tbi.univie.ac.at
      .oooO            Tel.:01-4277-52733 
      (   )   Oooo.    
-------\ (----(   )--------------------------------------------------------
        \_)    ) /
              (_/


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2001-11-20 14:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-11-09 14:01 GCC and fast eveluaton of polynomials Stephen L Moshier
2001-11-09 15:49 ` Andreas Svrcek-Seiler
  -- strict thread matches above, loose matches on Subject: below --
2001-11-09 14:55 Stephen L Moshier
2001-11-09 14:15 dewar
2001-11-07 12:23 Andreas Svrcek-Seiler
2001-11-08  8:31 ` Timothy J. Wood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).