Making GNU GCC choose_multiplier in expmed.c significantly faster

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Making GNU GCC choose_multiplier in expmed.c significantly faster
       [not found] <CAMfC4qj_rfkcLnOYNDdnAU2RAqFmrPhHH+1g4mCW=4wgcHNwDg@mail.gmail.com>
@ 2018-07-10 11:09 ` colinb2 .
  2018-07-19 20:19   ` Jeff Law
  0 siblings, 1 reply; 4+ messages in thread
From: colinb2 . @ 2018-07-10 11:09 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 5909 bytes --]

Feel free to copy this email and attachment to anyone who might be interested.
I'm very happy to answer any questions anyone has.
The program can be compiled and run like this on Linux with GNU GCC:
gcc -O2 -o expmed2.exe expmed2.c
./expmed2.exe

This email deals with making part of the GNU GCC compiler - integer division
by a constant divisor - faster. (The calculation of the parameters for the
machine code will be faster; compiled programs won't run faster.)
Further down I mention inequality (1) which can be used to make the LLVM
compiler somewhat faster, because that currently uses code based on (2).
I don't know what - if anything - the Java JVM uses for this, or how other
compilers do this, but these ideas may be useful there.

By significantly faster I mean I have benchmarked alternative versions of
choose_multiplier which on my low specification netbook can take maybe less than
half the time the current version takes. Time saved in compilation is much less
important than time saved in running compiled programs, but the code for the
faster versions is about the same length as the code for the current version,
and is only a bit more complicated, so is worth considering?

A short summary of the following is that choose_multiplier currently uses an
elegant algorithm due to Granlund & Montgomery, but which as implemented seems
rather slow. We can make it faster while retaining the basic structure, and
using a different, mostly equivalent, algorithm, may be a bit faster than that.

Licencing: in Fractint people's words "Don't want money, want recognition!"
The version choose_multiplier_v2 is based - but improves - on what's in
the GCC choose_multiplier function in file expmed.c, so the GCC licence.
The version choose_multiplier_v4 is based - but improves - on magicu2 in
"Hacker's Delight", so the licence is you needn't acknowledge the source
but it would be nice to credit the code as from
magicu2 in "Hacker's Delight" by Henry S Warren http://hackersdelight.org
with improvements by Colin Bartlett <colinb2@gmail.com>.
This latter also applies to choose_multiplier_power2_divrem because that
is also an obvious (?) idea from magicu2 in "Hacker's Delight".  */

The idea is using "wide int" seems slow compared to "unsigned HOST_WIDE_INT",
so it makes sense to avoid using "wide int" as much as possible. We can easily
rewrite choose_multiplier to only use "wide int" to calculate the initial mlow;
this is choose_multiplier_v2. An alternative for choose_multiplier_v2 completely
avoids using "wide int" by iterating upwards for the initial mlow, but if that
benchmarks as faster than using "wide int" even once (I suspect it might) then
just iterating upwards may even be a bit faster; this is choose_multiplier_v4.

The attachment is self-contained, and I have checked that the values produced
agree with a "Hacker's Delight" table of M/2^s for small d and n=precision=32.

What follows is a short summary of the theory, applying it to choose_multiplier.

Problem: find M/2^s so that for 0<=i<=iMax>=d we have floor(i/d)=floor(i*M/2^s).
Let qc=floor((iMax+1)/d); nc=qc*d-1; lgup=ceiling(log2(d)).
Given s let M=floor(2^s/d)+1; delta=M*d-2^s.

For GCC choose_multiplier:

* equivalent necessary & sufficient conditions:
(1) 0<delta and qc*delta<M
(2) 0<delta and nc*delta<2^s
Proof of (1) if and only if (2):
qc*delta*d-delta=nc*delta<2^s==M*d-delta
(3) 1/d<M/2^s<(1+1/nc)*1/d

* equivalent sufficient conditions: we have nc<2^precision, so:
(4) 0<delta and 2^precision*delta<=2^s
(4.1) 0<delta and delta<=2^(s-precision)
(5) 1/d<M/2^s<=(1+1/2^precision)*1/d
(5.1) 2^s/d<M<=(2^s+2^(s-precision))/d

(1) seems to be new to the literature, but is equivalent to (2) which is in
"Hacker's Delight" (Chapter 10) by Henry S Warren. Both give upwards iterating
algorithms which give minimal M/2^s and which avoid needing to use "wide int",
but the algorithm code using (1) is simpler and faster than that using (2).

(4.1) is in "Hacker's Delight". It gives an upwards iterating algorithm which
avoids using "wide int", and the algorithm code is a bit simpler than using (1).
Mostly it gives the same result as GCC choose_multiplier which implements
(possibly slowly because it uses "wide int") the elegant algorithm due to
Granlund & Montgomery. (Which perhaps builds on work by Alverson.)

We can prove if iMax<2^precision then for 0<=i<2^(precision/2) inequality (4)
etc give the same M/2^s as inequality (1) etc. For n=precision=32
the smallest d for which (4) gives a non-minimal M is d=102807.

Rough not necessarily reliable statistics suggest that using (4) etc
we have that if n=precision the average post_shift=lgup-1.

So if we can *quickly" calculate M/2^s at s=n+lgup-1, then either that fails
and we need to use an extra "top" bit for M and use post_shift=lgup,
or it works and we can try reducing M.

The catch is *if* we can *quickly* calculate M/2^s at s=n+lgup-1;
it's easy to directly calculate M/2^s at that s using "wide int", but using
"wide int" seems slow, and it might actually be faster to iterate upwards
from s=n and completely avoid using "wide int".

In any case "wide int" seems slow compared with using "unsigned HOST_WIDE_INT",
so it makes sense to avoid "wide int" as far as possible.

So in the attachment choose_multiplier_v2 rewrites choose_multiplier to:
(a) only use "wide int" to calculate the initial mlow at s=n+lgup-1;
    all other calculations are made using "unsigned HOST_WIDE_INT";
or (b) avoid using "wide int" to calculate the initial mlow at s=n+lgup-1,
       by iterating upwards from s=n to find the initial mlow.
But if (b) benchmarks as faster than (a), which I suspect might be the case,
then it may even be on average a bit faster to use choose_multiplier_v4 which
iterates upwards to find M/2^s, and which would avoid needing to calculate lgup
unless that is useful as a return value of choose_multiplier.

Colin Bartlett

[-- Attachment #2: expmed2.c --]
[-- Type: text/x-csrc, Size: 8948 bytes --]

#include <stdio.h>
#include <stdlib.h>

#define HOST_WIDE_INT int
#define HOST_BITS_PER_WIDE_INT 32
#define HOST_BITS_PER_DOUBLE_INT (2 * HOST_BITS_PER_WIDE_INT)

/* Making GNU GCC choose_multiplier in expmed.c significantly faster.
   By which we mean up to 50% or more faster for the compiler to calculate
   parameters for the machine code for integer division by a constant divisor.
   (Compiled programs won't run faster.)
   Licencing: in Fractint people's words "Don't want money, want recognition!"
   The version choose_multiplier_v2 is based - but improves - on what's in
   the GCC choose_multiplier function in file expmed.c, so the GCC licence.
   The version choose_multiplier_v4 is based - but improves - on magicu2 in
   "Hacker's Delight", so the licence is you needn't acknowledge the source
   but it would be nice to credit the code as from
   magicu2 in "Hacker's Delight" by Henry S Warren http://hackersdelight.org
   with improvements by Colin Bartlett <colinb2@gmail.com>.
   This latter also applies to choose_multiplier_power2_divrem because that
   is also an obvious (?) idea from magicu2 in "Hacker's Delight".  */

int
ceil_log2 (unsigned long long int iv)
{
  /* for now do it the long way */
  int s;
  if (iv == 0) return -1;
  iv -= 1;
  s = 0;
  while (iv)
    {
      s += 1;
      iv = iv >> 1;
    }
  return s;
}

void
gcc_assert (int iv)
{
  if (iv) return;
  printf ("gcc_assert error\n");
  exit (1);
}

\f
/* Choose a minimal N + 1 bit approximation to 1/D that can be used to
   replace division by D, and put the least significant N bits of the result
   in *MULTIPLIER_PTR and return the most significant bit.

   The width of operations is N (should be <= HOST_BITS_PER_WIDE_INT), the
   needed precision is in PRECISION (should be <= N).

   PRECISION should be as small as possible so this function can choose
   multiplier more freely.

   The rounded-up logarithm of D is placed in *lgup_ptr.  A shift count that
   is to be used for a final right shift is placed in *POST_SHIFT_PTR.

   Using this function, x/D will be equal to (x * m) >> (*POST_SHIFT_PTR),
   where m is the full HOST_BITS_PER_WIDE_INT + 1 bit multiplier.  */

#define CMUHWI1 ((unsigned HOST_WIDE_INT) 1)
// #define CMUHWI2 ((unsigned HOST_WIDE_INT) 2)

/* Licencing: */

void
choose_multiplier_power2_divrem (int two_exp, unsigned HOST_WIDE_INT d,
		   unsigned HOST_WIDE_INT *quotient, unsigned HOST_WIDE_INT *remainder) 
{
  /* Must have that 2^two_exp / d < 2^HOST_BITS_PER_WIDE_INT */
  unsigned HOST_WIDE_INT q, r, rtest;
  int s;
  if (two_exp < HOST_BITS_PER_WIDE_INT)
    {
      r = CMUHWI1 << two_exp;
      q = r / d;
      r = r - q * d;
    }
//  else if (0 == 0)
//    {
//      /* Using wide_int seems slowish & it may be faster to iterate upwards. */
//      wide_int val = wi::set_bit_in_zero (two_exp, HOST_BITS_PER_DOUBLE_INT);
//      q = wi::udiv_trunc (val, d).to_uhwi ();
//      r = 0 - q * d;
//    }
  else
    {
      /* Iterate upwards to get q, r;
         there may be "overflows" but that's OK as using unsigned integers. */
      s = HOST_BITS_PER_WIDE_INT;
      q = (0 - d) / d + 1;
      r = 0 - q * d;
      rtest = (d - 1) >> 1;
      while (s < two_exp)
        {
          s = s + 1;
          if (r <= rtest)
            {
              q = q << 1;
              r = r << 1;
            }
          else
            {
              q = (q << 1) | 1;
              r = (r << 1) - d;
            }
        }
    }
  *quotient = q;
  *remainder = r;
  return;
}

/* Why is choose_multiplier "unsigned HOST_WIDE_INT" instead of just "int"?
   It only returns 0 or 1.                                                 */

int
choose_multiplier_v2 (unsigned HOST_WIDE_INT d, int n, int precision,
		   unsigned HOST_WIDE_INT *multiplier_ptr,
		   int *post_shift_ptr, int *lgup_ptr)
{
  int lgup, shiftv, topbit, s;
  unsigned HOST_WIDE_INT mlow, mhigh, mlowv, mhighv;

  /* lgup = ceil(log2(divisor)); */
  lgup = ceil_log2 (d);

  gcc_assert (lgup <= n);

  if (lgup == 0)
    {
      /* It's easier to deal with d = 1 separately, as that
         is the only d for which we need to be very careful
         about avoiding shifting bits by >= HOST_BITS_PER_WIDE_INT. */ 
      *multiplier_ptr = CMUHWI1 << (n - precision);
      *post_shift_ptr = 0;
      *lgup_ptr = lgup;
      return 1;
    }

  topbit = 0;
  /* shiftv = (n or precision) + lgup - 1
     mlow = 2^shiftv / d
     mhigh = (2^shiftv + 2^(shiftv - precision)) / d
     An unlikely case is if precision < lgup
     when we could just use mhigh = 0, shiftv == 0. */
  shiftv = n + lgup - 1;
  choose_multiplier_power2_divrem (shiftv, d, &mlow, &mlowv);
  s = shiftv - precision;
  /* Avoid shifts by >= HOST_BITS_PER_WIDE_INT. */
  mhigh = precision < HOST_BITS_PER_WIDE_INT ? mlow >> precision : 0;
  mhighv = (s < HOST_BITS_PER_WIDE_INT ? CMUHWI1 << s : 0) - d * mhigh;
  mhigh += mlow + (mlowv < d - mhighv ? 0 : 1);
  if (mlow < mhigh)
    {
      /* Reduce to lowest terms. */
      shiftv -= n;
      while (shiftv > 0)
        {
          mlowv = mlow >> 1;
          mhighv = mhigh >> 1;
          if (mlowv >= mhighv)
            break;
          mlow = mlowv;
          mhigh = mhighv;
          shiftv -= 1;
        }
    }
  else
    {
      mhigh = ((mhigh - (CMUHWI1 << (n - 1))) << 1) | 1;
      shiftv = lgup;
      topbit = 1;
    }

  *multiplier_ptr = mhigh;
  *post_shift_ptr = shiftv;
  *lgup_ptr = lgup;
  return topbit;
}

int
choose_multiplier_v4 (unsigned HOST_WIDE_INT d, int n, int precision,
		   unsigned HOST_WIDE_INT *multiplier_ptr,
		   int *post_shift_ptr, int *lgup_ptr)
{
  int lgup, shiftv, topbit;
  unsigned HOST_WIDE_INT q, delta, deltatest, twos, topbitcheck;

  /* lgup = ceil(log2(divisor)); */
  lgup = ceil_log2 (d);

  gcc_assert (lgup <= n);

  if (lgup == 0)
    {
      /* It's easier to deal with d = 1 separately, as that
         is the only d for which we need to be very careful
         about avoiding shifting bits by >= HOST_BITS_PER_WIDE_INT. */ 
      *multiplier_ptr = CMUHWI1 << (n - precision);
      *post_shift_ptr = 0;
      *lgup_ptr = lgup;
      return 1;
    }

  /* Iterate upwards to find multiplier and post_shift. */
  topbit = 0;
  shiftv = 0;
  twos = CMUHWI1 << (n - precision);
  topbitcheck = CMUHWI1 << (n - 1);
  deltatest = d >> 1;
  if (n < HOST_BITS_PER_WIDE_INT)
    {
      delta = CMUHWI1 << n;
      q = delta / d;
    }
  else
    {
      delta = 0;
      q = (delta - d) / d + 1;
    }
  delta = (q + 1) * d - delta;
// printf("\n");
// printf ("//#// N %2d P %2d L %2d d %10d = 0x%8x; M 0x %8x %8x rv %1d %1d s %2d %2d;\n",
//          n, precision, lgup, d, d, m, mv, rv, rvv, s, sv);
  while (delta > twos)
    {
      shiftv += 1;
      if (delta <= deltatest)
        {
          q = (q << 1) | 1;
          delta = delta << 1;
        }
      else if (q < topbitcheck)
        {
          q = q << 1;
          delta = (delta << 1) - d;
        }
      else
        {
          topbit = 1;
          q = (q - topbitcheck) << 1;
          break;
        }
      twos = twos << 1;
    }

  *multiplier_ptr = q + 1;
  *post_shift_ptr = shiftv;
  *lgup_ptr = lgup;
  return topbit;
}

int
test_v2 (int n, int precision, unsigned HOST_WIDE_INT d, int qshow)
{
  int s, lgup, rv, sv, rvv, neq;
  unsigned HOST_WIDE_INT m, mv;
  rv = choose_multiplier_v2 (d, n, precision, &m, &s, &lgup);
  rvv = choose_multiplier_v4 (d, n, precision, &mv, &sv, &lgup);
  neq = (m == mv ? 0 : m>mv ? 1 : 2) | (rv == rvv ? 0 : 4) | (s == sv ? 0 : 8);
  if (qshow >= 4 || (neq && qshow))
    printf ("//#// N %2d P %2d L %2d d %10d = 0x%8x; M 0x %8x %8x rv %1d %1d s %2d %2d; neq %2d;\n",
          n, precision, lgup, d, d, m, mv, rv, rvv, s, sv, neq);
  return neq;
}

void
many_test_v2 (int qshow)
{
/*
*/
  test_v2 (32, 32, 1, qshow);
  test_v2 (32, 32, 2, qshow);
  test_v2 (32, 32, 4, qshow);
  test_v2 (32, 32, 8, qshow);
  test_v2 (32, 32, 1 << 6, qshow);
  test_v2 (32, 31, 1 << 6, qshow);
  test_v2 (32, 30, 1 << 6, qshow);
  test_v2 (32, 29, 1 << 6, qshow);
  test_v2 (32, 28, 1 << 6, qshow);
  test_v2 (32, 27, 1 << 6, qshow);
  test_v2 (32, 26, 1 << 6, qshow);
  test_v2 (32, 25, 1 << 6, qshow);
  test_v2 (32, 32, 0x800000, qshow);
  test_v2 (32, 32, 3, qshow);
  test_v2 (32, 32, 5, qshow);
  test_v2 (32, 32, 6, qshow);
  test_v2 (32, 32, 7, qshow);
  test_v2 (32, 31, 7, qshow);
  test_v2 (32, 30, 7, qshow);
  test_v2 (32, 29, 7, qshow);
  test_v2 (32, 28, 7, qshow);
  test_v2 (32, 32, 9, qshow);
  test_v2 (32, 32, 10, qshow);
  test_v2 (32, 32, 11, qshow);
  test_v2 (32, 32, 12, qshow);
  test_v2 (32, 32, 25, qshow);
  test_v2 (32, 32, 125, qshow);
  test_v2 (32, 32, 625, qshow);
  test_v2 (32, 32, 102807, qshow);
  test_v2 (32, 31, 102807, qshow);
  test_v2 (32, 30, 102807, qshow);
  return;
}

void
lots_test_v2 ()
{
  int qshow = 1;
  test_v2 (32, 32, 1, qshow);
  return;
}

int main()
{
  many_test_v2 (4);
  exit (0);
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Making GNU GCC choose_multiplier in expmed.c significantly faster
  2018-07-10 11:09 ` Making GNU GCC choose_multiplier in expmed.c significantly faster colinb2 .
@ 2018-07-19 20:19   ` Jeff Law
  2018-07-19 22:40     ` colinb2 .
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff Law @ 2018-07-19 20:19 UTC (permalink / raw)
  To: colinb2 ., gcc

On 07/10/2018 05:09 AM, colinb2 . wrote:
> Feel free to copy this email and attachment to anyone who might be interested.
> I'm very happy to answer any questions anyone has.
> The program can be compiled and run like this on Linux with GNU GCC:
> gcc -O2 -o expmed2.exe expmed2.c
> ./expmed2.exe
> 
> This email deals with making part of the GNU GCC compiler - integer division
> by a constant divisor - faster. (The calculation of the parameters for the
> machine code will be faster; compiled programs won't run faster.)
> Further down I mention inequality (1) which can be used to make the LLVM
> compiler somewhat faster, because that currently uses code based on (2).
> I don't know what - if anything - the Java JVM uses for this, or how other
> compilers do this, but these ideas may be useful there.
> 
> By significantly faster I mean I have benchmarked alternative versions of
> choose_multiplier which on my low specification netbook can take maybe less than
> half the time the current version takes. Time saved in compilation is much less
> important than time saved in running compiled programs, but the code for the
> faster versions is about the same length as the code for the current version,
> and is only a bit more complicated, so is worth considering?
Improvements to compile time behavior are always worth considering.


> 
> Licencing: in Fractint people's words "Don't want money, want recognition!"
> The version choose_multiplier_v2 is based - but improves - on what's in
> the GCC choose_multiplier function in file expmed.c, so the GCC licence.
> The version choose_multiplier_v4 is based - but improves - on magicu2 in
> "Hacker's Delight", so the licence is you needn't acknowledge the source
> but it would be nice to credit the code as from
> magicu2 in "Hacker's Delight" by Henry S Warren http://hackersdelight.org
> with improvements by Colin Bartlett <colinb2@gmail.com>.
> This latter also applies to choose_multiplier_power2_divrem because that
> is also an obvious (?) idea from magicu2 in "Hacker's Delight".  */
So a key issue with GCC is that for nontrivial code to be included in
GCC, the code's author must assign their copyright ownership to the FSF.


> 
> The idea is using "wide int" seems slow compared to "unsigned HOST_WIDE_INT",
> so it makes sense to avoid using "wide int" as much as possible.
Sure, but the use of wide_int types has certain advantages and if
possible we'd like that class to be just as efficient as a HOST_WIDE_INT
for common operations on 32 and 64 bit types.

In fact, for 32 and 64 bit types, wide_int is supposed to generate the
same code as HOST_WIDE_INT.  At least that was the state in 2013 when
wide_int was introduced.  So it may be worth spending some time figuring
out why changing the types changes the performance.

If you're going to argue to use HOST_WIDE_INT, then you'll have to think
about how you're going to support 128bit or wider target data types
which is the whole point behind the introduction of wide_int.

Anyway, I think these higher level questions/concerns need to be
addressed before we can reasonably discuss replacing wide_int with
HOST_WIDE_INT.

WRT algorithmic changes.   ISTM it would be best to address those
separately from the wide_int vs HOST_WIDE_INT discussion.  An
algorithmic change should have the same impact regardless of whether
we're using wide_int or HOST_WIDE_INT.




 We can easily
> rewrite choose_multiplier to only use "wide int" to calculate the initial mlow;
> this is choose_multiplier_v2. An alternative for choose_multiplier_v2 completely
> avoids using "wide int" by iterating upwards for the initial mlow, but if that
> benchmarks as faster than using "wide int" even once (I suspect it might) then
> just iterating upwards may even be a bit faster; this is choose_multiplier_v4.
So the most obvious take-away from this is to answer the question of
whether or not iterating upwards for the initial mlow is a win or not.
If so, that might be able to go forward independent of any other changes.

> 
> The attachment is self-contained, and I have checked that the values produced
> agree with a "Hacker's Delight" table of M/2^s for small d and n=precision=32.
Rather than attaching new implementations, the generally preferred
packaging is diffs against the trunk.


Jeff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Making GNU GCC choose_multiplier in expmed.c significantly faster
  2018-07-19 20:19   ` Jeff Law
@ 2018-07-19 22:40     ` colinb2 .
  0 siblings, 0 replies; 4+ messages in thread
From: colinb2 . @ 2018-07-19 22:40 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc

On 7/19/18, Jeff Law <law@redhat.com> wrote:
> On 07/10/2018 05:09 AM, colinb2 . wrote:

> Improvements to compile time behavior are always worth considering.
The points you make below about wide_int might address this, albeit in
a somewhat different way from the way I am proposing.

> So a key issue with GCC is that for nontrivial code to be included in
> GCC, the code's author must assign their copyright ownership to the FSF.
That's OK for choose_multiplier_v2 which is what I'm suggesting as a
replacement for choose_multiplier.

>> The idea is using "wide int" seems slow compared to "unsigned HOST_WIDE_INT",
>> so it makes sense to avoid using "wide int" as much as possible.
> Sure, but the use of wide_int types has certain advantages and if
> possible we'd like that class to be just as efficient as a HOST_WIDE_INT
> for common operations on 32 and 64 bit types.
That makes sense, and *if* you can do it has some advantages over what
I was proposing. Reviewing my code I realised that I needed to split
off two special cases (i) d=1 and (ii) PRECISION<lgup. I've put an
updated version of choose_multiplier_v2 here:
https://github.com/colinb2r/expmed2

> In fact, for 32 and 64 bit types, wide_int is supposed to generate the
> same code as HOST_WIDE_INT.  At least that was the state in 2013 when
> wide_int was introduced.  So it may be worth spending some time figuring
> out why changing the types changes the performance.
I was assuming that it was something intrinsic to the way wide_int had
to be implemented that made it substantially slower than fixed width
like HOST_WIDE_INT, and I was also assuming that earlier versions of
choose_multiplier which didn't use wide_int, but used an equivalent
extended precision would have a similar lack of speed? I did look at
earlier versions of choose_multiplier, and I have a vague recollection
that one of the earlier implementations basically used HOST_WIDE_INT
throughout, using auxiliary variables to indicate that mlow and mhigh
need an extra bit. Looked at like that it becomes sort of obvious why
an extended precision integer is likely to be slower than a fixed
width HOST_WIDE_INT: if nothing else, there is more "housekeeping"
involved?

> If you're going to argue to use HOST_WIDE_INT, then you'll have to think
> about how you're going to support 128bit or wider target data types
> which is the whole point behind the introduction of wide_int.
>
> Anyway, I think these higher level questions/concerns need to be
> addressed before we can reasonably discuss replacing wide_int with
> HOST_WIDE_INT.

In principle I agree with both these paragraphs. But that said the
current choose_multiplier happily (?) switches between using wide_int
and HOST_WIDE_INT, and at its end the multiplier is of type unsigned
HOST_WIDE_INT, so if we're focussing on choose_multiplier (rather than
the broader issue of making wide_int faster in general) I don't think
these valid points actually apply to choose_multiplier? If what you
were saying was strictly true for choose_multiplier, then it could use
wide_int throughout, only switching to HOST_WIDE_INT at the end? But
that's not how it's written.

> WRT algorithmic changes.   ISTM it would be best to address those
> separately from the wide_int vs HOST_WIDE_INT discussion.  An
> algorithmic change should have the same impact regardless of whether
> we're using wide_int or HOST_WIDE_INT.
Yes. Or to be precise, wide_int ought to be able to replace
HOST_WIDE_INT, but the reverse might not be true? Taking
choose_multiplier as an example, we must be rather more careful going
from wide_int to HOST_WIDE_INT than going from HOST_WIDE_INT to
wide_int?

> Rather than attaching new implementations, the generally preferred
> packaging is diffs against the trunk.
That makes sense. Unfortunately I don't feel I have the competence to
do that, but that's a bridge that can if necessary be crossed in an
appropriate way when the time comes?

> Jeff
Thanks for your pertinent comments. I must confess I had rather
assumed that after over a week without any "bites" that my post wasn't
going to attract any comments, so finding your response just now was a
welcome surprise.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Making GNU GCC choose_multiplier in expmed.c significantly faster
@ 2018-07-11  7:26 colinb2 .
  0 siblings, 0 replies; 4+ messages in thread
From: colinb2 . @ 2018-07-11  7:26 UTC (permalink / raw)
  To: gcc

Clarification: this and my first post assume familiarity with choose_multiplier.

lgup=ceiling(log2(divisor))

Currently choose_multiplier initializes mlow, mhigh at post_shift=lgup;
initially 2^n<=mlow<mhigh so we must use "wide int" in the reduction iteration,
which is relatively slow because using "wide int" or a similar extended integer
type seems slow compared to using fixed width integer types.

But choose_multiplier_v2 initializes at post_shift=lgup-1, so mlow<=mhigh<2^n
and we can now avoid "wide int" in the reduction iteration. We can also
calculate mhigh from mlow without using "wide int", so we can limit using
"wide int" to calculating the initial value of mlow. Depending on the results
of benchmarking, it may even be that using choose_multiplier_v4 which completely
avoids "wide int" is faster than choose_multiplier_v2.

If for the initial values of mlow, mhigh we have mlow<mhigh then we can try
reducing mhigh. But if mlow>=mhigh then we need post_shift=lgup and we need
to use an extra bit for the multiplier, signified by returning 1 instead of 0.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-07-19 21:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAMfC4qj_rfkcLnOYNDdnAU2RAqFmrPhHH+1g4mCW=4wgcHNwDg@mail.gmail.com>
2018-07-10 11:09 ` Making GNU GCC choose_multiplier in expmed.c significantly faster colinb2 .
2018-07-19 20:19   ` Jeff Law
2018-07-19 22:40     ` colinb2 .
2018-07-11  7:26 colinb2 .

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).