public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3
@ 2020-10-16 13:33 tkoenig at gcc dot gnu.org
  2020-10-16 15:22 ` [Bug rtl-optimization/97459] " jakub at gcc dot gnu.org
                   ` (26 more replies)
  0 siblings, 27 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-16 13:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

            Bug ID: 97459
           Summary: __uint128_t remainder for division by 3
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

The following two functions are equivalent:

unsigned r3_128u_v1 (__uint128_t n)
{
  unsigned long a;
  a = (n >> 64) + (n & 0xffffffffffffffff);
  return a % 3;
}

unsigned r3_128u_v2 (__uint128_t n)
{
  return (unsigned) (n%3);
}

and the first one is definitely faster.

(The approach is due to Hacker's Delight, 2nd edition, "Remainder by
Summing Digits". There are also other interesting approaches there.)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
@ 2020-10-16 15:22 ` jakub at gcc dot gnu.org
  2020-10-16 15:52 ` jakub at gcc dot gnu.org
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-10-16 15:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I don't think the two are equivalent, consider e.g.
(((__uint128_t)0xffffffffffffffdfULL)<<64)+0xffffffffffffffdfULL
on this, the first function returns 1, the second 2.
I believe Hacker's delign says that
n % 3 == (((n >> 64) + (n & 0xffffffffffffffff)) % 3),
but the addition there isn't a 64-bit number, but a 65-bit number, so either
we'd need to take into account also the carry from the addition, or we'd need
to do the addition still in 128-bits and do it once again afterwards e.g. (n >>
32) + (n & 0xffffffff), which then could be done in 64-bits already.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
  2020-10-16 15:22 ` [Bug rtl-optimization/97459] " jakub at gcc dot gnu.org
@ 2020-10-16 15:52 ` jakub at gcc dot gnu.org
  2020-10-16 15:57 ` jakub at gcc dot gnu.org
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-10-16 15:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
E.g.
unsigned r3_128u_v3 (__uint128_t n)
{
  unsigned long a;
  a = (n >> 88);
  a += (n >> 44) & 0xfffffffffffULL;
  a += (n & 0xfffffffffffULL);
  return a % 3;
}
could work, but haven't measured how fast it is on average against the libcall.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
  2020-10-16 15:22 ` [Bug rtl-optimization/97459] " jakub at gcc dot gnu.org
  2020-10-16 15:52 ` jakub at gcc dot gnu.org
@ 2020-10-16 15:57 ` jakub at gcc dot gnu.org
  2020-10-17 12:22 ` tkoenig at gcc dot gnu.org
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-10-16 15:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Or:
unsigned r3_128u_v4 (__uint128_t n)
{
  unsigned long a;
  a = (n >> 96);
  a += (n >> 64) & 0xffffffffULL;
  a += (n >> 32) & 0xffffffffULL;
  a += (n & 0xffffffffULL);
  return a % 3;
}
if the target doesn't have (efficient) multi-word shifts.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-10-16 15:57 ` jakub at gcc dot gnu.org
@ 2020-10-17 12:22 ` tkoenig at gcc dot gnu.org
  2020-10-18  7:33 ` tkoenig at gcc dot gnu.org
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-17 12:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #4 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Here's a complete program for benchmarks on x86_64, using Jakub's
functions (so they are indeed correct):

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <x86intrin.h>

unsigned r3_128u_v2 (__uint128_t n)
{
  return (unsigned) (n%3);
}

unsigned r3_128u_v3 (__uint128_t n)
{
  unsigned long a;
  a = (n >> 88);
  a += (n >> 44) & 0xfffffffffffULL;
  a += (n & 0xfffffffffffULL);
  return a % 3;
}

unsigned r3_128u_v4 (__uint128_t n)
{
  unsigned long a;
  a = (n >> 96);
  a += (n >> 64) & 0xffffffffULL;
  a += (n >> 32) & 0xffffffffULL;
  a += (n & 0xffffffffULL);
  return a % 3;
}

#define N 1000000

int main()
{
  __uint128_t *a;
  unsigned int s;
  unsigned long t1, t2;
  int fd;
  int i;
  a = malloc (sizeof (*a) * N);
  fd = open ("/dev/random", O_RDONLY);
  read (fd, a, sizeof (*a) * N);
  s = 0;
  t1 = __rdtsc();
  for (i=0; i<N; i++)
    s += r3_128u_v2(a[i]);
  t2 = __rdtsc();
  printf ("s = %u r3_128u_v2: %f cycles per iteration\n", s, (t2-t1)/(double)
N);

  s = 0;
  t1 = __rdtsc();
  for (i=0; i<N; i++)
    s += r3_128u_v3(a[i]);
  t2 = __rdtsc();
  printf ("s = %u r3_128u_v2: %f cycles per iteration\n", s, (t2-t1)/(double)
N);

  s = 0;
  t1 = __rdtsc();
  for (i=0; i<N; i++)
    s += r3_128u_v4(a[i]);
  t2 = __rdtsc();
  printf ("s = %u r3_128u_v2: %f cycles per iteration\n", s, (t2-t1)/(double)
N);

}

And here are the results on my box using -O3 -march=native:

s = 7 r3_128u_v2: 22.204618 cycles per iteration
s = 7 r3_128u_v2: 8.143544 cycles per iteration
s = 7 r3_128u_v2: 6.110718 cycles per iteration

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-10-17 12:22 ` tkoenig at gcc dot gnu.org
@ 2020-10-18  7:33 ` tkoenig at gcc dot gnu.org
  2020-10-18  9:11 ` jakub at gcc dot gnu.org
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-18  7:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #5 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
OK, so here is a benchmark with its function names corrected. It also
includes one version (_v5) which is a bit faster.

(Note I increased the number of iterations to get more accuracy out
of the cycle count, which leads to numbers not being comparable
to the previous benchmark.)

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <x86intrin.h>

unsigned r3_128u_v2 (__uint128_t n)
{
  return (unsigned) (n%3);
}

unsigned r3_128u_v3 (__uint128_t n)
{
  unsigned long a;
  a = (n >> 88);
  a += (n >> 44) & 0xfffffffffffULL;
  a += (n & 0xfffffffffffULL);
  return a % 3;
}

unsigned r3_128u_v4 (__uint128_t n)
{
  unsigned long a;
  a = (n >> 96);
  a += (n >> 64) & 0xffffffffULL;
  a += (n >> 32) & 0xffffffffULL;
  a += (n & 0xffffffffULL);
  return a % 3;
}

unsigned r3_128u_v5 (__uint128_t n)
{
  unsigned long a, b, c;
  b = n >> 64;
  c = n;
  if (__builtin_add_overflow (b, c, &a))
    a++;

  return a%3;
}

#define N 100000000

int main()
{
  __uint128_t *a;
  unsigned int s;
  unsigned long t1, t2;
  int fd;
  int i;
  a = malloc (sizeof (*a) * N);
  fd = open ("/dev/random", O_RDONLY);
  read (fd, a, sizeof (*a) * N);
  s = 0;
  t1 = __rdtsc();
  for (i=0; i<N; i++)
    s += r3_128u_v2(a[i]);
  t2 = __rdtsc();
  printf ("s = %u r3_128u_v2: %f cycles per iteration\n", s, (t2-t1)/(double)
N);

  s = 0;
  t1 = __rdtsc();
  for (i=0; i<N; i++)
    s += r3_128u_v3(a[i]);
  t2 = __rdtsc();
  printf ("s = %u r3_128u_v3: %f cycles per iteration\n", s, (t2-t1)/(double)
N);

  s = 0;
  t1 = __rdtsc();
  for (i=0; i<N; i++)
    s += r3_128u_v4(a[i]);
  t2 = __rdtsc();
  printf ("s = %u r3_128u_v4: %f cycles per iteration\n", s, (t2-t1)/(double)
N);

  s = 0;
  t1 = __rdtsc();
  for (i=0; i<N; i++)
    s += r3_128u_v5(a[i]);
  t2 = __rdtsc();
  printf ("s = %u r3_128u_v5: %f cycles per iteration\n", s, (t2-t1)/(double)
N);

}

This gets me

s = 6 r3_128u_v2: 12.638648 cycles per iteration
s = 6 r3_128u_v3: 5.588043 cycles per iteration
s = 6 r3_128u_v4: 5.524949 cycles per iteration
s = 6 r3_128u_v5: 3.539010 cycles per iteration

so the _v5 version seems to be the fastest (so far).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-10-18  7:33 ` tkoenig at gcc dot gnu.org
@ 2020-10-18  9:11 ` jakub at gcc dot gnu.org
  2020-10-19 11:51 ` jakub at gcc dot gnu.org
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-10-18  9:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I wasn't sure if the v5 version would be always correct.  But the highest
possible halves of the uint128_t number are 0xffffffffffffffff +
0xffffffffffffffff, which gives carry 1 and 0xfffffffffffffffeULL and so it
won't overflow again when adding the carry to it.  So perhaps for 3 we are ok,
but probably not for the other modulo which could be using similar technique.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2020-10-18  9:11 ` jakub at gcc dot gnu.org
@ 2020-10-19 11:51 ` jakub at gcc dot gnu.org
  2020-10-19 12:05 ` jakub at gcc dot gnu.org
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-10-19 11:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, can we use this for anything but modulo 3, or 5, or 17, or 257 (all of
those have 2^32 mod N == 2^64 mod N == 2^128 mod N == 1), probably also keyed
on the target having corresponding uaddv4_optab handler, normal expansion not
being able to handle it and emitting a libcall?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2020-10-19 11:51 ` jakub at gcc dot gnu.org
@ 2020-10-19 12:05 ` jakub at gcc dot gnu.org
  2020-10-20 18:16 ` tkoenig at gcc dot gnu.org
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-10-19 12:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
And perhaps for other (but constant and not power of two) modulos use that
  unsigned long long a;
  a = (n >> 96) * (unsigned long long) (((__uint128_t 1) << 96) % c);
  a += ((n >> 64) & 0xffffffffULL) * (unsigned long long) (((__uint128_t 1) <<
64) % c);
  a += ((n >> 32) & 0xffffffffULL) * (unsigned long long) (((__uint128_t 1) <<
32) % c);
  a += (n & 0xffffffffULL);
  return a % c;
form?  Though, guess it might get quite large because the multiplications by
constants can expand again into multiple instructions.  And it needs to be
careful that the 4 addends don't overflow the 32 bits.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2020-10-19 12:05 ` jakub at gcc dot gnu.org
@ 2020-10-20 18:16 ` tkoenig at gcc dot gnu.org
  2020-10-23 16:56 ` tkoenig at gcc dot gnu.org
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-20 18:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #9 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #7)
> So, can we use this for anything but modulo 3, or 5, or 17, or 257 (all of
> those have 2^32 mod N == 2^64 mod N == 2^128 mod N == 1)

I think so, too.

> probably also
> keyed on the target having corresponding uaddv4_optab handler, normal
> expansion not being able to handle it and emitting a libcall?

Again, yes.

This can also be used as a building block for handling division
and remainder base 10.

Here's a benchmark for this (it uses the sum of digits base 10
instead). qsum1 uses the standard method, which you can find
(for example) in libgfortran.

div_rem5_v2 first calculates the remainder of the division by 5 using this
method, then does an exact division by multiplying with its modular inverse
for 2^128.

div_rem10_v2 then uses div_rem5_v2 to calculate the value and
remainder of the division by 10, and qsum_v2 uses that to
calculate the sum of digits.

The timings are about a factor of 2 faster than the straightforward
libcall version:

s = 360398898 qsum_v1: 1.091621 s
s = 360398898 qsum_v2: 0.485509 s


#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdlib.h>

#define ONE ((__uint128_t) 1)
#define TWO_64 (ONE << 64)

typedef __uint128_t mytype;

double this_time ()
{
  struct timeval tv;
  gettimeofday (&tv, NULL);
  return tv.tv_sec + tv.tv_usec * 1e-6;
}

unsigned
qsum_v1 (mytype n)
{
  unsigned ret;
  ret = 0;
  while (n > 0)
    {
      ret += n % 10;
      n = n / 10;
    }
  return ret;
}

static void inline __attribute__((always_inline))
div_rem_5_v2 (mytype n, mytype *div, unsigned *rem)
{
  unsigned long a, b, c;
  /* The modular inverse to 5 modulo 2^128  */
  const mytype magic = (0xCCCCCCCCCCCCCCCC * TWO_64 + 0xCCCCCCCCCCCCCCCD *
ONE);
  b = n >> 64;
  c = n;
  if (__builtin_add_overflow (b, c, &a))
    a++;

  *rem = a % 5;
  *div = (n-*rem) * magic;
}

static void inline __attribute__((always_inline))
div_rem_10_v2 (mytype n, mytype *div, unsigned *rem)
{
  mytype n5;
  unsigned rem5;
  div_rem_5_v2 (n, &n5, &rem5);
  *rem = rem5 + (n5 % 2) * 5;
  *div = n5/2;
}

unsigned
qsum_v2 (mytype n)
{
  unsigned ret;
  unsigned rem;
  mytype n_new;
  ret = 0;
  while (n > 0)
    {
      div_rem_10_v2 (n, &n_new, &rem);
      ret += rem;
      n = n_new;
    }
  return ret;
}

#define N 10000000

int main()
{
  mytype *a;
  unsigned long int s;
  double t1, t2;
  int fd;
  long int i;
  a = malloc (sizeof (*a) * N);
  fd = open ("/dev/urandom", O_RDONLY);
  read (fd, a, sizeof (*a) * N);

  s = 0;
  t1 = this_time();
  for (i=0; i<N; i++)
    s += qsum_v1(a[i]);

  t2 = this_time();
  printf ("s = %lu qsum_v1: %f s\n", s, (t2-t1));

  s = 0;
  t1 = this_time();
  for (i=0; i<N; i++)
    s += qsum_v2(a[i]);

  t2 = this_time();
  printf ("s = %lu qsum_v2: %f s\n", s, (t2-t1));

  return 0;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2020-10-20 18:16 ` tkoenig at gcc dot gnu.org
@ 2020-10-23 16:56 ` tkoenig at gcc dot gnu.org
  2020-10-24 18:09 ` tkoenig at gcc dot gnu.org
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-23 16:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #10 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
There are a couple of more constants for this could be tried.

Base 7:

static unsigned 
rem_7_v2 (mytype n)
{
  unsigned long a, b, c, d;
  a = n & MASK_48;
  b = (n >> 48) & MASK_48;
  c = n >> 96;
  return (a+b+c) % 7;
}

gives the reminder with respect to 7.

The reason is that 2^48-1 = 3*3*5*7*13*17*97*241*257*673, so a shift
of 48 bits works for any combination of these factors. However, for 15,
I would have to check if it would be better to use the 64-bit shift.

For 19, it's a shift of 56 that would work.

I think I'd better make a table.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2020-10-23 16:56 ` tkoenig at gcc dot gnu.org
@ 2020-10-24 18:09 ` tkoenig at gcc dot gnu.org
  2020-10-24 18:14 ` tkoenig at gcc dot gnu.org
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-24 18:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #11 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Created attachment 49438
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49438&action=edit
Numbers a, b so that 2^b  ≡ 1 mod a up to b=64, larger b taken if several
solutions exist

Here is the promised table of divisors for which this optimization
is possible.

For example, the line 

9 60

means that the remainder of a division by 9 can be calculated by

#define MASK60 ((1ul << 60) - 1)

unsigned rem_9 (__uint128_t n)
{
    __uint64_t a, b, c;
    a = n & MASK60;
    b = (n >> 60) && MASK60;
    c = (n >> 120);
    return (a+b+c) % 9;
}

The number of terms varies; for b=64, it is two terms; for
63 >= b >= 43, it is three terms, and for the rest, it is four terms.

67 is the first odd divisor for which there is no such shortcut, the
next one is 83. Those are the only gaps below 100.

Of course, if a is in the list, then a*2^n can be treated by shifting
(like it was shown for a=10).

Now, the interesting question, what to make of it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2020-10-24 18:09 ` tkoenig at gcc dot gnu.org
@ 2020-10-24 18:14 ` tkoenig at gcc dot gnu.org
  2020-10-26  5:28 ` tkoenig at gcc dot gnu.org
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-24 18:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #12 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Thomas Koenig from comment #11)
> Created attachment 49438 [details]
> Numbers a, b so that 2^b  ≡ 1 mod a up to b=64, larger b taken if several
> solutions exist
>

A quick check that all numbers are correct is

awk ' { print 2 "^" $2 "%" $1 } ' divisiontable.dat | bc

which shows 1 as output only.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2020-10-24 18:14 ` tkoenig at gcc dot gnu.org
@ 2020-10-26  5:28 ` tkoenig at gcc dot gnu.org
  2020-11-08  9:31 ` tkoenig at gcc dot gnu.org
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-26  5:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

Thomas Koenig <tkoenig at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #49438|divisiontable.dat           |divisiontable.txt
           filename|                            |

--- Comment #13 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Comment on attachment 49438
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49438
Numbers a, b so that 2^b  ≡ 1 mod a up to b=64, larger b taken if several
solutions exist

Seems the bugzilla system decided this was an MPEG file.

Well, it is not, hopefully renaming it as txt will help.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2020-10-26  5:28 ` tkoenig at gcc dot gnu.org
@ 2020-11-08  9:31 ` tkoenig at gcc dot gnu.org
  2020-11-08 10:24 ` jakub at gcc dot gnu.org
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-11-08  9:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #14 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Created attachment 49520
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49520&action=edit
Numbers a, b so that 2^b  ≡ 1 mod a up to b=64, larger b taken if several
solutions exist, plus the multiplicative inverse for 2^128

I've added the multiplicative inverse to the table, calculated with
maxima by inv_mod(x,2^128). Output is in hex, to make it easier to
break down into two numbers.

Is there any more info that I could provide?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2020-11-08  9:31 ` tkoenig at gcc dot gnu.org
@ 2020-11-08 10:24 ` jakub at gcc dot gnu.org
  2020-11-08 19:28 ` tkoenig at gcc dot gnu.org
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-11-08 10:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #15 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I plan to work on this early in stage3.
And we really shouldn't use any tables, GCC should figure it all out.
So, for double-word modulo by constant that would be expanded using a libcall,
go for x from the word bitsize to double-word bitsize and check if (1max << x)
% cst
is 1 (and prefer what we've agreed on for 3), and fall back to multiplications
(see #c8) if there aren't any other options and the costs don't say it is too
costly.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2020-11-08 10:24 ` jakub at gcc dot gnu.org
@ 2020-11-08 19:28 ` tkoenig at gcc dot gnu.org
  2020-11-08 19:33 ` tkoenig at gcc dot gnu.org
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-11-08 19:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #16 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #15)
> I plan to work on this early in stage3.
> And we really shouldn't use any tables, GCC should figure it all out.
> So, for double-word modulo by constant that would be expanded using a
> libcall, go for x from the word bitsize to double-word bitsize and check if
> (1max << x) % cst
> is 1

It's probably better to search from high to low, to reduce the number
of necessary shifts for division by constants like 9 or 13.

> (and prefer what we've agreed on for 3), and fall back to
> multiplications (see #c8) if there aren't any other options and the costs
> don't say it is too costly.

I think for variants where the constants aren't power of two,

#define ONE ((__uint128_t) 1)
#define TWO_64 (ONE << 64)
#define MASK60 ((1ul << 60) - 1)

void
div_rem_13 (mytype n, mytype *div, unsigned int *rem)
{
  const mytype magic = TWO_64 * 14189803133622732012u + 5675921253449092805u *
ONE; /* 0xC4EC4EC4EC4EC4EC4EC4EC4EC4EC4EC5 */
  __uint64_t a, b, c;
  unsigned int r;

  a = n & MASK60;
  b = (n >> 60);
  b = b & MASK60;
  c = (n >> 120);
  r = (a+b+c) % 13;
  n = n - r;
  *div = n * magic;
  *rem = r;
}

should be pretty efficient; there is only one shift which spans two
words.  (The assembly generated from the function looks weird
because of quite a few move instructions, but that should not be
an issue for code generated inline).

Regarding the approach in comment #8, I think I'll run some benchmarks
to see how well that works for other constants which don't fit
the pattern of being divisors for 2^n-1.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2020-11-08 19:28 ` tkoenig at gcc dot gnu.org
@ 2020-11-08 19:33 ` tkoenig at gcc dot gnu.org
  2020-11-26 19:12 ` jakub at gcc dot gnu.org
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-11-08 19:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #17 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---

To be compilable, my previous code lacks

typedef __uint128_t mytype;

> #define ONE ((__uint128_t) 1)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2020-11-08 19:33 ` tkoenig at gcc dot gnu.org
@ 2020-11-26 19:12 ` jakub at gcc dot gnu.org
  2020-11-27 13:26 ` jakub at gcc dot gnu.org
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-11-26 19:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 49633
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49633&action=edit
gcc11-pr97459-wip.patch

Completely untested patch, so far only for double-word unsigned modulo.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2020-11-26 19:12 ` jakub at gcc dot gnu.org
@ 2020-11-27 13:26 ` jakub at gcc dot gnu.org
  2020-11-27 15:10 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-11-27 13:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #49633|0                           |1
        is obsolete|                            |

--- Comment #19 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 49634
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49634&action=edit
gcc11-pr97459-wip.patch

The earlier patch was buggy, we can't really consider the bit 63 for 128-bit
umod or 31 for 64-bit umod, because the sum of the 3 numbers like:
0x7fffffffffffffffULL + 0x7fffffffffffffffULL + 0x3ULL can overflow and unlike
the bit 64 (or 32) case it would need more code to handle those 2 values where
it overflowed.

I'll see if I figure out how to do signed modulo too.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2020-11-27 13:26 ` jakub at gcc dot gnu.org
@ 2020-11-27 15:10 ` jakub at gcc dot gnu.org
  2020-11-27 18:34 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-11-27 15:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #49634|0                           |1
        is obsolete|                            |
             Status|UNCONFIRMED                 |ASSIGNED
     Ever confirmed|0                           |1
           Assignee|unassigned at gcc dot gnu.org      |jakub at gcc dot gnu.org
   Last reconfirmed|                            |2020-11-27

--- Comment #20 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 49636
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49636&action=edit
gcc11-pr97459-wip.patch

Updated patch that can also handle signed modulo.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2020-11-27 15:10 ` jakub at gcc dot gnu.org
@ 2020-11-27 18:34 ` jakub at gcc dot gnu.org
  2020-11-30  9:57 ` cvs-commit at gcc dot gnu.org
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-11-27 18:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 49639
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49639&action=edit
gcc11-pr97459-alt.patch

Variant version of the patch which avoids the final adjustment from unsigned to
signed modulo by using different correction addend and signed modulo on the
word.
But it seems to have mixed effects, on i386 results in somewhat smaller code,
but on x86_64 in larger code.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2020-11-27 18:34 ` jakub at gcc dot gnu.org
@ 2020-11-30  9:57 ` cvs-commit at gcc dot gnu.org
  2020-12-01 15:25 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-11-30  9:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #22 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:4d87bd39bafae86747944b2f8c53fdbc43b8dac3

commit r11-5533-g4d87bd39bafae86747944b2f8c53fdbc43b8dac3
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Mon Nov 30 10:55:43 2020 +0100

    expansion: Improve double-word modulo by certain constant divisors
[PR97459]

    As discussed in the PR, e.g. on x86_64 (both -m32 and -m64) there is no
    double-word modulo and so we expand it to a __{,u}mod[dt]i3 call.
    For certain constant divisors we can do better.  E.g. consider
    32-bit word-size, 0x100000000ULL % 3 == 1, so we can use partly the
Hacker's
    delight modulo by summing digits approach and optimize
    unsigned long long foo (unsigned long long x) { return x % 3; }
    as
    unsigned long long foo (unsigned long long x) {
      unsigned int sum, carry;
      carry = __builtin_add_overflow ((unsigned int) x, (unsigned int) (x >>
32), &sum);
      sum += carry;
      return sum % 3;
    }
    Similarly, 0x10000000ULL % 5 == 1 (note, 1 << 28), so
    unsigned long long bar (unsigned long long x) { return x % 5; }
    as
    unsigned long long bar (unsigned long long x) {
      unsigned int sum = x & ((1 << 28) - 1);
      sum += (x >> 28) & ((1 << 28) - 1);
      sum += (x >> 56);
      return sum % 5;
    }
    etc.
    And we can do also signed modulo,
    long long baz (long long x) { return x % 5; }
    as
    long long baz (long long x) {
      unsigned int sum = x & ((1 << 28) - 1);
      sum += ((unsigned long long) x >> 28) & ((1 << 28) - 1);
      sum += ((unsigned long long) x >> 56);
      /* Sum adjustment for negative x.  */
      sum += (x >> 63) & 3;
      unsigned int rem = sum % 5;
      /* And finally adjust it to the right interval for negative values.  */
      return (int) (rem + ((x >> 63) & -4));
    }

    2020-11-30  Jakub Jelinek  <jakub@redhat.com>

            PR rtl-optimization/97459
            * internal-fn.h (expand_addsub_overflow): Declare.
            * internal-fn.c (expand_addsub_overflow): No longer static.
            * optabs.c (expand_doubleword_mod): New function.
            (expand_binop): Optimize double-word mod with constant divisor.

            * gcc.dg/pr97459-1.c: New test.
            * gcc.dg/pr97459-2.c: New test.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2020-11-30  9:57 ` cvs-commit at gcc dot gnu.org
@ 2020-12-01 15:25 ` cvs-commit at gcc dot gnu.org
  2020-12-02 10:41 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-12-01 15:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #23 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:855bb43f6d0bee5a74b5d3739456ca34b4609a50

commit r11-5614-g855bb43f6d0bee5a74b5d3739456ca34b4609a50
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Dec 1 16:25:06 2020 +0100

    Improve double-word mod even on powerpc [PR97459]

    I have noticed that while my (already committed, thanks for review)
    patch works on x86, it doesn't work on powerpc*.  The problem is that
    we don't have lshr double-word optab (neither TImode nor for -m32 DImode),
    but as expander has code for double-word shift, that doesn't really matter.
    As the implementation is prepared to punt whenever something can't be
    expanded with OPTAB_DIRECT and in the end also punts if any library calls
    would be emitted, the optab_handler checks were just to save compile time.

    On the other side, for even divisors, we know that (1 << bit) % (2 * x)
    for bit > 0 will never be equal to 1, because both dividend and divisor
    are even and so remainder will be even too, so we can save some compile
time
    by adding an early exit.

    The even divisors can be handled with the approach Thomas wrote about
    (perhaps generalized into divisors equal to what expand_doubleword_mod
    can handle times some power of two where we can handle power of two modulo
    cheaply), but that would be done in a different function...
    And we could use ctz to find the power of two...

    2020-12-01  Jakub Jelinek  <jakub@redhat.com>

            PR rtl-optimization/97459
            * optabs.c (expand_doubleword_mod): Punt early for even op1.
            (expand_binop): Don't require lshr_optab double-word handler.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2020-12-01 15:25 ` cvs-commit at gcc dot gnu.org
@ 2020-12-02 10:41 ` cvs-commit at gcc dot gnu.org
  2020-12-02 13:36 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-12-02 10:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #24 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:037fe26ee1ac18581bf0ad646d48591c97d10bd3

commit r11-5648-g037fe26ee1ac18581bf0ad646d48591c97d10bd3
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Wed Dec 2 11:32:19 2020 +0100

    expansion: Further improve double-word modulo, division and divmod
[PR97459]

    The following patch implements what Thomas wrote about, in particular
    that we can handle also double-word divison by the constants for which
    the earlier patch optimized modulo (if it would be otherwise a library
    call) and that we can also easily handle such constants shifted to the
left.
    Unfortunately, seems CSE isn't able to optimize away the two almost
    identical sequences (one to compute remainder, one to compute quotient),
    probably because of the ADD_OVERFLOW introduced jumps, so the patch also
    adjusts expand_DIVMOD.

    2020-12-02  Jakub Jelinek  <jakub@redhat.com>

            PR rtl-optimization/97459
            * optabs.h (expand_doubleword_divmod): Declare.
            * optabs.c (expand_doubleword_divmod): New function.
            (expand_binop): Use it.
            * internal-fn.c (expand_DIVMOD): Likewise.

            * gcc.target/i386/pr97282.c (foo): Use 123456 divisor instead of
            10.
            * gcc.dg/pr97459-1.c (TESTS): Add tests for 10, 12 and
            6144.
            * gcc.dg/pr97459-2.c (TESTS): Likewise.
            * gcc.dg/pr97459-3.c: New test.
            * gcc.dg/pr97459-4.c: New test.
            * gcc.dg/pr97459-5.c: New test.
            * gcc.dg/pr97459-6.c: New test.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2020-12-02 10:41 ` cvs-commit at gcc dot gnu.org
@ 2020-12-02 13:36 ` jakub at gcc dot gnu.org
  2020-12-03 19:55 ` tkoenig at gcc dot gnu.org
  2021-08-15 11:24 ` pinskia at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-12-02 13:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #25 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Should be implemented now.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2020-12-02 13:36 ` jakub at gcc dot gnu.org
@ 2020-12-03 19:55 ` tkoenig at gcc dot gnu.org
  2021-08-15 11:24 ` pinskia at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-12-03 19:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

--- Comment #26 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Yep, it's implemented and works great.

For a simple "sum of digits" program in base ten, it's an acceleration
by more than a factor of two.

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
  2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2020-12-03 19:55 ` tkoenig at gcc dot gnu.org
@ 2021-08-15 11:24 ` pinskia at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-15 11:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97459

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-08-15 11:24 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-16 13:33 [Bug rtl-optimization/97459] New: __uint128_t remainder for division by 3 tkoenig at gcc dot gnu.org
2020-10-16 15:22 ` [Bug rtl-optimization/97459] " jakub at gcc dot gnu.org
2020-10-16 15:52 ` jakub at gcc dot gnu.org
2020-10-16 15:57 ` jakub at gcc dot gnu.org
2020-10-17 12:22 ` tkoenig at gcc dot gnu.org
2020-10-18  7:33 ` tkoenig at gcc dot gnu.org
2020-10-18  9:11 ` jakub at gcc dot gnu.org
2020-10-19 11:51 ` jakub at gcc dot gnu.org
2020-10-19 12:05 ` jakub at gcc dot gnu.org
2020-10-20 18:16 ` tkoenig at gcc dot gnu.org
2020-10-23 16:56 ` tkoenig at gcc dot gnu.org
2020-10-24 18:09 ` tkoenig at gcc dot gnu.org
2020-10-24 18:14 ` tkoenig at gcc dot gnu.org
2020-10-26  5:28 ` tkoenig at gcc dot gnu.org
2020-11-08  9:31 ` tkoenig at gcc dot gnu.org
2020-11-08 10:24 ` jakub at gcc dot gnu.org
2020-11-08 19:28 ` tkoenig at gcc dot gnu.org
2020-11-08 19:33 ` tkoenig at gcc dot gnu.org
2020-11-26 19:12 ` jakub at gcc dot gnu.org
2020-11-27 13:26 ` jakub at gcc dot gnu.org
2020-11-27 15:10 ` jakub at gcc dot gnu.org
2020-11-27 18:34 ` jakub at gcc dot gnu.org
2020-11-30  9:57 ` cvs-commit at gcc dot gnu.org
2020-12-01 15:25 ` cvs-commit at gcc dot gnu.org
2020-12-02 10:41 ` cvs-commit at gcc dot gnu.org
2020-12-02 13:36 ` jakub at gcc dot gnu.org
2020-12-03 19:55 ` tkoenig at gcc dot gnu.org
2021-08-15 11:24 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).