From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 0EC993861845; Tue, 20 Oct 2020 18:16:53 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0EC993861845
From: "tkoenig at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/97459] __uint128_t remainder for division by 3
Date: Tue, 20 Oct 2020 18:16:52 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 11.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tkoenig at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-97459-4-mgKgtkLRYp@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-97459-4@http.gcc.gnu.org/bugzilla/>
References: <bug-97459-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Oct 2020 18:16:53 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97459
--- Comment #9 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #7)
> So, can we use this for anything but modulo 3, or 5, or 17, or 257 (all of
> those have 2^32 mod N =3D=3D 2^64 mod N =3D=3D 2^128 mod N =3D=3D 1)

I think so, too.

> probably also
> keyed on the target having corresponding uaddv4_optab handler, normal
> expansion not being able to handle it and emitting a libcall?

Again, yes.

This can also be used as a building block for handling division
and remainder base 10.

Here's a benchmark for this (it uses the sum of digits base 10
instead). qsum1 uses the standard method, which you can find
(for example) in libgfortran.

div_rem5_v2 first calculates the remainder of the division by 5 using this
method, then does an exact division by multiplying with its modular inverse
for 2^128.

div_rem10_v2 then uses div_rem5_v2 to calculate the value and
remainder of the division by 10, and qsum_v2 uses that to
calculate the sum of digits.

The timings are about a factor of 2 faster than the straightforward
libcall version:

s =3D 360398898 qsum_v1: 1.091621 s
s =3D 360398898 qsum_v2: 0.485509 s


#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdlib.h>

#define ONE ((__uint128_t) 1)
#define TWO_64 (ONE << 64)

typedef __uint128_t mytype;

double this_time ()
{
  struct timeval tv;
  gettimeofday (&tv, NULL);
  return tv.tv_sec + tv.tv_usec * 1e-6;
}

unsigned
qsum_v1 (mytype n)
{
  unsigned ret;
  ret =3D 0;
  while (n > 0)
    {
      ret +=3D n % 10;
      n =3D n / 10;
    }
  return ret;
}

static void inline __attribute__((always_inline))
div_rem_5_v2 (mytype n, mytype *div, unsigned *rem)
{
  unsigned long a, b, c;
  /* The modular inverse to 5 modulo 2^128  */
  const mytype magic =3D (0xCCCCCCCCCCCCCCCC * TWO_64 + 0xCCCCCCCCCCCCCCCD *
ONE);
  b =3D n >> 64;
  c =3D n;
  if (__builtin_add_overflow (b, c, &a))
    a++;

  *rem =3D a % 5;
  *div =3D (n-*rem) * magic;
}

static void inline __attribute__((always_inline))
div_rem_10_v2 (mytype n, mytype *div, unsigned *rem)
{
  mytype n5;
  unsigned rem5;
  div_rem_5_v2 (n, &n5, &rem5);
  *rem =3D rem5 + (n5 % 2) * 5;
  *div =3D n5/2;
}

unsigned
qsum_v2 (mytype n)
{
  unsigned ret;
  unsigned rem;
  mytype n_new;
  ret =3D 0;
  while (n > 0)
    {
      div_rem_10_v2 (n, &n_new, &rem);
      ret +=3D rem;
      n =3D n_new;
    }
  return ret;
}

#define N 10000000

int main()
{
  mytype *a;
  unsigned long int s;
  double t1, t2;
  int fd;
  long int i;
  a =3D malloc (sizeof (*a) * N);
  fd =3D open ("/dev/urandom", O_RDONLY);
  read (fd, a, sizeof (*a) * N);

  s =3D 0;
  t1 =3D this_time();
  for (i=3D0; i<N; i++)
    s +=3D qsum_v1(a[i]);

  t2 =3D this_time();
  printf ("s =3D %lu qsum_v1: %f s\n", s, (t2-t1));

  s =3D 0;
  t1 =3D this_time();
  for (i=3D0; i<N; i++)
    s +=3D qsum_v2(a[i]);

  t2 =3D this_time();
  printf ("s =3D %lu qsum_v2: %f s\n", s, (t2-t1));

  return 0;
}=