public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/97282] New: division done twice for modulo and divsion for 128-bit integers
@ 2020-10-03 16:37 tkoenig at gcc dot gnu.org
  2020-10-03 16:37 ` [Bug rtl-optimization/97282] " tkoenig at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-03 16:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97282

            Bug ID: 97282
           Summary: division done twice for modulo and divsion for 128-bit
                    integers
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Currently, gcc calls the (long and slow) division routines for 128-bit
integers twice when both the residual and the value is needed.  For
the other integer types, this is optimized.

Take this test case:

$ cat digsum.c
#include <stdio.h>
#include <x86intrin.h>

typedef __uint128_t myint;

unsigned long digsum1 (myint x)
{
  unsigned long ret;

  if (x == 0)
    return 0; 

  ret = 0;
  while (x > 0)
    {
      ret = ret + x % 10;
      x = x / 10;
    }
  return ret;
}

unsigned long digsum2 (myint x)
{
  unsigned long ret;
  myint tmp;

  if (x == 0)
    return 0; 

  ret = 0;
  while (x > 0)
    {
      tmp = x / 10;
      ret = ret + (x - tmp * 10);
      x = tmp;
    }
  return ret;
}

#define NUM 1000000
int main()
{
  myint x;
  unsigned long sum;
  long int t1, t2;
  __uint128_t from, to;

  from = 1;
  from = (from << 93) - NUM/2;
  to = from + NUM;
  sum = 0;

  t1 = __rdtsc();
  for (x=from; x<to; x++)
    sum = sum + digsum1(x);

  t2 = __rdtsc();
  printf ("digsum1:\nsum = %lu\n", sum);
  printf ("cycles per sum = %.2f\n\n", (double) (t2-t1)/NUM);

  sum = 0;
  t1 = __rdtsc();
  for (x=from; x<to; x++)
    sum = sum + digsum2(x);

  t2 = __rdtsc();
  printf ("digsum2:\nsum = %lu\n", sum);
  printf ("cycles per sum = %.2f\n", (double) (t2-t1)/NUM);

  return 0;
}

"As is", this gives on my machine

$ gcc -O3 digsum.c
$ ./a.out
digsum1:
sum = 113493792
cycles per sum = 2021.68

digsum2:
sum = 113493792
cycles per sum = 1025.47

(similar timings if a signed type is used).

This also affects Fortran I/O.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/97282] division done twice for modulo and divsion for 128-bit integers
  2020-10-03 16:37 [Bug rtl-optimization/97282] New: division done twice for modulo and divsion for 128-bit integers tkoenig at gcc dot gnu.org
@ 2020-10-03 16:37 ` tkoenig at gcc dot gnu.org
  2020-10-03 23:18 ` tkoenig at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-03 16:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97282

Thomas Koenig <tkoenig at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=89256
           Keywords|                            |missed-optimization
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/97282] division done twice for modulo and divsion for 128-bit integers
  2020-10-03 16:37 [Bug rtl-optimization/97282] New: division done twice for modulo and divsion for 128-bit integers tkoenig at gcc dot gnu.org
  2020-10-03 16:37 ` [Bug rtl-optimization/97282] " tkoenig at gcc dot gnu.org
@ 2020-10-03 23:18 ` tkoenig at gcc dot gnu.org
  2020-10-05 11:00 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2020-10-03 23:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97282

--- Comment #1 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
And here is a version which uses two 64-bit numbers for calculation of he
sum of decimal digits as a benchmark for the division and modulo:

unsigned long digsum3 (myint x)
{
  unsigned long ret;
  __uint64_t high, low;
  const unsigned long int rem_high[10] = {0,6,2,8,4,0,6,2,8,4};
  const unsigned long int foo_high[10] =
    {0x0000000000000000, 0x1999999999999999, 0x3333333333333333,
0x4CCCCCCCCCCCCCCC,
     0x6666666666666666, 0x8000000000000000, 0x9999999999999999,
0xB333333333333333,
     0xCCCCCCCCCCCCCCCC, 0xE666666666666666 };

  high = x >> 64;
  low = x;
  ret = 0;
  while (low > 0 || high > 0)
    {
      unsigned long r_high, r_low, r_sum, r_carry;
      r_high = high % 10;
      r_carry = rem_high[r_high];
      high = high / 10;
      r_low = low % 10;
      low = low / 10;
      low = low + foo_high[r_high];
      r_sum = r_low + r_carry;
      if (r_sum >= 10)
        {
          r_sum = r_sum - 10;
          low ++;
        }
      ret = ret + r_sum;
    }
  return ret;
}

It is _much_ faster, taking around 250 to 260 cycles per
calculation, a speedup of a factor of around 8 versus the
original code.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/97282] division done twice for modulo and divsion for 128-bit integers
  2020-10-03 16:37 [Bug rtl-optimization/97282] New: division done twice for modulo and divsion for 128-bit integers tkoenig at gcc dot gnu.org
  2020-10-03 16:37 ` [Bug rtl-optimization/97282] " tkoenig at gcc dot gnu.org
  2020-10-03 23:18 ` tkoenig at gcc dot gnu.org
@ 2020-10-05 11:00 ` jakub at gcc dot gnu.org
  2020-10-05 13:37 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-10-05 11:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97282

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
We have the divmod discovery in tree-ssa-math-opts.c and it will handle this
case if the division/modulo is by the same variable.  But for constants it
punts:
  /* Disable the transform if either is a constant, since division-by-constant
     may have specialized expansion.  */
  if (CONSTANT_CLASS_P (op1) || CONSTANT_CLASS_P (op2))
    return false;

So, perhaps we want to get it through if CONSTANT_CLASS_P (op2) (but not op1)
and
some further conditions are met, e.g. that op2 is not a power of two, and
either the type's precision is > HOST_BITS_PER_WIDE_INT (then expand_divmod
punts on it), as choose_multiplier etc. work on HOST_WIDE_INTs), or compute
choose_multiplier for the constant divisor and if pre or post shift needs to be
BITS_PER_WORD or more bits.
Or optimize into DIVMOD always even for constant op2 and when trying to expand
DIVMOD ifn with constant op2, see if the target can expand division by that
constant using just shifts and +/- and if so, emit it as that division +
subtraction, otherwise throw away the division expansion and emit a divmod.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/97282] division done twice for modulo and divsion for 128-bit integers
  2020-10-03 16:37 [Bug rtl-optimization/97282] New: division done twice for modulo and divsion for 128-bit integers tkoenig at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-10-05 11:00 ` jakub at gcc dot gnu.org
@ 2020-10-05 13:37 ` jakub at gcc dot gnu.org
  2020-10-06  8:33 ` cvs-commit at gcc dot gnu.org
  2021-08-15 11:23 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-10-05 13:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97282

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
           Assignee|unassigned at gcc dot gnu.org      |jakub at gcc dot gnu.org
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2020-10-05

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 49307
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49307&action=edit
gcc11-pr97282.patch

Untested fix.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/97282] division done twice for modulo and divsion for 128-bit integers
  2020-10-03 16:37 [Bug rtl-optimization/97282] New: division done twice for modulo and divsion for 128-bit integers tkoenig at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-10-05 13:37 ` jakub at gcc dot gnu.org
@ 2020-10-06  8:33 ` cvs-commit at gcc dot gnu.org
  2021-08-15 11:23 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-10-06  8:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97282

--- Comment #4 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:bf510679bb3f9bfd6019666065016bb26a5b5466

commit r11-3671-gbf510679bb3f9bfd6019666065016bb26a5b5466
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Oct 6 10:32:22 2020 +0200

    divmod: Match and expand DIVMOD even in some cases of constant divisor
[PR97282]

    As written in the comment, tree-ssa-math-opts.c wouldn't create a DIVMOD
    ifn call for division + modulo by constant for the fear that during
    expansion we could generate better code for those cases.
    If the divisoris a power of two, that is certainly the case always,
    but otherwise expand_divmod can punt in many cases, e.g. if the division
    type's precision is above HOST_BITS_PER_WIDE_INT, we don't even call
    choose_multiplier, because it works on HOST_WIDE_INTs (true, something
    we should fix eventually now that we have wide_ints), or if pre/post shift
    is larger than BITS_PER_WORD.

    So, the following patch recognizes DIVMOD with constant last argument even
    when it is unclear if expand_divmod will be able to optimize it, and then
    during DIVMOD expansion if the divisor is constant attempts to expand it as
    division + modulo and if they actually don't contain any libcalls or
    division/modulo, they are kept as is, otherwise that sequence is thrown
away
    and divmod optab or libcall is used.

    2020-10-06  Jakub Jelinek  <jakub@redhat.com>

            PR rtl-optimization/97282
            * tree-ssa-math-opts.c (divmod_candidate_p): Don't return false for
            constant op2 if it is not a power of two and the type has precision
            larger than HOST_BITS_PER_WIDE_INT or BITS_PER_WORD.
            * internal-fn.c (contains_call_div_mod): New function.
            (expand_DIVMOD): If last argument is a constant, try to expand it
as
            TRUNC_DIV_EXPR followed by TRUNC_MOD_EXPR, but if the sequence
            contains any calls or {,U}{DIV,MOD} rtxes, throw it away and use
            divmod optab or divmod libfunc.

            * gcc.target/i386/pr97282.c: New test.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/97282] division done twice for modulo and divsion for 128-bit integers
  2020-10-03 16:37 [Bug rtl-optimization/97282] New: division done twice for modulo and divsion for 128-bit integers tkoenig at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-10-06  8:33 ` cvs-commit at gcc dot gnu.org
@ 2021-08-15 11:23 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-15 11:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97282

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|---                         |11.0

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
All fixed for GCC 11 by the patches for PR 97459 .

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-15 11:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-03 16:37 [Bug rtl-optimization/97282] New: division done twice for modulo and divsion for 128-bit integers tkoenig at gcc dot gnu.org
2020-10-03 16:37 ` [Bug rtl-optimization/97282] " tkoenig at gcc dot gnu.org
2020-10-03 23:18 ` tkoenig at gcc dot gnu.org
2020-10-05 11:00 ` jakub at gcc dot gnu.org
2020-10-05 13:37 ` jakub at gcc dot gnu.org
2020-10-06  8:33 ` cvs-commit at gcc dot gnu.org
2021-08-15 11:23 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).