public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32
@ 2022-07-30 18:38 rsaxvc at gmail dot com
  2022-07-30 18:39 ` [Bug target/106484] " rsaxvc at gmail dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: rsaxvc at gmail dot com @ 2022-07-30 18:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

            Bug ID: 106484
           Summary: Failure to optimize uint64_t/constant division on
                    ARM32
           Product: gcc
           Version: 12.1.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsaxvc at gmail dot com
  Target Milestone: ---
            Target: arm

The following test function compiles into a call to __aeabi_uldivmod, even
though the divisor is a constant. Here's an example function:

    #include <stdint.h>
    uint64_t ns_to_s( uint64_t ns64 )
    {
    return ns64 / 1000000000ULL;
    }

CortexM4(-O3 -Wall -Wextra -mcpu=cortex-m4) assembly:

ns_to_s(unsigned long long):
        push    {r3, lr}
        adr     r3, .L4
        ldrd    r2, [r3]
        bl      __aeabi_uldivmod
        pop     {r3, pc}
.L4:
        .word   1000000000
        .word   0

Interestingly, gcc 12.1 for aarch64 compiles the above C function by
implementing division by a constant with scaled multiplication by the inverse
using the umulh instruction(not present on 32-bit ARM). (-O3 -Wall -Wextra):

ns_to_s(unsigned long):
        mov     x1, 23123
        lsr     x0, x0, 9
        movk    x1, 0xa09b, lsl 16
        movk    x1, 0xb82f, lsl 32
        movk    x1, 0x44, lsl 48
        umulh   x0, x0, x1
        lsr     x0, x0, 11
        ret

Instead, if something like __umulh could be added to libgcc, then GCC could use
the constants generated in the aarch64 logic to implement uint64_t/constant
division. Example umulh approach is attached. On Cortex-M4, the umulh-based
approach is significantly faster, although this depends on the specific libc
__aeabi_uldivmod linked against as well as the numerator.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
  2022-07-30 18:38 [Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32 rsaxvc at gmail dot com
@ 2022-07-30 18:39 ` rsaxvc at gmail dot com
  2022-07-30 22:29 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rsaxvc at gmail dot com @ 2022-07-30 18:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

rsaxvc at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsaxvc at gmail dot com

--- Comment #1 from rsaxvc at gmail dot com ---
Created attachment 53388
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53388&action=edit
example umulh implementation

C implementation of umulh which can be used to optimize uint64_t/constant
division on 32-bit processors without a umulh instruction but with a
32x32=64bit multiplier.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
  2022-07-30 18:38 [Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32 rsaxvc at gmail dot com
  2022-07-30 18:39 ` [Bug target/106484] " rsaxvc at gmail dot com
@ 2022-07-30 22:29 ` pinskia at gcc dot gnu.org
  2022-08-01 10:50 ` rearnsha at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-07-30 22:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
  2022-07-30 18:38 [Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32 rsaxvc at gmail dot com
  2022-07-30 18:39 ` [Bug target/106484] " rsaxvc at gmail dot com
  2022-07-30 22:29 ` pinskia at gcc dot gnu.org
@ 2022-08-01 10:50 ` rearnsha at gcc dot gnu.org
  2022-08-01 12:12 ` rsaxvc at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2022-08-01 10:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
32-bit division by constant relies on a 32x32->64 bit multiply.  So it seems
reasonable that a 64-bit division would need a 64x64->128 bit multiply.  We
don't have an instruction for that on arm nor does gcc support 128-bit integer
types on a 32-bit compiler.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
  2022-07-30 18:38 [Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32 rsaxvc at gmail dot com
                   ` (2 preceding siblings ...)
  2022-08-01 10:50 ` rearnsha at gcc dot gnu.org
@ 2022-08-01 12:12 ` rsaxvc at gmail dot com
  2023-05-04 12:55 ` rsaxvc at gmail dot com
  2023-07-08 18:10 ` rsaxvc at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: rsaxvc at gmail dot com @ 2022-08-01 12:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #3 from rsaxvc at gmail dot com ---
> 32-bit division by constant relies on a 32x32->64 bit multiply.

I believe only the upper half of the bits are required for the product to
represent the integer component of the quotient. This makes parameter passing
much easier in the uint64_t/constant case.

> We don't have an instruction for that on arm nor does gcc support 128-bit integer types on a 32-bit compiler.

I ran into that as well. The attached uint64_t umulh( uint64_t a, uint64_t b )
implements a 64x64->64MSBsOf128BitProduct using only 32x32->64 bit
multiplication, 64-bit addition, and register-aligned shifting.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
  2022-07-30 18:38 [Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32 rsaxvc at gmail dot com
                   ` (3 preceding siblings ...)
  2022-08-01 12:12 ` rsaxvc at gmail dot com
@ 2023-05-04 12:55 ` rsaxvc at gmail dot com
  2023-07-08 18:10 ` rsaxvc at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: rsaxvc at gmail dot com @ 2023-05-04 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #4 from rsaxvc at gmail dot com ---
Benchmarking shows the speedup to be highly variable depending on CPU core as
well as __aeabi_uldivmod() implementation, and somewhat on numerator.

The best __aeabi_uldivmod()s I've seen do use 32bit division instructions when
available, and umulh() based approach is only 2-3x faster when division
instructions are available.

When umull(32x32 with 64bit result) is available and udiv is not available or
libc doesn't use it, the umulh() based approach proposed here completes 28-38x
faster, on Cortex-M4, measured via GPIO and oscilloscope. The wide variation in
relative speed is due to variable execution time of __aeabi_uldivmod(). Similar
on ARM11.

There's a partial list of some contemporary cores have udiv here:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/divide-and-conquer
it does look like things are headed towards more cores having udiv available.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
  2022-07-30 18:38 [Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32 rsaxvc at gmail dot com
                   ` (4 preceding siblings ...)
  2023-05-04 12:55 ` rsaxvc at gmail dot com
@ 2023-07-08 18:10 ` rsaxvc at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: rsaxvc at gmail dot com @ 2023-07-08 18:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #5 from rsaxvc at gmail dot com ---
Related ticket requested with Clang:
https://github.com/llvm/llvm-project/issues/63731

latest umulh() function is a little shorter:

static uint64_t umulh(uint64_t a, uint64_t b)
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy +
d),
consider it with variables a,b,c,d, constants x,y = 1<<32
Results in one UMULL or UMLAL(when merged with accumulate below) per multiply*/
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Accumulate the results, keeping only the top 64-bits of the 128-bit result*/
uint64_t acc = acc0;
acc >>= 32;
acc += acc1;
acc += acc2;
acc >>= 32;
acc += acc3;

return acc;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-07-08 18:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-30 18:38 [Bug target/106484] New: Failure to optimize uint64_t/constant division on ARM32 rsaxvc at gmail dot com
2022-07-30 18:39 ` [Bug target/106484] " rsaxvc at gmail dot com
2022-07-30 22:29 ` pinskia at gcc dot gnu.org
2022-08-01 10:50 ` rearnsha at gcc dot gnu.org
2022-08-01 12:12 ` rsaxvc at gmail dot com
2023-05-04 12:55 ` rsaxvc at gmail dot com
2023-07-08 18:10 ` rsaxvc at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).