[Bug c/20283] New: optimising muldiv() type operations

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/20283] New: optimising muldiv() type operations
@ 2005-03-02 14:26 ajrobb at bigfoot dot com
  2005-03-02 16:35 ` [Bug middle-end/20283] " pinskia at gcc dot gnu dot org
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: ajrobb at bigfoot dot com @ 2005-03-02 14:26 UTC (permalink / raw)
  To: gcc-bugs

Reading specs from /usr/lib/gcc-lib/i586-suse-linux/3.3.4/specs
Configured with: ../configure --enable-threads=posix --prefix=/usr
--with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man
--enable-languages=c,c++,f77,objc,java,ada --disable-checking --libdir=/usr/lib
--enable-libgcj --with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib
--with-system-zlib --enable-shared --enable-__cxa_atexit i586-suse-linux
Thread model: posix
gcc version 3.3.4 (pre 3.3.5 20040809)
 /usr/lib/gcc-lib/i586-suse-linux/3.3.4/cc1 -E -quiet -v -D__GNUC__=3
-D__GNUC_MINOR__=3 -D__GNUC_PATCHLEVEL__=4 muldiv.c -march=pentium -O3 muldiv.i
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /usr/lib/gcc-lib/i586-suse-linux/3.3.4/include
 /usr/i586-suse-linux/include
 /usr/include
End of search list.
 /usr/lib/gcc-lib/i586-suse-linux/3.3.4/cc1 -fpreprocessed muldiv.i -quiet
-dumpbase muldiv.c -march=pentium -auxbase muldiv -O3 -version -o muldiv.s
GNU C version 3.3.4 (pre 3.3.5 20040809) (i586-suse-linux)
        compiled by GNU C version 3.3.4 (pre 3.3.5 20040809).
GGC heuristics: --param ggc-min-expand=63 --param ggc-min-heapsize=63486

When compiling the following function:

gcc -O3 -S -march=pentium

int vat(int a)
{
  return a * 47 / 40;
}

optimising misses the trick of combining the multiply and divide into a single
multiply and shift arithmetic right.

# assume value is already in EAX
movl 1261646643,%ecx
imul %ecx
sarl $30 # I mean shift %edx,%eax pair right by 30 bits

I realise I haven't chosen the value properly to give the same results as an
overflow when multiplying by 47, but is that defined in the standard?

-- 
           Summary: optimising muldiv() type operations
           Product: gcc
           Version: 3.3.4
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: ajrobb at bigfoot dot com
                CC: gcc-bugs at gcc dot gnu dot org
  GCC host triplet: SuSE Linux 9.2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20283


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/20283] optimising muldiv() type operations
  2005-03-02 14:26 [Bug c/20283] New: optimising muldiv() type operations ajrobb at bigfoot dot com
@ 2005-03-02 16:35 ` pinskia at gcc dot gnu dot org
  2005-03-02 16:53 ` ajrobb at bigfoot dot com
  2005-06-04 16:44 ` pinskia at gcc dot gnu dot org
  2 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2005-03-02 16:35 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From pinskia at gcc dot gnu dot org  2005-03-02 16:35 -------
With unsigned instead of signed, I get the following code with the mainline:
        movl    4(%esp), %eax
        leal    (%eax,%eax,2), %edx
        sall    $4, %edx
        subl    %eax, %edx
        movl    $-858993459, %eax
        mull    %edx
        shrl    $5, %edx
        movl    %edx, %eax
        ret

Is this okay, If so then there is no bug here really.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|c                           |middle-end


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20283


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/20283] optimising muldiv() type operations
  2005-03-02 14:26 [Bug c/20283] New: optimising muldiv() type operations ajrobb at bigfoot dot com
  2005-03-02 16:35 ` [Bug middle-end/20283] " pinskia at gcc dot gnu dot org
@ 2005-03-02 16:53 ` ajrobb at bigfoot dot com
  2005-06-04 16:44 ` pinskia at gcc dot gnu dot org
  2 siblings, 0 replies; 7+ messages in thread
From: ajrobb at bigfoot dot com @ 2005-03-02 16:53 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From ajrobb at bigfoot dot com  2005-03-02 16:53 -------
Subject: Re:  optimising muldiv() type operations

Hi,
Thanks for getting back. Your code below still performs a separate 
multiply by 47 before 'dividing' by 40.  My enhancement request is to 
combine these two operations making the 2nd-4th (and 8th) lines below 
redundant.

Well done though for spotting it should have been unsigned (signed 
division is horrible as it is not defined/portable).

pinskia at gcc dot gnu dot org wrote:

>------- Additional Comments From pinskia at gcc dot gnu dot org  2005-03-02 16:35 -------
>With unsigned instead of signed, I get the following code with the mainline:
>        movl    4(%esp), %eax
>        leal    (%eax,%eax,2), %edx
>        sall    $4, %edx
>        subl    %eax, %edx
>        movl    $-858993459, %eax
>        mull    %edx
>        shrl    $5, %edx
>        movl    %edx, %eax
>        ret
>
>Is this okay, If so then there is no bug here really.
>
>  
>



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20283


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/20283] optimising muldiv() type operations
  2005-03-02 14:26 [Bug c/20283] New: optimising muldiv() type operations ajrobb at bigfoot dot com
  2005-03-02 16:35 ` [Bug middle-end/20283] " pinskia at gcc dot gnu dot org
  2005-03-02 16:53 ` ajrobb at bigfoot dot com
@ 2005-06-04 16:44 ` pinskia at gcc dot gnu dot org
  2 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2005-06-04 16:44 UTC (permalink / raw)
  To: gcc-bugs


------- Additional Comments From pinskia at gcc dot gnu dot org  2005-06-04 16:44 -------
Confirmed.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|                            |1
   Last reconfirmed|0000-00-00 00:00:00         |2005-06-04 16:44:06
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20283


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/20283] optimising muldiv() type operations
       [not found] <bug-20283-8690@http.gcc.gnu.org/bugzilla/>
  2008-08-31  7:50 ` ajrobb at bigfoot dot com
  2008-08-31 13:41 ` ajrobb at bigfoot dot com
@ 2008-09-01  5:51 ` ajrobb at bigfoot dot com
  2 siblings, 0 replies; 7+ messages in thread
From: ajrobb at bigfoot dot com @ 2008-09-01  5:51 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from ajrobb at bigfoot dot com  2008-09-01 05:50 -------
OK - it's me again. I have formulated a general case optimization - I should
keep quiet now ;-) :

/* Demonstrate a general fast multiply and divide by rational number.
 * Both methods agree over the entire domain: 0..0xffffffffu.
 * (even when they overflow the 32-bit result)
 */
#include <stdint.h>
/* 32-bit numerator */
#define Y 47
/* 32-bit denominator */
#define Z 40
/* bit shift (at least 32) to normalize 'fact' below (with MSB set) */
#define B (32 + 2)

/* this is what we might write - but it calls a 64-bit divide function */
uint32_t slow(uint32_t x) {
  return x * (uint64_t) Y / Z;
}

/* this is how it can be implemented without a 64-bit divide call */
uint32_t fast(uint32_t x) {
  static const uint64_t fact = ((((uint64_t)(Y % Z)) << B) + (Z - 1)) / Z;
  /*
   * While this is constructed as 2 multiplies, the value of (Y / Z) will
   * often be small (e.g. 0 or 1) and can be implemented without an imul
   * instruction. This can overwrite the value of %eax after the mul
   * instruction is issued as we only need the value from %edx.
   * A superscalar x86 will assign a new %eax register to allow the mul
   * instruction to continue in parallel.
   *
   * Where the routine is not inline, it can be implemented as:
   *    movl    $-1288490188, %eax
   *    mull    4(%esp)
   *    movl    4(%esp), %eax
   *    # small multiplication by (Y/Z) could go here
   *    shrl    $2, %edx
   *    addl    %edx, %eax
   *    ret
   *
   * Where this routine is inlined and x is already in a register other than
   * %eax or %edx, the final multiply and addition might be achieved with a
   * single lea instruction into %eax. Example: assume that x is in %ebx with
   * the above ratio (Y/Z is 1):
   *    movl    $-1288490188, %eax
   *    mull    %ebx
   *    shrl    $2, %edx
   *    leal    (%edx, %ebx), %eax
   */
  return (uint32_t)((x * fact) >> B) + x * (Y / Z);
}

NOTE: the general case 64-bit result can be calculated with at most 2 mul
instructions and no divides.

uint64_t fast64(uint32_t x) {
  static const uint64_t fact = ((((uint64_t)(Y % Z)) << B) + (Z - 1)) / Z;
  return ((x * fact) >> B) + x * (uint64_t)(Y / Z);
}

With the above ratio of 47 / 40, this could be compiled to:

        movl    $-1288490188, %eax
        mull    4(%esp)
        movl    4(%esp), %eax
        shrl    $2, %edx
        addl    %edx, %eax
        xorl    %edx, %edx
        adcl    $0, %edx
        ret

Where the ratio is less than 1, the 32-bit and 64-bit versions are very similar
- both using a single mul instruction. (The 64-bit version must zero %edx.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20283


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/20283] optimising muldiv() type operations
       [not found] <bug-20283-8690@http.gcc.gnu.org/bugzilla/>
  2008-08-31  7:50 ` ajrobb at bigfoot dot com
@ 2008-08-31 13:41 ` ajrobb at bigfoot dot com
  2008-09-01  5:51 ` ajrobb at bigfoot dot com
  2 siblings, 0 replies; 7+ messages in thread
From: ajrobb at bigfoot dot com @ 2008-08-31 13:41 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from ajrobb at bigfoot dot com  2008-08-31 13:40 -------
For the case where the overall scale is between 1 and 2 (exclusive), it is
straightforward to implement a solution with a single multiply and no shifts:

/* Demostrate fast multiply and divide by rational number, r, where 1 < r < 2
 * Both methods work over entire domain: 0 < x < (1<<32)/r.
 */
#include <stdint.h>
#define Y 47
#define Z 40

uint32_t slow(uint32_t x) {
  return x * (uint64_t) Y / Z;
}

uint32_t fast(uint32_t x) {
  static const uint64_t fact = ((((uint64_t)(Y-Z)) << 32) + (Z-1)) / Z;
  return x + (uint32_t)((x * fact) >> 32);
}

This compiles to the following with gcc 4.2.1 on i586-suse-linux

gcc -O3 -fomit-frame-pointer

        .file   "vat.c"
.globl __udivdi3
        .text
        .p2align 4,,15
.globl slow
        .type   slow, @function
slow:
        subl    $28, %esp
        movl    $47, %eax
        mull    32(%esp)
        movl    $40, 8(%esp)
        movl    $0, 12(%esp)
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        call    __udivdi3
        addl    $28, %esp
        ret
        .size   slow, .-slow
        .p2align 4,,15
.globl fast
        .type   fast, @function
fast:
        subl    $12, %esp
        movl    $751619277, %ecx
        movl    %ebx, (%esp)
        movl    16(%esp), %ebx
        movl    %edi, 8(%esp)
        movl    %esi, 4(%esp)
        movl    %ebx, %eax
        mull    %ecx
        movl    %edx, %edi
        movl    %eax, %esi
        movl    %edi, %esi
        xorl    %edi, %edi
        movl    8(%esp), %edi
        leal    (%ebx,%esi), %eax
        movl    (%esp), %ebx
        movl    4(%esp), %esi
        addl    $12, %esp
        ret
        .size   fast, .-fast
# additional hand-coded assembler equivalent to fast above:
        .p2align 4,,15
.globl faster
        .type   faster, @function
faster:
        movl    $751619277, %eax
        mull    4(%esp)
        movl    4(%esp), %eax
        addl    %edx, %eax
        ret
        .size   faster, .-faster
        .ident  "GCC: (GNU) 4.2.1 (SUSE Linux)"
        .section        .note.GNU-stack,"",@progbits


-- 

ajrobb at bigfoot dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|3.3.4                       |4.2.1


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20283


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/20283] optimising muldiv() type operations
       [not found] <bug-20283-8690@http.gcc.gnu.org/bugzilla/>
@ 2008-08-31  7:50 ` ajrobb at bigfoot dot com
  2008-08-31 13:41 ` ajrobb at bigfoot dot com
  2008-09-01  5:51 ` ajrobb at bigfoot dot com
  2 siblings, 0 replies; 7+ messages in thread
From: ajrobb at bigfoot dot com @ 2008-08-31  7:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from ajrobb at bigfoot dot com  2008-08-31 07:48 -------
I should point out that for operations (x * y / z) where y > z, there is
nothing to be gained on an Intel Core 2 duo (as extra adds and a shrd are
required for the scaling to work over the whole domain: 0..(1<<32)/y. However,
there are still gains to be had where y < z and a simple 32-bit scale factor
can be used.

In the following example, fast is at least 20% faster than std with gcc 3.4
(Cygwin) when compiled without inlining (gcc -O2 -fomit-frame-pointer).

uint32_t std(uint32_t x) {
  return x * 40 / 47;
} 

uint32_t fast(uint32_t x) {
  static const uint64_t fact = ((((uint64_t)40) << 32) + 46) / 47;
  return (x * fact) >> 32;
}

        .def    _std;   .scl    2;      .type   32;     .endef
_std:
        movl    4(%esp), %edx
        movl    $-1370734243, %eax
        leal    (%edx,%edx,4), %edx
        sall    $3, %edx
        mull    %edx
        shrl    $5, %edx
        movl    %edx, %eax
        ret
        .p2align 4,,15
.globl _fast
        .def    _fast;  .scl    2;      .type   32;     .endef
_fast:
        movl    $-639675980, %eax
        mull    4(%esp)
        movl    %edx, %eax
        xorl    %edx, %edx
        ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20283


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-09-01  5:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-03-02 14:26 [Bug c/20283] New: optimising muldiv() type operations ajrobb at bigfoot dot com
2005-03-02 16:35 ` [Bug middle-end/20283] " pinskia at gcc dot gnu dot org
2005-03-02 16:53 ` ajrobb at bigfoot dot com
2005-06-04 16:44 ` pinskia at gcc dot gnu dot org
     [not found] <bug-20283-8690@http.gcc.gnu.org/bugzilla/>
2008-08-31  7:50 ` ajrobb at bigfoot dot com
2008-08-31 13:41 ` ajrobb at bigfoot dot com
2008-09-01  5:51 ` ajrobb at bigfoot dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).