From mboxrd@z Thu Jan  1 00:00:00 1970
From: Frank Klemm <pfk@fuchs.offl.uni-jena.de>
To: Jan Hubicka <jh@suse.cz>
Cc: gcc@gcc.gnu.org
Subject: long long / long long
Date: Sat, 08 Sep 2001 19:08:00 -0000
Message-id: <20010909040234.A5552@fuchs.offl.uni-jena.de>
References: <20010908153112.8DAF1F2B62@nile.gnat.com> <20010908181701.K8451@atrey.karlin.mff.cuni.cz>
X-SW-Source: 2001-09/msg00298.html

---- Code ----------------------------------------------------

.text
.type   __divdi3,@function
.global __divdi3

__divdi3:
        fildll  12(%esp)
        fildll   4(%esp)
        subl    $12,%esp
        movl    %esp,%ecx
        movw    $0x0C00,%ax
        fnstcw  (%ecx)
        orw     0(%ecx),%ax
        movw    %ax,2(%ecx)
        fldcw   2(%ecx)
        fdivp
        fistpll 4(%ecx)
        fldcw   0(%ecx)
        movl    4(%esp),%eax
        movl    8(%esp),%edx
        addl    $12,%esp
        ret


---- "Benchmark": Duration of a loop of --------------------------

    long long  x [1000];
    long long  y [1000];

    for (i = 0; i < 1000; i++)
        s += x[i] / y[i];


---- results ---------------------------------------------------- 
Old routine on Athlon:
	106 clocks including the a outer loop and storing the arguments on the stack.
	
This routine on Athlon:
	79 clocks including the a outer loop and storing the arguments on the stack.

  + shorter
  + can be inlined
  + sometimes the rounding control switch can be moved avoided by moving it outside a loop
  + faster for a lot of data
  - slower for trivial data (?)
  - do not work with SSE2 (needs 63 or 64 bit mantissa)

---- optimization -----------------------------------------------
This routine on Athlon after inling and moving fstcw/fldcw outside the loop:
	21 clocks including the a outer loop


Interested? Or are 64 bit are uninteresting for benchmarks?

-- 
Frank Klemm


Still remaining:
	long long % long long
	long long / long
	long long % long
	long long / const
	long long % const