From mboxrd@z Thu Jan 1 00:00:00 1970 From: Frank Klemm To: Jan Hubicka Cc: gcc@gcc.gnu.org Subject: long long / long long Date: Sat, 08 Sep 2001 19:08:00 -0000 Message-id: <20010909040234.A5552@fuchs.offl.uni-jena.de> References: <20010908153112.8DAF1F2B62@nile.gnat.com> <20010908181701.K8451@atrey.karlin.mff.cuni.cz> X-SW-Source: 2001-09/msg00298.html ---- Code ---------------------------------------------------- .text .type __divdi3,@function .global __divdi3 __divdi3: fildll 12(%esp) fildll 4(%esp) subl $12,%esp movl %esp,%ecx movw $0x0C00,%ax fnstcw (%ecx) orw 0(%ecx),%ax movw %ax,2(%ecx) fldcw 2(%ecx) fdivp fistpll 4(%ecx) fldcw 0(%ecx) movl 4(%esp),%eax movl 8(%esp),%edx addl $12,%esp ret ---- "Benchmark": Duration of a loop of -------------------------- long long x [1000]; long long y [1000]; for (i = 0; i < 1000; i++) s += x[i] / y[i]; ---- results ---------------------------------------------------- Old routine on Athlon: 106 clocks including the a outer loop and storing the arguments on the stack. This routine on Athlon: 79 clocks including the a outer loop and storing the arguments on the stack. + shorter + can be inlined + sometimes the rounding control switch can be moved avoided by moving it outside a loop + faster for a lot of data - slower for trivial data (?) - do not work with SSE2 (needs 63 or 64 bit mantissa) ---- optimization ----------------------------------------------- This routine on Athlon after inling and moving fstcw/fldcw outside the loop: 21 clocks including the a outer loop Interested? Or are 64 bit are uninteresting for benchmarks? -- Frank Klemm Still remaining: long long % long long long long / long long long % long long long / const long long % const