Possible suboptimal code generated in 32-bit ABI mode

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Possible suboptimal code generated in 32-bit ABI mode
@ 2017-11-19 16:20 bradfirj
  2017-11-19 16:36 ` Richard Bradfield
  0 siblings, 1 reply; 3+ messages in thread
From: bradfirj @ 2017-11-19 16:20 UTC (permalink / raw)
  To: gcc

Hello,

I was playing around with 64bit arithmetic with -m32 enabled and
encountered some strange optimization in what I thought was a very
simple case.

My test function, which I appreciate is totally artifical, is as
follows:

uint64_t sum(uint64_t a, uint64_t b) {
    return a + b;
}

This is obviously a single instruction on a 64bit machine, but I
compiled with -m32 -O2 to see how the compiler would emulate this addition
and by default I see the behaviour I expect, a 64bit addition emulated
using only the 32bit registers:

  # 64-m32-example.cpp:6:     return a + b;
  	mov	eax, DWORD PTR [esp+12]	# b, b
  	add	eax, DWORD PTR [esp+4]	# tmp90, a
  	mov	edx, DWORD PTR [esp+16]	# b, b
  	adc	edx, DWORD PTR [esp+8]	#, a
  # 64-m32-example.cpp:7: }
  	ret

However when I compile with -m32 -O2 -march=broadwell, or -march=native,
I see the following code being generated instead:

  	vmovq	xmm1, QWORD PTR [esp+12]	# b, b
  # 64-m32-example.cpp:6:     return a + b;
  	vmovq	xmm0, QWORD PTR [esp+4]	# tmp92, a
  	vpaddq	xmm0, xmm0, xmm1	# tmp90, tmp92, b
  	vmovd	eax, xmm0	# tmp93, tmp90
  	vpextrd	edx, xmm0, 1	# tmp94, tmp90,
  # 64-m32-example.cpp:7: }
  	ret

I found it fascinating that using the SIMD instructions for such a
simple function would be the optimal approach, so I ran a microbenchmark
using hayai, and the results are quite interesting.

For the simple case, using mov, add and adc, the operation is so fast
that it's beyond the resolution of my benchmark ('inf'):
----------
 Average time: 0.006 us (~0.095 us)
 Fastest time: 0.000 us (-0.006 us / -100.000 %)
 Slowest time: 3.958 us (+3.952 us / +68209.689 %)
  Median time: 0.000 us (1st quartile: 0.000 us | 3rd quartile: 0.000 us)
                      
 Average performance: 172586379.48293 runs/s
    Best performance: inf runs/s (+inf runs/s / +inf %)
   Worst performance: 252652.85498 runs/s (-172333726.62795 runs/s / -99.85361 %)
  Median performance: inf runs/s (1st quartile: inf | 3rd quartile: inf)
----------


For the code using xmm0 and xmm1:
----------
 Average time: 24.901 us (~1.144 us)
 Fastest time: 23.867 us (-1.034 us / -4.153 %)
 Slowest time: 61.867 us (+36.966 us / +148.451 %)
  Median time: 24.867 us (1st quartile: 24.867 us | 3rd quartile: 24.867 us)
                      
 Average performance: 40158.86848 runs/s
    Best performance: 41898.85616 runs/s (+1739.98768 runs/s / +4.33276 %)
   Worst performance: 16163.70601 runs/s (-23995.16247 runs/s / -59.75059 %)
  Median performance: 40213.93815 runs/s (1st quartile: 40213.93815 | 3rd quartile: 40213.93815)
----------


Can anyone explain why the optimizer is producing this output? I don't
pretend to have any knowledge of how these decisions are made, perhaps
using the SIMD instructions result in a higher throughput which is
invisible in my contrived benchmarks?

For reference, I am compiling everything using gcc trunk, at commit
254929 from Sun Nov 19, and I am benchmarking on a Skylake i7-6700K
at 4.0GHz.

Thanks,

Richard

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible suboptimal code generated in 32-bit ABI mode
  2017-11-19 16:20 Possible suboptimal code generated in 32-bit ABI mode bradfirj
@ 2017-11-19 16:36 ` Richard Bradfield
  2017-11-20 10:21   ` Richard Biener
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Bradfield @ 2017-11-19 16:36 UTC (permalink / raw)
  To: gcc

On Sun, Nov 19, 2017 at 04:19:45PM +0000, bradfirj@fstab.me wrote:
>For reference, I am compiling everything using gcc trunk, at commit
>254929 from Sun Nov 19, and I am benchmarking on a Skylake i7-6700K
>at 4.0GHz.

And something else I should definitely have included in the first place,
I cannot reproduce this on the GCC 7.2 shipped with my distribution, so
is it possible this is a regression?

Richard

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible suboptimal code generated in 32-bit ABI mode
  2017-11-19 16:36 ` Richard Bradfield
@ 2017-11-20 10:21   ` Richard Biener
  0 siblings, 0 replies; 3+ messages in thread
From: Richard Biener @ 2017-11-20 10:21 UTC (permalink / raw)
  To: GCC Development

On Sun, Nov 19, 2017 at 5:35 PM, Richard Bradfield <bradfier@fstab.me> wrote:
> On Sun, Nov 19, 2017 at 04:19:45PM +0000, bradfirj@fstab.me wrote:
>>
>> For reference, I am compiling everything using gcc trunk, at commit
>> 254929 from Sun Nov 19, and I am benchmarking on a Skylake i7-6700K
>> at 4.0GHz.
>
>
> And something else I should definitely have included in the first place,
> I cannot reproduce this on the GCC 7.2 shipped with my distribution, so
> is it possible this is a regression?

Please open a bugreport.

Thanks,
Richard.

> Richard

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-11-20 10:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-19 16:20 Possible suboptimal code generated in 32-bit ABI mode bradfirj
2017-11-19 16:36 ` Richard Bradfield
2017-11-20 10:21   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).