Why does this unrolled function write to the stack?

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Why does this unrolled function write to the stack?
@ 2023-02-08 13:29 Gaelan Steele
  2023-02-08 13:49 ` Jonathan Wakely
  0 siblings, 1 reply; 5+ messages in thread
From: Gaelan Steele @ 2023-02-08 13:29 UTC (permalink / raw)
  To: gcc-help

Hi all,

In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows:

void foo(int *a, const int *__restrict b, const int *__restrict c)
{
  for (int i = 0; i < 16; i++) {
    a[i] = b[i] + c[i];
  }
}

I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary.

Am I missing some reason why this is more efficient than the naive approach (computing the each sum into an intermediate register, then writing it directly into a)?

Thanks,
Gaelan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Why does this unrolled function write to the stack?
  2023-02-08 13:29 Why does this unrolled function write to the stack? Gaelan Steele
@ 2023-02-08 13:49 ` Jonathan Wakely
  2023-02-08 13:53   ` Jonathan Wakely
  0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Wakely @ 2023-02-08 13:49 UTC (permalink / raw)
  To: Gaelan Steele; +Cc: gcc-help

On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help
<gcc-help@gcc.gnu.org> wrote:
>
> Hi all,
>
> In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows:
>
> void foo(int *a, const int *__restrict b, const int *__restrict c)
> {
>   for (int i = 0; i < 16; i++) {
>     a[i] = b[i] + c[i];
>   }
> }
>
> I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary.

I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a
loop. So it's just an optimization choice at -O3 presumably based on
cost estimates that say that fully unrolling the loop will make the
code faster than looping.

>
> Am I missing some reason why this is more efficient than the naive approach (computing the each sum into an intermediate register, then writing it directly into a)?

Benchmarking the function at different optimization levels I get:

Run on (8 X 4500 MHz CPU s)
CPU Caches:
 L1 Data 32 KiB (x4)
 L1 Instruction 32 KiB (x4)
 L2 Unified 256 KiB (x4)
 L3 Unified 8192 KiB (x1)
Load Average: 0.14, 0.22, 0.39
***WARNING*** CPU scaling is enabled, the benchmark real time
measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
O3               1.60 ns         1.60 ns    432901632
O2               3.56 ns         3.56 ns    197086506
O1               6.87 ns         6.86 ns    101839250
Os               8.23 ns         8.22 ns     85273333


Using quickbench:
https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Why does this unrolled function write to the stack?
  2023-02-08 13:49 ` Jonathan Wakely
@ 2023-02-08 13:53   ` Jonathan Wakely
  2023-02-08 15:32     ` David Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Wakely @ 2023-02-08 13:53 UTC (permalink / raw)
  To: Gaelan Steele; +Cc: gcc-help

On Wed, 8 Feb 2023 at 13:49, Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
>
> On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
> >
> > Hi all,
> >
> > In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows:
> >
> > void foo(int *a, const int *__restrict b, const int *__restrict c)
> > {
> >   for (int i = 0; i < 16; i++) {
> >     a[i] = b[i] + c[i];
> >   }
> > }
> >
> > I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary.
>
> I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a
> loop. So it's just an optimization choice at -O3 presumably based on
> cost estimates that say that fully unrolling the loop will make the
> code faster than looping.
>
> >
> > Am I missing some reason why this is more efficient than the naive approach (computing the each sum into an intermediate register, then writing it directly into a)?
>
> Benchmarking the function at different optimization levels I get:
>
> Run on (8 X 4500 MHz CPU s)
> CPU Caches:
>  L1 Data 32 KiB (x4)
>  L1 Instruction 32 KiB (x4)
>  L2 Unified 256 KiB (x4)
>  L3 Unified 8192 KiB (x1)
> Load Average: 0.14, 0.22, 0.39
> ***WARNING*** CPU scaling is enabled, the benchmark real time
> measurements may be noisy and will incur extra overhead.
> -----------------------------------------------------
> Benchmark           Time             CPU   Iterations
> -----------------------------------------------------
> O3               1.60 ns         1.60 ns    432901632
> O2               3.56 ns         3.56 ns    197086506
> O1               6.87 ns         6.86 ns    101839250
> Os               8.23 ns         8.22 ns     85273333
>
>
> Using quickbench:
> https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw

Oops, sorry, those were my original results *without* the -mno-avx
-mno-sse options! But that just shows that vectorization makes the
function fast.

Turning that off I get:

O3               58.3 ns         58.2 ns     11725604
O2               61.7 ns         61.6 ns     10930434
O1               7.37 ns         7.35 ns     95752192
Os               8.57 ns         8.56 ns     79448548

So it does look like GCC is making poor choices here.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Why does this unrolled function write to the stack?
  2023-02-08 13:53   ` Jonathan Wakely
@ 2023-02-08 15:32     ` David Brown
  2023-02-08 18:52       ` Gaelan Steele
  0 siblings, 1 reply; 5+ messages in thread
From: David Brown @ 2023-02-08 15:32 UTC (permalink / raw)
  To: Jonathan Wakely, Gaelan Steele; +Cc: gcc-help

On 08/02/2023 14:53, Jonathan Wakely via Gcc-help wrote:
> On Wed, 8 Feb 2023 at 13:49, Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
>>
>> On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help
>> <gcc-help@gcc.gnu.org> wrote:
>>>
>>> Hi all,
>>>
>>> In a computer architecture class, we happened across a strange compilation choice by GCC that neither I nor my professor can make much sense of. The source is as follows:
>>>
>>> void foo(int *a, const int *__restrict b, const int *__restrict c)
>>> {
>>>    for (int i = 0; i < 16; i++) {
>>>      a[i] = b[i] + c[i];
>>>    }
>>> }
>>>
>>> I won't reproduce the full compiled output here, as it's rather long, but when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compiler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that appears to write each sum into an array on the stack before copying it into the provided pointer a. This seems hugely inefficient - it's doing quite a few memory accesses - and I can't see why it would be necessary.
>>
>> I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a
>> loop. So it's just an optimization choice at -O3 presumably based on
>> cost estimates that say that fully unrolling the loop will make the
>> code faster than looping.
>>

There's nothing wrong with the loop unrolling.  It's the use of space on 
the stack that's the problem.

> So it does look like GCC is making poor choices here.
> 

It seems to be a regression between gcc 10 and gcc 11 (discovered by 
changing the compiler on godbolt.org).  With gcc 11 onwards, the 
compiler seems to be using the stack to combine two 4-byte elements at a 
time into a single 8-byte element.  It's easy to see the effect by 
changing the loop size to 2.

(I've no idea what causes the effect, or how to fix it - but knowing it 
is a regression might make it easier for you to find.)



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Why does this unrolled function write to the stack?
  2023-02-08 15:32     ` David Brown
@ 2023-02-08 18:52       ` Gaelan Steele
  0 siblings, 0 replies; 5+ messages in thread
From: Gaelan Steele @ 2023-02-08 18:52 UTC (permalink / raw)
  To: David Brown, Jonathan Wakely; +Cc: gcc-help

Thanks for the help investigating, both of you!

> It seems to be a regression between gcc 10 and gcc 11 (discovered by
> changing the compiler on godbolt.org).  With gcc 11 onwards, the
> compiler seems to be using the stack to combine two 4-byte elements at a
> time into a single 8-byte element.  It's easy to see the effect by
> changing the loop size to 2.

I gave this a bisect, and it looks like the commit that caused it is this one:

commit 33c0f246f799b7403171e97f31276a8feddd05c9
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Oct 30 11:26:18 2020 +0100

    tree-optimization/97626 - handle SCCs properly in SLP stmt analysis

The details of how this ties back to the behavior we're seeing go far
above my head, unfortunately.

I'll go ahead and file a bug on the tracker.

Thanks again,
Gaelan
PS Apologies for the horrendous wrapping in my previous email! I'm
using webmail because my normal machine is in repair, and list etiquette
totally slipped my mind.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-02-08 18:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-08 13:29 Why does this unrolled function write to the stack? Gaelan Steele
2023-02-08 13:49 ` Jonathan Wakely
2023-02-08 13:53   ` Jonathan Wakely
2023-02-08 15:32     ` David Brown
2023-02-08 18:52       ` Gaelan Steele

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).