From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=sE0E=6E=gmail.com=jwakely.gcc@sourceware.org>
Received: from mail-ej1-x635.google.com (mail-ej1-x635.google.com [IPv6:2a00:1450:4864:20::635])
	by sourceware.org (Postfix) with ESMTPS id B72673858C53
	for <gcc-help@gcc.gnu.org>; Wed,  8 Feb 2023 13:53:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B72673858C53
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-ej1-x635.google.com with SMTP id p9so5415378ejj.1
        for <gcc-help@gcc.gnu.org>; Wed, 08 Feb 2023 05:53:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=bPj9eVLhC7ts8/TF2HIJwXvdYy4++W2753v2BKH2c3A=;
        b=p3uDWbAMzDnXCFl99j/fatr4A+7u3ffBWmLJlOMfniB11+b/5pzYtyM+psT+b1g19Q
         N8seyNbJOkkG9d1+kZTcehzWQPnio0KPsEPHBu4RZ80hgel1HuBEEzJ2XBQ7FKBuHP4N
         nCdfcY+S2RoI0sKIX0d1gi5wSNTHiJ8rOvjlfor5Y8gXLiU5O47HEtddAoFthhQYKFW5
         xioHh57wuMPmYGJPxFmBw5ATIcsLv8Df9ffd1Ka4bKqRwLbWcJ5ANfsdv2Rh73AF9ECO
         BRYgGXZbqaYrbILv6+mLQdr8CrGEfK2Qjlg4AYBhuXCPnJMRqstBFSyVKh1GI/4liOvH
         zCTA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=bPj9eVLhC7ts8/TF2HIJwXvdYy4++W2753v2BKH2c3A=;
        b=w6jLQL/k46yPOeLpgU0mfbHrZSL9JFk/uDcOFGfGiO+4TYAtokB2InhlLAlVh7sr5g
         Od6Baub/LS0q8ltN/Qy99KV167w6K2vo9HDiXGZ9AHOYL6kEIxmRsWULIxChNA1wWJxZ
         /bsLwE/pYuyNuXBEHIHglXTgfypeVGDKfIDCW52H3j1vS/UvMvL2gcwS9aC0V/vrPbI6
         /NtIBJxcdwLhpZEuOWcsnEVD3G3G4Wyt9xQ2Tyds7ip+BVF9xQFf2CIqGPzKIr1BDyQi
         YUmVoRKl8YnICXV4YVMb5tgxjUjYLQLhm1fk4ny73wzUAJS+PtNpXzbT0WS+1yPjEdwv
         T2Iw==
X-Gm-Message-State: AO0yUKV0Z84wslDQqLVEdduAASlp+6deuTLarB7Fv1qj4WOjxie5ezIS
	OFGQmjkeZZk2R9Yt7ykcX/yCQjNrjlSGuMW4ShsJrHnTpTY=
X-Google-Smtp-Source: AK7set8jMVukY4cZA4Q8ZmGOKOsQHWUOnKtrjxOptpnzLOP3n1uV8j04/bUG+ePZKcY7w0s3mr49re0ypvUzFBJEl4Q=
X-Received: by 2002:a17:906:90c1:b0:88f:9c9a:828 with SMTP id
 v1-20020a17090690c100b0088f9c9a0828mr1600518ejw.190.1675864435483; Wed, 08
 Feb 2023 05:53:55 -0800 (PST)
MIME-Version: 1.0
References: <VE1PR06MB705447E5FF4C3080E5F3E586F1D89@VE1PR06MB7054.eurprd06.prod.outlook.com>
 <CAH6eHdQeS_D_6tM9cgD-yNZnHuRYDzoFtpo=+DZweT4Lm1iHKw@mail.gmail.com>
In-Reply-To: <CAH6eHdQeS_D_6tM9cgD-yNZnHuRYDzoFtpo=+DZweT4Lm1iHKw@mail.gmail.com>
From: Jonathan Wakely <jwakely.gcc@gmail.com>
Date: Wed, 8 Feb 2023 13:53:44 +0000
Message-ID: <CAH6eHdR5GcWckSWrSrgTbL=T3CFTv5NiciuVA=6j6As6FJiyDg@mail.gmail.com>
Subject: Re: Why does this unrolled function write to the stack?
To: Gaelan Steele <gbs3@st-andrews.ac.uk>
Cc: "gcc-help@gcc.gnu.org" <gcc-help@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-help.gcc.gnu.org>

On Wed, 8 Feb 2023 at 13:49, Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
>
> On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help
> <gcc-help@gcc.gnu.org> wrote:
> >
> > Hi all,
> >
> > In a computer architecture class, we happened across a strange compilat=
ion choice by GCC that neither I nor my professor can make much sense of. T=
he source is as follows:
> >
> > void foo(int *a, const int *__restrict b, const int *__restrict c)
> > {
> >   for (int i =3D 0; i < 16; i++) {
> >     a[i] =3D b[i] + c[i];
> >   }
> > }
> >
> > I won't reproduce the full compiled output here, as it's rather long, b=
ut when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compi=
ler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop th=
at appears to write each sum into an array on the stack before copying it i=
nto the provided pointer a. This seems hugely inefficient - it's doing quit=
e a few memory accesses - and I can't see why it would be necessary.
>
> I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a
> loop. So it's just an optimization choice at -O3 presumably based on
> cost estimates that say that fully unrolling the loop will make the
> code faster than looping.
>
> >
> > Am I missing some reason why this is more efficient than the naive appr=
oach (computing the each sum into an intermediate register, then writing it=
 directly into a)?
>
> Benchmarking the function at different optimization levels I get:
>
> Run on (8 X 4500 MHz CPU s)
> CPU Caches:
>  L1 Data 32 KiB (x4)
>  L1 Instruction 32 KiB (x4)
>  L2 Unified 256 KiB (x4)
>  L3 Unified 8192 KiB (x1)
> Load Average: 0.14, 0.22, 0.39
> ***WARNING*** CPU scaling is enabled, the benchmark real time
> measurements may be noisy and will incur extra overhead.
> -----------------------------------------------------
> Benchmark           Time             CPU   Iterations
> -----------------------------------------------------
> O3               1.60 ns         1.60 ns    432901632
> O2               3.56 ns         3.56 ns    197086506
> O1               6.87 ns         6.86 ns    101839250
> Os               8.23 ns         8.22 ns     85273333
>
>
> Using quickbench:
> https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw

Oops, sorry, those were my original results *without* the -mno-avx
-mno-sse options! But that just shows that vectorization makes the
function fast.

Turning that off I get:

O3               58.3 ns         58.2 ns     11725604
O2               61.7 ns         61.6 ns     10930434
O1               7.37 ns         7.35 ns     95752192
Os               8.57 ns         8.56 ns     79448548

So it does look like GCC is making poor choices here.