From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x635.google.com (mail-ej1-x635.google.com [IPv6:2a00:1450:4864:20::635]) by sourceware.org (Postfix) with ESMTPS id B72673858C53 for ; Wed, 8 Feb 2023 13:53:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B72673858C53 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-ej1-x635.google.com with SMTP id p9so5415378ejj.1 for ; Wed, 08 Feb 2023 05:53:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=bPj9eVLhC7ts8/TF2HIJwXvdYy4++W2753v2BKH2c3A=; b=p3uDWbAMzDnXCFl99j/fatr4A+7u3ffBWmLJlOMfniB11+b/5pzYtyM+psT+b1g19Q N8seyNbJOkkG9d1+kZTcehzWQPnio0KPsEPHBu4RZ80hgel1HuBEEzJ2XBQ7FKBuHP4N nCdfcY+S2RoI0sKIX0d1gi5wSNTHiJ8rOvjlfor5Y8gXLiU5O47HEtddAoFthhQYKFW5 xioHh57wuMPmYGJPxFmBw5ATIcsLv8Df9ffd1Ka4bKqRwLbWcJ5ANfsdv2Rh73AF9ECO BRYgGXZbqaYrbILv6+mLQdr8CrGEfK2Qjlg4AYBhuXCPnJMRqstBFSyVKh1GI/4liOvH zCTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bPj9eVLhC7ts8/TF2HIJwXvdYy4++W2753v2BKH2c3A=; b=w6jLQL/k46yPOeLpgU0mfbHrZSL9JFk/uDcOFGfGiO+4TYAtokB2InhlLAlVh7sr5g Od6Baub/LS0q8ltN/Qy99KV167w6K2vo9HDiXGZ9AHOYL6kEIxmRsWULIxChNA1wWJxZ /bsLwE/pYuyNuXBEHIHglXTgfypeVGDKfIDCW52H3j1vS/UvMvL2gcwS9aC0V/vrPbI6 /NtIBJxcdwLhpZEuOWcsnEVD3G3G4Wyt9xQ2Tyds7ip+BVF9xQFf2CIqGPzKIr1BDyQi YUmVoRKl8YnICXV4YVMb5tgxjUjYLQLhm1fk4ny73wzUAJS+PtNpXzbT0WS+1yPjEdwv T2Iw== X-Gm-Message-State: AO0yUKV0Z84wslDQqLVEdduAASlp+6deuTLarB7Fv1qj4WOjxie5ezIS OFGQmjkeZZk2R9Yt7ykcX/yCQjNrjlSGuMW4ShsJrHnTpTY= X-Google-Smtp-Source: AK7set8jMVukY4cZA4Q8ZmGOKOsQHWUOnKtrjxOptpnzLOP3n1uV8j04/bUG+ePZKcY7w0s3mr49re0ypvUzFBJEl4Q= X-Received: by 2002:a17:906:90c1:b0:88f:9c9a:828 with SMTP id v1-20020a17090690c100b0088f9c9a0828mr1600518ejw.190.1675864435483; Wed, 08 Feb 2023 05:53:55 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jonathan Wakely Date: Wed, 8 Feb 2023 13:53:44 +0000 Message-ID: Subject: Re: Why does this unrolled function write to the stack? To: Gaelan Steele Cc: "gcc-help@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, 8 Feb 2023 at 13:49, Jonathan Wakely wrote: > > On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help > wrote: > > > > Hi all, > > > > In a computer architecture class, we happened across a strange compilat= ion choice by GCC that neither I nor my professor can make much sense of. T= he source is as follows: > > > > void foo(int *a, const int *__restrict b, const int *__restrict c) > > { > > for (int i =3D 0; i < 16; i++) { > > a[i] =3D b[i] + c[i]; > > } > > } > > > > I won't reproduce the full compiled output here, as it's rather long, b= ut when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compi= ler Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop th= at appears to write each sum into an array on the stack before copying it i= nto the provided pointer a. This seems hugely inefficient - it's doing quit= e a few memory accesses - and I can't see why it would be necessary. > > I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a > loop. So it's just an optimization choice at -O3 presumably based on > cost estimates that say that fully unrolling the loop will make the > code faster than looping. > > > > > Am I missing some reason why this is more efficient than the naive appr= oach (computing the each sum into an intermediate register, then writing it= directly into a)? > > Benchmarking the function at different optimization levels I get: > > Run on (8 X 4500 MHz CPU s) > CPU Caches: > L1 Data 32 KiB (x4) > L1 Instruction 32 KiB (x4) > L2 Unified 256 KiB (x4) > L3 Unified 8192 KiB (x1) > Load Average: 0.14, 0.22, 0.39 > ***WARNING*** CPU scaling is enabled, the benchmark real time > measurements may be noisy and will incur extra overhead. > ----------------------------------------------------- > Benchmark Time CPU Iterations > ----------------------------------------------------- > O3 1.60 ns 1.60 ns 432901632 > O2 3.56 ns 3.56 ns 197086506 > O1 6.87 ns 6.86 ns 101839250 > Os 8.23 ns 8.22 ns 85273333 > > > Using quickbench: > https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw Oops, sorry, those were my original results *without* the -mno-avx -mno-sse options! But that just shows that vectorization makes the function fast. Turning that off I get: O3 58.3 ns 58.2 ns 11725604 O2 61.7 ns 61.6 ns 10930434 O1 7.37 ns 7.35 ns 95752192 Os 8.57 ns 8.56 ns 79448548 So it does look like GCC is making poor choices here.