From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x632.google.com (mail-ej1-x632.google.com [IPv6:2a00:1450:4864:20::632]) by sourceware.org (Postfix) with ESMTPS id 5C85E3858C5E for ; Wed, 8 Feb 2023 13:50:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5C85E3858C5E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-ej1-x632.google.com with SMTP id sa10so20933311ejc.9 for ; Wed, 08 Feb 2023 05:50:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qBlm9UfBQrSv75sNptI1a5iVYlv/26TbE98fZOD8PXw=; b=ZXJnIV70stfEcFmqL98JJ1XSi6JDJNOizJeZvQnGIMFfZTZKbfDlbyAWOOa2GHpYUk ++SM0h6hmqjn27lVUdHWRaRdbWczrgrQJXpdCy/y4ffGzMlfZ0lWHJqYLGpL8CZNLrBi n+iOMd1xSu+MF1HUVPkIQhVRUv6cISXbmAVDBXaKqrN9vjRne6hxa7IBxJfdaClRJFCC vkqM1iUFcvF+7wyAhg4lNp+gGAKx6DgdO+KwsddvPfvEBsj+qZKmhUoRLfHOQ5tG5f1L sQyoR/2cPYmcz0az9xhwBjX+VaXp+EIB/sdTSAAFvMeUQFFABe/IIRvKJb0FgUVuq6hU N1Qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qBlm9UfBQrSv75sNptI1a5iVYlv/26TbE98fZOD8PXw=; b=DBrRo9JWzwKzhfPRhixjE2DGrKn+78cmDZCJkcxyOnbD5W80EINzuzzqjpjwHuSsed FHoPTnSOZPWB3ijAb6/jG1MEku0z4mu0CtQURlWUlb4TeZn/tBaxSuRWaqv1RYOGhIC5 yjsw4RjjuSx8/tOgJriJuHiiLb9DXqmzzOCS1RLZwTJGMvfabIDqhAO5g7fqGsL5qcYV XeAJE94B8jVP6ll+UPHCTMQwBqITQ14WhPZ/mDtktqNjHAAobf7uROPmvfPVJI/e9olP 01qBobrf4c4C+0XcBL74YgLmaYVTJWto4eDk3JNTivKQdCvcOuOlMGstQQALvGmCrpgE QqDA== X-Gm-Message-State: AO0yUKVrPY6hmAHA0tRyJrE5Hc4RSQE/9dVlfLpx2n1eLk6xVVEVg12i 7biZrybD+PttkDNaXAiGZr3qFkJ2fqU2eHAofxU= X-Google-Smtp-Source: AK7set/E2FJwEs2xu5T7D9qQPBzcvrCvVipalRhq1rkgjFwVZXsYba/laW0TmZjb0OvRMgYlhiOIp+uv7IC/hg3GvPk= X-Received: by 2002:a17:906:90c1:b0:88f:9c9a:828 with SMTP id v1-20020a17090690c100b0088f9c9a0828mr1597882ejw.190.1675864201889; Wed, 08 Feb 2023 05:50:01 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jonathan Wakely Date: Wed, 8 Feb 2023 13:49:50 +0000 Message-ID: Subject: Re: Why does this unrolled function write to the stack? To: Gaelan Steele Cc: "gcc-help@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, 8 Feb 2023 at 13:31, Gaelan Steele via Gcc-help wrote: > > Hi all, > > In a computer architecture class, we happened across a strange compilatio= n choice by GCC that neither I nor my professor can make much sense of. The= source is as follows: > > void foo(int *a, const int *__restrict b, const int *__restrict c) > { > for (int i =3D 0; i < 16; i++) { > a[i] =3D b[i] + c[i]; > } > } > > I won't reproduce the full compiled output here, as it's rather long, but= when compiled with -O3 -mno-avx -mno-sse, GCC 12.2 for x86-64 (via Compile= r Explorer: https://godbolt.org/z/o9e4o7cj4) produces an unrolled loop that= appears to write each sum into an array on the stack before copying it int= o the provided pointer a. This seems hugely inefficient - it's doing quite = a few memory accesses - and I can't see why it would be necessary. I don't think it's *necessary*. If you use -Os or -O1 or -O2 you get a loop. So it's just an optimization choice at -O3 presumably based on cost estimates that say that fully unrolling the loop will make the code faster than looping. > > Am I missing some reason why this is more efficient than the naive approa= ch (computing the each sum into an intermediate register, then writing it d= irectly into a)? Benchmarking the function at different optimization levels I get: Run on (8 X 4500 MHz CPU s) CPU Caches: L1 Data 32 KiB (x4) L1 Instruction 32 KiB (x4) L2 Unified 256 KiB (x4) L3 Unified 8192 KiB (x1) Load Average: 0.14, 0.22, 0.39 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- O3 1.60 ns 1.60 ns 432901632 O2 3.56 ns 3.56 ns 197086506 O1 6.87 ns 6.86 ns 101839250 Os 8.23 ns 8.22 ns 85273333 Using quickbench: https://quick-bench.com/q/sSwVvtrkOCp9q-XyKAevthiaNAw