From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) by sourceware.org (Postfix) with ESMTPS id C444A3850424 for ; Mon, 23 Nov 2020 08:42:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C444A3850424 Received: by mail-ej1-x634.google.com with SMTP id z5so2015369ejp.4 for ; Mon, 23 Nov 2020 00:42:55 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=122ueRAjVG4xO6i+pSyKQJNbwMZWLeDWGmAWnY6pIHc=; b=qDstAz1nKdFrKdpKBq6mfqpe8VOKhJRh/Mvy1Ofq6Z+T6bv7J8seVDr2VWRhbiUcfo xp99vDINBZkbyFElxwZQ2ImYOtk5OLNNYiOzjTD+sqTdYGK5vLUSYNK91FAriJZLoKqY u2xiZTubh7DwXV1bjs6LcqWRKiMUIoSWy6MmGbpxjNhUXJ2g0+FK2AmdF85FtjOFIHTe 3udaeA0HeqNzCrNZFJYobcieQLALV4/KvYl/PH17oounq+REx91qGwkLxBcmr5S+JjZs JIKTCGzWDZZnG7bvn7b+qG5LyMm4f1ODECkRB1UH4XCtsSj1kYCItSinynebtSU8T2/X phRg== X-Gm-Message-State: AOAM53272GcW2nqP6ddL8WppvOVJPFN6OIInfyS2Pj7oRihoHJZRWZMX JxMTXI7Ik+yExZ8/drs+4zpjQKb2+x7xYTJtSc8= X-Google-Smtp-Source: ABdhPJyC9jajiRe8SIeq3QIqYuCimdHqG5Ic45PN8gRD20ngCJv+KYJ1RAJNpmSy/+XBIJ8d/Hpk9ScsJckzXKit9Qo= X-Received: by 2002:a17:906:5587:: with SMTP id y7mr21181621ejp.138.1606120974808; Mon, 23 Nov 2020 00:42:54 -0800 (PST) MIME-Version: 1.0 References: <20200820043445.2216872-1-guojiufu@linux.ibm.com> <20201119194206.GX2672@gate.crashing.org> <83caaae1-4348-97ee-be9c-5e3c2083729c@redhat.com> <20201119200111.GY2672@gate.crashing.org> <20201119235620.GB2672@gate.crashing.org> <20201120152247.GA97803@kam.mff.cuni.cz> <20201120180945.GC2672@gate.crashing.org> In-Reply-To: <20201120180945.GC2672@gate.crashing.org> From: Richard Biener Date: Mon, 23 Nov 2020 09:42:43 +0100 Message-ID: Subject: Re: [PATCH] Check calls before loop unrolling To: Segher Boessenkool Cc: Jan Hubicka , GCC Patches , Bill Schmidt , David Edelsohn Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Nov 2020 08:42:57 -0000 On Fri, Nov 20, 2020 at 7:11 PM Segher Boessenkool wrote: > > Hi! > > On Fri, Nov 20, 2020 at 04:22:47PM +0100, Jan Hubicka wrote: > > As you know I spend quite some time on inliner heuristics but even after > > the years I have no clear idea how the requirements differs from x86-64 > > to ppc, arm and s390. Clearly compared to x86_64 prologues may get more > > expensive on ppc/arm because of more registers (so we should inline less > > to cold code) and function calls are more expensive (so we sould inline > > more to hot code). We do have PR for that in testusite where most of > > them I looked through. > > I made -fshrink-wrap-separate to make prologues less expensive for stuff > that is only used on the cold paths. This matters a lot -- and much > more could be done there, but that requires changing the generated code, > not just reordering it, so it is harder to do. > > Prologues (and epilogues) are only expensive if they are only needed for > cold code, in a hot function. > > > Problem is that each of us has different metodology - different > > bechmarks to look at > > This is a good thing often as well, it increases our total coverage. > But if not everything sees all results that also hurts :-/ > > > and different opinions on what is good for O2 and > > O3. > > Yeah. The documentation for -O3 merely says "Optimize yet more.", but > that is no guidance at all: why would a user ever use -O2 then? > > I always understood it as "-O2 is always faster than -O1, but -O3 is not > always faster than -O2". Aka "-O2 is always a good choice, and -O3 is a > an even better choice for *some* code, but that needs testing per case". So basically -O2 is supposed to be well-balanced in compile-time, code-size, performance and debuggability (if there's sth like that with optimized code...). -O1 is what you should use for machine-generated code, we kind-of promise to have no quadratic or worse algorithms in compile-time/memory-use here so you can throw a multi-gigabyte source function at GCC and it should not blow up. And -O1 still optimizes. -Os is when you want small code size at all cost (compile-time, less performance). -O3 is when you want performance at all cost (compile-time + code size and the ability to debug) So I'd always use -O2 unless doing a compute workload where I'd chose -O3 (maybe selective for the relevant TUs). Then there's profile-feedback and LTO which are enable them if you can (I'd avoid them for code you need -O1 for). It really helps GCC to make an appropriate decision what code to optimize more/less for the goal of the balanced profile which -O2 is. > In at least that understanding, and also to battle inflation in general, > we probably should move some things from -O3 to -O2. > > > From long term maintenace POV I am worried about changing a lot of > > --param defaults in different backends > > Me too. But changing a few key ones is just too important for > performance :-/ > > > simply becuase the meaning of > > those values keeps changing (as early opts improve; we get better on > > tracking optimizations during IPA passes; and our focus shift from C > > with sane inlines to basic C++ to heavy templatized C++ with many broken > > inline hints to heavy C++ with lto). > > I don't like if targets start to differ too much (in what generic passes > effectively do), no matter what. It's just not maintainable. > > > For this reason I tend to preffer to not tweak in taret specific ways > > unless there is very clear evidence to do so just because I think I will > > not be able to maintain code quality testing in future. > > Yes, completely agreed. But that exception is important :-) > > > It would be very interesting to set up testing that could let us compare > > basic arches side to side to different defaults. Our LNT testing does > > good job for x86-64 but we have basically zero coverage publically > > available on other targets and it is very hard to get inliner relevant > > banchmarks (where SPEC is not the best choice) done in comparable way on > > multiple arches. > > We cannot help with that on the cfarm, unless we get dedicated hardware > for such benchmarking (and I am not holding my breath for that, getting > good coverage at all is hard enough). So you probably need to get such > support for every arch separately, elsewhere :-/ > > > Segher