From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x633.google.com (mail-ej1-x633.google.com [IPv6:2a00:1450:4864:20::633]) by sourceware.org (Postfix) with ESMTPS id 57A843857C7F for ; Sat, 8 May 2021 01:53:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 57A843857C7F Received: by mail-ej1-x633.google.com with SMTP id r9so16260714ejj.3 for ; Fri, 07 May 2021 18:53:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Cr8oe/i+OCO8nqmx0MIJ5kLg15/Taea+hQPyad/nheI=; b=B7Ae62VntZmycVjrR4NAgmV6No9/WcBHuneAmWdS2dwMVfFHLsSQs3r66x6mFNjZBV TZWmmsEkN8tL8uzttc5BNAE1NxnhEog7qdFDPQP9RPNwCYGlIfmsjNXL5T74Tv6dJidx 7Qm3P1unpq2z5Dty0vfbqDgFEDqoAt5KLy1dgmEFx2yQTnygujga2v8fFV/U8VlBJlq1 RrTZ3JwrdQATp/824Z2d03VJ5HayfX5QbQpO6YV8uVvq+QnQClpPoUhaY14k6qF9FWUT uxKnOBu3fXMseCQtlhsDJpjDp8xEEv9itILcb9NzeC22PGuSjv1PPTrE5tDd30IjTz5D nlYQ== X-Gm-Message-State: AOAM533p5pvek8GaGLLKQ9AopUprrH6qIqgvoCaPEwBbios33bXoYkQL x/Z+zLW/g9kWmXZSYsZ+QPERgfe8tWay1Yk2wB8= X-Google-Smtp-Source: ABdhPJwmAg9ENyeVv9EWzjUb0kR6Ov8p5ZEVsKPGynMp//SLQ9zI/zVSMuGGb5vev7mScjM3ISvvXrg5QjXZwa4/uRg= X-Received: by 2002:a17:907:628d:: with SMTP id nd13mr13203741ejc.299.1620438783055; Fri, 07 May 2021 18:53:03 -0700 (PDT) MIME-Version: 1.0 References: <49775de69149498230edb6554d176c758841f341.camel@redhat.com> In-Reply-To: From: "Bin.Cheng" Date: Sat, 8 May 2021 09:49:43 +0800 Message-ID: Subject: Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598) To: Feng Xue OS Cc: David Malcolm , "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 08 May 2021 01:53:06 -0000 On Fri, Apr 30, 2021 at 1:20 PM Feng Xue OS via Gcc-patches wrote: > > >> This patch implements a new loop optimization according to the proposal > >> in RFC given at > >> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html. > >> So do not repeat the idea in this mail. Hope your comments on it. > > > > With the caveat that I'm not an optimization expert (but no one else > > seems to have replied), here are some thoughts. > > > > [...snip...] > > > >> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop > >> [PR98598] > > > > BTW, did you mean to also post patches 2 and 3? > > > > Not yet, but they are ready. Since this is kind of special optimization that uses > heap as temporary storage, not a common means in gcc, we do not know > basic attitude of the community towards it. So only the first patch was sent > out for initial comments, in that it implements a generic MGO framework, and > is complete and self-contained. Other 2 patches just composed some > enhancements for specific code pattern and dynamic alias check. If possible, > this proposal would be accepted principally, we will submit other 2 for review. > > > > >> In nested loops, if scattered memory accesses inside inner loop remain > >> unchanged in outer loop, we can sequentialize these loads by caching > >> their values into a temporary memory region at the first time, and > >> reuse the caching data in following iterations. This way can improve > >> efficiency of cpu cache subsystem by reducing its unpredictable activies. > > > > I don't think you've cited any performance numbers so far. Does the > > optimization show a measurable gain on some benchmark(s)? e.g. is this > > ready to run SPEC yet, and how does it do? > > Yes, we have done that. Minor improvement about several point percentage > could gain for some real applications. And to be specific, we also get major > improvement as more than 30% for certain benchmark in SPEC2017. Hi Feng Xue, Could you help point out which bench it is? I failed to observe improvement in spec2017 on local x86 machine. I am running with O3 level. Thanks, bin > > > > >> To illustrate what the optimization will do, two pieces of pseudo code, > >> before and after transformation, are given. Suppose all loads and > >> "iter_count" are invariant in outer loop. > >> > >> From: > >> > >> outer-loop () > >> { > >> inner-loop (iter, iter_count) > >> { > >> Type1 v1 = LOAD (iter); > >> Type2 v2 = LOAD (v1); > >> Type3 v3 = LOAD (v2); > >> ... > >> iter = NEXT (iter); > >> } > >> } > >> > >> To: > >> > >> typedef struct cache_elem > >> { > >> bool init; > >> Type1 c_v1; > >> Type2 c_v2; > >> Type3 c_v3; > > > > Putting the "bool init;" at the front made me think "what about > > packing?" but presumably the idea is that every element is accessed in > > order, so it presumably benefits speed to have "init" at the top of the > > element, right? > > Yes, layout of the struct layout could be optimized in terms of size by > some means, such as: > o. packing "init" into a padding hole after certain field > o. if certain field is a pointer type, the field can take the role of "init" > (Non-NULL implies "initialized") > Now this simple scheme is straightforward, and would be enhanced > in various aspects later. > > >> } cache_elem; > >> > >> cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem)); > > > What if the allocation fails at runtime? Do we keep an unoptimized > > copy of the nested loops around as a fallback and have an unlikely > > branch to that copy? > > Yes, we should. But in a different way, a flag is added into original > nested loop to control runtime switch between optimized and > unoptimized execution. This definitely incurs runtime cost, but > avoid possible code size bloating. A better handling, as a TODO is > to apply dynamic-switch for large loop, and loop-clone for small one. > > > I notice that you're using calloc, presumably to clear all of the > > "init" flags (and the whole buffer). > > > > FWIW, this feels like a case where it would be nice to have a thread- > > local heap allocation, perhaps something like an obstack implemented in > > the standard library - but that's obviously scope creep for this. > > Yes, that's good, specially for many-thread application. > > > Could it make sense to use alloca for small allocations? (or is that > > scope creep?) > > We did consider using alloca as you said. But if we could not determine > up limit for a non-constant size, we have to place alloca inside a loop that > encloses the nested loop. Without a corresponding free operation, this > kind of alloca-in-loop might cause stack overflow. So it becomes another > TODO. > > >> outer-loop () > >> { > >> size_t cache_idx = 0; > >> > >> inner-loop (iter, iter_count) > >> { > >> if (!cache_arr[cache_idx]->init) > >> { > >> v1 = LOAD (iter); > >> v2 = LOAD (v1); > >> v3 = LOAD (v2); > >> > >> cache_arr[cache_idx]->init = true; > >> cache_arr[cache_idx]->c_v1 = v1; > >> cache_arr[cache_idx]->c_v2 = v2; > >> cache_arr[cache_idx]->c_v3 = v3; > >> } > >> else > >> { > >> v1 = cache_arr[cache_idx]->c_v1; > >> v2 = cache_arr[cache_idx]->c_v2; > >> v3 = cache_arr[cache_idx]->c_v3; > >> } > >> ... > >> cache_idx++; > >> iter = NEXT (iter); > >> } > >> } > >> > >> free (cache_arr); > > > > I see that the pass puts a "free" on every CFG exit from the loop. > > > > What about "longjmp" and C++ exceptions? Are we guaranteed that "free" > > will happen for those ways of exiting the loop? (should have test > > cases for this, I think) > > > > We simply disable the optimization if a loop contains an abnormal or > eh exit. > > > [...] > > > > > >> 2020-12-25 Feng Xue > >> > >> gcc/ > >> PR tree-optimization/98598 > >> * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o. > >> * common.opt (-ftree-loop-mgo): New option. > >> * opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+. > >> * params.opt (mgo-max-dep-load-level): New parameter. > >> (mgo-max-cache-elem-size): Likewise. > >> (mgo-min-cache-array-length): Likewise. > >> (mgo-max-cache-array-length): Likewise. > >> * doc/invoke.texi (mgo-max-dep-load-level): Document new parameter. > >> (mgo-max-cache-elem-size): Likewise. > >> (mgo-min-cache-array-length): Likewise. > >> (mgo-max-cache-array-length): Likewise. > > > > Nit: also need to document "-ftree-loop-mgo". > > OK. > > > > >> * passes.def: Add new pass_loop_mgo pass. > >> * timevar.def (TV_LOOP_MGO): New timevar. > >> * tree-pass.h (make_pass_loop_mgo): New declaration. > >> * tree-ssa-loop-mgo.c: New file. > > > > Nit: new C++ source files should have a .cc extension, rather than .c > > OK. > > > > >> gcc/testsuite/ > >> PR tree-optimization/98598 > >> * gcc.dg/tree-ssa/mgo/mgo.exp: New file. > >> * gcc.dg/tree-ssa/mgo/mgo-common.h: New test header. > >> * gcc.dg/tree-ssa/mgo/list/list-1.c: New test. > >> * gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise. > > > > [...] > > > > Presumably the optimization can't happen if the underlying data gets > > modified during the iteration. > > Yes. Any data to be gathered should be read-only in the loop. > > > I briefly looked through the test cases, but I didn't see any/(much?) > > test coverage for loops that write back to the data, or call functions > > that might do that. Sorry if I missed some, but clearly we need to > > test that we don't incorrectly optimize those cases. > > If we could not statically deduce whether data might be written, we > add dynamic alias-check to detect write hazard, which has been > implemented in other patch. Those test cases covering data-write > and function call are also included in it. > > > What happens in a multithreaded program in which another thread might > > be writing to the data? What are we allowed to optimize? > > For data access in multi-thread, we assume programmer know how to > avoid data race using synchronization primitive call or "volatile" specifier > to annotate related data. If non-pure call or "volatile" memory access > is found, the optimization will be suppressed. > > > Are there some C++ examples? Fortran? > > OK. Will add some. > > > It would be great to be able to inject calloc failures when testing > > this, but obviously that's out of scope for such a patch. > > OK. This is a TODO. > > > Hope this is constructive > > It definitely is, thanks for your comments. > > Feng