From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <amker.cheng@gmail.com>
Received: from mail-ej1-x633.google.com (mail-ej1-x633.google.com
 [IPv6:2a00:1450:4864:20::633])
 by sourceware.org (Postfix) with ESMTPS id 57A843857C7F
 for <gcc-patches@gcc.gnu.org>; Sat,  8 May 2021 01:53:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 57A843857C7F
Received: by mail-ej1-x633.google.com with SMTP id r9so16260714ejj.3
 for <gcc-patches@gcc.gnu.org>; Fri, 07 May 2021 18:53:04 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=Cr8oe/i+OCO8nqmx0MIJ5kLg15/Taea+hQPyad/nheI=;
 b=B7Ae62VntZmycVjrR4NAgmV6No9/WcBHuneAmWdS2dwMVfFHLsSQs3r66x6mFNjZBV
 TZWmmsEkN8tL8uzttc5BNAE1NxnhEog7qdFDPQP9RPNwCYGlIfmsjNXL5T74Tv6dJidx
 7Qm3P1unpq2z5Dty0vfbqDgFEDqoAt5KLy1dgmEFx2yQTnygujga2v8fFV/U8VlBJlq1
 RrTZ3JwrdQATp/824Z2d03VJ5HayfX5QbQpO6YV8uVvq+QnQClpPoUhaY14k6qF9FWUT
 uxKnOBu3fXMseCQtlhsDJpjDp8xEEv9itILcb9NzeC22PGuSjv1PPTrE5tDd30IjTz5D
 nlYQ==
X-Gm-Message-State: AOAM533p5pvek8GaGLLKQ9AopUprrH6qIqgvoCaPEwBbios33bXoYkQL
 x/Z+zLW/g9kWmXZSYsZ+QPERgfe8tWay1Yk2wB8=
X-Google-Smtp-Source: ABdhPJwmAg9ENyeVv9EWzjUb0kR6Ov8p5ZEVsKPGynMp//SLQ9zI/zVSMuGGb5vev7mScjM3ISvvXrg5QjXZwa4/uRg=
X-Received: by 2002:a17:907:628d:: with SMTP id
 nd13mr13203741ejc.299.1620438783055; 
 Fri, 07 May 2021 18:53:03 -0700 (PDT)
MIME-Version: 1.0
References: <MWHPR01MB2304C068CB1CBA463DB2AA9BF7A10@MWHPR01MB2304.prod.exchangelabs.com>
 <49775de69149498230edb6554d176c758841f341.camel@redhat.com>
 <SN6PR01MB49583160F08788891135EBF6F75E9@SN6PR01MB4958.prod.exchangelabs.com>
In-Reply-To: <SN6PR01MB49583160F08788891135EBF6F75E9@SN6PR01MB4958.prod.exchangelabs.com>
From: "Bin.Cheng" <amker.cheng@gmail.com>
Date: Sat, 8 May 2021 09:49:43 +0800
Message-ID: <CAHFci29S-_ZYZyt9ABUcehxdkkvUr7ZE-dixzgUVFkyQsic6HA@mail.gmail.com>
Subject: Re: [PATCH/RFC] Add a new memory gathering optimization for loop
 (PR98598)
To: Feng Xue OS <fxue@os.amperecomputing.com>
Cc: David Malcolm <dmalcolm@redhat.com>, 
 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sat, 08 May 2021 01:53:06 -0000

On Fri, Apr 30, 2021 at 1:20 PM Feng Xue OS via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> >> This patch implements a new loop optimization according to the proposal
> >> in RFC given at
> >> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
> >> So do not repeat the idea in this mail. Hope your comments on it.
> >
> > With the caveat that I'm not an optimization expert (but no one else
> > seems to have replied), here are some thoughts.
> >
> > [...snip...]
> >
> >> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
> >>  [PR98598]
> >
> > BTW, did you mean to also post patches 2 and 3?
> >
>
> Not yet, but they are ready. Since this is kind of special optimization that uses
> heap as temporary storage, not a common means in gcc, we do not know
> basic attitude of the community towards it. So only the first patch was sent
> out for initial comments, in that it implements a generic MGO framework, and
> is complete and self-contained. Other 2 patches just composed some
> enhancements for specific code pattern and dynamic alias check. If possible,
> this proposal would be accepted principally, we will submit other 2 for review.
>
> >
> >> In nested loops, if scattered memory accesses inside inner loop remain
> >> unchanged in outer loop, we can sequentialize these loads by caching
> >> their values into a temporary memory region at the first time, and
> >> reuse the caching data in following iterations. This way can improve
> >> efficiency of cpu cache subsystem by reducing its unpredictable activies.
> >
> > I don't think you've cited any performance numbers so far.  Does the
> > optimization show a measurable gain on some benchmark(s)?  e.g. is this
> > ready to run SPEC yet, and how does it do?
>
> Yes, we have done that. Minor improvement about several point percentage
> could gain for some real applications. And to be specific, we also get major
> improvement as more than 30% for certain benchmark in SPEC2017.
Hi Feng Xue,
Could you help point out which bench it is?  I failed to observe
improvement in spec2017 on local x86 machine.  I am running with O3
level.

Thanks,
bin
>
> >
> >> To illustrate what the optimization will do, two pieces of pseudo code,
> >> before and after transformation, are given. Suppose all loads and
> >> "iter_count" are invariant in outer loop.
> >>
> >> From:
> >>
> >>   outer-loop ()
> >>     {
> >>       inner-loop (iter, iter_count)
> >>         {
> >>           Type1 v1 = LOAD (iter);
> >>           Type2 v2 = LOAD (v1);
> >>           Type3 v3 = LOAD (v2);
> >>           ...
> >>           iter = NEXT (iter);
> >>         }
> >>     }
> >>
> >> To:
> >>
> >>   typedef struct cache_elem
> >>     {
> >>       bool   init;
> >>       Type1  c_v1;
> >>       Type2  c_v2;
> >>       Type3  c_v3;
> >
> > Putting the "bool init;" at the front made me think "what about
> > packing?" but presumably the idea is that every element is accessed in
> > order, so it presumably benefits speed to have "init" at the top of the
> > element, right?
>
> Yes, layout of the struct layout could be optimized in terms of size by
> some means, such as:
>   o. packing "init" into a padding hole after certain field
>   o. if certain field is a pointer type, the field can take the role of "init"
>       (Non-NULL implies "initialized")
> Now this simple scheme is straightforward, and would be enhanced
> in various aspects later.
>
> >>     } cache_elem;
> >>
> >>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
>
> > What if the allocation fails at runtime?  Do we keep an unoptimized
> > copy of the nested loops around as a fallback and have an unlikely
> > branch to that copy?
>
> Yes, we should. But in a different way, a flag is added into original
> nested loop to control runtime switch between optimized and
> unoptimized execution. This definitely incurs runtime cost, but
> avoid possible code size bloating. A better handling, as a TODO is
> to apply dynamic-switch for large loop, and loop-clone for small one.
>
> > I notice that you're using calloc, presumably to clear all of the
> > "init" flags (and the whole buffer).
> >
> > FWIW, this feels like a case where it would be nice to have a thread-
> > local heap allocation, perhaps something like an obstack implemented in
> > the standard library - but that's obviously scope creep for this.
>
> Yes, that's good, specially for many-thread application.
>
> > Could it make sense to use alloca for small allocations?  (or is that
> > scope creep?)
>
> We did consider using alloca as you said.  But if we could not determine
> up limit for a non-constant size, we have to place alloca inside a loop that
> encloses the nested loop. Without a corresponding free operation, this
> kind of alloca-in-loop might cause stack overflow. So it becomes another
> TODO.
>
> >>   outer-loop ()
> >>     {
> >>       size_t cache_idx = 0;
> >>
> >>       inner-loop (iter, iter_count)
> >>         {
> >>           if (!cache_arr[cache_idx]->init)
> >>             {
> >>               v1 = LOAD (iter);
> >>               v2 = LOAD (v1);
> >>               v3 = LOAD (v2);
> >>
> >>               cache_arr[cache_idx]->init = true;
> >>               cache_arr[cache_idx]->c_v1 = v1;
> >>               cache_arr[cache_idx]->c_v2 = v2;
> >>               cache_arr[cache_idx]->c_v3 = v3;
> >>             }
> >>         else
> >>             {
> >>               v1 = cache_arr[cache_idx]->c_v1;
> >>               v2 = cache_arr[cache_idx]->c_v2;
> >>               v3 = cache_arr[cache_idx]->c_v3;
> >>             }
> >>           ...
> >>           cache_idx++;
> >>           iter = NEXT (iter);
> >>         }
> >>     }
> >>
> >>   free (cache_arr);
> >
> > I see that the pass puts a "free" on every CFG exit from the loop.
> >
> > What about "longjmp" and C++ exceptions?  Are we guaranteed that "free"
> > will happen for those ways of exiting the loop?  (should have test
> > cases for this, I think)
> >
>
> We simply disable the optimization if a loop contains an abnormal or
> eh exit.
>
> > [...]
> >
> >
> >> 2020-12-25  Feng Xue  <fxue@os.amperecomputing.com>
> >>
> >> gcc/
> >>       PR tree-optimization/98598
> >>       * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
> >>       * common.opt (-ftree-loop-mgo): New option.
> >>       * opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+.
> >>       * params.opt (mgo-max-dep-load-level): New parameter.
> >>       (mgo-max-cache-elem-size): Likewise.
> >>       (mgo-min-cache-array-length): Likewise.
> >>       (mgo-max-cache-array-length): Likewise.
> >>       * doc/invoke.texi (mgo-max-dep-load-level): Document new parameter.
> >>       (mgo-max-cache-elem-size): Likewise.
> >>       (mgo-min-cache-array-length): Likewise.
> >>       (mgo-max-cache-array-length): Likewise.
> >
> > Nit: also need to document "-ftree-loop-mgo".
>
> OK.
>
> >
> >>       * passes.def: Add new pass_loop_mgo pass.
> >>       * timevar.def (TV_LOOP_MGO): New timevar.
> >>       * tree-pass.h (make_pass_loop_mgo): New declaration.
> >>       * tree-ssa-loop-mgo.c: New file.
> >
> > Nit: new C++ source files should have a .cc extension, rather than .c
>
> OK.
>
> >
> >> gcc/testsuite/
> >>       PR tree-optimization/98598
> >>       * gcc.dg/tree-ssa/mgo/mgo.exp: New file.
> >>       * gcc.dg/tree-ssa/mgo/mgo-common.h: New test header.
> >>       * gcc.dg/tree-ssa/mgo/list/list-1.c: New test.
> >>       * gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise.
> >>       * gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise.
> >
> > [...]
> >
> > Presumably the optimization can't happen if the underlying data gets
> > modified during the iteration.
>
> Yes. Any data to be gathered should be read-only in the loop.
>
> > I briefly looked through the test cases, but I didn't see any/(much?)
> > test coverage for loops that write back to the data, or call functions
> > that might do that.  Sorry if I missed some, but clearly we need to
> > test that we don't incorrectly optimize those cases.
>
> If we could not statically deduce whether data might be written, we
> add dynamic alias-check to detect write hazard, which has been
> implemented in other patch. Those test cases covering data-write
> and function call are also included in it.
>
> > What happens in a multithreaded program in which another thread might
> > be writing to the data?  What are we allowed to optimize?
>
> For data access in multi-thread, we assume programmer know how to
> avoid data race using synchronization primitive call or  "volatile" specifier
> to annotate related data. If non-pure call or "volatile" memory access
> is found, the optimization will be suppressed.
>
> > Are there some C++ examples?  Fortran?
>
> OK. Will add some.
>
> > It would be great to be able to inject calloc failures when testing
> > this, but obviously that's out of scope for such a patch.
>
> OK. This is a TODO.
>
> > Hope this is constructive
>
> It definitely is, thanks for your comments.
>
> Feng