Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: David Malcolm <dmalcolm@redhat.com>
To: Feng Xue OS <fxue@os.amperecomputing.com>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)
Date: Thu, 29 Apr 2021 09:46:46 -0400	[thread overview]
Message-ID: <49775de69149498230edb6554d176c758841f341.camel@redhat.com> (raw)
In-Reply-To: <MWHPR01MB2304C068CB1CBA463DB2AA9BF7A10@MWHPR01MB2304.prod.exchangelabs.com>

On Thu, 2021-01-21 at 06:27 +0000, Feng Xue OS via Gcc-patches wrote:
> This patch implements a new loop optimization according to the proposal
> in RFC given at  
> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
> So do not repeat the idea in this mail. Hope your comments on it.

With the caveat that I'm not an optimization expert (but no one else
seems to have replied), here are some thoughts.

[...snip...]

> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
>  [PR98598]

BTW, did you mean to also post patches 2 and 3?


> In nested loops, if scattered memory accesses inside inner loop remain
> unchanged in outer loop, we can sequentialize these loads by caching
> their values into a temporary memory region at the first time, and
> reuse the caching data in following iterations. This way can improve
> efficiency of cpu cache subsystem by reducing its unpredictable activies.

I don't think you've cited any performance numbers so far.  Does the
optimization show a measurable gain on some benchmark(s)?  e.g. is this
ready to run SPEC yet, and how does it do?
 
> To illustrate what the optimization will do, two pieces of pseudo code,
> before and after transformation, are given. Suppose all loads and
> "iter_count" are invariant in outer loop.
> 
> From:
> 
>   outer-loop ()
>     {
>       inner-loop (iter, iter_count)
>         {
>           Type1 v1 = LOAD (iter);
>           Type2 v2 = LOAD (v1);
>           Type3 v3 = LOAD (v2);
>           ...
>           iter = NEXT (iter);
>         }
>     }
> 
> To:
> 
>   typedef struct cache_elem
>     {
>       bool   init;
>       Type1  c_v1;
>       Type2  c_v2;
>       Type3  c_v3;

Putting the "bool init;" at the front made me think "what about
packing?" but presumably the idea is that every element is accessed in
order, so it presumably benefits speed to have "init" at the top of the
element, right?

>     } cache_elem;
> 
>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));

What if the allocation fails at runtime?  Do we keep an unoptimized
copy of the nested loops around as a fallback and have an unlikely
branch to that copy?

I notice that you're using calloc, presumably to clear all of the
"init" flags (and the whole buffer).

FWIW, this feels like a case where it would be nice to have a thread-
local heap allocation, perhaps something like an obstack implemented in
the standard library - but that's obviously scope creep for this.

Could it make sense to use alloca for small allocations?  (or is that
scope creep?)
 
>   outer-loop ()
>     {
>       size_t cache_idx = 0;
> 
>       inner-loop (iter, iter_count)
>         {
>           if (!cache_arr[cache_idx]->init)
>             {
>               v1 = LOAD (iter);
>               v2 = LOAD (v1);
>               v3 = LOAD (v2);
> 
>               cache_arr[cache_idx]->init = true;
>               cache_arr[cache_idx]->c_v1 = v1;
>               cache_arr[cache_idx]->c_v2 = v2;
>               cache_arr[cache_idx]->c_v3 = v3;
>             }
> 	  else
>             {
>               v1 = cache_arr[cache_idx]->c_v1;
>               v2 = cache_arr[cache_idx]->c_v2;
>               v3 = cache_arr[cache_idx]->c_v3;
>             }
>           ...
>           cache_idx++;
>           iter = NEXT (iter);
>         }
>     }
> 
>   free (cache_arr);

I see that the pass puts a "free" on every CFG exit from the loop.

What about "longjmp" and C++ exceptions?  Are we guaranteed that "free"
will happen for those ways of exiting the loop?  (should have test
cases for this, I think)


[...]


> 2020-12-25  Feng Xue  <fxue@os.amperecomputing.com>
> 
> gcc/
> 	PR tree-optimization/98598
> 	* Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
> 	* common.opt (-ftree-loop-mgo): New option.
> 	* opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+.
> 	* params.opt (mgo-max-dep-load-level): New parameter.
> 	(mgo-max-cache-elem-size): Likewise.
> 	(mgo-min-cache-array-length): Likewise.
> 	(mgo-max-cache-array-length): Likewise.
> 	* doc/invoke.texi (mgo-max-dep-load-level): Document new parameter.
> 	(mgo-max-cache-elem-size): Likewise.
> 	(mgo-min-cache-array-length): Likewise.
> 	(mgo-max-cache-array-length): Likewise.

Nit: also need to document "-ftree-loop-mgo".

> 	* passes.def: Add new pass_loop_mgo pass.
> 	* timevar.def (TV_LOOP_MGO): New timevar.
> 	* tree-pass.h (make_pass_loop_mgo): New declaration.
> 	* tree-ssa-loop-mgo.c: New file.

Nit: new C++ source files should have a .cc extension, rather than .c

> gcc/testsuite/
> 	PR tree-optimization/98598
> 	* gcc.dg/tree-ssa/mgo/mgo.exp: New file.
> 	* gcc.dg/tree-ssa/mgo/mgo-common.h: New test header.
> 	* gcc.dg/tree-ssa/mgo/list/list-1.c: New test.
> 	* gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise.

[...]

Presumably the optimization can't happen if the underlying data gets
modified during the iteration.
I briefly looked through the test cases, but I didn't see any/(much?)
test coverage for loops that write back to the data, or call functions
that might do that.  Sorry if I missed some, but clearly we need to
test that we don't incorrectly optimize those cases.

What happens in a multithreaded program in which another thread might
be writing to the data?  What are we allowed to optimize?

Are there some C++ examples?  Fortran?

It would be great to be able to inject calloc failures when testing
this, but obviously that's out of scope for such a patch.

Hope this is constructive
Dave

next prev parent reply	other threads:[~2021-04-29 13:46 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-21  6:27 Feng Xue OS
2021-02-02  1:45 ` Ping: " Feng Xue OS
2021-04-02  5:44 ` PING^2 " Feng Xue OS
2021-04-29  2:50   ` PING^3 " Feng Xue OS
2021-04-29 13:46 ` David Malcolm [this message]
2021-04-30  4:02   ` Feng Xue OS
2021-05-08  1:49     ` Bin.Cheng
2021-04-30  7:37 ` Richard Biener
2021-05-07  4:29   ` Feng Xue OS
2021-05-10 11:56     ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49775de69149498230edb6554d176c758841f341.camel@redhat.com \
    --to=dmalcolm@redhat.com \
    --cc=fxue@os.amperecomputing.com \
    --cc=gcc-patches@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).