From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dmalcolm@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by sourceware.org (Postfix) with ESMTP id 1C60A3958C1B
 for <gcc-patches@gcc.gnu.org>; Thu, 29 Apr 2021 13:46:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1C60A3958C1B
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-509-9BUTwAIIOqmeTQRbzPdLzg-1; Thu, 29 Apr 2021 09:46:48 -0400
X-MC-Unique: 9BUTwAIIOqmeTQRbzPdLzg-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com
 [10.5.11.23])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8A72D6D5A6;
 Thu, 29 Apr 2021 13:46:47 +0000 (UTC)
Received: from t14s.localdomain (ovpn-112-65.phx2.redhat.com [10.3.112.65])
 by smtp.corp.redhat.com (Postfix) with ESMTP id 1B18F19C44;
 Thu, 29 Apr 2021 13:46:47 +0000 (UTC)
Message-ID: <49775de69149498230edb6554d176c758841f341.camel@redhat.com>
Subject: Re: [PATCH/RFC] Add a new memory gathering optimization for loop
 (PR98598)
From: David Malcolm <dmalcolm@redhat.com>
To: Feng Xue OS <fxue@os.amperecomputing.com>, "gcc-patches@gcc.gnu.org"
 <gcc-patches@gcc.gnu.org>
Date: Thu, 29 Apr 2021 09:46:46 -0400
In-Reply-To: <MWHPR01MB2304C068CB1CBA463DB2AA9BF7A10@MWHPR01MB2304.prod.exchangelabs.com>
References: <MWHPR01MB2304C068CB1CBA463DB2AA9BF7A10@MWHPR01MB2304.prod.exchangelabs.com>
User-Agent: Evolution 3.38.4 (3.38.4-1.fc33)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Apr 2021 13:46:54 -0000

On Thu, 2021-01-21 at 06:27 +0000, Feng Xue OS via Gcc-patches wrote:
> This patch implements a new loop optimization according to the proposal
> in RFC given at  
> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
> So do not repeat the idea in this mail. Hope your comments on it.

With the caveat that I'm not an optimization expert (but no one else
seems to have replied), here are some thoughts.

[...snip...]

> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
>  [PR98598]

BTW, did you mean to also post patches 2 and 3?


> In nested loops, if scattered memory accesses inside inner loop remain
> unchanged in outer loop, we can sequentialize these loads by caching
> their values into a temporary memory region at the first time, and
> reuse the caching data in following iterations. This way can improve
> efficiency of cpu cache subsystem by reducing its unpredictable activies.

I don't think you've cited any performance numbers so far.  Does the
optimization show a measurable gain on some benchmark(s)?  e.g. is this
ready to run SPEC yet, and how does it do?
 
> To illustrate what the optimization will do, two pieces of pseudo code,
> before and after transformation, are given. Suppose all loads and
> "iter_count" are invariant in outer loop.
> 
> From:
> 
>   outer-loop ()
>     {
>       inner-loop (iter, iter_count)
>         {
>           Type1 v1 = LOAD (iter);
>           Type2 v2 = LOAD (v1);
>           Type3 v3 = LOAD (v2);
>           ...
>           iter = NEXT (iter);
>         }
>     }
> 
> To:
> 
>   typedef struct cache_elem
>     {
>       bool   init;
>       Type1  c_v1;
>       Type2  c_v2;
>       Type3  c_v3;

Putting the "bool init;" at the front made me think "what about
packing?" but presumably the idea is that every element is accessed in
order, so it presumably benefits speed to have "init" at the top of the
element, right?

>     } cache_elem;
> 
>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));

What if the allocation fails at runtime?  Do we keep an unoptimized
copy of the nested loops around as a fallback and have an unlikely
branch to that copy?

I notice that you're using calloc, presumably to clear all of the
"init" flags (and the whole buffer).

FWIW, this feels like a case where it would be nice to have a thread-
local heap allocation, perhaps something like an obstack implemented in
the standard library - but that's obviously scope creep for this.

Could it make sense to use alloca for small allocations?  (or is that
scope creep?)
 
>   outer-loop ()
>     {
>       size_t cache_idx = 0;
> 
>       inner-loop (iter, iter_count)
>         {
>           if (!cache_arr[cache_idx]->init)
>             {
>               v1 = LOAD (iter);
>               v2 = LOAD (v1);
>               v3 = LOAD (v2);
> 
>               cache_arr[cache_idx]->init = true;
>               cache_arr[cache_idx]->c_v1 = v1;
>               cache_arr[cache_idx]->c_v2 = v2;
>               cache_arr[cache_idx]->c_v3 = v3;
>             }
> 	  else
>             {
>               v1 = cache_arr[cache_idx]->c_v1;
>               v2 = cache_arr[cache_idx]->c_v2;
>               v3 = cache_arr[cache_idx]->c_v3;
>             }
>           ...
>           cache_idx++;
>           iter = NEXT (iter);
>         }
>     }
> 
>   free (cache_arr);

I see that the pass puts a "free" on every CFG exit from the loop.

What about "longjmp" and C++ exceptions?  Are we guaranteed that "free"
will happen for those ways of exiting the loop?  (should have test
cases for this, I think)


[...]


> 2020-12-25  Feng Xue  <fxue@os.amperecomputing.com>
> 
> gcc/
> 	PR tree-optimization/98598
> 	* Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
> 	* common.opt (-ftree-loop-mgo): New option.
> 	* opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+.
> 	* params.opt (mgo-max-dep-load-level): New parameter.
> 	(mgo-max-cache-elem-size): Likewise.
> 	(mgo-min-cache-array-length): Likewise.
> 	(mgo-max-cache-array-length): Likewise.
> 	* doc/invoke.texi (mgo-max-dep-load-level): Document new parameter.
> 	(mgo-max-cache-elem-size): Likewise.
> 	(mgo-min-cache-array-length): Likewise.
> 	(mgo-max-cache-array-length): Likewise.

Nit: also need to document "-ftree-loop-mgo".

> 	* passes.def: Add new pass_loop_mgo pass.
> 	* timevar.def (TV_LOOP_MGO): New timevar.
> 	* tree-pass.h (make_pass_loop_mgo): New declaration.
> 	* tree-ssa-loop-mgo.c: New file.

Nit: new C++ source files should have a .cc extension, rather than .c

> gcc/testsuite/
> 	PR tree-optimization/98598
> 	* gcc.dg/tree-ssa/mgo/mgo.exp: New file.
> 	* gcc.dg/tree-ssa/mgo/mgo-common.h: New test header.
> 	* gcc.dg/tree-ssa/mgo/list/list-1.c: New test.
> 	* gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise.
> 	* gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise.

[...]

Presumably the optimization can't happen if the underlying data gets
modified during the iteration.
I briefly looked through the test cases, but I didn't see any/(much?)
test coverage for loops that write back to the data, or call functions
that might do that.  Sorry if I missed some, but clearly we need to
test that we don't incorrectly optimize those cases.

What happens in a multithreaded program in which another thread might
be writing to the data?  What are we allowed to optimize?

Are there some C++ examples?  Fortran?

It would be great to be able to inject calloc failures when testing
this, but obviously that's out of scope for such a patch.

Hope this is constructive
Dave