From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id 1C60A3958C1B for ; Thu, 29 Apr 2021 13:46:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1C60A3958C1B Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-509-9BUTwAIIOqmeTQRbzPdLzg-1; Thu, 29 Apr 2021 09:46:48 -0400 X-MC-Unique: 9BUTwAIIOqmeTQRbzPdLzg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8A72D6D5A6; Thu, 29 Apr 2021 13:46:47 +0000 (UTC) Received: from t14s.localdomain (ovpn-112-65.phx2.redhat.com [10.3.112.65]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1B18F19C44; Thu, 29 Apr 2021 13:46:47 +0000 (UTC) Message-ID: <49775de69149498230edb6554d176c758841f341.camel@redhat.com> Subject: Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598) From: David Malcolm To: Feng Xue OS , "gcc-patches@gcc.gnu.org" Date: Thu, 29 Apr 2021 09:46:46 -0400 In-Reply-To: References: User-Agent: Evolution 3.38.4 (3.38.4-1.fc33) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Apr 2021 13:46:54 -0000 On Thu, 2021-01-21 at 06:27 +0000, Feng Xue OS via Gcc-patches wrote: > This patch implements a new loop optimization according to the proposal > in RFC given at > https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html. > So do not repeat the idea in this mail. Hope your comments on it. With the caveat that I'm not an optimization expert (but no one else seems to have replied), here are some thoughts. [...snip...] > Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop > [PR98598] BTW, did you mean to also post patches 2 and 3? > In nested loops, if scattered memory accesses inside inner loop remain > unchanged in outer loop, we can sequentialize these loads by caching > their values into a temporary memory region at the first time, and > reuse the caching data in following iterations. This way can improve > efficiency of cpu cache subsystem by reducing its unpredictable activies. I don't think you've cited any performance numbers so far. Does the optimization show a measurable gain on some benchmark(s)? e.g. is this ready to run SPEC yet, and how does it do? > To illustrate what the optimization will do, two pieces of pseudo code, > before and after transformation, are given. Suppose all loads and > "iter_count" are invariant in outer loop. > > From: > > outer-loop () > { > inner-loop (iter, iter_count) > { > Type1 v1 = LOAD (iter); > Type2 v2 = LOAD (v1); > Type3 v3 = LOAD (v2); > ... > iter = NEXT (iter); > } > } > > To: > > typedef struct cache_elem > { > bool init; > Type1 c_v1; > Type2 c_v2; > Type3 c_v3; Putting the "bool init;" at the front made me think "what about packing?" but presumably the idea is that every element is accessed in order, so it presumably benefits speed to have "init" at the top of the element, right? > } cache_elem; > > cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem)); What if the allocation fails at runtime? Do we keep an unoptimized copy of the nested loops around as a fallback and have an unlikely branch to that copy? I notice that you're using calloc, presumably to clear all of the "init" flags (and the whole buffer). FWIW, this feels like a case where it would be nice to have a thread- local heap allocation, perhaps something like an obstack implemented in the standard library - but that's obviously scope creep for this. Could it make sense to use alloca for small allocations? (or is that scope creep?) > outer-loop () > { > size_t cache_idx = 0; > > inner-loop (iter, iter_count) > { > if (!cache_arr[cache_idx]->init) > { > v1 = LOAD (iter); > v2 = LOAD (v1); > v3 = LOAD (v2); > > cache_arr[cache_idx]->init = true; > cache_arr[cache_idx]->c_v1 = v1; > cache_arr[cache_idx]->c_v2 = v2; > cache_arr[cache_idx]->c_v3 = v3; > } > else > { > v1 = cache_arr[cache_idx]->c_v1; > v2 = cache_arr[cache_idx]->c_v2; > v3 = cache_arr[cache_idx]->c_v3; > } > ... > cache_idx++; > iter = NEXT (iter); > } > } > > free (cache_arr); I see that the pass puts a "free" on every CFG exit from the loop. What about "longjmp" and C++ exceptions? Are we guaranteed that "free" will happen for those ways of exiting the loop? (should have test cases for this, I think) [...] > 2020-12-25 Feng Xue > > gcc/ > PR tree-optimization/98598 > * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o. > * common.opt (-ftree-loop-mgo): New option. > * opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+. > * params.opt (mgo-max-dep-load-level): New parameter. > (mgo-max-cache-elem-size): Likewise. > (mgo-min-cache-array-length): Likewise. > (mgo-max-cache-array-length): Likewise. > * doc/invoke.texi (mgo-max-dep-load-level): Document new parameter. > (mgo-max-cache-elem-size): Likewise. > (mgo-min-cache-array-length): Likewise. > (mgo-max-cache-array-length): Likewise. Nit: also need to document "-ftree-loop-mgo". > * passes.def: Add new pass_loop_mgo pass. > * timevar.def (TV_LOOP_MGO): New timevar. > * tree-pass.h (make_pass_loop_mgo): New declaration. > * tree-ssa-loop-mgo.c: New file. Nit: new C++ source files should have a .cc extension, rather than .c > gcc/testsuite/ > PR tree-optimization/98598 > * gcc.dg/tree-ssa/mgo/mgo.exp: New file. > * gcc.dg/tree-ssa/mgo/mgo-common.h: New test header. > * gcc.dg/tree-ssa/mgo/list/list-1.c: New test. > * gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise. [...] Presumably the optimization can't happen if the underlying data gets modified during the iteration. I briefly looked through the test cases, but I didn't see any/(much?) test coverage for loops that write back to the data, or call functions that might do that. Sorry if I missed some, but clearly we need to test that we don't incorrectly optimize those cases. What happens in a multithreaded program in which another thread might be writing to the data? What are we allowed to optimize? Are there some C++ examples? Fortran? It would be great to be able to inject calloc failures when testing this, but obviously that's out of scope for such a patch. Hope this is constructive Dave