From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) by sourceware.org (Postfix) with ESMTPS id 55024385480E for ; Mon, 10 May 2021 11:56:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 55024385480E Received: by mail-ed1-x534.google.com with SMTP id s6so18335504edu.10 for ; Mon, 10 May 2021 04:56:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=9H2JJe9AukdNhEuZJpyNZA51U/td9jBsq9bs/M/Dh2E=; b=dn1GmgM3N4JgN6YmJYyXAfKd2U5dwBMABENAueVnvMiGKeuiC/finxvoAuJyClOo30 jTY+u56Th7NXe87kAEk3WZgB9y6/qeacDY500X1dXJTJARxssGHTdwe7kmm1LRRwn1Ji N+sJPiPnrNX2XoxpSzaX2jD+730HU3rVibPPluFi1pFzaSGxzD1/I6AEfDawB1ZFRrUl S5Xb8zWGGhKaqxAa1SV2qQirtlpXaBa6vBVVsoAfR+OIhkIpra7ICg/GTYZsL3q0OOTv W8pakU5ScUrWFXcRlMnxlFSxkCKzxDMacbUUhjvzaK29JAJ/IHcmgylQ5pAZoHWeHcCY DpUg== X-Gm-Message-State: AOAM531juD0DjCVrtRb9YJS0GaGnYDu3sRyV4BbWdhS8Xh9bxTIZ+bD2 qs7aILbqGa8hmRcAmWde5AsykQItvo6BAWlD45I= X-Google-Smtp-Source: ABdhPJz816lH9bXvcY0aC0DT65QzRy8ghUqZy6FJkrQmgwasnjoH7M9v75cABU70dDlIhFc5GxzJ8jtjpdslnONAlqc= X-Received: by 2002:aa7:de02:: with SMTP id h2mr28908174edv.61.1620647814343; Mon, 10 May 2021 04:56:54 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Mon, 10 May 2021 13:56:43 +0200 Message-ID: Subject: Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598) To: Feng Xue OS Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 May 2021 11:56:57 -0000 On Fri, May 7, 2021 at 6:29 AM Feng Xue OS wrote: > > >> gcc/ > >> PR tree-optimization/98598 > >> * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o. > >> * common.opt (-ftree-loop-mgo): New option. > > > > Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a good > > name. -floop-mgo would be better but still I'd have no idea what this would do. > > > > I don't have a good suggestion here other than to expand it to > > -floop-gather-memory (?!). > > OK. Better than "mgo", this abbr. is only a term for development use. > > > The option documentation isn't informative either. > > > > From: > > > > outer-loop () > > { > > inner-loop (iter, iter_count) > > { > > Type1 v1 = LOAD (iter); > > Type2 v2 = LOAD (v1); > > Type3 v3 = LOAD (v2); > > ... > > iter = NEXT (iter); > > } > > } > > > > To: > > > > typedef struct cache_elem > > { > > bool init; > > Type1 c_v1; > > Type2 c_v2; > > Type3 c_v3; > > } cache_elem; > > > > cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem)); > > > > outer-loop () > > { > > size_t cache_idx = 0; > > > > inner-loop (iter, iter_count) > > { > > if (!cache_arr[cache_idx]->init) > > { > > v1 = LOAD (iter); > > v2 = LOAD (v1); > > v3 = LOAD (v2); > > > > cache_arr[cache_idx]->init = true; > > cache_arr[cache_idx]->c_v1 = v1; > > cache_arr[cache_idx]->c_v2 = v2; > > cache_arr[cache_idx]->c_v3 = v3; > > } > > else > > { > > v1 = cache_arr[cache_idx]->c_v1; > > v2 = cache_arr[cache_idx]->c_v2; > > v3 = cache_arr[cache_idx]->c_v3; > > } > > ... > > cache_idx++; > > iter = NEXT (iter); > > } > > } > > > > free (cache_arr); > > > > This is a _very_ special transform. What it seems to do is > > optimize the dependent loads for outer loop iteration n > 1 > > by caching the result(s). If that's possible then you should > > be able to distribute the outer loop to one doing the caching > > and one using the cache. Then this transform would be more > > like a tradidional array expansion of scalars? In some cases > > also loop interchange could remove the need for the caching. > > > > Doing MGO as the very first loop pass thus looks bad, I think > > MGO should be much later, for example after interchange. > > I also think that MGO should work in concert with loop > > distribution (which could have an improved cost model) > > rather than being a separate pass. > > > > Your analysis phase looks quite expensive, building sth > > like a on-the side representation very closely matching SSA. > > It seems to work from PHI defs to uses, which looks backwards. > > Did not catch this point very clearly. Would you please detail it more? I don't remember exactly but you are building a lot of data structures that resemble ones readily available when you do find_dep_loads. You're looking at each and every stmt searching for operands matching up sth and then you're walking SSA uses again. And the searching uses linear vector walks (vec::contains). > > You seem to roll your own dependence analysis code :/ Please > > have a look at loop distribution. > > > > Also you build an actual structure type for reasons that escape > > me rather than simply accessing the allocated storage at > > appropriate offsets. > > > > I think simply calling 'calloc' isn't OK because you might need > > aligned storage and because calloc might not be available. > > Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT > > is large enough for the data you want to place (or perform > > dynamic re-alignment yourself). We probably want some generic > > middle-end utility to obtain aligned allocated storage at some > > point. > > > > As said above I think you want to re-do this transform as > > a loop distribution transform. I think if caching works then > > the loads should be distributable and the loop distribution > > transform should be enhanced to expand the scalars to arrays. > > I checked code of loop distribution, and its trigger strategy seems > to be very conservative, now only targets simple and regular > index-based loop, and could not handle link-list traversal, which > consists of a series of discrete memory accesses, and MGO would > matter a lot. Additionally, for some complicate cases, we could > not completely decompose MGO as two separate loops for > "do caching" and "use caching" respectively. An example: > > for (i = 0; i < N; i++) > { > for (j = 0; j < i; j++) > { > Type1 v1 = LOAD_FN1 (j); > Type2 v2 = LOAD_FN2 (v1); > Type3 v3 = LOAD_FN3 (v2); > > ... > > condition = ... > } > > if (condition) > break; > } > > We should not cache all loads (Totally N) in one step since some > of them might be invalid after "condition" breaks loops. We have to > mix up "do caching" and "use caching", and let them dynamically > switched against "init" flag. Maybe, but then classical array expansion is useful and runs into the same dynamic allocation problem. I suppose loop distribution could be teached the dependent loads case - it's not unlike gathers. > But loop distribution does have some > overlap on analysis and transformation with MGO, we will try to > see if there is a way to unify them. Thanks. Richard. > Thanks, > Feng