From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com
 [IPv6:2a00:1450:4864:20::534])
 by sourceware.org (Postfix) with ESMTPS id 55024385480E
 for <gcc-patches@gcc.gnu.org>; Mon, 10 May 2021 11:56:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 55024385480E
Received: by mail-ed1-x534.google.com with SMTP id s6so18335504edu.10
 for <gcc-patches@gcc.gnu.org>; Mon, 10 May 2021 04:56:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=9H2JJe9AukdNhEuZJpyNZA51U/td9jBsq9bs/M/Dh2E=;
 b=dn1GmgM3N4JgN6YmJYyXAfKd2U5dwBMABENAueVnvMiGKeuiC/finxvoAuJyClOo30
 jTY+u56Th7NXe87kAEk3WZgB9y6/qeacDY500X1dXJTJARxssGHTdwe7kmm1LRRwn1Ji
 N+sJPiPnrNX2XoxpSzaX2jD+730HU3rVibPPluFi1pFzaSGxzD1/I6AEfDawB1ZFRrUl
 S5Xb8zWGGhKaqxAa1SV2qQirtlpXaBa6vBVVsoAfR+OIhkIpra7ICg/GTYZsL3q0OOTv
 W8pakU5ScUrWFXcRlMnxlFSxkCKzxDMacbUUhjvzaK29JAJ/IHcmgylQ5pAZoHWeHcCY
 DpUg==
X-Gm-Message-State: AOAM531juD0DjCVrtRb9YJS0GaGnYDu3sRyV4BbWdhS8Xh9bxTIZ+bD2
 qs7aILbqGa8hmRcAmWde5AsykQItvo6BAWlD45I=
X-Google-Smtp-Source: ABdhPJz816lH9bXvcY0aC0DT65QzRy8ghUqZy6FJkrQmgwasnjoH7M9v75cABU70dDlIhFc5GxzJ8jtjpdslnONAlqc=
X-Received: by 2002:aa7:de02:: with SMTP id h2mr28908174edv.61.1620647814343; 
 Mon, 10 May 2021 04:56:54 -0700 (PDT)
MIME-Version: 1.0
References: <MWHPR01MB2304C068CB1CBA463DB2AA9BF7A10@MWHPR01MB2304.prod.exchangelabs.com>
 <CAFiYyc1Y5xb3cvjU-8yK+X8sz7FO3eoznTSdkvQaoKRtm6ddQg@mail.gmail.com>
 <SN6PR01MB495804ADC4B70D5D515A65AFF7579@SN6PR01MB4958.prod.exchangelabs.com>
In-Reply-To: <SN6PR01MB495804ADC4B70D5D515A65AFF7579@SN6PR01MB4958.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 10 May 2021 13:56:43 +0200
Message-ID: <CAFiYyc2p0=7FBidnFXAQxVA62XKVOd3BbiyNk_3eBiTOSLFVog@mail.gmail.com>
Subject: Re: [PATCH/RFC] Add a new memory gathering optimization for loop
 (PR98598)
To: Feng Xue OS <fxue@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 10 May 2021 11:56:57 -0000

On Fri, May 7, 2021 at 6:29 AM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
>
> >> gcc/
> >>         PR tree-optimization/98598
> >>         * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
> >>         * common.opt (-ftree-loop-mgo): New option.
> >
> > Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a good
> > name.  -floop-mgo would be better but still I'd have no idea what this would do.
> >
> > I don't have a good suggestion here other than to expand it to
> > -floop-gather-memory (?!).
>
> OK. Better than "mgo", this abbr. is only a term for development use.
>
> > The option documentation isn't informative either.
> >
> > From:
> >
> >   outer-loop ()
> >     {
> >       inner-loop (iter, iter_count)
> >         {
> >           Type1 v1 = LOAD (iter);
> >           Type2 v2 = LOAD (v1);
> >           Type3 v3 = LOAD (v2);
> >           ...
> >           iter = NEXT (iter);
> >         }
> >     }
> >
> > To:
> >
> >   typedef struct cache_elem
> >     {
> >       bool   init;
> >       Type1  c_v1;
> >       Type2  c_v2;
> >       Type3  c_v3;
> >     } cache_elem;
> >
> >   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
> >
> >   outer-loop ()
> >     {
> >       size_t cache_idx = 0;
> >
> >       inner-loop (iter, iter_count)
> >         {
> >           if (!cache_arr[cache_idx]->init)
> >             {
> >               v1 = LOAD (iter);
> >               v2 = LOAD (v1);
> >               v3 = LOAD (v2);
> >
> >               cache_arr[cache_idx]->init = true;
> >               cache_arr[cache_idx]->c_v1 = v1;
> >               cache_arr[cache_idx]->c_v2 = v2;
> >               cache_arr[cache_idx]->c_v3 = v3;
> >             }
> >           else
> >             {
> >               v1 = cache_arr[cache_idx]->c_v1;
> >               v2 = cache_arr[cache_idx]->c_v2;
> >               v3 = cache_arr[cache_idx]->c_v3;
> >             }
> >           ...
> >           cache_idx++;
> >           iter = NEXT (iter);
> >         }
> >     }
> >
> >   free (cache_arr);
> >
> > This is a _very_ special transform.  What it seems to do is
> > optimize the dependent loads for outer loop iteration n > 1
> > by caching the result(s).  If that's possible then you should
> > be able to distribute the outer loop to one doing the caching
> > and one using the cache.  Then this transform would be more
> > like a tradidional array expansion of scalars?  In some cases
> > also loop interchange could remove the need for the caching.
> >
> > Doing MGO as the very first loop pass thus looks bad, I think
> > MGO should be much later, for example after interchange.
> > I also think that MGO should work in concert with loop
> > distribution (which could have an improved cost model)
> > rather than being a separate pass.
> >
> > Your analysis phase looks quite expensive, building sth
> > like a on-the side representation very closely matching SSA.
> > It seems to work from PHI defs to uses, which looks backwards.
>
> Did not catch this point very clearly. Would you please detail it more?

I don't remember exactly but you are building a lot of data structures
that resemble ones readily available when you do find_dep_loads.
You're looking at each and every stmt searching for operands
matching up sth and then you're walking SSA uses again.
And the searching uses linear vector walks (vec::contains).

> > You seem to roll your own dependence analysis code :/  Please
> > have a look at loop distribution.
> >
> > Also you build an actual structure type for reasons that escape
> > me rather than simply accessing the allocated storage at
> > appropriate offsets.
> >
> > I think simply calling 'calloc' isn't OK because you might need
> > aligned storage and because calloc might not be available.
> > Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT
> > is large enough for the data you want to place (or perform
> > dynamic re-alignment yourself).  We probably want some generic
> > middle-end utility to obtain aligned allocated storage at some
> > point.
> >
> > As said above I think you want to re-do this transform as
> > a loop distribution transform.  I think if caching works then
> > the loads should be distributable and the loop distribution
> > transform should be enhanced to expand the scalars to arrays.
>
> I checked code of loop distribution, and its trigger strategy seems
> to be very conservative, now only targets simple and regular
> index-based loop, and could not handle link-list traversal, which
> consists of a series of discrete memory accesses, and MGO would
> matter a lot. Additionally, for some complicate cases,  we could
> not completely decompose MGO as two separate loops for
> "do caching" and "use caching" respectively. An example:
>
> for (i = 0; i < N; i++)
>   {
>     for (j = 0; j < i; j++)
>        {
>            Type1 v1 = LOAD_FN1 (j);
>            Type2 v2 = LOAD_FN2 (v1);
>            Type3 v3 = LOAD_FN3 (v2);
>
>            ...
>
>            condition = ...
>        }
>
>     if (condition)
>       break;
>   }
>
> We should not cache all loads (Totally N) in one step since some
> of them might be invalid after "condition" breaks loops. We have to
> mix up "do caching" and "use caching", and let them dynamically
> switched against "init" flag.

Maybe, but then classical array expansion is useful and runs into
the same dynamic allocation problem.  I suppose loop distribution
could be teached the dependent loads case - it's not unlike
gathers.

>  But loop distribution does have some
> overlap on analysis and transformation with MGO, we will try to
> see if there is a way to unify them.

Thanks.
Richard.

> Thanks,
> Feng