From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ed1-x542.google.com (mail-ed1-x542.google.com
 [IPv6:2a00:1450:4864:20::542])
 by sourceware.org (Postfix) with ESMTPS id BC19A386F825;
 Mon, 14 Sep 2020 09:48:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org BC19A386F825
Received: by mail-ed1-x542.google.com with SMTP id w1so16937731edr.3;
 Mon, 14 Sep 2020 02:48:08 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=UrLSQY7MOrtW3j0u/5WLv4iQr3tViLTjlST626vHZfA=;
 b=nsRT/FNuamYMX5RZXvzt7Kq873kC75zSLucP6UZWXRtxP+/cUTy3FYMkheLepbVAKT
 5jgYSDuC/4mAF7D8/whnX1Ged9jh+u9k3d5V5YKyjfHXpGt4u31ixlxNuQPB0ao9+np5
 NJMKb682i6wo6UGioDrmj4SNGThJIOh89Ea05eZlxQs8N045JXkX4VDyrXJAlLKBfqE0
 v7O/A/8/XZNtGBrCGHBvZuSYGpxDLRGa9c6Uu5Dg95pbPmLFZU8oKYD6QZMM0yYSSuOS
 lHZH1tbtrCt0Be0GlpbKnnC18batQgFuUpFsvOcYCtKeAYURBqFFrUrPYz5ZK3Jc1IcS
 nJ4g==
X-Gm-Message-State: AOAM531a9MNVxIbwUIUm+G8Zzhga/4tdTXyl1tXlooyp0q1od0SbUJdH
 /9mT/kkLOR0G32yKOUqCOgiROGYaIcB6GIIyKnk=
X-Google-Smtp-Source: ABdhPJxAfX+Jw600XfyG2QEYYvitU/HwgK/wmSh9+RLJerI8VX2BgDWN0jCyXHGjc9lmqpiH6j8cZr3Zt9psuzjasHM=
X-Received: by 2002:aa7:d68c:: with SMTP id d12mr16603372edr.274.1600076887772; 
 Mon, 14 Sep 2020 02:48:07 -0700 (PDT)
MIME-Version: 1.0
References: <CAFiYyc2PTQ6xAOg-OGv+RvvXycP=rg89A0dgNDmK5ETJxdVntQ@mail.gmail.com>
 <CAFiYyc3Gh8N9+yVjmy3JhD+YPg+OAbvdP8yyAdpft1t4GUoEOw@mail.gmail.com>
 <c9ad6b4e-d8a7-079d-2b3f-4504f4a46bef@linux.ibm.com>
 <20200904102357.GF28786@gate.crashing.org>
 <e36e8ff4-cb88-dc95-2588-d62ce36c4b77@linux.ibm.com>
 <CAFiYyc17PyP++95TWjyOJQd52+EbSN+95f7R+ZZS1T_aNfxy_w@mail.gmail.com>
 <98b124ee-b32d-71f7-a662-e0ce2520de6a@linux.ibm.com>
 <CAFiYyc3hZyxo4_v_SGZ-WRGuDR9FZBFVBfQ4T4X7in0KXRSKgQ@mail.gmail.com>
 <20200909134739.GX28786@gate.crashing.org>
 <CAFiYyc1A+uyhLudiTz7MVLAZ+oBCtyQpvOLcDA9jsnqGCFvgKg@mail.gmail.com>
 <20200909160057.GZ28786@gate.crashing.org>
 <CAFiYyc3CF40xF-aSgcd8YjBRLFb18enMLDbguE0iKPgvMfD6Kw@mail.gmail.com>
 <572b36a7-2d52-2f46-05bd-140e16794a61@linux.ibm.com>
In-Reply-To: <572b36a7-2d52-2f46-05bd-140e16794a61@linux.ibm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 14 Sep 2020 11:47:56 +0200
Message-ID: <CAFiYyc3JTfmraYpVb6Q7gXao8kUbDb3ieKzJf86M8cf1cSgtOQ@mail.gmail.com>
Subject: Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of
 gimple [PR79251]
To: luoxhu <luoxhu@linux.ibm.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
 GCC Patches <gcc-patches@gcc.gnu.org>, 
 David Edelsohn <dje.gcc@gmail.com>, Bill Schmidt <wschmidt@linux.ibm.com>,
 linkw@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Sep 2020 09:48:10 -0000

On Mon, Sep 14, 2020 at 10:05 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2020/9/10 18:08, Richard Biener wrote:
> > On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> >>
> >> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
> >>> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
> >>> <segher@kernel.crashing.org> wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>> On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> >>>>> Hmm, yeah - I guess that's what should be addressed first then.
> >>>>> I'm quite sure that in case 'v' is not on the stack but in memory like
> >>>>> in my case a SImode store is better than what we get from
> >>>>> vec_insert - in fact vec_insert will likely introduce a RMW cycle
> >>>>> which is prone to inserting store-data-races?
> >>>>
> >>>> The other way around -- if it is in memory, and was stored as vector
> >>>> recently, then reading back something shorter from it is prone to
> >>>> SHL/LHS problems.  There is nothing special about the stack here, except
> >>>> of course it is more likely to have been stored recently if on the
> >>>> stack.  So it depends how often it has been stored recently which option
> >>>> is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
> >>>> often, the penalty is relatively bigger, so memory does not often win.
> >>>>
> >>>> I.e.: it needs to be measured.  Intuition is often wrong here.
> >>>
> >>> But the current method would simply do a direct store to memory
> >>> without a preceeding read of the whole vector.
> >>
> >> The problem is even worse the other way: you do a short store here, but
> >> so a full vector read later.  If the store and read are far apart, that
> >> is fine, but if they are close (that is on the order of fifty or more
> >> insns), there can be problems.
> >
> > Sure, but you can't simply load/store a whole vector when the code didn't
> > unless you know it will not introduce data races and it will not trap (thus
> > the whole vector needs to be at least naturally aligned).
> >
> > Also if there's a preceeding short store you will now load the whole vector
> > to avoid the short store ... catch-22
> >
> >> There often are problems over function calls (where the compiler cannot
> >> usually *see* how something is used).
> >
> > Yep.  The best way would be to use small loads and larger stores
> > which is what CPUs usually tend to handle fine (with alignment
> > constraints, etc.).  Of course that's not what either of the "solutions"
> > can do.
> >
> > That said, since you seem to be "first" in having an instruction
> > to insert into a vector at a variable position the idea that we'd
> > have to spill anyway for this to be expanded and thus we expand
> > the vector to a stack location in the first place falls down.  And
> > that's where I'd first try to improve things.
> >
> > So what can the CPU actually do?
> >
>
> Not sure whether this reflects the issues you discussed above.
> I constructed below test cases and tested with and without this patch,
> only if "a+c"(which means store only), the performance is getting bad with
> this patch due to extra load/store(about 50% slower).
> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
> patch will always get big benefit.
> But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
> inside the function or out of the function soon.  Does this mean unify the two
> usage to same gimple code not a good idea sometimes?  Or is it possible to
> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
> used not far away inside or out side of the function?
>
>
> #define TYPE int
>
> vector TYPE v = {1, 2, 3, 4};   // a. global vector.
> vector TYPE s = {4, 3, 2, 1};
>
> __attribute__ ((noinline))
> vector TYPE
> test (vector TYPE u, TYPE i, size_t n)
> {
>   v[n % 4] = i;

^^^

this should be

   u[n % 4] = i;

I guess.  Is the % 4 mandated by the vec_insert semantics btw?

If you tested with the above error you probably need to re-measure.

On gimple the above function (after fixing it) looks like

  VIEW_CONVERT_EXPR<int[4]>(u)[_1] = i_4(D);

and the IFN idea I had would - for non-global memory 'u' only - transform
this to

  vector_register_2 = u;
  vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
  u = vector_register_3;

if vec_set can handle variable indexes.  This then becomes a
vec_set on a register and if that was the only variable indexing of 'u'
will also cause 'u' to be expanded as register rather than stack memory.

Note we can't use the direct-optab method here since the vec_set optab
modifies operand 0 which isn't possible in SSA form.  That might hint
at that we eventually want to extend BIT_INSERT_EXPR to handle
a non-constant bit position but for experiments using an alternate
internal function is certainly easier.

Richard.

>   return u;
> }
>
> int main ()
> {
>   //vector TYPE v = {1, 2, 3, 4};    // b. local vector.
>   //vector TYPE s = {4, 3, 2, 1};
>   vector TYPE r = {0};
>   for (long k = 0; k < 3989478945; k++)
>     {
>         r += test (s, 254.0f, k);  // c. access different vector. store only.
>        //r += test (v, 254.0f, k);  // d. access same vector. load after store in callee function.
>       //test (s, 254.0f, k); r += v;   //e. load after store out of function.
>     }
>   return r[0];
> }
>
>
> Thanks,
> Xionghu
>