Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Kewen.Lin" <linkw@linux.ibm.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>,
	Bill Schmidt <wschmidt@linux.ibm.com>,
	David Edelsohn <dje.gcc@gmail.com>,
	Segher Boessenkool <segher@kernel.crashing.org>
Subject: Re: [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook
Date: Thu, 13 May 2021 15:04:08 +0800	[thread overview]
Message-ID: <989572f5-c6c3-c32d-3cb1-5b341348f687@linux.ibm.com> (raw)
In-Reply-To: <CAFiYyc1tLkACa9_UmG4GQO5_=b9Y-gypAYhZSNFejyp0FukHWw@mail.gmail.com>

Hi!

>>> But in the end the vector code shouldn't end up worse than the
>>> scalar code with respect to IVs - the cases where it would should
>>> be already costed.  So I wonder if you have specific examples
>>> where things go worse enough for the heuristic to trigger?
>>>
>>
>> One typical case that I worked on to reuse this density check is the
>> function mat_times_vec of src file block_solver.fppized.f of SPEC2017
>> 503.bwaves_r, the density with the existing heuristic is 83 (doesn't
>> exceed the threshold unlikely).  The interesting loop is the innermost
>> one while option set is "-O2 -mcpu=power8 -ffast-math -ftree-vectorize".
>> We have verified that this loop isn't profitable to be vectorized at
>> O2 (without loop-interchange).
> 
> Yeah, but that's because the loop only runs 5 iterations, not because
> of some "density" (which suggests AGU overloading or some such)?
> Because if you modify it so it iterates more then with keeping the
> "density" measurement constant you suddenly become profitable?
> 

Yes, I agree this isn't a perfect one showing how the density check
matters, though it led me to find this check.  I tried to run SPEC2017
bmks w/ and w/o this density heuristic to catch some "expected" case,
but failed to unluckily.  It may be worth to trying with some more
option sets or even test with the previous SPECs later.

I hacked the innermost loop iteration from 5 to 20, but baseline run
didn't stop (after more than 7 hrs then I killed it), which was
suspected to become endless because of some garbage (out of bound) data.

But the current cost modeling for this loop on Power is still bad, the
min profitable iteration (both static and run time) are evaluated as 2,
while the reality shows 5 isn't profitable at least.


> The loop does have quite many memory streams so optimizing
> the (few) arithmetic ops by vectorizign them might not be worth
> the trouble, esp. since most of the loads are "strided" (composed
> from scalars) when no interchange is performed.  So it's probably
> more a "density" of # memory streams vs. # arithmetic ops, and
> esp. with any non-consecutive vector loads this balance being
> worse in the vector case?
> 

Yeah, these many scalar "strided" loads make things worse.  The fed
vector CTORs have to wait for all of their required loads are ready,
and these vector CTOR are required by further multiplications.

I posted one patch[1] on this, which tries to model it with
some counts: nload (total load number), nload_ctor (strided
load number fed into CTOR) and nctor_strided (CTOR number fed
by strided load).

Restricting the penalization by considering some factors:
  1) vect density ratio, if there are many vector instructions,
     the stalls from loads are easy to impact the subsequent
     computation.
  2) total load number, if nload is small, it's unlikely to
     bother the load/store units much.
  3) strided loads fed into CTOR pct., if there are high portion
     strided loads fed into CTOR, it's very likely to block
     the CTOR and its subsequent chain.

btw, as your previous comments on add_stmt_cost, the load/strided/ctor
statistics should be gathered there instead, like:

  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
    {
      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
          || kind == vector_gather_load)
          data->nload += count;
      if (stmt_info && STMT_VINFO_STRIDED_P (stmt_info))
        {
          if (kind == scalar_load || kind == unaligned_load)
            data->nload_ctor += count;
          else if (kind == vec_construct)
            data->nctor_strided += count;
        }
    }

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/569791.html

> The x86 add_stmt_cost has
> 
>   /* If we do elementwise loads into a vector then we are bound by
>      latency and execution resources for the many scalar loads
>      (AGU and load ports).  Try to account for this by scaling the
>      construction cost by the number of elements involved.  */
>   if ((kind == vec_construct || kind == vec_to_scalar)
>       && stmt_info
>       && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>           || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
>       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>       && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
>     {
>       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
>     }
> 
> so it penaltizes VMAT_ELEMENTWISE for variable step for both loads and stores.
> The above materialized over PRs 84037, 85491 and 87561, so not specifically
> for the bwaves case.  IIRC on x86 bwaves at -O2 is slower with vectorization
> as well.
> 

Thanks for the pointer!  rs6000 probably can follow this way instead.
IIUC, this cost adjustment is for each individual vec_construct/vec_to_scalar,
is it better to use the way that firstly collecting the data in add_stmt_cost
and then considering them from a view of the whole loop?  This individual way
seems easy to overkill if there are just a few VMAT_ELEMENTWISE loads then the
resource hazard isn't so bad.  And as you mentioned above, bwaves_r mainly
suffers from strided loads, this tuning looks should be applied for
VMAT_STRIDED_SLP as well?  Is it the reason why it only considers the variable
step that the constant step loads can be grouped on x86 (like: paired load/store
fusion)?

BR,
Kewen

> Oh, and yes - the info the vectorizer presents the target with for
> vector loads/stores
> leaves a lot to be desired ...
> 
> Richard.
>

next prev parent reply	other threads:[~2021-05-13  7:04 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-07  2:28 [PATCH] rs6000: Make density_test only for vector version Kewen.Lin
2021-05-07  9:43 ` Richard Biener
2021-05-08  8:05   ` [PATCH 1/2] vect: Add costing_for_scalar parameter to init_cost hook Kewen.Lin
2021-05-10 13:55     ` Richard Biener
2021-05-11  7:10       ` Kewen.Lin
2021-05-11  7:54         ` Richard Biener
2021-05-11 10:50           ` Kewen.Lin
2021-05-11 11:59             ` Richard Biener
2021-05-13  7:04               ` Kewen.Lin [this message]
2021-05-17  8:55                 ` Richard Biener
2021-05-17  9:24                   ` Kewen.Lin
2021-05-10 14:08     ` Richard Sandiford
2021-05-11  7:18       ` Kewen.Lin
2021-05-07 11:43 ` [PATCH] rs6000: Make density_test only for vector version will schmidt
2021-05-08  8:47   ` Kewen.Lin
2021-05-08  8:12 ` [PATCH 2/2 v2] rs6000: Guard " Kewen.Lin
2021-05-10 20:12   ` Segher Boessenkool
2021-05-11  7:20     ` Kewen.Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=989572f5-c6c3-c32d-3cb1-5b341348f687@linux.ibm.com \
    --to=linkw@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).