public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/100173] New: telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2  on CLX/ICX, 9% regression on znver3
@ 2021-04-21  5:48 crazylht at gmail dot com
  2021-04-21  6:46 ` [Bug tree-optimization/100173] " rguenth at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: crazylht at gmail dot com @ 2021-04-21  5:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100173

            Bug ID: 100173
           Summary: telecom/viterb00data_1 has 16.92% regression compared
                    O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2
                     on CLX/ICX, 9% regression on znver3
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: crazylht at gmail dot com
                CC: hjl.tools at gmail dot com
  Target Milestone: ---
              Host: x86_64-pc-linux-gnu
            Target: x86_64-*-* i?86-*-*

Created attachment 50647
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50647&action=edit
ACS.cpp

cat testcase

void
__attribute__ ((noipa))
ACS(e_s16 *pBranchMetric)
{
  n_int i;
  e_s16 esMetricIn, esMetric1, esMetric2;

  StatePathMetricData *pIn1 = BufPtr[BufSelector];
  StatePathMetricData *pIn2 = pIn1 + (1<<5)/2;
  StatePathMetricData *pOut = BufPtr[1 - BufSelector];

  BufSelector ^= 1;

  for (i = 0; i < (1<<5)/2; i++) {

    esMetricIn = *pBranchMetric++;

    esMetric1 = pIn1->m_esPathMetric - esMetricIn;
    esMetric2 = pIn2->m_esPathMetric + esMetricIn;

    if (esMetric1 >= esMetric2) {
      pOut->m_esPathMetric = esMetric1;
      pOut->m_esState = (pIn1->m_esState << 1);
    }
    else {
      pOut->m_esPathMetric = esMetric2;
      pOut->m_esState = (pIn2->m_esState << 1);
    }
    pOut++;

    esMetric1 = pIn1->m_esPathMetric + esMetricIn;
    esMetric2 = pIn2->m_esPathMetric - esMetricIn;

    if (esMetric1 >=esMetric2) {
      pOut->m_esPathMetric =esMetric1;
      pOut->m_esState = (pIn1->m_esState << 1) | 1;
    }
    else {
      pOut->m_esPathMetric =esMetric2;
      pOut->m_esState = (pIn2->m_esState << 1) | 1;
    }
    pOut++;

    pIn1++;
    pIn2++;
  }
}

It is if conditional store replacement plays here, it sinks 2 stores from IF_BB
and ELSE_BB to JOIN_BB since they have same address. But failed to vectorize
them with -fvect-cost-model=very-cheap, and it causes worse IPC for consecutive
stores in JOIN_BB on both ICX and znver3. With -fvect-cost-model=cheap, the
loop can be vectorized and 2.6x faster than O2.

So I think we should either vectorize this loop or not sink conditional stores
when cost model is very-cheap.

and the codes related are here: 

  /* If either vectorization or if-conversion is disabled then do
     not sink any stores.  */
  if (param_max_stores_to_sink == 0
      || (!flag_tree_loop_vectorize && !flag_tree_slp_vectorize)
      || !flag_tree_loop_if_convert)
    return false;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/100173] telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2  on CLX/ICX, 9% regression on znver3
  2021-04-21  5:48 [Bug tree-optimization/100173] New: telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2 on CLX/ICX, 9% regression on znver3 crazylht at gmail dot com
@ 2021-04-21  6:46 ` rguenth at gcc dot gnu.org
  2021-04-29  3:02 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-21  6:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100173

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that store commoning of code sinking will sink the last store anyway:

@@ -546,7 +233,6 @@
   _27 = _26 << 1;
   _28 = (short int) _27;
   _29 = _28 | 1;
-  MEM[(struct StatePathMetricData *)pOut_90 + 4B].m_esState = _29;
   goto <bb 9>; [100.00%]

   <bb 8> [local count: 505302904]:
@@ -556,21 +242,19 @@
   _32 = _31 << 1;
   _33 = (short int) _32;
   _34 = _33 | 1;
-  MEM[(struct StatePathMetricData *)pOut_90 + 4B].m_esState = _34;

   <bb 9> [local count: 1010605809]:
+  # _94 = PHI <_29(7), _34(8)>
+  MEM[(struct StatePathMetricData *)pOut_90 + 4B].m_esState = _94;

but yes, cselim will also sink the first store, moving it across the
scalar compute in the block.  I might note that ideally we'd sink
all the compute as well and end up with just a conditional load of
either pIn1->m_esState or pIn2_89->m_esState.  That might then allow
scheduling to recover the original performance.

You can try that as a source transform, like

    e_s16 tem1, tem2;
    if (esMetric1 >=esMetric2) {
        tem1 = esMetric1;
        tem2 = pIn1->m_esState;
    }
    else {
        tem1 = esMetric2;
        tem2 = pIn2->m_esState;
    }
      pOut->m_esPathMetric =tem1;
      pOut->m_esState = (tem2 << 1) | 1;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/100173] telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2  on CLX/ICX, 9% regression on znver3
  2021-04-21  5:48 [Bug tree-optimization/100173] New: telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2 on CLX/ICX, 9% regression on znver3 crazylht at gmail dot com
  2021-04-21  6:46 ` [Bug tree-optimization/100173] " rguenth at gcc dot gnu.org
@ 2021-04-29  3:02 ` crazylht at gmail dot com
  2021-04-29  6:57 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: crazylht at gmail dot com @ 2021-04-29  3:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100173

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---

> but yes, cselim will also sink the first store, moving it across the
> scalar compute in the block.  I might note that ideally we'd sink
> all the compute as well and end up with just a conditional load of
> either pIn1->m_esState or pIn2_89->m_esState.  That might then allow
> scheduling to recover the original performance.
> 

I want to clasify this regression is not related to 2 sinked stores, it just
trigger some micro-architecture bound.

Also w/o -fvect-cost-model=very-cheap, it can be 2-3x faster, the tripper count
is constant, so i wonder why very-cheap cost model doesn't vectorize this loop?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/100173] telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2  on CLX/ICX, 9% regression on znver3
  2021-04-21  5:48 [Bug tree-optimization/100173] New: telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2 on CLX/ICX, 9% regression on znver3 crazylht at gmail dot com
  2021-04-21  6:46 ` [Bug tree-optimization/100173] " rguenth at gcc dot gnu.org
  2021-04-29  3:02 ` crazylht at gmail dot com
@ 2021-04-29  6:57 ` rguenther at suse dot de
  2021-04-29 11:02 ` crazylht at gmail dot com
  2021-04-29 12:15 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenther at suse dot de @ 2021-04-29  6:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100173

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 29 Apr 2021, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100173
> 
> --- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > but yes, cselim will also sink the first store, moving it across the
> > scalar compute in the block.  I might note that ideally we'd sink
> > all the compute as well and end up with just a conditional load of
> > either pIn1->m_esState or pIn2_89->m_esState.  That might then allow
> > scheduling to recover the original performance.
> > 
> 
> I want to clasify this regression is not related to 2 sinked stores, it just
> trigger some micro-architecture bound.
> 
> Also w/o -fvect-cost-model=very-cheap, it can be 2-3x faster, the tripper count
> is constant, so i wonder why very-cheap cost model doesn't vectorize this loop?

Because it needs versioning for alias

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/100173] telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2  on CLX/ICX, 9% regression on znver3
  2021-04-21  5:48 [Bug tree-optimization/100173] New: telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2 on CLX/ICX, 9% regression on znver3 crazylht at gmail dot com
                   ` (2 preceding siblings ...)
  2021-04-29  6:57 ` rguenther at suse dot de
@ 2021-04-29 11:02 ` crazylht at gmail dot com
  2021-04-29 12:15 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: crazylht at gmail dot com @ 2021-04-29 11:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100173

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---

> but yes, cselim will also sink the first store, moving it across the

Can we also sink loads? assign pointer to another temp pointer in both if and
else bb, and then load val from this temp pointer. those assignments that in if
and else branch would be finially transformed to conditional mov. 

performance can benifit 100% with below change.

 for (i = 0; i < (1<<5)/2; i++) {

    esMetricIn = *pBranchMetric++;

    esMetric1 = pIn1->m_esPathMetric - esMetricIn;
    esMetric2 = pIn2->m_esPathMetric + esMetricIn;

    e_s16 *t1p = (esMetric1 >= esMetric2) ? &(pIn1->m_esState) :
&(pIn2->m_esState);
    e_s16 t1  = (esMetric1 >= esMetric2) ? esMetric1 : esMetric2;
    pOut->m_esPathMetric = t1;
    pOut->m_esState = *t1p << 1;
    pOut++;

    esMetric1 = pIn1->m_esPathMetric + esMetricIn;
    esMetric2 = pIn2->m_esPathMetric - esMetricIn;

    e_s16 *t2p = (esMetric1 >= esMetric2) ? &(pIn1->m_esState) :
&(pIn2->m_esState);
    e_s16 t2  = (esMetric1 >= esMetric2) ? esMetric1 : esMetric2;
    pOut->m_esPathMetric = t2;
    pOut->m_esState = *t2p << 1;
    pOut++;

    pIn1++;
    pIn2++;
  }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/100173] telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2  on CLX/ICX, 9% regression on znver3
  2021-04-21  5:48 [Bug tree-optimization/100173] New: telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2 on CLX/ICX, 9% regression on znver3 crazylht at gmail dot com
                   ` (3 preceding siblings ...)
  2021-04-29 11:02 ` crazylht at gmail dot com
@ 2021-04-29 12:15 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-29 12:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100173

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #4)
> > but yes, cselim will also sink the first store, moving it across the
> 
> Can we also sink loads?

loads are usually hoisted, not sunk.

> assign pointer to another temp pointer in both if
> and else bb, and then load val from this temp pointer. those assignments
> that in if and else branch would be finially transformed to conditional mov. 

Hmm, so conditional move is faster.  No, I don't think we'd do this at
the moment and I'm not sure we want in general (since aggressive
if-conversion tends to be bad).

> performance can benifit 100% with below change.
> 
>  for (i = 0; i < (1<<5)/2; i++) {
> 
>     esMetricIn = *pBranchMetric++;
> 
>     esMetric1 = pIn1->m_esPathMetric - esMetricIn;
>     esMetric2 = pIn2->m_esPathMetric + esMetricIn;
> 
>     e_s16 *t1p = (esMetric1 >= esMetric2) ? &(pIn1->m_esState) :
> &(pIn2->m_esState);
>     e_s16 t1  = (esMetric1 >= esMetric2) ? esMetric1 : esMetric2;
>     pOut->m_esPathMetric = t1;
>     pOut->m_esState = *t1p << 1;
>     pOut++;
> 
>     esMetric1 = pIn1->m_esPathMetric + esMetricIn;
>     esMetric2 = pIn2->m_esPathMetric - esMetricIn;
> 
>     e_s16 *t2p = (esMetric1 >= esMetric2) ? &(pIn1->m_esState) :
> &(pIn2->m_esState);
>     e_s16 t2  = (esMetric1 >= esMetric2) ? esMetric1 : esMetric2;
>     pOut->m_esPathMetric = t2;
>     pOut->m_esState = *t2p << 1;
>     pOut++;
> 
>     pIn1++;
>     pIn2++;
>   }

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-04-29 12:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-21  5:48 [Bug tree-optimization/100173] New: telecom/viterb00data_1 has 16.92% regression compared O2 -ftree-vectorize -fvect-cost-model=very-cheap to O2 on CLX/ICX, 9% regression on znver3 crazylht at gmail dot com
2021-04-21  6:46 ` [Bug tree-optimization/100173] " rguenth at gcc dot gnu.org
2021-04-29  3:02 ` crazylht at gmail dot com
2021-04-29  6:57 ` rguenther at suse dot de
2021-04-29 11:02 ` crazylht at gmail dot com
2021-04-29 12:15 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).