[Bug middle-end/99638] New: s132 benchmarks of TSVC on zen3 benefits from -mno-fma

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/99638] New: s132 benchmarks of TSVC on zen3 benefits from -mno-fma
@ 2021-03-17 21:24 hubicka at gcc dot gnu.org
  2021-03-17 21:28 ` [Bug middle-end/99638] s132 and s281 " hubicka at gcc dot gnu.org
  2021-03-18  9:17 ` rguenth at gcc dot gnu.org
  0 siblings, 2 replies; 3+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-17 21:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99638

            Bug ID: 99638
           Summary: s132 benchmarks of TSVC on zen3 benefits from -mno-fma
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

typedef float real_t;

#define iterations 1000000
#define LEN_1D 32000
#define LEN_2D 256
// array definitions
real_t flat_2d_array[LEN_2D*LEN_2D];

real_t x[LEN_1D];

real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D],
aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D];

int indx[LEN_1D];

real_t* __restrict__ xx;
real_t* yy;


// %2.5

void main()
{
//    global data flow analysis
//    loop with multiple dimension ambiguous subscripts

    int m = 0;
    int j = m;
    int k = m+1;
    for (int nl = 0; nl < 400*iterations; nl++) {
        for (int i= 1; i < LEN_2D; i++) {
            aa[j][i] = aa[k][i-1] + b[i] * c[1];
        }
        dummy();
    }
}

compiled with -Ofast -march=native runs 4.4s compared to 4.2s with -Ofast
-march=native -mno-fma

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug middle-end/99638] s132 and s281 benchmarks of TSVC on zen3 benefits from -mno-fma
  2021-03-17 21:24 [Bug middle-end/99638] New: s132 benchmarks of TSVC on zen3 benefits from -mno-fma hubicka at gcc dot gnu.org
@ 2021-03-17 21:28 ` hubicka at gcc dot gnu.org
  2021-03-18  9:17 ` rguenth at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-17 21:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99638

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jamborm at gcc dot gnu.org
            Summary|s132 benchmarks of TSVC on  |s132 and s281 benchmarks of
                   |zen3 benefits from -mno-fma |TSVC on zen3 benefits from
                   |                            |-mno-fma

--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
s281 benchmark:

typedef float real_t;

#define iterations 1000000
#define LEN_1D 32000
#define LEN_2D 256
// array definitions
real_t flat_2d_array[LEN_2D*LEN_2D];

real_t x[LEN_1D];

real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D],
aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D];

int indx[LEN_1D];

real_t* __restrict__ xx;
real_t* yy;

// %2.5

void main()
{
//    crossing thresholds
//    index set splitting
//    reverse data access

    real_t x;
    for (int nl = 0; nl < iterations; nl++) {
        for (int i = 0; i < LEN_1D; i++) {
            x = a[LEN_1D-i-1] + b[i] * c[i];
            a[i] = x-(real_t)1.0;
            b[i] = x;
        }
        dummy();
    }
}


with FMA runs 18s and without 14s

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug middle-end/99638] s132 and s281 benchmarks of TSVC on zen3 benefits from -mno-fma
  2021-03-17 21:24 [Bug middle-end/99638] New: s132 benchmarks of TSVC on zen3 benefits from -mno-fma hubicka at gcc dot gnu.org
  2021-03-17 21:28 ` [Bug middle-end/99638] s132 and s281 " hubicka at gcc dot gnu.org
@ 2021-03-18  9:17 ` rguenth at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-18  9:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99638

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
.L4:
        vmovups b(%rax), %ymm0
        addq    $32, %rax
        vfmadd213ps     aa+988(%rax), %ymm1, %ymm0
        vmovups %ymm0, aa-32(%rax)
        cmpq    $996, %rax
        jne     .L4

vs.

.L4:
        vmulps  b(%rax), %ymm2, %ymm0
        addq    $32, %rax
        vaddps  aa+988(%rax), %ymm0, %ymm0
        vmovups %ymm0, aa-32(%rax)
        cmpq    $996, %rax
        jne     .L4

I'm not sure we can explain the difference, can we?

On Zen2 -mfma doesn't make a difference btw. (but Zen3 should have FMA
with one cycle less latency even...)

The 2nd testcase has one more load uop in the loop.  Both Zen2 and Zen3
should be able to issue two load uops per cycle.  The 2nd testcase is not
vectorized, on Zen2 -mno-fma vs. -mfma is in the noise (-mfma looks slightly
faster).

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-03-18  9:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 21:24 [Bug middle-end/99638] New: s132 benchmarks of TSVC on zen3 benefits from -mno-fma hubicka at gcc dot gnu.org
2021-03-17 21:28 ` [Bug middle-end/99638] s132 and s281 " hubicka at gcc dot gnu.org
2021-03-18  9:17 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).