[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "rguenther at suse dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
Date: Thu, 25 Jan 2024 09:05:44 +0000	[thread overview]
Message-ID: <bug-113583-4-GoRIwfy2W9@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-113583-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> Both ICC and Clang X86 can vectorize SPEC 2017 lbm:
> 
> https://godbolt.org/z/MjbTbYf1G
> 
> But I am not sure X86 ICC is better or X86 Clang is better.

gather/scatter are possibly slow (and gather now has that Intel
security issue).  The reason is a "cost" one:

t.c:47:21: note:   ==> examining statement: _4 = *_3;
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
or not equal to 3
t.c:47:21: missed:   not falling back to elementwise accesses
t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
*_3;
t.c:47:21: missed:  bad operation or unsupported loop bound.

where we don't consider using gather because we have a known constant
stride (20).  Since the stores are really scatters we don't attempt
to SLP either.

Disabling the above heuristic we get this vectorized as well, avoiding
gather/scatter by manually implementing them and using a quite high
VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
faster code in the end).  But yes, I doubt that any of ICC or clang
vectorized codes are faster anywhere (but without specifying an
uarch you get some generic cost modelling applied).  Maybe SPR doesn't
have the gather bug and it does have reasonable gather and scatter
(zen4 scatter sucks).

.L3:
        vmovsd  952(%rax), %xmm0
        vmovsd  -8(%rax), %xmm2
        addq    $1280, %rsi
        addq    $1280, %rax
        vmovhpd -168(%rax), %xmm0, %xmm1
        vmovhpd -1128(%rax), %xmm2, %xmm2
        vmovsd  -648(%rax), %xmm0
        vmovhpd -488(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm0
        vmovsd  -968(%rax), %xmm1
        vmovhpd -808(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm1, %ymm2, %ymm2
        vinsertf64x4    $0x1, %ymm0, %zmm2, %zmm2
        vmovsd  -320(%rax), %xmm0
        vmovhpd -160(%rax), %xmm0, %xmm1
        vmovsd  -640(%rax), %xmm0
        vmovhpd -480(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm1
        vmovsd  -960(%rax), %xmm0
        vmovhpd -800(%rax), %xmm0, %xmm8
        vmovsd  -1280(%rax), %xmm0
        vmovhpd -1120(%rax), %xmm0, %xmm0
        vinsertf32x4    $0x1, %xmm8, %ymm0, %ymm0
        vinsertf64x4    $0x1, %ymm1, %zmm0, %zmm0
        vmovsd  -312(%rax), %xmm1
        vmovhpd -152(%rax), %xmm1, %xmm8
        vmovsd  -632(%rax), %xmm1
        vmovhpd -472(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm8
        vmovsd  -952(%rax), %xmm1
        vmovhpd -792(%rax), %xmm1, %xmm9
        vmovsd  -1272(%rax), %xmm1
        vmovhpd -1112(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm9, %ymm1, %ymm1
        vinsertf64x4    $0x1, %ymm8, %zmm1, %zmm1
        vaddpd  %zmm1, %zmm0, %zmm0
        vaddpd  %zmm7, %zmm2, %zmm1
        vfnmadd132pd    %zmm3, %zmm2, %zmm1
        vfmadd132pd     %zmm6, %zmm5, %zmm0
        valignq $3, %ymm1, %ymm1, %ymm2
        vmovlpd %xmm1, -1280(%rsi)
        vextractf64x2   $1, %ymm1, %xmm8
        vmovhpd %xmm1, -1120(%rsi)
        vextractf64x4   $0x1, %zmm1, %ymm1
        vmovlpd %xmm1, -640(%rsi)
        vmovhpd %xmm1, -480(%rsi)
        vmovsd  %xmm2, -800(%rsi)
        vextractf64x2   $1, %ymm1, %xmm2
        vmovsd  %xmm8, -960(%rsi)
        valignq $3, %ymm1, %ymm1, %ymm1
        vmovsd  %xmm2, -320(%rsi)
        vmovsd  %xmm1, -160(%rsi)
        vmovsd  -320(%rax), %xmm1
        vmovhpd -160(%rax), %xmm1, %xmm2
        vmovsd  -640(%rax), %xmm1
        vmovhpd -480(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm2, %ymm1, %ymm2
        vmovsd  -960(%rax), %xmm1
        vmovhpd -800(%rax), %xmm1, %xmm8
        vmovsd  -1280(%rax), %xmm1
        vmovhpd -1120(%rax), %xmm1, %xmm1
        vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm1
        vinsertf64x4    $0x1, %ymm2, %zmm1, %zmm1
        vfnmadd132pd    %zmm3, %zmm1, %zmm0
        vaddpd  %zmm4, %zmm0, %zmm0
        valignq $3, %ymm0, %ymm0, %ymm1
        vmovlpd %xmm0, 14728(%rsi)
        vextractf64x2   $1, %ymm0, %xmm2
        vmovhpd %xmm0, 14888(%rsi)
        vextractf64x4   $0x1, %zmm0, %ymm0
        vmovlpd %xmm0, 15368(%rsi)
        vmovhpd %xmm0, 15528(%rsi)
        vmovsd  %xmm1, 15208(%rsi)
        vextractf64x2   $1, %ymm0, %xmm1
        vmovsd  %xmm2, 15048(%rsi)
        valignq $3, %ymm0, %ymm0, %ymm0
        vmovsd  %xmm1, 15688(%rsi)
        vmovsd  %xmm0, 15848(%rsi)
        cmpq    %rdx, %rsi
        jne     .L3

next prev parent reply	other threads:[~2024-01-25  9:06 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24 14:21 [Bug tree-optimization/113583] New: " rdapp at gcc dot gnu.org
2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai
2024-01-24 14:44 ` rdapp at gcc dot gnu.org
2024-01-24 15:00 ` juzhe.zhong at rivai dot ai
2024-01-25  3:06 ` juzhe.zhong at rivai dot ai
2024-01-25  3:13 ` juzhe.zhong at rivai dot ai
2024-01-25  5:41 ` pinskia at gcc dot gnu.org
2024-01-25  9:05 ` rguenther at suse dot de [this message]
2024-01-25  9:16 ` juzhe.zhong at rivai dot ai
2024-01-25  9:34 ` rguenth at gcc dot gnu.org
2024-01-26  9:50 ` rdapp at gcc dot gnu.org
2024-01-26 10:21 ` rguenther at suse dot de
2024-02-05  6:59 ` juzhe.zhong at rivai dot ai
2024-02-07  3:39 ` juzhe.zhong at rivai dot ai
2024-02-07  7:48 ` juzhe.zhong at rivai dot ai
2024-02-07  8:04 ` rguenther at suse dot de
2024-02-07  8:08 ` juzhe.zhong at rivai dot ai
2024-02-07  8:13 ` juzhe.zhong at rivai dot ai
2024-02-07 10:24 ` rguenther at suse dot de
2024-05-13 14:17 ` rdapp at gcc dot gnu.org
2024-05-16 12:41 ` rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-113583-4-GoRIwfy2W9@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).