[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "juzhe.zhong at rivai dot ai" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
Date: Thu, 25 Jan 2024 09:16:17 +0000	[thread overview]
Message-ID: <bug-113583-4-E36XBngfTR@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-113583-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #7 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to rguenther@suse.de from comment #6)
> On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> > 
> > --- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> > Both ICC and Clang X86 can vectorize SPEC 2017 lbm:
> > 
> > https://godbolt.org/z/MjbTbYf1G
> > 
> > But I am not sure X86 ICC is better or X86 Clang is better.
> 
> gather/scatter are possibly slow (and gather now has that Intel
> security issue).  The reason is a "cost" one:
> 
> t.c:47:21: note:   ==> examining statement: _4 = *_3;
> t.c:47:21: missed:   no array mode for V8DF[20]
> t.c:47:21: missed:   no array mode for V8DF[20]
> t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> or not equal to 3
> t.c:47:21: missed:   not falling back to elementwise accesses
> t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> *_3;
> t.c:47:21: missed:  bad operation or unsupported loop bound.
> 
> where we don't consider using gather because we have a known constant
> stride (20).  Since the stores are really scatters we don't attempt
> to SLP either.
> 
> Disabling the above heuristic we get this vectorized as well, avoiding
> gather/scatter by manually implementing them and using a quite high
> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> faster code in the end).  But yes, I doubt that any of ICC or clang
> vectorized codes are faster anywhere (but without specifying an
> uarch you get some generic cost modelling applied).  Maybe SPR doesn't
> have the gather bug and it does have reasonable gather and scatter
> (zen4 scatter sucks).
> 
> .L3:
>         vmovsd  952(%rax), %xmm0
>         vmovsd  -8(%rax), %xmm2
>         addq    $1280, %rsi
>         addq    $1280, %rax
>         vmovhpd -168(%rax), %xmm0, %xmm1
>         vmovhpd -1128(%rax), %xmm2, %xmm2
>         vmovsd  -648(%rax), %xmm0
>         vmovhpd -488(%rax), %xmm0, %xmm0
>         vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm0
>         vmovsd  -968(%rax), %xmm1
>         vmovhpd -808(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm1, %ymm2, %ymm2
>         vinsertf64x4    $0x1, %ymm0, %zmm2, %zmm2
>         vmovsd  -320(%rax), %xmm0
>         vmovhpd -160(%rax), %xmm0, %xmm1
>         vmovsd  -640(%rax), %xmm0
>         vmovhpd -480(%rax), %xmm0, %xmm0
>         vinsertf32x4    $0x1, %xmm1, %ymm0, %ymm1
>         vmovsd  -960(%rax), %xmm0
>         vmovhpd -800(%rax), %xmm0, %xmm8
>         vmovsd  -1280(%rax), %xmm0
>         vmovhpd -1120(%rax), %xmm0, %xmm0
>         vinsertf32x4    $0x1, %xmm8, %ymm0, %ymm0
>         vinsertf64x4    $0x1, %ymm1, %zmm0, %zmm0
>         vmovsd  -312(%rax), %xmm1
>         vmovhpd -152(%rax), %xmm1, %xmm8
>         vmovsd  -632(%rax), %xmm1
>         vmovhpd -472(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm8
>         vmovsd  -952(%rax), %xmm1
>         vmovhpd -792(%rax), %xmm1, %xmm9
>         vmovsd  -1272(%rax), %xmm1
>         vmovhpd -1112(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm9, %ymm1, %ymm1
>         vinsertf64x4    $0x1, %ymm8, %zmm1, %zmm1
>         vaddpd  %zmm1, %zmm0, %zmm0
>         vaddpd  %zmm7, %zmm2, %zmm1
>         vfnmadd132pd    %zmm3, %zmm2, %zmm1
>         vfmadd132pd     %zmm6, %zmm5, %zmm0
>         valignq $3, %ymm1, %ymm1, %ymm2
>         vmovlpd %xmm1, -1280(%rsi)
>         vextractf64x2   $1, %ymm1, %xmm8
>         vmovhpd %xmm1, -1120(%rsi)
>         vextractf64x4   $0x1, %zmm1, %ymm1
>         vmovlpd %xmm1, -640(%rsi)
>         vmovhpd %xmm1, -480(%rsi)
>         vmovsd  %xmm2, -800(%rsi)
>         vextractf64x2   $1, %ymm1, %xmm2
>         vmovsd  %xmm8, -960(%rsi)
>         valignq $3, %ymm1, %ymm1, %ymm1
>         vmovsd  %xmm2, -320(%rsi)
>         vmovsd  %xmm1, -160(%rsi)
>         vmovsd  -320(%rax), %xmm1
>         vmovhpd -160(%rax), %xmm1, %xmm2
>         vmovsd  -640(%rax), %xmm1
>         vmovhpd -480(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm2, %ymm1, %ymm2
>         vmovsd  -960(%rax), %xmm1
>         vmovhpd -800(%rax), %xmm1, %xmm8
>         vmovsd  -1280(%rax), %xmm1
>         vmovhpd -1120(%rax), %xmm1, %xmm1
>         vinsertf32x4    $0x1, %xmm8, %ymm1, %ymm1
>         vinsertf64x4    $0x1, %ymm2, %zmm1, %zmm1
>         vfnmadd132pd    %zmm3, %zmm1, %zmm0
>         vaddpd  %zmm4, %zmm0, %zmm0
>         valignq $3, %ymm0, %ymm0, %ymm1
>         vmovlpd %xmm0, 14728(%rsi)
>         vextractf64x2   $1, %ymm0, %xmm2
>         vmovhpd %xmm0, 14888(%rsi)
>         vextractf64x4   $0x1, %zmm0, %ymm0
>         vmovlpd %xmm0, 15368(%rsi)
>         vmovhpd %xmm0, 15528(%rsi)
>         vmovsd  %xmm1, 15208(%rsi)
>         vextractf64x2   $1, %ymm0, %xmm1
>         vmovsd  %xmm2, 15048(%rsi)
>         valignq $3, %ymm0, %ymm0, %ymm0
>         vmovsd  %xmm1, 15688(%rsi)
>         vmovsd  %xmm0, 15848(%rsi)
>         cmpq    %rdx, %rsi
>         jne     .L3

Thanks Richard.

>> Disabling the above heuristic we get this vectorized as well, avoiding
>> gather/scatter by manually implementing them and using a quite high
>> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
>> faster code in the end). 

Ok. It seems that enabling the vectorization for this case doesn't mean
lbm will become faster. It depends on hardware.
I think we can test spec 2017 lbm on RISC-V board to see whether vectorization
is beneficial.

But I wonder if we see it is beneficial on some boards, could you teach us how
we can enable vectorization for such case according to uarchs ?

next prev parent reply	other threads:[~2024-01-25  9:16 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24 14:21 [Bug tree-optimization/113583] New: " rdapp at gcc dot gnu.org
2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai
2024-01-24 14:44 ` rdapp at gcc dot gnu.org
2024-01-24 15:00 ` juzhe.zhong at rivai dot ai
2024-01-25  3:06 ` juzhe.zhong at rivai dot ai
2024-01-25  3:13 ` juzhe.zhong at rivai dot ai
2024-01-25  5:41 ` pinskia at gcc dot gnu.org
2024-01-25  9:05 ` rguenther at suse dot de
2024-01-25  9:16 ` juzhe.zhong at rivai dot ai [this message]
2024-01-25  9:34 ` rguenth at gcc dot gnu.org
2024-01-26  9:50 ` rdapp at gcc dot gnu.org
2024-01-26 10:21 ` rguenther at suse dot de
2024-02-05  6:59 ` juzhe.zhong at rivai dot ai
2024-02-07  3:39 ` juzhe.zhong at rivai dot ai
2024-02-07  7:48 ` juzhe.zhong at rivai dot ai
2024-02-07  8:04 ` rguenther at suse dot de
2024-02-07  8:08 ` juzhe.zhong at rivai dot ai
2024-02-07  8:13 ` juzhe.zhong at rivai dot ai
2024-02-07 10:24 ` rguenther at suse dot de
2024-05-13 14:17 ` rdapp at gcc dot gnu.org
2024-05-16 12:41 ` rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-113583-4-E36XBngfTR@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).