public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "juzhe.zhong at rivai dot ai" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/113583] Main loop in 519.lbm not vectorized. Date: Thu, 25 Jan 2024 09:16:17 +0000 [thread overview] Message-ID: <bug-113583-4-E36XBngfTR@http.gcc.gnu.org/bugzilla/> (raw) In-Reply-To: <bug-113583-4@http.gcc.gnu.org/bugzilla/> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #7 from JuzheZhong <juzhe.zhong at rivai dot ai> --- (In reply to rguenther@suse.de from comment #6) > On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 > > > > --- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> --- > > Both ICC and Clang X86 can vectorize SPEC 2017 lbm: > > > > https://godbolt.org/z/MjbTbYf1G > > > > But I am not sure X86 ICC is better or X86 Clang is better. > > gather/scatter are possibly slow (and gather now has that Intel > security issue). The reason is a "cost" one: > > t.c:47:21: note: ==> examining statement: _4 = *_3; > t.c:47:21: missed: no array mode for V8DF[20] > t.c:47:21: missed: no array mode for V8DF[20] > t.c:47:21: missed: the size of the group of accesses is not a power of 2 > or not equal to 3 > t.c:47:21: missed: not falling back to elementwise accesses > t.c:58:15: missed: not vectorized: relevant stmt not supported: _4 = > *_3; > t.c:47:21: missed: bad operation or unsupported loop bound. > > where we don't consider using gather because we have a known constant > stride (20). Since the stores are really scatters we don't attempt > to SLP either. > > Disabling the above heuristic we get this vectorized as well, avoiding > gather/scatter by manually implementing them and using a quite high > VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely > faster code in the end). But yes, I doubt that any of ICC or clang > vectorized codes are faster anywhere (but without specifying an > uarch you get some generic cost modelling applied). Maybe SPR doesn't > have the gather bug and it does have reasonable gather and scatter > (zen4 scatter sucks). > > .L3: > vmovsd 952(%rax), %xmm0 > vmovsd -8(%rax), %xmm2 > addq $1280, %rsi > addq $1280, %rax > vmovhpd -168(%rax), %xmm0, %xmm1 > vmovhpd -1128(%rax), %xmm2, %xmm2 > vmovsd -648(%rax), %xmm0 > vmovhpd -488(%rax), %xmm0, %xmm0 > vinsertf32x4 $0x1, %xmm1, %ymm0, %ymm0 > vmovsd -968(%rax), %xmm1 > vmovhpd -808(%rax), %xmm1, %xmm1 > vinsertf32x4 $0x1, %xmm1, %ymm2, %ymm2 > vinsertf64x4 $0x1, %ymm0, %zmm2, %zmm2 > vmovsd -320(%rax), %xmm0 > vmovhpd -160(%rax), %xmm0, %xmm1 > vmovsd -640(%rax), %xmm0 > vmovhpd -480(%rax), %xmm0, %xmm0 > vinsertf32x4 $0x1, %xmm1, %ymm0, %ymm1 > vmovsd -960(%rax), %xmm0 > vmovhpd -800(%rax), %xmm0, %xmm8 > vmovsd -1280(%rax), %xmm0 > vmovhpd -1120(%rax), %xmm0, %xmm0 > vinsertf32x4 $0x1, %xmm8, %ymm0, %ymm0 > vinsertf64x4 $0x1, %ymm1, %zmm0, %zmm0 > vmovsd -312(%rax), %xmm1 > vmovhpd -152(%rax), %xmm1, %xmm8 > vmovsd -632(%rax), %xmm1 > vmovhpd -472(%rax), %xmm1, %xmm1 > vinsertf32x4 $0x1, %xmm8, %ymm1, %ymm8 > vmovsd -952(%rax), %xmm1 > vmovhpd -792(%rax), %xmm1, %xmm9 > vmovsd -1272(%rax), %xmm1 > vmovhpd -1112(%rax), %xmm1, %xmm1 > vinsertf32x4 $0x1, %xmm9, %ymm1, %ymm1 > vinsertf64x4 $0x1, %ymm8, %zmm1, %zmm1 > vaddpd %zmm1, %zmm0, %zmm0 > vaddpd %zmm7, %zmm2, %zmm1 > vfnmadd132pd %zmm3, %zmm2, %zmm1 > vfmadd132pd %zmm6, %zmm5, %zmm0 > valignq $3, %ymm1, %ymm1, %ymm2 > vmovlpd %xmm1, -1280(%rsi) > vextractf64x2 $1, %ymm1, %xmm8 > vmovhpd %xmm1, -1120(%rsi) > vextractf64x4 $0x1, %zmm1, %ymm1 > vmovlpd %xmm1, -640(%rsi) > vmovhpd %xmm1, -480(%rsi) > vmovsd %xmm2, -800(%rsi) > vextractf64x2 $1, %ymm1, %xmm2 > vmovsd %xmm8, -960(%rsi) > valignq $3, %ymm1, %ymm1, %ymm1 > vmovsd %xmm2, -320(%rsi) > vmovsd %xmm1, -160(%rsi) > vmovsd -320(%rax), %xmm1 > vmovhpd -160(%rax), %xmm1, %xmm2 > vmovsd -640(%rax), %xmm1 > vmovhpd -480(%rax), %xmm1, %xmm1 > vinsertf32x4 $0x1, %xmm2, %ymm1, %ymm2 > vmovsd -960(%rax), %xmm1 > vmovhpd -800(%rax), %xmm1, %xmm8 > vmovsd -1280(%rax), %xmm1 > vmovhpd -1120(%rax), %xmm1, %xmm1 > vinsertf32x4 $0x1, %xmm8, %ymm1, %ymm1 > vinsertf64x4 $0x1, %ymm2, %zmm1, %zmm1 > vfnmadd132pd %zmm3, %zmm1, %zmm0 > vaddpd %zmm4, %zmm0, %zmm0 > valignq $3, %ymm0, %ymm0, %ymm1 > vmovlpd %xmm0, 14728(%rsi) > vextractf64x2 $1, %ymm0, %xmm2 > vmovhpd %xmm0, 14888(%rsi) > vextractf64x4 $0x1, %zmm0, %ymm0 > vmovlpd %xmm0, 15368(%rsi) > vmovhpd %xmm0, 15528(%rsi) > vmovsd %xmm1, 15208(%rsi) > vextractf64x2 $1, %ymm0, %xmm1 > vmovsd %xmm2, 15048(%rsi) > valignq $3, %ymm0, %ymm0, %ymm0 > vmovsd %xmm1, 15688(%rsi) > vmovsd %xmm0, 15848(%rsi) > cmpq %rdx, %rsi > jne .L3 Thanks Richard. >> Disabling the above heuristic we get this vectorized as well, avoiding >> gather/scatter by manually implementing them and using a quite high >> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely >> faster code in the end). Ok. It seems that enabling the vectorization for this case doesn't mean lbm will become faster. It depends on hardware. I think we can test spec 2017 lbm on RISC-V board to see whether vectorization is beneficial. But I wonder if we see it is beneficial on some boards, could you teach us how we can enable vectorization for such case according to uarchs ?
next prev parent reply other threads:[~2024-01-25 9:16 UTC|newest] Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top 2024-01-24 14:21 [Bug tree-optimization/113583] New: " rdapp at gcc dot gnu.org 2024-01-24 14:42 ` [Bug tree-optimization/113583] " juzhe.zhong at rivai dot ai 2024-01-24 14:44 ` rdapp at gcc dot gnu.org 2024-01-24 15:00 ` juzhe.zhong at rivai dot ai 2024-01-25 3:06 ` juzhe.zhong at rivai dot ai 2024-01-25 3:13 ` juzhe.zhong at rivai dot ai 2024-01-25 5:41 ` pinskia at gcc dot gnu.org 2024-01-25 9:05 ` rguenther at suse dot de 2024-01-25 9:16 ` juzhe.zhong at rivai dot ai [this message] 2024-01-25 9:34 ` rguenth at gcc dot gnu.org 2024-01-26 9:50 ` rdapp at gcc dot gnu.org 2024-01-26 10:21 ` rguenther at suse dot de 2024-02-05 6:59 ` juzhe.zhong at rivai dot ai 2024-02-07 3:39 ` juzhe.zhong at rivai dot ai 2024-02-07 7:48 ` juzhe.zhong at rivai dot ai 2024-02-07 8:04 ` rguenther at suse dot de 2024-02-07 8:08 ` juzhe.zhong at rivai dot ai 2024-02-07 8:13 ` juzhe.zhong at rivai dot ai 2024-02-07 10:24 ` rguenther at suse dot de 2024-05-13 14:17 ` rdapp at gcc dot gnu.org 2024-05-16 12:41 ` rguenth at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-113583-4-E36XBngfTR@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).