* [x86 PATCH] Split SUBREGs of SSE vector registers into vec_select insns.
@ 2023-08-03 7:10 Roger Sayle
2023-08-03 9:37 ` Uros Bizjak
0 siblings, 1 reply; 2+ messages in thread
From: Roger Sayle @ 2023-08-03 7:10 UTC (permalink / raw)
To: gcc-patches; +Cc: 'Uros Bizjak'
[-- Attachment #1: Type: text/plain, Size: 2577 bytes --]
This patch is the final piece in the series to improve the ABI issues
affecting PR 88873. The previous patches tackled inserting DFmode
values into V2DFmode registers, by introducing insvti_{low,high}part
patterns. This patch improves the extraction of DFmode values from
v2DFmode registers via TImode intermediates.
I'd initially thought this would require new extvti_{low,high}part
patterns to be defined, but all that's required is to recognize that
the SUBREG idioms produced by combine are equivalent to (forms of)
vec_select patterns. The target-independent middle-end can't be sure
that the appropriate vec_select instruction exists on the target,
hence doesn't canonicalize a SUBREG of a vector mode as a vec_select,
but the backend can provide a define_split stating where and when
this is useful, for example, considering whether the operand is in
memory, or whether !TARGET_SSE_MATH and the destination is i387.
For pr88873.c, gcc -O2 -march=cascadelake currently generates:
foo: vpunpcklqdq %xmm3, %xmm2, %xmm7
vpunpcklqdq %xmm1, %xmm0, %xmm6
vpunpcklqdq %xmm5, %xmm4, %xmm2
vmovdqa %xmm7, -24(%rsp)
vmovdqa %xmm6, %xmm1
movq -16(%rsp), %rax
vpinsrq $1, %rax, %xmm7, %xmm4
vmovapd %xmm4, %xmm6
vfmadd132pd %xmm1, %xmm2, %xmm6
vmovapd %xmm6, -24(%rsp)
vmovsd -16(%rsp), %xmm1
vmovsd -24(%rsp), %xmm0
ret
with this patch, we now generate:
foo: vpunpcklqdq %xmm1, %xmm0, %xmm6
vpunpcklqdq %xmm3, %xmm2, %xmm7
vpunpcklqdq %xmm5, %xmm4, %xmm2
vmovdqa %xmm6, %xmm1
vfmadd132pd %xmm7, %xmm2, %xmm1
vmovsd %xmm1, %xmm1, %xmm0
vunpckhpd %xmm1, %xmm1, %xmm1
ret
The improvement is even more dramatic when compared to the original
29 instructions shown in comment #8. GCC 13, for example, required
12 transfers to/from memory.
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures. Ok for mainline?
2023-08-03 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/sse.md (define_split): Convert highpart:DF extract
from V2DFmode register into a sse2_storehpd instruction.
(define_split): Likewise, convert lowpart:DF extract from V2DF
register into a sse2_storelpd instruction.
gcc/testsuite/ChangeLog
* gcc.target/i386/pr88873.c: Tweak to check for improved code.
Thanks in advance,
Roger
--
[-- Attachment #2: patchet.txt --]
[-- Type: text/plain, Size: 1707 bytes --]
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 35fd66e..bc419ff 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -13554,6 +13554,14 @@
[(set_attr "type" "ssemov")
(set_attr "mode" "V2SF,V4SF,V2SF")])
+;; Convert highpart SUBREG in sse2_storehpd or *vec_extractv2df_1_sse.
+(define_split
+ [(set (match_operand:DF 0 "register_operand")
+ (subreg:DF (match_operand:V2DF 1 "register_operand") 8))]
+ "TARGET_SSE"
+ [(set (match_dup 0)
+ (vec_select:DF (match_dup 1) (parallel [(const_int 1)])))])
+
;; Avoid combining registers from different units in a single alternative,
;; see comment above inline_secondary_memory_needed function in i386.cc
(define_insn "sse2_storelpd"
@@ -13599,6 +13607,14 @@
[(set_attr "type" "ssemov")
(set_attr "mode" "V2SF,V4SF,V2SF")])
+;; Convert lowpart SUBREG into sse2_storelpd or *vec_extractv2df_0_sse.
+(define_split
+ [(set (match_operand:DF 0 "register_operand")
+ (subreg:DF (match_operand:V2DF 1 "register_operand") 0))]
+ "TARGET_SSE"
+ [(set (match_dup 0)
+ (vec_select:DF (match_dup 1) (parallel [(const_int 0)])))])
+
(define_expand "sse2_loadhpd_exp"
[(set (match_operand:V2DF 0 "nonimmediate_operand")
(vec_concat:V2DF
diff --git a/gcc/testsuite/gcc.target/i386/pr88873.c b/gcc/testsuite/gcc.target/i386/pr88873.c
index d893aac..a3a7ef2 100644
--- a/gcc/testsuite/gcc.target/i386/pr88873.c
+++ b/gcc/testsuite/gcc.target/i386/pr88873.c
@@ -9,3 +9,5 @@ s_t foo (s_t a, s_t b, s_t c)
}
/* { dg-final { scan-assembler-times "vpunpcklqdq" 3 } } */
+/* { dg-final { scan-assembler "vunpckhpd" } } */
+/* { dg-final { scan-assembler-not "rsp" } } */
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [x86 PATCH] Split SUBREGs of SSE vector registers into vec_select insns.
2023-08-03 7:10 [x86 PATCH] Split SUBREGs of SSE vector registers into vec_select insns Roger Sayle
@ 2023-08-03 9:37 ` Uros Bizjak
0 siblings, 0 replies; 2+ messages in thread
From: Uros Bizjak @ 2023-08-03 9:37 UTC (permalink / raw)
To: Roger Sayle; +Cc: gcc-patches
On Thu, Aug 3, 2023 at 9:10 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch is the final piece in the series to improve the ABI issues
> affecting PR 88873. The previous patches tackled inserting DFmode
> values into V2DFmode registers, by introducing insvti_{low,high}part
> patterns. This patch improves the extraction of DFmode values from
> v2DFmode registers via TImode intermediates.
>
> I'd initially thought this would require new extvti_{low,high}part
> patterns to be defined, but all that's required is to recognize that
> the SUBREG idioms produced by combine are equivalent to (forms of)
> vec_select patterns. The target-independent middle-end can't be sure
> that the appropriate vec_select instruction exists on the target,
> hence doesn't canonicalize a SUBREG of a vector mode as a vec_select,
> but the backend can provide a define_split stating where and when
> this is useful, for example, considering whether the operand is in
> memory, or whether !TARGET_SSE_MATH and the destination is i387.
>
> For pr88873.c, gcc -O2 -march=cascadelake currently generates:
>
> foo: vpunpcklqdq %xmm3, %xmm2, %xmm7
> vpunpcklqdq %xmm1, %xmm0, %xmm6
> vpunpcklqdq %xmm5, %xmm4, %xmm2
> vmovdqa %xmm7, -24(%rsp)
> vmovdqa %xmm6, %xmm1
> movq -16(%rsp), %rax
> vpinsrq $1, %rax, %xmm7, %xmm4
> vmovapd %xmm4, %xmm6
> vfmadd132pd %xmm1, %xmm2, %xmm6
> vmovapd %xmm6, -24(%rsp)
> vmovsd -16(%rsp), %xmm1
> vmovsd -24(%rsp), %xmm0
> ret
>
> with this patch, we now generate:
>
> foo: vpunpcklqdq %xmm1, %xmm0, %xmm6
> vpunpcklqdq %xmm3, %xmm2, %xmm7
> vpunpcklqdq %xmm5, %xmm4, %xmm2
> vmovdqa %xmm6, %xmm1
> vfmadd132pd %xmm7, %xmm2, %xmm1
> vmovsd %xmm1, %xmm1, %xmm0
> vunpckhpd %xmm1, %xmm1, %xmm1
> ret
>
> The improvement is even more dramatic when compared to the original
> 29 instructions shown in comment #8. GCC 13, for example, required
> 12 transfers to/from memory.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures. Ok for mainline?
>
>
> 2023-08-03 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> * config/i386/sse.md (define_split): Convert highpart:DF extract
> from V2DFmode register into a sse2_storehpd instruction.
> (define_split): Likewise, convert lowpart:DF extract from V2DF
> register into a sse2_storelpd instruction.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/pr88873.c: Tweak to check for improved code.
OK.
Thanks,
Uros.
>
>
> Thanks in advance,
> Roger
> --
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-08-03 9:37 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-03 7:10 [x86 PATCH] Split SUBREGs of SSE vector registers into vec_select insns Roger Sayle
2023-08-03 9:37 ` Uros Bizjak
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).