From: Richard Biener <rguenther@suse.de>
To: Jeff Law <law@redhat.com>,gcc-patches@gcc.gnu.org
Cc: Jan Hubicka <jh@suse.de>,ubizjak@gmail.com,Martin Jambor
<mjambor@suse.de>
Subject: Re: [PATCH] Fix PR80846, change vectorizer reduction epilogue (on x86)
Date: Wed, 06 Dec 2017 06:42:00 -0000 [thread overview]
Message-ID: <956A88C4-387C-429A-BD7E-958AAFB45C17@suse.de> (raw)
In-Reply-To: <227ee67d-908d-8fc3-3fe1-173341294fdc@redhat.com>
On December 5, 2017 9:18:46 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 11/28/2017 08:15 AM, Richard Biener wrote:
>>
>> The following adds a new target hook,
>targetm.vectorize.split_reduction,
>> which allows the target to specify a preferred mode to perform the
>> final reducion on using either vector shifts or scalar extractions.
>> Up to that mode the vector reduction result is reduced by combining
>> lowparts and highparts recursively. This avoids lane-crossing
>operations
>> when doing AVX256 on Zen and Bulldozer and also speeds up things on
>> Haswell (I verified ~20% speedup on Broadwell).
>>
>> Thus the patch implements the target hook on x86 to _always_ prefer
>> SSE modes for the final reduction.
>>
>> For the testcase in the bugzilla
>>
>> int sumint(const int arr[]) {
>> arr = __builtin_assume_aligned(arr, 64);
>> int sum=0;
>> for (int i=0 ; i<1024 ; i++)
>> sum+=arr[i];
>> return sum;
>> }
>>
>> this changes -O3 -mavx512f code from
>>
>> sumint:
>> .LFB0:
>> .cfi_startproc
>> vpxord %zmm0, %zmm0, %zmm0
>> leaq 4096(%rdi), %rax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> vpaddd (%rdi), %zmm0, %zmm0
>> addq $64, %rdi
>> cmpq %rdi, %rax
>> jne .L2
>> vpxord %zmm1, %zmm1, %zmm1
>> vshufi32x4 $78, %zmm1, %zmm0, %zmm2
>> vpaddd %zmm2, %zmm0, %zmm0
>> vmovdqa64 .LC0(%rip), %zmm2
>> vpermi2d %zmm1, %zmm0, %zmm2
>> vpaddd %zmm2, %zmm0, %zmm0
>> vmovdqa64 .LC1(%rip), %zmm2
>> vpermi2d %zmm1, %zmm0, %zmm2
>> vpaddd %zmm2, %zmm0, %zmm0
>> vmovdqa64 .LC2(%rip), %zmm2
>> vpermi2d %zmm1, %zmm0, %zmm2
>> vpaddd %zmm2, %zmm0, %zmm0
>> vmovd %xmm0, %eax
>>
>> to
>>
>> sumint:
>> .LFB0:
>> .cfi_startproc
>> vpxord %zmm0, %zmm0, %zmm0
>> leaq 4096(%rdi), %rax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> vpaddd (%rdi), %zmm0, %zmm0
>> addq $64, %rdi
>> cmpq %rdi, %rax
>> jne .L2
>> vextracti64x4 $0x1, %zmm0, %ymm1
>> vpaddd %ymm0, %ymm1, %ymm1
>> vmovdqa %xmm1, %xmm0
>> vextracti128 $1, %ymm1, %xmm1
>> vpaddd %xmm1, %xmm0, %xmm0
>> vpsrldq $8, %xmm0, %xmm1
>> vpaddd %xmm1, %xmm0, %xmm0
>> vpsrldq $4, %xmm0, %xmm1
>> vpaddd %xmm1, %xmm0, %xmm0
>> vmovd %xmm0, %eax
>>
>> and for -O3 -mavx2 from
>>
>> sumint:
>> .LFB0:
>> .cfi_startproc
>> vpxor %xmm0, %xmm0, %xmm0
>> leaq 4096(%rdi), %rax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> vpaddd (%rdi), %ymm0, %ymm0
>> addq $32, %rdi
>> cmpq %rdi, %rax
>> jne .L2
>> vpxor %xmm1, %xmm1, %xmm1
>> vperm2i128 $33, %ymm1, %ymm0, %ymm2
>> vpaddd %ymm2, %ymm0, %ymm0
>> vperm2i128 $33, %ymm1, %ymm0, %ymm2
>> vpalignr $8, %ymm0, %ymm2, %ymm2
>> vpaddd %ymm2, %ymm0, %ymm0
>> vperm2i128 $33, %ymm1, %ymm0, %ymm1
>> vpalignr $4, %ymm0, %ymm1, %ymm1
>> vpaddd %ymm1, %ymm0, %ymm0
>> vmovd %xmm0, %eax
>>
>> to
>>
>> sumint:
>> .LFB0:
>> .cfi_startproc
>> vpxor %xmm0, %xmm0, %xmm0
>> leaq 4096(%rdi), %rax
>> .p2align 4,,10
>> .p2align 3
>> .L2:
>> vpaddd (%rdi), %ymm0, %ymm0
>> addq $32, %rdi
>> cmpq %rdi, %rax
>> jne .L2
>> vmovdqa %xmm0, %xmm1
>> vextracti128 $1, %ymm0, %xmm0
>> vpaddd %xmm0, %xmm1, %xmm0
>> vpsrldq $8, %xmm0, %xmm1
>> vpaddd %xmm1, %xmm0, %xmm0
>> vpsrldq $4, %xmm0, %xmm1
>> vpaddd %xmm1, %xmm0, %xmm0
>> vmovd %xmm0, %eax
>> vzeroupper
>> ret
>>
>> which besides being faster is also smaller (less prefixes).
>>
>> SPEC 2k6 results on Haswell (thus AVX2) are neutral. As it merely
>> effects reduction vectorization epilogues I didn't expect big effects
>> but for loops that do not run much (more likely with AVX512).
>>
>> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>>
>> Ok for trunk?
>>
>> The PR mentions some more tricks to optimize the sequence but
>> those look like backend only optimizations.
>>
>> Thanks,
>> Richard.
>>
>> 2017-11-28 Richard Biener <rguenther@suse.de>
>>
>> PR tree-optimization/80846
>> * target.def (split_reduction): New target hook.
>> * targhooks.c (default_split_reduction): New function.
>> * targhooks.h (default_split_reduction): Declare.
>> * tree-vect-loop.c (vect_create_epilog_for_reduction): If the
>> target requests first reduce vectors by combining low and high
>> parts.
>> * tree-vect-stmts.c (vect_gen_perm_mask_any): Adjust.
>> (get_vectype_for_scalar_type_and_size): Export.
>> * tree-vectorizer.h (get_vectype_for_scalar_type_and_size): Declare.
>>
>> * doc/tm.texi.in (TARGET_VECTORIZE_SPLIT_REDUCTION): Document.
>> * doc/tm.texi: Regenerate.
>>
>> i386/
>> * config/i386/i386.c (ix86_split_reduction): Implement
>> TARGET_VECTORIZE_SPLIT_REDUCTION.
>>
>> * gcc.target/i386/pr80846-1.c: New testcase.
>> * gcc.target/i386/pr80846-2.c: Likewise.
>I'm not a big fan of introducing these kinds of target queries into the
>gimple optimizers, but I think we've all agreed to allow them to
>varying
>degrees within the vectorizer.
>
>So no objections from me. You know the vectorizer bits far better than
>I :-)
I had first (ab)used the vector_sizes hook but Jakub convinced me to add a new one. There might be non-trivial costs when moving between vector sizes.
Richard.
>
>jeff
next prev parent reply other threads:[~2017-12-06 6:42 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-28 15:25 Richard Biener
2017-12-05 20:18 ` Jeff Law
2017-12-06 6:42 ` Richard Biener [this message]
2018-01-05 9:01 ` Richard Biener
2018-01-09 23:13 ` Jeff Law
2018-01-10 8:23 ` Richard Biener
2018-01-11 12:11 ` Richard Biener
2018-01-11 16:21 ` Jeff Law
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=956A88C4-387C-429A-BD7E-958AAFB45C17@suse.de \
--to=rguenther@suse.de \
--cc=gcc-patches@gcc.gnu.org \
--cc=jh@suse.de \
--cc=law@redhat.com \
--cc=mjambor@suse.de \
--cc=ubizjak@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).