public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Jeff Law <law@redhat.com>
To: Richard Biener <rguenther@suse.de>, gcc-patches@gcc.gnu.org
Cc: Jan Hubicka <jh@suse.de>,
	ubizjak@gmail.com, Martin Jambor <mjambor@suse.de>
Subject: Re: [PATCH] Fix PR80846, change vectorizer reduction epilogue (on x86)
Date: Tue, 05 Dec 2017 20:18:00 -0000	[thread overview]
Message-ID: <227ee67d-908d-8fc3-3fe1-173341294fdc@redhat.com> (raw)
In-Reply-To: <alpine.LSU.2.20.1711281605550.12252@zhemvz.fhfr.qr>

On 11/28/2017 08:15 AM, Richard Biener wrote:
> 
> The following adds a new target hook, targetm.vectorize.split_reduction,
> which allows the target to specify a preferred mode to perform the
> final reducion on using either vector shifts or scalar extractions.
> Up to that mode the vector reduction result is reduced by combining
> lowparts and highparts recursively.  This avoids lane-crossing operations
> when doing AVX256 on Zen and Bulldozer and also speeds up things on
> Haswell (I verified ~20% speedup on Broadwell).
> 
> Thus the patch implements the target hook on x86 to _always_ prefer
> SSE modes for the final reduction.
> 
> For the testcase in the bugzilla
> 
> int sumint(const int arr[]) {
>     arr = __builtin_assume_aligned(arr, 64);
>     int sum=0;
>     for (int i=0 ; i<1024 ; i++)
>       sum+=arr[i];
>     return sum;
> }
> 
> this changes -O3 -mavx512f code from
> 
> sumint:
> .LFB0:
>         .cfi_startproc
>         vpxord  %zmm0, %zmm0, %zmm0
>         leaq    4096(%rdi), %rax
>         .p2align 4,,10
>         .p2align 3
> .L2:
>         vpaddd  (%rdi), %zmm0, %zmm0
>         addq    $64, %rdi
>         cmpq    %rdi, %rax
>         jne     .L2
>         vpxord  %zmm1, %zmm1, %zmm1
>         vshufi32x4      $78, %zmm1, %zmm0, %zmm2
>         vpaddd  %zmm2, %zmm0, %zmm0
>         vmovdqa64       .LC0(%rip), %zmm2
>         vpermi2d        %zmm1, %zmm0, %zmm2
>         vpaddd  %zmm2, %zmm0, %zmm0
>         vmovdqa64       .LC1(%rip), %zmm2
>         vpermi2d        %zmm1, %zmm0, %zmm2
>         vpaddd  %zmm2, %zmm0, %zmm0
>         vmovdqa64       .LC2(%rip), %zmm2
>         vpermi2d        %zmm1, %zmm0, %zmm2
>         vpaddd  %zmm2, %zmm0, %zmm0
>         vmovd   %xmm0, %eax
> 
> to
> 
> sumint:
> .LFB0:
>         .cfi_startproc
>         vpxord  %zmm0, %zmm0, %zmm0
>         leaq    4096(%rdi), %rax
>         .p2align 4,,10
>         .p2align 3
> .L2:
>         vpaddd  (%rdi), %zmm0, %zmm0
>         addq    $64, %rdi
>         cmpq    %rdi, %rax
>         jne     .L2
>         vextracti64x4   $0x1, %zmm0, %ymm1
>         vpaddd  %ymm0, %ymm1, %ymm1
>         vmovdqa %xmm1, %xmm0
>         vextracti128    $1, %ymm1, %xmm1
>         vpaddd  %xmm1, %xmm0, %xmm0
>         vpsrldq $8, %xmm0, %xmm1
>         vpaddd  %xmm1, %xmm0, %xmm0
>         vpsrldq $4, %xmm0, %xmm1
>         vpaddd  %xmm1, %xmm0, %xmm0
>         vmovd   %xmm0, %eax
> 
> and for -O3 -mavx2 from
> 
> sumint:
> .LFB0:
>         .cfi_startproc
>         vpxor   %xmm0, %xmm0, %xmm0
>         leaq    4096(%rdi), %rax
>         .p2align 4,,10
>         .p2align 3
> .L2:
>         vpaddd  (%rdi), %ymm0, %ymm0
>         addq    $32, %rdi
>         cmpq    %rdi, %rax
>         jne     .L2
>         vpxor   %xmm1, %xmm1, %xmm1
>         vperm2i128      $33, %ymm1, %ymm0, %ymm2
>         vpaddd  %ymm2, %ymm0, %ymm0
>         vperm2i128      $33, %ymm1, %ymm0, %ymm2
>         vpalignr        $8, %ymm0, %ymm2, %ymm2
>         vpaddd  %ymm2, %ymm0, %ymm0
>         vperm2i128      $33, %ymm1, %ymm0, %ymm1
>         vpalignr        $4, %ymm0, %ymm1, %ymm1
>         vpaddd  %ymm1, %ymm0, %ymm0
>         vmovd   %xmm0, %eax
> 
> to
> 
> sumint:
> .LFB0:
>         .cfi_startproc
>         vpxor   %xmm0, %xmm0, %xmm0
>         leaq    4096(%rdi), %rax
>         .p2align 4,,10
>         .p2align 3
> .L2:
>         vpaddd  (%rdi), %ymm0, %ymm0
>         addq    $32, %rdi
>         cmpq    %rdi, %rax
>         jne     .L2
>         vmovdqa %xmm0, %xmm1
>         vextracti128    $1, %ymm0, %xmm0
>         vpaddd  %xmm0, %xmm1, %xmm0
>         vpsrldq $8, %xmm0, %xmm1
>         vpaddd  %xmm1, %xmm0, %xmm0
>         vpsrldq $4, %xmm0, %xmm1
>         vpaddd  %xmm1, %xmm0, %xmm0
>         vmovd   %xmm0, %eax
>         vzeroupper
>         ret
> 
> which besides being faster is also smaller (less prefixes).
> 
> SPEC 2k6 results on Haswell (thus AVX2) are neutral.  As it merely
> effects reduction vectorization epilogues I didn't expect big effects
> but for loops that do not run much (more likely with AVX512).
> 
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> 
> Ok for trunk?
> 
> The PR mentions some more tricks to optimize the sequence but
> those look like backend only optimizations.
> 
> Thanks,
> Richard.
> 
> 2017-11-28  Richard Biener  <rguenther@suse.de>
> 
> 	PR tree-optimization/80846
> 	* target.def (split_reduction): New target hook.
> 	* targhooks.c (default_split_reduction): New function.
> 	* targhooks.h (default_split_reduction): Declare.
> 	* tree-vect-loop.c (vect_create_epilog_for_reduction): If the
> 	target requests first reduce vectors by combining low and high
> 	parts.
> 	* tree-vect-stmts.c (vect_gen_perm_mask_any): Adjust.
> 	(get_vectype_for_scalar_type_and_size): Export.
> 	* tree-vectorizer.h (get_vectype_for_scalar_type_and_size): Declare.
> 
> 	* doc/tm.texi.in (TARGET_VECTORIZE_SPLIT_REDUCTION): Document.
> 	* doc/tm.texi: Regenerate.
> 
> 	i386/
> 	* config/i386/i386.c (ix86_split_reduction): Implement
> 	TARGET_VECTORIZE_SPLIT_REDUCTION.
> 
> 	* gcc.target/i386/pr80846-1.c: New testcase.
> 	* gcc.target/i386/pr80846-2.c: Likewise.
I'm not a big fan of introducing these kinds of target queries into the
gimple optimizers, but I think we've all agreed to allow them to varying
degrees within the vectorizer.

So no objections from me.  You know the vectorizer bits far better than
I :-)


jeff

  reply	other threads:[~2017-12-05 20:18 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-28 15:25 Richard Biener
2017-12-05 20:18 ` Jeff Law [this message]
2017-12-06  6:42   ` Richard Biener
2018-01-05  9:01 ` Richard Biener
2018-01-09 23:13   ` Jeff Law
2018-01-10  8:23     ` Richard Biener
2018-01-11 12:11       ` Richard Biener
2018-01-11 16:21       ` Jeff Law

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=227ee67d-908d-8fc3-3fe1-173341294fdc@redhat.com \
    --to=law@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jh@suse.de \
    --cc=mjambor@suse.de \
    --cc=rguenther@suse.de \
    --cc=ubizjak@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).