public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] fix vectorizer performance problem on cygwin hosted cross compiler
@ 2015-11-20  7:21 Jim Wilson
  2015-11-20 10:51 ` Richard Biener
  0 siblings, 1 reply; 2+ messages in thread
From: Jim Wilson @ 2015-11-20  7:21 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2979 bytes --]

A cygwin hosted cross compiler to aarch64-linux, compiling a C version
of linpack with -Ofast, produces code that runs 17% slower than a
linux hosted compiler.  The problem shows up in the vect dump, where
some different vectorization optimization decisions were made by the
cygwin compiler than the linux compiler.  That happened because
tree-vect-data-refs.c calls qsort in vect_analyze_data_ref_accesses,
and the newlib and glibc qsort routines sort the list differently.  I
can reproduce the same problem on linux by adding the newlib qsort
sources to a gcc build.  For an x86_64 target, I see about a 30%
performance loss using the newlib qsort.

The qsort trouble turns out to be a problem in the qsort comparison
function, dr_group_sort_cmp.  It does this
  if (!operand_equal_p (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb), 0))
    {
      cmp = compare_tree (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb));
      if (cmp != 0)
        return cmp;
    }
operand_equal_p calls STRIP_NOPS, so it will consider two trees to be
the same even if they have NOP_EXPR.  However, compare_tree is not
calling STRIP_NOPS, so it handles trees with NOP_EXPRs differently
than trees without.  The result is that depending on which array entry
gets used as the qsort pivot point, you can get very different sorts.
The newlib qsort happens to be accidentally choosing a bad pivot for
this testcase.  The glibc qsort happens to be accidentally choosing a
good pivot for this testcase.  This then triggers a cascading problem
in vect_analyze_data_ref_accesses which assumes that array entries
that pass the operand_equal_p test for the base address will end up
adjacent, and will only vectorize in that case.

For a contrived example, suppose we have four entries to sort: (plus Y
8), (mult A 4), (pointer_plus Z 16), and (nop (mult A 4)).  Suppose we
choose the mult as the pivot point. The plus sorts before because
tree_code plus is less than mult. The pointer_plus sorts after for the
same reason. The nop sorts equal. So we end up with plus, mult, nop,
pointer_plus. The mult and nop are then combined into the same
vectorization group.  Now suppose we choose the pointer_plus as the
pivot point. The plus and mult sort before. The nop sorts after. The
final result is plus, mult, pointer_plus, nop. And we fail to
vectorize as the mult and nop are not adjacent as they should be.

When I modify compare_tree to call STRIP_NOPS, this problem goes away.
I get the same sort from both the newlib and glibc qsort functions,
and I get the same linpack performance from a cygwin hosted compiler
and a linux hosted compiler.

This patch was tested with an x86_64 bootstrap and make check.  There
were no regressions.  I've also done a SPEC CPU2000 run with and
without the patch on aarch64-linux, there is no performance change.
And I've verified it by building linpack for aarch64-linux with cygwin
hosted cross compiler, x86_64 hosted cross compiler, and an aarch64
native compiler.

Jim

[-- Attachment #2: vect-qsort.patch --]
[-- Type: text/x-patch, Size: 517 bytes --]

2015-11-19  Jim Wilson  <jim.wilson@linaro.org>

	* tree-vect-data-refs.c (compare_tree): Call STRIP_NOPS.

Index: tree-vect-data-refs.c
===================================================================
--- tree-vect-data-refs.c	(revision 230429)
+++ tree-vect-data-refs.c	(working copy)
@@ -2545,6 +2545,8 @@ compare_tree (tree t1, tree t2)
   if (t2 == NULL)
     return 1;
 
+  STRIP_NOPS (t1);
+  STRIP_NOPS (t2);
 
   if (TREE_CODE (t1) != TREE_CODE (t2))
     return TREE_CODE (t1) < TREE_CODE (t2) ? -1 : 1;

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] fix vectorizer performance problem on cygwin hosted cross compiler
  2015-11-20  7:21 [PATCH] fix vectorizer performance problem on cygwin hosted cross compiler Jim Wilson
@ 2015-11-20 10:51 ` Richard Biener
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Biener @ 2015-11-20 10:51 UTC (permalink / raw)
  To: Jim Wilson; +Cc: gcc-patches

On Fri, Nov 20, 2015 at 8:21 AM, Jim Wilson <jim.wilson@linaro.org> wrote:
> A cygwin hosted cross compiler to aarch64-linux, compiling a C version
> of linpack with -Ofast, produces code that runs 17% slower than a
> linux hosted compiler.  The problem shows up in the vect dump, where
> some different vectorization optimization decisions were made by the
> cygwin compiler than the linux compiler.  That happened because
> tree-vect-data-refs.c calls qsort in vect_analyze_data_ref_accesses,
> and the newlib and glibc qsort routines sort the list differently.  I
> can reproduce the same problem on linux by adding the newlib qsort
> sources to a gcc build.  For an x86_64 target, I see about a 30%
> performance loss using the newlib qsort.
>
> The qsort trouble turns out to be a problem in the qsort comparison
> function, dr_group_sort_cmp.  It does this
>   if (!operand_equal_p (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb), 0))
>     {
>       cmp = compare_tree (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb));
>       if (cmp != 0)
>         return cmp;
>     }
> operand_equal_p calls STRIP_NOPS, so it will consider two trees to be
> the same even if they have NOP_EXPR.  However, compare_tree is not
> calling STRIP_NOPS, so it handles trees with NOP_EXPRs differently
> than trees without.  The result is that depending on which array entry
> gets used as the qsort pivot point, you can get very different sorts.
> The newlib qsort happens to be accidentally choosing a bad pivot for
> this testcase.  The glibc qsort happens to be accidentally choosing a
> good pivot for this testcase.  This then triggers a cascading problem
> in vect_analyze_data_ref_accesses which assumes that array entries
> that pass the operand_equal_p test for the base address will end up
> adjacent, and will only vectorize in that case.
>
> For a contrived example, suppose we have four entries to sort: (plus Y
> 8), (mult A 4), (pointer_plus Z 16), and (nop (mult A 4)).  Suppose we
> choose the mult as the pivot point. The plus sorts before because
> tree_code plus is less than mult. The pointer_plus sorts after for the
> same reason. The nop sorts equal. So we end up with plus, mult, nop,
> pointer_plus. The mult and nop are then combined into the same
> vectorization group.  Now suppose we choose the pointer_plus as the
> pivot point. The plus and mult sort before. The nop sorts after. The
> final result is plus, mult, pointer_plus, nop. And we fail to
> vectorize as the mult and nop are not adjacent as they should be.
>
> When I modify compare_tree to call STRIP_NOPS, this problem goes away.
> I get the same sort from both the newlib and glibc qsort functions,
> and I get the same linpack performance from a cygwin hosted compiler
> and a linux hosted compiler.
>
> This patch was tested with an x86_64 bootstrap and make check.  There
> were no regressions.  I've also done a SPEC CPU2000 run with and
> without the patch on aarch64-linux, there is no performance change.
> And I've verified it by building linpack for aarch64-linux with cygwin
> hosted cross compiler, x86_64 hosted cross compiler, and an aarch64
> native compiler.

Ok.

Thanks,
Richard.

> Jim

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-11-20 10:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-20  7:21 [PATCH] fix vectorizer performance problem on cygwin hosted cross compiler Jim Wilson
2015-11-20 10:51 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).