[Bug middle-end/53090] New: suboptimal ivopt

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/53090] New: suboptimal ivopt
@ 2012-04-23 17:36 xinliangli at gmail dot com
  2012-04-23 17:38 ` [Bug middle-end/53090] " xinliangli at gmail dot com
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: xinliangli at gmail dot com @ 2012-04-23 17:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090

             Bug #: 53090
           Summary: suboptimal ivopt
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: xinliangli@gmail.com


Compiling the attached benchmark code with trunk gcc, the code generated for
the hot memory swap loop (line 60) is very inefficient : both icc and llvm use
two ivs and generate a tight loop with 9 instructions, but gcc decides to use 3
ivs, and the loop exit testing code is wierd and inefficient -- it ends up
produce a loop with 11 instructions.

#define XCH(x,y)    { Aint t_mp; t_mp=(x); (x)=(y); (y)=t_mp; }

    for( i=1, j=k-1 ; i<j ; ++i, --j ) {
        XCH(perm[i], perm[j])
    }

The tight version:


.LBB0_11:                               # %for.body57.i
                                        #   Parent Loop BB0_1 Depth=1
                                        #     Parent Loop BB0_9 Depth=2
                                        # =>    This Inner Loop Header: Depth=3
    movl    (%rbx,%rdi,4), %ebp
    movl    (%rbx,%rsi,4), %eax
    movl    %eax, (%rbx,%rdi,4)
    movl    %ebp, (%rbx,%rsi,4)
    decq    %rsi
    incq    %rdi
    cmpl    %edx, %edi
    leal    -1(%rdx), %edx
    jl    .LBB0_11


The gcc version:

.L18:
    movl    (%rdx), %edi
    addl    $1, %ecx
    movl    (%rsi), %eax
    movl    %eax, (%rdx)
    addq    $4, %rdx
    movl    %edi, (%rsi)
    movl    %r8d, %edi
    subq    $4, %rsi
    subl    %ecx, %edi
    cmpl    %edi, %ecx
    jl    .L18


However gcc is doing the right thing when applied on the extracted test case:

#define XCH(x,y) { int t_mp; t_mp=(x); (x)=(y); (y)=t_mp; }

void foo (int *perm, int k)
{
 int i,j;
 for( i=1, j=k-1 ; i<j ; ++i, --j ) {
    XCH(perm[i], perm[j])
 }
}


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug middle-end/53090] suboptimal ivopt
  2012-04-23 17:36 [Bug middle-end/53090] New: suboptimal ivopt xinliangli at gmail dot com
@ 2012-04-23 17:38 ` xinliangli at gmail dot com
  2014-03-29 11:04 ` [Bug target/53090] " amker.cheng at gmail dot com
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: xinliangli at gmail dot com @ 2012-04-23 17:38 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090

--- Comment #1 from davidxl <xinliangli at gmail dot com> 2012-04-23 17:37:40 UTC ---
Created attachment 27223
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27223
benchmark source


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/53090] suboptimal ivopt
  2012-04-23 17:36 [Bug middle-end/53090] New: suboptimal ivopt xinliangli at gmail dot com
  2012-04-23 17:38 ` [Bug middle-end/53090] " xinliangli at gmail dot com
@ 2014-03-29 11:04 ` amker.cheng at gmail dot com
  2014-04-02 11:17 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: amker.cheng at gmail dot com @ 2014-03-29 11:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090

bin.cheng <amker.cheng at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker.cheng at gmail dot com

--- Comment #2 from bin.cheng <amker.cheng at gmail dot com> ---
I tried the simple case, gcc doesn't work as expected on x86_64, but x86 is
fine.

I think there are several issues in ivopt causing this.  The first issue is
IVOPT is too conservative when representing iv_use with iv_cand in type with
smaller precision.
Consider below use/cand:
use 0
  address
  in statement t_mp_11 = *_10;

  at position *_10
  type int *
  base perm_9(D) + 4
  step 4
  base object (void *) perm_9(D)
  related candidates 
candidate 5 (important)
  var_before ivtmp.8
  var_after ivtmp.8
  incremented before exit test
  type unsigned int
  base 1
  step 1
candidate 6 (important)
  original biv
  type int
  base 1
  step 1

Use 0 is in type "int *" which has precision 64 on x86_64; cand is in type
"int" which has precision 32 on x86_64.  In function get_computation_cost_at,
there is below code:

  if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
    {
      /* We do not have a precision to express the values of use.  */
      return infinite_cost;
    }
But this is too conservative because the loop runs for "(j-i)/2" times, which
can be expressed by the candidate.  Even though the candidate has smaller type
than iv_use.

We should add some code checking loop niters against candidate's coverage here.
For example, the generated assembly changed into:
.L14:
    movl    (%rdx), %edi
    movslq    %eax, %rcx
    addl    $1, %eax
    movl    (%r15,%rcx,4), %esi
    subq    $4, %rdx
    movl    %edi, (%r15,%rcx,4)
    movl    %r8d, %ecx
    subl    %eax, %ecx
    movl    %esi, 4(%rdx)
    cmpl    %ecx, %eax
    jl    .L14

Now the original candidate is chosen as rcs for original induction variable
"i".  Unfortunately there are some other issues which prevent IVOPT from
choosing right candidate for original induction variable "j".

I will keep looking into it see what's going on.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/53090] suboptimal ivopt
  2012-04-23 17:36 [Bug middle-end/53090] New: suboptimal ivopt xinliangli at gmail dot com
  2012-04-23 17:38 ` [Bug middle-end/53090] " xinliangli at gmail dot com
  2014-03-29 11:04 ` [Bug target/53090] " amker.cheng at gmail dot com
@ 2014-04-02 11:17 ` rguenth at gcc dot gnu.org
  2014-04-02 11:34 ` rguenth at gcc dot gnu.org
  2014-04-02 11:50 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-04-02 11:17 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think the main observation is that

use 1
  address
  in statement _15 = *_14;

  at position *_14
  type int *
  base perm_9(D) + (sizetype) ((long unsigned int) (k_4(D) + -1) * 4)
  step 18446744073709551612
  base object (void *) perm_9(D)
  related candidates

((long unsigned int) (k_4(D) + -1) * 4) is a widening multiplication
and the x86 addressing mode widens(?).  Note that with the current
specification of TARGET_MEM_REF:

/* Low-level memory addressing.  Operands are BASE (address of static or
   global variable or register), OFFSET (integer constant),
   INDEX (register), STEP (integer constant), INDEX2 (register),
   The corresponding address is BASE + STEP * INDEX + INDEX2 + OFFSET.
   Only variations and values valid on the target are allowed.

   The type of STEP, INDEX and INDEX2 is sizetype.

this isn't allowed.  One would need to a) extend the spec to allow
step * index be a widening multiplication, b) extend the recog code
in tree-ssa-address.c properly to see what the target does here,
esp. whether it treats INDEX as signed or unsigned.

Then IVOPTs could choose (unsigned) k_4 as IV and use it directly in
the widening multiplication places.

A way to test this would be to emit the widening multiplies from
c-common.c:pointer_int_sum already.  (but expect lots of fallout, there is
a lot of code handling MULT_EXPR but not a lot handling WIDEN_MULT_EXPR)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/53090] suboptimal ivopt
  2012-04-23 17:36 [Bug middle-end/53090] New: suboptimal ivopt xinliangli at gmail dot com
                   ` (2 preceding siblings ...)
  2014-04-02 11:17 ` rguenth at gcc dot gnu.org
@ 2014-04-02 11:34 ` rguenth at gcc dot gnu.org
  2014-04-02 11:50 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-04-02 11:34 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Sth like

Index: gcc/c-family/c-common.c
===================================================================
--- gcc/c-family/c-common.c     (revision 209018)
+++ gcc/c-family/c-common.c     (working copy)
@@ -4415,19 +4415,21 @@ pointer_int_sum (location_t loc, enum tr

   /* Convert the integer argument to a type the same size as sizetype
      so the multiply won't overflow spuriously.  */
+  enum tree_code code = MULT_EXPR;
   if (TYPE_PRECISION (TREE_TYPE (intop)) != TYPE_PRECISION (sizetype)
       || TYPE_UNSIGNED (TREE_TYPE (intop)) != TYPE_UNSIGNED (sizetype))
-    intop = convert (c_common_type_for_size (TYPE_PRECISION (sizetype),
-                                            TYPE_UNSIGNED (sizetype)), intop);
+    code = WIDEN_MULT_EXPR;

   /* Replace the integer argument with a suitable product by the object size.
      Do this multiplication as signed, then convert to the appropriate type
      for the pointer operation and disregard an overflow that occurred only
      because of the sign-extension change in the latter conversion.  */
   {
-    tree t = build_binary_op (loc,
-                             MULT_EXPR, intop,
-                             convert (TREE_TYPE (intop), size_exp), 1);
+    tree t = fold_build2_loc (loc, code,
+                             c_common_type_for_size (TYPE_PRECISION
(sizetype),
+                                                     TYPE_UNSIGNED
(sizetype)),
+                             intop,
+                             convert (TREE_TYPE (intop), size_exp));
     intop = convert (sizetype, t);
     if (TREE_OVERFLOW_P (intop) && !TREE_OVERFLOW (t))
       intop = build_int_cst_wide (TREE_TYPE (intop), TREE_INT_CST_LOW (intop),


but then you notice that for example SCEV doesn't handle WIDEN_MULT_EXPR.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/53090] suboptimal ivopt
  2012-04-23 17:36 [Bug middle-end/53090] New: suboptimal ivopt xinliangli at gmail dot com
                   ` (3 preceding siblings ...)
  2014-04-02 11:34 ` rguenth at gcc dot gnu.org
@ 2014-04-02 11:50 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-04-02 11:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> Sth like
> 
> Index: gcc/c-family/c-common.c
> ===================================================================
> --- gcc/c-family/c-common.c     (revision 209018)
> +++ gcc/c-family/c-common.c     (working copy)
> @@ -4415,19 +4415,21 @@ pointer_int_sum (location_t loc, enum tr
>  
>    /* Convert the integer argument to a type the same size as sizetype
>       so the multiply won't overflow spuriously.  */
> +  enum tree_code code = MULT_EXPR;
>    if (TYPE_PRECISION (TREE_TYPE (intop)) != TYPE_PRECISION (sizetype)
>        || TYPE_UNSIGNED (TREE_TYPE (intop)) != TYPE_UNSIGNED (sizetype))
> -    intop = convert (c_common_type_for_size (TYPE_PRECISION (sizetype),
> -                                            TYPE_UNSIGNED (sizetype)),
> intop);
> +    code = WIDEN_MULT_EXPR;
>  
>    /* Replace the integer argument with a suitable product by the object
> size.
>       Do this multiplication as signed, then convert to the appropriate type
>       for the pointer operation and disregard an overflow that occurred only
>       because of the sign-extension change in the latter conversion.  */
>    {
> -    tree t = build_binary_op (loc,
> -                             MULT_EXPR, intop,
> -                             convert (TREE_TYPE (intop), size_exp), 1);
> +    tree t = fold_build2_loc (loc, code,
> +                             c_common_type_for_size (TYPE_PRECISION
> (sizetype),
> +                                                     TYPE_UNSIGNED
> (sizetype)),
> +                             intop,
> +                             convert (TREE_TYPE (intop), size_exp));
>      intop = convert (sizetype, t);
>      if (TREE_OVERFLOW_P (intop) && !TREE_OVERFLOW (t))
>        intop = build_int_cst_wide (TREE_TYPE (intop), TREE_INT_CST_LOW
> (intop),
> 
> 
> but then you notice that for example SCEV doesn't handle WIDEN_MULT_EXPR.

But for example SCEV could be teached to _create_ the WIDEN_MULT_EXPRs
in the first place ... thus analyze for example _14 instead of as

(set_scalar_evolution
  instantiated_below = 4
  (scalar = _14)
  (scalar_evolution = {perm_9(D) + (sizetype) ((long unsigned int) j_5 * 4), +,
18446744073709551612}_1))

as

(set_scalar_evolution
  instantiated_below = 4
  (scalar = _14)
  (scalar_evolution = {perm_9(D) + (sizetype) (j_5 w* 4), +,
18446744073709551612}_1))


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-04-02 11:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-23 17:36 [Bug middle-end/53090] New: suboptimal ivopt xinliangli at gmail dot com
2012-04-23 17:38 ` [Bug middle-end/53090] " xinliangli at gmail dot com
2014-03-29 11:04 ` [Bug target/53090] " amker.cheng at gmail dot com
2014-04-02 11:17 ` rguenth at gcc dot gnu.org
2014-04-02 11:34 ` rguenth at gcc dot gnu.org
2014-04-02 11:50 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).