public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/94298] New: x86 duplicates loads
@ 2020-03-24 10:17 rguenth at gcc dot gnu.org
  2020-03-24 13:15 ` [Bug target/94298] " ubizjak at gmail dot com
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-24 10:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298

            Bug ID: 94298
           Summary: x86 duplicates loads
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

For the following testcase at -O3 -fgimple (gimple testcase because the
vectorizer generated code depends on not committed patches) we somehow
duplicate the load from y:

typedef double v2df __attribute__((vector_size(16)));
typedef long v2di __attribute__((vector_size(16)));
double x[1024], y[1024];
void __GIMPLE (ssa,guessed_local(10737416))
foo ()
{
  v2df * vectp_x7;
  v2df vect__56;
  v2df vect__45;
  v2df * vectp_y3;
  v2df _12;
  v2df _13;
  unsigned int _19;
  unsigned int _24;

  __BB(2,guessed_local(10737416)):
  goto __BB3(precise(134217728));

  __BB(3,loop_header(1),guessed_local(1063004409)):
  vectp_y3_21 = __PHI (__BB2: &y, __BB3: vectp_y3_17);
  vectp_x7_6 = __PHI (__BB2: &x, __BB3: vectp_x7_20);
  _19 = __PHI (__BB2: 0u, __BB3: _24);
  vect__45_14 = __MEM <v2df> ((double *)vectp_y3_21);
  _13 = __VEC_PERM (vect__45_14, vect__45_14, _Literal (v2di) { 1l, 1l });
  _12 = __VEC_PERM (vect__45_14, vect__45_14, _Literal (v2di) { 0l, 0l });
  vect__56_7 = _12 + _13;
  __MEM <v2df> ((double *)vectp_x7_6) = vect__56_7;
  vectp_y3_17 = vectp_y3_21 + 16ul;
  vectp_x7_20 = vectp_x7_6 + 16ul;
  _24 = _19 + 1u;
  if (_24 != 512u)
    goto __BB3(adjusted(132875551));
  else
    goto __BB4(adjusted(1342177));

  __BB(4,guessed_local(10737416)):
  return;

}


results in

foo:
.LFB0:
        .cfi_startproc
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L2:
        movapd  y(%rax), %xmm1
        movapd  y(%rax), %xmm0
        addq    $16, %rax
        unpcklpd        %xmm1, %xmm1
        unpckhpd        %xmm0, %xmm0
        addpd   %xmm1, %xmm0
        movaps  %xmm0, x-16(%rax)
        cmpq    $8192, %rax
        jne     .L2
        ret

The duplication happens in IRA/LRA but I suspect either x86 costing or
operand constraints makes them think this is cheaper.  LRA is fed with

(insn 8 6 11 3 (set (reg/v:V2DF 85 [ vect__45 ])
        (mem:V2DF (plus:DI (reg:DI 86 [ ivtmp.6 ])
                (symbol_ref:DI ("y") [flags 0x2]  <var_decl 0x7f2a94b8dbd0 y>))
[1 MEM[symbol: y, index: ivtmp.6_7, offset: 0B]+0 S16 A128])) 1338
{movv2df_internal}
     (expr_list:REG_EQUIV (mem:V2DF (plus:DI (reg:DI 86 [ ivtmp.6 ])
                (symbol_ref:DI ("y") [flags 0x2]  <var_decl 0x7f2a94b8dbd0 y>))
[1 MEM[symbol: y, index: ivtmp.6_7, offset: 0B]+0 S16 A128])
        (nil)))
(insn 11 8 12 3 (set (reg:V2DF 89)
        (vec_select:V2DF (vec_concat:V4DF (reg/v:V2DF 85 [ vect__45 ])
                (reg/v:V2DF 85 [ vect__45 ]))
            (parallel [
                    (const_int 0 [0])
                    (const_int 2 [0x2])
                ]))) "t2.c":27:3 2995 {*vec_interleave_lowv2df}
     (nil))
(insn 12 11 13 3 (set (reg:V2DF 90)
        (vec_select:V2DF (vec_concat:V4DF (reg/v:V2DF 85 [ vect__45 ])
                (reg/v:V2DF 85 [ vect__45 ]))
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 3 [0x3])
                ]))) "t2.c":27:3 2989 {*vec_interleave_highv2df}
     (expr_list:REG_DEAD (reg/v:V2DF 85 [ vect__45 ])
        (nil)))
(insn 13 12 14 3 (set (reg:V2DF 91 [ vect__56 ])
        (plus:V2DF (reg:V2DF 89)
            (reg:V2DF 90))) "t2.c":27:3 1519 {*addv2df3}
     (expr_list:REG_DEAD (reg:V2DF 90)
        (expr_list:REG_DEAD (reg:V2DF 89)
            (expr_list:REG_EQUIV (mem:V2DF (plus:DI (reg:DI 86 [ ivtmp.6 ])
                        (symbol_ref:DI ("x") [flags 0x2]  <var_decl
0x7f2a94b8db40 x>)) [1 MEM[symbol: x, index: ivtmp.6_7, offset: 0B]+0 S16
A128])
                (nil)))))
(insn 14 13 15 3 (set (mem:V2DF (plus:DI (reg:DI 86 [ ivtmp.6 ])
                (symbol_ref:DI ("x") [flags 0x2]  <var_decl 0x7f2a94b8db40 x>))
[1 MEM[symbol: x, index: ivtmp.6_7, offset: 0B]+0 S16 A128])
        (reg:V2DF 91 [ vect__56 ])) "t2.c":27:3 1338 {movv2df_internal}
     (expr_list:REG_DEAD (reg:V2DF 91 [ vect__56 ])
        (nil)))

and the LRA:

         Choosing alt 3 in insn 11:  (0) x  (1) 0  (2) m
{*vec_interleave_lowv2df}
      Creating newreg=92 from oldreg=89, assigning class SSE_REGS to r92
   11: r92:V2DF=vec_select(vec_concat(r92:V2DF,[r86:DI+`y']),parallel)
    Inserting insn reload before:
   26: r92:V2DF=[r86:DI+`y']
    Inserting insn reload after:
   27: r89:V2DF=r92:V2DF

and postreload CSE cannot do anything because the shuffle clobbers the
reg we loaded into (only sched2 moves things in a way that CSE would
be possible again but after sched2 there's no CSE anymore).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94298] x86 duplicates loads
  2020-03-24 10:17 [Bug target/94298] New: x86 duplicates loads rguenth at gcc dot gnu.org
@ 2020-03-24 13:15 ` ubizjak at gmail dot com
  2020-03-24 13:43 ` rguenther at suse dot de
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: ubizjak at gmail dot com @ 2020-03-24 13:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298

--- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> ---
The situation is a bit more complicated. IRA DTRT:

    8: r85:V2DF=[r86:DI+`y']
      REG_EQUIV [r86:DI+`y']
   11: r89:V2DF=vec_select(vec_concat(r85:V2DF,r85:V2DF),parallel)
   12: r90:V2DF=vec_select(vec_concat(r85:V2DF,r85:V2DF),parallel)
      REG_DEAD r85:V2DF

Later, LRA propagates memory operand into the insn. Since the insn clobbers its
input, multiple loads are emitted:

   26: xmm1:V2DF=[ax:DI+`y']
   11: xmm1:V2DF=vec_select(vec_concat(xmm1:V2DF,[ax:DI+`y']),parallel)
   28: xmm0:V2DF=[ax:DI+`y']
   12: xmm0:V2DF=vec_select(vec_concat([ax:DI+`y'],xmm0:V2DF),parallel)

which is further "optimized" in postreload pass:

   26: xmm1:V2DF=[ax:DI+`y']
   11: xmm1:V2DF=vec_select(vec_concat(xmm1:V2DF,xmm1:V2DF),parallel)
   28: xmm0:V2DF=[ax:DI+`y']
   12: xmm0:V2DF=vec_select(vec_concat(xmm0:V2DF,xmm0:V2DF),parallel)

It looks to me that a heuristics is missing in LRA, where memory operand
shouldn't propagate into insn, if there are multiple uses of a register.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94298] x86 duplicates loads
  2020-03-24 10:17 [Bug target/94298] New: x86 duplicates loads rguenth at gcc dot gnu.org
  2020-03-24 13:15 ` [Bug target/94298] " ubizjak at gmail dot com
@ 2020-03-24 13:43 ` rguenther at suse dot de
  2020-03-24 13:53 ` ubizjak at gmail dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenther at suse dot de @ 2020-03-24 13:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298

--- Comment #2 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 24 Mar 2020, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298
> 
> --- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> ---
> The situation is a bit more complicated. IRA DTRT:
> 
>     8: r85:V2DF=[r86:DI+`y']
>       REG_EQUIV [r86:DI+`y']
>    11: r89:V2DF=vec_select(vec_concat(r85:V2DF,r85:V2DF),parallel)
>    12: r90:V2DF=vec_select(vec_concat(r85:V2DF,r85:V2DF),parallel)
>       REG_DEAD r85:V2DF
> 
> Later, LRA propagates memory operand into the insn. Since the insn clobbers its
> input, multiple loads are emitted:
> 
>    26: xmm1:V2DF=[ax:DI+`y']
>    11: xmm1:V2DF=vec_select(vec_concat(xmm1:V2DF,[ax:DI+`y']),parallel)
>    28: xmm0:V2DF=[ax:DI+`y']
>    12: xmm0:V2DF=vec_select(vec_concat([ax:DI+`y'],xmm0:V2DF),parallel)
> 
> which is further "optimized" in postreload pass:
> 
>    26: xmm1:V2DF=[ax:DI+`y']
>    11: xmm1:V2DF=vec_select(vec_concat(xmm1:V2DF,xmm1:V2DF),parallel)
>    28: xmm0:V2DF=[ax:DI+`y']
>    12: xmm0:V2DF=vec_select(vec_concat(xmm0:V2DF,xmm0:V2DF),parallel)
> 
> It looks to me that a heuristics is missing in LRA, where memory operand
> shouldn't propagate into insn, if there are multiple uses of a register.

Yeah, but the odd thing is the memory doesn't actually end up in the
insn but is reloaded!  (I've filed a related PR recently where it actually
ends up in the insns but duplicated and thus code size grows but register
pressure decreases)

So I wonder whether the bug is that there is a memory alternative
in the first place?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94298] x86 duplicates loads
  2020-03-24 10:17 [Bug target/94298] New: x86 duplicates loads rguenth at gcc dot gnu.org
  2020-03-24 13:15 ` [Bug target/94298] " ubizjak at gmail dot com
  2020-03-24 13:43 ` rguenther at suse dot de
@ 2020-03-24 13:53 ` ubizjak at gmail dot com
  2020-03-24 14:05 ` rguenther at suse dot de
  2020-03-30 14:42 ` vmakarov at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: ubizjak at gmail dot com @ 2020-03-24 13:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298

--- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to rguenther@suse.de from comment #2)

> So I wonder whether the bug is that there is a memory alternative
> in the first place?

Memory alternative should be OK, we do have insns that access memory. Perhaps
vec_interleave_high/vec_interleave_low shouldn't be used by middel end in this
particular case?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94298] x86 duplicates loads
  2020-03-24 10:17 [Bug target/94298] New: x86 duplicates loads rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-03-24 13:53 ` ubizjak at gmail dot com
@ 2020-03-24 14:05 ` rguenther at suse dot de
  2020-03-30 14:42 ` vmakarov at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenther at suse dot de @ 2020-03-24 14:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298

--- Comment #4 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 24 Mar 2020, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298
> 
> --- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to rguenther@suse.de from comment #2)
> 
> > So I wonder whether the bug is that there is a memory alternative
> > in the first place?
> 
> Memory alternative should be OK, we do have insns that access memory. Perhaps
> vec_interleave_high/vec_interleave_low shouldn't be used by middel end in this
> particular case?

It's going through the generic vec_perm_const expander.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94298] x86 duplicates loads
  2020-03-24 10:17 [Bug target/94298] New: x86 duplicates loads rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-03-24 14:05 ` rguenther at suse dot de
@ 2020-03-30 14:42 ` vmakarov at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2020-03-30 14:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94298

--- Comment #5 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
I think that the root of the problem is that IRA on register cost calculation
sub-pass chooses memory for the pseudo.

It happens because the current algorithm (which is just an adoption of old
recglass.c) considers insn alternatives for each operand independently of other
operands. Roughly speaking, if we have insn with 2 alternatives and constraints
'r,m' for one operand and 'm,r' for another operand, the algorithm considers
'm' for the both operands (or 'r' for the both operands) are perfect (zero
cost) match.

I've tried to modify the current algorithm by small changes without a success. 
The full changes would result in what I have on ira-select branch.  ira-select
chooses an insn alternative and then calculate the costs taking this choice
into account.  Although move insns are a real challenge for such algorithm
because they have too many alternatives.

Switching to ira-select branch creates a lot of new testsuite failures on some
targets for GCC tests expecting a specific code generation and actually does
not improve overall SPEC rates on x86-64.

I am thinking about combination of the two approaches but it is definitely not
work which could be done for GCC10.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-03-30 14:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-24 10:17 [Bug target/94298] New: x86 duplicates loads rguenth at gcc dot gnu.org
2020-03-24 13:15 ` [Bug target/94298] " ubizjak at gmail dot com
2020-03-24 13:43 ` rguenther at suse dot de
2020-03-24 13:53 ` ubizjak at gmail dot com
2020-03-24 14:05 ` rguenther at suse dot de
2020-03-30 14:42 ` vmakarov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).