public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
@ 2020-05-21  5:29 ` luoxhu at gcc dot gnu.org
  2020-05-22 14:59 ` wschmidt at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 9+ messages in thread
From: luoxhu at gcc dot gnu.org @ 2020-05-21  5:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

luoxhu at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bergner at gcc dot gnu.org,
                   |                            |luoxhu at gcc dot gnu.org

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Is this just the difference of O3 and O2? Since O3 is OK, maybe this bug is not
effective?

$ /opt/at10.0/bin/gcc -O3 -S pr70053.c
$ cat pr70053.s
        .file   "pr70053.c"
        .abiversion 2
        .section        ".text"
        .align 2
        .p2align 4,,15
        .globl D256_add_finite
        .type   D256_add_finite, @function
D256_add_finite:
        dcmpuq 7,4,6
        beq 7,.L3
        fmr 7,3
        fmr 6,2
        fmr 3,7
        fmr 2,6
        blr
        .p2align 4,,15
.L3:
        fmr 5,7
        fmr 4,6
        fmr 3,7
        fmr 2,6
        blr
        .long 0
        .byte 0,0,0,0,0,0,0,0
        .size   D256_add_finite,.-D256_add_finite
        .ident  "GCC: (GNU) 6.4.1 20170720 (Advance-Toolchain-at10.0) IBM AT 10
branch, based on subversion id 250395."
        .section        .note.GNU-stack,"",@progbits

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
  2020-05-21  5:29 ` [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads luoxhu at gcc dot gnu.org
@ 2020-05-22 14:59 ` wschmidt at gcc dot gnu.org
  2020-05-25  8:06 ` luoxhu at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 9+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2020-05-22 14:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

Bill Schmidt <wschmidt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2020-05-22

--- Comment #5 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
I'd like to understand why the difference between -O2 and -O3 exists.  We
shouldn't generate this kind of nasty store-load at -O2.

Confirmed, BTW. :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
  2020-05-21  5:29 ` [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads luoxhu at gcc dot gnu.org
  2020-05-22 14:59 ` wschmidt at gcc dot gnu.org
@ 2020-05-25  8:06 ` luoxhu at gcc dot gnu.org
  2020-05-25  9:50 ` luoxhu at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 9+ messages in thread
From: luoxhu at gcc dot gnu.org @ 2020-05-25  8:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #6 from luoxhu at gcc dot gnu.org ---
"-O2 -ftree-slp-vectorize" could also generate the expected simple fmrs.

Reason is pass_cselim will transform conditional stores into unconditional ones
with PHI instructions when vectorization and if-conversion is
enabled(gcc/tree-ssa-phiopt.c:2482).

pr70053.c.108t.cdce:
D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;

  <bb 2> [local count: 1073741824]:
  if (b_4(D) == c_5(D))
    goto <bb 3>; [34.00%]
  else
    goto <bb 4>; [66.00%]

  <bb 3> [local count: 365072224]:
  D.2914.td0 = c_5(D);
  D.2914.td1 = c_5(D);
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 708669601]:
  D.2914.td0 = a_3(D);
  D.2914.td1 = b_4(D);

  <bb 5> [local count: 1073741824]:
  return D.2914;

}

=> pr70053.c.109t.cselim:

D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;
  _Decimal128 cstore_10;
  _Decimal128 cstore_11;

  <bb 2> [local count: 1073741824]:
  if (b_4(D) == c_5(D))
    goto <bb 4>; [34.00%]
  else
    goto <bb 3>; [66.00%]

  <bb 3> [local count: 708669601]:

  <bb 4> [local count: 1073741824]:
  # cstore_10 = PHI <c_5(D)(2), a_3(D)(3)>
  # cstore_11 = PHI <c_5(D)(2), b_4(D)(3)>
  D.2914.td1 = cstore_11;
  D.2914.td0 = cstore_10;
  return D.2914;

}

Then at expand pass, the PHI instruction "cstore_10 = PHI <c_5(D)(2),
a_3(D)(3)>" will be expanded to move for "-O2 -ftree-slp-vectorize". If no such
PHI generated, bb3 and bb4 in pr70053.c.108t.cdce will be expanded to
STORE/LOAD with TD->DI conversion, causing a lot st/ld conversion finally.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2020-05-25  8:06 ` luoxhu at gcc dot gnu.org
@ 2020-05-25  9:50 ` luoxhu at gcc dot gnu.org
  2020-05-25 17:22 ` segher at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 9+ messages in thread
From: luoxhu at gcc dot gnu.org @ 2020-05-25  9:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

luoxhu at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |segher at gcc dot gnu.org

--- Comment #7 from luoxhu at gcc dot gnu.org ---
When expanding "D.2914.td0 = c_5(D);" in expand_assignment (to=<component_ref
0x7ffff55909c0>, from=<ssa_name 0x7ffff5390b40>, nontemporal=false) at
../../gcc-master/gcc/expr.c:5058

1) expr.c:5158:   to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);

gdb pr to_rtx
(mem/c:BLK (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S32 A128])

...

2) expr.c:5167:   to_rtx = adjust_address (to_rtx, mode1, 0);

p mode1
$86 = E_TDmode
(gdb) pr to_rtx
(mem/c:TD (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S16 A128])

to_rtx is generated with address conversion from DImode to TDmode here.

...

3) expr.c:5374:   result = store_field (to_rtx, bitsize,
bitpos,bitregion_start, bitregion_end, mode1, from, get_alias_set (to),
nontemporal, reversep);

then the assignment instruction is generated as below:

(insn 11 10 12 4 (set (mem/c:TD (reg/f:DI 112 virtual-stack-vars) [1
D.2914.td0+0 S16 A128]) (reg/v:TD 121 [ c ])) "pr70053.c":20:14 -1 (nil))


So if we need remove the redundant store/load in expand, the conversion from
DImode to TDmode should be avoided for this case when using virtual-stack-vars
registers. (For PR65421, there are similar DImode to DFmode conversion). 

pr70053.c.236r.expand with -O2:
    1: NOTE_INSN_DELETED
    6: NOTE_INSN_BASIC_BLOCK 2
    2: r119:TD=%2:TD
    3: r120:TD=%4:TD
    4: r121:TD=%6:TD
    5: NOTE_INSN_FUNCTION_BEG
    8: r122:CCFP=cmp(r120:TD,r121:TD)
    9: pc={(r122:CCFP!=0)?L16:pc}
      REG_BR_PROB 708669604
   10: NOTE_INSN_BASIC_BLOCK 4
   11: [r112:DI]=r121:TD
   12: r123:DI=r112:DI+0x10
   13: [r123:DI]=r121:TD
   14: pc=L21
   15: barrier
   16: L16:
   17: NOTE_INSN_BASIC_BLOCK 5
   18: [r112:DI]=r119:TD
   19: r124:DI=r112:DI+0x10
   20: [r124:DI]=r120:TD
   21: L21:
   22: NOTE_INSN_BASIC_BLOCK 6
   23: r125:TD=[r112:DI]
   24: r127:DI=r112:DI+0x10
   25: r126:TD=[r127:DI]
   26: r117:TD=r125:TD
   27: r118:TD=r126:TD
   31: %2:TD=r117:TD
   32: %4:TD=r118:TD
   33: use %2:TD
   34: use %4:TD

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2020-05-25  9:50 ` luoxhu at gcc dot gnu.org
@ 2020-05-25 17:22 ` segher at gcc dot gnu.org
  2020-06-01  2:40 ` luoxhu at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 9+ messages in thread
From: segher at gcc dot gnu.org @ 2020-05-25 17:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #8 from Segher Boessenkool <segher at gcc dot gnu.org> ---
I see no conversion there?

But, why does it it store to memory at all?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2020-05-25 17:22 ` segher at gcc dot gnu.org
@ 2020-06-01  2:40 ` luoxhu at gcc dot gnu.org
  2021-01-31  1:53 ` michaeljclark at mac dot com
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 9+ messages in thread
From: luoxhu at gcc dot gnu.org @ 2020-06-01  2:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #9 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #8)
> I see no conversion there?
> 
> But, why does it it store to memory at all?

Yes, no conversion for this case, only adjust_address to TImode. mem/c:TD means
a MEM cannot trap.

Reason of store to memory:
D.2914 is a local struct variable here, seems we need do some optimization to
sink the D.2914.td0 and D.2914.td1 from BB3&BB4 to BB5 to avoid store/load on
stack? Or if there already exists some pass in Gimple? Or should this be
optimized after expander by some new pass like store sink?

O2/pr70053.c.234t.optimized:
D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;

  <bb 2> [local count: 1073741824]:
  if (b_4(D) == c_5(D))
    goto <bb 3>; [34.00%]
  else
    goto <bb 4>; [66.00%]

  <bb 3> [local count: 365072224]:
  D.2914.td0 = c_5(D);
  D.2914.td1 = c_5(D);
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 708669601]:
  D.2914.td0 = a_3(D);
  D.2914.td1 = b_4(D);

  <bb 5> [local count: 1073741824]:
  return D.2914;

}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2020-06-01  2:40 ` luoxhu at gcc dot gnu.org
@ 2021-01-31  1:53 ` michaeljclark at mac dot com
  2021-02-02  0:01 ` segher at gcc dot gnu.org
  2022-10-28  6:56 ` LpSolit at gmail dot com
  8 siblings, 0 replies; 9+ messages in thread
From: michaeljclark at mac dot com @ 2021-01-31  1:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

Michael Clark <michaeljclark at mac dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |michaeljclark at mac dot com

--- Comment #10 from Michael Clark <michaeljclark at mac dot com> ---
another data point. I am seeing something similar on x86-64.
SysV x86-64 ABI specifies that _Decimal128 is to be passed in
xmm regs so I believe the stack stores here are redundant.

; cat > dec1.c << EOF
_Decimal128 add_d(_Decimal128 a, _Decimal128 b) { return a + b; }
EOF
; gcc -O2 -S -masm=intel dec1.c 
; cat dec1.s
add_d:
.LFB0:
        .cfi_startproc
        endbr64
        sub     rsp, 40
        .cfi_def_cfa_offset 48
        movaps  XMMWORD PTR [rsp], xmm0
        movaps  XMMWORD PTR 16[rsp], xmm1
        call    __bid_addtd3@PLT
        movaps  XMMWORD PTR [rsp], xmm0
        add     rsp, 40
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2021-01-31  1:53 ` michaeljclark at mac dot com
@ 2021-02-02  0:01 ` segher at gcc dot gnu.org
  2022-10-28  6:56 ` LpSolit at gmail dot com
  8 siblings, 0 replies; 9+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-02  0:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #11 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Please open a separate bug for x86 problems.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
       [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2021-02-02  0:01 ` segher at gcc dot gnu.org
@ 2022-10-28  6:56 ` LpSolit at gmail dot com
  8 siblings, 0 replies; 9+ messages in thread
From: LpSolit at gmail dot com @ 2022-10-28  6:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

Jiu Fu Guo <guojiufu at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #12 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
The generated code is a lot better on the trunk (both -O2 and -O3):
        .cfi_startproc
        dcmpuq 0,4,6
        beq 0,.L4
        blr
        .p2align 4,,15
.L4:
        fmr 5,7
        fmr 4,6
        fmr 3,7
        fmr 2,6
        blr


And the expanded RTL like below, and cse/fwprop/dse optimize it as expected. 
    2: r119:TD=%2:TD
    3: r120:TD=%4:TD
    4: r121:TD=%6:TD
    5: NOTE_INSN_FUNCTION_BEG
   10: r122:CCFP=cmp(r120:TD,r121:TD)
   11: pc={(r122:CCFP!=0)?L13:pc}
      REG_BR_PROB 708669604
   12: NOTE_INSN_BASIC_BLOCK 4
    6: r120:TD=r121:TD
    7: r119:TD=r121:TD
   13: L13:
   14: NOTE_INSN_BASIC_BLOCK 5
   15: r123:DI=r112:DI+0x10
   16: [r123:DI]=r120:TD
   17: [r112:DI]=r119:TD
   18: r124:TD=[r112:DI]
   19: r126:DI=r112:DI+0x10
   20: r125:TD=[r126:DI]
   21: r117:TD=r124:TD
   22: r118:TD=r125:TD
   26: %2:TD=r117:TD
   27: %4:TD=r118:TD
   28: use %2:TD
   29: use %4:TD

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-01-14 21:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-70053-4@http.gcc.gnu.org/bugzilla/>
2020-05-21  5:29 ` [Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads luoxhu at gcc dot gnu.org
2020-05-22 14:59 ` wschmidt at gcc dot gnu.org
2020-05-25  8:06 ` luoxhu at gcc dot gnu.org
2020-05-25  9:50 ` luoxhu at gcc dot gnu.org
2020-05-25 17:22 ` segher at gcc dot gnu.org
2020-06-01  2:40 ` luoxhu at gcc dot gnu.org
2021-01-31  1:53 ` michaeljclark at mac dot com
2021-02-02  0:01 ` segher at gcc dot gnu.org
2022-10-28  6:56 ` LpSolit at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).