public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value
@ 2020-07-02 12:45 zhongyunde at tom dot com
2020-07-06 11:01 ` [Bug rtl-optimization/96031] " zhongyunde at tom dot com
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: zhongyunde at tom dot com @ 2020-07-02 12:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031
Bug ID: 96031
Summary: suboptimal codegen for store low 16-bits value
Product: gcc
Version: 8.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: zhongyunde at tom dot com
Target Milestone: ---
For the following code, as instruction strh only store the low 16-bits value,
so the 'and w2, w2, 65535 ' is redundant.
test base on the ARM64 gcc 8.2 on https://gcc.godbolt.org/, so get complicated
assemble.
typedef unsigned int UINT32;
typedef unsigned short UINT16;
UINT16 array[12];
void foo (UINT32 len, UINT32 step)
{
UINT32 index = 1;
for (index = 1 ; index < len; index++ )
{
array[index] = index * step;
}
}
// the assemble of kernel loop body --------------------------
b .L4 //
.L6:
add x3, x3, 2 // ivtmp.6, ivtmp.6,
.L4:
strh w2, [x4, 2] // ivtmp.4, MEM[base: _2, offset: 2B]
add w2, w1, w2 // tmp105, _12, ivtmp.4
and w2, w2, 65535 // ivtmp.4, tmp105 ????
cmp x3, x0 // ivtmp.6, _23
mov x4, x3 // ivtmp.6, ivtmp.6
bne .L6 //,
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
@ 2020-07-06 11:01 ` zhongyunde at tom dot com
2020-07-19 14:32 ` amker at gcc dot gnu.org
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at tom dot com @ 2020-07-06 11:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031
--- Comment #1 from zhongyunde at tom dot com <zhongyunde at tom dot com> ---
this may can be enhance by ivopts.
If the case adjusted as following, then the 'and w2, w2, 65535 ' will
disappear.
typedef unsigned int UINT32;
typedef unsigned short UINT16;
UINT16 array[12];
void foo (UINT32 len, UINT32 step)
{
UINT32 index = 0;
UINT32 sum = 0;
for (index = 0; index < len; index++ )
{
array[index] = sum;
sum += step;
}
}
// the assemble of kernel loop body --------------------------
.L9:
add x2, x2, 2 // ivtmp.6, ivtmp.6,
.L3:
strh w3, [x4] // sum, MEM[base: _12, offset: 0B]
cmp x2, x0 // ivtmp.6, _22
add w3, w3, w1 // sum, sum, step
mov x4, x2 // ivtmp.6, ivtmp.6
bne .L9 //,
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
2020-07-06 11:01 ` [Bug rtl-optimization/96031] " zhongyunde at tom dot com
@ 2020-07-19 14:32 ` amker at gcc dot gnu.org
2020-07-20 12:25 ` zhongyunde at tom dot com
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: amker at gcc dot gnu.org @ 2020-07-19 14:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031
--- Comment #2 from bin cheng <amker at gcc dot gnu.org> ---
Interesting case, I see two issues in generated asm. One is the unnecessary
bitwise and, the other is allocating different registers for induction variable
and the base address. However, looks like neither issue is caused by ivopts.
Check the dump:
431 <bb 4> [local count: 105119324]:
432 _12 = (short unsigned int) step_8(D);
433 ivtmp.10_11 = (unsigned long) &array;
434 _18 = len_7(D) + 4294967294;
435 _19 = (unsigned long) _18;
436 _20 = _19 * 2;
437 _21 = (unsigned long) &array;
438 _22 = _21 + 2;
439 _23 = _20 + _22;
440
441 <bb 5> [local count: 955630224]:
442 # ivtmp.8_15 = PHI <_12(4), ivtmp.8_5(6)>
443 # ivtmp.10_16 = PHI <ivtmp.10_11(4), ivtmp.10_4(6)>
444 _3 = ivtmp.8_15;
445 _2 = (void *) ivtmp.10_16;
446 MEM[base: _2, offset: 2B] = _3;
447 ivtmp.8_5 = ivtmp.8_15 + _12;
448 ivtmp.10_4 = ivtmp.10_16 + 2;
449 if (ivtmp.10_4 != _23)
450 goto <bb 6>; [89.00%]
451 else
452 goto <bb 8>; [11.00%]
453
454 <bb 8> [local count: 105119324]:
455 goto <bb 3>; [100.00%]
456
457 <bb 6> [local count: 850510900]:
458 goto <bb 5>; [100.00%]
As far as I can tell, it's optimal.
The register allocation issue is introduced by rtl PRE, apparently we should
not save the add 2 instruction in the last iteration with a false dependence
which is more harmful.
As for ivopt, I can see a minor improvement by replacing != exit condition with
<=, thus saving add 2 instruction computing _22, which happens to "disable" the
wrong PRE transformation.
Ah, I see it's already classified as rtl-optimization.
Thanks
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
2020-07-06 11:01 ` [Bug rtl-optimization/96031] " zhongyunde at tom dot com
2020-07-19 14:32 ` amker at gcc dot gnu.org
@ 2020-07-20 12:25 ` zhongyunde at tom dot com
2020-08-26 4:04 ` zhongyunde at tom dot com
2021-08-22 9:00 ` pinskia at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at tom dot com @ 2020-07-20 12:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031
--- Comment #3 from zhongyunde at tom dot com <zhongyunde at tom dot com> ---
I find there is some different between the two cases during in ivopts.
For the 2nd case, a UINT32 type iv sum is choosed
<bb 3> [local count: 955630224]:
# sum_15 = PHI <0(5), sum_9(6)>
# ivtmp.10_17 = PHI <ivtmp.10_3(5), ivtmp.10_4(6)>
_2 = (short unsigned int) sum_15;
_1 = _2;
_11 = (void *) ivtmp.10_17;
MEM[base: _11, offset: 0B] = _1;
sum_9 = step_8(D) + sum_15;
ivtmp.10_4 = ivtmp.10_17 + 2;
if (ivtmp.10_4 != _22)
goto <bb 6>; [89.00%]
For the 1st case, a 'short unsigned int type' ivtmp.8 is choosed as your dump
showed, and there is no UINT32 type candidate with Step step.
typedef unsigned int UINT32;
typedef unsigned short UINT16;
UINT16 array[12];
void foo (UINT32 len, UINT32 step)
{
UINT32 index = 0;
UINT32 sum = 0;
for (index = 0; index < len; index++ )
{
sum = index * step;
array[index] = sum;
}
}
I tried to add a UINT32 type temporary sum as above case (the 3rd case), then
modify the gcc to add an UINT32 type candidate variable and adjust the cost to
choose the Candidate variable (do the similar things as the 2nd case in ivopt),
then we can also optimize the 'and w2, w2, 65535' insn.
But above method is not conformed to the implementation method of ivopt, may be
we need extend an UINT32 candidate variable base 'on short unsigned int' IV
struct ?
===== the change of gcc to add UINT32 type candidate variable
==================
@@ -3389,7 +3389,7 @@ add_iv_candidate_for_bivs (struct ivopts_data *data)
EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, i, bi)
{
iv = ver_info (data, i)->iv;
- if (iv && iv->biv_p && !integer_zerop (iv->step))
+ if (iv && !integer_zerop (iv->step))
add_iv_candidate_for_biv (data, iv);
}
}
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
` (2 preceding siblings ...)
2020-07-20 12:25 ` zhongyunde at tom dot com
@ 2020-08-26 4:04 ` zhongyunde at tom dot com
2021-08-22 9:00 ` pinskia at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at tom dot com @ 2020-08-26 4:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031
--- Comment #4 from zhongyunde at tom dot com <zhongyunde at tom dot com> ---
> As for ivopt, I can see a minor improvement by replacing != exit condition
> with <=, thus saving add 2 instruction computing _22, which happens to
> "disable" the wrong PRE transformation.
>
I take a look at the function may_eliminate_iv, now iv_elimination_compare
will only return EQ_EXPR or NE_EXPR, so do you mean to do some extend for this
case?
5411 *bound = fold_convert (TREE_TYPE (cand->iv->base),
5412 aff_combination_to_tree (&bnd));
5413 *comp = iv_elimination_compare (data, use);
5414
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
` (3 preceding siblings ...)
2020-08-26 4:04 ` zhongyunde at tom dot com
@ 2021-08-22 9:00 ` pinskia at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-22 9:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
See Also| |https://gcc.gnu.org/bugzill
| |a/show_bug.cgi?id=57231
Status|UNCONFIRMED |NEW
Last reconfirmed| |2021-08-22
Ever confirmed|0 |1
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed. There might be a few others like this too.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-08-22 9:00 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-02 12:45 [Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value zhongyunde at tom dot com
2020-07-06 11:01 ` [Bug rtl-optimization/96031] " zhongyunde at tom dot com
2020-07-19 14:32 ` amker at gcc dot gnu.org
2020-07-20 12:25 ` zhongyunde at tom dot com
2020-08-26 4:04 ` zhongyunde at tom dot com
2021-08-22 9:00 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).