public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/49028] New: Missed optimization of pointer arithmetic
@ 2011-05-17 17:32 piotr.wyderski at gmail dot com
2011-05-17 17:40 ` [Bug rtl-optimization/49028] " piotr.wyderski at gmail dot com
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: piotr.wyderski at gmail dot com @ 2011-05-17 17:32 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49028
Summary: Missed optimization of pointer arithmetic
Product: gcc
Version: 4.6.0
Status: UNCONFIRMED
Severity: minor
Priority: P3
Component: rtl-optimization
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: piotr.wyderski@gmail.com
The following examples come from x64, but I believe the problem
shows on other architectures too.
I have implemented an object recycler based on a circular buffer of pointers
with a single cursor. N is the capacity of the buffer; powers of two are highly
preferred, so let we assume N = 16.
template <unsigned int N> struct R {
void* m_Data[N];
void** m_Cursor;
void xxx_release(void* p) __attribute__((__noinline__));
};
template <unsigned int N> void R<N>::xxx_release(void* p) {
m_Cursor = m_Data + ((m_Cursor + 1 - m_Data) % N);
*m_Cursor = p;
}
int main(int argc, char *argv[]) {
R<16> rrr;
rrr.xxx_release(0);
return 0;
}
This generates (-O3 -msse -msse2 -msse4.2 -mfpmath=sse -march=core2
-mtune=core2):
000000000041a910 <_ZN1RILj16EE11xxx_releaseEPv>:
41a910: 48 8b 97 80 00 00 00 mov 0x80(%rdi),%rdx
41a917: 48 83 c2 08 add $0x8,%rdx
41a91b: 48 29 fa sub %rdi,%rdx
41a91e: 48 89 d0 mov %rdx,%rax
41a921: 48 c1 fa 3f sar $0x3f,%rdx
41a925: 48 c1 ea 3c shr $0x3c,%rdx
41a929: 48 c1 f8 03 sar $0x3,%rax
41a92d: 48 01 d0 add %rdx,%rax
41a930: 83 e0 0f and $0xf,%eax
41a933: 48 29 d0 sub %rdx,%rax
41a936: 48 8d 14 c7 lea (%rdi,%rax,8),%rdx
41a93a: 48 89 34 c7 mov %rsi,(%rdi,%rax,8)
41a93e: 48 89 97 80 00 00 00 mov %rdx,0x80(%rdi)
41a945: c3 retq
The sequence is far from being optimal. The compiler should not move pointer
arithmetic to the type-independent integer domain (i.e. were (p + 1 - p) == 1),
but notice that the actual increment, whis is M = sizeof(void*), is a power of
2, in this case 8, so the final modulo result in integer domain for (p + 1) % N
will be the same as (((char*) p) + M) % (N * M). To cut a long story short:
instead of the above it should just generate:
000000000041a910 <_ZN1RILj16EE11xxx_releaseEPv>:
mov 0x80(%rdi),%rdx
add $0x08, %rdx
sub %rdi, %rdx
and $0x7f, %rdx
add %rdi, %rdx
mov %rsi, (%rdx)
mov %rdx, 0x80(%rdi)
retq
I'm not sure I understand what is GCC trying to achieve with the shifts.
Secondly, my example above uses only simple addressing modes, but GCC
uses the most complex mode twice: in lea and in the subsequent mov.
Complex lea has 3 cycles of latency on SandyBridge, according to the
Intel Optimization Manual, p. 3.5.1.3, so it should best be avoided.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug rtl-optimization/49028] Missed optimization of pointer arithmetic
2011-05-17 17:32 [Bug rtl-optimization/49028] New: Missed optimization of pointer arithmetic piotr.wyderski at gmail dot com
@ 2011-05-17 17:40 ` piotr.wyderski at gmail dot com
2011-05-18 10:18 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: piotr.wyderski at gmail dot com @ 2011-05-17 17:40 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49028
--- Comment #1 from Piotr Wyderski <piotr.wyderski at gmail dot com> 2011-05-17 17:24:03 UTC ---
If I change the function to:
template <unsigned int N> void R<N>::xxx_release(void* p) {
char* q = reinterpret_cast<char*>(m_Cursor);
char* b = reinterpret_cast<char*>(m_Data);
q = ((q + sizeof(void*)) - b) % (N * sizeof(void*)) + b;
m_Cursor = reinterpret_cast<void**>(q);
*m_Cursor = p;
}
Then the generated code is:
000000000041a910 <_ZN1RILj16EE11xxx_releaseEPv>:
41a910: 48 8b 87 80 00 00 00 mov 0x80(%rdi),%rax
41a917: 48 83 c0 08 add $0x8,%rax
41a91b: 48 29 f8 sub %rdi,%rax
41a91e: 83 e0 7f and $0x7f,%eax
41a921: 48 01 f8 add %rdi,%rax
41a924: 48 89 87 80 00 00 00 mov %rax,0x80(%rdi)
41a92b: 48 89 30 mov %rsi,(%rax)
41a92e: c3 retq
which is astonishingly close to my hand-made assembly...
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug rtl-optimization/49028] Missed optimization of pointer arithmetic
2011-05-17 17:32 [Bug rtl-optimization/49028] New: Missed optimization of pointer arithmetic piotr.wyderski at gmail dot com
2011-05-17 17:40 ` [Bug rtl-optimization/49028] " piotr.wyderski at gmail dot com
@ 2011-05-18 10:18 ` rguenth at gcc dot gnu.org
2011-05-18 13:15 ` piotr.wyderski at gmail dot com
2021-08-14 20:53 ` [Bug tree-optimization/49028] " pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-05-18 10:18 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49028
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2011.05.18 09:40:28
Ever Confirmed|0 |1
--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-05-18 09:40:28 UTC ---
I think we are confused by the division by the element size which is of course
what the language semantic requires. Thus, we start with
(void) ((this)->m_Cursor = (void * *) &(this)->m_Data + (long unsigned int)
((long unsigned int) ((((long int) ((this)->m_Cursor + 8) - (long int)
&(this)->m_Data) /[ex] 8) % 16) * 8))
and because of the modulus we can't fold it. In the IL we end up with
<bb 2>:
D.2101_3 = this_1(D)->m_Cursor;
D.2102_4 = D.2101_3 + 8;
D.2103_5 = (long int) D.2102_4;
D.2100_6 = &this_1(D)->m_Data;
D.2104_7 = (long int) D.2100_6;
D.2105_8 = D.2103_5 - D.2104_7;
D.2106_9 = D.2105_8 /[ex] 8;
D.2107_10 = D.2106_9 % 16;
D.2108_11 = (long unsigned int) D.2107_10;
D.2110_13 = &this_1(D)->m_Data[D.2108_11]{lb: 0 sz: 8};
this_1(D)->m_Cursor = D.2110_13;
MEM[(void * *)this_1(D)].m_Data[D.2108_11]{lb: 0 sz: 8} = p_15(D);
return;
where it requires quite some elaboration to combine the D.2108_11 index
computation with the address generation via the array-reference.
Maybe we can do some generic clever tricks to (A /[ex] CST1) % CST2?
We'd like to re-associate it somehow.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug rtl-optimization/49028] Missed optimization of pointer arithmetic
2011-05-17 17:32 [Bug rtl-optimization/49028] New: Missed optimization of pointer arithmetic piotr.wyderski at gmail dot com
2011-05-17 17:40 ` [Bug rtl-optimization/49028] " piotr.wyderski at gmail dot com
2011-05-18 10:18 ` rguenth at gcc dot gnu.org
@ 2011-05-18 13:15 ` piotr.wyderski at gmail dot com
2021-08-14 20:53 ` [Bug tree-optimization/49028] " pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: piotr.wyderski at gmail dot com @ 2011-05-18 13:15 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49028
--- Comment #3 from Piotr Wyderski <piotr.wyderski at gmail dot com> 2011-05-18 12:02:58 UTC ---
(In reply to comment #2)
> Maybe we can do some generic clever tricks to (A /[ex] CST1) % CST2?
> We'd like to re-associate it somehow.
Wouldn't it be possible to do add the following subtree replacement rule?
* %
/ \ / \
% B A *
/ \ => / \
/ \ C B C
A B
It works in contexts where it can be proven that for_all a.(a % b) == 0,
which by definition includes b-aligned pointers.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug tree-optimization/49028] Missed optimization of pointer arithmetic
2011-05-17 17:32 [Bug rtl-optimization/49028] New: Missed optimization of pointer arithmetic piotr.wyderski at gmail dot com
` (2 preceding siblings ...)
2011-05-18 13:15 ` piotr.wyderski at gmail dot com
@ 2021-08-14 20:53 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-14 20:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49028
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|minor |enhancement
Component|rtl-optimization |tree-optimization
Last reconfirmed|2011-05-18 09:40:28 |2021-8-14
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
_3 = _2 + 8;
_4 = _3 - _1;
_5 = _4 /[ex] 8;
_6 = _5 % 16;
_7 = (long unsigned int) _6;
_8 = _7 * 8;
_9 = _1 + _8;
I Noticed only MSVC actually does this optimization.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-08-14 20:53 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-17 17:32 [Bug rtl-optimization/49028] New: Missed optimization of pointer arithmetic piotr.wyderski at gmail dot com
2011-05-17 17:40 ` [Bug rtl-optimization/49028] " piotr.wyderski at gmail dot com
2011-05-18 10:18 ` rguenth at gcc dot gnu.org
2011-05-18 13:15 ` piotr.wyderski at gmail dot com
2021-08-14 20:53 ` [Bug tree-optimization/49028] " pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).