* Unroller gone wild
@ 2010-03-08 8:50 Piotr Wyderski
2010-03-08 10:17 ` Richard Guenther
0 siblings, 1 reply; 2+ messages in thread
From: Piotr Wyderski @ 2010-03-08 8:50 UTC (permalink / raw)
To: GCC Mailing List
I have the following code:
struct bounding_box {
pack4sf m_Mins;
pack4sf m_Maxs;
void set(__v4sf v_mins, __v4sf v_maxs) {
m_Mins = v_mins;
m_Maxs = v_maxs;
}
};
struct bin {
bounding_box m_Box[3];
pack4si m_NL;
pack4sf m_AL;
};
static const std::size_t bin_count = 16;
bin aBins[bin_count];
for(std::size_t i = 0; i != bin_count; ++i) {
bin& b = aBins[i];
b.m_Box[0].set(g_VecInf, g_VecMinusInf);
b.m_Box[1].set(g_VecInf, g_VecMinusInf);
b.m_Box[2].set(g_VecInf, g_VecMinusInf);
b.m_NL = __v4si{ 0, 0, 0, 0 };
}
where pack4sf/si are union-based wrappers for __v4sf/si.
GCC 4.5 on Core i7/Cygwin with
-O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
-fomit-frame-pointer
completely unrolled the loop into 112 movdqa instructions,
which is "a bit" too agressive. Should I file a bug report?
The processor has an 18 instructions long prefetch queue
and the loop is perfectly predictable by the built-in branch
prediction circuitry, so translating it as is would result in huge
fetch/decode bandwidth reduction. Is there something like
"#pragma nounroll" to selectively disable this optimization?
Best regards
Piotr Wyderski
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Unroller gone wild
2010-03-08 8:50 Unroller gone wild Piotr Wyderski
@ 2010-03-08 10:17 ` Richard Guenther
0 siblings, 0 replies; 2+ messages in thread
From: Richard Guenther @ 2010-03-08 10:17 UTC (permalink / raw)
To: Piotr Wyderski; +Cc: GCC Mailing List
On Mon, Mar 8, 2010 at 9:49 AM, Piotr Wyderski <piotr.wyderski@gmail.com> wrote:
> I have the following code:
>
> struct bounding_box {
>
> pack4sf m_Mins;
> pack4sf m_Maxs;
>
> void set(__v4sf v_mins, __v4sf v_maxs) {
>
> m_Mins = v_mins;
> m_Maxs = v_maxs;
> }
> };
>
> struct bin {
>
> bounding_box m_Box[3];
> pack4si m_NL;
> pack4sf m_AL;
> };
>
> static const std::size_t bin_count = 16;
> bin aBins[bin_count];
>
> for(std::size_t i = 0; i != bin_count; ++i) {
>
> bin& b = aBins[i];
>
> b.m_Box[0].set(g_VecInf, g_VecMinusInf);
> b.m_Box[1].set(g_VecInf, g_VecMinusInf);
> b.m_Box[2].set(g_VecInf, g_VecMinusInf);
> b.m_NL = __v4si{ 0, 0, 0, 0 };
> }
>
> where pack4sf/si are union-based wrappers for __v4sf/si.
> GCC 4.5 on Core i7/Cygwin with
>
> -O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
> -fomit-frame-pointer
>
> completely unrolled the loop into 112 movdqa instructions,
> which is "a bit" too agressive. Should I file a bug report?
> The processor has an 18 instructions long prefetch queue
> and the loop is perfectly predictable by the built-in branch
> prediction circuitry, so translating it as is would result in huge
> fetch/decode bandwidth reduction. Is there something like
> "#pragma nounroll" to selectively disable this optimization?
No, only --param max-completely-peel-times (which is 16)
or --param max-completely-peeled-insns (which probably should
then be way lower than the current 400).
Richard.
> Best regards
> Piotr Wyderski
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2010-03-08 10:17 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-08 8:50 Unroller gone wild Piotr Wyderski
2010-03-08 10:17 ` Richard Guenther
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).