public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Unroller gone wild
@ 2010-03-08  8:50 Piotr Wyderski
  2010-03-08 10:17 ` Richard Guenther
  0 siblings, 1 reply; 2+ messages in thread
From: Piotr Wyderski @ 2010-03-08  8:50 UTC (permalink / raw)
  To: GCC Mailing List

I have the following code:

    struct bounding_box {

        pack4sf m_Mins;
        pack4sf m_Maxs;

        void set(__v4sf v_mins, __v4sf v_maxs) {

            m_Mins = v_mins;
            m_Maxs = v_maxs;
        }
    };

    struct bin {

        bounding_box m_Box[3];
        pack4si      m_NL;
        pack4sf      m_AL;
    };

    static const std::size_t bin_count = 16;
    bin aBins[bin_count];

    for(std::size_t i = 0; i != bin_count; ++i) {

        bin& b = aBins[i];

        b.m_Box[0].set(g_VecInf, g_VecMinusInf);
        b.m_Box[1].set(g_VecInf, g_VecMinusInf);
        b.m_Box[2].set(g_VecInf, g_VecMinusInf);
        b.m_NL = __v4si{ 0, 0, 0, 0 };
    }

where pack4sf/si are union-based wrappers for __v4sf/si.
GCC 4.5 on Core i7/Cygwin with

-O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
-fomit-frame-pointer

completely unrolled the loop into 112 movdqa instructions,
which is "a bit" too agressive. Should I file a bug report?
The processor has an 18 instructions long prefetch queue
and the loop is perfectly predictable by the built-in branch
prediction circuitry, so translating it as is would result in huge
fetch/decode bandwidth reduction. Is there something like
"#pragma nounroll" to selectively disable this optimization?

Best regards
Piotr Wyderski

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Unroller gone wild
  2010-03-08  8:50 Unroller gone wild Piotr Wyderski
@ 2010-03-08 10:17 ` Richard Guenther
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Guenther @ 2010-03-08 10:17 UTC (permalink / raw)
  To: Piotr Wyderski; +Cc: GCC Mailing List

On Mon, Mar 8, 2010 at 9:49 AM, Piotr Wyderski <piotr.wyderski@gmail.com> wrote:
> I have the following code:
>
>    struct bounding_box {
>
>        pack4sf m_Mins;
>        pack4sf m_Maxs;
>
>        void set(__v4sf v_mins, __v4sf v_maxs) {
>
>            m_Mins = v_mins;
>            m_Maxs = v_maxs;
>        }
>    };
>
>    struct bin {
>
>        bounding_box m_Box[3];
>        pack4si      m_NL;
>        pack4sf      m_AL;
>    };
>
>    static const std::size_t bin_count = 16;
>    bin aBins[bin_count];
>
>    for(std::size_t i = 0; i != bin_count; ++i) {
>
>        bin& b = aBins[i];
>
>        b.m_Box[0].set(g_VecInf, g_VecMinusInf);
>        b.m_Box[1].set(g_VecInf, g_VecMinusInf);
>        b.m_Box[2].set(g_VecInf, g_VecMinusInf);
>        b.m_NL = __v4si{ 0, 0, 0, 0 };
>    }
>
> where pack4sf/si are union-based wrappers for __v4sf/si.
> GCC 4.5 on Core i7/Cygwin with
>
> -O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
> -fomit-frame-pointer
>
> completely unrolled the loop into 112 movdqa instructions,
> which is "a bit" too agressive. Should I file a bug report?
> The processor has an 18 instructions long prefetch queue
> and the loop is perfectly predictable by the built-in branch
> prediction circuitry, so translating it as is would result in huge
> fetch/decode bandwidth reduction. Is there something like
> "#pragma nounroll" to selectively disable this optimization?

No, only --param max-completely-peel-times (which is 16)
or --param max-completely-peeled-insns (which probably should
then be way lower than the current 400).

Richard.

> Best regards
> Piotr Wyderski
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2010-03-08 10:17 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-08  8:50 Unroller gone wild Piotr Wyderski
2010-03-08 10:17 ` Richard Guenther

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).