Loop Vectorization and OpenMP

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Loop Vectorization and OpenMP
@ 2013-01-14 16:05 Freddie Witherden
  2013-01-14 16:21 ` Tim Prince
  0 siblings, 1 reply; 3+ messages in thread
From: Freddie Witherden @ 2013-01-14 16:05 UTC (permalink / raw)
  To: gcc-help

Hi all,

I have a function which I wish to accelerate with auto-vectorization and
OpenMP:

void fn(float *restrict rho_in,     float *restrict E_in,
        float *restrict rhou_in,    float *restrict rhov_in,
        float *restrict f0rho_out,  float *restrict f0E_out,
        float *restrict f0rhou_out, float *restrict f0rhov_out,
        float *restrict f1rho_out,  float *restrict f1E_out,
        float *restrict f1rhou_out, float *restrict f1rhov_out,
        int n)
{
    rho_in  = (float *) __builtin_assume_aligned(rho_in, 32);
    E_in    = (float *) __builtin_assume_aligned(E_in, 32);
    rhou_in = (float *) __builtin_assume_aligned(rhou_in, 32);
    rhov_in = (float *) __builtin_assume_aligned(rhov_in, 32);

    f0rho_out  = (float *) __builtin_assume_aligned(f0rho_out, 32);
    f0E_out    = (float *) __builtin_assume_aligned(f0E_out, 32);
    f0rhou_out = (float *) __builtin_assume_aligned(f0rhou_out, 32);
    f0rhov_out = (float *) __builtin_assume_aligned(f0rhov_out, 32);

    f1rho_out  = (float *) __builtin_assume_aligned(f1rho_out, 32);
    f1E_out    = (float *) __builtin_assume_aligned(f1E_out, 32);
    f1rhou_out = (float *) __builtin_assume_aligned(f1rhou_out, 32);
    f1rhov_out = (float *) __builtin_assume_aligned(f1rhov_out, 32);

    #pragma omp parallel for
    for (int i = 0; i < n; ++i)
    {
        float rho = rho_in[i], E = E_in[i];
        float rhou = rhou_in[i], rhov = rhov_in[i];

        float invrho = 1.0f/rho;
        float u = invrho*rhou, v = invrho*rhov;

        float p = 0.4f*(E - 0.5f*(rhou*u + rhov*v));

        f0rho_out[i]  = rhou;       f1rho_out[i]  = rhov;
        f0rhou_out[i] = rhou*u + p; f1rhou_out[i] = rhov*u;
        f0rhov_out[i] = rhou*v;     f1rhov_out[i] = rhov*v + p;
        f0E_out[i]    = (E + p)*u;  f1E_out[i]    = (E + p)*v;
    }
}

the combination of "restrict" along with the alignment fluff yields some
extremely tight ASM on my AVX-capable system.  However, when OpenMP
enters the mix the resulting code is not vectorized:

  gcc-4.7.2 -std=c99 -Ofast -fopenmp -march=native -S fn.c

as can be seen by a simple inspection of the resulting assembly.  I
believe this is due to Bug 46032 (although some of the comments imply
that it should be fixed).  It appears as if either the "restrict"
properly or the alignment is getting clobbered when the OpenMP 'inner'
function is generated.

Can anyone suggest any workarounds?  It seems like a common problem and
really do not want to reinvent the wheel if a simple refactoring of my
code can iron everything out.

Regards, Freddie.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Loop Vectorization and OpenMP
  2013-01-14 16:05 Loop Vectorization and OpenMP Freddie Witherden
@ 2013-01-14 16:21 ` Tim Prince
  2013-01-14 21:17   ` Freddie Witherden
  0 siblings, 1 reply; 3+ messages in thread
From: Tim Prince @ 2013-01-14 16:21 UTC (permalink / raw)
  To: gcc-help

On 1/14/2013 9:33 AM, Freddie Witherden wrote:
> Hi all,
>
> I have a function which I wish to accelerate with auto-vectorization and
> OpenMP:
>
> void fn(float *restrict rho_in,     float *restrict E_in,
>          float *restrict rhou_in,    float *restrict rhov_in,
>          float *restrict f0rho_out,  float *restrict f0E_out,
>          float *restrict f0rhou_out, float *restrict f0rhov_out,
>          float *restrict f1rho_out,  float *restrict f1E_out,
>          float *restrict f1rhou_out, float *restrict f1rhov_out,
>          int n)
> {
>      rho_in  = (float *) __builtin_assume_aligned(rho_in, 32);
>      E_in    = (float *) __builtin_assume_aligned(E_in, 32);
>      rhou_in = (float *) __builtin_assume_aligned(rhou_in, 32);
>      rhov_in = (float *) __builtin_assume_aligned(rhov_in, 32);
>
>      f0rho_out  = (float *) __builtin_assume_aligned(f0rho_out, 32);
>      f0E_out    = (float *) __builtin_assume_aligned(f0E_out, 32);
>      f0rhou_out = (float *) __builtin_assume_aligned(f0rhou_out, 32);
>      f0rhov_out = (float *) __builtin_assume_aligned(f0rhov_out, 32);
>
>      f1rho_out  = (float *) __builtin_assume_aligned(f1rho_out, 32);
>      f1E_out    = (float *) __builtin_assume_aligned(f1E_out, 32);
>      f1rhou_out = (float *) __builtin_assume_aligned(f1rhou_out, 32);
>      f1rhov_out = (float *) __builtin_assume_aligned(f1rhov_out, 32);
>
>      #pragma omp parallel for
>      for (int i = 0; i < n; ++i)
>      {
>          float rho = rho_in[i], E = E_in[i];
>          float rhou = rhou_in[i], rhov = rhov_in[i];
>
>          float invrho = 1.0f/rho;
>          float u = invrho*rhou, v = invrho*rhov;
>
>          float p = 0.4f*(E - 0.5f*(rhou*u + rhov*v));
>
>          f0rho_out[i]  = rhou;       f1rho_out[i]  = rhov;
>          f0rhou_out[i] = rhou*u + p; f1rhou_out[i] = rhov*u;
>          f0rhov_out[i] = rhou*v;     f1rhov_out[i] = rhov*v + p;
>          f0E_out[i]    = (E + p)*u;  f1E_out[i]    = (E + p)*v;
>      }
> }
>
> the combination of "restrict" along with the alignment fluff yields some
> extremely tight ASM on my AVX-capable system.  However, when OpenMP
> enters the mix the resulting code is not vectorized:
>
>    gcc-4.7.2 -std=c99 -Ofast -fopenmp -march=native -S fn.c
>
> as can be seen by a simple inspection of the resulting assembly.  I
> believe this is due to Bug 46032 (although some of the comments imply
> that it should be fixed).  It appears as if either the "restrict"
> properly or the alignment is getting clobbered when the OpenMP 'inner'
> function is generated.
>
> Can anyone suggest any workarounds?  It seems like a common problem and
> really do not want to reinvent the wheel if a simple refactoring of my
> code can iron everything out.
>
> Regards, Freddie.
It's a Frequently Encountered Problem.  What did 
-ftree-vectorizer-verbose=3 produce?
Part of the problem is that the OpenMP chunks won't have the alignments 
you set carefully for the start of the array, unless the loop count 
happens to be a multiple of number of threads times unrolling factor 
times vector register width, thus unknown at compile time.
It remains to be seen how much OpenMP 4.0 proposals for pragmas to deal 
with this may help.
Until then, OpenMP tends to work better with at least 2 levels of loops, 
where the outer is parallelizable and the inner vectorizable.
Tim

-- 
Tim Prince

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Loop Vectorization and OpenMP
  2013-01-14 16:21 ` Tim Prince
@ 2013-01-14 21:17   ` Freddie Witherden
  0 siblings, 0 replies; 3+ messages in thread
From: Freddie Witherden @ 2013-01-14 21:17 UTC (permalink / raw)
  To: tprince; +Cc: Tim Prince, gcc-help

On 14/01/13 16:04, Tim Prince wrote:
> It's a Frequently Encountered Problem.  What did 
> -ftree-vectorizer-verbose=3 produce?

Nothing.  At 5 it gave:

  27: versioning for alias required: can't determine dependence between
  *D.1967_20 and *D.1988_49
  27: mark for run-time aliasing test between *D.1967_20 and *D.1988_49
  [...]
  27: disable versioning for alias - max number of generated checks
  exceeded.

which implies that "restrict" is being clobbered.

> Part of the problem is that the OpenMP chunks won't have the 
> alignments you set carefully for the start of the array, unless the 
> loop count happens to be a multiple of number of threads times 
> unrolling factor times vector register width, thus unknown at compile
> time. It remains to be seen how much OpenMP 4.0 proposals for pragmas
> to deal with this may help. Until then, OpenMP tends to work better
> with at least 2 levels of loops, where the outer is parallelizable
> and the inner vectorizable.

Okay.  Can anyone suggest a good blocking methodology such that given

  for (int i = 0; i < n; ++i)
    // Code which uses parameters ...

where we require the parameters ... have an alignment of X 'items' (so
for 256-bit AVX registers and float types X = 32/4 = 8) yields:

  for (outer)
    for (inner)
      // Code

such that the outer loop can be hit with OpenMP and the inner loop with
auto-vectorization.

Regards, Freddie.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-01-14 16:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-14 16:05 Loop Vectorization and OpenMP Freddie Witherden
2013-01-14 16:21 ` Tim Prince
2013-01-14 21:17   ` Freddie Witherden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).