public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "jakub at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
Date: Thu, 16 Jun 2011 16:33:00 -0000	[thread overview]
Message-ID: <bug-49442-4-swj7UIO6nI@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-49442-4@http.gcc.gnu.org/bugzilla/>

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-06-16 16:33:16 UTC ---
I was testing on SandyBridge, but it was reported to us for Core2.
The loop used to be vectorized in 4.4 and is now too, in both cases it does a
huge hard to decipher test with many conditions and either uses non-vectorized
loop or vectorized loop.  In the r148210 condition was also:
  vect_p.44_29 = (vector double *) out1_6(D);
  addr2int0.45_28 = (long int) vect_p.44_29;
  vect_p.48_37 = (vector double *) out2_21(D);
  addr2int1.49_38 = (long int) vect_p.48_37;
  orptrs1.50_40 = addr2int0.45_28 | addr2int1.49_38;
  vect_p.53_41 = (vector double *) out3_34(D);
  addr2int2.54_42 = (long int) vect_p.53_41;
  orptrs2.55_51 = orptrs1.50_40 | addr2int2.54_42;
  andmask.56_52 = orptrs2.55_51 & 15;
...
  D.2833_72 = andmask.56_52 == 0;
but in the new condition is not, and previously it used movaps stores in the
loop:
    movapd    %xmm0, (%rdi,%r10)
...
    movapd    %xmm0, (%rsi,%r10)
...
    movapd    %xmm0, (%rdx,%r10)
while newly it uses:
    movlpd    %xmm0, (%rdi,%rbx)
    movhpd    %xmm0, 8(%rdi,%rbx)
...
    movlpd    %xmm0, (%rsi,%rbx)
    movhpd    %xmm0, 8(%rsi,%rbx)
...
    movlpd    %xmm0, (%rdx,%rbx)
    movhpd    %xmm0, 8(%rdx,%rbx)

Surprisingly, the new code is slower even when the pointers aren't aligned:
r128110:
Strip out best and worst realtime result
minimum: 8.849950347 sec real / 0.000085810 sec CPU
maximum: 9.278652529 sec real / 0.000153471 sec CPU
average: 9.055898562 sec real / 0.000138755 sec CPU
stdev  : 0.073603342 sec real / 0.000016469 sec CPU
r128111:
Strip out best and worst realtime result
minimum: 12.089365836 sec real / 0.000081233 sec CPU
maximum: 12.378188295 sec real / 0.000158253 sec CPU
average: 12.234883839 sec real / 0.000136920 sec CPU
stdev  : 0.073461527 sec real / 0.000017463 sec CPU
(same baz routine, and
double a[60000] __attribute__((aligned (32)));
int
main ()
{
  int i;
  for (i = 0; i < 500000; i++)
    baz (a + 1, a + 10001, a + 30000, a + 40000, a + 50000, 10000);
  return 0;
}
instead).  Here, in r128110 generated code it uses the scalar loop, while in
r128111 it uses the vectorized one with those movlpd+movhpd stores.
So in this particular case for this particular CPU, it would be better if the
cost model said that it should verify whether all store pointers are
sufficiently aligned and only use the vectorized loop in that case.

BTW, the vectorization condition is really long, is it a good idea to let it go
through with just a single branch at the end?  Wouldn't it be better to test
several most likely to fail checks first, conditional branch, then some other
tests, again conditional branch?

I've talked with Richard on IRC about how users could promise the compiler
that the pointers are sufficiently aligned and thus it can just assume it is
aligned (if it would test for it) and use it in the loop, both for loads and
stores.  Possibilities include __attribute__((ptr_align (align [, misalign])))
on const pointer parameters and const pointer variables, or adding
__builtin_unreachable () using assertions.

But now that I think about it more, we already version the loop for
vectorization in this case, wouldn't it be better to just add some extension
which would allow the user to say something is likely?  Such hint could be
e.g. hint that some pointer is likely to be so and so aligned/misaligned,
or e.g. that pointers don't overlap (yeah, I know, we have restrict, but
e.g. on STL containers it is more fun to add those)?

E.g. if this loop was hinted that all 5 pointers are 16 byte aligned and
that neither in1[0..len-1] nor in2[0..len-1] overlap out{1,2,3}[0..len-1], the
vectorizer could verify those conditions at runtime and use an correct
alignment
and __restrict assuming faster vectorized loop, while for the fallback case
(vectorization not beneficial, or some overlaps somewhere, or misaligned
pointers) would be a scalar loop not assuming anything of that.
Or perhaps the hints could tell the vectorizer to emit 3 different versions
instead of two, each with different assumptions or something similar.


  parent reply	other threads:[~2011-06-16 16:33 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-16 14:57 [Bug tree-optimization/49442] New: " jakub at gcc dot gnu.org
2011-06-16 15:23 ` [Bug tree-optimization/49442] " rguenth at gcc dot gnu.org
2011-06-16 15:24 ` rguenth at gcc dot gnu.org
2011-06-16 16:33 ` jakub at gcc dot gnu.org [this message]
2011-06-17  8:16 ` jakub at gcc dot gnu.org
2011-06-19  8:25 ` irar at il dot ibm.com
2011-06-21 12:43 ` jakub at gcc dot gnu.org
2011-06-21 12:51 ` jakub at gcc dot gnu.org
2011-06-21 12:53 ` jakub at gcc dot gnu.org
2011-06-30 13:57 ` jakub at gcc dot gnu.org
2011-08-01 14:04 ` rguenth at gcc dot gnu.org
2012-07-02 11:16 ` [Bug tree-optimization/49442] [4.5/4.6/4.7/4.8 " rguenth at gcc dot gnu.org
2013-04-12 15:17 ` [Bug tree-optimization/49442] [4.7/4.8/4.9 " jakub at gcc dot gnu.org
2014-06-12 13:47 ` [Bug tree-optimization/49442] [4.7/4.8/4.9/4.10 " rguenth at gcc dot gnu.org
2014-12-19 13:41 ` [Bug tree-optimization/49442] [4.8/4.9/5 " jakub at gcc dot gnu.org
2015-06-23  8:24 ` [Bug tree-optimization/49442] [4.8/4.9/5/6 " rguenth at gcc dot gnu.org
2015-06-26 20:16 ` [Bug tree-optimization/49442] [4.9/5/6 " jakub at gcc dot gnu.org
2015-06-26 20:37 ` jakub at gcc dot gnu.org
2021-05-14  9:46 ` [Bug tree-optimization/49442] [9/10/11/12 " jakub at gcc dot gnu.org
2021-06-01  8:05 ` rguenth at gcc dot gnu.org
2022-05-27  9:34 ` [Bug tree-optimization/49442] [10/11/12/13 " rguenth at gcc dot gnu.org
2022-06-28 10:30 ` jakub at gcc dot gnu.org
2023-07-07 10:29 ` [Bug tree-optimization/49442] [11/12/13/14 " rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-49442-4-swj7UIO6nI@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).