public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2
@ 2021-08-14 14:27 hubicka at gcc dot gnu.org
  2021-08-16  8:33 ` [Bug middle-end/101908] " crazylht at gmail dot com
                   ` (48 more replies)
  0 siblings, 49 replies; 50+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-08-14 14:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

            Bug ID: 101908
           Summary: cray regression with -O2 -ftree-slp-vectorize compared
                    to -O2
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

This should be easy to track since cray is stupid benchmark.

According to run
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.001&revisions=e54acea9e5a821448af97c66e94a1e4c4f977d5d%2Ce87209a1269622017bf3d98bf71502dcb0f893aa%2C73474527aaa24d9236aca074c5494a07f40ce058&include_user_branches=on

slp-vectorize seems to cause 101%-124% regression on zen and 96% on kabylake.ve

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug middle-end/101908] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
@ 2021-08-16  8:33 ` crazylht at gmail dot com
  2021-08-16  9:05 ` [Bug tree-optimization/101908] " rguenth at gcc dot gnu.org
                   ` (47 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2021-08-16  8:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
Is cray opensource, if that could you give us a link for this benchmark?
I'd like to run the benchmark on CLX too.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
  2021-08-16  8:33 ` [Bug middle-end/101908] " crazylht at gmail dot com
@ 2021-08-16  9:05 ` rguenth at gcc dot gnu.org
  2021-08-16  9:06 ` rguenth at gcc dot gnu.org
                   ` (46 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-16  9:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |tree-optimization
             Target|                            |x86_64-*-*

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
See http://www.futuretech.blinkenlights.nl/c-ray.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
  2021-08-16  8:33 ` [Bug middle-end/101908] " crazylht at gmail dot com
  2021-08-16  9:05 ` [Bug tree-optimization/101908] " rguenth at gcc dot gnu.org
@ 2021-08-16  9:06 ` rguenth at gcc dot gnu.org
  2021-08-25  7:47 ` crazylht at gmail dot com
                   ` (45 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-16  9:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 51307
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51307&action=edit
c-ray v1.1

Hmm, that's dead.  See attached.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-08-16  9:06 ` rguenth at gcc dot gnu.org
@ 2021-08-25  7:47 ` crazylht at gmail dot com
  2021-10-28 12:26 ` [Bug tree-optimization/101908] [12 regression] " hubicka at gcc dot gnu.org
                   ` (44 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2021-08-25  7:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #3)
> Created attachment 51307 [details]
> c-ray v1.1
> 
> Hmm, that's dead.  See attached.

No regression on CLX.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-08-25  7:47 ` crazylht at gmail dot com
@ 2021-10-28 12:26 ` hubicka at gcc dot gnu.org
  2021-10-28 12:39 ` hubicka at gcc dot gnu.org
                   ` (43 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-10-28 12:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|cray regression with -O2    |[12 regression] cray
                   |-ftree-slp-vectorize        |regression with -O2
                   |compared to -O2             |-ftree-slp-vectorize
                   |                            |compared to -O2

--- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Since vectorization is now on by default we see the problem at -O2.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-10-28 12:26 ` [Bug tree-optimization/101908] [12 regression] " hubicka at gcc dot gnu.org
@ 2021-10-28 12:39 ` hubicka at gcc dot gnu.org
  2021-10-28 13:02 ` rguenth at gcc dot gnu.org
                   ` (42 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-10-28 12:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
zen
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=198.639.0&plot.1=180.639.0&plot.2=201.639.0&plot.3=150.639.0&plot.4=246.639.0&plot.5=256.639.0&plot.6=176.639.0&
kabylake
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=266.639.0&plot.1=21.639.0&
zen2
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=276.639.0&plot.1=11.639.0&
zen3 also sees >100% regression

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-10-28 12:39 ` hubicka at gcc dot gnu.org
@ 2021-10-28 13:02 ` rguenth at gcc dot gnu.org
  2021-10-28 13:05 ` rguenth at gcc dot gnu.org
                   ` (41 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-28 13:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |12.0
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-10-28

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, with -O3/-Ofast cray is _much_ faster, despite the vectorization (we see
even more there).  At -O2 we get

c-ray-f.c:370:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:378:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:450:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:442:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:421:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:490:3: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:273:10: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:553:18: optimized: basic block part vectorized using 16 byte vectors

perf data: .v is vectorized, .nv not:

Samples: 53K of event 'cycles', Event count (approx.): 53285786739              
Overhead       Samples  Command     Shared Object     Symbol                    
  62.31%         32941  c-ray-f.v   c-ray-f.v         [.] ray_sphere
  31.13%         16666  c-ray-f.nv  c-ray-f.nv        [.] ray_sphere

and _likely_ the issue is that we have

       |     int ray_sphere(const struct sphere *sph, struct ray ray, struct
sp#
       |       sub      $0x98,%rsp                                            
#
       |     double a, b, c, d, sqrt_d, t1, t2;                               
#
       |                                                                      
#
       |     a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z);               
#
       |     b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +                
#
       |       movupd   (%rdi),%xmm5                                          
#
       |     2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +                    
#
       |     2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);                     
#
  0.02 |       movsd    0x10(%rdi),%xmm9                                      
#
  0.01 |       movupd   0xb8(%rsp),%xmm13                                     
#
 37.67 |       movupd   0xa0(%rsp),%xmm15                                

so we pass struct ray on the stack(?) and perform SSE loads from it but
the argument passing does

  0.88 |       movups %xmm2,(%rsp)                                            
#
  0.22 |       movups %xmm3,0x10(%rsp)                                        
#
 43.81 |       movups %xmm4,0x20(%rsp)                                        
#
  0.66 |       call   ray_sphere                   

IIRC Zen2 had some 'tricks' to forward stack spill/restore and if that
fails for some reason there is probably a penalty - at least in this case
it shouldn't be STLF.  Note the not vectorized code has the same code
on the caller side but loads scalar pieces.

Not inlining ray_sphere at -O2 is of course what makes it overall slow.

Confirmed on Zen2.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-10-28 13:02 ` rguenth at gcc dot gnu.org
@ 2021-10-28 13:05 ` rguenth at gcc dot gnu.org
  2021-10-28 13:06 ` hubicka at kam dot mff.cuni.cz
                   ` (40 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-28 13:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
It _is_ likely STLF.

struct vec3 {
        double x, y, z;
};

struct ray {
        struct vec3 orig, dir;
};

the vectorized ray_sphere wants { ray.orig.x, ray.orig.y } and
{ ray.dir.x, ray.dir.y } where the former is fine but the latter
is misaligned as to the way we push the struct to the stack
which is { ray.orig.x, ray.orig.y } { ray.orig.z, ray.dir.x }
{ray.dir.y, ray.dir.z }.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-10-28 13:05 ` rguenth at gcc dot gnu.org
@ 2021-10-28 13:06 ` hubicka at kam dot mff.cuni.cz
  2021-10-28 13:09 ` hubicka at kam dot mff.cuni.cz
                   ` (39 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-10-28 13:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #9 from hubicka at kam dot mff.cuni.cz ---
> Not inlining ray_sphere at -O2 is of course what makes it overall slow.

ray_spehere is not at all that small function.  We already play tricks
at -O3 to inline it by detecting that some of its parameters are loop
invariant and thus inlining will enable hoisting part of function body
out of loop. This we account as very large expected speedup for inlining
and bump inline limits up.
I do not think we could do the same at -O2 since this is all quite
speculative and leads to quite some code growth.

Honza

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-10-28 13:06 ` hubicka at kam dot mff.cuni.cz
@ 2021-10-28 13:09 ` hubicka at kam dot mff.cuni.cz
  2021-10-28 13:09 ` rguenth at gcc dot gnu.org
                   ` (38 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-10-28 13:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #10 from hubicka at kam dot mff.cuni.cz ---
>        |     b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +                
> #
>        |       movupd   (%rdi),%xmm5                                          
> #
>        |     2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +                    
> #
>        |     2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);                     
> #
>   0.02 |       movsd    0x10(%rdi),%xmm9                                      
> #
>   0.01 |       movupd   0xb8(%rsp),%xmm13                                     
> #
>  37.67 |       movupd   0xa0(%rsp),%xmm15                                
> 
> so we pass struct ray on the stack(?) and perform SSE loads from it but
> the argument passing does
> 
>   0.88 |       movups %xmm2,(%rsp)                                            
> #
>   0.22 |       movups %xmm3,0x10(%rsp)                                        
> #
>  43.81 |       movups %xmm4,0x20(%rsp)                                        
> #
>   0.66 |       call   ray_sphere                   

Adding Martin to CC.  I think we could teach ipa-sra to, with -flto,
turn the structure either to scalar arguments or to be passed by
reference which would allow us to hoist its initialization out of the
loop body.

Honza

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-10-28 13:09 ` hubicka at kam dot mff.cuni.cz
@ 2021-10-28 13:09 ` rguenth at gcc dot gnu.org
  2021-10-28 13:12 ` hubicka at kam dot mff.cuni.cz
                   ` (37 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-28 13:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
-mtune-ctrl=^sse_unaligned_load_optimal fixes the observed regression.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2021-10-28 13:09 ` rguenth at gcc dot gnu.org
@ 2021-10-28 13:12 ` hubicka at kam dot mff.cuni.cz
  2021-10-28 13:13 ` rguenth at gcc dot gnu.org
                   ` (36 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-10-28 13:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #12 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> 
> --- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
> -mtune-ctrl=^sse_unaligned_load_optimal fixes the observed regression.
Interesting.  I suppose we may want to run specs with generic model
changed this way to see if it cures other stlf problems? I can do that
if that makes sense.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2021-10-28 13:12 ` hubicka at kam dot mff.cuni.cz
@ 2021-10-28 13:13 ` rguenth at gcc dot gnu.org
  2021-10-28 13:15 ` rguenth at gcc dot gnu.org
                   ` (35 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-28 13:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
And before HJs patches to use by_pieces to setup arguments we had code like

        pushq   136(%rsp)
        .cfi_def_cfa_offset 224
        pushq   136(%rsp)
        .cfi_def_cfa_offset 232
        pushq   136(%rsp)
        .cfi_def_cfa_offset 240
        pushq   136(%rsp)
        .cfi_def_cfa_offset 248
        pushq   136(%rsp)
        .cfi_def_cfa_offset 256
        call    ray_sphere

which was also fine.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2021-10-28 13:13 ` rguenth at gcc dot gnu.org
@ 2021-10-28 13:15 ` rguenth at gcc dot gnu.org
  2021-10-28 13:21 ` rguenth at gcc dot gnu.org
                   ` (34 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-28 13:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to hubicka from comment #12)
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> > 
> > --- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
> > -mtune-ctrl=^sse_unaligned_load_optimal fixes the observed regression.
> Interesting.  I suppose we may want to run specs with generic model
> changed this way to see if it cures other stlf problems? I can do that
> if that makes sense.

It will only help for V2DF I think, so no, not really.  But an IPA idea of
whether there's cross-call STLF issues might be nice.

Generally doing wider stores is fine but of course if structs end up
"misaligned" then doing wide loads tends to run into these issues.

In theory the backend should have good enough knowledge to split the
wide loads of the argument area near to the prologue because it should
know how we stored to it.

But then - just fix the CPUs :P

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2021-10-28 13:15 ` rguenth at gcc dot gnu.org
@ 2021-10-28 13:21 ` rguenth at gcc dot gnu.org
  2021-10-29 13:58 ` hubicka at kam dot mff.cuni.cz
                   ` (33 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-28 13:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
And even when making ray_sphere static we're not considering to use alternate
argument passing conventions (use 6 SSE regs for the 'double's or 3 for how
the vectors are used to setup the stack right now).  We might even consider
to drive the local ABI decision by how the vectorizer ends up using things?
Like if we see the vector loads pass the aggregate in _4_ SSE regs,
the two desired vectors plus two regs for the two leftover scalars.

The vectorizer is of course too late for IPA SRA but an IPA SRA like
pass could run after the GIMPLE opts if we serialize for another late IPA
phase.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2021-10-28 13:21 ` rguenth at gcc dot gnu.org
@ 2021-10-29 13:58 ` hubicka at kam dot mff.cuni.cz
  2022-01-20 11:12 ` rguenth at gcc dot gnu.org
                   ` (32 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-10-29 13:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> It will only help for V2DF I think, so no, not really.  But an IPA idea of
> whether there's cross-call STLF issues might be nice.
> 
> Generally doing wider stores is fine but of course if structs end up
> "misaligned" then doing wide loads tends to run into these issues.

Well, we don't do this kind of analysis intraprocedurally, doing it
interprocedurally will be just harder.  I guess intraprocedurally we
copy propagate most of the obvious cases, which would correspond to
IPA-SRAing the structure.
> 
> But then - just fix the CPUs :P
Seem to be hard problem for hardware architects  :)

Honza

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2021-10-29 13:58 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-20 11:12 ` rguenth at gcc dot gnu.org
  2022-01-20 11:26 ` [Bug target/101908] " rguenth at gcc dot gnu.org
                   ` (31 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-20 11:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
No good idea how to tackle such issues.  Possibly a mdreorg pass could for the
code region "near" to the function prologue scan for loads that are known to
access the arguments in a way conflicting with how GCC itself would pass
them with respect to STLF and split those.

--- c-ray-f.s   2022-01-20 12:00:41.660954367 +0100
+++ c-ray-f.s.fixed     2022-01-20 12:00:38.160908539 +0100
@@ -334,8 +334,12 @@
        .cfi_def_cfa_offset 160
        movupd  (%rdi), %xmm5
        movsd   16(%rdi), %xmm9
-       movupd  184(%rsp), %xmm13
-       movupd  160(%rsp), %xmm15
+       movsd   184(%rsp), %xmm13
+       movhpd  192(%rsp), %xmm13
+#      movupd  184(%rsp), %xmm13
+       movsd   160(%rsp), %xmm15
+       movhpd  168(%rsp), %xmm15
+#      movupd  160(%rsp), %xmm15
        movsd   176(%rsp), %xmm10
        movaps  %xmm5, 16(%rsp)
        unpckhpd        %xmm5, %xmm5

indeed improves performance back to previous levels.  That's the ray_sphere
"prologue", preceeding is only

ray_sphere:
.LFB33:
        .cfi_startproc
        subq    $152, %rsp


At .stv1/.stv2 we see

(note 4 3 11 2 NOTE_INSN_FUNCTION_BEG)
(insn 11 4 13 2 (set (reg:V2DF 174 [ vect_ray_orig_x_87.270 ])
        (mem/c:V2DF (reg/f:DI 16 argp) [1 MEM <vector(2) double> [(double
*)&ray]+0 S16 A64])) 1673 {movv2df_internal}
     (nil))
...
(insn 16 15 18 2 (set (reg:V2DF 178 [ vect_ray_dir_x_90.266 ])
        (mem/c:V2DF (plus:DI (reg/f:DI 16 argp)
                (const_int 24 [0x18])) [1 MEM <vector(2) double> [(double
*)&ray + 24B]+0 S16 A64])) 1673 {movv2df_internal}
     (nil))

at the classic mdreorg place it is

(insn:TI 16 30 11 2 (set (reg:V2DF 49 xmm13 [orig:178 vect_ray_dir_x_90.266 ]
[178])
        (mem/c:V2DF (plus:DI (reg/f:DI 7 sp)
                (const_int 184 [0xb8])) [1 MEM <vector(2) double> [(double
*)&ray + 24B]+0 S16 A64])) 1673 {movv2df_internal}
     (nil))
(insn 11 16 15 2 (set (reg:V2DF 51 xmm15 [orig:174 vect_ray_orig_x_87.270 ]
[174])
        (mem/c:V2DF (plus:DI (reg/f:DI 7 sp)
                (const_int 160 [0xa0])) [1 MEM <vector(2) double> [(double
*)&ray]+0 S16 A64])) 1673 {movv2df_internal}
     (nil))

both might have enough info to tell that we load from an argument and how
that argument was passed.  But I don't know enough RTL details to say
how difficult it would be to split vector loads from the argument space
if it is "misaligned" compared to the argument passing sequence.

I do wonder though how CLX is fine with such access pattern ;)  (did you test
with just -O2?)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2022-01-20 11:12 ` rguenth at gcc dot gnu.org
@ 2022-01-20 11:26 ` rguenth at gcc dot gnu.org
  2022-01-20 11:52 ` rguenth at gcc dot gnu.org
                   ` (30 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-20 11:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P1
          Component|tree-optimization           |target

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
I do think this general issue, using %xmm for argument passing and
vectorizing with -O2 will see a lot of people stumbling into such unexpected
issues so I fear we have to do something about it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2022-01-20 11:26 ` [Bug target/101908] " rguenth at gcc dot gnu.org
@ 2022-01-20 11:52 ` rguenth at gcc dot gnu.org
  2022-02-24  9:54 ` rguenth at gcc dot gnu.org
                   ` (29 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-20 11:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
#include <stdlib.h>

struct X { double x[3]; };
typedef double v2df __attribute__((vector_size(16)));

v2df __attribute__((noipa))
foo (struct X x)
{
  return (v2df) {x.x[1], x.x[2] };
}

struct X y;
int main(int argc, char **argv)
{
  struct X x = y;
  int cnt = atoi (argv[1]);
  for (int i = 0; i < cnt; ++i)
    foo (x);
  return 0;
}

also reproduces it.  On both trunk and the branch we see 'foo' using
movups (combine does this as well even when not vectorizing).  Using
-mtune-ctrl=^sse_unaligned_load_optimal improves performance of the
micro benchmark more than 4-fold on Zen2.  Note that tuning also causes
us to not passing the argument using vector registers even though the
stack slot is aligned (but we use movupd there, we could use an aligned
move - but that's a missed optimization there).

Note doing non-vector argument setup but a misaligned vector load does
_not_ improve the situation, so the cray issue is solely caused by
-O2 enabling vectorization and eventually by the fact that using vector
stores for the argument setup might cause them to be more likely to
not retired compared to doing more scalar stores.

The same behavior can be observed on Haswell.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2022-01-20 11:52 ` rguenth at gcc dot gnu.org
@ 2022-02-24  9:54 ` rguenth at gcc dot gnu.org
  2022-02-25  3:57 ` crazylht at gmail dot com
                   ` (28 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-24  9:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aros at gmx dot com

--- Comment #20 from Richard Biener <rguenth at gcc dot gnu.org> ---
*** Bug 104663 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2022-02-24  9:54 ` rguenth at gcc dot gnu.org
@ 2022-02-25  3:57 ` crazylht at gmail dot com
  2022-02-25  7:33 ` rguenth at gcc dot gnu.org
                   ` (27 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-02-25  3:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #21 from Hongtao.liu <crazylht at gmail dot com> ---
Now we have SLP node available in vector cost hook, maybe we can do sth in cost
model to prevent vectorization when node's definition from big-size parameter.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2022-02-25  3:57 ` crazylht at gmail dot com
@ 2022-02-25  7:33 ` rguenth at gcc dot gnu.org
  2022-02-25  8:26 ` lili.cui at intel dot com
                   ` (26 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-25  7:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #22 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #21)
> Now we have SLP node available in vector cost hook, maybe we can do sth in
> cost model to prevent vectorization when node's definition from big-size
> parameter.

Note we vectorize a load here for which we do not pass down an SLP node.
But of course there's the stmt-info one could look at - but the issue
is that for SLP that doesn't tell you which part of the variable is accessed.
Also even if we were to pass down the SLP node we do not know exactly how
it is going to vectorize - but sure, we could play with some heuristics
there.

For x86 we can just assume that all aggregates > 16 bytes are passed on the
stack, correct?  Note I see for

#include <stdlib.h>

struct X { double x[3]; };
typedef double v2df __attribute__((vector_size(16)));

v2df __attribute__((noipa))
foo (struct X x, struct X y)
{
  return (v2df) {x.x[1], x.x[2] } + (v2df) { y.x[0], y.x[1] };
}

struct X y;
int main(int argc, char **argv)
{
  struct X x = y;
  int cnt = atoi (argv[1]);
  for (int i = 0; i < cnt; ++i)
    foo (x, x);
  return 0;
}

the structs passed as

        movups  %xmm0, 24(%rsp)
        movq    %rax, 40(%rsp)
        movq    %rax, 16(%rsp)
        movups  %xmm0, (%rsp)
        call    foo

so alignment of the stack variable depends on the position of the
function argument (and thus preceeding parameters).  That means
we cannot rely on &y being 16 byte aligned and it seems we cannot
rely on a particular store sequence order either here.

That would mean pessimization of all incoming stack parameters
> 16 bytes in size (maybe also == 16 bytes?) because we do not know
how the caller pushed the parameters?  (without the caller using
%xmm stores all such vectorization would trigger STLF failures - dependent
on the load-to-store "distance" of course).

Can you peek engineers at Intel at what a big enough "distance" would be
to make sure the store hit L1 (and is a load from L1 better than a
failed STLF, thus the store still in buffers but not forwardable)?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2022-02-25  7:33 ` rguenth at gcc dot gnu.org
@ 2022-02-25  8:26 ` lili.cui at intel dot com
  2022-02-25  8:31 ` lili.cui at intel dot com
                   ` (25 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: lili.cui at intel dot com @ 2022-02-25  8:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

cuilili <lili.cui at intel dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lili.cui at intel dot com

--- Comment #23 from cuilili <lili.cui at intel dot com> ---
(In reply to Richard Biener from comment #17)
> I do wonder though how CLX is fine with such access pattern ;)  (did you test
> with just -O2?)

Actually CLX also has STLF issues, there is 13.7% regression when comparing
"gcc trunk + -O2" w/ and w/t "-fno-tree-vectorize"

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2022-02-25  8:26 ` lili.cui at intel dot com
@ 2022-02-25  8:31 ` lili.cui at intel dot com
  2022-02-25 15:27 ` hjl.tools at gmail dot com
                   ` (24 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: lili.cui at intel dot com @ 2022-02-25  8:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #24 from cuilili <lili.cui at intel dot com> ---
(In reply to cuilili from comment #23)
> (In reply to Richard Biener from comment #17)
> > I do wonder though how CLX is fine with such access pattern ;)  (did you test
> > with just -O2?)
> 
Sorry, correct w/ and w/t order.

 Actually CLX also has STLF issues, there is 13.7% regression when comparing
 "gcc trunk + -O2" w/t and w/ "-fno-tree-vectorize"

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2022-02-25  8:31 ` lili.cui at intel dot com
@ 2022-02-25 15:27 ` hjl.tools at gmail dot com
  2022-02-28  1:29 ` crazylht at gmail dot com
                   ` (23 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: hjl.tools at gmail dot com @ 2022-02-25 15:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #25 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to cuilili from comment #24)
> (In reply to cuilili from comment #23)
> > (In reply to Richard Biener from comment #17)
> > > I do wonder though how CLX is fine with such access pattern ;)  (did you test
> > > with just -O2?)
> > 
> Sorry, correct w/ and w/t order.
> 
>  Actually CLX also has STLF issues, there is 13.7% regression when comparing
>  "gcc trunk + -O2" w/t and w/ "-fno-tree-vectorize"

Can this be mitigated by removing redundant load and store?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2022-02-25 15:27 ` hjl.tools at gmail dot com
@ 2022-02-28  1:29 ` crazylht at gmail dot com
  2022-02-28  1:30 ` crazylht at gmail dot com
                   ` (22 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-02-28  1:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #26 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #22)
> (In reply to Hongtao.liu from comment #21)
> > Now we have SLP node available in vector cost hook, maybe we can do sth in
> > cost model to prevent vectorization when node's definition from big-size
> > parameter.
> 
> Note we vectorize a load here for which we do not pass down an SLP node.
> But of course there's the stmt-info one could look at - but the issue
> is that for SLP that doesn't tell you which part of the variable is accessed.
> Also even if we were to pass down the SLP node we do not know exactly how
> it is going to vectorize - but sure, we could play with some heuristics
> there.
> 
> For x86 we can just assume that all aggregates > 16 bytes are passed on the
> stack, correct?  Note I see for
> 
> #include <stdlib.h>
> 
> struct X { double x[3]; };
> typedef double v2df __attribute__((vector_size(16)));
> 
> v2df __attribute__((noipa))
> foo (struct X x, struct X y)
> {
>   return (v2df) {x.x[1], x.x[2] } + (v2df) { y.x[0], y.x[1] };
> }
> 
> struct X y;
> int main(int argc, char **argv)
> {
>   struct X x = y;
>   int cnt = atoi (argv[1]);
>   for (int i = 0; i < cnt; ++i)
>     foo (x, x);
>   return 0;
> }
> 
> the structs passed as
> 
>         movups  %xmm0, 24(%rsp)
>         movq    %rax, 40(%rsp)
>         movq    %rax, 16(%rsp)
>         movups  %xmm0, (%rsp)
>         call    foo
> 
> so alignment of the stack variable depends on the position of the
> function argument (and thus preceeding parameters).  That means
> we cannot rely on &y being 16 byte aligned and it seems we cannot
> rely on a particular store sequence order either here.
We can start with disabling vectorization with very cheap cost model to fix O2
regressions, then fine tune that in GCC 13.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2022-02-28  1:29 ` crazylht at gmail dot com
@ 2022-02-28  1:30 ` crazylht at gmail dot com
  2022-02-28  5:13 ` lili.cui at intel dot com
                   ` (21 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-02-28  1:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #27 from Hongtao.liu <crazylht at gmail dot com> ---
> We can start with disabling vectorization with very cheap cost model to fix
Of course only for (>=)16-byte struct passing.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2022-02-28  1:30 ` crazylht at gmail dot com
@ 2022-02-28  5:13 ` lili.cui at intel dot com
  2022-03-01  9:33 ` crazylht at gmail dot com
                   ` (20 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: lili.cui at intel dot com @ 2022-02-28  5:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #28 from cuilili <lili.cui at intel dot com> ---
(In reply to H.J. Lu from comment #25)
> Can this be mitigated by removing redundant load and store?
Yes, inlining say_sphere can remove redundant loads and stores, O3 does
inlining, but O2 is more sensitive to code size and cannot be inlined.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2022-02-28  5:13 ` lili.cui at intel dot com
@ 2022-03-01  9:33 ` crazylht at gmail dot com
  2022-03-10 13:47 ` crazylht at gmail dot com
                   ` (19 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-01  9:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #29 from Hongtao.liu <crazylht at gmail dot com> ---
>From Agner Fog's excellent optimization
manuals(https://www.agner.org/optimize/microarchitecture.pdf).

For ICX/TGL:

An aligned write of 128 bits or more followed by a read of one or both of the
two halves or
the four quarters, etc., has little or no penalty. A partial read that does not
fit into the halves
or quarters fails to forward. The write-to-read latency is 19-20 clock cycles
when forwarding
fails.
A read that is bigger than the write, or a read that covers both written and
unwritten bytes,
fails to forward. The write-to-read latency is 19-20 clock cycles.

And for Intel software optimization guide:

There are several cases in which data is passed through memory, and the store
may need to be sepa-
rated from the load:
• Spills, save and restore registers in a stack frame.
• Parameter passing.
• Global and volatile variables.
• Type conversion between integer and floating-point.
• When compilers do not analyze code that is inlined, forcing variables that
are involved in the interface with inlined code to be in memory, creating more
memory variables and preventing the elimination of
redundant loads.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2022-03-01  9:33 ` crazylht at gmail dot com
@ 2022-03-10 13:47 ` crazylht at gmail dot com
  2022-03-10 13:54 ` crazylht at gmail dot com
                   ` (18 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-10 13:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #30 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 52594
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52594&action=edit
tar -xvf micro.tar.gz

Num/Type        char/s  char/v  char/vn short/s short/v short/vn        int/s  
int/v   int/vn  int64/s int64/v int64/vn        float/s float/v float/vn       
doule/s double/v        double/vn
2       3.01308 5.77472 2.51209 3.01211 5.1863  2.51186 3.01316 5.87912 2.51149
3.01267 6.842   2.51195 3.01294 7.28071 2.51211 3.01343 8.28379 2.51226
4       3.57279 4.97372 2.51137 3.5156  5.18539 2.51204 3.51603 5.9016  2.51148
3.57062 7.34315 2.51127 3.56799 7.28184 2.5105  3.78715 8.78754 2.51126
8       4.524   4.97573 2.51168 4.55842 5.08339 2.51106 4.66614 6.40174 2.51107
5.32924 7.66509 2.6445  5.42716 7.78232 2.51272 5.80704 9.51308 2.64533
16      6.52829 4.83359 2.51139 6.5292  5.56546 2.51095 6.53379 6.61226 2.64337
6.69231 7.93031 2.90873 8.03185 8.45706 2.65844 8.03236 10.3075 2.91103

type/s: scalar          
type/v: vector with penalty             
type/vn: vector w/o penalty

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2022-03-10 13:47 ` crazylht at gmail dot com
@ 2022-03-10 13:54 ` crazylht at gmail dot com
  2022-03-10 13:55 ` crazylht at gmail dot com
                   ` (17 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-10 13:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #31 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 52595
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52595&action=edit
microbenchmark

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2022-03-10 13:54 ` crazylht at gmail dot com
@ 2022-03-10 13:55 ` crazylht at gmail dot com
  2022-03-11  7:11 ` crazylht at gmail dot com
                   ` (16 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-10 13:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #32 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #31)
> Created attachment 52595 [details]
> microbenchmark

The microbenchmark is used to test penalty for STFS, I've run it on CLX, and
find 1 stalled vector load is faster than 16 scalar loads, but a little bit
slower than 8 scalar loads, and greatly behind 4(or less) scalar loads. 

Num/Type        char/s  char/v  char/vn short/s short/v short/vn        int/s  
int/v   int/vn  int64/s int64/v int64/vn        float/s float/v float/vn       
doule/s double/v        double/vn
2       3.01308 5.77472 2.51209 3.01211 5.1863  2.51186 3.01316 5.87912 2.51149
3.01267 6.842   2.51195 3.01294 7.28071 2.51211 3.01343 8.28379 2.51226
4       3.57279 4.97372 2.51137 3.5156  5.18539 2.51204 3.51603 5.9016  2.51148
3.57062 7.34315 2.51127 3.56799 7.28184 2.5105  3.78715 8.78754 2.51126
8       4.524   4.97573 2.51168 4.55842 5.08339 2.51106 4.66614 6.40174 2.51107
5.32924 7.66509 2.6445  5.42716 7.78232 2.51272 5.80704 9.51308 2.64533
16      6.52829 4.83359 2.51139 6.5292  5.56546 2.51095 6.53379 6.61226 2.64337
6.69231 7.93031 2.90873 8.03185 8.45706 2.65844 8.03236 10.3075 2.91103


type/s: scalar
type/v: vector with penalty
type/vn: vector w/o penalty

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2022-03-10 13:55 ` crazylht at gmail dot com
@ 2022-03-11  7:11 ` crazylht at gmail dot com
  2022-03-11  8:32 ` rguenth at gcc dot gnu.org
                   ` (15 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-11  7:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #33 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #32)
> (In reply to Hongtao.liu from comment #31)
> > Created attachment 52595 [details]
> > microbenchmark
> 
The interesting the microbenchmark didn't hit store forwarding stall on znver2
client for overlap 128bit v2di/v2df/v2si/v4si/v2sf/v4sf load, but for cray,
there's regression due to STFS???

for microbenchmark, it's like

        leaq    -1200(%rbp), %rsi
...
        vmovdqa %xmm0, -1088(%rbp)
        vmovdqa %xmm0, -1072(%rbp)
...
        movq    %rsi, %rdi


        vmovupd 120(%rsi), %xmm0
        vaddpd  120(%rdi), %xmm0, %xmm0
        vmovupd %xmm0, (%rdx)


120(%rsi) is equal -1080(%rbp), and vmovupd 120(%rsi), %xmm0 is just half 
vmovdqa %xmm0, -1088(%rbp) and half vmovdqa %xmm0, -1072(%rbp).


whole data:

char
NUM2
scalar: 2.66484
   vec: 7.14645: penalty
  vecn: 2.26811
NUM4
scalar: 3.17188
   vec: 5.79971: penalty
  vecn: 2.22844
NUM8
scalar: 4.06115
   vec: 5.76087: penalty
  vecn: 2.25474
NUM16
scalar: 5.84893
   vec: 5.77123: penalty
  vecn: 2.23649
short
NUM2
scalar: 2.6982
   vec: 5.98521: penalty
  vecn: 2.25488
NUM4
scalar: 3.15688
   vec: 5.98339: penalty
  vecn: 2.25535
NUM8
scalar: 4.10435
   vec: 5.98285: penalty
  vecn: 2.25676
NUM16
scalar: 5.92615
   vec: 5.77799: penalty
  vecn: 2.24804
int
NUM2
scalar: 2.72005
   vec: 2.46749: no!!
  vecn: 2.25704
NUM4
scalar: 3.18113
   vec: 2.46506: no!!
  vecn: 2.26846
NUM8
scalar: 4.01626
   vec: 6.67516: penalty
  vecn: 2.27382
NUM16
scalar: 5.92935
   vec: 7.17056: penalty
  vecn: 10.0371
int64_t
NUM2
scalar: 2.67302
   vec: 2.48949: no!!
  vecn: 2.24273
NUM4
scalar: 3.17415
   vec: 7.80522: penalty
  vecn: 2.25004
NUM8
scalar: 4.07681
   vec: 8.31397: penalty
  vecn: 10.0378
NUM16
scalar: 5.81931
   vec: 7.85716: penalty
  vecn: 10.863
float
NUM2
scalar: 2.67386
   vec: 2.48: no!!
  vecn: 2.26215
NUM4
scalar: 3.17401
   vec: 2.48121: no!!
  vecn: 2.23051
NUM8
scalar: 4.05976
   vec: 7.16108: penalty
  vecn: 2.27791
NUM16
scalar: 6.08089
   vec: 7.61818: penalty
  vecn: 10.6009
double
NUM2
scalar: 2.67811
   vec: 2.46635: no!!
  vecn: 2.22982
NUM4
scalar: 3.19169
   vec: 8.2489: penalty
  vecn: 2.25086
NUM8
scalar: 4.05351
   vec: 8.70083: penalty

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2022-03-11  7:11 ` crazylht at gmail dot com
@ 2022-03-11  8:32 ` rguenth at gcc dot gnu.org
  2022-03-11  8:48 ` crazylht at gmail dot com
                   ` (14 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-11  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can confirm this observation on Zen2.  Note perf still records STLF failures
for these cases it just seems that the penalties are well hidden with the
high store load on the caller side for small NUM?

I'm not sure how well CPUs handle OOO execution across calls here, but I'm
guessing that for cray there's only dependent instructions on the STLF
failing loads while for your test the result is always stored to memory.

Not sure if there's a good way to add in a "serializing" instruction
instead of a store - if we'd write vector code directly we'd return
an accumulation vector and pass that as input to further iterations
but that makes it difficult to compare to the scalar variant.

If we look at

double
NUM2
scalar: 2.67811
   vec: 2.46635: no!!
  vecn: 2.22982

then we do see a slight penalty to the case with successful STLF, but
I suspect the main load of the test is the 9 vector stores in the caller.

What's odd though is

NUM4
scalar: 3.19169
   vec: 8.2489: penalty
  vecn: 2.25086

we still have the "same" assembly in foo, just using %ymm instead of %xmm


I'll also note that foo2n vs foo2 access stores of different distance:

void
__attribute__ ((noipa))
foo (TYPE* x, TYPE* y, TYPE* __restrict p)
{
  p[0] = x[0] + y[0];
  p[1] = x[1] + y[1];
}

vs.

void
__attribute__ ((noipa))
foo (TYPE* x, TYPE* y, TYPE* __restrict p)
{
  p[0] = x[15] + y[15];
  p[1] = x[16] + y[16];
}      

shouldn't the former access x[14] and x[15]?  Also on Zen2 using
512 byte vector stores in main() causes them to be decomposed to
128 byte vector stores, not in generic vector lowering which should
choose 256 byte vector stores but during RTL expansion.  So we have
to avoid this, otherwise the vecn cases with larger vector sizes
will fail to STLF as well.

With the two possible issues resolved I get

char
NUM2
scalar: 2.61746
   vec: 6.99399
  vecn: 2.17881
NUM4
scalar: 3.04455
   vec: 5.6571
  vecn: 2.17512
NUM8
scalar: 3.99576
   vec: 5.64829
  vecn: 2.18647
NUM16
scalar: 5.71159
   vec: 5.70879
  vecn: 2.222
short
NUM2
scalar: 2.63836
   vec: 5.92917
  vecn: 2.22295
NUM4
scalar: 3.07966
   vec: 5.93041
  vecn: 2.22694
NUM8
scalar: 4.14134
   vec: 6.16279
  vecn: 2.29287
NUM16
scalar: 5.96713
   vec: 5.91371
  vecn: 2.29854
int
NUM2
scalar: 2.74058
   vec: 2.51288
  vecn: 2.28018
NUM4
scalar: 3.22811
   vec: 2.53454
  vecn: 2.30637
NUM8
scalar: 4.14464
   vec: 6.84145
  vecn: 2.30211
NUM16
scalar: 5.97653
   vec: 7.28825
  vecn: 2.52693
int64_t
NUM2
scalar: 2.75497
   vec: 2.51353
  vecn: 2.29852
NUM4
scalar: 3.20552
   vec: 8.02914
  vecn: 2.28612
NUM8
scalar: 4.1486
   vec: 8.40673
  vecn: 2.54104
NUM16
scalar: 5.96569
   vec: 8.03334
  vecn: 2.98774
float
NUM2
scalar: 2.74666
   vec: 2.53057
  vecn: 2.29079
NUM4
scalar: 3.22499
   vec: 2.52525
  vecn: 2.29374
NUM8
scalar: 4.12471
   vec: 7.33367
  vecn: 2.30114
NUM16
scalar: 6.27016
   vec: 7.78154
  vecn: 2.53966
double
NUM2
scalar: 2.76049
   vec: 2.52339
  vecn: 2.31286
NUM4
scalar: 3.25052
   vec: 8.09372
  vecn: 2.31465
NUM8
scalar: 4.19226
   vec: 8.90108
  vecn: 2.56059
NUM16
scalar: 6.32366
   vec: 8.22693
  vecn: 3.00417

Note Zen2 has comparatively few entries in the store queue, 22 when
SMT is enabled (the 44 are statically partitioned).

What I take away from this is that modern OOO archs do not benefit much
from short sequences of low-lane vectorized code (here in particular
NUM2) since there's a good chance there's enough resources to carry
out the scalar variant in parallel.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2022-03-11  8:32 ` rguenth at gcc dot gnu.org
@ 2022-03-11  8:48 ` crazylht at gmail dot com
  2022-03-11 10:41 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-11  8:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #35 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #34)
> I can confirm this observation on Zen2.  Note perf still records STLF
> failures
penalty is much higher on Znver3 than zen2 for the same case(v2df).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2022-03-11  8:48 ` crazylht at gmail dot com
@ 2022-03-11 10:41 ` rguenth at gcc dot gnu.org
  2022-03-11 13:14 ` crazylht at gmail dot com
                   ` (12 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-11 10:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #36 from Richard Biener <rguenth at gcc dot gnu.org> ---
As additional observation for the c-ray case we end up with

  <bb 2> [local count: 1073741824]:
  vect_ray_orig_x_87.270_173 = MEM <vector(2) double> [(double *)&ray];
  _170 = BIT_FIELD_REF <vect_ray_orig_x_87.270_173, 64, 64>;
  _171 = BIT_FIELD_REF <vect_ray_orig_x_87.270_173, 64, 0>;
  # DEBUG D#93 => ray.orig.x
  # DEBUG ray$orig$x => D#93
  # DEBUG D#92 => ray.orig.y
  # DEBUG ray$orig$y => D#92
  ray$orig$z_89 = ray.orig.z;
  # DEBUG ray$orig$z => ray$orig$z_89
  vect_ray_dir_x_90.266_178 = MEM <vector(2) double> [(double *)&ray + 24B];
  _175 = BIT_FIELD_REF <vect_ray_dir_x_90.266_178, 64, 64>;
  _176 = BIT_FIELD_REF <vect_ray_dir_x_90.266_178, 64, 0>;

so we load as vector but will need both lanes for scalar code pieces we
couldn't vectorize (live lanes).  It's somewhat difficult to reverse the
vectorization decision at that point - we need the final idea on what stmts
we vectorize to compute live lanes and we need to know which operands
are vectorized to tell whether we can vectorize a stmt.  But at least
for loads we eventually could use scalar loads and a CTOR "late".

There's also code in GIMPLE forwprop that can decompose vector loads
feeding BIT_FIELD_REFs but it only does that if there's no other use of
the vector (in this case of course there is - a single for the first
and two for the second).

There is not much value in the vectorization we do in this function
(when manually fixing the STLF issue the speed is as good as with the
scalar code).  We cost

ray.dir.x 1 times scalar_load costs 12 in body
ray.dir.y 1 times scalar_load costs 12 in body

vs.

ray.dir.x 1 times unaligned_load (misalign -1) costs 12 in body
ray.dir.x 1 times vec_to_scalar costs 4 in epilogue
ray.dir.y 1 times vec_to_scalar costs 4 in epilogue

which is probably OK, with SSE it's two loads vs one load + move + unpck,
with AVX we can elide the move (but a move is free), the disadvantage of
the vector load is the higher latency on the high part (plus of course
the STLF hit).  Since the vectorizer doesn't prune individual stmts
because of costs but only throws away the whole opportunity if the
overall cost doesn't seem profitable it's difficult to optimially
handle this on the costing side I think.  Instead the vectorizer should
somehow be directed to use scalar loads + vector construction if
likely STLF fails are detected.

For example the following mitigates the issue for c-ray without resorting
to "late" adjustments via costs but instead by changing the vectorization
strathegy for possibly affected loads using target independent and
likely flawed heuristics.  A full exercise of the cummulative-args
machinery might be able to tell how (parts) of a PARM_DECL are passed.
Whether the caller will end up using wider moves with %xmm remains a guess
of course.  What's also completely missing is an idea how far from
function entry this vectorization happens - for c-ray it would be enough
to restrict this to loads in BB 2 for example.

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 5c9e8cfefa5..4f07e5ddc61 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2197,7 +2197,24 @@ get_group_load_store_type (vec_info *vinfo,
stmt_vec_info
 stmt_info,
   /* Stores can't yet have gaps.  */
   gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);

-  if (slp_node)
+  if (!loop_vinfo
+      && vls_type == VLS_LOAD
+      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
+      && (TREE_CODE (TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0))
+         == PARM_DECL)
+      /* Assume that for a power of two number of elements the aggregate
+        move to the stack is using larger moves at the caller side.  */
+      && !pow2p_hwi (group_size))
+    {
+      /* When doing BB vectorizing force loads from function parameters
+        (???  that are passed in memory and stored in pieces likely
+        causing STLF failures) to be done elementwise.  */
+      /* ???  Note this will cause vectorization to fail because of
+        the fear of underestimating the cost of elementwise accesses,
+        see the end of get_load_store_type.  */
+      *memory_access_type = VMAT_ELEMENTWISE;
+    }
+  else if (slp_node)
     {
       /* For SLP vectorization we directly vectorize a subchain
         without permutation.  */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2022-03-11 10:41 ` rguenth at gcc dot gnu.org
@ 2022-03-11 13:14 ` crazylht at gmail dot com
  2022-03-11 13:27 ` rguenther at suse dot de
                   ` (11 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-11 13:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #37 from Hongtao.liu <crazylht at gmail dot com> ---
> There is not much value in the vectorization we do in this function
> (when manually fixing the STLF issue the speed is as good as with the
> scalar code).  We cost
> 
> ray.dir.x 1 times scalar_load costs 12 in body
> ray.dir.y 1 times scalar_load costs 12 in body
Still from an target-related perspective, instead of adding cost for STLF
penalty, maybe we should just reduce cost of scalar_load if it's from parm_decl
because there's probably STLF.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2022-03-11 13:14 ` crazylht at gmail dot com
@ 2022-03-11 13:27 ` rguenther at suse dot de
  2022-03-14  9:05 ` crazylht at gmail dot com
                   ` (10 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenther at suse dot de @ 2022-03-11 13:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #38 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 11 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> 
> --- Comment #37 from Hongtao.liu <crazylht at gmail dot com> ---
> > There is not much value in the vectorization we do in this function
> > (when manually fixing the STLF issue the speed is as good as with the
> > scalar code).  We cost
> > 
> > ray.dir.x 1 times scalar_load costs 12 in body
> > ray.dir.y 1 times scalar_load costs 12 in body
> Still from an target-related perspective, instead of adding cost for STLF
> penalty, maybe we should just reduce cost of scalar_load if it's from parm_decl
> because there's probably STLF.

That's an interesting idea - it would eventually also improve the case
where the argument is passed in register(s) but we fail to realize that.

I'll see if I get around to prototype some argument classification
in the vectorizer (looking how hard it is to use
INIT_CUMULATIVE_ARGS in a context where we are not expanding to RTL),
unfortunately stack passing is done by code in function.cc (plus
extra target hooks of course), but it might be easy enough to figure
alignment and size at least (and whether arguments are passed on
the stack or not).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2022-03-11 13:27 ` rguenther at suse dot de
@ 2022-03-14  9:05 ` crazylht at gmail dot com
  2022-03-14  9:16 ` rguenther at suse dot de
                   ` (9 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-14  9:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #39 from Hongtao.liu <crazylht at gmail dot com> ---

> I'll see if I get around to prototype some argument classification
> in the vectorizer (looking how hard it is to use
> INIT_CUMULATIVE_ARGS in a context where we are not expanding to RTL),
> unfortunately stack passing is done by code in function.cc (plus
> extra target hooks of course), but it might be easy enough to figure
> alignment and size at least (and whether arguments are passed on
> the stack or not).

According to Intel software optimization guide,  
When using an unmasked store instruction, and load instruction after it, data
forwarding depends on ***load type, size and address offset from store
address***, and does not depend on the store address itself (i.e., the store
address does not have to be aligned to or fit into cache line, forwarding will
occur for nonaligned and even line-split stores).
The figure below describes all possible cases when data forwarding will occur.

I'm not sure if we can get store size in the vectorizer, how parameter been
pushed to stack by caller also matters for STLF.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (38 preceding siblings ...)
  2022-03-14  9:05 ` crazylht at gmail dot com
@ 2022-03-14  9:16 ` rguenther at suse dot de
  2022-03-15  1:52 ` crazylht at gmail dot com
                   ` (8 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenther at suse dot de @ 2022-03-14  9:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #40 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 14 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> 
> --- Comment #39 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > I'll see if I get around to prototype some argument classification
> > in the vectorizer (looking how hard it is to use
> > INIT_CUMULATIVE_ARGS in a context where we are not expanding to RTL),
> > unfortunately stack passing is done by code in function.cc (plus
> > extra target hooks of course), but it might be easy enough to figure
> > alignment and size at least (and whether arguments are passed on
> > the stack or not).
> 
> According to Intel software optimization guide,  
> When using an unmasked store instruction, and load instruction after it, data
> forwarding depends on ***load type, size and address offset from store
> address***, and does not depend on the store address itself (i.e., the store
> address does not have to be aligned to or fit into cache line, forwarding will
> occur for nonaligned and even line-split stores).
> The figure below describes all possible cases when data forwarding will occur.
> 
> I'm not sure if we can get store size in the vectorizer, how parameter been
> pushed to stack by caller also matters for STLF.

Yes, but since we now use _by_pieces for stack pushing we can try aligning
heuristics on both sides.  The main point of using INIT_CUMULATIVE_ARGS
is of course to figure whether a decl is passed in registers - there
are plenty of PRs where we get costs wrong for that case.

My additional worry is that we're going to be too pessimistic for
cases that execute long after the argument setup and thus will fetch
from L1 instead of forward from the store buffers.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (39 preceding siblings ...)
  2022-03-14  9:16 ` rguenther at suse dot de
@ 2022-03-15  1:52 ` crazylht at gmail dot com
  2022-03-15  7:14 ` rguenther at suse dot de
                   ` (7 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-15  1:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #41 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #22)
> (In reply to Hongtao.liu from comment #21)
> > Now we have SLP node available in vector cost hook, maybe we can do sth in
> > cost model to prevent vectorization when node's definition from big-size
> > parameter.
> 
> Note we vectorize a load here for which we do not pass down an SLP node.
> But of course there's the stmt-info one could look at - but the issue
> is that for SLP that doesn't tell you which part of the variable is accessed.
> Also even if we were to pass down the SLP node we do not know exactly how
> it is going to vectorize - but sure, we could play with some heuristics
Then, we can't get exact offset between load address and store address.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (40 preceding siblings ...)
  2022-03-15  1:52 ` crazylht at gmail dot com
@ 2022-03-15  7:14 ` rguenther at suse dot de
  2022-03-29  3:40 ` crazylht at gmail dot com
                   ` (6 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenther at suse dot de @ 2022-03-15  7:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #42 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 15 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> 
> --- Comment #41 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to Richard Biener from comment #22)
> > (In reply to Hongtao.liu from comment #21)
> > > Now we have SLP node available in vector cost hook, maybe we can do sth in
> > > cost model to prevent vectorization when node's definition from big-size
> > > parameter.
> > 
> > Note we vectorize a load here for which we do not pass down an SLP node.
> > But of course there's the stmt-info one could look at - but the issue
> > is that for SLP that doesn't tell you which part of the variable is accessed.
> > Also even if we were to pass down the SLP node we do not know exactly how
> > it is going to vectorize - but sure, we could play with some heuristics
> Then, we can't get exact offset between load address and store address.

Yes, at the moment this info is not present.  I do have ideas how to
refactor things to make the exact generated stores and loads available
but it will be quite some work to do that.  But sure for costing we
should know exactly what is going to be generated (at GIMPLE level),
we shouldn't have to second-guess.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (41 preceding siblings ...)
  2022-03-15  7:14 ` rguenther at suse dot de
@ 2022-03-29  3:40 ` crazylht at gmail dot com
  2022-03-29  4:01 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-29  3:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #43 from Hongtao.liu <crazylht at gmail dot com> ---
One thing I found by experiments:
Insert 64 vaddps %xmm18, %xmm19, %xmm20(no dependence between each other, just
emulate for pipeline) before stalled load, stlf stall case is as fast as no
stall cases on CLX. I guess this is "distance" you mean.

Is there any existed structure in GCC I can get latency from entry to the load
instruction? And of course for loop with unknown trip count, latency can't be
exactly estimated. Similar for cases when load is in join_bb, guess we need to
calculate "average" latency among all possible predecessors?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (42 preceding siblings ...)
  2022-03-29  3:40 ` crazylht at gmail dot com
@ 2022-03-29  4:01 ` crazylht at gmail dot com
  2022-03-29  6:47 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-29  4:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #44 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #43)
> One thing I found by experiments:
> Insert 64 vaddps %xmm18, %xmm19, %xmm20(no dependence between each other,
> just emulate for pipeline) before stalled load, stlf stall case is as fast
> as no stall cases on CLX. I guess this is "distance" you mean.
> 
But there's still event for STLF blocks, guess processor scheduler helps here.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (43 preceding siblings ...)
  2022-03-29  4:01 ` crazylht at gmail dot com
@ 2022-03-29  6:47 ` rguenther at suse dot de
  2022-03-29  8:27 ` crazylht at gmail dot com
                   ` (3 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: rguenther at suse dot de @ 2022-03-29  6:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #45 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 29 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> 
> --- Comment #43 from Hongtao.liu <crazylht at gmail dot com> ---
> One thing I found by experiments:
> Insert 64 vaddps %xmm18, %xmm19, %xmm20(no dependence between each other, just
> emulate for pipeline) before stalled load, stlf stall case is as fast as no
> stall cases on CLX. I guess this is "distance" you mean.

Yes - on the micro-architecture that's likely the point the data is
then available from L1-D.  The "distance" might depend on the store
workload (# of stores that can issue / retire / flush to L1 per cycle).

> Is there any existed structure in GCC I can get latency from entry to the load
> instruction?

There's the DFA description used by the instruction scheduler.  I'm
not familiar with that part of GCC but IIRC the dependence and DFA
query part should be sufficiently separate.  For OOO
uarchs we can compute a minimum distance based purely on frontend
cycles.  Doing better would need to look at instruction dependences.
I'm not sure if the CPUs we care about use forwarding possibilities
in the decision to OOO schedule loads/stores but IIRC store buffer
entries are allocated early at insn issue time and memory dependences
are taken into account.

Since we have no idea about the instruction sequence before function
entry going into too much detail will probably suffer from GIGO so
I'd resort to approximating the frontend side of the pipeline only
by some manual bean counting.

> And of course for loop with unknown trip count, latency can't be
> exactly estimated. Similar for cases when load is in join_bb, guess we need to
> calculate "average" latency among all possible predecessors?

I'd have simply stopped at backwards reachable blocks since whether
or not a load will forward from a store before function entry will
depend on the iteration number.

Likewise for CFG joins - I suppose one could conservatively assume
the shorter or longer path is taken, dependent on what side we want
to err on (maybe look at the edge probabilities even and choose the
most probable incoming path length).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (44 preceding siblings ...)
  2022-03-29  6:47 ` rguenther at suse dot de
@ 2022-03-29  8:27 ` crazylht at gmail dot com
  2022-03-29 10:00 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  48 siblings, 0 replies; 50+ messages in thread
From: crazylht at gmail dot com @ 2022-03-29  8:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #46 from Hongtao.liu <crazylht at gmail dot com> ---
Another issue is splitting vector load to halves or elements, the latter
requires scratch registers which may not be available, the former doesn't
require extra register but may still trigger STLF stalls. For cray case,
splitting to halves is equal to splitting to elements.

For x86, there're sse/256_unaligned_load_optima would split 128/256-bit vector
load to halves.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (45 preceding siblings ...)
  2022-03-29  8:27 ` crazylht at gmail dot com
@ 2022-03-29 10:00 ` rguenther at suse dot de
  2022-04-05  5:17 ` cvs-commit at gcc dot gnu.org
  2022-04-05  6:24 ` rguenth at gcc dot gnu.org
  48 siblings, 0 replies; 50+ messages in thread
From: rguenther at suse dot de @ 2022-03-29 10:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #47 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 29 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> 
> --- Comment #46 from Hongtao.liu <crazylht at gmail dot com> ---
> Another issue is splitting vector load to halves or elements, the latter
> requires scratch registers which may not be available, the former doesn't
> require extra register but may still trigger STLF stalls. For cray case,
> splitting to halves is equal to splitting to elements.
> 
> For x86, there're sse/256_unaligned_load_optima would split 128/256-bit vector
> load to halves.

I suggest to try the easy case first, only split when splitting would
split to elements and when that doesn't require scratch registers.
For large N (number of elements) the separate loads + inserts will
eventually offset the penalty of a failing forwarding anyway, so it
is less obviously a win (or less obviously not a loss).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (46 preceding siblings ...)
  2022-03-29 10:00 ` rguenther at suse dot de
@ 2022-04-05  5:17 ` cvs-commit at gcc dot gnu.org
  2022-04-05  6:24 ` rguenth at gcc dot gnu.org
  48 siblings, 0 replies; 50+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-04-05  5:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #48 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:e3174d6183e5c042e822d9feabb670235b737441

commit r12-7990-ge3174d6183e5c042e822d9feabb670235b737441
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Mar 30 20:35:55 2022 +0800

    Split vector load from parm_del to elemental loads to avoid STLF stalls.

    Since cfg is freed before machine_reorg, just do a rough calculation
    of the window according to the layout.
    Also according to an experiment on CLX, set window size to 64.

    Currently only handle V2DFmode load since it doesn't need any scratch
    registers, and it's sufficient to recover cray performance for -O2
    compared to GCC11.

    gcc/ChangeLog:

            PR target/101908
            * config/i386/i386.cc (ix86_split_stlf_stall_load): New
            function
            (ix86_reorg): Call ix86_split_stlf_stall_load.
            * config/i386/i386.opt (-param=x86-stlf-window-ninsns=): New
            param.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr101908-1.c: New test.
            * gcc.target/i386/pr101908-2.c: New test.
            * gcc.target/i386/pr101908-3.c: New test.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
  2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
                   ` (47 preceding siblings ...)
  2022-04-05  5:17 ` cvs-commit at gcc dot gnu.org
@ 2022-04-05  6:24 ` rguenth at gcc dot gnu.org
  48 siblings, 0 replies; 50+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-05  6:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #49 from Richard Biener <rguenth at gcc dot gnu.org> ---
Thanks Hongtao!  Fixed on trunk.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-04-05  6:24 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-14 14:27 [Bug middle-end/101908] New: cray regression with -O2 -ftree-slp-vectorize compared to -O2 hubicka at gcc dot gnu.org
2021-08-16  8:33 ` [Bug middle-end/101908] " crazylht at gmail dot com
2021-08-16  9:05 ` [Bug tree-optimization/101908] " rguenth at gcc dot gnu.org
2021-08-16  9:06 ` rguenth at gcc dot gnu.org
2021-08-25  7:47 ` crazylht at gmail dot com
2021-10-28 12:26 ` [Bug tree-optimization/101908] [12 regression] " hubicka at gcc dot gnu.org
2021-10-28 12:39 ` hubicka at gcc dot gnu.org
2021-10-28 13:02 ` rguenth at gcc dot gnu.org
2021-10-28 13:05 ` rguenth at gcc dot gnu.org
2021-10-28 13:06 ` hubicka at kam dot mff.cuni.cz
2021-10-28 13:09 ` hubicka at kam dot mff.cuni.cz
2021-10-28 13:09 ` rguenth at gcc dot gnu.org
2021-10-28 13:12 ` hubicka at kam dot mff.cuni.cz
2021-10-28 13:13 ` rguenth at gcc dot gnu.org
2021-10-28 13:15 ` rguenth at gcc dot gnu.org
2021-10-28 13:21 ` rguenth at gcc dot gnu.org
2021-10-29 13:58 ` hubicka at kam dot mff.cuni.cz
2022-01-20 11:12 ` rguenth at gcc dot gnu.org
2022-01-20 11:26 ` [Bug target/101908] " rguenth at gcc dot gnu.org
2022-01-20 11:52 ` rguenth at gcc dot gnu.org
2022-02-24  9:54 ` rguenth at gcc dot gnu.org
2022-02-25  3:57 ` crazylht at gmail dot com
2022-02-25  7:33 ` rguenth at gcc dot gnu.org
2022-02-25  8:26 ` lili.cui at intel dot com
2022-02-25  8:31 ` lili.cui at intel dot com
2022-02-25 15:27 ` hjl.tools at gmail dot com
2022-02-28  1:29 ` crazylht at gmail dot com
2022-02-28  1:30 ` crazylht at gmail dot com
2022-02-28  5:13 ` lili.cui at intel dot com
2022-03-01  9:33 ` crazylht at gmail dot com
2022-03-10 13:47 ` crazylht at gmail dot com
2022-03-10 13:54 ` crazylht at gmail dot com
2022-03-10 13:55 ` crazylht at gmail dot com
2022-03-11  7:11 ` crazylht at gmail dot com
2022-03-11  8:32 ` rguenth at gcc dot gnu.org
2022-03-11  8:48 ` crazylht at gmail dot com
2022-03-11 10:41 ` rguenth at gcc dot gnu.org
2022-03-11 13:14 ` crazylht at gmail dot com
2022-03-11 13:27 ` rguenther at suse dot de
2022-03-14  9:05 ` crazylht at gmail dot com
2022-03-14  9:16 ` rguenther at suse dot de
2022-03-15  1:52 ` crazylht at gmail dot com
2022-03-15  7:14 ` rguenther at suse dot de
2022-03-29  3:40 ` crazylht at gmail dot com
2022-03-29  4:01 ` crazylht at gmail dot com
2022-03-29  6:47 ` rguenther at suse dot de
2022-03-29  8:27 ` crazylht at gmail dot com
2022-03-29 10:00 ` rguenther at suse dot de
2022-04-05  5:17 ` cvs-commit at gcc dot gnu.org
2022-04-05  6:24 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).