[RFC] propgation leap over memory copy for struct

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] propgation leap over memory copy for struct
@ 2022-10-31  2:42 Jiufu Guo
  2022-10-31 22:13 ` Jeff Law
  2022-11-01  0:37 ` Segher Boessenkool
  0 siblings, 2 replies; 14+ messages in thread
From: Jiufu Guo @ 2022-10-31  2:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: segher, dje.gcc, linkw, guojiufu, rguenth, pinskia

Hi,

We know that for struct variable assignment, memory copy may be used.
And for memcpy, we may load and store more bytes as possible at one time.
While it may be not best here:
1. Before/after stuct variable assignment, the vaiable may be operated.
And it is hard for some optimizations to leap over memcpy.  Then some struct
operations may be sub-optimimal.  Like the issue in PR65421.
2. The size of struct is constant mostly, the memcpy would be expanded.  Using
small size to load/store and executing in parallel may not slower than using
large size to loat/store. (sure, more registers may be used for smaller bytes.)

In PR65421, For source code as below:
////////t.c
#define FN 4
typedef struct { double a[FN]; } A;

A foo (const A *a) { return *a; }
A bar (const A a) { return a; }
///////

If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
(by subreg/cse/fwprop/cprop) as:
-------
foo:
.LFB0:
        .cfi_startproc
        blr

bar:
.LFB1:
      	.cfi_startproc
	lfd 2,8(3)
	lfd 1,0(3)
	blr
--------
If the size of "A" is larger than any INT mode size, RTL insns would be 
generated as:
   13: r125:V2DI=[r112:DI+0x20]
   14: r126:V2DI=[r112:DI+0x30]
   15: [r112:DI]=r125:V2DI
   16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
   17: r127:DF=[r112:DI]
   18: r128:DF=[r112:DI+0x8]
   19: r129:DF=[r112:DI+0x10]
   20: r130:DF=[r112:DI+0x18]
------------

I'm thinking about ways to improve this.
Metod1: One way may be changing the memory copy by referencing the type 
of the struct if the size of struct is not too big. And generate insns 
like the below:
   13: r125:DF=[r112:DI+0x20]
   15: r126:DF=[r112:DI+0x28]
   17: r127:DF=[r112:DI+0x30]
   19: r128:DF=[r112:DI+0x38]
   14: [r112:DI]=r125:DF
   16: [r112:DI+0x8]=r126:DF
   18: [r112:DI+0x10]=r127:DF
   20: [r112:DI+0x18]=r128:DF
   21: r129:DF=[r112:DI]
   22: r130:DF=[r112:DI+0x8]
   23: r131:DF=[r112:DI+0x10]
   24: r132:DF=[r112:DI+0x18]

Then passes (cse, prop, dse...) could help to optimize the code.
Concerns of the method: we may not need to do this if the number of 
fields is too large.  And the types/modes of each load/store may
depend on the platform and not same with the type of the fields of
the struct. For example: 
For "struct {double a[3]; long long l;}", on ppc64le, DImode may be
 better for assignments on parameter.

Method2: One way may be enhancing CSE to make it able to treat one large
memory slot as two(or more) combined slots: 
   13: r125:V2DI#0=[r112:DI+0x20]
   13': r125:V2DI#8=[r112:DI+0x28]
   15: [r112:DI]#0=r125:V2DI#0
   15': [r112:DI]#8=r125:V2DI#8

This may seems more hack in CSE.

Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".
To do this, "moving" between "PARALLEL<->PARALLEL" and "PARALLEL<->MEM" 
may need to be enhanced.  This method may require more effort to make
it works for corner/unknown cases.

I'm wondering which would be more flexible to handle this issue?
Thanks for any comments and suggestions!

BR,
Jeff(Jiufu)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-10-31  2:42 [RFC] propgation leap over memory copy for struct Jiufu Guo
@ 2022-10-31 22:13 ` Jeff Law
  2022-11-01  0:49   ` Segher Boessenkool
                     ` (2 more replies)
  2022-11-01  0:37 ` Segher Boessenkool
  1 sibling, 3 replies; 14+ messages in thread
From: Jeff Law @ 2022-10-31 22:13 UTC (permalink / raw)
  To: Jiufu Guo, gcc-patches; +Cc: segher, rguenth, pinskia, linkw, dje.gcc

On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> Hi,
>
> We know that for struct variable assignment, memory copy may be used.
> And for memcpy, we may load and store more bytes as possible at one time.
> While it may be not best here:
> 1. Before/after stuct variable assignment, the vaiable may be operated.
> And it is hard for some optimizations to leap over memcpy.  Then some struct
> operations may be sub-optimimal.  Like the issue in PR65421.
> 2. The size of struct is constant mostly, the memcpy would be expanded.  Using
> small size to load/store and executing in parallel may not slower than using
> large size to loat/store. (sure, more registers may be used for smaller bytes.)
>
>
> In PR65421, For source code as below:
> ////////t.c
> #define FN 4
> typedef struct { double a[FN]; } A;
>
> A foo (const A *a) { return *a; }
> A bar (const A a) { return a; }

So the first question in my mind is can we do better at the gimple 
phase?  For the second case in particular can't we just "return a" 
rather than copying a into <retval> then returning <retval>?  This feels 
a lot like the return value optimization from C++.  I'm not sure if it 
applies to the first case or not, it's been a long time since I looked 
at NRV optimizations, but it might be worth poking around in there a bit 
(tree-nrv.cc).

But even so, these kinds of things are still bound to happen, so it's 
probably worth thinking about if we can do better in RTL as well.

The first thing that comes to my mind is to annotate memcpy calls that 
are structure assignments.  The idea here is that we may want to expand 
a memcpy differently in those cases.   Changing how we expand an opaque 
memcpy call is unlikely to be beneficial in most cases.  But changing 
how we expand a structure copy may be beneficial by exposing the 
underlying field values.   This would roughly correspond to your method #1.

Or instead of changing how we expand, teach the optimizers about these 
annotated memcpy calls -- they're just a a copy of each field.   That's 
how CSE and the propagators could treat them. After some point we'd 
lower them in the usual ways, but at least early in the RTL pipeline we 
could keep them as annotated memcpy calls.  This roughly corresponds to 
your second suggestion.

jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-10-31  2:42 [RFC] propgation leap over memory copy for struct Jiufu Guo
  2022-10-31 22:13 ` Jeff Law
@ 2022-11-01  0:37 ` Segher Boessenkool
  2022-11-01  3:01   ` Jiufu Guo
  1 sibling, 1 reply; 14+ messages in thread
From: Segher Boessenkool @ 2022-11-01  0:37 UTC (permalink / raw)
  To: Jiufu Guo; +Cc: gcc-patches, dje.gcc, linkw, rguenth, pinskia

Hi!

On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote:
> #define FN 4
> typedef struct { double a[FN]; } A;
> 
> A foo (const A *a) { return *a; }
> A bar (const A a) { return a; }
> ///////
> 
> If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
> (by subreg/cse/fwprop/cprop) as:
> -------
> foo:
> .LFB0:
>         .cfi_startproc
>         blr
> 
> bar:
> .LFB1:
>       	.cfi_startproc
> 	lfd 2,8(3)
> 	lfd 1,0(3)
> 	blr
> --------

I think you swapped foo and bar here?

> If the size of "A" is larger than any INT mode size, RTL insns would be 
> generated as:
>    13: r125:V2DI=[r112:DI+0x20]
>    14: r126:V2DI=[r112:DI+0x30]
>    15: [r112:DI]=r125:V2DI
>    16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
>    17: r127:DF=[r112:DI]
>    18: r128:DF=[r112:DI+0x8]
>    19: r129:DF=[r112:DI+0x10]
>    20: r130:DF=[r112:DI+0x18]
> ------------
> 
> I'm thinking about ways to improve this.
> Metod1: One way may be changing the memory copy by referencing the type 
> of the struct if the size of struct is not too big. And generate insns 
> like the below:
>    13: r125:DF=[r112:DI+0x20]
>    15: r126:DF=[r112:DI+0x28]
>    17: r127:DF=[r112:DI+0x30]
>    19: r128:DF=[r112:DI+0x38]
>    14: [r112:DI]=r125:DF
>    16: [r112:DI+0x8]=r126:DF
>    18: [r112:DI+0x10]=r127:DF
>    20: [r112:DI+0x18]=r128:DF
>    21: r129:DF=[r112:DI]
>    22: r130:DF=[r112:DI+0x8]
>    23: r131:DF=[r112:DI+0x10]
>    24: r132:DF=[r112:DI+0x18]

This is much worse though?  The expansion with memcpy used V2DI, which
typically is close to 2x faster than DFmode accesses.

Or are you trying to avoid small reads of large stores here?  Those
aren't so bad, large reads of small stores is the nastiness we need to
avoid.

The code we have now does

   15: [r112:DI]=r125:V2DI
...
   17: r127:DF=[r112:DI]
   18: r128:DF=[r112:DI+0x8]

Can you make this optimised to not use a memory temporary at all, just
immediately assign from r125 to r127 and r128?

> Method2: One way may be enhancing CSE to make it able to treat one large
> memory slot as two(or more) combined slots: 
>    13: r125:V2DI#0=[r112:DI+0x20]
>    13': r125:V2DI#8=[r112:DI+0x28]
>    15: [r112:DI]#0=r125:V2DI#0
>    15': [r112:DI]#8=r125:V2DI#8
> 
> This may seems more hack in CSE.

The current CSE pass we have is the pass most in need of a full rewrite
we have, since many many years.  It does a lot of things, important
things that we should not lose, but it does a pretty bad job of CSE.

> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".

:BLK can never be optimised well.  It always has to live in memory, by
definition.


Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-10-31 22:13 ` Jeff Law
@ 2022-11-01  0:49   ` Segher Boessenkool
  2022-11-01  4:30     ` Jiufu Guo
  2022-11-01  3:30   ` Jiufu Guo
  2022-11-05 11:38   ` Richard Biener
  2 siblings, 1 reply; 14+ messages in thread
From: Segher Boessenkool @ 2022-11-01  0:49 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jiufu Guo, gcc-patches, rguenth, pinskia, linkw, dje.gcc

On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >We know that for struct variable assignment, memory copy may be used.
> >And for memcpy, we may load and store more bytes as possible at one time.
> >While it may be not best here:

> So the first question in my mind is can we do better at the gimple 
> phase?  For the second case in particular can't we just "return a" 
> rather than copying a into <retval> then returning <retval>?  This feels 
> a lot like the return value optimization from C++.  I'm not sure if it 
> applies to the first case or not, it's been a long time since I looked 
> at NRV optimizations, but it might be worth poking around in there a bit 
> (tree-nrv.cc).

If it is a bigger struct you end up with quite a lot of stuff in
registers.  GCC will eventually put that all in memory so it will work
out fine in the end, but you are likely to get inefficient code.

OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
would be useful to put smaller temportaries, say 32 bytes and smaller,
in registers instead of in memory.

> But even so, these kinds of things are still bound to happen, so it's 
> probably worth thinking about if we can do better in RTL as well.

Always.  It is a mistake to think that having better high-level
optimisations means that you don't need good low-level optimisations
anymore: in fact deficiencies there become more glaringly apparent if
the early pipeline opts become better :-)

> The first thing that comes to my mind is to annotate memcpy calls that 
> are structure assignments.  The idea here is that we may want to expand 
> a memcpy differently in those cases.   Changing how we expand an opaque 
> memcpy call is unlikely to be beneficial in most cases.  But changing 
> how we expand a structure copy may be beneficial by exposing the 
> underlying field values.   This would roughly correspond to your method 
> #1.
> 
> Or instead of changing how we expand, teach the optimizers about these 
> annotated memcpy calls -- they're just a a copy of each field.   That's 
> how CSE and the propagators could treat them. After some point we'd 
> lower them in the usual ways, but at least early in the RTL pipeline we 
> could keep them as annotated memcpy calls.  This roughly corresponds to 
> your second suggestion.

Ideally this won't ever make it as far as RTL, if the structures do not
need to go via memory.  All high-level optimissations should have been
done earlier, and hopefully it was not expand tiself that forced stuff
into memory!  :-/


Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-11-01  0:37 ` Segher Boessenkool
@ 2022-11-01  3:01   ` Jiufu Guo
  0 siblings, 0 replies; 14+ messages in thread
From: Jiufu Guo @ 2022-11-01  3:01 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, dje.gcc, linkw, rguenth, pinskia

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi!
>
> On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote:
>> #define FN 4
>> typedef struct { double a[FN]; } A;
>> 
>> A foo (const A *a) { return *a; }
>> A bar (const A a) { return a; }
>> ///////
>> 
>> If FN<=2; the size of "A" fits into TImode, then this code can be optimized 
>> (by subreg/cse/fwprop/cprop) as:
>> -------
>> foo:
>> .LFB0:
>>         .cfi_startproc
>>         blr
>> 
>> bar:
>> .LFB1:
>>       	.cfi_startproc
>> 	lfd 2,8(3)
>> 	lfd 1,0(3)
>> 	blr
>> --------
>
> I think you swapped foo and bar here?
Oh, thanks!
>
>> If the size of "A" is larger than any INT mode size, RTL insns would be 
>> generated as:
>>    13: r125:V2DI=[r112:DI+0x20]
>>    14: r126:V2DI=[r112:DI+0x30]
>>    15: [r112:DI]=r125:V2DI
>>    16: [r112:DI+0x10]=r126:V2DI  /// memcpy for assignment: D.3338 = arg;
>>    17: r127:DF=[r112:DI]
>>    18: r128:DF=[r112:DI+0x8]
>>    19: r129:DF=[r112:DI+0x10]
>>    20: r130:DF=[r112:DI+0x18]
>> ------------
>> 
>> I'm thinking about ways to improve this.
>> Metod1: One way may be changing the memory copy by referencing the type 
>> of the struct if the size of struct is not too big. And generate insns 
>> like the below:
>>    13: r125:DF=[r112:DI+0x20]
>>    15: r126:DF=[r112:DI+0x28]
>>    17: r127:DF=[r112:DI+0x30]
>>    19: r128:DF=[r112:DI+0x38]
>>    14: [r112:DI]=r125:DF
>>    16: [r112:DI+0x8]=r126:DF
>>    18: [r112:DI+0x10]=r127:DF
>>    20: [r112:DI+0x18]=r128:DF
>>    21: r129:DF=[r112:DI]
>>    22: r130:DF=[r112:DI+0x8]
>>    23: r131:DF=[r112:DI+0x10]
>>    24: r132:DF=[r112:DI+0x18]
>
> This is much worse though?  The expansion with memcpy used V2DI, which
> typically is close to 2x faster than DFmode accesses.
Using V2DI, it help to access 2x bytes at one time than DF/DI.
While since those readings can be executed in parallel, it would be not
too bad via using DF/DI.

>
> Or are you trying to avoid small reads of large stores here?  Those
> aren't so bad, large reads of small stores is the nastiness we need to
> avoid.
Here, I want to use 2 DF readings, because optimizations cse/fwprop/dse
could eleminate those memory accesses on same size.
>
> The code we have now does
>
>    15: [r112:DI]=r125:V2DI
> ...
>    17: r127:DF=[r112:DI]
>    18: r128:DF=[r112:DI+0x8]
>
> Can you make this optimised to not use a memory temporary at all, just
> immediately assign from r125 to r127 and r128?
r125 are not directly assinged to r127/r128, since 'insn 15' and 'insn
17/18' are expanded for different gimple stmt:
  D.3331 = a;  ==> 'insn 15' is generated for struct assignment here.
  return D.3331; ==> 'insn 17/18' are prepared for return registers.

I'm trying to eliminate thos  memory temporary, and did not find a good
way.  Method1-3 are the ideas which I'm trying to use to delete those
temporaries.

>
>> Method2: One way may be enhancing CSE to make it able to treat one large
>> memory slot as two(or more) combined slots: 
>>    13: r125:V2DI#0=[r112:DI+0x20]
>>    13': r125:V2DI#8=[r112:DI+0x28]
>>    15: [r112:DI]#0=r125:V2DI#0
>>    15': [r112:DI]#8=r125:V2DI#8
>> 
>> This may seems more hack in CSE.
>
> The current CSE pass we have is the pass most in need of a full rewrite
> we have, since many many years.  It does a lot of things, important
> things that we should not lose, but it does a pretty bad job of CSE.
>
>> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK".
>
> :BLK can never be optimised well.  It always has to live in memory, by
> definition.

Thanks for your sugguestions!

BR,
Jeff (Jiufu)
>
>
> Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-10-31 22:13 ` Jeff Law
  2022-11-01  0:49   ` Segher Boessenkool
@ 2022-11-01  3:30   ` Jiufu Guo
  2022-11-05 11:38   ` Richard Biener
  2 siblings, 0 replies; 14+ messages in thread
From: Jiufu Guo @ 2022-11-01  3:30 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, segher, rguenth, pinskia, linkw, dje.gcc

Jeff Law <jeffreyalaw@gmail.com> writes:

> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> Hi,
>>
>> We know that for struct variable assignment, memory copy may be used.
>> And for memcpy, we may load and store more bytes as possible at one time.
>> While it may be not best here:
>> 1. Before/after stuct variable assignment, the vaiable may be operated.
>> And it is hard for some optimizations to leap over memcpy.  Then some struct
>> operations may be sub-optimimal.  Like the issue in PR65421.
>> 2. The size of struct is constant mostly, the memcpy would be expanded.  Using
>> small size to load/store and executing in parallel may not slower than using
>> large size to loat/store. (sure, more registers may be used for smaller bytes.)
>>
>>
>> In PR65421, For source code as below:
>> ////////t.c
>> #define FN 4
>> typedef struct { double a[FN]; } A;
>>
>> A foo (const A *a) { return *a; }
>> A bar (const A a) { return a; }
>
> So the first question in my mind is can we do better at the gimple
> phase?  For the second case in particular can't we just "return a"
> rather than copying a into <retval> then returning <retval>?  This
> feels a lot like the return value optimization from C++.  I'm not sure
> if it applies to the first case or not, it's been a long time since I
> looked at NRV optimizations, but it might be worth poking around in
> there a bit (tree-nrv.cc).
Thanks for point out this idea!!

Currently the optimized gimple looks like:
  D.3334 = a;
  return D.3334;

and
  D.3336 = *a_2(D);
  return D.3336;

It may be better to have:
"return a;" and "return *a;"
-----------------

If the code looks like:
typedef struct { double a[3]; long l;} A; //mix types
A foo (const A a) { return a; }
A bar (const A *a) { return *a; }

Current optimized gimples looks like:
  <retval> = a;
  return <retval>;
and
  <retval> = *a_2(D);
  return <retval>;

"return a;" and "return *a;" may be works here too.
>
>
> But even so, these kinds of things are still bound to happen, so it's
> probably worth thinking about if we can do better in RTL as well. 
>
Yeap, thanks!
>
> The first thing that comes to my mind is to annotate memcpy calls that
> are structure assignments.  The idea here is that we may want to
> expand a memcpy differently in those cases.   Changing how we expand
> an opaque memcpy call is unlikely to be beneficial in most cases.  But
> changing how we expand a structure copy may be beneficial by exposing
> the underlying field values.   This would roughly correspond to your
> method #1.
Right.  For general memcpy, we would read/write larger bytes at one
time. Reading/writing small fields may only beneficial for structure
assignment.

>
> Or instead of changing how we expand, teach the optimizers about these
> annotated memcpy calls -- they're just a a copy of each field.  
> That's how CSE and the propagators could treat them. After some point
> we'd lower them in the usual ways, but at least early in the RTL
> pipeline we could keep them as annotated memcpy calls.  This roughly
> corresponds to your second suggestion.
Thanks for your insights about this idea! Using annoated memcpy for
early optimizations, and it would be treated as general memcpy in later
passes.


Thanks again for your very helpful comments and sugguestions!

BR,
Jeff(Jiufu)

>
>
> jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-11-01  0:49   ` Segher Boessenkool
@ 2022-11-01  4:30     ` Jiufu Guo
  2022-11-05 14:13       ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Jiufu Guo @ 2022-11-01  4:30 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Jeff Law, gcc-patches, rguenth, pinskia, linkw, dje.gcc

Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> >We know that for struct variable assignment, memory copy may be used.
>> >And for memcpy, we may load and store more bytes as possible at one time.
>> >While it may be not best here:
>
>> So the first question in my mind is can we do better at the gimple 
>> phase?  For the second case in particular can't we just "return a" 
>> rather than copying a into <retval> then returning <retval>?  This feels 
>> a lot like the return value optimization from C++.  I'm not sure if it 
>> applies to the first case or not, it's been a long time since I looked 
>> at NRV optimizations, but it might be worth poking around in there a bit 
>> (tree-nrv.cc).
>
> If it is a bigger struct you end up with quite a lot of stuff in
> registers.  GCC will eventually put that all in memory so it will work
> out fine in the end, but you are likely to get inefficient code.
Yes.  We may need to use memory to save regiters for big struct.
Small struct may be practical to use registers.  We may leverage the
idea that: some type of small struct are passing to function through
registers. 

>
> OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
> would be useful to put smaller temportaries, say 32 bytes and smaller,
> in registers instead of in memory.
I think you mean:  we should try to registers to avoid memory accesing,
and using registers would be ok for more bytes memcpy(32bytes).
Great sugguestion, thanks a lot!

Like below idea:
[r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
dest = [r100:TI, r101:TI];

Currently, for 8bytes structure, we are using TImode for it.
And subreg/fwprop/cse passes are able to optimize it as expected.
Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
I'm not sure if current infrastructure supports to use two more
registers for one structure.

>
>> But even so, these kinds of things are still bound to happen, so it's 
>> probably worth thinking about if we can do better in RTL as well.
>
> Always.  It is a mistake to think that having better high-level
> optimisations means that you don't need good low-level optimisations
> anymore: in fact deficiencies there become more glaringly apparent if
> the early pipeline opts become better :-)
Understant, thanks :)

>
>> The first thing that comes to my mind is to annotate memcpy calls that 
>> are structure assignments.  The idea here is that we may want to expand 
>> a memcpy differently in those cases.   Changing how we expand an opaque 
>> memcpy call is unlikely to be beneficial in most cases.  But changing 
>> how we expand a structure copy may be beneficial by exposing the 
>> underlying field values.   This would roughly correspond to your method 
>> #1.
>> 
>> Or instead of changing how we expand, teach the optimizers about these 
>> annotated memcpy calls -- they're just a a copy of each field.   That's 
>> how CSE and the propagators could treat them. After some point we'd 
>> lower them in the usual ways, but at least early in the RTL pipeline we 
>> could keep them as annotated memcpy calls.  This roughly corresponds to 
>> your second suggestion.
>
> Ideally this won't ever make it as far as RTL, if the structures do not
> need to go via memory.  All high-level optimissations should have been
> done earlier, and hopefully it was not expand tiself that forced stuff
> into memory!  :-/
Currently, after early gimple optimization, the struct member accessing
may still need to be in memory (if the mode of the struct is BLK).
For example:

_Bool foo (const A a) { return a.a[0] > 1.0; }

The optimized gimple would be:
  _1 = a.a[0];
  _3 = _1 > 1.0e+0;
  return _3;

During expand to RTL, parm 'a' is store to memory from arg regs firstly,
and "a.a[0]" is also reading from memory.  It may be better to use
"f1" for "a.a[0]" here.

Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
for the struct (BLK may be changed), and using 4 DF registers to access
the structure in expand pass.


Thanks again for your kindly and helpful comments!

BR,
Jeff(Jiufu)

>
>
> Segher

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-10-31 22:13 ` Jeff Law
  2022-11-01  0:49   ` Segher Boessenkool
  2022-11-01  3:30   ` Jiufu Guo
@ 2022-11-05 11:38   ` Richard Biener
  2022-11-09  9:21     ` Jiufu Guo
  2 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2022-11-05 11:38 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jiufu Guo, gcc-patches, pinskia, dje.gcc, linkw, segher, rguenth

On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> > Hi,
> >
> > We know that for struct variable assignment, memory copy may be used.
> > And for memcpy, we may load and store more bytes as possible at one time.
> > While it may be not best here:
> > 1. Before/after stuct variable assignment, the vaiable may be operated.
> > And it is hard for some optimizations to leap over memcpy.  Then some struct
> > operations may be sub-optimimal.  Like the issue in PR65421.
> > 2. The size of struct is constant mostly, the memcpy would be expanded.  Using
> > small size to load/store and executing in parallel may not slower than using
> > large size to loat/store. (sure, more registers may be used for smaller bytes.)
> >
> >
> > In PR65421, For source code as below:
> > ////////t.c
> > #define FN 4
> > typedef struct { double a[FN]; } A;
> >
> > A foo (const A *a) { return *a; }
> > A bar (const A a) { return a; }
>
> So the first question in my mind is can we do better at the gimple
> phase?  For the second case in particular can't we just "return a"
> rather than copying a into <retval> then returning <retval>?  This feels
> a lot like the return value optimization from C++.  I'm not sure if it
> applies to the first case or not, it's been a long time since I looked
> at NRV optimizations, but it might be worth poking around in there a bit
> (tree-nrv.cc).
>
>
> But even so, these kinds of things are still bound to happen, so it's
> probably worth thinking about if we can do better in RTL as well.
>
>
> The first thing that comes to my mind is to annotate memcpy calls that
> are structure assignments.  The idea here is that we may want to expand
> a memcpy differently in those cases.   Changing how we expand an opaque
> memcpy call is unlikely to be beneficial in most cases.  But changing
> how we expand a structure copy may be beneficial by exposing the
> underlying field values.   This would roughly correspond to your method #1.
>
> Or instead of changing how we expand, teach the optimizers about these
> annotated memcpy calls -- they're just a a copy of each field.   That's
> how CSE and the propagators could treat them. After some point we'd
> lower them in the usual ways, but at least early in the RTL pipeline we
> could keep them as annotated memcpy calls.  This roughly corresponds to
> your second suggestion.

In the end it depends on the access patterns so some analysis like SRA
performs would be nice.  The issue with expanding memcpy on GIMPLE
is that we currently cannot express 'rep; movb;' or other target specific
sequences from the cpymem like optabs on GIMPLE and recovering those
from piecewise copies on RTL is going to be difficult.

>
> jeff
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-11-01  4:30     ` Jiufu Guo
@ 2022-11-05 14:13       ` Richard Biener
  2022-11-08  4:05         ` Jiufu Guo
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2022-11-05 14:13 UTC (permalink / raw)
  To: Jiufu Guo
  Cc: Segher Boessenkool, Jeff Law, gcc-patches, rguenth, pinskia,
	linkw, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 5107 bytes --]

On Tue, 1 Nov 2022, Jiufu Guo wrote:

> Segher Boessenkool <segher@kernel.crashing.org> writes:
> 
> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >> >We know that for struct variable assignment, memory copy may be used.
> >> >And for memcpy, we may load and store more bytes as possible at one time.
> >> >While it may be not best here:
> >
> >> So the first question in my mind is can we do better at the gimple 
> >> phase?  For the second case in particular can't we just "return a" 
> >> rather than copying a into <retval> then returning <retval>?  This feels 
> >> a lot like the return value optimization from C++.  I'm not sure if it 
> >> applies to the first case or not, it's been a long time since I looked 
> >> at NRV optimizations, but it might be worth poking around in there a bit 
> >> (tree-nrv.cc).
> >
> > If it is a bigger struct you end up with quite a lot of stuff in
> > registers.  GCC will eventually put that all in memory so it will work
> > out fine in the end, but you are likely to get inefficient code.
> Yes.  We may need to use memory to save regiters for big struct.
> Small struct may be practical to use registers.  We may leverage the
> idea that: some type of small struct are passing to function through
> registers. 
> 
> >
> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
> > would be useful to put smaller temportaries, say 32 bytes and smaller,
> > in registers instead of in memory.
> I think you mean:  we should try to registers to avoid memory accesing,
> and using registers would be ok for more bytes memcpy(32bytes).
> Great sugguestion, thanks a lot!
> 
> Like below idea:
> [r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
> dest = [r100:TI, r101:TI];
> 
> Currently, for 8bytes structure, we are using TImode for it.
> And subreg/fwprop/cse passes are able to optimize it as expected.
> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
> I'm not sure if current infrastructure supports to use two more
> registers for one structure.
> 
> >
> >> But even so, these kinds of things are still bound to happen, so it's 
> >> probably worth thinking about if we can do better in RTL as well.
> >
> > Always.  It is a mistake to think that having better high-level
> > optimisations means that you don't need good low-level optimisations
> > anymore: in fact deficiencies there become more glaringly apparent if
> > the early pipeline opts become better :-)
> Understant, thanks :)
> 
> >
> >> The first thing that comes to my mind is to annotate memcpy calls that 
> >> are structure assignments.  The idea here is that we may want to expand 
> >> a memcpy differently in those cases.   Changing how we expand an opaque 
> >> memcpy call is unlikely to be beneficial in most cases.  But changing 
> >> how we expand a structure copy may be beneficial by exposing the 
> >> underlying field values.   This would roughly correspond to your method 
> >> #1.
> >> 
> >> Or instead of changing how we expand, teach the optimizers about these 
> >> annotated memcpy calls -- they're just a a copy of each field.   That's 
> >> how CSE and the propagators could treat them. After some point we'd 
> >> lower them in the usual ways, but at least early in the RTL pipeline we 
> >> could keep them as annotated memcpy calls.  This roughly corresponds to 
> >> your second suggestion.
> >
> > Ideally this won't ever make it as far as RTL, if the structures do not
> > need to go via memory.  All high-level optimissations should have been
> > done earlier, and hopefully it was not expand tiself that forced stuff
> > into memory!  :-/
> Currently, after early gimple optimization, the struct member accessing
> may still need to be in memory (if the mode of the struct is BLK).
> For example:
> 
> _Bool foo (const A a) { return a.a[0] > 1.0; }
> 
> The optimized gimple would be:
>   _1 = a.a[0];
>   _3 = _1 > 1.0e+0;
>   return _3;
> 
> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
> and "a.a[0]" is also reading from memory.  It may be better to use
> "f1" for "a.a[0]" here.
> 
> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
> for the struct (BLK may be changed), and using 4 DF registers to access
> the structure in expand pass.

I think for cases like this it might be a good idea to perform
SRA-like analysis at RTL expansion time when we know how parameters
arrive (in pieces) and take that knowledge into account when
assigning the RTL to a decl.  The same applies for the return ABI.
Since we rely on RTL to elide copies to/from return/argument
registers/slots we have to assign "layout compatible" registers
to the corresponding auto vars.

> 
> Thanks again for your kindly and helpful comments!
> 
> BR,
> Jeff(Jiufu)
> 
> >
> >
> > Segher
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-11-05 14:13       ` Richard Biener
@ 2022-11-08  4:05         ` Jiufu Guo
  2022-11-09  7:51           ` Jiufu Guo
  0 siblings, 1 reply; 14+ messages in thread
From: Jiufu Guo @ 2022-11-08  4:05 UTC (permalink / raw)
  To: Richard Biener
  Cc: Segher Boessenkool, Jeff Law, gcc-patches, rguenth, pinskia,
	linkw, dje.gcc

Richard Biener <rguenther@suse.de> writes:

> On Tue, 1 Nov 2022, Jiufu Guo wrote:
>
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> 
>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> >> >We know that for struct variable assignment, memory copy may be used.
>> >> >And for memcpy, we may load and store more bytes as possible at one time.
>> >> >While it may be not best here:
>> >
>> >> So the first question in my mind is can we do better at the gimple 
>> >> phase?  For the second case in particular can't we just "return a" 
>> >> rather than copying a into <retval> then returning <retval>?  This feels 
>> >> a lot like the return value optimization from C++.  I'm not sure if it 
>> >> applies to the first case or not, it's been a long time since I looked 
>> >> at NRV optimizations, but it might be worth poking around in there a bit 
>> >> (tree-nrv.cc).
>> >
>> > If it is a bigger struct you end up with quite a lot of stuff in
>> > registers.  GCC will eventually put that all in memory so it will work
>> > out fine in the end, but you are likely to get inefficient code.
>> Yes.  We may need to use memory to save regiters for big struct.
>> Small struct may be practical to use registers.  We may leverage the
>> idea that: some type of small struct are passing to function through
>> registers. 
>> 
>> >
>> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
>> > would be useful to put smaller temportaries, say 32 bytes and smaller,
>> > in registers instead of in memory.
>> I think you mean:  we should try to registers to avoid memory accesing,
>> and using registers would be ok for more bytes memcpy(32bytes).
>> Great sugguestion, thanks a lot!
>> 
>> Like below idea:
>> [r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
>> dest = [r100:TI, r101:TI];
>> 
>> Currently, for 8bytes structure, we are using TImode for it.
>> And subreg/fwprop/cse passes are able to optimize it as expected.
>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
>> I'm not sure if current infrastructure supports to use two more
>> registers for one structure.
>> 
>> >
>> >> But even so, these kinds of things are still bound to happen, so it's 
>> >> probably worth thinking about if we can do better in RTL as well.
>> >
>> > Always.  It is a mistake to think that having better high-level
>> > optimisations means that you don't need good low-level optimisations
>> > anymore: in fact deficiencies there become more glaringly apparent if
>> > the early pipeline opts become better :-)
>> Understant, thanks :)
>> 
>> >
>> >> The first thing that comes to my mind is to annotate memcpy calls that 
>> >> are structure assignments.  The idea here is that we may want to expand 
>> >> a memcpy differently in those cases.   Changing how we expand an opaque 
>> >> memcpy call is unlikely to be beneficial in most cases.  But changing 
>> >> how we expand a structure copy may be beneficial by exposing the 
>> >> underlying field values.   This would roughly correspond to your method 
>> >> #1.
>> >> 
>> >> Or instead of changing how we expand, teach the optimizers about these 
>> >> annotated memcpy calls -- they're just a a copy of each field.   That's 
>> >> how CSE and the propagators could treat them. After some point we'd 
>> >> lower them in the usual ways, but at least early in the RTL pipeline we 
>> >> could keep them as annotated memcpy calls.  This roughly corresponds to 
>> >> your second suggestion.
>> >
>> > Ideally this won't ever make it as far as RTL, if the structures do not
>> > need to go via memory.  All high-level optimissations should have been
>> > done earlier, and hopefully it was not expand tiself that forced stuff
>> > into memory!  :-/
>> Currently, after early gimple optimization, the struct member accessing
>> may still need to be in memory (if the mode of the struct is BLK).
>> For example:
>> 
>> _Bool foo (const A a) { return a.a[0] > 1.0; }
>> 
>> The optimized gimple would be:
>>   _1 = a.a[0];
>>   _3 = _1 > 1.0e+0;
>>   return _3;
>> 
>> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
>> and "a.a[0]" is also reading from memory.  It may be better to use
>> "f1" for "a.a[0]" here.
>> 
>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
>> for the struct (BLK may be changed), and using 4 DF registers to access
>> the structure in expand pass.
>
> I think for cases like this it might be a good idea to perform
> SRA-like analysis at RTL expansion time when we know how parameters
> arrive (in pieces) and take that knowledge into account when
> assigning the RTL to a decl.  The same applies for the return ABI.
> Since we rely on RTL to elide copies to/from return/argument
> registers/slots we have to assign "layout compatible" registers
> to the corresponding auto vars.
>
Thanks for pointing out this!
This looks like a kind of SRA, especially for parm and return value.
As you pointed out, there is something that we may need to take care
to adjust:
1. We would use the "layout compatible" mode reg for the scalar. e.g.
DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}".

2. For an aggregate that will be assigned to return value, before
expanding to 'return stmt', we may not sure if need to assign
'scalar rtl(s)' to decl. 
To handle this issue, we may use 'scalar rtl(s)' for all struct decl
as if it is parm or return result.
Then method3 may be similar to this idea: using "parallel RTL" for
the decl (may use DECL_RTL directly).

Please point out any misunderstandings or suggestions.
Thanks again!

BR,
Jeff(Jiufu)

>> 
>> Thanks again for your kindly and helpful comments!
>> 
>> BR,
>> Jeff(Jiufu)
>> 
>> >
>> >
>> > Segher
>> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-11-08  4:05         ` Jiufu Guo
@ 2022-11-09  7:51           ` Jiufu Guo
  2022-11-09  8:50             ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Jiufu Guo @ 2022-11-09  7:51 UTC (permalink / raw)
  To: Jiufu Guo via Gcc-patches
  Cc: Richard Biener, Segher Boessenkool, Jeff Law, rguenth, pinskia,
	linkw, dje.gcc

Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:

> Richard Biener <rguenther@suse.de> writes:
>
>> On Tue, 1 Nov 2022, Jiufu Guo wrote:
>>
>>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>>> 
>>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
>>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>>> >> >We know that for struct variable assignment, memory copy may be used.
>>> >> >And for memcpy, we may load and store more bytes as possible at one time.
>>> >> >While it may be not best here:
>>> >
>>> >> So the first question in my mind is can we do better at the gimple 
>>> >> phase?  For the second case in particular can't we just "return a" 
>>> >> rather than copying a into <retval> then returning <retval>?  This feels 
>>> >> a lot like the return value optimization from C++.  I'm not sure if it 
>>> >> applies to the first case or not, it's been a long time since I looked 
>>> >> at NRV optimizations, but it might be worth poking around in there a bit 
>>> >> (tree-nrv.cc).
>>> >
>>> > If it is a bigger struct you end up with quite a lot of stuff in
>>> > registers.  GCC will eventually put that all in memory so it will work
>>> > out fine in the end, but you are likely to get inefficient code.
>>> Yes.  We may need to use memory to save regiters for big struct.
>>> Small struct may be practical to use registers.  We may leverage the
>>> idea that: some type of small struct are passing to function through
>>> registers. 
>>> 
>>> >
>>> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
>>> > would be useful to put smaller temportaries, say 32 bytes and smaller,
>>> > in registers instead of in memory.
>>> I think you mean:  we should try to registers to avoid memory accesing,
>>> and using registers would be ok for more bytes memcpy(32bytes).
>>> Great sugguestion, thanks a lot!
>>> 
>>> Like below idea:
>>> [r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
>>> dest = [r100:TI, r101:TI];
>>> 
>>> Currently, for 8bytes structure, we are using TImode for it.
>>> And subreg/fwprop/cse passes are able to optimize it as expected.
>>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
>>> I'm not sure if current infrastructure supports to use two more
>>> registers for one structure.
>>> 
>>> >
>>> >> But even so, these kinds of things are still bound to happen, so it's 
>>> >> probably worth thinking about if we can do better in RTL as well.
>>> >
>>> > Always.  It is a mistake to think that having better high-level
>>> > optimisations means that you don't need good low-level optimisations
>>> > anymore: in fact deficiencies there become more glaringly apparent if
>>> > the early pipeline opts become better :-)
>>> Understant, thanks :)
>>> 
>>> >
>>> >> The first thing that comes to my mind is to annotate memcpy calls that 
>>> >> are structure assignments.  The idea here is that we may want to expand 
>>> >> a memcpy differently in those cases.   Changing how we expand an opaque 
>>> >> memcpy call is unlikely to be beneficial in most cases.  But changing 
>>> >> how we expand a structure copy may be beneficial by exposing the 
>>> >> underlying field values.   This would roughly correspond to your method 
>>> >> #1.
>>> >> 
>>> >> Or instead of changing how we expand, teach the optimizers about these 
>>> >> annotated memcpy calls -- they're just a a copy of each field.   That's 
>>> >> how CSE and the propagators could treat them. After some point we'd 
>>> >> lower them in the usual ways, but at least early in the RTL pipeline we 
>>> >> could keep them as annotated memcpy calls.  This roughly corresponds to 
>>> >> your second suggestion.
>>> >
>>> > Ideally this won't ever make it as far as RTL, if the structures do not
>>> > need to go via memory.  All high-level optimissations should have been
>>> > done earlier, and hopefully it was not expand tiself that forced stuff
>>> > into memory!  :-/
>>> Currently, after early gimple optimization, the struct member accessing
>>> may still need to be in memory (if the mode of the struct is BLK).
>>> For example:
>>> 
>>> _Bool foo (const A a) { return a.a[0] > 1.0; }
>>> 
>>> The optimized gimple would be:
>>>   _1 = a.a[0];
>>>   _3 = _1 > 1.0e+0;
>>>   return _3;
>>> 
>>> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
>>> and "a.a[0]" is also reading from memory.  It may be better to use
>>> "f1" for "a.a[0]" here.
>>> 
>>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
>>> for the struct (BLK may be changed), and using 4 DF registers to access
>>> the structure in expand pass.
>>
>> I think for cases like this it might be a good idea to perform
>> SRA-like analysis at RTL expansion time when we know how parameters
>> arrive (in pieces) and take that knowledge into account when
>> assigning the RTL to a decl.  The same applies for the return ABI.
>> Since we rely on RTL to elide copies to/from return/argument
>> registers/slots we have to assign "layout compatible" registers
>> to the corresponding auto vars.
>>
In other words, for this kind of parameter, creating a few scalars
for each pieces.  And the 'accessing to the paramter' is expanded to
'accessing the scalars' accordingly.
This would also able to avoid memory accesing for the paramter.

Maybe we could use something like "parallel:M? {DF;DF;DF;DF}" or
"parallel:M? {DI;DI;DI;DI}" to group the scalars in DECL_RTL.

For this, we would need to support 'move'/'access' these sub-RTLs.

Any more sugguestions? Thanks.

BR,
Jeff(Jiufu)

> Thanks for pointing out this!
> This looks like a kind of SRA, especially for parm and return value.
> As you pointed out, there is something that we may need to take care
> to adjust:
> 1. We would use the "layout compatible" mode reg for the scalar. e.g.
> DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}".
>
> 2. For an aggregate that will be assigned to return value, before
> expanding to 'return stmt', we may not sure if need to assign
> 'scalar rtl(s)' to decl. 
> To handle this issue, we may use 'scalar rtl(s)' for all struct decl
> as if it is parm or return result.
> Then method3 may be similar to this idea: using "parallel RTL" for
> the decl (may use DECL_RTL directly).
>
> Please point out any misunderstandings or suggestions.
> Thanks again!
>
> BR,
> Jeff(Jiufu)
>
>>> 
>>> Thanks again for your kindly and helpful comments!
>>> 
>>> BR,
>>> Jeff(Jiufu)
>>> 
>>> >
>>> >
>>> > Segher
>>> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-11-09  7:51           ` Jiufu Guo
@ 2022-11-09  8:50             ` Richard Biener
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Biener @ 2022-11-09  8:50 UTC (permalink / raw)
  To: Jiufu Guo
  Cc: Jiufu Guo via Gcc-patches, Segher Boessenkool, Jeff Law, pinskia,
	linkw, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 7467 bytes --]

On Wed, 9 Nov 2022, Jiufu Guo wrote:

> Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> 
> > Richard Biener <rguenther@suse.de> writes:
> >
> >> On Tue, 1 Nov 2022, Jiufu Guo wrote:
> >>
> >>> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >>> 
> >>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote:
> >>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >>> >> >We know that for struct variable assignment, memory copy may be used.
> >>> >> >And for memcpy, we may load and store more bytes as possible at one time.
> >>> >> >While it may be not best here:
> >>> >
> >>> >> So the first question in my mind is can we do better at the gimple 
> >>> >> phase?  For the second case in particular can't we just "return a" 
> >>> >> rather than copying a into <retval> then returning <retval>?  This feels 
> >>> >> a lot like the return value optimization from C++.  I'm not sure if it 
> >>> >> applies to the first case or not, it's been a long time since I looked 
> >>> >> at NRV optimizations, but it might be worth poking around in there a bit 
> >>> >> (tree-nrv.cc).
> >>> >
> >>> > If it is a bigger struct you end up with quite a lot of stuff in
> >>> > registers.  GCC will eventually put that all in memory so it will work
> >>> > out fine in the end, but you are likely to get inefficient code.
> >>> Yes.  We may need to use memory to save regiters for big struct.
> >>> Small struct may be practical to use registers.  We may leverage the
> >>> idea that: some type of small struct are passing to function through
> >>> registers. 
> >>> 
> >>> >
> >>> > OTOH, 8 bytes isn't as big as we would want these days, is it?  So it
> >>> > would be useful to put smaller temportaries, say 32 bytes and smaller,
> >>> > in registers instead of in memory.
> >>> I think you mean:  we should try to registers to avoid memory accesing,
> >>> and using registers would be ok for more bytes memcpy(32bytes).
> >>> Great sugguestion, thanks a lot!
> >>> 
> >>> Like below idea:
> >>> [r100:TI, r101:TI] = src;  //Or r100:OI/OO = src;
> >>> dest = [r100:TI, r101:TI];
> >>> 
> >>> Currently, for 8bytes structure, we are using TImode for it.
> >>> And subreg/fwprop/cse passes are able to optimize it as expected.
> >>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet;
> >>> I'm not sure if current infrastructure supports to use two more
> >>> registers for one structure.
> >>> 
> >>> >
> >>> >> But even so, these kinds of things are still bound to happen, so it's 
> >>> >> probably worth thinking about if we can do better in RTL as well.
> >>> >
> >>> > Always.  It is a mistake to think that having better high-level
> >>> > optimisations means that you don't need good low-level optimisations
> >>> > anymore: in fact deficiencies there become more glaringly apparent if
> >>> > the early pipeline opts become better :-)
> >>> Understant, thanks :)
> >>> 
> >>> >
> >>> >> The first thing that comes to my mind is to annotate memcpy calls that 
> >>> >> are structure assignments.  The idea here is that we may want to expand 
> >>> >> a memcpy differently in those cases.   Changing how we expand an opaque 
> >>> >> memcpy call is unlikely to be beneficial in most cases.  But changing 
> >>> >> how we expand a structure copy may be beneficial by exposing the 
> >>> >> underlying field values.   This would roughly correspond to your method 
> >>> >> #1.
> >>> >> 
> >>> >> Or instead of changing how we expand, teach the optimizers about these 
> >>> >> annotated memcpy calls -- they're just a a copy of each field.   That's 
> >>> >> how CSE and the propagators could treat them. After some point we'd 
> >>> >> lower them in the usual ways, but at least early in the RTL pipeline we 
> >>> >> could keep them as annotated memcpy calls.  This roughly corresponds to 
> >>> >> your second suggestion.
> >>> >
> >>> > Ideally this won't ever make it as far as RTL, if the structures do not
> >>> > need to go via memory.  All high-level optimissations should have been
> >>> > done earlier, and hopefully it was not expand tiself that forced stuff
> >>> > into memory!  :-/
> >>> Currently, after early gimple optimization, the struct member accessing
> >>> may still need to be in memory (if the mode of the struct is BLK).
> >>> For example:
> >>> 
> >>> _Bool foo (const A a) { return a.a[0] > 1.0; }
> >>> 
> >>> The optimized gimple would be:
> >>>   _1 = a.a[0];
> >>>   _3 = _1 > 1.0e+0;
> >>>   return _3;
> >>> 
> >>> During expand to RTL, parm 'a' is store to memory from arg regs firstly,
> >>> and "a.a[0]" is also reading from memory.  It may be better to use
> >>> "f1" for "a.a[0]" here.
> >>> 
> >>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}"
> >>> for the struct (BLK may be changed), and using 4 DF registers to access
> >>> the structure in expand pass.
> >>
> >> I think for cases like this it might be a good idea to perform
> >> SRA-like analysis at RTL expansion time when we know how parameters
> >> arrive (in pieces) and take that knowledge into account when
> >> assigning the RTL to a decl.  The same applies for the return ABI.
> >> Since we rely on RTL to elide copies to/from return/argument
> >> registers/slots we have to assign "layout compatible" registers
> >> to the corresponding auto vars.
> >>
> In other words, for this kind of parameter, creating a few scalars
> for each pieces.  And the 'accessing to the paramter' is expanded to
> 'accessing the scalars' accordingly.
> This would also able to avoid memory accesing for the paramter.
> 
> Maybe we could use something like "parallel:M? {DF;DF;DF;DF}" or
> "parallel:M? {DI;DI;DI;DI}" to group the scalars in DECL_RTL.
> 
> For this, we would need to support 'move'/'access' these sub-RTLs.
> 
> Any more sugguestions? Thanks.

I think the key is going to be exposing the ABI to the GIMPLE phase,
doing "true" SRA at RTL expansion looks dubious, my suggestion was
to do SRA-like analysis to improve the memcpy expansion heuristics.
For the case of matching up with incoming parameters that can be
only done if the uses match up with that layout (which doesn't
necessarily agree with what we'd choose for auto variables)

> BR,
> Jeff(Jiufu)
> 
> > Thanks for pointing out this!
> > This looks like a kind of SRA, especially for parm and return value.
> > As you pointed out, there is something that we may need to take care
> > to adjust:
> > 1. We would use the "layout compatible" mode reg for the scalar. e.g.
> > DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}".
> >
> > 2. For an aggregate that will be assigned to return value, before
> > expanding to 'return stmt', we may not sure if need to assign
> > 'scalar rtl(s)' to decl. 
> > To handle this issue, we may use 'scalar rtl(s)' for all struct decl
> > as if it is parm or return result.
> > Then method3 may be similar to this idea: using "parallel RTL" for
> > the decl (may use DECL_RTL directly).
> >
> > Please point out any misunderstandings or suggestions.
> > Thanks again!
> >
> > BR,
> > Jeff(Jiufu)
> >
> >>> 
> >>> Thanks again for your kindly and helpful comments!
> >>> 
> >>> BR,
> >>> Jeff(Jiufu)
> >>> 
> >>> >
> >>> >
> >>> > Segher
> >>> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-11-05 11:38   ` Richard Biener
@ 2022-11-09  9:21     ` Jiufu Guo
  2022-11-09 12:56       ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Jiufu Guo @ 2022-11-09  9:21 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jeff Law, gcc-patches, pinskia, dje.gcc, linkw, segher, rguenth

Hi,

Richard Biener <richard.guenther@gmail.com> writes:

> On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>>
>> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
>> > Hi,
>> >
>> > We know that for struct variable assignment, memory copy may be used.
>> > And for memcpy, we may load and store more bytes as possible at one time.
>> > While it may be not best here:
>> > 1. Before/after stuct variable assignment, the vaiable may be operated.
>> > And it is hard for some optimizations to leap over memcpy.  Then some struct
>> > operations may be sub-optimimal.  Like the issue in PR65421.
>> > 2. The size of struct is constant mostly, the memcpy would be expanded.  Using
>> > small size to load/store and executing in parallel may not slower than using
>> > large size to loat/store. (sure, more registers may be used for smaller bytes.)
>> >
>> >
>> > In PR65421, For source code as below:
>> > ////////t.c
>> > #define FN 4
>> > typedef struct { double a[FN]; } A;
>> >
>> > A foo (const A *a) { return *a; }
>> > A bar (const A a) { return a; }
>>
>> So the first question in my mind is can we do better at the gimple
>> phase?  For the second case in particular can't we just "return a"
>> rather than copying a into <retval> then returning <retval>?  This feels
>> a lot like the return value optimization from C++.  I'm not sure if it
>> applies to the first case or not, it's been a long time since I looked
>> at NRV optimizations, but it might be worth poking around in there a bit
>> (tree-nrv.cc).
>>
>>
>> But even so, these kinds of things are still bound to happen, so it's
>> probably worth thinking about if we can do better in RTL as well.
>>
>>
>> The first thing that comes to my mind is to annotate memcpy calls that
>> are structure assignments.  The idea here is that we may want to expand
>> a memcpy differently in those cases.   Changing how we expand an opaque
>> memcpy call is unlikely to be beneficial in most cases.  But changing
>> how we expand a structure copy may be beneficial by exposing the
>> underlying field values.   This would roughly correspond to your method #1.
>>
>> Or instead of changing how we expand, teach the optimizers about these
>> annotated memcpy calls -- they're just a a copy of each field.   That's
>> how CSE and the propagators could treat them. After some point we'd
>> lower them in the usual ways, but at least early in the RTL pipeline we
>> could keep them as annotated memcpy calls.  This roughly corresponds to
>> your second suggestion.
>
> In the end it depends on the access patterns so some analysis like SRA
> performs would be nice.  The issue with expanding memcpy on GIMPLE
> is that we currently cannot express 'rep; movb;' or other target specific
> sequences from the cpymem like optabs on GIMPLE and recovering those
> from piecewise copies on RTL is going to be difficult.
Actually, it is a special memcpy. It is generated during expanding the
struct assignment(expand_assignment/store_expr/emit_block_move).
We may introduce a function block_move_for_record for struct type.  And
this function could be a target hook to generate specificed sequences.
For example:
r125:DF=[r112:DI+0x20]
r126:DF=[r112:DI+0x28]
[r112:DI]=r125:DF
[r112:DI+0x8]=r126:DF

After expanding, following passes(cse/prop/dse/..) could optimize the
insn sequences. e.g "[r112:DI+0x20]=f1;r125:DF=[r112:DI+0x20];
[r112:DI]=r125:DF;r129:DF=[r112:DI]" ==> "r129:DF=f1"

And if the small reading/writing insns are still occur in late passes
e.g. combine, we would recover the isnsn to better sequence:
r125:DF=[r112:DI+0x20];r126:DF=[r112:DI+0x28]
==>
r155:V2DI=[r112:DI+0x20];

Any comments? Thanks!

BR,
Jeff(Jiufu)


>
>>
>> jeff
>>
>>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] propgation leap over memory copy for struct
  2022-11-09  9:21     ` Jiufu Guo
@ 2022-11-09 12:56       ` Richard Biener
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Biener @ 2022-11-09 12:56 UTC (permalink / raw)
  To: Jiufu Guo
  Cc: Richard Biener, Jeff Law, gcc-patches, pinskia, dje.gcc, linkw,
	segher, rguenth

On Wed, 9 Nov 2022, Jiufu Guo wrote:

> Hi,
> 
> Richard Biener <richard.guenther@gmail.com> writes:
> 
> > On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >>
> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote:
> >> > Hi,
> >> >
> >> > We know that for struct variable assignment, memory copy may be used.
> >> > And for memcpy, we may load and store more bytes as possible at one time.
> >> > While it may be not best here:
> >> > 1. Before/after stuct variable assignment, the vaiable may be operated.
> >> > And it is hard for some optimizations to leap over memcpy.  Then some struct
> >> > operations may be sub-optimimal.  Like the issue in PR65421.
> >> > 2. The size of struct is constant mostly, the memcpy would be expanded.  Using
> >> > small size to load/store and executing in parallel may not slower than using
> >> > large size to loat/store. (sure, more registers may be used for smaller bytes.)
> >> >
> >> >
> >> > In PR65421, For source code as below:
> >> > ////////t.c
> >> > #define FN 4
> >> > typedef struct { double a[FN]; } A;
> >> >
> >> > A foo (const A *a) { return *a; }
> >> > A bar (const A a) { return a; }
> >>
> >> So the first question in my mind is can we do better at the gimple
> >> phase?  For the second case in particular can't we just "return a"
> >> rather than copying a into <retval> then returning <retval>?  This feels
> >> a lot like the return value optimization from C++.  I'm not sure if it
> >> applies to the first case or not, it's been a long time since I looked
> >> at NRV optimizations, but it might be worth poking around in there a bit
> >> (tree-nrv.cc).
> >>
> >>
> >> But even so, these kinds of things are still bound to happen, so it's
> >> probably worth thinking about if we can do better in RTL as well.
> >>
> >>
> >> The first thing that comes to my mind is to annotate memcpy calls that
> >> are structure assignments.  The idea here is that we may want to expand
> >> a memcpy differently in those cases.   Changing how we expand an opaque
> >> memcpy call is unlikely to be beneficial in most cases.  But changing
> >> how we expand a structure copy may be beneficial by exposing the
> >> underlying field values.   This would roughly correspond to your method #1.
> >>
> >> Or instead of changing how we expand, teach the optimizers about these
> >> annotated memcpy calls -- they're just a a copy of each field.   That's
> >> how CSE and the propagators could treat them. After some point we'd
> >> lower them in the usual ways, but at least early in the RTL pipeline we
> >> could keep them as annotated memcpy calls.  This roughly corresponds to
> >> your second suggestion.
> >
> > In the end it depends on the access patterns so some analysis like SRA
> > performs would be nice.  The issue with expanding memcpy on GIMPLE
> > is that we currently cannot express 'rep; movb;' or other target specific
> > sequences from the cpymem like optabs on GIMPLE and recovering those
> > from piecewise copies on RTL is going to be difficult.
> Actually, it is a special memcpy. It is generated during expanding the
> struct assignment(expand_assignment/store_expr/emit_block_move).
> We may introduce a function block_move_for_record for struct type.  And
> this function could be a target hook to generate specificed sequences.
> For example:
> r125:DF=[r112:DI+0x20]
> r126:DF=[r112:DI+0x28]
> [r112:DI]=r125:DF
> [r112:DI+0x8]=r126:DF
> 
> After expanding, following passes(cse/prop/dse/..) could optimize the
> insn sequences. e.g "[r112:DI+0x20]=f1;r125:DF=[r112:DI+0x20];
> [r112:DI]=r125:DF;r129:DF=[r112:DI]" ==> "r129:DF=f1"
> 
> And if the small reading/writing insns are still occur in late passes
> e.g. combine, we would recover the isnsn to better sequence:
> r125:DF=[r112:DI+0x20];r126:DF=[r112:DI+0x28]
> ==>
> r155:V2DI=[r112:DI+0x20];
> 
> Any comments? Thanks!

As said the best copying decomposition depends on the followup uses
and the argument passing ABI which is why I suggested to perform
SRA like analysis which collects the access patterns and use that
to drive the heuristic expanding this special memcpy.

Richard.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-11-09 12:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31  2:42 [RFC] propgation leap over memory copy for struct Jiufu Guo
2022-10-31 22:13 ` Jeff Law
2022-11-01  0:49   ` Segher Boessenkool
2022-11-01  4:30     ` Jiufu Guo
2022-11-05 14:13       ` Richard Biener
2022-11-08  4:05         ` Jiufu Guo
2022-11-09  7:51           ` Jiufu Guo
2022-11-09  8:50             ` Richard Biener
2022-11-01  3:30   ` Jiufu Guo
2022-11-05 11:38   ` Richard Biener
2022-11-09  9:21     ` Jiufu Guo
2022-11-09 12:56       ` Richard Biener
2022-11-01  0:37 ` Segher Boessenkool
2022-11-01  3:01   ` Jiufu Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).