* [RFC] propgation leap over memory copy for struct @ 2022-10-31 2:42 Jiufu Guo 2022-10-31 22:13 ` Jeff Law 2022-11-01 0:37 ` Segher Boessenkool 0 siblings, 2 replies; 14+ messages in thread From: Jiufu Guo @ 2022-10-31 2:42 UTC (permalink / raw) To: gcc-patches; +Cc: segher, dje.gcc, linkw, guojiufu, rguenth, pinskia Hi, We know that for struct variable assignment, memory copy may be used. And for memcpy, we may load and store more bytes as possible at one time. While it may be not best here: 1. Before/after stuct variable assignment, the vaiable may be operated. And it is hard for some optimizations to leap over memcpy. Then some struct operations may be sub-optimimal. Like the issue in PR65421. 2. The size of struct is constant mostly, the memcpy would be expanded. Using small size to load/store and executing in parallel may not slower than using large size to loat/store. (sure, more registers may be used for smaller bytes.) In PR65421, For source code as below: ////////t.c #define FN 4 typedef struct { double a[FN]; } A; A foo (const A *a) { return *a; } A bar (const A a) { return a; } /////// If FN<=2; the size of "A" fits into TImode, then this code can be optimized (by subreg/cse/fwprop/cprop) as: ------- foo: .LFB0: .cfi_startproc blr bar: .LFB1: .cfi_startproc lfd 2,8(3) lfd 1,0(3) blr -------- If the size of "A" is larger than any INT mode size, RTL insns would be generated as: 13: r125:V2DI=[r112:DI+0x20] 14: r126:V2DI=[r112:DI+0x30] 15: [r112:DI]=r125:V2DI 16: [r112:DI+0x10]=r126:V2DI /// memcpy for assignment: D.3338 = arg; 17: r127:DF=[r112:DI] 18: r128:DF=[r112:DI+0x8] 19: r129:DF=[r112:DI+0x10] 20: r130:DF=[r112:DI+0x18] ------------ I'm thinking about ways to improve this. Metod1: One way may be changing the memory copy by referencing the type of the struct if the size of struct is not too big. And generate insns like the below: 13: r125:DF=[r112:DI+0x20] 15: r126:DF=[r112:DI+0x28] 17: r127:DF=[r112:DI+0x30] 19: r128:DF=[r112:DI+0x38] 14: [r112:DI]=r125:DF 16: [r112:DI+0x8]=r126:DF 18: [r112:DI+0x10]=r127:DF 20: [r112:DI+0x18]=r128:DF 21: r129:DF=[r112:DI] 22: r130:DF=[r112:DI+0x8] 23: r131:DF=[r112:DI+0x10] 24: r132:DF=[r112:DI+0x18] Then passes (cse, prop, dse...) could help to optimize the code. Concerns of the method: we may not need to do this if the number of fields is too large. And the types/modes of each load/store may depend on the platform and not same with the type of the fields of the struct. For example: For "struct {double a[3]; long long l;}", on ppc64le, DImode may be better for assignments on parameter. Method2: One way may be enhancing CSE to make it able to treat one large memory slot as two(or more) combined slots: 13: r125:V2DI#0=[r112:DI+0x20] 13': r125:V2DI#8=[r112:DI+0x28] 15: [r112:DI]#0=r125:V2DI#0 15': [r112:DI]#8=r125:V2DI#8 This may seems more hack in CSE. Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK". To do this, "moving" between "PARALLEL<->PARALLEL" and "PARALLEL<->MEM" may need to be enhanced. This method may require more effort to make it works for corner/unknown cases. I'm wondering which would be more flexible to handle this issue? Thanks for any comments and suggestions! BR, Jeff(Jiufu) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-10-31 2:42 [RFC] propgation leap over memory copy for struct Jiufu Guo @ 2022-10-31 22:13 ` Jeff Law 2022-11-01 0:49 ` Segher Boessenkool ` (2 more replies) 2022-11-01 0:37 ` Segher Boessenkool 1 sibling, 3 replies; 14+ messages in thread From: Jeff Law @ 2022-10-31 22:13 UTC (permalink / raw) To: Jiufu Guo, gcc-patches; +Cc: segher, rguenth, pinskia, linkw, dje.gcc On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > Hi, > > We know that for struct variable assignment, memory copy may be used. > And for memcpy, we may load and store more bytes as possible at one time. > While it may be not best here: > 1. Before/after stuct variable assignment, the vaiable may be operated. > And it is hard for some optimizations to leap over memcpy. Then some struct > operations may be sub-optimimal. Like the issue in PR65421. > 2. The size of struct is constant mostly, the memcpy would be expanded. Using > small size to load/store and executing in parallel may not slower than using > large size to loat/store. (sure, more registers may be used for smaller bytes.) > > > In PR65421, For source code as below: > ////////t.c > #define FN 4 > typedef struct { double a[FN]; } A; > > A foo (const A *a) { return *a; } > A bar (const A a) { return a; } So the first question in my mind is can we do better at the gimple phase? For the second case in particular can't we just "return a" rather than copying a into <retval> then returning <retval>? This feels a lot like the return value optimization from C++. I'm not sure if it applies to the first case or not, it's been a long time since I looked at NRV optimizations, but it might be worth poking around in there a bit (tree-nrv.cc). But even so, these kinds of things are still bound to happen, so it's probably worth thinking about if we can do better in RTL as well. The first thing that comes to my mind is to annotate memcpy calls that are structure assignments. The idea here is that we may want to expand a memcpy differently in those cases. Changing how we expand an opaque memcpy call is unlikely to be beneficial in most cases. But changing how we expand a structure copy may be beneficial by exposing the underlying field values. This would roughly correspond to your method #1. Or instead of changing how we expand, teach the optimizers about these annotated memcpy calls -- they're just a a copy of each field. That's how CSE and the propagators could treat them. After some point we'd lower them in the usual ways, but at least early in the RTL pipeline we could keep them as annotated memcpy calls. This roughly corresponds to your second suggestion. jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-10-31 22:13 ` Jeff Law @ 2022-11-01 0:49 ` Segher Boessenkool 2022-11-01 4:30 ` Jiufu Guo 2022-11-01 3:30 ` Jiufu Guo 2022-11-05 11:38 ` Richard Biener 2 siblings, 1 reply; 14+ messages in thread From: Segher Boessenkool @ 2022-11-01 0:49 UTC (permalink / raw) To: Jeff Law; +Cc: Jiufu Guo, gcc-patches, rguenth, pinskia, linkw, dje.gcc On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: > On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > >We know that for struct variable assignment, memory copy may be used. > >And for memcpy, we may load and store more bytes as possible at one time. > >While it may be not best here: > So the first question in my mind is can we do better at the gimple > phase? For the second case in particular can't we just "return a" > rather than copying a into <retval> then returning <retval>? This feels > a lot like the return value optimization from C++. I'm not sure if it > applies to the first case or not, it's been a long time since I looked > at NRV optimizations, but it might be worth poking around in there a bit > (tree-nrv.cc). If it is a bigger struct you end up with quite a lot of stuff in registers. GCC will eventually put that all in memory so it will work out fine in the end, but you are likely to get inefficient code. OTOH, 8 bytes isn't as big as we would want these days, is it? So it would be useful to put smaller temportaries, say 32 bytes and smaller, in registers instead of in memory. > But even so, these kinds of things are still bound to happen, so it's > probably worth thinking about if we can do better in RTL as well. Always. It is a mistake to think that having better high-level optimisations means that you don't need good low-level optimisations anymore: in fact deficiencies there become more glaringly apparent if the early pipeline opts become better :-) > The first thing that comes to my mind is to annotate memcpy calls that > are structure assignments. The idea here is that we may want to expand > a memcpy differently in those cases. Changing how we expand an opaque > memcpy call is unlikely to be beneficial in most cases. But changing > how we expand a structure copy may be beneficial by exposing the > underlying field values. This would roughly correspond to your method > #1. > > Or instead of changing how we expand, teach the optimizers about these > annotated memcpy calls -- they're just a a copy of each field. That's > how CSE and the propagators could treat them. After some point we'd > lower them in the usual ways, but at least early in the RTL pipeline we > could keep them as annotated memcpy calls. This roughly corresponds to > your second suggestion. Ideally this won't ever make it as far as RTL, if the structures do not need to go via memory. All high-level optimissations should have been done earlier, and hopefully it was not expand tiself that forced stuff into memory! :-/ Segher ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-11-01 0:49 ` Segher Boessenkool @ 2022-11-01 4:30 ` Jiufu Guo 2022-11-05 14:13 ` Richard Biener 0 siblings, 1 reply; 14+ messages in thread From: Jiufu Guo @ 2022-11-01 4:30 UTC (permalink / raw) To: Segher Boessenkool Cc: Jeff Law, gcc-patches, rguenth, pinskia, linkw, dje.gcc Segher Boessenkool <segher@kernel.crashing.org> writes: > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> >We know that for struct variable assignment, memory copy may be used. >> >And for memcpy, we may load and store more bytes as possible at one time. >> >While it may be not best here: > >> So the first question in my mind is can we do better at the gimple >> phase? For the second case in particular can't we just "return a" >> rather than copying a into <retval> then returning <retval>? This feels >> a lot like the return value optimization from C++. I'm not sure if it >> applies to the first case or not, it's been a long time since I looked >> at NRV optimizations, but it might be worth poking around in there a bit >> (tree-nrv.cc). > > If it is a bigger struct you end up with quite a lot of stuff in > registers. GCC will eventually put that all in memory so it will work > out fine in the end, but you are likely to get inefficient code. Yes. We may need to use memory to save regiters for big struct. Small struct may be practical to use registers. We may leverage the idea that: some type of small struct are passing to function through registers. > > OTOH, 8 bytes isn't as big as we would want these days, is it? So it > would be useful to put smaller temportaries, say 32 bytes and smaller, > in registers instead of in memory. I think you mean: we should try to registers to avoid memory accesing, and using registers would be ok for more bytes memcpy(32bytes). Great sugguestion, thanks a lot! Like below idea: [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; dest = [r100:TI, r101:TI]; Currently, for 8bytes structure, we are using TImode for it. And subreg/fwprop/cse passes are able to optimize it as expected. Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; I'm not sure if current infrastructure supports to use two more registers for one structure. > >> But even so, these kinds of things are still bound to happen, so it's >> probably worth thinking about if we can do better in RTL as well. > > Always. It is a mistake to think that having better high-level > optimisations means that you don't need good low-level optimisations > anymore: in fact deficiencies there become more glaringly apparent if > the early pipeline opts become better :-) Understant, thanks :) > >> The first thing that comes to my mind is to annotate memcpy calls that >> are structure assignments. The idea here is that we may want to expand >> a memcpy differently in those cases. Changing how we expand an opaque >> memcpy call is unlikely to be beneficial in most cases. But changing >> how we expand a structure copy may be beneficial by exposing the >> underlying field values. This would roughly correspond to your method >> #1. >> >> Or instead of changing how we expand, teach the optimizers about these >> annotated memcpy calls -- they're just a a copy of each field. That's >> how CSE and the propagators could treat them. After some point we'd >> lower them in the usual ways, but at least early in the RTL pipeline we >> could keep them as annotated memcpy calls. This roughly corresponds to >> your second suggestion. > > Ideally this won't ever make it as far as RTL, if the structures do not > need to go via memory. All high-level optimissations should have been > done earlier, and hopefully it was not expand tiself that forced stuff > into memory! :-/ Currently, after early gimple optimization, the struct member accessing may still need to be in memory (if the mode of the struct is BLK). For example: _Bool foo (const A a) { return a.a[0] > 1.0; } The optimized gimple would be: _1 = a.a[0]; _3 = _1 > 1.0e+0; return _3; During expand to RTL, parm 'a' is store to memory from arg regs firstly, and "a.a[0]" is also reading from memory. It may be better to use "f1" for "a.a[0]" here. Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}" for the struct (BLK may be changed), and using 4 DF registers to access the structure in expand pass. Thanks again for your kindly and helpful comments! BR, Jeff(Jiufu) > > > Segher ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-11-01 4:30 ` Jiufu Guo @ 2022-11-05 14:13 ` Richard Biener 2022-11-08 4:05 ` Jiufu Guo 0 siblings, 1 reply; 14+ messages in thread From: Richard Biener @ 2022-11-05 14:13 UTC (permalink / raw) To: Jiufu Guo Cc: Segher Boessenkool, Jeff Law, gcc-patches, rguenth, pinskia, linkw, dje.gcc [-- Attachment #1: Type: text/plain, Size: 5107 bytes --] On Tue, 1 Nov 2022, Jiufu Guo wrote: > Segher Boessenkool <segher@kernel.crashing.org> writes: > > > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: > >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > >> >We know that for struct variable assignment, memory copy may be used. > >> >And for memcpy, we may load and store more bytes as possible at one time. > >> >While it may be not best here: > > > >> So the first question in my mind is can we do better at the gimple > >> phase? For the second case in particular can't we just "return a" > >> rather than copying a into <retval> then returning <retval>? This feels > >> a lot like the return value optimization from C++. I'm not sure if it > >> applies to the first case or not, it's been a long time since I looked > >> at NRV optimizations, but it might be worth poking around in there a bit > >> (tree-nrv.cc). > > > > If it is a bigger struct you end up with quite a lot of stuff in > > registers. GCC will eventually put that all in memory so it will work > > out fine in the end, but you are likely to get inefficient code. > Yes. We may need to use memory to save regiters for big struct. > Small struct may be practical to use registers. We may leverage the > idea that: some type of small struct are passing to function through > registers. > > > > > OTOH, 8 bytes isn't as big as we would want these days, is it? So it > > would be useful to put smaller temportaries, say 32 bytes and smaller, > > in registers instead of in memory. > I think you mean: we should try to registers to avoid memory accesing, > and using registers would be ok for more bytes memcpy(32bytes). > Great sugguestion, thanks a lot! > > Like below idea: > [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; > dest = [r100:TI, r101:TI]; > > Currently, for 8bytes structure, we are using TImode for it. > And subreg/fwprop/cse passes are able to optimize it as expected. > Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; > I'm not sure if current infrastructure supports to use two more > registers for one structure. > > > > >> But even so, these kinds of things are still bound to happen, so it's > >> probably worth thinking about if we can do better in RTL as well. > > > > Always. It is a mistake to think that having better high-level > > optimisations means that you don't need good low-level optimisations > > anymore: in fact deficiencies there become more glaringly apparent if > > the early pipeline opts become better :-) > Understant, thanks :) > > > > >> The first thing that comes to my mind is to annotate memcpy calls that > >> are structure assignments. The idea here is that we may want to expand > >> a memcpy differently in those cases. Changing how we expand an opaque > >> memcpy call is unlikely to be beneficial in most cases. But changing > >> how we expand a structure copy may be beneficial by exposing the > >> underlying field values. This would roughly correspond to your method > >> #1. > >> > >> Or instead of changing how we expand, teach the optimizers about these > >> annotated memcpy calls -- they're just a a copy of each field. That's > >> how CSE and the propagators could treat them. After some point we'd > >> lower them in the usual ways, but at least early in the RTL pipeline we > >> could keep them as annotated memcpy calls. This roughly corresponds to > >> your second suggestion. > > > > Ideally this won't ever make it as far as RTL, if the structures do not > > need to go via memory. All high-level optimissations should have been > > done earlier, and hopefully it was not expand tiself that forced stuff > > into memory! :-/ > Currently, after early gimple optimization, the struct member accessing > may still need to be in memory (if the mode of the struct is BLK). > For example: > > _Bool foo (const A a) { return a.a[0] > 1.0; } > > The optimized gimple would be: > _1 = a.a[0]; > _3 = _1 > 1.0e+0; > return _3; > > During expand to RTL, parm 'a' is store to memory from arg regs firstly, > and "a.a[0]" is also reading from memory. It may be better to use > "f1" for "a.a[0]" here. > > Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}" > for the struct (BLK may be changed), and using 4 DF registers to access > the structure in expand pass. I think for cases like this it might be a good idea to perform SRA-like analysis at RTL expansion time when we know how parameters arrive (in pieces) and take that knowledge into account when assigning the RTL to a decl. The same applies for the return ABI. Since we rely on RTL to elide copies to/from return/argument registers/slots we have to assign "layout compatible" registers to the corresponding auto vars. > > Thanks again for your kindly and helpful comments! > > BR, > Jeff(Jiufu) > > > > > > > Segher > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-11-05 14:13 ` Richard Biener @ 2022-11-08 4:05 ` Jiufu Guo 2022-11-09 7:51 ` Jiufu Guo 0 siblings, 1 reply; 14+ messages in thread From: Jiufu Guo @ 2022-11-08 4:05 UTC (permalink / raw) To: Richard Biener Cc: Segher Boessenkool, Jeff Law, gcc-patches, rguenth, pinskia, linkw, dje.gcc Richard Biener <rguenther@suse.de> writes: > On Tue, 1 Nov 2022, Jiufu Guo wrote: > >> Segher Boessenkool <segher@kernel.crashing.org> writes: >> >> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: >> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> >> >We know that for struct variable assignment, memory copy may be used. >> >> >And for memcpy, we may load and store more bytes as possible at one time. >> >> >While it may be not best here: >> > >> >> So the first question in my mind is can we do better at the gimple >> >> phase? For the second case in particular can't we just "return a" >> >> rather than copying a into <retval> then returning <retval>? This feels >> >> a lot like the return value optimization from C++. I'm not sure if it >> >> applies to the first case or not, it's been a long time since I looked >> >> at NRV optimizations, but it might be worth poking around in there a bit >> >> (tree-nrv.cc). >> > >> > If it is a bigger struct you end up with quite a lot of stuff in >> > registers. GCC will eventually put that all in memory so it will work >> > out fine in the end, but you are likely to get inefficient code. >> Yes. We may need to use memory to save regiters for big struct. >> Small struct may be practical to use registers. We may leverage the >> idea that: some type of small struct are passing to function through >> registers. >> >> > >> > OTOH, 8 bytes isn't as big as we would want these days, is it? So it >> > would be useful to put smaller temportaries, say 32 bytes and smaller, >> > in registers instead of in memory. >> I think you mean: we should try to registers to avoid memory accesing, >> and using registers would be ok for more bytes memcpy(32bytes). >> Great sugguestion, thanks a lot! >> >> Like below idea: >> [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; >> dest = [r100:TI, r101:TI]; >> >> Currently, for 8bytes structure, we are using TImode for it. >> And subreg/fwprop/cse passes are able to optimize it as expected. >> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; >> I'm not sure if current infrastructure supports to use two more >> registers for one structure. >> >> > >> >> But even so, these kinds of things are still bound to happen, so it's >> >> probably worth thinking about if we can do better in RTL as well. >> > >> > Always. It is a mistake to think that having better high-level >> > optimisations means that you don't need good low-level optimisations >> > anymore: in fact deficiencies there become more glaringly apparent if >> > the early pipeline opts become better :-) >> Understant, thanks :) >> >> > >> >> The first thing that comes to my mind is to annotate memcpy calls that >> >> are structure assignments. The idea here is that we may want to expand >> >> a memcpy differently in those cases. Changing how we expand an opaque >> >> memcpy call is unlikely to be beneficial in most cases. But changing >> >> how we expand a structure copy may be beneficial by exposing the >> >> underlying field values. This would roughly correspond to your method >> >> #1. >> >> >> >> Or instead of changing how we expand, teach the optimizers about these >> >> annotated memcpy calls -- they're just a a copy of each field. That's >> >> how CSE and the propagators could treat them. After some point we'd >> >> lower them in the usual ways, but at least early in the RTL pipeline we >> >> could keep them as annotated memcpy calls. This roughly corresponds to >> >> your second suggestion. >> > >> > Ideally this won't ever make it as far as RTL, if the structures do not >> > need to go via memory. All high-level optimissations should have been >> > done earlier, and hopefully it was not expand tiself that forced stuff >> > into memory! :-/ >> Currently, after early gimple optimization, the struct member accessing >> may still need to be in memory (if the mode of the struct is BLK). >> For example: >> >> _Bool foo (const A a) { return a.a[0] > 1.0; } >> >> The optimized gimple would be: >> _1 = a.a[0]; >> _3 = _1 > 1.0e+0; >> return _3; >> >> During expand to RTL, parm 'a' is store to memory from arg regs firstly, >> and "a.a[0]" is also reading from memory. It may be better to use >> "f1" for "a.a[0]" here. >> >> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}" >> for the struct (BLK may be changed), and using 4 DF registers to access >> the structure in expand pass. > > I think for cases like this it might be a good idea to perform > SRA-like analysis at RTL expansion time when we know how parameters > arrive (in pieces) and take that knowledge into account when > assigning the RTL to a decl. The same applies for the return ABI. > Since we rely on RTL to elide copies to/from return/argument > registers/slots we have to assign "layout compatible" registers > to the corresponding auto vars. > Thanks for pointing out this! This looks like a kind of SRA, especially for parm and return value. As you pointed out, there is something that we may need to take care to adjust: 1. We would use the "layout compatible" mode reg for the scalar. e.g. DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}". 2. For an aggregate that will be assigned to return value, before expanding to 'return stmt', we may not sure if need to assign 'scalar rtl(s)' to decl. To handle this issue, we may use 'scalar rtl(s)' for all struct decl as if it is parm or return result. Then method3 may be similar to this idea: using "parallel RTL" for the decl (may use DECL_RTL directly). Please point out any misunderstandings or suggestions. Thanks again! BR, Jeff(Jiufu) >> >> Thanks again for your kindly and helpful comments! >> >> BR, >> Jeff(Jiufu) >> >> > >> > >> > Segher >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-11-08 4:05 ` Jiufu Guo @ 2022-11-09 7:51 ` Jiufu Guo 2022-11-09 8:50 ` Richard Biener 0 siblings, 1 reply; 14+ messages in thread From: Jiufu Guo @ 2022-11-09 7:51 UTC (permalink / raw) To: Jiufu Guo via Gcc-patches Cc: Richard Biener, Segher Boessenkool, Jeff Law, rguenth, pinskia, linkw, dje.gcc Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > Richard Biener <rguenther@suse.de> writes: > >> On Tue, 1 Nov 2022, Jiufu Guo wrote: >> >>> Segher Boessenkool <segher@kernel.crashing.org> writes: >>> >>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: >>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >>> >> >We know that for struct variable assignment, memory copy may be used. >>> >> >And for memcpy, we may load and store more bytes as possible at one time. >>> >> >While it may be not best here: >>> > >>> >> So the first question in my mind is can we do better at the gimple >>> >> phase? For the second case in particular can't we just "return a" >>> >> rather than copying a into <retval> then returning <retval>? This feels >>> >> a lot like the return value optimization from C++. I'm not sure if it >>> >> applies to the first case or not, it's been a long time since I looked >>> >> at NRV optimizations, but it might be worth poking around in there a bit >>> >> (tree-nrv.cc). >>> > >>> > If it is a bigger struct you end up with quite a lot of stuff in >>> > registers. GCC will eventually put that all in memory so it will work >>> > out fine in the end, but you are likely to get inefficient code. >>> Yes. We may need to use memory to save regiters for big struct. >>> Small struct may be practical to use registers. We may leverage the >>> idea that: some type of small struct are passing to function through >>> registers. >>> >>> > >>> > OTOH, 8 bytes isn't as big as we would want these days, is it? So it >>> > would be useful to put smaller temportaries, say 32 bytes and smaller, >>> > in registers instead of in memory. >>> I think you mean: we should try to registers to avoid memory accesing, >>> and using registers would be ok for more bytes memcpy(32bytes). >>> Great sugguestion, thanks a lot! >>> >>> Like below idea: >>> [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; >>> dest = [r100:TI, r101:TI]; >>> >>> Currently, for 8bytes structure, we are using TImode for it. >>> And subreg/fwprop/cse passes are able to optimize it as expected. >>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; >>> I'm not sure if current infrastructure supports to use two more >>> registers for one structure. >>> >>> > >>> >> But even so, these kinds of things are still bound to happen, so it's >>> >> probably worth thinking about if we can do better in RTL as well. >>> > >>> > Always. It is a mistake to think that having better high-level >>> > optimisations means that you don't need good low-level optimisations >>> > anymore: in fact deficiencies there become more glaringly apparent if >>> > the early pipeline opts become better :-) >>> Understant, thanks :) >>> >>> > >>> >> The first thing that comes to my mind is to annotate memcpy calls that >>> >> are structure assignments. The idea here is that we may want to expand >>> >> a memcpy differently in those cases. Changing how we expand an opaque >>> >> memcpy call is unlikely to be beneficial in most cases. But changing >>> >> how we expand a structure copy may be beneficial by exposing the >>> >> underlying field values. This would roughly correspond to your method >>> >> #1. >>> >> >>> >> Or instead of changing how we expand, teach the optimizers about these >>> >> annotated memcpy calls -- they're just a a copy of each field. That's >>> >> how CSE and the propagators could treat them. After some point we'd >>> >> lower them in the usual ways, but at least early in the RTL pipeline we >>> >> could keep them as annotated memcpy calls. This roughly corresponds to >>> >> your second suggestion. >>> > >>> > Ideally this won't ever make it as far as RTL, if the structures do not >>> > need to go via memory. All high-level optimissations should have been >>> > done earlier, and hopefully it was not expand tiself that forced stuff >>> > into memory! :-/ >>> Currently, after early gimple optimization, the struct member accessing >>> may still need to be in memory (if the mode of the struct is BLK). >>> For example: >>> >>> _Bool foo (const A a) { return a.a[0] > 1.0; } >>> >>> The optimized gimple would be: >>> _1 = a.a[0]; >>> _3 = _1 > 1.0e+0; >>> return _3; >>> >>> During expand to RTL, parm 'a' is store to memory from arg regs firstly, >>> and "a.a[0]" is also reading from memory. It may be better to use >>> "f1" for "a.a[0]" here. >>> >>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}" >>> for the struct (BLK may be changed), and using 4 DF registers to access >>> the structure in expand pass. >> >> I think for cases like this it might be a good idea to perform >> SRA-like analysis at RTL expansion time when we know how parameters >> arrive (in pieces) and take that knowledge into account when >> assigning the RTL to a decl. The same applies for the return ABI. >> Since we rely on RTL to elide copies to/from return/argument >> registers/slots we have to assign "layout compatible" registers >> to the corresponding auto vars. >> In other words, for this kind of parameter, creating a few scalars for each pieces. And the 'accessing to the paramter' is expanded to 'accessing the scalars' accordingly. This would also able to avoid memory accesing for the paramter. Maybe we could use something like "parallel:M? {DF;DF;DF;DF}" or "parallel:M? {DI;DI;DI;DI}" to group the scalars in DECL_RTL. For this, we would need to support 'move'/'access' these sub-RTLs. Any more sugguestions? Thanks. BR, Jeff(Jiufu) > Thanks for pointing out this! > This looks like a kind of SRA, especially for parm and return value. > As you pointed out, there is something that we may need to take care > to adjust: > 1. We would use the "layout compatible" mode reg for the scalar. e.g. > DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}". > > 2. For an aggregate that will be assigned to return value, before > expanding to 'return stmt', we may not sure if need to assign > 'scalar rtl(s)' to decl. > To handle this issue, we may use 'scalar rtl(s)' for all struct decl > as if it is parm or return result. > Then method3 may be similar to this idea: using "parallel RTL" for > the decl (may use DECL_RTL directly). > > Please point out any misunderstandings or suggestions. > Thanks again! > > BR, > Jeff(Jiufu) > >>> >>> Thanks again for your kindly and helpful comments! >>> >>> BR, >>> Jeff(Jiufu) >>> >>> > >>> > >>> > Segher >>> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-11-09 7:51 ` Jiufu Guo @ 2022-11-09 8:50 ` Richard Biener 0 siblings, 0 replies; 14+ messages in thread From: Richard Biener @ 2022-11-09 8:50 UTC (permalink / raw) To: Jiufu Guo Cc: Jiufu Guo via Gcc-patches, Segher Boessenkool, Jeff Law, pinskia, linkw, dje.gcc [-- Attachment #1: Type: text/plain, Size: 7467 bytes --] On Wed, 9 Nov 2022, Jiufu Guo wrote: > Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > > Richard Biener <rguenther@suse.de> writes: > > > >> On Tue, 1 Nov 2022, Jiufu Guo wrote: > >> > >>> Segher Boessenkool <segher@kernel.crashing.org> writes: > >>> > >>> > On Mon, Oct 31, 2022 at 04:13:38PM -0600, Jeff Law wrote: > >>> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > >>> >> >We know that for struct variable assignment, memory copy may be used. > >>> >> >And for memcpy, we may load and store more bytes as possible at one time. > >>> >> >While it may be not best here: > >>> > > >>> >> So the first question in my mind is can we do better at the gimple > >>> >> phase? For the second case in particular can't we just "return a" > >>> >> rather than copying a into <retval> then returning <retval>? This feels > >>> >> a lot like the return value optimization from C++. I'm not sure if it > >>> >> applies to the first case or not, it's been a long time since I looked > >>> >> at NRV optimizations, but it might be worth poking around in there a bit > >>> >> (tree-nrv.cc). > >>> > > >>> > If it is a bigger struct you end up with quite a lot of stuff in > >>> > registers. GCC will eventually put that all in memory so it will work > >>> > out fine in the end, but you are likely to get inefficient code. > >>> Yes. We may need to use memory to save regiters for big struct. > >>> Small struct may be practical to use registers. We may leverage the > >>> idea that: some type of small struct are passing to function through > >>> registers. > >>> > >>> > > >>> > OTOH, 8 bytes isn't as big as we would want these days, is it? So it > >>> > would be useful to put smaller temportaries, say 32 bytes and smaller, > >>> > in registers instead of in memory. > >>> I think you mean: we should try to registers to avoid memory accesing, > >>> and using registers would be ok for more bytes memcpy(32bytes). > >>> Great sugguestion, thanks a lot! > >>> > >>> Like below idea: > >>> [r100:TI, r101:TI] = src; //Or r100:OI/OO = src; > >>> dest = [r100:TI, r101:TI]; > >>> > >>> Currently, for 8bytes structure, we are using TImode for it. > >>> And subreg/fwprop/cse passes are able to optimize it as expected. > >>> Two concerns here: larger int modes(OI/OO/..) may be not introduced yet; > >>> I'm not sure if current infrastructure supports to use two more > >>> registers for one structure. > >>> > >>> > > >>> >> But even so, these kinds of things are still bound to happen, so it's > >>> >> probably worth thinking about if we can do better in RTL as well. > >>> > > >>> > Always. It is a mistake to think that having better high-level > >>> > optimisations means that you don't need good low-level optimisations > >>> > anymore: in fact deficiencies there become more glaringly apparent if > >>> > the early pipeline opts become better :-) > >>> Understant, thanks :) > >>> > >>> > > >>> >> The first thing that comes to my mind is to annotate memcpy calls that > >>> >> are structure assignments. The idea here is that we may want to expand > >>> >> a memcpy differently in those cases. Changing how we expand an opaque > >>> >> memcpy call is unlikely to be beneficial in most cases. But changing > >>> >> how we expand a structure copy may be beneficial by exposing the > >>> >> underlying field values. This would roughly correspond to your method > >>> >> #1. > >>> >> > >>> >> Or instead of changing how we expand, teach the optimizers about these > >>> >> annotated memcpy calls -- they're just a a copy of each field. That's > >>> >> how CSE and the propagators could treat them. After some point we'd > >>> >> lower them in the usual ways, but at least early in the RTL pipeline we > >>> >> could keep them as annotated memcpy calls. This roughly corresponds to > >>> >> your second suggestion. > >>> > > >>> > Ideally this won't ever make it as far as RTL, if the structures do not > >>> > need to go via memory. All high-level optimissations should have been > >>> > done earlier, and hopefully it was not expand tiself that forced stuff > >>> > into memory! :-/ > >>> Currently, after early gimple optimization, the struct member accessing > >>> may still need to be in memory (if the mode of the struct is BLK). > >>> For example: > >>> > >>> _Bool foo (const A a) { return a.a[0] > 1.0; } > >>> > >>> The optimized gimple would be: > >>> _1 = a.a[0]; > >>> _3 = _1 > 1.0e+0; > >>> return _3; > >>> > >>> During expand to RTL, parm 'a' is store to memory from arg regs firstly, > >>> and "a.a[0]" is also reading from memory. It may be better to use > >>> "f1" for "a.a[0]" here. > >>> > >>> Maybe, method3 is similar with your idea: using "parallel:BLK {DF;DF;DF; DF}" > >>> for the struct (BLK may be changed), and using 4 DF registers to access > >>> the structure in expand pass. > >> > >> I think for cases like this it might be a good idea to perform > >> SRA-like analysis at RTL expansion time when we know how parameters > >> arrive (in pieces) and take that knowledge into account when > >> assigning the RTL to a decl. The same applies for the return ABI. > >> Since we rely on RTL to elide copies to/from return/argument > >> registers/slots we have to assign "layout compatible" registers > >> to the corresponding auto vars. > >> > In other words, for this kind of parameter, creating a few scalars > for each pieces. And the 'accessing to the paramter' is expanded to > 'accessing the scalars' accordingly. > This would also able to avoid memory accesing for the paramter. > > Maybe we could use something like "parallel:M? {DF;DF;DF;DF}" or > "parallel:M? {DI;DI;DI;DI}" to group the scalars in DECL_RTL. > > For this, we would need to support 'move'/'access' these sub-RTLs. > > Any more sugguestions? Thanks. I think the key is going to be exposing the ABI to the GIMPLE phase, doing "true" SRA at RTL expansion looks dubious, my suggestion was to do SRA-like analysis to improve the memcpy expansion heuristics. For the case of matching up with incoming parameters that can be only done if the uses match up with that layout (which doesn't necessarily agree with what we'd choose for auto variables) > BR, > Jeff(Jiufu) > > > Thanks for pointing out this! > > This looks like a kind of SRA, especially for parm and return value. > > As you pointed out, there is something that we may need to take care > > to adjust: > > 1. We would use the "layout compatible" mode reg for the scalar. e.g. > > DF for "{double arr[4];}", but DI for "{double arr[3]; long l;}". > > > > 2. For an aggregate that will be assigned to return value, before > > expanding to 'return stmt', we may not sure if need to assign > > 'scalar rtl(s)' to decl. > > To handle this issue, we may use 'scalar rtl(s)' for all struct decl > > as if it is parm or return result. > > Then method3 may be similar to this idea: using "parallel RTL" for > > the decl (may use DECL_RTL directly). > > > > Please point out any misunderstandings or suggestions. > > Thanks again! > > > > BR, > > Jeff(Jiufu) > > > >>> > >>> Thanks again for your kindly and helpful comments! > >>> > >>> BR, > >>> Jeff(Jiufu) > >>> > >>> > > >>> > > >>> > Segher > >>> > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-10-31 22:13 ` Jeff Law 2022-11-01 0:49 ` Segher Boessenkool @ 2022-11-01 3:30 ` Jiufu Guo 2022-11-05 11:38 ` Richard Biener 2 siblings, 0 replies; 14+ messages in thread From: Jiufu Guo @ 2022-11-01 3:30 UTC (permalink / raw) To: Jeff Law; +Cc: gcc-patches, segher, rguenth, pinskia, linkw, dje.gcc Jeff Law <jeffreyalaw@gmail.com> writes: > On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> Hi, >> >> We know that for struct variable assignment, memory copy may be used. >> And for memcpy, we may load and store more bytes as possible at one time. >> While it may be not best here: >> 1. Before/after stuct variable assignment, the vaiable may be operated. >> And it is hard for some optimizations to leap over memcpy. Then some struct >> operations may be sub-optimimal. Like the issue in PR65421. >> 2. The size of struct is constant mostly, the memcpy would be expanded. Using >> small size to load/store and executing in parallel may not slower than using >> large size to loat/store. (sure, more registers may be used for smaller bytes.) >> >> >> In PR65421, For source code as below: >> ////////t.c >> #define FN 4 >> typedef struct { double a[FN]; } A; >> >> A foo (const A *a) { return *a; } >> A bar (const A a) { return a; } > > So the first question in my mind is can we do better at the gimple > phase? For the second case in particular can't we just "return a" > rather than copying a into <retval> then returning <retval>? This > feels a lot like the return value optimization from C++. I'm not sure > if it applies to the first case or not, it's been a long time since I > looked at NRV optimizations, but it might be worth poking around in > there a bit (tree-nrv.cc). Thanks for point out this idea!! Currently the optimized gimple looks like: D.3334 = a; return D.3334; and D.3336 = *a_2(D); return D.3336; It may be better to have: "return a;" and "return *a;" ----------------- If the code looks like: typedef struct { double a[3]; long l;} A; //mix types A foo (const A a) { return a; } A bar (const A *a) { return *a; } Current optimized gimples looks like: <retval> = a; return <retval>; and <retval> = *a_2(D); return <retval>; "return a;" and "return *a;" may be works here too. > > > But even so, these kinds of things are still bound to happen, so it's > probably worth thinking about if we can do better in RTL as well. > Yeap, thanks! > > The first thing that comes to my mind is to annotate memcpy calls that > are structure assignments. The idea here is that we may want to > expand a memcpy differently in those cases. Changing how we expand > an opaque memcpy call is unlikely to be beneficial in most cases. But > changing how we expand a structure copy may be beneficial by exposing > the underlying field values. This would roughly correspond to your > method #1. Right. For general memcpy, we would read/write larger bytes at one time. Reading/writing small fields may only beneficial for structure assignment. > > Or instead of changing how we expand, teach the optimizers about these > annotated memcpy calls -- they're just a a copy of each field. > That's how CSE and the propagators could treat them. After some point > we'd lower them in the usual ways, but at least early in the RTL > pipeline we could keep them as annotated memcpy calls. This roughly > corresponds to your second suggestion. Thanks for your insights about this idea! Using annoated memcpy for early optimizations, and it would be treated as general memcpy in later passes. Thanks again for your very helpful comments and sugguestions! BR, Jeff(Jiufu) > > > jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-10-31 22:13 ` Jeff Law 2022-11-01 0:49 ` Segher Boessenkool 2022-11-01 3:30 ` Jiufu Guo @ 2022-11-05 11:38 ` Richard Biener 2022-11-09 9:21 ` Jiufu Guo 2 siblings, 1 reply; 14+ messages in thread From: Richard Biener @ 2022-11-05 11:38 UTC (permalink / raw) To: Jeff Law; +Cc: Jiufu Guo, gcc-patches, pinskia, dje.gcc, linkw, segher, rguenth On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > > On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > > Hi, > > > > We know that for struct variable assignment, memory copy may be used. > > And for memcpy, we may load and store more bytes as possible at one time. > > While it may be not best here: > > 1. Before/after stuct variable assignment, the vaiable may be operated. > > And it is hard for some optimizations to leap over memcpy. Then some struct > > operations may be sub-optimimal. Like the issue in PR65421. > > 2. The size of struct is constant mostly, the memcpy would be expanded. Using > > small size to load/store and executing in parallel may not slower than using > > large size to loat/store. (sure, more registers may be used for smaller bytes.) > > > > > > In PR65421, For source code as below: > > ////////t.c > > #define FN 4 > > typedef struct { double a[FN]; } A; > > > > A foo (const A *a) { return *a; } > > A bar (const A a) { return a; } > > So the first question in my mind is can we do better at the gimple > phase? For the second case in particular can't we just "return a" > rather than copying a into <retval> then returning <retval>? This feels > a lot like the return value optimization from C++. I'm not sure if it > applies to the first case or not, it's been a long time since I looked > at NRV optimizations, but it might be worth poking around in there a bit > (tree-nrv.cc). > > > But even so, these kinds of things are still bound to happen, so it's > probably worth thinking about if we can do better in RTL as well. > > > The first thing that comes to my mind is to annotate memcpy calls that > are structure assignments. The idea here is that we may want to expand > a memcpy differently in those cases. Changing how we expand an opaque > memcpy call is unlikely to be beneficial in most cases. But changing > how we expand a structure copy may be beneficial by exposing the > underlying field values. This would roughly correspond to your method #1. > > Or instead of changing how we expand, teach the optimizers about these > annotated memcpy calls -- they're just a a copy of each field. That's > how CSE and the propagators could treat them. After some point we'd > lower them in the usual ways, but at least early in the RTL pipeline we > could keep them as annotated memcpy calls. This roughly corresponds to > your second suggestion. In the end it depends on the access patterns so some analysis like SRA performs would be nice. The issue with expanding memcpy on GIMPLE is that we currently cannot express 'rep; movb;' or other target specific sequences from the cpymem like optabs on GIMPLE and recovering those from piecewise copies on RTL is going to be difficult. > > jeff > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-11-05 11:38 ` Richard Biener @ 2022-11-09 9:21 ` Jiufu Guo 2022-11-09 12:56 ` Richard Biener 0 siblings, 1 reply; 14+ messages in thread From: Jiufu Guo @ 2022-11-09 9:21 UTC (permalink / raw) To: Richard Biener Cc: Jeff Law, gcc-patches, pinskia, dje.gcc, linkw, segher, rguenth Hi, Richard Biener <richard.guenther@gmail.com> writes: > On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: >> >> >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: >> > Hi, >> > >> > We know that for struct variable assignment, memory copy may be used. >> > And for memcpy, we may load and store more bytes as possible at one time. >> > While it may be not best here: >> > 1. Before/after stuct variable assignment, the vaiable may be operated. >> > And it is hard for some optimizations to leap over memcpy. Then some struct >> > operations may be sub-optimimal. Like the issue in PR65421. >> > 2. The size of struct is constant mostly, the memcpy would be expanded. Using >> > small size to load/store and executing in parallel may not slower than using >> > large size to loat/store. (sure, more registers may be used for smaller bytes.) >> > >> > >> > In PR65421, For source code as below: >> > ////////t.c >> > #define FN 4 >> > typedef struct { double a[FN]; } A; >> > >> > A foo (const A *a) { return *a; } >> > A bar (const A a) { return a; } >> >> So the first question in my mind is can we do better at the gimple >> phase? For the second case in particular can't we just "return a" >> rather than copying a into <retval> then returning <retval>? This feels >> a lot like the return value optimization from C++. I'm not sure if it >> applies to the first case or not, it's been a long time since I looked >> at NRV optimizations, but it might be worth poking around in there a bit >> (tree-nrv.cc). >> >> >> But even so, these kinds of things are still bound to happen, so it's >> probably worth thinking about if we can do better in RTL as well. >> >> >> The first thing that comes to my mind is to annotate memcpy calls that >> are structure assignments. The idea here is that we may want to expand >> a memcpy differently in those cases. Changing how we expand an opaque >> memcpy call is unlikely to be beneficial in most cases. But changing >> how we expand a structure copy may be beneficial by exposing the >> underlying field values. This would roughly correspond to your method #1. >> >> Or instead of changing how we expand, teach the optimizers about these >> annotated memcpy calls -- they're just a a copy of each field. That's >> how CSE and the propagators could treat them. After some point we'd >> lower them in the usual ways, but at least early in the RTL pipeline we >> could keep them as annotated memcpy calls. This roughly corresponds to >> your second suggestion. > > In the end it depends on the access patterns so some analysis like SRA > performs would be nice. The issue with expanding memcpy on GIMPLE > is that we currently cannot express 'rep; movb;' or other target specific > sequences from the cpymem like optabs on GIMPLE and recovering those > from piecewise copies on RTL is going to be difficult. Actually, it is a special memcpy. It is generated during expanding the struct assignment(expand_assignment/store_expr/emit_block_move). We may introduce a function block_move_for_record for struct type. And this function could be a target hook to generate specificed sequences. For example: r125:DF=[r112:DI+0x20] r126:DF=[r112:DI+0x28] [r112:DI]=r125:DF [r112:DI+0x8]=r126:DF After expanding, following passes(cse/prop/dse/..) could optimize the insn sequences. e.g "[r112:DI+0x20]=f1;r125:DF=[r112:DI+0x20]; [r112:DI]=r125:DF;r129:DF=[r112:DI]" ==> "r129:DF=f1" And if the small reading/writing insns are still occur in late passes e.g. combine, we would recover the isnsn to better sequence: r125:DF=[r112:DI+0x20];r126:DF=[r112:DI+0x28] ==> r155:V2DI=[r112:DI+0x20]; Any comments? Thanks! BR, Jeff(Jiufu) > >> >> jeff >> >> >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-11-09 9:21 ` Jiufu Guo @ 2022-11-09 12:56 ` Richard Biener 0 siblings, 0 replies; 14+ messages in thread From: Richard Biener @ 2022-11-09 12:56 UTC (permalink / raw) To: Jiufu Guo Cc: Richard Biener, Jeff Law, gcc-patches, pinskia, dje.gcc, linkw, segher, rguenth On Wed, 9 Nov 2022, Jiufu Guo wrote: > Hi, > > Richard Biener <richard.guenther@gmail.com> writes: > > > On Mon, Oct 31, 2022 at 11:14 PM Jeff Law via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > >> > >> > >> On 10/30/22 20:42, Jiufu Guo via Gcc-patches wrote: > >> > Hi, > >> > > >> > We know that for struct variable assignment, memory copy may be used. > >> > And for memcpy, we may load and store more bytes as possible at one time. > >> > While it may be not best here: > >> > 1. Before/after stuct variable assignment, the vaiable may be operated. > >> > And it is hard for some optimizations to leap over memcpy. Then some struct > >> > operations may be sub-optimimal. Like the issue in PR65421. > >> > 2. The size of struct is constant mostly, the memcpy would be expanded. Using > >> > small size to load/store and executing in parallel may not slower than using > >> > large size to loat/store. (sure, more registers may be used for smaller bytes.) > >> > > >> > > >> > In PR65421, For source code as below: > >> > ////////t.c > >> > #define FN 4 > >> > typedef struct { double a[FN]; } A; > >> > > >> > A foo (const A *a) { return *a; } > >> > A bar (const A a) { return a; } > >> > >> So the first question in my mind is can we do better at the gimple > >> phase? For the second case in particular can't we just "return a" > >> rather than copying a into <retval> then returning <retval>? This feels > >> a lot like the return value optimization from C++. I'm not sure if it > >> applies to the first case or not, it's been a long time since I looked > >> at NRV optimizations, but it might be worth poking around in there a bit > >> (tree-nrv.cc). > >> > >> > >> But even so, these kinds of things are still bound to happen, so it's > >> probably worth thinking about if we can do better in RTL as well. > >> > >> > >> The first thing that comes to my mind is to annotate memcpy calls that > >> are structure assignments. The idea here is that we may want to expand > >> a memcpy differently in those cases. Changing how we expand an opaque > >> memcpy call is unlikely to be beneficial in most cases. But changing > >> how we expand a structure copy may be beneficial by exposing the > >> underlying field values. This would roughly correspond to your method #1. > >> > >> Or instead of changing how we expand, teach the optimizers about these > >> annotated memcpy calls -- they're just a a copy of each field. That's > >> how CSE and the propagators could treat them. After some point we'd > >> lower them in the usual ways, but at least early in the RTL pipeline we > >> could keep them as annotated memcpy calls. This roughly corresponds to > >> your second suggestion. > > > > In the end it depends on the access patterns so some analysis like SRA > > performs would be nice. The issue with expanding memcpy on GIMPLE > > is that we currently cannot express 'rep; movb;' or other target specific > > sequences from the cpymem like optabs on GIMPLE and recovering those > > from piecewise copies on RTL is going to be difficult. > Actually, it is a special memcpy. It is generated during expanding the > struct assignment(expand_assignment/store_expr/emit_block_move). > We may introduce a function block_move_for_record for struct type. And > this function could be a target hook to generate specificed sequences. > For example: > r125:DF=[r112:DI+0x20] > r126:DF=[r112:DI+0x28] > [r112:DI]=r125:DF > [r112:DI+0x8]=r126:DF > > After expanding, following passes(cse/prop/dse/..) could optimize the > insn sequences. e.g "[r112:DI+0x20]=f1;r125:DF=[r112:DI+0x20]; > [r112:DI]=r125:DF;r129:DF=[r112:DI]" ==> "r129:DF=f1" > > And if the small reading/writing insns are still occur in late passes > e.g. combine, we would recover the isnsn to better sequence: > r125:DF=[r112:DI+0x20];r126:DF=[r112:DI+0x28] > ==> > r155:V2DI=[r112:DI+0x20]; > > Any comments? Thanks! As said the best copying decomposition depends on the followup uses and the argument passing ABI which is why I suggested to perform SRA like analysis which collects the access patterns and use that to drive the heuristic expanding this special memcpy. Richard. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-10-31 2:42 [RFC] propgation leap over memory copy for struct Jiufu Guo 2022-10-31 22:13 ` Jeff Law @ 2022-11-01 0:37 ` Segher Boessenkool 2022-11-01 3:01 ` Jiufu Guo 1 sibling, 1 reply; 14+ messages in thread From: Segher Boessenkool @ 2022-11-01 0:37 UTC (permalink / raw) To: Jiufu Guo; +Cc: gcc-patches, dje.gcc, linkw, rguenth, pinskia Hi! On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote: > #define FN 4 > typedef struct { double a[FN]; } A; > > A foo (const A *a) { return *a; } > A bar (const A a) { return a; } > /////// > > If FN<=2; the size of "A" fits into TImode, then this code can be optimized > (by subreg/cse/fwprop/cprop) as: > ------- > foo: > .LFB0: > .cfi_startproc > blr > > bar: > .LFB1: > .cfi_startproc > lfd 2,8(3) > lfd 1,0(3) > blr > -------- I think you swapped foo and bar here? > If the size of "A" is larger than any INT mode size, RTL insns would be > generated as: > 13: r125:V2DI=[r112:DI+0x20] > 14: r126:V2DI=[r112:DI+0x30] > 15: [r112:DI]=r125:V2DI > 16: [r112:DI+0x10]=r126:V2DI /// memcpy for assignment: D.3338 = arg; > 17: r127:DF=[r112:DI] > 18: r128:DF=[r112:DI+0x8] > 19: r129:DF=[r112:DI+0x10] > 20: r130:DF=[r112:DI+0x18] > ------------ > > I'm thinking about ways to improve this. > Metod1: One way may be changing the memory copy by referencing the type > of the struct if the size of struct is not too big. And generate insns > like the below: > 13: r125:DF=[r112:DI+0x20] > 15: r126:DF=[r112:DI+0x28] > 17: r127:DF=[r112:DI+0x30] > 19: r128:DF=[r112:DI+0x38] > 14: [r112:DI]=r125:DF > 16: [r112:DI+0x8]=r126:DF > 18: [r112:DI+0x10]=r127:DF > 20: [r112:DI+0x18]=r128:DF > 21: r129:DF=[r112:DI] > 22: r130:DF=[r112:DI+0x8] > 23: r131:DF=[r112:DI+0x10] > 24: r132:DF=[r112:DI+0x18] This is much worse though? The expansion with memcpy used V2DI, which typically is close to 2x faster than DFmode accesses. Or are you trying to avoid small reads of large stores here? Those aren't so bad, large reads of small stores is the nastiness we need to avoid. The code we have now does 15: [r112:DI]=r125:V2DI ... 17: r127:DF=[r112:DI] 18: r128:DF=[r112:DI+0x8] Can you make this optimised to not use a memory temporary at all, just immediately assign from r125 to r127 and r128? > Method2: One way may be enhancing CSE to make it able to treat one large > memory slot as two(or more) combined slots: > 13: r125:V2DI#0=[r112:DI+0x20] > 13': r125:V2DI#8=[r112:DI+0x28] > 15: [r112:DI]#0=r125:V2DI#0 > 15': [r112:DI]#8=r125:V2DI#8 > > This may seems more hack in CSE. The current CSE pass we have is the pass most in need of a full rewrite we have, since many many years. It does a lot of things, important things that we should not lose, but it does a pretty bad job of CSE. > Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK". :BLK can never be optimised well. It always has to live in memory, by definition. Segher ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC] propgation leap over memory copy for struct 2022-11-01 0:37 ` Segher Boessenkool @ 2022-11-01 3:01 ` Jiufu Guo 0 siblings, 0 replies; 14+ messages in thread From: Jiufu Guo @ 2022-11-01 3:01 UTC (permalink / raw) To: Segher Boessenkool; +Cc: gcc-patches, dje.gcc, linkw, rguenth, pinskia Segher Boessenkool <segher@kernel.crashing.org> writes: > Hi! > > On Mon, Oct 31, 2022 at 10:42:35AM +0800, Jiufu Guo wrote: >> #define FN 4 >> typedef struct { double a[FN]; } A; >> >> A foo (const A *a) { return *a; } >> A bar (const A a) { return a; } >> /////// >> >> If FN<=2; the size of "A" fits into TImode, then this code can be optimized >> (by subreg/cse/fwprop/cprop) as: >> ------- >> foo: >> .LFB0: >> .cfi_startproc >> blr >> >> bar: >> .LFB1: >> .cfi_startproc >> lfd 2,8(3) >> lfd 1,0(3) >> blr >> -------- > > I think you swapped foo and bar here? Oh, thanks! > >> If the size of "A" is larger than any INT mode size, RTL insns would be >> generated as: >> 13: r125:V2DI=[r112:DI+0x20] >> 14: r126:V2DI=[r112:DI+0x30] >> 15: [r112:DI]=r125:V2DI >> 16: [r112:DI+0x10]=r126:V2DI /// memcpy for assignment: D.3338 = arg; >> 17: r127:DF=[r112:DI] >> 18: r128:DF=[r112:DI+0x8] >> 19: r129:DF=[r112:DI+0x10] >> 20: r130:DF=[r112:DI+0x18] >> ------------ >> >> I'm thinking about ways to improve this. >> Metod1: One way may be changing the memory copy by referencing the type >> of the struct if the size of struct is not too big. And generate insns >> like the below: >> 13: r125:DF=[r112:DI+0x20] >> 15: r126:DF=[r112:DI+0x28] >> 17: r127:DF=[r112:DI+0x30] >> 19: r128:DF=[r112:DI+0x38] >> 14: [r112:DI]=r125:DF >> 16: [r112:DI+0x8]=r126:DF >> 18: [r112:DI+0x10]=r127:DF >> 20: [r112:DI+0x18]=r128:DF >> 21: r129:DF=[r112:DI] >> 22: r130:DF=[r112:DI+0x8] >> 23: r131:DF=[r112:DI+0x10] >> 24: r132:DF=[r112:DI+0x18] > > This is much worse though? The expansion with memcpy used V2DI, which > typically is close to 2x faster than DFmode accesses. Using V2DI, it help to access 2x bytes at one time than DF/DI. While since those readings can be executed in parallel, it would be not too bad via using DF/DI. > > Or are you trying to avoid small reads of large stores here? Those > aren't so bad, large reads of small stores is the nastiness we need to > avoid. Here, I want to use 2 DF readings, because optimizations cse/fwprop/dse could eleminate those memory accesses on same size. > > The code we have now does > > 15: [r112:DI]=r125:V2DI > ... > 17: r127:DF=[r112:DI] > 18: r128:DF=[r112:DI+0x8] > > Can you make this optimised to not use a memory temporary at all, just > immediately assign from r125 to r127 and r128? r125 are not directly assinged to r127/r128, since 'insn 15' and 'insn 17/18' are expanded for different gimple stmt: D.3331 = a; ==> 'insn 15' is generated for struct assignment here. return D.3331; ==> 'insn 17/18' are prepared for return registers. I'm trying to eliminate thos memory temporary, and did not find a good way. Method1-3 are the ideas which I'm trying to use to delete those temporaries. > >> Method2: One way may be enhancing CSE to make it able to treat one large >> memory slot as two(or more) combined slots: >> 13: r125:V2DI#0=[r112:DI+0x20] >> 13': r125:V2DI#8=[r112:DI+0x28] >> 15: [r112:DI]#0=r125:V2DI#0 >> 15': [r112:DI]#8=r125:V2DI#8 >> >> This may seems more hack in CSE. > > The current CSE pass we have is the pass most in need of a full rewrite > we have, since many many years. It does a lot of things, important > things that we should not lose, but it does a pretty bad job of CSE. > >> Method3: For some record type, use "PARALLEL:BLK" instead "MEM:BLK". > > :BLK can never be optimised well. It always has to live in memory, by > definition. Thanks for your sugguestions! BR, Jeff (Jiufu) > > > Segher ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-11-09 12:56 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-10-31 2:42 [RFC] propgation leap over memory copy for struct Jiufu Guo 2022-10-31 22:13 ` Jeff Law 2022-11-01 0:49 ` Segher Boessenkool 2022-11-01 4:30 ` Jiufu Guo 2022-11-05 14:13 ` Richard Biener 2022-11-08 4:05 ` Jiufu Guo 2022-11-09 7:51 ` Jiufu Guo 2022-11-09 8:50 ` Richard Biener 2022-11-01 3:30 ` Jiufu Guo 2022-11-05 11:38 ` Richard Biener 2022-11-09 9:21 ` Jiufu Guo 2022-11-09 12:56 ` Richard Biener 2022-11-01 0:37 ` Segher Boessenkool 2022-11-01 3:01 ` Jiufu Guo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).