* Re: Inefficient code
@ 2018-07-06 10:18 Bernd Edlinger
2018-07-06 12:55 ` Paul Koning
0 siblings, 1 reply; 11+ messages in thread
From: Bernd Edlinger @ 2018-07-06 10:18 UTC (permalink / raw)
To: Paul Koning; +Cc: gcc
You can get much better code if you make xrci a bit field.
so the entire bit filed region can be accessed word-wise:
#include <stdint.h>
struct Xrb
{
uint16_t xrlen; /* Length of I/O buffer in bytes */
uint16_t xrbc; /* Byte count for transfer */
void * xrloc; /* Pointer to I/O buffer */
uint8_t xrci:8; /* Channel number times 2 for transfer */
uint32_t xrblk:24; /* Random access block number */
uint16_t xrtime; /* Wait time for terminal input */
uint16_t xrmod; /* Modifiers */
};
void test(struct Xrb *XRB)
{
XRB->xrblk = 5;
}
Bernd.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-06 10:18 Inefficient code Bernd Edlinger
@ 2018-07-06 12:55 ` Paul Koning
0 siblings, 0 replies; 11+ messages in thread
From: Paul Koning @ 2018-07-06 12:55 UTC (permalink / raw)
To: Bernd Edlinger; +Cc: gcc
> On Jul 6, 2018, at 6:18 AM, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
>
> You can get much better code if you make xrci a bit field.
> so the entire bit filed region can be accessed word-wise:
>
>
> #include <stdint.h>
>
> struct Xrb
> {
> uint16_t xrlen; /* Length of I/O buffer in bytes */
> uint16_t xrbc; /* Byte count for transfer */
> void * xrloc; /* Pointer to I/O buffer */
> uint8_t xrci:8; /* Channel number times 2 for transfer */
> uint32_t xrblk:24; /* Random access block number */
> uint16_t xrtime; /* Wait time for terminal input */
> uint16_t xrmod; /* Modifiers */
> };
>
> void test(struct Xrb *XRB)
> {
> XRB->xrblk = 5;
> }
>
>
> Bernd.
That helps with x86. It makes no difference with xstormy16, and it makes things slightly worse with pdp11 (though I can fiddle with the patterns to help with that; there's a zero_extend optimization I haven't coded yet).
On the other hand, since the two are equivalent it's reasonable to call this a missed optimization.
paul
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-06 1:04 ` Paul Koning
@ 2018-07-06 6:54 ` Eric Botcazou
0 siblings, 0 replies; 11+ messages in thread
From: Eric Botcazou @ 2018-07-06 6:54 UTC (permalink / raw)
To: Paul Koning; +Cc: gcc
> Xstormy does 3 mov.b also. For that matter, so does the x86 target (both
> -m32 and -m64). Hm.
Indeed, even at -Os, so this may be a generic issue.
--
Eric Botcazou
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-06 1:01 ` Paul Koning
@ 2018-07-06 1:04 ` Paul Koning
2018-07-06 6:54 ` Eric Botcazou
0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-06 1:04 UTC (permalink / raw)
To: Eric Botcazou; +Cc: gcc
> On Jul 5, 2018, at 9:01 PM, Paul Koning <paulkoning@comcast.net> wrote:
>
>
>
>> On Jul 5, 2018, at 6:47 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>>
>>> So back to the previous one: anything I can do about a 24 bit field getting
>>> split into three movqi rather than a movqi plus a movhi? That happens
>>> during RTL expand, I believe.
>>
>> Yes, this one doesn't look as hopeless as the store merging issue. A way of
>> tackling it would be to do a side-by-side debugging of a compiler built for a
>> similar target for which only 2 stores are generated.
>
> I'll try xstormy16 since that's also 16 bit words, strict alignment.
>
> Then again, I fed the code to GCC for VAX and it also produces a sequence of 3 separate byte stores. No mixed endians there.
Xstormy does 3 mov.b also. For that matter, so does the x86 target (both -m32 and -m64). Hm.
paul
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-05 22:47 ` Eric Botcazou
@ 2018-07-06 1:01 ` Paul Koning
2018-07-06 1:04 ` Paul Koning
0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-06 1:01 UTC (permalink / raw)
To: Eric Botcazou; +Cc: gcc
> On Jul 5, 2018, at 6:47 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>
>> So back to the previous one: anything I can do about a 24 bit field getting
>> split into three movqi rather than a movqi plus a movhi? That happens
>> during RTL expand, I believe.
>
> Yes, this one doesn't look as hopeless as the store merging issue. A way of
> tackling it would be to do a side-by-side debugging of a compiler built for a
> similar target for which only 2 stores are generated.
I'll try xstormy16 since that's also 16 bit words, strict alignment.
Then again, I fed the code to GCC for VAX and it also produces a sequence of 3 separate byte stores. No mixed endians there.
paul
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-05 20:53 ` Paul Koning
@ 2018-07-05 22:47 ` Eric Botcazou
2018-07-06 1:01 ` Paul Koning
0 siblings, 1 reply; 11+ messages in thread
From: Eric Botcazou @ 2018-07-05 22:47 UTC (permalink / raw)
To: Paul Koning; +Cc: gcc
> So back to the previous one: anything I can do about a 24 bit field getting
> split into three movqi rather than a movqi plus a movhi? That happens
> during RTL expand, I believe.
Yes, this one doesn't look as hopeless as the store merging issue. A way of
tackling it would be to do a side-by-side debugging of a compiler built for a
similar target for which only 2 stores are generated.
--
Eric Botcazou
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-05 20:44 ` Eric Botcazou
@ 2018-07-05 20:53 ` Paul Koning
2018-07-05 22:47 ` Eric Botcazou
0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-05 20:53 UTC (permalink / raw)
To: Eric Botcazou; +Cc: GCC Mailing List
> On Jul 5, 2018, at 4:44 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>
> ...
> The GIMPLE pass responsible for the optimization simply punts for the "funny-
> endian ordering" of the PDP11. More generally, you shouldn't expect anything
> sparkling for such a peculiar architecture as the PDP11.
Ok. Yet another item for the machine specific optimization pass (to be written).
So back to the previous one: anything I can do about a 24 bit field getting split into three movqi rather than a movqi plus a movhi? That happens during RTL expand, I believe.
paul
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-05 16:29 ` Paul Koning
@ 2018-07-05 20:44 ` Eric Botcazou
2018-07-05 20:53 ` Paul Koning
0 siblings, 1 reply; 11+ messages in thread
From: Eric Botcazou @ 2018-07-05 20:44 UTC (permalink / raw)
To: Paul Koning; +Cc: gcc, Segher Boessenkool
> I just constructed another test case that shows the same issue more
> blatantly:
>
> struct s
> {
> char a;
> char b;
> char c;
> char d;
> int e;
> int f;
> char h;
> char i;
> };
No, it's not the same issue.
> struct s ts;
>
> void setts(void)
> {
> ts.a=2;
> ts.b=4;
> ts.c=1;
> ts.d=24;
> ts.e=5;
> ts.f=42;
> ts.h=9;
> ts.i=3;
> }
>
> Each of the fields are written separately, even though clearly the adjacent
> byte writes can and should be combined into a single HImode move. This
> happens both with -O2 and -Os.
The GIMPLE pass responsible for the optimization simply punts for the "funny-
endian ordering" of the PDP11. More generally, you shouldn't expect anything
sparkling for such a peculiar architecture as the PDP11.
--
Eric Botcazou
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-05 16:01 ` Segher Boessenkool
@ 2018-07-05 16:29 ` Paul Koning
2018-07-05 20:44 ` Eric Botcazou
0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-05 16:29 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: GCC Development
> On Jul 5, 2018, at 12:01 PM, Segher Boessenkool <segher@kernel.crashing.org> wrote:
>
> On Thu, Jul 05, 2018 at 08:45:30AM -0400, Paul Koning wrote:
>> I have a struct that looks like this:
>>
>> struct Xrb
>> {
>> uint16_t xrlen; /* Length of I/O buffer in bytes */
>> uint16_t xrbc; /* Byte count for transfer */
>> void * xrloc; /* Pointer to I/O buffer */
>> uint8_t xrci; /* Channel number times 2 for transfer */
>> uint32_t xrblk:24; /* Random access block number */
>> uint16_t xrtime; /* Wait time for terminal input */
>> uint16_t xrmod; /* Modifiers */
>> };
>>
>> When I write to xrblk (that 24 bit field) on my 16 bit target, I get unexpectly inefficient output:
>>
>> XRB->xrblk = 5;
>>
>> movb #5,10(r0)
>> clrb 11(r0)
>> clrb 7(r0)
>
> (7? not 12?)
Octal offsets. It's writing the 3 bytes in LSB to MSB order. (PDP11 -- which has funny-endian ordering.)
> rather than the expected word write to the word-aligned lower half of that field.
>>
>> Looking at the dumps, I see it coming into the RTL expand phase as a single write, which expand then turns into the three insns corresponding to the above. But (of course) there is a word (HImode) move also, which has the same cost as the byte one.
>>
>> Is there something I have to do in my target definition to get this to come out right? This is a strict_alignment target, but alignment is satisfied in this example. Also, SLOW_BYTE_ACCESS is 1.
>
> What is your MOVE_MAX? It should be 2 probably.
It is.
I just constructed another test case that shows the same issue more blatantly:
struct s
{
char a;
char b;
char c;
char d;
int e;
int f;
char h;
char i;
};
struct s ts;
void setts(void)
{
ts.a=2;
ts.b=4;
ts.c=1;
ts.d=24;
ts.e=5;
ts.f=42;
ts.h=9;
ts.i=3;
}
Each of the fields are written separately, even though clearly the adjacent byte writes can and should be combined into a single HImode move. This happens both with -O2 and -Os.
paul
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Inefficient code
2018-07-05 12:46 Paul Koning
@ 2018-07-05 16:01 ` Segher Boessenkool
2018-07-05 16:29 ` Paul Koning
0 siblings, 1 reply; 11+ messages in thread
From: Segher Boessenkool @ 2018-07-05 16:01 UTC (permalink / raw)
To: Paul Koning; +Cc: GCC Development
On Thu, Jul 05, 2018 at 08:45:30AM -0400, Paul Koning wrote:
> I have a struct that looks like this:
>
> struct Xrb
> {
> uint16_t xrlen; /* Length of I/O buffer in bytes */
> uint16_t xrbc; /* Byte count for transfer */
> void * xrloc; /* Pointer to I/O buffer */
> uint8_t xrci; /* Channel number times 2 for transfer */
> uint32_t xrblk:24; /* Random access block number */
> uint16_t xrtime; /* Wait time for terminal input */
> uint16_t xrmod; /* Modifiers */
> };
>
> When I write to xrblk (that 24 bit field) on my 16 bit target, I get unexpectly inefficient output:
>
> XRB->xrblk = 5;
>
> movb #5,10(r0)
> clrb 11(r0)
> clrb 7(r0)
(7? not 12?)
> rather than the expected word write to the word-aligned lower half of that field.
>
> Looking at the dumps, I see it coming into the RTL expand phase as a single write, which expand then turns into the three insns corresponding to the above. But (of course) there is a word (HImode) move also, which has the same cost as the byte one.
>
> Is there something I have to do in my target definition to get this to come out right? This is a strict_alignment target, but alignment is satisfied in this example. Also, SLOW_BYTE_ACCESS is 1.
What is your MOVE_MAX? It should be 2 probably.
Segher
^ permalink raw reply [flat|nested] 11+ messages in thread
* Inefficient code
@ 2018-07-05 12:46 Paul Koning
2018-07-05 16:01 ` Segher Boessenkool
0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-05 12:46 UTC (permalink / raw)
To: GCC Development
I have a struct that looks like this:
struct Xrb
{
uint16_t xrlen; /* Length of I/O buffer in bytes */
uint16_t xrbc; /* Byte count for transfer */
void * xrloc; /* Pointer to I/O buffer */
uint8_t xrci; /* Channel number times 2 for transfer */
uint32_t xrblk:24; /* Random access block number */
uint16_t xrtime; /* Wait time for terminal input */
uint16_t xrmod; /* Modifiers */
};
When I write to xrblk (that 24 bit field) on my 16 bit target, I get unexpectly inefficient output:
XRB->xrblk = 5;
movb #5,10(r0)
clrb 11(r0)
clrb 7(r0)
rather than the expected word write to the word-aligned lower half of that field.
Looking at the dumps, I see it coming into the RTL expand phase as a single write, which expand then turns into the three insns corresponding to the above. But (of course) there is a word (HImode) move also, which has the same cost as the byte one.
Is there something I have to do in my target definition to get this to come out right? This is a strict_alignment target, but alignment is satisfied in this example. Also, SLOW_BYTE_ACCESS is 1.
paul
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-07-06 12:55 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-06 10:18 Inefficient code Bernd Edlinger
2018-07-06 12:55 ` Paul Koning
-- strict thread matches above, loose matches on Subject: below --
2018-07-05 12:46 Paul Koning
2018-07-05 16:01 ` Segher Boessenkool
2018-07-05 16:29 ` Paul Koning
2018-07-05 20:44 ` Eric Botcazou
2018-07-05 20:53 ` Paul Koning
2018-07-05 22:47 ` Eric Botcazou
2018-07-06 1:01 ` Paul Koning
2018-07-06 1:04 ` Paul Koning
2018-07-06 6:54 ` Eric Botcazou
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).