Re: Inefficient code

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: Inefficient code
@ 2018-07-06 10:18 Bernd Edlinger
  2018-07-06 12:55 ` Paul Koning
  0 siblings, 1 reply; 11+ messages in thread
From: Bernd Edlinger @ 2018-07-06 10:18 UTC (permalink / raw)
  To: Paul Koning; +Cc: gcc

You can get much better code if you make xrci a bit field.
so the entire bit filed region can be accessed word-wise:


#include <stdint.h>

struct Xrb
{
    uint16_t xrlen;             /* Length of I/O buffer in bytes */
    uint16_t xrbc;              /* Byte count for transfer */
    void * xrloc;               /* Pointer to I/O buffer */
    uint8_t xrci:8;             /* Channel number times 2 for transfer */
    uint32_t xrblk:24;  /* Random access block number */
    uint16_t xrtime;    /* Wait time for terminal input */
    uint16_t xrmod;             /* Modifiers */
};

void test(struct Xrb *XRB)
{
    XRB->xrblk = 5;
}


Bernd.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-06 10:18 Inefficient code Bernd Edlinger
@ 2018-07-06 12:55 ` Paul Koning
  0 siblings, 0 replies; 11+ messages in thread
From: Paul Koning @ 2018-07-06 12:55 UTC (permalink / raw)
  To: Bernd Edlinger; +Cc: gcc



> On Jul 6, 2018, at 6:18 AM, Bernd Edlinger <bernd.edlinger@hotmail.de> wrote:
> 
> You can get much better code if you make xrci a bit field.
> so the entire bit filed region can be accessed word-wise:
> 
> 
> #include <stdint.h>
> 
> struct Xrb
> {
>    uint16_t xrlen;             /* Length of I/O buffer in bytes */
>    uint16_t xrbc;              /* Byte count for transfer */
>    void * xrloc;               /* Pointer to I/O buffer */
>    uint8_t xrci:8;             /* Channel number times 2 for transfer */
>    uint32_t xrblk:24;  /* Random access block number */
>    uint16_t xrtime;    /* Wait time for terminal input */
>    uint16_t xrmod;             /* Modifiers */
> };
> 
> void test(struct Xrb *XRB)
> {
>    XRB->xrblk = 5;
> }
> 
> 
> Bernd.

That helps with x86.  It makes no difference with xstormy16, and it makes things slightly worse with pdp11 (though I can fiddle with the patterns to help with that; there's a zero_extend optimization I haven't coded yet).

On the other hand, since the two are equivalent it's reasonable to call this a missed optimization.

	paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-06  1:04             ` Paul Koning
@ 2018-07-06  6:54               ` Eric Botcazou
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Botcazou @ 2018-07-06  6:54 UTC (permalink / raw)
  To: Paul Koning; +Cc: gcc

> Xstormy does 3 mov.b also.  For that matter, so does the x86 target (both
> -m32 and -m64).  Hm.

Indeed, even at -Os, so this may be a generic issue.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-06  1:01           ` Paul Koning
@ 2018-07-06  1:04             ` Paul Koning
  2018-07-06  6:54               ` Eric Botcazou
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-06  1:04 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: gcc



> On Jul 5, 2018, at 9:01 PM, Paul Koning <paulkoning@comcast.net> wrote:
> 
> 
> 
>> On Jul 5, 2018, at 6:47 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> 
>>> So back to the previous one: anything I can do about a 24 bit field getting
>>> split into three movqi rather than a movqi plus a movhi?  That happens
>>> during RTL expand, I believe.
>> 
>> Yes, this one doesn't look as hopeless as the store merging issue.  A way of 
>> tackling it would be to do a side-by-side debugging of a compiler built for a 
>> similar target for which only 2 stores are generated.
> 
> I'll try xstormy16 since that's also 16 bit words, strict alignment.
> 
> Then again, I fed the code to GCC for VAX and it also produces a sequence of 3 separate byte stores.  No mixed endians there.

Xstormy does 3 mov.b also.  For that matter, so does the x86 target (both -m32 and -m64).  Hm.

	paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-05 22:47         ` Eric Botcazou
@ 2018-07-06  1:01           ` Paul Koning
  2018-07-06  1:04             ` Paul Koning
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-06  1:01 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: gcc



> On Jul 5, 2018, at 6:47 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
> 
>> So back to the previous one: anything I can do about a 24 bit field getting
>> split into three movqi rather than a movqi plus a movhi?  That happens
>> during RTL expand, I believe.
> 
> Yes, this one doesn't look as hopeless as the store merging issue.  A way of 
> tackling it would be to do a side-by-side debugging of a compiler built for a 
> similar target for which only 2 stores are generated.

I'll try xstormy16 since that's also 16 bit words, strict alignment.

Then again, I fed the code to GCC for VAX and it also produces a sequence of 3 separate byte stores.  No mixed endians there.

	paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-05 20:53       ` Paul Koning
@ 2018-07-05 22:47         ` Eric Botcazou
  2018-07-06  1:01           ` Paul Koning
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Botcazou @ 2018-07-05 22:47 UTC (permalink / raw)
  To: Paul Koning; +Cc: gcc

> So back to the previous one: anything I can do about a 24 bit field getting
> split into three movqi rather than a movqi plus a movhi?  That happens
> during RTL expand, I believe.

Yes, this one doesn't look as hopeless as the store merging issue.  A way of 
tackling it would be to do a side-by-side debugging of a compiler built for a 
similar target for which only 2 stores are generated.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-05 20:44     ` Eric Botcazou
@ 2018-07-05 20:53       ` Paul Koning
  2018-07-05 22:47         ` Eric Botcazou
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-05 20:53 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: GCC Mailing List



> On Jul 5, 2018, at 4:44 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
> 
> ...
> The GIMPLE pass responsible for the optimization simply punts for the "funny-
> endian ordering" of the PDP11.  More generally, you shouldn't expect anything 
> sparkling for such a peculiar architecture as the PDP11.

Ok.  Yet another item for the machine specific optimization pass (to be written).

So back to the previous one: anything I can do about a 24 bit field getting split into three movqi rather than a movqi plus a movhi?  That happens during RTL expand, I believe.

	paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-05 16:29   ` Paul Koning
@ 2018-07-05 20:44     ` Eric Botcazou
  2018-07-05 20:53       ` Paul Koning
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Botcazou @ 2018-07-05 20:44 UTC (permalink / raw)
  To: Paul Koning; +Cc: gcc, Segher Boessenkool

> I just constructed another test case that shows the same issue more
> blatantly:
> 
> struct s
> {
>     char a;
>     char b;
>     char c;
>     char d;
>     int e;
>     int f;
>     char h;
>     char i;
> };

No, it's not the same issue.

> struct s ts;
> 
> void setts(void)
> {
>     ts.a=2;
>     ts.b=4;
>     ts.c=1;
>     ts.d=24;
>     ts.e=5;
>     ts.f=42;
>     ts.h=9;
>     ts.i=3;
> }
> 
> Each of the fields are written separately, even though clearly the adjacent
> byte writes can and should be combined into a single HImode move.  This
> happens both with -O2 and -Os.

The GIMPLE pass responsible for the optimization simply punts for the "funny-
endian ordering" of the PDP11.  More generally, you shouldn't expect anything 
sparkling for such a peculiar architecture as the PDP11.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-05 16:01 ` Segher Boessenkool
@ 2018-07-05 16:29   ` Paul Koning
  2018-07-05 20:44     ` Eric Botcazou
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-05 16:29 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Development



> On Jul 5, 2018, at 12:01 PM, Segher Boessenkool <segher@kernel.crashing.org> wrote:
> 
> On Thu, Jul 05, 2018 at 08:45:30AM -0400, Paul Koning wrote:
>> I have a struct that looks like this:
>> 
>> struct Xrb
>> {
>>    uint16_t xrlen;		/* Length of I/O buffer in bytes */
>>    uint16_t xrbc;		/* Byte count for transfer */
>>    void * xrloc;		/* Pointer to I/O buffer */
>>    uint8_t xrci;		/* Channel number times 2 for transfer */
>>    uint32_t xrblk:24;	/* Random access block number */
>>    uint16_t xrtime;	/* Wait time for terminal input */
>>    uint16_t xrmod;		/* Modifiers */
>> };
>> 
>> When I write to xrblk (that 24 bit field) on my 16 bit target, I get unexpectly inefficient output:
>> 
>>    XRB->xrblk = 5;
>> 
>> 	movb	#5,10(r0)
>> 	clrb	11(r0)
>> 	clrb	7(r0)
> 
> (7? not 12?)

Octal offsets.  It's writing the 3 bytes in LSB to MSB order.  (PDP11 -- which has funny-endian ordering.)

> rather than the expected word write to the word-aligned lower half of that field.
>> 
>> Looking at the dumps, I see it coming into the RTL expand phase as a single write, which expand then turns into the three insns corresponding to the above.  But (of course) there is a word (HImode) move also, which has the same cost as the byte one.
>> 
>> Is there something I have to do in my target definition to get this to come out right?  This is a strict_alignment target, but alignment is satisfied in this example.  Also, SLOW_BYTE_ACCESS is 1.
> 
> What is your MOVE_MAX?  It should be 2 probably.

It is. 

I just constructed another test case that shows the same issue more blatantly:

struct s
{
    char a;
    char b;
    char c;
    char d;
    int e;
    int f;
    char h;
    char i;
};

struct s ts;

void setts(void)
{
    ts.a=2;
    ts.b=4;
    ts.c=1;
    ts.d=24;
    ts.e=5;
    ts.f=42;
    ts.h=9;
    ts.i=3;
}

Each of the fields are written separately, even though clearly the adjacent byte writes can and should be combined into a single HImode move.  This happens both with -O2 and -Os.

	paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Inefficient code
  2018-07-05 12:46 Paul Koning
@ 2018-07-05 16:01 ` Segher Boessenkool
  2018-07-05 16:29   ` Paul Koning
  0 siblings, 1 reply; 11+ messages in thread
From: Segher Boessenkool @ 2018-07-05 16:01 UTC (permalink / raw)
  To: Paul Koning; +Cc: GCC Development

On Thu, Jul 05, 2018 at 08:45:30AM -0400, Paul Koning wrote:
> I have a struct that looks like this:
> 
> struct Xrb
> {
>     uint16_t xrlen;		/* Length of I/O buffer in bytes */
>     uint16_t xrbc;		/* Byte count for transfer */
>     void * xrloc;		/* Pointer to I/O buffer */
>     uint8_t xrci;		/* Channel number times 2 for transfer */
>     uint32_t xrblk:24;	/* Random access block number */
>     uint16_t xrtime;	/* Wait time for terminal input */
>     uint16_t xrmod;		/* Modifiers */
> };
> 
> When I write to xrblk (that 24 bit field) on my 16 bit target, I get unexpectly inefficient output:
> 
>     XRB->xrblk = 5;
> 
> 	movb	#5,10(r0)
> 	clrb	11(r0)
> 	clrb	7(r0)

(7? not 12?)

> rather than the expected word write to the word-aligned lower half of that field.
> 
> Looking at the dumps, I see it coming into the RTL expand phase as a single write, which expand then turns into the three insns corresponding to the above.  But (of course) there is a word (HImode) move also, which has the same cost as the byte one.
> 
> Is there something I have to do in my target definition to get this to come out right?  This is a strict_alignment target, but alignment is satisfied in this example.  Also, SLOW_BYTE_ACCESS is 1.

What is your MOVE_MAX?  It should be 2 probably.


Segher

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Inefficient code
@ 2018-07-05 12:46 Paul Koning
  2018-07-05 16:01 ` Segher Boessenkool
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Koning @ 2018-07-05 12:46 UTC (permalink / raw)
  To: GCC Development

I have a struct that looks like this:

struct Xrb
{
    uint16_t xrlen;		/* Length of I/O buffer in bytes */
    uint16_t xrbc;		/* Byte count for transfer */
    void * xrloc;		/* Pointer to I/O buffer */
    uint8_t xrci;		/* Channel number times 2 for transfer */
    uint32_t xrblk:24;	/* Random access block number */
    uint16_t xrtime;	/* Wait time for terminal input */
    uint16_t xrmod;		/* Modifiers */
};

When I write to xrblk (that 24 bit field) on my 16 bit target, I get unexpectly inefficient output:

    XRB->xrblk = 5;

	movb	#5,10(r0)
	clrb	11(r0)
	clrb	7(r0)

rather than the expected word write to the word-aligned lower half of that field.

Looking at the dumps, I see it coming into the RTL expand phase as a single write, which expand then turns into the three insns corresponding to the above.  But (of course) there is a word (HImode) move also, which has the same cost as the byte one.

Is there something I have to do in my target definition to get this to come out right?  This is a strict_alignment target, but alignment is satisfied in this example.  Also, SLOW_BYTE_ACCESS is 1.

	paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-07-06 12:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-06 10:18 Inefficient code Bernd Edlinger
2018-07-06 12:55 ` Paul Koning
  -- strict thread matches above, loose matches on Subject: below --
2018-07-05 12:46 Paul Koning
2018-07-05 16:01 ` Segher Boessenkool
2018-07-05 16:29   ` Paul Koning
2018-07-05 20:44     ` Eric Botcazou
2018-07-05 20:53       ` Paul Koning
2018-07-05 22:47         ` Eric Botcazou
2018-07-06  1:01           ` Paul Koning
2018-07-06  1:04             ` Paul Koning
2018-07-06  6:54               ` Eric Botcazou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).