public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Riscv code generation
@ 2023-10-23 14:14 Jacob Navia
  2023-10-24  9:43 ` Benny Lyne Amorsen
  0 siblings, 1 reply; 2+ messages in thread
From: Jacob Navia @ 2023-10-23 14:14 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 2005 bytes --]

Hi
In a previous post I pointed to a strange code generation`by gcc in the riscv-64 targets.
To resume:
	Suppose a 64 bit operation: c = a OP b;
Gcc does the following:
	Instead of loading 64 bits from memory gcc loads 8 bytes into 8 separate registers for both operands. Then it ORs the 8 bytes into a single 64 bit number. Then, it executes the 64 bit operation. And lastly, it splits the 64 bits result into 8 bytes into 8 different registers, and stores this 8 bytes one after the other.

When I saw this I was impressed that that utterly bloated code did run faster than a hastyly written assembly program I did in 10 minutes. Obviously I didn’t take any pipeline turbulence into account and my program was slower. When I did take pipeline turbulence into account, I managed to write a program that runs several times faster than the bloated code.

You realize that for the example above, instead of
1) Load 64 bits into a register (2 operations)
2) Do the operation
3) Store the result

We have 2 loads, and 1 operation + a store. 4 instructions compared to 46 operations for the « gcc way » (16 loads of a byte, 14 x 2  OR operations and 8 shifts to split the result and 8 stores of a byte each.

I think this is a BUG, but I’m still not convinced that it is one,  and I do not have a clue WHY you do this.

Is here anyone doing the riscv backend? This happens only with -O3 by the way

Sample code:

#define ACCUM_MENGTH 9
#define WORDSIZE 64
Typedef struct {
   Int sign, exponent;
   Long long mantissa[ACCUM_LENGTH];
} QfloatAccum,*QfloatAccump;

void shup1(QfloatAccump x)
{
	QELT newbits,bits;
	int i;
	bits = x->mantissa[ACCUM_LENGTH] >> (WORDSIZE-1);
	x->mantissa[ACCUM_LENGTH] <<= 1;
	for( i=ACCUM_LENGTH-1; i>0; i-- ) {
		newbits = x->mantissa[i] >> (WORDSIZE - 1);
		x->mantissa[i] <<= 1;
		x->mantissa[i] |= bits;
		bits = newbits;
	}
	x->mantissa[0] <<= 1;
	x->mantissa[0] |= bits;
}

Please point me to the right person. Thanks



^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Riscv code generation
  2023-10-23 14:14 Riscv code generation Jacob Navia
@ 2023-10-24  9:43 ` Benny Lyne Amorsen
  0 siblings, 0 replies; 2+ messages in thread
From: Benny Lyne Amorsen @ 2023-10-24  9:43 UTC (permalink / raw)
  To: gcc

Jacob Navia via Gcc <gcc@gcc.gnu.org> writes:

> We have 2 loads, and 1 operation + a store. 4 instructions compared to
> 46 operations for the « gcc way » (16 loads of a byte, 14 x 2 OR
> operations and 8 shifts to split the result and 8 stores of a byte
> each.

The sample code seems to have a couple of errors; I fixed it up and put
it on godbolt: https://godbolt.org/z/obbr7K7dx

Let me know if the fixups were wrong. The issue should probably be
reported on Bugzilla as a missed-optimization bug.


/Benny



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-26  9:00 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-23 14:14 Riscv code generation Jacob Navia
2023-10-24  9:43 ` Benny Lyne Amorsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).