* Riscv code generation
@ 2023-10-23 14:14 Jacob Navia
2023-10-24 9:43 ` Benny Lyne Amorsen
0 siblings, 1 reply; 2+ messages in thread
From: Jacob Navia @ 2023-10-23 14:14 UTC (permalink / raw)
To: gcc
[-- Attachment #1: Type: text/plain, Size: 2005 bytes --]
Hi
In a previous post I pointed to a strange code generation`by gcc in the riscv-64 targets.
To resume:
Suppose a 64 bit operation: c = a OP b;
Gcc does the following:
Instead of loading 64 bits from memory gcc loads 8 bytes into 8 separate registers for both operands. Then it ORs the 8 bytes into a single 64 bit number. Then, it executes the 64 bit operation. And lastly, it splits the 64 bits result into 8 bytes into 8 different registers, and stores this 8 bytes one after the other.
When I saw this I was impressed that that utterly bloated code did run faster than a hastyly written assembly program I did in 10 minutes. Obviously I didn’t take any pipeline turbulence into account and my program was slower. When I did take pipeline turbulence into account, I managed to write a program that runs several times faster than the bloated code.
You realize that for the example above, instead of
1) Load 64 bits into a register (2 operations)
2) Do the operation
3) Store the result
We have 2 loads, and 1 operation + a store. 4 instructions compared to 46 operations for the « gcc way » (16 loads of a byte, 14 x 2 OR operations and 8 shifts to split the result and 8 stores of a byte each.
I think this is a BUG, but I’m still not convinced that it is one, and I do not have a clue WHY you do this.
Is here anyone doing the riscv backend? This happens only with -O3 by the way
Sample code:
#define ACCUM_MENGTH 9
#define WORDSIZE 64
Typedef struct {
Int sign, exponent;
Long long mantissa[ACCUM_LENGTH];
} QfloatAccum,*QfloatAccump;
void shup1(QfloatAccump x)
{
QELT newbits,bits;
int i;
bits = x->mantissa[ACCUM_LENGTH] >> (WORDSIZE-1);
x->mantissa[ACCUM_LENGTH] <<= 1;
for( i=ACCUM_LENGTH-1; i>0; i-- ) {
newbits = x->mantissa[i] >> (WORDSIZE - 1);
x->mantissa[i] <<= 1;
x->mantissa[i] |= bits;
bits = newbits;
}
x->mantissa[0] <<= 1;
x->mantissa[0] |= bits;
}
Please point me to the right person. Thanks
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Riscv code generation
2023-10-23 14:14 Riscv code generation Jacob Navia
@ 2023-10-24 9:43 ` Benny Lyne Amorsen
0 siblings, 0 replies; 2+ messages in thread
From: Benny Lyne Amorsen @ 2023-10-24 9:43 UTC (permalink / raw)
To: gcc
Jacob Navia via Gcc <gcc@gcc.gnu.org> writes:
> We have 2 loads, and 1 operation + a store. 4 instructions compared to
> 46 operations for the « gcc way » (16 loads of a byte, 14 x 2 OR
> operations and 8 shifts to split the result and 8 stores of a byte
> each.
The sample code seems to have a couple of errors; I fixed it up and put
it on godbolt: https://godbolt.org/z/obbr7K7dx
Let me know if the fixups were wrong. The issue should probably be
reported on Bugzilla as a missed-optimization bug.
/Benny
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-10-26 9:00 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-23 14:14 Riscv code generation Jacob Navia
2023-10-24 9:43 ` Benny Lyne Amorsen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).