public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions
@ 2012-07-13 9:01 olegendo at gcc dot gnu.org
2012-07-13 10:34 ` [Bug target/53949] " olegendo at gcc dot gnu.org
` (14 more replies)
0 siblings, 15 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-13 9:01 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
Bug #: 53949
Summary: [SH] Add support for mac.w / mac.l instructions
Classification: Unclassified
Product: gcc
Version: 4.8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: olegendo@gcc.gnu.org
CC: chrbr@gcc.gnu.org
Target: sh*-*-*
So far, GCC does not utilize the integer multiply-add instructions.
On SH1 only the mac.w instruction is supported.
On SH2 and above the mac.w and mac.l instructions are available.
Carry over from PR 39423 comment #20
> On a related thread, for further work, I'm thinking on adding support for the
> MAC instruction, now that was have the multiply and add. But this requires
> exposing the MACLH registers to reload. Had anyone had a thought on this ? I'd
> like to give this a try pretty soon.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
@ 2012-07-13 10:34 ` olegendo at gcc dot gnu.org
2012-07-13 11:01 ` chrbr at gcc dot gnu.org
` (13 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-13 10:34 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #1 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-13 10:34:20 UTC ---
(In reply to comment #0)
> So far, GCC does not utilize the integer multiply-add instructions.
> On SH1 only the mac.w instruction is supported.
> On SH2 and above the mac.w and mac.l instructions are available.
>
> Carry over from PR 39423 comment #20
>
> > On a related thread, for further work, I'm thinking on adding support for the
> > MAC instruction, now that was have the multiply and add. But this requires
> > exposing the MACLH registers to reload. Had anyone had a thought on this ? I'd
> > like to give this a try pretty soon.
I think the biggest problem is that the mac operands have to be in memory.
For example:
long long fun (int a, int long b, long long c)
{
return (long long)a * (long long)b + c;
}
would need to become something like ...
mov.l r4,@-r15
mov r15,r1
mov.l r5,@-r15
lds r6,mach
lds r7,macl
mac.l @r15+,@r1+
sts mach,r1
sts macl,r0
rts
add #4,r15
not using the mac instruction seems a bit simpler in this case:
dmuls.l r4,r5
sts mach,r1
clrt
sts macl,r0
addc r6,r0
rts
addc r7,r1
I think the mac instructions can be very useful when they can be used inside of
loops, but for this the whole post-inc memory stuff has to integrate properly
into
the surrounding code.
Chris, do you have any ideas/plans on how to handle the SR.S bit, for example
to implement
the ssmaddhisi4 pattern with mac.w?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
2012-07-13 10:34 ` [Bug target/53949] " olegendo at gcc dot gnu.org
@ 2012-07-13 11:01 ` chrbr at gcc dot gnu.org
2012-07-13 11:24 ` chrbr at gcc dot gnu.org
` (12 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: chrbr at gcc dot gnu.org @ 2012-07-13 11:01 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #2 from chrbr at gcc dot gnu.org 2012-07-13 11:00:55 UTC ---
I see the MAC only as a global optimization, since its interest is to spawns
across several loop BBs as you said. Their is also problem on clear the
accumulator.
That should certainly be new extension in the gimple SSA loop optimizers, based
on the presence on a multiply and and pattern. Not sure what is the best way to
do this as this point.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
2012-07-13 10:34 ` [Bug target/53949] " olegendo at gcc dot gnu.org
2012-07-13 11:01 ` chrbr at gcc dot gnu.org
@ 2012-07-13 11:24 ` chrbr at gcc dot gnu.org
2012-07-15 12:11 ` olegendo at gcc dot gnu.org
` (11 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: chrbr at gcc dot gnu.org @ 2012-07-13 11:24 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
chrbr at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (2 preceding siblings ...)
2012-07-13 11:24 ` chrbr at gcc dot gnu.org
@ 2012-07-15 12:11 ` olegendo at gcc dot gnu.org
2012-07-17 19:13 ` olegendo at gcc dot gnu.org
` (10 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-15 12:11 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #3 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-15 12:11:20 UTC ---
Created attachment 27799
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27799
Proof of concept patch
This is a proof of concept patch just to probe around.
The idea is to allow the RA to allocate macl and mach registers in DImode, and
have mac insns that use the macl/mach regs as a pair in DImode.
With the patch applied, the following function ...
int64_t test01 (const int16_t* a, const int16_t* b)
{
int64_t sum = 0;
for (int i = 0; i < 16; ++i)
sum += (int64_t)(*a++) * (int64_t)(*b++);
return sum;
}
compiled with -m4 -O2 results in ...
__Z6test01PKsS0_:
.LFB0:
.cfi_startproc
mov #16,r1 ! 88 movsi_ie/3 [length = 2]
clrmac ! 39 clrmac/1 [length = 2]
.align 2
.L3:
dt r1 ! 89 dect [length = 2]
bf/s .L3 ! 90 branch_false [length = 2]
mac.w @r4+,@r5+ ! 61 *macw [length = 2]
sts macl,r0 ! 82 movsi_ie/8 [length = 2]
rts ! 99 *return_i [length = 2]
sts mach,r1 ! 83 movsi_ie/8 [length = 2]
... which is not that bad already.
Some notes I took while playing around with this:
- When compiling for big endian the RA mistakes mach and macl when
storing mach:macl to a DImode reg:reg pair.
This could probably fixed by providing appropriate move insns patterns.
- Move insns/splits for DImode mach:macl <-> memory have to be added.
I've seen an ICE when compiling with -O1:
error: unrecognizable insn:
(insn 122 14 15 2 (set (mem/c:DI (plus:SI (reg/f:SI 15 r15)
(const_int 8 [0x8])) [0 %sfp+-8 S8 A32])
(reg:DI 148 macl)) sh_mac.cpp:38 -1
(nil))
- In some cases the mach:macl reg pair gets swapped to a general reg pair
without any obvious need. Example function:
int64_t test04 (const int16_t* a, const int16_t* b,
const int16_t* c, const int16_t* d)
{
int64_t sum0 = 0;
int64_t sum1 = 0;
for (int i = 0; i < 16; ++i)
sum0 += (int64_t)(*a++) * (int64_t)(*b++);
for (int i = 0; i < 16; ++i)
sum1 += (int64_t)(*c++) * (int64_t)(*d++);
return sum0 - sum1;
}
The IRA pass first allocates sum0 and sum1 to mach:macl, but then reload
seems to think that they are conflicting and moves sum0 to a general regs
pair. This results in ...
mov #0,r2
mov #16,r1
mov r2,r3
.L16:
lds r2,macl
lds r3,mach
dt r1
mac.w @r4+,@r5+
sts macl,r2
bf/s .L16
sts mach,r3
mov #16,r1
clrmac
.align 2
.L18:
dt r1
bf/s .L18
mac.w @r6+,@r7+
which would be better as:
mov #16,r1
clrmac
.L16:
dt r1
bf/s .L16
mac.w @r4+,@r5+
sts macl,r2
sts mach,r3
clrmac
mov #16,r1
.L18:
dt r1
bf/s .L18
mac.w @r6+,@r7+
- Loops with multiple running sums like
for (int i = 0; i < 16; ++i)
{
sum0 += (int64_t)(*a++) * (int64_t)(*b++);
sum1 += (int64_t)(*c++) * (int64_t)(*d++);
}
result in macl:mach swapping to general reg pairs between subsequent
mac.w instructions. Ideally such loops should be split into multiple
loops like in the previous example.
- When loop unrolling is turned on the auto-inc addresses refs are
converted to displacement addresses. Because the auto-inc-dec pass
currently fails to detect a lot of auto-inc-dec possibilities the
mac.w pattern will not match.
The same goes for manually unrolled code like
sum += (int64_t)(*a++) * (int64_t)(*b++);
sum += (int64_t)(*a++) * (int64_t)(*b++);
- Running sum variables should be turned into DImode variables if possible:
int32_t test00 (const int16_t* a, const int16_t* b)
{
int32_t sum = 0;
for (int i = 0; i < 16; ++i)
sum += (*a++) * (*b++);
return sum;
}
- The existing multiplication patterns could be adopted to utilize macl:mach
reg pair allocation, especially 32x32 -> 64 bit multiplications.
- Normal multiplications that do not need a full MAC operation but use
memory operands can be done with a clrmac-mac sequence.
Probably there are more subtle issues. Also, I have not tried expanding
the standard name 'maddmn4' pattern, maybe it would make some of the
problems mentioned above automagically disappear.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (3 preceding siblings ...)
2012-07-15 12:11 ` olegendo at gcc dot gnu.org
@ 2012-07-17 19:13 ` olegendo at gcc dot gnu.org
2012-07-17 23:05 ` kkojima at gcc dot gnu.org
` (9 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-17 19:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
Oleg Endo <olegendo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |kkojima at gcc dot gnu.org
--- Comment #4 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-17 19:12:49 UTC ---
To be able to use the mac.w instruction for implementing the ssmaddhisi4
pattern I think the already existing mode-switching facilities can be used, as
it is done for float/double mode switching.
Even without doing SR.S mode switches, before any mac instructions can be used,
we must make sure that the SR.S bit is in a defined state at function entry and
function leave.
Kaz, do you think it is safe to assume that SR.S = 0 at function entry?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (4 preceding siblings ...)
2012-07-17 19:13 ` olegendo at gcc dot gnu.org
@ 2012-07-17 23:05 ` kkojima at gcc dot gnu.org
2012-07-22 16:47 ` olegendo at gcc dot gnu.org
` (8 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: kkojima at gcc dot gnu.org @ 2012-07-17 23:05 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #5 from Kazumoto Kojima <kkojima at gcc dot gnu.org> 2012-07-17 23:04:54 UTC ---
(In reply to comment #4)
> Kaz, do you think it is safe to assume that SR.S = 0 at function entry?
I think so. I can't imagine a practical system with setting
SR.S to one in its start-up code, though I'm wrong about it.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (5 preceding siblings ...)
2012-07-17 23:05 ` kkojima at gcc dot gnu.org
@ 2012-07-22 16:47 ` olegendo at gcc dot gnu.org
2012-10-11 20:43 ` olegendo at gcc dot gnu.org
` (7 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-07-22 16:47 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #6 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-07-22 16:47:44 UTC ---
If I understand correctly PR 29961 is somewhat related to this.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (6 preceding siblings ...)
2012-07-22 16:47 ` olegendo at gcc dot gnu.org
@ 2012-10-11 20:43 ` olegendo at gcc dot gnu.org
2012-11-07 21:31 ` olegendo at gcc dot gnu.org
` (6 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-10-11 20:43 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #7 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-10-11 20:43:02 UTC ---
A note regarding the SR.S bit. The insns sets and clrs are available only on
SH3* and SH4*. SH1* and SH2* (incl SH2A) do not implement them.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (7 preceding siblings ...)
2012-10-11 20:43 ` olegendo at gcc dot gnu.org
@ 2012-11-07 21:31 ` olegendo at gcc dot gnu.org
2013-05-04 13:39 ` olegendo at gcc dot gnu.org
` (5 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2012-11-07 21:31 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #8 from Oleg Endo <olegendo at gcc dot gnu.org> 2012-11-07 21:31:39 UTC ---
Christian, I just wanted to check with you whether you've already started doing
something regarding the mac.w / mac.l instructions?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (8 preceding siblings ...)
2012-11-07 21:31 ` olegendo at gcc dot gnu.org
@ 2013-05-04 13:39 ` olegendo at gcc dot gnu.org
2013-12-17 12:25 ` olegendo at gcc dot gnu.org
` (4 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2013-05-04 13:39 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #9 from Oleg Endo <olegendo at gcc dot gnu.org> 2013-05-04 13:39:10 UTC ---
(In reply to comment #3)
> - Loops with multiple running sums like
> for (int i = 0; i < 16; ++i)
> {
> sum0 += (int64_t)(*a++) * (int64_t)(*b++);
> sum1 += (int64_t)(*c++) * (int64_t)(*d++);
> }
>
> result in macl:mach swapping to general reg pairs between subsequent
> mac.w instructions. Ideally such loops should be split into multiple
> loops like in the previous example.
This is basically what -ftree-loop-distribution does. The question would be
how to re-use it for this particular case.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (9 preceding siblings ...)
2013-05-04 13:39 ` olegendo at gcc dot gnu.org
@ 2013-12-17 12:25 ` olegendo at gcc dot gnu.org
2013-12-17 12:37 ` olegendo at gcc dot gnu.org
` (3 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2013-12-17 12:25 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #10 from Oleg Endo <olegendo at gcc dot gnu.org> ---
I was wondering whether it would make sense to convert sequences such as
SH4 SH4A
mov.l @r15,r3 LS/2 LS/2
mul.l r2,r3 CO/4 EX/3
sts macl,r3 CO/3 LS/2
add r1,r3 EX/1 EX/1
into
mov r15,r0 MT/0 MT/1
mov.l r2,@-r15 LS/1 LS/1
lds r1,macl CO/3 LS/1
mac.l @r15+,@r0+ CO/4 CO/5
sts macl,r3 CO/3 LS/2
Looking simply at the issue cycles (the numbers above) would suggest that it's
not worth doing it, at least not if the value has to be pulled out from the mac
register immediately after the mac operation. Probably it's not beneficial to
emit a single mac insn if the data is not already in place so that it can be
reached easily with the post-inc addressing.
On the other hand something like ...
int test33 (int* x, int y, int z)
{
return x[0] * 40 + z;
}
currently compiles to:
mov.l @r4,r2
mov #40,r1
mul.l r1,r2
sts macl,r0
rts
add r6,r0
where this one maybe could be better:
mova .L40,r0
lds r6,macl
mac.l @r4+,r0+
rts
sts macl,r0
.align 2
.L40: .long 40
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (10 preceding siblings ...)
2013-12-17 12:25 ` olegendo at gcc dot gnu.org
@ 2013-12-17 12:37 ` olegendo at gcc dot gnu.org
2015-01-30 19:37 ` olegendo at gcc dot gnu.org
` (2 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2013-12-17 12:37 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #11 from Oleg Endo <olegendo at gcc dot gnu.org> ---
Another question is whether the following is OK to do on all SH
implementations:
int test33 (int x, int y, int z)
{
return x * y + z;
}
currently compiles:
mul.l r5,r4
sts macl,r0
rts
add r6,r0
could also be done as:
lds r6,macl
mov.l r4,@-r15
mov.l r5,@-r15
mac.l @r15+,@r15+
rts
sts macl,r0
This assumes that a mac insn with both address operands being the same works
exactly as it's described in the Renesas manuals:
tempn = Read_32 (R[n]);
R[n] += 4;
tempm = Read_32 (R[m]);
R[m] += 4;
However, I don't know whether this is true for all SH implementations.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (11 preceding siblings ...)
2013-12-17 12:37 ` olegendo at gcc dot gnu.org
@ 2015-01-30 19:37 ` olegendo at gcc dot gnu.org
2015-02-01 0:37 ` olegendo at gcc dot gnu.org
2015-04-08 20:08 ` olegendo at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2015-01-30 19:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
Oleg Endo <olegendo at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2015-01-30
Ever confirmed|0 |1
--- Comment #12 from Oleg Endo <olegendo at gcc dot gnu.org> ---
This is an example from Renesas public material regarding the SH-DSP. The
example program can utilize either the regular mac.w instruction, or the DSP
ISA (found on SH2-DSP, SH3-DSP, SH4AL-DSP).
typedef short Fixed;
typedef long Lfixed;
typedef long Laccum;
#define __X
#define __Y
void func(__X Fixed x[5],Fixed *out, __Y Fixed y[128][5] )
{
int i;
__Y Fixed *yp=y[0];
__X Fixed *xp=x;
Fixed x0,y0;
Lfixed m0;
Laccum a0;
for(i=0;i<128;i++)
{
a0=0;
x0=*xp++; y0=*yp++;
m0=x0*y0; x0=*xp++; y0=*yp++;
a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
a0+=m0; m0=x0*y0;
a0+=m0;
*out++=a0;
}
}
which should compile to something like:
_func:
mov #-128,r1
mov r5,r3
extu.b r1,r1
.L11:
clrmac
mac.w @r6+,@r4+
dt r1
mac.w @r6+,@r4+
mac.w @r6+,@r4+
mac.w @r6+,@r4+
sts macl,r2
mov.w r2,@r3
bf/s .L11
add #2,r3
rts
nop
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (12 preceding siblings ...)
2015-01-30 19:37 ` olegendo at gcc dot gnu.org
@ 2015-02-01 0:37 ` olegendo at gcc dot gnu.org
2015-04-08 20:08 ` olegendo at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2015-02-01 0:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #13 from Oleg Endo <olegendo at gcc dot gnu.org> ---
A more interesting real-world example from libjpeg would be function
jpeg_idct_ifast (jidctint.c).
If we take the code as-is, there are few mac opportunities due to sharing of
the terms. The expressions could be un-CSE'd which would result in longer mac
chains, but the overall result gets worse because the data layout is not in a
mac friendly way.
The first loop in jpeg_idct_ifast can be split into 8 independent loops for the
output value wsptr[8*n+i].
For n = 1,2,3,4,5,6 the loops look a bit complex, but for n = 0 and n = 7 we
get similar looking loops like:
for (int i = 0; i < 8; ++i)
{
wsptr[8*7+i] = inptr[8*0 + i] * quantptr[8*0 + i]
- inptr[8*1 + i] * quantptr[8*1 + i]
+ inptr[8*2 + i] * quantptr[8*2 + i]
- inptr[8*3 + i] * quantptr[8*3 + i]
+ inptr[8*4 + i] * quantptr[8*4 + i]
- inptr[8*5 + i] * quantptr[8*5 + i]
+ inptr[8*6 + i] * quantptr[8*6 + i]
- inptr[8*7 + i] * quantptr[8*7 + i];
}
Still, due to the subtractions and memory access pattern, plain mac insns can't
be used.
The subtractions can be converted into additions by negating the operands.
Since mac wants both operands in memory, those can be placed on the stack.
Also, in this case the address registers can be pre-computed outside the loop,
since there are enough registers.
A possible outcome would be something like this:
// r4 = inptr[8*0+i]
// r5 = quantptr[8*0+i]
// r6 = wsptr[8*0+i]
mov r4,r3; add #32,r3 // r3 = inptr[8*1+i]
mov r3,r7; add #32,r7 // r7 = inptr[8*2+i]
mov r7,r8; add #32,r8 // r8 = inptr[8*3+i]
mov r8,r9; add #32,r9 // r9 = inptr[8*4+i]
mov r9,r10; add #32,r10 // r10 = inptr[8*5+i]
mov r10,r11; add #32,r11 // r11 = inptr[8*6+i]
mov r11,r12; add #32,r12 // r12 = inptr[8*7+i]
mov #8,r14
add #126,r6; add #102,r6 // r6 = wpstr + 8*7*4 + 4
mov r4,r0; sub r5,r0 // r0 = quantptr - intptr
.Loop:
mov.l @(r0,r12),r1 // quantptr[8*7+i]
mov.l @(r0,r11),r2 // quantptr[8*6+i]
mov.l @(r0,r10),r13 // quantptr[8*5+i]
neg r1,r1
mov.l r1,@-r15
mov.l r2,@-r15
neg r13,r13
mov.l @(r0,r8),r1 // quantptr[8*3+i]
mov.l @(r0,r9),r2 // quantptr[8*4+i]
mov.l r13,@-r15
neg r1,r1
mov.l r2,@-r15
mov.l @(r0,r7),r2 // quantptr[8*2+i]
mov.l @(r0,r3),r13 // quantptr[8*1+i]
mov.l r1,@-r15
mov.l r2,@-r15
neg r13,r13
mov.l r13,@-r15
clrmac
mac.l @r4+,@r5+
mac.l @r3+,@r15+
mac.l @r7+,@r15+
mac.l @r8+,@r15+
mac.l @r9+,@r15+
mac.l @r10+,@r15+
mac.l @r11+,@r15+
mac.l @r12+,@r15+
dt r14
sts macl,@-r6
bf/s .Loop
add #8,r6
which is 31 insns per loop and (almost) no pipeline stalls, vs. 53 insns per
loop + stalls on mul-sts sequences when the mac insn is not used.
The above loop can be optimized even further with partial unrolling to avoid
the latency of the last mac and sts.
Of course it'd be even better, if the application's data was in a mac friendly
layout.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug target/53949] [SH] Add support for mac.w / mac.l instructions
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
` (13 preceding siblings ...)
2015-02-01 0:37 ` olegendo at gcc dot gnu.org
@ 2015-04-08 20:08 ` olegendo at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: olegendo at gcc dot gnu.org @ 2015-04-08 20:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53949
--- Comment #14 from Oleg Endo <olegendo at gcc dot gnu.org> ---
(In reply to Oleg Endo from comment #3)
>
> - When compiling for big endian the RA mistakes mach and macl when
> storing mach:macl to a DImode reg:reg pair.
> This could probably fixed by providing appropriate move insns patterns.
I believe this is PR 29961.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-04-08 20:08 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-13 9:01 [Bug target/53949] New: [SH] Add support for mac.w / mac.l instructions olegendo at gcc dot gnu.org
2012-07-13 10:34 ` [Bug target/53949] " olegendo at gcc dot gnu.org
2012-07-13 11:01 ` chrbr at gcc dot gnu.org
2012-07-13 11:24 ` chrbr at gcc dot gnu.org
2012-07-15 12:11 ` olegendo at gcc dot gnu.org
2012-07-17 19:13 ` olegendo at gcc dot gnu.org
2012-07-17 23:05 ` kkojima at gcc dot gnu.org
2012-07-22 16:47 ` olegendo at gcc dot gnu.org
2012-10-11 20:43 ` olegendo at gcc dot gnu.org
2012-11-07 21:31 ` olegendo at gcc dot gnu.org
2013-05-04 13:39 ` olegendo at gcc dot gnu.org
2013-12-17 12:25 ` olegendo at gcc dot gnu.org
2013-12-17 12:37 ` olegendo at gcc dot gnu.org
2015-01-30 19:37 ` olegendo at gcc dot gnu.org
2015-02-01 0:37 ` olegendo at gcc dot gnu.org
2015-04-08 20:08 ` olegendo at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).